strange problem with java.net.URL and java.net.URLConnection - java

I'm trying to download an image from a url.
The process I wrote works for everyone except for ONE content provider that we're dealing with.
When I access their JPGs via Firefox, everything looks kosher (happy Passover, btw). However, when I use my process I either:
A) get a 404
or
B) in the debugger when I set a break point at the URL line (URL url = new URL(str);)
then after the connection I DO get a file but it's not a .jpg, but rather some HTML that they're producing with generic links and stuff. I don't see a redirect code, though! It comes back as 200.
Here's my code...
URL url = new URL(urlString);
URLConnection uc = url.openConnection();
String val = uc.getHeaderField(0);
System.out.println("FOUND OBJECT OF TYPE:" + contType);
if(!val.contains("200")){
//problem
}
else{
is = uc.getInputStream();
}
Has anyone seen anything of this nature? I'm thinking maybe it's some mime type issue, but that's just a total guess... I'm completely stumped.

Maybe the site is just using some kind of protection to prevent others from hotlinking their images or to disallow mass downloads.
They usually check either the HTTP referrer (it must be from their own domain), or the user agent (must be a browser, not a download manager). Set both and try it again.

if(!val.contains("200")) // ...
First of all, I would suggest you to use this useful class called
HttpURLConnection, which provides the method getResponseCode()
Searching the whole data for some '200' implies
performance issues, and
inconsistency (binary files can contain some '200')

Have you tried using WireShark to see exactly what packets are going back and forth? This is often the fastest way to see what is different. That is:
First run WireShark when using FireFox to get the GIF, and then
Run WireShark to use your code to get it.
Then compare and contrast the packets in both directions and I almost guarantee that you'll see something different in the HTTP headers or some other part of the traffic that will explain the problem.

All good guesses, but the "right" answer reward, I think, has to go to ivan_pertrovich_ivanovich_harkovich_rostropovitch_o'neil because using HttpURLConnection I was able to see that, in fact, before getting a 404, I'm first getting a 301. So, now, it's just a matter of finding out from these people what they're expecting in the header, which would make them less inclined to redirect me.
thanks for the suggestion.

Related

Java web crawling - static URLS

I'm going to look a bit more into the techniques because obviously there is a lot to learn but I was wondering what the best approach to handling static URLs is. I would guess it has to do with cookies but I'm not positive.
E.x. So I search my query on a site example.com, example.com/search?string=blah then sends me to a url that is specific to the search string. From there I can then go further (to the data I actually want) but the link to the results is a static URL example.com/results.php?id=33, the id stays the same regardless of search string. So the only logical thing is a cookie being passed right? If so, how would I have Java open a connection, grab the cookies, then open a new connection and pass the cookies? I tried something like this with two methods, one that opens the initial connection and grabs the cookies then opens a new connection and passes the cookies to that method.
There are definitely multiple cookies if that helps.
Also, any links/resources that you think I might find helpful would be greatly appreciated.

Reference implementation / .lib for full url encoding

I'm writing a Java application which parses links from html & uses them to request their content. The area of url encoding when we have no idea of the "intent" of the url author is very thorny. For example when to use %20 or + is a complex issue: (%20 vs +), a browser would perform this encoding for a url containing an un-encoded space.
There are many other situations in which a browser would change the content of a parsed url before requesting a page, for example:
http://www.Example.com/รพ
... when parsed & requested by a browser becomes ...
http://www.Example.com/%C3%BE
.. and...
http://www.Example.com/&
... when parsed & requested by a browser becomes ...
http://www.Example.com/&
So my question is, instead of re-inventing the wheel again is there perhaps a Java library I haven't found to do this job? Failing that can anyone point me towards a reference implementation in a common browsers source? or perhaps pseudo code? Failing that, any recommendations on approach welcome!
Thanks,
Jon
HtmlUnit can certainly pick URLs out of HTML and resolve them (and much more).
I don't know whether it handles your corner cases, though. I would imagine it will handle the second, since that is a normal, if slightly funny-looking, use of HTML and a URL. I don't know what it will do with the second, in which an invalid URL is encoded in HTML.
I also know that if you find that HTMLUnit does something differently to how real browsers do it, write a JUnit test case to prove it, and file a bug report, then its maintainers will happily fix it with great alacrity.
How about using java.net.URLEncoder.encode() & java.net.URLDecoder.decode().

Get the content of the HTTP stream in Java + Mozilla XPCOM

I've often read StackOverflow as a source to get answers; but now I have a very specific question and I can't really find any data on the internet. I trust you to be as helpful as always! :D
Basically, I'm relying on Mozilla's XULRunner and its XPCOM objects to analyze the HTTP stream of an SWT browser in a Java application on Linux.
Heavily based on Snippet128 and Snippet321 from the Java SWT website (can't post more than 1 URL sorry :/ ), my browser so far can parse all of the HTTP headers using an nsIHttpHeaderVisitor - and do some pretty stuff like printing them on a tree and such.
Full source is here.
Now... That's already pretty good. It covers the majority of what I want to do (school assignment at first, going a bit further than asked!).
But what I would really like is to be able to get the raw "content" data from every HTTP request: HTML of course ; but also CSS and images.
I've been trying different ways to achieve this goal, but everything failed so far:
Using an XPCOM object - which one?
nsIInputStream would be a good one. But I can't seem to find where the good stream actually is... The nsIHttpChannel open() method (who gives back an nsIInputStream) seems to be called by the SWT browser, leaving me with no way of getting the stream back.
nsIRequest : no luck.
another Listener that I'd have missed? I just spent an hour trying to use the nsIHttpActivityObserver interface, but it doesn't give me any HTTP content (merely GETs and 200 OK).
Using another object
the SWT's browser for instance. Well it kinda works: its getText() method gives me the html source of the page I'm visiting. But I want more!
I'm really stuck here, and I would greatly appreciate any help.
Cheers!
Florent
Perhaps nsITraceableChannel can help you?

Multiple questions in Java (Validating URLs, adding applications to startup, etc.)

I have an application to build in Java, and I've got some questions to put.
Is there some way to know if one URL of a webpage is real? The user enters the URL and I have to test if it's real, or not.
How can I konw if one webpage has changes since one date, or what is the date of the last update?
In java how can I put an application running on pc boot, the application must run since the user turns on the computer.
I'm not sure what kind of application you want to build. I'll assume it's a desktop application. In order to check if a URL exists, you should make a HTTP HEAD Request, and parse the results. HEAD can be used to check if the page has been modified. In order for an application to start when the PC boots, you have to add a registry entry under Windows; this process is explained here
To check whether a url is valid you could try using a regular expression (regex for urls).
To know if a webpage has changed you can take a look at the http headers (reading http headers in java).
You can't make the program startup automatically on boot, the user must do that. However, you can write code to help the user set the program as startup app; this however depends on the operating system.
I'm not sure what you mean by "real". If you mean "valid", then you can just construct a java.net.URL from a String and catch the resulting MalformedURLException if it's not valid. If you mean that there's actually something there, you could issue an HTTP HEAD request like Geo says, or you could just retrieve the content. HTTPUnit is particularly handy for retrieving web content.
HTTP headers may indicate when the content has changed, as nan suggested above. If you don't want to count on that you can just retrieve the page and store it, or even better, store a hash of the page content. See DigestOutputStream for generating a hash. On a subsequent check for changes you would simply compare the new hash with the the one you stored last time.
Nan is right about start on boot. What OS are you targeting?

Java HttpURLConnection: Can it cope with duplicate header names?

I am debugging some code in the Selenium-rc proxy server. It seems the culprit is the HttpURLConnection object, whose interface for getting at the HTTP headers does not cope with duplicate header names, such as:
Set-Cookie: foo=foo; Path=/
Set-Cookie: bar=bar; Path=/
The way of getting at the headers through the HttpURLConnection (using getHeaderField(int n) and getHeaderFieldKey(int n)) seems to be causing my second cookie to be lost. My question is
Is it true that HttpURLConnection itself can't cope with it, and
If so, is there a workaround to it?
My recommended workaround is to not use HttpUtilConnection at all, which is crude and unintuitive, but use commons-httpclient instead.
http://hc.apache.org/httpclient-3.x/
Without actually having tried it (can't remember to have handled that topic myself), there's also getHeaderFields, inherited from UrlConnection. Does this do what you need?
Ok, I found the problem, and the answer to the original question. Basically, the Cookie implementation I used (python's default Cookie Lib) used \r\n to delimit the different Set-Cookie headers(as supposed to \n), this confused HttpUrlConnection and caused it to stop at the first occurence of that delimiter(I am going to guess it stops at the first empty line). So the answer to the first question is: Yes, it can cope with duplicate header names, but is buggy in another way. Currently fixing the python library is a workable workaround, but it's not going to work long term because we don't own that library. I am sure using the httpclient library is a sensible way to go, but I am hoping for a solution that requires less work. So I don't know exactly what to do there yet.

Categories