Java HttpURLConnection: Can it cope with duplicate header names? - java

I am debugging some code in the Selenium-rc proxy server. It seems the culprit is the HttpURLConnection object, whose interface for getting at the HTTP headers does not cope with duplicate header names, such as:
Set-Cookie: foo=foo; Path=/
Set-Cookie: bar=bar; Path=/
The way of getting at the headers through the HttpURLConnection (using getHeaderField(int n) and getHeaderFieldKey(int n)) seems to be causing my second cookie to be lost. My question is
Is it true that HttpURLConnection itself can't cope with it, and
If so, is there a workaround to it?

My recommended workaround is to not use HttpUtilConnection at all, which is crude and unintuitive, but use commons-httpclient instead.
http://hc.apache.org/httpclient-3.x/

Without actually having tried it (can't remember to have handled that topic myself), there's also getHeaderFields, inherited from UrlConnection. Does this do what you need?

Ok, I found the problem, and the answer to the original question. Basically, the Cookie implementation I used (python's default Cookie Lib) used \r\n to delimit the different Set-Cookie headers(as supposed to \n), this confused HttpUrlConnection and caused it to stop at the first occurence of that delimiter(I am going to guess it stops at the first empty line). So the answer to the first question is: Yes, it can cope with duplicate header names, but is buggy in another way. Currently fixing the python library is a workable workaround, but it's not going to work long term because we don't own that library. I am sure using the httpclient library is a sensible way to go, but I am hoping for a solution that requires less work. So I don't know exactly what to do there yet.

Related

Cleaning up URLs to remove personal information

Are there rules to identify and remove any PII information from URLs? I would like this to be generic and handle all sorts of urls which we might encounter on the internet.
Clarification : I have a list of urls of people browsing the internet and want to remove PII from those.
To answer the question as restated in your reply to snemarch:
Yes I understand that. I meant what considerations I need to keep in mind to identify PII in urls? What are the various ways in which PII might occur in URls?
HTTP GET information can be transmitted in many different ways. Some, and likely most, will look like this:
example.com/form.php?key=value.
Other websites, including stackoverflow, may use a URL rewrite to tranform the link "example.com/form/value" into the equivalent: "example.com/form.php?key=value." This URL rewrite is completely dependent on the configuration of the server and there is no simple way to detect and strip off PII presented this way.
With this in mind, there is really no way to 100% remove all PII from a list of different urls, as such information can be indiscernible from a URL without any PII. You can, at the very least, strip out information that is DEFINITELY PII, such as a URL in the form "example.com/form.php?key=value." I would be willing to bet that any URL with a "=" has some sort of variable in it, and should be filtered. Past that, you're going to have to manually parse a majority of the list.
Depending on how big the list is and how serious you are about filtering it, you could research popular mod_rewrite methods for popular products and attempt to match them in your list, scrape URLS to determine additional information about a URL, and do some complicated and likely ugly algorithms to attempt to guess at what may be a variable in a URL - possibly factoring into account similar URL's a user has visited and comparing the tokens of the URL. similar urls with slightly different text in a given token are probably variables, and should be filtered.
Good luck!
You should never pass any user sensitive information from URL via GET. If you use POST instead then just make sure the connection is HTTPS.

Reference implementation / .lib for full url encoding

I'm writing a Java application which parses links from html & uses them to request their content. The area of url encoding when we have no idea of the "intent" of the url author is very thorny. For example when to use %20 or + is a complex issue: (%20 vs +), a browser would perform this encoding for a url containing an un-encoded space.
There are many other situations in which a browser would change the content of a parsed url before requesting a page, for example:
http://www.Example.com/รพ
... when parsed & requested by a browser becomes ...
http://www.Example.com/%C3%BE
.. and...
http://www.Example.com/&
... when parsed & requested by a browser becomes ...
http://www.Example.com/&
So my question is, instead of re-inventing the wheel again is there perhaps a Java library I haven't found to do this job? Failing that can anyone point me towards a reference implementation in a common browsers source? or perhaps pseudo code? Failing that, any recommendations on approach welcome!
Thanks,
Jon
HtmlUnit can certainly pick URLs out of HTML and resolve them (and much more).
I don't know whether it handles your corner cases, though. I would imagine it will handle the second, since that is a normal, if slightly funny-looking, use of HTML and a URL. I don't know what it will do with the second, in which an invalid URL is encoded in HTML.
I also know that if you find that HTMLUnit does something differently to how real browsers do it, write a JUnit test case to prove it, and file a bug report, then its maintainers will happily fix it with great alacrity.
How about using java.net.URLEncoder.encode() & java.net.URLDecoder.decode().

Add HttpOnly flag to cookies on the fly with Apache?

So I have a java webapp that uses tomcat with an apache proxy layer. I'm looking to make all cookies set from the app have the httpOnly flag. The problem with this is that tomcat is responsible for setting the flag from the application side and its default (in servlet api 2.5) is false. I was hoping I could set this flag for all cookies on the fly using apache.
I've been trying different combinations and the closest I have gotten is setting the last cookie passed to httpOnly which is of course wrong:
Header append Set-Cookie "; HttpOnly"
I have no way of knowing what cookies/values are going to be passed from the app. Is this even possible?
The following mod_headers rewrite has the benefit that it won't duplicate HttpOnly if it's already there, if that sort of thing matters to you:
Header edit Set-Cookie "(?i)^((?:(?!;\s?HttpOnly).)+)$" "$1; HttpOnly"
See:
Where I originally found the above regex
An explanation of why all those parentheses with the negative lookahead assertion, search for Finding Lines Containing or Not Containing Certain Words
A post where I found a small improvement to the regex (search for Header edit Set-Cookie)
Try the following mod_headers directive.
Header edit Set-Cookie ^(.*)$ $1;HttpOnly

Get the content of the HTTP stream in Java + Mozilla XPCOM

I've often read StackOverflow as a source to get answers; but now I have a very specific question and I can't really find any data on the internet. I trust you to be as helpful as always! :D
Basically, I'm relying on Mozilla's XULRunner and its XPCOM objects to analyze the HTTP stream of an SWT browser in a Java application on Linux.
Heavily based on Snippet128 and Snippet321 from the Java SWT website (can't post more than 1 URL sorry :/ ), my browser so far can parse all of the HTTP headers using an nsIHttpHeaderVisitor - and do some pretty stuff like printing them on a tree and such.
Full source is here.
Now... That's already pretty good. It covers the majority of what I want to do (school assignment at first, going a bit further than asked!).
But what I would really like is to be able to get the raw "content" data from every HTTP request: HTML of course ; but also CSS and images.
I've been trying different ways to achieve this goal, but everything failed so far:
Using an XPCOM object - which one?
nsIInputStream would be a good one. But I can't seem to find where the good stream actually is... The nsIHttpChannel open() method (who gives back an nsIInputStream) seems to be called by the SWT browser, leaving me with no way of getting the stream back.
nsIRequest : no luck.
another Listener that I'd have missed? I just spent an hour trying to use the nsIHttpActivityObserver interface, but it doesn't give me any HTTP content (merely GETs and 200 OK).
Using another object
the SWT's browser for instance. Well it kinda works: its getText() method gives me the html source of the page I'm visiting. But I want more!
I'm really stuck here, and I would greatly appreciate any help.
Cheers!
Florent
Perhaps nsITraceableChannel can help you?

strange problem with java.net.URL and java.net.URLConnection

I'm trying to download an image from a url.
The process I wrote works for everyone except for ONE content provider that we're dealing with.
When I access their JPGs via Firefox, everything looks kosher (happy Passover, btw). However, when I use my process I either:
A) get a 404
or
B) in the debugger when I set a break point at the URL line (URL url = new URL(str);)
then after the connection I DO get a file but it's not a .jpg, but rather some HTML that they're producing with generic links and stuff. I don't see a redirect code, though! It comes back as 200.
Here's my code...
URL url = new URL(urlString);
URLConnection uc = url.openConnection();
String val = uc.getHeaderField(0);
System.out.println("FOUND OBJECT OF TYPE:" + contType);
if(!val.contains("200")){
//problem
}
else{
is = uc.getInputStream();
}
Has anyone seen anything of this nature? I'm thinking maybe it's some mime type issue, but that's just a total guess... I'm completely stumped.
Maybe the site is just using some kind of protection to prevent others from hotlinking their images or to disallow mass downloads.
They usually check either the HTTP referrer (it must be from their own domain), or the user agent (must be a browser, not a download manager). Set both and try it again.
if(!val.contains("200")) // ...
First of all, I would suggest you to use this useful class called
HttpURLConnection, which provides the method getResponseCode()
Searching the whole data for some '200' implies
performance issues, and
inconsistency (binary files can contain some '200')
Have you tried using WireShark to see exactly what packets are going back and forth? This is often the fastest way to see what is different. That is:
First run WireShark when using FireFox to get the GIF, and then
Run WireShark to use your code to get it.
Then compare and contrast the packets in both directions and I almost guarantee that you'll see something different in the HTTP headers or some other part of the traffic that will explain the problem.
All good guesses, but the "right" answer reward, I think, has to go to ivan_pertrovich_ivanovich_harkovich_rostropovitch_o'neil because using HttpURLConnection I was able to see that, in fact, before getting a 404, I'm first getting a 301. So, now, it's just a matter of finding out from these people what they're expecting in the header, which would make them less inclined to redirect me.
thanks for the suggestion.

Categories