I think that using getQuery loses information, is dangerous and that instead only getRawQuery should be used, and that any query parameter values that are known to be encoded should be manually decoded (once the raw query is split on the & characters) with URLDecoder.
Case in point: Assume you have the URL www.example.com with two query parameters:
a parameter url with value =www.otherexample.com?b=2&c=3
a nondescript parameter d with value 4.
The parameter url should be url-encoded, so the URI that your application sees is:
www.example.com?url=www%2Eotherexample%2Ecom%3Fb%3D2%26c%3D3&d=4
Now, if you obtain the query part with getQuery, you get the following:
url=www.otherexample.com?b=2&c=3&d=4
Notice that you've already lost information as you can't say whether d is a query parameter of the www.example.com or of www.otherexample.com.
If instead you obtain the query part with getRawQuery, you get the following:
url=www%2Eotherexample%2Ecom%3Fb%3D2%26c%3D3&d=4
This time, no information is lost and all's well. You can parse the query part and URL-decode the value of the url parameter if you like.
Am I missing anything ?
You're correct.
URI.getQuery() is broken and you shouldn't use it.
Strange thing is I can't find any confirmation of this apart from your post, which made me think maybe URI.getQuery could be useful for something. But after some testing of my own I'm pretty sure it just shouldn't be used unless your application's query string doesn't follow the convention of separating arguments with ampersand.
EDIT 11/11/2019
As pointed out in a comment below, while you can use URI.getRawQuery() to work around the broken URI.getQuery() method, you can't just use the raw query as the query argument to the multi-argument URI constructor, as that constructor is also broken.
You can't use the multi-argument URI constructor if any of the query string arguments contain an ampersand. You could argue this is a bug, but the documentation of the expected behaviour contradicts itself so it's not clear which behaviour is correct. The javadoc of the multi-argument constructor says "Any character that is not a legal URI character is quoted". This implies that an escaped octet should NOT be quoted because the main class documentation includes it as a legal character ("The set of all legal URI characters consists of the unreserved, reserved, escaped, and other characters"). But further down, it documents the observed behaviour that the percent character ('%') is always quoted by the multi-argument constructors, which one assumes is without regard for whether it's part of an escaped octet.
Regardless of whether there is ever any acknowledgement that the documentation is contradictory, or what the correct behaviour should be, it is almost certain the current behaviour will never be altered. The only work-around is not to use the multi-argument constructors if you need the URI to end up containing the quoted ampersand octet "%26". Use the single-argument constructor instead, after doing your own encoding and quoting of special characters.
Related
I am using the Java URL constructor "URL(URL context, String spec)" found here but the constructed URL is not what I expect - it is leaving out a path segment provided in the context parameter.
As an example, this code
new URL(new URL("http://asdf.com/z"), "a/b/c");
produces a URL with value
http://asdf.com/a/b/c
So it has left out of "z" path segment.
I have two questions:
What is the meaning of "context" the first parameter here in the java doc? I could not find mention of it in the URL specification nor did I really find it in java doc.
Is leaving out the "z" expected behavior?
Thanks!
What is the meaning of "context" the first parameter here in the java doc?
It's like the "base URL" of the spec parameter. If context is https://example.com, and spec is /foo, the constructor would create https://example.com/foo. It's similar to (but not exactly the same as, as we'll see later) asking "I am currently on https://example.com, and I want to go to /foo, what would my final URL be?"
Is leaving out the "z" expected behavior?
Yes. If you follow through the rules of resolving a relative URL against an base URL in RFC 2396 with regards to this case, you will reach this step:
(6) If this step is reached, then we are resolving a relative-path
reference. The relative path needs to be merged with the base
URI's path. Although there are many ways to do this, we will
describe a simple method using a separate string buffer.
(a) All but the last segment of the base URI's path component is
copied to the buffer. In other words, any characters after the
last (right-most) slash character, if any, are excluded.
(b) The reference's path component is appended to the buffer
string.
The "last segment" here, refers to z, and that is not added to the buffer. Right after that, the path a/b/c "is appended to the buffer". Steps (c) onwards deals with removing . and .., which is irrelevant here.
Note that RFC 2386 doesn't say you MUST implement the algorithm in this way, but that whatever your implementation is, your output must match the output of that algorithm:
The above algorithm is intended to provide an example by which the
output of implementations can be tested -- implementation of the
algorithm itself is not required.
So yeah, this is expected. To keep the /z, you should add another / after the z:
new URL(new URL("http://asdf.com/z/"), "a/b/c")
This way the "last segment" becomes the empty string.
You can treat the context like the current directory in file system.
With context "http://asdf.com/z", the current directory is "http://asdf.com/", and use "a/b/c" as the spec will result a full path "http://asdf.com/a/b/c".
Is there any method that already implements proper ETag quoting for http headers?
As pointed out in Syntax for ETag? the proper way is not as trivial as putting double quotes around it.
Couldn't find anything obvious.
That's kind of misleading. The quotes are an integral part of the ETag, so there's no transition from "unquoted ETag" to "quoted ETag".
If what you're after is a way to include characters not allowed in ETags, you'll just to invent a custom escaping syntax. Which one doesn't matter, because your server is producing and consuming them, and for clients they are fully opaque.
That answer is based on a previous version of the ETag specification. The current one (RFC 7232) explicitly disallows the use of the double-quote character within the opaque ETag.
So assuming that the opaque part of your ETag is valid according to RFC 7232, it really is as simple as putting double quotes around it.
However, I recommend that instead of doing that you require whoever is providing the ETag to include the double quotes. That's because they are necessary to distinguish weak ETags. Without them you are left with a more complicated API, or more commonly, no way to specify weak ETags at all.
In many websites (specially gmail, yahoo or hotmail), you would notice the URL
is followed is something like: yahoo.com/abc/bcd.html;_x=12323;_y=2322;
what are these _x and _y parameters? How to access them in server side code?
They are parameters in the URL (as distinct from the query string), this article has a good discussion, including this helpful diagram:
<scheme>://<username>:<password>#<host>:<port>/<path>;<parameters>?<query>#<fragment>
Note that they're not "parameters" in the sense used in the Java EE ServletRequest#getParameter and such (there when they say "parameter" they mean query string or POST arguments, which are different).
This is defined in §3.3 of RFC 2396:
The path may consist of a sequence of path segments separated by a
single slash "/" character. Within a path segment, the characters
"/", ";", "=", and "?" are reserved. Each path segment may include a
sequence of parameters, indicated by the semicolon ";" character.
The parameters are not significant to the parsing of relative
references.
(For the avoidance of doubt: The term "path" above does not include the query string, see the beginning of §3.)
RFC 2396 is obsoleted by RFC 3986, though, which amends the above markedly:
Aside from dot-segments in hierarchical paths, a path segment is
considered opaque by the generic syntax. URI producing applications
often use the reserved characters allowed in a segment to delimit
scheme-specific or dereference-handler-specific subcomponents. For
example, the semicolon (";") and equals ("=") reserved characters are
often used to delimit parameters and parameter values applicable to
that segment. The comma (",") reserved character is often used for
similar purposes. For example, one URI producer might use a segment
such as "name;v=1.1" to indicate a reference to version 1.1 of
"name", whereas another might use a segment such as "name,1.1" to
indicate the same. Parameter types may be defined by scheme-specific
semantics, but in most cases the syntax of a parameter is specific to
the implementation of the URI's dereferencing algorithm.
They're just characters that may appear in the URL. You access them by parsing the URL, because they're not regular query string parameters.
Those are parameters to a segment in the path part of the URI.
The URI syntax is defined in RFC 3986 as follows:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
[...]
The following are two example URIs and their component parts:
foo://example.com:8042/over/there?name=ferret#nose
\_/ \______________/\_________/ \_________/ \__/
| | | | |
scheme authority path query fragment
| _____________________|__
/ \ / \
urn:example:animal:ferret:nose
In this example (http://yahoo.com/abc/bcd.html;_x=12323;_y=2322;), these parameters are part of the path component. Essentially, this is just a convention used within that application for it to be able to identify resources.
Generally speaking, although paths in HTTP URIs are often similar to what you would find on a file system, they don't have to be related to the file system structure in any way. This is purely an implementation decision from the engine that dereferences the URI (i.e. the HTTP server implementation and what dispatches the request to whatever will produce a response).
Strictly speaking, the query is also an integral part of the URI (so many discussions you'll find on "RESTful" URIs are pointless, except for some SEO techniques).
Because parameters are passed via the query segment when using HTML forms, many HTTP frameworks expose its content by splitting the query for you into a map. For example, in a Java/Servlet content, the query string (getQueryString()) automatically populates the parameters returned by getParameter(...).
If you want to get parameters from bcd.html;_x=12323;_y=2322;, you'll have to split the path yourself.
I have to use HttpClient 2.0 (can not use anything newer), and I am running into the next issue. When I use the method (post, in that case), it "codify" the parameters to the Hexadecimal ASCII code, and the "spaces" turned into "+" (something that the receiver don't want).
Does anyone know a way to avoid it?
Thanks a lot.
Even your browser does that, converting space character into +. See here http://download.oracle.com/javase/1.5.0/docs/api/java/net/URLEncoder.html
It encodes URL, converts to UTF-8 like string.
When encoding a String, the following rules apply:
The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.
The special characters ".", "-", "*", and "_" remain the same.
The space character " " is converted into a plus sign "+".
All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.
Also, see here http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1
Control names and values are escaped. Space characters are replaced by +', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
The control names/values are listed in the order they appear in the document. The name is separated from the value by =' and name/value pairs are separated from each other by&'.
To answer your question, if you do not want to encode. I guess, URLDecoder.decode will help you to undo the encoded string.
You could in theory avoid this by constructing the query string or request body containing parameters by hand.
But this would be a bad thing to do, because the HTML, HTTP, URL and URI specs all mandate that reserved characters in request parameters are encoded. And if you violate this, you may find that server-side HTTP stacks, proxies and so on reject your requests as invalid, or misbehave in other ways.
The correct way to deal with this issue is to do one of the following:
If the server is implemented in Java EE technology, use the relevant servlet API methods (e.g. ServletRequest.getParam(...)) to fetch the request parameters. These will take care of any decoding for you.
If the parameters are part of a URL query string, you can instantiate a Java URL or URI object and use the getter to return you the query with the encoding removed.
If your server is implemented some other way (or if you need to unpick the request URL's query string or POST data yourself), then use URLDecoder.decode or equivalent to remove the % encoding and replace +'s ... after you have figured out where the query and parameter boundaries, etc are.
Is there any real way to represent a URL (which more than likely will also have a query string) as a filename in Java without obscuring the original URL completely?
My first approach was to simply escape invalid characters with arbitrary replacements (for example, replacing "/" with "_", etc).
The problem is, as in the example of replacing with underscores is that a URL such as "app/my_app" would become "app_my_app" thus obscuring the original URL completely.
I have also attempted to encode all the special characters, however again, seeing crazy %3e %20 etc is really not clear.
Thank you for any suggestions.
Well, you should know what you want here, exactly. Keep in mind that the restrictions on file names vary between systems. On a Unix system you probably only need to escape the virgule somehow, whereas on Windows you need to take care of the colon and the question mark as well.
I guess, the safest thing would be to encode anything that could potentially clash (everything non-alphanumeric would be a good candidate, although you migth adapt this to the platform) with percent-encoding. It's still somewhat readable and you're guaranteed to get the original URL back.
Why? URL-encoding is already defined in an RFC: there's not much point in reinventing it. Basically you must have an escape character such as %, otherwise you can't tell whether a character represents itself or an escape. E.g. in your example app_my_app could represent app/my/app. You therefore also need a double-escape convention so you can represent the escape character itself. It is not simple.