In many websites (specially gmail, yahoo or hotmail), you would notice the URL
is followed is something like: yahoo.com/abc/bcd.html;_x=12323;_y=2322;
what are these _x and _y parameters? How to access them in server side code?
They are parameters in the URL (as distinct from the query string), this article has a good discussion, including this helpful diagram:
<scheme>://<username>:<password>#<host>:<port>/<path>;<parameters>?<query>#<fragment>
Note that they're not "parameters" in the sense used in the Java EE ServletRequest#getParameter and such (there when they say "parameter" they mean query string or POST arguments, which are different).
This is defined in §3.3 of RFC 2396:
The path may consist of a sequence of path segments separated by a
single slash "/" character. Within a path segment, the characters
"/", ";", "=", and "?" are reserved. Each path segment may include a
sequence of parameters, indicated by the semicolon ";" character.
The parameters are not significant to the parsing of relative
references.
(For the avoidance of doubt: The term "path" above does not include the query string, see the beginning of §3.)
RFC 2396 is obsoleted by RFC 3986, though, which amends the above markedly:
Aside from dot-segments in hierarchical paths, a path segment is
considered opaque by the generic syntax. URI producing applications
often use the reserved characters allowed in a segment to delimit
scheme-specific or dereference-handler-specific subcomponents. For
example, the semicolon (";") and equals ("=") reserved characters are
often used to delimit parameters and parameter values applicable to
that segment. The comma (",") reserved character is often used for
similar purposes. For example, one URI producer might use a segment
such as "name;v=1.1" to indicate a reference to version 1.1 of
"name", whereas another might use a segment such as "name,1.1" to
indicate the same. Parameter types may be defined by scheme-specific
semantics, but in most cases the syntax of a parameter is specific to
the implementation of the URI's dereferencing algorithm.
They're just characters that may appear in the URL. You access them by parsing the URL, because they're not regular query string parameters.
Those are parameters to a segment in the path part of the URI.
The URI syntax is defined in RFC 3986 as follows:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
[...]
The following are two example URIs and their component parts:
foo://example.com:8042/over/there?name=ferret#nose
\_/ \______________/\_________/ \_________/ \__/
| | | | |
scheme authority path query fragment
| _____________________|__
/ \ / \
urn:example:animal:ferret:nose
In this example (http://yahoo.com/abc/bcd.html;_x=12323;_y=2322;), these parameters are part of the path component. Essentially, this is just a convention used within that application for it to be able to identify resources.
Generally speaking, although paths in HTTP URIs are often similar to what you would find on a file system, they don't have to be related to the file system structure in any way. This is purely an implementation decision from the engine that dereferences the URI (i.e. the HTTP server implementation and what dispatches the request to whatever will produce a response).
Strictly speaking, the query is also an integral part of the URI (so many discussions you'll find on "RESTful" URIs are pointless, except for some SEO techniques).
Because parameters are passed via the query segment when using HTML forms, many HTTP frameworks expose its content by splitting the query for you into a map. For example, in a Java/Servlet content, the query string (getQueryString()) automatically populates the parameters returned by getParameter(...).
If you want to get parameters from bcd.html;_x=12323;_y=2322;, you'll have to split the path yourself.
Related
I think that using getQuery loses information, is dangerous and that instead only getRawQuery should be used, and that any query parameter values that are known to be encoded should be manually decoded (once the raw query is split on the & characters) with URLDecoder.
Case in point: Assume you have the URL www.example.com with two query parameters:
a parameter url with value =www.otherexample.com?b=2&c=3
a nondescript parameter d with value 4.
The parameter url should be url-encoded, so the URI that your application sees is:
www.example.com?url=www%2Eotherexample%2Ecom%3Fb%3D2%26c%3D3&d=4
Now, if you obtain the query part with getQuery, you get the following:
url=www.otherexample.com?b=2&c=3&d=4
Notice that you've already lost information as you can't say whether d is a query parameter of the www.example.com or of www.otherexample.com.
If instead you obtain the query part with getRawQuery, you get the following:
url=www%2Eotherexample%2Ecom%3Fb%3D2%26c%3D3&d=4
This time, no information is lost and all's well. You can parse the query part and URL-decode the value of the url parameter if you like.
Am I missing anything ?
You're correct.
URI.getQuery() is broken and you shouldn't use it.
Strange thing is I can't find any confirmation of this apart from your post, which made me think maybe URI.getQuery could be useful for something. But after some testing of my own I'm pretty sure it just shouldn't be used unless your application's query string doesn't follow the convention of separating arguments with ampersand.
EDIT 11/11/2019
As pointed out in a comment below, while you can use URI.getRawQuery() to work around the broken URI.getQuery() method, you can't just use the raw query as the query argument to the multi-argument URI constructor, as that constructor is also broken.
You can't use the multi-argument URI constructor if any of the query string arguments contain an ampersand. You could argue this is a bug, but the documentation of the expected behaviour contradicts itself so it's not clear which behaviour is correct. The javadoc of the multi-argument constructor says "Any character that is not a legal URI character is quoted". This implies that an escaped octet should NOT be quoted because the main class documentation includes it as a legal character ("The set of all legal URI characters consists of the unreserved, reserved, escaped, and other characters"). But further down, it documents the observed behaviour that the percent character ('%') is always quoted by the multi-argument constructors, which one assumes is without regard for whether it's part of an escaped octet.
Regardless of whether there is ever any acknowledgement that the documentation is contradictory, or what the correct behaviour should be, it is almost certain the current behaviour will never be altered. The only work-around is not to use the multi-argument constructors if you need the URI to end up containing the quoted ampersand octet "%26". Use the single-argument constructor instead, after doing your own encoding and quoting of special characters.
What is the deal with Java's bizarre file protocol handling?
I mean on windows UNC path's get turned into 5 slashes, and I get why that happens, but on linux an absolute path gets turned into file:/local/path/to/file
Shouldn't that have three slashes?
I'm assuming the authors of Java aren't incompetent, so is there an explanation for why that's acceptable?
Let’s start with the documentation of the URI class:
A hierarchical URI is subject to further parsing according to the syntax
[scheme:][//authority][path][?query][#fragment]
As you can see, the authority is optional. This is supported by the URI specification, section 3:
The scheme and path components are required, though the path may be empty (no characters). When authority is present, the path must either be empty or begin with a slash ("/") character. When authority is not present, the path cannot begin with two slash characters ("//").
A file: URI can have an authority separator, //, with an effectively empty authority after it, but it serves no purpose, so there is no harm in omitting it. It’s still a fully compliant URI.
I need to generate a href to a URI. All easy with the exception when it comes to reserved characters which need percent-encoding, e.g. link to /some/path;element should appear as <a href="/some/path%3Belement"> (I know that path;element represents a single entity).
Initially I was looking for a Java library that does this but I ended up writing something myself (look below for what failed with Java, as this question isn't Java-specific).
So, RFC 3986 does suggest when NOT to encode. This should happen, as I read it, when character falls under unreserved (ALPHA / DIGIT / "-" / "." / "_" / "~") class. So far so good. But what about the opposite case? RFC only mentions that percent (%) always needs encoding. But what about the others?
Question: is it correct to assume that everything that is not unreserved, can/should be percent-encoded? For example, opening bracket ( does not necessarily need encoding but semicolon ; does. If I don't encode it I end up looking for /first* when following <a href="/first;second">. But following <a href="/first(second"> I always end up looking for /first(second, as expected. What confuses me is that both ( and ; are in the same sub-delims class as far as RFC goes. As I imagine, encoding everything non-unreserved is a safe bet, but what about SEOability, user friendliness when it comes to localized URIs?
Now, what failed with Java libs. I have tried doing it like
new java.net.URI("http", "site", "/pa;th", null).toASCIISTring()
but this gives http://site/pa;th which is no good. Similar results observed with:
javax.ws.rs.core.UriBuilder
Spring's UriUtils - I have tried both encodePath(String, String) and encodePathSegment(String, String)
[*] /first is a result of call to HttpServletRequest.getServletPath() in the server side when clicking on <a href="/first;second">
EDIT: I probably need to mention that this behaviour was observed under Tomcat, and I have checked both Tomcat 6 and 7 behave the same way.
Is it correct to assume that everything that is not unreserved, can/should be percent-encoded?
No. RFC 3986 says this:
"Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data. "
The implication is that you decide which of the delimiters (i.e. the <delimiter> characters) need to be encoded depending on the context. Those which don't need to be encode shouldn't be encoded.
For instance, you should not percent-encode a / if it appears in a path component, but you should percent-encode it when it appears in a query or fragment.
So, in fact, a ; character (which is a member of <reserved> should not be automatically percent encoded. And indeed the java URL and URI classes won't do this; see the URI(...) javadoc, specifically step 7) for how the <path> component is handled.
This is reinforced by this paragraph:
"The purpose of reserved characters is to provide a set of delimiting characters that are distinguishable from other data within a URI. URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent. Percent- encoding a reserved character, or decoding a percent-encoded octet that corresponds to a reserved character, will change how the URI is interpreted by most applications. Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI."
So this says that a URL containing a percent-encoded ; is not the same as a URL that contains a raw ;. And the last sentence implies that they should NOT be percent encoded or decoded automatically.
Which leaves us with the question - why do you want ; to be percent encoded?
Let's say you have a CMS where people can create arbitrary pages having arbitrary paths. Later on, I need to generate href links to all pages in, for example, site map component. Therefore I need an algorithm to know which characters to escape. Semicolon has to be treated literally in this case and should be escaped.
Sorry, but it does not follow that semicolon should be escaped.
As far as the URL / URI spec is concerned, the ; has no special meaning. It might have special meaning to a particular web server / web site, but in general (i.e. without specific knowledge of the site) you have no way of knowing this.
If the ; does have special meaning in a particular URI, then if you percent-escape it, then you break that meaning. For instance, if the site uses ; to allow a session token to be appended to the path, then percent-encoding will stop it from recognizing the session token ...
If the ; is simply a data character provided by some client, then if you percent encode it, you are potentially changing the meaning of URI. Whether this matters depends on what the server does; i.e. whether is decodes or not as part of the application logic.
What this means knowing the "right thing to do" requires intimate knowledge of what the URI means to the end user and/or the site. This would require advanced mind-reading technology to implement. My recommendation would be to get the CMS to solve it by suitably escaping any delimiters the URI paths before it delivers them to your software. The algorithm is necessarily going to be specific to the CMS and content delivery platform. It/they will be responding to requests for documents identified by the URLs and will need to know how to interpret them.
(Supporting arbitrary people using arbitrary paths is a bit crazy. There have to be some limits. For instance, not even Windows allows you use a file separator character in a filename component. So you are going to have to have some boundaries somewhere. It is just a matter of deciding where they should be.)
The ABNF for an absolute path part:
path-absolute = "/" [ segment-nz *( "/" segment ) ]
segment = *pchar
segment-nz = 1*pchar
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
reserved = gen-delims / sub-delims
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
pchar includes sub-delims so you would not have to encode any of these in the path part: :#-._~!$&'()*+,;=
I wrote my own URL builder which includes an encoder for the path - as always, caveat emptor.
I have to use HttpClient 2.0 (can not use anything newer), and I am running into the next issue. When I use the method (post, in that case), it "codify" the parameters to the Hexadecimal ASCII code, and the "spaces" turned into "+" (something that the receiver don't want).
Does anyone know a way to avoid it?
Thanks a lot.
Even your browser does that, converting space character into +. See here http://download.oracle.com/javase/1.5.0/docs/api/java/net/URLEncoder.html
It encodes URL, converts to UTF-8 like string.
When encoding a String, the following rules apply:
The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.
The special characters ".", "-", "*", and "_" remain the same.
The space character " " is converted into a plus sign "+".
All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.
Also, see here http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1
Control names and values are escaped. Space characters are replaced by +', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
The control names/values are listed in the order they appear in the document. The name is separated from the value by =' and name/value pairs are separated from each other by&'.
To answer your question, if you do not want to encode. I guess, URLDecoder.decode will help you to undo the encoded string.
You could in theory avoid this by constructing the query string or request body containing parameters by hand.
But this would be a bad thing to do, because the HTML, HTTP, URL and URI specs all mandate that reserved characters in request parameters are encoded. And if you violate this, you may find that server-side HTTP stacks, proxies and so on reject your requests as invalid, or misbehave in other ways.
The correct way to deal with this issue is to do one of the following:
If the server is implemented in Java EE technology, use the relevant servlet API methods (e.g. ServletRequest.getParam(...)) to fetch the request parameters. These will take care of any decoding for you.
If the parameters are part of a URL query string, you can instantiate a Java URL or URI object and use the getter to return you the query with the encoding removed.
If your server is implemented some other way (or if you need to unpick the request URL's query string or POST data yourself), then use URLDecoder.decode or equivalent to remove the % encoding and replace +'s ... after you have figured out where the query and parameter boundaries, etc are.
WHat will be the best practice to replace Unicode character in URL.
For example if I have a multilingual website and support East European languages
How should I format the URL that it always contains valid characters?
What you want todo is called slugify.
$slugified_url_part = iconv('utf-8', 'us-ascii//TRANSLIT', $url_part);
The above code will turn non ascii chars to it's closest ascii char.
You should also trim whitespace and replace inner whitespace with a dash or underscore.
Making all chars lowercase is also common.
Slugify is handy for remembering URLS and SEO.
You could ofcourse use percent encoding but that can look ugly.
Use Percent-encoding. Most languages have a helper function already built in.
Percent-encoding, also known as URL encoding, is a mechanism for encoding information in a Uniform Resource Identifier (URI) under certain circumstances. Although it is known as URL encoding it is, in fact, used more generally within the main Uniform Resource Identifier (URI) set, which includes both Uniform Resource Locator (URL) and Uniform Resource Name (URN). As such it is also used in the preparation of data of the "application/x-www-form-urlencoded" media type, as is often used in email messages and the submission of HTML form data in HTTP requests.
when using php you can use urlencode() to build your urls
The tags on this are a bit confusing, containing both PHP and Java.
For the Java side.
Use URLEncoder.encode("Your String Here", "UTF-8");