Java URI Builder build method replacing "?" character in the path

Java URI Builder build method replacing "?" character in the path - java

I have an API at the following path
/v0/segments/ch/abc/view/status/ACTIVE?sc=%s&expiryGteInMs=%d
I am building a Client using the URIBuilder in Java.
return UriBuilder
.fromUri(config.getHost())
.path(String.format(config.getPath(),request.List(), request.getTime()))
.build();
The request contains a list to be substituted in place of %s and the time to be substituted in place of %d. But the request being formed has a path like this
/v0/segments/ch/abc/view/status/ACTIVE%3Fsc=FK,GR&expiryGteInMs=1611081000000
Basically the "?" character is being replaced by %3F. Can somebody help me with this, please?
P.S: I know that we can use the ".queryParam" option available in URIBuilder. Looking for the actual reason why this is happening?

Most probably library that you are using is encoding url, and ? encodes to %3F.
Why this happens(in short): url could contain only specific set of character, and ? is not one of them, so, in order to transfer this character, we should encode it (so called Percent-encoding).
A bit longer explanation (taken from here):
URL encoding converts characters into a format that can be transmitted over the Internet.
URLs can only be sent over the Internet using the ASCII character-set.
Since URLs often contain characters outside the ASCII set, the URL has to be converted into a valid ASCII format.
URL encoding replaces unsafe ASCII characters with a "%" followed by two hexadecimal digits.
URLs cannot contain spaces. URL encoding normally replaces a space with a plus (+) sign or with %20.

Related

Request.getParameter is wrongly decoding the string

I have the URL with the following query string
equipmentAccessoryRoute=LFVR+BASICACC
When I do request.getParameter("equipmentAccessoryRoute") it returns 'LFVR BASICACC' in a string variable, replacing plus sign with space.
To resolve this issue I did something like this
String accessoryRoute = java.net.URLEncoder.encode(request.getParameter("equipmentAccessoryRoute"),"UTF-8");
It worked perfectly but now yt is not working for the following query string (which was working before)
`equipmentAccessoryRoute=C1000IP5EL#-A`
Decoding converts this into 'C1000IP5EL%40-A' and stores into a string.
I am really confused. I tried to learn URL encoding but find it very hard to understand.

URL - Uniform Resource Locator
URLs can only be sent over the Internet using the ASCII
character-set. Since URLs often contain characters outside the ASCII
set, the URL has to be converted into a valid ASCII format.
URL encoding replaces unsafe ASCII characters with a "%" followed by
two hexadecimal digits.
URLs cannot contain spaces. URL encoding normally replaces a space
with a plus (+) sign or with %20.
Hope this helps.

Need help identifying type of UTF Encoding

I'm having a hard time trying to figure out the type of unicode that i need to convert to pass data for post request. Mostly would be chinese characters.
Example String:
的事故事务院治党派驻地是不是
Expected Unicode: %u7684%u4E8B%u6545%u4E8B%u52A1%u9662%u6CBB%u515A%u6D3E%u9A7B%u5730%u662F%u4E0D%u662F
Tried to encode to UTF16-BE:
%76%84%4E%8B%65%45%4E%8B%52%A1%5C%40%5C%40%95%7F%67%1F%8D%27%7B%49%5F%85%62%08%59%1A
Encoded text in UTF-16: %FF%FE%84%76%8B%4E%45%65%8B%4E%A1%52%62%96%BB%6C%5A%51%3E%6D%7B%9A%30%57%2F%66%0D%4E%2F%66
Encoded text in UTF-8: %E7%9A%84%E4%BA%8B%E6%95%85%E4%BA%8B%E5%8A%A1%E9%99%A2%E6%B2%BB%E5%85%9A%E6%B4%BE%E9%A9%BB%E5%9C%B0%E6%98%AF%E4%B8%8D%E6%98%AF
As you can see, UTF16-BE is the closest, but it only takes 2 bytes and there should be an additional %u in front of every character as shown in the expected unicode.
I've been using URLEncoder method to get the encoded text, with the standard charset encodings but it doesn't seem to return the expected unicode.
Code:
String text = "的事故事务院治党派驻地是不是";
URLEncoder.encode(text, "UTF-16BE");

As Kayaman said in a comment: Your expectation is wrong.
That is because %uNNNN is not a valid URL encoding of Unicode text. As Wikipedia says it:
There exists a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a UTF-16 code unit represented as four hexadecimal digits. This behavior is not specified by any RFC and has been rejected by the W3C.
So unless your server is expected non-standard input, your expectation is wrong.
Instead, use UTF-8. As Wikipedia says it:
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
That is however for sending data in a URL, e.g. as part of a GET.
For sending text data as part of a application/x-www-form-urlencoded encoded POST, see the HTML5 documentation:
If the form element has an accept-charset attribute, let the selected character encoding be the result of picking an encoding for the form.
Otherwise, if the form element has no accept-charset attribute, but the document's character encoding is an ASCII-compatible character encoding, then that is the selected character encoding.
Otherwise, let the selected character encoding be UTF-8.
Since most web pages ("the document") are presented in UTF-8 these days, that would likely mean UTF-8.

I think that you are thinking too far. The encoding of a text doesn't need to "resemble" in any way the string of Unicode code points of this text. These are two different things.
To send the string 的事故事务院治党派驻地是不是 in a POST request, just write the entire POST request and encode it with UTF-8, and the resulting bytes are what is sent as the body of the POST request to the server.
As pointed out by #Andreas, UTF-8 is the default encoding of HTML5, so it's not even necessary to set the accept-charset attribute, because the server will automatically use UTF-8 to decode the body of your request, if accept-charset is not set.

HttpClient 2.0. Params "codified"

I have to use HttpClient 2.0 (can not use anything newer), and I am running into the next issue. When I use the method (post, in that case), it "codify" the parameters to the Hexadecimal ASCII code, and the "spaces" turned into "+" (something that the receiver don't want).
Does anyone know a way to avoid it?
Thanks a lot.

Even your browser does that, converting space character into +. See here http://download.oracle.com/javase/1.5.0/docs/api/java/net/URLEncoder.html
It encodes URL, converts to UTF-8 like string.
When encoding a String, the following rules apply:
The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.
The special characters ".", "-", "*", and "_" remain the same.
The space character " " is converted into a plus sign "+".
All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.
Also, see here http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1
Control names and values are escaped. Space characters are replaced by +', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
The control names/values are listed in the order they appear in the document. The name is separated from the value by =' and name/value pairs are separated from each other by&'.
To answer your question, if you do not want to encode. I guess, URLDecoder.decode will help you to undo the encoded string.

You could in theory avoid this by constructing the query string or request body containing parameters by hand.
But this would be a bad thing to do, because the HTML, HTTP, URL and URI specs all mandate that reserved characters in request parameters are encoded. And if you violate this, you may find that server-side HTTP stacks, proxies and so on reject your requests as invalid, or misbehave in other ways.
The correct way to deal with this issue is to do one of the following:
If the server is implemented in Java EE technology, use the relevant servlet API methods (e.g. ServletRequest.getParam(...)) to fetch the request parameters. These will take care of any decoding for you.
If the parameters are part of a URL query string, you can instantiate a Java URL or URI object and use the getter to return you the query with the encoding removed.
If your server is implemented some other way (or if you need to unpick the request URL's query string or POST data yourself), then use URLDecoder.decode or equivalent to remove the % encoding and replace +'s ... after you have figured out where the query and parameter boundaries, etc are.

formatting url characters

WHat will be the best practice to replace Unicode character in URL.
For example if I have a multilingual website and support East European languages
How should I format the URL that it always contains valid characters?

What you want todo is called slugify.
$slugified_url_part = iconv('utf-8', 'us-ascii//TRANSLIT', $url_part);
The above code will turn non ascii chars to it's closest ascii char.
You should also trim whitespace and replace inner whitespace with a dash or underscore.
Making all chars lowercase is also common.
Slugify is handy for remembering URLS and SEO.
You could ofcourse use percent encoding but that can look ugly.

Use Percent-encoding. Most languages have a helper function already built in.
Percent-encoding, also known as URL encoding, is a mechanism for encoding information in a Uniform Resource Identifier (URI) under certain circumstances. Although it is known as URL encoding it is, in fact, used more generally within the main Uniform Resource Identifier (URI) set, which includes both Uniform Resource Locator (URL) and Uniform Resource Name (URN). As such it is also used in the preparation of data of the "application/x-www-form-urlencoded" media type, as is often used in email messages and the submission of HTML form data in HTTP requests.

when using php you can use urlencode() to build your urls

The tags on this are a bit confusing, containing both PHP and Java.
For the Java side.
Use URLEncoder.encode("Your String Here", "UTF-8");

URI encoding in UNICODE for apache httpclient 4

I am working with apache http client 4 for all of my web accesses.
This means that every query that I need to do has to pass the URI syntax checks.
One of the sites that I am trying to access uses UNICODE as the url GET params encoding, i.e:
http://maya.tase.co.il/bursa/index.asp?http://maya.tase.co.il/bursa/index.asp?view=search&company_group=147&srh_txt=%u05E0%u05D9%u05D1&arg_comp=&srh_from=2009-06-01&srh_until=2010-02-16&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press=
(the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)
The problem is that URI doesn't support UNICODE encoding(it only supports UTF-8)
The really big issue here, is that this site expect it's params to be encoded in UNICODE, so any attempts to convert the url using String.format("http://...srh_txt=%s&...",URLEncoder.encode( "ניב" , "UTF8"))
results in a url which is legal and can be used to construct a URI but the site response to it with an error message, since it's not the encoding that it expects.
by the way URL object can be created and even used to connect to the web site using the non converted url.
Is there any way of creating URI in non UTF-8 encoding?
Is there any way of working with apache httpclient 4 with regular URL(and not URI)?
thanks,
Niv

(the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)
It doesn't really. That's not URL-encoding and the sequence %u is invalid in a URL.
%u05E0%u05D9%u05D1" encodes ניב only in JavaScript's oddball escape syntax. escape is the same as URL-encoding for all ASCII characters except for +, but the %u#### escapes it produces for Unicode characters are completely of its own invention.
(One should, in general, never use escape. Using encodeURIComponent instead produces the correct URL-encoded UTF-8, ניב=%D7%A0%D7%99%D7%91.)
If a site requires %u#### sequences in its query string, it is very badly broken.
Is there any way of creating URI in non UTF-8 encoding?
Yes, URIs may use any character encoding you like. It is conventionally UTF-8; that's what IRI requires and what browsers will usually submit if the user types non-ASCII characters into the address bar, but URI itself concerns itself only with bytes.
So you could convert ניב to %F0%E9%E1. There would be no way for the web app to tell that those bytes represented characters encoded in code page 1255 (Hebrew, similar to ISO-8859-8). But it does appear to work, on the link above, which the UTF-8 version does not. Oh dear!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.