URI encoding in UNICODE for apache httpclient 4 - java

I am working with apache http client 4 for all of my web accesses.
This means that every query that I need to do has to pass the URI syntax checks.
One of the sites that I am trying to access uses UNICODE as the url GET params encoding, i.e:
http://maya.tase.co.il/bursa/index.asp?http://maya.tase.co.il/bursa/index.asp?view=search&company_group=147&srh_txt=%u05E0%u05D9%u05D1&arg_comp=&srh_from=2009-06-01&srh_until=2010-02-16&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press=
(the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)
The problem is that URI doesn't support UNICODE encoding(it only supports UTF-8)
The really big issue here, is that this site expect it's params to be encoded in UNICODE, so any attempts to convert the url using String.format("http://...srh_txt=%s&...",URLEncoder.encode( "ניב" , "UTF8"))
results in a url which is legal and can be used to construct a URI but the site response to it with an error message, since it's not the encoding that it expects.
by the way URL object can be created and even used to connect to the web site using the non converted url.
Is there any way of creating URI in non UTF-8 encoding?
Is there any way of working with apache httpclient 4 with regular URL(and not URI)?
thanks,
Niv

(the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)
It doesn't really. That's not URL-encoding and the sequence %u is invalid in a URL.
%u05E0%u05D9%u05D1" encodes ניב only in JavaScript's oddball escape syntax. escape is the same as URL-encoding for all ASCII characters except for +, but the %u#### escapes it produces for Unicode characters are completely of its own invention.
(One should, in general, never use escape. Using encodeURIComponent instead produces the correct URL-encoded UTF-8, ניב=%D7%A0%D7%99%D7%91.)
If a site requires %u#### sequences in its query string, it is very badly broken.
Is there any way of creating URI in non UTF-8 encoding?
Yes, URIs may use any character encoding you like. It is conventionally UTF-8; that's what IRI requires and what browsers will usually submit if the user types non-ASCII characters into the address bar, but URI itself concerns itself only with bytes.
So you could convert ניב to %F0%E9%E1. There would be no way for the web app to tell that those bytes represented characters encoded in code page 1255 (Hebrew, similar to ISO-8859-8). But it does appear to work, on the link above, which the UTF-8 version does not. Oh dear!

Related

Java URI Builder build method replacing "?" character in the path

I have an API at the following path
/v0/segments/ch/abc/view/status/ACTIVE?sc=%s&expiryGteInMs=%d
I am building a Client using the URIBuilder in Java.
return UriBuilder
.fromUri(config.getHost())
.path(String.format(config.getPath(),request.List(), request.getTime()))
.build();
The request contains a list to be substituted in place of %s and the time to be substituted in place of %d. But the request being formed has a path like this
/v0/segments/ch/abc/view/status/ACTIVE%3Fsc=FK,GR&expiryGteInMs=1611081000000
Basically the "?" character is being replaced by %3F. Can somebody help me with this, please?
P.S: I know that we can use the ".queryParam" option available in URIBuilder. Looking for the actual reason why this is happening?
Most probably library that you are using is encoding url, and ? encodes to %3F.
Why this happens(in short): url could contain only specific set of character, and ? is not one of them, so, in order to transfer this character, we should encode it (so called Percent-encoding).
A bit longer explanation (taken from here):
URL encoding converts characters into a format that can be transmitted over the Internet.
URLs can only be sent over the Internet using the ASCII character-set.
Since URLs often contain characters outside the ASCII set, the URL has to be converted into a valid ASCII format.
URL encoding replaces unsafe ASCII characters with a "%" followed by two hexadecimal digits.
URLs cannot contain spaces. URL encoding normally replaces a space with a plus (+) sign or with %20.

Need help identifying type of UTF Encoding

I'm having a hard time trying to figure out the type of unicode that i need to convert to pass data for post request. Mostly would be chinese characters.
Example String:
的事故事务院治党派驻地是不是
Expected Unicode: %u7684%u4E8B%u6545%u4E8B%u52A1%u9662%u6CBB%u515A%u6D3E%u9A7B%u5730%u662F%u4E0D%u662F
Tried to encode to UTF16-BE:
%76%84%4E%8B%65%45%4E%8B%52%A1%5C%40%5C%40%95%7F%67%1F%8D%27%7B%49%5F%85%62%08%59%1A
Encoded text in UTF-16: %FF%FE%84%76%8B%4E%45%65%8B%4E%A1%52%62%96%BB%6C%5A%51%3E%6D%7B%9A%30%57%2F%66%0D%4E%2F%66
Encoded text in UTF-8: %E7%9A%84%E4%BA%8B%E6%95%85%E4%BA%8B%E5%8A%A1%E9%99%A2%E6%B2%BB%E5%85%9A%E6%B4%BE%E9%A9%BB%E5%9C%B0%E6%98%AF%E4%B8%8D%E6%98%AF
As you can see, UTF16-BE is the closest, but it only takes 2 bytes and there should be an additional %u in front of every character as shown in the expected unicode.
I've been using URLEncoder method to get the encoded text, with the standard charset encodings but it doesn't seem to return the expected unicode.
Code:
String text = "的事故事务院治党派驻地是不是";
URLEncoder.encode(text, "UTF-16BE");
As Kayaman said in a comment: Your expectation is wrong.
That is because %uNNNN is not a valid URL encoding of Unicode text. As Wikipedia says it:
There exists a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a UTF-16 code unit represented as four hexadecimal digits. This behavior is not specified by any RFC and has been rejected by the W3C.
So unless your server is expected non-standard input, your expectation is wrong.
Instead, use UTF-8. As Wikipedia says it:
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
That is however for sending data in a URL, e.g. as part of a GET.
For sending text data as part of a application/x-www-form-urlencoded encoded POST, see the HTML5 documentation:
If the form element has an accept-charset attribute, let the selected character encoding be the result of picking an encoding for the form.
Otherwise, if the form element has no accept-charset attribute, but the document's character encoding is an ASCII-compatible character encoding, then that is the selected character encoding.
Otherwise, let the selected character encoding be UTF-8.
Since most web pages ("the document") are presented in UTF-8 these days, that would likely mean UTF-8.
I think that you are thinking too far. The encoding of a text doesn't need to "resemble" in any way the string of Unicode code points of this text. These are two different things.
To send the string 的事故事务院治党派驻地是不是 in a POST request, just write the entire POST request and encode it with UTF-8, and the resulting bytes are what is sent as the body of the POST request to the server.
As pointed out by #Andreas, UTF-8 is the default encoding of HTML5, so it's not even necessary to set the accept-charset attribute, because the server will automatically use UTF-8 to decode the body of your request, if accept-charset is not set.

How to decode the Unicode encoding in java?

I have Search on my site we frame the query and send in the Request and Response comes back from the vendor as JSON. The vendor crawls our site and capture the data from our site and send response. In Our design we are converting the JSON into java object using GSON. We use the UTF-8 as charset in the Meta.
I have a situation the response has some times Unicode encoding for the special characters based on the request. The browser is rendering this Unicode encoding for special characters in a strange way. How should i decode this Unicode encoding?
For example, for the special character 'ndash' i see in the response it encoded as '\u2013'
To clarify the differences between Unicode and a character encoding
Unicode
is an abstract concept aiming to identify all letters (currently > 110 000).
Character encoding
defines how a character can be represending by a sequence of bytes
one such encoding is utf-8 which uses 1-4 bytes to represent a Unicode character
A java String is always UTF-16. Hence when you construct a String you can use the following String constructor
new String(byte[], encoding)
The second argument should be the encoding the characters are in when the client are sending them. If you don't explicilty define an encoding, you will get the default system encoding, which you can examine using Charset.defaultCharset();.
You can manually set the default encoding as an argument when starting the JVM
-Dfile.encoding="utf-8"
Although rarely needed, you can also employ CharsetDecoder/CharsetEncoder.

HTTP headers encoding/decoding in Java

A custom HTTP header is being passed to a Servlet application for authentication purposes. The header value must be able to contain accents and other non-ASCII characters, so must be in a certain encoding (ideally UTF-8).
I am provided with this piece of Java code by the developers who control the authentication environment:
String firstName = request.getHeader("my-custom-header");
String decodedFirstName = new String(firstName.getBytes(),"UTF-8");
But this code doesn't look right to me: it presupposes the encoding of the header value, when it seemed to me that there was a proper way of specifying an encoding for header values (from MIME I believe).
Here is my question: what is the right way (tm) of dealing with custom header values that need to support a UTF-8 encoding:
on the wire (how the header looks like over the wire)
from the decoding point of view (how to decode it using the Java Servlet API, and can we assume that request.getHeader() already properly does the decoding)
Here is an environment independent code sample to treat headers as UTF-8 in case you can't change your service:
String valueAsISO = request.getHeader("my-custom-header");
String valueAsUTF8 = new String(firstName.getBytes("ISO8859-1"),"UTF-8");
Again: RFC 2047 is not implemented in practice. The next revision of HTTP/1.1 is going to remove any mention of it.
So, if you need to transport non-ASCII characters, the safest way is to encode them into a sequence of ASCII, such as the "Slug" header in the Atom Publishing Protocol.
As mentioned already the first look should always go to the HTTP 1.1 spec (RFC 2616). It says that text in header values must use the MIME encoding as defined RFC 2047 if it contains characters from character sets other than ISO-8859-1.
So here's a plus for you. If your requirements are covered by the ISO-8859-1 charset then you just put your characters into your request/response messages. Otherwise MIME encoding is the only alternative.
As long as the user agent sends the values to your custom headers according to these rules you wont have to worry about decoding them. That's what the Servlet API should do.
However, there's a more basic reason why your code sniplet doesn't do what it's supposed to. The first line fetches the header value as a Java string. As we know it's represented as UTF8 internally so at this point the HTTP request message parsing is already done and finished.
The next line fetches the byte array of this string. Since no encoding was specified (IMHO this method with no argument should have been deprecated long ago), the current system default encoding is used, which is usually not UTF8 and then the array is again converted as being UTF8 encoded. Outch.
The HTTPbis working group is aware of the issue, and the latest drafts get rid of all the language with respect to TEXT and RFC 2047 encoding -- it is not used in practice over HTTP.
See http://trac.tools.ietf.org/wg/httpbis/trac/ticket/74 for the whole story.
See the HTTP spec for the rules, which says in section 2.2
The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047 [14].
The above code will not correctly decode an RFC2047 encoding string, leading me to believe that the service doesn't correctly follow the spec, and they just embeding raw utf-8 data in the header.
Thanks for the answers. It seems that the ideal would be to follow the proper HTTP header encoding as per RFC 2047. Header values in UTF-8 on the wire would look something like this:
=?UTF-8?Q?...?=
Now here is the funny thing: it seems that neither Tomcat 5.5 or 6 properly decodes HTTP headers as per RFC 2047! The Tomcat code assumes everywhere that header values use ISO-8859-1.
So for Tomcat, specifically, I will work around this by writing a filter which handles the proper decoding of the header values.

How do you unescape URLs in Java?

When I read the xml through a URL's InputStream, and then cut out everything except the url, I get "http://cliveg.bu.edu/people/sganguly/player/%20Rang%20De%20Basanti%20-%20Tu%20Bin%20Bataye.mp3".
As you can see, there are a lot of "%20"s.
I want the url to be unescaped.
Is there any way to do this in Java, without using a third-party library?
This is not unescaped XML, this is URL encoded text. Looks to me like you want to use the following on the URL strings.
URLDecoder.decode(url);
This will give you the correct text. The result of decoding the like you provided is this.
http://cliveg.bu.edu/people/sganguly/player/ Rang De Basanti - Tu Bin Bataye.mp3
The %20 is an escaped space character. To get the above I used the URLDecoder object.
Starting from Java 11 use
URLDecoder.decode(url, StandardCharsets.UTF_8).
for Java 7/8/9 use URLDecoder.decode(url, "UTF-8").
URLDecoder.decode(String s) has been deprecated since Java 5
Regarding the chosen encoding:
Note: The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilites.
I'm having problems using this method when I have special characters like á, é, í, etc. My (probably wild) guess is widechars are not being encoded properly... well, at least I was expecting to see sequences like %uC2BF instead of %C2%BF.
Edited: My bad, this post explains the difference between URL encoding and JavaScript's escape sequences: URI encoding in UNICODE for apache httpclient 4
In my case the URL contained escaped html entities, so StringEscapeUtils.unescapeHtml4() from apache-commons did the trick

Categories