How to decode the Unicode encoding in java? - java

I have Search on my site we frame the query and send in the Request and Response comes back from the vendor as JSON. The vendor crawls our site and capture the data from our site and send response. In Our design we are converting the JSON into java object using GSON. We use the UTF-8 as charset in the Meta.
I have a situation the response has some times Unicode encoding for the special characters based on the request. The browser is rendering this Unicode encoding for special characters in a strange way. How should i decode this Unicode encoding?
For example, for the special character 'ndash' i see in the response it encoded as '\u2013'

To clarify the differences between Unicode and a character encoding
Unicode
is an abstract concept aiming to identify all letters (currently > 110 000).
Character encoding
defines how a character can be represending by a sequence of bytes
one such encoding is utf-8 which uses 1-4 bytes to represent a Unicode character
A java String is always UTF-16. Hence when you construct a String you can use the following String constructor
new String(byte[], encoding)
The second argument should be the encoding the characters are in when the client are sending them. If you don't explicilty define an encoding, you will get the default system encoding, which you can examine using Charset.defaultCharset();.
You can manually set the default encoding as an argument when starting the JVM
-Dfile.encoding="utf-8"
Although rarely needed, you can also employ CharsetDecoder/CharsetEncoder.

Related

Need help identifying type of UTF Encoding

I'm having a hard time trying to figure out the type of unicode that i need to convert to pass data for post request. Mostly would be chinese characters.
Example String:
的事故事务院治党派驻地是不是
Expected Unicode: %u7684%u4E8B%u6545%u4E8B%u52A1%u9662%u6CBB%u515A%u6D3E%u9A7B%u5730%u662F%u4E0D%u662F
Tried to encode to UTF16-BE:
%76%84%4E%8B%65%45%4E%8B%52%A1%5C%40%5C%40%95%7F%67%1F%8D%27%7B%49%5F%85%62%08%59%1A
Encoded text in UTF-16: %FF%FE%84%76%8B%4E%45%65%8B%4E%A1%52%62%96%BB%6C%5A%51%3E%6D%7B%9A%30%57%2F%66%0D%4E%2F%66
Encoded text in UTF-8: %E7%9A%84%E4%BA%8B%E6%95%85%E4%BA%8B%E5%8A%A1%E9%99%A2%E6%B2%BB%E5%85%9A%E6%B4%BE%E9%A9%BB%E5%9C%B0%E6%98%AF%E4%B8%8D%E6%98%AF
As you can see, UTF16-BE is the closest, but it only takes 2 bytes and there should be an additional %u in front of every character as shown in the expected unicode.
I've been using URLEncoder method to get the encoded text, with the standard charset encodings but it doesn't seem to return the expected unicode.
Code:
String text = "的事故事务院治党派驻地是不是";
URLEncoder.encode(text, "UTF-16BE");
As Kayaman said in a comment: Your expectation is wrong.
That is because %uNNNN is not a valid URL encoding of Unicode text. As Wikipedia says it:
There exists a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a UTF-16 code unit represented as four hexadecimal digits. This behavior is not specified by any RFC and has been rejected by the W3C.
So unless your server is expected non-standard input, your expectation is wrong.
Instead, use UTF-8. As Wikipedia says it:
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
That is however for sending data in a URL, e.g. as part of a GET.
For sending text data as part of a application/x-www-form-urlencoded encoded POST, see the HTML5 documentation:
If the form element has an accept-charset attribute, let the selected character encoding be the result of picking an encoding for the form.
Otherwise, if the form element has no accept-charset attribute, but the document's character encoding is an ASCII-compatible character encoding, then that is the selected character encoding.
Otherwise, let the selected character encoding be UTF-8.
Since most web pages ("the document") are presented in UTF-8 these days, that would likely mean UTF-8.
I think that you are thinking too far. The encoding of a text doesn't need to "resemble" in any way the string of Unicode code points of this text. These are two different things.
To send the string 的事故事务院治党派驻地是不是 in a POST request, just write the entire POST request and encode it with UTF-8, and the resulting bytes are what is sent as the body of the POST request to the server.
As pointed out by #Andreas, UTF-8 is the default encoding of HTML5, so it's not even necessary to set the accept-charset attribute, because the server will automatically use UTF-8 to decode the body of your request, if accept-charset is not set.

Is sending UTF-8 encoded characters Network Safe?

The reason for encoding with standard Base64 format is to make sure it won't contain any control characters which may be considered as control characters over network. This ensures receiving same data over the other side of the network transfer.
In this scenario, Does UTF-8 character encoding provides same as Base64 by not giving any control characters in the output so that we can send it via network?
The reason for encoding with standard Base64 format is to make sure it won't contain any control characters which may be considered as control characters over network.
The above statement is incorrect. Base64 is used specifically to encode binary data using 64 of the printable ASCII characters. It is only necessary in specific situations where you are embedding binary data in a protocol which was designed to transfer text (such as embedding attachments in email); it is not required in general for transmitting data over a network. HTTP, for instance, manages perfectly well without it.
In this scenario, Does UTF-8 character encoding provides same as Base64 by not giving any control characters in the output so that we can send it via network?
No. UTF-8 is a Unicode string format. It cannot be used to encode arbitrary binary data.
Control characters (0-31 in ASCII) are not touched by UTF-8 encoding and therefore if your protocol cannot transmit them safely you wouldn't solve the issue by using UTF-8.
UTF-8 is about encoding unicode text into a 8-bit bytes stream, not about escaping control characters. It solves a different problem.
Note that the input for UTF-8 encoding is unicode text, not random bytes: for example it's not possible to encode the byte 0x83 with UTF-8: what you can do is convert the greek letter "Δ" encoded in cp737 as 0x83 into UTF-8, or you can encode the russian letter "Ѓ" encoded in cp855 as 0x83 into UTF8, but the result would be different ("Δ" is 0xCE+0x94, while "Ѓ" is 0xD0+0x83).

What character set should be used for URL encoding?

I need to encode a URL component. The URL component can contain special character like "?,#,=" and also characters of Chinese language.
Which of the character sets should I use: UTF-8, UTF-16 or UTF-32? and why?
I suppose you mean percent encoding here.
RFC 3986, section 2.5 is pretty clear about this (emphasis mine):
When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be percent-
encoded. For example, the character A would be represented as "A",
the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
as "%C3%80", and the character KATAKANA LETTER A would be represented
as "%E3%82%A2".
Therefore, this should be UTF-8.
Also, beware of URLEncoder.encode(); while the recommendation for it is repeatedly repeated, the fact is that it is not suitable for URI encoding; quoting the javadoc of the class itself:
This class contains static methods for converting a String to the application/x-www-form-urlencoded MIME format
which is not what URI encoding uses. (in case you are wondering, application/x-www-form-urlencoded is what is used in HTTP POST data) What you want to use is a URI template instead. See for instance here.
A reference from a HTML point of view.
The HTML4 specification, section Non-ASCII characters in URI attribute values, states (my emphasis):
We recommend that user agents adopt the following convention for
handling non-ASCII characters in such cases:
Represent each character in UTF-8 (see [RFC2279]) as one or more bytes.
Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of
the byte value).
Similar, in HTML5 specification, the Selecting a form submission encoding section, basically says that UTF-8 should be used if no accept-charset attribute is specified.
On the other hand, I found nothing that states UTF-8 must be used. Some older software use iso-8859-1 in particular. For example, Apache Tomcat before version 8 has iso-8859-1 as default value for its URIEncoding setting.
UTF-8 (Unicode) is the default character encoding in HTML5, as it encompasses almost all symbols/characters.
Go for UTF-8, also you can achieve the same thing by
URLEncoder.encode(string, encoding)
In addition, you can refer
This blog, It tried to encode some Chinese characters like '维也纳恩斯特哈佩尔球场'
Encode your URL to escape special characters. There are several websites that can do this for you.
E.g. http://www.url-encode-decode.com/

How to use UTF-16 in URL encoding?

Currently I am using utf-8 for URL encoding. I want to convert it to UTF-16.
How can I achieve this?
When encoding Unicode characters in URLs, it's necessary to encode them in such a fashion that all URL parsers and consumers can understand your URLs.
To that end; when the URL was expanded by RFCs in the wake of the development of Unicode and related standards and tools, it was decided that the encoding to employ for encoding characters (using percent escapes) was to be UTF-8, as this would mean that established ASCII escapes would Just Work™.
Consequently, even if you could generate URLs with UTF-16-based percent escapes, no other program would be able to understand them, making them useless. In fact, by matter of definition, they wouldn't even be URLs.
There's also the question of why on earth you would want to use UTF-16 for anything, it being silly and all.
Remember: Never Don't Use UTF-8! (N'DUUH!)
URL escapes, as in %nn hex values, encode bytes. 8-bit bytes. If for some very nonstandard reason you want to encode bytes of UTF-16 instead of UTF-8, you must first pick a byte order (BE or LE). Then you have to write code in your program to take the two bytes of each 16-bit UTF-16 character and represent it as %nn in hex.

URI encoding in UNICODE for apache httpclient 4

I am working with apache http client 4 for all of my web accesses.
This means that every query that I need to do has to pass the URI syntax checks.
One of the sites that I am trying to access uses UNICODE as the url GET params encoding, i.e:
http://maya.tase.co.il/bursa/index.asp?http://maya.tase.co.il/bursa/index.asp?view=search&company_group=147&srh_txt=%u05E0%u05D9%u05D1&arg_comp=&srh_from=2009-06-01&srh_until=2010-02-16&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press=
(the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)
The problem is that URI doesn't support UNICODE encoding(it only supports UTF-8)
The really big issue here, is that this site expect it's params to be encoded in UNICODE, so any attempts to convert the url using String.format("http://...srh_txt=%s&...",URLEncoder.encode( "ניב" , "UTF8"))
results in a url which is legal and can be used to construct a URI but the site response to it with an error message, since it's not the encoding that it expects.
by the way URL object can be created and even used to connect to the web site using the non converted url.
Is there any way of creating URI in non UTF-8 encoding?
Is there any way of working with apache httpclient 4 with regular URL(and not URI)?
thanks,
Niv
(the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)
It doesn't really. That's not URL-encoding and the sequence %u is invalid in a URL.
%u05E0%u05D9%u05D1" encodes ניב only in JavaScript's oddball escape syntax. escape is the same as URL-encoding for all ASCII characters except for +, but the %u#### escapes it produces for Unicode characters are completely of its own invention.
(One should, in general, never use escape. Using encodeURIComponent instead produces the correct URL-encoded UTF-8, ניב=%D7%A0%D7%99%D7%91.)
If a site requires %u#### sequences in its query string, it is very badly broken.
Is there any way of creating URI in non UTF-8 encoding?
Yes, URIs may use any character encoding you like. It is conventionally UTF-8; that's what IRI requires and what browsers will usually submit if the user types non-ASCII characters into the address bar, but URI itself concerns itself only with bytes.
So you could convert ניב to %F0%E9%E1. There would be no way for the web app to tell that those bytes represented characters encoded in code page 1255 (Hebrew, similar to ISO-8859-8). But it does appear to work, on the link above, which the UTF-8 version does not. Oh dear!

Categories