Convert string from UTF-8 to ISO 8859-1 in Java - java

I want to encode a UTF-8 string to a ISO 8859- string in Java
I have this:
String title = new String(item.getTitle().getText().getBytes("ISO-8859-1"));
But it isn't working, the output is Sørensen for example

There's no such thing as a "UTF-8 string" in Java... there are just strings, which are always in Unicode. (They're effectively always UTF-16.)
You can have a byte array which is an ISO-8859-1 encoded form of a string (or UTF-8 or whatever) but it doesn't make sense to have a string with an encoding.
If you've read a string with the incorrect encoding somewhere, the correct thing to do is fix the code which reads the string, rather than trying to decode/encode the data from the string form later.
If you could give more information about the problem, we can probably give some more useful advice.

This problem isn't to be solved that way. Strings in Java are always in the same encoding (UTF-16), you've basically only changed the content. You need to set the encoding in the destination of this string. If it's the stdout, you need to set its encoding. If it's a file, you need to set its Writer encoding. If it's a HTML page, you need to set the response encoding. If it's a database, you need to set the DB/table/connection encoding. Etcetera.
Update: as per the comments:
The string is from a RSS feed that is in UTF-8, and I want to show in in a HTML page that uses ISO 8859 encoding
You'll need to upgrade the HTML page's encoding from vintage ISO 8859 encoding to the modern and world-domination-prepared UTF-8 encoding.
Update 2: as per the comments:
Firefox shows the it in the right encoding by default (utf-8) but Internet Explorer for example doesn't
Then the text is actually fine. You don't need to massage the string into another encoding. The symptoms tells that the character encoding information is missing in the response headers. Firefox has actually a pretty smart encoding detector, while IE will use the platform default encoding when the encoding is unknown. But IE will also fail if the HTML is (drastically) malformed in doctype and head.
Thus, either the HTML response is syntactically invalid, or the response content type wasn't set correctly. Assuming that your website validates and that you're using JSP/Servlet (after judging your post history here), you basically need to add the following line to the top of your JSP:
<%# page pageEncoding="UTF-8" %>
That's all. It will automatically set both the response encoding (so that the server knows which encoding to use to write the characters to the byte stream of the response) and the encoding in the Content-Type response header (so that the client knows which encoding to use to read/display those characters from the byte stream of the response). For more background information you may find this article useful.

Related

Need help identifying type of UTF Encoding

I'm having a hard time trying to figure out the type of unicode that i need to convert to pass data for post request. Mostly would be chinese characters.
Example String:
的事故事务院治党派驻地是不是
Expected Unicode: %u7684%u4E8B%u6545%u4E8B%u52A1%u9662%u6CBB%u515A%u6D3E%u9A7B%u5730%u662F%u4E0D%u662F
Tried to encode to UTF16-BE:
%76%84%4E%8B%65%45%4E%8B%52%A1%5C%40%5C%40%95%7F%67%1F%8D%27%7B%49%5F%85%62%08%59%1A
Encoded text in UTF-16: %FF%FE%84%76%8B%4E%45%65%8B%4E%A1%52%62%96%BB%6C%5A%51%3E%6D%7B%9A%30%57%2F%66%0D%4E%2F%66
Encoded text in UTF-8: %E7%9A%84%E4%BA%8B%E6%95%85%E4%BA%8B%E5%8A%A1%E9%99%A2%E6%B2%BB%E5%85%9A%E6%B4%BE%E9%A9%BB%E5%9C%B0%E6%98%AF%E4%B8%8D%E6%98%AF
As you can see, UTF16-BE is the closest, but it only takes 2 bytes and there should be an additional %u in front of every character as shown in the expected unicode.
I've been using URLEncoder method to get the encoded text, with the standard charset encodings but it doesn't seem to return the expected unicode.
Code:
String text = "的事故事务院治党派驻地是不是";
URLEncoder.encode(text, "UTF-16BE");
As Kayaman said in a comment: Your expectation is wrong.
That is because %uNNNN is not a valid URL encoding of Unicode text. As Wikipedia says it:
There exists a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a UTF-16 code unit represented as four hexadecimal digits. This behavior is not specified by any RFC and has been rejected by the W3C.
So unless your server is expected non-standard input, your expectation is wrong.
Instead, use UTF-8. As Wikipedia says it:
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
That is however for sending data in a URL, e.g. as part of a GET.
For sending text data as part of a application/x-www-form-urlencoded encoded POST, see the HTML5 documentation:
If the form element has an accept-charset attribute, let the selected character encoding be the result of picking an encoding for the form.
Otherwise, if the form element has no accept-charset attribute, but the document's character encoding is an ASCII-compatible character encoding, then that is the selected character encoding.
Otherwise, let the selected character encoding be UTF-8.
Since most web pages ("the document") are presented in UTF-8 these days, that would likely mean UTF-8.
I think that you are thinking too far. The encoding of a text doesn't need to "resemble" in any way the string of Unicode code points of this text. These are two different things.
To send the string 的事故事务院治党派驻地是不是 in a POST request, just write the entire POST request and encode it with UTF-8, and the resulting bytes are what is sent as the body of the POST request to the server.
As pointed out by #Andreas, UTF-8 is the default encoding of HTML5, so it's not even necessary to set the accept-charset attribute, because the server will automatically use UTF-8 to decode the body of your request, if accept-charset is not set.

ISO-8859-1 to UTF-8 in Java

An XML containing 哈瓦那 (UTF-8) is sent to Service A.
Service A sends it to Service B.
The string was encoded to 哈瓦那 (ISO-8859-1).
How do I encode it back to 哈瓦那? Considering that all strings in Java are UTF-16. Service B has to compare it as 哈瓦那 not 哈瓦那.
Thanks.
When you read a text file, you have to read it using the actual encoding used to create the file. If you specify the appropriate encoding, you'll get the correct characters in memory. So, if the same file (semantically) exists in two versions (UTF-8 encoded and ISO-8859-1), reading the first one with UTF-8 and the second one with ISO-8859-1 will lead to exactly the same chars in memory.
The above is true only if it made sense to encode the file in ISO-8859-1 in the first place. UTF-8 is able to store every unicode character. But ISO-8859-1 is able to encode only a small subset of the unicode characters (western languages characters). The characters you posted literally look like Chinese to me, and I don't think encoding them in ISO-8859-1 is even possible without losing everything.
I think you are misdiagnosing the problem:
An XML containing 哈瓦那 (UTF-8) is sent to Service A.
OK ...
Service A sends it to Service B.
OK ...
The string was converted to 哈瓦那 (ISO-8859-1).
This is not correct. The string has not been "converted". Rather, it has been decoded with the wrong character encoding. Specifically, it looks very much like something has taken UTF-8 encoded bytes, and assumed that they are ISO-8859-1 encoded, and decoded them accordingly.
Can you unpick this? It depends where the mistaken decoding first occurred. If it happens in Service B, then you should be able to relabel the data source as UTF-8, and then decode it correctly. On the other hand, if the first mistaken decoding happens in service A, then you could be out of luck. A mistaken decoding can result in loss of data as unrecognized codes are replaced with some other character. If that happens, the original data will be gone forever.
In either case, the best way to deal with this is to figure out what is getting the wrong character encoding mixed up, and fix that. Perhaps the XML needs to be fixed to specify the charset / encoding. Perhaps, the transport mechanism (e.g. HTTP request or response) needs to be corrected to include the proper document encoding.
Use writers and readers to encode/decode your input/output streams:
String yourText = "...";
InputStream yourInputStream = ...;
Writer out = new OutputStreamWriter(youInputStream, "UTF-8");
out.write(yourText);
Same for reader.

How to "fix" broken Java Strings (charset-conversion)

I'm running a Servlet that takes POST requests from websites that aren't necessarily encoded in UTF-8. These requests get parsed with GSON and information (mainly strings) end up in objects.
Client side charset doesn't seem to be used for any of this, as Java just stores Strings in Unicode internally.
Now if a page sending a request has a non-unicode-charset, the information in the strings is garbled up and doesn't represent what was sent - it seems to be misinterpreted somewhere either in the process of being stringified by the servlet, or parsed by gson.
Assuming there is no easy way of fixing the root of the issue, is there a way of recovering that information, given the (misinterpreted) Java Strings and the charset identifier (i.e. "Shift_JIS", "Windows-1255") used to display it on the client's side?
I haven't had need to do this before, but I believe that
final String realCharsetName = "Shift_JIS"; // for example
new String(brokenString.getBytes(), realCharsetName);
stands a good chance of doing the trick.
(This does however assume that encoding issues were entirely ignored while reading, and so the platform's default character set was used (a likely assumption since if people thought about charsets they probably would have got it right). It also assumes you're decoding on a machine with the same default charset as the one that originally read the bytes and created the String.)
If you happen to know exactly which charset was incorrectly used to read the string, you can pass it into the getBytes() call to make this 100% reliable.
Assuming that it's obtained as a POST request parameter the following way
String string = request.getParameter("name");
then you need to URL-encode the string back to the original query string parameter value using the charset which the server itself was using to decode the parameter value
String original = URLEncoder.encode(string, "UTF-8");
and then URL-decode it using the intended charset
String fixed = URLDecoder.decode(original, "Shift_JIS");
As the better alternative, you could also just instruct the server to use the given charset directly before obtaining any request parameter by ServletRequest#setCharacterEncoding().
request.setCharacterEncoding("Shift_JIS");
String string = request.getParameter("name");
There's by the way no way to know about the charset which the client used to URL-encode the POST request body. Almost no of the clients specifies it in the Content-Type request header, otherwise the ServletRequest#setCharacterEncoding() call would be already implicitly done by the servlet API based on that. You could determine it by checking getCharacterEncoding(), if it returns null then the client has specified none.
However, this does of course not work if the client has already properly encoded the value as UTF-8 or for any other charset. The Shift_JIS massage would break it again. There exist tools/API's to guess the original charset used based on the obtained byte sequence, but that's not 100% reliable. If your servlet concerns a public API, then you should document properly that it only accepts UTF-8 encoded parameters whenever the charset is not specified in the request header. You can then move the problem to the client side and point them on their mistake.
Am I correct that what you get is a string that was parsed as if it were UTF-8 but was encoded in Windows-1255? The solution would be to encode your string in UTF-8 and decode the result as Windows-1255.
The correct way to fix the problem is to ensure that when you read the content, you do so using the correct character encoding. Most frameworks and libraries will take care of this for you, but if you're manually writing servlets, it's something you need to be aware of. This isn't a shortcoming of Java. You just need to pay attention to the encodings. Specifically, the Content-Type header should contain useful information.
Any time you convert from a byte stream to a character stream in Java, you should supply a character encoding so that the bytes can be properly decoded into characters. See for example the InputStreamReader constructors.

If I implement UTF-16 file handler, can it accurately process all other encodings

I am writing a small scale HTML crawler in java. I want to have a single file handler that can open all the HTML file one by one and process them. But, there is no way to know what the HTML file is encoded in before actually opening that particular file. So, I am willing to know if I can have something like this :
new BufferedReader(
new InputStreamReader(
new FileInputStream(file), UTF16));
and the handler will be able to read all possible encodings (in an accurate way) the files may have (my idea is UTF16 is backward compatible with all other encodings). I will have to deal with following encodings.
charset=iso-8859-1
charset=utf-8
charset=iso-8859-1
charset=iso-8859-15'
charset="UTF-8"
charset=windows-1252
charset=utf-16
Thanks. Any suggestion would be highly appreciated.
No, UTF-16 is certainly not compatible with all other encodings (in that you can't use a UTF-16 decoder to decode any old text). Try using it for UTF-8, ISO-Latin-1 or any number of other encodings and it will fail.
Assuming this HTML has been fetched from a web server, you should remember the content type given in the response. Alternatively you could heuristically guess the encoding, of course.
No UTF16 can only understand files encoded in UTF16. Your best bet lies in determining the encoding before you process the file. Use the GuessEncoding library to detect the encoding and then construct the reader in the encoding detected.
I'd use this in combination with Jon Skeet's suggestion
Wow. Just wow.
The only way to do this is to read the first few hundred bytes in a safe encoding such as Windows-1252 and look for the NULLS that indicate UTF-16/32 and the META charset tag.
Failing in that, look at the headers for a charset.
If no header found, assume UTF-8 (standard) unless it parses out, then assume Windows-1252 (common error is sending Windows-1252 with no charset header).

HTTP headers encoding/decoding in Java

A custom HTTP header is being passed to a Servlet application for authentication purposes. The header value must be able to contain accents and other non-ASCII characters, so must be in a certain encoding (ideally UTF-8).
I am provided with this piece of Java code by the developers who control the authentication environment:
String firstName = request.getHeader("my-custom-header");
String decodedFirstName = new String(firstName.getBytes(),"UTF-8");
But this code doesn't look right to me: it presupposes the encoding of the header value, when it seemed to me that there was a proper way of specifying an encoding for header values (from MIME I believe).
Here is my question: what is the right way (tm) of dealing with custom header values that need to support a UTF-8 encoding:
on the wire (how the header looks like over the wire)
from the decoding point of view (how to decode it using the Java Servlet API, and can we assume that request.getHeader() already properly does the decoding)
Here is an environment independent code sample to treat headers as UTF-8 in case you can't change your service:
String valueAsISO = request.getHeader("my-custom-header");
String valueAsUTF8 = new String(firstName.getBytes("ISO8859-1"),"UTF-8");
Again: RFC 2047 is not implemented in practice. The next revision of HTTP/1.1 is going to remove any mention of it.
So, if you need to transport non-ASCII characters, the safest way is to encode them into a sequence of ASCII, such as the "Slug" header in the Atom Publishing Protocol.
As mentioned already the first look should always go to the HTTP 1.1 spec (RFC 2616). It says that text in header values must use the MIME encoding as defined RFC 2047 if it contains characters from character sets other than ISO-8859-1.
So here's a plus for you. If your requirements are covered by the ISO-8859-1 charset then you just put your characters into your request/response messages. Otherwise MIME encoding is the only alternative.
As long as the user agent sends the values to your custom headers according to these rules you wont have to worry about decoding them. That's what the Servlet API should do.
However, there's a more basic reason why your code sniplet doesn't do what it's supposed to. The first line fetches the header value as a Java string. As we know it's represented as UTF8 internally so at this point the HTTP request message parsing is already done and finished.
The next line fetches the byte array of this string. Since no encoding was specified (IMHO this method with no argument should have been deprecated long ago), the current system default encoding is used, which is usually not UTF8 and then the array is again converted as being UTF8 encoded. Outch.
The HTTPbis working group is aware of the issue, and the latest drafts get rid of all the language with respect to TEXT and RFC 2047 encoding -- it is not used in practice over HTTP.
See http://trac.tools.ietf.org/wg/httpbis/trac/ticket/74 for the whole story.
See the HTTP spec for the rules, which says in section 2.2
The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047 [14].
The above code will not correctly decode an RFC2047 encoding string, leading me to believe that the service doesn't correctly follow the spec, and they just embeding raw utf-8 data in the header.
Thanks for the answers. It seems that the ideal would be to follow the proper HTTP header encoding as per RFC 2047. Header values in UTF-8 on the wire would look something like this:
=?UTF-8?Q?...?=
Now here is the funny thing: it seems that neither Tomcat 5.5 or 6 properly decodes HTTP headers as per RFC 2047! The Tomcat code assumes everywhere that header values use ISO-8859-1.
So for Tomcat, specifically, I will work around this by writing a filter which handles the proper decoding of the header values.

Categories