How to "fix" broken Java Strings (charset-conversion)

How to "fix" broken Java Strings (charset-conversion) - java

I'm running a Servlet that takes POST requests from websites that aren't necessarily encoded in UTF-8. These requests get parsed with GSON and information (mainly strings) end up in objects.
Client side charset doesn't seem to be used for any of this, as Java just stores Strings in Unicode internally.
Now if a page sending a request has a non-unicode-charset, the information in the strings is garbled up and doesn't represent what was sent - it seems to be misinterpreted somewhere either in the process of being stringified by the servlet, or parsed by gson.
Assuming there is no easy way of fixing the root of the issue, is there a way of recovering that information, given the (misinterpreted) Java Strings and the charset identifier (i.e. "Shift_JIS", "Windows-1255") used to display it on the client's side?

I haven't had need to do this before, but I believe that
final String realCharsetName = "Shift_JIS"; // for example
new String(brokenString.getBytes(), realCharsetName);
stands a good chance of doing the trick.
(This does however assume that encoding issues were entirely ignored while reading, and so the platform's default character set was used (a likely assumption since if people thought about charsets they probably would have got it right). It also assumes you're decoding on a machine with the same default charset as the one that originally read the bytes and created the String.)
If you happen to know exactly which charset was incorrectly used to read the string, you can pass it into the getBytes() call to make this 100% reliable.

Assuming that it's obtained as a POST request parameter the following way
String string = request.getParameter("name");
then you need to URL-encode the string back to the original query string parameter value using the charset which the server itself was using to decode the parameter value
String original = URLEncoder.encode(string, "UTF-8");
and then URL-decode it using the intended charset
String fixed = URLDecoder.decode(original, "Shift_JIS");
As the better alternative, you could also just instruct the server to use the given charset directly before obtaining any request parameter by ServletRequest#setCharacterEncoding().
request.setCharacterEncoding("Shift_JIS");
String string = request.getParameter("name");
There's by the way no way to know about the charset which the client used to URL-encode the POST request body. Almost no of the clients specifies it in the Content-Type request header, otherwise the ServletRequest#setCharacterEncoding() call would be already implicitly done by the servlet API based on that. You could determine it by checking getCharacterEncoding(), if it returns null then the client has specified none.
However, this does of course not work if the client has already properly encoded the value as UTF-8 or for any other charset. The Shift_JIS massage would break it again. There exist tools/API's to guess the original charset used based on the obtained byte sequence, but that's not 100% reliable. If your servlet concerns a public API, then you should document properly that it only accepts UTF-8 encoded parameters whenever the charset is not specified in the request header. You can then move the problem to the client side and point them on their mistake.

Am I correct that what you get is a string that was parsed as if it were UTF-8 but was encoded in Windows-1255? The solution would be to encode your string in UTF-8 and decode the result as Windows-1255.

The correct way to fix the problem is to ensure that when you read the content, you do so using the correct character encoding. Most frameworks and libraries will take care of this for you, but if you're manually writing servlets, it's something you need to be aware of. This isn't a shortcoming of Java. You just need to pay attention to the encodings. Specifically, the Content-Type header should contain useful information.
Any time you convert from a byte stream to a character stream in Java, you should supply a character encoding so that the bytes can be properly decoded into characters. See for example the InputStreamReader constructors.

Related

decoding and encoding strings, ISO-8859-1 to UTF-8 in Java

I have read the other posts on this issue, but the solutions they presented did not work for me. Actually, the official Java documentation also did not work as intended (I am using Java 11) : https://docs.oracle.com/javase/tutorial/i18n/text/string.html
My problem is that I am reading one byte at a time from a byte buffer, putting that in a byte array, and making a String out of that byte array. The bytes I read are from an embedded system that can only send ISO-8859-1 bytes, so I end up with a byte array with ISO-8859-1 bytes and the Java String I end up getting is thus ISO-8859-1 encoded. No problem here. The String in IntelliJ looks like this :
The bytes I am trying to convert from ISO-8859-1 to UTF-8 are the ones in yellow. I want them to be UTF-8, so in the end the "C9" byte should be replace by the "C3A9" bytes.
The first step works correctly, I do this : maintenanceResponseString.getBytes(StandardCharsets.UTF_8) and I get the right bytes that I want, the UTF-8 encoding of the string, that's good :
The problem comes in here , when I try to make a STRING out of these new (and GOOD) bytes, like this :
new String(maintenanceResponseString.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8)
The old bytes are back ?!! It's like the "getBytes(UTF-8)" never actually happened. That is NOT what the documentation says should happen... what am I missing here ? I have done tests and the string really is still ISO-8859-1 encoded... I don't know what is going on here. Where are the bytes from "getBytes" ?
How do you convert a String that contains ISO-8859-1 bytes to UTF-8 bytes ? I'm out of alternatives and I need to get it done real bad for a pro project... this should be easy !
Note : I have tried alternatives like
ByteBuffer buffer = StandardCharsets.UTF_8.encode(s);
return StandardCharsets.UTF_8.decode(buffer).toString();
But the exact same thing happens.
Thank you in advance for your help.
EDIT :
With some info in the comments about how Strings in Java 9+ get represented internally not as UTF-16 only anymore, but Latin-1 (why...), I think that is what made me think the Strings were "internally encoded in Latin-1" when it is just the default representation of the String if we don't specify the encoding we want to use when displaying the String.
From what I undestand now the String itself is not bound to any encoding, and you can CHOOSE the encoding you want to display it in when it gets written.
Actually my issue is that the String ends up written to an XML file via JAXB marshalling in LATIN-1, and I now think the issues lies over there... I will dig further when I access my work computer again and report here

It turns out there was nothing wrong with Strings and "their encoding". What happened is I got really confused because the debugger shows the contents of the String in a "default internal storage encoding", and that is ISO-8859-1 (but can be UTF-16, depends on the content of the String).
Quote from the JEP-254 :
We propose to change the internal representation of the String class
from a UTF-16 char array to a byte array plus an encoding-flag field.
The new String class will store characters encoded either as
ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes
per character), based upon the contents of the string. The encoding
flag will indicate which encoding is used.
But actually it doesn't matter the internal encoding storage. When it is time to be written, the String will use whatever encoding you want at the time of writing.
My issue actually was when I was sending the String in an HTTP request with Spring RestTemplate. I didn't have the header specifying the "charset" to use in the request, and RestTemplate defaults to ISO-8859-1 if not told otherwise. I added the charset=utf-8, and the String was correctly written as UTF-8 in the request.
Thank you to #VGR #Eugene #skomisa for the help

ISO-8859-1 to UTF-8 in Java

An XML containing 哈瓦那 (UTF-8) is sent to Service A.
Service A sends it to Service B.
The string was encoded to å“ˆç“¦é‚£ (ISO-8859-1).
How do I encode it back to 哈瓦那? Considering that all strings in Java are UTF-16. Service B has to compare it as 哈瓦那 not å“ˆç“¦é‚£.
Thanks.

When you read a text file, you have to read it using the actual encoding used to create the file. If you specify the appropriate encoding, you'll get the correct characters in memory. So, if the same file (semantically) exists in two versions (UTF-8 encoded and ISO-8859-1), reading the first one with UTF-8 and the second one with ISO-8859-1 will lead to exactly the same chars in memory.
The above is true only if it made sense to encode the file in ISO-8859-1 in the first place. UTF-8 is able to store every unicode character. But ISO-8859-1 is able to encode only a small subset of the unicode characters (western languages characters). The characters you posted literally look like Chinese to me, and I don't think encoding them in ISO-8859-1 is even possible without losing everything.

I think you are misdiagnosing the problem:
An XML containing 哈瓦那 (UTF-8) is sent to Service A.
OK ...
Service A sends it to Service B.
OK ...
The string was converted to å“ˆç“¦é‚£ (ISO-8859-1).
This is not correct. The string has not been "converted". Rather, it has been decoded with the wrong character encoding. Specifically, it looks very much like something has taken UTF-8 encoded bytes, and assumed that they are ISO-8859-1 encoded, and decoded them accordingly.
Can you unpick this? It depends where the mistaken decoding first occurred. If it happens in Service B, then you should be able to relabel the data source as UTF-8, and then decode it correctly. On the other hand, if the first mistaken decoding happens in service A, then you could be out of luck. A mistaken decoding can result in loss of data as unrecognized codes are replaced with some other character. If that happens, the original data will be gone forever.
In either case, the best way to deal with this is to figure out what is getting the wrong character encoding mixed up, and fix that. Perhaps the XML needs to be fixed to specify the charset / encoding. Perhaps, the transport mechanism (e.g. HTTP request or response) needs to be corrected to include the proper document encoding.

Use writers and readers to encode/decode your input/output streams:
String yourText = "...";
InputStream yourInputStream = ...;
Writer out = new OutputStreamWriter(youInputStream, "UTF-8");
out.write(yourText);
Same for reader.

What could be the possible consequences of default encoding to UTF-8 for a String to Stream conversion?

I need to convert Strings obtained from some API's to InputStream consumed by other API's. The only way is that I convert the String to Stream without knowing the exact encoding. So I assume it to be UTF-8 and it works fine for now. However I would like to know what could be a better solution for this given that I have no way of identifying the the encoding of the source of the string.

There is no good solution to the problem of not knowing the encoding.
Because of this, you must demand that the encoding be explicitly specified, or else use one single agreed-upon encoding that is strictly adhered to.
Also, make sure you use the rare form of the contructor to InputStreamReader that condescends to raise an exception on an encoding error. That is InputStreamReader(InputStream in, CharsetDecoder dec). The other three are either broken or else infelicitously designed depending on your point of view or purposes, because they suppress encoding errors and render your program unreliable and nonportable.
Be very careful about missing errors, especially when you do not know for sure what you are getting — and even if you think you do :).

The possible consequences of applying the incorrect encoding is getting the wrong data out the other end.
The specific consequences will depend on the specific encodings. For example, if you receive a stream of ISO-8859-1 characters, and try to decode using UTF-8, you'll probably get errors due to incorrect sequences. If you start with UTF-16 and assume that it's ISO-8859-1, you'll get twice as many characters as you expect, and every other one will be garbage.

Encodings are not a property of Strings in Java, they're only relevant when you convert between Strings and bytes. If those APIs give you Strings, there is only one point where your program needs to use an encoding, which is when you convert the String back to bytes to be returned by the InputStream. And those "other APIs" of course need to know which encoding to use if they're going to interpret the contents as text data.

To add to the other answers, your deployed application will no longer be portable between Windows and Linux, since these usually have different default encodings.

HTTP headers encoding/decoding in Java

A custom HTTP header is being passed to a Servlet application for authentication purposes. The header value must be able to contain accents and other non-ASCII characters, so must be in a certain encoding (ideally UTF-8).
I am provided with this piece of Java code by the developers who control the authentication environment:
String firstName = request.getHeader("my-custom-header");
String decodedFirstName = new String(firstName.getBytes(),"UTF-8");
But this code doesn't look right to me: it presupposes the encoding of the header value, when it seemed to me that there was a proper way of specifying an encoding for header values (from MIME I believe).
Here is my question: what is the right way (tm) of dealing with custom header values that need to support a UTF-8 encoding:
on the wire (how the header looks like over the wire)
from the decoding point of view (how to decode it using the Java Servlet API, and can we assume that request.getHeader() already properly does the decoding)
Here is an environment independent code sample to treat headers as UTF-8 in case you can't change your service:
String valueAsISO = request.getHeader("my-custom-header");
String valueAsUTF8 = new String(firstName.getBytes("ISO8859-1"),"UTF-8");

Again: RFC 2047 is not implemented in practice. The next revision of HTTP/1.1 is going to remove any mention of it.
So, if you need to transport non-ASCII characters, the safest way is to encode them into a sequence of ASCII, such as the "Slug" header in the Atom Publishing Protocol.

As mentioned already the first look should always go to the HTTP 1.1 spec (RFC 2616). It says that text in header values must use the MIME encoding as defined RFC 2047 if it contains characters from character sets other than ISO-8859-1.
So here's a plus for you. If your requirements are covered by the ISO-8859-1 charset then you just put your characters into your request/response messages. Otherwise MIME encoding is the only alternative.
As long as the user agent sends the values to your custom headers according to these rules you wont have to worry about decoding them. That's what the Servlet API should do.
However, there's a more basic reason why your code sniplet doesn't do what it's supposed to. The first line fetches the header value as a Java string. As we know it's represented as UTF8 internally so at this point the HTTP request message parsing is already done and finished.
The next line fetches the byte array of this string. Since no encoding was specified (IMHO this method with no argument should have been deprecated long ago), the current system default encoding is used, which is usually not UTF8 and then the array is again converted as being UTF8 encoded. Outch.

The HTTPbis working group is aware of the issue, and the latest drafts get rid of all the language with respect to TEXT and RFC 2047 encoding -- it is not used in practice over HTTP.
See http://trac.tools.ietf.org/wg/httpbis/trac/ticket/74 for the whole story.

See the HTTP spec for the rules, which says in section 2.2
The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047 [14].
The above code will not correctly decode an RFC2047 encoding string, leading me to believe that the service doesn't correctly follow the spec, and they just embeding raw utf-8 data in the header.

Thanks for the answers. It seems that the ideal would be to follow the proper HTTP header encoding as per RFC 2047. Header values in UTF-8 on the wire would look something like this:
=?UTF-8?Q?...?=
Now here is the funny thing: it seems that neither Tomcat 5.5 or 6 properly decodes HTTP headers as per RFC 2047! The Tomcat code assumes everywhere that header values use ISO-8859-1.
So for Tomcat, specifically, I will work around this by writing a filter which handles the proper decoding of the header values.

Decoding split 16-bit character in Java

In my application, I receive a URL-UTF8 encoded string of characters, which is split up by the sending client. After splitting, each message part includes some header information which is meant to be used to reconstruct the message.
With English characters, it's pretty straightforward
String content = new String(request.getParameter("content").getBytes("UTF-8"));
I store this in along with the header information in a buffer for each received part. When all parts have been received, I simply recompose the message by concatenating each individual part according to header information.
With languages that use 16-bit encodings this is sometimes not working as expected. Everything works fine if the split does NOT happen in the middle of a single character.
For instance here's a string of three Hebrew characters being sent by the client:
%D7%93%D7%99%D7%91
If this winds up split as follows: {%D7%93%D7%99} {%D7%91}, reconstruction isn't a problem.
However sometimes the client splits it up in the middle (example: {%D7%93%D7} {%99%D7%91})
When this happens, after reconstruction I get two � characters at the boundary point instead of the single correct Hebrew character.
I thought the inability to correctly retain the single byte information was related to passing around strings, so I tried passing around byte array from request.getParameter("content").getBytes("UTF-8") to the buffer without wrapping in the string joining together the byte arrays. In the buffer I joined all these arrays BEFORE converting the final array to a string.
Even after doing this, it appears I still "lost" that information held by the single bytes. I'm guessing this is because the getBytes("UTF-8") method can't correctly resolve the single bytes since they are not valid characters. Is that right?
Is there any way I can get around this and preserve these tail/head bytes?

Your client is the problem here. Apparently it treats the text data as a byte array for the purpose of splitting it up, and then sending the invalid fragments as text (HTTP request parameters are inherently textual). At that point, you have already lost.
You either have to change the client to split the data as text (i.e. along character boundaries), or change your protocol to send the fragments as binary data, i.e. not as a parameter but as the request body, to be retrieved via ServletRequest.getInputStream() - then, concatenating the data before decoding it should work.
(Caveat: the above assumes that you are indeed writing Servlet code, which I inferred from the request.getParameter() method; but even if that's a coincidence the same principles apply: either split the data as a String before any conversion to byte[] happens on the client side, or make sure you concatenate the byte arrays on the server before any conversion to String happens.)

You must first collect all bytes and then convert them all at once into a string.

Following scheme is a hack but it should work in your case,
Set you server/page in Latin-1 mode. If this is a GET, client has no way to set encoding. You have to do this on server's end. For example, you need to add URIEncoding="iso-8859-1" in connector for Tomcat.
Get content as Latin1. It will be wrong value at this point but don't worry,
String content = request.getParameter("content");
Concatenate the string as Latin-1.
data = data + content;
When you get the whole thing, you need to re-encode the string as UTF-8 like this,
String value = new String(data.getBytes("iso-8859-1"), "utf-8");
The value should contain the correct characters.

You never need to convert a string to bytes and then to a String java, it is completely pointless. Once a series of bytes have been decoded to a String it is in Java String encoding (UTF-16E I think).
The problem you have is that the application server is making an assumation about the encoding of the incoming HTTP request, usually the platform encoding. You can give the application server a hint as to the expected encoding by calling ServletRequest.setCharacterEncoding(String) before anything else calls getParameter().
Browser's assume that form fields should be submitted back to the server using the same encoding that the page was served with. This is a general rule as the HTTP spec doesn't have a way to specify the encoding of the incoming request, only the response.
Spring has a nice Filter to do this for you CharacterEncodingFilter if you define this as the every first filter in web.xml most of your encoding issue will go away.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.