decoding and encoding strings, ISO-8859-1 to UTF-8 in Java - java

I have read the other posts on this issue, but the solutions they presented did not work for me. Actually, the official Java documentation also did not work as intended (I am using Java 11) : https://docs.oracle.com/javase/tutorial/i18n/text/string.html
My problem is that I am reading one byte at a time from a byte buffer, putting that in a byte array, and making a String out of that byte array. The bytes I read are from an embedded system that can only send ISO-8859-1 bytes, so I end up with a byte array with ISO-8859-1 bytes and the Java String I end up getting is thus ISO-8859-1 encoded. No problem here. The String in IntelliJ looks like this :
The bytes I am trying to convert from ISO-8859-1 to UTF-8 are the ones in yellow. I want them to be UTF-8, so in the end the "C9" byte should be replace by the "C3A9" bytes.
The first step works correctly, I do this : maintenanceResponseString.getBytes(StandardCharsets.UTF_8) and I get the right bytes that I want, the UTF-8 encoding of the string, that's good :
The problem comes in here , when I try to make a STRING out of these new (and GOOD) bytes, like this :
new String(maintenanceResponseString.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8)
The old bytes are back ?!! It's like the "getBytes(UTF-8)" never actually happened. That is NOT what the documentation says should happen... what am I missing here ? I have done tests and the string really is still ISO-8859-1 encoded... I don't know what is going on here. Where are the bytes from "getBytes" ?
How do you convert a String that contains ISO-8859-1 bytes to UTF-8 bytes ? I'm out of alternatives and I need to get it done real bad for a pro project... this should be easy !
Note : I have tried alternatives like
ByteBuffer buffer = StandardCharsets.UTF_8.encode(s);
return StandardCharsets.UTF_8.decode(buffer).toString();
But the exact same thing happens.
Thank you in advance for your help.
EDIT :
With some info in the comments about how Strings in Java 9+ get represented internally not as UTF-16 only anymore, but Latin-1 (why...), I think that is what made me think the Strings were "internally encoded in Latin-1" when it is just the default representation of the String if we don't specify the encoding we want to use when displaying the String.
From what I undestand now the String itself is not bound to any encoding, and you can CHOOSE the encoding you want to display it in when it gets written.
Actually my issue is that the String ends up written to an XML file via JAXB marshalling in LATIN-1, and I now think the issues lies over there... I will dig further when I access my work computer again and report here

It turns out there was nothing wrong with Strings and "their encoding". What happened is I got really confused because the debugger shows the contents of the String in a "default internal storage encoding", and that is ISO-8859-1 (but can be UTF-16, depends on the content of the String).
Quote from the JEP-254 :
We propose to change the internal representation of the String class
from a UTF-16 char array to a byte array plus an encoding-flag field.
The new String class will store characters encoded either as
ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes
per character), based upon the contents of the string. The encoding
flag will indicate which encoding is used.
But actually it doesn't matter the internal encoding storage. When it is time to be written, the String will use whatever encoding you want at the time of writing.
My issue actually was when I was sending the String in an HTTP request with Spring RestTemplate. I didn't have the header specifying the "charset" to use in the request, and RestTemplate defaults to ISO-8859-1 if not told otherwise. I added the charset=utf-8, and the String was correctly written as UTF-8 in the request.
Thank you to #VGR #Eugene #skomisa for the help

Related

When is encoding being relevant in Java?

This might be a bit beginner question but it's fairly relevant considering debbuging encoding in Java: At what point is an encoding being relevant to a String object?
Consider I have a String object that I want to save to a file. Is the String object itself using some sort of encoding I should manipulate or this encoding will only be informed by me when I create a stream of bytes to save?
The same applies to importing: when I open a file and get it's bytes, I assume there's no encoding at hand, only bytes. When I parse this bytes to a String, I got to use an encoding to understand what characters are they. After I parse those bytes, the String (in memory) has some sort of meta information with the encoding or this is only being handled by the JVM?
This is vital considering I'm having file import/export issues and I got to understand at which point I should worry about getting the right encoding.
Hope I explained my doubt well, and thank you in advance!
Java strings do not have explicit encoding information. They don't know where they came from, and they don't know where they are going. All Java strings are stored internally as UTF-16.
You (optionally) specify what encoding to use whenever you want to turn a String into a sequence of bytes (e.g., to save to a file), or when you want to turn a sequence of bytes (e.g., read from a file) into a String.
Encoding is important to String when you are de/serializing from disk or the web. There are multiple text file formats: ascii, latin-1, utf-8/16 (I believe there may be two utf-16 formats, but I'm not 100%)
See InputStreamReader for how to load a String from text encoded in a non-default format

UTF-8 Encoding ; Only some Japanese characters are not getting converted

I am getting the parameter value as parameter from the Jersey Web Service, which is in Japaneses characters.
Here, 'japaneseString' is the web service parameter containing the characters in japanese language.
String name = new String(japaneseString.getBytes(), "UTF-8");
However, I am able to convert a few sting literals successfully, while some of them are creating problems.
The following were successfully converted:
1) アップル
2) 赤
3) 世丕且且世两上与丑万丣丕且丗丕
4) 世世丗丈
While these din't:
1) ひほわれよう
2) 存在する
When I further investigated, i found that these 2 strings are getting converted in to some JUNK characters.
1) Input: ひほわれよう Output : �?��?��?れよ�?�
2) Input: 存在する Output: 存在�?�る
Any idea why some of the japanese characters are not converted properly?
Thanks.
You are mixing concepts here.
A String is just a sequence of characters (chars); a String in itself has no encoding at all. For what it's worth, replace characters in the above with carrier pigeons. Same thing. A carrier pigeon has no encoding. Neither does a char. (1)
What you are doing here:
new String(x.getBytes(), "UTF-8")
is a "poor man's encoding/decoding process". You will probably have noticed that there are two versions of .getBytes(): one where you pass a charset as an argument and the other where you don't.
If you don't, and that is what happens here, it means you will get the result of the encoding process using your default character set; and then you try and re-decode this byte sequence using UTF-8.
Don't do that. Just take in the string as it comes. If, however, you have trouble reading the original byte stream into a string, it means you use a Reader with the wrong charset. Fix that part.
For more information, read this link.
(1) the fact that, in fact, a char is a UTF-16 code unit is irrelevant to this discussion
Try with JVM parameter file.encoding to set with value UTF-8 in startup of Tomcat(JVM).
E.x.: -Dfile.encoding=UTF-8
I concur with #fge.
Clarification
In java String/char/Reader/Writer handle (Unicode) text, and can combine all scripts in the world.
And byte[]/InputStream/OutputStream are binary data, which need an indication of some encoding to be converted to String.
In your case japaneseStingr should already be a correct String, or be substituted by the original byte[].
Traps in Java
Encoding often is an optional parameter, which then defaults to the platform encoding. You fell in that trap too:
String s = "...";
byte[] b = s.getBytes(); // Platform encoding, non-portable.
byte[] b = s.getBytes("UTF-8"); // Explicit
byte[] b = s.getBytes(StandardCharsets.UTF_8); // Explicit,
// better (for UTF-8, ISO-8859-1)
In general avoid the overloaded methods without encoding parameter, as they are for current-computer only data: non-portable. For completeness: classes FileReader/FileWriter should be avoided as they even provide no encoding parameters.
Error
japaneseString is already wrong. So you have to read that right.
It could have been read erroneouslyas Windows-1252 (Windows Latin-1) and suffered when recoding to UTF-8. Evidently only some cases get messed up.
Maybe you had:
String japanesString = new String(bytes);
instead of:
String japanesString = new String(bytes, StandardCharsets.UTF_8);
At the end:
String name = japaneseString;
Show the code for reading japaneseString for further help.

Android UTF-8 vs ANSI .equals not working / how change charset?

i read a list for my android app from a csv or txt file.
If the File is encoded UTF-8 with Notepad++ i seh the list all right. But i cant search/find strings with .equals.
If the file is encoded with widows as ansi, is cant see äöü etc. But now i can find strings.
Now my question. How can i found out what charset my string has?
I compare my frist string (from the file) with another string, read in in the app with searchview.
I "THINK" my searchview string from the app is ansi too, how to change that to UTF-8 and hope that the compare then works, again.
Android 4.4.2
Thank you
following dosent work:
String s = null;
try
{
s = new String(query.getBytes(), "UTF-8");
}
catch (UnsupportedEncodingException e)
{
Log.e("utf8", "conversion", e);
}
Java strings are always encoded as UTF-16, regardless of where the string data comes from.
It is important that you correctly identify the charset of the source data when converting it to a Java string. new String(query.getBytes(), "UTF-8") will work fine if the byte[] array is actually UTF-8 encoded. If you specify the wrong charset, you will get an UnsupportedEncodingException error only if you specify a charset that Java does not support. However, if you specify a charset that Java does support, and then the decoding of the data fails (typically because you specified the wrong charset for the data), you will get other errors instead, such as MalformedInputException or UnmappableCharacterException, or worse you will not get any errors at all and malformed/illegal bytes will simply be converted to the Unicode U+FFFD replacement character instead. If you need more control over error handling during the conversion process, you need to use the CharsetDecoder class instead.
Sometimes UTF-encoded files will have a BOM in the front, so you can check for that. But Ansi files do not use BOMs. If a UTF BOM is not present in the file, then you have to either analyze the raw data and take a guess (which will lead to problems if you guess wrong), or simply ask the user which charset to use.
Always know the charset of your data. If you don't know, ask. Avoid guessing.

Why does ICU4J return the byte-order-mark when reading an array of bytes into a String?

I read a file into an array of bytes. Then I use ICU4J to detect the file's encoding (I don't know what the encoding might be, these files can have multiple different encodings) and return a Unicode String.
Like so:
byte[] fileContent = // read file into byte array
CharsetDetector cd = new CharsetDetector();
cd.setText(fileContent);
CharsetMatch cm = cd.detect();
String result = cm.getString();
When my file is encoded using UTF-16LE the first character in "result" is the byte-order-mark. I'm not interested in this and because it is specific to the encoding scheme and not really part of the file content I would expect it to be gone.
Yet ICU4J returns it. Why is this happening and is there some way of getting around this problem? The only solution I see is manually checking if the first character in the returned String is the byte order mark and stripping it manually. Is there some cleaner/better way?
I just consulted the docs ... icu-project.org/apiref/icu4j/com/ibm/icu/text/…. They in fact say that it returns the corresponding Java String but they do not say anything about removing the BOM. So I'd expect it to be there if it was in the first place.
To me it is natural that it is also retrieved. I'd expect them to explicitly mention it in the docs if they were trimming out the BOM.
I think the answer is here unicode.org/faq/utf_bom.html#bom1 - "Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol."
I think that's pretty much it. If a BOM is mandatory, you'd have to add it again. Filtering it out if the BOM is prohibited is considered the easy part I guess :)

ISO-8859-1 to UTF-8 in Java

An XML containing 哈瓦那 (UTF-8) is sent to Service A.
Service A sends it to Service B.
The string was encoded to 哈瓦那 (ISO-8859-1).
How do I encode it back to 哈瓦那? Considering that all strings in Java are UTF-16. Service B has to compare it as 哈瓦那 not 哈瓦那.
Thanks.
When you read a text file, you have to read it using the actual encoding used to create the file. If you specify the appropriate encoding, you'll get the correct characters in memory. So, if the same file (semantically) exists in two versions (UTF-8 encoded and ISO-8859-1), reading the first one with UTF-8 and the second one with ISO-8859-1 will lead to exactly the same chars in memory.
The above is true only if it made sense to encode the file in ISO-8859-1 in the first place. UTF-8 is able to store every unicode character. But ISO-8859-1 is able to encode only a small subset of the unicode characters (western languages characters). The characters you posted literally look like Chinese to me, and I don't think encoding them in ISO-8859-1 is even possible without losing everything.
I think you are misdiagnosing the problem:
An XML containing 哈瓦那 (UTF-8) is sent to Service A.
OK ...
Service A sends it to Service B.
OK ...
The string was converted to 哈瓦那 (ISO-8859-1).
This is not correct. The string has not been "converted". Rather, it has been decoded with the wrong character encoding. Specifically, it looks very much like something has taken UTF-8 encoded bytes, and assumed that they are ISO-8859-1 encoded, and decoded them accordingly.
Can you unpick this? It depends where the mistaken decoding first occurred. If it happens in Service B, then you should be able to relabel the data source as UTF-8, and then decode it correctly. On the other hand, if the first mistaken decoding happens in service A, then you could be out of luck. A mistaken decoding can result in loss of data as unrecognized codes are replaced with some other character. If that happens, the original data will be gone forever.
In either case, the best way to deal with this is to figure out what is getting the wrong character encoding mixed up, and fix that. Perhaps the XML needs to be fixed to specify the charset / encoding. Perhaps, the transport mechanism (e.g. HTTP request or response) needs to be corrected to include the proper document encoding.
Use writers and readers to encode/decode your input/output streams:
String yourText = "...";
InputStream yourInputStream = ...;
Writer out = new OutputStreamWriter(youInputStream, "UTF-8");
out.write(yourText);
Same for reader.

Categories