UTF-8 Encoding ; Only some Japanese characters are not getting converted

UTF-8 Encoding ; Only some Japanese characters are not getting converted - java

I am getting the parameter value as parameter from the Jersey Web Service, which is in Japaneses characters.
Here, 'japaneseString' is the web service parameter containing the characters in japanese language.
String name = new String(japaneseString.getBytes(), "UTF-8");
However, I am able to convert a few sting literals successfully, while some of them are creating problems.
The following were successfully converted:
1) アップル
2) 赤
3) 世丕且且世两上与丑万丣丕且丗丕
4) 世世丗丈
While these din't:
1) ひほわれよう
2) 存在する
When I further investigated, i found that these 2 strings are getting converted in to some JUNK characters.
1) Input: ひほわれよう Output : �?��?��?れよ�?�
2) Input: 存在する Output: 存在�?�る
Any idea why some of the japanese characters are not converted properly?
Thanks.

You are mixing concepts here.
A String is just a sequence of characters (chars); a String in itself has no encoding at all. For what it's worth, replace characters in the above with carrier pigeons. Same thing. A carrier pigeon has no encoding. Neither does a char. (1)
What you are doing here:
new String(x.getBytes(), "UTF-8")
is a "poor man's encoding/decoding process". You will probably have noticed that there are two versions of .getBytes(): one where you pass a charset as an argument and the other where you don't.
If you don't, and that is what happens here, it means you will get the result of the encoding process using your default character set; and then you try and re-decode this byte sequence using UTF-8.
Don't do that. Just take in the string as it comes. If, however, you have trouble reading the original byte stream into a string, it means you use a Reader with the wrong charset. Fix that part.
For more information, read this link.
(1) the fact that, in fact, a char is a UTF-16 code unit is irrelevant to this discussion

Try with JVM parameter file.encoding to set with value UTF-8 in startup of Tomcat(JVM).
E.x.: -Dfile.encoding=UTF-8

I concur with #fge.
Clarification
In java String/char/Reader/Writer handle (Unicode) text, and can combine all scripts in the world.
And byte[]/InputStream/OutputStream are binary data, which need an indication of some encoding to be converted to String.
In your case japaneseStingr should already be a correct String, or be substituted by the original byte[].
Traps in Java
Encoding often is an optional parameter, which then defaults to the platform encoding. You fell in that trap too:
String s = "...";
byte[] b = s.getBytes(); // Platform encoding, non-portable.
byte[] b = s.getBytes("UTF-8"); // Explicit
byte[] b = s.getBytes(StandardCharsets.UTF_8); // Explicit,
// better (for UTF-8, ISO-8859-1)
In general avoid the overloaded methods without encoding parameter, as they are for current-computer only data: non-portable. For completeness: classes FileReader/FileWriter should be avoided as they even provide no encoding parameters.
Error
japaneseString is already wrong. So you have to read that right.
It could have been read erroneouslyas Windows-1252 (Windows Latin-1) and suffered when recoding to UTF-8. Evidently only some cases get messed up.
Maybe you had:
String japanesString = new String(bytes);
instead of:
String japanesString = new String(bytes, StandardCharsets.UTF_8);
At the end:
String name = japaneseString;
Show the code for reading japaneseString for further help.

Related

decoding and encoding strings, ISO-8859-1 to UTF-8 in Java

I have read the other posts on this issue, but the solutions they presented did not work for me. Actually, the official Java documentation also did not work as intended (I am using Java 11) : https://docs.oracle.com/javase/tutorial/i18n/text/string.html
My problem is that I am reading one byte at a time from a byte buffer, putting that in a byte array, and making a String out of that byte array. The bytes I read are from an embedded system that can only send ISO-8859-1 bytes, so I end up with a byte array with ISO-8859-1 bytes and the Java String I end up getting is thus ISO-8859-1 encoded. No problem here. The String in IntelliJ looks like this :
The bytes I am trying to convert from ISO-8859-1 to UTF-8 are the ones in yellow. I want them to be UTF-8, so in the end the "C9" byte should be replace by the "C3A9" bytes.
The first step works correctly, I do this : maintenanceResponseString.getBytes(StandardCharsets.UTF_8) and I get the right bytes that I want, the UTF-8 encoding of the string, that's good :
The problem comes in here , when I try to make a STRING out of these new (and GOOD) bytes, like this :
new String(maintenanceResponseString.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8)
The old bytes are back ?!! It's like the "getBytes(UTF-8)" never actually happened. That is NOT what the documentation says should happen... what am I missing here ? I have done tests and the string really is still ISO-8859-1 encoded... I don't know what is going on here. Where are the bytes from "getBytes" ?
How do you convert a String that contains ISO-8859-1 bytes to UTF-8 bytes ? I'm out of alternatives and I need to get it done real bad for a pro project... this should be easy !
Note : I have tried alternatives like
ByteBuffer buffer = StandardCharsets.UTF_8.encode(s);
return StandardCharsets.UTF_8.decode(buffer).toString();
But the exact same thing happens.
Thank you in advance for your help.
EDIT :
With some info in the comments about how Strings in Java 9+ get represented internally not as UTF-16 only anymore, but Latin-1 (why...), I think that is what made me think the Strings were "internally encoded in Latin-1" when it is just the default representation of the String if we don't specify the encoding we want to use when displaying the String.
From what I undestand now the String itself is not bound to any encoding, and you can CHOOSE the encoding you want to display it in when it gets written.
Actually my issue is that the String ends up written to an XML file via JAXB marshalling in LATIN-1, and I now think the issues lies over there... I will dig further when I access my work computer again and report here

It turns out there was nothing wrong with Strings and "their encoding". What happened is I got really confused because the debugger shows the contents of the String in a "default internal storage encoding", and that is ISO-8859-1 (but can be UTF-16, depends on the content of the String).
Quote from the JEP-254 :
We propose to change the internal representation of the String class
from a UTF-16 char array to a byte array plus an encoding-flag field.
The new String class will store characters encoded either as
ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes
per character), based upon the contents of the string. The encoding
flag will indicate which encoding is used.
But actually it doesn't matter the internal encoding storage. When it is time to be written, the String will use whatever encoding you want at the time of writing.
My issue actually was when I was sending the String in an HTTP request with Spring RestTemplate. I didn't have the header specifying the "charset" to use in the request, and RestTemplate defaults to ISO-8859-1 if not told otherwise. I added the charset=utf-8, and the String was correctly written as UTF-8 in the request.
Thank you to #VGR #Eugene #skomisa for the help

Importance of specifying the encoding in getBytes in Java

I understand the need to specify encoding when converting a byte[] to String in Java using appropriate format i.e. hex, base64 etc. because the default encoding may not be same in different platforms. But I am not sure I understand the same while converting a string to bytes. So this question, is to wrap my head around the concept of need to specify character set while transferring Strings over web.
Consider foll. code in Java
Note: The String in example below is not read from a file, another resource, it is created.
1: String message = "a good message";
2: byte[] encryptedMsgBytes = encrypt(key,,message.getBytes());
3: String base64EncodedMessage = new String (Base64.encodeBase64(encryptedMsgBytes));
I need to send this over the web using Http Post & will be received & processed (decrypted, converted from base64 etc.) at other end.
Based on reading up article, the recommended practice is to use .getBytes("utf-8")
on line 2, i.e message.getBytes("UTF-8")
& the similar approach is recommended on other end to process the data as shown on line 7 below
4: String base64EncodedMsg =
5: byte[] base64EncodedMsgBytes = Base64.encodeBase64(base64EncodedMsg));
6: byte[] decryptedMsgBytes = decrypt(aesKey, "AES", Base64.decodeBase64(base64EncodedMessage);
7: String originalMsg = new String(decryptedMsgBytes, "UTF-8");
Given that Java's internal in-memory string representation is utf-16. ( excluding: UTF8 during serialization & file saving) , do we really need this if the decryption was also done in Java (Note: This is not a practical assumption, just for sake of discussion to understand the need to mention encoding)? Since, in JVM the String 'message' on line 1 was represented using UTF-16, wouldn't the .getBytes() method without specifying the encoding always return the UTF-16 bytes ? or is that incorrect and .getBytes() method without specifying the encoding always returns raw bytes ? Since the internal representation is UTF-16 why would the default character encoding on a particular JVM matter ?
If indeed it returns UTF-16, then is there is need to use new String(decryptedMsgBytes, "UTF-8") on other end ?

wouldn't the .getBytes() method without specifying the encoding
always return the UTF-16 bytes ?
This is incorrect. Per the Javadoc, this uses the platform's default charset:
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

Converting String from One Charset to Another

I am working on converting a string from one charset to another and read many example on it and finally found below code, which looks nice to me and as a newbie to Charset Encoding, I want to know, if it is the right way to do it .
public static byte[] transcodeField(byte[] source, Charset from, Charset to) {
return new String(source, from).getBytes(to);
}
To convert String from ASCII to EBCDIC, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("US-ASCII"), Charset.forName("Cp1047"))));
And to convert from EBCDIC to ASCII, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("Cp1047"), Charset.forName("US-ASCII"))));

The code you found (transcodeField) doesn't convert a String from one encoding to another, because a String doesn't have an encoding¹. It converts bytes from one encoding to another. The method is only useful if your use case satisfies 2 conditions:
Your input data is bytes in one encoding
Your output data needs to be bytes in another encoding
In that case, it's straight forward:
byte[] out = transcodeField(inbytes, Charset.forName(inEnc), Charset.forName(outEnc));
If the input data contains characters that can't be represented in the output encoding (such as converting complex UTF8 to ASCII) those characters will be replaced with the ? replacement symbol, and the data will be corrupted.
However a lot of people ask "How do I convert a String from one encoding to another", to which a lot of people answer with the following snippet:
String s = new String(source.getBytes(inputEncoding), outputEncoding);
This is complete bull****. The getBytes(String encoding) method returns a byte array with the characters encoded in the specified encoding (if possible, again invalid characters are converted to ?). The String constructor with the 2nd parameter creates a new String from a byte array, where the bytes are in the specified encoding. Now since you just used source.getBytes(inputEncoding) to get those bytes, they're not encoded in outputEncoding (except if the encodings use the same values, which is common for "normal" characters like abcd, but differs with more complex like accented characters éêäöñ).
So what does this mean? It means that when you have a Java String, everything is great. Strings are unicode, meaning that all of your characters are safe. The problem comes when you need to convert that String to bytes, meaning that you need to decide on an encoding. Choosing a unicode compatible encoding such as UTF8, UTF16 etc. is great. It means your characters will still be safe even if your String contained all sorts of weird characters. If you choose a different encoding (with US-ASCII being the least supportive) your String must contain only the characters supported by the encoding, or it will result in corrupted bytes.
Now finally some examples of good and bad usage.
String myString = "Feng shui in chinese is 風水";
byte[] bytes1 = myString.getBytes("UTF-8"); // Bytes correct
byte[] bytes2 = myString.getBytes("US-ASCII"); // Last 2 characters are now corrupted (converted to question marks)
String nordic = "Här är några merkkejä";
byte[] bytes3 = nordic.getBytes("UTF-8"); // Bytes correct, "weird" chars take 2 bytes each
byte[] bytes4 = nordic.getBytes("ISO-8859-1"); // Bytes correct, "weird" chars take 1 byte each
String broken = new String(nordic.getBytes("UTF-8"), "ISO-8859-1"); // Contains now "HÃ¤r Ã¤r nÃ¥gra merkkejÃ¤"
The last example demonstrates that even though both of the encodings support the nordic characters, they use different bytes to represent them and using the wrong encoding when decoding results in Mojibake. Therefore there's no such thing as "converting a String from one encoding to another", and you should never use the broken example.
Also note that you should always specify the encoding used (with both getBytes() and new String()), because you can't trust that the default encoding is always the one you want.
As a last issue, Charset and Encoding aren't the same thing, but they're very much related.
¹ Technically the way a String is stored internally in the JVM is in UTF-16 encoding up to Java 8, and variable encoding from Java 9 onwards, but the developer doesn't need to care about that.
NOTE
It's possible to have a corrupted String and be able to uncorrupt it by fiddling with the encoding, which may be where this "convert String to other encoding" misunderstanding originates from.
// Input comes from network/file/other place and we have misconfigured the encoding
String input = "HÃ¤r Ã¤r nÃ¥gra merkkejÃ¤"; // UTF-8 bytes, interpreted wrongly as ISO-8859-1 compatible
byte[] bytes = input.getBytes("ISO-8859-1"); // Get each char as single byte
String asUtf8 = new String(bytes, "UTF-8"); // Recreate String as UTF-8
If no characters were corrupted in input, the string would now be "fixed". However the proper approach is to use the correct encoding when reading input, not fix it afterwards. Especially if there's a chance of it becoming corrupted.

Convert UTF-8 encoded string to human readable string

How to convert any UTF8 strings to readable strings.
Like : â¬ (in UTF8) is €
I tried using Charset but not working.

You are encoding a string to ISO-8859-15 with byte[] b = "Üü?öäABC".getBytes("ISO-8859-15"); then you are decoding it with UTF-8 System.out.println(new String(b, "UTF-8"));. You have to decode it the same way with ISO-8859-15.

This is not "UTF-8" but completely broken and unrepairable data. Strings do not have encodings. It makes no sense to say "UTF-8" string in this context. String is a string of abstract characters - it doesn't have any encodings except as an internal implementation detail that is not our concern and not related to your problem.

A string in java is already an unicode representation. When you call one of the getBytes methods on it you get an encoded representation (as bytes, thus binary values) in a specific encoding - ISO-8859-15 in your example. If you want to convert this byte array back to an unicode string you can do that with one of the string constructors accepting a byte array, like you did, but you must do so using the exact same encoding the byte array was originally generated with. Only then you can convert it back to an unicode string (which has no encoding, and doesn't need one).
Beware of the encoding-less methods, both the string constructor and the getBytes method, since they use the default encoding of the platform the code is running on, which might not be what you want to achieve.

You are trying to decode a byteArray encoded with "ISO-8859-15" with "UTF-8" format
b = "Üü?öäABC".getBytes("ISO-8859-15");
u = "Üü?öäABC".getBytes("UTF-8");
System.out.println(new String(b, "ISO-8859-15")); // will be ok
System.out.println(new String(b, "UTF-8")); // will look garbled
System.out.println(new String(u,"UTF-8")); // will be ok

I think the problem here is that you're assuming a java String is encoded with whatever you've specified in the constructor. It's not. It's in UTF-16.
So, "Üü?öäABC".getBytes("ISO-8859-15") is actually converting a UTF-16 string to ISO-8859-15, and then getting the byte representation of that.
If you want to get the human-readable format in your Eclipse console, just keep it as it is (in UTF-16) - and call System.out.println("Üü?öäABC"), because your Eclipse console will decode the string and display it as UTF-16.

Encoding issues

I have a "windows1255" encoded String, is there any safe way i can convert it to a "UTF-8"
String and vice versa?
In general is there a safe way(meaning data will not be damaged) to convert between
Encodings in Java?
str.getBytes("UTF-8");
new String(str,"UTF-8");
if the original string is not encoded as "UTF-8" can the data be damaged?

You can can't have a String object in Java properly encoded as anything other than UTF-16 - as that's the sole encoding for those objects defined by the spec. Of course you could do something untoward like put 1252 values in a char[] and create a String from it, but things will go wrong pretty much immediately.
What you can have is byte[] encoded in various different ways, and you can convert them to and from String using constructors which take a Charset, and with getBytes as in your code.
So you can do conversions using a String as an intermediate. I don't know of any way in the JDK to do a direct conversion, but the intermediate is likely not too costly in practice.
About round-trip comversions - it is not generally true that you can convert between encoding without losing data. Only a few encodings can handle the full spectrum of Unicode characters (eg the UTF family, GB18030, etc) - while many legacy character sets encode only a small subset. You can't safely round trip through those character sets without losing data, unless you are sure the input falls into the representable set.

String is attempting to be a sequence of abstract characters, it does not have any encoding from the point of view
of its users. Of course, it must have an internal encoding but that's an implementation detail.
It makes no sense to encode String as UTF-8, and then decode the result back as UTF-8. It will be no-op, in that:
(new String(str.getBytes("UTF-8"), "UTF-8") ).equals(str) == true;
But there are cases where the String abstraction falls apart and the above will be a "lossy" conversion. Because of the internal
implementation details, a String can contain unpaired UTF-16 surrogates which cannot be represented in UTF-8 (or any encoding
for that matter, including the internal UTF-16 encoding*). So they will be lost in the encoding, and when you decode back, you get the original string without the invalid unpaired surrogates.
The only thing I can take from your question is that you have a String result from interpreting binary data as Windows-1255, where it should have been interpreted in UTF-8.
To fix this, you would have to go to the source of this and use UTF-8 decoding explicitly.
If you however, only have the string result from misinterpretation, you can't really do anything as so many bytes have no representation in Windows-1255 and would have not made it to the string.
If this wasn't the case, you could fully restore the original intended message by:
new String( str.getBytes("Windows-1255"), "UTF-8");
* It is actually wrong of Java to allow unpaired surrogates to exist in its Strings in the first place since it's not valid UTF-16

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.