Importance of specifying the encoding in getBytes in Java - java

I understand the need to specify encoding when converting a byte[] to String in Java using appropriate format i.e. hex, base64 etc. because the default encoding may not be same in different platforms. But I am not sure I understand the same while converting a string to bytes. So this question, is to wrap my head around the concept of need to specify character set while transferring Strings over web.
Consider foll. code in Java
Note: The String in example below is not read from a file, another resource, it is created.
1: String message = "a good message";
2: byte[] encryptedMsgBytes = encrypt(key,,message.getBytes());
3: String base64EncodedMessage = new String (Base64.encodeBase64(encryptedMsgBytes));
I need to send this over the web using Http Post & will be received & processed (decrypted, converted from base64 etc.) at other end.
Based on reading up article, the recommended practice is to use .getBytes("utf-8")
on line 2, i.e message.getBytes("UTF-8")
& the similar approach is recommended on other end to process the data as shown on line 7 below
4: String base64EncodedMsg =
5: byte[] base64EncodedMsgBytes = Base64.encodeBase64(base64EncodedMsg));
6: byte[] decryptedMsgBytes = decrypt(aesKey, "AES", Base64.decodeBase64(base64EncodedMessage);
7: String originalMsg = new String(decryptedMsgBytes, "UTF-8");
Given that Java's internal in-memory string representation is utf-16. ( excluding: UTF8 during serialization & file saving) , do we really need this if the decryption was also done in Java (Note: This is not a practical assumption, just for sake of discussion to understand the need to mention encoding)? Since, in JVM the String 'message' on line 1 was represented using UTF-16, wouldn't the .getBytes() method without specifying the encoding always return the UTF-16 bytes ? or is that incorrect and .getBytes() method without specifying the encoding always returns raw bytes ? Since the internal representation is UTF-16 why would the default character encoding on a particular JVM matter ?
If indeed it returns UTF-16, then is there is need to use new String(decryptedMsgBytes, "UTF-8") on other end ?

wouldn't the .getBytes() method without specifying the encoding
always return the UTF-16 bytes ?
This is incorrect. Per the Javadoc, this uses the platform's default charset:
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

Related

String encoding (UTF-8) JAVA

Could anyone please help me out here. I want to know the difference in below two string formatting. I am trying to encode the string to UTF-8. which one is the correct method.
String string2 = new String(string1.getBytes("UTF-8"), "UTF-8"));
OR
String string3 = new String(string1.getBytes(),"UTF-8"));
ALSO if I use above two code together i.e.
line 1 :string1 = new String(string1.getBytes("UTF-8"), "UTF-8"));
line 2 :string1 = new String(string1.getBytes(),"UTF-8"));
Will the value of string1 will be the same in both the lines?
PS: Purpose of doing all this is to send Japanese text in web service call.
So I want to send it with UTF-8 encoding.
According to the javadoc of String#getBytes(String charsetName):
Encodes this String into a sequence of bytes using the named charset,
storing the result into a new byte array.
And the documentation of String(byte[] bytes, Charset charset)
Constructs a new String by decoding the specified array of bytes using
the specified charset.
Thus getBytes() is opposite operation of String(byte []). The getBytes() encodes the string to bytes, and String(byte []) will decode the byte array and convert it to string. You will have to use same charset for both methods to preserve the actual string value. I.e. your second example is wrong:
// This is wrong because you are calling getBytes() with default charset
// But converting those bytes to string using UTF-8 encoding. This will
// mostly work because default encoding is usually UTF-8, but it can fail
// so it is wrong.
new String(string1.getBytes(),"UTF-8"));
String and char (two-bytes UTF-16) in java is for (Unicode) text.
When converting from and to byte[]s one needs the Charset (encoding) of those bytes.
Both String.getBytes() and new String(byte[]) are short cuts that use the default operating system encoding. That almost always is wrong for crossplatform usages.
So use
byte[] b = s.getBytes("UTF-8");
s = new String(b, "UTF-8");
Or better, not throwing an UnsupportedCharsetException:
byte[] b = s.getBytes(StandardCharsets.UTF_8);
s = new String(b, StandardCharsets.UTF_8);
(Android does not know StandardCharsets however.)
The same holds for InputStreamReader, OutputStreamWriter that bridge binary data (InputStream/OutputStream) and text (Reader, Writer).
Please don't confuse yourself. "String" is usually used to refer to values in a datatype that stores text. In this case, java.lang.String.
Serialized text is a sequence of bytes created by applying a character encoding to a string. In this case, byte[].
There are no UTF-8-encoded strings in Java.
If your web service client library takes a string, pass it the string. If it lets you specify an encoding to use for serialization, pass it StandardCharsets.UTF_8 or equivalent.
If it doesn't take a string, then pass it string1.GetBytes(StandardCharsets.UTF_8) and use whatever other mechanism it provides to let you tell the recipient that the bytes are UTF-8-encoded text. Or, get a different client library.

Converting String from One Charset to Another

I am working on converting a string from one charset to another and read many example on it and finally found below code, which looks nice to me and as a newbie to Charset Encoding, I want to know, if it is the right way to do it .
public static byte[] transcodeField(byte[] source, Charset from, Charset to) {
return new String(source, from).getBytes(to);
}
To convert String from ASCII to EBCDIC, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("US-ASCII"), Charset.forName("Cp1047"))));
And to convert from EBCDIC to ASCII, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("Cp1047"), Charset.forName("US-ASCII"))));
The code you found (transcodeField) doesn't convert a String from one encoding to another, because a String doesn't have an encoding¹. It converts bytes from one encoding to another. The method is only useful if your use case satisfies 2 conditions:
Your input data is bytes in one encoding
Your output data needs to be bytes in another encoding
In that case, it's straight forward:
byte[] out = transcodeField(inbytes, Charset.forName(inEnc), Charset.forName(outEnc));
If the input data contains characters that can't be represented in the output encoding (such as converting complex UTF8 to ASCII) those characters will be replaced with the ? replacement symbol, and the data will be corrupted.
However a lot of people ask "How do I convert a String from one encoding to another", to which a lot of people answer with the following snippet:
String s = new String(source.getBytes(inputEncoding), outputEncoding);
This is complete bull****. The getBytes(String encoding) method returns a byte array with the characters encoded in the specified encoding (if possible, again invalid characters are converted to ?). The String constructor with the 2nd parameter creates a new String from a byte array, where the bytes are in the specified encoding. Now since you just used source.getBytes(inputEncoding) to get those bytes, they're not encoded in outputEncoding (except if the encodings use the same values, which is common for "normal" characters like abcd, but differs with more complex like accented characters éêäöñ).
So what does this mean? It means that when you have a Java String, everything is great. Strings are unicode, meaning that all of your characters are safe. The problem comes when you need to convert that String to bytes, meaning that you need to decide on an encoding. Choosing a unicode compatible encoding such as UTF8, UTF16 etc. is great. It means your characters will still be safe even if your String contained all sorts of weird characters. If you choose a different encoding (with US-ASCII being the least supportive) your String must contain only the characters supported by the encoding, or it will result in corrupted bytes.
Now finally some examples of good and bad usage.
String myString = "Feng shui in chinese is 風水";
byte[] bytes1 = myString.getBytes("UTF-8"); // Bytes correct
byte[] bytes2 = myString.getBytes("US-ASCII"); // Last 2 characters are now corrupted (converted to question marks)
String nordic = "Här är några merkkejä";
byte[] bytes3 = nordic.getBytes("UTF-8"); // Bytes correct, "weird" chars take 2 bytes each
byte[] bytes4 = nordic.getBytes("ISO-8859-1"); // Bytes correct, "weird" chars take 1 byte each
String broken = new String(nordic.getBytes("UTF-8"), "ISO-8859-1"); // Contains now "Här är några merkkejä"
The last example demonstrates that even though both of the encodings support the nordic characters, they use different bytes to represent them and using the wrong encoding when decoding results in Mojibake. Therefore there's no such thing as "converting a String from one encoding to another", and you should never use the broken example.
Also note that you should always specify the encoding used (with both getBytes() and new String()), because you can't trust that the default encoding is always the one you want.
As a last issue, Charset and Encoding aren't the same thing, but they're very much related.
¹ Technically the way a String is stored internally in the JVM is in UTF-16 encoding up to Java 8, and variable encoding from Java 9 onwards, but the developer doesn't need to care about that.
NOTE
It's possible to have a corrupted String and be able to uncorrupt it by fiddling with the encoding, which may be where this "convert String to other encoding" misunderstanding originates from.
// Input comes from network/file/other place and we have misconfigured the encoding
String input = "Här är några merkkejä"; // UTF-8 bytes, interpreted wrongly as ISO-8859-1 compatible
byte[] bytes = input.getBytes("ISO-8859-1"); // Get each char as single byte
String asUtf8 = new String(bytes, "UTF-8"); // Recreate String as UTF-8
If no characters were corrupted in input, the string would now be "fixed". However the proper approach is to use the correct encoding when reading input, not fix it afterwards. Especially if there's a chance of it becoming corrupted.

How to encode a string into UTF-8 in JAVA

I have a Japanese String 文字列 I want to convert it to UTF-8 encoding. This question seems like a bit duplicate. I have googled for sometime but not able to find direct answer.
Encoding a String is the process of transforming a sequence of characters into a sequence of bytes.
For that use the getBytes() method.
This method accepts and encoding parameter, which defines the encoding used in this process. Therefore, you can use :
byte[] encoded = "文字列".getBytes("UTF-8");
As per Jon Skeet comment, don't use magic strings:
byte[] encoded = "文字列".getBytes(StandardCharsets.UTF_8);

UTF-8 Encoding ; Only some Japanese characters are not getting converted

I am getting the parameter value as parameter from the Jersey Web Service, which is in Japaneses characters.
Here, 'japaneseString' is the web service parameter containing the characters in japanese language.
String name = new String(japaneseString.getBytes(), "UTF-8");
However, I am able to convert a few sting literals successfully, while some of them are creating problems.
The following were successfully converted:
1) アップル
2) 赤
3) 世丕且且世两上与丑万丣丕且丗丕
4) 世世丗丈
While these din't:
1) ひほわれよう
2) 存在する
When I further investigated, i found that these 2 strings are getting converted in to some JUNK characters.
1) Input: ひほわれよう Output : �?��?��?れよ�?�
2) Input: 存在する Output: 存在�?�る
Any idea why some of the japanese characters are not converted properly?
Thanks.
You are mixing concepts here.
A String is just a sequence of characters (chars); a String in itself has no encoding at all. For what it's worth, replace characters in the above with carrier pigeons. Same thing. A carrier pigeon has no encoding. Neither does a char. (1)
What you are doing here:
new String(x.getBytes(), "UTF-8")
is a "poor man's encoding/decoding process". You will probably have noticed that there are two versions of .getBytes(): one where you pass a charset as an argument and the other where you don't.
If you don't, and that is what happens here, it means you will get the result of the encoding process using your default character set; and then you try and re-decode this byte sequence using UTF-8.
Don't do that. Just take in the string as it comes. If, however, you have trouble reading the original byte stream into a string, it means you use a Reader with the wrong charset. Fix that part.
For more information, read this link.
(1) the fact that, in fact, a char is a UTF-16 code unit is irrelevant to this discussion
Try with JVM parameter file.encoding to set with value UTF-8 in startup of Tomcat(JVM).
E.x.: -Dfile.encoding=UTF-8
I concur with #fge.
Clarification
In java String/char/Reader/Writer handle (Unicode) text, and can combine all scripts in the world.
And byte[]/InputStream/OutputStream are binary data, which need an indication of some encoding to be converted to String.
In your case japaneseStingr should already be a correct String, or be substituted by the original byte[].
Traps in Java
Encoding often is an optional parameter, which then defaults to the platform encoding. You fell in that trap too:
String s = "...";
byte[] b = s.getBytes(); // Platform encoding, non-portable.
byte[] b = s.getBytes("UTF-8"); // Explicit
byte[] b = s.getBytes(StandardCharsets.UTF_8); // Explicit,
// better (for UTF-8, ISO-8859-1)
In general avoid the overloaded methods without encoding parameter, as they are for current-computer only data: non-portable. For completeness: classes FileReader/FileWriter should be avoided as they even provide no encoding parameters.
Error
japaneseString is already wrong. So you have to read that right.
It could have been read erroneouslyas Windows-1252 (Windows Latin-1) and suffered when recoding to UTF-8. Evidently only some cases get messed up.
Maybe you had:
String japanesString = new String(bytes);
instead of:
String japanesString = new String(bytes, StandardCharsets.UTF_8);
At the end:
String name = japaneseString;
Show the code for reading japaneseString for further help.

Convert UTF-8 encoded string to human readable string

How to convert any UTF8 strings to readable strings.
Like : ⬠(in UTF8) is €
I tried using Charset but not working.
You are encoding a string to ISO-8859-15 with byte[] b = "Üü?öäABC".getBytes("ISO-8859-15"); then you are decoding it with UTF-8 System.out.println(new String(b, "UTF-8"));. You have to decode it the same way with ISO-8859-15.
This is not "UTF-8" but completely broken and unrepairable data. Strings do not have encodings. It makes no sense to say "UTF-8" string in this context. String is a string of abstract characters - it doesn't have any encodings except as an internal implementation detail that is not our concern and not related to your problem.
A string in java is already an unicode representation. When you call one of the getBytes methods on it you get an encoded representation (as bytes, thus binary values) in a specific encoding - ISO-8859-15 in your example. If you want to convert this byte array back to an unicode string you can do that with one of the string constructors accepting a byte array, like you did, but you must do so using the exact same encoding the byte array was originally generated with. Only then you can convert it back to an unicode string (which has no encoding, and doesn't need one).
Beware of the encoding-less methods, both the string constructor and the getBytes method, since they use the default encoding of the platform the code is running on, which might not be what you want to achieve.
You are trying to decode a byteArray encoded with "ISO-8859-15" with "UTF-8" format
b = "Üü?öäABC".getBytes("ISO-8859-15");
u = "Üü?öäABC".getBytes("UTF-8");
System.out.println(new String(b, "ISO-8859-15")); // will be ok
System.out.println(new String(b, "UTF-8")); // will look garbled
System.out.println(new String(u,"UTF-8")); // will be ok
I think the problem here is that you're assuming a java String is encoded with whatever you've specified in the constructor. It's not. It's in UTF-16.
So, "Üü?öäABC".getBytes("ISO-8859-15") is actually converting a UTF-16 string to ISO-8859-15, and then getting the byte representation of that.
If you want to get the human-readable format in your Eclipse console, just keep it as it is (in UTF-16) - and call System.out.println("Üü?öäABC"), because your Eclipse console will decode the string and display it as UTF-16.

Categories