I am working on converting a string from one charset to another and read many example on it and finally found below code, which looks nice to me and as a newbie to Charset Encoding, I want to know, if it is the right way to do it .
public static byte[] transcodeField(byte[] source, Charset from, Charset to) {
return new String(source, from).getBytes(to);
}
To convert String from ASCII to EBCDIC, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("US-ASCII"), Charset.forName("Cp1047"))));
And to convert from EBCDIC to ASCII, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("Cp1047"), Charset.forName("US-ASCII"))));
The code you found (transcodeField) doesn't convert a String from one encoding to another, because a String doesn't have an encoding¹. It converts bytes from one encoding to another. The method is only useful if your use case satisfies 2 conditions:
Your input data is bytes in one encoding
Your output data needs to be bytes in another encoding
In that case, it's straight forward:
byte[] out = transcodeField(inbytes, Charset.forName(inEnc), Charset.forName(outEnc));
If the input data contains characters that can't be represented in the output encoding (such as converting complex UTF8 to ASCII) those characters will be replaced with the ? replacement symbol, and the data will be corrupted.
However a lot of people ask "How do I convert a String from one encoding to another", to which a lot of people answer with the following snippet:
String s = new String(source.getBytes(inputEncoding), outputEncoding);
This is complete bull****. The getBytes(String encoding) method returns a byte array with the characters encoded in the specified encoding (if possible, again invalid characters are converted to ?). The String constructor with the 2nd parameter creates a new String from a byte array, where the bytes are in the specified encoding. Now since you just used source.getBytes(inputEncoding) to get those bytes, they're not encoded in outputEncoding (except if the encodings use the same values, which is common for "normal" characters like abcd, but differs with more complex like accented characters éêäöñ).
So what does this mean? It means that when you have a Java String, everything is great. Strings are unicode, meaning that all of your characters are safe. The problem comes when you need to convert that String to bytes, meaning that you need to decide on an encoding. Choosing a unicode compatible encoding such as UTF8, UTF16 etc. is great. It means your characters will still be safe even if your String contained all sorts of weird characters. If you choose a different encoding (with US-ASCII being the least supportive) your String must contain only the characters supported by the encoding, or it will result in corrupted bytes.
Now finally some examples of good and bad usage.
String myString = "Feng shui in chinese is 風水";
byte[] bytes1 = myString.getBytes("UTF-8"); // Bytes correct
byte[] bytes2 = myString.getBytes("US-ASCII"); // Last 2 characters are now corrupted (converted to question marks)
String nordic = "Här är några merkkejä";
byte[] bytes3 = nordic.getBytes("UTF-8"); // Bytes correct, "weird" chars take 2 bytes each
byte[] bytes4 = nordic.getBytes("ISO-8859-1"); // Bytes correct, "weird" chars take 1 byte each
String broken = new String(nordic.getBytes("UTF-8"), "ISO-8859-1"); // Contains now "Här är några merkkejä"
The last example demonstrates that even though both of the encodings support the nordic characters, they use different bytes to represent them and using the wrong encoding when decoding results in Mojibake. Therefore there's no such thing as "converting a String from one encoding to another", and you should never use the broken example.
Also note that you should always specify the encoding used (with both getBytes() and new String()), because you can't trust that the default encoding is always the one you want.
As a last issue, Charset and Encoding aren't the same thing, but they're very much related.
¹ Technically the way a String is stored internally in the JVM is in UTF-16 encoding up to Java 8, and variable encoding from Java 9 onwards, but the developer doesn't need to care about that.
NOTE
It's possible to have a corrupted String and be able to uncorrupt it by fiddling with the encoding, which may be where this "convert String to other encoding" misunderstanding originates from.
// Input comes from network/file/other place and we have misconfigured the encoding
String input = "Här är några merkkejä"; // UTF-8 bytes, interpreted wrongly as ISO-8859-1 compatible
byte[] bytes = input.getBytes("ISO-8859-1"); // Get each char as single byte
String asUtf8 = new String(bytes, "UTF-8"); // Recreate String as UTF-8
If no characters were corrupted in input, the string would now be "fixed". However the proper approach is to use the correct encoding when reading input, not fix it afterwards. Especially if there's a chance of it becoming corrupted.
Related
Get input string in UTF-8, I applied TIS620 encoding and created new string from it now how to retain the bytes? since UTF-8 represents Thai char in 3 bytes where as TIS620 in 1 byte. I've requirement where the backend system stores characters in string as 1 byte only so default UTF-8 breaks it.
How to convert String character encoding from UTF-8 to TIS620?
How to retain the byte size while passing it to backend system?
If the string is reassigned to new String , Does character encoding is retained or it again gets converted to UTF-16 (Java default)?
Is it possible in Java? Any lib/utility which can be integrated?
I've tried below code and can check that post TIS620 the byte count matches the character count i.e.1 byte/char. But if encodedString gets new String assignment will it loose TIS620 format?
(Convert String with encoding UTF-8 to TIS620 (Thai encoding) in Java.What are the ways to do it and it there any data loss?)
public String encode() {
try {
String input = " "ใบใบใบใบ"";
byte [] encodedBytes= input.getBytes("TIS620");
String encodedString = new String(encodedBytes,"TIS620");
}catch (UnsupportedEncodingException e){
//Encoding failed
}
}
Expected result is, if I convert 5 Thai character from UTF-8 format to TIS620 the byte count should be converted and retained from 15 (UTF-8) to 5 (TIS620)?
A String in Java is always encoded in UTF-16, no matter how it was constructed. Or put differently: as soon as you have a String object, you should not care about which encoding it has. The encoding only comes back into the picture once you want to go back towards a byte[] (or OutputStream or the like).
This is correct and almost certainly exactly what you want to do. You should not try to work around that fact.
If you need to write the string to disk or send it to some other system in some specific encoding then you can get that encoded data from the String by using getBytes() as you did in your sample code.
In other words:
A String object in Java can not "have TIS620" encoding. A byte[] can contain TIS620 encoded data and you create that from a String using .getBytes("TIS620").
If you pass the encoded byte[] to the other system, it will have the correct byte size, simply because it was created with the correct encoding.
String always uses UTF-16. Creating a String with the content "ใบใบใบใบ" from UTF-8 data and from TIS620 data will produce exactly identical String objects, there's no way to know what encoding was used to create them.
InputStreamReader, OutputStreamWriter and comparable classes can also be passed an encoding to decode/encode with that encoding respectively. Other than that, no special handling is required.
Java's text datatypes (String, char and Character)—same goes for .NET, JavaScript, VB4/5/6/A/Script, …) always use the UTF-16 character encoding of the Unicode character set.
Many interfaces, bindings, drivers, data adaptors, and what not, understand that the text datatype is UTF-16 and which character encoding the target needs and so does a conversion itself. As long as you are using Java datatypes, if you have text encoding as UTF-8 or TIS620, you would typically use a byte array.
That it for straightforward text as text.
Now, if you had an array of arbitrary bytes and you want to write it into a text context, you could use Base64. Such a function takes a byte array and returns a String (UTF-16 encoded, of course). But since the characters used are supported by every character set, there would be no loss of data to convert the data to using whichever character encoding is needed.
People do like dealing with text datatypes so the above scheme is great. But for some reason, instead of Base64, some people use what I call Base256. They have an array of bytes (very often created from encoding text with a character encoding) and they apply an encoding function to convert the bytes to text, choosing to encode by decoding with a character encoding. You need to identify if that's what you are dealing with and if so, which character encoding was co-opted as a Base256 encoding. (Often the character encoding used for this is ISO 8859-1.)
I have made a work around for my web application, as I failed to se the character encoding to UTF-8 in all scopes when first creating it. I made a simple character conversion java class, so that I could insert character encoding conversion where needed. These are my methods for that:
public static String encodeUTF8ToLatin(String s) throws UnsupportedEncodingException {
byte[] b = s.getBytes("UTF-8");
return new String(b, "ISO-8859-1");
}
public static String encodeLatinToUTF8(String s) throws UnsupportedEncodingException {
byte[] b = s.getBytes("ISO-8859-1");
return new String(b, "UTF-8");
}
I am using these methods due to the special Danish/Norwegian characters ÆØÅ æøå. These have been working well for a while now, but I just discovered that the second method can't convert Upper case characters. When sending the String "ÆØÅ æøå" it returns "?????? æøå". This confuses me, as the conversion table found here seems to claim that all six characters follow the same encoding.
Does anyone know why my upper case characters does not convert properly here?
UPDATE:
From the answers provided, I can tell that I have some serious gaps in my knowledge regarding charsets and encoding. I think I have to just go back to basics, read more, and then I'll decide if the question is salvageable afterwards.
Your encodeLatinToUTF8 converts a Unicode String to a byte array using UTF-8 encoding. Then it decodes that UTF-8 encoded byte array pretending that it is ISO-8859-1 (there is your problem) and converts it to a Unicode string.
Same with the other method.
Your methods are a bit pointless. Strings don't have encoding, as they are already decoded to characters. Character encoding is a way to represent characters as 8 bit numbers so it only makes sense in byte array context.
I finally made it work. I simply used "Windows-1252" instead of "ISO-8859-1" to get the bytes, before creating a new string, using UTF-8.
I created a new method, which works for both lower case and upper case letters:
public static String encodeWindows1252ToUTF8(String s) throws UnsupportedEncodingException {
byte[] b = s.getBytes("Windows-1252");
return new String(b, "UTF-8");
}
I found this answer, by referring to this page, which states:
Symptom The following characters fail, while other characters display
correctly:
€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ.
The trademark and Euro currency symbol, ellipsis, single and double
"smart quotes", en and em dash, and the OE ligature characters are
used frequently and are most likely to be reported as a symptom of
this problem.
Explanation The characters in the range 0x80-0x9F (128-159) ... are in
Windows-1252 and not in ISO-8859-1. If you have a problem with
characters in that range only, it is because the characters are
treated as ISO-8859-1 and not Windows-1252.
Look for references to ISO-8859-1 and replace them with "Windows-1252"
(or CP1252, or the correct character encoding name for the library or
platform you are using.)
The three characters that failed, was Æ Ø and Å, all including characters from the list above.
I am getting the parameter value as parameter from the Jersey Web Service, which is in Japaneses characters.
Here, 'japaneseString' is the web service parameter containing the characters in japanese language.
String name = new String(japaneseString.getBytes(), "UTF-8");
However, I am able to convert a few sting literals successfully, while some of them are creating problems.
The following were successfully converted:
1) アップル
2) 赤
3) 世丕且且世两上与丑万丣丕且丗丕
4) 世世丗丈
While these din't:
1) ひほわれよう
2) 存在する
When I further investigated, i found that these 2 strings are getting converted in to some JUNK characters.
1) Input: ひほわれよう Output : �?��?��?れよ�?�
2) Input: 存在する Output: 存在�?�る
Any idea why some of the japanese characters are not converted properly?
Thanks.
You are mixing concepts here.
A String is just a sequence of characters (chars); a String in itself has no encoding at all. For what it's worth, replace characters in the above with carrier pigeons. Same thing. A carrier pigeon has no encoding. Neither does a char. (1)
What you are doing here:
new String(x.getBytes(), "UTF-8")
is a "poor man's encoding/decoding process". You will probably have noticed that there are two versions of .getBytes(): one where you pass a charset as an argument and the other where you don't.
If you don't, and that is what happens here, it means you will get the result of the encoding process using your default character set; and then you try and re-decode this byte sequence using UTF-8.
Don't do that. Just take in the string as it comes. If, however, you have trouble reading the original byte stream into a string, it means you use a Reader with the wrong charset. Fix that part.
For more information, read this link.
(1) the fact that, in fact, a char is a UTF-16 code unit is irrelevant to this discussion
Try with JVM parameter file.encoding to set with value UTF-8 in startup of Tomcat(JVM).
E.x.: -Dfile.encoding=UTF-8
I concur with #fge.
Clarification
In java String/char/Reader/Writer handle (Unicode) text, and can combine all scripts in the world.
And byte[]/InputStream/OutputStream are binary data, which need an indication of some encoding to be converted to String.
In your case japaneseStingr should already be a correct String, or be substituted by the original byte[].
Traps in Java
Encoding often is an optional parameter, which then defaults to the platform encoding. You fell in that trap too:
String s = "...";
byte[] b = s.getBytes(); // Platform encoding, non-portable.
byte[] b = s.getBytes("UTF-8"); // Explicit
byte[] b = s.getBytes(StandardCharsets.UTF_8); // Explicit,
// better (for UTF-8, ISO-8859-1)
In general avoid the overloaded methods without encoding parameter, as they are for current-computer only data: non-portable. For completeness: classes FileReader/FileWriter should be avoided as they even provide no encoding parameters.
Error
japaneseString is already wrong. So you have to read that right.
It could have been read erroneouslyas Windows-1252 (Windows Latin-1) and suffered when recoding to UTF-8. Evidently only some cases get messed up.
Maybe you had:
String japanesString = new String(bytes);
instead of:
String japanesString = new String(bytes, StandardCharsets.UTF_8);
At the end:
String name = japaneseString;
Show the code for reading japaneseString for further help.
How to convert any UTF8 strings to readable strings.
Like : ⬠(in UTF8) is €
I tried using Charset but not working.
You are encoding a string to ISO-8859-15 with byte[] b = "Üü?öäABC".getBytes("ISO-8859-15"); then you are decoding it with UTF-8 System.out.println(new String(b, "UTF-8"));. You have to decode it the same way with ISO-8859-15.
This is not "UTF-8" but completely broken and unrepairable data. Strings do not have encodings. It makes no sense to say "UTF-8" string in this context. String is a string of abstract characters - it doesn't have any encodings except as an internal implementation detail that is not our concern and not related to your problem.
A string in java is already an unicode representation. When you call one of the getBytes methods on it you get an encoded representation (as bytes, thus binary values) in a specific encoding - ISO-8859-15 in your example. If you want to convert this byte array back to an unicode string you can do that with one of the string constructors accepting a byte array, like you did, but you must do so using the exact same encoding the byte array was originally generated with. Only then you can convert it back to an unicode string (which has no encoding, and doesn't need one).
Beware of the encoding-less methods, both the string constructor and the getBytes method, since they use the default encoding of the platform the code is running on, which might not be what you want to achieve.
You are trying to decode a byteArray encoded with "ISO-8859-15" with "UTF-8" format
b = "Üü?öäABC".getBytes("ISO-8859-15");
u = "Üü?öäABC".getBytes("UTF-8");
System.out.println(new String(b, "ISO-8859-15")); // will be ok
System.out.println(new String(b, "UTF-8")); // will look garbled
System.out.println(new String(u,"UTF-8")); // will be ok
I think the problem here is that you're assuming a java String is encoded with whatever you've specified in the constructor. It's not. It's in UTF-16.
So, "Üü?öäABC".getBytes("ISO-8859-15") is actually converting a UTF-16 string to ISO-8859-15, and then getting the byte representation of that.
If you want to get the human-readable format in your Eclipse console, just keep it as it is (in UTF-16) - and call System.out.println("Üü?öäABC"), because your Eclipse console will decode the string and display it as UTF-16.
How can I decode an utf-8 string with android? I tried with this commands but output is the same of input:
URLDecoder.decode("hello&//à", "UTF-8");
new String("hello&//à", "UTF-8");
EntityUtils.toString("hello&//à", "utf-8");
A string needs no encoding. It is simply a sequence of Unicode characters.
You need to encode when you want to turn a String into a sequence of bytes. The charset the you choose (UTF-8, cp1255, etc.) determines the Character->Byte mapping. Note that a character is not necessarily translated into a single byte. In most charsets, most Unicode characters are translated to at least two bytes.
Encoding of a String is carried out by:
String s1 = "some text";
byte[] bytes = s1.getBytes("UTF-8"); // Charset to encode into
You need to decode when you have а sequence of bytes and you want to turn them into a String. When yоu dо that you need to specify, again, the charset with which the bytеs were originally encoded (otherwise you'll end up with garblеd tеxt).
Decoding:
String s2 = new String(bytes, "UTF-8"); // Charset with which bytes were encoded
If you want to understand this better, a great text is "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
the core functions are getBytes(String charset) and new String(byte[] data). you can use these functions to do UTF-8 decoding.
UTF-8 decoding actually is a string to string conversion, the intermediate buffer is a byte array. since the target is an UTF-8 string, so the only parameter for new String() is the byte array, which calling is equal to new String(bytes, "UTF-8")
Then the key is the parameter for input encoded string to get internal byte array, which you should know beforehand. If you don't, guess the most possible one, "ISO-8859-1" is a good guess for English user.
The decoding sentence should be
String decoded = new String(encoded.getBytes("ISO-8859-1"));
Try looking at decode string encoded in utf-8 format in android but it doesn't look like your string is encoded with anything particular. What do you think the output should be?