This question already has answers here:
How to read UTF 8 encoded file in java with turkish characters
(3 answers)
Closed 9 years ago.
In my assigned project, the original author has written a function:
public String asString() throws DataException
{
if (getData() == null) return null;
CharBuffer charBuf = null;
try
{
charBuf = s_charset.newDecoder().decode(ByteBuffer.wrap(f_data));
}
catch (CharacterCodingException e)
{
throw new DataException("You can't have a string from this ParasolBlob: " + this, e);
}
return charBuf.toString()+"你好";
}
Please note that the constant s_charset is defined as:
private static final Charset s_charset = Charset.forName("UTF-8");
Please also note that I have hard-coded a Chinese string in the return string.
Now when the program flow reaches this method, it will throw the following exception:
java.nio.charset.UnmappableCharacterException: Input length = 2
And more interstingly, the hard-coded Chinese strings will be shown as "??" at the console if I do a System.out.println().
I think this problem is quite interesting in regard of Localization. And I've tried changing it to
Charset.forName("GBK");
but seems is not the solution. Also, I have set the coding of the Java class to be of "UTF-8".
Any experts have experience in this regard? Would you please share a little? Thanks in advance!
And more interstingly, the hard-coded Chinese strings will be shown as
"??" at the console if I do a System.out.println().
System.out performs transcoding operations from UTF-16 strings to the default JRE character encoding. If this does not match the encoding used by the device receiving the character data is corrupted. So, the console should be set to use the right character encoding(UTF-8) to render the chinese chars properly.
If you are using eclipse then you can change the console encoding by going to
Run Configuration-> Common -> Encoding(slect UTF-8 from dropdown)
Java Strings are unicodes
System.out.println("你好");
As Kevin stated, depending on what is the underlying encoding of your source file, this encoding will be used to convert it to UTF-16BE (real encoding of Java String). So when you see "??" it is surely simple conversion error.
Now, if you want to convert simple byte array to String, using given character encoding, I believe there is much easier way to do this, than using raw CharsetDecoder. That is:
byte[] bytes = {0x61};
String string = new String(bytes, Charset.forName("UTF-8"));
System.out.println(string);
This will work, provided that the byte array really contains UTF-8 encoded stream of bytes. And it must be without BOM, otherwise the conversion will probably fail. Make sure that what you are trying to convert does not start with the sequence 0xEF 0xBB 0xBF.
Related
I can't display string, which contains Latin-Extended-A chars in an appropriate way.
I have tried getting bytes with different Unicode and creating new String with new Unicode.
If you have string "ăăăăăăăăă". How can i output it in an appropriate way.
Java supports unicode characters. If you have:
String x = "ăăăăăăăăă";
System.out.println(x);
You'll get
ăăăăăăăăă
If you get question marks or funky looking characters, then it's most likely not a problem with java or the code, but with the fonts on your computer not supporting it.
I am working on converting a string from one charset to another and read many example on it and finally found below code, which looks nice to me and as a newbie to Charset Encoding, I want to know, if it is the right way to do it .
public static byte[] transcodeField(byte[] source, Charset from, Charset to) {
return new String(source, from).getBytes(to);
}
To convert String from ASCII to EBCDIC, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("US-ASCII"), Charset.forName("Cp1047"))));
And to convert from EBCDIC to ASCII, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("Cp1047"), Charset.forName("US-ASCII"))));
The code you found (transcodeField) doesn't convert a String from one encoding to another, because a String doesn't have an encoding¹. It converts bytes from one encoding to another. The method is only useful if your use case satisfies 2 conditions:
Your input data is bytes in one encoding
Your output data needs to be bytes in another encoding
In that case, it's straight forward:
byte[] out = transcodeField(inbytes, Charset.forName(inEnc), Charset.forName(outEnc));
If the input data contains characters that can't be represented in the output encoding (such as converting complex UTF8 to ASCII) those characters will be replaced with the ? replacement symbol, and the data will be corrupted.
However a lot of people ask "How do I convert a String from one encoding to another", to which a lot of people answer with the following snippet:
String s = new String(source.getBytes(inputEncoding), outputEncoding);
This is complete bull****. The getBytes(String encoding) method returns a byte array with the characters encoded in the specified encoding (if possible, again invalid characters are converted to ?). The String constructor with the 2nd parameter creates a new String from a byte array, where the bytes are in the specified encoding. Now since you just used source.getBytes(inputEncoding) to get those bytes, they're not encoded in outputEncoding (except if the encodings use the same values, which is common for "normal" characters like abcd, but differs with more complex like accented characters éêäöñ).
So what does this mean? It means that when you have a Java String, everything is great. Strings are unicode, meaning that all of your characters are safe. The problem comes when you need to convert that String to bytes, meaning that you need to decide on an encoding. Choosing a unicode compatible encoding such as UTF8, UTF16 etc. is great. It means your characters will still be safe even if your String contained all sorts of weird characters. If you choose a different encoding (with US-ASCII being the least supportive) your String must contain only the characters supported by the encoding, or it will result in corrupted bytes.
Now finally some examples of good and bad usage.
String myString = "Feng shui in chinese is 風水";
byte[] bytes1 = myString.getBytes("UTF-8"); // Bytes correct
byte[] bytes2 = myString.getBytes("US-ASCII"); // Last 2 characters are now corrupted (converted to question marks)
String nordic = "Här är några merkkejä";
byte[] bytes3 = nordic.getBytes("UTF-8"); // Bytes correct, "weird" chars take 2 bytes each
byte[] bytes4 = nordic.getBytes("ISO-8859-1"); // Bytes correct, "weird" chars take 1 byte each
String broken = new String(nordic.getBytes("UTF-8"), "ISO-8859-1"); // Contains now "Här är några merkkejä"
The last example demonstrates that even though both of the encodings support the nordic characters, they use different bytes to represent them and using the wrong encoding when decoding results in Mojibake. Therefore there's no such thing as "converting a String from one encoding to another", and you should never use the broken example.
Also note that you should always specify the encoding used (with both getBytes() and new String()), because you can't trust that the default encoding is always the one you want.
As a last issue, Charset and Encoding aren't the same thing, but they're very much related.
¹ Technically the way a String is stored internally in the JVM is in UTF-16 encoding up to Java 8, and variable encoding from Java 9 onwards, but the developer doesn't need to care about that.
NOTE
It's possible to have a corrupted String and be able to uncorrupt it by fiddling with the encoding, which may be where this "convert String to other encoding" misunderstanding originates from.
// Input comes from network/file/other place and we have misconfigured the encoding
String input = "Här är några merkkejä"; // UTF-8 bytes, interpreted wrongly as ISO-8859-1 compatible
byte[] bytes = input.getBytes("ISO-8859-1"); // Get each char as single byte
String asUtf8 = new String(bytes, "UTF-8"); // Recreate String as UTF-8
If no characters were corrupted in input, the string would now be "fixed". However the proper approach is to use the correct encoding when reading input, not fix it afterwards. Especially if there's a chance of it becoming corrupted.
I am getting the parameter value as parameter from the Jersey Web Service, which is in Japaneses characters.
Here, 'japaneseString' is the web service parameter containing the characters in japanese language.
String name = new String(japaneseString.getBytes(), "UTF-8");
However, I am able to convert a few sting literals successfully, while some of them are creating problems.
The following were successfully converted:
1) アップル
2) 赤
3) 世丕且且世两上与丑万丣丕且丗丕
4) 世世丗丈
While these din't:
1) ひほわれよう
2) 存在する
When I further investigated, i found that these 2 strings are getting converted in to some JUNK characters.
1) Input: ひほわれよう Output : �?��?��?れよ�?�
2) Input: 存在する Output: 存在�?�る
Any idea why some of the japanese characters are not converted properly?
Thanks.
You are mixing concepts here.
A String is just a sequence of characters (chars); a String in itself has no encoding at all. For what it's worth, replace characters in the above with carrier pigeons. Same thing. A carrier pigeon has no encoding. Neither does a char. (1)
What you are doing here:
new String(x.getBytes(), "UTF-8")
is a "poor man's encoding/decoding process". You will probably have noticed that there are two versions of .getBytes(): one where you pass a charset as an argument and the other where you don't.
If you don't, and that is what happens here, it means you will get the result of the encoding process using your default character set; and then you try and re-decode this byte sequence using UTF-8.
Don't do that. Just take in the string as it comes. If, however, you have trouble reading the original byte stream into a string, it means you use a Reader with the wrong charset. Fix that part.
For more information, read this link.
(1) the fact that, in fact, a char is a UTF-16 code unit is irrelevant to this discussion
Try with JVM parameter file.encoding to set with value UTF-8 in startup of Tomcat(JVM).
E.x.: -Dfile.encoding=UTF-8
I concur with #fge.
Clarification
In java String/char/Reader/Writer handle (Unicode) text, and can combine all scripts in the world.
And byte[]/InputStream/OutputStream are binary data, which need an indication of some encoding to be converted to String.
In your case japaneseStingr should already be a correct String, or be substituted by the original byte[].
Traps in Java
Encoding often is an optional parameter, which then defaults to the platform encoding. You fell in that trap too:
String s = "...";
byte[] b = s.getBytes(); // Platform encoding, non-portable.
byte[] b = s.getBytes("UTF-8"); // Explicit
byte[] b = s.getBytes(StandardCharsets.UTF_8); // Explicit,
// better (for UTF-8, ISO-8859-1)
In general avoid the overloaded methods without encoding parameter, as they are for current-computer only data: non-portable. For completeness: classes FileReader/FileWriter should be avoided as they even provide no encoding parameters.
Error
japaneseString is already wrong. So you have to read that right.
It could have been read erroneouslyas Windows-1252 (Windows Latin-1) and suffered when recoding to UTF-8. Evidently only some cases get messed up.
Maybe you had:
String japanesString = new String(bytes);
instead of:
String japanesString = new String(bytes, StandardCharsets.UTF_8);
At the end:
String name = japaneseString;
Show the code for reading japaneseString for further help.
i read a list for my android app from a csv or txt file.
If the File is encoded UTF-8 with Notepad++ i seh the list all right. But i cant search/find strings with .equals.
If the file is encoded with widows as ansi, is cant see äöü etc. But now i can find strings.
Now my question. How can i found out what charset my string has?
I compare my frist string (from the file) with another string, read in in the app with searchview.
I "THINK" my searchview string from the app is ansi too, how to change that to UTF-8 and hope that the compare then works, again.
Android 4.4.2
Thank you
following dosent work:
String s = null;
try
{
s = new String(query.getBytes(), "UTF-8");
}
catch (UnsupportedEncodingException e)
{
Log.e("utf8", "conversion", e);
}
Java strings are always encoded as UTF-16, regardless of where the string data comes from.
It is important that you correctly identify the charset of the source data when converting it to a Java string. new String(query.getBytes(), "UTF-8") will work fine if the byte[] array is actually UTF-8 encoded. If you specify the wrong charset, you will get an UnsupportedEncodingException error only if you specify a charset that Java does not support. However, if you specify a charset that Java does support, and then the decoding of the data fails (typically because you specified the wrong charset for the data), you will get other errors instead, such as MalformedInputException or UnmappableCharacterException, or worse you will not get any errors at all and malformed/illegal bytes will simply be converted to the Unicode U+FFFD replacement character instead. If you need more control over error handling during the conversion process, you need to use the CharsetDecoder class instead.
Sometimes UTF-encoded files will have a BOM in the front, so you can check for that. But Ansi files do not use BOMs. If a UTF BOM is not present in the file, then you have to either analyze the raw data and take a guess (which will lead to problems if you guess wrong), or simply ask the user which charset to use.
Always know the charset of your data. If you don't know, ask. Avoid guessing.
I'm lithuanian and I'm creating app in lithuanian language, but Strings can't contain letters such as: ą, č, ę, ė, į, š, ų, ū, ž...
I searched over the internet for simple way to make it possible, but I ended up there...
There is some of my code that I want to modify:
if (dayOfWeek.equals("Wednesday")) {
dayOfWeek = "Treciadienis"; //this should be Trečiadienis
}
And I have Array that has bunch of these letters. How should I deal with it?
static JSONArray jArray = new JSONArray(data);
Thank you in advance!
A string can contain the letter ą. The following code dayOfWeek = "Treciadienis";.
Do you have checked if your file is encoded in UTF-8 ? For that under Eclipse, do File => Properties, and you'll see in the bottom the Text file encoding.
If you really can not, I think your talking about a a with ogonek, the other solution is to refer on bytes value of the String, and to do : dayOfWeek = "Tre".concat(new String(new byte[]{(byte) 0xC4})).concat("iadienis"); (yep, quite extreme, but it works).
It's very common if your're using Windows that Eclipse sets the default encoding to Cp1252, which you must change to UTF-8 so you're able to use that kind of characters hardcoded in your .java files.
Don't forget that you can also use the string constructor:
String(byte[] data, String charsetName)
Adding Tod gahfy's answer:
Instead of adding individual bytes you can use the \uxxxx syntax within the String. Xxxx is the Unicode code point of the character. This is of course more annoying than using UTF-8 encoding but less annoying than adding bytes.