How to encode a string into UTF-8 in JAVA - java

I have a Japanese String 文字列 I want to convert it to UTF-8 encoding. This question seems like a bit duplicate. I have googled for sometime but not able to find direct answer.

Encoding a String is the process of transforming a sequence of characters into a sequence of bytes.
For that use the getBytes() method.
This method accepts and encoding parameter, which defines the encoding used in this process. Therefore, you can use :
byte[] encoded = "文字列".getBytes("UTF-8");
As per Jon Skeet comment, don't use magic strings:
byte[] encoded = "文字列".getBytes(StandardCharsets.UTF_8);

Related

Java- How to verify if Thai characters are encoded correctly from UTF-8 to TIS620

Get input string in UTF-8, I applied TIS620 encoding and created new string from it now how to retain the bytes? since UTF-8 represents Thai char in 3 bytes where as TIS620 in 1 byte. I've requirement where the backend system stores characters in string as 1 byte only so default UTF-8 breaks it.
How to convert String character encoding from UTF-8 to TIS620?
How to retain the byte size while passing it to backend system?
If the string is reassigned to new String , Does character encoding is retained or it again gets converted to UTF-16 (Java default)?
Is it possible in Java? Any lib/utility which can be integrated?
I've tried below code and can check that post TIS620 the byte count matches the character count i.e.1 byte/char. But if encodedString gets new String assignment will it loose TIS620 format?
(Convert String with encoding UTF-8 to TIS620 (Thai encoding) in Java.What are the ways to do it and it there any data loss?)
public String encode() {
try {
String input = " "ใบใบใบใบ"";
byte [] encodedBytes= input.getBytes("TIS620");
String encodedString = new String(encodedBytes,"TIS620");
}catch (UnsupportedEncodingException e){
//Encoding failed
}
}
Expected result is, if I convert 5 Thai character from UTF-8 format to TIS620 the byte count should be converted and retained from 15 (UTF-8) to 5 (TIS620)?
A String in Java is always encoded in UTF-16, no matter how it was constructed. Or put differently: as soon as you have a String object, you should not care about which encoding it has. The encoding only comes back into the picture once you want to go back towards a byte[] (or OutputStream or the like).
This is correct and almost certainly exactly what you want to do. You should not try to work around that fact.
If you need to write the string to disk or send it to some other system in some specific encoding then you can get that encoded data from the String by using getBytes() as you did in your sample code.
In other words:
A String object in Java can not "have TIS620" encoding. A byte[] can contain TIS620 encoded data and you create that from a String using .getBytes("TIS620").
If you pass the encoded byte[] to the other system, it will have the correct byte size, simply because it was created with the correct encoding.
String always uses UTF-16. Creating a String with the content "ใบใบใบใบ" from UTF-8 data and from TIS620 data will produce exactly identical String objects, there's no way to know what encoding was used to create them.
InputStreamReader, OutputStreamWriter and comparable classes can also be passed an encoding to decode/encode with that encoding respectively. Other than that, no special handling is required.
Java's text datatypes (String, char and Character)—same goes for .NET, JavaScript, VB4/5/6/A/Script, …) always use the UTF-16 character encoding of the Unicode character set.
Many interfaces, bindings, drivers, data adaptors, and what not, understand that the text datatype is UTF-16 and which character encoding the target needs and so does a conversion itself. As long as you are using Java datatypes, if you have text encoding as UTF-8 or TIS620, you would typically use a byte array.
That it for straightforward text as text.
Now, if you had an array of arbitrary bytes and you want to write it into a text context, you could use Base64. Such a function takes a byte array and returns a String (UTF-16 encoded, of course). But since the characters used are supported by every character set, there would be no loss of data to convert the data to using whichever character encoding is needed.
People do like dealing with text datatypes so the above scheme is great. But for some reason, instead of Base64, some people use what I call Base256. They have an array of bytes (very often created from encoding text with a character encoding) and they apply an encoding function to convert the bytes to text, choosing to encode by decoding with a character encoding. You need to identify if that's what you are dealing with and if so, which character encoding was co-opted as a Base256 encoding. (Often the character encoding used for this is ISO 8859-1.)

Converting String from One Charset to Another

I am working on converting a string from one charset to another and read many example on it and finally found below code, which looks nice to me and as a newbie to Charset Encoding, I want to know, if it is the right way to do it .
public static byte[] transcodeField(byte[] source, Charset from, Charset to) {
return new String(source, from).getBytes(to);
}
To convert String from ASCII to EBCDIC, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("US-ASCII"), Charset.forName("Cp1047"))));
And to convert from EBCDIC to ASCII, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("Cp1047"), Charset.forName("US-ASCII"))));
The code you found (transcodeField) doesn't convert a String from one encoding to another, because a String doesn't have an encoding¹. It converts bytes from one encoding to another. The method is only useful if your use case satisfies 2 conditions:
Your input data is bytes in one encoding
Your output data needs to be bytes in another encoding
In that case, it's straight forward:
byte[] out = transcodeField(inbytes, Charset.forName(inEnc), Charset.forName(outEnc));
If the input data contains characters that can't be represented in the output encoding (such as converting complex UTF8 to ASCII) those characters will be replaced with the ? replacement symbol, and the data will be corrupted.
However a lot of people ask "How do I convert a String from one encoding to another", to which a lot of people answer with the following snippet:
String s = new String(source.getBytes(inputEncoding), outputEncoding);
This is complete bull****. The getBytes(String encoding) method returns a byte array with the characters encoded in the specified encoding (if possible, again invalid characters are converted to ?). The String constructor with the 2nd parameter creates a new String from a byte array, where the bytes are in the specified encoding. Now since you just used source.getBytes(inputEncoding) to get those bytes, they're not encoded in outputEncoding (except if the encodings use the same values, which is common for "normal" characters like abcd, but differs with more complex like accented characters éêäöñ).
So what does this mean? It means that when you have a Java String, everything is great. Strings are unicode, meaning that all of your characters are safe. The problem comes when you need to convert that String to bytes, meaning that you need to decide on an encoding. Choosing a unicode compatible encoding such as UTF8, UTF16 etc. is great. It means your characters will still be safe even if your String contained all sorts of weird characters. If you choose a different encoding (with US-ASCII being the least supportive) your String must contain only the characters supported by the encoding, or it will result in corrupted bytes.
Now finally some examples of good and bad usage.
String myString = "Feng shui in chinese is 風水";
byte[] bytes1 = myString.getBytes("UTF-8"); // Bytes correct
byte[] bytes2 = myString.getBytes("US-ASCII"); // Last 2 characters are now corrupted (converted to question marks)
String nordic = "Här är några merkkejä";
byte[] bytes3 = nordic.getBytes("UTF-8"); // Bytes correct, "weird" chars take 2 bytes each
byte[] bytes4 = nordic.getBytes("ISO-8859-1"); // Bytes correct, "weird" chars take 1 byte each
String broken = new String(nordic.getBytes("UTF-8"), "ISO-8859-1"); // Contains now "Här är några merkkejä"
The last example demonstrates that even though both of the encodings support the nordic characters, they use different bytes to represent them and using the wrong encoding when decoding results in Mojibake. Therefore there's no such thing as "converting a String from one encoding to another", and you should never use the broken example.
Also note that you should always specify the encoding used (with both getBytes() and new String()), because you can't trust that the default encoding is always the one you want.
As a last issue, Charset and Encoding aren't the same thing, but they're very much related.
¹ Technically the way a String is stored internally in the JVM is in UTF-16 encoding up to Java 8, and variable encoding from Java 9 onwards, but the developer doesn't need to care about that.
NOTE
It's possible to have a corrupted String and be able to uncorrupt it by fiddling with the encoding, which may be where this "convert String to other encoding" misunderstanding originates from.
// Input comes from network/file/other place and we have misconfigured the encoding
String input = "Här är några merkkejä"; // UTF-8 bytes, interpreted wrongly as ISO-8859-1 compatible
byte[] bytes = input.getBytes("ISO-8859-1"); // Get each char as single byte
String asUtf8 = new String(bytes, "UTF-8"); // Recreate String as UTF-8
If no characters were corrupted in input, the string would now be "fixed". However the proper approach is to use the correct encoding when reading input, not fix it afterwards. Especially if there's a chance of it becoming corrupted.

UTF-8 Encoding ; Only some Japanese characters are not getting converted

I am getting the parameter value as parameter from the Jersey Web Service, which is in Japaneses characters.
Here, 'japaneseString' is the web service parameter containing the characters in japanese language.
String name = new String(japaneseString.getBytes(), "UTF-8");
However, I am able to convert a few sting literals successfully, while some of them are creating problems.
The following were successfully converted:
1) アップル
2) 赤
3) 世丕且且世两上与丑万丣丕且丗丕
4) 世世丗丈
While these din't:
1) ひほわれよう
2) 存在する
When I further investigated, i found that these 2 strings are getting converted in to some JUNK characters.
1) Input: ひほわれよう Output : �?��?��?れよ�?�
2) Input: 存在する Output: 存在�?�る
Any idea why some of the japanese characters are not converted properly?
Thanks.
You are mixing concepts here.
A String is just a sequence of characters (chars); a String in itself has no encoding at all. For what it's worth, replace characters in the above with carrier pigeons. Same thing. A carrier pigeon has no encoding. Neither does a char. (1)
What you are doing here:
new String(x.getBytes(), "UTF-8")
is a "poor man's encoding/decoding process". You will probably have noticed that there are two versions of .getBytes(): one where you pass a charset as an argument and the other where you don't.
If you don't, and that is what happens here, it means you will get the result of the encoding process using your default character set; and then you try and re-decode this byte sequence using UTF-8.
Don't do that. Just take in the string as it comes. If, however, you have trouble reading the original byte stream into a string, it means you use a Reader with the wrong charset. Fix that part.
For more information, read this link.
(1) the fact that, in fact, a char is a UTF-16 code unit is irrelevant to this discussion
Try with JVM parameter file.encoding to set with value UTF-8 in startup of Tomcat(JVM).
E.x.: -Dfile.encoding=UTF-8
I concur with #fge.
Clarification
In java String/char/Reader/Writer handle (Unicode) text, and can combine all scripts in the world.
And byte[]/InputStream/OutputStream are binary data, which need an indication of some encoding to be converted to String.
In your case japaneseStingr should already be a correct String, or be substituted by the original byte[].
Traps in Java
Encoding often is an optional parameter, which then defaults to the platform encoding. You fell in that trap too:
String s = "...";
byte[] b = s.getBytes(); // Platform encoding, non-portable.
byte[] b = s.getBytes("UTF-8"); // Explicit
byte[] b = s.getBytes(StandardCharsets.UTF_8); // Explicit,
// better (for UTF-8, ISO-8859-1)
In general avoid the overloaded methods without encoding parameter, as they are for current-computer only data: non-portable. For completeness: classes FileReader/FileWriter should be avoided as they even provide no encoding parameters.
Error
japaneseString is already wrong. So you have to read that right.
It could have been read erroneouslyas Windows-1252 (Windows Latin-1) and suffered when recoding to UTF-8. Evidently only some cases get messed up.
Maybe you had:
String japanesString = new String(bytes);
instead of:
String japanesString = new String(bytes, StandardCharsets.UTF_8);
At the end:
String name = japaneseString;
Show the code for reading japaneseString for further help.

Convert UTF-8 encoded string to human readable string

How to convert any UTF8 strings to readable strings.
Like : ⬠(in UTF8) is €
I tried using Charset but not working.
You are encoding a string to ISO-8859-15 with byte[] b = "Üü?öäABC".getBytes("ISO-8859-15"); then you are decoding it with UTF-8 System.out.println(new String(b, "UTF-8"));. You have to decode it the same way with ISO-8859-15.
This is not "UTF-8" but completely broken and unrepairable data. Strings do not have encodings. It makes no sense to say "UTF-8" string in this context. String is a string of abstract characters - it doesn't have any encodings except as an internal implementation detail that is not our concern and not related to your problem.
A string in java is already an unicode representation. When you call one of the getBytes methods on it you get an encoded representation (as bytes, thus binary values) in a specific encoding - ISO-8859-15 in your example. If you want to convert this byte array back to an unicode string you can do that with one of the string constructors accepting a byte array, like you did, but you must do so using the exact same encoding the byte array was originally generated with. Only then you can convert it back to an unicode string (which has no encoding, and doesn't need one).
Beware of the encoding-less methods, both the string constructor and the getBytes method, since they use the default encoding of the platform the code is running on, which might not be what you want to achieve.
You are trying to decode a byteArray encoded with "ISO-8859-15" with "UTF-8" format
b = "Üü?öäABC".getBytes("ISO-8859-15");
u = "Üü?öäABC".getBytes("UTF-8");
System.out.println(new String(b, "ISO-8859-15")); // will be ok
System.out.println(new String(b, "UTF-8")); // will look garbled
System.out.println(new String(u,"UTF-8")); // will be ok
I think the problem here is that you're assuming a java String is encoded with whatever you've specified in the constructor. It's not. It's in UTF-16.
So, "Üü?öäABC".getBytes("ISO-8859-15") is actually converting a UTF-16 string to ISO-8859-15, and then getting the byte representation of that.
If you want to get the human-readable format in your Eclipse console, just keep it as it is (in UTF-16) - and call System.out.println("Üü?öäABC"), because your Eclipse console will decode the string and display it as UTF-16.

Encoding conversion in java

Is there any free java library which I can use to convert string in one encoding to other encoding, something like iconv? I'm using Java version 1.3.
You don't need a library beyond the standard one - just use Charset. (You can just use the String constructors and getBytes methods, but personally I don't like just working with the names of character encodings. Too much room for typos.)
EDIT: As pointed out in comments, you can still use Charset instances but have the ease of use of the String methods: new String(bytes, charset) and String.getBytes(charset).
See "URL Encoding (or: 'What are those "%20" codes in URLs?')".
CharsetDecoder should be what you are looking for, no ?
Many network protocols and files store their characters with a byte-oriented character set such as ISO-8859-1 (ISO-Latin-1).
However, Java's native character encoding is Unicode UTF16BE (Sixteen-bit UCS Transformation Format, big-endian byte order).
See Charset. That doesn't mean UTF16 is the default charset (i.e.: the default "mapping between sequences of sixteen-bit Unicode code units and sequences of bytes"):
Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets.
[US-ASCII, ISO-8859-1 a.k.a. ISO-LATIN-1, UTF-8, UTF-16BE, UTF-16LE, UTF-16]
The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.
This example demonstrates how to convert ISO-8859-1 encoded bytes in a ByteBuffer to a string in a CharBuffer and visa versa.
// Create the encoder and decoder for ISO-8859-1
Charset charset = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();
try {
// Convert a string to ISO-LATIN-1 bytes in a ByteBuffer
// The new ByteBuffer is ready to be read.
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap("a string"));
// Convert ISO-LATIN-1 bytes in a ByteBuffer to a character ByteBuffer and then to a string.
// The new ByteBuffer is ready to be read.
CharBuffer cbuf = decoder.decode(bbuf);
String s = cbuf.toString();
} catch (CharacterCodingException e) {
}
I would just like to add that if the String is originally encoded using the wrong encoding it might be impossible to change it to another encoding without errors.
The question does not state that the conversion here is made from wrong encoding to correct encoding but I personally stumbled to this question just because of this situation so just a heads up for others as well.
This answer in other question gives an explanation why the conversion does not always yield correct results
https://stackoverflow.com/a/2623793/4702806
It is a whole lot easier if you think of unicode as a character set (which it actually is - it is very basically the numbered set of all known characters). You can encode it as UTF-8 (1-3 bytes per character depending) or maybe UTF-16 (2 bytes per character or 4 bytes using surrogate pairs).
Back in the mist of time Java used to use UCS-2 to encode the unicode character set. This could only handle 2 bytes per character and is now obsolete. It was a fairly obvious hack to add surrogate pairs and move up to UTF-16.
A lot of people think they should have used UTF-8 in the first place. When Java was originally written unicode had far more than 65535 characters anyway...

Categories