Encoding issues - java

I have a "windows1255" encoded String, is there any safe way i can convert it to a "UTF-8"
String and vice versa?
In general is there a safe way(meaning data will not be damaged) to convert between
Encodings in Java?
str.getBytes("UTF-8");
new String(str,"UTF-8");
if the original string is not encoded as "UTF-8" can the data be damaged?

You can can't have a String object in Java properly encoded as anything other than UTF-16 - as that's the sole encoding for those objects defined by the spec. Of course you could do something untoward like put 1252 values in a char[] and create a String from it, but things will go wrong pretty much immediately.
What you can have is byte[] encoded in various different ways, and you can convert them to and from String using constructors which take a Charset, and with getBytes as in your code.
So you can do conversions using a String as an intermediate. I don't know of any way in the JDK to do a direct conversion, but the intermediate is likely not too costly in practice.
About round-trip comversions - it is not generally true that you can convert between encoding without losing data. Only a few encodings can handle the full spectrum of Unicode characters (eg the UTF family, GB18030, etc) - while many legacy character sets encode only a small subset. You can't safely round trip through those character sets without losing data, unless you are sure the input falls into the representable set.

String is attempting to be a sequence of abstract characters, it does not have any encoding from the point of view
of its users. Of course, it must have an internal encoding but that's an implementation detail.
It makes no sense to encode String as UTF-8, and then decode the result back as UTF-8. It will be no-op, in that:
(new String(str.getBytes("UTF-8"), "UTF-8") ).equals(str) == true;
But there are cases where the String abstraction falls apart and the above will be a "lossy" conversion. Because of the internal
implementation details, a String can contain unpaired UTF-16 surrogates which cannot be represented in UTF-8 (or any encoding
for that matter, including the internal UTF-16 encoding*). So they will be lost in the encoding, and when you decode back, you get the original string without the invalid unpaired surrogates.
The only thing I can take from your question is that you have a String result from interpreting binary data as Windows-1255, where it should have been interpreted in UTF-8.
To fix this, you would have to go to the source of this and use UTF-8 decoding explicitly.
If you however, only have the string result from misinterpretation, you can't really do anything as so many bytes have no representation in Windows-1255 and would have not made it to the string.
If this wasn't the case, you could fully restore the original intended message by:
new String( str.getBytes("Windows-1255"), "UTF-8");
* It is actually wrong of Java to allow unpaired surrogates to exist in its Strings in the first place since it's not valid UTF-16

Related

How to create UTF16 Strings [duplicate]

This question already has answers here:
How to convert UTF8 string to UTF16
(2 answers)
Closed 2 years ago.
Is there a way to create an UTF16 string from scratch or from an actual UTF8 string that doesn't involve some weird "hack" like looping through each char and appending a 00 byte to make it an UTF16 char?
Ideally I would like to be able to do something like this:
String s = new String("TestData".getBytes(), StandardCharsets.UTF_16);
But that doesn't work as the string literal is interpreted as UTF8.
In java, a String instance doesn't have an encoding. It just is - it represents the characters as characters, and therefore, there is no encoding.
Encoding just isn't a thing except in transition: When you 'transition' a bunch of characters into a bunch of bytes, or vice versa - that operation cannot be performed unless a charset is provided.
Take, for example, your snippet. It is broken. You write:
"TestData".getBytes().
This compiles. That is unfortunate; this is an API design error in java; you should never use these methods (That'd be: Methods that silently paper over the fact that a charset IS involved). This IS a transition from characters (A String) to bytes. If you read the javadoc on the getBytes() method, it'll tell you that the 'platform default encoding' will be used. This means it's a fine formula for writing code that passes all tests on your machine and will then fail at runtime.
There are valid reasons to want platform default encoding, but I -strongly- encourage you to never use getBytes() regardless. If you run into one of these rare scenarios, write "TestData".getBytes(Charset.defaultCharset()) so that your code makes explicit that a charset-using conversion is occurring here, and that you intended it to be the platform default.
So, going back to your question: There is no such thing as a UTF-16 string. (If 'string' here is the be taken as meaning: java.lang.String, and not a slang english term meaning 'sequence of bytes').
There IS such a thing as a sequence of bytes, representing unicode characters encoded in UTF-16 format. In other words, 'a UTF-16 string', in java, would look like byte[]. Not String.
Thus, all you really need is:
byte[] utf16 = "TestData".GetBytes(StandardCharsets.UTF_16);
You write:
But that doesn't work as the string literal is interpreted as UTF8.
That's a property of the code then, not of the string. If you have some code you can't change that will turn a string into bytes using the UTF8 charset, and you don't want that to happen, then find the source and fix it. There is no other solution.
In particular, trying to hack things such that you have a string with gobbledygook that has the crazy property that if you take this gobbledygook, turn it into bytes using the UTF8 charset, and then take those bytes and turn that back into a string using the UTF16 charset, that you get what you actually wanted - cannot work. This is theoretically possible (but a truly bad idea) for charsets that have the property that every sequence of bytes is representable, such as ISO_8859_1, but UTF-8 does not adhere to that property. There are sequences of bytes that are just an error in UTF-8 and will cause an exception. On the flipside, it is not possible to craft a string such that decoding it with UTF-8 into a byte array produces a certain desired sequence of bytes.

Java- How to verify if Thai characters are encoded correctly from UTF-8 to TIS620

Get input string in UTF-8, I applied TIS620 encoding and created new string from it now how to retain the bytes? since UTF-8 represents Thai char in 3 bytes where as TIS620 in 1 byte. I've requirement where the backend system stores characters in string as 1 byte only so default UTF-8 breaks it.
How to convert String character encoding from UTF-8 to TIS620?
How to retain the byte size while passing it to backend system?
If the string is reassigned to new String , Does character encoding is retained or it again gets converted to UTF-16 (Java default)?
Is it possible in Java? Any lib/utility which can be integrated?
I've tried below code and can check that post TIS620 the byte count matches the character count i.e.1 byte/char. But if encodedString gets new String assignment will it loose TIS620 format?
(Convert String with encoding UTF-8 to TIS620 (Thai encoding) in Java.What are the ways to do it and it there any data loss?)
public String encode() {
try {
String input = " "ใบใบใบใบ"";
byte [] encodedBytes= input.getBytes("TIS620");
String encodedString = new String(encodedBytes,"TIS620");
}catch (UnsupportedEncodingException e){
//Encoding failed
}
}
Expected result is, if I convert 5 Thai character from UTF-8 format to TIS620 the byte count should be converted and retained from 15 (UTF-8) to 5 (TIS620)?
A String in Java is always encoded in UTF-16, no matter how it was constructed. Or put differently: as soon as you have a String object, you should not care about which encoding it has. The encoding only comes back into the picture once you want to go back towards a byte[] (or OutputStream or the like).
This is correct and almost certainly exactly what you want to do. You should not try to work around that fact.
If you need to write the string to disk or send it to some other system in some specific encoding then you can get that encoded data from the String by using getBytes() as you did in your sample code.
In other words:
A String object in Java can not "have TIS620" encoding. A byte[] can contain TIS620 encoded data and you create that from a String using .getBytes("TIS620").
If you pass the encoded byte[] to the other system, it will have the correct byte size, simply because it was created with the correct encoding.
String always uses UTF-16. Creating a String with the content "ใบใบใบใบ" from UTF-8 data and from TIS620 data will produce exactly identical String objects, there's no way to know what encoding was used to create them.
InputStreamReader, OutputStreamWriter and comparable classes can also be passed an encoding to decode/encode with that encoding respectively. Other than that, no special handling is required.
Java's text datatypes (String, char and Character)—same goes for .NET, JavaScript, VB4/5/6/A/Script, …) always use the UTF-16 character encoding of the Unicode character set.
Many interfaces, bindings, drivers, data adaptors, and what not, understand that the text datatype is UTF-16 and which character encoding the target needs and so does a conversion itself. As long as you are using Java datatypes, if you have text encoding as UTF-8 or TIS620, you would typically use a byte array.
That it for straightforward text as text.
Now, if you had an array of arbitrary bytes and you want to write it into a text context, you could use Base64. Such a function takes a byte array and returns a String (UTF-16 encoded, of course). But since the characters used are supported by every character set, there would be no loss of data to convert the data to using whichever character encoding is needed.
People do like dealing with text datatypes so the above scheme is great. But for some reason, instead of Base64, some people use what I call Base256. They have an array of bytes (very often created from encoding text with a character encoding) and they apply an encoding function to convert the bytes to text, choosing to encode by decoding with a character encoding. You need to identify if that's what you are dealing with and if so, which character encoding was co-opted as a Base256 encoding. (Often the character encoding used for this is ISO 8859-1.)

Converting String from One Charset to Another

I am working on converting a string from one charset to another and read many example on it and finally found below code, which looks nice to me and as a newbie to Charset Encoding, I want to know, if it is the right way to do it .
public static byte[] transcodeField(byte[] source, Charset from, Charset to) {
return new String(source, from).getBytes(to);
}
To convert String from ASCII to EBCDIC, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("US-ASCII"), Charset.forName("Cp1047"))));
And to convert from EBCDIC to ASCII, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("Cp1047"), Charset.forName("US-ASCII"))));
The code you found (transcodeField) doesn't convert a String from one encoding to another, because a String doesn't have an encoding¹. It converts bytes from one encoding to another. The method is only useful if your use case satisfies 2 conditions:
Your input data is bytes in one encoding
Your output data needs to be bytes in another encoding
In that case, it's straight forward:
byte[] out = transcodeField(inbytes, Charset.forName(inEnc), Charset.forName(outEnc));
If the input data contains characters that can't be represented in the output encoding (such as converting complex UTF8 to ASCII) those characters will be replaced with the ? replacement symbol, and the data will be corrupted.
However a lot of people ask "How do I convert a String from one encoding to another", to which a lot of people answer with the following snippet:
String s = new String(source.getBytes(inputEncoding), outputEncoding);
This is complete bull****. The getBytes(String encoding) method returns a byte array with the characters encoded in the specified encoding (if possible, again invalid characters are converted to ?). The String constructor with the 2nd parameter creates a new String from a byte array, where the bytes are in the specified encoding. Now since you just used source.getBytes(inputEncoding) to get those bytes, they're not encoded in outputEncoding (except if the encodings use the same values, which is common for "normal" characters like abcd, but differs with more complex like accented characters éêäöñ).
So what does this mean? It means that when you have a Java String, everything is great. Strings are unicode, meaning that all of your characters are safe. The problem comes when you need to convert that String to bytes, meaning that you need to decide on an encoding. Choosing a unicode compatible encoding such as UTF8, UTF16 etc. is great. It means your characters will still be safe even if your String contained all sorts of weird characters. If you choose a different encoding (with US-ASCII being the least supportive) your String must contain only the characters supported by the encoding, or it will result in corrupted bytes.
Now finally some examples of good and bad usage.
String myString = "Feng shui in chinese is 風水";
byte[] bytes1 = myString.getBytes("UTF-8"); // Bytes correct
byte[] bytes2 = myString.getBytes("US-ASCII"); // Last 2 characters are now corrupted (converted to question marks)
String nordic = "Här är några merkkejä";
byte[] bytes3 = nordic.getBytes("UTF-8"); // Bytes correct, "weird" chars take 2 bytes each
byte[] bytes4 = nordic.getBytes("ISO-8859-1"); // Bytes correct, "weird" chars take 1 byte each
String broken = new String(nordic.getBytes("UTF-8"), "ISO-8859-1"); // Contains now "Här är några merkkejä"
The last example demonstrates that even though both of the encodings support the nordic characters, they use different bytes to represent them and using the wrong encoding when decoding results in Mojibake. Therefore there's no such thing as "converting a String from one encoding to another", and you should never use the broken example.
Also note that you should always specify the encoding used (with both getBytes() and new String()), because you can't trust that the default encoding is always the one you want.
As a last issue, Charset and Encoding aren't the same thing, but they're very much related.
¹ Technically the way a String is stored internally in the JVM is in UTF-16 encoding up to Java 8, and variable encoding from Java 9 onwards, but the developer doesn't need to care about that.
NOTE
It's possible to have a corrupted String and be able to uncorrupt it by fiddling with the encoding, which may be where this "convert String to other encoding" misunderstanding originates from.
// Input comes from network/file/other place and we have misconfigured the encoding
String input = "Här är några merkkejä"; // UTF-8 bytes, interpreted wrongly as ISO-8859-1 compatible
byte[] bytes = input.getBytes("ISO-8859-1"); // Get each char as single byte
String asUtf8 = new String(bytes, "UTF-8"); // Recreate String as UTF-8
If no characters were corrupted in input, the string would now be "fixed". However the proper approach is to use the correct encoding when reading input, not fix it afterwards. Especially if there's a chance of it becoming corrupted.

Will String.getBytes("UTF-16") return the same result on all platforms?

I need to create a hash from a String containing users password. To create the hash, I use a byte array which I get by calling String.getBytes(). But when I call this method with specified encoding, (such as UTF-8) on a platform where this is not the default encoding, the non-ASCII characters get replaced by a default character (if I understand the behaviour of getBytes() correctly) and therefore on such platform, I will get a different byte array, and eventually a different hash.
Since Strings are internally stored in UTF-16, will calling String.getBytes("UTF-16") guarantee me that I get the same byte array on every platform, regardless of its default encoding?
Yes. Not only is it guaranteed to be UTF-16, but the byte order is defined too:
When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.
(The BOM isn't relevant when the caller doesn't ask for it, so String.getBytes(...) won't include it.)
So long as you have the same string content - i.e. the same sequence of char values - then you'll get the same bytes on every implementation of Java, barring bugs. (Any such bug would be pretty surprising, given that UTF-16 is probably the simplest encoding to implement in Java...)
The fact that UTF-16 is the native representation for char (and usually for String) is only relevant in terms of ease of implementation, however. For example, I'd also expect String.getBytes("UTF-8") to give the same results on every platform.
It is true, java uses Unicode internally so it may combine any script/language. String and char use UTF-16BE but .class files store there String constants in UTF-8. In general it is irrelevant what String does, as there is a conversion to bytes specifying the encoding the bytes have to be in.
If this encoding of the bytes cannot represent some of the Unicode characters, a placeholder character or question mark is given. Also fonts might not have all Unicode characters, 35 MB for a full Unicode font is a normal size. You might then see a square with 2x2 hex codes or so for missing code points. Or on Linux another font might substitute the char.
Hence UTF-8 is a perfect fine choice.
String s = ...;
if (!s.startsWith("\uFEFF")) { // Add a Unicode BOM
s = "\uFEFF" + s;
}
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
Both UTF-16 (in both byte orders) and UTF-8 always are present in the JRE, whereas some Charsets are not. Hence you can use a constant from StandardCharsets not needing to handle any UnsupportedEncodingException.
Above I added a BOM for Windows Notepad esoecially, to recognize UTF-8. It certainly is not good practice. But as a small help here.
There is no disadvantage to UTF16-LE or UTF-16BE. I think UTF-8 is a bit more universally used, as UTF-16 also cannot store all Unicode code points in 16 bits. Text is Asian scripts would be more compressed, but already HTML pages are more compact in UTF-8 because of the HTML tags and other latin script.
For Windows UTF-16LE might be more native.
Problem with placeholders for non-Unicode platforms, especially Windows, might happen.
I just found this:
https://github.com/facebook/conceal/issues/138
which seems to answer negatively your question.
As per Jon Skeet's answer: the specification is clear. But I guess Android/Mac implementations of Dalvik/JVM don't agree.

How to use UTF-16 in URL encoding?

Currently I am using utf-8 for URL encoding. I want to convert it to UTF-16.
How can I achieve this?
When encoding Unicode characters in URLs, it's necessary to encode them in such a fashion that all URL parsers and consumers can understand your URLs.
To that end; when the URL was expanded by RFCs in the wake of the development of Unicode and related standards and tools, it was decided that the encoding to employ for encoding characters (using percent escapes) was to be UTF-8, as this would mean that established ASCII escapes would Just Work™.
Consequently, even if you could generate URLs with UTF-16-based percent escapes, no other program would be able to understand them, making them useless. In fact, by matter of definition, they wouldn't even be URLs.
There's also the question of why on earth you would want to use UTF-16 for anything, it being silly and all.
Remember: Never Don't Use UTF-8! (N'DUUH!)
URL escapes, as in %nn hex values, encode bytes. 8-bit bytes. If for some very nonstandard reason you want to encode bytes of UTF-16 instead of UTF-8, you must first pick a byte order (BE or LE). Then you have to write code in your program to take the two bytes of each 16-bit UTF-16 character and represent it as %nn in hex.

Categories