Java char to bytes and back is converted wrongly (UTF-8) - java

While programming I encountered a weird behavior of Strings that are converted to bytes and then back to Strings again. Some chars are converted wrongly, and therefore the hashCode of the String is also changed. The length of the Strings remain the same.
The problem seems to occur with chars from 55296 - 57343 (U+D800 to U+DFFF). Other chars work fine. Is it because they are surrogates?
String string = new String(new char[] { 56000 });
System.out.println((int)string.charAt(0));
System.out.println((int)new String(string.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8).charAt(0));
The console output is:
56000
63
What is going on here? Is this a java bug, or am I misunderstanding something?

That's because these values are not characters but surrogates. Two of these values form a surrogate pair that in turn represents one character. If you have just one low or high surrogate value this is an invalid encoding and not a character.
Since this is an invalid encoding, it is replaced by a "?" character when you convert it to UTF-8.
You can read more about it for example here https://en.wikipedia.org/wiki/UTF-16

Related

Store data in Byte array in java

I am trying to convert a string like "password" to hex values, then have it inside a long array, the loop working fine till reaching the value "6F" (hex value for o char) then I have an exception java.lang.NumberFormatException
String password = "password";
char array[] = password.toCharArray();
int index = 0;
for (char c : array) {
String hex = (Integer.toHexString((int) c));
data[index] = Long.parseLong(hex);
index++;
}
how can I store the 6F values inside Byte array, as the 6F is greater than 1 byte ?. Please help me on this
Long.parseLong parses decimal numbers. It turns the string "10" into the number 10. If the input is hex, that is incorrect - the string "10" is supposed to be turned into the number 16. The fix is to use the Long.parseLong(String input, int radix) method. the radix you want is 16, though writing that as 0x10 may be more readable - it's the same thing to the compiler, purely a personal style choice. Thus, Long.parseLong(hex, 0x10) is what you want.
Note that in practice char has numbers that go from 0 to 65535, which doesn't fit in bytes. In effect, you must put a marker down that passwords must not contain any characters that aren't ASCII characters (so no umlauts, snowmen, emoji, funny quotes, etc).
If you fail to check this, Integer.toHexString((int) c) will turn into something like 16F or worse (3 to 4 characters), and it may also turn into a single character.
More generally, converting from char c to a hex string, and then parse the hex string into a number, is completely pointless. It's turning 15 into "F" and then turning "F" into 15. If you just want to shove a char into a byte: data[index++] = (byte) c; is all you need - that is the only line you need in your for loop.
But, heed this:
This really isn't how you're supposed to do that!
What you're doing is converting character data to a byte array. This is not actually simple - there are only 256 possible bytes, and there are way more characters that folks have invented. Literally hundreds of thousands of them.
Thus, to convert characters to bytes or vice versa, you must apply an encoding. Encodings have wildly varying properties. The most commonly used encoding, however, is 'UTF-8'. It represent every unicode symbol, and has the interesting property that basic ASCII characters look the exact same. However, it has the downside that any given character is smeared out into 1, 2, 3, or even 4 bytes, depending on what character it is. Fortunately, java has plenty of tools for this, thus, you don't need to care. What you really want, is this:
byte[] data = password.getBytes(StandardCharsets.UTF8);
That's asking the string to turn itself into a byte array, using UTF8 encoding. That means "password" turns into the sequence '112 97 115 115 119 111 114 100' which is no doubt what you want, but you can also have as password, say, außgescheignet ☃, and that works too - it's turned into bytes, and you can get back to your snowman enabled password:
String in = "außgescheignet ☃";
byte[] data = in.getBytes(StandardCharsets.UTF8);
String andBackAgain = new String(data, StandardCharsets.UTF8);
assert in.equals(andBackAgain); // true
if you stick this in a source file, make sure you save it in whatever text editor you use to do this as UTF8, and that javac compiles it that way too (javac has an -encoding parameter to enforce this).
If you think this is going to cause issues on whatever you send this to, and you want to restrict it to what someone with a rather USA-centric view would call 'normal' characters, then you want the exact same code as showcased here, but use StandardCharsets.ASCII instead. Then, that line (password.getBytes(StandardCharsets.ASCII)) will flat out error if it includes non-ASCII characters. That's a good thing: Your infrastructure would not deal with it correctly, we just posited that in this hypothetical exercise. Throwing an exception early in the process on a relevant line is exactly what you want.

String to Byte Conversion and Back Again Not Returning Same Result (ASCII)

I'm having a few issues converting a string back to the appropriate value after it has been converted to bytes.
The initial string:
"0000000000Y Yã"
Where the 'ã' is just a character value.
The conversion code:
byte[] b = s.getBytes(StandardCharsets.US_ASCII);
However when using to convert it back:
String str = new String(b, StandardCharsets.US_ASCII);
I recieve:
"0000000000Y Y?"
Anyone know why this is?
Thanks.
ã is not an ASCII character, so how it is handled is given by the implementation
https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#getBytes-java.nio.charset.Charset-
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array.
For this charset it comes out as '?'
ã is not part of the US_ASCII character set.
The getBytes() method is documented with:
This method always replaces malformed-input and unmappable-character
sequences with this charset's default replacement byte array.
(see the documentation)
For US_ASCII, the default replacement byte array seems to be one byte representing the '?' character (ASCII code 0x3F). So this is what gets inserted into the byte array in place of your ã character.
When converting back to String, you get the character corresponding to the replacement byte, being the '?' character.
So, if you convert to bytes, and you want to get back the identical characters, be sure to use a character set that contains every character you intend to use. A safe decision will be UTF-8.
If you need to obey some character encoding (e.g. because some external interface needs that), then Java's replacement strategy makes sense, but of course some characters will get lost.
This is because ã is not an ASCII character. Check an
ASCII table for valid ASCII characters.

Java string "hello" has 12 bytes when getBytes("UTF-16")?

I expected that, when a java character is stored as "UTF-16", each character uses 2 bytes, so "hello" should consume 10 bytes, but this code:
String h = "hello";
System.out.println(new String(h.getBytes("UTF-16"), "UTF-16").length());
System.out.println(new String(h.getBytes("UTF-8"), "UTF-8").getBytes("UTF-16").length);
Will print "5 12"
My question:
(1) I expected that the first println should get "10" as I mentioned. But why 5?
(2) For the second println, I am trying to getBytes for it first as "UTF-8" then as "UTF-16". I suppose it should also be 10. But actually it's 12.
I'm using MAC and my region is HongKong. Would you help to explain what's happening in the program, and how "5 12" actually came out?
Thanks a lot!
(1) I expected that the first println should get "10" as I mentioned. But why 5?
You take a 5 character string, encode it as bytes using UTF-16 encoding.
Then you create a new string by decoding the bytes (correctly) from UTF-16, which gives you a new string consisting of your original 5 characters again.
(2) For the second println, I am trying to getBytes for it first as "UTF-8" then as "UTF-16". I suppose it should also be 10. But actually it's 12.
This part of the code:
new String(h.getBytes("UTF-8"), "UTF-8")
is actually a no-op. It is just a rather expensive way to copy a string. You encode the string to bytes using UTF-8 as the encoding scheme, and then you create a new string by decoding the UTF-8 encoded bytes.
So effectively, you are doing this:
"hello".getBytes("UTF-16").length
The reason for the extra 2 bytes is that UTF-16 encoding puts a BOM (byte order mark) as the first (2 byte) code unit.
For more information, read the Unicode FAQs on "UTF-8, UTF-16, UTF-32 & BOM".
I expected that the first println should get "10" as I mentioned. But why 5?
You are calling length() on the String, not on the byte[]. So this will give you the length of the String in characters (at least as long as we are staying in the Unicode Basic Multilingual Plane -- this unfortunately breaks down when you have characters that need variable-length encoding even in UTF-16).
Once you have a String, it does not matter what encoding was used to create it. length is always given in terms of characters.
If you converted this into a byte[] using UTF-16, you might rightfully have expected 10 (for the five characters times two bytes each) -- that it actually ends up being 12 is due to a Byte Order Mark being included.

How do I convert a single character code to a `char` given a character set?

I want to convert decimal to ascii and this is the code returns the unexpected results. Here is the code I am using.
public static void main(String[] args) {
char ret= (char)146;
System.out.println(ret);// returns nothing.
I expect to get character single "'" as per http://www.ascii-code.com/
Anyone came across this? Thanks.
So, a couple of things.
First of all the page you linked to says this about the code point range in question:
The extended ASCII codes (character code 128-255)
There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1. Codes 128-159 contain the Microsoft® Windows Latin-1 extended characters.
This is incorrect, or at least, to me, misleadingly worded. ISO 8859-1 / Latin-1 does not define code point 146 (and another reference just because). So that's already asking for trouble. You can see this also if you do the conversion through String:
String s = new String(new byte[] {(byte)146}, "iso-8859-1");
System.out.println(s);
Outputs the same "unexpected" result. It appears that what they are actually referring to is the Windows-1252 set (aka "Windows Latin-1", but this name is almost completely obsolete these days), which does define that code point as a right single quote (for other charsets that provide this character at 146 see this list and look for encodings that provide it at 0x92), and we can verify this as such:
String s = new String(new byte[] {(byte)146}, "windows-1252");
System.out.println(s);
So the first mistake is that page is confusing.
But the big mistake is you can't do what you're trying to do in the way you are doing it. A char in Java is a UTF-16 code point (or half of one, if you're representing the supplementary characters > 0xFFFF, a single char corresponds to a BMP point, a pair of them or an int corresponds to the full range, including the supplementary ones).
Unfortunately, Java doesn't really expose a lot of API for single-character conversions. Even Character doesn't have any readily available ways to convert from the charset of your choice to UTF-16.
So one option is to do it via String as hinted at in the examples above, e.g. express your code points as a raw byte[] array and convert from there:
String s = new String(new byte[] {(byte)146}, "windows-1252");
System.out.println(s);
char c = s.charAt(0);
System.out.println(c);
You could grab the char again via s.charAt(0). Note that you have to be mindful of your character set when doing this. Here we know that our byte sequence is valid for the specified encoding, and we know that the result is only one char long, so we can do this.
However, you have to watch out for things in the general case. For example, perhaps your byte sequence and character set yield a result that is in the UTF-16 supplementary character range. In that case s.charAt(0) would not be sufficient and s.codePointAt(0) stored in an int would be required instead.
As an alternative, with the same caveats, you could use Charset to decode, although it's just as clunky, e.g.:
Charset cs = Charset.forName("windows-1252");
CharBuffer cb = cs.decode(ByteBuffer.wrap(new byte[] {(byte)146}));
char c = cb.get(0);
System.out.println(c);
Note that I am not entirely sure how Charset#decode handles supplementary characters and can't really test right now (but anybody, feel free to chime in).
As an aside: In your case, 146 (0x92) cast directly to char corresponds to the UTF-16 character "PRIVATE USE TWO" (see also), and all bets are off for what you'll end up displaying there. This character is classified by Unicode as a control character, and seems to fall in the range of characters reserved for ANSI terminal control (although AFAIK isn't actually used, but it's in that range regardless). I wouldn't be surprised if perhaps browsers in some locales rendered it as a right-single-quote for compatibility, but terminals did something weird with it.
Also, fyi, the official UTF-16 code point for right single quote is 0x2019. You could reliably store that in a char by using that value, e.g.:
System.out.println((char)0x2019);
You can also see this for yourself by looking at the value after the conversion from windows-1252:
String s = new String(new byte[] {(byte)146}, "windows-1252");
char c = s.charAt(0);
System.out.printf("0x%x\n", (int)c); // outputs 0x2019
Or, for completeness:
String s = new String(new byte[] {(byte)146}, "windows-1252");
int cp = s.codePointAt(0);
System.out.printf("0x%x\n", cp); // outputs 0x2019
The page you refer mention that values 160 to 255 correspond to the ISO-8859-1 (aka Latin 1) table; as for values in the range 128 to 159, they are from the Windows specific variant of the Latin 1 (ISO-8859-1 leave that range undefined, to be assigned by operating system).
Java characters are based on UTF16, which is itself based on the Unicode table. If you want to specifically refer to the right quote character, it is you can specify it as '\u2019' in Java (see http://www.fileformat.info/info/unicode/char/2019/index.htm).

Why new String(bytes, enc).getBytes(enc) does not return the original byte array?

I made the following "simulation":
byte[] b = new byte[256];
for (int i = 0; i < 256; i ++) {
b[i] = (byte) (i - 128);
}
byte[] transformed = new String(b, "cp1251").getBytes("cp1251");
for (int i = 0; i < b.length; i ++) {
if (b[i] != transformed[i]) {
System.out.println("Wrong : " + i);
}
}
For cp1251 this outputs only one wrong byte - at position 25.
For KOI8-R - all fine.
For cp1252 - 4 or 5 differences.
What is the reason for this and how can this be overcome?
I know it is wrong to represent byte arrays as strings in whatever encoding, but it is a requirement of the protocol of a payment provider, so I don't have a choice.
Update: representing it in ISO-8859-1 works, and I'll use it for the byte[] part, and cp1251 for the textual part, so the question remains only out of curiousity
Some of the "bytes" are not supported in the target set - they are replaced with the ? character. When you convert back, ? is normally converted to the byte value 63 - which isn't what it was before.
What is the reason for this
The reason is that character encodings are not necesarily bijective and there is no good reason to expect them to be. Not all bytes or byte sequences are legal in all encodings, and usually illegal sequences are decoded to some sort of placeholder character like '?' or U+FFFD, which of course does not produce the same bytes when re-encoded.
Additionally, some encodings may map some legal different byte sequences to the same string.
It appears that both cp1251 and cp1252 have byte values that do not correspond to defined characters; i.e. they are "unmappable".
The javadoc for String(byte[], String) says this:
The behavior of this constructor when the given bytes are not valid in the given charset is unspecified. The CharsetDecoder class should be used when more control over the decoding process is required.
Other constructors say this:
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string.
If you see this kind of thing happening in practice it indicates that either you are using the wrong character set, or you've been given some bad data. Either way, it is probably not a good idea to carry on as if there was no problem.
I've been trying to figure out if there is a way to get a CharsetDecoder to "preserve" unmappable characters, and I don't think it is possible unless you are willing to implementing a custom decoder/encoder pair. But I've also concluded that it does not make sense to even try. It is (theoretically) wrong map those unmappable characters to real Unicode code points. And if you do, how is your application going to handle them?
Actually there shall be one difference: a byte of value 24 is converted to a char of value 0xFFFD; that's the "Unicode replacement character", used for untranslatable bytes. When converted back, you get a question mark (value 63).
In CP1251, the code 24 means "end of input" and cannot be part of a proper string, which is why Java deems it as "untranslatable".
Historical reason: in the ancient character encodings (EBCDIC, ASCII) the first 32 codes have special 'control' meaning and they may not map to readable characters. Examples: backspace, bell, carriage return. Newer character encoding standards usually inherit this and they don't define Unicode characters for every one of the first 32 positions. Java characters are Unicode.

Categories