Why is Java String.length inconsistent across platforms with unicode characters? - java

According to the Java documentation for String.length:
public int length()
Returns the length of this string.
The length is equal to the number of Unicode code units in the string.
Specified by:
length in interface CharSequence
Returns:
the length of the sequence
of characters represented by this object.
But then I don't understand why the following program, HelloUnicode.java, produces different results on different platforms. According to my understanding, the number of Unicode code units should be the same, since Java supposedly always represents strings in UTF-16:
public class HelloWorld {
public static void main(String[] args) {
String myString = "I have a ๐Ÿ™‚ in my string";
System.out.println("String: " + myString);
System.out.println("Bytes: " + bytesToHex(myString.getBytes()));
System.out.println("String Length: " + myString.length());
System.out.println("Byte Length: " + myString.getBytes().length);
System.out.println("Substring 9 - 13: " + myString.substring(9, 13));
System.out.println("Substring Bytes: " + bytesToHex(myString.substring(9, 13).getBytes()));
}
// Code from https://stackoverflow.com/a/9855338/4019986
private final static char[] hexArray = "0123456789ABCDEF".toCharArray();
public static String bytesToHex(byte[] bytes) {
char[] hexChars = new char[bytes.length * 2];
for ( int j = 0; j < bytes.length; j++ ) {
int v = bytes[j] & 0xFF;
hexChars[j * 2] = hexArray[v >>> 4];
hexChars[j * 2 + 1] = hexArray[v & 0x0F];
}
return new String(hexChars);
}
}
The output of this program on my Windows box is:
String: I have a ๐Ÿ™‚ in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 26
Byte Length: 26
Substring 9 - 13: ๐Ÿ™‚
Substring Bytes: F09F9982
The output on my CentOS 7 machine is:
String: I have a ๐Ÿ™‚ in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 24
Byte Length: 26
Substring 9 - 13: ๐Ÿ™‚ i
Substring Bytes: F09F99822069
I ran both with Java 1.8. Same byte length, different String length. Why?
UPDATE
By replacing the "๐Ÿ™‚" in the string with "\uD83D\uDE42", I get the following results:
Windows:
String: I have a ? in my string
Bytes: 4920686176652061203F20696E206D7920737472696E67
String Length: 24
Byte Length: 23
Substring 9 - 13: ? i
Substring Bytes: 3F2069
CentOS:
String: I have a ๐Ÿ™‚ in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 24
Byte Length: 26
Substring 9 - 13: ๐Ÿ™‚ i
Substring Bytes: F09F99822069
Why "\uD83D\uDE42" ends up being encoded as 0x3F on the Windows machine is beyond me...
Java Versions:
Windows:
java version "1.8.0_211"
Java(TM) SE Runtime Environment (build 1.8.0_211-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode)
CentOS:
openjdk version "1.8.0_201"
OpenJDK Runtime Environment (build 1.8.0_201-b09)
OpenJDK 64-Bit Server VM (build 25.201-b09, mixed mode)
Update 2
Using .getBytes("utf-8"), with the "๐Ÿ™‚" embedded in the string literal, here are the outputs.
Windows:
String: I have a ๐Ÿ™‚ in my string
Bytes: 492068617665206120C3B0C5B8E284A2E2809A20696E206D7920737472696E67
String Length: 26
Byte Length: 32
Substring 9 - 13: ๐Ÿ™‚
Substring Bytes: C3B0C5B8E284A2E2809A
CentOS:
String: I have a ๐Ÿ™‚ in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 24
Byte Length: 26
Substring 9 - 13: ๐Ÿ™‚ i
Substring Bytes: F09F99822069
So yes it appears to be a difference in system encoding. But then that means string literals are encoded differently on different platforms? That sounds like it could be problematic in certain situations.
Also... where is the byte sequence C3B0C5B8E284A2E2809A coming from to represent the smiley in Windows? That doesn't make sense to me.
For completeness, using .getBytes("utf-16"), with the "๐Ÿ™‚" embedded in the string literal, here are the outputs.
Windows:
String: I have a ๐Ÿ™‚ in my string
Bytes: FEFF00490020006800610076006500200061002000F001782122201A00200069006E0020006D007900200073007400720069006E0067
String Length: 26
Byte Length: 54
Substring 9 - 13: ๐Ÿ™‚
Substring Bytes: FEFF00F001782122201A
CentOS:
String: I have a ๐Ÿ™‚ in my string
Bytes: FEFF004900200068006100760065002000610020D83DDE4200200069006E0020006D007900200073007400720069006E0067
String Length: 24
Byte Length: 50
Substring 9 - 13: ๐Ÿ™‚ i
Substring Bytes: FEFFD83DDE4200200069

You have to be careful about specifying the encodings:
when you compile the Java file, it uses some encoding for the source file. My guess is that this already broke your original String literal on compilation. This can be fixed by using the escape sequence.
after you use the escape sequence, the String.length are the same. The bytes inside the String are also the same, but what you are printing out does not show that.
the bytes printed are different because you called getBytes() and that again uses the environment or platform-specific encoding. So it was also broken (replacing unencodable smilies with question mark). You need to call getBytes("UTF-8") to be platform-independent.
So to answer the specific questions posed:
Same byte length, different String length. Why?
Because the string literal is being encoded by the java compiler, and the java compiler often uses a different encoding on different systems by default. This may result in a different number of character units per Unicode character, which results in a different string length. Passing the -encoding command line option with the same option across platforms will make them encode consistently.
Why "\uD83D\uDE42" ends up being encoded as 0x3F on the Windows machine is beyond me...
It's not encoded as 0x3F in the string. 0x3f is the question mark. Java puts this in when it is asked to output invalid characters via System.out.println or getBytes, which was the case when you encoded literal UTF-16 representations in a string with a different encoding and then tried to print it to the console and getBytes from it.
But then that means string literals are encoded differently on different platforms?
By default, yes.
Also... where is the byte sequence C3B0C5B8E284A2E2809A coming from to represent the smiley in Windows?
This is quite convoluted. The "๐Ÿ™‚" character (Unicode code point U+1F642) is stored in the Java source file with UTF-8 encoding using the byte sequence F0 9F 99 82. The Java compiler then reads the source file using the platform default encoding, Cp1252 (Windows-1252), so it treats these UTF-8 bytes as though they were Cp1252 characters, making a 4-character string by translating each byte from Cp1252 to Unicode, resulting in U+00F0 U+0178 U+2122 U+201A. The getBytes("utf-8") call then converts this 4-character string into bytes by encoding them as utf-8. Since every character of the string is higher than hex 7F, each character is converted into 2 or more UTF-8 bytes; hence the resulting string being this long. The value of this string is not significant; it's just the result of using an incorrect encoding.

You didn't take into account, that getBytes() returns the bytes in the platform's default encoding. This is different on windows and centOS.
See also How to Find the Default Charset/Encoding in Java? and the API documentation on String.getBytes().

Related

Java - decode base64 - Illegal base64 character 1

I have following data in a file:
I want to decode the UserData. On reading it as string comment, I'm doing following:
String[] split = comment.split("=");
if(split[0].equals("UserData")) {
System.out.println(split[1]);
byte[] callidArray = Arrays.copyOf(java.util.Base64.getDecoder().decode(split[1]), 9);
System.out.println("UserData:" + Hex.encodeHexString(callidArray).toString());
}
But I'm getting the following exception:
java.lang.IllegalArgumentException: Illegal base64 character 1
What could be the reason?
The image suggests that the string you are trying to decode contains characters like SOH and BEL. These are ASCII control characters, and will not ever appear in a Base64 encoded string.
(Base64 typically consists of letters, digits, and +, \ and =. There are some variant formats, but control characters are never included.)
This is confirmed by the exception message:
java.lang.IllegalArgumentException: Illegal base64 character 1
The SOH character has ASCII code 1.
Conclusions:
You cannot decode that string as if it was Base64. It won't work.
It looks like the string is not "encoded" at all ... in the normal sense of what "encoding" means in Java.
We can't advise you on what you should do with it without a clear explanation of:
where the (binary) data comes from,
what you expected it to contain, and
how you read the data and turned it into a Java String object: show us the code that did that!
The UserData field in the picture in the question actually contains Bytes representation of Hexadecimal characters.
So, I don't need to decode Base64. I need to copy the string to a byte array and get equivalent hexadecimal characters of the byte array.
String[] split = comment.split("=");
if(split[0].equals("UserData")) {
System.out.println(split[1]);
byte[] callidArray = Arrays.copyOf(split[1].getBytes(), 9);
System.out.println("UserData:" + Hex.encodeHexString(callidArray).toString());
}
Output:
UserData:010a20077100000000

Different codepoints for same character in MacOS and Windows

I have a small piece of code in which I am checking the codepoint for the the character รœ.
Locale lc = Locale.getDefault();
System.out.println(lc.toString());
System.out.println(Charset.defaultCharset());
System.out.println(System.getProperty("file.encoding"));
String inUnicode = "\u00dc";
String glyph = "รœ";
System.out.println("inUnicode " + inUnicode + " code point " + inUnicode.codePointAt(0));
System.out.println("glyph " + glyph + " code point " + glyph.codePointAt(0));
I am getting different value for codepoint when I run this code on MacOS x and Windows 10, see the output below.
Output on MacOS
en_US
UTF-8
UTF-8
inUnicode รœ code point 220
glyph รœ code point 220
Output on Windows
en_US
windows-1252
Cp1252
in unicode รœ code point 220
glyph ?? code point 195
I checked the codepage for windows-1252 at https://en.wikipedia.org/wiki/Windows-1252#Character_set, here the codepoint for รœ is 220.
For String glyph = "รœ"; why do I get codepoint as 195 on Windows? As per my understanding glyph should have been rendered properly and the codepoint should have been 220 since it is defined in Windows-1252.
If I replace String glyph = "รœ"; with String glyph = new String("รœ".getBytes(), Charset.forName("UTF-8")); then glyph is rendered correctly and codepoint value is 220.
Is this the correct and efficient way to standardize behavior of String on any OS irrespective of locale and charset?
195 is 0xC3 in hex.
In UTF-8, รœ is encoded as bytes 0xC3 0x9C.
System.getProperty("file.encoding") says the default file encoding on Windows is not UTF-8, but clearly your Java file is actually encoded in UTF-8. The fact that println() is outputting glyph ?? (note 2 ?, meaning 2 chars are present), and that you are able to decode the raw string bytes using the UTF-8 Charset, proves this.
glyph should have a single char whose value is 0x00DC, not 2 chars whose values are 0x00C3 0x009C. getCodepointAt(0) is returning 0x00C3 (195) on Windows because your Java file is encoded in UTF-8 but is being loaded as if it were encoded in Windows-1252 instead, so the 2 bytes 0xC3 0x9C get decoded as characters 0x00C3 0x009C instead of as character 0x00DC.
You need to specify the actual file encoding when running Java, eg:
java -Dfile.encoding=UTF-8 ...

ASC Visual Basic for Java

I need a function on Java that do the same as ASC function on Visual Basic. I've had looking for it on internet, but I can't found the solution.
The String that I have to know the codes was created on Visual Basic. It's according to ISO 8859-1 and Microsoft Windows Latin-1 characters. The ASC function on Visual Basic knows those codes, but in Java, I can't find a function that does the same thing.
I know in Java this sentence:
String myString = "ร…ร›โ€“รŸร•ร…รโ€ขรžรƒ";
int first = (int)string.chartAt(0); // "ร…"- VB and Java returns: 197
int second = (int)string.chartAt(0); // "ร›" - VB and Java returns: 219
int third = (int)string.chartAt(0); // "โ€“" - VB returns: 150 and Java returns: 8211
The first two characters, I haven't had problem, but the third character is not a ASCII code.
How can I get same codes in VB and Java?
First of all, note that ISO 8859-1 != Windows Latin-1. (See http://en.wikipedia.org/wiki/Windows-1252)
The problem is that Java encodes characters as UTF16, so casting to int will generally result in the Unicode value of the char.
To get the Latin-1 encoding of a char, first convert it to a Latin-1 encoded byte array:
public class Encoding {
public static void main(String[] args) {
// Cp1252 is Windows codepage 1252
byte[] bytes = "ร…ร›โ€“รŸร•ร…รโ€ขรžรƒ".getBytes(Charset.forName("Cp1252"));
for (byte b: bytes) {
System.out.println(b & 255);
}
}
}
prints:
197
219
150
223
213
197
221
149
222
195

Is UTF to EBCDIC Conversion lossless?

We have a process which communicates with an external via MQ. The external system runs on a mainframe maching (IBM z/OS), while we run our process on a CentOS Linux platform. So far we never had any issues.
Recently we started receiving messages from them with non-printable EBCDIC characters embedded in the message. They use the characters as a compressed ID, 8 bytes long. When we receive it, it arrives on our queue encoded in UTF (CCSID 1208).
They need to original 8 bytes back in order to identify our response messages. I'm trying to find a solution in Java to convert the ID back from UTF to EBCDIC before sending the response.
I've been playing around with the JTOpen library, using the AS400Text class to do the conversion. Also, the counterparty has sent us a snapshot of the ID in bytes. However, when I compare the bytes after conversion, they are different from the original message.
Has anyone ever encountered this issue? Maybe I'm using the wrong code page?
Thanks for any input you may have.
Bytes from counterparty(Positions [5,14]):
00000 F0 40 D9 F0 F3 F0 CB 56--EF 80 04 C9 10 2E C4 D4 |0 R030.....I..DM|
Program output:
UTF String: [R030รƒยดรƒยฎรƒโ€ขรƒหœร‚ล“IDMDHP1027W 0510]
EBCDIC String: [R030รƒยดรƒยฎรƒรƒร‚IDMDHP1027W 0510]
NATIVE CHARSET - HEX: [52303330C3B4C3AEC395C398C29C491006444D44485031303237572030353130]
CP500 CHARSET - HEX: [D9F0F3F066BE66AF663F663F623FC9102EC4D4C4C8D7F1F0F2F7E640F0F5F1F0]
Here is some sample code:
private void readAndPrint(MQMessage mqMessage) throws IOException {
mqMessage.seek(150);
byte[] subStringBytes = new byte[32];
mqMessage.readFully(subStringBytes);
String msgId = toHexString(mqMessage.messageId).toUpperCase();
System.out.println("----------------------------------------------------------------");
System.out.println("MESSAGE_ID: " + msgId);
String hexString = toHexString(subStringBytes).toUpperCase();
String subStr = new String(subStringBytes);
System.out.println("NATIVE CHARSET - HEX: [" + hexString + "] [" + subStr + "]");
// Transform to EBCDIC
int codePageNumber = 37;
String codePage = "CP037";
AS400Text converter = new AS400Text(subStr.length(), codePageNumber);
byte[] bytesData = converter.toBytes(subStr);
String resultedEbcdicText = new String(bytesData, codePage);
String hexStringEbcdic = toHexString(bytesData).toUpperCase();
System.out.println("CP500 CHARSET - HEX: [" + hexStringEbcdic + "] [" + resultedEbcdicText + "]");
System.out.println("----------------------------------------------------------------");
}
If a MQ message has varying sub-message fields that require different encodings, then that's how you should handle those messages, i.e., as separate message pieces.
But as you describe this, the entire message needs to be received without conversion. The first eight bytes need to be extracted and held separately. The remainder of the message can then have its encoding converted (unless other sub-fields also need to be extracted as binary, unconverted bytes).
For any return message, the opposite conversion must be done. The text portion of the message can be converted, and then that sub-string can have the original eight bytes prepended to it. The newly reconstructed message then can be sent back through the queue, again without automatic conversion.
Your partner on the other end is not using the messaging product correctly. (Of course, you probably shouldn't say that out loud.) There should be no part of such a message that cannot automatically survive intact across both directions. Instead of an 8-byte binary field, it should be represented as something more like a 16-byte hex representation of the 8-byte value for one example method. In hex, there'd be no conversion problem either way across the route.
It seems to me that the special 8 bytes are not actually EBCDIC character but simply 8 bytes of data. If it is in such case then I believe, as mentioned by another answer, that you should handle that 8 bytes separately without allowing it convert to UTF8 and then back to EBCDIC for further processing.
Depending on the EBCDIC variant you are using, it is quite possible that a byte in EBCDIC is not converting to a meaningful UTF-8 character, and hence, you will fail to get the original byte by converting the UTF8 character to EBCDIC you received.
A brief search on Google give me several EBCDIC tables (e.g. http://www.simotime.com/asc2ebc1.htm#AscEbcTables). You can see there are lots of values in EBCDIC that do not have character assigned. Hence, when they are converted to UTF8, you may not assume each of them will convert to a distinct character in Unicode. Therefore your proposed way of processing is going to be very dangerous and error-prone.

String.getBytes("UTF-32") returns different results on JVM and Dalvik VM

I have a 48 character AES-192 encryption key which I'm using to decrypt an encrypted database.
However, it tells me the key length is invalid, so I logged the results of getBytes().
When I execute:
final String string = "346a23652a46392b4d73257c67317e352e3372482177652c";
final byte[] utf32Bytes = string.getBytes("UTF-32");
System.out.println(utf32Bytes.length);
Using BlueJ on my mac (Java Virtual Machine), I get 192 as the output.
However, when I use:
Log.d(C.TAG, "Key Length: " + String.valueOf("346a23652a46392b4d73257c67317e352e3372482177652c".getBytes("UTF-32").length));
I get 196 as the output.
Does anybody know why this is happening, and where Dalvik is getting an additional 4 bytes from?
You should specify endianess on both machines
final byte[] utf32Bytes = string.getBytes("UTF-32BE");
Note that "UTF-32BE" is a different encoding, not special .getBytes parameter. It has fixed endianess and doesn't need BOM. More info: http://www.unicode.org/faq/utf_bom.html#gen6
Why would you UTF-32 encode plain a hexidecimal number. Thats 8x larger than it needs to be. :P
String s = "346a23652a46392b4d73257c67317e352e3372482177652c";
byte[] bytes = new BigInteger(s, 16).toByteArray();
String s2 = new BigInteger(1, bytes).toString(16);
System.out.println("Strings match is "+s.equals(s2)+" length "+bytes.length);
prints
Strings match is true length 24

Categories