Different codepoints for same character in MacOS and Windows

Different codepoints for same character in MacOS and Windows - java

I have a small piece of code in which I am checking the codepoint for the the character Ü.
Locale lc = Locale.getDefault();
System.out.println(lc.toString());
System.out.println(Charset.defaultCharset());
System.out.println(System.getProperty("file.encoding"));
String inUnicode = "\u00dc";
String glyph = "Ü";
System.out.println("inUnicode " + inUnicode + " code point " + inUnicode.codePointAt(0));
System.out.println("glyph " + glyph + " code point " + glyph.codePointAt(0));
I am getting different value for codepoint when I run this code on MacOS x and Windows 10, see the output below.
Output on MacOS
en_US
UTF-8
UTF-8
inUnicode Ü code point 220
glyph Ü code point 220
Output on Windows
en_US
windows-1252
Cp1252
in unicode Ü code point 220
glyph ?? code point 195
I checked the codepage for windows-1252 at https://en.wikipedia.org/wiki/Windows-1252#Character_set, here the codepoint for Ü is 220.
For String glyph = "Ü"; why do I get codepoint as 195 on Windows? As per my understanding glyph should have been rendered properly and the codepoint should have been 220 since it is defined in Windows-1252.
If I replace String glyph = "Ü"; with String glyph = new String("Ü".getBytes(), Charset.forName("UTF-8")); then glyph is rendered correctly and codepoint value is 220.
Is this the correct and efficient way to standardize behavior of String on any OS irrespective of locale and charset?

195 is 0xC3 in hex.
In UTF-8, Ü is encoded as bytes 0xC3 0x9C.
System.getProperty("file.encoding") says the default file encoding on Windows is not UTF-8, but clearly your Java file is actually encoded in UTF-8. The fact that println() is outputting glyph ?? (note 2 ?, meaning 2 chars are present), and that you are able to decode the raw string bytes using the UTF-8 Charset, proves this.
glyph should have a single char whose value is 0x00DC, not 2 chars whose values are 0x00C3 0x009C. getCodepointAt(0) is returning 0x00C3 (195) on Windows because your Java file is encoded in UTF-8 but is being loaded as if it were encoded in Windows-1252 instead, so the 2 bytes 0xC3 0x9C get decoded as characters 0x00C3 0x009C instead of as character 0x00DC.
You need to specify the actual file encoding when running Java, eg:
java -Dfile.encoding=UTF-8 ...

Related

Why is Java String.length inconsistent across platforms with unicode characters?

According to the Java documentation for String.length:
public int length()
Returns the length of this string.
The length is equal to the number of Unicode code units in the string.
Specified by:
length in interface CharSequence
Returns:
the length of the sequence
of characters represented by this object.
But then I don't understand why the following program, HelloUnicode.java, produces different results on different platforms. According to my understanding, the number of Unicode code units should be the same, since Java supposedly always represents strings in UTF-16:
public class HelloWorld {
public static void main(String[] args) {
String myString = "I have a 🙂 in my string";
System.out.println("String: " + myString);
System.out.println("Bytes: " + bytesToHex(myString.getBytes()));
System.out.println("String Length: " + myString.length());
System.out.println("Byte Length: " + myString.getBytes().length);
System.out.println("Substring 9 - 13: " + myString.substring(9, 13));
System.out.println("Substring Bytes: " + bytesToHex(myString.substring(9, 13).getBytes()));
}
// Code from https://stackoverflow.com/a/9855338/4019986
private final static char[] hexArray = "0123456789ABCDEF".toCharArray();
public static String bytesToHex(byte[] bytes) {
char[] hexChars = new char[bytes.length * 2];
for ( int j = 0; j < bytes.length; j++ ) {
int v = bytes[j] & 0xFF;
hexChars[j * 2] = hexArray[v >>> 4];
hexChars[j * 2 + 1] = hexArray[v & 0x0F];
}
return new String(hexChars);
}
}
The output of this program on my Windows box is:
String: I have a 🙂 in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 26
Byte Length: 26
Substring 9 - 13: 🙂
Substring Bytes: F09F9982
The output on my CentOS 7 machine is:
String: I have a 🙂 in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 24
Byte Length: 26
Substring 9 - 13: 🙂 i
Substring Bytes: F09F99822069
I ran both with Java 1.8. Same byte length, different String length. Why?
UPDATE
By replacing the "🙂" in the string with "\uD83D\uDE42", I get the following results:
Windows:
String: I have a ? in my string
Bytes: 4920686176652061203F20696E206D7920737472696E67
String Length: 24
Byte Length: 23
Substring 9 - 13: ? i
Substring Bytes: 3F2069
CentOS:
String: I have a 🙂 in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 24
Byte Length: 26
Substring 9 - 13: 🙂 i
Substring Bytes: F09F99822069
Why "\uD83D\uDE42" ends up being encoded as 0x3F on the Windows machine is beyond me...
Java Versions:
Windows:
java version "1.8.0_211"
Java(TM) SE Runtime Environment (build 1.8.0_211-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode)
CentOS:
openjdk version "1.8.0_201"
OpenJDK Runtime Environment (build 1.8.0_201-b09)
OpenJDK 64-Bit Server VM (build 25.201-b09, mixed mode)
Update 2
Using .getBytes("utf-8"), with the "🙂" embedded in the string literal, here are the outputs.
Windows:
String: I have a 🙂 in my string
Bytes: 492068617665206120C3B0C5B8E284A2E2809A20696E206D7920737472696E67
String Length: 26
Byte Length: 32
Substring 9 - 13: 🙂
Substring Bytes: C3B0C5B8E284A2E2809A
CentOS:
String: I have a 🙂 in my string
Bytes: 492068617665206120F09F998220696E206D7920737472696E67
String Length: 24
Byte Length: 26
Substring 9 - 13: 🙂 i
Substring Bytes: F09F99822069
So yes it appears to be a difference in system encoding. But then that means string literals are encoded differently on different platforms? That sounds like it could be problematic in certain situations.
Also... where is the byte sequence C3B0C5B8E284A2E2809A coming from to represent the smiley in Windows? That doesn't make sense to me.
For completeness, using .getBytes("utf-16"), with the "🙂" embedded in the string literal, here are the outputs.
Windows:
String: I have a 🙂 in my string
Bytes: FEFF00490020006800610076006500200061002000F001782122201A00200069006E0020006D007900200073007400720069006E0067
String Length: 26
Byte Length: 54
Substring 9 - 13: 🙂
Substring Bytes: FEFF00F001782122201A
CentOS:
String: I have a 🙂 in my string
Bytes: FEFF004900200068006100760065002000610020D83DDE4200200069006E0020006D007900200073007400720069006E0067
String Length: 24
Byte Length: 50
Substring 9 - 13: 🙂 i
Substring Bytes: FEFFD83DDE4200200069

You have to be careful about specifying the encodings:
when you compile the Java file, it uses some encoding for the source file. My guess is that this already broke your original String literal on compilation. This can be fixed by using the escape sequence.
after you use the escape sequence, the String.length are the same. The bytes inside the String are also the same, but what you are printing out does not show that.
the bytes printed are different because you called getBytes() and that again uses the environment or platform-specific encoding. So it was also broken (replacing unencodable smilies with question mark). You need to call getBytes("UTF-8") to be platform-independent.
So to answer the specific questions posed:
Same byte length, different String length. Why?
Because the string literal is being encoded by the java compiler, and the java compiler often uses a different encoding on different systems by default. This may result in a different number of character units per Unicode character, which results in a different string length. Passing the -encoding command line option with the same option across platforms will make them encode consistently.
Why "\uD83D\uDE42" ends up being encoded as 0x3F on the Windows machine is beyond me...
It's not encoded as 0x3F in the string. 0x3f is the question mark. Java puts this in when it is asked to output invalid characters via System.out.println or getBytes, which was the case when you encoded literal UTF-16 representations in a string with a different encoding and then tried to print it to the console and getBytes from it.
But then that means string literals are encoded differently on different platforms?
By default, yes.
Also... where is the byte sequence C3B0C5B8E284A2E2809A coming from to represent the smiley in Windows?
This is quite convoluted. The "🙂" character (Unicode code point U+1F642) is stored in the Java source file with UTF-8 encoding using the byte sequence F0 9F 99 82. The Java compiler then reads the source file using the platform default encoding, Cp1252 (Windows-1252), so it treats these UTF-8 bytes as though they were Cp1252 characters, making a 4-character string by translating each byte from Cp1252 to Unicode, resulting in U+00F0 U+0178 U+2122 U+201A. The getBytes("utf-8") call then converts this 4-character string into bytes by encoding them as utf-8. Since every character of the string is higher than hex 7F, each character is converted into 2 or more UTF-8 bytes; hence the resulting string being this long. The value of this string is not significant; it's just the result of using an incorrect encoding.

You didn't take into account, that getBytes() returns the bytes in the platform's default encoding. This is different on windows and centOS.
See also How to Find the Default Charset/Encoding in Java? and the API documentation on String.getBytes().

When I extract spanish text from a PDF with PDFBox, accents are changed by "strange" characters

I have this code in java to take a PDF file and extract all the text:
File file= new File("C:/file.pdf");
PDDocument doc= PDDocument.load(file);
PDFTextStripper s = new PDFTextStripper();
content= s.getText(doc);
System.out.println(content)
If we run the application with Windows, it works correctly and extracts all the text. However, when we pass the app to the server that uses Linux, the spanish accents are converted into "strange" characters like --> "carÃ¡cter" (it should be "carácter"). I tried to convert the String to bytes and then to UTF8 unicode:
byte[] b = content.getBytes(Charset.forName("UTF-8"));
String text= new String(b);
System.out.println(text);
But it does not work, in Windows it continues working well but in the Linux server it still shows wrong the spanish accents, etc ... I understand that if in a Windows environment it works correctly, in a Linux environment it should have to work too ... Any idea of What can it be or what can I do? Thank you

Ã¡ is what you get when the UTF-8 encoded form of á is misinterpreted as Latin-1.
There are two possibilities for this to happen:
a bug in PDFTextStripper.getText() - Java strings are UTF-16 encoded, but getText() may be returning a string containing UTF-8 byte octets that have been expanded as-is to 16-bit Java chars, thus producing 2 chars 0x00C3 0x00A1 instead of 1 char 0x00E1 for á. Subsequently calling content.getBytes(UTF8) on such a malformed string would just give you more corrupted data.
To "fix" this kind of mistake, loop through the string copying its chars as-is to a byte[] array, and then decode that array as UTF-8:
byte[] b = new byte[content.length()];
for (int i = 0; i < content.length(); ++i) {
b[i] = (byte) content[i];
}
String text = new String(b, "UTF-8");
System.out.println(text);
a configuration mismatch - PDFTextStripper.getText() may be returning a properly encoded UTF-16 string containing a á char as expected, but then System.out.println() outputs the UTF-8 encoded form of that string, and your terminal/console misinterprets the output as Latin-1 instead of as UTF-8.
In this case, the code you shown is fine, you would just need to double-check your Java environment and terminal/console configuration to make sure they agree on the charset used for console output.
You need to check the actual char values in content to know which case is actually happening.

Java character encoding to HTML ISO-8859-1

I have a byte[] containing the value -110 (and other negative values). When I convert it to string I need it to display a ’ (right single quote). Currently, I am getting a question mark (?)
The ’ aligns to the special ASCII character #146 mentioned in this page but I am now stuck as to how I can input -110 or 146 (-110+256) and be a ’ value. I have also trued Any advice would be greatly appreciated.
byte[] b = {-110,84};
System.out.println(new String(b, Charset.forName("Windows-1252"))); //Displays ?T . The desired output should be ’T
System.out.println(new String(b, Charset.forName("UTF-8"))); //Displays ?T . The desired output should be ’T
System.out.println(new String(b, Charset.forName("ISO-8859-1"))); //Displays ?T . The desired output should be ’T

Thanks for the responses as John Skeet points out in his reply, the Java program needed to recognize the input data Windows-1252 and the Windows command line wasn't set to the code page either.
Setting the command line codepage to Windows-1252 was done by running
chcp 1252
Starting the Java program to use Windows-1252 as the default was done by adding the following parameter
-Dfile.encoding="Windows-1252"

UTF8 convertion for text obtained from internet

ElasticSearch is a search Server which accepts data only in UTF8.
When i tries to give ElasticSearch following text
Small businesses potentially in line for a lighter reporting load include those with an annual turnover of less than £440,000, net assets of less than £220,000 and fewer than ten employees"
Through my java application - Basically my java application takes this info from a webpage , and gives it to elasticSearch. ES complaints it cant understand £ and it fails. After filtering through below code -
byte bytes[] = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �
But then when I copy it to a file in my home directory using bash and it goes in fine. Any pointers will help.

You have ISO-8895-1 octets in bytes, which you then tell String to decode as if it were UTF-8. When it does that, it doesn't recognize the illegal 0xA3 sequence and replaces it with the substitution character.
To do this, you have to construct the string with the encoding it uses, then convert it to the encoding that you want. See How do I convert between ISO-8859-1 and UTF-8 in Java?.

UTF-8 is easier than one thinks. In String everything is unicode characters.
Bytes/string conversion is done as follows.
(Note Cp1252 or Windows-1252 is the Windows Latin1 extension of ISO-8859-1; better use
that one.)
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(file), "Cp1252"));
PrintWriter out = new PrintWriter(
new OutputStreamWriter(new FileOutputStream(file), "UTF-8"));
response.setContentType("text/html; charset=UTF-8");
response.setEncoding("UTF-8");
String s = "20 \u00A3"; // Escaping
To see why Cp1252 is more suitable than ISO-8859-1:
http://en.wikipedia.org/wiki/Windows-1252

String s is a series of characters that are basically independent of any character encoding (ok, not exactly independent, but close enough for our needs now). Whatever encoding your data was in when you loaded it into a String has already been decoded. The decoding was done either using system default encoding (which is practically ALWAYS AN ERROR, do not ever use system default encoding, trust me I have over 10 years of experience in dealing with bugs related to wrong default encodings) or the encoding you explicitely specified when you loaded the data.
When you call getBytes("ISO-8859-1") for a String, you request that the String is encoded into bytes according to ISO-8859-1 encoding.
When you create a String from a byte array, you need to specify the encoding in which the characters in the byte array are represented. You create a string from a byte array that has been encoded in UTF-8 (and just above you encoded it in ISO-8859-1, that is your error).
What you want to do is:
byte bytes[] = s.getBytes("UTF-8");
s = new String(bytes, "UTF-8");

Why is ¿ displayed different in Windows vs Linux even when using UTF-8?

Why is the following displayed different in Linux vs Windows?
System.out.println(new String("¿".getBytes("UTF-8"), "UTF-8"));
in Windows:
¿
in Linux:
Â¿

System.out.println() outputs the text in the system default encoding, but the console interprets that output according to its own encoding (or "codepage") setting. On your Windows machine the two encodings seem to match, but on the Linux box the output is apparently in UTF-8 while the console is decoding it as a single-byte encoding like ISO-8859-1. Or maybe, as Jon suggested, the source file is being saved as UTF-8 and javac is reading it as something else, a problem that can be avoided by using Unicode escapes.
When you need to output anything other than ASCII text, your best bet is to write it to a file using an appropriate encoding, then read the file with a text editor--consoles are too limited and too system-dependent. By the way, this bit of code:
new String("¿".getBytes("UTF-8"), "UTF-8")
...has no effect on the output. All that does is encode the contents of the string to a byte array and decode it again, reproducing the original string--an expensive no-op. If you want to output text in a particular encoding, you need to use an OutputStreamWriter, like so:
FileOutputStream fos = new FileOutputStream("out.txt");
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");

Not sure where the problem is exactly, but it's worth noting that
Â¿ ( 0xc2,0xbf)
is the result of encoding with UTF-8
0xbf,
which is the Unicode codepoint for ¿
So, it looks like in the linux case, the output is not being displayed as utf-8, but as a single-byte string

Check what encoding your linux terminal has.
For gnome-terminal in ubuntu - go to the "Terminal" menu and select "Set Character Encoding".
For putty, Configuration -> Window -> Translation -> UTF-8 (and if that doesn't work, see this post).

Run this code to help determine if it is a compiler or console issue:
public static void main(String[] args) throws Exception {
String s = "¿";
printHex(Charset.defaultCharset(), s);
Charset utf8 = Charset.forName("UTF-8");
printHex(utf8, s);
}
public static void printHex(Charset encoding, String s)
throws UnsupportedEncodingException {
System.out.print(encoding + "\t" + s + "\t");
byte[] barr = s.getBytes(encoding);
for (int i = 0; i < barr.length; i++) {
int n = barr[i] & 0xFF;
String hex = Integer.toHexString(n);
if (hex.length() == 1) {
System.out.print('0');
}
System.out.print(hex);
}
System.out.println();
}
If the encoded bytes for UTF-8 are different on each platform (it should be c2bf), it is a compiler issue.
If it is a compiler issue, replace "¿" with "\u00bf".

It's hard to know exactly which bytes your source code contains, or the string which getBytes() is being called on, due to your editor and compiler encodings.
Can you produce a short but complete program containing only ASCII (and the relevant \uxxxx escaping in the string) which still shows the problem?
I suspect the problem may well be with the console output on either Windows or Linux, but it would be good to get a reproducible program first.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.