I have a .java file with a string String s="P�rsh�ndetje bot�!";.
When I open this file in Notepad++ and change encoding to ISO-8859-1 it shows appropriate string: "Përshëndetje botë!", but if i open the file in idea intellij and change encoding to ISO-8859-1, it gives me a warning of how some symbols can't be converted and then replaces those symbols with ? mark: "P?rsh?ndetje bot?!".
Why is this happening? Why Notepad++ is able to convert the file, while idea is not?
I'm not sure, but it is possible that when you first opened the file it was read as UTF-8 and the invalid byte sequences were turned into the Unicode replacement character, then when you try to convert to ISO-8859-1 it is trying to convert the Unicode replacement character but there is no value for that in ISO-8859-1 so it is converted to ? instead.
(Even though text like "ërs" can be represented in Unicode and thus UTF-8, the ISO-8859-1 encoding of "ërs" is EB 72 73 which is the start byte of a three-byte UTF-8 sequence, but the next two bytes are not continuation bytes, so a program treating it as UTF-8 would think those accented characters are invalid.)
I think you need to get IntelliJ to open the file as ISO-8859-1, rather than opening it first as UTF-8 and then trying to convert to ISO-8859-1.
(When you switch the encoding in Notepad++ it must be going back to the original bytes of the file and interpreting them as ISO-8859-1, rather than trying to convert content that it has already altered by changing invalid bytes to the replacement character.)
Note that ë is a perfectly valid Unicode character. It can be represented as either U+00EB, Latin small letter e with diaeresis, or as two code points, U+0065 and U+0308, Latin small letter e combined with Combining diaeresis. But U+00EB would be encoded in UTF-8 as the two-byte sequence C3 AB, and for U+0065 U+0308 the "e" would be encoded as itself, 65, and U+0308 would be encoded as CC 88.
So "ë" in UTF-8 must be either C3 AB or 65 CC 88. It can't be EB.
I believe there is some bug here in IDEA (where the default encoding is UTF-8) in that when you convert the file containing valid ISO-8859-1 encoded characters and change the file encoding to ISO-8859-1, it messes it up. The particular codepoint that it messes up is ë. For some reason, it replaces it with \ufffd whereas its correct codepoint is \u00eb. This is the character that shows up as � in your editor.
My suggestion is to just use UTF-8 and not change it to ISO-8859-1. UTF-8 is backward compatible with ISO-8859-1 and you could write this string using the IME on your OS (which appears to be Windows). I am not sure how to do it on Windows, but on a Mac, I use the U+ keyboard
and then add this character as 00eb while keeping the ALT key pressed. Then it shows up correctly:
Related
I am getting an java.nio.charset.UnmappableCharacterException when trying to write a csv that contains the char µ.
I need to write the file with ASCII encoding so that Excel can open it directly without the user having to do anything.
How can I convert my µ char into its ASCII equivalent before writing to file ?
ASCII only takes the lower 7 bits of the character. So there are only 2^7 = 128 characters possible. However, of those only 95 are actually printable (read: visible), and that includes the space character (because it still has a fixed width). Unfortunately, your character is not part of that list.
The most used ASCII compatible character encoding is probably UTF-8 by now. However, that requires two bytes to create Mu / the micro-symbol (0xC2 0xB5).
Western Latin, also known as ISO/IEC 8859-1 (since 1987), has the character at U+00B5 (Alt+0181), translated as 0xB5 in hexadecimal notation. Western Latin is however not used that much as a name. Instead, the extended version called Windows-1252 is used, with the character at the same location.
You can see the Unicode encoding here and the Windows-1252 here (at the fileformat.info site).
I open my Windows notepad, enter 18, and save the file as utf-8 encoding. I know that my file will have a BOM header, and my file is a utf-8 encoded file(with a BOM header).
Problem is that, when printing that string by below code:
//str is that string read from the file using StandardCharsets.UTF_8 encoding
System.out.println(str);
In windows I got:
?18
But in linux I got:
18
So why the behavior of java is different? How to understand it?
A BOM is a zero-width space, so invisible in principle.
However Window has no UTF-8 encoding but uses one of the many single byte encodings. The conversion from String to the output will turn the BOM, missing in the charset, into a question mark.
Still Notepad will recognize the BOM and display UTF-8 text.
Linux nowadays generally uses UTF-8, so has no problems, also in the console.
Further explanation
On Windows System.out uses the console, and that console for instance uses as charset/encoding for instance Cp-850, a single byte charset of some 256 characters. Missing might very well be ĉ or the BOM char. If a java String contains these chars, they can not be encoded to one of the 256 available chars. Hence they will be converted to a ?.
Using a CharsetEncoder:
String s = ...
CharsetEncoder encoder = Charset.defaultCharset().newEncoder();
if (!encoder.canEncode(s)) {
System.out.println("A problem");
}
Windows generally also runs on a single byte encoding, like Cp-1252. Again 256 chars. However editors may deal with several encodings, and if the font can represent the character (Unicode code point), then everything works.
The behavior of java is the same, FileInputStream do not handle bom.
In windows, your file is file1, file1 hex present is EF BB BF 31 38
In linux, your file is file2, file2's hex present is 31 38
when you read them, you would get different string.
I recommend you convert your bom file to without-bom file with notepad++.
Or you can use BOMInputStream
My java program is trying to read a text file (Mainframe VSAM file converted to flat file). I believe this means, the file is encoded in EBCDIC format.
I am using com.ibm.jzos.FileFactory.newBufferedReader(fullyQualifiedFileName, ZFile.DEFAULT_EBCDIC_CODE_PAGE); to open the file.
and use String inputLine = inputFileReader.readLine() to read a line and store it in a java string variable for processing. I read that text when stored in String variable becomes unicode.
How can I ensure that the content is not corrupted when storing in the java string variable?
The Charset Decoder will map the bytes to their correct Unicode for String. And vice versa.
The only problem is that the BufferedReader.readLine will drop the line endings (also the EBCDIC end-of-line NEL char, \u0085 - also a recognized Unicode newline). So on writing write the NEL yourself, or set the System line separator property.
Nothing easier than to write a unit test with 256 EBCDIC characters and convert them back and forth.
If you have read the file with the correct character set (which is the biggest assumption here), then it doesn't matter that Java itself uses Unicode internally, Unicode contains all characters of EBCDIC.
A character set specifies the mapping between a character (codepoint) and one or more bytes. A file is nothing more than a stream of bytes, if you apply the right character set, then the right characters are mapped in memory.
Say the byte 1 maps to 'A' in character set X and bytes 0 and 65 in UTF-16, then reading a file which contains byte 1 using character set X will make the system read character 'A', even if that system in memory uses bytes 0 and 65 to store that character.
However there is no way to know if you used the right character set, unless you specifically now what the actual result should be.
I need to export string data that includes the 'degrees' symbol ("\u00B0"). This data is exported as a csv text file with UTF-8 encoding. As would be expected, the degrees symbol is encoded as two characters (0xC2, 0xB0) within the java (unicode) string. When the CSV file is imported into Excel, it is displayed as a capital A with an circumflex accent, followed by the degrees symbol.
I know that "UTF-8" only supports 7-bit ASCII (as a single byte), not 8-bit "extended ASCII", and "US-ASCII" only supports 7-bit ASCII period.
Is there some way to specify encoding such that the 0xC2 prefix byte is suppressed?
I'm leaning toward allowing normal processing to occur, then reading & overwriting the file contents, stripping the extra byte.
I'd really prefer a more eloquent solution...
Excel assumes csv files are in an 8-bit code page.
To get Excel to parse your csv as UTF-8, you need to add a UTF-8 Byte Order Mark to the start of the file.
Edit:
If you're in Western Europe or US, Excel will likely use Windows-1252 character set for decoding and encoding when encountering files without a Unicode Byte Order Mark.
As 0xC2 and 0xB0 are both legal Windows-1252 characters, Excel will decode to the following:
0xC2 = Â
0xB0 = °
I'm reading some text that I got from Wikipedia.
The text contains hyphen like in this String: "Australia for the [[2011–12 NBL season]]"
I'm trying to do is to convert the text to utf-8, using this code:
String myStr = "Australia for the [[2011–12 NBL season]]";
new String(myStr.getBytes(), "utf-8");
The result is:
Australia for the [[2011�12 NBL season]]
The problem is that the hyphen is not being mapped correctly.
The hyphen value in bytes is [-106] (I have no idea what to do with it...)
Do you know how to convert it to a hyphen that utf-8 encoding recognizes?
I would be happy to replace other special characters as well by some general code, but also specific "hyphens" replacement code will help.
The problem code point is U+2013 EN DASH which can be represented with the escape \u2013.
Try replacing the string with "2011\u201312". If this works then there is a mismatch between your editor character encoding and the one the compiler is using.
Otherwise, the problem is with the transcoding operation from string to whatever device you are writing to. Anywhere where you convert from bytes to chars or chars to bytes is a potential point of corruption when the wrong encoding is used; this can include System.out.
Note: Java strings are always UTF-16.
new String(myStr.getBytes(), "utf-8");
This code takes UTF-16, converts it to the platform encoding, which might be anything, then pretends its UTF-8 and converts it back to UTF-16. At best, the platform encoding is UTF-8 and this is a no-op; otherwise it will just corrupt the data.
This is how you create UTF-8 in Java:
byte[] utf8 = str.getBytes(StandardCharsets.UTF_8); // Java 7
You can read more here.
This is because the source code (editor) is maybe in Windows-1252 (extended Latin-1), and it is compiled with another encoding UTF-8 (compiler). These two encodings must be the same, or use in the source: "\u00AD", the ASCII representation of the hyphen.