Determining ISO-8859-1 vs US-ASCII charset - java

I am trying to determine whether to use
PrintWriter pw = new PrintWriter(outputFilename, "ISO-8859-1");
or
PrintWriter pw = new PrintWriter(outputFilename, "US-ASCII");
I was reading All about character sets to determine the character set of an example file which I must create in the same encoding via java code.
When my example file contains "European" letters (Norwegian: å ø æ), then the following command tells me the file encoding is "iso-8859-1"
file -bi example.txt
However, when I take a copy of the same example file and modify it to contain different data, without any Norwegian text (let's say, I replace "Bjørn" with "Bjorn"), then the same command tells me the file encoding is "us-ascii".
file -bi example-no-european-letters.txt
What does this mean? Is ISO-8859-1 in practise the same as US-ASCII if there are no "European" characters in it?
Should I just use a charset "ISO-8559-1" and everything will be ok?

If the file contains only the 7-bit US-ASCII characters it can be read as US-ASCII. It doesn't tell anything about what was intended as the charset. It may be just a coincidence that there were no characters that would require a different coding.
ISO-8859-1 (and -15) is a common european encoding, able to encode äöåéü and other characters, the first 127 characters being the same as in US-ASCII (as often is, for convenience reasons).
However you can't just pick an encoding and assume that "everything will be OK". The very common UTF-8 encoding also contains the US-ASCII charset, but it will encode for example äöå characters as two bytes instead of ISO-8859-1's one byte.
TL;DR: Don't assume things with encodings. Find out what was intended and use that. If you can't find it out, observe the data to try to figure out what is a correct charset to use (as you noted yourself, multiple encodings may work at least temporarily).

It depends on different types of characters we use in the respective document. ASCII is 7-bit charset and ISO-8859-1 is 8-bit charset which supports some additional characters. But, mostly, if you are going to reproduce the document from inputstream, I recommend the ISO-8859-1 charset. It will work for textfile like notepad and MS word.
If you are using some different international characters, we need to check the corresponding charset which supports that particular character like UTF-8..

Related

Fix mixed encoding in string

I have a file which contains the following string:
AAdοbe Dοcument Clοud
if viewed in Notepad++. In hex view the string looks like this:
If I read the file with Java the string looks like this:
AAdοbe Dοcument Clοud
How I can get the same encoding in Java as with Notepad++?
Your file is encoded as UTF-8, and the CE BF bytes is the UTF-8 encoding of the character ο ('GREEK SMALL LETTER OMICRON' (U+03BF)).
If you use the Encoding pull-down menu in Notepad++ to specify UTF-8, you should see the content as:
AAdοbe Dοcument Clοud
You might want to replace those Greek ο's with regular Latin o's ('LATIN SMALL LETTER O' (U+006F)).
If you decide to keep the Greek ο's, you need to make sure your Java program reads the file using UTF-8, which is best done using one of these:
BufferedReader reader = Files.newBufferedReader(Paths.get("file.txt")); // UTF-8 is the default
BufferedReader reader = Files.newBufferedReader(Paths.get("file.txt"), StandardCharsets.UTF_8);
If you look at the text with a debugger, you should see that it is now read correctly. If you print the text, make sure the console window you're using can handle UTF-8 characters, otherwise it might just print wrong, even though it was read correctly.
You must set encoding in file reader ilke this.
new FileReader(fileName, StandardCharsets.UTF_8)
You must read the file in java using the same encoding as the file has.
If you are working with non standard encodings, even trying to read the encoding with something like:
InputStreamReader r = new InputStreamReader(new FileInputStream(theFile));
r.getEncoding()
Can output with wrong values.
There's little library which handles recognition of encoding a bit better: https://code.google.com/archive/p/juniversalchardet/
It also has some holes in obtaining proper encoding, but I've used it.
And while using it I found out that most of non-standard encodings can be read with UTF-16 like:
new FileReader(fileName, StandardCharsets.UTF_16)
Since a while, Java supports usage of UTF-16 encoding. It's defined in Java standard API as StandardCharsets.UTF_16. That character set covers lots of language specific characters and emojis.

Unable to convert Hyphen to UTF-8

I'm reading some text that I got from Wikipedia.
The text contains hyphen like in this String: "Australia for the [[2011–12 NBL season]]"
I'm trying to do is to convert the text to utf-8, using this code:
String myStr = "Australia for the [[2011–12 NBL season]]";
new String(myStr.getBytes(), "utf-8");
The result is:
Australia for the [[2011�12 NBL season]]
The problem is that the hyphen is not being mapped correctly.
The hyphen value in bytes is [-106] (I have no idea what to do with it...)
Do you know how to convert it to a hyphen that utf-8 encoding recognizes?
I would be happy to replace other special characters as well by some general code, but also specific "hyphens" replacement code will help.
The problem code point is U+2013 EN DASH which can be represented with the escape \u2013.
Try replacing the string with "2011\u201312". If this works then there is a mismatch between your editor character encoding and the one the compiler is using.
Otherwise, the problem is with the transcoding operation from string to whatever device you are writing to. Anywhere where you convert from bytes to chars or chars to bytes is a potential point of corruption when the wrong encoding is used; this can include System.out.
Note: Java strings are always UTF-16.
new String(myStr.getBytes(), "utf-8");
This code takes UTF-16, converts it to the platform encoding, which might be anything, then pretends its UTF-8 and converts it back to UTF-16. At best, the platform encoding is UTF-8 and this is a no-op; otherwise it will just corrupt the data.
This is how you create UTF-8 in Java:
byte[] utf8 = str.getBytes(StandardCharsets.UTF_8); // Java 7
You can read more here.
This is because the source code (editor) is maybe in Windows-1252 (extended Latin-1), and it is compiled with another encoding UTF-8 (compiler). These two encodings must be the same, or use in the source: "\u00AD", the ASCII representation of the hyphen.

Java - UTF8/16 is a Charset Name or Character Encoding?

The application I am developing will be used by folks in Western & Eastern Europe as well in the US. I am encoding my input and decoding my output with UTF-8 character set.
My confusion is becase when I use this method String(byte[] bytes, String charsetName), I provide UTF-8 as the charsetname when it really is an character encoding. And my default econding is set in Eclipse as Cp1252.
Does this mean if, in the US in my Java application, I create an Output text file using Cp1252 as my charset encoding and UTF-8 as my charset name, will the folks in Europe be able to read this file in my Java application and vice versa?
They're encodings. It's a pity that Java uses "charset" all over the place when it really means "encoding", but that's hard to fix now :( Annoyingly, IANA made the same mistake.
Actually, by Unicode terminology they're probably most accurately character encoding schemes:
A character encoding form plus byte serialization. There are seven character encoding schemes in Unicode: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.
Where a character encoding form is:
Mapping from a character set definition to the actual code units used to represent the data.
Yes, the fact that Unicode only defines seven character encoding forms makes this even more confusing. Fundamentally, all most developers need to know is that a "charset" in Java terminology is a mapping between text data (String, char[]) and binary data (byte[]).
I think those two things are not directly related.
The Eclipse setting decide how your eclipse editor will save the text file (typically source code) you created/edited. You can use other editors and therefore the file maybe saved in some other encoding scheme. As long as your java compiler has no problem compiling your source code you're safe.
The
java String(byte[] bytes, String charsetName)
is your own application logic that deals with how do you want to interpret some data your read either from a file or network. Different charsetName (essentially different character encoding scheme) may have different interpretation on the byte array.
A "charset" does implies the set of characters that the text uses. For UTF-8/16, the character set happens to be "all" characters. For others, not necessarily. Back in the days, everybody were inventing their own character sets and encoding schemes, and the two were almost 1-to-1 mapping, therefore one name can be used to refer to both character set and encoding scheme.

Java safeguards for when UTF-16 doesn't cut it

My understanding is that Java uses UTF-16 by default (for String and char and possibly other types) and that UTF-16 is a major superset of most character encodings on the planet (though, I could be wrong). But I need a way to protect my app for when it's reading files that were generated with encodings (I'm not sure if there are many, or none at all) that UTF-16 doesn't support.
So I ask:
Is it safe to assume the file is UTF-16 prior to reading it, or, to maximize my chances of not getting NPEs or other malformed input exceptions, should I be using a character encoding detector like JUniversalCharDet or JCharDet or ICU4J to first detect the encoding?
Then, when writing to a file, I need to be sure that a characte/byte didn't make it into the in-memory object (the String, the OutputStream, whatever) that produces garbage text/characters when written to a string or file. Ideally, I'd like to have some way of making sure that this garbage-producing character gets caught somehow before making it into the file that I am writing. How do I safeguard against this?
Thanks in advance.
Java normally uses UTF-16 for its internal representation of characters. n Java char arrays are a sequence of UTF-16 encoded Unicode codepoints. By default char values are considered to be Big Endian (as any Java basic type is). You should however not use char values to write strings to files or memory. You should make use of the character encoding/decoding facilities in the Java API (see below).
UTF-16 is not a major superset of encodings. Actually, UTF-8 and UTF-16 can both encode any Unicode code point. In that sense, Unicode does define almost any character that you possibly want to use in modern communication.
If you read a file from disk and asume UTF-16 then you would quickly run into trouble. Most text files are using ASCII or an extension of ASCII to use all 8 bits of a byte. Examples of these extensions are UTF-8 (which can be used to read any ASCII text) or ISO 8859-1 (Latin). Then there are a lot of encodings e.g. used by Windows that are an extension of those extensions. UTF-16 is not compatible with ASCII, so it should not be used as default for most applications.
So yes, please use some kind of detector if you want to read a lot of plain text files with unknown encoding. This should answer question #1.
As for question #2, think of a file that is completely ASCII. Now you want to add a character that is not in the ASCII. You choose UTF-8 (which is a pretty safe bet). There is no way of knowing that the program that opens the file guesses correctly guesses that it should use UTF-8. It may try to use Latin or even worse, assume 7-bit ASCII. In that case you get garbage. Unfortunately there are no smart tricks to make sure this never happens.
Look into the CharsetEncoder and CharsetDecoder classes to see how Java handles encoding/decoding.
Whenever a conversion between bytes and characters takes place, Java allows to specify the character encoding to be used. If it is not specified, a machine dependent default encoding is used. In some encodings the bit pattern representing a certain character has no similarity with the bit pattern used for the same character in UTF-16 encoding.
To question 1 the answer is therefore "no", you cannot assume the file is encoded in UTF-16.
It depends on the used encoding which characters are representable.

Java: Turkish Encoding Mac/Windows

I have a problem with turkish special characters on different machines. The following code:
String turkish = "ğüşçĞÜŞÇı";
String test1 = new String(turkish.getBytes());
String test2 = new String(turkish.getBytes("UTF-8"));
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");
System.out.println(test1);
System.out.println(test2);
System.out.println(test3);
On a Mac the three Strings are the same as the original string. On a Windows machine the three lines are (Printed with the Netbeans 6.7 console):
?ü?ç?Ü?Ç?
ğüşçĞÜŞÇı
?ü?ç?Ü?Ç?
I don't get the problem.
String test1 = new String(turkish.getBytes());
You're taking the Unicode String including the Turkish characters, and turning it into bytes using the default encoding (using the default encoding is usually a mistake). You're then taking those bytes and decoding them back into a String, again using the default encoding. The result is you've achieved nothing (except losing any characters that don't fit in the default encoding); whether you have put a String through an encode/decode cycle has no effect on what the following System.out.println(test1) does because that's still printing a String and not bytes.
String test2 = new String(turkish.getBytes("UTF-8"));
Encodes as UTF-8 and then decodes using the default encoding. On Mac the default encoding is UTF-8 so this does nothing. On Windows the default encoding is never UTF-8 so the result is the wrong characters.
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");
Does precisely nothing.
To write Strings to stdout with a different encoding than the default encoding, you'd create a encoder something like new OutputStreamWriter(System.out, "cp1252") and send the string content to that.
However in this case, it looks like the console is using Windows code page 1252 Western European (+1 ATorres). There is no encoding mismatch issue here at all, so you won't be able to solve it by re-encoding strings!
The default encoding cp1252 matches the console's encoding, it's just that cp1252 doesn't contain the Turkish characters ğşĞŞı at all. You can see the other characters that are in cp1252, üçÜÇ, come through just fine. Unless you can reconfigure the console to use a different encoding that does include all the characters you want, there is no way you'll be able to output those characters.
Presumably on a Turkish Windows install, the default code page will be cp1254 instead and you will get the characters you expect (but other characters don't work). You can test this by changing the ‘Language to use for non-Unicode applications’ setting in the Regional and Language Options Control Panel app.
Unfortunately no Windows locale uses UTF-8 as the default code page. Putting non-ASCII output onto the console with the stdio stream functions is not something that's really reliable at all. There is a Win32 API to write Unicode directly to the console, but unfortunately nothing much uses it.
Don't rely on the console, or on the default platform encoding. Always specify the character encoding for calls like getBytes and the String constructor taking a byte array, and if you want to examine the contents of a string, print out the unicode value of each character.
I would also advise either restricting your source code to use ASCII (and \uxxxx to encode non-ASCII characters) or explicitly specifying the character encoding when you compile.
Now, what bigger problem are you trying to solve?
You may be dealing with different settings of the default encoding.
java -Dfile.encoding=utf-8
versus
java -Dfile.encoding=something else
Or, you may just be seeing the fact that the Mac terminal window works in UTF-8, and the Windows DOS box does not work in UTF-8.
As per Mr. Skeet, you have a third possible problem, which is that you are trying to embed UTF-8 chars in your source. Depending on the compiler options, you may or may not be getting what you intend there. Put this data in a properties file, or use \u escapes.
Finally, also per Mr. Skeet, never, ever call the zero-argument getBytes().
If you are using AspectJ compiler do not forget to set it's encoding to UTF-8 too. I have struggled to find this for hours.

Categories