Handling non-english characters using Eclipse - java

Below is the text that I would like to paste in bundle.properties file using Eclipse.
Честит рожден ден
Instead Eclipse displays these characters in unicode escape notation, as shown below:
\u0427\u0435\u0441\u0442\u0438\u0442 \u0440\u043E\u0436\u0434\u0435\u043D \u0434\u0435\u043D
How do I resolve this problem?

This is intended behavior. The PropertyResourceBundle relies on the Properties class, whose load method always assumes the file to be encoded is iso-latin-1¹:
The input stream is in a simple line-oriented format as specified in load(Reader) and is assumed to use the ISO 8859-1 character encoding; that is each byte is one Latin1 character. Characters not in Latin1, and certain special characters, are represented in keys and elements using Unicode escapes as defined in section 3.3 of The Java™ Language Specification.
So converting your copied characters to Unicode escape sequences in the right thing to ensure that they will be loaded properly. At runtime, the ResourceBundle will contain the right character content.
While in Eclipse, source files usually inherit the charset setting from their parent, to end up at the project or even system wide setting, it supports setting the charset encoding for single files and conveniently changes it automatically to iso-latin-1 for .properties files.
Note that starting with Java 9, you can use UTF-8 for properties resource bundles. This does not require additional configuration actions, as the charset encoding is determined by probing. As the documentation of the PropertyResourceBundle(InputStream) constructor states:
This constructor reads the property file in UTF-8 by default. If a MalformedInputException or an UnmappableCharacterException occurs on reading the input stream, then the PropertyResourceBundle instance resets to the state before the exception, re-reads the input stream in ISO-8859-1 and continues reading. If the system property java.util.PropertyResourceBundle.encoding is set to either "ISO-8859-1" or "UTF-8", the input stream is solely read in that encoding, and throws the exception if it encounters an invalid sequence.
This works, as both encodings are identical for ASCII characters, while for non-ASCII sequences, it practically never happens for real life text that an iso-latin-1 sequence forms a valid UTF-8 sequence. This applies to PropertyResourceBundle which handles this probing, not for the Properties class, which still only uses iso-latin-1 in its load(InputStream) method.
¹ I kept the statement in this absolute form for simplicity, despite, as elaborated at the end of this answer, Java 9 has lifted this restriction.

Related

Is 'the local character set' the same as 'the encoding of the text data you want to process'?

The Oracle Java Documentation states the following boast in its Tutorial introduction to character streams:
A program that uses character streams in place of byte streams automatically adapts to the local character set and is ready for internationalization — all without extra effort by the programmer.
(http://docs.oracle.com/javase/tutorial/essential/io/charstreams.html)
My question is concerned with the meaning of the word 'automatically' in this context. Elsewhere the documentation warns
Data in text files is automatically converted to Unicode when its encoding matches the default file encoding of the Java Virtual Machine.... If the default file encoding differs from the encoding of the text data you want to process, then you must perform the conversion yourself. You might need to do this when processing text from another country or computing platform.
(http://docs.oracle.com/javase/tutorial/i18n/text/convertintro.html)
Is 'the local character set' in the first quote analogous to 'the encoding of the text data you want to process' of the second quote? And if so, is the second quote not exploding the boast of the first - that you don't need to do any conversion unless you need to do a conversion?
In the context of the first tutorial you have linked, I read it that they use "local character set" to mean the default character set.
For example:
inputStream = new FileReader("xanadu.txt");
They are creating a FileReader, which does not allow you to specify a Charset, so the JVM's default charset will be used:
FileReader(String) calls
InputStreamReader(InputStream), which calls
StreamDecoder.forInputStreamReader(InputStream, Object, String), with null as the last parameter
So Charset.defaultCharset() is used as the Charset
If you wanted to use an explicit charset, you would write:
inputStream = new InputStreamReader(new FileInputStream("xanadu.txt"), charset);
No. The local character set is the character set (table of character values and respective codes) that the file uses, but the default text encoding is how the JVM interprets the characters (converts them into their character codes). They are linked and very similar, but not exactly the same.
Also, it says that it "automatically" converts it because that is the function of the JVM: it automatically converts the characters in the text file that contains your code into code that the machine can read.

Determining ISO-8859-1 vs US-ASCII charset

I am trying to determine whether to use
PrintWriter pw = new PrintWriter(outputFilename, "ISO-8859-1");
or
PrintWriter pw = new PrintWriter(outputFilename, "US-ASCII");
I was reading All about character sets to determine the character set of an example file which I must create in the same encoding via java code.
When my example file contains "European" letters (Norwegian: å ø æ), then the following command tells me the file encoding is "iso-8859-1"
file -bi example.txt
However, when I take a copy of the same example file and modify it to contain different data, without any Norwegian text (let's say, I replace "Bjørn" with "Bjorn"), then the same command tells me the file encoding is "us-ascii".
file -bi example-no-european-letters.txt
What does this mean? Is ISO-8859-1 in practise the same as US-ASCII if there are no "European" characters in it?
Should I just use a charset "ISO-8559-1" and everything will be ok?
If the file contains only the 7-bit US-ASCII characters it can be read as US-ASCII. It doesn't tell anything about what was intended as the charset. It may be just a coincidence that there were no characters that would require a different coding.
ISO-8859-1 (and -15) is a common european encoding, able to encode äöåéü and other characters, the first 127 characters being the same as in US-ASCII (as often is, for convenience reasons).
However you can't just pick an encoding and assume that "everything will be OK". The very common UTF-8 encoding also contains the US-ASCII charset, but it will encode for example äöå characters as two bytes instead of ISO-8859-1's one byte.
TL;DR: Don't assume things with encodings. Find out what was intended and use that. If you can't find it out, observe the data to try to figure out what is a correct charset to use (as you noted yourself, multiple encodings may work at least temporarily).
It depends on different types of characters we use in the respective document. ASCII is 7-bit charset and ISO-8859-1 is 8-bit charset which supports some additional characters. But, mostly, if you are going to reproduce the document from inputstream, I recommend the ISO-8859-1 charset. It will work for textfile like notepad and MS word.
If you are using some different international characters, we need to check the corresponding charset which supports that particular character like UTF-8..

In Java, How to detect if a string is unicode escaped

I have a property file which may/ may not contain unicode escaped characters in the values of its keys. Please see the sample below. My job is to ensure that if a value in the property file contains a non-ascii character, then it should be unicode escaped. So, in the sample below, first entry is OK, all entries like the second entry should be removed and converted to like the first entry.
##sample.properties
escaped=cari\u00F1o
nonescaped=cariño
normal=darling
Essentially my question is how can I differentiate in Java between cari\u00F1o and cariño since as far as Java is concerned it treats them as identical.
Properties files in Java must be saved in the ISO-8859-1 character set for Java to read them properly. That means that it is possible to use special characters from Western European languages without escaping them. It is not possible to use characters from other languages such as those from Easter Europe, Russia, or China without escaping them.
As such there are only a few non-ascii characters that can appear in a properties file without being escaped.
To detect whether characters have been escaped or not, you will need to open the properties file directly, rather than through the Properties class. The Properties class does all the unescaping for you when you load a file through it. You should open them using the File class or though System.getResourceAsStream as an InputStream. Once you do so you can scan through the input stream one byte at a time and ensure that all bytes are in the 0x20-0x7E range plus new lines \r and \n which is the ASCII range of characters you would expect in a properties file.
I would suggest that your translators don't try to write properties files directly. They should provide you with documents like spreadsheets that you convert into properties file. Or they could use a translation editor such as Attesoro (which I wrote) to let them save the properties files properly escaped.
You could simply use the native2ascii tool, which performs exactly this conversion (it will convert all non-ASCII characters to escapes but leave existing escapes intact).
Your problem is that the Java Properties class decodes the properties files, assuming ISO-8859-1 encoding, and parsing escaped unicode characters.
So from a Properties point of view, these two strings are indeed the same.
I believe if you need to differentiate these two, you will need to write your own parser.
It's actually a feauture that you do not need to care by default. The one thing that strikes me as the most odd is that the (only) encoding is ISO-8859-1, probably for historical reasons.
The library ICU4J seems to be what you're looking for. See the Normalization page.

Java - UTF8/16 is a Charset Name or Character Encoding?

The application I am developing will be used by folks in Western & Eastern Europe as well in the US. I am encoding my input and decoding my output with UTF-8 character set.
My confusion is becase when I use this method String(byte[] bytes, String charsetName), I provide UTF-8 as the charsetname when it really is an character encoding. And my default econding is set in Eclipse as Cp1252.
Does this mean if, in the US in my Java application, I create an Output text file using Cp1252 as my charset encoding and UTF-8 as my charset name, will the folks in Europe be able to read this file in my Java application and vice versa?
They're encodings. It's a pity that Java uses "charset" all over the place when it really means "encoding", but that's hard to fix now :( Annoyingly, IANA made the same mistake.
Actually, by Unicode terminology they're probably most accurately character encoding schemes:
A character encoding form plus byte serialization. There are seven character encoding schemes in Unicode: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.
Where a character encoding form is:
Mapping from a character set definition to the actual code units used to represent the data.
Yes, the fact that Unicode only defines seven character encoding forms makes this even more confusing. Fundamentally, all most developers need to know is that a "charset" in Java terminology is a mapping between text data (String, char[]) and binary data (byte[]).
I think those two things are not directly related.
The Eclipse setting decide how your eclipse editor will save the text file (typically source code) you created/edited. You can use other editors and therefore the file maybe saved in some other encoding scheme. As long as your java compiler has no problem compiling your source code you're safe.
The
java String(byte[] bytes, String charsetName)
is your own application logic that deals with how do you want to interpret some data your read either from a file or network. Different charsetName (essentially different character encoding scheme) may have different interpretation on the byte array.
A "charset" does implies the set of characters that the text uses. For UTF-8/16, the character set happens to be "all" characters. For others, not necessarily. Back in the days, everybody were inventing their own character sets and encoding schemes, and the two were almost 1-to-1 mapping, therefore one name can be used to refer to both character set and encoding scheme.

What happens if your input file contains some unsupported character?

I have this text file which might contain some unsupported characters in the Latin1 character set, which is the default character set of my JVM.
What would those characters be turned into when my java program tries to read from the file?
Concretely, supposed I had a 2-byte long character in the file, would it be read as a one-byte character (because each character in Latin1 is only 1-byte long)?
Thanks,
I can't use the InputStreaReader option, because the file has to be read with Latin1.
And
I have this text file which might contain some unsupported characters in the Latin1 character set ...
You have contradictory requirements here.
Either the file is LATIN-1 (and there no "unsupported characters") or it is not LATIN-1. If it is not LATIN-1, you should be trying to find out what character set / encoding it really is, and use that one instead of LATIN-1 to read the file.
As other answers / comments have explained, you can either change the JVM's default character set, or specify a character set explicitly when you open the Reader.
I'm having trouble setting the default character set of my JVM .
Please explain what you are trying and what problems you are having.
(and was a bit afraid of messing it up!)
COWARD! :-)
FWIW - if you try to read a data stream in (say) LATIN-1 and the data stream is not actually in LATIN-1, then you can expect the following:
Characters that encode the same in LATIN-1 and the actual character set will be passed unharmed.
Characters that don't encode the same, will either be replaced by a character that means "unknown character" (e.g. a question mark), or will be garbled. Which happens will depend on whether that byte or byte sequence at issue encodes a valid (but wrong) character, or no character at all.
The net result will be partially garbled text. The garbling may or may not be reversible, depending on exactly what the real character set and characters are. But it is best to avoid "going there" ... by using the RIGHT character set to decode in the first instance.
First of all you can specify the character set to use when reading a file. See for example: java.io.InputStreamReader
Secondly. Yes if reading using a 1 byte character set then each byte will be used to map to one character.
Thirdly: Test it and you shall see, beyond doubt what actually happens!
If you don't know the charset you'll have to guess it. This is tricky and error prone.
Here is a question regarding this issue:
How can I detect the encoding/codepage of a text file
Check out how you can fool notepad into guessing wrong.

Categories