How to use a UTF8 properties file with Vaadin Bean Validation

How to use a UTF8 properties file with Vaadin Bean Validation - java

I'm currently using Vaadin and an add-on named Vaadin Bean Validation for Java Bean Validation API 1.0 (JSR-303). The implementation of this API is hibernate-validator.
I have a custom property file with UTF8 as charset. But with this mechanism, special letters like "éèà" are always displayed wrong.
How can I fix this?

Properties files are as per specification read using ISO-8859-1.
... the input/output stream is encoded in ISO 8859-1 character encoding. Characters that cannot be directly represented in this encoding can be written using Unicode escapes ; only a single 'u' character is allowed in an escape sequence. The native2ascii tool can be used to convert property files to and from other character encodings.
So, any character which is not covered by the ISO-8859-1 range needs to be escaped in the Unicode escape sequences \uXXXX. You can use the JDK-supplied native2ascii tool to convert them. You can find it in JDK's /bin folder.
Here's an example assuming that foo_utf8.properties is the one which you saved using UTF-8 and that foo.properties is the one which you'd like to use in your application:
native2ascii –encoding UTF-8 foo_utf8.properties foo.properties
If you're using an IDE such as Eclipse, then you can just use the builtin properties file editor which should automatically be associated with .properties files. If you use this editor instead of the plain text editor, then it'll automatically escape the characters which are not covered by the ISO-8859-1 range.

Escape the UTF-8 character is the properties file.
e.g.:
foo.bar.max=Foo \u00E1 \u00E9 and \u00F6bar
will display as:
Foo á é and öbar
Here is a tool that can help you convert the characters: http://rishida.net/tools/conversion/

Related

Encoding of special properties in Eclipse

I have an application which uses swedish language in some java and jsp pages.
Swedish words are described in application.properties file and those names will be used in the application.
Application Screen:
Words which are defined in the properties file and the words which I am seeing in the jsp page is different.
button.search=Sök
I tried all content types in the settings. Still I am getting this error and because of this different words my application is not working in eclipse.
Could you please anyone tell me what changes I need to do in eclipse to make this application work

From your screenshot it looks like your properties file is encoded in UTF-8,
thus ö is represented by 2 bytes.
But properties files must be encoded in ISO-8859-1 (optionally with \uXXXX escapes), not in UTF-8 or anything else.
Quoted from the javadoc of class
java.util.Properties:
The load(Reader) / store(Writer, String) methods load and store
properties from and to a character based stream in a simple
line-oriented format specified below. The load(InputStream) /
store(OutputStream, String) methods work the same way as the
load(Reader)/store(Writer, String) pair, except the input/output
stream is encoded in ISO 8859-1 character encoding.
Characters that cannot be directly represented in this encoding
can be written using Unicode escapes [...]
That means, you should store your application.properties file
in ISO-8859-1 encoding. Or better yet, you should write
button.search=S\u00F6k
instead of
button.search=Sök
Using the \uXXXX escapes for all non-ASCII characters has the advantage
that you can store the file in UTF-8 or any ISO-8859-x, and you get the same
file content anyway.

Just use Eclipse's Properties Editor. It saves a .properties file in the only allowed character encoding (ISO 8859-1) and \u escapes characters that are not in that character set.
It does have a hover display to show decoded codepoints but a view showing a table of name-value pairs would be nicer.

Maybe change the workspace encoding will help. Go Window -> Preferences -> General -> Workspace and change the text file encoding to UTF8.

Change you eclipse default content type:
Window > Preferences > General > Content Types, set UTF-8 as the default encoding for all content types.

Getting String ZweibrÃ¼cken in java

Hi I am using this variable in my properties file like this ZX=Zweibrücken.
And When I go to fetch this variable its value change and showing me this - ZweibrÃ¼cken.
Normally I am using java code to get all value from .property file.
but in that case i am getting wrong value.
Could you please help ?

Java properties files MUST be encoded in ISO 8859-1, so if you have to put characters which do not belong to this encoding, you must encode them with \uXXXX where XXXX is the unicode codepoint of the character.
To create a valid properties file you should either use an editor for properties files, or encode non ascii characters in \uXXXX notation.

Handling non-english characters using Eclipse

Below is the text that I would like to paste in bundle.properties file using Eclipse.
Честит рожден ден
Instead Eclipse displays these characters in unicode escape notation, as shown below:
\u0427\u0435\u0441\u0442\u0438\u0442 \u0440\u043E\u0436\u0434\u0435\u043D \u0434\u0435\u043D
How do I resolve this problem?

This is intended behavior. The PropertyResourceBundle relies on the Properties class, whose load method always assumes the file to be encoded is iso-latin-1¹:
The input stream is in a simple line-oriented format as specified in load(Reader) and is assumed to use the ISO 8859-1 character encoding; that is each byte is one Latin1 character. Characters not in Latin1, and certain special characters, are represented in keys and elements using Unicode escapes as defined in section 3.3 of The Java™ Language Specification.
So converting your copied characters to Unicode escape sequences in the right thing to ensure that they will be loaded properly. At runtime, the ResourceBundle will contain the right character content.
While in Eclipse, source files usually inherit the charset setting from their parent, to end up at the project or even system wide setting, it supports setting the charset encoding for single files and conveniently changes it automatically to iso-latin-1 for .properties files.
Note that starting with Java 9, you can use UTF-8 for properties resource bundles. This does not require additional configuration actions, as the charset encoding is determined by probing. As the documentation of the PropertyResourceBundle(InputStream) constructor states:
This constructor reads the property file in UTF-8 by default. If a MalformedInputException or an UnmappableCharacterException occurs on reading the input stream, then the PropertyResourceBundle instance resets to the state before the exception, re-reads the input stream in ISO-8859-1 and continues reading. If the system property java.util.PropertyResourceBundle.encoding is set to either "ISO-8859-1" or "UTF-8", the input stream is solely read in that encoding, and throws the exception if it encounters an invalid sequence.
This works, as both encodings are identical for ASCII characters, while for non-ASCII sequences, it practically never happens for real life text that an iso-latin-1 sequence forms a valid UTF-8 sequence. This applies to PropertyResourceBundle which handles this probing, not for the Properties class, which still only uses iso-latin-1 in its load(InputStream) method.
¹ I kept the statement in this absolute form for simplicity, despite, as elaborated at the end of this answer, Java 9 has lifted this restriction.

In Java, How to detect if a string is unicode escaped

I have a property file which may/ may not contain unicode escaped characters in the values of its keys. Please see the sample below. My job is to ensure that if a value in the property file contains a non-ascii character, then it should be unicode escaped. So, in the sample below, first entry is OK, all entries like the second entry should be removed and converted to like the first entry.
##sample.properties
escaped=cari\u00F1o
nonescaped=cariño
normal=darling
Essentially my question is how can I differentiate in Java between cari\u00F1o and cariño since as far as Java is concerned it treats them as identical.

Properties files in Java must be saved in the ISO-8859-1 character set for Java to read them properly. That means that it is possible to use special characters from Western European languages without escaping them. It is not possible to use characters from other languages such as those from Easter Europe, Russia, or China without escaping them.
As such there are only a few non-ascii characters that can appear in a properties file without being escaped.
To detect whether characters have been escaped or not, you will need to open the properties file directly, rather than through the Properties class. The Properties class does all the unescaping for you when you load a file through it. You should open them using the File class or though System.getResourceAsStream as an InputStream. Once you do so you can scan through the input stream one byte at a time and ensure that all bytes are in the 0x20-0x7E range plus new lines \r and \n which is the ASCII range of characters you would expect in a properties file.
I would suggest that your translators don't try to write properties files directly. They should provide you with documents like spreadsheets that you convert into properties file. Or they could use a translation editor such as Attesoro (which I wrote) to let them save the properties files properly escaped.

You could simply use the native2ascii tool, which performs exactly this conversion (it will convert all non-ASCII characters to escapes but leave existing escapes intact).

Your problem is that the Java Properties class decodes the properties files, assuming ISO-8859-1 encoding, and parsing escaped unicode characters.
So from a Properties point of view, these two strings are indeed the same.
I believe if you need to differentiate these two, you will need to write your own parser.
It's actually a feauture that you do not need to care by default. The one thing that strikes me as the most odd is that the (only) encoding is ISO-8859-1, probably for historical reasons.

The library ICU4J seems to be what you're looking for. See the Normalization page.

Using a unicode character in a .java file?

I want to set a unicode character in a class file like this:
TextView tv = ...;
tv.setText("·");
is there anything potentially wrong with using a unicode character in a .java file?
Thanks

No. Java strings support Unicode so you shouldn't run into any problems. You might have to check that the TextView class handles all the Unicode characters (which it should), but Java itself will handle the unicode characters.
You should also ensure that the file is saved with the correct encoding settings. Essentially this means that your editor should save the java file as UTF-8 encoded Unicode. See the comments to this answer for more details on this.

Is there anything potentially wrong with using a unicode character in a .java file?
As you know, Strings within the JVM are stored as Unicode - so the question is how to deal with Unicode in Java source files ...
In short, using Unicode is fine. There are a few ways to approach it ...
By default, the javac compiler expects the source file to be in the platform default encoding. This can be overridden using the -encoding flag:
-encoding encoding
Sets the source file encoding name, such as
EUCJIS/SJIS/ISO8859-1/UTF8. If -encoding is not specified, the
platform default converter is used.
Alternatively, if it's a single character (like it appears to be), you can keep your source file in your platform default encoding, and specify the character using the Unicode escape sequence:
tv.setText("\u1234");
... where '1234' is the Unicode value for the character you want.
Another alternative is to first save your file in your Unicode-compatible encoding (say UTF-8), then use native2ascii to convert that file to your native encoding (it will convert any out of range characters to the corresponding Unicode escape sequence).
NAME
native2ascii - native to ASCII converter
SYNOPSIS
native2ascii [ options ] [ inputfile [outputfile]]
DESCRIPTION
The Java compiler and other Java tools can only process files that contain Latin-1 or Unicode-encoded (\udddd notation) characters.
native2ascii converts files that contain other character encoding into
files containing Latin-1 or Unicode-encoded charaters.
If outputfile is omitted, standard output is used for output. In addition, if inputfile is omitted, standard input is used for input.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.