Using a unicode character in a .java file?

Using a unicode character in a .java file? - java

I want to set a unicode character in a class file like this:
TextView tv = ...;
tv.setText("·");
is there anything potentially wrong with using a unicode character in a .java file?
Thanks

No. Java strings support Unicode so you shouldn't run into any problems. You might have to check that the TextView class handles all the Unicode characters (which it should), but Java itself will handle the unicode characters.
You should also ensure that the file is saved with the correct encoding settings. Essentially this means that your editor should save the java file as UTF-8 encoded Unicode. See the comments to this answer for more details on this.

Is there anything potentially wrong with using a unicode character in a .java file?
As you know, Strings within the JVM are stored as Unicode - so the question is how to deal with Unicode in Java source files ...
In short, using Unicode is fine. There are a few ways to approach it ...
By default, the javac compiler expects the source file to be in the platform default encoding. This can be overridden using the -encoding flag:
-encoding encoding
Sets the source file encoding name, such as
EUCJIS/SJIS/ISO8859-1/UTF8. If -encoding is not specified, the
platform default converter is used.
Alternatively, if it's a single character (like it appears to be), you can keep your source file in your platform default encoding, and specify the character using the Unicode escape sequence:
tv.setText("\u1234");
... where '1234' is the Unicode value for the character you want.
Another alternative is to first save your file in your Unicode-compatible encoding (say UTF-8), then use native2ascii to convert that file to your native encoding (it will convert any out of range characters to the corresponding Unicode escape sequence).
NAME
native2ascii - native to ASCII converter
SYNOPSIS
native2ascii [ options ] [ inputfile [outputfile]]
DESCRIPTION
The Java compiler and other Java tools can only process files that contain Latin-1 or Unicode-encoded (\udddd notation) characters.
native2ascii converts files that contain other character encoding into
files containing Latin-1 or Unicode-encoded charaters.
If outputfile is omitted, standard output is used for output. In addition, if inputfile is omitted, standard input is used for input.

Related

Encoding of special properties in Eclipse

I have an application which uses swedish language in some java and jsp pages.
Swedish words are described in application.properties file and those names will be used in the application.
Application Screen:
Words which are defined in the properties file and the words which I am seeing in the jsp page is different.
button.search=Sök
I tried all content types in the settings. Still I am getting this error and because of this different words my application is not working in eclipse.
Could you please anyone tell me what changes I need to do in eclipse to make this application work

From your screenshot it looks like your properties file is encoded in UTF-8,
thus ö is represented by 2 bytes.
But properties files must be encoded in ISO-8859-1 (optionally with \uXXXX escapes), not in UTF-8 or anything else.
Quoted from the javadoc of class
java.util.Properties:
The load(Reader) / store(Writer, String) methods load and store
properties from and to a character based stream in a simple
line-oriented format specified below. The load(InputStream) /
store(OutputStream, String) methods work the same way as the
load(Reader)/store(Writer, String) pair, except the input/output
stream is encoded in ISO 8859-1 character encoding.
Characters that cannot be directly represented in this encoding
can be written using Unicode escapes [...]
That means, you should store your application.properties file
in ISO-8859-1 encoding. Or better yet, you should write
button.search=S\u00F6k
instead of
button.search=Sök
Using the \uXXXX escapes for all non-ASCII characters has the advantage
that you can store the file in UTF-8 or any ISO-8859-x, and you get the same
file content anyway.

Just use Eclipse's Properties Editor. It saves a .properties file in the only allowed character encoding (ISO 8859-1) and \u escapes characters that are not in that character set.
It does have a hover display to show decoded codepoints but a view showing a table of name-value pairs would be nicer.

Maybe change the workspace encoding will help. Go Window -> Preferences -> General -> Workspace and change the text file encoding to UTF8.

Change you eclipse default content type:
Window > Preferences > General > Content Types, set UTF-8 as the default encoding for all content types.

Determining ISO-8859-1 vs US-ASCII charset

I am trying to determine whether to use
PrintWriter pw = new PrintWriter(outputFilename, "ISO-8859-1");
or
PrintWriter pw = new PrintWriter(outputFilename, "US-ASCII");
I was reading All about character sets to determine the character set of an example file which I must create in the same encoding via java code.
When my example file contains "European" letters (Norwegian: å ø æ), then the following command tells me the file encoding is "iso-8859-1"
file -bi example.txt
However, when I take a copy of the same example file and modify it to contain different data, without any Norwegian text (let's say, I replace "Bjørn" with "Bjorn"), then the same command tells me the file encoding is "us-ascii".
file -bi example-no-european-letters.txt
What does this mean? Is ISO-8859-1 in practise the same as US-ASCII if there are no "European" characters in it?
Should I just use a charset "ISO-8559-1" and everything will be ok?

If the file contains only the 7-bit US-ASCII characters it can be read as US-ASCII. It doesn't tell anything about what was intended as the charset. It may be just a coincidence that there were no characters that would require a different coding.
ISO-8859-1 (and -15) is a common european encoding, able to encode äöåéü and other characters, the first 127 characters being the same as in US-ASCII (as often is, for convenience reasons).
However you can't just pick an encoding and assume that "everything will be OK". The very common UTF-8 encoding also contains the US-ASCII charset, but it will encode for example äöå characters as two bytes instead of ISO-8859-1's one byte.
TL;DR: Don't assume things with encodings. Find out what was intended and use that. If you can't find it out, observe the data to try to figure out what is a correct charset to use (as you noted yourself, multiple encodings may work at least temporarily).

It depends on different types of characters we use in the respective document. ASCII is 7-bit charset and ISO-8859-1 is 8-bit charset which supports some additional characters. But, mostly, if you are going to reproduce the document from inputstream, I recommend the ISO-8859-1 charset. It will work for textfile like notepad and MS word.
If you are using some different international characters, we need to check the corresponding charset which supports that particular character like UTF-8..

ö character encoding issue in CSV file creation

I am trying to create a CSV file from my Java code.
File file = File.createTempFile("DummyReport", ".csv");
SomeListofObjects items = getSomeList();
FileUtils.write(file, "ID;CREATION;" + System.lineSeparator());
FileUtils.writeLines(file, activities.getItems(), true);
return file;
I am facing some issue with special chars.
When I debug the code, I found that I have a character as "ö". But in the csv file generated, it is coming weirdly "Ã¶".
Can we set this in FileUtile or File? Can some one help me to solve this?

First check if you are using a text viewer that displays your output correctly. If not, the problem might be your system encoding.
FileUtils.write(file, string) uses the default system encoding, which in your system seems to be 8bit. The "ö" character however is encoded as two bytes, resulting in "Ã¶.".
Use FileUtils.write(File file, CharSequence data, String encoding) instead, with an appropriate encoding:
ISO 8859-1 (8bit standard, Latin-1)
CP1252 (8bit proprietary, Windows default, extends Latin 1)
MacRoman (8bit proprietary, Apple default)
UTF-8 (16bit standard, Linux default)
Latin-15 (not always supported)
My suggestion is to use FileUtils.write(file, string, "UTF-8").

You do not specify an encoding when you write to your file.
The result of which is that the default encoding is used.
It appears however that you use UTF-8, and unfortunately, you use Excel.
And Excel cannot read UTF-8 until you prepend the file with a BOM... Which no other program requires.
So, you have two choices:
keep doing what you are doing and to hell with Excel;
prepend a BOM to the file and make the file unreadable with other programs!
Also, if you are using Java 7+, useFiles.write() instead.
Another solution would of course to use ISO as an encoding, but... Well, that's your choice.

How to use a UTF8 properties file with Vaadin Bean Validation

I'm currently using Vaadin and an add-on named Vaadin Bean Validation for Java Bean Validation API 1.0 (JSR-303). The implementation of this API is hibernate-validator.
I have a custom property file with UTF8 as charset. But with this mechanism, special letters like "éèà" are always displayed wrong.
How can I fix this?

Properties files are as per specification read using ISO-8859-1.
... the input/output stream is encoded in ISO 8859-1 character encoding. Characters that cannot be directly represented in this encoding can be written using Unicode escapes ; only a single 'u' character is allowed in an escape sequence. The native2ascii tool can be used to convert property files to and from other character encodings.
So, any character which is not covered by the ISO-8859-1 range needs to be escaped in the Unicode escape sequences \uXXXX. You can use the JDK-supplied native2ascii tool to convert them. You can find it in JDK's /bin folder.
Here's an example assuming that foo_utf8.properties is the one which you saved using UTF-8 and that foo.properties is the one which you'd like to use in your application:
native2ascii –encoding UTF-8 foo_utf8.properties foo.properties
If you're using an IDE such as Eclipse, then you can just use the builtin properties file editor which should automatically be associated with .properties files. If you use this editor instead of the plain text editor, then it'll automatically escape the characters which are not covered by the ISO-8859-1 range.

Escape the UTF-8 character is the properties file.
e.g.:
foo.bar.max=Foo \u00E1 \u00E9 and \u00F6bar
will display as:
Foo á é and öbar
Here is a tool that can help you convert the characters: http://rishida.net/tools/conversion/

Ant compile: unclosed character literal

When I compile my web application using ant I get the following compiler message:
unclosed character literal
The line of offending code is:
protected char[] diacriticVowelsArray = { 'á', 'é', 'í', 'ó', 'ú' };
What does the compiler message mean?

Java normally expects its source files to be encoded in UTF-8. Have you got your editor set up to save the source file using UTF-8 encoding? The problem is if you use a different encoding, then the Java compiler will be confused (since you're using characters that will be encoded differently between UTF-8 and other encodings) and be unable to decode your source.
It's also possible that your Java is set up to use a different encoding. In that case, try:
javac -encoding UTF8 YourSourceFile.java

Use UTF encoding for text files with your Java sources.
or
Use '\uCODE' where CODE is Unicode number for á, é etc. (like for 'á' you write '\u00E1').
You might need this:
http://www.fileformat.info/info/unicode/char/e1/index.htm

It worked for me to use " instead of the ' char.
It also worked the javac -encoding UTF8 param as previously described.
This means that the compiler did not used the UTF8 coding.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.