Java properties UTF-8 encoding in Eclipse - java

I've recently had to switch encoding of webapp I'm working on from ISO-xx to utf8. Everything went smooth, except properties files. I added -Dfile.encoding=UTF-8 in eclipse.ini and normal files work fine. Properties however show some strange behaviour.
If I copy utf8 encoded properties from Notepad++ and paste them in Eclipse, they show and work fine. When I reopen properties file, I see some Unicode characters instead of proper ones, like:
Zur\u00EF\u00BF\u00BDck instead of Zurück
but app still works fine.
If I start to edit properties, add some special characters and save, they display correctly, however they don't work and all previously working special characters don't work any more.
When I compare local version with CVS I can see special characters correctly on remote file and after update I'm at start again: app works, but Eclipse displays Unicode chars.
I tried changing file encoding by right clicking it and selecting „Other: UTF8” but it didn't help. It also said: „determined from content: ISO-8859-1”
I'm using Java 6 and Jboss Developer based on Eclipse 3.3
I can live with it by editing properties in Notepad++ and pasting them in Eclipse, but I would be grateful if someone could help me with fixing this in Eclipse.

Answer for "pre-Java-9" is below. As of Java 9, properties files are saved and loaded in UTF-8 by default, but falling back to ISO-8859-1 if an invalid UTF-8 byte sequence is detected. See the Java 9 release notes for details.
Properties files are ISO-8859-1 by definition - see the docs for the Properties class.
Spring has a replacement which can load with a specified encoding, using PropertiesFactoryBean.
EDIT: As Laurence noted in the comments, Java 1.6 introduced overloads for load and store which take a Reader/Writer. This means you can create a reader for the file with whatever encoding you want, and pass it to load. Unfortunately FileReader still doesn't let you specify the encoding in the constructor (aargh) so you'll be stuck with chaining FileInputStream and InputStreamReader together. However, it'll work.
For example, to read a file using UTF-8:
Properties properties = new Properties();
InputStream inputStream = new FileInputStream("path/to/file");
try {
Reader reader = new InputStreamReader(inputStream, "UTF-8");
try {
properties.load(reader);
} finally {
reader.close();
}
} finally {
inputStream.close();
}

Don't waste your time, you can use Resource Bundle plugin in Eclipse
Old Sourceforge page

It is not a problem with Eclipse. If you are using the Properties class to read and store the properties file, the class will escape all special characters.
From the class documentation:
When saving properties to a stream or loading them from a stream, the ISO 8859-1 character encoding is used. For characters that cannot be directly represented in this encoding, Unicode escapes are used; however, only a single 'u' character is allowed in an escape sequence. The native2ascii tool can be used to convert property files to and from other character encodings.
From the API, store() method:
Characters less than \u0020 and characters greater than \u007E are written as \uxxxx for the appropriate hexadecimal value xxxx.

Properties props = new Properties();
URL resource = getClass().getClassLoader().getResource("data.properties");
props.load(new InputStreamReader(resource.openStream(), "UTF8"));
Works like a charm
:-)

There are too many points in the process you describe where errors can occur, so I won't try to guess what you're doing wrong, but I think I know what's happening under the hood.
EF BF BD is the UTF-8 encoded form of U+FFFD, the standard replacement character that's inserted by decoders when they encounter malformed input. It sounds like your text is being saved as ISO-8859-1, then read as if it were UTF-8, then saved as UTF-8, then converted to the Properties format using native2ascii using the platform default encoding (e.g., windows-1252).
ü => 0xFC // save as ISO-8859-1
0xFC => U+FFFD // read as UTF-8
U+FFFD => 0xEF 0xBF 0xBD // save as UTF-8
0xEF 0xBF 0xBD => \u00EF\u00BF\u00BD // native2ascii
I suggest you leave the "file.encoding" property alone. Like "file.separator" and "line.separator", it's not nearly as useful as you would expect it to be. Instead, get into the habit of always specifying an encoding when reading and writing text files.

Properties props = new Properties();
URL resource = getClass().getClassLoader().getResource("data.properties");
props.load(new InputStreamReader(resource.openStream(), "UTF8"));
this works well in java 1.6. How can i do this in 1.5, Since Properties class does not have a method to pars InputStreamReader.

There is much easier way:
props.load(new InputStreamReader(new FileInputStream("properties_file"), "UTF8"));

Just another Eclipse plugin for *.properties files:
Properties Editor

You can define UTF-8 .properties files to store your translations and use ResourceBundle, to get values. To avoid problems you can change encoding:
String value = RESOURCE_BUNDLE.getString(key);
return new String(value.getBytes("ISO-8859-1"), "UTF-8");

This seems to work only for some characters ... including special characters for German, Portuguese, French. However, I ran into trouble with Russian, Hindi and Mandarin characters. These are not converted to Properties format 'native2ascii', instead get saved with ?? ?? ??
The only way I could get my app to display these characters correctly is by putting them in the properties file translated to UTF-8 format - as \u0915 instead of क, or \u044F instead of я.
Any advice?

I recommend you to use Attesoro (http://attesoro.org/). Is simple and easy to use. And is made in java.

I found a solution to this problem. You need to write file (*.properties) use standard "Properties", example:
Properties properties = new Properties();
properties.put("DB_DRIVER", "com.mysql.cj.jdbc.Driver");
properties.put("DB_URL", "jdbc:mysql://localhost:3306/world");
properties.put("DB_USERNAME", "root");
properties.put("DB_PASSWORD", "1111");
properties.put("DB_AUTO_RECONNECT", "true");
properties.put("DB_CHARACTER_ENCODING", "UTF-8");
properties.put("DB_USE_UNICODE", "true");
try {
properties.store(new FileWriter("src/connectionDB/base/db.properties"), "Comment writes");
} catch (IOException e) {
System.out.println(e.getMessage());
}
then, you can read file without mistakes:
try {
properties.load(new FileReader("src\\connectionDB\\base\\db.properties"));
properties.list(System.out);
} catch (IOException ex) {
System.out.println(ex.getMessage());
}
or
try {
String str = new String(Files.readAllBytes(Paths.get("src/connectionDB/base/db.properties")), StandardCharsets.UTF_8);
properties.load(new StringReader(str));
properties.list(System.out);
} catch (IOException e) {
System.out.println(e.getMessage());
}
or
InputStream inputStream = getClass().getClassLoader().getResourceAsStream("connectionDB/base/db.properties");
try {
Reader reader = new InputStreamReader(inputStream, "UTF-8");
try {
properties.load(reader);
properties.list(System.out);
} catch (IOException e) {
System.out.println(e.getMessage());
}
} catch (UnsupportedEncodingException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
never mind....
then close the code that creates this file and use file *.properties

If the properties are for XML or HTML, it's safest to use XML entities. They're uglier to read, but it means that the properties file can be treated as straight ASCII, so nothing will get mangled.
Note that HTML has entities that XML doesn't, so I keep it safe by using straight XML: http://www.w3.org/TR/html4/sgml/entities.html

Related

How do I write chinese charactes in ZipEntry?

I want to export a string(chinese text) to CSV file inside a zip file. Where do I need to set the encoding to UTF-8? Or what approach should I take (based on the code below) to display chinese characters in the exported CSV file?
This is the code I currently have.
ByteArrayOutputStream out = new ByteArrayOutputStream();
ZipOutputStream zipOut = new ZipOutputStream(out, StandardCharsets.UTF_8)
try {
ZipEntry entry = new ZipEntry("chinese.csv");
zipOut.putNextEntry(entry);
zipOut.write("类型".getBytes());
} catch (IOException e) {
e.printStackTrace();
} finally {
zipOut.close();
out.close();
}
Instead of "类型", I get "类型" in the CSV file.
First, you definitely need to change zipOut.write("类型".getBytes()); to zipOut.write("类型".getBytes(StandardCharsets.UTF_8)); Also, when you open your resultant CSV file, the editor might not be aware that the content is encoded in UTF-8. You may need to tell your editor that it is UTF-8 encoding. For instance, in Notepad, you can save your file with "Save As" option and change encoding to UTF-8. Also, your issue might be just wrong display issue rather than actual encoding. There is an Open Source Java library that has a utility that converts any String to Unicode Sequence and vice-versa. This utility helped me many times when I was working on diagnosing various charset related issues. Here is the sample what the code does
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder
I tried your inputs and got this:
System.out.println(StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("类型"));
System.out.println(StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("类型"));
And the output was:
\u7c7b\u578b
\u00e7\u00b1\u00bb\u00e5\u017e\u2039
So it looks like you did lose the info, and it is not just a display issue
The getBytes() method is one culprit, without an explicit charset it takes the default character set of your machine. As of the Java String documentation:
getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
getBytes(string charsetName)
Encodes this String into a sequence of bytes using the given charset, storing the result into a new byte array.
Furthermore, as #Slaw pointed out, make sure that you compile (javac -encoding <encoding>) your files with the same encoding the files are in:
-encoding Set the source file encoding name, such as EUC-JP and UTF-8. If -encoding is not specified, the platform default converter is used.
A call to closeEntry() was missing in the OP btw. I stripped the snippet down to what I found necessary to achieve the desired funcitonality.
try (FileOutputStream fileOut = new FileOutputStream("out.zip");
ZipOutputStream zipOut = new ZipOutputStream(fileOut)) {
zipOut.putNextEntry(new ZipEntry("chinese.csv"));
zipOut.write("类型".getBytes("UTF-8"));
zipOut.closeEntry();
}
Finally, as #MichaelGantman pointed out, you might want to check what is in which encoding using a tool like a hex-editor for example, also to rule out that the editor you view the result file in displays correct utf-8 in a wrong way. "类" in utf-8 is (hex) e7 b1 bb in utf-16 (the java default encoding) it is 7c 7b

Java: How to write "Arabic" in properties file?

I want to write "Arabic" in the message resource bundle (properties) file but when I try to save it I get this error:
"Save couldn't be completed
Some characters cannot be mapped using "ISO-85591-1" character encoding. Either change encoding or remove the character ..."
Can anyone guide please?
I want to write:
global.username = اسم المستخدم
How should I write the Arabic of "username" in properties file? So, that internationalization works..
BR
SC
http://sourceforge.net/projects/eclipse-rbe/
You can use the above plugin for eclipse IDE to make the Unicode conversion for you.
As described in the class reference for "Properties"
The load(Reader) / store(Writer, String) methods load and store properties from and to
a character based stream in a simple line-oriented format specified below.
The load(InputStream) / store(OutputStream, String) methods work the same way as the
load(Reader)/store(Writer, String) pair, except the input/output stream is encoded in
ISO 8859-1 character encoding. Characters that cannot be directly represented in this
encoding can be written using Unicode escapes ; only a single 'u' character is allowed
in an escape sequence. The native2ascii tool can be used to convert property files to
and from other character encodings.
Properties-based resource bundles must be encoded in ISO-8859-1 to use the default loading mechanism, but I have successfully used this code to allow the properties files to be encoded in UTF-8:
private static class ResourceControl extends ResourceBundle.Control {
#Override
public ResourceBundle newBundle(String baseName, Locale locale,
String format, ClassLoader loader, boolean reload)
throws IllegalAccessException, InstantiationException,
IOException {
String bundlename = toBundleName(baseName, locale);
String resName = toResourceName(bundlename, "properties");
InputStream stream = loader.getResourceAsStream(resName);
return new PropertyResourceBundle(new InputStreamReader(stream,
"UTF-8"));
}
}
Then of course you have to change the encoding of the file itself to UTF-8 in your IDE, and can use it like this:
ResourceBundle bundle = ResourceBundle.getBundle(
"package.Bundle", new ResourceControl());
new String(ret.getBytes("ISO-8859-1"), "UTF-8"); worked for me.
property file saved in ISO-8859-1 Encodiing.
If you are using Eclipse, you can choose "Window-->Preferences" and then filter on "content types". Then you should be able to set the default encoding. There's a screen shot showing this at the top of this post.
This is mainly an editor configuration issue. If you're working in Windows, you can edit the text in an editor that supports UTF-8. Notepad or Eclipse built-in editor should be more than enough, provided you've saved file as UTF-8. In Linux, I've used gedit and emacs successfully. In Notepad, you can do this by clicking 'Save As' button and choosing 'UTF-8' encoding. Other editors should have similar feature. Some editors might require font change in order to display letters correctly, but it seems that you don't have this issue.
Having said that, there are other steps to consider when performing i18n for arabic. You can find some useful links below. Make sure to use native2ascii on properties file before using it otherwise it might not work. I spent a lot of time until I figured this one out.
Tomcat webapps
Using nativ2ascii with properties files
Besides native2ascii tool mentioned in other answers there is a java Open Source library that can provide conversion functionality to be used in code
Library MgntUtils has a Utility that converts Strings in any language (including special characters and emojis to unicode sequence and vise versa:
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder

Google App Engine encoding

I'm trying to log some russian text:
LOG.info("тестирование русского");
But I get question symbols instead (viewing from web):
[app-id/app-version].:
15:18:44,753 INFO [class] -
???????????? ????????
Java file saved with UTF-8 encoding. All settings are default.
Even I read file in UTF-8 with russian characters and try to log something from it -- encoding is wrong too.
I had a similar problem with Hebrew text. I found out it was caused by the default encoding.
To check the default encoding, I used this code:
OutputStreamWriter out = new OutputStreamWriter(new ByteArrayOutputStream());
String encoding = out.getEncoding();
On my computer, the encoding is "UTF8". On the GAE server, it is "ASCII".
I solved the problem by replacing all the file readers in my code with:
new InputStreamReader(new FileInputStream(file), "UTF8"));
This tells Java to ignore the default encoding, and open all input files as UTF8.
Try this. Apparently GAE tries to autodetect encoding and fails.
Even constant strings were messed up
public class Util {
public static String FixRussianString(String string){
try {
return new String(string.getBytes("CP1251"), "UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return string;
}
}

Use cyrillic .properties file in eclipse project

I'm developing a small project and I'd like to use internationalization for it. The problem is that when I try to use .properties file with cyrillic symbols inside, the text is displayed as rubbish. When I hard-code the strings it's displayed just fine.
Here is my code:
ResourceBundle labels = ResourceBundle.getBundle("Labels");
btnQuit = new JButton(labels.getString("quit"));
And in my .properties file:
quit = Изход
And I get rubbish. When i try
btnQuit = new JButton("Изход);
It is displayed correctly. As far as I am aware, UTF-8 is the encoding used for the files.
Any ideas?
AnyEdit is an eclipse-plugin that allows you to easily convert your your properties files from and to unicode notation. (avoiding the use of command-line tools like native2ascii)
If you were using the Properties class alone (without resource bundle), since Java 1.6 you have the option to load the file with a custom encoding, using a Reader (rather than an InputStream)
I'd guess you can also use new PropertyResourceBundle(reader), rather than ResourceBundle.getBundle(..), where reader is:
Reader reader = new BufferedReader(new InputStreamReader(
getClass().getResourceAsStream("messages.properties"), "utf-8")));
Properties are ISO-8859-1 encoded by default. You must use native2ascii to convert your UTF-8 properties to a valid ISO-8859-1 properties file containing unicode escape sequences for all non-ISO-8859-1 characters.

File is not saved in UTF-8 encoding even when I set encoding to UTF-8

When I check my file with Notepad++ it's in ANSI encoding. What I am doing wrong here?
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(file), "UTF8");
try
{
out.write(text);
out.flush();
} finally
{
out.close();
}
UPDATE:
This is solved now, reason for jboss not understanding my xml wasn't encoding, but it was naming of my xml. Thanx all for help, even there really wasn't any problem...
If you're creating an XML file (as your comments imply), I would strongly recommend that you use the XML libraries to output this and write the correct XML encoding header. Otherwise your character encoding won't conform to XML standards and other tools (like your JBoss instance) will rightfully complain.
// Prepare the DOM document for writing
Source source = new DOMSource(doc);
// Prepare the output file
File file = new File(filename);
Result result = new StreamResult(file);
// Write the DOM document to the file
Transformer xformer = TransformerFactory.newInstance().newTransformer();
xformer.transform(source, result);
There's no such thing as plain text. The problem is that an application is decoding character data without you telling it which encoding the data uses.
Although many Microsoft apps rely on the presence of a Byte Order Mark to indicate a Unicode file, this is by no means standard. The Unicode BOM FAQ says more.
You can add a BOM to your output by writing the character '\uFEFF' at the start of the stream. More info here. This should be enough for applications that rely on BOMs.
UTF-8 is designed to be, in the common case, rather indistinguishable from ANSI. So when you write text to a file and encode the text with UTF-8, in the common case, it looks like ANSI to anyone else who opens the file.
UTF-8 is 1-byte-per-character for all ASCII characters, just like ANSI.
UTF-8 has all the same bytes for the ASCII characters as ANSI does.
UTF-8 does not have any special header characters, just as ANSI does not.
It's only when you start to get into the non-ASCII codepoints that things start looking different.
But in the common case, byte-for-byte, ANSI and UTF-8 are identical.
If there is no BOM (and Java doesn't output one for UTF8, it doesn't even recognize it), the text is identical in ANSI and UTF8 encoding as long as only characters in the ASCII range are being used. Therefore Notepad++ cannot detect any difference.
(And there seems to be an issue with UTF8 in Java anyways...)
The IANA registered type is "UTF-8", not "UTF8". However, Java should throw an exception for invalid encodings, so that's probably not the problem.
I suspect that Notepad is the problem. Examine the text using a hexdump program, and you should see it properly encoded.
Did you try to write a BOM at the beginning of the file? BOM is the only thing that can tell the editor the file is in UTF-8. Otherwise, the UTF-8 file can just look like Latin-1 or extended ANSI.
You can do it like this,
public final static byte[] UTF8_BOM = {(byte)0xEF, (byte)0xBB, (byte)0xBF};
...
OutputStream os = new FileOutputStream(file);
os.write(UTF8_BOM);
os.flush();
OutputStreamWriter out = new OutputStreamWriter(os, "UTF8");
try
{
out.write(text);
out.flush();
} finally
{
out.close();
}

Categories