I have a project where everything is in UTF-8. I was using the Properties.load(Reader) method to read properties files in this encoding. But now, I need to make the project compatible with Java 1.5, and the mentioned method doesn't exist in Java 1.5. There is only a load method that takes an InputStream as a parameter, which is assumed to be in ISO-8859-1.
Is there any simple way to make my project 1.5-compatible without having to change all the .properties files to ISO-8859-1? I don't really want to have a mix of encodings in my project (encodings are already a time sink one at a time, let alone when you mix them) or change all my project to ISO-8859-1.
With "a simple way" I mean "without creating a custom Properties class from scratch".
Could you use xml-properties instead? As I understand by the spec .properties files should be in ISO-8859-1, if you want other characters, they should be quoted, using the native2ascii tool.
One strategy that might work for this situation is as follows:
Read the bytes of the Reader into a ByteArrayOutputStream.
Once that is completed, call toByteArray() See below.
With the byte[] construct a ByteArrayInputStream
Use the ByteArrayInputStream in Properties.load(InputStream)
As pointed out, the above failed to actually convert the character set from UTF-8 to ISO-8859-1. To fix that, a tweak.
After the BAOS has been filled, instead of calling toByteArray()..
Call toString("ISO-8859-1") to get an ISO-8859-1 encoded String. Then look to..
Call String.getBytes() to get the byte[]
What you can do is open a thread that would read data using a BufferedReader then write out the data to a PipedOutputStream which is then linked by a PipedInputStream that load uses.
PipedOutputStream pos = new PipedOutputStream();
PipedInputStream pis = new PipedInputStream(pos);
ReaderRunnable reader = new ReaderRunnable(pos, new File("utfproperty.properties"));
Thread t = new Thread(reader);
t.start();
properties.load(pis);
t.join();
The BufferedReader will read the data one character at a time and if it detects it to be a character data not to be within the US-ASCII (i.e. low 7-bit) range then it writes "\u" + the character code into the PipedOutputStream.
ReaderRunnable would be a class that looks like:
public class ReaderRunnable implements Runnable {
public ReaderRunnable(OutputStream os, File f) {
this.os = os;
this.f = f;
}
private final OutputStream os;
private final File f;
public void run() {
// open file
// read file, escape any non US-ASCII characters
}
}
Now after writing all that I was thinking that someone should've had this problem before and solved it, and the best place to look for these things is in Apache Commons. Fortunately, they have an implementation there.
https://commons.apache.org/io/apidocs/org/apache/commons/io/input/ReaderInputStream.html
The implementation from Apache is not without flaws though. Your input file even if it is UTF-8 must only contain the characters from the ISO-8859-1 character set. The design I had provided above can handle that situation.
Depending on your build engine you can \uXXXX-escape the properties into the build target directory. Maven can filter them via the native2ascii-maven-plugin.
What I personally do in my projects is I keep my properties in UTF-8 files with an extension .uproperties and I convert them to ISO at the build time to .properties files using native2ascii.exe. This allows me to maintain my properties in UTF-8 and the Ant script does everything else for me.
What I just now experienced is, Make all .java files also UTF-8 encoding type (not only properties file where you store UTF-8 characters). This way there no need to use for InputStreamReader also. Also, make sure to compile to UTF-8 encoding.
This has worked for me without any added parameter of UTF-8.
To test this, write a simple stub program in eclipse and change the format of that java file by going to properties of that file and Resource section, to set the UTF-8 encoding format.
Related
I want to export a string(chinese text) to CSV file inside a zip file. Where do I need to set the encoding to UTF-8? Or what approach should I take (based on the code below) to display chinese characters in the exported CSV file?
This is the code I currently have.
ByteArrayOutputStream out = new ByteArrayOutputStream();
ZipOutputStream zipOut = new ZipOutputStream(out, StandardCharsets.UTF_8)
try {
ZipEntry entry = new ZipEntry("chinese.csv");
zipOut.putNextEntry(entry);
zipOut.write("类型".getBytes());
} catch (IOException e) {
e.printStackTrace();
} finally {
zipOut.close();
out.close();
}
Instead of "类型", I get "类型" in the CSV file.
First, you definitely need to change zipOut.write("类型".getBytes()); to zipOut.write("类型".getBytes(StandardCharsets.UTF_8)); Also, when you open your resultant CSV file, the editor might not be aware that the content is encoded in UTF-8. You may need to tell your editor that it is UTF-8 encoding. For instance, in Notepad, you can save your file with "Save As" option and change encoding to UTF-8. Also, your issue might be just wrong display issue rather than actual encoding. There is an Open Source Java library that has a utility that converts any String to Unicode Sequence and vice-versa. This utility helped me many times when I was working on diagnosing various charset related issues. Here is the sample what the code does
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder
I tried your inputs and got this:
System.out.println(StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("类型"));
System.out.println(StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("类型"));
And the output was:
\u7c7b\u578b
\u00e7\u00b1\u00bb\u00e5\u017e\u2039
So it looks like you did lose the info, and it is not just a display issue
The getBytes() method is one culprit, without an explicit charset it takes the default character set of your machine. As of the Java String documentation:
getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
getBytes(string charsetName)
Encodes this String into a sequence of bytes using the given charset, storing the result into a new byte array.
Furthermore, as #Slaw pointed out, make sure that you compile (javac -encoding <encoding>) your files with the same encoding the files are in:
-encoding Set the source file encoding name, such as EUC-JP and UTF-8. If -encoding is not specified, the platform default converter is used.
A call to closeEntry() was missing in the OP btw. I stripped the snippet down to what I found necessary to achieve the desired funcitonality.
try (FileOutputStream fileOut = new FileOutputStream("out.zip");
ZipOutputStream zipOut = new ZipOutputStream(fileOut)) {
zipOut.putNextEntry(new ZipEntry("chinese.csv"));
zipOut.write("类型".getBytes("UTF-8"));
zipOut.closeEntry();
}
Finally, as #MichaelGantman pointed out, you might want to check what is in which encoding using a tool like a hex-editor for example, also to rule out that the editor you view the result file in displays correct utf-8 in a wrong way. "类" in utf-8 is (hex) e7 b1 bb in utf-16 (the java default encoding) it is 7c 7b
We have a requirement of picking the data from Oracle DB table and dump that data into a csv file and a plain pipe seperated text file. Give a link to user on application so user can view the generated csv/text files.
As lot of parsing was involved so we wrote a Unix shell script and are calling it from out Struts/J2ee application.
Earlier we were loosing the Chinese and Roman chars in the generated files and the generated file were having us-ascii charset(cheked using-> file -i). Later we used NLS_LANG=AMERICAN_AMERICA.AL32UTF8 and this gave us utf-8 format files.
But still the characters were gibberish, so again we tried iconv command and converted utf-8 files to utf-16le charset.
iconv -f utf-8 -t utf-16le $recordFile > $tempFile
This works fine for the generated text file. But with CSV the Chinese and Roman chars are still not correct. Now if we open this csv file in a Notepad and give a newline by pressing Enter key from keyboard, save it. Open it with MS-Excel, all characters are coming fine including the Chinese and Romans but now the text is in single line for each row instead of columns.
Not sure what's going on.
Java code
PrintWriter out = servletResponse.getWriter();
servletResponse.setContentType("application/vnd.ms-excel; charset=UTF-8");
servletResponse.setCharacterEncoding("UTF-8");
servletResponse.setHeader("Content-Disposition","attachment; filename="+ fileName.toString());
FileInputStream fileInputStream = new FileInputStream(fileLoc + fileName);
int i;
while ((i=fileInputStream.read()) != -1) {
out.write(i);
}
fileInputStream.close();
out.close();
Please let me know if i missed out any details.
Thanks to all for taking out time to go through this.
Was able to solve it out. First as mentioned by Aaron removed UTF-16LE encoding to avoid future issues and encoded files to UTF-8. Changed the PrintWriter in Java code to OutputStream and was able to see the correct characters in my text file.
CSV was still showing garbage. Came to know that we need to prepend EF BB BF at the beginning of file as the BOM aware software like MS-Excel needs it. So changing the Java code as below did the trick for csv.
OutputStream out = servletResponse.getOutputStream();
os.write(239); //0xEF
os.write(187); //0xBB
out.write(191); //0xBF
FileInputStream fileInputStream = new FileInputStream(fileLoc + fileName);
int i;
while ((i=fileInputStream.read()) != -1) {
out.write(i);
}
fileInputStream.close();
out.flush();
out.close();
As always with Unicode problems, every single step of the transformation chain must work perfectly. If you make a mistake in one place, data will be silently corrupted. There is no easy way to figure out where it happens, you have to debug the code or write unit tests.
The Java code above only works if the file actually contains UTF-8 encoded data; it doesn't "magically" figure out what's in the file and converts it to UTF-8. So if the file already contains garbage, you just slap a "this is UTF-8" label on it but it's still garbage.
That means for you that you need to create test cases which take known test data and move that through every step of the chain: Inserting into database, reading from the database, writing to CSV, writing to the text file, reading those files and download to the user.
For each step, you need to write unit tests which takes a known Unicode string like abc öäü and processes it and then check the result. To make it easier to input in Java code, use "abc \u00f6\u00e4\u00fc" You may also want to add spaces at the beginning and end of the string to see whether they are properly preserved or not.
file -i doesn't help you much here since it just makes a guess what the file contains. There is no indicator (data or metadata) in a text file which says "this is UTF-8". UTF-16 supports a BOM header for this but almost no one uses UTF-16, so many tools don't support it (properly).
I had to change the Eclipse Indigo encoding to UTF-8. Now all the spécial characters as éàçè are replaced with �.
I can do a search and replace but I wonder if there is better solution.
Thanks
Changing the encoding in Eclipse doesn't change your existing files : it only changes the way Eclipse reads them.
What you need is to convert your old files to UTF-8 as well as configuring Eclipse.
There are some tools to do that and you may write a small java program too.
If you want to use an existing tool, here's the first I found : http://www.marblesoftware.com/Marble_Software/Charco.html (you could find a better one for your (unspecified) OS.
If you want to write a tool yourself (about 20 LOC), the thing to know is that you must :
read the file with their initial charset
write the files in UTF-8
Here's the core of the operation :
reader = new BufferedReader(new InputStreamReader(new FileInputStream(...), "you have to know it"));
writer = new OutputStreamWriter(new FileOutputStream(...), "UTF-8");
String line;
while ((line=reader.readLine())!=null) {
writer.write(line);
}
I recommend notepad++ for conversion. This is an editor which has some very useful/powerful view and conversion tools to troubleshoot charsets.
Also some more "swiss-knife"-like functions (file comparison, advanced search and replace and many more...)
notepad++
Just only need alt + enter then chooses resource UTF-8
I want to write "Arabic" in the message resource bundle (properties) file but when I try to save it I get this error:
"Save couldn't be completed
Some characters cannot be mapped using "ISO-85591-1" character encoding. Either change encoding or remove the character ..."
Can anyone guide please?
I want to write:
global.username = اسم المستخدم
How should I write the Arabic of "username" in properties file? So, that internationalization works..
BR
SC
http://sourceforge.net/projects/eclipse-rbe/
You can use the above plugin for eclipse IDE to make the Unicode conversion for you.
As described in the class reference for "Properties"
The load(Reader) / store(Writer, String) methods load and store properties from and to
a character based stream in a simple line-oriented format specified below.
The load(InputStream) / store(OutputStream, String) methods work the same way as the
load(Reader)/store(Writer, String) pair, except the input/output stream is encoded in
ISO 8859-1 character encoding. Characters that cannot be directly represented in this
encoding can be written using Unicode escapes ; only a single 'u' character is allowed
in an escape sequence. The native2ascii tool can be used to convert property files to
and from other character encodings.
Properties-based resource bundles must be encoded in ISO-8859-1 to use the default loading mechanism, but I have successfully used this code to allow the properties files to be encoded in UTF-8:
private static class ResourceControl extends ResourceBundle.Control {
#Override
public ResourceBundle newBundle(String baseName, Locale locale,
String format, ClassLoader loader, boolean reload)
throws IllegalAccessException, InstantiationException,
IOException {
String bundlename = toBundleName(baseName, locale);
String resName = toResourceName(bundlename, "properties");
InputStream stream = loader.getResourceAsStream(resName);
return new PropertyResourceBundle(new InputStreamReader(stream,
"UTF-8"));
}
}
Then of course you have to change the encoding of the file itself to UTF-8 in your IDE, and can use it like this:
ResourceBundle bundle = ResourceBundle.getBundle(
"package.Bundle", new ResourceControl());
new String(ret.getBytes("ISO-8859-1"), "UTF-8"); worked for me.
property file saved in ISO-8859-1 Encodiing.
If you are using Eclipse, you can choose "Window-->Preferences" and then filter on "content types". Then you should be able to set the default encoding. There's a screen shot showing this at the top of this post.
This is mainly an editor configuration issue. If you're working in Windows, you can edit the text in an editor that supports UTF-8. Notepad or Eclipse built-in editor should be more than enough, provided you've saved file as UTF-8. In Linux, I've used gedit and emacs successfully. In Notepad, you can do this by clicking 'Save As' button and choosing 'UTF-8' encoding. Other editors should have similar feature. Some editors might require font change in order to display letters correctly, but it seems that you don't have this issue.
Having said that, there are other steps to consider when performing i18n for arabic. You can find some useful links below. Make sure to use native2ascii on properties file before using it otherwise it might not work. I spent a lot of time until I figured this one out.
Tomcat webapps
Using nativ2ascii with properties files
Besides native2ascii tool mentioned in other answers there is a java Open Source library that can provide conversion functionality to be used in code
Library MgntUtils has a Utility that converts Strings in any language (including special characters and emojis to unicode sequence and vise versa:
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder
What is the best way to add non-ASCII file names to a zip file using Java, in such a way that the files can be properly read in both Windows and Linux?
Here is one attempt, adapted from https://truezip.dev.java.net/tutorial-6.html#Example, which works in Windows Vista but fails in Ubuntu Hardy. In Hardy the file name is shown as abc-ЖДФ.txt in file-roller.
import java.io.IOException;
import java.io.PrintStream;
import de.schlichtherle.io.File;
import de.schlichtherle.io.FileOutputStream;
public class Main {
public static void main(final String[] args) throws IOException {
try {
PrintStream ps = new PrintStream(new FileOutputStream(
"outer.zip/abc-åäö.txt"));
try {
ps.println("The characters åäö works here though.");
} finally {
ps.close();
}
} finally {
File.umount();
}
}
}
Unlike java.util.zip, truezip allows specifying zip file encoding. Here's another sample, this time explicitly specifiying the encoding. Neither IBM437, UTF-8 nor ISO-8859-1 works in Linux. IBM437 works in Windows.
import java.io.IOException;
import de.schlichtherle.io.FileOutputStream;
import de.schlichtherle.util.zip.ZipEntry;
import de.schlichtherle.util.zip.ZipOutputStream;
public class Main {
public static void main(final String[] args) throws IOException {
for (String encoding : new String[] { "IBM437", "UTF-8", "ISO-8859-1" }) {
ZipOutputStream zipOutput = new ZipOutputStream(
new FileOutputStream(encoding + "-example.zip"), encoding);
ZipEntry entry = new ZipEntry("abc-åäö.txt");
zipOutput.putNextEntry(entry);
zipOutput.closeEntry();
zipOutput.close();
}
}
}
The encoding for the File-Entries in ZIP is originally specified as IBM Code Page 437. Many characters used in other languages are impossible to use that way.
The PKWARE-specification refers to the problem and adds a bit. But that is a later addition (from 2007, thanks to Cheeso for clearing that up, see comments). If that bit is set, the filename-entry have to be encoded in UTF-8. This extension is described in 'APPENDIX D - Language Encoding (EFS)', that is at the end of the linked document.
For Java it is a known bug, to get into trouble with non-ASCII-characters. See bug #4244499 and the high number of related bugs.
My colleague used as workaround URL-Encoding for the filenames before storing them into the ZIP and decoding after reading them. If you control both, storing and reading, that may be a workaround.
EDIT: At the bug someone suggests using the ZipOutputStream from Apache Ant as workaround. This implementation allows the specification of an encoding.
In Zip files, according to the spec owned by PKWare, the encoding of file names and file comments is IBM437. In 2007 PKWare extended the spec to also allow UTF-8. This says nothing about the encoding of the files contained within the zip. Only the encoding of the filenames.
I think all tools and libraries (Java and non Java) support IBM437 (which is a superset of ASCII), and fewer tools and libraries support UTF-8. Some tools and libs support other code pages. For example if you zip something using WinRar on a computer running in Shanghai, you will get the Big5 code page. This is not "allowed" by the zip spec but it happens anyway.
The DotNetZip library for .NET does Unicode, but of course that doesn't help you if you are using Java!
Using the Java built-in support for ZIP, you will always get IBM437. If you want an archive with something other than IBM437, then use a third party library, or create a JAR.
Miracles indeed happen, and Sun/Oracle did really fix the long-living bug/rfe:
Now it's possible to set up filename encodings upon creating the zip file/stream (requires Java 7).
You can still use the Apache Commons implementation of the zip stream : http://commons.apache.org/compress/apidocs/org/apache/commons/compress/archivers/zip/ZipArchiveOutputStream.html#setEncoding%28java.lang.String%29
Calling setEncoding("UTF-8") on your stream should be enough.
From a quick look at the TrueZIP manual - they recommend the JAR format:
It uses UTF-8 for file name encoding
and comments - unlike ZIP, which only
uses IBM437.
This probably means that the API is using the java.util.zip package for its implementation; that documentation states that it is still using a ZIP format from 1996. Unicode support wasn't added to the PKWARE .ZIP File Format Specification until 2006.
Did it actually fail or was just a font issue? (e.g. font having different glyphs for those charcodes) I've seen similar issues in Windows where rendering "broke" because the font didn't support the charset but the data was actually intact and correct.
Non-ASCII file names are not reliable across ZIP implementations and are best avoided. There is no provision for storing a charset setting in ZIP files; clients tend to guess with 'the current system codepage', which is unlikely to be what you want. Many combinations of client and codepage can result in inaccessible files.
Sorry!