Add non-ASCII file names to zip in Java - java

What is the best way to add non-ASCII file names to a zip file using Java, in such a way that the files can be properly read in both Windows and Linux?
Here is one attempt, adapted from https://truezip.dev.java.net/tutorial-6.html#Example, which works in Windows Vista but fails in Ubuntu Hardy. In Hardy the file name is shown as abc-ЖДФ.txt in file-roller.
import java.io.IOException;
import java.io.PrintStream;
import de.schlichtherle.io.File;
import de.schlichtherle.io.FileOutputStream;
public class Main {
public static void main(final String[] args) throws IOException {
try {
PrintStream ps = new PrintStream(new FileOutputStream(
"outer.zip/abc-åäö.txt"));
try {
ps.println("The characters åäö works here though.");
} finally {
ps.close();
}
} finally {
File.umount();
}
}
}
Unlike java.util.zip, truezip allows specifying zip file encoding. Here's another sample, this time explicitly specifiying the encoding. Neither IBM437, UTF-8 nor ISO-8859-1 works in Linux. IBM437 works in Windows.
import java.io.IOException;
import de.schlichtherle.io.FileOutputStream;
import de.schlichtherle.util.zip.ZipEntry;
import de.schlichtherle.util.zip.ZipOutputStream;
public class Main {
public static void main(final String[] args) throws IOException {
for (String encoding : new String[] { "IBM437", "UTF-8", "ISO-8859-1" }) {
ZipOutputStream zipOutput = new ZipOutputStream(
new FileOutputStream(encoding + "-example.zip"), encoding);
ZipEntry entry = new ZipEntry("abc-åäö.txt");
zipOutput.putNextEntry(entry);
zipOutput.closeEntry();
zipOutput.close();
}
}
}

The encoding for the File-Entries in ZIP is originally specified as IBM Code Page 437. Many characters used in other languages are impossible to use that way.
The PKWARE-specification refers to the problem and adds a bit. But that is a later addition (from 2007, thanks to Cheeso for clearing that up, see comments). If that bit is set, the filename-entry have to be encoded in UTF-8. This extension is described in 'APPENDIX D - Language Encoding (EFS)', that is at the end of the linked document.
For Java it is a known bug, to get into trouble with non-ASCII-characters. See bug #4244499 and the high number of related bugs.
My colleague used as workaround URL-Encoding for the filenames before storing them into the ZIP and decoding after reading them. If you control both, storing and reading, that may be a workaround.
EDIT: At the bug someone suggests using the ZipOutputStream from Apache Ant as workaround. This implementation allows the specification of an encoding.

In Zip files, according to the spec owned by PKWare, the encoding of file names and file comments is IBM437. In 2007 PKWare extended the spec to also allow UTF-8. This says nothing about the encoding of the files contained within the zip. Only the encoding of the filenames.
I think all tools and libraries (Java and non Java) support IBM437 (which is a superset of ASCII), and fewer tools and libraries support UTF-8. Some tools and libs support other code pages. For example if you zip something using WinRar on a computer running in Shanghai, you will get the Big5 code page. This is not "allowed" by the zip spec but it happens anyway.
The DotNetZip library for .NET does Unicode, but of course that doesn't help you if you are using Java!
Using the Java built-in support for ZIP, you will always get IBM437. If you want an archive with something other than IBM437, then use a third party library, or create a JAR.

Miracles indeed happen, and Sun/Oracle did really fix the long-living bug/rfe:
Now it's possible to set up filename encodings upon creating the zip file/stream (requires Java 7).

You can still use the Apache Commons implementation of the zip stream : http://commons.apache.org/compress/apidocs/org/apache/commons/compress/archivers/zip/ZipArchiveOutputStream.html#setEncoding%28java.lang.String%29
Calling setEncoding("UTF-8") on your stream should be enough.

From a quick look at the TrueZIP manual - they recommend the JAR format:
It uses UTF-8 for file name encoding
and comments - unlike ZIP, which only
uses IBM437.
This probably means that the API is using the java.util.zip package for its implementation; that documentation states that it is still using a ZIP format from 1996. Unicode support wasn't added to the PKWARE .ZIP File Format Specification until 2006.

Did it actually fail or was just a font issue? (e.g. font having different glyphs for those charcodes) I've seen similar issues in Windows where rendering "broke" because the font didn't support the charset but the data was actually intact and correct.

Non-ASCII file names are not reliable across ZIP implementations and are best avoided. There is no provision for storing a charset setting in ZIP files; clients tend to guess with 'the current system codepage', which is unlikely to be what you want. Many combinations of client and codepage can result in inaccessible files.
Sorry!

Related

Why does Java ignore the first line of a .properties file?

I was working with an app that loads a .properties file with java.util.Properties like this:
Properties _properties = new Properties();
_properties.load(new FileInputStream("app.properties"));
The properties file (initially) was this:
app=myApp
dbLogin=myDbLogin
version=0.9.8.10
server=1
freq=10000
stateGap=360000
The strange thing was that when I called _properties.getProperty("app"), it always returned null, however I could load all of the other properties without any issues. I solved the problem by adding a comment to the top of the properties file, then everything worked fine.
My question is: Why does Java do this? I can't seem to find any documentation about this, and it seems counter-intuitive.
Thanks to #KonstantinV.Salikhov and #pms for their help in hunting this down; I decided to post the answer that was discovered to save people hunting through the comments.
The problem was that my file was the wrong encoding, as mentioned here: http://docs.oracle.com/javase/7/docs/api/java/util/Properties.html
The load(Reader) / store(Writer, String) methods load and store properties from and to a character based stream in a simple line-oriented format specified below. The load(InputStream) / store(OutputStream, String) methods work the same way as the load(Reader)/store(Writer, String) pair, except the input/output stream is encoded in ISO 8859-1 character encoding.
(Emphasis mine).
I changed the encoding of the properties file to ISO-8859-1 and everything worked.
Java does not handle the BOM correctly – you can see it in the properties as key. It is possible to save the file UTF-8 but without BOM. In vim for instance
:set nobomb
See vim wiki

Text corrupt after changing the Eclipse to UTF-8 Encoding

I had to change the Eclipse Indigo encoding to UTF-8. Now all the spécial characters as éàçè are replaced with �.
I can do a search and replace but I wonder if there is better solution.
Thanks
Changing the encoding in Eclipse doesn't change your existing files : it only changes the way Eclipse reads them.
What you need is to convert your old files to UTF-8 as well as configuring Eclipse.
There are some tools to do that and you may write a small java program too.
If you want to use an existing tool, here's the first I found : http://www.marblesoftware.com/Marble_Software/Charco.html (you could find a better one for your (unspecified) OS.
If you want to write a tool yourself (about 20 LOC), the thing to know is that you must :
read the file with their initial charset
write the files in UTF-8
Here's the core of the operation :
reader = new BufferedReader(new InputStreamReader(new FileInputStream(...), "you have to know it"));
writer = new OutputStreamWriter(new FileOutputStream(...), "UTF-8");
String line;
while ((line=reader.readLine())!=null) {
writer.write(line);
}
I recommend notepad++ for conversion. This is an editor which has some very useful/powerful view and conversion tools to troubleshoot charsets.
Also some more "swiss-knife"-like functions (file comparison, advanced search and replace and many more...)
notepad++
Just only need alt + enter then chooses resource UTF-8

Reading UTF-8 .properties files in Java 1.5?

I have a project where everything is in UTF-8. I was using the Properties.load(Reader) method to read properties files in this encoding. But now, I need to make the project compatible with Java 1.5, and the mentioned method doesn't exist in Java 1.5. There is only a load method that takes an InputStream as a parameter, which is assumed to be in ISO-8859-1.
Is there any simple way to make my project 1.5-compatible without having to change all the .properties files to ISO-8859-1? I don't really want to have a mix of encodings in my project (encodings are already a time sink one at a time, let alone when you mix them) or change all my project to ISO-8859-1.
With "a simple way" I mean "without creating a custom Properties class from scratch".
Could you use xml-properties instead? As I understand by the spec .properties files should be in ISO-8859-1, if you want other characters, they should be quoted, using the native2ascii tool.
One strategy that might work for this situation is as follows:
Read the bytes of the Reader into a ByteArrayOutputStream.
Once that is completed, call toByteArray() See below.
With the byte[] construct a ByteArrayInputStream
Use the ByteArrayInputStream in Properties.load(InputStream)
As pointed out, the above failed to actually convert the character set from UTF-8 to ISO-8859-1. To fix that, a tweak.
After the BAOS has been filled, instead of calling toByteArray()..
Call toString("ISO-8859-1") to get an ISO-8859-1 encoded String. Then look to..
Call String.getBytes() to get the byte[]
What you can do is open a thread that would read data using a BufferedReader then write out the data to a PipedOutputStream which is then linked by a PipedInputStream that load uses.
PipedOutputStream pos = new PipedOutputStream();
PipedInputStream pis = new PipedInputStream(pos);
ReaderRunnable reader = new ReaderRunnable(pos, new File("utfproperty.properties"));
Thread t = new Thread(reader);
t.start();
properties.load(pis);
t.join();
The BufferedReader will read the data one character at a time and if it detects it to be a character data not to be within the US-ASCII (i.e. low 7-bit) range then it writes "\u" + the character code into the PipedOutputStream.
ReaderRunnable would be a class that looks like:
public class ReaderRunnable implements Runnable {
public ReaderRunnable(OutputStream os, File f) {
this.os = os;
this.f = f;
}
private final OutputStream os;
private final File f;
public void run() {
// open file
// read file, escape any non US-ASCII characters
}
}
Now after writing all that I was thinking that someone should've had this problem before and solved it, and the best place to look for these things is in Apache Commons. Fortunately, they have an implementation there.
https://commons.apache.org/io/apidocs/org/apache/commons/io/input/ReaderInputStream.html
The implementation from Apache is not without flaws though. Your input file even if it is UTF-8 must only contain the characters from the ISO-8859-1 character set. The design I had provided above can handle that situation.
Depending on your build engine you can \uXXXX-escape the properties into the build target directory. Maven can filter them via the native2ascii-maven-plugin.
What I personally do in my projects is I keep my properties in UTF-8 files with an extension .uproperties and I convert them to ISO at the build time to .properties files using native2ascii.exe. This allows me to maintain my properties in UTF-8 and the Ant script does everything else for me.
What I just now experienced is, Make all .java files also UTF-8 encoding type (not only properties file where you store UTF-8 characters). This way there no need to use for InputStreamReader also. Also, make sure to compile to UTF-8 encoding.
This has worked for me without any added parameter of UTF-8.
To test this, write a simple stub program in eclipse and change the format of that java file by going to properties of that file and Resource section, to set the UTF-8 encoding format.

Java: How to write "Arabic" in properties file?

I want to write "Arabic" in the message resource bundle (properties) file but when I try to save it I get this error:
"Save couldn't be completed
Some characters cannot be mapped using "ISO-85591-1" character encoding. Either change encoding or remove the character ..."
Can anyone guide please?
I want to write:
global.username = اسم المستخدم
How should I write the Arabic of "username" in properties file? So, that internationalization works..
BR
SC
http://sourceforge.net/projects/eclipse-rbe/
You can use the above plugin for eclipse IDE to make the Unicode conversion for you.
As described in the class reference for "Properties"
The load(Reader) / store(Writer, String) methods load and store properties from and to
a character based stream in a simple line-oriented format specified below.
The load(InputStream) / store(OutputStream, String) methods work the same way as the
load(Reader)/store(Writer, String) pair, except the input/output stream is encoded in
ISO 8859-1 character encoding. Characters that cannot be directly represented in this
encoding can be written using Unicode escapes ; only a single 'u' character is allowed
in an escape sequence. The native2ascii tool can be used to convert property files to
and from other character encodings.
Properties-based resource bundles must be encoded in ISO-8859-1 to use the default loading mechanism, but I have successfully used this code to allow the properties files to be encoded in UTF-8:
private static class ResourceControl extends ResourceBundle.Control {
#Override
public ResourceBundle newBundle(String baseName, Locale locale,
String format, ClassLoader loader, boolean reload)
throws IllegalAccessException, InstantiationException,
IOException {
String bundlename = toBundleName(baseName, locale);
String resName = toResourceName(bundlename, "properties");
InputStream stream = loader.getResourceAsStream(resName);
return new PropertyResourceBundle(new InputStreamReader(stream,
"UTF-8"));
}
}
Then of course you have to change the encoding of the file itself to UTF-8 in your IDE, and can use it like this:
ResourceBundle bundle = ResourceBundle.getBundle(
"package.Bundle", new ResourceControl());
new String(ret.getBytes("ISO-8859-1"), "UTF-8"); worked for me.
property file saved in ISO-8859-1 Encodiing.
If you are using Eclipse, you can choose "Window-->Preferences" and then filter on "content types". Then you should be able to set the default encoding. There's a screen shot showing this at the top of this post.
This is mainly an editor configuration issue. If you're working in Windows, you can edit the text in an editor that supports UTF-8. Notepad or Eclipse built-in editor should be more than enough, provided you've saved file as UTF-8. In Linux, I've used gedit and emacs successfully. In Notepad, you can do this by clicking 'Save As' button and choosing 'UTF-8' encoding. Other editors should have similar feature. Some editors might require font change in order to display letters correctly, but it seems that you don't have this issue.
Having said that, there are other steps to consider when performing i18n for arabic. You can find some useful links below. Make sure to use native2ascii on properties file before using it otherwise it might not work. I spent a lot of time until I figured this one out.
Tomcat webapps
Using nativ2ascii with properties files
Besides native2ascii tool mentioned in other answers there is a java Open Source library that can provide conversion functionality to be used in code
Library MgntUtils has a Utility that converts Strings in any language (including special characters and emojis to unicode sequence and vise versa:
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder

How to compile a java source file which is encoded as "UTF-8"?

I saved my Java source file specifying it's encoding type as UTF-8 (using Notepad, by default Notepad's encoding type is ANSI) and then I tried to compile it using:
javac -encoding "UTF-8" One.java
but it gave an error message"
One.java:1: illegal character: \65279
?public class One {
^
1 error
Is there any other way, I can compile this?
Here is the source:
public class One {
public static void main( String[] args ){
System.out.println("HI");
}
}
Your file is being read as UTF-8, otherwise a character with value "65279" could never appear. javac expects your source code to be in the platform default encoding, according to the javac documentation:
If -encoding is not specified, the platform default converter is used.
Decimal 65279 is hex FEFF, which is the Unicode Byte Order Mark (BOM). It's unnecessary in UTF-8, because UTF-8 is always encoded as an octet stream and doesn't have endianness issues.
Notepad likes to stick in BOMs even when they're not necessary, but some programs don't like finding them. As others have pointed out, Notepad is not a very good text editor. Switching to a different text editor will almost certainly solve your problem.
Open the file in Notepad++ and select Encoding -> Convert to UTF-8 without BOM.
This isn't a problem with your text editor, it's a problem with javac !
The Unicode spec says BOM is optionnal in UTF-8, it doesn't say it's forbidden !
If a BOM can be there, then javac HAS to handle it, but it doesn't. Actually, using the BOM in UTF-8 files IS useful to distinguish an ANSI-coded file from an Unicode-coded file.
The proposed solution of removing the BOM is only a workaround and not the proper solution.
This bug report indicates that this "problem" will never be fixed : https://web.archive.org/web/20160506002035/http://bugs.java.com/view_bug.do?bug_id=4508058
Since this thread is in the top 2 google results for the "javac BOM" search, I'm leaving this here for future readers.
Try javac -encoding UTF8 One.java
Without the quotes and it's UTF8, no dash.
See this forum thread for more links
See Below
For example we can discuss with an Program (Telugu words)
Program (UnicodeEx.java)
class UnicodeEx {
public static void main(String[] args) {
double ఎత్తు = 10;
double వెడల్పు = 25;
double దీర్ఘ_చతురస్ర_వైశాల్యం;
System.out.println("The Value of Height = "+ఎత్తు+" and Width = "+వెడల్పు+"\n");
దీర్ఘ_చతురస్ర_వైశాల్యం = ఎత్తు * వెడల్పు;
System.out.println("Area of Rectangle = "+దీర్ఘ_చతురస్ర_వైశాల్యం);
}
}
This is the Program while saving as "UnicodeEx.java" and change Encoding to "unicode"
**How to Compile**
javac -encoding "unicode" UnicodeEx.java
How to Execute
java UnicodeEx
The Value of Height = 10.0 and Width = 25.0
Area of Rectangle = 250.0
I know this is a very old thread, but I was experiencing a similar problem with PHP instead of Java and Google took me here. I was writing PHP on Notepad++ (not plain Notepad) and noticed that an extra white line appeared every time I called an include file. Firebug showed that there was a 65279 character in those extra lines.
Actually both the main PHP file and the included files were encoded in UTF-8. However, Notepad++ has also an option to encode as "UTF-8 without BOM". This solved my problem.
Bottom line: UTF-8 encoding inserts here and there this extra BOM character unless you instruct your editor to use UTF8 without BOM.
Works fine here, even edited in Notepad. Moral of the story is, don't use Notepad. There's likely a unprintable character in there that Notepad is either inserting or happily hiding from you.
I had the same problem. To solve it opened the file in a hex editor and found three "invisible" bytes at the beginning of the file. I removed them, and compilation worked.
Open your file with WordPad or any other editor except Notepad.
Select Save As type as Text Document - MS-DOS Format
Reopen the Project
To extend the existing answers with a solution for Linux users:
To remove the BOM on all .java files at once, go to your source directory and execute
find -iregex '.*\.java' -type f -print0 | xargs -0 dos2unix
Requires find, xargs and dos2unix to be installed, which should be included in most distributions. The first statement finds all .java files in the current directory recursively, the second one converts each of them with the dos2unix tool, which is intended to convert line endings but also removes the BOM.
The line endings conversion should have no effect as it should already be in Linux \n format on Linux if you configure your version control correctly but be warned that it does that as well in case you have one of those rare cases where that is not intended.
In the Intellij Idea(Settings>Editor>File Encodings), the project encoding was "windows-1256". So I used the following code to convert static strings to utf8
protected String persianString(String persianStirng) throws UnsupportedEncodingException {
return new String(persianStirng.getBytes("windows-1256"), "UTF-8");
}
Now It is OK!
Depending on the file encoding you should change "windows-1256" to a proper one

Categories