Java, unzip folder with German characters in filenames

Java, unzip folder with German characters in filenames - java

I'm trying to unzip folder that contains German characters in it, for example Aufhänge .
I know that in Java 7, it is using utf-8 by default, and i think "ä" is one of the utf-8 characters.
Here is my code snippet
public static void main(String[] args) throws IOException {
ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(ZIP_PATH), StandardCharsets.UTF_8);
ZipEntry zipEntry;
while ((zipEntry = zipInputStream.getNextEntry()) != null) {
System.out.println(zipEntry.getName());
}
}
This is an error that I get: java.lang.IllegalArgumentException: MALFORMED
It works with Charset.forName("Cp437"), but it doesn't work with StandardCharsets.UTF_8

You don't mention your operating system, nor how you created the zip file, but I managed to recreate your problem anyway, using 7-Zip on Windows 10:
Create a simple text file with some trivial content (e.g. nothing but the three characters "abc").
Save the file as D:\Temp\Aufhänge.txt. Note the umlaut in the file name.
Locate that file in Windows File Explorer.
Select the file and right click. From the context menu select 7-Zip > Add to "Aufhänge.zip" to create Aufhänge.zip.
Then, in NetBeans run the following code to unzip the file you just created:
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
public class GermanZip {
static String ZIP_PATH = "D:\\Temp\\Aufhänge.zip";
public static void main(String[] args) throws FileNotFoundException, IOException {
ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(ZIP_PATH), Charset.forName("UTF-8"));
ZipEntry zipEntry;
while ((zipEntry = zipInputStream.getNextEntry()) != null) {
System.out.println(zipEntry.getName());
}
}
}
As you pointed out, the code throws java.lang.IllegalArgumentException: MALFORMED when executing this statement: zipEntry = zipInputStream.getNextEntry()) != null.
The problem arises because by default 7-Zip encodes the names of the files within the zip file using Cp437, as noted in this comment from 7-Zip:
Default encoding is OEM (DOS) encoding. It's for compatibility with
old zip software.
That's why the unzip works when using Charset.forName("Cp437") instead of Charset.forName("UTF-8").
If you want to unzip using Charset.forName("UTF-8") then you have to force 7-Zip to encode the filenames within the zip in UTF-8. To do this specify the cu parameter when running 7-Zip, as noted in the linked comment:
In Windows File Explorer select the file and right click.
From the context menu select 7-Zip > Add to Archive...".
In the Add to Archive dialog specify cu in the Parameters field:
Having stored the zipped filenames in UTF-8 format, you can then replace Charset.forName("Cp437") with Charset.forName("UTF-8") in your code, and no exception will be thrown when unzipping.
This answer is specific to Windows 10 and 7-Zip, but the general principle should apply in any environment: if specifying an encoding of UTF-8 for your ZipInputStream be certain that the filenames within the zip file really are encoded using UTF-8. You can easily verify this by opening the zip file in a binary editor and searching for the names of the zipped files.
Update based on OP's comment/question below:
Unfortunately the .ZIP File Format Specification does not currently provide a way to store the encoding used for zipped file names apart from one exception, as described in "APPENDIX D - Language Encoding (EFS)":
D.2 If general purpose bit 11 is unset, the file name and comment
SHOULD conform to the original ZIP character encoding. If general
purpose bit 11 is set, the filename and comment MUST support The
Unicode Standard, Version 4.1.0 or greater using the character
encoding form defined by the UTF-8 storage specification. The
Unicode Standard is published by the The Unicode Consortium
(www.unicode.org). UTF-8 encoded data stored within ZIP files is
expected to not include a byte order mark (BOM).
So in your code, for each zipped file, first check whether bit 11 of the general purpose bit flag is set. If it is then you can be certain that the name of that zipped fie is encoded using UTF-8. Otherwise the encoding is whatever was used when the zipped file was created. That is Cp437 by default on Windows, but if you are running on Windows and processing a zip file created on Linux I don't think there is an easy way of determining the encoding(s) used.
Unfortunately ZipEntry does not provide a method to access the general purpose bit flag field of a zipped file, so you would need to process the zip file at the byte level to do that.
To add a further complication, "encoding" in this context relates to the encoding used for each zipped filename rather than for the zip file itself. One zipped file name could be encoded in UTF-8, another zipped file name could have been added using Cp437, etc.

Related

Java UTF-8 non ASCII chars not supported on Windows

I am working on a tool which produces a zip with some generated files. Some of my users using Windows 10 reported me that when I add a string into a file within a zip, non ascii chars are replaced by "?"
It is really strange because that works perfectly on linux (nixos). Do you have any idea?
fis = new ByteArrayInputStream(content.getBytes(StandardCharsets.UTF_8));
...
public static void addToZip(String zipFilePath, final InputStream fis, final ZipOutputStream zos)
throws IOException {
final ZipEntry zipEntry = new ZipEntry(zipFilePath);
zipEntry.setLastModifiedTime(FileTime.fromMillis(0L));
zos.putNextEntry(zipEntry);
final byte[] bytes = new byte[1024];
int length;
while ((length = fis.read(bytes)) >= 0)
zos.write(bytes, 0, length);
zos.closeEntry();
fis.close();
if (!(Settings.PROTECTION.toBool()))
return;
zipEntry.setCrc(bytes.length);
zipEntry.setSize(new BigInteger(bytes).mod(BigInteger.valueOf(Long.MAX_VALUE)).longValue());
}
...
final ZipOutputStream zos = new ZipOutputStream(fos, StandardCharsets.UTF_8);

Since your description and your code is incomplete, I'm making a few assumptions:
Some of the files in zip archive are text files (or similar such as CSV).
The zip archives are created on a Linux system (or a single system under your control).
The zip archives are the sent to your users and then used on different operating systems.
If so, the problem is not related to the zip archive. Instead, it's a general problem of text files. It would also occur if you just sent a single text file.
The cause of the problem is that text files do not contain any reliable information about the encoding. On your side, the text file is created using UTF-8 encoding. On the users side, different operating systems and different tools are used to view or process the text file. Some of these tools might make an effort to determine the encoding and guess it correctly. But if they just use the operating system's default encoding, users with Windows will use the incorrect encoding as Windows defaults to Windows-1252 and similar encodings.
The result of processing an UTF-8 encoded file with Windows-1252 encoding is that bytes that are not valid in Windows-1252 are shown as "?".
If your users view the text files with text editors, ask them to set the text editor to UTF-8. If the text files are processed with custom software, ask them to modify the software such that it explicitly uses UTF-8.

can not save utf8 file in windows server with java

I have a simple java application that saves some String in utf-8 encode.
But when I open that file with notepad and save as,it shows it's encode ANSI.Now I don't know where is the problem?
My code that save the file is
File fileDir = new File("c:\\Sample.txt");
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(fileDir), "UTF8"));
out.append("kodehelp UTF-8").append("\r\n");
out.append("??? UTF-8").append("\r\n");
out.append("???? UTF-8").append("\r\n");
out.flush();
out.close();

The characters you are writing to the file, as they appear in the code snippet, are in the basic ASCII subset of UFT-8. Notepad is likely auto-detecting the format, and seeing nothing outside the ASCII range, decides the file is ANSI.
If you want to force a different decision, place characters such as 字 or õ which are well out of the ASCII range.
It is possible that the ??? strings in your example were intended to be UTF-8. If so. make sure your IDE and/or build tool recognizes the files as UTF-8, and the files are indeed UTF-8 encoded. If you provide more information about your build system, then we can help further.

Char set issue in ZipEntry.getName()

in my project, i have a functionality of uploading zip.
when user upload any zip, my system extract that file and display the folder structure to user.
if zip file contain the file have name like Õ.txt then it will bi Display like O.txt.
ZipFile zipFile = new ZipFile(filePath, Charset.forName("UTF8"));
Enumeration entries = zipFile.entries();
while(entries.hasMoreElements())
{
ZipEntry entry = (ZipEntry)entries.nextElement();
System.out.println(entry.getName());
}
above is my code to read zip entry.
now, when i try to get the Name of entry, it will give me O.txt instead of Õ.txt.
i have test this code with JDK 7 but having the same result.
i have also tried the different encoding type like CP437, IBM437, ISO-8859-1 and ISO-8859-1 but no change in the result.
so pleas suggest me the way which can support all the character at the time of getting entry from the zip file
Thanks & Regards
Yatin

It seems there may be something wrong with your environment and not necessarily the way you access the ZIP file. Here's a check list:
Does the ZIP file really contain a UTF-8 encoded entry with that name? Use a tool like 7-Zip to verify.
Does the JVM use the correct character set? Check the system property file.encoding.
Does the encoding of your output terminal / window match this setting?
After all, the result will only be correct if all elements of the processing chain use correct settings.

Preserving file checksum after extract from zip in java

This is what I'm trying to accomplish:
1) Calculate the checksum of all files to be added to a zip file. Currently using apache commons io follows:
final Checksum oChecksum = new Adler32();
...
//for every file iFile in folder
long lSum = (FileUtils.checksum(iFile, oChecksum)).getValue();
//store this checksum in a log
2) Compress the folder processed as a zip using the Ant zip task.
3) Extract files from the zip one by one to the specified folder (using both commons io and compression for this), and calculate the checksum of the extracted file:
final Checksum oChecksum = new Adler32();
...
ZipFile myZip = new ZipFile("test.zip");
ZipArchiveEntry zipEntry = myZip.getEntry("checksum.log"); //reads the filename from the log
BufferedInputStream myInputStream = new BufferedInputStream(myZip.getInputStream(zipEntry));
File destFile = new File("/mydir", zipEntry.getName());
lDestFile.createNewFile();
FileUtils.copyInputStreamToFile(myInputStream, destFile);
long newChecksum = FileUtils.checksum(destFile, oChecksum).getValue();
The problem I have is that the value from newChecksum doesn't match the one from the original file. The files' sizes match on disk. Funny thing is that if I run cksum or md5sum commands on both files directly on a terminal, these are the same for both files. The mismatch occurs only from java.
Is this the correct way to approach it or is there any way to preserve the checksum value after extraction?
I also tried using a CheckedInputStream but this also gets me different values from java.
EDIT: This seems related to the Adler32 object used (pre-zip vs unzip checks). If I do "new Adler32()" in the unzip check for every file instead of reusing the same Adler32 for all, I get the correct result.

Are you trying to for all file concatenated? If yes, you need to make sure you're reading them in the same order "checksumed" them.
If no, you need to call checksum.reset() between computing the checksum for each file. You'll notice (in you look at the source) that Adler32 is stateful, which means you're computing the checksum of the file plus all the preceding ones during part one.

decompress .gz file in batch

I have 100 of .gz files which I need to de-compress.
I have couple of questions
a) I am using the code given at http://www.roseindia.net/java/beginners/JavaUncompress.shtml to decompress the .gz file. Its working fine.
Quest:- is there a way to get the file name of the zipped file. I know that Zip class of Java gives of enumeration of entery file to work upon. This can give me the filename, size etc stored in .zip file. But, do we have the same for .gz files or does the file name is same as filename.gz with .gz removed.
b) is there another elegant way to decompress .gz file by calling the utility function in the java code. Like calling 7-zip application from your java class. Then, I don't have to worry about input/output stream.
Thanks in advance.
Kapil

a) Zip is an archive format, while gzip is not. So an entry iterator does not make much sense unless (for example) your gz-files are compressed tar files. What you want is probably:
File outFile = new File(infile.getParent(), infile.getName().replaceAll("\\.gz$", ""));
b) Do you only want to uncompress the files? If not you may be ok with using GZIPInputStream and read the files directly, i.e. without intermediate decompression.
But ok. Let's say you really only want to uncompress the files. If so, you could probably use this:
public static File unGzip(File infile, boolean deleteGzipfileOnSuccess) throws IOException {
GZIPInputStream gin = new GZIPInputStream(new FileInputStream(infile));
FileOutputStream fos = null;
try {
File outFile = new File(infile.getParent(), infile.getName().replaceAll("\\.gz$", ""));
fos = new FileOutputStream(outFile);
byte[] buf = new byte[100000];
int len;
while ((len = gin.read(buf)) > 0) {
fos.write(buf, 0, len);
}
fos.close();
if (deleteGzipfileOnSuccess) {
infile.delete();
}
return outFile;
} finally {
if (gin != null) {
gin.close();
}
if (fos != null) {
fos.close();
}
}
}

Regarding A, the gunzip command creates an uncompressed file with the original name minus the .gz suffix. See the man page.
Regarding B, Do you need gunzip specifically, or will another compression algorithm do? There's a java port of the LZMA compression algorithm used by 7zip to create .7z files, but it will not handle .gz files.

If you have a fixed number of files to decompress once, why don't you use existing tools for that?
As Paul Morie noticed, gunzip can do that:
for i in *.gz; do gunzip $i; done
And it would automatically name them, stripping .gz$
On windows, try winrar, probably, or gunzip from http://unxutils.sf.net

GZip is normally used only on single files, so it generally does not contain information about individual files. To bundle multiple files into one compressed archive, they are first combined into an uncompressed Tar file (with info about individual contents), and then compressed as a single file. This combination is called a Tarball.
There are libraries to extract the individual file info from a Tar, just as with ZipEntries. One example. You will first have to extract the .gz file into a temporary file in order to use it, or at least feed the GZipInputStream into the Tar library.
You may also call 7-Zip from the command line using Java. 7-Zip command-line syntax is here: 7-Zip Command Line Syntax. Example of calling the command shell from Java: Executing shell commands in Java. You will have to call 7-Zip twice: once to extract the Tar from the .tar.gz or .tgz file, and again to extract the individual files from the Tar.
Or, you could just do the easy thing and write a brief shell script or batch file to do your decompression. There's no reason to hammer a square peg in a round hole -- this is what batch files are made for. As a bonus, you can also feed them parameters, reducing the complexity of a java command line execution considerably, while still letting java control execution.

Have you tried
gunzip *.gz

.gz files (gzipped) can store the filename of a compressed file. So for example FuBar.doc can be saved inside myDocument.gz and with appropriate uncompression, the file can be restored to the filename FuBar.doc. Unfortunately, java.util.zip.GZIPInputStream does not support any way of reading the filename even if it is stored inside the archive.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java, unzip folder with German characters in filenames - java

Related

Java UTF-8 non ASCII chars not supported on Windows

can not save utf8 file in windows server with java

Char set issue in ZipEntry.getName()

Preserving file checksum after extract from zip in java

decompress .gz file in batch

Categories

Resources