Char set issue in ZipEntry.getName()

Char set issue in ZipEntry.getName() - java

in my project, i have a functionality of uploading zip.
when user upload any zip, my system extract that file and display the folder structure to user.
if zip file contain the file have name like Õ.txt then it will bi Display like O.txt.
ZipFile zipFile = new ZipFile(filePath, Charset.forName("UTF8"));
Enumeration entries = zipFile.entries();
while(entries.hasMoreElements())
{
ZipEntry entry = (ZipEntry)entries.nextElement();
System.out.println(entry.getName());
}
above is my code to read zip entry.
now, when i try to get the Name of entry, it will give me O.txt instead of Õ.txt.
i have test this code with JDK 7 but having the same result.
i have also tried the different encoding type like CP437, IBM437, ISO-8859-1 and ISO-8859-1 but no change in the result.
so pleas suggest me the way which can support all the character at the time of getting entry from the zip file
Thanks & Regards
Yatin

It seems there may be something wrong with your environment and not necessarily the way you access the ZIP file. Here's a check list:
Does the ZIP file really contain a UTF-8 encoded entry with that name? Use a tool like 7-Zip to verify.
Does the JVM use the correct character set? Check the system property file.encoding.
Does the encoding of your output terminal / window match this setting?
After all, the result will only be correct if all elements of the processing chain use correct settings.

Related

Java, unzip folder with German characters in filenames

I'm trying to unzip folder that contains German characters in it, for example Aufhänge .
I know that in Java 7, it is using utf-8 by default, and i think "ä" is one of the utf-8 characters.
Here is my code snippet
public static void main(String[] args) throws IOException {
ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(ZIP_PATH), StandardCharsets.UTF_8);
ZipEntry zipEntry;
while ((zipEntry = zipInputStream.getNextEntry()) != null) {
System.out.println(zipEntry.getName());
}
}
This is an error that I get: java.lang.IllegalArgumentException: MALFORMED
It works with Charset.forName("Cp437"), but it doesn't work with StandardCharsets.UTF_8

You don't mention your operating system, nor how you created the zip file, but I managed to recreate your problem anyway, using 7-Zip on Windows 10:
Create a simple text file with some trivial content (e.g. nothing but the three characters "abc").
Save the file as D:\Temp\Aufhänge.txt. Note the umlaut in the file name.
Locate that file in Windows File Explorer.
Select the file and right click. From the context menu select 7-Zip > Add to "Aufhänge.zip" to create Aufhänge.zip.
Then, in NetBeans run the following code to unzip the file you just created:
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
public class GermanZip {
static String ZIP_PATH = "D:\\Temp\\Aufhänge.zip";
public static void main(String[] args) throws FileNotFoundException, IOException {
ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(ZIP_PATH), Charset.forName("UTF-8"));
ZipEntry zipEntry;
while ((zipEntry = zipInputStream.getNextEntry()) != null) {
System.out.println(zipEntry.getName());
}
}
}
As you pointed out, the code throws java.lang.IllegalArgumentException: MALFORMED when executing this statement: zipEntry = zipInputStream.getNextEntry()) != null.
The problem arises because by default 7-Zip encodes the names of the files within the zip file using Cp437, as noted in this comment from 7-Zip:
Default encoding is OEM (DOS) encoding. It's for compatibility with
old zip software.
That's why the unzip works when using Charset.forName("Cp437") instead of Charset.forName("UTF-8").
If you want to unzip using Charset.forName("UTF-8") then you have to force 7-Zip to encode the filenames within the zip in UTF-8. To do this specify the cu parameter when running 7-Zip, as noted in the linked comment:
In Windows File Explorer select the file and right click.
From the context menu select 7-Zip > Add to Archive...".
In the Add to Archive dialog specify cu in the Parameters field:
Having stored the zipped filenames in UTF-8 format, you can then replace Charset.forName("Cp437") with Charset.forName("UTF-8") in your code, and no exception will be thrown when unzipping.
This answer is specific to Windows 10 and 7-Zip, but the general principle should apply in any environment: if specifying an encoding of UTF-8 for your ZipInputStream be certain that the filenames within the zip file really are encoded using UTF-8. You can easily verify this by opening the zip file in a binary editor and searching for the names of the zipped files.
Update based on OP's comment/question below:
Unfortunately the .ZIP File Format Specification does not currently provide a way to store the encoding used for zipped file names apart from one exception, as described in "APPENDIX D - Language Encoding (EFS)":
D.2 If general purpose bit 11 is unset, the file name and comment
SHOULD conform to the original ZIP character encoding. If general
purpose bit 11 is set, the filename and comment MUST support The
Unicode Standard, Version 4.1.0 or greater using the character
encoding form defined by the UTF-8 storage specification. The
Unicode Standard is published by the The Unicode Consortium
(www.unicode.org). UTF-8 encoded data stored within ZIP files is
expected to not include a byte order mark (BOM).
So in your code, for each zipped file, first check whether bit 11 of the general purpose bit flag is set. If it is then you can be certain that the name of that zipped fie is encoded using UTF-8. Otherwise the encoding is whatever was used when the zipped file was created. That is Cp437 by default on Windows, but if you are running on Windows and processing a zip file created on Linux I don't think there is an easy way of determining the encoding(s) used.
Unfortunately ZipEntry does not provide a method to access the general purpose bit flag field of a zipped file, so you would need to process the zip file at the byte level to do that.
To add a further complication, "encoding" in this context relates to the encoding used for each zipped filename rather than for the zip file itself. One zipped file name could be encoded in UTF-8, another zipped file name could have been added using Cp437, etc.

Java: `A` Archive attribute missing while creating zip programmatically

We are dealing with the decompression libraries/utility that uses attribute to check for the presence of directories/files within the zip.
Problem is that we are not able to set archive bit for a zip while creation. When we create zip programmatically, it wash out previous attributes as well.
We will try to set archive bit with below mentioned steps but not getting desired result so far:
1. Parse each zip entry and getExtra byte[].
2. Use Int value=32 and perform bitwise 'OR' operation.
3. setExtra byte[] after 'OR' operation.
Adding some more details:
We tried following approaches but still this issue is unresolved.
Using setAttribute() method on File system but getting the attributes are getting reset while creating zip.
Files.setAttribute(file, “dos:archive”, true)
Using File.copy() which copies the file attributes associated with the file to the target file but no success. Even existing attributes are not being retained to target file.
Files.copy(path, path, StandardCopyOption.COPY_ATTRIBUTES)
Using ZipEntry.setExtra(byte[]).
found some info online that the java doesn’t have any direct method to set attributes but as per some online articles we found that the extra field is used to set the file permissions on unix and MS DOS file attributes. This is an undocumented field and we didn’t find any reliable information online. Basically, initial 2 bytes are used for unix and last 2 bytes are used for DOS file attributes. We tried setting DOS file attributes with different values in it.
ZipEntry.setExtra(byte[]) - Sets the optional extra field data for the entry.
Using winzip command line tool but not an elegant solution.

I assume it is DOS (Windows)
With Java 7
import java.nio.file.Files;
import java.nio.file.Path;
File theFile = new File("yourfile.zip");
Path file = theFile.toPath();
Files.setAttribute(file, "dos:archive", true);
see: http://kodejava.org/how-do-i-set-the-value-of-file-attributes/

Convert URL to normal windows filename Java 6

I am trying to read package name from a jar file. My probem is that when I get URL, it contains unrecognized form to be recognized by windows file.
I read this solution. But this did not helped me. Convert URL to normal windows filename Java.
directoryURL.toURI().getSchemeSpecificPart() does not convert windows style.
This is my code.
// Get a File object for the package
URL directoryURL = Thread.currentThread().getContextClassLoader()
.getResource(packageNameSlashed);
logger.info("URI" + directoryURL.toURI());
logger.info("Windows file Name" + directoryURL.toURI().getSchemeSpecificPart());
// build jar file name, then loop through zipped entries
jarFileName = URLDecoder.decode(directoryURL.getFile(), "UTF-8");
jarFileName = jarFileName.substring(0, jarFileName.indexOf(".jar"));
// HERE Throws exception"
jf = new JarFile(jarFileName + ".jar");
while (jarEntries.hasMoreElements()) {
entryName = jarEntries.nextElement().getName();
logger.info("Entry name: " + entryName);
if (entryName.startsWith(packageNameSlashed)
&& entryName.length() > packageNameSlashed.length() + 5
&& entryName.endsWith(".class")) {
entryName = entryName.substring(packageNameSlashed.length() + 1);
packageClassNames.put(entryName, packageName);
}
}
This is log.
16-02-2015 14:02:15 INFO - URI jar:file:/C:/SVN/AAA/trunk/aaa/client/target/server-1.0.jar!/packageName
16-02-2015 14:02:15 INFO Windows file Name file:/C:/SVN/AAA/trunk/aaa/client/target/server-1.0.jar!/packageName

A "jar:..." URL does not identify a file. Rather, it identifies a member of a JAR file.
The syntax is (roughly speaking) "jar:<jar-url>!<path-within-jar>", where the is itself a URL; e.g. a "file:" URL in your example.
If you are going to open the JAR file and iterate entries like that, you need to:
Extract the schemeSpecificPart of the original URL
Split the schemeSpecificPart on the "!" character.
Parse the part before the "!" as a URI, then use File(URI) to get the File.
Use the File to open the ZipFile.
Lookup the part after the "!" in the ZipFile ...

The answer by Stephen has all the elements you need.
With the getResource(package).getURI() or getResoucer(package).toFile you are getting the path to the resource.
Do substring on it to extract the part between file:// and ! this is the path to physical location of your jar of interest.
De new File on this sub-path and you have handle to your jar.
Jar is normal zip file, and process it as such (java.util.zip and there are manuals on the web).
List content of your zip file (now you may need to navigate using the bits behind ! sign in your original path), and you get your classes name.
I am not sure if this is the best way to achieve your goal, I would check how classes discovery (which is what you are trying to do, are implemented in some open source framework (for example tomcat uses it, JPA impelementation to find the entitities). There is also discovery project on apache but it seems to be dead for a while.

handling about 450.000 files in a zip

My question is simple. Would Java handle a .zip file with about 450,000 files in there? The code that I wrote would not load all of the files, just one specific file would be searched in the zip, and be read line by line. The file size is about 500kb.
Would this work or will I get an OutOfMemory Exception?
Oh sry, uncompressed there about 0,5MB. Zipped are they whole files about 250mb.
Ok, the name of the Files are IDs + Date(unique) in that zip file. If i have to check a log, ill call Java and give the ID + Date and Java is reading just that one file, never more.
Edit: It works, it works very well. About 400.000 files in a zip, if u have the Memory to Zip the Files works without any problem.
Edit2: It works on Linux Filesystems witout a problem, on NTFS sometimes it crashed. NTFS has a problem with that musch files in 1 Zip.

Using the zip filesystem in Java 7, you can actually access one individual file pretty easily and open a BufferedReader on it.
First you have to create the FileSystem:
public static FileSystem getZipFileSystem(final String zipPath)
{
final Path path = Paths.get(zipPath).toAbsolutePath();
final Map<String, Object> env = new HashMap<>();
final URI uri = URI.create("jar:file:" + path.toString());
return FileSystems.newFileSystem(uri, env, null);
}
Once you have done that, you can create a BufferedReader from an entry in the zip itself:
try (
final FileSystem fs = getZipFileSystem("/path/to/the.zip");
final BufferedReader reader = Files.newBufferedReader(fs.getPath("path/to/entry"),
StandardCharsets.UTF_8);
) {
// operate on the reader
}
You could also read all lines in the entry at once using Files.readAllLines().
If you wish to copy a zip entry to a file on the filesystem, you can also do that:
Files.copy(zipfs.getPath("path/to/entry"), Paths.get("file/on/local/fs"));
Or you can directly copy the result to an OutputStream, or directly create an entry from an OutputStream...
Or even walk the entire zip using Files.walkFileTree().
Or get all the entries in a "directory" in a zip using Files.newDirectoryStream(). Note that as its name says, this is a stream; unlike File.listFiles() (which only works on files on disk anyway), this returns a iterator over the entries.
Or... Or... Or...
Note that a FileSystem needs to be .close()d.

I'm not sure that I understand what you're trying to do.
If it's 0.5 MB/file and 450,000 files, you'll need 225GB. You won't have enough memory to do all this in a single zip in memory even if you get 90% compression.
I'd recommend breaking it into manageable chunks. You'll be able to parallelize that way too, so it's not a bad idea.

How to verify the content of ZipFile before saving to disk

I have an application that requires a user to upload a zipfile containing xml report file among other files.
What I want to do is, to verify it is a zip, then open and check if there is an xml file, and verify some few nodes which are required in that xml.
I want to do this before I save this zipfile to a disk/filesystem, and withought creating a temporary file. I will only save the file if it passes the validation.
I am using Spring multipart CommonsMultipartFile to manage uploads.
The application is using Java, jsp, tomcat
Thanks.

See my comment on the OP about the wisdom of buffering the entire file in memory.
One quick first check for a valid zip file would be to check the first 4 bytes for the appropriate "magic" bytes. a zip file should start with the first 4 bytes {(byte)0x50, (byte)0x4b, (byte)0x03, (byte)0x04}. the only way to really check it, however, is to attempt to unzip it.

If you want to check whether a file is a ZIP file, perhaps you can use getContentType() method of the URLConnection class? Something like this:
URL u = new URL(fileUrl);
URLConnection uc = u.openConnection();
String type = uc.getContentType();
But it would be much faster to detect the magic bytes which, for the ZIP format, are 50 4B.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.