When creating a zip archive, what constitutes a duplicate entry

When creating a zip archive, what constitutes a duplicate entry - java

In a Java web application I am creating a zip file from various in-memory files (stored as byte[]).
Here's the key bit of code:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ZipOutputStream zos = new ZipOutputStream(baos);
for (//each member of a collection of objects) {
PDFDocument pdfDocument = //generate PDF for this member of the collection;
ZipEntry entry = new ZipEntry(pdfDocument.getFileName());
entry.setSize(pdfDocument.getBody().length);
zos.putNextEntry(entry);
zos.write(pdfDocument.getBody());//pdfDocument.getBody() returns byte[]
zos.closeEntry();
}
zos.close();
The problem: I'm sometimes getting a "ZipException: duplicate entry" when doing the "putNextEntry()" line.
The PDF files themselves will certainly be different, but they may have the same name ("PDF_File_for_John_Smith.pdf"). Is a name collision sufficient to cause this exception?

You can't store 2 entries with the same same name in a zip archive(in the same folder), much like you can't have 2 files with the same name in the same folder in a filesystem.
Edit; And while technically the zip file format allows this, the Java API for dealing with ZIP archives does not.

Yes -- you can use a directory structure inside your ZIP file if you need to hold multiple files with the same file name.

I believe so. Zip was originally intended to archive a directory structure, so it expects filenames to be unique. You could add directories to keep your files separated (and provide extra information to differentiate them, if you want).

Related

In Java, can you walk through the contents of nested zip files without inflating the parent?

I have a zip file which contains zip files (which may themselves contain zip files).
parent.zip
|- child_1.zip
| |- foo.txt
|
|- child_2.zip
| |- bar.txt
|
|- baz.txt
Using ZipFile, I can get the ZipEntries of the parent zip file, and see the children (child_1.zip, child_2.zip, baz.txt), but I cannot find a way to examine the contents of those child zips (foo.txt, bar.txt) without inflating the parent zip.
Is this possible, or do I need to inflate parent.zip?

One can use a zip file system using the jar:file: protocol:
URI uri = new URI(
"jar:file:/home/.../.../external.zip!/.../internal.zip!/");
Map<String, ?> env = new HashMap<>();
try (FileSystem zipfs = FileSystems.newFileSystem(uri, env)) {
Path rootPath2 = zipfs.getPath("/");
Files.walk(rootPath2).forEach(p -> {
System.out.printf("Path %s%n", p.toString());
});
}
For a recursive walk one has to create URIs with an added "!/", and do the recursion oneself.
Using Files one can copy files out and into of a zip file system. (Here I have some doubts.)

This isn't a problem with zip files themselves (though it is a horrific format), but the java.util.zip API, and probably zlib which it is typically implemented with.
ZipFile requires a File which it likes to memory map. If the "file" is actually a nested entry, that's not going to fly unless you copy it out, or have some OS-specific trick up your sleeve.
If the nested zip file is compressed within the outer zip file, random access is obviously out. You would need a different API anyway. However, java.util.zip does have ZipInputStream. Don't treat it as an InputStream - that's a typically strange subtyping arrangement. It does allow you to stream out entries, even if the archive is a compressed entry of the outer file.
(Roughly ZIP files work like this: At the end of the file is a central directory. In order to access the archive in a random access manner, you need to load the end of the file and read that in. It contains names, lengths, etc., as well as an offset to each entry in the file. The entries contain names, lengths, etc., and the actual file contents. No, they needn't be consistent, or have any kind of 1-1 correlation. May also contain other lies, such as the decompressed length being wrong or -1. Anyway, you can ignore the central directory and read the entries sequentially.
JARs add to the fun by adding an INDEX.LST and a META-INF/manifest.mf as the first entries of the file. The former contains an index, similar to the central directory, but at the front rather than the end. The latter may contain a listing of the files together with signatures. Executable zips and GIFARs (and I think similar, earlier discovered equivalents for Microsoft products) may have something stuffed in front of the zip, so you have to go in through the rear for those.)
A small demonstration program.
import java.io.*;
import java.util.zip.*;
interface Code {
static void main(String[] args) throws Exception {
ZipFile zipZip = new ZipFile("zip.zip.zip");
ZipEntry zipEntry = zipZip.getEntry("zip.zip");
if (zipEntry == null) {
throw new Error("zip.zip not found");
}
InputStream zipIn = zipZip.getInputStream(zipEntry);
ZipInputStream zip = new ZipInputStream(zipIn);
for (;;) {
ZipEntry entry = zip.getNextEntry();
if (entry == null) {
break;
}
System.err.println(entry.getName());
new BufferedReader(new InputStreamReader(zip)).lines().forEach(l -> {
System.err.println("> "+l);
});
}
}
}

handling about 450.000 files in a zip

My question is simple. Would Java handle a .zip file with about 450,000 files in there? The code that I wrote would not load all of the files, just one specific file would be searched in the zip, and be read line by line. The file size is about 500kb.
Would this work or will I get an OutOfMemory Exception?
Oh sry, uncompressed there about 0,5MB. Zipped are they whole files about 250mb.
Ok, the name of the Files are IDs + Date(unique) in that zip file. If i have to check a log, ill call Java and give the ID + Date and Java is reading just that one file, never more.
Edit: It works, it works very well. About 400.000 files in a zip, if u have the Memory to Zip the Files works without any problem.
Edit2: It works on Linux Filesystems witout a problem, on NTFS sometimes it crashed. NTFS has a problem with that musch files in 1 Zip.

Using the zip filesystem in Java 7, you can actually access one individual file pretty easily and open a BufferedReader on it.
First you have to create the FileSystem:
public static FileSystem getZipFileSystem(final String zipPath)
{
final Path path = Paths.get(zipPath).toAbsolutePath();
final Map<String, Object> env = new HashMap<>();
final URI uri = URI.create("jar:file:" + path.toString());
return FileSystems.newFileSystem(uri, env, null);
}
Once you have done that, you can create a BufferedReader from an entry in the zip itself:
try (
final FileSystem fs = getZipFileSystem("/path/to/the.zip");
final BufferedReader reader = Files.newBufferedReader(fs.getPath("path/to/entry"),
StandardCharsets.UTF_8);
) {
// operate on the reader
}
You could also read all lines in the entry at once using Files.readAllLines().
If you wish to copy a zip entry to a file on the filesystem, you can also do that:
Files.copy(zipfs.getPath("path/to/entry"), Paths.get("file/on/local/fs"));
Or you can directly copy the result to an OutputStream, or directly create an entry from an OutputStream...
Or even walk the entire zip using Files.walkFileTree().
Or get all the entries in a "directory" in a zip using Files.newDirectoryStream(). Note that as its name says, this is a stream; unlike File.listFiles() (which only works on files on disk anyway), this returns a iterator over the entries.
Or... Or... Or...
Note that a FileSystem needs to be .close()d.

I'm not sure that I understand what you're trying to do.
If it's 0.5 MB/file and 450,000 files, you'll need 225GB. You won't have enough memory to do all this in a single zip in memory even if you get 90% compression.
I'd recommend breaking it into manageable chunks. You'll be able to parallelize that way too, so it's not a bad idea.

how to check if a ZIP file is empty in Java?

I have the following piece of code -
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
ZipOutputStream zos = new ZipOutputStream(outputStream);
for (int i = 0; i < params.getGrades().size(); i++) {
generateReport(param1, param2, zos);
}
zos.flush();
zos.close();
In the generateReport method, I have code to generate my reports as xls files and add them to ZIP.
Is there any way we can check if any files have been written in the ZIP file, or if the ZIP file is empty? is there any property I can use?
Thanks,
Raaz

You can use the ZipFile from the java.util.zip package.
You can invoke the
size()
method.

After you close zos, outputStream.size() gives you the number of bytes written. You would have to allow for whatever the ZIP header size is for an empty ZIP file.

See:
http://www.java-examples.com/get-number-entries-zip-file-example
and:
Count files in ZIP's directory - JAVA, Android
and:
Android: Get Number of Files within Zip?

#Raaz, Please go through this link.
In that you can see a Class called 'ZipEntry'. It represents the files contained in a zip folder. It provides some useful methods such as:
zipEntry.getName(); // name of the file contained by zip.
zipEntry.getSize(); // size of the file contained by zip.

#Didier - I decided to take your advice on returning a value, but ended up doing it this way -
Instead of checking if a file has been added to the ZIP, I checked if the list data I'm trying to write in an xls file (the file which in turn gets added to the ZIP) is empty. If it's empty, then I set a error value to "No file generated". If the list is not empty, I assigned the an empty value to the string and returned it to the calling function.

Where does Java File instance store?

I have a question regarding to the Java File class. When I create a File instance, for example,
File aFile = new File(path);
Where does the instance aFile store in the computer? Or it stores in JVM? I mean is there a temp file stored in the local disk?
If I have an InputStream instance, and write it to a file by using OutputSteam, for example
File aFile = new File("test.txt");
OutputStream anOutputStream = new FileOutputStream(aFile);
byte aBuffer[] = new byte[1024];
while( ( iLength = anInputStream.read( aBuffer ) ) > 0)
{
anOutputStream.write( aBuffer, 0, iLength);
}
Now where does the file test.txt store?
Thanks in advance!

A File object isn't a real file at all - it's really just a filename/location, and methods which hook into the file system to check whether or not the file really exists etc. There's no content directly associated with the File instance - it's not like it's a virtual in-memory file, for example. The instance itself is just an object in memory like any other object.
Creating a File instance on its own does nothing to the file system.
When you create a FileOutputStream, however, that does affect whatever file system you're writing to. The File instance is relatively irrelevant though - you'd get the same effect from:
OutputStream anOutputStream = new FileOutputStream("test.txt");

It will write the file where you specify it with path arguement.
In your case, it will write it in the directory where you run your java class.
If you specify /test/myproject/myfile.txt
it will go in /test/myproject/myfile.txt

If you don't provide a path, it is in the current directory (ie: the directory where java.exe is executed from.) If you provide a full path, it is stored there.
Regardless, it is always stored in the filesystem, not in JVM memory.

How do I write XML files directly to a zip archive?

What is the proper way to write a list of XML files using JAXB directly to a zip archive without using a 3rd party library.
Would it be better to just write all the XML files to a directory and then zip?

As others pointed out, you can use the ZipOutputStream class to create a ZIP-file. The trick to put multiple files in a single ZIP-file is to use the ZipEntry descriptors prior to writing (marshalling) the JAXB XML data in the ZipOutputStream. So your code might look similar to this one:
JAXBElement jaxbElement1 = objectFactory.createRoot(rootType);
JAXBElement jaxbElement2 = objectFactory.createRoot(rootType);
ZipOutputStream zos = null;
try {
zos = new ZipOutputStream(new FileOutputStream("xml-file.zip"));
// add zip-entry descriptor
ZipEntry ze1 = new ZipEntry("xml-file-1.xml");
zos.putNextEntry(ze1);
// add zip-entry data
marshaller.marshal(jaxbElement1, zos);
ZipEntry ze2 = new ZipEntry("xml-file-2.xml");
zos.putNextEntry(ze2);
marshaller.marshal(jaxbElement2, zos);
zos.flush();
} finally {
if (zos != null) {
zos.close();
}
}

The "proper" way — without using a 3rd party library — would be to use java.util.zip.ZipOutputStream.
Personally, though, I prefer TrueZip.
TrueZIP is a Java based plug-in framework for virtual file systems (VFS) which provides transparent access to archive files as if they were just plain directories.

I don't know what JAXB has to do with anything, nor XML - file contents are file contents. Your question is really "How can I output characters directly to a zip archive"
To do that, open a ZipOututStream and use the API to create entries then write contents to each entry. Remember that a zip archive is like a series of named files within the archive.
btw, ZipOututStream is part of the JDK (ie it's not a "library")

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

When creating a zip archive, what constitutes a duplicate entry - java

You can't store 2 entries with the same same name in a zip archive(in the same folder), much like you can't have 2 files with the same name in the same folder in a filesystem. Edit; And while technically the zip file format allows this, the Java API for dealing with ZIP archives does not.

Yes -- you can use a directory structure inside your ZIP file if you need to hold multiple files with the same file name.

I believe so. Zip was originally intended to archive a directory structure, so it expects filenames to be unique. You could add directories to keep your files separated (and provide extra information to differentiate them, if you want).

Related

In Java, can you walk through the contents of nested zip files without inflating the parent?

handling about 450.000 files in a zip

how to check if a ZIP file is empty in Java?

Where does Java File instance store?

How do I write XML files directly to a zip archive?

Categories

Resources