Reading files from an embedded ZIP archive

Reading files from an embedded ZIP archive - java

I have a ZIP archive that's embedded inside a larger file. I know the archive's starting offset within the larger file and its length.
Are there any Java libraries that would enable me to directly read the files contained within the archive? I am thinking along the lines of ZipFile.getInputStream(). Unfortunately, ZipFile doesn't work for this use case since its constructors require a standalone ZIP file.
For performance reasons, I cannot copy the ZIP achive into a separate file before opening it.
edit: Just to be clear, I do have random access to the file.

I've come up with a quick hack (which needs to get sanitized here and there), but it reads the contents of files from a ZIP archive which is embedded inside a TAR. It uses Java6, FileInputStream, ZipEntry and ZipInputStream. 'Works on my local machine':
final FileInputStream ins = new FileInputStream("archive.tar");
// Zip starts at 0x1f6400, size is not needed
long toSkip = 0x1f6400;
// Safe skipping
while(toSkip > 0)
toSkip -= ins.skip(toSkip);
final ZipInputStream zipin = new ZipInputStream(ins);
ZipEntry ze;
while((ze = zipin.getNextEntry()) != null)
{
final byte[] content = new byte[(int)ze.getSize()];
int offset = 0;
while(offset < content.length)
{
final int read = zipin.read(content, offset, content.length - offset);
if(read == -1)
break;
offset += read;
}
// DEBUG: print out ZIP entry name and filesize
System.out.println(ze + ": " + offset);
}
zipin.close();

1.create FileInputStream fis=new FileInputStream(..);
position it at the start of embedded zipfile:
fis.skip(offset);
open ZipInputStream(fis)

I suggest using TrueZIP, it provides file system access to many kinds of archives. It worked well for me in the past.

If you're using Java SE 7, it provides a zip fie system which allows you to read/ write files in the zip directly: http://docs.oracle.com/javase/7/docs/technotes/guides/io/fsp/zipfilesystemprovider.html

I think apache commons compress may help you.
There is a class org.apache.commons.compress.archivers.zip.ZipArchiveEntry, which inherit java.util.zip.ZipEntry.
It has a method getDataOffset(), that can get the offset of data stream within the archive file.

7-zip-JavaBinding is a Java wrapper for the 7-zip C++ library.
The code snippets page in particular has some nice examples including printing a list of items in an archive, extracting a single file and opening multi-part archives.

Check whether zip4j helps you or not.
You can try PartInputStream to read zip file as per your use case.
I think it is better to create temp zip file and then accessing it.

Related

handling about 450.000 files in a zip

My question is simple. Would Java handle a .zip file with about 450,000 files in there? The code that I wrote would not load all of the files, just one specific file would be searched in the zip, and be read line by line. The file size is about 500kb.
Would this work or will I get an OutOfMemory Exception?
Oh sry, uncompressed there about 0,5MB. Zipped are they whole files about 250mb.
Ok, the name of the Files are IDs + Date(unique) in that zip file. If i have to check a log, ill call Java and give the ID + Date and Java is reading just that one file, never more.
Edit: It works, it works very well. About 400.000 files in a zip, if u have the Memory to Zip the Files works without any problem.
Edit2: It works on Linux Filesystems witout a problem, on NTFS sometimes it crashed. NTFS has a problem with that musch files in 1 Zip.

Using the zip filesystem in Java 7, you can actually access one individual file pretty easily and open a BufferedReader on it.
First you have to create the FileSystem:
public static FileSystem getZipFileSystem(final String zipPath)
{
final Path path = Paths.get(zipPath).toAbsolutePath();
final Map<String, Object> env = new HashMap<>();
final URI uri = URI.create("jar:file:" + path.toString());
return FileSystems.newFileSystem(uri, env, null);
}
Once you have done that, you can create a BufferedReader from an entry in the zip itself:
try (
final FileSystem fs = getZipFileSystem("/path/to/the.zip");
final BufferedReader reader = Files.newBufferedReader(fs.getPath("path/to/entry"),
StandardCharsets.UTF_8);
) {
// operate on the reader
}
You could also read all lines in the entry at once using Files.readAllLines().
If you wish to copy a zip entry to a file on the filesystem, you can also do that:
Files.copy(zipfs.getPath("path/to/entry"), Paths.get("file/on/local/fs"));
Or you can directly copy the result to an OutputStream, or directly create an entry from an OutputStream...
Or even walk the entire zip using Files.walkFileTree().
Or get all the entries in a "directory" in a zip using Files.newDirectoryStream(). Note that as its name says, this is a stream; unlike File.listFiles() (which only works on files on disk anyway), this returns a iterator over the entries.
Or... Or... Or...
Note that a FileSystem needs to be .close()d.

I'm not sure that I understand what you're trying to do.
If it's 0.5 MB/file and 450,000 files, you'll need 225GB. You won't have enough memory to do all this in a single zip in memory even if you get 90% compression.
I'd recommend breaking it into manageable chunks. You'll be able to parallelize that way too, so it's not a bad idea.

How do I write XML files directly to a zip archive?

What is the proper way to write a list of XML files using JAXB directly to a zip archive without using a 3rd party library.
Would it be better to just write all the XML files to a directory and then zip?

As others pointed out, you can use the ZipOutputStream class to create a ZIP-file. The trick to put multiple files in a single ZIP-file is to use the ZipEntry descriptors prior to writing (marshalling) the JAXB XML data in the ZipOutputStream. So your code might look similar to this one:
JAXBElement jaxbElement1 = objectFactory.createRoot(rootType);
JAXBElement jaxbElement2 = objectFactory.createRoot(rootType);
ZipOutputStream zos = null;
try {
zos = new ZipOutputStream(new FileOutputStream("xml-file.zip"));
// add zip-entry descriptor
ZipEntry ze1 = new ZipEntry("xml-file-1.xml");
zos.putNextEntry(ze1);
// add zip-entry data
marshaller.marshal(jaxbElement1, zos);
ZipEntry ze2 = new ZipEntry("xml-file-2.xml");
zos.putNextEntry(ze2);
marshaller.marshal(jaxbElement2, zos);
zos.flush();
} finally {
if (zos != null) {
zos.close();
}
}

The "proper" way — without using a 3rd party library — would be to use java.util.zip.ZipOutputStream.
Personally, though, I prefer TrueZip.
TrueZIP is a Java based plug-in framework for virtual file systems (VFS) which provides transparent access to archive files as if they were just plain directories.

I don't know what JAXB has to do with anything, nor XML - file contents are file contents. Your question is really "How can I output characters directly to a zip archive"
To do that, open a ZipOututStream and use the API to create entries then write contents to each entry. Remember that a zip archive is like a series of named files within the archive.
btw, ZipOututStream is part of the JDK (ie it's not a "library")

Preserving file checksum after extract from zip in java

This is what I'm trying to accomplish:
1) Calculate the checksum of all files to be added to a zip file. Currently using apache commons io follows:
final Checksum oChecksum = new Adler32();
...
//for every file iFile in folder
long lSum = (FileUtils.checksum(iFile, oChecksum)).getValue();
//store this checksum in a log
2) Compress the folder processed as a zip using the Ant zip task.
3) Extract files from the zip one by one to the specified folder (using both commons io and compression for this), and calculate the checksum of the extracted file:
final Checksum oChecksum = new Adler32();
...
ZipFile myZip = new ZipFile("test.zip");
ZipArchiveEntry zipEntry = myZip.getEntry("checksum.log"); //reads the filename from the log
BufferedInputStream myInputStream = new BufferedInputStream(myZip.getInputStream(zipEntry));
File destFile = new File("/mydir", zipEntry.getName());
lDestFile.createNewFile();
FileUtils.copyInputStreamToFile(myInputStream, destFile);
long newChecksum = FileUtils.checksum(destFile, oChecksum).getValue();
The problem I have is that the value from newChecksum doesn't match the one from the original file. The files' sizes match on disk. Funny thing is that if I run cksum or md5sum commands on both files directly on a terminal, these are the same for both files. The mismatch occurs only from java.
Is this the correct way to approach it or is there any way to preserve the checksum value after extraction?
I also tried using a CheckedInputStream but this also gets me different values from java.
EDIT: This seems related to the Adler32 object used (pre-zip vs unzip checks). If I do "new Adler32()" in the unzip check for every file instead of reusing the same Adler32 for all, I get the correct result.

Are you trying to for all file concatenated? If yes, you need to make sure you're reading them in the same order "checksumed" them.
If no, you need to call checksum.reset() between computing the checksum for each file. You'll notice (in you look at the source) that Adler32 is stateful, which means you're computing the checksum of the file plus all the preceding ones during part one.

What is the fastest way to extract 1 file from a zip file which contain a lot of file?

I tried the java.util.zip package, it is too slow.
Then I found LZMA SDK and 7z jbinding but they are also lacking something.
The LZMA SDK does not provide a kind of documentation/tutorial of how-to-use, it is very frustrating. No javadoc.
While the 7z jbinding does not provide a simple way to extract only 1 file, however, it only provide way to extract all the content of the zip file. Moreover, it does not provide a way to specify a location to place the unzipped file.
Any idea please?

What does your code with java.util.zip look like and how big of a zip file are you dealing with?
I'm able to extract a 4MB entry out of a 200MB zip file with 1,800 entries in roughly a second with this:
OutputStream out = new FileOutputStream("your.file");
FileInputStream fin = new FileInputStream("your.zip");
BufferedInputStream bin = new BufferedInputStream(fin);
ZipInputStream zin = new ZipInputStream(bin);
ZipEntry ze = null;
while ((ze = zin.getNextEntry()) != null) {
if (ze.getName().equals("your.file")) {
byte[] buffer = new byte[8192];
int len;
while ((len = zin.read(buffer)) != -1) {
out.write(buffer, 0, len);
}
out.close();
break;
}
}

I have not benchmarked the speed but with java 7 or greater, I extract a file as follows.
I would imagine that it's faster than the ZipFile API:
A short example extracting META-INF/MANIFEST.MF from a zip file test.zip:
// file to extract from zip file
String file = "MANIFEST.MF";
// location to extract the file to
File outputLocation = new File("D:/temp/", file);
// path to the zip file
Path zipFile = Paths.get("D:/temp/test.zip");
// load zip file as filesystem
try (FileSystem fileSystem = FileSystems.newFileSystem(zipFile)) {
// copy file from zip file to output location
Path source = fileSystem.getPath("META-INF/" + file);
Files.copy(source, outputLocation.toPath());
}

Use a ZipFile rather than a ZipInputStream.
Although the documentation does not indicate this (it's in the docs for JarFile), it should use random-access file operations to read the file. Since a ZIPfile contains a directory at a known location, this means a LOT less IO has to happen to find a particular file.
Some caveats: to the best of my knowledge, the Sun implementation uses a memory-mapped file. This means that your virtual address space has to be large enough to hold the file as well as everything else in your JVM. Which may be a problem for a 32-bit server. On the other hand, it may be smart enough to avoid memory-mapping on 32-bit, or memory-map just the directory; I haven't tried.
Also, if you're using multiple files, be sure to use a try/finally to ensure that the file is closed after use.

decompress .gz file in batch

I have 100 of .gz files which I need to de-compress.
I have couple of questions
a) I am using the code given at http://www.roseindia.net/java/beginners/JavaUncompress.shtml to decompress the .gz file. Its working fine.
Quest:- is there a way to get the file name of the zipped file. I know that Zip class of Java gives of enumeration of entery file to work upon. This can give me the filename, size etc stored in .zip file. But, do we have the same for .gz files or does the file name is same as filename.gz with .gz removed.
b) is there another elegant way to decompress .gz file by calling the utility function in the java code. Like calling 7-zip application from your java class. Then, I don't have to worry about input/output stream.
Thanks in advance.
Kapil

a) Zip is an archive format, while gzip is not. So an entry iterator does not make much sense unless (for example) your gz-files are compressed tar files. What you want is probably:
File outFile = new File(infile.getParent(), infile.getName().replaceAll("\\.gz$", ""));
b) Do you only want to uncompress the files? If not you may be ok with using GZIPInputStream and read the files directly, i.e. without intermediate decompression.
But ok. Let's say you really only want to uncompress the files. If so, you could probably use this:
public static File unGzip(File infile, boolean deleteGzipfileOnSuccess) throws IOException {
GZIPInputStream gin = new GZIPInputStream(new FileInputStream(infile));
FileOutputStream fos = null;
try {
File outFile = new File(infile.getParent(), infile.getName().replaceAll("\\.gz$", ""));
fos = new FileOutputStream(outFile);
byte[] buf = new byte[100000];
int len;
while ((len = gin.read(buf)) > 0) {
fos.write(buf, 0, len);
}
fos.close();
if (deleteGzipfileOnSuccess) {
infile.delete();
}
return outFile;
} finally {
if (gin != null) {
gin.close();
}
if (fos != null) {
fos.close();
}
}
}

Regarding A, the gunzip command creates an uncompressed file with the original name minus the .gz suffix. See the man page.
Regarding B, Do you need gunzip specifically, or will another compression algorithm do? There's a java port of the LZMA compression algorithm used by 7zip to create .7z files, but it will not handle .gz files.

If you have a fixed number of files to decompress once, why don't you use existing tools for that?
As Paul Morie noticed, gunzip can do that:
for i in *.gz; do gunzip $i; done
And it would automatically name them, stripping .gz$
On windows, try winrar, probably, or gunzip from http://unxutils.sf.net

GZip is normally used only on single files, so it generally does not contain information about individual files. To bundle multiple files into one compressed archive, they are first combined into an uncompressed Tar file (with info about individual contents), and then compressed as a single file. This combination is called a Tarball.
There are libraries to extract the individual file info from a Tar, just as with ZipEntries. One example. You will first have to extract the .gz file into a temporary file in order to use it, or at least feed the GZipInputStream into the Tar library.
You may also call 7-Zip from the command line using Java. 7-Zip command-line syntax is here: 7-Zip Command Line Syntax. Example of calling the command shell from Java: Executing shell commands in Java. You will have to call 7-Zip twice: once to extract the Tar from the .tar.gz or .tgz file, and again to extract the individual files from the Tar.
Or, you could just do the easy thing and write a brief shell script or batch file to do your decompression. There's no reason to hammer a square peg in a round hole -- this is what batch files are made for. As a bonus, you can also feed them parameters, reducing the complexity of a java command line execution considerably, while still letting java control execution.

Have you tried
gunzip *.gz

.gz files (gzipped) can store the filename of a compressed file. So for example FuBar.doc can be saved inside myDocument.gz and with appropriate uncompression, the file can be restored to the filename FuBar.doc. Unfortunately, java.util.zip.GZIPInputStream does not support any way of reading the filename even if it is stored inside the archive.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading files from an embedded ZIP archive - java

1.create FileInputStream fis=new FileInputStream(..); position it at the start of embedded zipfile: fis.skip(offset); open ZipInputStream(fis)

I suggest using TrueZIP, it provides file system access to many kinds of archives. It worked well for me in the past.

If you're using Java SE 7, it provides a zip fie system which allows you to read/ write files in the zip directly: http://docs.oracle.com/javase/7/docs/technotes/guides/io/fsp/zipfilesystemprovider.html

I think apache commons compress may help you. There is a class org.apache.commons.compress.archivers.zip.ZipArchiveEntry, which inherit java.util.zip.ZipEntry. It has a method getDataOffset(), that can get the offset of data stream within the archive file.

7-zip-JavaBinding is a Java wrapper for the 7-zip C++ library. The code snippets page in particular has some nice examples including printing a list of items in an archive, extracting a single file and opening multi-part archives.

Check whether zip4j helps you or not. You can try PartInputStream to read zip file as per your use case. I think it is better to create temp zip file and then accessing it.

Related

handling about 450.000 files in a zip

How do I write XML files directly to a zip archive?

Preserving file checksum after extract from zip in java

What is the fastest way to extract 1 file from a zip file which contain a lot of file?

decompress .gz file in batch

Categories

Resources