decompress .gz file in batch - java

I have 100 of .gz files which I need to de-compress.
I have couple of questions
a) I am using the code given at http://www.roseindia.net/java/beginners/JavaUncompress.shtml to decompress the .gz file. Its working fine.
Quest:- is there a way to get the file name of the zipped file. I know that Zip class of Java gives of enumeration of entery file to work upon. This can give me the filename, size etc stored in .zip file. But, do we have the same for .gz files or does the file name is same as filename.gz with .gz removed.
b) is there another elegant way to decompress .gz file by calling the utility function in the java code. Like calling 7-zip application from your java class. Then, I don't have to worry about input/output stream.
Thanks in advance.
Kapil

a) Zip is an archive format, while gzip is not. So an entry iterator does not make much sense unless (for example) your gz-files are compressed tar files. What you want is probably:
File outFile = new File(infile.getParent(), infile.getName().replaceAll("\\.gz$", ""));
b) Do you only want to uncompress the files? If not you may be ok with using GZIPInputStream and read the files directly, i.e. without intermediate decompression.
But ok. Let's say you really only want to uncompress the files. If so, you could probably use this:
public static File unGzip(File infile, boolean deleteGzipfileOnSuccess) throws IOException {
GZIPInputStream gin = new GZIPInputStream(new FileInputStream(infile));
FileOutputStream fos = null;
try {
File outFile = new File(infile.getParent(), infile.getName().replaceAll("\\.gz$", ""));
fos = new FileOutputStream(outFile);
byte[] buf = new byte[100000];
int len;
while ((len = gin.read(buf)) > 0) {
fos.write(buf, 0, len);
}
fos.close();
if (deleteGzipfileOnSuccess) {
infile.delete();
}
return outFile;
} finally {
if (gin != null) {
gin.close();
}
if (fos != null) {
fos.close();
}
}
}

Regarding A, the gunzip command creates an uncompressed file with the original name minus the .gz suffix. See the man page.
Regarding B, Do you need gunzip specifically, or will another compression algorithm do? There's a java port of the LZMA compression algorithm used by 7zip to create .7z files, but it will not handle .gz files.

If you have a fixed number of files to decompress once, why don't you use existing tools for that?
As Paul Morie noticed, gunzip can do that:
for i in *.gz; do gunzip $i; done
And it would automatically name them, stripping .gz$
On windows, try winrar, probably, or gunzip from http://unxutils.sf.net

GZip is normally used only on single files, so it generally does not contain information about individual files. To bundle multiple files into one compressed archive, they are first combined into an uncompressed Tar file (with info about individual contents), and then compressed as a single file. This combination is called a Tarball.
There are libraries to extract the individual file info from a Tar, just as with ZipEntries. One example. You will first have to extract the .gz file into a temporary file in order to use it, or at least feed the GZipInputStream into the Tar library.
You may also call 7-Zip from the command line using Java. 7-Zip command-line syntax is here: 7-Zip Command Line Syntax. Example of calling the command shell from Java: Executing shell commands in Java. You will have to call 7-Zip twice: once to extract the Tar from the .tar.gz or .tgz file, and again to extract the individual files from the Tar.
Or, you could just do the easy thing and write a brief shell script or batch file to do your decompression. There's no reason to hammer a square peg in a round hole -- this is what batch files are made for. As a bonus, you can also feed them parameters, reducing the complexity of a java command line execution considerably, while still letting java control execution.

Have you tried
gunzip *.gz

.gz files (gzipped) can store the filename of a compressed file. So for example FuBar.doc can be saved inside myDocument.gz and with appropriate uncompression, the file can be restored to the filename FuBar.doc. Unfortunately, java.util.zip.GZIPInputStream does not support any way of reading the filename even if it is stored inside the archive.

Related

handling about 450.000 files in a zip

My question is simple. Would Java handle a .zip file with about 450,000 files in there? The code that I wrote would not load all of the files, just one specific file would be searched in the zip, and be read line by line. The file size is about 500kb.
Would this work or will I get an OutOfMemory Exception?
Oh sry, uncompressed there about 0,5MB. Zipped are they whole files about 250mb.
Ok, the name of the Files are IDs + Date(unique) in that zip file. If i have to check a log, ill call Java and give the ID + Date and Java is reading just that one file, never more.
Edit: It works, it works very well. About 400.000 files in a zip, if u have the Memory to Zip the Files works without any problem.
Edit2: It works on Linux Filesystems witout a problem, on NTFS sometimes it crashed. NTFS has a problem with that musch files in 1 Zip.
Using the zip filesystem in Java 7, you can actually access one individual file pretty easily and open a BufferedReader on it.
First you have to create the FileSystem:
public static FileSystem getZipFileSystem(final String zipPath)
{
final Path path = Paths.get(zipPath).toAbsolutePath();
final Map<String, Object> env = new HashMap<>();
final URI uri = URI.create("jar:file:" + path.toString());
return FileSystems.newFileSystem(uri, env, null);
}
Once you have done that, you can create a BufferedReader from an entry in the zip itself:
try (
final FileSystem fs = getZipFileSystem("/path/to/the.zip");
final BufferedReader reader = Files.newBufferedReader(fs.getPath("path/to/entry"),
StandardCharsets.UTF_8);
) {
// operate on the reader
}
You could also read all lines in the entry at once using Files.readAllLines().
If you wish to copy a zip entry to a file on the filesystem, you can also do that:
Files.copy(zipfs.getPath("path/to/entry"), Paths.get("file/on/local/fs"));
Or you can directly copy the result to an OutputStream, or directly create an entry from an OutputStream...
Or even walk the entire zip using Files.walkFileTree().
Or get all the entries in a "directory" in a zip using Files.newDirectoryStream(). Note that as its name says, this is a stream; unlike File.listFiles() (which only works on files on disk anyway), this returns a iterator over the entries.
Or... Or... Or...
Note that a FileSystem needs to be .close()d.
I'm not sure that I understand what you're trying to do.
If it's 0.5 MB/file and 450,000 files, you'll need 225GB. You won't have enough memory to do all this in a single zip in memory even if you get 90% compression.
I'd recommend breaking it into manageable chunks. You'll be able to parallelize that way too, so it's not a bad idea.

zip folder in windows using command line

I am writing a program that needs to zip a file.
This will run over both linux and windows machines. It works just fine in Linux but I am not able to get anything done in windows.
To send commands I am using the apache-net project. I've also tried using Runtime().exec
but it isn't working.
Can somebody suggest something?
CommandLine cmdLine = new CommandLine("zip");
cmdLine.addArgument("-r");
cmdLine.addArgument("documents.zip");
cmdLine.addArgument("documents");
DefaultExecutor exec = new DefaultExecutor();
ExecuteWatchdog dog = new ExecuteWatchdog(60*1000);
exec.setWorkingDirectory(new File("."));
exec.setWatchdog(dog);
int check =-1;
try {
check = exec.execute(cmdLine);
} catch (ExecuteException e) {
} catch (IOException e) {
}
Java provides its own compression library in java.util.zip.* that supports the .zip format. An example that zips a folder can be found here. Here's a quickie example that works on a single file. The benefit of going with native Java is that it will work on multiple operating systems and is not dependent on having specific binaries installed.
public static void zip(String origFileName) {
try {
String zipName=origFileName + ".zip";
ZipOutputStream out = new ZipOutputStream(new BufferedOutputStream(new FileOutputStream(zipName)));
byte[] data = new byte[1000];
BufferedInputStream in = new BufferedInputStream(new FileInputStream(origFileName));
int count;
out.putNextEntry(new ZipEntry(origFileName));
while((count = in.read(data,0,1000)) != -1) {
out.write(data, 0, count);
}
in.close();
out.flush();
out.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}
The same code won't work in Windows. Windows doesn't have a "zip" program the way that Linux does. You will need to see if Windows 7 has a command line zip program (I don't think it does; see here: http://answers.microsoft.com/en-us/windows/forum/windows_vista-files/how-to-compress-a-folder-from-command-prompt/02f93b08-bebc-4c9d-b2bb-907a2184c8d5). You will likely need to do two things
Make sure the user has a suitable 3rd party zip program
Do OS detection to execute the proper command.
You can use inbuilt compact.exe to compress/uncompress in dos
It displays or alters the compression of files on NTFS partitions.
COMPACT [/C | /U] [/S[:dir]] [/A] [/I] [/F] [/Q] [filename [...]]
/C Compresses the specified files. Directories will be marked so that files added afterward will be compressed.
/U Uncompresses the specified files. Directories will be marked so that files added afterward will not be compressed.
/S Performs the specified operation on files in the given directory and all subdirectories. Default "dir" is the current directory.
/A Displays files with the hidden or system attributes. These files are omitted by default.
/I Continues performing the specified operation even after errors have occurred. By default, COMPACT stops when an error is encountered.
/F Forces the compress operation on all specified files, even those that are already compressed. Already-compressed files are skipped by default.
/Q Reports only the most essential information.
filename Specifies a pattern, file, or directory.
Used without parameters, COMPACT displays the compression state of the current directory and any files it contains. You may use multiple filenames and wildcards. You must put spaces between multiple parameters.
Examples
compact
Display all the files in the current directory and their compact status.
compact file.txt
Display the compact status of the file file.txt
compact file.txt /C
Compacts the file.txt file.

Reading files from an embedded ZIP archive

I have a ZIP archive that's embedded inside a larger file. I know the archive's starting offset within the larger file and its length.
Are there any Java libraries that would enable me to directly read the files contained within the archive? I am thinking along the lines of ZipFile.getInputStream(). Unfortunately, ZipFile doesn't work for this use case since its constructors require a standalone ZIP file.
For performance reasons, I cannot copy the ZIP achive into a separate file before opening it.
edit: Just to be clear, I do have random access to the file.
I've come up with a quick hack (which needs to get sanitized here and there), but it reads the contents of files from a ZIP archive which is embedded inside a TAR. It uses Java6, FileInputStream, ZipEntry and ZipInputStream. 'Works on my local machine':
final FileInputStream ins = new FileInputStream("archive.tar");
// Zip starts at 0x1f6400, size is not needed
long toSkip = 0x1f6400;
// Safe skipping
while(toSkip > 0)
toSkip -= ins.skip(toSkip);
final ZipInputStream zipin = new ZipInputStream(ins);
ZipEntry ze;
while((ze = zipin.getNextEntry()) != null)
{
final byte[] content = new byte[(int)ze.getSize()];
int offset = 0;
while(offset < content.length)
{
final int read = zipin.read(content, offset, content.length - offset);
if(read == -1)
break;
offset += read;
}
// DEBUG: print out ZIP entry name and filesize
System.out.println(ze + ": " + offset);
}
zipin.close();
1.create FileInputStream fis=new FileInputStream(..);
position it at the start of embedded zipfile:
fis.skip(offset);
open ZipInputStream(fis)
I suggest using TrueZIP, it provides file system access to many kinds of archives. It worked well for me in the past.
If you're using Java SE 7, it provides a zip fie system which allows you to read/ write files in the zip directly: http://docs.oracle.com/javase/7/docs/technotes/guides/io/fsp/zipfilesystemprovider.html
I think apache commons compress may help you.
There is a class org.apache.commons.compress.archivers.zip.ZipArchiveEntry, which inherit java.util.zip.ZipEntry.
It has a method getDataOffset(), that can get the offset of data stream within the archive file.
7-zip-JavaBinding is a Java wrapper for the 7-zip C++ library.
The code snippets page in particular has some nice examples including printing a list of items in an archive, extracting a single file and opening multi-part archives.
Check whether zip4j helps you or not.
You can try PartInputStream to read zip file as per your use case.
I think it is better to create temp zip file and then accessing it.

Preserving file checksum after extract from zip in java

This is what I'm trying to accomplish:
1) Calculate the checksum of all files to be added to a zip file. Currently using apache commons io follows:
final Checksum oChecksum = new Adler32();
...
//for every file iFile in folder
long lSum = (FileUtils.checksum(iFile, oChecksum)).getValue();
//store this checksum in a log
2) Compress the folder processed as a zip using the Ant zip task.
3) Extract files from the zip one by one to the specified folder (using both commons io and compression for this), and calculate the checksum of the extracted file:
final Checksum oChecksum = new Adler32();
...
ZipFile myZip = new ZipFile("test.zip");
ZipArchiveEntry zipEntry = myZip.getEntry("checksum.log"); //reads the filename from the log
BufferedInputStream myInputStream = new BufferedInputStream(myZip.getInputStream(zipEntry));
File destFile = new File("/mydir", zipEntry.getName());
lDestFile.createNewFile();
FileUtils.copyInputStreamToFile(myInputStream, destFile);
long newChecksum = FileUtils.checksum(destFile, oChecksum).getValue();
The problem I have is that the value from newChecksum doesn't match the one from the original file. The files' sizes match on disk. Funny thing is that if I run cksum or md5sum commands on both files directly on a terminal, these are the same for both files. The mismatch occurs only from java.
Is this the correct way to approach it or is there any way to preserve the checksum value after extraction?
I also tried using a CheckedInputStream but this also gets me different values from java.
EDIT: This seems related to the Adler32 object used (pre-zip vs unzip checks). If I do "new Adler32()" in the unzip check for every file instead of reusing the same Adler32 for all, I get the correct result.
Are you trying to for all file concatenated? If yes, you need to make sure you're reading them in the same order "checksumed" them.
If no, you need to call checksum.reset() between computing the checksum for each file. You'll notice (in you look at the source) that Adler32 is stateful, which means you're computing the checksum of the file plus all the preceding ones during part one.

What is the fastest way to extract 1 file from a zip file which contain a lot of file?

I tried the java.util.zip package, it is too slow.
Then I found LZMA SDK and 7z jbinding but they are also lacking something.
The LZMA SDK does not provide a kind of documentation/tutorial of how-to-use, it is very frustrating. No javadoc.
While the 7z jbinding does not provide a simple way to extract only 1 file, however, it only provide way to extract all the content of the zip file. Moreover, it does not provide a way to specify a location to place the unzipped file.
Any idea please?
What does your code with java.util.zip look like and how big of a zip file are you dealing with?
I'm able to extract a 4MB entry out of a 200MB zip file with 1,800 entries in roughly a second with this:
OutputStream out = new FileOutputStream("your.file");
FileInputStream fin = new FileInputStream("your.zip");
BufferedInputStream bin = new BufferedInputStream(fin);
ZipInputStream zin = new ZipInputStream(bin);
ZipEntry ze = null;
while ((ze = zin.getNextEntry()) != null) {
if (ze.getName().equals("your.file")) {
byte[] buffer = new byte[8192];
int len;
while ((len = zin.read(buffer)) != -1) {
out.write(buffer, 0, len);
}
out.close();
break;
}
}
I have not benchmarked the speed but with java 7 or greater, I extract a file as follows.
I would imagine that it's faster than the ZipFile API:
A short example extracting META-INF/MANIFEST.MF from a zip file test.zip:
// file to extract from zip file
String file = "MANIFEST.MF";
// location to extract the file to
File outputLocation = new File("D:/temp/", file);
// path to the zip file
Path zipFile = Paths.get("D:/temp/test.zip");
// load zip file as filesystem
try (FileSystem fileSystem = FileSystems.newFileSystem(zipFile)) {
// copy file from zip file to output location
Path source = fileSystem.getPath("META-INF/" + file);
Files.copy(source, outputLocation.toPath());
}
Use a ZipFile rather than a ZipInputStream.
Although the documentation does not indicate this (it's in the docs for JarFile), it should use random-access file operations to read the file. Since a ZIPfile contains a directory at a known location, this means a LOT less IO has to happen to find a particular file.
Some caveats: to the best of my knowledge, the Sun implementation uses a memory-mapped file. This means that your virtual address space has to be large enough to hold the file as well as everything else in your JVM. Which may be a problem for a 32-bit server. On the other hand, it may be smart enough to avoid memory-mapping on 32-bit, or memory-map just the directory; I haven't tried.
Also, if you're using multiple files, be sure to use a try/finally to ensure that the file is closed after use.

Categories