Poor Performance of Java's unzip utilities

Poor Performance of Java's unzip utilities - java

I have noticed that the unzip facility in Java is extremely slow compared to using a native tool such as WinZip.
Is there a third party library available for Java that is more efficient?
Open Source is preferred.
Edit
Here is a speed comparison using the Java built-in solution vs 7zip.
I added buffered input/output streams in my original solution (thanks Jim, this did make a big difference).
Zip File size: 800K
Java Solution: 2.7 seconds
7Zip solution: 204 ms
Here is the modified code using the built-in Java decompression:
/** Unpacks the give zip file using the built in Java facilities for unzip. */
#SuppressWarnings("unchecked")
public final static void unpack(File zipFile, File rootDir) throws IOException
{
ZipFile zip = new ZipFile(zipFile);
Enumeration<ZipEntry> entries = (Enumeration<ZipEntry>) zip.entries();
while(entries.hasMoreElements()) {
ZipEntry entry = entries.nextElement();
java.io.File f = new java.io.File(rootDir, entry.getName());
if (entry.isDirectory()) { // if its a directory, create it
continue;
}
if (!f.exists()) {
f.getParentFile().mkdirs();
f.createNewFile();
}
BufferedInputStream bis = new BufferedInputStream(zip.getInputStream(entry)); // get the input stream
BufferedOutputStream bos = new BufferedOutputStream(new java.io.FileOutputStream(f));
while (bis.available() > 0) { // write contents of 'is' to 'fos'
bos.write(bis.read());
}
bos.close();
bis.close();
}
}

The problem is not the unzipping, it's the inefficient way you write the unzipped data back to disk. My benchmarks show that using
InputStream is = zip.getInputStream(entry); // get the input stream
OutputStream os = new java.io.FileOutputStream(f);
byte[] buf = new byte[4096];
int r;
while ((r = is.read(buf)) != -1) {
os.write(buf, 0, r);
}
os.close();
is.close();
instead reduces the method's execution time by a factor of 5 (from 5 to 1 second for a 6 MB zip file).
The likely culprit is your use of bis.available(). Aside from being incorrect (available returns the number of bytes until a call to read would block, not until the end of the stream), this bypasses the buffering provided by BufferedInputStream, requiring a native system call for every byte copied into the output file.
Note that wrapping in a BufferedStream is not necessary if you use the bulk read and write methods as I do above, and that the code to close the resources is not exception safe (if reading or writing fails for any reason, neither is nor os would be closed). Finally, if you have IOUtils in the class path, I recommend using their well tested IOUtils.copy instead of rolling your own.

Make sure you are feeding the unzip method a BufferedInputStream in your Java application. If you have made the mistake of using an unbuffered input stream your IO performance is guaranteed to suck.

I have found an 'inelegant' solution. There is an open source utility 7zip (www.7-zip.org) that is free to use. You can download the command line version (http://www.7-zip.org/download.html). 7-zip is only supported on Windows, but it looks like this has been ported to other platforms (p7zip).
Obviously this solution is not ideal since it is platform specific and relies on an executable. However, the speed compared to doing the unzip in Java is incredible.
Here is the code for the utility function that I created to interface with this utility. There is room for improvement as the code below is Windows specific.
/** Unpacks the zipfile to the output directory. Note: this code relies on 7-zip
(specifically the cmd line version, 7za.exe). The exeDir specifies the location of the 7za.exe utility. */
public static void unpack(File zipFile, File outputDir, File exeDir) throws IOException, InterruptedException
{
if (!zipFile.exists()) throw new FileNotFoundException(zipFile.getAbsolutePath());
if (!exeDir.exists()) throw new FileNotFoundException(exeDir.getAbsolutePath());
if (!outputDir.exists()) outputDir.mkdirs();
String cmd = exeDir.getAbsolutePath() + "/7za.exe -y e " + zipFile.getAbsolutePath();
ProcessBuilder builder = new ProcessBuilder(new String[] { "cmd.exe", "/C", cmd });
builder.directory(outputDir);
Process p = builder.start();
int rc = p.waitFor();
if (rc != 0) {
log.severe("Util::unpack() 7za process did not complete normally. rc: " + rc);
}
}

Related

Extracting PDF inside a Zip inside a Zip

i have checked everywhere online and stackoverflow and could not find a match specific to this issue.
I am trying to extract a pdf file that is located in a zip file that is inside a zip file (nested zips).
Re-calling the method i am using to extract does not work nor does changing the whole program to accept Inputstreams instead of how i am doing it below.
The .pdf file inside the nested zip is just skipped at this stage
public static void main(String[] args)
{
try
{
//Paths
String basePath = "C:\\Users\\user\\Desktop\\Scan\\";
File lookupDir = new File(basePath + "Data\\");
String doneFolder = basePath + "DoneUnzipping\\";
File[] directoryListing = lookupDir.listFiles();
for (int i = 0; i < directoryListing.length; i++)
{
if (directoryListing[i].isFile()) //there's definately a file
{
//Save the current file's path
String pathOrigFile = directoryListing[i].getAbsolutePath();
Path origFileDone = Paths.get(pathOrigFile);
Path newFileDone = Paths.get(doneFolder + directoryListing[i].getName());
//unzip it
if(directoryListing[i].getName().toUpperCase().endsWith(ZIP_EXTENSION)) //ZIP files
{
unzip(directoryListing[i].getAbsolutePath(), DESTINATION_DIRECTORY + directoryListing[i].getName());
//move to the 'DoneUnzipping' folder
Files.move(origFileDone, newFileDone);
}
}
}
} catch (Exception e)
{
e.printStackTrace(System.out);
}
}
private static void unzip(String zipFilePath, String destDir)
{
//buffer for read and write data to file
byte[] buffer = new byte[BUFFER_SIZE];
try (ZipInputStream zis = new ZipInputStream(new FileInputStream(zipFilePath)))
{
FileInputStream fis = new FileInputStream(zipFilePath);
ZipEntry ze = zis.getNextEntry();
while(ze != null)
{
String fileName = ze.getName();
int index = fileName.lastIndexOf("/");
String newFileName = fileName.substring(index + 1);
File newFile = new File(destDir + File.separator + newFileName);
//Zips inside zips
if(fileName.toUpperCase().endsWith(ZIP_EXTENSION))
{
ZipInputStream innerZip = new ZipInputStream(zis);
ZipEntry innerEntry = null;
while((innerEntry = innerZip.getNextEntry()) != null)
{
System.out.println("The file: " + fileName);
if(fileName.toUpperCase().endsWith("PDF"))
{
FileOutputStream fos = new FileOutputStream(newFile);
int len;
while ((len = innerZip.read(buffer)) > 0)
{
fos.write(buffer, 0, len);
}
fos.close();
}
}
}
//close this ZipEntry
zis.closeEntry(); // java.io.IOException: Stream Closed
ze = zis.getNextEntry();
}
//close last ZipEntry
zis.close();
fis.close();
} catch (IOException e)
{
e.printStackTrace();
}
}

The solution to this is not as obvious as it seems. Despite writing a few zip utilities myself some time ago, getting zip entries from inside another zip file only seems obvious in retrospect
(and I also got the java.io.IOException: Stream Closed on my first attempt).
The Java classes for ZipFile and ZipInputStream really direct your thinking into using the file system, but it is not required.
The functions below will scan a parent-level zip file, and continue scanning until it finds an entry with a specified name. (Nearly) everything is done in-memory.
Naturally, this can be modified to use different search criteria, find multiple file types, etc. and take different actions, but this at least demonstrates the basic technique in question -- zip files inside of zip files -- no guarantees on other aspects of the code, and someone more savvy could most likely improve the style.
final static String ZIP_EXTENSION = ".zip";
public static byte[] getOnePDF() throws IOException
{
final File source = new File("/path/to/MegaData.zip");
final String nameToFind = "FindThisFile.pdf";
final ByteArrayOutputStream mem = new ByteArrayOutputStream();
try (final ZipInputStream in = new ZipInputStream(new BufferedInputStream(new FileInputStream(source))))
{
digIntoContents(in, nameToFind, mem);
}
// Save to disk, if you want
// copy(new ByteArrayInputStream(mem.toByteArray()), new FileOutputStream(new File("/path/to/output.pdf")));
// Otherwise, just return the binary data
return mem.toByteArray();
}
private static void digIntoContents(final ZipInputStream in, final String nameToFind, final ByteArrayOutputStream mem) throws IOException
{
ZipEntry entry;
while (null != (entry = in.getNextEntry()))
{
final String name = entry.getName();
// Found the file we are looking for
if (name.equals(nameToFind))
{
copy(in, mem);
return;
}
// Found another zip file
if (name.toUpperCase().endsWith(ZIP_EXTENSION.toUpperCase()))
{
digIntoContents(new ZipInputStream(new ByteArrayInputStream(getZipEntryFromMemory(in))), nameToFind, mem);
}
}
}
private static byte[] getZipEntryFromMemory(final ZipInputStream in) throws IOException
{
final ByteArrayOutputStream mem = new ByteArrayOutputStream();
copy(in, mem);
return mem.toByteArray();
}
// General purpose, reusable, utility function
// OK for binary data (bad for non-ASCII text, use Reader/Writer instead)
public static void copy(final InputStream from, final OutputStream to) throws IOException
{
final int bufferSize = 4096;
final byte[] buf = new byte[bufferSize];
int len;
while (0 < (len = from.read(buf)))
{
to.write(buf, 0, len);
}
to.flush();
}

Your question asks how to use java (by implication in windows) to extract a pdf from a zip inside another outer zip.
In many systems including windows it is a single line command that will depend on the location of source and target folders, however using the shortest example of current downloads folder it would be in a shell as simple as
tar -xf "german (2).zip" && tar -xf "german.zip" && german.pdf
to shell the command in windows see
How do I execute Windows commands in Java?
The default pdf viewer can open the result so Windows Edge or in my case SumatraPDF
There is generally no point in putting a pdf inside a zip because it cannot be run in there. So single nesting would be advisable if needed for download transportation.
There is no need to add a password to the zip because PDF uses its own password for opening. Thus unwise to add two levels of complexity. Keep it simple.
If you have multiple zips nested inside multiple zips with multiple pdfs in each then you have to be more specific by filtering names. However avoid that extra onion skin where possible.
\Downloads>tar -xf "german (2).zip" "both.zip" && tar -xf "both.zip" "English language.pdf"
You could complicate that by run in a memory or temp folder but it is reliable and simple to use the native file system so consider without Java its fastest to run
CD /D "C:/Users/user/Desktop/Scan/DoneUnzipping" && for %f in (..\Data\*.zip) do tar -xf "%f" "*.zip" && for %f in (*.zip) do tar -xf "%f" "*.pdf" && del "*.zip"
This will extract all inner zips into working folder then extract all PDFs and remove all the essential temporary zips. The source double zips will not be deleted simply touched.

The line that causes your problem looks to be auto-close block you have created when reading the inner zip:
try(ZipInputStream innerZip = new ZipInputStream(fis)) {
...
}
Several likely issues: firstly it is reading the wrong stream - fis not the existing zis.
Secondly, you shouldn't use try-with-resources for auto-close on innerZip as this implicitly calls innerZip.close() when exiting the block. If you view the source code of ZipInputStream via a good IDE you should see (eventually) that ZipInputStream extends InflaterInputStream which itself extends FilterInputStream. A call to innerZip.close() will close the underlying outer stream zis (fis in your case) hence stream is closed when you resume the next entry of the outer zip.
Therefore remove the try() block and add use of zis:
ZipInputStream innerZip = new ZipInputStream(zis);
Use try-catch block only for the outermost file handling:
try (ZipInputStream zis = new ZipInputStream(new FileInputStream(zipFilePath))) {
ZipEntry ze = zis.getNextEntry();
...
}
Thirdly, you appear to be copying the wrong stream when extracting a PDF - use innerZip not outer zis. The code will never extract PDF as these 2 lines can never be true at the same time because a file ending ZIP will never end PDF too:
if(fileName.toUpperCase().endsWith(ZIP_EXTENSION)) {
...
// You want innerEntry.getName() here
if(fileName.toUpperCase().endsWith("PDF"))
You should be able to switch to one line Files.copy and make use of the PDF filename not zip filename:
if(innerEntry.getName().toUpperCase().endsWith("PDF")) {
Path newFile = Paths.get(destDir + '-'+innerEntry.getName().replace("/", "-"));
System.out.println("Files.copy to " + newFile);
Files.copy(innerZip, newFile);
}

How copy files bigger than 4.3 GB in java

I'm writing a program part simply to copy a file from source to a destination File. The code works as well it should but if there is a file large file the process of copy will end up, after the destination file reach a size of 4.3 GB, with an exception. The exception is a "file is to large" it looks like:
java.io.IOException: Die Datei ist zu groß
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
at java.nio.channels.Channels.writeFully(Channels.java:101)
at java.nio.channels.Channels.access$000(Channels.java:61)
at java.nio.channels.Channels$1.write(Channels.java:174)
at java.nio.file.Files.copy(Files.java:2909)
at java.nio.file.Files.copy(Files.java:3069)
at sample.Controller.copyStream(Controller.java:318)
The method to produce that is following:
private void copyStream(File src, File dest){
try {
FileInputStream fis = new FileInputStream(src);
OutputStream newFos = java.nio.file.Files.newOutputStream(dest.toPath(),StandardOpenOption.WRITE);
Files.copy(src.toPath(),newFos);
newFos.flush();
newFos.close();
fis.close();
} catch (Exception e) {
e.printStackTrace();
}
}
I also tried to use java.io Fileoutputstream and write in a kbyte way, but there happends the same. How can I copy or create files larger than 4.3 GB? Is it maybe possible in other language than java? This programm I run on a Linux (Ubuntu LTS 16.04).
Thanks in advance.
Edit:
Thanks very much you all for your help. As you said, the file system was the problem. After i formated the file system to exfat it works fine.

POSIX (and thus Unix) systems are allowed to impose a maximum length on the path (what you get from File.getPath() or the components of a path (the last of which you can get with File.getName()). You might be seeing this problem because of the long name for the file.
In that case, the file open operating system call will fail with an ENAMETOOLONG error code.
However, the message "File too large" is typically associated with the ´EFBIG´ error code. That is more likely to result from a write system call:
An attempt was made to write a file that exceeds the implementation-dependent maximum file size or the process' file size limit.
Perhaps the file is being opened for appending, and the implied lseek to the end of the file is giving the EFBIG error.
In the end, you could try other methods of copying if it has to do something with your RAM.
Also another option could be that the disk is full.
To copy files there are basically four ways [and it turns out streams is the fastest on a basic level] :
Copy with streams:
private static void copyFileUsingStream(File source, File dest) throws IOException {
InputStream is = null;
OutputStream os = null;
try {
is = new FileInputStream(source);
os = new FileOutputStream(dest);
byte[] buffer = new byte[1024];
int length;
while ((length = is.read(buffer)) > 0) {
os.write(buffer, 0, length);
}
} finally {
is.close();
os.close();
}
}
Copy with Java NIO classes:
private static void copyFileUsingChannel(File source, File dest) throws IOException {
FileChannel sourceChannel = null;
FileChannel destChannel = null;
try {
sourceChannel = new FileInputStream(source).getChannel();
destChannel = new FileOutputStream(dest).getChannel();
destChannel.transferFrom(sourceChannel, 0, sourceChannel.size());
}finally{
sourceChannel.close();
destChannel.close();
}
}
Copy with Apache Commons IO FileUtils:
private static void copyFileUsingApacheCommonsIO(File source, File dest) throws IOException {
FileUtils.copyFile(source, dest);
}
and your Method by using Java 7 and the Files class:
private static void copyFileUsingJava7Files(File source, File dest) throws IOException {
Files.copy(source.toPath(), dest.toPath());
}
Edit 1:
as suggested in the comments, here are three SO-questions, which cover the problem and explain the four different methodes of copying better:
Standard concise way to copy a file in Java?
File copy/move methods and approaches explanation, comparison
Reading and writing a large file using Java NIO
Thanks to #jww for pointing it out

How to determine the compression method of a zip file

From a third party I am retrieving .zip files. I want to unzip these to another folder. To this end I found a method that does exactly that, see code below. It iterates through all files and unzips them to another folder. However, when observing the corresponding compression method I found out that this changes for some files. And for some files it states: "invalid compression method", after which it aborts further unzipping of the zip file.
As the compression method seems to change, I suspect I need to set the compression method to the correct one (however that might be a wrong assumption). So rises my question: how to determine the compression method needed?
The code I am using:
public void unZipIt(String zipFile, String outputFolder){
//create output directory is not exists
File folder = new File(OUTPUT_FOLDER);
if(!folder.exists()){
folder.mkdir();
}
FileInputStream fis = null;
ZipInputStream zipIs = null;
ZipEntry zEntry = null;
try
{
fis = new FileInputStream(zipFile);
zipIs = new ZipInputStream(new BufferedInputStream(fis));
while((zEntry = zipIs.getNextEntry()) != null){
System.out.println(zEntry.getMethod());
try{
byte[] tmp = new byte[4*1024];
FileOutputStream fos = null;
String opFilePath = OUTPUT_FOLDER + "\\" + zEntry.getName();
System.out.println("Extracting file to "+opFilePath);
fos = new FileOutputStream(opFilePath);
int size = 0;
while((size = zipIs.read(tmp)) != -1){
fos.write(tmp, 0 , size);
}
fos.flush();
fos.close();
} catch(IOException e){
System.out.println(e.getMessage());
}
}
zipIs.close();
} catch (FileNotFoundException e) {
System.out.println(e.getMessage());
}
catch(IOException ex){
System.out.println(ex.getMessage());
}
}
Currently I am retrieving the following output:
8
Extracting file to C:\Users\nlmeibe2\Documents\Projects\Output_test\SOPHIS_cptyrisk_tradedata_1192_20140616.csv
8
Extracting file to C:\Users\nlmeibe2\Documents\Projects\Output_test\SOPHIS_cptyrisk_underlying_1192_20140616.csv
0
Extracting file to C:\Users\nlmeibe2\Documents\Projects\Output_test\10052013/
12
Extracting file to C:\Users\nlmeibe2\Documents\Projects\Output_test\MRM_Daily_Position_Report_Package_Level_Underlying_View_EQB_v2_COBDATE_2014-06-16_RUNDATETIME_2014-06-17-04h15.csv
invalid compression method
invalid compression method

Since you only print the exception message and not the stack trace (with line numbers), it is impossible to know exactly where the exception is thrown, but I suppose it is not thrown until you actually try to read from the ZipEntry.
If the numbers in your output is the ZIP method, the last entry you encounter is compressed with method 12 (bzip2), which is not supported by the Java ZIP implementation. PKWare (the maintainers of the ZIP format) regularly add new compression methods to the ZIP specification and there are currently some 12-15 (not sure about the exact number) compression methods specified. Java only supports the methods 0 (stored) and 8 (deflated) and will throw an exception with the message "invalid compression method" if you try to decompress a ZIP file using an unsupported compression method.
Both WinZip and the ZIP functions in Windows may use compression methods not supported by the Java API.

Use zEntry.getMethod() to get the compression method
Returns the compression method of the entry, or -1 if not specified.
It will return an int which will be
public static final int STORED
public static final int DEFLATED
or -1 if it don't know the method.
Docs.

How To read contents for file which is in .7z extension file

I want to read a file which is in .7z zipped file. I do not want it to be extracted on to local system. But in Java Buffer it self I need to read all contents of file. Is there any way to this? If yes can you provide example of the code to do that?
Scenario:
Main File- TestFile.7z
Files inside TestFile.7z are First.xml, Second.xml, Third.xml
I want to read First.xml without unzipping it.

You can use the Apache Commons Compress library. This library supports packing and unpacking for several archive formats. To use 7z format you also have to put xz-1.4.jar into the classpath. Here are the XZ for Java sources. You can download the XZ binary from Maven Central Repository.
Here is a small example to read the contents of a 7z archive.
public static void main(String[] args) throws IOException {
SevenZFile archiveFile = new SevenZFile(new File("archive.7z"));
SevenZArchiveEntry entry;
try {
// Go through all entries
while((entry = archiveFile.getNextEntry()) != null) {
// Maybe filter by name. Name can contain a path.
String name = entry.getName();
if(entry.isDirectory()) {
System.out.println(String.format("Found directory entry %s", name));
} else {
// If this is a file, we read the file content into a
// ByteArrayOutputStream ...
System.out.println(String.format("Unpacking %s ...", name));
ByteArrayOutputStream contentBytes = new ByteArrayOutputStream();
// ... using a small buffer byte array.
byte[] buffer = new byte[2048];
int bytesRead;
while((bytesRead = archiveFile.read(buffer)) != -1) {
contentBytes.write(buffer, 0, bytesRead);
}
// Assuming the content is a UTF-8 text file we can interpret the
// bytes as a string.
String content = contentBytes.toString("UTF-8");
System.out.println(content);
}
}
} finally {
archiveFile.close();
}
}

While the Apache Commons Compress library works as advertized above I've found it to be unusably slow for files of any substantial size -- mine were around a GB or more. I had to call a native command line 7z.exe from java for my large image files which was at least 10 times faster.
I used jre1.7. Maybe things will improve for higher versions of the jre.

when file moved/renamed - what's the difference between java RAF and File

I only want to discuss about this in java/linux context.
RandomAccessFile rand = new RandomAccessFile("test.log", "r");
VS
File file = new File("test.log");
After the creation, we start reading the file to the end.
In java.io.File case, it will throw IOException when reading the file if you mv or delete the physical file prior to the file reading.
public void readIOFile() throws IOException, InterruptedException {
File file = new File("/tmp/test.log");
System.out.print("file created"); // convert byte into char
Thread.sleep(5000);
while (true) {
char[] buffer = new char[1024];
FileReader fr = new FileReader(file);
fr.read(buffer);
System.out.println(buffer);
}
}
But in RandomFileAccess case, if you mv or delete the physical file prior to the file reading, it will finish reading the file without errors/exceptions.
public void readRAF() throws IOException, InterruptedException {
File file = new File("/tmp/test.log");
RandomAccessFile rand = new RandomAccessFile(file, "rw");
System.out.println("file created"); // convert byte into char
while (true) {
System.out.println(file.lastModified());
System.out.println(file.length());
Thread.sleep(5000);
System.out.println("finish sleeping");
int i = (int) rand.length();
rand.seek(0); // Seek to start point of file
for (int ct = 0; ct < i; ct++) {
byte b = rand.readByte(); // read byte from the file
System.out.print((char) b); // convert byte into char
}
}
}
Can anyone explain to me why ? Is there anything to do with file's inode?

Unlike RandomAccessFile or say, InputStream and many other java IO facilities, File is just an immutable handle that you drag from time to time when you need to do filesystem gateway actions. You may think of it as the reference: File instance is pointing to some specified path. On the other hand RandomAccessFile have notion of path only at construction time: it goes to the specified path, opens file and acquires file system descriptor -- you may think of it as an unique id of a given file, which do not changes on move and some other operations -- and uses this id throughout it's lifetime to address file.

The OS based file system services such as creating folders, files, verifying the permissions, changing file names etc., are provided by the java.io.File class.
The java.io.RandomAccessFile class provides random access to the records that are stored in a data file. Using this class, reading and writing , manipulations to the data can be done. One more flexibility is that it can read and write primitive data types, which helps in structured approach in handling data files.
Unlike the input and output stream classes in java.io, RandomAccessFile is used for both reading and writing files. RandomAccessFile does not inherit from InputStream or OutputStream. It implements the DataInput and DataOutput interfaces.

There is no evidence here that you have moved or renamed the file at all.
If you did thatt from outside the program, clearly it is just a timing issue.
If you rename a file before you try to open it with the old name, it will fail. Surely this is obvious?

One of the main difference is, File can not have control over write or read directly, it requires IO streams to do that. Where as RAF, we can write or read the files.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.