How to determine the compression method of a zip file

How to determine the compression method of a zip file - java

From a third party I am retrieving .zip files. I want to unzip these to another folder. To this end I found a method that does exactly that, see code below. It iterates through all files and unzips them to another folder. However, when observing the corresponding compression method I found out that this changes for some files. And for some files it states: "invalid compression method", after which it aborts further unzipping of the zip file.
As the compression method seems to change, I suspect I need to set the compression method to the correct one (however that might be a wrong assumption). So rises my question: how to determine the compression method needed?
The code I am using:
public void unZipIt(String zipFile, String outputFolder){
//create output directory is not exists
File folder = new File(OUTPUT_FOLDER);
if(!folder.exists()){
folder.mkdir();
}
FileInputStream fis = null;
ZipInputStream zipIs = null;
ZipEntry zEntry = null;
try
{
fis = new FileInputStream(zipFile);
zipIs = new ZipInputStream(new BufferedInputStream(fis));
while((zEntry = zipIs.getNextEntry()) != null){
System.out.println(zEntry.getMethod());
try{
byte[] tmp = new byte[4*1024];
FileOutputStream fos = null;
String opFilePath = OUTPUT_FOLDER + "\\" + zEntry.getName();
System.out.println("Extracting file to "+opFilePath);
fos = new FileOutputStream(opFilePath);
int size = 0;
while((size = zipIs.read(tmp)) != -1){
fos.write(tmp, 0 , size);
}
fos.flush();
fos.close();
} catch(IOException e){
System.out.println(e.getMessage());
}
}
zipIs.close();
} catch (FileNotFoundException e) {
System.out.println(e.getMessage());
}
catch(IOException ex){
System.out.println(ex.getMessage());
}
}
Currently I am retrieving the following output:
8
Extracting file to C:\Users\nlmeibe2\Documents\Projects\Output_test\SOPHIS_cptyrisk_tradedata_1192_20140616.csv
8
Extracting file to C:\Users\nlmeibe2\Documents\Projects\Output_test\SOPHIS_cptyrisk_underlying_1192_20140616.csv
0
Extracting file to C:\Users\nlmeibe2\Documents\Projects\Output_test\10052013/
12
Extracting file to C:\Users\nlmeibe2\Documents\Projects\Output_test\MRM_Daily_Position_Report_Package_Level_Underlying_View_EQB_v2_COBDATE_2014-06-16_RUNDATETIME_2014-06-17-04h15.csv
invalid compression method
invalid compression method

Since you only print the exception message and not the stack trace (with line numbers), it is impossible to know exactly where the exception is thrown, but I suppose it is not thrown until you actually try to read from the ZipEntry.
If the numbers in your output is the ZIP method, the last entry you encounter is compressed with method 12 (bzip2), which is not supported by the Java ZIP implementation. PKWare (the maintainers of the ZIP format) regularly add new compression methods to the ZIP specification and there are currently some 12-15 (not sure about the exact number) compression methods specified. Java only supports the methods 0 (stored) and 8 (deflated) and will throw an exception with the message "invalid compression method" if you try to decompress a ZIP file using an unsupported compression method.
Both WinZip and the ZIP functions in Windows may use compression methods not supported by the Java API.

Use zEntry.getMethod() to get the compression method
Returns the compression method of the entry, or -1 if not specified.
It will return an int which will be
public static final int STORED
public static final int DEFLATED
or -1 if it don't know the method.
Docs.

Related

Extracting PDF inside a Zip inside a Zip

i have checked everywhere online and stackoverflow and could not find a match specific to this issue.
I am trying to extract a pdf file that is located in a zip file that is inside a zip file (nested zips).
Re-calling the method i am using to extract does not work nor does changing the whole program to accept Inputstreams instead of how i am doing it below.
The .pdf file inside the nested zip is just skipped at this stage
public static void main(String[] args)
{
try
{
//Paths
String basePath = "C:\\Users\\user\\Desktop\\Scan\\";
File lookupDir = new File(basePath + "Data\\");
String doneFolder = basePath + "DoneUnzipping\\";
File[] directoryListing = lookupDir.listFiles();
for (int i = 0; i < directoryListing.length; i++)
{
if (directoryListing[i].isFile()) //there's definately a file
{
//Save the current file's path
String pathOrigFile = directoryListing[i].getAbsolutePath();
Path origFileDone = Paths.get(pathOrigFile);
Path newFileDone = Paths.get(doneFolder + directoryListing[i].getName());
//unzip it
if(directoryListing[i].getName().toUpperCase().endsWith(ZIP_EXTENSION)) //ZIP files
{
unzip(directoryListing[i].getAbsolutePath(), DESTINATION_DIRECTORY + directoryListing[i].getName());
//move to the 'DoneUnzipping' folder
Files.move(origFileDone, newFileDone);
}
}
}
} catch (Exception e)
{
e.printStackTrace(System.out);
}
}
private static void unzip(String zipFilePath, String destDir)
{
//buffer for read and write data to file
byte[] buffer = new byte[BUFFER_SIZE];
try (ZipInputStream zis = new ZipInputStream(new FileInputStream(zipFilePath)))
{
FileInputStream fis = new FileInputStream(zipFilePath);
ZipEntry ze = zis.getNextEntry();
while(ze != null)
{
String fileName = ze.getName();
int index = fileName.lastIndexOf("/");
String newFileName = fileName.substring(index + 1);
File newFile = new File(destDir + File.separator + newFileName);
//Zips inside zips
if(fileName.toUpperCase().endsWith(ZIP_EXTENSION))
{
ZipInputStream innerZip = new ZipInputStream(zis);
ZipEntry innerEntry = null;
while((innerEntry = innerZip.getNextEntry()) != null)
{
System.out.println("The file: " + fileName);
if(fileName.toUpperCase().endsWith("PDF"))
{
FileOutputStream fos = new FileOutputStream(newFile);
int len;
while ((len = innerZip.read(buffer)) > 0)
{
fos.write(buffer, 0, len);
}
fos.close();
}
}
}
//close this ZipEntry
zis.closeEntry(); // java.io.IOException: Stream Closed
ze = zis.getNextEntry();
}
//close last ZipEntry
zis.close();
fis.close();
} catch (IOException e)
{
e.printStackTrace();
}
}

The solution to this is not as obvious as it seems. Despite writing a few zip utilities myself some time ago, getting zip entries from inside another zip file only seems obvious in retrospect
(and I also got the java.io.IOException: Stream Closed on my first attempt).
The Java classes for ZipFile and ZipInputStream really direct your thinking into using the file system, but it is not required.
The functions below will scan a parent-level zip file, and continue scanning until it finds an entry with a specified name. (Nearly) everything is done in-memory.
Naturally, this can be modified to use different search criteria, find multiple file types, etc. and take different actions, but this at least demonstrates the basic technique in question -- zip files inside of zip files -- no guarantees on other aspects of the code, and someone more savvy could most likely improve the style.
final static String ZIP_EXTENSION = ".zip";
public static byte[] getOnePDF() throws IOException
{
final File source = new File("/path/to/MegaData.zip");
final String nameToFind = "FindThisFile.pdf";
final ByteArrayOutputStream mem = new ByteArrayOutputStream();
try (final ZipInputStream in = new ZipInputStream(new BufferedInputStream(new FileInputStream(source))))
{
digIntoContents(in, nameToFind, mem);
}
// Save to disk, if you want
// copy(new ByteArrayInputStream(mem.toByteArray()), new FileOutputStream(new File("/path/to/output.pdf")));
// Otherwise, just return the binary data
return mem.toByteArray();
}
private static void digIntoContents(final ZipInputStream in, final String nameToFind, final ByteArrayOutputStream mem) throws IOException
{
ZipEntry entry;
while (null != (entry = in.getNextEntry()))
{
final String name = entry.getName();
// Found the file we are looking for
if (name.equals(nameToFind))
{
copy(in, mem);
return;
}
// Found another zip file
if (name.toUpperCase().endsWith(ZIP_EXTENSION.toUpperCase()))
{
digIntoContents(new ZipInputStream(new ByteArrayInputStream(getZipEntryFromMemory(in))), nameToFind, mem);
}
}
}
private static byte[] getZipEntryFromMemory(final ZipInputStream in) throws IOException
{
final ByteArrayOutputStream mem = new ByteArrayOutputStream();
copy(in, mem);
return mem.toByteArray();
}
// General purpose, reusable, utility function
// OK for binary data (bad for non-ASCII text, use Reader/Writer instead)
public static void copy(final InputStream from, final OutputStream to) throws IOException
{
final int bufferSize = 4096;
final byte[] buf = new byte[bufferSize];
int len;
while (0 < (len = from.read(buf)))
{
to.write(buf, 0, len);
}
to.flush();
}

Your question asks how to use java (by implication in windows) to extract a pdf from a zip inside another outer zip.
In many systems including windows it is a single line command that will depend on the location of source and target folders, however using the shortest example of current downloads folder it would be in a shell as simple as
tar -xf "german (2).zip" && tar -xf "german.zip" && german.pdf
to shell the command in windows see
How do I execute Windows commands in Java?
The default pdf viewer can open the result so Windows Edge or in my case SumatraPDF
There is generally no point in putting a pdf inside a zip because it cannot be run in there. So single nesting would be advisable if needed for download transportation.
There is no need to add a password to the zip because PDF uses its own password for opening. Thus unwise to add two levels of complexity. Keep it simple.
If you have multiple zips nested inside multiple zips with multiple pdfs in each then you have to be more specific by filtering names. However avoid that extra onion skin where possible.
\Downloads>tar -xf "german (2).zip" "both.zip" && tar -xf "both.zip" "English language.pdf"
You could complicate that by run in a memory or temp folder but it is reliable and simple to use the native file system so consider without Java its fastest to run
CD /D "C:/Users/user/Desktop/Scan/DoneUnzipping" && for %f in (..\Data\*.zip) do tar -xf "%f" "*.zip" && for %f in (*.zip) do tar -xf "%f" "*.pdf" && del "*.zip"
This will extract all inner zips into working folder then extract all PDFs and remove all the essential temporary zips. The source double zips will not be deleted simply touched.

The line that causes your problem looks to be auto-close block you have created when reading the inner zip:
try(ZipInputStream innerZip = new ZipInputStream(fis)) {
...
}
Several likely issues: firstly it is reading the wrong stream - fis not the existing zis.
Secondly, you shouldn't use try-with-resources for auto-close on innerZip as this implicitly calls innerZip.close() when exiting the block. If you view the source code of ZipInputStream via a good IDE you should see (eventually) that ZipInputStream extends InflaterInputStream which itself extends FilterInputStream. A call to innerZip.close() will close the underlying outer stream zis (fis in your case) hence stream is closed when you resume the next entry of the outer zip.
Therefore remove the try() block and add use of zis:
ZipInputStream innerZip = new ZipInputStream(zis);
Use try-catch block only for the outermost file handling:
try (ZipInputStream zis = new ZipInputStream(new FileInputStream(zipFilePath))) {
ZipEntry ze = zis.getNextEntry();
...
}
Thirdly, you appear to be copying the wrong stream when extracting a PDF - use innerZip not outer zis. The code will never extract PDF as these 2 lines can never be true at the same time because a file ending ZIP will never end PDF too:
if(fileName.toUpperCase().endsWith(ZIP_EXTENSION)) {
...
// You want innerEntry.getName() here
if(fileName.toUpperCase().endsWith("PDF"))
You should be able to switch to one line Files.copy and make use of the PDF filename not zip filename:
if(innerEntry.getName().toUpperCase().endsWith("PDF")) {
Path newFile = Paths.get(destDir + '-'+innerEntry.getName().replace("/", "-"));
System.out.println("Files.copy to " + newFile);
Files.copy(innerZip, newFile);
}

File has been moved, can not be read again (Spring mvc)

I am using spring MVC where through API I am uploading zip file using MultipartFile. In backend I have to convert uploaded zip file into InputStream for further processing. But my code is giving error intermittently " File has been moved, can not be read again ".
here is the code snippet :
File temp = null;
InputStream stream = null;
try {
InputStream initialStream = inputFile.getInputStream();
byte[] buffer = new byte[initialStream.available()];
initialStream.read(buffer);
temp = File.createTempFile("upload", null);
try (OutputStream outStream = new FileOutputStream(temp)) {
outStream.write(buffer);
}
ZipFile zipFile = new ZipFile(temp);
stream = zipFile.getInputStream(zipFile.getEntries().nextElement());
} catch (Exception e) {
log.error("Exception occurred while processing zip file " + e.getMessage());
throw e;
} finally {
if (temp != null)
temp.delete();
}
return stream;
Here inputFile is MultipartFile.
Could you please suggest what is wrong here?

Your code is returning an input stream from a file that you have deleted - last line is temp.delete().
ZipInputStream has a small internal buffer for decoding, so that may explain why some read calls work after the delete, but it will not be possible to continue reading from a file that you deleted, hence the exception.
Also, the call initialStream.available() is unlikely to be the correct way to determine the size of the input stream file part. Try printing the size / check how to read the actual length of the file in the multipart stream - such as part.getSize(), or transfer the bytes into a new ByteArrayOutputStream() before assigning to buffer.
I would not recommend doing any work with files or multipart streams using direct transfer to byte[] as you risk OutOfMemoryException. However in your case where you are happy to have byte[] for the ZIP and you read the first entry of the ZIP file (and are ignoring other entries) then you could try extracting the first entry as InputStream without writing to a file as follows:
// Read a zip input stream from a zip stored in byte[]:
ZipInputStream zis = new ZipInputStream(new ByteArrayInputStream(buffer));
// Select first entry from ZIP
ZipEntry entry = zis.getNextEntry();
// You should be able to read the entry from zis directly,
// if this is text file you could test with:
// zis.transferTo(System.out);
return zis;
You should ensure that you close the stream after use.

Potential issues I can see in your code:
temp file is used as zip file, yet you delete the temp file prior to
returning. How can you use the zip file as file stream if you have
deleted it?
Do you support concurrent uploads? If yes, then you have concurrent
resource access problem. Multiple calls to create temp file:
"upload" and process it. Why don't you create a different
filename e.g. with datetime suffix + random number suffix.

How to detect file type from its content in zip archive?

I have a zip archive that contains several gzip files. But gzip file's extentions are also .zip . I walk through zip archive with ZipInputStream. How can I detect inner file's type with reading its content rather than extentions. I also need not to change (or reset) ZipInputStream position.
So I need;
Read files in zip with using inputStream (ZipInputStream in my case) Because zip in zip is possible.
Find file type from its content.
While finding file type from its content, inputStream position should not change. Because i will continue to read next files.
Example:
root/1.zip/2.zip/3.zip(actually 3 is gzip)/4.txt
Sample Java Code:
public static void main(String[] args) {
//root/1.zip/2.zip/3.zip(actually 3 is gzip)/4.txt
String file = "root/1.zip";
File rootZip = new File(file);
try (FileInputStream fis = new FileInputStream(rootZip)) {
lookupInZip(fis)
.stream()
.forEach(System.out::println);
} catch (IOException e) {
System.out.println("Failed to get files");
}
}
public static List<String> lookupInZip(InputStream inputStream) throws IOException {
Tika tika = new Tika();
List<String> paths = new ArrayList<>();
ZipInputStream zipInputStream = new ZipInputStream(inputStream);
ZipEntry entry = zipInputStream.getNextEntry();
while (entry != null) {
String entryName = entry.getName();
if (!entry.isDirectory()) {
//Option 1
//String fileType = tika.detect(entryName);
//Option 2
String fileType = tika.detect(zipInputStream);
if ("application/zip".equals(fileType)) {
List<String> innerPaths = lookupInZip(zipInputStream);
paths.addAll(innerPaths);
} else {
paths.add(entryName);
}
}
entry = zipInputStream.getNextEntry();
}
return paths;
}
If I use option 1, '3.zip' is evaluated as zip file but it is gzip.
If I use option 2, '2.zip' is evaluated as zip correctly by using its content. But when lookupInZip() is called for '3.zip' recursively, zipInputStream.getNextEntry() returns null. Because in previous step, we use inputStream content to detect type and inputStrem position changed.
Note: tika.detect() uses BufferedInputStream in implementation to reset inputStream position but it does not solve my problem.

The first two bytes are enough to see if it is likely a zip file, likely a gzip file, or certainly something else.
If the first two bytes are 0x50 0x4b, then it is likely a zip file. If the first two bytes are 0x1f 0x8b, then it is likely a gzip file. If it is neither, then the file is something else.
The first two bytes matching is not a guarantee it is that type, but it appears from your structure that it is usually one or the other, and you can use the extension as further corroborating evidence that it is compressed.
As for not changing the position, you need a way to peek at the first two bytes without advancing the position, or a way to get them and then unget them to return the position to where it was.

Java utility library for Nested ZIP file handling

I am aware that Oracle notes ZIP/GZIP file compressor/decompressor methods on their website. But I have a scenario where I need to scan and find out whether any nested ZIPs/RARs are involved. For example, the following case:
-MyFiles.zip
-MyNestedFiles.zip
-MyMoreNestedFiles.zip
-MoreProbably.zip
-Other_non_zips
-Other_non_zips
-Other_non_zips
I know that apache commons compress package and java.util.zip are the wideley used packages where commons compress actually caters for the missing features in java.util.zip e.g. some character setting whilst doing zipouts. But what I am not sure about is the utilities for recursing through nested zip files and the answers provided on SO are not very good examples of doing this. I tried the following code (which I got from Oracle blog), but as I suspected, the nested directory recursion fails because it simply cannot find the files:
public static void processZipFiles(String pathName) throws Exception{
ZipInputStream zis = null;
InputStream is = null;
try {
ZipFile zipFile = new ZipFile(new File(pathName));
String nestPathPrefix = zipFile.getName().substring(0, zipFile.getName().length() -4);
for(Enumeration e = zipFile.entries(); e.hasMoreElements();){
ZipEntry ze = (ZipEntry)e.nextElement();
if(ze.getName().contains(".zip")){
is = zipFile.getInputStream(ze);
zis = new ZipInputStream(is);
ZipEntry zentry = zis.getNextEntry();
while (zentry!=null){
System.out.println(zentry.getName());
zentry = zis.getNextEntry();
ZipFile nestFile = new ZipFile(nestPathPrefix+"\\"+zentry.getName());
if (zentry.getName().contains(".zip")) {
processZipFiles(nestPathPrefix+"\\"+zentry.getName());
}
}
is.close();
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally{
if(is != null)
is.close();
if(zis!=null)
zis.close();
}
}
May be I am doing something wrong - or using the wrong utils. My objective is to identify whether any of the files or nested zip files have got file extensions which I am not allowing. This is to make sure that I can prevent my users to upload forbidden files even when they are zipping it. I also have the option to use Tika which can do recursive parsing (Using Zukka Zitting's solution), but I am not sure if I can use the Metadata to do this detection how I want.
Any help/suggestion is appreciated.

Using Commons Compress would be easier, not least because it has sensible shared interfaces between the various decompressors which make life easier + allows handling of other compression formats (eg Tar) at the same time
If you do want to use only the built-in Zip support, I'd suggest you do something like this:
File file = new File("outermost.zip");
FileInputStream input = new FileInputStream(file);
check(input, file.toString());
public static void check(InputStream compressedInput, String name) {
ZipInputStream input = new ZipInputStream(compressedInput);
ZipEntry entry = null;
while ( (entry = input.getNextEntry()) != null ) {
System.out.println("Found " + entry.getName() + " in " + name);
if (entry.getName().endsWith(".zip")) { // TODO Better checking
check(input, name + "/" + entry.getName());
}
}
}
Your code will fail as you're trying to read inner.zip within outer.zip as a local file, but it doesn't exist as a standalone file. The code above will process things ending with .zip as another zip file, and will recurse
You probably want to use commons compress though, so you can handle things with alternate filenames, other compression formats etc

java Extracting Zip file

I'm looking for a way to extract Zip file. So far I have tried java.util.zip and org.apache.commons.compress, but both gave a corrupted output.
Basically, the input is a ZIP file contain one single .doc file.
java.util.zip: Output corrupted.
org.apache.commons.compress: Output blank file, but with 2 mb size.
So far only the commercial software like Winrar work perfectly. Is there a java library that make use of this?
This is my method using java.util library:
public void extractZipNative(File fileZip)
{
ZipInputStream zis;
StringBuilder sb;
try {
zis = new ZipInputStream(new FileInputStream(fileZip));
ZipEntry ze = zis.getNextEntry();
byte[] buffer = new byte[(int) ze.getSize()];
FileOutputStream fos = new FileOutputStream(this.tempFolderPath+ze.getName());
int len;
while ((len=zis.read(buffer))>0)
{
fos.write(buffer);
}
fos.flush();
fos.close();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally
{
if (zis!=null)
{
try { zis.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
Many thanks,
Mike

I think your input may be compressed by some "incompatible" zip program like 7zip.
Try investigating first if it can be unpacked with a classical WinZip or such.
Javas zip handling is very well able to deal with zipped archives that come from a "compatible" zip compressor.

It is an error in my code. I need to specify the offset and len of bytes write.

it works for me
ZipFile Vanilla = new ZipFile(new File("Vanilla.zip")); //zipfile defined and needs to be in directory
Enumeration<? extends ZipEntry> entries = Vanilla.entries();// all (files)entries of zip file
while(entries.hasMoreElements()){//runs while there is files in zip
ZipEntry entry = entries.nextElement();//gets name of file in zip
File folderw =new File("tkwgter5834");//creates new directory
InputStream stream = Vanilla.getInputStream(entry);//gets input
FileInputStream inpure= new FileInputStream("Vanilla.zip");//file input stream for zip file to read bytes of file
FileOutputStream outter = new FileOutputStream(new File(folderw +"//"+ entry.toString())); //fileoutput stream creates file inside defined directory(folderw variable) by file's name
outter.write(inpure.readAllBytes());// write into files which were created
outter.close();//closes fileoutput stream
}

Have you tried jUnrar? Perhaps it might work:
https://github.com/edmund-wagner/junrar
If that doesn't work either, I guess your archive is corrupted in some way.

If you know the environment that you're going to be running this code in, I think you're much better off just making a call to the system to unzip it for you. It will be way faster than anything that you implement in java.
I wrote the code to extract a zip file with nested directories and it ran slowly and took a lot of CPU. I wound up replacing it with this:
Runtime.getRuntime().exec(String.format("unzip %s -d %s", archive.getAbsolutePath(), basePath));
That works a lot better.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.