handling about 450.000 files in a zip

handling about 450.000 files in a zip - java

My question is simple. Would Java handle a .zip file with about 450,000 files in there? The code that I wrote would not load all of the files, just one specific file would be searched in the zip, and be read line by line. The file size is about 500kb.
Would this work or will I get an OutOfMemory Exception?
Oh sry, uncompressed there about 0,5MB. Zipped are they whole files about 250mb.
Ok, the name of the Files are IDs + Date(unique) in that zip file. If i have to check a log, ill call Java and give the ID + Date and Java is reading just that one file, never more.
Edit: It works, it works very well. About 400.000 files in a zip, if u have the Memory to Zip the Files works without any problem.
Edit2: It works on Linux Filesystems witout a problem, on NTFS sometimes it crashed. NTFS has a problem with that musch files in 1 Zip.

Using the zip filesystem in Java 7, you can actually access one individual file pretty easily and open a BufferedReader on it.
First you have to create the FileSystem:
public static FileSystem getZipFileSystem(final String zipPath)
{
final Path path = Paths.get(zipPath).toAbsolutePath();
final Map<String, Object> env = new HashMap<>();
final URI uri = URI.create("jar:file:" + path.toString());
return FileSystems.newFileSystem(uri, env, null);
}
Once you have done that, you can create a BufferedReader from an entry in the zip itself:
try (
final FileSystem fs = getZipFileSystem("/path/to/the.zip");
final BufferedReader reader = Files.newBufferedReader(fs.getPath("path/to/entry"),
StandardCharsets.UTF_8);
) {
// operate on the reader
}
You could also read all lines in the entry at once using Files.readAllLines().
If you wish to copy a zip entry to a file on the filesystem, you can also do that:
Files.copy(zipfs.getPath("path/to/entry"), Paths.get("file/on/local/fs"));
Or you can directly copy the result to an OutputStream, or directly create an entry from an OutputStream...
Or even walk the entire zip using Files.walkFileTree().
Or get all the entries in a "directory" in a zip using Files.newDirectoryStream(). Note that as its name says, this is a stream; unlike File.listFiles() (which only works on files on disk anyway), this returns a iterator over the entries.
Or... Or... Or...
Note that a FileSystem needs to be .close()d.

I'm not sure that I understand what you're trying to do.
If it's 0.5 MB/file and 450,000 files, you'll need 225GB. You won't have enough memory to do all this in a single zip in memory even if you get 90% compression.
I'd recommend breaking it into manageable chunks. You'll be able to parallelize that way too, so it's not a bad idea.

Related

Random-access Zip file without writing it to disk

I have a 1-2GB zip file with 500-1000k entries. I need to get files by name in fraction of second, without full unpacking. If file is stored on HDD, this works fine:
public class ZipMapper {
private HashMap<String,ZipEntry> map;
private ZipFile zf;
public ZipMapper(File file) throws IOException {
map = new HashMap<>();
zf = new ZipFile(file);
Enumeration<? extends ZipEntry> en = zf.entries();
while(en.hasMoreElements()) {
ZipEntry ze = en.nextElement();
map.put(ze.getName(), ze);
}
}
public Node getNode(String key) throws IOException {
return Node.loadFromStream(zf.getInputStream(map.get(key)));
}
}
But what can I do if program downloaded the zip file from Amazon S3 and has its InputStream (or byte array)? While downloading 1GB takes ~1 second, writing it to HDD may take some time, and it is slightly harder to handle multiple files since we don't have HDD garbage collector.
ZipInputStream does not allow to random access to entries.
It would be nice to create a virtual File in memory by byte array, but I couldn't find a way to.

You could mark the file to be deleted on exit.
If you want to go for an in-memory approach: Have a look at the new NIO.2 File API. Oracle provides a filesystem provider for zip/ jar and AFAIK ShrinkWrap provides an in-memory filesystem. You could try a combination of the two.
I've written some utility methods to copy directories and files to/from a Zip file using the NIO.2 File API (the library is Open Source):
Maven:
<dependency>
<groupId>org.softsmithy.lib</groupId>
<artifactId>softsmithy-lib-core</artifactId>
<version>0.3</version>
</dependency>
Tutorial:
http://softsmithy.sourceforge.net/lib/current/docs/tutorial/nio-file/index.html
API: CopyFileVisitor.copy
Especially PathUtils.resolve helps with resolving paths across filesystems.

You can use SecureBlackbox library, it allows ZIP operations on any seekable streams.

I think you should consider using your OS in order to create "in memory" file system (i.e - RAM drive).
In addition, take a look at the FileSystems API.

A completely different approach: If the server has the file on disk (and possibly cached in RAM already): make it give you the file(s) directly. In other words, submit which files you need and then take care to extract and deliver these on the server.

Blackbox library only has Extract(String name, String outputPath) method. Seems that it can randomly access any file in seekable zip-stream indeed, but it can't write result to byte array or return stream.
I couldn't find and documentation for ShrinkWrap. I couldn't find any suitable implementations of FileSystem/FileSystemProvider etc.
However, it turned out that Amazon EC2 instance I'm running (Large) somehow writes 1gb file to disk in ~1 second. So I just write file to the disk and use ZipFile.
If HDD would be slow, I think RAM disk would be the easiest solution.

RandomAccessFile from ZipEntry (java)

I was looking for something about reading zip-archives via RandomAccessFile. So, I found this example: http://www.java2s.com/Code/JavaAPI/java.io/RandomAccessFilereadLine.htm
However it doesn't work for me, it tells that there's no such file or directory, but the file-path is right. Is this example incorrect?
UPDATE: from docs.oracle.com:
RandomAccessFile(String name, String mode)
Creates a random access file stream to read from, and optionally to write to, a file with the specified name.
It's weird that they try to create RAF with entryName as a "name" parameter in this example
There's one more example with the same thing: http://www.java-tips.org/java-se-tips/java.util.zip/how-to-read-files-within-a-zip-file-3.html

I think this is a case where un-vetted code winds up on the internets and causes no end of problems.
There is no way the code in those two examples is going to do anything useful. The only way that code would do anything is if the contents of the zip file had already been extracted into the folder that contains the zip.
Long and short: you can't use RAF with ZipEntry because the ZipEntry refers to a compressed stream. You can't do random access on a stream (unless you decompress the whole thing and buffer the results).
It's really interesting to me how:
a) the code in the java-tips article doesn't follow proper naming conventions for Java
b) the code in both articles is astoundingly similar
Here's sample code that shows how to properly use ZipInputStream

With the NIO.2 File API (Java 7) working with zip files becomes much easier.
Try (untested):
try (FileSystem zipFS = FileSystems.newFileSystem(URI.create("jar:" + zipURI), Map.of())) {
Path targetInZipPath = zipFS.getPath(targetInZipPathString);
// do something here
}
Read more about the ZIP filesystem (JDK module jdk.zipfs) here: https://docs.oracle.com/en/java/javase/17/docs/api/jdk.zipfs/module-summary.html

Fastest way to access given lines of text file with and without using GZip and the Jar File (GZip in memory?)

I have given number (5-7) of large UTF8 text files (7 MB). In unicode their size is about 15MB each.
I need to load given parts of a given file. The files are known and does not change. I would like to access and load lines at give place as fast as possible. I load these lines adding HTML tags and display them in a JEditorPane. I know the bottle neck will be the rendering by the JEditorPane of the HTML generated but for now I would like to concentrate on the file access performances.
Moreover the user can search for a given word in all the files.
For now the code I use is :
private static void loadFile(String filename, int startLine, int stopLine) {
try {
FileInputStream fis = new FileInputStream(filename);
InputStreamReader isr = new InputStreamReader(fis, "UTF8");
BufferedReader reader = new BufferedReader(isr);
for (int j = startLine; j <= stopLine; j++) {
//here I add HTML tags
//or do string comparison in case of search by the user
sb.append(reader.readLine());
}
reader.close();
} catch (FileNotFoundException e) {
System.out.println(e);
} catch (IOException e) {
System.out.println(e);
}
}
Now my questions :
As the number of parts of each file is known, 67 in my case (for each file), I could create 67 smaller files. It will be "faster" to load a given part but be slower when I do a search as I must open each of the 67 file.
I have not done bench marking but my feelings says that opening 67 files in case of a search is much longer than the time to perform empty reader.readlines when loading a part of the file.
So in my case it is better to have a single larger file. Do you agree with that ?
If I put each large file in the resource, I mean in the Jar file, will the performance be worse, if yes is it significantly worse ?
And the related question is what if I zip each file to spare size. As far as I undersand a Jar file is simply a zip file.
I think I don't know how unzipping works. If I zip a file, will the file be decompressed in memory or will my program be able to access the given lines I need directly on the disk.
Same for the Jar file will it be decompressed in memory.
If unzipping is not in memory can someone edit my code to use zip file.
Final question and the most important for me. I could increase all the performance if everything was performed in memory, but due to unicode and the quite large files this could easily result in a heap of memory of more than 100MB. Is there a possibility of having the zip file loaded in memory and work on that. This would be fast and use only few memory.
Summary of the questions
In my case, 1 large file is best than plenty of small ones.
If files are zipped, is the unzip process (GZipInputStream) performed in memory. Is all the file unzipped in memory and then access or is it possible to access it directly on disk.
If yes to question 2, can someone edit my code to be able to do it ?
MOST IMPORTANT : is it possible to have the zip file loaded in memory and how ?
I hope my questions are clear enough. ;-)
UPDATE : Thanks to Mike for the getResourceAsStream hint, I get it working
Notice that benchmarking give that load the Gzip file is efficient, but in ma case is too slow.
~200 ms for the gzip file
~125 ms for the standard file so 1.6 times faster.
Assuming that the resource folder is called resources
private static void loadFile(String filename, int startLine, int stopLine) {
try {
GZIPInputStream zip = new GZIPInputStream(this.class.getResourceAsStream("resources/"+filename));
InputStreamReader isr = new InputStreamReader(zip, "UTF8");
BufferedReader reader = new BufferedReader(isr);
for (int j = startLine; j <= stopLine; j++) {
//here I add HTML tags
//or do string comparison in case of search by the user
sb.append(reader.readLine());
}
reader.close();
} catch (FileNotFoundException e) {
System.out.println(e);
} catch (IOException e) {
System.out.println(e);
}
}

If the files really aren't changing very often I would suggest using some other data structures. Creating a hash table of all the words and locations they show up would make searching much faster, creating an index of all the line start positions would make that process much faster.
But, to answer your questions more directly:
Yes, one large file is probably still better than many small files, I doubt that reading a line and decoding from UTF8 will be noticeable compared to opening many files, or decompressing many files.
Yes, the unzipping process is performed in memory, and on the fly. It happens as you request data, but acts as a buffered stream, it will decompress entire blocks at a time, so it is actually very efficient.
I can't fix your code directly, but I can suggest looking up getResourceAsStream:
http://docs.oracle.com/javase/6/docs/api/java/lang/Class.html#getResourceAsStream%28java.lang.String%29
This function will open a file that is in a zip / jar file and give you access to it as a stream, automatically decompressing it in memory as you use it.
If you treat it as a resource, java will do it all for you, you will have to read up on some of the specifics of handling resources, but java should handle it fairly intelligently.

I think it would be quicker for you to load the file(s) into memory. You can then zip around to whatever part of the file you need.
Take a look at RandomAccessFile for this.
The GZipInputStream reads the files into memory as a buffered stream.
That's another question entirely :)
Again, the zip file will be decompressed in memory depending on what Class you use to open it.

Preserving file checksum after extract from zip in java

This is what I'm trying to accomplish:
1) Calculate the checksum of all files to be added to a zip file. Currently using apache commons io follows:
final Checksum oChecksum = new Adler32();
...
//for every file iFile in folder
long lSum = (FileUtils.checksum(iFile, oChecksum)).getValue();
//store this checksum in a log
2) Compress the folder processed as a zip using the Ant zip task.
3) Extract files from the zip one by one to the specified folder (using both commons io and compression for this), and calculate the checksum of the extracted file:
final Checksum oChecksum = new Adler32();
...
ZipFile myZip = new ZipFile("test.zip");
ZipArchiveEntry zipEntry = myZip.getEntry("checksum.log"); //reads the filename from the log
BufferedInputStream myInputStream = new BufferedInputStream(myZip.getInputStream(zipEntry));
File destFile = new File("/mydir", zipEntry.getName());
lDestFile.createNewFile();
FileUtils.copyInputStreamToFile(myInputStream, destFile);
long newChecksum = FileUtils.checksum(destFile, oChecksum).getValue();
The problem I have is that the value from newChecksum doesn't match the one from the original file. The files' sizes match on disk. Funny thing is that if I run cksum or md5sum commands on both files directly on a terminal, these are the same for both files. The mismatch occurs only from java.
Is this the correct way to approach it or is there any way to preserve the checksum value after extraction?
I also tried using a CheckedInputStream but this also gets me different values from java.
EDIT: This seems related to the Adler32 object used (pre-zip vs unzip checks). If I do "new Adler32()" in the unzip check for every file instead of reusing the same Adler32 for all, I get the correct result.

Are you trying to for all file concatenated? If yes, you need to make sure you're reading them in the same order "checksumed" them.
If no, you need to call checksum.reset() between computing the checksum for each file. You'll notice (in you look at the source) that Adler32 is stateful, which means you're computing the checksum of the file plus all the preceding ones during part one.

decompress .gz file in batch

I have 100 of .gz files which I need to de-compress.
I have couple of questions
a) I am using the code given at http://www.roseindia.net/java/beginners/JavaUncompress.shtml to decompress the .gz file. Its working fine.
Quest:- is there a way to get the file name of the zipped file. I know that Zip class of Java gives of enumeration of entery file to work upon. This can give me the filename, size etc stored in .zip file. But, do we have the same for .gz files or does the file name is same as filename.gz with .gz removed.
b) is there another elegant way to decompress .gz file by calling the utility function in the java code. Like calling 7-zip application from your java class. Then, I don't have to worry about input/output stream.
Thanks in advance.
Kapil

a) Zip is an archive format, while gzip is not. So an entry iterator does not make much sense unless (for example) your gz-files are compressed tar files. What you want is probably:
File outFile = new File(infile.getParent(), infile.getName().replaceAll("\\.gz$", ""));
b) Do you only want to uncompress the files? If not you may be ok with using GZIPInputStream and read the files directly, i.e. without intermediate decompression.
But ok. Let's say you really only want to uncompress the files. If so, you could probably use this:
public static File unGzip(File infile, boolean deleteGzipfileOnSuccess) throws IOException {
GZIPInputStream gin = new GZIPInputStream(new FileInputStream(infile));
FileOutputStream fos = null;
try {
File outFile = new File(infile.getParent(), infile.getName().replaceAll("\\.gz$", ""));
fos = new FileOutputStream(outFile);
byte[] buf = new byte[100000];
int len;
while ((len = gin.read(buf)) > 0) {
fos.write(buf, 0, len);
}
fos.close();
if (deleteGzipfileOnSuccess) {
infile.delete();
}
return outFile;
} finally {
if (gin != null) {
gin.close();
}
if (fos != null) {
fos.close();
}
}
}

Regarding A, the gunzip command creates an uncompressed file with the original name minus the .gz suffix. See the man page.
Regarding B, Do you need gunzip specifically, or will another compression algorithm do? There's a java port of the LZMA compression algorithm used by 7zip to create .7z files, but it will not handle .gz files.

If you have a fixed number of files to decompress once, why don't you use existing tools for that?
As Paul Morie noticed, gunzip can do that:
for i in *.gz; do gunzip $i; done
And it would automatically name them, stripping .gz$
On windows, try winrar, probably, or gunzip from http://unxutils.sf.net

GZip is normally used only on single files, so it generally does not contain information about individual files. To bundle multiple files into one compressed archive, they are first combined into an uncompressed Tar file (with info about individual contents), and then compressed as a single file. This combination is called a Tarball.
There are libraries to extract the individual file info from a Tar, just as with ZipEntries. One example. You will first have to extract the .gz file into a temporary file in order to use it, or at least feed the GZipInputStream into the Tar library.
You may also call 7-Zip from the command line using Java. 7-Zip command-line syntax is here: 7-Zip Command Line Syntax. Example of calling the command shell from Java: Executing shell commands in Java. You will have to call 7-Zip twice: once to extract the Tar from the .tar.gz or .tgz file, and again to extract the individual files from the Tar.
Or, you could just do the easy thing and write a brief shell script or batch file to do your decompression. There's no reason to hammer a square peg in a round hole -- this is what batch files are made for. As a bonus, you can also feed them parameters, reducing the complexity of a java command line execution considerably, while still letting java control execution.

Have you tried
gunzip *.gz

.gz files (gzipped) can store the filename of a compressed file. So for example FuBar.doc can be saved inside myDocument.gz and with appropriate uncompression, the file can be restored to the filename FuBar.doc. Unfortunately, java.util.zip.GZIPInputStream does not support any way of reading the filename even if it is stored inside the archive.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

handling about 450.000 files in a zip - java

Related

Random-access Zip file without writing it to disk

RandomAccessFile from ZipEntry (java)

Fastest way to access given lines of text file with and without using GZip and the Jar File (GZip in memory?)

Preserving file checksum after extract from zip in java

decompress .gz file in batch

Categories

Resources