ZipEntry.STORED for files that are already compressed?

ZipEntry.STORED for files that are already compressed? - java

I am using a ZipOutputStream to zip up a bunch of files that are a mix of already zipped formats as well as lots of large highly compressible formats like plain text.
Most of the already zipped formats are large files and it makes no sense to spend cpu and memory on recompressing them since they never get smaller and sometimes get slightly large on the rare occasion.
I am trying to use .setMethod(ZipEntry.STORED) when I detect a pre-compressed file but it complains that I need to supply the size, compressedSize and crc for those files.
I can get it work with the following approach but this requires that I read the file twice. Once to calculate the CRC32 then again to actually copy the file into the ZipOutputStream.
// code that determines the value of method omitted for brevity
if (STORED == method)
{
fze.setMethod(STORED);
fze.setCompressedSize(fe.attributes.size());
final HashingInputStream his = new HashingInputStream(Hashing.crc32(), fis);
ByteStreams.copy(his,ByteStreams.nullOutputStream());
fze.setCrc(his.hash().padToLong());
}
else
{
fze.setMethod(DEFLATED);
}
zos.putNextEntry(fze);
ByteStreams.copy(new FileInputStream(fe.path.toFile()), zos);
zos.closeEntry();
Is there a way provide this information without having to read the input stream twice?

Short Answer:
I could not determine a way to read the files only once and calculate the CRC with the standard library given the time I had to solve this problem.
I did find an optimization that decreased the time by about 50% on average.
I pre-calculate the CRC of the files to be stored concurrently with an ExecutorCompletionService limited to Runtime.getRuntime().availableProcessors() and wait until they are done. The effectiveness of this varies based on the number of files that need the CRC calculated. With the more files, the more benefit.
Then in the .postVisitDirectories() I wrap a ZipOutputStream around a PipedOutputStream from a PipedInputStream/PipedOutputStream pair running on a temporary Thread to convert the ZipOutputStream to an InputStream I can pass into the HttpRequest to upload the results of the ZipOutputStream to a remote server while writing all the precalculated ZipEntry/Path objects serially.
This is good enough for now, to process the 300+GB of immediate needs, but when I get to the 10TB job I will look at addressing it and trying to find some more advantages without adding too much complexity.
If I come up with something substantially better time wise I will update this answer with the new implementation.
Long answer:
I ended up writing a clean room ZipOutputStream that supports multipart zip files, intelligent compression levels vs STORE and was able to calculate the CRC as I read and then write out the metadata at the end of the stream.
Why ZipOutputStream.setLevel() swapping will not work:
The ZipOutputStream.setLevel(NO_COMPRESSION/DEFAULT_COMPRESSION)
hack is not a viable approach. I did extensive tests on hundreds of
gigs of data, thousands of folders and files and the measurements were
conclusive. It gains nothing over calculating the CRC for the
STORED files vs compressing them at NO_COMPRESSION. It is actually
slower by a large margin!
In my tests the files are on a network mounted drive so reading
the files already compressed files twice over the network to
calculate the CRC then again to add to the ZipOutputStream was as
fast or faster than just processing all the files once as DEFLATED
and changing the .setLevel() on the ZipOutputStream.
There is no local filesystem caching going on with the network access.
This is a worse case scenario, processing files on the local disk will
be much much faster because of local filesystem caching.
So this hack is a naive approach and is based on false assumptions. It is processing the
data through the compression algorithm even at NO_COMPRESSION level
and the overhead is higher than reading the files twice.

Related

Best way to merge binary files in Java

I'm developing a basic download manager that can download a file over http using multiple connections. At the end of the download, I have several temp file containing each a part of the file.
I now want to merge them into a single file.
It's not hard to do so, simply create an output stream and input streams and pipe the inputs into the output in the good order.
But I was wondering: is there a way to do it more efficiently? I mean, from my understanding what will happen here is that the JVM will read byte per byte the inputs, and write byte per byte the output.
So basically I have :
- read byte from disk
- store byte in memory
- some CPU instructions will probably run and the byte will probably be copied into the CPU's cache
- write byte to the disk
I was wondering if there was a way to keep the process on the disk? I don't know if I'm understandable, but basically to tell the disk "hey disk, take these files of yours and make one with them"
In a short sentence, I want to reduce the CPU & memory usage to the lowest possible.

In theory it may be possible to do this operation on a file system level: you could append the block list from one inode to another without moving the data. This is not very practical though, most likely you would have to bypass your operating system and access the disk directly.
The next best thing may be using FileChannel.transferTo or transferFrom methods:
This method is potentially much more efficient than a simple loop that reads from this channel and writes to the target channel. Many operating systems can transfer bytes directly from the filesystem cache to the target channel without actually copying them.
You should also test reading and writing large blocks of bytes using streams or RandomAccessFile - it may still be faster than using channels. Here's a good article about testing sequential IO performance in Java.

How to convert a file directory into a List of InputStreams with Java 8

I want to covert all files inside ROOT directory into java 8 stream of InputStream.
suppose there are a lot of files inside the ROOT
for example file OS:
ROOT
|--> file1
|--> file2
I want to do somethink but using java 8 with the best PERFORMANCE!!
List<InputStream> ins = new ArrayList<>();
File[] arr = new File(ROOT).listFiles()
for(File file: arr){
ins.add(FileUtils.openInputStream(file));
}

Short answer: best solution is the one that you have dismissed:
Use File.listFiles() (or equivalent) to iterate over the files in each directory.
Use recursion for nested directories.
Lets start with the performance issue. When you are uploading a large number of individual files to cloud storage, the performance bottleneck is likely to be the network and the remote server:
Unless you have an extraordinarily good network link, a single TCP stream won't transfer data anywhere like as fast as it can be read from disk (or written at the other end).
Each time you transfer a file, there is likely to be an overhead for starting the new file. The remote server has to create the new file, which entails adding a directory entry, and inode to hold the metadata, etc.
Even on the sending side, the OS and disc overheads of reading directories and metadata are likely to dominate the Java overheads.
(But don't just trust what I say ... measure it!)
The chances are that the above overheads will be orders of magnitude greater than you can get by tweaking the Java-side file traversal.
But ignoring the above, I don't think that using the Java 8 Stream paradigm would help anyway. AFAIK, there are no special high performance "adapters" for applying streams to directory entries, so you would most likely end up with a Stream wrapper for the result of listFiles() calls. And that would not improve performance.
(You might get some benefit from parallel streams, but I don't think you will get enough control over the parallelism.)
Furthermore, you would need to deal with the fact that if your Java 8 Stream produces InputStream or similar handles, then you need to make sure that those handles are properly closed. You can't just close them all at the end, or rely on the GC to finalize them. If you do either of those, you risk running out of file descriptors.

InputStream is = new ByteArrayInputStream(Files.readAllBytes(Paths.get(Root)));

reading files from memory instead of disk

I have a Java project with a huge set of XML files (>500). Reading this files at runtime leads to performance issues.
Is there an option to load all the XML files to RAM and read from there instead of the disk?
I know there are products like RamDisk but this one is a commercial tool.
Can I copy XML files to main memory and read from main memory using any existing Java API / libraries?

I would first try memory mapped files, as provided by RandomAccessFile and FileChannel in standard java library. This way OS will be able to keep the frequently used file content in memory, effectively achieving what you want.

You can use In-Memory databases to store intermediate files (XML files). This will give the speed of using ram and db together.
For reference use the following links:
http://www.mcobject.com/in_memory_database
Usage of H2 as in memory database:
http://www.javatips.net/blog/2014/07/h2-in-memory-database-example

Use java.io.RandomAccessFile class. It behaves like a large array of bytes stored in the file system. Instances of this class support both reading and writing to a random access file.
Also I would suggest using a MemoryMappedFile, to read the file directly from the disk instead of loading it in memory.
RandomAccessFile file = new RandomAccessFile("wiki.txt", "r");
FileChannel channel = file.getChannel();
MappedByteBuffer buf = channel.map(FileChannel.MapMode.READ_WRITE, 0, 1024*50);
And then you can read the buffer as usual.

have you considered creating an object structure for these files and serializing them, java object serialization and deserialization is much faster than parsing an XML, this is again considering that these 500 or so XML files don't get modified between reads.
here is an article which talks about serializing and deserializing.
if the concern is to load file content into memory, then consider ByteArrayInputStream, ByteArrayOutputStream classes maybe even use ByteBuffer, these can store the bytes in memory

Java object serialization/deserialization is not faster than XML writing and parsing in general. When large numbers of objects are involved Java serialization/deserialization can actually be very inefficient, because it tracks each individual object (so that repeated references aren't serialized more than once). This is great for networks of objects, but for simple tree structures it adds a lot of overhead with no gains.
Your best approach is probably to just use a fast technique for processing the XML (such as javax.xml.stream.XMLStreamReader). Unless the files are huge, that 30-40 seconds time to load the XML files is way out of line - you're probably using an inefficient approach to processing the XML, such as loading them into a DOM. You can also try reading multiple files in parallel (such as by using Java 8 parallel Streams).

Looks like your main issue is large number of files and RAM is not an issue. Can you confirm?
Is it possible that you do a preprocessing step where you append all these files using some kind of separator and create a big file? This way you can increase the block size of your reads and avoid the performance penalty of disk seeks.

Have you thought about compressing the XML files and reading in those compressed XML files? Compressed XML could be as little as 3-5% the size of the original or better. You can uncompress it when it is visible to users and then store it compressed again for further reading.
Here is a library I found that might help:
zip4j

It all depends, whether you read the data more than once or not.
Assuming we use some sort of Java-based-RamDisk (it would actually be some sort of Buffer or Byte-array).
Further assume the time to process the data takes less than reading from. So you have to read it at least one single time. So it would make no difference if you'd read it first from disk-to-memory and then process it from memory.
If you would read a file more than once, you could read all the files into memory (various options, Buffer, Byte-Arrays, custom FileSystem, ...).
In case processing takes longer than reading (which seems not to be the case), you could pre-fetch the files from disk using a separate thread - and process the data from memory using another thread.

Adding files to zip java using memory while avoiding reserved file name problems

I want to add, remove or modify files in zip using the most effective way possible.
Yes, you may say what I should do is to unzip/zip files into file system, but if there is a file with special name like 'aux' or 'con' , It doesn't work in Windows as they are DOS device names, and also there might be filename encoding issues that prevents the process from working proberly. Another reason I don't just unzip to file system and re-zip is that it is much more slower and takes more disk space than just using RAM.
In image : http://i.stack.imgur.com/yPuYG.png

You could use a memory bases stream, like ByteArrayOutputStream to read/write the contents of the file.
The issue is the amount of available memory, because RAM is limited, you're going to need to store the output on something larger, like a disk eventually.
In order to try and optimism the process, you could set a preferred threshold for the read/write/process operation.
Basically you would run the process and calculate how long it took, based on the preferred threshold, adjust the buffer size for the next loop.
I would allow for a number of loops and average the time so your not trying to do fine control over the buffer that might actually slow you down

Reading a file vs loading a file into main memory from disk for processing

how do I load a file into main memory?
I read the files using,
I use
BufferReader buf = new BufferedReader(FileReader());
I presume that this is reading the file line by line from the disk. What is the advantage of this?
What is the advantage of loading the file directly into memory?
How do we do that in Java?
I found some examples on Scanner or RandomAccessFile methods. Do they load the files into memory? Should I use them? Which of the two should I use ?
Thanks in advance!!!

BufferReader buf = new BufferedReader(FileReader());
I presume that this is reading the file line by line from the disk. What is the advantage of this?
Not exactly. It is reading the file in chunks whose size is the default buffer size (8k bytes I think).
The advantage is that you don't need a huge heap to read a huge file. This is a significant issue since the maximum heap size can only be specified at JVM startup (with Hotspot Java).
You also don't consume the system's physical / virtual memory resources to represent the huge heap.
What is the advantage of loading the file directly into memory?
It reduces the number of system calls, and may read the file faster. How much faster depends on a number of factors. And you have the problem of dealing with really large files.
How do we do that in Java?
Find out how large the file is.
Allocate a byte (or character) array big enough.
Use the relevant read(byte[], int, int) or read(char[], int, int) method to read the entire file.
You can also use a memory-mapped file ... but that requires using the Buffer APIs which can be a bit tricky to use.
I found some examples on Scanner or RandomAccessFile methods. Do they load the files into memory?
No, and no.
Should I use them? Which of the two should I use ?
Do they provide the functionality that you require? Do you need to read / parse text-based data? Do you need to do random access on a binary data?
Under normal circumstances, you should chose your I/O APIs based primarily on the functionality that you require, and secondarily on performance considerations. Using a BufferedInputStream or BufferedReader is usually enough to get acceptable* performance if you intend to parse it as you read it. (But if you actually need to hold the entire file in memory in its original form, then a BufferedXxx wrapper class actually makes reading a bit slower.)
* - Note that acceptable performance is not the same as optimal performance, but your client / project manager probably would not want your to waste time writing code to perform optimally ... if this is not a stated requirement.

If you're reading in the file and then parsing it, walking from beginning to end once to extract your data, then not referencing the file again, a buffered reader is about as "optimal" as you'll get. You can "tune" the performance somewhat by adjusting the buffer size -- a larger buffer will read larger chunks from the file. (Make the buffer a power of 2 -- eg 262144.) Reading in an entire large file (larger than, say, 1mb) will generally cost you performance in paging and heap management.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.