Zipping large files(>15GB) and uploading to S3 without OOM

Zipping large files(>15GB) and uploading to S3 without OOM - java

I have one memory issue while zipping large files/folders(result zip >15GB) and uploading it to S3 storage. I can create zip file in disc and append files/folders, upload that file with parts to S3. But by my experience it is not good way to resolve this issue. Do you know any good patterns zipping large files/folders and uploading it to S3 without memory issues(such OOM)? It will be good if i can append these files/folders to S3 directly to some uploaded zip.
Zip files/folders to disc and uploading that zip file by parts to S3.

👋
The main reason why you are getting an OOM is just because of how the deflate algorithm of zlib works.
Imagine this setup:
It starts to read the whole file by opening a readable stream.
It creates a temporary 0 byte output file from the start.
It then reads the data in chunks, called dictionary size, it then sends it to the CPU for further processing and compression, which are propagated back to the RAM.
When it finished with a certain fixed sized dictionary, it moves to the next one, and so on until it reaches END OF FILE terminator.
After that, it grabs all that deflated bytes (compressed) from RAM and writes that to the actual file.
You can observe & deduce that behavior by initiating a deflate operation, an example below.
(The file is created, 372mb is processed, but none is written to the file until the last processed byte.)
You could technically grab all of the parts, archive them AGAIN in a tar.gz and then upload to AWS, as one file, but you may get into the same problem with memory, but now on the uploading part.
Here are the file size limitations:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html
If you use the CLI you can technically do that, if you need or have to use the REST API that's not an option for you as the limitation there is only 5GB per request.
Also, you have not specified the max size, so if it's even larger than 160GB that's not an option EVEN using the AWS CLI (which takes care of releasing the memory after each uploaded chunk). So your best bet would be multipart upload.
https://docs.aws.amazon.com/cli/latest/reference/s3api/create-multipart-upload.html
All the best!

You can use AWS Lambda to zip your files for you before uploading them to an S3 bucket. You can even configure Lambda to be triggered and zip your files on upload. Here is a Java example of a Lambda function for zipping large files. This library is limited to 10 GB, but this can be overcome by using EFS.
Lambda’s ephemeral storage is limited to 10 GB, but you can attach EFS storage to handle larger files. The cost should be close to none if you delete the files after use.
Also, remember to use Multipart Upload when uploading file larger than 100 MB to S3. If you are using the SDK, it should handle this for you.

Zipping the file in 1 go is not exactly a correct way to go about. Think the better way is to break down the problem in a way you don't load the whole data in 1 go, but read it byte by byte and sent it to your destination byte by byte. This way, not only you will get speed (~x10) but also address those OOM's
Your destination could be a web end point on an EC2 instance or an API gateway fronted web service depending upon your architectural choice.
So essentially the part 1 of solution is to STREAM - zip it byte by byte and sent it to an http end point. Part 2 might be to use Multi part upload interfaces from AWS SDK (in your destination) and push it in parallel to S3
Path in = Paths.get("abc.huge");
Path out = Paths.get("abc.huge.gz");
try (InputStream in = Files.newInputStream(in);
OutputStream fout = Files.newOutputStream(out);) {
GZipCompressorOutputStream out2 = new GZipCompressorOutputStream(
new BufferedOutputStream(fout));
// Read and write byte by byte
final byte[] buffer = new byte[buffersize];
int n = 0;
while (-1 != (n = in.read(buffer))) {
out2.write(buffer, 0, n);
}
}

Related

When streaming a file during a download, do you know the filesize?

When downloading large files, and you receive the file in chunks, is it possible to know the file size or that information is not in any type of header that you get back during the request/response?
Basically I would like to know if the file size is above a threshold before downloading.

It will depend on the file or stream that you are downloading - at one extreme if the chunks are part of a a continuos stream, like a live TV channel or a data analytics stream, then there is no real end and no size.
A more common example might be a video where many formats may have the size information in their metadata or header information.
For example an MP4 file will have the file size in the header information which should be the first part dowloaded, if the file has been prepared for streaming properly - MP4 files usually have the header at the end but when being prepared for streaming it is common best practice to move the header to the start.

It will depend on the file or stream that you are downloading - at one extreme if the chunks are part of a a continuos stream, like a live TV channel or a data analytics stream, then there is no real end and no size.
A more common example might be a video where many formats may have the size information in their metadata or header information.
For example an MP4 'container' will have the size of a movie stream in the header information which should be in the first part dowloaded, if the file has been prepared for streaming properly - MP4 files usually have the header at the end but when being prepared for streaming it is common best practice to move the header to the start.

ZipEntry.STORED for files that are already compressed?

I am using a ZipOutputStream to zip up a bunch of files that are a mix of already zipped formats as well as lots of large highly compressible formats like plain text.
Most of the already zipped formats are large files and it makes no sense to spend cpu and memory on recompressing them since they never get smaller and sometimes get slightly large on the rare occasion.
I am trying to use .setMethod(ZipEntry.STORED) when I detect a pre-compressed file but it complains that I need to supply the size, compressedSize and crc for those files.
I can get it work with the following approach but this requires that I read the file twice. Once to calculate the CRC32 then again to actually copy the file into the ZipOutputStream.
// code that determines the value of method omitted for brevity
if (STORED == method)
{
fze.setMethod(STORED);
fze.setCompressedSize(fe.attributes.size());
final HashingInputStream his = new HashingInputStream(Hashing.crc32(), fis);
ByteStreams.copy(his,ByteStreams.nullOutputStream());
fze.setCrc(his.hash().padToLong());
}
else
{
fze.setMethod(DEFLATED);
}
zos.putNextEntry(fze);
ByteStreams.copy(new FileInputStream(fe.path.toFile()), zos);
zos.closeEntry();
Is there a way provide this information without having to read the input stream twice?

Short Answer:
I could not determine a way to read the files only once and calculate the CRC with the standard library given the time I had to solve this problem.
I did find an optimization that decreased the time by about 50% on average.
I pre-calculate the CRC of the files to be stored concurrently with an ExecutorCompletionService limited to Runtime.getRuntime().availableProcessors() and wait until they are done. The effectiveness of this varies based on the number of files that need the CRC calculated. With the more files, the more benefit.
Then in the .postVisitDirectories() I wrap a ZipOutputStream around a PipedOutputStream from a PipedInputStream/PipedOutputStream pair running on a temporary Thread to convert the ZipOutputStream to an InputStream I can pass into the HttpRequest to upload the results of the ZipOutputStream to a remote server while writing all the precalculated ZipEntry/Path objects serially.
This is good enough for now, to process the 300+GB of immediate needs, but when I get to the 10TB job I will look at addressing it and trying to find some more advantages without adding too much complexity.
If I come up with something substantially better time wise I will update this answer with the new implementation.
Long answer:
I ended up writing a clean room ZipOutputStream that supports multipart zip files, intelligent compression levels vs STORE and was able to calculate the CRC as I read and then write out the metadata at the end of the stream.
Why ZipOutputStream.setLevel() swapping will not work:
The ZipOutputStream.setLevel(NO_COMPRESSION/DEFAULT_COMPRESSION)
hack is not a viable approach. I did extensive tests on hundreds of
gigs of data, thousands of folders and files and the measurements were
conclusive. It gains nothing over calculating the CRC for the
STORED files vs compressing them at NO_COMPRESSION. It is actually
slower by a large margin!
In my tests the files are on a network mounted drive so reading
the files already compressed files twice over the network to
calculate the CRC then again to add to the ZipOutputStream was as
fast or faster than just processing all the files once as DEFLATED
and changing the .setLevel() on the ZipOutputStream.
There is no local filesystem caching going on with the network access.
This is a worse case scenario, processing files on the local disk will
be much much faster because of local filesystem caching.
So this hack is a naive approach and is based on false assumptions. It is processing the
data through the compression algorithm even at NO_COMPRESSION level
and the overhead is higher than reading the files twice.

Is it possible to have a ZipOutputstream or GZIPOutputStream in Android that can be incrementally added to?

My app creates a large amount of output, but only over a long time. Each time there is new output to add it is just a string (a few hundred bytes worth).
It would simplify my code considerably if I could add incrementally (i.e. append) to a pre-existing GZIP (or Zip) file. Is this even possible (in Java, specifically)?
I am looking for a solution that will create a file that can be opened by 3rd party apps.
I realize I can decompress the file, add the additional text and compress it again as a new blob.
Thanks
PVS

Yes. See this example in C in the examples directory of the zlib distribution: gzlog.h and gzlog.c. It does exactly that, allowing you to append short pieces of data to a gzip file. It does so efficiently, by not compressing the additions until a threshold is reached, and then compressing what hasn't been compressed so far. After each addition, the gzip file contains the addition and is a valid gzip file. The code also protects against system crashes in the middle of an append operation, recovering the file on the next append operation.
Though allowed by the format, this code does not simply concatenate short gzip streams. That would result in very poor compression.

Copying a growing file in Java

I'm trying to work out how I can copy a growing file using Java. An example of what I would like to work is the following:
A file is downloaded from an HTTP server.
I initiate a file copy before the file has finished downloading
The copying begins, and doesn't end until the file is completely downloaded and everything has been copied
I have used the following code:
InputStream is = new FileInputStream(sourceFile);
OutputStream os = new FileOutputStream(targetFile);
byte[] buf = new byte[8192];
int num;
while ((num = is.read(buf)) != -1) {
os.write(buf, 0, num);
}
But that only copies the content that has so far been downloaded, so I end up with a broken target file.
I have also tested using BufferedInputStream and BufferedOutputStream, but that didn't work either.
Is there any way to achieve what I want?
Thanks

If you are in control off the file download via HTTP then you could download to a temporary file and then rename the file once the download has completed, thus making the operation atomic.
The alternative is for your file copy process to periodically check the file size of the target file and to only initiate the copy once the file size has stabilised and is no longer increasing. For example, you may elect to record the file size every second and only initiate the copy if the size remains constant for 3 successive poll attempts.

This is going to be tricky, since the copying process has no reliable way of knowing when the download has finished. You could look at whether the file is growing, but the download could stall for a period of time, and you could erroneously conclude that it has finished. If the download fails in the middle, you also have no way of knowing that you're looking at an incomplete file.
I think your best bet is the downloading process. If you control it, you could modify it to store the file in the other location, or both locations, or move/rename it at the end, depending on your requirements.
If you don't control the downloading process, and it's a simple HTTP download, you could replace it with something that you do control.

Need to send multiple objects through an http output stream

I am trying to send some very large files (>200MB) through an Http output stream from a Java client to a servlet running in Tomcat.
My protocol currently packages the file contents in a byte[] and that is placed a a Map<String, Object> along with some metadata (filename, etc.), each part under a "standard" key ("FILENAME" -> "Foo", "CONTENTS" -> byte[], "USERID" -> 1234, etc.). The Map is written to the URL connection output stream (urlConnection.getOutputStream()). This works well when the file contents are small (<25MB), but I am running into Tomcat memory issues (OutOfMemoryError) when the file size is very large.
I thought of sending the metadata Map first, followed by the file contents, and finally by a checksum on the file data. The receiver servlet can then read the metadata from its input stream, then read bytes until the entire file is finished, finally followed by reading the checksum.
Would it be better to send the metadata in connection headers? If so, how? If I send the metadata down the socket first, followed by the file contents, is there some kind of standard protocol for doing this?

You will almost certainly want to use a multipart POST to send the data to the server. Then on the server you can use something like commons-fileupload to process the upload.
The good thing about commons-fileupload is that it understands that the server may not have enough memory to buffer large files and will automatically stream the uploaded data to disk once it exceeds a certain size, which is quite helpful in avoiding OutOfMemoryError type problems.
Otherwise you are going to have to implement something comparable yourself. It doesn't really make much difference how you package and send your data, so long as the server can 1) parse the upload and 2) redirect data to a file so that it doesn't ever have to buffer the entire request in memory at once. As mentioned both of these come free if you use commons-fileupload, so that's definitely what I'd recommend.

I don't have a direct answer for you but you might consider using FTP instead. Apache Mina provides FTPLets, essentially servlets that respond to FTP events (see http://mina.apache.org/ftpserver/ftplet.html for details).
This would allow you to push your data in any format without requiring the receiving end to accommodate the entire data in memory.
Regards.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.