Java create tar archive with entries of unknown size

Java create tar archive with entries of unknown size - java

I have a web app where I need to be able to serve the user an archive of multiple files. I've set up a generic ArchiveExporter, and made a ZipArchiveExporter. Works beautifully! I can stream my data to my server, and archive the data and stream it to the user all without using much memory, and without needing a filesystem (I'm on Google App Engine).
Then I remembered about the whole zip64 thing with 4gb zip files. My archives can get potentially very large (high res images), so I'd like to have an option to avoid zip files for my larger input.
I checked out org.apache.commons.compress.archivers.tar.TarArchiveOutputStream and thought I had found what I needed! Sadly when I checked the docs, and ran into some errors; I quickly found out you MUST pass the size of each entry as you stream. This is a problem because the data is being streamed to me with no way of knowing the size beforehand.
I tried counting and returning the written bytes from export(), but TarArchiveOutputStream expects a size in TarArchiveEntry before writing to it, so that obviously doesn't work.
I can use a ByteArrayOutputStream and read each entry entirely before writing its content so I know its size, but my entries can pontentially get very large; and this is not very polite to the other processes running on the instance.
I could use some form of persistence, upload the entry, and query the data size. However, that would be a waste of my google storage api calls, bandwidth, storage, and runtime.
I am aware of this SO question asking almost the same thing, but he settled for using zip files and there is no more relevant information.
What is the ideal solution to creating a tar archive with entries of unknown size?
public abstract class ArchiveExporter<T extends OutputStream> extends Exporter { //base class
public abstract void export(OutputStream out); //from Exporter interface
public abstract void archiveItems(T t) throws IOException;
}
public class ZipArchiveExporter extends ArchiveExporter<ZipOutputStream> { //zip class, works as intended
#Override
public void export(OutputStream out) throws IOException {
try(ZipOutputStream zos = new ZipOutputStream(out, Charsets.UTF_8)) {
zos.setLevel(0);
archiveItems(zos);
}
}
#Override
protected void archiveItems(ZipOutputStream zos) throws IOException {
zos.putNextEntry(new ZipEntry(exporter.getFileName()));
exporter.export(zos);
//chained call to export from other exporter like json exporter for instance
zos.closeEntry();
}
}
public class TarArchiveExporter extends ArchiveExporter<TarArchiveOutputStream> {
#Override
public void export(OutputStream out) throws IOException {
try(TarArchiveOutputStream taos = new TarArchiveOutputStream(out, "UTF-8")) {
archiveItems(taos);
}
}
#Override
protected void archiveItems(TarArchiveOutputStream taos) throws IOException {
TarArchiveEntry entry = new TarArchiveEntry(exporter.getFileName());
//entry.setSize(?);
taos.putArchiveEntry(entry);
exporter.export(taos);
taos.closeArchiveEntry();
}
}
EDIT this is what I was thinking with the ByteArrayOutputStream. It works, but I cannot guarantee I will always have enough memory to store the whole entry at once, hence my streaming efforts. There has to be a more elegant way of streaming a tarball! Maybe this is a question more suited for Code Review?
protected void byteArrayOutputStreamApproach(TarArchiveOutputStream taos) throws IOException {
TarArchiveEntry entry = new TarArchiveEntry(exporter.getFileName());
try(ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
exporter.export(baos);
byte[] data = baos.toByteArray();
//holding ENTIRE entry in memory. What if it's huge? What if it has more than Integer.MAX_VALUE bytes? :[
int len = data.length;
entry.setSize(len);
taos.putArchiveEntry(entry);
taos.write(data);
taos.closeArchiveEntry();
}
}
EDIT This is what I meant by uploading the entry to a medium (Google Cloud Storage in this case) to accurately query the whole size. Seems like major overkill for what seems like a simple problem, but this doesn't suffer from the same ram problems as the solution above. Just at the cost of bandwidth and time. I hope someone smarter than me comes by and makes me feel stupid soon :D
protected void googleCloudStorageTempFileApproach(TarArchiveOutputStream taos) throws IOException {
TarArchiveEntry entry = new TarArchiveEntry(exporter.getFileName());
String name = NameHelper.getRandomName(); //get random name for temp storage
BlobInfo blobInfo = BlobInfo.newBuilder(StorageHelper.OUTPUT_BUCKET, name).build(); //prepare upload of temp file
WritableByteChannel wbc = ApiContainer.storage.writer(blobInfo); //get WriteChannel for temp file
try(OutputStream out = Channels.newOutputStream(wbc)) {
exporter.export(out); //stream items to remote temp file
} finally {
wbc.close();
}
Blob blob = ApiContainer.storage.get(blobInfo.getBlobId());
long size = blob.getSize(); //accurately query the size after upload
entry.setSize(size);
taos.putArchiveEntry(entry);
ReadableByteChannel rbc = blob.reader(); //get ReadChannel for temp file
try(InputStream in = Channels.newInputStream(rbc)) {
IOUtils.copy(in, taos); //stream back to local tar stream from remote temp file
} finally {
rbc.close();
}
blob.delete(); //delete remote temp file
taos.closeArchiveEntry();
}

I've been looking at a similar issue, and this is a constraint of tar file format, as far as I can tell.
Tar files are written as a stream, and metadata (filenames, permissions etc) are written between the file data (i.e. metadata 1, filedata 1, metadata 2, filedata 2 etc). The program that extracts the data, it reads metadata 1, then starts extracting filedata 1, but it has to have a way of knowing when it's done. This could be done a number of ways; tar does this by having the length in the metadata.
Depending on your needs, and what the recipient expects out, there are a few options that I can see (not all apply to your situation):
As you mentioned, load an entire file, work out the length, then send it.
Divide the file into blocks, of predefined length (which fits into memory), then tar them up as file1-part1, file1-part2 etc.; the last block would be short.
Divide the file into blocks of a predefined length (which don't need to fit into memory), then pad the last block to that size with something appropriate.
Work out the maximum possible size of the file, and pad to that size.
Use a different archive format.
Make your own archive format, which does not have this limitation.
Interestingly, gzip does not have predefined limits, and multiple gzips can be concatenated together, each with it's own "original filename". Unfortunately, standard gunzip extracts all the resulting data into one file, using the (?) first filename.

Related

Slow operations in parallel

I need help with running parallel operations. The goal of the code is to extract a large amount of small files from the same tar in different folders in a very short time
This is the code:
public void decompress(File archive, File destination) throws RuntimeException {
try (InputStream in = new FileInputStream(archive);
BufferedInputStream buff = new BufferedInputStream(in);
TarArchiveInputStream is = (TarArchiveInputStream) new ArchiveStreamFactory().createArchiveInputStream("tar", buff)
) {
TarArchiveEntry entry;
while ((entry = is.getNextTarEntry()) != null) {
File file = new File(destination, entry.getName());
file.getParentFile().mkdirs();
Files.write(file.toPath(), is.readAllBytes());
}
} catch (IOException | ArchiveException e) {
e.printStackTrace();
}
}
When I execute one time this operation, it takes ~900ms
But when I do something like this to execute the same operation, multiple times in parallel it takes 20000ms:
ExecutorService EXECUTOR_SERVICE = Executors.newFixedThreadPool(20);
File archive = ...;
for (int i = 0; i < 5; i++) {
File directory = new File("Dir_" + i);
EXECUTOR_SERVICE.submit(() -> decompress(archive, directory));
}
or
File archive = ...;
for (int i = 0; i < 5; i++) {
File directory = new File("Dir_" + i);
new Thread(() -> decompress(archive, directory)).start();
}

One suspicion is that the directories contain many files, hence File.mkdirs does needlessly much checks.
The constructor of BufferedInputStream may have a custom buffer size. Never helped much, but it might be with your disk. Also with parallelism it could help to prevent much "disk head movements."
You probably already tried Files.copy but still, it might have a better memory behavior that readAllBytes.
So the version becomes (eschewing File in favor of Path):
public void decompress(File archive, File destination) throws RuntimeException {
final int bufferSize = 1024 * 128;
Path archivePath = archive.toPath();
Path destinationPath = destination.toPath();
try (InputStream in = new FileInputStream(archive);
BufferedInputStream buff = new BufferedInputStream(in, bufferSize);
TarArchiveInputStream is = (TarArchiveInputStream)
new ArchiveStreamFactory().createArchiveInputStream("tar", buff)
) {
Path oldFileParent = destinationPath;
oldFileParent.createDirectories();
TarArchiveEntry entry;
while ((entry = is.getNextTarEntry()) != null) {
Path file = Paths.get(destinationPath, entry.getName());
Path fileParent = file.getParent();
if (!fileParent.equals(oldFileParent)) {
oldFileParent = fileParent;
oldFileParent.createDirectories();
}
Files.copy(is, file);
//Files.write(file, is.readAllBytes());
}
} catch (IOException | ArchiveException e) {
e.printStackTrace();
}
}
Throwing a RuntimeException and capturing the IOException/ArchiveException without throwing it back (as new IllegalStateException(e)) is a matter of taste.
Now to adding parallelism: disk output is probably the bottleneck. Writing two files to the same disk in parallel means skipping back and forth on the disk. Small files might just do.
Better seems to parallelize reading a next file and then in another thread write it.
Two threads might theoretically perform better than many threads with enhightened disk traffic. readAllBytes might then be appropriate, to let the writing thread not use is.
As in the tar entry maybe the file size is kept too, that would allow to check whether readAllBytes is efficient enough - for large files.
Logging was mentioned in this question. It is known, that that can consume much time, and with parallelism becomes even more critical. But you seem to be aware of it. You wrote having written your own logger. For a library System.Logger is actually best. It is a façade that uses any logger the application provides. This would have prevented the logger vulnaribility hidden in library dependencies of the past year.

Ignoring the fact that you are not decompressing the file in parallel here (you are running multiple threads decompressing the same file concurrently, essentially overwriting the result), there may be several reasons for this performance hit. I/O is one, so it depends on the underlying implementation. Also, what is the Logger you are using there? While other parts of your code doesn't seem to be shared among multiple threads, the static call to Logger is something that is shared.
Also note: java.nio uses FileChannels which provide synchronous I/O, so depending on how you create the channels, you may get into similar situations (though I don't believe this applies here).

Out of Memory issue (Heap) from generating large csv file

I have an application for users to get data from database and download as csv file.
The general workflow follows:
User click download button at frontend.
Backend (SpringBoot in this case) will start an async thread to get data from database.
Generate csv files with data from step (2) and upload to google cloud storage.
Send user an email with signed url to download the data.
My problem is backend keep throwing "OOM Java heap space" error under some extreme cases. For extreme case, all my memory was filled (4GB). My initial plan was to load data via pagination from database (not all at once to save memory), and generate a csv for each page data. In this case, GC will clear the memory once a csv was generated to keep whole memory usage is not that high. However, the actual case is memory is increasing all the time until all are used up. The GC does not work as expected. I got total 18 pages and around 200000 record (from db) per page at extreme case.
I used JProfiler to monitor heap usage and found that the retained size of those large byte[] objects are not 0 which might represent there exist some references link to them (I guess that's why GC does not clear them from memory as expected).
How should I optimize my code and VM environment to make sure the memory usage can be lower than 1GB for extreme case? What makes those large byte[] objects not cleared by GC as expected?
The code to get data from database and generate csv file
#Override
#Async
#Transactional(timeout = DOWNLOAD_DATA_TRANSACTION_TIME_LIMIT)
public void startDownloadDataInCSVBySearchQuery(SearchQuery query, DownloadRequestRecord downloadRecord) throws IOException {
logger.debug(Thread.currentThread().getName() + ": starts to process download data");
String username = downloadRecord.getUsername();
// get posts from database first
List<? extends SocialPost> posts = this.postsService.getPosts(query);
try (ByteArrayOutputStream out = new ByteArrayOutputStream()) {
// get ids of posts
List<String> postsIDs = this.getPostsIDsFromPosts(posts);
int postsSize = postsIDs.size();
// do pagination db search. For each page, there are 1500 posts
int numPages = postsSize / POSTS_COUNT_PER_PAGE + 1;
for (int i = 0; i < numPages; i++) {
logger.debug("Download comments: start at page {}, out of total page {}", i + 1, numPages);
int pageStartPos = i * POSTS_COUNT_PER_PAGE; // this is set to 1500
int pageEndPos = Math.min((i + 1) * POSTS_COUNT_PER_PAGE, postsSize);
// get post ids per page
List<String> postsIDsPerPage = postsIDs.subList(pageStartPos, pageEndPos);
// use posts ids to get corresponding comments from db, via sql "IN"
List<Comment> commentsPerPage = this.commentsService.getCommentsByPostsIDs(postsIDsPerPage);
// generate csv file for page data and upload to google cloud
String commentsFileName = "comments-" + downloadRecord.getDownloadTime() + "-" + (i + 1) + ".csv";
this.csvUtil.generateCommentsCsvFileStream(commentsPerPage, commentsFileName, out);
this.googleCloudStorageInstance.uploadDownloadOutputStreamData(out.toByteArray(), commentsFileName);
}
} catch (Exception ex) {
logger.error("Exception from downloading data: ", ex);
}
Code to generate csv file
// use Apache csv
public void generateCommentsCsvFileStream(List<Comment> comments, String filename, ByteArrayOutputStream out) throws IOException {
CSVPrinter csvPrinter = new CSVPrinter(new OutputStreamWriter(out), CSVFormat.DEFAULT.withHeader(PostHeaders.class).withQuoteMode(QuoteMode.MINIMAL));
for (Comment comment: comments) {
List<Object> record = Arrays.asList(
// write csv content
comment.getPageId(),
...
);
csvPrinter.printRecord(record);
}
// close printer to release memory
csvPrinter.flush();
csvPrinter.close();
}
Code to upload file to goole cloud storage
public Blob uploadDownloadOutputStreamData(byte[] fileStream, String filename) {
logger.debug("Upload file: '{}' to google cloud storage", filename);
BlobId blobId = BlobId.of(this.DownloadDataBucketName, filename);
BlobInfo blobInfo = BlobInfo.newBuilder(blobId).build();
return this.cloudStorage.create(blobInfo, fileStream);
}
The heap usage is increasing all the time as page increases.The G1 old gen heap usage is still very high after system crush.
The G1 Eden space is almost empty, big files are saved into Old gen directly.
Old gen GC activity is low, most of GC activities come from Eden space:
Heap walker shows the retained size of those big byte[] is not 0.

You're using a single instance of ByteArrayOutputStream which just writes to a in-memory byte array.
That looks like a mistake because you seem to only want to upload each page at a time, not the accumulated result so far (which includes ALL pages).
By the way, doing this is useless:
try (ByteArrayOutputStream out = new ByteArrayOutputStream())
ByteArrayOutputStream does not need to be closed as it lives in memory. Just remove that. And create a new instance for each page (inside the pages for loop) instead of re-using the same instance for all pages and it might just work fine.
EDIT
Another advice would be to break this code up into more methods... not just because it's more readable with smaller methods, but because you're keeping temporary variables in scope for too long (causing unnecessary memory to stick around longer than needed).
For example:
List<? extends SocialPost> posts = this.postsService.getPosts(query);
try (ByteArrayOutputStream out = new ByteArrayOutputStream()) {
// get ids of posts
List<String> postsIDs = this.getPostsIDsFromPosts(posts);
....
From this point on, posts is not used anymore, and I assume that it contains a lot of stuff... so you should "drop" that variable once you got the IDs.
Do something like this instead:
List<String> postsIDs = getAllPostIds(query);
....
List<String> getAllPostIds(SearchQuery query) {
// this variable will be GC'd after this method returns as it's no longer referenced (assuming getPostIDsFromPosts() doesn't store it in a field)
List<? extends SocialPost> posts = this.postsService.getPosts(query);
return this.getPostsIDsFromPosts(posts);
}

Copied DocumentFile has different siize and hash to original

I'm attempting to copy / duplicate a DocumentFile in an Android application, but upon inspecting the created duplicate, it does not appear to be exactly the same as the original (which is causing a problem, because I need to do an MD5 check on both files the next time a copy is called, so as to avoid overwriting the same files).
The process is as follows:
User selects a file from a ACTION_OPEN_DOCUMENT_TREE
Source file's type is obtained
New DocumentFile in target location is initialised
Contents of first file is duplicated into second file
The initial stages are done with the following code:
// Get the source file's type
String sourceFileType = MimeTypeMap.getSingleton().getExtensionFromMimeType(contextRef.getContentResolver().getType(file.getUri()));
// Create the new (empty) file
DocumentFile newFile = targetLocation.createFile(sourceFileType, file.getName());
// Copy the file
CopyBufferedFile(new BufferedInputStream(contextRef.getContentResolver().openInputStream(file.getUri())), new BufferedOutputStream(contextRef.getContentResolver().openOutputStream(newFile.getUri())));
The main copy process is done using the following snippet:
void CopyBufferedFile(BufferedInputStream bufferedInputStream, BufferedOutputStream bufferedOutputStream)
{
// Duplicate the contents of the temporary local File to the DocumentFile
try
{
byte[] buf = new byte[1024];
bufferedInputStream.read(buf);
do
{
bufferedOutputStream.write(buf);
}
while(bufferedInputStream.read(buf) != -1);
}
catch (IOException e)
{
e.printStackTrace();
}
finally
{
try
{
if (bufferedInputStream != null) bufferedInputStream.close();
if (bufferedOutputStream != null) bufferedOutputStream.close();
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
The problem that I'm facing, is that although the file copies successfully and is usable (it's a picture of a cat, and it's still a picture of a cat in the destination), it is slightly different.
The file size has changed from 2261840 to 2262016 (+176)
The MD5 hash has changed completely
Is there something wrong with my copying code that is causing the file to change slightly?
Thanks in advance.

Your copying code is incorrect. It is assuming (incorrectly) that each call to read will either return buffer.length bytes or return -1.
What you should do is capture the number of bytes read in a variable each time, and then write exactly that number of bytes. Your code for closing the streams is verbose and (in theory1) buggy as well.
Here is a rewrite that addresses both of those issues, and some others as well.
void copyBufferedFile(BufferedInputStream bufferedInputStream,
BufferedOutputStream bufferedOutputStream)
throws IOException
{
try (BufferedInputStream in = bufferedInputStream;
BufferedOutputStream out = bufferedOutputStream)
{
byte[] buf = new byte[1024];
int nosRead;
while ((nosRead = in.read(buf)) != -1) // read this carefully ...
{
out.write(buf, 0, nosRead);
}
}
}
As you can see, I have gotten rid of the bogus "catch and squash exception" handlers, and fixed the resource leak using Java 7+ try with resources.
There are still a couple of issues:
It is better for the copy function to take file name strings (or File or Path objects) as parameters and be responsible for opening the streams.
Given that you are doing block reads and writes, there is little value in using buffered streams. (Indeed, it might conceivably be making the I/O slower.) It would be better to use plain streams and make the buffer the same size as the default buffer size used by the Buffered* classes .... or larger.
If you are really concerned about performance, try using transferFrom as described here:
https://www.journaldev.com/861/java-copy-file
1 - In theory, if the bufferedInputStream.close() throws an exception, the bufferedOutputStream.close() call will be skipped. In practice, it is unlikely that closing an input stream will throw an exception. But either way, the try with resource approach will deals with this correctly, and far more concisely.

How to download a large file from Google Cloud Storage using Java with checksum control

I want to download large files from Google Cloud Storage using the google provided Java library com.google.cloud.storage. I have working code, but I still have one question and one major concern:
My major concern is, when is the file content actually downloaded? During (references to the code below) storage.get(blobId), during blob.reader() or during reader.read(bytes)? This gets very important when it comes to how to handle an invalid checksum, what do I need to do in order to actually trigger that the file is fetched over the network again?
The simpler question is: Is there built in functionality to do md5 (or crc32c) check on the received file in the google library? Maybe I don't need to implement it on my own.
Here is my method trying to download big files from Google Cloud Storage:
private static final int MAX_NUMBER_OF_TRIES = 3;
public Path downloadFile(String storageFileName, String bucketName) throws IOException {
// In my real code, this is a field populated in the constructor.
Storage storage = Objects.requireNonNull(StorageOptions.getDefaultInstance().getService());
BlobId blobId = BlobId.of(bucketName, storageFileName);
Path outputFile = Paths.get(storageFileName.replaceAll("/", "-"));
int retryCounter = 1;
Blob blob;
boolean checksumOk;
MessageDigest messageDigest;
try {
messageDigest = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException ex) {
throw new RuntimeException(ex);
}
do {
LOGGER.debug("Start download file {} from bucket {} to Content Store (try {})", storageFileName, bucketName, retryCounter);
blob = storage.get(blobId);
if (null == blob) {
throw new CloudStorageCommunicationException("Failed to download file after " + retryCounter + " tries.");
}
if (Files.exists(outputFile)) {
Files.delete(outputFile);
}
try (ReadChannel reader = blob.reader();
FileChannel channel = new FileOutputStream(outputFile.toFile(), true).getChannel()) {
ByteBuffer bytes = ByteBuffer.allocate(128 * 1024);
int bytesRead = reader.read(bytes);
while (bytesRead > 0) {
bytes.flip();
messageDigest.update(bytes.array(), 0, bytesRead);
channel.write(bytes);
bytes.clear();
bytesRead = reader.read(bytes);
}
}
String checksum = Base64.encodeBase64String(messageDigest.digest());
checksumOk = checksum.equals(blob.getMd5());
if (!checksumOk) {
Files.delete(outputFile);
messageDigest.reset();
}
} while (++retryCounter <= MAX_NUMBER_OF_TRIES && !checksumOk);
if (!checksumOk) {
throw new CloudStorageCommunicationException("Failed to download file after " + MAX_NUMBER_OF_TRIES + " tries.");
}
return outputFile;
}

The google-cloud-java storage library does not validate checksums on its own when reading data beyond normal HTTPS/TCP correctness checking. If it compared the MD5 of the received data to the known MD5, it would need to download the entire file before it could return any results from read(), which for very large files would be infeasible.
What you're doing is a good idea if you need the additional protection of comparing MD5s. If this is a one-off task, you could use the gsutil command-line tool, which does this same sort of additional check.

As the JavaDoc of ReadChannel says:
Implementations of this class may buffer data internally to reduce remote calls.
So the implementation you get from blob.reader() could cache the whole file, some bytes or nothing and just fetch byte for byte when you call read(). You will never know and you shouldn't care.
As only read() throws an IOException and the other methods you used do not, I'd say that only calling read() will actually download stuff. You can also see this in the sources of the lib.
Btw. despite the example in the JavaDocs of the library, you should check for >= 0, not > 0. 0 just means nothing was read, not that end of stream is reached. End of stream is signaled by returning -1.
For retrying after a failed checksum check, get a new reader from the blob. If something caches the downloaded data, then the reader itself. So if you get a new reader from the blob, the file will be redownloaded from remote.

JaxRS create and return zip file from server

I want to create and return a zip file from my server using JaxRS. I don't think that I want to create an actual file on the server, if possible I would like to create the zip on the fly and pass that back to the client. If I create a huge zip file on the fly will I run out of memory if too many files are in the zip file?
Also I am not sure the most efficient way to do this. Here is what I was thinking but I am very rusty when it comes to input/output in java.
public Response getFiles() {
// These are the files to include in the ZIP file
String[] filenames = // ... bunch of filenames
byte[] buf = new byte[1024];
try {
// Create the ZIP file
ByteArrayOutputStream baos= new ByteArrayOutputStream();
ZipOutputStream out = new ZipOutputStream(new BufferedOutputStream(baos));
// Compress the files
for (String filename : filenames) {
FileInputStream in = new FileInputStream(filename);
// Add ZIP entry to output stream.
out.putNextEntry(new ZipEntry(filename));
// Transfer bytes from the file to the ZIP file
int len;
while ((len = in.read(buf)) > 0) {
out.write(buf, 0, len);
}
// Complete the entry
out.closeEntry();
in.close();
}
// Complete the ZIP file
out.close();
ResponseBuilder response = Response.ok(out); // Not a 100% sure this will work
response.type(MediaType.APPLICATION_OCTET_STREAM);
response.header("Content-Disposition", "attachment; filename=\"files.zip\"");
return response.build();
} catch (IOException e) {
}
}
Any help would be greatly appreciated.

There are two options:
1- Create ZIP in a temporal directory and then dump to client.
2- Use OutputStream from the Response to send zip directly to the client, when you are creating them.
But never use memory to create huge ZIP file.

There's no need to create the ZIP file from the first to the last byte in the memory before serving it to the client. Also, there's no need to create such a file in temp directory in advance as well (especially because the IO might be really slow).
The key is to start streaming the "ZIP response" and generating the content on the flight.
Let's say we have a aMethodReturningStream(), which returns a Stream, and we want to turn each element into a file stored in the ZIP file. And that we don't want to keep bytes of each element stored all the time in any intermediate representation, like a collection or an array.
Then such a pseudocode might help:
#GET
#Produces("application/zip")
public Response generateZipOnTheFly() {
StreamingOutput output = strOut -> {
try (ZipOutputStream zout = new ZipOutputStream(strOut)) {
aMethodReturningStream().forEach(singleStreamElement -> {
try {
ZipEntry zipEntry = new ZipEntry(createFileName(singleStreamElement));
FileTime fileTime = FileTime.from(singleStreamElement.getCreationTime());
zipEntry.setCreationTime(fileTime);
zipEntry.setLastModifiedTime(fileTime);
zout.putNextEntry(zipEntry);
zout.write(singleStreamElement.getBytes());
zout.flush();
} catch (IOException e) {
throw new RuntimeException(e);
}
});
}
};
return Response.ok(output)
.header("Content-Disposition", "attachment; filename=\"generated.zip\"")
.build();
}
This concept relies on passing a StreamingOutput to the Response builder. The StreamingOutput is not a full response/entity/body generated before sending the response, but a recipe used to generate the flow of bytes on-the-fly (here wrapped into ZipOutputStream). If you're not sure about this, then maybe set a breakpoint next on flush() and observe the a download progress using e.g. wget.
The key thing to remember here is that the stream here is not a "wrapper" of pre-computed or pre-fetched items. It must be dynamic, e.g. wrapping a DB cursor or something like that. Also, it can be replaced by anything that's streaming data. That's why it cannot be a foreach loop iterating over Element[] elems array (with each Element having all the bytes "inside"), like
for(Element elem: elems)
if you'd like to avoid reading all items into the heap at once before streaming the ZIP.
(Please note this is a pseudocode and you might want to add better handling and polish other stuff as well.)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java create tar archive with entries of unknown size - java

Related

Slow operations in parallel

Out of Memory issue (Heap) from generating large csv file

Copied DocumentFile has different siize and hash to original

How to download a large file from Google Cloud Storage using Java with checksum control

JaxRS create and return zip file from server

Categories

Resources