Out of Memory issue (Heap) from generating large csv file

Out of Memory issue (Heap) from generating large csv file - java

I have an application for users to get data from database and download as csv file.
The general workflow follows:
User click download button at frontend.
Backend (SpringBoot in this case) will start an async thread to get data from database.
Generate csv files with data from step (2) and upload to google cloud storage.
Send user an email with signed url to download the data.
My problem is backend keep throwing "OOM Java heap space" error under some extreme cases. For extreme case, all my memory was filled (4GB). My initial plan was to load data via pagination from database (not all at once to save memory), and generate a csv for each page data. In this case, GC will clear the memory once a csv was generated to keep whole memory usage is not that high. However, the actual case is memory is increasing all the time until all are used up. The GC does not work as expected. I got total 18 pages and around 200000 record (from db) per page at extreme case.
I used JProfiler to monitor heap usage and found that the retained size of those large byte[] objects are not 0 which might represent there exist some references link to them (I guess that's why GC does not clear them from memory as expected).
How should I optimize my code and VM environment to make sure the memory usage can be lower than 1GB for extreme case? What makes those large byte[] objects not cleared by GC as expected?
The code to get data from database and generate csv file
#Override
#Async
#Transactional(timeout = DOWNLOAD_DATA_TRANSACTION_TIME_LIMIT)
public void startDownloadDataInCSVBySearchQuery(SearchQuery query, DownloadRequestRecord downloadRecord) throws IOException {
logger.debug(Thread.currentThread().getName() + ": starts to process download data");
String username = downloadRecord.getUsername();
// get posts from database first
List<? extends SocialPost> posts = this.postsService.getPosts(query);
try (ByteArrayOutputStream out = new ByteArrayOutputStream()) {
// get ids of posts
List<String> postsIDs = this.getPostsIDsFromPosts(posts);
int postsSize = postsIDs.size();
// do pagination db search. For each page, there are 1500 posts
int numPages = postsSize / POSTS_COUNT_PER_PAGE + 1;
for (int i = 0; i < numPages; i++) {
logger.debug("Download comments: start at page {}, out of total page {}", i + 1, numPages);
int pageStartPos = i * POSTS_COUNT_PER_PAGE; // this is set to 1500
int pageEndPos = Math.min((i + 1) * POSTS_COUNT_PER_PAGE, postsSize);
// get post ids per page
List<String> postsIDsPerPage = postsIDs.subList(pageStartPos, pageEndPos);
// use posts ids to get corresponding comments from db, via sql "IN"
List<Comment> commentsPerPage = this.commentsService.getCommentsByPostsIDs(postsIDsPerPage);
// generate csv file for page data and upload to google cloud
String commentsFileName = "comments-" + downloadRecord.getDownloadTime() + "-" + (i + 1) + ".csv";
this.csvUtil.generateCommentsCsvFileStream(commentsPerPage, commentsFileName, out);
this.googleCloudStorageInstance.uploadDownloadOutputStreamData(out.toByteArray(), commentsFileName);
}
} catch (Exception ex) {
logger.error("Exception from downloading data: ", ex);
}
Code to generate csv file
// use Apache csv
public void generateCommentsCsvFileStream(List<Comment> comments, String filename, ByteArrayOutputStream out) throws IOException {
CSVPrinter csvPrinter = new CSVPrinter(new OutputStreamWriter(out), CSVFormat.DEFAULT.withHeader(PostHeaders.class).withQuoteMode(QuoteMode.MINIMAL));
for (Comment comment: comments) {
List<Object> record = Arrays.asList(
// write csv content
comment.getPageId(),
...
);
csvPrinter.printRecord(record);
}
// close printer to release memory
csvPrinter.flush();
csvPrinter.close();
}
Code to upload file to goole cloud storage
public Blob uploadDownloadOutputStreamData(byte[] fileStream, String filename) {
logger.debug("Upload file: '{}' to google cloud storage", filename);
BlobId blobId = BlobId.of(this.DownloadDataBucketName, filename);
BlobInfo blobInfo = BlobInfo.newBuilder(blobId).build();
return this.cloudStorage.create(blobInfo, fileStream);
}
The heap usage is increasing all the time as page increases.The G1 old gen heap usage is still very high after system crush.
The G1 Eden space is almost empty, big files are saved into Old gen directly.
Old gen GC activity is low, most of GC activities come from Eden space:
Heap walker shows the retained size of those big byte[] is not 0.

You're using a single instance of ByteArrayOutputStream which just writes to a in-memory byte array.
That looks like a mistake because you seem to only want to upload each page at a time, not the accumulated result so far (which includes ALL pages).
By the way, doing this is useless:
try (ByteArrayOutputStream out = new ByteArrayOutputStream())
ByteArrayOutputStream does not need to be closed as it lives in memory. Just remove that. And create a new instance for each page (inside the pages for loop) instead of re-using the same instance for all pages and it might just work fine.
EDIT
Another advice would be to break this code up into more methods... not just because it's more readable with smaller methods, but because you're keeping temporary variables in scope for too long (causing unnecessary memory to stick around longer than needed).
For example:
List<? extends SocialPost> posts = this.postsService.getPosts(query);
try (ByteArrayOutputStream out = new ByteArrayOutputStream()) {
// get ids of posts
List<String> postsIDs = this.getPostsIDsFromPosts(posts);
....
From this point on, posts is not used anymore, and I assume that it contains a lot of stuff... so you should "drop" that variable once you got the IDs.
Do something like this instead:
List<String> postsIDs = getAllPostIds(query);
....
List<String> getAllPostIds(SearchQuery query) {
// this variable will be GC'd after this method returns as it's no longer referenced (assuming getPostIDsFromPosts() doesn't store it in a field)
List<? extends SocialPost> posts = this.postsService.getPosts(query);
return this.getPostsIDsFromPosts(posts);
}

Related

Is it possible to download the first 100 lines of a big file with millions of lines from S3?

I have multiple 100MB raw files with series of user activities in CSV format. I only want to download the first 100 lines of the files.
The problem is that each file may have different CSV header columns, and data values, because they are user activities from multiple subdomains using different activity tracking providers. This means that each line can be 50 characters long or 500 characters long and it is unknown until I read them all.
S3 supports getObject API with Range parameter which you can use to download the specific ranges of XX bytes of the file.
https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html#API_GetObject_RequestSyntax
If I use this API to parse first 1Mb of files, iterate each byte until I see 100 new lines character \n, would that technically work? Is there something that I have to be careful about this approach? (e.g. multibyte chars?)

There is no built-in way, byte-range fetches are the best way forward.
As you're not sure of the header or line length in each case, downloading 1MB chunks until you have 100 line is a safe & efficient approach.
Multibyte chars etc. won't be important as at this level, you're purely looking to stop reading after 100 \n characters. Depending on the source of your files, however, I would be also conscious of \r\n and \r as being valid line endings.
I've written the below Java code for getting the last n bytes, feel free to use it as a starting point for getting the first n bytes:
public String getLastBytesOfObjectAsString(String bucket, String key, long lastBytesCount) {
try {
final ObjectMetadata objectMetadata = client.getObjectMetadata(bucket, key);
final long fileSizeInBytes = objectMetadata.getContentLength();
long rangeStart = fileSizeInBytes - lastBytesCount;
if (rangeStart < 0) {
rangeStart = 0;
}
final GetObjectRequest getObjectRequest =
new GetObjectRequest(bucket, key).withRange(rangeStart);
try (S3Object s3Object = client.getObject(getObjectRequest);
InputStream inputStream = s3Object.getObjectContent()) {
return new String(inputStream.readAllBytes());
}
} catch (Exception ex) {
...
}
}

You can use smart_open like this:
from smart_open import open
with open('s3://bucket/path/file.csv', 'r') as f:
csv_reader = csv.DictReader(f, delimiter=',')
data = ''
for i, row in enumerate(csv_reader):
data += row +'\n'
if i > 100:
store(data)
You will need to open another file in your localmachine with write permission to store a 100 lines or as many as you like. If you want the 1st lines from multiple files, you can do the same but using the boto3 function for listing the files and send the path/file name to a function using smart_open.
s3client = boto3.client('s3')
listObj = s3client.list_objects_v2(Bucket=bucket, Prefix=prefix)
for obj in listObj['Contents']:
smart_function(obj['Key'])
The obj['Key'] contains the path and file name of each file in that Bucket+Path(Prefix)

Java create tar archive with entries of unknown size

I have a web app where I need to be able to serve the user an archive of multiple files. I've set up a generic ArchiveExporter, and made a ZipArchiveExporter. Works beautifully! I can stream my data to my server, and archive the data and stream it to the user all without using much memory, and without needing a filesystem (I'm on Google App Engine).
Then I remembered about the whole zip64 thing with 4gb zip files. My archives can get potentially very large (high res images), so I'd like to have an option to avoid zip files for my larger input.
I checked out org.apache.commons.compress.archivers.tar.TarArchiveOutputStream and thought I had found what I needed! Sadly when I checked the docs, and ran into some errors; I quickly found out you MUST pass the size of each entry as you stream. This is a problem because the data is being streamed to me with no way of knowing the size beforehand.
I tried counting and returning the written bytes from export(), but TarArchiveOutputStream expects a size in TarArchiveEntry before writing to it, so that obviously doesn't work.
I can use a ByteArrayOutputStream and read each entry entirely before writing its content so I know its size, but my entries can pontentially get very large; and this is not very polite to the other processes running on the instance.
I could use some form of persistence, upload the entry, and query the data size. However, that would be a waste of my google storage api calls, bandwidth, storage, and runtime.
I am aware of this SO question asking almost the same thing, but he settled for using zip files and there is no more relevant information.
What is the ideal solution to creating a tar archive with entries of unknown size?
public abstract class ArchiveExporter<T extends OutputStream> extends Exporter { //base class
public abstract void export(OutputStream out); //from Exporter interface
public abstract void archiveItems(T t) throws IOException;
}
public class ZipArchiveExporter extends ArchiveExporter<ZipOutputStream> { //zip class, works as intended
#Override
public void export(OutputStream out) throws IOException {
try(ZipOutputStream zos = new ZipOutputStream(out, Charsets.UTF_8)) {
zos.setLevel(0);
archiveItems(zos);
}
}
#Override
protected void archiveItems(ZipOutputStream zos) throws IOException {
zos.putNextEntry(new ZipEntry(exporter.getFileName()));
exporter.export(zos);
//chained call to export from other exporter like json exporter for instance
zos.closeEntry();
}
}
public class TarArchiveExporter extends ArchiveExporter<TarArchiveOutputStream> {
#Override
public void export(OutputStream out) throws IOException {
try(TarArchiveOutputStream taos = new TarArchiveOutputStream(out, "UTF-8")) {
archiveItems(taos);
}
}
#Override
protected void archiveItems(TarArchiveOutputStream taos) throws IOException {
TarArchiveEntry entry = new TarArchiveEntry(exporter.getFileName());
//entry.setSize(?);
taos.putArchiveEntry(entry);
exporter.export(taos);
taos.closeArchiveEntry();
}
}
EDIT this is what I was thinking with the ByteArrayOutputStream. It works, but I cannot guarantee I will always have enough memory to store the whole entry at once, hence my streaming efforts. There has to be a more elegant way of streaming a tarball! Maybe this is a question more suited for Code Review?
protected void byteArrayOutputStreamApproach(TarArchiveOutputStream taos) throws IOException {
TarArchiveEntry entry = new TarArchiveEntry(exporter.getFileName());
try(ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
exporter.export(baos);
byte[] data = baos.toByteArray();
//holding ENTIRE entry in memory. What if it's huge? What if it has more than Integer.MAX_VALUE bytes? :[
int len = data.length;
entry.setSize(len);
taos.putArchiveEntry(entry);
taos.write(data);
taos.closeArchiveEntry();
}
}
EDIT This is what I meant by uploading the entry to a medium (Google Cloud Storage in this case) to accurately query the whole size. Seems like major overkill for what seems like a simple problem, but this doesn't suffer from the same ram problems as the solution above. Just at the cost of bandwidth and time. I hope someone smarter than me comes by and makes me feel stupid soon :D
protected void googleCloudStorageTempFileApproach(TarArchiveOutputStream taos) throws IOException {
TarArchiveEntry entry = new TarArchiveEntry(exporter.getFileName());
String name = NameHelper.getRandomName(); //get random name for temp storage
BlobInfo blobInfo = BlobInfo.newBuilder(StorageHelper.OUTPUT_BUCKET, name).build(); //prepare upload of temp file
WritableByteChannel wbc = ApiContainer.storage.writer(blobInfo); //get WriteChannel for temp file
try(OutputStream out = Channels.newOutputStream(wbc)) {
exporter.export(out); //stream items to remote temp file
} finally {
wbc.close();
}
Blob blob = ApiContainer.storage.get(blobInfo.getBlobId());
long size = blob.getSize(); //accurately query the size after upload
entry.setSize(size);
taos.putArchiveEntry(entry);
ReadableByteChannel rbc = blob.reader(); //get ReadChannel for temp file
try(InputStream in = Channels.newInputStream(rbc)) {
IOUtils.copy(in, taos); //stream back to local tar stream from remote temp file
} finally {
rbc.close();
}
blob.delete(); //delete remote temp file
taos.closeArchiveEntry();
}

I've been looking at a similar issue, and this is a constraint of tar file format, as far as I can tell.
Tar files are written as a stream, and metadata (filenames, permissions etc) are written between the file data (i.e. metadata 1, filedata 1, metadata 2, filedata 2 etc). The program that extracts the data, it reads metadata 1, then starts extracting filedata 1, but it has to have a way of knowing when it's done. This could be done a number of ways; tar does this by having the length in the metadata.
Depending on your needs, and what the recipient expects out, there are a few options that I can see (not all apply to your situation):
As you mentioned, load an entire file, work out the length, then send it.
Divide the file into blocks, of predefined length (which fits into memory), then tar them up as file1-part1, file1-part2 etc.; the last block would be short.
Divide the file into blocks of a predefined length (which don't need to fit into memory), then pad the last block to that size with something appropriate.
Work out the maximum possible size of the file, and pad to that size.
Use a different archive format.
Make your own archive format, which does not have this limitation.
Interestingly, gzip does not have predefined limits, and multiple gzips can be concatenated together, each with it's own "original filename". Unfortunately, standard gunzip extracts all the resulting data into one file, using the (?) first filename.

Does jPod Merge PDFs by data streaming?

I am using jPod to Merge my PDF Documents. I merged 400 PDFs of each 20 Pages resulting in file of 190 MB, whereas the size of a single pdf is 38 KB. I checked for heap status in my IDE. I didn't get any Out of Memory Error. I ran the same in Apache Tomcat with almost 30 Clients. My Tomcat stopped serving the requests. Is it because, jPod doesn't use Streaming
Or due to some other reasons??
private void run() throws Throwable {
String sOutFileFullPathAndName = "/Users/test/Downloads/" + UUID.randomUUID().toString().replace("-", "");
PDDocument dstDocument = PDDocument.createNew();
for (int i = 0;i < 400; i++) {
//System.out.println(Runtime.getRuntime().freeMemory());
PDDocument srcDocument = PDDocument.createFromLocator(new FileLocator("/Users/test/Downloads/2.pdf") );
mergeDocuments(dstDocument, srcDocument);
}
FileLocator destinationLocator = new FileLocator(sOutFileFullPathAndName);
dstDocument.save(destinationLocator, null);
dstDocument.close();
}
private void mergeDocuments(PDDocument dstDocument, PDDocument srcDocument) {
PDPageTree pageTree = srcDocument.getPageTree();
int pageCount = pageTree.getCount();
for (int index = 0; index < pageCount; index++) {
PDPage srcPage = pageTree.getPageAt( index );
appendPage(dstDocument, srcPage);
srcPage = null;
}
}
private void appendPage(PDDocument document, PDPage page) {
PDResources srcResources = page.getResources();
CSContent cSContent = page.getContentStream();
PDPage newPage = (PDPage) PDPage.META.createNew();
// copy resources from source page to the newly created page
PDResources newResources = (PDResources) PDResources.META
.createFromCos(srcResources.cosGetObject().copyDeep());
newPage.setResources(newResources);
newPage.setContentStream(cSContent);
// add that new page to the destination document
document.addPageNode(newPage);
}

PDF is not simply a "stream" of page data. It is a complex data structure containing objects referencing each other. In this concrete case page trees/nodes, content streams, resources,...
jPod keeps persistent object in memory using weak references only - they can always be refreshed from the random access data. If you start updating the object structure, objects get "locked" in memory, simply because the change is not persistent and cannot longer be refreshed.
Making lots of changes without peridodically saving the result will keep the complete structure in memory - i assume that's your problem here. Saving every now and then should reduce memory footprint.
In addition, this algorithm will create a poor page tree, containing in a linear array with thousands of pages. You should try to create a balanced tree structure. Another point for optimization is resource handling. Merging resources like fonts or images may dramatically reduce target size.

Java, why reading from MappedByteBuffer is slower than reading from BufferedReader

I tried to read lines from a file which maybe large.
To make a better performance, I tried to use mapped file. But when I compare the performance, I find that the mapped file way is even a a little slower than I read from BufferedReader
public long chunkMappedFile(String filePath, int trunkSize) throws IOException {
long begin = System.currentTimeMillis();
logger.info("Processing imei file, mapped file [{}], trunk size = {} ", filePath, trunkSize);
//Create file object
File file = new File(filePath);
//Get file channel in readonly mode
FileChannel fileChannel = new RandomAccessFile(file, "r").getChannel();
long positionStart = 0;
StringBuilder line = new StringBuilder();
long lineCnt = 0;
while(positionStart < fileChannel.size()) {
long mapSize = positionStart + trunkSize < fileChannel.size() ? trunkSize : fileChannel.size() - positionStart ;
MappedByteBuffer buffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, positionStart, mapSize);//mapped read
for (int i = 0; i < buffer.limit(); i++) {
char c = (char) buffer.get();
//System.out.print(c); //Print the content of file
if ('\n' != c) {
line.append(c);
} else {// line ends
processor.processLine(line.toString());
if (++lineCnt % 100000 ==0) {
try {
logger.info("mappedfile processed {} lines already, sleep 1ms", lineCnt);
Thread.sleep(1);
} catch (InterruptedException e) {}
}
line = new StringBuilder();
}
}
closeDirectBuffer(buffer);
positionStart = positionStart + buffer.limit();
}
long end = System.currentTimeMillis();
logger.info("chunkMappedFile {} , trunkSize: {}, cost : {} " ,filePath, trunkSize, end - begin);
return lineCnt;
}
public long normalFileRead(String filePath) throws IOException {
long begin = System.currentTimeMillis();
logger.info("Processing imei file, Normal read file [{}] ", filePath);
long lineCnt = 0;
try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
String line;
while ((line = br.readLine()) != null) {
processor.processLine(line.toString());
if (++lineCnt % 100000 ==0) {
try {
logger.info("file processed {} lines already, sleep 1ms", lineCnt);
Thread.sleep(1);
} catch (InterruptedException e) {}
} }
}
long end = System.currentTimeMillis();
logger.info("normalFileRead {} , cost : {} " ,filePath, end - begin);
return lineCnt;
}
Test result in Linux with reading a file which size is 537MB:
MappedBuffer way:
2017-09-28 14:33:19.277 [main] INFO com.oppo.push.ts.dispatcher.imei2device.ImeiTransformerOfflineImpl - process imei file ends:/push/file/imei2device-local/20170928/imei2device-13 , lines :12758858 , cost :14804 , lines per seconds: 861852.0670089165
BufferedReader way:
2017-09-28 14:27:03.374 [main] INFO com.oppo.push.ts.dispatcher.imei2device.ImeiTransformerOfflineImpl - process imei file ends:/push/file/imei2device-local/20170928/imei2device-13 , lines :12758858 , cost :13001 , lines per seconds: 981375.1249903854

That is the thing: file IO isn't straight forward and easy.
You have to keep in mind that your operating system has a huge impact on what exactly is going to happen. In that sense: there are no solid rules that would work for all JVM implementations on all platforms.
When you really have to worry about the last bit of performance, doing in-depth profiling on your target platform is the primary solution.
Beyond that, you are getting that "performance" aspect wrong. Meaning: memory mapped IO doesn't magically increase the performance of reading a single file within an application once. Its major advantages go along this path:
mmap is great if you have multiple processes accessing data in a read only fashion from the same file, which is common in the kind of server systems I write. mmap allows all those processes to share the same physical memory pages, saving a lot of memory.
( quoted from this answer on using the C mmap() system call )
In other words: you example is about reading a file contents. In the end, the OS still has to turn to the drive to read all bytes from there. Meaning: it reads disc content and puts it in memory. When you do that the first time ... it really doesn't matter that you do some "special" things on top of that. To the contrary - as you do "special" things the memory-mapped approach might even be slower - because of the overhead compared to an "ordinary" read.
And coming back to my first record: even when you would have 5 process reading the same file, the memory-mapped approach isn't necessarily faster. As the Linux might figure: I already read that file into memory, and it didn't change - so even without explicit "memory mapping" the Linux kernel might cache information.

The memory mapping doesn't really give any advantage, since even though you're bulk loading a file into memory, you're still processing it one byte at a time. You might see a performance increase if you processed the buffer in suitably sized byte[] chunks. Even then the BufferedReader version may perform better or at least almost the same.
The nature of your task is to process a file sequentially. BufferedReader already does this very well and the code is simple, so if I had to choose I'd go with the simplest option.
Also note that your buffer code doesn't work except for single byte encodings. As soon as you get multiple bytes per character, it will fail magnificently.

GhostCat is correct. And in addition to your OS choice, other things that can affect performance.
Mapping a file will place greater demand on physical memory. If physical memory is "tight" that could cause paging activity, and a performance hit.
The OS could use a different read-ahead strategy if you read a file using read syscalls versus mapping it into memory. Read-ahead (into the buffer cache) can make file reading a lot faster.
The default buffer size for BufferedReader and the OS memory page size are likely to be different. This may result in the size of disk read requests being different. (Larger reads often result in greater throughput I/O. At least to a certain point.)
There could also be "artefacts" caused by the way that you benchmark. For example:
The first time you read a file, a copy of some or all of the file will land in the buffer cache (in memory)
The second time you read the same file, parts of it may still be in memory, and the apparent read time will be shorter.

How to tell how far along an InputStream a loop is in Java

I'm trying to make a downloader so I can automatically update my program. The following is my code so far:
public void applyUpdate(final CharSequence ver)
{
java.io.InputStream is;
java.io.BufferedWriter bw;
try
{
String s, ver;
alertOf(s);
updateDialogProgressBar.setIndeterminate(true);//This is a javax.swing.JProgressBar which is configured beforehand, and is displayed to the user in an update dialog
is = latestUpdURL.openStream();//This is a java.net.URL which is configured beforehand, and contains the path to a replacement JAR file
bw = new java.io.BufferedWriter(new java.io.FileWriter(new java.io.File(System.getProperty("user.dir") + java.io.File.separatorChar + TITLE + ver + ".jar")));//Creates a new buffered writer which writes to a file adjacent to the JAR being run, whose name is the title of the application, then a space, then the version number of the update, then ".jar"
updateDialogProgressBar.setValue(0);
//updateDialogProgressBar.setMaximum(totalSize);//This is where I would input the total number of bytes in the target file
updateDialogProgressBar.setIndeterminate(false);
{
for (int i, prog=0; (i = is.read()) != -1; prog++)
{
bw.write(i);
updateDialogProgressBar.setValue(prog);
}
bw.close();
is.close();
}
}
catch (Throwable t)
{
//Alert the user of a problem
}
}
As you can see, I'm just trying to make a downloader with a progress bar, but I don't know how to tell the total size of the target file. How can I tell how many bytes are going to be downloaded before the file is done downloading?

A Stream is a flow of bytes, you can't ask it how many bytes are remaining, you just read from it until it says 'i'm done'. Now, depending on how the connection that provides the stream is stablished, perhaps the underlying protocol (HTTP, for example) can know in advance the total length to be sent... perhaps not. For this, see URLConnection.getContentLength(). But it might well return -1 (= 'I don't know').
BTW, your code is not the proper way to read a stream of bytes and write it to a file. For one thing, you are using a Writer, when you should use a OutputStream (you are converting from bytes to characters, and then back to bytes - this hinders performance and might corrupt everything if the received content is binary, or if the encodings don't match). Secondly, its inefficient to read and write one byte at a time.

To get the length of the file, you do this:
new File("System.getProperty("user.dir") + java.io.File.separatorChar + TITLE + ver + ".jar").length()

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.