I have been using Apache POI to process data coming from big files. Reading was done via the SAX event API which makes it efficient for large sets of data without consuming much memory.
However there's also a requirement that I need to update an existing template for the final report. This template may have more than 10MB (even 20MB in some cases).
Do you know a way to efficiently update a big template file (xslx)? Currently I am reading the entire contents of the template into memory and modify those contents (using XSSF from POI). My current method works for small files (under 5 MB) but for bigger files it fails with out of memory exception.
Is there a solution for this in Java? (not necessarily using Apache POI) Open source/free solutions are preferred but commercial is also good as long as it has a reasonable price.
Thank you,
Iulian
For large spreadsheet handling, it is advisable to use SXSSF
As far as what I can think of, Streaming classes is a bit slower compared to HSSF and XSSF, but is far more superior when it comes to memory management (Feel free to correct me).
This guy has made a few classes that can read Xlsx files and process them in XML. It returns an array with Strings, these Strings are practically the rows of an Xlsx file.
Link:
You could then use these arrays to load them line by line in a stream, instead of all of them at once.
Most likely the message you are facing is about heap space (java.lang.OutOfMemoryError: Java heap space), which will be triggered when you try to add more data into the heap space area in memory, but the size of this data is larger than the JVM can accommodate in the Java heap space. In many cases you are good to go with just increasing the heap size by specifying (or altering if present) the -Xmx parameter, similar to following:
-Xmx1024m
Related
I am iterating through all 30 large files, parse each using CSVParser, and convert each line to some object. I use java 8's parallel stream to hopefully be able to load them in parallel. But I am getting Java heap space error. I tried increasing the memory to -Xmx1024m but still got the heap space error. How should I be doing the loading of these files efficiently?
The problem is that you are attempting to load too much information into memory. Either way you do it (in parallel, or one file at a time), you will run out of memory if you want to hold too many objects in memory at the same time.
This is not an "efficiency" problem. It is a more fundamental problem with the design of your application. Ask yourself why you need to hold all of those object in memory at the same time, and whether you can either avoid that or reduce the space needed to represent the information you need.
I am trying to parse a large excel file(.xlsx) using Apache POI XSSF library. After 100,000 rows it throws heap space error. I tried increasing the memory but it does not help. Is there a workaround for this problem? Or can someone suggest me a another library to parse large excel files.
Thanks!
You can use http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api
Have a look at this thread for details.
Efficient way to search records from an excel file using Apache-POI
Try the latest (stable!) Version from Apache POI.
Alternatives might be smartXLS
When facing the most common OutOfMemoryError, namely the one "java.lang.OutOfMemoryError: Java heap space", some simple aspects must first be understood.
Java applications are allowed to use a limited amount of memory. This limit is specified during application startup. To make things more complex, Java memory is separated different regions named heap space and permgen.
The size of those regions is set during the Java Virtual Machine (JVM) launch by specifying parameters such as -Xmx and -XX:MaxPermSize. If you do not explicitly set the sizes, platform-specific defaults will be used.
So – the “[java.lang.OutOfMemoryError: Java heap space][1]” error will be triggered when you try to add more data into the heap space area, but there is not enough room for it.
Based on this simple description, you have two options
Give more room to the data structures
Reduce the size of the data structures used
Giving more room is easy - just increase the heap size by changing the -Xmx parameter, similar to the following example giving your Java process 1G of heap to play with:
java -Xmx1024m com.mycompany.MyClass
Reducing the size of the data structures typically takes more effort, but this might be necessary in order to get rid of the underlying problems - giving more room can sometimes just mask the symptoms and postpone the inevitable. For example, when facing a memory leak you are just postponing the time when all the memory is filled with leaking garbage.
In your case, reading the data in smaller batches and processing each batch at the time might be an option.
I have a program that parses through hundreds of thousands of files, stores data from each file and towards the end, prints some of the data extracted into an excel document.
These are some of the errors I encountered and handled in regards to memory:
java.lang.OutOfMemoryError: Java heap space
Increased memory to 2gb
Error occurred during initialization of VM. Could not reserve enough space for 2097152KB object heap
downloaded jre8 for 64 bit machine. set -d64 as one of the default vm arguments
java.lang.OurOfMemoryError: GC overhead limit exceeded
Increased java heap memory from 2gb to 3g and included this argument -XX:-UseGCOverheadLimit
So now my default VM arguments are: -d64 -Xmx3g -XX:-UseGCOverheadLimit
The issue is that my program runs for several hours, reads in and stores all of the information I need from all of these files and then throws an error at the end when it's trying to print everything if a memory error occurs.
What I'm wondering is if there is a way to store the data extracted and then access it again even if the program exits due to an error. The way I want to store the data is in the same format I use it in the program. For instance, let's say I have several hundred thousand files of user records and I traversed through all of them, stored the data I extracted in user objects and I have these user and other personally defined objects stored in HashMaps and LinkedLists. Is there a way to store these user objects and HashMaps and LinkedLists in a way that even if the program exits due to an error I can write another program that will go through the objects stored so far and printing out the information that I want without having to go through the process of reading in, extracting and storing the information over again?
One way of doing so is called serialization. (What is object serialization?).
However, depending on you data, you could just write your informations into a handy XML file and after extracting all the data just load the XML and proceed further.
Hope that helps.
First of all, it is very rare that you need this much text data in memory at the same time, and can't use and discard it iteratively.
If you really need to operate on this much data, consider using a map-reduce framework (such as those that Google provides). It will solve both speed and memory problems.
Finally if you are really sure you can't solve your problem in the other two ways or if the map-reduce setup is not worth it to you then then your only option is to write the data to file (somewhere). A good way serialize your data is to use Json. Google's gson and also Jackson 2 are popular libraries to do this.
I have an code which reads lots of data from a file and writes that to an excel file. The problem im facing is, when the data goes beyond the limit of heap size, its throwing an out of memory exception. I tried increasing the heap size and the program ran normally. But the problem is, there is limited RAM on my machine and if I dedicate huge space to the heap, the machine is becoming very slow. So, is there any way to free the memory after the processing some limit of data so that I need not increase my Heap size for running my code? Im relatively new to this kind of stuff, so please suggest some ideas
In cases like this you need to restructure your code, so that it works with small chunks of data. Create a small buffer, read the data into it, process it and write it to the Excel file. Then continue with the next iteration, reading into the same buffer.
Of course the Excel library you are using needs to be able to work like this and shouldn't requiring writing the whole file in a single go.
I think, JXL API provides such kind of limited buffer functionality.It's an open source API.
http://jexcelapi.sourceforge.net/
You can use DirectByteBuffer with Buffer.allocateDirect(byteSize); or MemoryMappedFile, they use memory space out of heap memory.
I am getting OutOfMemoryErrors when uploading large (>300MB) files to
a servlet utilizing Commons FileUpload 1.2.1. It seems odd, because
the entire point of using DiskFileItem is to prevent the (possibly
large) file from residing in memory. I am using the default size
threshold of 10KB, so that's all that should ever be loaded into the
heap, right? Here is the partial stack trace:
java.lang.OutOfMemoryError
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:177)
at org.apache.commons.fileupload.disk.DiskFileItem.get(DiskFileItem.java:334)
at org.springframework.web.multipart.commons.CommonsMultipartFile.getBytes(CommonsMultipartFile.java:114)
Why is this happening? Is there some configuration I'm missing? Any tips/tricks to avoid this situation besides increasing my heap size?
I really shouldn't have to increase my heap, because in theory the most that should be loaded into memory from this operation is a little over 10KB. Plus, my heap max (-Xmx) is already set for 1GB which should be plenty.
When dealing with file uploads, especially big ones, you should process those files as streams which you slurp into a medium-size in-memory buffer and copy directly into your output file. The wrong way to do it is to inhale the whole thing into memory before writing it out.
The doc on commons-upload mentions, just below the middle, how to "Process a file upload". If you remember to copy from the inputstream to the outputstream in reasonably sized chunks (say, 1 MB), you should have no problem.
While Carl Smotricz's answer is probably better in the general case, the exception you get is a JVM bug that is reported here:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546