Storing temporary data even after program exists - java

I have a program that parses through hundreds of thousands of files, stores data from each file and towards the end, prints some of the data extracted into an excel document.
These are some of the errors I encountered and handled in regards to memory:
java.lang.OutOfMemoryError: Java heap space
Increased memory to 2gb
Error occurred during initialization of VM. Could not reserve enough space for 2097152KB object heap
downloaded jre8 for 64 bit machine. set -d64 as one of the default vm arguments
java.lang.OurOfMemoryError: GC overhead limit exceeded
Increased java heap memory from 2gb to 3g and included this argument -XX:-UseGCOverheadLimit
So now my default VM arguments are: -d64 -Xmx3g -XX:-UseGCOverheadLimit
The issue is that my program runs for several hours, reads in and stores all of the information I need from all of these files and then throws an error at the end when it's trying to print everything if a memory error occurs.
What I'm wondering is if there is a way to store the data extracted and then access it again even if the program exits due to an error. The way I want to store the data is in the same format I use it in the program. For instance, let's say I have several hundred thousand files of user records and I traversed through all of them, stored the data I extracted in user objects and I have these user and other personally defined objects stored in HashMaps and LinkedLists. Is there a way to store these user objects and HashMaps and LinkedLists in a way that even if the program exits due to an error I can write another program that will go through the objects stored so far and printing out the information that I want without having to go through the process of reading in, extracting and storing the information over again?

One way of doing so is called serialization. (What is object serialization?).
However, depending on you data, you could just write your informations into a handy XML file and after extracting all the data just load the XML and proceed further.
Hope that helps.

First of all, it is very rare that you need this much text data in memory at the same time, and can't use and discard it iteratively.
If you really need to operate on this much data, consider using a map-reduce framework (such as those that Google provides). It will solve both speed and memory problems.
Finally if you are really sure you can't solve your problem in the other two ways or if the map-reduce setup is not worth it to you then then your only option is to write the data to file (somewhere). A good way serialize your data is to use Json. Google's gson and also Jackson 2 are popular libraries to do this.

Related

Efficient way of parsing multiple large csv files using CSVParser

I am iterating through all 30 large files, parse each using CSVParser, and convert each line to some object. I use java 8's parallel stream to hopefully be able to load them in parallel. But I am getting Java heap space error. I tried increasing the memory to -Xmx1024m but still got the heap space error. How should I be doing the loading of these files efficiently?
The problem is that you are attempting to load too much information into memory. Either way you do it (in parallel, or one file at a time), you will run out of memory if you want to hold too many objects in memory at the same time.
This is not an "efficiency" problem. It is a more fundamental problem with the design of your application. Ask yourself why you need to hold all of those object in memory at the same time, and whether you can either avoid that or reduce the space needed to represent the information you need.

Heap space error with Apache POI XSSF

I am trying to parse a large excel file(.xlsx) using Apache POI XSSF library. After 100,000 rows it throws heap space error. I tried increasing the memory but it does not help. Is there a workaround for this problem? Or can someone suggest me a another library to parse large excel files.
Thanks!
You can use http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api
Have a look at this thread for details.
Efficient way to search records from an excel file using Apache-POI
Try the latest (stable!) Version from Apache POI.
Alternatives might be smartXLS
When facing the most common OutOfMemoryError, namely the one "java.lang.OutOfMemoryError: Java heap space", some simple aspects must first be understood.
Java applications are allowed to use a limited amount of memory. This limit is specified during application startup. To make things more complex, Java memory is separated different regions named heap space and permgen.
The size of those regions is set during the Java Virtual Machine (JVM) launch by specifying parameters such as -Xmx and -XX:MaxPermSize. If you do not explicitly set the sizes, platform-specific defaults will be used.
So – the “[java.lang.OutOfMemoryError: Java heap space][1]” error will be triggered when you try to add more data into the heap space area, but there is not enough room for it.
Based on this simple description, you have two options
Give more room to the data structures
Reduce the size of the data structures used
Giving more room is easy - just increase the heap size by changing the -Xmx parameter, similar to the following example giving your Java process 1G of heap to play with:
java -Xmx1024m com.mycompany.MyClass
Reducing the size of the data structures typically takes more effort, but this might be necessary in order to get rid of the underlying problems - giving more room can sometimes just mask the symptoms and postpone the inevitable. For example, when facing a memory leak you are just postponing the time when all the memory is filled with leaking garbage.
In your case, reading the data in smaller batches and processing each batch at the time might be an option.

Java big list object causing out of memory

I am using Java Spring ibatis
I have java based reporting application which displays large amount of data. I notice when system try to process large amount of data it throws "out of memory" error.
I know either we can increase the memory size or we can introduce paging in reporting application.
any idea ? i am curious if there is some thing like if list object is large enough split it into memory and disk so we don't have to make any major change in the application code ?
any suggestion appreciated.
The first thing to do should be to check exactly what is causing you to run out of memory.
Add the following to your command line
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/where/you/want
This will generate a heap dump hprof file.
You can use something like the Eclipse Memory Analyser Tool to see which part of the heap (if at all) you need to increase or whether you have a memory leak.

How to avoid Out of memory exception without increasing Heap size on eclipse

I have an code which reads lots of data from a file and writes that to an excel file. The problem im facing is, when the data goes beyond the limit of heap size, its throwing an out of memory exception. I tried increasing the heap size and the program ran normally. But the problem is, there is limited RAM on my machine and if I dedicate huge space to the heap, the machine is becoming very slow. So, is there any way to free the memory after the processing some limit of data so that I need not increase my Heap size for running my code? Im relatively new to this kind of stuff, so please suggest some ideas
In cases like this you need to restructure your code, so that it works with small chunks of data. Create a small buffer, read the data into it, process it and write it to the Excel file. Then continue with the next iteration, reading into the same buffer.
Of course the Excel library you are using needs to be able to work like this and shouldn't requiring writing the whole file in a single go.
I think, JXL API provides such kind of limited buffer functionality.It's an open source API.
http://jexcelapi.sourceforge.net/
You can use DirectByteBuffer with Buffer.allocateDirect(byteSize); or MemoryMappedFile, they use memory space out of heap memory.

Java Memory Leak Due to Massive Data Processing

I am currently developing an application that processes several files, containing around 75,000 records a piece (stored in binary format). When this app is ran (manually, about once a month), about 1 million records are contained entirely with the files. Files are put in a folder, click process and it goes and stores this into a MySQL database (table_1)
The records contain information that needs to be compared to another table (table_2) containing over 700k records.
I have gone about this a few ways:
METHOD 1: Import Now, Process Later
In this method, I would import the data into the database without any processing from the other table. However when I wanted to run a report on the collected data, it would crash assuming memory leak (1 GB used in total before crash).
METHOD 2: Import Now, Use MySQL to Process
This was what I would like to do but in practice it didn't seem to turn out so well. In this I would write the logic in finding the correlations between table_1 and table_2. However the MySQL result is massive and I couldn't get a consistent output, sometimes causing MySQL giving up.
METHOD 3: Import Now, Process Now
I am currently trying this method and although the memory leak is subtle, It still only gets to about 200,000 records before crashing. I have tried numerous forced garbage collections along the way, destroying properly classes, etc. It seems something is fighting me.
I am at my wits end trying to solve the issue with memory leaking / the app crashing. I am no expert in Java and have yet to really deal with very large amounts of data in MySQL. Any guidance would be extremely helpful. I have put thought into these methods:
Break each line process into individual class, hopefully expunging any memory usage on each line
Some sort of stored routine where once a line is stored into the database, MySQL does the table_1 <=> table_2 computation and stores the result
But I would like to pose the question to the many skilled Stack Overflow members to learn properly how this should be handled.
I concur with the answers that say "use a profiler".
But I'd just like to point out a couple of misconceptions in your question:
The storage leak is not due to massive data processing. It is due to a bug. The "massiveness" simply makes the symptoms more apparent.
Running the garbage collector won't cure a storage leak. The JVM always runs a full garbage collection immediately before it decides to give up and throw an OOME.
It is difficult to give advice on what might actually be causing the storage leak without more information on what you are trying to do and how you are doing it.
The learning curve for a profiler like VirtualVM is pretty small. With luck, you'll have an answer - at least a very big clue - within an hour or so.
you properly handle this situation by either:
generating a heap dump when the app crashes and analyzing that in a good memory profiler
hook up the running app to a good memory profiler and look at the heap
i personally prefer yjp, but there are some decent free apps as well (e.g. jvisualvm and netbeans)
Without knowing too much about what you're doing, if you're running out of memory there's likely some point where you're storing everything in the jvm, but you should be able to do a data processing task like this the severe memory problems you're experiencing. In the past, I've seen data processing pipelines that run out of memory because there's one class reading stuff out of the db, wrapping it all up in a nice collection, and then passing it off to another, which of course requires all of the data to be in memory simultaneously. Frameworks are good for hiding this sort of thing.
Heap dumps/digging with virtualVm hasn't been terribly helpful for me , as the details I'm looking for are often hidden - e.g. If you've got a ton of memory filled with maps of strings it doesn't really help to tell you that Strings are the largest component of your memory useage, you sort of need to know who owns them.
Can you post more detail about the actual problem you're trying to solve?

Categories