Mallet topic modelling - java

I have been using mallet for inferring topics for a text file containing 100,000 lines(around 34 MB in mallet format). But now i need to run it for on a file containing a million lines(around 180MB) and I am getting an java.lang.outofmemory exception . Is there a way of splitting the file into smaller ones and build a model for the data present in all the files combined??
thanks in advance

In bin/mallet.bat increase value for this line:
set MALLET_MEMORY=1G

I'm not sure about scalability of Mallet to big data, but project http://dragon.ischool.drexel.edu/ can store its data in disk backed persistence therefore can scale to unlimited corpus sizes(with low performance of course)

The model is still going to be pretty much huge, even if it read it from multiple files. Have you tried increasing the heap size of your java vm?

java.lang.outofmemory exception occurs mainly because of insufficient heap space.
You can use -Xms and -Xmx to set heap space so that it will not come again.

Given the current PC's memory size, it should be easy to use a heap as large as 2GB.
You should try the single-machine solution before considering using a cluster.

Related

Efficient way of parsing multiple large csv files using CSVParser

I am iterating through all 30 large files, parse each using CSVParser, and convert each line to some object. I use java 8's parallel stream to hopefully be able to load them in parallel. But I am getting Java heap space error. I tried increasing the memory to -Xmx1024m but still got the heap space error. How should I be doing the loading of these files efficiently?
The problem is that you are attempting to load too much information into memory. Either way you do it (in parallel, or one file at a time), you will run out of memory if you want to hold too many objects in memory at the same time.
This is not an "efficiency" problem. It is a more fundamental problem with the design of your application. Ask yourself why you need to hold all of those object in memory at the same time, and whether you can either avoid that or reduce the space needed to represent the information you need.

Heap space error with Apache POI XSSF

I am trying to parse a large excel file(.xlsx) using Apache POI XSSF library. After 100,000 rows it throws heap space error. I tried increasing the memory but it does not help. Is there a workaround for this problem? Or can someone suggest me a another library to parse large excel files.
Thanks!
You can use http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api
Have a look at this thread for details.
Efficient way to search records from an excel file using Apache-POI
Try the latest (stable!) Version from Apache POI.
Alternatives might be smartXLS
When facing the most common OutOfMemoryError, namely the one "java.lang.OutOfMemoryError: Java heap space", some simple aspects must first be understood.
Java applications are allowed to use a limited amount of memory. This limit is specified during application startup. To make things more complex, Java memory is separated different regions named heap space and permgen.
The size of those regions is set during the Java Virtual Machine (JVM) launch by specifying parameters such as -Xmx and -XX:MaxPermSize. If you do not explicitly set the sizes, platform-specific defaults will be used.
So – the “[java.lang.OutOfMemoryError: Java heap space][1]” error will be triggered when you try to add more data into the heap space area, but there is not enough room for it.
Based on this simple description, you have two options
Give more room to the data structures
Reduce the size of the data structures used
Giving more room is easy - just increase the heap size by changing the -Xmx parameter, similar to the following example giving your Java process 1G of heap to play with:
java -Xmx1024m com.mycompany.MyClass
Reducing the size of the data structures typically takes more effort, but this might be necessary in order to get rid of the underlying problems - giving more room can sometimes just mask the symptoms and postpone the inevitable. For example, when facing a memory leak you are just postponing the time when all the memory is filled with leaking garbage.
In your case, reading the data in smaller batches and processing each batch at the time might be an option.

Java big list object causing out of memory

I am using Java Spring ibatis
I have java based reporting application which displays large amount of data. I notice when system try to process large amount of data it throws "out of memory" error.
I know either we can increase the memory size or we can introduce paging in reporting application.
any idea ? i am curious if there is some thing like if list object is large enough split it into memory and disk so we don't have to make any major change in the application code ?
any suggestion appreciated.
The first thing to do should be to check exactly what is causing you to run out of memory.
Add the following to your command line
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/where/you/want
This will generate a heap dump hprof file.
You can use something like the Eclipse Memory Analyser Tool to see which part of the heap (if at all) you need to increase or whether you have a memory leak.

Modify big excel (xslx) file in java

I have been using Apache POI to process data coming from big files. Reading was done via the SAX event API which makes it efficient for large sets of data without consuming much memory.
However there's also a requirement that I need to update an existing template for the final report. This template may have more than 10MB (even 20MB in some cases).
Do you know a way to efficiently update a big template file (xslx)? Currently I am reading the entire contents of the template into memory and modify those contents (using XSSF from POI). My current method works for small files (under 5 MB) but for bigger files it fails with out of memory exception.
Is there a solution for this in Java? (not necessarily using Apache POI) Open source/free solutions are preferred but commercial is also good as long as it has a reasonable price.
Thank you,
Iulian
For large spreadsheet handling, it is advisable to use SXSSF
As far as what I can think of, Streaming classes is a bit slower compared to HSSF and XSSF, but is far more superior when it comes to memory management (Feel free to correct me).
This guy has made a few classes that can read Xlsx files and process them in XML. It returns an array with Strings, these Strings are practically the rows of an Xlsx file.
Link:
You could then use these arrays to load them line by line in a stream, instead of all of them at once.
Most likely the message you are facing is about heap space (java.lang.OutOfMemoryError: Java heap space), which will be triggered when you try to add more data into the heap space area in memory, but the size of this data is larger than the JVM can accommodate in the Java heap space. In many cases you are good to go with just increasing the heap size by specifying (or altering if present) the -Xmx parameter, similar to following:
-Xmx1024m

How to avoid Out of memory exception without increasing Heap size on eclipse

I have an code which reads lots of data from a file and writes that to an excel file. The problem im facing is, when the data goes beyond the limit of heap size, its throwing an out of memory exception. I tried increasing the heap size and the program ran normally. But the problem is, there is limited RAM on my machine and if I dedicate huge space to the heap, the machine is becoming very slow. So, is there any way to free the memory after the processing some limit of data so that I need not increase my Heap size for running my code? Im relatively new to this kind of stuff, so please suggest some ideas
In cases like this you need to restructure your code, so that it works with small chunks of data. Create a small buffer, read the data into it, process it and write it to the Excel file. Then continue with the next iteration, reading into the same buffer.
Of course the Excel library you are using needs to be able to work like this and shouldn't requiring writing the whole file in a single go.
I think, JXL API provides such kind of limited buffer functionality.It's an open source API.
http://jexcelapi.sourceforge.net/
You can use DirectByteBuffer with Buffer.allocateDirect(byteSize); or MemoryMappedFile, they use memory space out of heap memory.

Categories