I am trying to read a file (tab or csv file) in java with roughly 3m rows; have also added the virtual machine memory to -Xmx6g. The code works fine with 400K rows for tab separated file and slightly less for csv file. There are many LinkedHashMaps and Vectors involved that I try to use System.gc() after every few hundred rows in order to free memory and garbage values. However, my code gives the following error after 400K rows.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Vector.<init>(Vector.java:111)
at java.util.Vector.<init>(Vector.java:124)
at java.util.Vector.<init>(Vector.java:133)
at cleaning.Capture.main(Capture.java:110)
Your attempt to load the whole file is fundamentally ill-fated. You may optimize all you want, but you'll just be pushing the upper limit slightly higher. What you need is eradicate the limit itself.
There is a very negligible chance that you actually need the whole contents in memory all at once. You probably need to calculate something from that data, so you should start working out a way to make that calculation chunk by chunk, each time being able to throw away the processed chunk.
If your data is deeply intertwined, preventing you from serializing your calculation, then the reasonable recourse is, as HovercraftFOE mentions above, transfering the data into a database and work from there, indexing everything you need, normalizing it, etc.
Related
I am developing a text analysis program that represents documents as arrays of "feature counts" (e.g., occurrences of a particular token) within some pre-defined feature space. These arrays are stored in an ArrayList after some processing.
I am testing the program on a 64 mb dataset, with 50,000 records. The program worked fine with small data sets, but now it consistently throws a "out of memory" Java heap exception when I start loading the arrays into an ArrayList object (using the .add(double[]) method). Depending on how much memory I allocate to the stack, I will get this exception at the 1000th to 3000th addition to the ArrayList, far short of my 50,000 entries. It became clear to me that I cannot store all this data in RAM and operate on it as usual.
However, I'm not sure what data structures are best suited to allow me to access and perform calculations on the entire dataset when only part of it can be loaded into RAM?
I was thinking that serializing the data to disk and storing the locations in a hashmap in RAM would be useful. However, I have also seen discussions on caching and buffered processing.
I'm 100% sure this is a common CS problem, so I'm sure there are several clever ways that this has been addressed. Any pointers would be appreciated :-)
You have plenty of choices:
Increase heap size (-Xmx) to several gigabytes.
Do not use boxing collections, use fastutil - that should decrease your memory use 4x. http://fastutil.di.unimi.it/
Process your data in batches or sequentially - do not keep whole dataset in memory simultaneously.
Use a proper database. There are even intraprocess databases like HSQL, your mileage may vary.
Process your data via map-reduce, perhaps something local like pig.
How about using Apache Spark (Great for in-memory cluster computing) ?This would help scale your infrastructure as your data set gets Larger.
Perhaps I'm doing this the wrong way:
I have a 4GB (33million lines of text) file, where each line has a string in it.
I'm trying to create a trie -> The algorithm works.
The problem is that Node.js has a process memory limit of 1.4GB, so the moment I process 5.5 million lines, it crashes.
To get around this, I tried the following:
Instead of 1 Trie, I create many Tries, each having a range of the alphabet.
For example:
aTrie ---> all words starting with a
bTrie ---> all words starting with b...
etc...
But the problem is, I still can't keep all the objects in memory while reading the file, so each time I read a line, I load / unload a trie from disk. When there is a change I delete the old file, and write the updated trie from memory to disk.
This is SUPER SLOW! Even on my macbook pro with SSD.
I've considered writing this in Java, but then the problem of converting JAVA objects to json comes up (same problem with using C++ etc).
Any suggestions ?
You may extend memory size limit that the node process uses by specifying the option below;
ps: size in mb's.
node --max_old_space_size=4096
for more options please see:
https://github.com/thlorenz/v8-flags/blob/master/flags-0.11.md
Instead of using 26 Tries you could use a hash function to create an arbitrary number of sub-Tries. This way, the amount of data you have to read from disk is limited to the size of your sub-Trie that you determine. In addition, you could cache the recently used sub-Tries in memory and then persist the changes to disk asynchronously in the background if IO is still a problem.
We have an application in which an XML string is created from a stored proc resultset and transformed using XSLT to return to the calling servlet. This work fine with smaller dataset but causing out of memory error with large amount of data. What will be the ideal solution in this case ?
XSLT transformations, in general, require the entire dataset to be loaded into memory, so the easiest thing is to get more memory.
If you can rewrite your XSLT, there is Streaming Transformations for XML which allow for incremental processing of data.
If you're processing the entire XML document at once then it sounds like you'll need to allocate more memory to the Java heap. But that only works up to the defined maximum heap size. Do you know a reasonable maximum data set size or is it unbounded?
Why do you need the database to generate the XML?
Few important things to note.
You mentioned works fine functionally with small data-set but goes out of memory with large data sets. you need to identify whether its creation of datasets that causes out of memory or transfer of datasets in the same process.
You are doing something which is making many objects to stay in memory. Re-Check your code and nullify some objects explicitly after usage.This will make life easier for garbage collector. Play with MaxPermSize settings of JVM. This will give you additional space for strings.
This approach is going to have a limitation that even if you are able to transfer large datasets for single user this might go outOfMemory for multiple users.
A suggestion that might work for you.
Break this in an Asyncronous process.Make creation of large datasets separate process and downloading of that datasets a different process.
While making the datasets available for download you can very well control the memory consumption using stream based downloading.
I've been working on a graphing/data processing application (you can see a screenshot here) using Clojure (though, oftentimes, it feels like I'm using more Java than Clojure), and have started testing my application with bigger datasets. I have no problem with around 100k points, but when I start getting higher than that, I run into heap space problems.
Now, theoretically, about half a GB should be enough to hold around 70 million doubles. Granted, I'm doing many things that require some overhead, and I may in fact be holding 2-3 copies of the data in memory at the same time, but I haven't optimized much yet, and 500k or so is still orders of magnitude less than that I should be able to load.
I understand that Java has artificial restrictions (that can be changed) on the size of the heap, and I understand those can be changed, in part, with options you can specify as the JVM starts. This leads me to my first questions:
Can I change the maximum allowed heap space if I am using Swank-Clojure (via Leiningen) the JVM has on startup?
If I package this application (like I plan to) as an Uberjar, would I be able to ensure my JVM has some kind of minimum heap space?
But I'm not content with just relying on the heap of the JVM to power my application. I don't know the size of the data I may eventually be working with, but it could reach millions of points, and perhaps the heap couldn't accommodate that. Therefore, I'm interesting in finding alternatives to just piling the data on. Here are some ideas I had, and questions about them:
Would it be possible to read in only parts of a large (text) file at a time, so I could import and process the data in "chunks", e.g, n lines at a time? If so, how?
Is there some faster way of accessing the file I'd be reading from (potentially rapidly, depending on the implementation), other than simply reading from it a bit at a time? I guess I'm asking here for any tips/hacks that have worked for you in the past, if you've done a similar thing.
Can I "sample" from the file; e.g. read only every z lines, effectively downsampling my data?
Right now I plan on, if there are answers to the above (I'll keep searching!), or insights offered that lead to equivalent solutions, read in a chunk of data at a time, graph it to the timeline (see the screenshot–the timeline is green), and allowed the user to interact with just that bit until she clicks next chunk (or something), then I'd save changes made to a file and load the next "chunk" of data and display it.
Alternatively, I'd display the whole timeline of all the data (downsampled, so I could load it), but only allow access to one "chunk" of it at a time in the main window (the part that is viewed above the green timeline, as outlined by the viewport rectangle in the timeline).
Most of all, though, is there a better way? Note that I cannot downsample the primary window's data, as I need to be able to process it and let the user interact with it (e.g, click a point or near one to add a "marker" to that point: that marker is drawn as a vertical rule over that point).
I'd appreciate any insight, answers, suggestions or corrections! I'm also willing to expound
on my question in any way you'd like.
This will hopefully, at least in part, be open-sourced; I'd like a simple-to-use yet fast way to make xy-plots of lots of data in the Clojure world.
EDIT Downsampling is possible only when graphing, and not always then, depending on the parts being graphed. I need access to all the data to perform analysis on. (Just clearing that up!) Though I should definitely look into downsampling, I don't think that will solve my memory issues in the least, as all I'm doing to graph is drawing on a BufferedImage.
Can I change the maximum allowed heap
space if I am using Swank-Clojure (via
Leiningen) the JVM has on startup?
You can change the Java heap size by supplying the -Xms (min heap) and -Xmx (max heap) options at startup, see the docs.
So something like java -Xms256m -Xmx1024m ... would give 256MB initial heap with the option to grow to 1GB.
I don't use Leiningen/Swank, but I expect that it's possible to change it. If nothing else, there should be a startup script for Java somewhere where you can change the arguments.
If I package this application (like I
plan to) as an Uberjar, would I be
able to ensure my JVM has some kind of
minimum heap space?
Memory isn't controlled from within a jar file, but from the startup script, normally a .sh or .bat file that calls java and supplies the arguments.
Can I "sample" from the file; e.g.
read only every z lines?
java.io.RandomAccessFile gives random file access by byte index, which you can build on to sample the contents.
Would it be possible to read in only
parts of a large (text) file at a
time, so I could import and process
the data in "chunks", e.g, n lines at
a time? If so, how?
line-seq returns a lazy sequence of each line in a file, so you can process as much at a time as you wish.
Alternatively, use the Java mechanisms in java.io - BufferedReader.readLine() or FileInputStream.read(byte[] buffer)
Is there some faster way of accessing
the file I'd be reading from
(potentially rapidly, depending on the
implementation), other than simply
reading from it a bit at a time?
Within Java/Clojure there is BufferedReader, or you can maintain your own byte buffer and read larger chunks at a time.
To make the most out of the memory you have, keep the data as primitive as possible.
For some actual numbers, let's assume you want to graph the contents of a music CD:
A CD has two channels, each with 44,100 samples per second
60 min. of music is then ~300 million data points
Represented as 16 bits (2 bytes, a short) per datapoint: 600MB
Represented as primitive int array (4 bytes per datapoint): 1.2GB
Represented as Integer array (32 bytes per datapoint): 10GB
Using the numbers from this blog for object size (16 byte overhead per object, 4 bytes for primitive int, objects aligned to 8-byte boundaries, 8-byte pointers in the array = 32 bytes per Integer datapoint).
Even 600MB of data is a stretch to keep in memory all at once on a "normal" computer, since you will probably be using lots of memory elsewhere too. But the switch from primitive to boxed numbers will all by itself reduce the number of datapoints you can hold in memory by an order of magnitude.
If you were to graph the data from a 60 min CD on a 1900 pixel wide "overview" timeline, you would have one pixel to display two seconds of music (~180,000 datapoints). This is clearly way too little to show any level of detail, you would want some form of subsampling or summary data there.
So the solution you describe - process the full dataset one chunk at a time for a summary display in the 'overview' timeline, and keep only the small subset for the main "detail" window in memory - sounds perfectly reasonable.
Update:
On fast file reads: This article times the file reading speed for 13 different ways to read a 100MB file in Java - the results vary from 0.5 seconds to 10 minutes(!). In general, reading is fast with a decent buffer size (4k to 8k bytes) and (very) slow when reading one byte at a time.
The article also has a comparison to C in case anyone is interested. (Spoiler: The fastest Java reads are within a factor 2 of a memory-mapped file in C.)
Tossing out a couple ideas from left field...
You might find something useful in the Colt library... http://acs.lbl.gov/software/colt/
Or perhaps memory-mapped I/O.
A couple of thoughts:
Best way to handle large in-memory data sets in Java/Clojure is to use large primitive arrays. If you do this, you are basically using only a little more memory than the size of the underlying data. You handle these arrays in Clojure just fine with the aget/aset functionality
I'd be tempted to downsample, but maintain a way to lazily access the detailed points "on demand" if you need to, e.g. in the user interaction case. Kind of like the way that Google maps lets you see the whole world, and only loads the detail when you zoom in....
If you only care about the output image from the x-y plot, then you can construct it by loading in a few thousand points at a time (e.g. loading into your primitive arrays), plotting them then discarding then. In this way you won't need to hold the full data set in memory.
OK, so I am writing a program that unfortunately needs to use a huge data structure to complete its work, but it is failing with a "out of memory error" during its initialization. While I understand entirely what that means and why it is a problem, I am having trouble overcoming it, since my program needs to use this large structure and I don't know any other way to store it.
The program first indexes a large corpus of text files that I provide. This works fine.
Then it uses this index to initialize a large 2D array. This array will have n² entries, where "n" is the number of unique words in the corpus of text. For the relatively small chunk I am testing it o n(about 60 files) it needs to make approximately 30,000x30,000 entries. This will probably be bigger once I run it on my full intended corpus too.
It consistently fails every time, after it indexes, while it is initializing the data structure(to be worked on later).
Things I have done include:
revamp my code to use a primitive int[] instead of a TreeMap
eliminate redundant structures, etc...
Also, I have run the program with-Xmx2g to max out my allocated memory
I am fairly confident this is not going to be a simple line of code solution, but is most likely going to require a very new approach. I am looking for what that approach is, any ideas?
Thanks,
B.
It sounds like (making some assumptions about what you're using your array for) most of the entries will be 0. If so, you might consider using a sparse matrix representation.
If you really have that many entries (your current array is somewhere over 3 gigabytes already, even assuming no overhead), then you'll have to use some kind of on-disk storage, or a lazy-load/unload system.
There are several causes of out of memory issues.
Firstly, the simplest case is you simply need more heap. You're using 512M max heap when your program could operate correctly with 2G. Increase is with -Xmx2048m as a JVM option and you're fine. Also be aware than 64 bit VMs will use up to twice the memory of 32 bit VMs depending on the makeup of that data.
If your problem isn't that simple then you can look at optimization. Replacing objects with primitives and so on. This might be an option. I can't really say based on what you've posted.
Ultimately however you come to a cross roads where you have to make a choice between virtulization and partitioning.
Virtualizing in this context simply means some form of pretending there is more memory than there is. Operating systems use this with virtual address spaces and using hard disk space as extra memory. This could mean only keeping some of the data structure in memory at a time and persisting the rest to secondary storage (eg file or database).
Partitioning is splitting your data across multiple servers (either real or virtual). For example, if you were keeping track of stock trades on the NASDAQ you could put stock codes starting with "A" on server1, "B" on server2, etc. You need to find a reasonable approach to slice your data such that you reduce or eliminate the need for cross-communication because that cross-communication is what limits your scalability.
So simple case, if what you're storing is 30K words and 30K x 30K combinations of words you could divide it up into four server:
A-M x A-M
A-M x N-Z
N-Z x A-M
N-Z x N-Z
That's just one idea. Again it's hard toc omment without knowing specifics.
This is a common problem dealing with large datasets. You can optimize as much as you want, but the memory will never be enough (probably), and as soon as the dataset grows a little more you are still smoked. The most scalable solution is simply to keep less in memory, work on chunks, and persist the structure on disk (database/file).
If you don't need a full 32 bits (size of integer) for each value in your 2D array, perhaps a smaller type such as a byte would do the trick? Also you should give it as much heap space as possible - 2GB is still relatively small for a modern system. RAM is cheap, especially if you're expecting to be doing a lot of processing in-memory.