Handling large datasets in Java/Clojure: littleBig data

Handling large datasets in Java/Clojure: littleBig data - java

I've been working on a graphing/data processing application (you can see a screenshot here) using Clojure (though, oftentimes, it feels like I'm using more Java than Clojure), and have started testing my application with bigger datasets. I have no problem with around 100k points, but when I start getting higher than that, I run into heap space problems.
Now, theoretically, about half a GB should be enough to hold around 70 million doubles. Granted, I'm doing many things that require some overhead, and I may in fact be holding 2-3 copies of the data in memory at the same time, but I haven't optimized much yet, and 500k or so is still orders of magnitude less than that I should be able to load.
I understand that Java has artificial restrictions (that can be changed) on the size of the heap, and I understand those can be changed, in part, with options you can specify as the JVM starts. This leads me to my first questions:
Can I change the maximum allowed heap space if I am using Swank-Clojure (via Leiningen) the JVM has on startup?
If I package this application (like I plan to) as an Uberjar, would I be able to ensure my JVM has some kind of minimum heap space?
But I'm not content with just relying on the heap of the JVM to power my application. I don't know the size of the data I may eventually be working with, but it could reach millions of points, and perhaps the heap couldn't accommodate that. Therefore, I'm interesting in finding alternatives to just piling the data on. Here are some ideas I had, and questions about them:
Would it be possible to read in only parts of a large (text) file at a time, so I could import and process the data in "chunks", e.g, n lines at a time? If so, how?
Is there some faster way of accessing the file I'd be reading from (potentially rapidly, depending on the implementation), other than simply reading from it a bit at a time? I guess I'm asking here for any tips/hacks that have worked for you in the past, if you've done a similar thing.
Can I "sample" from the file; e.g. read only every z lines, effectively downsampling my data?
Right now I plan on, if there are answers to the above (I'll keep searching!), or insights offered that lead to equivalent solutions, read in a chunk of data at a time, graph it to the timeline (see the screenshot–the timeline is green), and allowed the user to interact with just that bit until she clicks next chunk (or something), then I'd save changes made to a file and load the next "chunk" of data and display it.
Alternatively, I'd display the whole timeline of all the data (downsampled, so I could load it), but only allow access to one "chunk" of it at a time in the main window (the part that is viewed above the green timeline, as outlined by the viewport rectangle in the timeline).
Most of all, though, is there a better way? Note that I cannot downsample the primary window's data, as I need to be able to process it and let the user interact with it (e.g, click a point or near one to add a "marker" to that point: that marker is drawn as a vertical rule over that point).
I'd appreciate any insight, answers, suggestions or corrections! I'm also willing to expound
on my question in any way you'd like.
This will hopefully, at least in part, be open-sourced; I'd like a simple-to-use yet fast way to make xy-plots of lots of data in the Clojure world.
EDIT Downsampling is possible only when graphing, and not always then, depending on the parts being graphed. I need access to all the data to perform analysis on. (Just clearing that up!) Though I should definitely look into downsampling, I don't think that will solve my memory issues in the least, as all I'm doing to graph is drawing on a BufferedImage.

Can I change the maximum allowed heap
space if I am using Swank-Clojure (via
Leiningen) the JVM has on startup?
You can change the Java heap size by supplying the -Xms (min heap) and -Xmx (max heap) options at startup, see the docs.
So something like java -Xms256m -Xmx1024m ... would give 256MB initial heap with the option to grow to 1GB.
I don't use Leiningen/Swank, but I expect that it's possible to change it. If nothing else, there should be a startup script for Java somewhere where you can change the arguments.
If I package this application (like I
plan to) as an Uberjar, would I be
able to ensure my JVM has some kind of
minimum heap space?
Memory isn't controlled from within a jar file, but from the startup script, normally a .sh or .bat file that calls java and supplies the arguments.
Can I "sample" from the file; e.g.
read only every z lines?
java.io.RandomAccessFile gives random file access by byte index, which you can build on to sample the contents.
Would it be possible to read in only
parts of a large (text) file at a
time, so I could import and process
the data in "chunks", e.g, n lines at
a time? If so, how?
line-seq returns a lazy sequence of each line in a file, so you can process as much at a time as you wish.
Alternatively, use the Java mechanisms in java.io - BufferedReader.readLine() or FileInputStream.read(byte[] buffer)
Is there some faster way of accessing
the file I'd be reading from
(potentially rapidly, depending on the
implementation), other than simply
reading from it a bit at a time?
Within Java/Clojure there is BufferedReader, or you can maintain your own byte buffer and read larger chunks at a time.
To make the most out of the memory you have, keep the data as primitive as possible.
For some actual numbers, let's assume you want to graph the contents of a music CD:
A CD has two channels, each with 44,100 samples per second
60 min. of music is then ~300 million data points
Represented as 16 bits (2 bytes, a short) per datapoint: 600MB
Represented as primitive int array (4 bytes per datapoint): 1.2GB
Represented as Integer array (32 bytes per datapoint): 10GB
Using the numbers from this blog for object size (16 byte overhead per object, 4 bytes for primitive int, objects aligned to 8-byte boundaries, 8-byte pointers in the array = 32 bytes per Integer datapoint).
Even 600MB of data is a stretch to keep in memory all at once on a "normal" computer, since you will probably be using lots of memory elsewhere too. But the switch from primitive to boxed numbers will all by itself reduce the number of datapoints you can hold in memory by an order of magnitude.
If you were to graph the data from a 60 min CD on a 1900 pixel wide "overview" timeline, you would have one pixel to display two seconds of music (~180,000 datapoints). This is clearly way too little to show any level of detail, you would want some form of subsampling or summary data there.
So the solution you describe - process the full dataset one chunk at a time for a summary display in the 'overview' timeline, and keep only the small subset for the main "detail" window in memory - sounds perfectly reasonable.
Update:
On fast file reads: This article times the file reading speed for 13 different ways to read a 100MB file in Java - the results vary from 0.5 seconds to 10 minutes(!). In general, reading is fast with a decent buffer size (4k to 8k bytes) and (very) slow when reading one byte at a time.
The article also has a comparison to C in case anyone is interested. (Spoiler: The fastest Java reads are within a factor 2 of a memory-mapped file in C.)

Tossing out a couple ideas from left field...
You might find something useful in the Colt library... http://acs.lbl.gov/software/colt/
Or perhaps memory-mapped I/O.

A couple of thoughts:
Best way to handle large in-memory data sets in Java/Clojure is to use large primitive arrays. If you do this, you are basically using only a little more memory than the size of the underlying data. You handle these arrays in Clojure just fine with the aget/aset functionality
I'd be tempted to downsample, but maintain a way to lazily access the detailed points "on demand" if you need to, e.g. in the user interaction case. Kind of like the way that Google maps lets you see the whole world, and only loads the detail when you zoom in....
If you only care about the output image from the x-y plot, then you can construct it by loading in a few thousand points at a time (e.g. loading into your primitive arrays), plotting them then discarding then. In this way you won't need to hold the full data set in memory.

Related

Java - Millions of records, HashMap throws OutOfMemoryError

I'm reading a file to parse few of the fields of each record as a reference key and another field as the reference value. These keys and values are referred for another process.
Hence, I chose a HashMap, so that I can get the values for each key, easily.
But, each of the file consists of tens of millions or records. Hence, the HashMap throws OutOfMemoryError. I hope increasing the heap memory will not be a good solution, if the input file in future grows.
For similar questions in SO, most have suggested to use a database. I fear I'll not be given option to use a DB. Is there any other way to handle the problem?
EDIT: I need to do this similar HashMap Loading for 4 such files :( I need all the four. Bcoz, If I dont find a matching entry for my input in the first Map, I need to find in second, then if there not, then third and finally in fourth.
Edit 2: The files I have sums up to, around 1 GB.
EDIT 3:
034560000010000001750
000234500010000100752
012340000010000300374
I have records like these in a file.. I need to have 03456000001000000 as key and 1750 as value.. for all the millions of records. I'll refer these keys and get the value for my another process.

Using a database will not reduce memory cost or runtime per itself.
However, the default hashmaps may not be what you are looking for, depending on your data types. When used with primitive values such as Integers then java hashmaps have a massive memory overhead. In a HashMap<Integer, Integer>, every entry uses like 24+16+16 bytes. Unused entries (and the hashmap keeps up to half of them unused) take 4 bytes extra. So you can roughly estimate >56 bytes per int->int entry in Java HashMap<Integer, Integer>.
If you encode the integers as String, and we're talking maybe 6 digit numbers, that is likely 24 bytes for the underlying char[] array (16 bit characters; 12 bytes overhead for the array, sizes are a multiple of 8!), plus 16 bytes for the String object around (maybe 24, too). For key and value each. So that is then around 24+40+40, i.e. over 104 bytes per entry.
(Update: as your keys are 17 characters in length, make this 24+62+40, i.e. 136 bytes)
If you used a primitive hashmap such as GNU Trove TIntIntHashMap, it would only take 8 bytes + unused, so lets estimate 16 bytes per entry, at least 6 times less memory.
(Update: for TLongIntHashMap, estimate 12 bytes per entry, 24 bytes with overhead of unused buckets.)
Now you could also just store everything in a massive sorted list. This will allow you to perform a fast join operation, and you will lose much of the overhead of unused entries, and can probably process twice as many in much shorter time.
Oh, and if you know the valid value range, you can abuse an array as "hashmap".
I.e. if your valid keys are 0...999999, then just use an int[1000000] as storage, and write each entry into the appropriate row. Don't store the key at all - it's the offset in the array.
Last but not least, Java by default only uses 25% of your memory. You probably want to increase its memory limit.

Short answer: no. It's quite clear that you can't load your entire dataset in memory. You need a way to keep it on disk together with an index, so that you can access the relevant bits of the dataset without rescanning the whole file every time a new key is requested.
Essentially, a DBMS is a mechanism for handling (large) quantities of data: storing, retrieving, combining, filtering etc. They also provide caching for commonly used queries and responses. So anything you are going to do will be a (partial) reimplementation of what a DBMS already does.
I understand your concerns about having an external component to depend on, however note that a DBMS is not necessarily a server daemon. There are tiny DBMS which link with your program and keep all the dataset in a file, like SQLite does.,

Such large data collections should be handled with a database. Java programs are limited in memory, varying from device to device. You provided no info about your program, but please remember that if it is run on different devices, some of them may have very little ram and will crash very quickly. DB (be it SQL or file-based) is a must when it comes to large-data programs.

You have to either
a) have enough memory load to load the data into memory.
b) have to read the data from disk, with an index which is either in memory or not.
Whether you use a database or not the problem is much the same. If you don't have enough memory, you will see a dramatic drop in performance if you start randomly accessing the disk.
There are alternatives like Chronicle Map which use off heap and performs well up to double your main memory size so you won't get an out of memory error, however you still have problem that you can't store more data in memory than you have main memory.

The memory footprint depends on how you approach the file in java. A widely used solution is based on streaming the file using the Apache Commons IO LineIterator. Their recommended usage
LineIterator it = FileUtils.lineIterator(file, "UTF-8");
try {
while (it.hasNext()) {
String line = it.nextLine();
// do something with line
}
} finally {
it.close();
}
Its an optimized approach, but if the file is too big, you can still end up with OutOfMemory

Since you write that you fear that you will not be given the option to use a database some kind of embedded DB might be the answer. If it is impossible to keep everything in memory it must be stored somewhere else.
I believe that some kind of embedded database that uses the disk as storage might work. Examples include BerkeleyDB and Neo4j. Since both databases use a file index for fast lookups the memory load is lesser than if you keep the entire load in memory but they are still fast.

You could try lazy loading it.

'Big dictionary' implementation in Java

I am in the middle of a Java project which will be using a 'big dictionary' of words. By 'dictionary' I mean certain numbers (int) assigned to Strings. And by 'big' I mean a file of the order of 100 MB. The first solution that I came up with is probably the simplest possible. At initialization I read in the whole file and create a large HashMap which will be later used to look strings up.
Is there an efficient way to do it without the need of reading the whole file at initialization? Perhaps not, but what if the file is really large, let's say in the order of the RAM available? So basically I'm looking for a way to look things up efficiently in a large dictionary stored in memory.
Thanks for the answers so far, as a result I've realised I could be more specific in my question. As you've probably guessed the application is to do with text mining, in particular representing text in a form of a sparse vector (although some had other inventive ideas :)). So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible. Initial overhead of 'reading' the dictionary file or indexing it into a database is not as important as long as the string look-up time is optimized. Again, let's assume that the dictionary size is big, comparable to the size of RAM available.

Consider ChronicleMap (https://github.com/OpenHFT/Chronicle-Map) in a non-replicated mode. It is an off-heap Java Map implementation, or, from another point of view, a superlightweight NoSQL key-value store.
What it does useful for your task out of the box:
Persistance to disk via memory mapped files (see comment by Michał Kosmulski)
Lazy load (disk pages are loaded only on demand) -> fast startup
If your data volume is larger than available memory, operating system will unmap rarely used pages automatically.
Several JVMs can use the same map, because off-heap memory is shared on OS level. Useful if you does the processing within a map-reduce-like framework, e. g. Hadoop.
Strings are stored in UTF-8 form, -> ~50% memory savings if strings are mostly ASCII (as maaartinus noted)
int or long values takes just 4(8) bytes, like if you have primitive-specialized map implementation.
Very little per-entry memory overhead, much less than in standard HashMap and ConcurrentHashMap
Good configurable concurrency via lock striping, if you already need, or are going to parallelize text processing in future.

At the point your data structure is a few hundred MB to orders of RAM, you're better off not initializing a data structure at run-time, but rather using a database which supports indexing(which most do these days). Indexing is going to be one of the only ways you can ensure the fastest retrieval of text once you're file gets so large and you're running up against the -Xmx settings of your JVM. This is because if your file is as large, or much larger than your maximum size settings, you're inevitably going to crash your JVM.
As for having to read the whole file at initialization. You're going to have to do this eventually so that you can efficiently search and analyze the text in your code. If you know that you're only going to be searching a certain portion of your file at a time, you can implement lazy loading. If not, you might as well bite the bullet and load your entire file into the DB in the beggenning. You can implement parallelism in this process, if there are other parts of your code execution that doesn't depend on this.
Please let me know if you have any questions!

As stated in a comment, a Trie will save you a lot of memory.
You should also consider using bytes instead of chars as this saves you a factor of 2 for plain ASCII text or when using your national charset as long as it has no more than 256 different letters.
At the first glance, combining this low-level optimization with tries makes no sense, as with them the node size is dominated by the pointers. But there's a way if you want to go low level.
So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible.
Then forget any database, as they're damn slow when compared to HashMaps.
If it doesn't fit into memory, the cheapest solution is usually to get more of it. Otherwise, consider loading only the most common words and doing something slower for the others (e.g., a memory mapped file).
I was asked to point to a good tries implementation, especially off-heap. I'm not aware of any.
Assuming the OP needs no mutability, especially no mutability of keys, it all looks very simple.
I guess, the whole dictionary could be easily packed into a single ByteBuffer. Assuming mostly ASCII and with some bit hacking, an arrow would need 1 byte per arrow label character and 1-5 bytes for the child pointer. The child pointer would be relative (i.e., difference between the current node and the child), which would make most of them fit into a single byte when stored in a base 128 encoding.
I can only guess the total memory consumption, but I'd say, something like <4 bytes per word. The above compression would slow the lookup down, but still nowhere near what a single disk access needs.

It sounds too big to store in memory. Either store it in a relational database (easy, and with an index on the hash, fast), or a NoSQL solution, like Solr (small learning curve, very fast).
Although NoSQL is very fast, if you really want to tweak performance, and there are entries that are far more frequently looked up than others, consider using a limited size cache to hold the most recently used (say) 10000 lookups.

Processing a large (GB) file, quickly and multiple times (Java)

What options are there for processing large files quickly, multiple times?
I have a single file (min 1.5 GB, but can be upwards of 10-15 GB) that needs to be read multiple times - on the order of hundreds to thousands of times. The server has a large amount of RAM (64+ GB) and plenty of processors (24+).
The file will be sequential, read-only. Files are encrypted (sensitive data) on disk. I also use MessagePack to deserialize them into objects during the read process.
I cannot store the objects created from the file into memory - too large of an expansion (1.5 GB file turns into 35 GB in-memory object array). File can't be stored as a byte array (limited by Java's array length of 2^32-1).
My initial thought is to use a memory mapped file, but that has its own set of limitations.
The idea is to get the file off the disk and into memory for processing.
The large volume of data is for a machine learning algorithm, that requires multiple reads. During the calculation of each file pass, there's a considerable amount of heap usage by the algorithm itself, which is unavoidable, hence the requirement to read it multiple times.

The problem you have here is that you cannot mmap() the way the system call of the same name does; the syscall can map up to 2^64, FileChannel#map() cannot map more than 2^30 reliably.
However, what you can do is wrap a FileChannel into a class and create several "map ranges" covering all the file.
I have done "nearly" such a thing except more complicated: largetext. More complicated because I have to do the decoding process to boot, and the text which is loaded must be so into memory, unlike you who reads bytes. Less complicated because I have a define JDK interface to implement and you don't.
You can however use nearly the same technique using Guava and a RangeMap<Long, MappedByteBuffer>.
I implement CharSequence in this project above; I suggest that you implement a LargeByteMapping interface instead, from which you can read whatever parts you want; or, well, whatever suits you. Your main problem will be to define that interface. I suspect what CharSequence does is not what you want.
Meh, I may even have a go at it some day, largetext is quite exciting a project and this looks like the same kind of thing; except less complicated, ultimately!
One could even imagine a LargeByteMapping implementation where a factory would create such mappings with only a small part of that into memory and the rest written to a file; and such an implementation would also use the principle of locality: the latest queried part of the file into memory would be kept into memory for faster access.
See also here.
EDIT I feel some more explanation is needed here... A MappedByteBuffer will NOT EAT HEAP SPACE!!
It will eat address space only; it is nearly the equivalent of a ByteBuffer.allocateDirect(), except it is backed by a file.
And a very important distinction needs to be made here; all of the text above supposes that you are reading bytes, not characters!

Figure out how to structure the data. Get a good book about NoSQL and find the appropriate Database (Wide-Column, Graph, etc.) for your scenario. That's what I'd do. You'd not only have sophisticated query methods on your data but also mangling the data using distribute map-reduced implementations doing whatever you want. Maybe that's what you want (you even dropped the bigdata bomb)

How about creating "a dictionary" as the bridge between your program and the target file? Your program will call the dictionary then dictionary will refer you to the big fat file.

Java Heap Size and OutofMemory

I am trying to read a file (tab or csv file) in java with roughly 3m rows; have also added the virtual machine memory to -Xmx6g. The code works fine with 400K rows for tab separated file and slightly less for csv file. There are many LinkedHashMaps and Vectors involved that I try to use System.gc() after every few hundred rows in order to free memory and garbage values. However, my code gives the following error after 400K rows.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Vector.<init>(Vector.java:111)
at java.util.Vector.<init>(Vector.java:124)
at java.util.Vector.<init>(Vector.java:133)
at cleaning.Capture.main(Capture.java:110)

Your attempt to load the whole file is fundamentally ill-fated. You may optimize all you want, but you'll just be pushing the upper limit slightly higher. What you need is eradicate the limit itself.
There is a very negligible chance that you actually need the whole contents in memory all at once. You probably need to calculate something from that data, so you should start working out a way to make that calculation chunk by chunk, each time being able to throw away the processed chunk.
If your data is deeply intertwined, preventing you from serializing your calculation, then the reasonable recourse is, as HovercraftFOE mentions above, transfering the data into a database and work from there, indexing everything you need, normalizing it, etc.

Avoid an "out of memory error" in Java(eclipse), when using large data structure?

OK, so I am writing a program that unfortunately needs to use a huge data structure to complete its work, but it is failing with a "out of memory error" during its initialization. While I understand entirely what that means and why it is a problem, I am having trouble overcoming it, since my program needs to use this large structure and I don't know any other way to store it.
The program first indexes a large corpus of text files that I provide. This works fine.
Then it uses this index to initialize a large 2D array. This array will have n² entries, where "n" is the number of unique words in the corpus of text. For the relatively small chunk I am testing it o n(about 60 files) it needs to make approximately 30,000x30,000 entries. This will probably be bigger once I run it on my full intended corpus too.
It consistently fails every time, after it indexes, while it is initializing the data structure(to be worked on later).
Things I have done include:
revamp my code to use a primitive int[] instead of a TreeMap
eliminate redundant structures, etc...
Also, I have run the program with-Xmx2g to max out my allocated memory
I am fairly confident this is not going to be a simple line of code solution, but is most likely going to require a very new approach. I am looking for what that approach is, any ideas?
Thanks,
B.

It sounds like (making some assumptions about what you're using your array for) most of the entries will be 0. If so, you might consider using a sparse matrix representation.
If you really have that many entries (your current array is somewhere over 3 gigabytes already, even assuming no overhead), then you'll have to use some kind of on-disk storage, or a lazy-load/unload system.

There are several causes of out of memory issues.
Firstly, the simplest case is you simply need more heap. You're using 512M max heap when your program could operate correctly with 2G. Increase is with -Xmx2048m as a JVM option and you're fine. Also be aware than 64 bit VMs will use up to twice the memory of 32 bit VMs depending on the makeup of that data.
If your problem isn't that simple then you can look at optimization. Replacing objects with primitives and so on. This might be an option. I can't really say based on what you've posted.
Ultimately however you come to a cross roads where you have to make a choice between virtulization and partitioning.
Virtualizing in this context simply means some form of pretending there is more memory than there is. Operating systems use this with virtual address spaces and using hard disk space as extra memory. This could mean only keeping some of the data structure in memory at a time and persisting the rest to secondary storage (eg file or database).
Partitioning is splitting your data across multiple servers (either real or virtual). For example, if you were keeping track of stock trades on the NASDAQ you could put stock codes starting with "A" on server1, "B" on server2, etc. You need to find a reasonable approach to slice your data such that you reduce or eliminate the need for cross-communication because that cross-communication is what limits your scalability.
So simple case, if what you're storing is 30K words and 30K x 30K combinations of words you could divide it up into four server:
A-M x A-M
A-M x N-Z
N-Z x A-M
N-Z x N-Z
That's just one idea. Again it's hard toc omment without knowing specifics.

This is a common problem dealing with large datasets. You can optimize as much as you want, but the memory will never be enough (probably), and as soon as the dataset grows a little more you are still smoked. The most scalable solution is simply to keep less in memory, work on chunks, and persist the structure on disk (database/file).

If you don't need a full 32 bits (size of integer) for each value in your 2D array, perhaps a smaller type such as a byte would do the trick? Also you should give it as much heap space as possible - 2GB is still relatively small for a modern system. RAM is cheap, especially if you're expecting to be doing a lot of processing in-memory.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.