I am developing a text analysis program that represents documents as arrays of "feature counts" (e.g., occurrences of a particular token) within some pre-defined feature space. These arrays are stored in an ArrayList after some processing.
I am testing the program on a 64 mb dataset, with 50,000 records. The program worked fine with small data sets, but now it consistently throws a "out of memory" Java heap exception when I start loading the arrays into an ArrayList object (using the .add(double[]) method). Depending on how much memory I allocate to the stack, I will get this exception at the 1000th to 3000th addition to the ArrayList, far short of my 50,000 entries. It became clear to me that I cannot store all this data in RAM and operate on it as usual.
However, I'm not sure what data structures are best suited to allow me to access and perform calculations on the entire dataset when only part of it can be loaded into RAM?
I was thinking that serializing the data to disk and storing the locations in a hashmap in RAM would be useful. However, I have also seen discussions on caching and buffered processing.
I'm 100% sure this is a common CS problem, so I'm sure there are several clever ways that this has been addressed. Any pointers would be appreciated :-)
You have plenty of choices:
Increase heap size (-Xmx) to several gigabytes.
Do not use boxing collections, use fastutil - that should decrease your memory use 4x. http://fastutil.di.unimi.it/
Process your data in batches or sequentially - do not keep whole dataset in memory simultaneously.
Use a proper database. There are even intraprocess databases like HSQL, your mileage may vary.
Process your data via map-reduce, perhaps something local like pig.
How about using Apache Spark (Great for in-memory cluster computing) ?This would help scale your infrastructure as your data set gets Larger.
Related
I am in the middle of a Java project which will be using a 'big dictionary' of words. By 'dictionary' I mean certain numbers (int) assigned to Strings. And by 'big' I mean a file of the order of 100 MB. The first solution that I came up with is probably the simplest possible. At initialization I read in the whole file and create a large HashMap which will be later used to look strings up.
Is there an efficient way to do it without the need of reading the whole file at initialization? Perhaps not, but what if the file is really large, let's say in the order of the RAM available? So basically I'm looking for a way to look things up efficiently in a large dictionary stored in memory.
Thanks for the answers so far, as a result I've realised I could be more specific in my question. As you've probably guessed the application is to do with text mining, in particular representing text in a form of a sparse vector (although some had other inventive ideas :)). So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible. Initial overhead of 'reading' the dictionary file or indexing it into a database is not as important as long as the string look-up time is optimized. Again, let's assume that the dictionary size is big, comparable to the size of RAM available.
Consider ChronicleMap (https://github.com/OpenHFT/Chronicle-Map) in a non-replicated mode. It is an off-heap Java Map implementation, or, from another point of view, a superlightweight NoSQL key-value store.
What it does useful for your task out of the box:
Persistance to disk via memory mapped files (see comment by Michał Kosmulski)
Lazy load (disk pages are loaded only on demand) -> fast startup
If your data volume is larger than available memory, operating system will unmap rarely used pages automatically.
Several JVMs can use the same map, because off-heap memory is shared on OS level. Useful if you does the processing within a map-reduce-like framework, e. g. Hadoop.
Strings are stored in UTF-8 form, -> ~50% memory savings if strings are mostly ASCII (as maaartinus noted)
int or long values takes just 4(8) bytes, like if you have primitive-specialized map implementation.
Very little per-entry memory overhead, much less than in standard HashMap and ConcurrentHashMap
Good configurable concurrency via lock striping, if you already need, or are going to parallelize text processing in future.
At the point your data structure is a few hundred MB to orders of RAM, you're better off not initializing a data structure at run-time, but rather using a database which supports indexing(which most do these days). Indexing is going to be one of the only ways you can ensure the fastest retrieval of text once you're file gets so large and you're running up against the -Xmx settings of your JVM. This is because if your file is as large, or much larger than your maximum size settings, you're inevitably going to crash your JVM.
As for having to read the whole file at initialization. You're going to have to do this eventually so that you can efficiently search and analyze the text in your code. If you know that you're only going to be searching a certain portion of your file at a time, you can implement lazy loading. If not, you might as well bite the bullet and load your entire file into the DB in the beggenning. You can implement parallelism in this process, if there are other parts of your code execution that doesn't depend on this.
Please let me know if you have any questions!
As stated in a comment, a Trie will save you a lot of memory.
You should also consider using bytes instead of chars as this saves you a factor of 2 for plain ASCII text or when using your national charset as long as it has no more than 256 different letters.
At the first glance, combining this low-level optimization with tries makes no sense, as with them the node size is dominated by the pointers. But there's a way if you want to go low level.
So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible.
Then forget any database, as they're damn slow when compared to HashMaps.
If it doesn't fit into memory, the cheapest solution is usually to get more of it. Otherwise, consider loading only the most common words and doing something slower for the others (e.g., a memory mapped file).
I was asked to point to a good tries implementation, especially off-heap. I'm not aware of any.
Assuming the OP needs no mutability, especially no mutability of keys, it all looks very simple.
I guess, the whole dictionary could be easily packed into a single ByteBuffer. Assuming mostly ASCII and with some bit hacking, an arrow would need 1 byte per arrow label character and 1-5 bytes for the child pointer. The child pointer would be relative (i.e., difference between the current node and the child), which would make most of them fit into a single byte when stored in a base 128 encoding.
I can only guess the total memory consumption, but I'd say, something like <4 bytes per word. The above compression would slow the lookup down, but still nowhere near what a single disk access needs.
It sounds too big to store in memory. Either store it in a relational database (easy, and with an index on the hash, fast), or a NoSQL solution, like Solr (small learning curve, very fast).
Although NoSQL is very fast, if you really want to tweak performance, and there are entries that are far more frequently looked up than others, consider using a limited size cache to hold the most recently used (say) 10000 lookups.
What options are there for processing large files quickly, multiple times?
I have a single file (min 1.5 GB, but can be upwards of 10-15 GB) that needs to be read multiple times - on the order of hundreds to thousands of times. The server has a large amount of RAM (64+ GB) and plenty of processors (24+).
The file will be sequential, read-only. Files are encrypted (sensitive data) on disk. I also use MessagePack to deserialize them into objects during the read process.
I cannot store the objects created from the file into memory - too large of an expansion (1.5 GB file turns into 35 GB in-memory object array). File can't be stored as a byte array (limited by Java's array length of 2^32-1).
My initial thought is to use a memory mapped file, but that has its own set of limitations.
The idea is to get the file off the disk and into memory for processing.
The large volume of data is for a machine learning algorithm, that requires multiple reads. During the calculation of each file pass, there's a considerable amount of heap usage by the algorithm itself, which is unavoidable, hence the requirement to read it multiple times.
The problem you have here is that you cannot mmap() the way the system call of the same name does; the syscall can map up to 2^64, FileChannel#map() cannot map more than 2^30 reliably.
However, what you can do is wrap a FileChannel into a class and create several "map ranges" covering all the file.
I have done "nearly" such a thing except more complicated: largetext. More complicated because I have to do the decoding process to boot, and the text which is loaded must be so into memory, unlike you who reads bytes. Less complicated because I have a define JDK interface to implement and you don't.
You can however use nearly the same technique using Guava and a RangeMap<Long, MappedByteBuffer>.
I implement CharSequence in this project above; I suggest that you implement a LargeByteMapping interface instead, from which you can read whatever parts you want; or, well, whatever suits you. Your main problem will be to define that interface. I suspect what CharSequence does is not what you want.
Meh, I may even have a go at it some day, largetext is quite exciting a project and this looks like the same kind of thing; except less complicated, ultimately!
One could even imagine a LargeByteMapping implementation where a factory would create such mappings with only a small part of that into memory and the rest written to a file; and such an implementation would also use the principle of locality: the latest queried part of the file into memory would be kept into memory for faster access.
See also here.
EDIT I feel some more explanation is needed here... A MappedByteBuffer will NOT EAT HEAP SPACE!!
It will eat address space only; it is nearly the equivalent of a ByteBuffer.allocateDirect(), except it is backed by a file.
And a very important distinction needs to be made here; all of the text above supposes that you are reading bytes, not characters!
Figure out how to structure the data. Get a good book about NoSQL and find the appropriate Database (Wide-Column, Graph, etc.) for your scenario. That's what I'd do. You'd not only have sophisticated query methods on your data but also mangling the data using distribute map-reduced implementations doing whatever you want. Maybe that's what you want (you even dropped the bigdata bomb)
How about creating "a dictionary" as the bridge between your program and the target file? Your program will call the dictionary then dictionary will refer you to the big fat file.
I have a huge dump file - 12GB of text containing millions of entries. Each entry has a numeric id, some text, and other irrelevant properties. I want to convert this file into something that will provide an efficient look-up. That is, given an id, it would return the text quickly. The limitations:
Embedded in Java, preferably without an external server or foreign language dependencies.
Read and writes to the disk, not in-memory - I don't have 12GB of RAM.
Does not blow up too much - I don't want to turn a 12GB file into a 200GB index. I don't need full text search, sorting, or anything fancy - Just key-value lookup.
Efficient - It's a lot of data and I have just one machine, so speed is an issue. Tools that can store large batches and/or work well with several threads are preferred.
Storing more than one field is nice, but not a must. The main concern is the text.
Your recommendations are welcomed!
I would use Java Chronicle or something like it (partly because I wrote it) because it is designed to access large amounts of data (larger than your machine) some what randomly.
It can store any number of fields in text or binary formats (or a combination if you wish) It adds 8 bytes per record you want to be able to randomly access. It doesn't support deleting records (you can mark them for reuse), but you can update and add new records.
It can only have a single writer thread, but it can be read by an number of threads on the same machine (even different processes)
It doesn't support batching but it can read/write millions of entries per second with typical sub microsecond latency (except for random reads/writes which are not in memory)
It uses next to no heap (<1 MB for TBs of data)
It uses an id which is sequential but you can build a table to do just that translation.
BTW: You can buy 32 GB for less than $200. Perhaps its time to get more memory ;)
Why not use JavaDb - the db that comes with Java ?
It'll store the info on disk, and be efficient in terms of lookups, provided you index properly. It'll run in-JVM, so you don't need a separate server/service. You talk to it using standard JDBC.
I suspect it'll be pretty efficient. This database has a long history (it used to be IBM's Derby) and will have had a lot of effort expended on it in terms of robustness and efficiency.
You'll obviously need to do an initial onboarding of the data to create the database, but that's a one-off task.
We have an application in which an XML string is created from a stored proc resultset and transformed using XSLT to return to the calling servlet. This work fine with smaller dataset but causing out of memory error with large amount of data. What will be the ideal solution in this case ?
XSLT transformations, in general, require the entire dataset to be loaded into memory, so the easiest thing is to get more memory.
If you can rewrite your XSLT, there is Streaming Transformations for XML which allow for incremental processing of data.
If you're processing the entire XML document at once then it sounds like you'll need to allocate more memory to the Java heap. But that only works up to the defined maximum heap size. Do you know a reasonable maximum data set size or is it unbounded?
Why do you need the database to generate the XML?
Few important things to note.
You mentioned works fine functionally with small data-set but goes out of memory with large data sets. you need to identify whether its creation of datasets that causes out of memory or transfer of datasets in the same process.
You are doing something which is making many objects to stay in memory. Re-Check your code and nullify some objects explicitly after usage.This will make life easier for garbage collector. Play with MaxPermSize settings of JVM. This will give you additional space for strings.
This approach is going to have a limitation that even if you are able to transfer large datasets for single user this might go outOfMemory for multiple users.
A suggestion that might work for you.
Break this in an Asyncronous process.Make creation of large datasets separate process and downloading of that datasets a different process.
While making the datasets available for download you can very well control the memory consumption using stream based downloading.
OK, so I am writing a program that unfortunately needs to use a huge data structure to complete its work, but it is failing with a "out of memory error" during its initialization. While I understand entirely what that means and why it is a problem, I am having trouble overcoming it, since my program needs to use this large structure and I don't know any other way to store it.
The program first indexes a large corpus of text files that I provide. This works fine.
Then it uses this index to initialize a large 2D array. This array will have n² entries, where "n" is the number of unique words in the corpus of text. For the relatively small chunk I am testing it o n(about 60 files) it needs to make approximately 30,000x30,000 entries. This will probably be bigger once I run it on my full intended corpus too.
It consistently fails every time, after it indexes, while it is initializing the data structure(to be worked on later).
Things I have done include:
revamp my code to use a primitive int[] instead of a TreeMap
eliminate redundant structures, etc...
Also, I have run the program with-Xmx2g to max out my allocated memory
I am fairly confident this is not going to be a simple line of code solution, but is most likely going to require a very new approach. I am looking for what that approach is, any ideas?
Thanks,
B.
It sounds like (making some assumptions about what you're using your array for) most of the entries will be 0. If so, you might consider using a sparse matrix representation.
If you really have that many entries (your current array is somewhere over 3 gigabytes already, even assuming no overhead), then you'll have to use some kind of on-disk storage, or a lazy-load/unload system.
There are several causes of out of memory issues.
Firstly, the simplest case is you simply need more heap. You're using 512M max heap when your program could operate correctly with 2G. Increase is with -Xmx2048m as a JVM option and you're fine. Also be aware than 64 bit VMs will use up to twice the memory of 32 bit VMs depending on the makeup of that data.
If your problem isn't that simple then you can look at optimization. Replacing objects with primitives and so on. This might be an option. I can't really say based on what you've posted.
Ultimately however you come to a cross roads where you have to make a choice between virtulization and partitioning.
Virtualizing in this context simply means some form of pretending there is more memory than there is. Operating systems use this with virtual address spaces and using hard disk space as extra memory. This could mean only keeping some of the data structure in memory at a time and persisting the rest to secondary storage (eg file or database).
Partitioning is splitting your data across multiple servers (either real or virtual). For example, if you were keeping track of stock trades on the NASDAQ you could put stock codes starting with "A" on server1, "B" on server2, etc. You need to find a reasonable approach to slice your data such that you reduce or eliminate the need for cross-communication because that cross-communication is what limits your scalability.
So simple case, if what you're storing is 30K words and 30K x 30K combinations of words you could divide it up into four server:
A-M x A-M
A-M x N-Z
N-Z x A-M
N-Z x N-Z
That's just one idea. Again it's hard toc omment without knowing specifics.
This is a common problem dealing with large datasets. You can optimize as much as you want, but the memory will never be enough (probably), and as soon as the dataset grows a little more you are still smoked. The most scalable solution is simply to keep less in memory, work on chunks, and persist the structure on disk (database/file).
If you don't need a full 32 bits (size of integer) for each value in your 2D array, perhaps a smaller type such as a byte would do the trick? Also you should give it as much heap space as possible - 2GB is still relatively small for a modern system. RAM is cheap, especially if you're expecting to be doing a lot of processing in-memory.