Reducing memory usage of very large HashMap

Reducing memory usage of very large HashMap - java

I have a very large hash map (2+ million entries) that is created by reading in the contents of a CSV file. Some information:
The HashMap maps a String key (which is less than 20 chars) to a String value (which is approximately 50 characters).
This HashMap is initialized with an initial capacity of 3 million so that the load factor is around .66.
The HashMap is only utilized by a single operation, and once that operation is completed, I "clear()" it. (Although it doesn't appear that this clear actually clears up memory, is a separate call to System.gc() necessary?).
One idea I had was to change the HashMap to HashMap and use the hashCode of the String as the key, this will end up saving a bit of memory but risks issues with collisions if two strings have identical hash codes ... how likely is this for strings that are less than 20 characters long?
Does anyone else have any ideas on what to do here? The CSV file itself is only 100 MB, but java ends up using over 600MB in memory for this HashMap.
Thanks!

It sounds like you have the framework to try this already. Instead of adding the string, add the string.hashCode() and see if you get collisions.
In terms of freeing up memory, the JVM generally doesn't get smaller, but it will garbage collect if it needs to.
Also, it sounds like you might have an algorithm that doesn't need the hash table at all. Could you describe what you're trying to do in a little more detail?

Parse the CSV, and build a Map whose keys are your existing keys, but values are Integer pointers to locations in the files for that key.
When you want the value for a key, find the index in the map, then use a RandomAccessFile to read that line from the file. Keep the RandomAccessFile open during processing, then close it when done.

what you are trying to do is exactly a JOIN operation. Try considering an in-memory DB like H2 and you can achieve this by loading both CSV files to temp tables and then do a JOIN over them.
And as per my experience h2 runs great with load operation and this code will certainly be faster and less memory intensive than ur manual HashMap based joining method.

If performance isn't the primary concern, store the entries in a database instead. Then memory isn't a concern, and you have good, if not great, search speed thanks to the database.

Related

Java - Millions of records, HashMap throws OutOfMemoryError

I'm reading a file to parse few of the fields of each record as a reference key and another field as the reference value. These keys and values are referred for another process.
Hence, I chose a HashMap, so that I can get the values for each key, easily.
But, each of the file consists of tens of millions or records. Hence, the HashMap throws OutOfMemoryError. I hope increasing the heap memory will not be a good solution, if the input file in future grows.
For similar questions in SO, most have suggested to use a database. I fear I'll not be given option to use a DB. Is there any other way to handle the problem?
EDIT: I need to do this similar HashMap Loading for 4 such files :( I need all the four. Bcoz, If I dont find a matching entry for my input in the first Map, I need to find in second, then if there not, then third and finally in fourth.
Edit 2: The files I have sums up to, around 1 GB.
EDIT 3:
034560000010000001750
000234500010000100752
012340000010000300374
I have records like these in a file.. I need to have 03456000001000000 as key and 1750 as value.. for all the millions of records. I'll refer these keys and get the value for my another process.

Using a database will not reduce memory cost or runtime per itself.
However, the default hashmaps may not be what you are looking for, depending on your data types. When used with primitive values such as Integers then java hashmaps have a massive memory overhead. In a HashMap<Integer, Integer>, every entry uses like 24+16+16 bytes. Unused entries (and the hashmap keeps up to half of them unused) take 4 bytes extra. So you can roughly estimate >56 bytes per int->int entry in Java HashMap<Integer, Integer>.
If you encode the integers as String, and we're talking maybe 6 digit numbers, that is likely 24 bytes for the underlying char[] array (16 bit characters; 12 bytes overhead for the array, sizes are a multiple of 8!), plus 16 bytes for the String object around (maybe 24, too). For key and value each. So that is then around 24+40+40, i.e. over 104 bytes per entry.
(Update: as your keys are 17 characters in length, make this 24+62+40, i.e. 136 bytes)
If you used a primitive hashmap such as GNU Trove TIntIntHashMap, it would only take 8 bytes + unused, so lets estimate 16 bytes per entry, at least 6 times less memory.
(Update: for TLongIntHashMap, estimate 12 bytes per entry, 24 bytes with overhead of unused buckets.)
Now you could also just store everything in a massive sorted list. This will allow you to perform a fast join operation, and you will lose much of the overhead of unused entries, and can probably process twice as many in much shorter time.
Oh, and if you know the valid value range, you can abuse an array as "hashmap".
I.e. if your valid keys are 0...999999, then just use an int[1000000] as storage, and write each entry into the appropriate row. Don't store the key at all - it's the offset in the array.
Last but not least, Java by default only uses 25% of your memory. You probably want to increase its memory limit.

Short answer: no. It's quite clear that you can't load your entire dataset in memory. You need a way to keep it on disk together with an index, so that you can access the relevant bits of the dataset without rescanning the whole file every time a new key is requested.
Essentially, a DBMS is a mechanism for handling (large) quantities of data: storing, retrieving, combining, filtering etc. They also provide caching for commonly used queries and responses. So anything you are going to do will be a (partial) reimplementation of what a DBMS already does.
I understand your concerns about having an external component to depend on, however note that a DBMS is not necessarily a server daemon. There are tiny DBMS which link with your program and keep all the dataset in a file, like SQLite does.,

Such large data collections should be handled with a database. Java programs are limited in memory, varying from device to device. You provided no info about your program, but please remember that if it is run on different devices, some of them may have very little ram and will crash very quickly. DB (be it SQL or file-based) is a must when it comes to large-data programs.

You have to either
a) have enough memory load to load the data into memory.
b) have to read the data from disk, with an index which is either in memory or not.
Whether you use a database or not the problem is much the same. If you don't have enough memory, you will see a dramatic drop in performance if you start randomly accessing the disk.
There are alternatives like Chronicle Map which use off heap and performs well up to double your main memory size so you won't get an out of memory error, however you still have problem that you can't store more data in memory than you have main memory.

The memory footprint depends on how you approach the file in java. A widely used solution is based on streaming the file using the Apache Commons IO LineIterator. Their recommended usage
LineIterator it = FileUtils.lineIterator(file, "UTF-8");
try {
while (it.hasNext()) {
String line = it.nextLine();
// do something with line
}
} finally {
it.close();
}
Its an optimized approach, but if the file is too big, you can still end up with OutOfMemory

Since you write that you fear that you will not be given the option to use a database some kind of embedded DB might be the answer. If it is impossible to keep everything in memory it must be stored somewhere else.
I believe that some kind of embedded database that uses the disk as storage might work. Examples include BerkeleyDB and Neo4j. Since both databases use a file index for fast lookups the memory load is lesser than if you keep the entire load in memory but they are still fast.

You could try lazy loading it.

Java: Optimal approach for storing and reading 1 billion data records

I'm looking for the fastest approach, in Java, to store ~1 billion records of ~250 bytes each (storage will happen only once) and then being able to read it multiple times in a non-sequential order.
The source records are being generated into simple java value objects and I would like to read them back in the same format.
For now my best guess is to store these objects, using a fast serialization library such as Kryo, in a flat file and then to use Java FileChannel to make direct random access to read the records at specific positions in the file (when storing the data, I will keep in a hashmap (also to be saved on disk) with the position in the file of each record so that I know where to read it).
Also, there is no need to optimize disk space. My key concern is to optimize read performance, while having a reasonable write performance (that, again, will happen only once).
Last precision: while the records are all of the same type (same Java value object), their size (in bytes) is variable (e.g. it contains strings).
Is there any better approach than what I mentioned above? Any hint or suggestion would be greatly appreciated !
Many thanks,
Thomas

You can use Apache Lucene, it will take care of everything you have mentioned above :)
It is super fast, you can search results more quickly then ever.
Apache Lucene persist objects in files and indexes them. We have used it in couple of apps and it is super fast.

You could just use an embedded Derby database. It's written in Java and you can actually run it up embedded within your process so there is no overhead of inter-process or networked communication. It will store the data and allow you to query it/etc handling all the complexity and indexing for you.

300 million items in a Map

If each of them is guaranteed to have a unique key (generated and
enforced by an external keying system) which Map implementation is
the correct fit for me? Assume this has to be optimized for
concurrent lookup only (The data is initialized once during the
application startup).
Does this 300 million unique keys have any positive or negative
implications on bucketing/collisions?
Any other suggestions?
My map would look something like this
Map<String, <boolean, boolean, boolean, boolean>>

I would not use a map, this needs to much memory. Especially in your case.
Store the values in one data array, and store the keys in a sorted index array.
In the sorted array you use binSearch to find the position of a key in data[].
The tricky part will be building up the array, without running out of memory.
you dont need to consider concurreny because you only read from the data
Further try to avoid to use a String as key. try to convert them to long.
the advantage of this solution: search time garuanteed to not exceed log n. even in worst cases when keys make problems with hashcode

Other suggestion? You bet.
Use a proper key-value store, Redis is the first option that comes to mind. Sure it's a separate process and dependency, but you'll win big time when it comes to proper system design.
There should be a very good reason why you would want to couple your business logic with several gigs of data in same process memory, even if it's ephemeral. I've tried this several times, and was always proved wrong.

It seems to me, that you can simply use TreeMap, because it will give you O(log(n)) for data search due to its sorted structure. Furthermore, it is eligible method, because, as you said, all data will be loaded at startup.

If you need to keep everything in memory, then you will need to use some library meant to be used with these amount of elements like Huge collections. On top of that, if the number of writes will be big, then you have to also think about some more sophisticated solutions like Non-blocking hash map

How to remove duplicate words using Java when words are more than 200 million?

I have a file (size = ~1.9 GB) which contains ~220,000,000 (~220 million) words / strings. They have duplication, almost 1 duplicate word every 100 words.
In my second program, I want to read the file. I am successful to read the file by lines using BufferedReader.
Now to remove duplicates, we can use Set (and it's implementations), but Set has problems, as described following in 3 different scenarios:
With default JVM size, Set can contain up to 0.7-0.8 million words, and then OutOfMemoryError.
With 512M JVM size, Set can contain up to 5-6 million words, and then OOM error.
With 1024M JVM size, Set can contain up to 12-13 million words, and then OOM error. Here after 10 million records addition into Set, operations become extremely slow. For example, addition of next ~4000 records, it took 60 seconds.
I have restrictions that I can't increase the JVM size further, and I want to remove duplicate words from the file.
Please let me know if you have any idea about any other ways/approaches to remove duplicate words using Java from such a gigantic file. Many Thanks :)
Addition of info to question: My words are basically alpha-numeric and they are IDs which are unique in our system. Hence they are not plain English words.

Use merge sort and remove the duplicates in a second pass. You could even remove the duplicates while merging (just keep the latest word added to output in RAM and compare the candidates to it as well).

Divide the huge file into 26 smaller files based on the first letter of the word. If any of the letter files are still too large, divide that letter file by using the second letter.
Process each of the letter files separately using a Set to remove duplicates.

You might be able to use a trie data structure to do the job in one pass. It has advantages that recommend it for this type of problem. Lookup and insert are quick. And its representation is relatively space efficient. You might be able to represent all of your words in RAM.

If you sort the items, duplicates will be easy to detect and remove, as the duplicates will bunch together.
There is code here you could use to mergesort the large file:
http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194

For large files I try not to read the data into memory but instead operate on a memory mapped file and let the OS page in/out memory as needed. If your set structures contain offsets into this memory mapped file instead of the actual strings it would consume significantly less memory.
Check out this article:
http://javarevisited.blogspot.com/2012/01/memorymapped-file-and-io-in-java.html

Question: Are these really WORDS, or are they something else -- phrases, part numbers, etc?
For WORDS in a common spoken language one would expect that after the first couple of thousand you'd have found most of the unique words, so all you really need to do is read a word in, check it against a dictionary, if found skip it, if not found add it to the dictionary and write it out.
In this case your dictionary is only a few thousand words large. And you don't need to retain the source file since you write out the unique words as soon as you find them (or you can simply dump the dictionary when you're done).

If you have the posibility to insert the words in a temporary table of a database (using batch inserts), then it would be a select distinct towards that table.

One classic way to solve this kind of problem is a Bloom filter. Basically you hash your word a number of times and for each hash result set some bits in a bit vector. If you're checking a word and all the bits from its hashes are set in the vector you've probably (you can set this probability arbitrarily low by increasing the number of hashes/bits in the vector) seen it before and it's a duplicate.
This was how early spell checkers worked. They knew if a word was in the dictionary, but they couldn't tell you what the correct spelling was because it only tell you if the current word is seen.
There are a number of open source implementations out there including java-bloomfilter

I'd tackle this in Java the same way as in every other language: Write a deduplication filter and pipe it as often as necessary.
This is what I mean (in pseudo code):
Input parameters: Offset, Size
Allocate searchable structure of size Size (=Set, but need not be one)
Read Offset (or EOF is encountered) elements from stdin and just copy them to stdout
Read Size elments from stdin (or EOF), store them in Set. If duplicate, drop, else write to stdout.
Read elements from stdin until EOF, if they are in Set then drop, else write to stdout
Now pipe as many instances as you need (If storage is no problem, maybe only as many as you have cores) with increasing Offsets and sane Size. This lets you use more cores, as I suspect the process is CPU bound. You can even use netcat and spread processing over more machines, if you are in a hurry.

Even in English, which has a huge number of words for a natural language, the upper estimates are only about 80000 words. Based on that, you could just use a HashSet and add all your words it (probably in all lower case to avoid case issues):
Set<String> words = new HashSet<String>();
while (read-next-word) {
words.add(word.toLowerCase());
}
If they are real words, this isn't going to cause memory problems, will will be pretty fast too!

To not have to worry to much about implementation you should use a database system, either plain old relational SQL or a No-SQL solution. Im pretty sure you could use e.g. Berkeley DB java edition and then do (pseudo code)
for(word : stream) {
if(!DB.exists(word)) {
DB.put(word)
outstream.add(word)
}
}
The problem is in essence easy, you need to store things on disk because there is not enough memory, then either use sorting O(N log N) (unecessary) or hashing O(N) to find the unique words.
If you want a solution that will very likely work but is not guaranteed to do so use a LRU type hash table. According to the empirical Zpif's law you should be OK.
A follow up question to some smart guy out there, what if I have 64-bit machine and set heap size to say 12GB, shouldn't virtual memory take care of the problem (although not in an optimal way) or is java not designed this way?

Quicksort would be a good option over Mergesort in this case because it needs less memory. This thread has a good explanation as to why.

Most performant solutions arise from omiting unecessary stuff. You look only for duplicates, so just do not store words itself, store hashes. But wait, you are not interested in hashes either, only if they awere seen already - do not store them. Treat hash as really large number, and use bitset to see whether you already seen this number.
So your problem boils down to really big sparse populated bitmap - with size depending on hash width. If your hash is up to 32 bit, you can use riak bitmap.
... gone thinking about really big bitmap for 128+ bit hashes %) (I'll be back )

LRU byte Cache java

I need to implement a cache in java with a maximum size, would like to do it using the real size of the cache in the memory and not the number of elements in the cache. This cache will basically have String as key and String as value. I have already implemented the cache using the LinkedHashMap structure of java but the question is how to know the actual size of the cache so that i can adapt the policy to drop an object when the size is too big.
Wanted to compute it using the getObjectSize() of the instrumentation package but it seems not working as desired.
When I do getObjectSize( a string ) whatever the size of the string is, it returns the same size : 32. I guess it's just using the reference size of the string or something like that and not the content. So don't know how to solve this problem efficiently.
Do you have any ideas ?
Thanks a lot!

You might want to consider using Ehcache with memory based cache sizing.

If your keys and values are both strings, then the calculation is easy: object overhead + 2 bytes per character in the strings. On a 32-bit Sun JVM, 32 bytes for overhead sounds correct.
There are a couple of caveats: first, the Map that you use to hold the cache adds its own overhead. This will depend on the size of the hash table and the number of entries in the map. Personally, I'd just ignore all overheads and base the calculation on the string lengths.
Second, unless you track strings by identity, you may over-count because the same string may be stored with multiple keys. Since tracking strings by identity would add yet more overhead, this is probably not worth doing.
And finally: while memory-limited caches seem like a good idea, they rarely are. If you know your application well enough, you should know the average string length, and can control the cache based on number of entries. And if you don't know your application that well, a simple LRU expiration policy is likely to get you into trouble: a large entry can cause many small entries to be expired. And if that happens, unless the cost to rebuild is proportional to the size, you've just made your cache less effective.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.