HashSet of Strings taking up too much memory, suggestions...?

HashSet of Strings taking up too much memory, suggestions...? - java

I am currently storing a list of words (around 120,000) in a HashSet, for the purpose of using as a list to check enetered words against to see if they are spelt correctly, and just returning yes or no.
I was wondering if there is a way to do this which takes up less memory. Currently 120,000 words is around 12meg, the actual file the words are read from is around 900kb.
Any suggestions?
Thanks in advance

You could use a prefix tree or trie: http://en.wikipedia.org/wiki/Trie

Check out bloom filters or cuckoo hashing. Bloom filter or cuckoo hashing?
I am not sure if this is the answer for your question but worth looking into these alternatives. bloom filters are mainly used for spell checker kind of use cases.

HashSet is probably not the right structure for this. Use Trie instead.

This might be a bit late but using Google you can easily find the DAWG investigation and C code that I posted a while ago.
http://www.pathcom.com/~vadco/dawg.html
TWL06 - 178,691 words - fits into 494,676 Bytes
The downside of a compressed-shared-node structure is that it does not work as a hash function for the words in your list. That is to say, it will tell you if a word exists, but it will not return an index to related data for a word that does exist.
If you want the perfect and complete hash functionality, in a processor-cache sized structure, you are going to have to read, understand, and modify a data structure called the ADTDAWG. It will be slightly larger than a traditional DAWG, but it is faster and more useful.
http://www.pathcom.com/~vadco/adtdawg.html
All the very best,
JohnPaul Adamovsky

12MB to store 120,000 words is about 100 bytes per word. Probably at least 32 bytes of that is String overhead. If words average 10 letters and they are stored as 2-byte chars, that accounts for another 20 bytes. Then there is the reference to each String in your HashSet, which is probably another 4 bytes. The remaining 44 bytes is probably the HashSet entry and indexing overhead, or something I haven't considered above.
The easiest thing to go after is the overhead of the String objects themselves, which can take far more memory than is required to store the actual character data. So your main approach would be to develop a custom representation that avoids storing a separate object for each string. In the course of doing this, you can also get rid of the HashSet overhead, since all you really need is a simple word lookup, which can be done by a straightforward binary search on an array that will be part of your custom implementation.
You could create your custom implementation as an array of type int with one element for each word. Each of these int elements would be broken into sub-fields that contain a length and an offset that points into a separate backing array of type char. Put both of these into a class that manages them, and that supports public methods allowing you to retrieve and/or convert your data and individual characters given a string index and an optional character index, and to perform the simple searches on the list of words that are needed for your spell check feature.
If you have no more than 16777216 characters of underlying string data (e.g., 120,000 strings times an average length of 10 characters = 1.2 million chars), you can take the low-order 24 bits of each int and store the starting offset of each string into your backing array of char data, and take the high-order 8 bits of each int and store the size of the corresponding string there.
Your char data will have your erstwhile strings crammed together without any delimiters, relying entirely upon the int array to know where each string starts and ends.
Taking the above approach, your 120,000 words (at an average of 10 letters each) would require about 2,400,000 bytes of backing array data and 480,000 bytes of integer index data (120,000 x 4 bytes), for a total of 2,880,000 bytes, which is about a 75 percent savings over the present 12MB amount you have reported above.
The words in the arrays would be sorted alphabetically, and your lookup process could be a simple binary search on the int array (retrieving the corresponding words from the char array for each test), which should be very efficient.
If your words happen to be entirely ASCII data, you could save an additional 1,200,000 bytes by storing the backing data as bytes instead of as chars.
This could get more difficult if you needed to alter these strings. Apparently, in your case (spell checker), you don't need to (unless you want to support user additions to the list, which would be infrequent anyway, and so re-writing the char data and indexes to add or delete words might be acceptable).

One way to save memory to save memory is to use a radix tree. This is better than a trie as the prefixes are not stored redundantly.
As your dictionary is fixed another way is to build a perfect hash function for it. Your hash set does not need buckets (and the associated overhead) as there cannot be collisions. Every implementation of a hash table/hash set that uses open addressing can be used for this (like google collection's ImmutableSet).

The problem is by design: Storing such a huge amount of words in a HashSet for spell-check-reasons isn't a good idea:
You can either use a spell-checker (example: http://softcorporation.com/products/spellcheck/ ), or you can buildup a "auto-wordcompletion" with a prefix tree ( description: http://en.wikipedia.org/wiki/Trie ).
There is no way to reduce memory-usage in this design.

You can also try Radix Tree(Wiki,Implementation) .This some what like trie but more memory efficient.

Related

How to store integer numbers in the range of 0-9 in only 4 bits and use the same as Key in HashMap?

I have been asked to come up with a solution where you have a file where each line represent a 10 digit phone number and we need to tell whether a given 10 digit phone number is present in the file or not.
I came up with Trie Data structure where each each children is nothing but a Map of integer as Key and Trie as Value.
class Trie{
boolean isEnd;
Map<Integer, Trie> map = new HashMap<>();
}
I can take int[] arr also to store the children.
As we have only numbers ranging from 0 - 9, so we can store these numbers in 4 bits only. Why to take 'int' or Integer as data type. How to reduce memory here?
How we can store this numbers in Map or array but not taking int as we will end up wasting lot of memory.
Moreover is there any better solution than Trie?

If you're going for memory efficiency, I would actually advise against using a trie and recommend a different data structure. As I understand it, you are only interested in answering queries of the form "have I see this phone number before?" While you could do this by treating the phone numbers as strings and throwing all of them into a trie, you wouldn't be taking advantage of the operations that tries are designed to support (fast prefix searching, retrieving elements in sorted order, etc.), so you'd be paying for things you wouldn't be using.
Moreover, let's think about the space usage of the trie. Even if every phone number had a long common prefix, each node in the trie requires space to store its child pointers. If you store even one (64-bit) pointer per node, you're using the same amount of space that you'd be using to store a 10-digit phone number (which fits comfortably into a 64-bit integer). If the phone numbers don't have long shared prefixes, you're potentially storing ten pointers per number, a huge space blowup, regardless of how big the hash table keys are.
Instead of throwing things into a trie, I'd consider just using a simple, vanilla hash table. After all, hash tables are specifically optimized to support membership queries and membership queries alone. Hashing phone numbers shouldn't be too bad, as they can be packed into 64-bit integers and hashed using a variety of simple hashing techniques. This lets you control what kind of time/space tradeoff you want to make (larger table sizes increase memory and decrease time, smaller tables increase time and decrease memory).

How to reduce memory usage for a HashMap<String, Integer> like data structure

Before starting to explain my problem, I should mention that I am not looking for a way to increase Java heap memory. I should strictly store these objects.
I am working on storing huge number (5-10 GB) of DNA sequences and their counts (Integer) in a hash table. The DNA sequences (with length 32 or less) consists of 'A', 'C', 'G', 'T', and 'N' (undefined) chars. As we know, when storing a large number of objects in memory, Java has poor space efficiency compared to lower level languages like C and C++. Thus, if I store this sequence as string (it holds about 100 MB memory for a sequence with length ~30), I see the error.
I tried to represent nucleic acids as 'A'=00, 'C'=01, 'G'=10, 'T'=11 and neglect 'N' (because it ruins the char to 2-bit transform as the 5-th acid). Then, concatenate these 2-bit acids into byte array. It brought some improvement but unfortunately I see the error after a couple of hours again. I need a convenient solution or at least a workaround to handle this error. Thank you in advance.

Being fairly complex maybe this here is a weird idea, and would require quite a lot of work, but this is what I would try:
You already pointed out two individual subproblems of your overall task:
the default HashMap implementation may be suboptimal for such large collection sizes
you need to store something else than strings
The map implementation
I would recommend to write a highly tailored hash map implementation for the Map<String, Long> interface. Internally you do not have to store strings. Unfortunately 5^32 > 2^64, so there is no way to pack your whole string into a single long, well, let's stick to two longs for a key. You can make string to/back long[2] conversion fairly efficiently on the fly when providing a string key to your map implementation (use bit shifts etc).
As for packing the values, here are some considerations:
for a key-value pair a standard hashmap will need to have an array of N longs for buckets, where N is the current capacity, when the bucket is found from the hash key it will need to have a linked list of key-value pairs to resolve keys that produce identical hash codes. For your specific case you could try to optimize it in the following way:
use a long[] of size 3N where N is the capacity to store both keys and values in a continuous array
in this array, at locations 3 * (hashcode % N) and 3 * (hashcode % N) + 1 you store the long[2] representation of the key, of the first key that matches this bucket or of the only one (on insertion, zero otherwise), at location 3 * (hashcode % N) + 2 you store the corresponding count
for all those cases where a different key results in the same hash code and thus the same bucket, your store the data in a standard HashMap<Long2KeyWrapper, Long>. The idea is to keep the capacity of the array mentioned above (and resize correspondingly) large enough to have by far the largest part of the data in that contiguous array and not in the fallback hash map. This will dramatically reduce the storage overhead of the hashmap
do not expand the capacity in N=2N iterations, make smaller growth steps, e.g. 10-20%. this will cost performance on populating the map, but will keep your memory footprint under control
The keys
Given the inequality 5^32 > 2^64 your idea to use bits to encode 5 letters seems to be the best I can think of right now. Use 3 bits and correspondingly long[2].

I recommend you look into the Trove4j Collections API; it offers Collections that hold primitives which will use less memory than their boxed, wrapper classes.
Specifically, you should check out their TObjectIntHashMap.
Also, I wouldn't recommended storing anything as a String or char until JDK 9 is released, as the backing char array of a String is UTF-16 encoded, using two bytes per char. JDK 9 defaults to UTF-8 where only one byte is used.

If you're using on the order of ~10gb of data, or at least data with an in memory representation size of ~10gb, then you might need to think of ways to write the data you don't need at the moment to disk and load individual portions of your dataset into memory to work on them.
I had this exact problem a few years ago when I was conducting research with monte carlo simulations so I wrote a Java data structure to solve it. You can clone/fork the source here: github.com/tylerparsons/surfdep
The library supports both MySQL and SQLite as the underlying database. If you don't have either, I'd recommend SQLite as it's much quicker to set up.
Full disclaimer: this is not the most efficient implementation, but it will handle very large datasets if you let it run for a few hours. I tested it successfully with matrices of up to 1 billion elements on my Windows laptop.

Storing a (string,integer) tuple more efficiently and apply binary search

Introduction
We store tuples (string,int) in a binary file. The string represents a word (no spaces nor numbers). In order to find a word, we apply binary search algorithm, since we know that all the tuples are sorted with respect to the word.
In order to store this, we use writeUTF for the string and writeInt for the integer. Other than that, let's assume for now there are no ways to distinguish between the start and the end of the tuple unless we know them in advance.
Problem
When we apply binary search, we get a position (i.e. (a+b)/2) in the file, which we can read using methods in Random Access File, i.e. we can read the byte at that place. However, since we can be in the middle of the word, we cannot know where this words starts or finishes.
Solution
Here're two possible solutions we came up with, however, we're trying to decide which one will be more space efficient/faster.
Method 1: Instead of storing the integer as a number, we thought to store it as a string (using eg. writeChars or writeUTF), because in that case, we can insert a null character in the end of the tuple. That is, we can be sure that none of the methods used to serialize the data will use the null character, since the information we store (numbers and digits) have higher ASCII value representations.
Method 2: We keep the same structure, but instead we separate each tuple with 6-8 (or less) bytes of random noise (same across the file). In this case, we assume that words have a low entropy, so it's very unlikely they will have any signs of randomness. Even if the integer may get 4 bytes that are exactly the same as those in the random noise, the additional two bytes that follow will not (with high probability).
Which of these methods would you recommend? Is there a better way to store this kind of information. Note, we cannot serialize the entire file and later de-serialize it into memory, since it's very big (and we are not allowed to).

I assume you're trying to optimize for speed & space (in that order).
I'd use a different layout, built from 2 files:
Interger + Index file
Each "record" is exactly 8 bytes long, the lower 4 are the integer value for the record, and the upper 4 bytes are an integer representing the offset for the record in the other file (the characters file).
Characters file
Contiguous file of characters (UTF-8 encoding or anything you choose). "Records" are not separated, not terminated in any way, simple 1 by 1 characters. For example, the records Good, Hello, Morning will look like GoodHelloMorning.
To iterate the dataset, you iterate the integer/index file with direct access (recordNum * 8 is the byte offset of the record), read the integer and the characters offset, plus the character offset of the next record (which is the 4 byte integer at recordNum * 8 + 12), then read the string from the characters file between the offsets you read from the index file. Done!

it's less than 200MB. Max 20 chars for a word.
So why bother? Unless you work on some severely restricted system, load everything into a Map<String, Integer> and get a few orders of magnitude speed up.
But let's say, I'm overlooking something and let's continue.
Method 1: Instead of storing the integer as a number, we thought to store it as a string (using eg. writeChars or writeUTF), because in that case, we can insert a null character
You don't have to as you said that your word contains no numbers. So you can always parse things like 0124some456word789 uniquely.
The efficiency depends on the distribution. You may win a factor of 4 (single digit numbers) or lose a factor of 2.5 (10-digit numbers). You could save something by using a higher base. But there's the storage for the string and it may dominate.
Method 2: We keep the same structure, but instead we separate each tuple with 6-8 (or less) bytes of random noise (same across the file).
This is too wasteful. Using four zeros between the data byte would do:
Find a sequence of at least four zeros.
Find the last zero.
That's the last separator byte.
Method 3: Using some hacks, you could ensure that the number contains no zero byte (either assuming that it doesn't use the whole range or representing it with five bytes). Then a single zero byte would do.
Method 4: As disk is organized in blocks, you should probably split your data into 4 KiB blocks. Then you can add some time header allowing you quick access to the data (start indexes for the 8th, 16th, etc. piece of data). The range between e.g., the 8th and 16th block should be scanned sequentially as it's both simpler and faster than binary search.

Data structure choice for ngrams upto length 5, when building count-based distributional model

I am building a distributional model (count based) from text. Basically for each ngram (a sequence of words), I have to store a count. I need reasonably quick access to the count. For n=5, technically all possible 5-grams are (10^4)^5 even if I assume a conservative estimate of 10k words, which is too high. But many combinations of these n-grams wouldn't exist in text, so a 5d array kind of structure is out of consideration.
I built a trie, where each word is a node. So this trie would be really wide, with max depth 5. That gave me considerable saving of memory. But I still run out of memory (64GB) after I train on enough files. To be fair, I am not using any super efficient Java practices here. Each node has a count, index of word as int. I then have a HashMap to store children. I initially started with a list. Tried to sort it each time I added a child, but I was losing lot of time there, so moved to HashMap. Even with a list, I will run out of memory after reading some more files.
So I guess I need to divide my task into parts, store each part to disk. But ultimately, when accessing I would need to merge these data structures. So I think the way forward is a disk based solution, where I know which file to access for ngrams which start with something (some sort of ordering). As I see it, the problem with trie is it's not very efficient when I go around to merging it. I would need to load two parts into memory to merge. That wouldn't really work.
What approach would you recommend? I looked into a HashMap encoding based structure for language models (like the one berkeleylm uses). But in their use case, they don't need to reconstruct the ngram, so they just hash it and store the hashvalue as context. I need to be able to access the context later.
Any suggestions? Is there any value in using a database? Can they do it without being in-memory?

I wouldn't use HashMap, it's quite memory intensive, a simple sorted array should be better, you can then use binary search on it.
Maybe you could also try a binary prefix-trie. First you create a single char-string, for example by interleave the letters of the words into a single string (I suppose you could also concatenate them, separated by a blank). This long String could then be stored in a binary trie. See CritBit1D for an example.
You could also use a multi-dimensional tree. Many trees are limited to 64bit numbers, but you cold turn the eight leading ASCII characters of every word into a 64-bit integer number and then store that as a 5D key. That should be much more efficient than a 5D array. Multi-dim indexes are: kd-trees, R-trees or quadtrees. The 5-gram-count and the full 5-gram (including remaining characters) can be stored separately in the VALUE that can be associated with each 5D-KEY.
If you are using Java you could try my very own tree. It's a prefix-sharing bitwise quadtree. It is very memory efficient, very well suited to larger datasets (1M entries upwards) and works natively with 'integer' rather than 'float'. It also has very good nearest neighbour search.

Memory and speed efficient search on Strings

I have a bunch of Strings I'd like a fast lookup for. Each String is 22 chars long and is looked up by the first 12 only (the "key" so to say), the full set of Strings is recreated periodically. They are loaded from a file and refreshed when the file changes. I have to deal with too little available memory, other server processes on my VPS need it too and need it more.
How do I best store the Strings and search for them?
My current idea is to store them all one after another inside a char[] (to save RAM), and sort them for faster lookups (I figure the lookup is fastest if I have them presorted so I can use binary or interpolation search). But I'm not exactly sure how I should code it - if anyone is in the mood for a challenging puzzle: here it is...
Btw: It's probably ok to exceed the memory constraints for a while during the recreation / sorting, but it shouldn't be by much or for long.
Thanks!
Update
For the "I want to know specifics" crowd (correct me if I'm wrong in the Java details): The source files contain about 320 000 entries (all ANSI text), I really want to stay (WAY!) below 64 MB RAM usage and the data is only part of my program. Here's some information on sizes of Java types in memory.
My VPS is a 32bit OS, so...
one byte[], all concatenated = 12 + length bytes
one char[], all concatenated = 12 + length * 2 bytes
String = 32 + length * 2 bytes (is Object, has char[] + 3 int)
So I have to keep in memory:
~7 MB if all are stored in a byte[]
~14 MB if all are stored in a char[]
~25 MB if all are stored in a String[]
> 40 MB if they are stored in a HashTable / Map (for which I'd probably have to finetune the initial capacity)
A HashTable is not magical - it helps on insertion, but in principle it's just a very long array of String where the hashCode modulus capacity is used as an index, the data is stored in the next free position after the index and searched lineary if it's not found there on lookup. But for a Hashtable, I'd need the String itself and a substring of the first 12 chars for lookup. I don't want that (or do I miss something here?), sorry folks...

I would probably use a cache solution for that, may be even guava will do. Of course sort them, then binary search. Unfortunately I do not have the time for it :(

Sounds like a HashTable would be the right implementation for this situation.
Searching is done in constant time and refreshing could be done in linear time.
Java Data Structure Big-O (Warning PDF)

I coded a solution myself - but it's a little different than the question I posted because I could use information I didn't publish (I'll do better next time, sorry).
I'm just answering this because it's solved, I won't accept one of the other answers because they didn't really help with the memory constraints (and were a little short for my taste). They still got an upvote each, no hard feelings and thanks for taking the time!
I managed to push all of the info into two longs (with the key completely residing in the first one). The first 12 chars are an ISIN which can be compressed into a long because it only uses digits and capital letters, always starts with two capital letters and ends with a digit which can be reconstructed from the other chars. The product of all possible values leaves a little more than 3 bits to spare.
I store all entries from my source file in a long[] (packed ISIN first, other stuff in the second long) and sort them based on the first of two longs.
When I do a query by a key, I transform it to a long, do a binary search (which I'll maybe change to an interpolation search) and return the matching index. The different parts of the value are retrievable by said index - I get the second long from the array, unpack it and return the requested data.
The result: RAM usage dropped from ~110 MB to < 50 MB including Jetty (btw - I used a HashTable before) and lookups are lightning fast.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.