HashMap alternatives for memory-efficient data storage - java

I've currently got a spreadsheet type program that keeps its data in an ArrayList of HashMaps. You'll no doubt be shocked when I tell you that this hasn't proven ideal. The overhead seems to use 5x more memory than the data itself.
This question asks about efficient collections libraries, and the answer was use Google Collections. My follow up is "which part?". I've been reading through the documentation but don't feel like it gives a very good sense of which classes are a good fit for this. (I'm also open to other libraries or suggestions).
So I'm looking for something that will let me store dense spreadsheet-type data with minimal memory overhead.
My columns are currently referenced by Field objects, rows by their indexes, and values are Objects, almost always Strings
Some columns will have a lot of repeated values
primary operations are to update or remove records based on values of certain fields, and also adding/removing/combining columns
I'm aware of options like H2 and Derby but in this case I'm not looking to use an embedded database.
EDIT: If you're suggesting libraries, I'd also appreciate it if you could point me to a particular class or two in them that would apply here. Whereas Sun's documentation usually includes information about which operations are O(1), which are O(N), etc, I'm not seeing much of that in third-party libraries, nor really any description of which classes are best suited for what.

Some columns will have a lot of
repeated values
immediately suggests to me the possible use of the FlyWeight pattern, regardless of the solution you choose for your collections.

Trove collections should have a particular care about space occupied (I think they also have tailored data structures if you stick to primitive types).. take a look here.
Otherwise you can try with Apache collections.. just do your benchmarks!
In anycase, if you've got many references around to same elements try to design some suited pattern (like flyweight)

Chronicle Map could have overhead of less than 20 bytes per entry (see a test proving this). For comparison, java.util.HashMap's overhead varies from 37-42 bytes with -XX:+UseCompressedOops to 58-69 bytes without compressed oops (reference).
Additionally, Chronicle Map stores keys and values off-heap, so it doesn't store Object headers, which are not accounted as HashMap's overhead above. Chronicle Map integrates with Chronicle-Values, a library for generation of flyweight implementations of interfaces, the pattern suggested by Brian Agnew in another answer.

So I'm assuming that you have a map of Map<ColumnName,Column>, where the column is actually something like ArrayList<Object>.
A few possibilities -
Are you completely sure that memory is an issue? If you're just generally worried about size it'd be worth confirming that this will really be an issue in a running program. It takes an awful lot of rows and maps to fill up a JVM.
You could test your data set with different types of maps in the collections. Depending on your data, you can also initialize maps with preset size/load factor combinations that may help. I've messed around with this in the past, you might get a 30% reduction in memory if you're lucky.
What about storing your data in a single matrix-like data structure (an existing library implementation or something like a wrapper around a List of Lists), with a single map that maps column keys to matrix columns?

Assuming all your rows have most of the same columns, you can just use an array for each row, and a Map<ColumnKey, Integer> to lookup which columns refers to which cell. This way you have only 4-8 bytes of overhead per cell.
If Strings are often repeated, you could use a String pool to reduce duplication of strings. Object pools for other immutable types may be useful in reducing memory consumed.
EDIT: You can structure your data as either row based or column based. If its rows based (one array of cells per row) adding/removing the row is just a matter of removing this row. If its columns based, you can have an array per column. This can make handling primitive types much more efficent. i.e. you can have one column which is int[] and another which is double[], its much more common for an entire column to have the same data type, rather than having the same data type for a whole row.
However, either way you struture the data it will be optmised for either row or column modification and performing an add/remove of the other type will result in a rebuild of the entire dataset.
(Something I do is have row based data and add columnns to the end, assuming if a row isn't long enough, the column has a default value, this avoids a rebuild when adding a column. Rather than removing a column, I have a means of ignoring it)

Guava does include a Table interface and a hash-based implementation. Seems like a natural fit to your problem. Note that this is still marked as beta.

keeps its data in an ArrayList of HashMaps
Well, this part seems terribly inefficient to me. Empty HashMap will already allocate 16 * size of a pointer bytes (16 stands for default initial capacity), plus some variables for hash object (14 + psize). If you have a lot of sparsely filled rows, this could be a big problem.
One option would be to use a single large hash with composite key (combining row and column). Although, that doesn't make operations on whole rows very effective.
Also, since you don't mention the operation of adding cell, you can create hashes with only necessary inner storage (initialCapacity parameter).
I don't know much about google collections, so can't help there. Also, if you find any useful optimization, please do post here! It would be interesting to know.

I've been experimenting with using the SparseObjectMatrix2D from the Colt project. My data is pretty dense but their Matrix classes don't really offer any way to enlarge them, so I went with a sparse matrix set to the maximum size.
It seems to use roughly 10% less memory and loads about 15% faster for the same data, as well as offering some clever manipulation methods. Still interested in other options though.

From your description, it seems that instead of an ArrayList of HashMaps you rather want a (Linked)HashMap of ArrayList (each ArrayList would be a column).
I'd add a double map from field-name to column-number, and some clever getters/setters that never throw IndexOutOfBoundsException.
You can also use a ArrayList<ArrayList<Object>> (basically a jagged dinamically growing matrix) and keep the mapping to field (column) names outside.
Some columns will have a lot of
repeated values
I doubt this matters, specially if they are Strings, (they are internalized) and your collection would store references to them.

Why don't you try using cache implementation like EHCache.
This turned out to be very effective for me, when I hit the same situation.
You can simply store your collection within the EHcache implementation.
There are configurations like:
Maximum bytes to be used from Local heap.
Once the bytes used by your application overflows that configured in the cache, then cache implementation takes care of writing the data to the disk. Also you can configure the amount of time after which the objects are written to disk using Least Recent Used algorithm.
You can be sure of avoiding any out of memory errors, using this types of cache implementations.
It only increases the IO operations of your application by a small degree.
This is just a birds eye view of the configuration. There are a lot of configurations to optimize your requirements.

For me apache commons collections did not save any space, here are two similar heap dumps just before OOME comparing Java 11 HashMap to Apache Commons HashedMap:
The Apache Commons HashedMap doesn't appear to make any meaningful difference.

Related

'Big dictionary' implementation in Java

I am in the middle of a Java project which will be using a 'big dictionary' of words. By 'dictionary' I mean certain numbers (int) assigned to Strings. And by 'big' I mean a file of the order of 100 MB. The first solution that I came up with is probably the simplest possible. At initialization I read in the whole file and create a large HashMap which will be later used to look strings up.
Is there an efficient way to do it without the need of reading the whole file at initialization? Perhaps not, but what if the file is really large, let's say in the order of the RAM available? So basically I'm looking for a way to look things up efficiently in a large dictionary stored in memory.
Thanks for the answers so far, as a result I've realised I could be more specific in my question. As you've probably guessed the application is to do with text mining, in particular representing text in a form of a sparse vector (although some had other inventive ideas :)). So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible. Initial overhead of 'reading' the dictionary file or indexing it into a database is not as important as long as the string look-up time is optimized. Again, let's assume that the dictionary size is big, comparable to the size of RAM available.
Consider ChronicleMap (https://github.com/OpenHFT/Chronicle-Map) in a non-replicated mode. It is an off-heap Java Map implementation, or, from another point of view, a superlightweight NoSQL key-value store.
What it does useful for your task out of the box:
Persistance to disk via memory mapped files (see comment by MichaƂ Kosmulski)
Lazy load (disk pages are loaded only on demand) -> fast startup
If your data volume is larger than available memory, operating system will unmap rarely used pages automatically.
Several JVMs can use the same map, because off-heap memory is shared on OS level. Useful if you does the processing within a map-reduce-like framework, e. g. Hadoop.
Strings are stored in UTF-8 form, -> ~50% memory savings if strings are mostly ASCII (as maaartinus noted)
int or long values takes just 4(8) bytes, like if you have primitive-specialized map implementation.
Very little per-entry memory overhead, much less than in standard HashMap and ConcurrentHashMap
Good configurable concurrency via lock striping, if you already need, or are going to parallelize text processing in future.
At the point your data structure is a few hundred MB to orders of RAM, you're better off not initializing a data structure at run-time, but rather using a database which supports indexing(which most do these days). Indexing is going to be one of the only ways you can ensure the fastest retrieval of text once you're file gets so large and you're running up against the -Xmx settings of your JVM. This is because if your file is as large, or much larger than your maximum size settings, you're inevitably going to crash your JVM.
As for having to read the whole file at initialization. You're going to have to do this eventually so that you can efficiently search and analyze the text in your code. If you know that you're only going to be searching a certain portion of your file at a time, you can implement lazy loading. If not, you might as well bite the bullet and load your entire file into the DB in the beggenning. You can implement parallelism in this process, if there are other parts of your code execution that doesn't depend on this.
Please let me know if you have any questions!
As stated in a comment, a Trie will save you a lot of memory.
You should also consider using bytes instead of chars as this saves you a factor of 2 for plain ASCII text or when using your national charset as long as it has no more than 256 different letters.
At the first glance, combining this low-level optimization with tries makes no sense, as with them the node size is dominated by the pointers. But there's a way if you want to go low level.
So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible.
Then forget any database, as they're damn slow when compared to HashMaps.
If it doesn't fit into memory, the cheapest solution is usually to get more of it. Otherwise, consider loading only the most common words and doing something slower for the others (e.g., a memory mapped file).
I was asked to point to a good tries implementation, especially off-heap. I'm not aware of any.
Assuming the OP needs no mutability, especially no mutability of keys, it all looks very simple.
I guess, the whole dictionary could be easily packed into a single ByteBuffer. Assuming mostly ASCII and with some bit hacking, an arrow would need 1 byte per arrow label character and 1-5 bytes for the child pointer. The child pointer would be relative (i.e., difference between the current node and the child), which would make most of them fit into a single byte when stored in a base 128 encoding.
I can only guess the total memory consumption, but I'd say, something like <4 bytes per word. The above compression would slow the lookup down, but still nowhere near what a single disk access needs.
It sounds too big to store in memory. Either store it in a relational database (easy, and with an index on the hash, fast), or a NoSQL solution, like Solr (small learning curve, very fast).
Although NoSQL is very fast, if you really want to tweak performance, and there are entries that are far more frequently looked up than others, consider using a limited size cache to hold the most recently used (say) 10000 lookups.

300 million items in a Map

If each of them is guaranteed to have a unique key (generated and
enforced by an external keying system) which Map implementation is
the correct fit for me? Assume this has to be optimized for
concurrent lookup only (The data is initialized once during the
application startup).
Does this 300 million unique keys have any positive or negative
implications on bucketing/collisions?
Any other suggestions?
My map would look something like this
Map<String, <boolean, boolean, boolean, boolean>>
I would not use a map, this needs to much memory. Especially in your case.
Store the values in one data array, and store the keys in a sorted index array.
In the sorted array you use binSearch to find the position of a key in data[].
The tricky part will be building up the array, without running out of memory.
you dont need to consider concurreny because you only read from the data
Further try to avoid to use a String as key. try to convert them to long.
the advantage of this solution: search time garuanteed to not exceed log n. even in worst cases when keys make problems with hashcode
Other suggestion? You bet.
Use a proper key-value store, Redis is the first option that comes to mind. Sure it's a separate process and dependency, but you'll win big time when it comes to proper system design.
There should be a very good reason why you would want to couple your business logic with several gigs of data in same process memory, even if it's ephemeral. I've tried this several times, and was always proved wrong.
It seems to me, that you can simply use TreeMap, because it will give you O(log(n)) for data search due to its sorted structure. Furthermore, it is eligible method, because, as you said, all data will be loaded at startup.
If you need to keep everything in memory, then you will need to use some library meant to be used with these amount of elements like Huge collections. On top of that, if the number of writes will be big, then you have to also think about some more sophisticated solutions like Non-blocking hash map

Array of Structs are always faster than Structs of arrays?

I was wondering if the data layout Structs of Arrays (SoA) is always faster than an Array of Structs (AoS) or Array of Pointers (AoP) for problems with inputs that only fits in RAM programmed in C/JAVA.
Some days ago I was improving the performance of a Molecular Dynamic algorithm (in C), summarizing in this algorithm it is calculated the force interaction among particles based on their force and position.
Original the particles were represented by a struct containing 9 different doubles, 3 for particles forces (Fx,Fy,Fz) , 3 for positions and 3 for velocity. The algorithm had an array containing pointers to all the particles (AoP). I decided to change the layout from AoP to SoA to improve the cache use.
So, now I have a Struct with 9 array where each array stores forces, velocity and positions (x,y,z) of each particle. Each particle is accessed by it own array index.
I had a gain in performance (for an input that only fits in RAM) of about 1.9x, so I was wondering if typically changing from AoP or AoS to SoA it will always performance better, and if not in which types of algorithms this do not occurs.
Much depends of how useful all fields are. If you have a data structure where using one fields means you are likely to use all of them, then an array of struct is more efficient as it keeps together all the things you are likely to need.
Say you have time series data where you only need a small selection of the possible fields you have. You might have all sorts of data about an event or point in time, but you only need say 3-5 of them. In this case a structure of arrays is more efficient because a) you don't need to cache the fields you don't use b) you often access values in order i.e. caching a field, its next value and its next is useful.
For this reason, time-series information is often stored as a collection of columns.
This will depend on how exactly you access the data.
Try to imagine, what exactly happens in the hardware when you access your data, in either SoA or AoS.
To reason about your question, you must consider following things -
If the cache is absent, the performance should be the same, assuming that memory access latency is equal for all the elements of the data.
Now with the cache, if you access consecutive address locations, definitely you will get performance improvement. This is exactly valid in your case. When you have AoS, The locations are not consecutive in the memory, so you must lose some performance there.
You must be accessing in for loops your data like for(int i=0;i<1000000;i++) Fx[i] = 0. So if the data is huge in quantity, you will easily see the small performance benefits. If your data was small, this would not matter much.
Finally, you also don't know about the DRAM that you are using. It will have some benefits when you access consecutive data. For example to understand why it is like that you can refer to wiki.

How bad is to use Hashmaps and ArrayLists while using huge data?

I am reading XML document into HashMaps, ArrayLists so that the relationship maintains even in the memory. My code does my job but i am worried about the iterations or function calls i am performing on this huge maps and lists. Currently the xml data i am working is not so huge. but i dont know what happens if it has. What are the testcases i need to perform on my logics that use these hashmaps? How bad is using a Java collections for such a huge data? Is there any alternatives for them? Will the huge data affect the JVM to crash?
Java collections have a certain overhead, which can increase the memory usage a lot (20 times in extreme cases) when they're the primary data structures of an application and the payload data consists of a large number of small objects. This could lead to the application terminating with an OutOfMemoryError even though the actual data is much smaller than the available memory.
ArrayList is actually very efficient for large numbers of elements, but inefficient when you have a large number of lists that are empty or contain only one element. For those cases, you could use Collections.emptyList() and Collections.singletonList() to improve efficiency.
HashMap has the same problem as well as a considerable overhead for each element stored in it. So the same advice applies as for ArrayList. If you have a large number of elements, there may be alternative Map implementations that are more efficient, e.g. Google Guava.
The biggest overheads happen when you store primitive values such as int or long in collections, as the need to be wrapped as objects. In those cases, the GNU Trove collections offer an alternative.
In your case specifically, the question is whether you really need to keep the entire data from the XML in memory at once, or whether you can process it in small chunks. This would probably be the best solution if your data can grow arbitrarily large.
The easiest short term solution would be to simply buy more memory. It's cheap.
JVM will not crash in what you describe. What may happen is an OutOfMemoryError. Also if you retain the data in those Collections for long you may have issues with the garbage collection. Do you really need to store the whole XML data in memory?
If you are dealing with temporary data and you need to have a fast access to it you do not have to many alternatives. The question is what do you mean when you say "huge"? MegaBytes? GigaBytes? TeraBytes?
While your data does not exceed 1G IMHO holding it in memory may be OK. Otherwise you should think about alternatives like DB (relational or NoSql) files etc.
In your specific example I'd think about replacing ArrayList to LinkedList unless you need random access list. ArrayList is just a wrapper over array, so when you need 1 million elements it allocates 1 million elements long array. Linked list is better for when number of elements is big but it is rate of access of element by index is o(n/2). If you need both (i.e. huge list and fast access) use TreeMap with index as a key instead. You will get log(n) access rate.
What are the testcases i need to perform on my logics that use these hashmaps?
Why not to generate large XML files (for example, 5 times larger, than your current data samples) and check your parsers/memory storages with them? Because only you knows what files are possible in your case, how fast will they grow, this is the only solution.
How bad is using a Java collections for such a huge data? Is there any
alternatives for them? Will the huge data affect the JVM to crash?
Of course, is it possible that you will have OutOfMemory exception if you try to store too much data in memory, and it is not eligible for GC. This library: http://trove.starlight-systems.com/ declares, that it uses less memory, but I didn't use it myself. Some discussion is available here: What is the most efficient Java Collections library?
How bad is using a Java collections for such a huge data?
Java Map implementations and (to a lesser extent) Collection implementations do tend to use a fair amount of memory. The effect is most pronounced when the key / value / element types are wrapper types for primitive types.
Is there any alternatives for them?
There are alternative implementations of "collections" of primitive types that use less memory; e.g. the GNU Trove libraries. But they don't implement the standard Java collection APIs, and that severely limits their usefulness.
If your collections don't use the primitive wrapper classes, then your options are more limited. You might be able to implement your own custom data structures to use less memory, but the saving won't be that great (in percentage terms) and you've got a significant amount of work to do to implement the code.
A better solution is to redesign your application so that it doesn't need to represent the entire XML data structure in memory. (If you can achieve this.)
Will the huge data affect the JVM to crash?
It could cause a JVM to throw an OutOfMemoryError. That's not technically a crash, but in your use-case it probably means that the application has no choice but to give up.

Duplication detection for 3K incoming requests per second, recommended data structure/algorithm?

Designing a system where a service endpoint (probably a simple servlet) will have to handle 3K requests per second (data will be http posted).
These requests will then be stored into mysql.
They key issue that I need guidance on is that their will be a high % of duplicate data posted to this endpoint.
I only need to store unique data to mysql, so what would you suggest I use to handle the duplication?
The posted data will look like:
<root>
<prop1></prop1>
<prop2></prop2>
<prop3></prop3>
<body>
maybe 10-30K of test in here
</body>
</root>
I will write a method that will hash prop1, prop2, pro3 to create a unique hashcode (body can be different and still be considered unique).
I was thinking of creating some sort of concurrent dictionary that will be shared accross requests.
Their are more chances of duplication of posted data within a period of 24 hours. So I can purge data from this dictionary after every x hours.
Any suggestions on the data structure to store duplications? And what about purging and how many records I should store considering 3K requests per second i.e. it will get large very fast.
Note: Their are 10K different sources that will be posting, and the chances of duplication only occurrs for a given source. Meaning I could have more than one dictionary for maybe a group of sources to spread things out. Meaning if source1 posts data, and then source2 posts data, the changes of duplication are very very low. But if source1 posts 100 times in a day, the chances of duplication are very high.
Note: please ignore for now the task of saving the posted data to mysql as that is another issue on its own, duplication detection is my first hurdle I need help with.
Interesting question.
I would probably be looking at some kind of HashMap of HashMaps structure here where the first level of HashMaps would use the sources as keys and the second level would contain the actual data (the minimal for detecting duplicates) and use your hashcode function for hashing. For actual implementation, Java's ConcurrentHashMap would probably be the choice.
This way you have also set up the structure to partition your incoming load depending on sources if you need to distribute the load over several machines.
With regards to purging I think you have to measure the exact behavior with production like data. You need to learn how quickly the data grows when you successfully eliminate duplicates and how it becomes distributed in the HashMaps. With a good distribution and a not too quick growth I can imagine it is good enough to do a cleanup occasionally. Otherwise maybe a LRU policy would be good.
Sounds like you need a hashing structure that can add and check the existence of a key in constant time. In that case, try to implement a Bloom filter. Be careful that this is a probabilistic structure i.e. it may tell you that a key exists when it does not, but you can make the probability of failure extremely low if you tweak the parameters carefully.
Edit: Ok, so bloom filters are not acceptable. To still maintain constant lookup (albeit not a constant insertion), try to look into Cuckoo hashing.
1) Setup your database like this
ALTER TABLE Root ADD UNIQUE INDEX(Prop1, Prop2, Prop3);
INSERT INTO Root (Prop1, Prop2, Prop3, Body) VALUES (#prop1, #prop2, #prop3, #body)
ON DUPLICATE KEY UPDATE Body=#body
2) You don't need any algorithms or fancy hashing ADTs
shell> mysqlimport [options] db_name textfile1 [textfile2 ...]
http://dev.mysql.com/doc/refman/5.1/en/mysqlimport.html
Make use of the --replace or --ignore flags, as well as, --compress.
3) All your Java will do is...
a) generate CSV files, use the StringBuffer class then every X seconds or so, swap with a fresh StringBuffer and pass the .toString of the old one to a thread to flush it to a file /temp/SOURCE/TIME_STAMP.csv
b) occasionally kick off a Runtime.getRuntime().exec of the mysqlimport command
c) delete the old CSV files if space is an issue, or archive them to network storage/backup device
Well you're basically looking for some kind of extremely large Hashmap and something like
if (map.put(key, val) != null) // send data
There are lots of different Hashmap implementations available, but you could look at NBHM. Non-blocking puts and designed with large, scalable problems in mind could work just fine. The Map also has iterators that do NOT throw a ConcurrentModificationException while using them to traverse the map which is basically a requirement for removing old data as I see it. Also putIfAbsent is all you actually need - but no idea if that's more efficient than just a simple put, you'd have to ask Cliff or check the source.
The trick then is to try to avoid resizing of the Map by making it large enough - otherwise the throughput will suffer while resizing (which could be a problem). And think about how to implement the removing of old data - using some idle thread that traverses an iterator and removes old data probably.
Use a java.util.ConcurrentHashMap for building a map of your hashes, but make sure you have the correct initialCapacity and concurrencyLevel assigned to the map at creation time.
The api docs for ConcurrentHashMap have all the relevant information:
initialCapacity - the initial capacity. The implementation performs
internal sizing to accommodate this many elements.
concurrencyLevel - the estimated number of concurrently updating threads. The
implementation performs internal sizing to try to accommodate this
many threads.
You should be able to use putIfAbsent for handling 3K requests as long as you have initialized the ConcurrentHashMap the right way - make sure this is tuned as part of your load testing.
At some point, though, trying to handle all the requests in one server may prove to be too much, and you will have to load-balance across servers. At that point you may consider using memcached for storing the index of hashes, instead of the CHP.
The interesting problems that you will still have to solve, though, are:
loading all of the hashes into memory at startup
determining when to knock off hashes from the in-memory map
If you use a strong hash formula, such as MD5 or SHA-1, you will not need to store any data at all. The probability of duplicate is virtually null, so if you find the same hash result twice, the second is a duplicate.
Given that MD5 is 16 bytes, and SHA-1 20 bytes, it should decrease memory requirements, therefore keeping more elements in the CPU cache, therefore dramatically improving speed.
Storing these keys requires little else than a small hash table followed by trees to handle collisions.

Categories