300 million items in a Map

300 million items in a Map - java

If each of them is guaranteed to have a unique key (generated and
enforced by an external keying system) which Map implementation is
the correct fit for me? Assume this has to be optimized for
concurrent lookup only (The data is initialized once during the
application startup).
Does this 300 million unique keys have any positive or negative
implications on bucketing/collisions?
Any other suggestions?
My map would look something like this
Map<String, <boolean, boolean, boolean, boolean>>

I would not use a map, this needs to much memory. Especially in your case.
Store the values in one data array, and store the keys in a sorted index array.
In the sorted array you use binSearch to find the position of a key in data[].
The tricky part will be building up the array, without running out of memory.
you dont need to consider concurreny because you only read from the data
Further try to avoid to use a String as key. try to convert them to long.
the advantage of this solution: search time garuanteed to not exceed log n. even in worst cases when keys make problems with hashcode

Other suggestion? You bet.
Use a proper key-value store, Redis is the first option that comes to mind. Sure it's a separate process and dependency, but you'll win big time when it comes to proper system design.
There should be a very good reason why you would want to couple your business logic with several gigs of data in same process memory, even if it's ephemeral. I've tried this several times, and was always proved wrong.

It seems to me, that you can simply use TreeMap, because it will give you O(log(n)) for data search due to its sorted structure. Furthermore, it is eligible method, because, as you said, all data will be loaded at startup.

If you need to keep everything in memory, then you will need to use some library meant to be used with these amount of elements like Huge collections. On top of that, if the number of writes will be big, then you have to also think about some more sophisticated solutions like Non-blocking hash map

Related

Fastest way to access this object

Lets say I have a list of 1,000,000 users where their unique identifier is their username string. So to compare two User objects I just override the compareTo() method an compare the username members.
Given a username string I wish to find the User object from a list. What, in an average case, would be the fastest way to do this.
I'm guessing a HashMap, mapping usernames to User objects, but I wondered if there was something else that I didn't know about which would be better.

If you don't need to store them in a database (which is the usual scenario), a HashMap<String, User> would work fine - it has O(1) complexity for lookup.
As noted, the usual scenario is to have them in the database. But in order to get faster results, caching is utilized. You can use EhCache - it is similar to ConcurrentHashMap, but it has time-to-live for elements and the option to be distributed across multiple machines.
You should not dump your whole database in memory, because it will be hard to synchronize. You will face issues with invalidating the entries in the map and keeping them up-to-date. Caching frameworks make all this easier. Also note that the database has its own optimizations, and it is not unlikely that your users will be kept in memory there for faster access.

I'm sure you want a hash map. They're the fastest thing going, and memory efficient. As also noted in other replies, a String works as a great key, so you don't need to override anything. (This is also true of the following.)
The chief alternative is a TreeMap. This is slower and a uses a bit more memory. It's a lot more flexible, however. The same map will work great with 5 entries and 5 million entries. You don't need to clue it in in advance. If your list varies wildly in size, the TreeMap will grab memory as it needs and let it go when it doesn't. Hashmaps are not so good about letting go, and as I explain below, they can be awkward when grabbing more memory.
TreeMap's work better with Garbage Collectors. They ask for memory in small, easily found chunks. If you start a hashtable with room for 100,000 entries, when it gets full it will free the 100,000 element (almost a megabye on a 64 bit machine) array and ask for one that's even larger. If it does this repeatedly, it can get ahead of the GC, which tends to throw an out-of-memory exception rather than spend a lot of time gathering up and concentrating scattered bits of free memory. (It prefers to maintain its reputation for speed at the expense of your machine's reputation for having a lot of memory. You really can manage to run out of memory with 90% of your heap unused because it's fragmented.)
So if you are running your program full tilt, your list of names varies wildly in size--and perhaps you even have several lists of names varying wildly in size--a TreeMap will work a lot better for you.
A hash map will no doubt be just what you need. But when things get really crazy, there's the ConcurrentSkipListMap. This is everything a TreeMap is except it's a bit slower. On the other hand, it allows adds, updates, deletes, and reads from multiple threads willy-nilly, with no synchronization. (I mention it just to be complete.)

In terms of data structures the HashMapcan be a good choice. It favours larger datasets. The time for inserts is considered constant O(1).
In this case it sounds like you will be carrying out more lookups than inserts. For lookups the average time complexity is O(1 + n/k), the key factor here (sorry about the pun) is how effective the hashing algorithm is at evenly distributing the data across the buckets.
the risk here is that the usernames are short in length and use a small character set such as a-z. In which case there would be a lot of collisions causing the HashMap to be loaded unevenly and therefore slowing down the lookups. One option to improve this could be to create your own user key object and override the hashcode() method with an algorthim that suits your keys better.
in summary if you have a large data set, a good/suitable hashing algorithm and you have the space to hold it all in memory then HashMap can provide a relatively fast lookup
I think given your last post on the ArrayList and it's scalabilty I would take Bozho's suggestion and go for a purpose build cache such as EhCache. This will allow you to control memory usage and eviction policies. Still a lot faster than db access.

If you don't change your list of users very often then you may want to use Aho-Corasick. You will need a pre-processing step that will take O(T) time and space, where T is the sum of the lengths of all user names. After that you can match user names in O(n) time, where n is the length of the user name you are looking for. Since you will have to look at every character in the user name you are looking for I don't think it's possible to do better than this.

Java Performance: Map vs List

I've building a tree pagination in JSF1.2 and Richfaces 3.3.2, because I have a lot of tree nodes (something like 80k), and it's slow..
So, as first attempt, I create a HashMap with the page and the list of nodes of the page.
But, the performance isn't good enough...
So I was wondering if is something faster than a HashMap, maybe a List of Lists or something.
Someone have some experience with this? What can I do?
Thanks in advance.
EDIT.
The big problem is that I have to validate permissions of users in the childnodes of the tree. I knew that this is the big problem: this validation is slow, because I have to go inside the nodes, I don't have a good way to know if the user have permission in a 10th level node without iterate all of them. Plus to this, the same three has used in more places...
The basic reason for why I was doing this pagination, is that the client side will be much slow, because of the structure generated by richfaces, a lot of tr's and td's, the browser just going crazy with this.
So, unfortunatelly, I have to load all the nodes, and paginate just client side, and I need to know what of them is faster to iterate...
Sorry my bad english.

A hash map is the fastest data structure if you want to get all nodes for a page. The list of nodes can be fetched in constant time (O(1)) while with lists the time is O(n) (n=number of pages, faster on sorted lists but never getting near O(1))
What operations on your datastructure are too slow. That's what you have to analyse before you start optimization.

It's probably more due to the fact that JSF is a performance pig than a data structure choice. The one attempt I've seen to create a JSF app could be timed with a sundial.
You're making a mistake by guessing about solutions without more knowledge about the root cause. I'd recommend that you profile your app to see where the time is being spent.

The data structure to use always depends on how you need to store the data and how you need to access it. HashMap<K, V> is supposed to have constant time complexity in accessing the value, provided the key. When you call get(key), the hashCode() for key is computed and it's used to retrieve the related value. Unless you've got different keys that have the same hashcode (in which case you may have been doing something wrong, as while is not mandatory different objects should have different hash codes, at least in the majority of cases), this is usually fast.
Searching an element in a plain list requires scanning of the list, which will (almost) always be slower than computing an hashcode.
If you need to associate values with keys, a Map is the way. And HashMap should be fast enough.
I don't know too much about JSF, but I think - if the data structure and access pattern is the one that a Map is designed for - the problem is not the HashMap itself.

I would solve this with a javascript/ajax calls method that fetches childnodes.

Duplication detection for 3K incoming requests per second, recommended data structure/algorithm?

Designing a system where a service endpoint (probably a simple servlet) will have to handle 3K requests per second (data will be http posted).
These requests will then be stored into mysql.
They key issue that I need guidance on is that their will be a high % of duplicate data posted to this endpoint.
I only need to store unique data to mysql, so what would you suggest I use to handle the duplication?
The posted data will look like:
<root>
<prop1></prop1>
<prop2></prop2>
<prop3></prop3>
<body>
maybe 10-30K of test in here
</body>
</root>
I will write a method that will hash prop1, prop2, pro3 to create a unique hashcode (body can be different and still be considered unique).
I was thinking of creating some sort of concurrent dictionary that will be shared accross requests.
Their are more chances of duplication of posted data within a period of 24 hours. So I can purge data from this dictionary after every x hours.
Any suggestions on the data structure to store duplications? And what about purging and how many records I should store considering 3K requests per second i.e. it will get large very fast.
Note: Their are 10K different sources that will be posting, and the chances of duplication only occurrs for a given source. Meaning I could have more than one dictionary for maybe a group of sources to spread things out. Meaning if source1 posts data, and then source2 posts data, the changes of duplication are very very low. But if source1 posts 100 times in a day, the chances of duplication are very high.
Note: please ignore for now the task of saving the posted data to mysql as that is another issue on its own, duplication detection is my first hurdle I need help with.

Interesting question.
I would probably be looking at some kind of HashMap of HashMaps structure here where the first level of HashMaps would use the sources as keys and the second level would contain the actual data (the minimal for detecting duplicates) and use your hashcode function for hashing. For actual implementation, Java's ConcurrentHashMap would probably be the choice.
This way you have also set up the structure to partition your incoming load depending on sources if you need to distribute the load over several machines.
With regards to purging I think you have to measure the exact behavior with production like data. You need to learn how quickly the data grows when you successfully eliminate duplicates and how it becomes distributed in the HashMaps. With a good distribution and a not too quick growth I can imagine it is good enough to do a cleanup occasionally. Otherwise maybe a LRU policy would be good.

Sounds like you need a hashing structure that can add and check the existence of a key in constant time. In that case, try to implement a Bloom filter. Be careful that this is a probabilistic structure i.e. it may tell you that a key exists when it does not, but you can make the probability of failure extremely low if you tweak the parameters carefully.
Edit: Ok, so bloom filters are not acceptable. To still maintain constant lookup (albeit not a constant insertion), try to look into Cuckoo hashing.

1) Setup your database like this
ALTER TABLE Root ADD UNIQUE INDEX(Prop1, Prop2, Prop3);
INSERT INTO Root (Prop1, Prop2, Prop3, Body) VALUES (#prop1, #prop2, #prop3, #body)
ON DUPLICATE KEY UPDATE Body=#body
2) You don't need any algorithms or fancy hashing ADTs
shell> mysqlimport [options] db_name textfile1 [textfile2 ...]
http://dev.mysql.com/doc/refman/5.1/en/mysqlimport.html
Make use of the --replace or --ignore flags, as well as, --compress.
3) All your Java will do is...
a) generate CSV files, use the StringBuffer class then every X seconds or so, swap with a fresh StringBuffer and pass the .toString of the old one to a thread to flush it to a file /temp/SOURCE/TIME_STAMP.csv
b) occasionally kick off a Runtime.getRuntime().exec of the mysqlimport command
c) delete the old CSV files if space is an issue, or archive them to network storage/backup device

Well you're basically looking for some kind of extremely large Hashmap and something like
if (map.put(key, val) != null) // send data
There are lots of different Hashmap implementations available, but you could look at NBHM. Non-blocking puts and designed with large, scalable problems in mind could work just fine. The Map also has iterators that do NOT throw a ConcurrentModificationException while using them to traverse the map which is basically a requirement for removing old data as I see it. Also putIfAbsent is all you actually need - but no idea if that's more efficient than just a simple put, you'd have to ask Cliff or check the source.
The trick then is to try to avoid resizing of the Map by making it large enough - otherwise the throughput will suffer while resizing (which could be a problem). And think about how to implement the removing of old data - using some idle thread that traverses an iterator and removes old data probably.

Use a java.util.ConcurrentHashMap for building a map of your hashes, but make sure you have the correct initialCapacity and concurrencyLevel assigned to the map at creation time.
The api docs for ConcurrentHashMap have all the relevant information:
initialCapacity - the initial capacity. The implementation performs
internal sizing to accommodate this many elements.
concurrencyLevel - the estimated number of concurrently updating threads. The
implementation performs internal sizing to try to accommodate this
many threads.
You should be able to use putIfAbsent for handling 3K requests as long as you have initialized the ConcurrentHashMap the right way - make sure this is tuned as part of your load testing.
At some point, though, trying to handle all the requests in one server may prove to be too much, and you will have to load-balance across servers. At that point you may consider using memcached for storing the index of hashes, instead of the CHP.
The interesting problems that you will still have to solve, though, are:
loading all of the hashes into memory at startup
determining when to knock off hashes from the in-memory map

If you use a strong hash formula, such as MD5 or SHA-1, you will not need to store any data at all. The probability of duplicate is virtually null, so if you find the same hash result twice, the second is a duplicate.
Given that MD5 is 16 bytes, and SHA-1 20 bytes, it should decrease memory requirements, therefore keeping more elements in the CPU cache, therefore dramatically improving speed.
Storing these keys requires little else than a small hash table followed by trees to handle collisions.

Reducing memory usage of very large HashMap

I have a very large hash map (2+ million entries) that is created by reading in the contents of a CSV file. Some information:
The HashMap maps a String key (which is less than 20 chars) to a String value (which is approximately 50 characters).
This HashMap is initialized with an initial capacity of 3 million so that the load factor is around .66.
The HashMap is only utilized by a single operation, and once that operation is completed, I "clear()" it. (Although it doesn't appear that this clear actually clears up memory, is a separate call to System.gc() necessary?).
One idea I had was to change the HashMap to HashMap and use the hashCode of the String as the key, this will end up saving a bit of memory but risks issues with collisions if two strings have identical hash codes ... how likely is this for strings that are less than 20 characters long?
Does anyone else have any ideas on what to do here? The CSV file itself is only 100 MB, but java ends up using over 600MB in memory for this HashMap.
Thanks!

It sounds like you have the framework to try this already. Instead of adding the string, add the string.hashCode() and see if you get collisions.
In terms of freeing up memory, the JVM generally doesn't get smaller, but it will garbage collect if it needs to.
Also, it sounds like you might have an algorithm that doesn't need the hash table at all. Could you describe what you're trying to do in a little more detail?

Parse the CSV, and build a Map whose keys are your existing keys, but values are Integer pointers to locations in the files for that key.
When you want the value for a key, find the index in the map, then use a RandomAccessFile to read that line from the file. Keep the RandomAccessFile open during processing, then close it when done.

what you are trying to do is exactly a JOIN operation. Try considering an in-memory DB like H2 and you can achieve this by loading both CSV files to temp tables and then do a JOIN over them.
And as per my experience h2 runs great with load operation and this code will certainly be faster and less memory intensive than ur manual HashMap based joining method.

If performance isn't the primary concern, store the entries in a database instead. Then memory isn't a concern, and you have good, if not great, search speed thanks to the database.

HashMap alternatives for memory-efficient data storage

I've currently got a spreadsheet type program that keeps its data in an ArrayList of HashMaps. You'll no doubt be shocked when I tell you that this hasn't proven ideal. The overhead seems to use 5x more memory than the data itself.
This question asks about efficient collections libraries, and the answer was use Google Collections. My follow up is "which part?". I've been reading through the documentation but don't feel like it gives a very good sense of which classes are a good fit for this. (I'm also open to other libraries or suggestions).
So I'm looking for something that will let me store dense spreadsheet-type data with minimal memory overhead.
My columns are currently referenced by Field objects, rows by their indexes, and values are Objects, almost always Strings
Some columns will have a lot of repeated values
primary operations are to update or remove records based on values of certain fields, and also adding/removing/combining columns
I'm aware of options like H2 and Derby but in this case I'm not looking to use an embedded database.
EDIT: If you're suggesting libraries, I'd also appreciate it if you could point me to a particular class or two in them that would apply here. Whereas Sun's documentation usually includes information about which operations are O(1), which are O(N), etc, I'm not seeing much of that in third-party libraries, nor really any description of which classes are best suited for what.

Some columns will have a lot of
repeated values
immediately suggests to me the possible use of the FlyWeight pattern, regardless of the solution you choose for your collections.

Trove collections should have a particular care about space occupied (I think they also have tailored data structures if you stick to primitive types).. take a look here.
Otherwise you can try with Apache collections.. just do your benchmarks!
In anycase, if you've got many references around to same elements try to design some suited pattern (like flyweight)

Chronicle Map could have overhead of less than 20 bytes per entry (see a test proving this). For comparison, java.util.HashMap's overhead varies from 37-42 bytes with -XX:+UseCompressedOops to 58-69 bytes without compressed oops (reference).
Additionally, Chronicle Map stores keys and values off-heap, so it doesn't store Object headers, which are not accounted as HashMap's overhead above. Chronicle Map integrates with Chronicle-Values, a library for generation of flyweight implementations of interfaces, the pattern suggested by Brian Agnew in another answer.

So I'm assuming that you have a map of Map<ColumnName,Column>, where the column is actually something like ArrayList<Object>.
A few possibilities -
Are you completely sure that memory is an issue? If you're just generally worried about size it'd be worth confirming that this will really be an issue in a running program. It takes an awful lot of rows and maps to fill up a JVM.
You could test your data set with different types of maps in the collections. Depending on your data, you can also initialize maps with preset size/load factor combinations that may help. I've messed around with this in the past, you might get a 30% reduction in memory if you're lucky.
What about storing your data in a single matrix-like data structure (an existing library implementation or something like a wrapper around a List of Lists), with a single map that maps column keys to matrix columns?

Assuming all your rows have most of the same columns, you can just use an array for each row, and a Map<ColumnKey, Integer> to lookup which columns refers to which cell. This way you have only 4-8 bytes of overhead per cell.
If Strings are often repeated, you could use a String pool to reduce duplication of strings. Object pools for other immutable types may be useful in reducing memory consumed.
EDIT: You can structure your data as either row based or column based. If its rows based (one array of cells per row) adding/removing the row is just a matter of removing this row. If its columns based, you can have an array per column. This can make handling primitive types much more efficent. i.e. you can have one column which is int[] and another which is double[], its much more common for an entire column to have the same data type, rather than having the same data type for a whole row.
However, either way you struture the data it will be optmised for either row or column modification and performing an add/remove of the other type will result in a rebuild of the entire dataset.
(Something I do is have row based data and add columnns to the end, assuming if a row isn't long enough, the column has a default value, this avoids a rebuild when adding a column. Rather than removing a column, I have a means of ignoring it)

Guava does include a Table interface and a hash-based implementation. Seems like a natural fit to your problem. Note that this is still marked as beta.

keeps its data in an ArrayList of HashMaps
Well, this part seems terribly inefficient to me. Empty HashMap will already allocate 16 * size of a pointer bytes (16 stands for default initial capacity), plus some variables for hash object (14 + psize). If you have a lot of sparsely filled rows, this could be a big problem.
One option would be to use a single large hash with composite key (combining row and column). Although, that doesn't make operations on whole rows very effective.
Also, since you don't mention the operation of adding cell, you can create hashes with only necessary inner storage (initialCapacity parameter).
I don't know much about google collections, so can't help there. Also, if you find any useful optimization, please do post here! It would be interesting to know.

I've been experimenting with using the SparseObjectMatrix2D from the Colt project. My data is pretty dense but their Matrix classes don't really offer any way to enlarge them, so I went with a sparse matrix set to the maximum size.
It seems to use roughly 10% less memory and loads about 15% faster for the same data, as well as offering some clever manipulation methods. Still interested in other options though.

From your description, it seems that instead of an ArrayList of HashMaps you rather want a (Linked)HashMap of ArrayList (each ArrayList would be a column).
I'd add a double map from field-name to column-number, and some clever getters/setters that never throw IndexOutOfBoundsException.
You can also use a ArrayList<ArrayList<Object>> (basically a jagged dinamically growing matrix) and keep the mapping to field (column) names outside.
Some columns will have a lot of
repeated values
I doubt this matters, specially if they are Strings, (they are internalized) and your collection would store references to them.

Why don't you try using cache implementation like EHCache.
This turned out to be very effective for me, when I hit the same situation.
You can simply store your collection within the EHcache implementation.
There are configurations like:
Maximum bytes to be used from Local heap.
Once the bytes used by your application overflows that configured in the cache, then cache implementation takes care of writing the data to the disk. Also you can configure the amount of time after which the objects are written to disk using Least Recent Used algorithm.
You can be sure of avoiding any out of memory errors, using this types of cache implementations.
It only increases the IO operations of your application by a small degree.
This is just a birds eye view of the configuration. There are a lot of configurations to optimize your requirements.

For me apache commons collections did not save any space, here are two similar heap dumps just before OOME comparing Java 11 HashMap to Apache Commons HashedMap:
The Apache Commons HashedMap doesn't appear to make any meaningful difference.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.