Lets say I have a list of 1,000,000 users where their unique identifier is their username string. So to compare two User objects I just override the compareTo() method an compare the username members.
Given a username string I wish to find the User object from a list. What, in an average case, would be the fastest way to do this.
I'm guessing a HashMap, mapping usernames to User objects, but I wondered if there was something else that I didn't know about which would be better.
If you don't need to store them in a database (which is the usual scenario), a HashMap<String, User> would work fine - it has O(1) complexity for lookup.
As noted, the usual scenario is to have them in the database. But in order to get faster results, caching is utilized. You can use EhCache - it is similar to ConcurrentHashMap, but it has time-to-live for elements and the option to be distributed across multiple machines.
You should not dump your whole database in memory, because it will be hard to synchronize. You will face issues with invalidating the entries in the map and keeping them up-to-date. Caching frameworks make all this easier. Also note that the database has its own optimizations, and it is not unlikely that your users will be kept in memory there for faster access.
I'm sure you want a hash map. They're the fastest thing going, and memory efficient. As also noted in other replies, a String works as a great key, so you don't need to override anything. (This is also true of the following.)
The chief alternative is a TreeMap. This is slower and a uses a bit more memory. It's a lot more flexible, however. The same map will work great with 5 entries and 5 million entries. You don't need to clue it in in advance. If your list varies wildly in size, the TreeMap will grab memory as it needs and let it go when it doesn't. Hashmaps are not so good about letting go, and as I explain below, they can be awkward when grabbing more memory.
TreeMap's work better with Garbage Collectors. They ask for memory in small, easily found chunks. If you start a hashtable with room for 100,000 entries, when it gets full it will free the 100,000 element (almost a megabye on a 64 bit machine) array and ask for one that's even larger. If it does this repeatedly, it can get ahead of the GC, which tends to throw an out-of-memory exception rather than spend a lot of time gathering up and concentrating scattered bits of free memory. (It prefers to maintain its reputation for speed at the expense of your machine's reputation for having a lot of memory. You really can manage to run out of memory with 90% of your heap unused because it's fragmented.)
So if you are running your program full tilt, your list of names varies wildly in size--and perhaps you even have several lists of names varying wildly in size--a TreeMap will work a lot better for you.
A hash map will no doubt be just what you need. But when things get really crazy, there's the ConcurrentSkipListMap. This is everything a TreeMap is except it's a bit slower. On the other hand, it allows adds, updates, deletes, and reads from multiple threads willy-nilly, with no synchronization. (I mention it just to be complete.)
In terms of data structures the HashMapcan be a good choice. It favours larger datasets. The time for inserts is considered constant O(1).
In this case it sounds like you will be carrying out more lookups than inserts. For lookups the average time complexity is O(1 + n/k), the key factor here (sorry about the pun) is how effective the hashing algorithm is at evenly distributing the data across the buckets.
the risk here is that the usernames are short in length and use a small character set such as a-z. In which case there would be a lot of collisions causing the HashMap to be loaded unevenly and therefore slowing down the lookups. One option to improve this could be to create your own user key object and override the hashcode() method with an algorthim that suits your keys better.
in summary if you have a large data set, a good/suitable hashing algorithm and you have the space to hold it all in memory then HashMap can provide a relatively fast lookup
I think given your last post on the ArrayList and it's scalabilty I would take Bozho's suggestion and go for a purpose build cache such as EhCache. This will allow you to control memory usage and eviction policies. Still a lot faster than db access.
If you don't change your list of users very often then you may want to use Aho-Corasick. You will need a pre-processing step that will take O(T) time and space, where T is the sum of the lengths of all user names. After that you can match user names in O(n) time, where n is the length of the user name you are looking for. Since you will have to look at every character in the user name you are looking for I don't think it's possible to do better than this.
Related
I am in the middle of a Java project which will be using a 'big dictionary' of words. By 'dictionary' I mean certain numbers (int) assigned to Strings. And by 'big' I mean a file of the order of 100 MB. The first solution that I came up with is probably the simplest possible. At initialization I read in the whole file and create a large HashMap which will be later used to look strings up.
Is there an efficient way to do it without the need of reading the whole file at initialization? Perhaps not, but what if the file is really large, let's say in the order of the RAM available? So basically I'm looking for a way to look things up efficiently in a large dictionary stored in memory.
Thanks for the answers so far, as a result I've realised I could be more specific in my question. As you've probably guessed the application is to do with text mining, in particular representing text in a form of a sparse vector (although some had other inventive ideas :)). So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible. Initial overhead of 'reading' the dictionary file or indexing it into a database is not as important as long as the string look-up time is optimized. Again, let's assume that the dictionary size is big, comparable to the size of RAM available.
Consider ChronicleMap (https://github.com/OpenHFT/Chronicle-Map) in a non-replicated mode. It is an off-heap Java Map implementation, or, from another point of view, a superlightweight NoSQL key-value store.
What it does useful for your task out of the box:
Persistance to disk via memory mapped files (see comment by Michał Kosmulski)
Lazy load (disk pages are loaded only on demand) -> fast startup
If your data volume is larger than available memory, operating system will unmap rarely used pages automatically.
Several JVMs can use the same map, because off-heap memory is shared on OS level. Useful if you does the processing within a map-reduce-like framework, e. g. Hadoop.
Strings are stored in UTF-8 form, -> ~50% memory savings if strings are mostly ASCII (as maaartinus noted)
int or long values takes just 4(8) bytes, like if you have primitive-specialized map implementation.
Very little per-entry memory overhead, much less than in standard HashMap and ConcurrentHashMap
Good configurable concurrency via lock striping, if you already need, or are going to parallelize text processing in future.
At the point your data structure is a few hundred MB to orders of RAM, you're better off not initializing a data structure at run-time, but rather using a database which supports indexing(which most do these days). Indexing is going to be one of the only ways you can ensure the fastest retrieval of text once you're file gets so large and you're running up against the -Xmx settings of your JVM. This is because if your file is as large, or much larger than your maximum size settings, you're inevitably going to crash your JVM.
As for having to read the whole file at initialization. You're going to have to do this eventually so that you can efficiently search and analyze the text in your code. If you know that you're only going to be searching a certain portion of your file at a time, you can implement lazy loading. If not, you might as well bite the bullet and load your entire file into the DB in the beggenning. You can implement parallelism in this process, if there are other parts of your code execution that doesn't depend on this.
Please let me know if you have any questions!
As stated in a comment, a Trie will save you a lot of memory.
You should also consider using bytes instead of chars as this saves you a factor of 2 for plain ASCII text or when using your national charset as long as it has no more than 256 different letters.
At the first glance, combining this low-level optimization with tries makes no sense, as with them the node size is dominated by the pointers. But there's a way if you want to go low level.
So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible.
Then forget any database, as they're damn slow when compared to HashMaps.
If it doesn't fit into memory, the cheapest solution is usually to get more of it. Otherwise, consider loading only the most common words and doing something slower for the others (e.g., a memory mapped file).
I was asked to point to a good tries implementation, especially off-heap. I'm not aware of any.
Assuming the OP needs no mutability, especially no mutability of keys, it all looks very simple.
I guess, the whole dictionary could be easily packed into a single ByteBuffer. Assuming mostly ASCII and with some bit hacking, an arrow would need 1 byte per arrow label character and 1-5 bytes for the child pointer. The child pointer would be relative (i.e., difference between the current node and the child), which would make most of them fit into a single byte when stored in a base 128 encoding.
I can only guess the total memory consumption, but I'd say, something like <4 bytes per word. The above compression would slow the lookup down, but still nowhere near what a single disk access needs.
It sounds too big to store in memory. Either store it in a relational database (easy, and with an index on the hash, fast), or a NoSQL solution, like Solr (small learning curve, very fast).
Although NoSQL is very fast, if you really want to tweak performance, and there are entries that are far more frequently looked up than others, consider using a limited size cache to hold the most recently used (say) 10000 lookups.
If each of them is guaranteed to have a unique key (generated and
enforced by an external keying system) which Map implementation is
the correct fit for me? Assume this has to be optimized for
concurrent lookup only (The data is initialized once during the
application startup).
Does this 300 million unique keys have any positive or negative
implications on bucketing/collisions?
Any other suggestions?
My map would look something like this
Map<String, <boolean, boolean, boolean, boolean>>
I would not use a map, this needs to much memory. Especially in your case.
Store the values in one data array, and store the keys in a sorted index array.
In the sorted array you use binSearch to find the position of a key in data[].
The tricky part will be building up the array, without running out of memory.
you dont need to consider concurreny because you only read from the data
Further try to avoid to use a String as key. try to convert them to long.
the advantage of this solution: search time garuanteed to not exceed log n. even in worst cases when keys make problems with hashcode
Other suggestion? You bet.
Use a proper key-value store, Redis is the first option that comes to mind. Sure it's a separate process and dependency, but you'll win big time when it comes to proper system design.
There should be a very good reason why you would want to couple your business logic with several gigs of data in same process memory, even if it's ephemeral. I've tried this several times, and was always proved wrong.
It seems to me, that you can simply use TreeMap, because it will give you O(log(n)) for data search due to its sorted structure. Furthermore, it is eligible method, because, as you said, all data will be loaded at startup.
If you need to keep everything in memory, then you will need to use some library meant to be used with these amount of elements like Huge collections. On top of that, if the number of writes will be big, then you have to also think about some more sophisticated solutions like Non-blocking hash map
I have this list of TCP/UDP port numbers and their string description:
http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers
now this is in the form of an HashMap with portnumber as the key and string description as the value. it might not be so big but i have to lookup for port description in real time when the packets are coming and as you can imagine, this requires efficient retrieval otherwise it slows down the processing considerably.
Initially i thought of implementing huge switch case/break logic or if, else if but that sounded too shabby so i came up with this hashMap.
Now i want to know does Java has something like caching mechanism to speed up if the queries are always the same? like mostly the queried ports will be 80, 443, 23, 22 etc and rarely other services type packets might arrive.
My Options:
Should i make couple of else-if checks in the start for most
common types and then revert to this hashMap if not found earlier
Should i continue with this hashMap to do the search for me
should i revert to some other clever way of doing this??
Please suggest.
Have you measured how long this takes ? I suspect that a lookup in a hash map with a reasonable number of buckets is going to be negligible compared to whatever else you're doing.
As always with these sort of questions, it's well worth measuring the supposed performance issue before working on it. Premature optimisation is the root of all evil, as they say.
it slows down the processing considerably.
A lookup of a HashMap typically takes about 50 ns. Given reading from a socket with data typically takes 10,000 - 20,000 ns, I suspect this isn't the problem you think it is.
If you want really fast lookup use an array as this can be faster.
String[] portToName = new String[65536];
The HashMap has a guaranteed O(1) access time for get operations. The way you're doing it right now is perfect from any point of view.
Maintaining an if/else if structure would be error prone and useless in terms of speedup (for a large list it would actually be worse, with an O(n) asympt time).
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Memory overhead of Java HashMap compared to ArrayList
I try to search on google but didnt found an answer, I need to store 160 entrys on a collection, and I dont want to iterate over them, I want to get the value by one entry, a point 2D, what is the best?
With less memory consume and faster access ?
Thanks alot in advance ;)
For 160 entries? Who cares. Use whichever API is better for this – probably the HashMap.
Also, with data structures, very frequently the tradeoff is speed versus memory use – as a rule you can't have both. (E.g. a hash table that uses chaining for collision resolution will have memory overhead for the buckets over an array; accessing an element by key in it will be faster than searching the array for the key.)
With less memory consume and faster access ?
HashMap probably gives faster access. This could be computationally important, but only if the application does a huge number of lookups using the collection.
ArrayList probably gives least memory usage. However, for a single collection containing 160 elements, the difference is probably irrelevant.
My advice is to not spend a lot of time trying to decide which is best. Toss a coin if you need to, and then move on to more important problems. You only should be worrying about this kind of thing (where the collection is always small) if CPU and/or memory profiling tells you that this data structure is a critical bottleneck.
If Hashmap vs ArrayList are you only two options, obviously it is going to be a Hashmap for speed of retrival. I am not sure of memory usage. But ArrayList has the ability to maintain order.
Designing a system where a service endpoint (probably a simple servlet) will have to handle 3K requests per second (data will be http posted).
These requests will then be stored into mysql.
They key issue that I need guidance on is that their will be a high % of duplicate data posted to this endpoint.
I only need to store unique data to mysql, so what would you suggest I use to handle the duplication?
The posted data will look like:
<root>
<prop1></prop1>
<prop2></prop2>
<prop3></prop3>
<body>
maybe 10-30K of test in here
</body>
</root>
I will write a method that will hash prop1, prop2, pro3 to create a unique hashcode (body can be different and still be considered unique).
I was thinking of creating some sort of concurrent dictionary that will be shared accross requests.
Their are more chances of duplication of posted data within a period of 24 hours. So I can purge data from this dictionary after every x hours.
Any suggestions on the data structure to store duplications? And what about purging and how many records I should store considering 3K requests per second i.e. it will get large very fast.
Note: Their are 10K different sources that will be posting, and the chances of duplication only occurrs for a given source. Meaning I could have more than one dictionary for maybe a group of sources to spread things out. Meaning if source1 posts data, and then source2 posts data, the changes of duplication are very very low. But if source1 posts 100 times in a day, the chances of duplication are very high.
Note: please ignore for now the task of saving the posted data to mysql as that is another issue on its own, duplication detection is my first hurdle I need help with.
Interesting question.
I would probably be looking at some kind of HashMap of HashMaps structure here where the first level of HashMaps would use the sources as keys and the second level would contain the actual data (the minimal for detecting duplicates) and use your hashcode function for hashing. For actual implementation, Java's ConcurrentHashMap would probably be the choice.
This way you have also set up the structure to partition your incoming load depending on sources if you need to distribute the load over several machines.
With regards to purging I think you have to measure the exact behavior with production like data. You need to learn how quickly the data grows when you successfully eliminate duplicates and how it becomes distributed in the HashMaps. With a good distribution and a not too quick growth I can imagine it is good enough to do a cleanup occasionally. Otherwise maybe a LRU policy would be good.
Sounds like you need a hashing structure that can add and check the existence of a key in constant time. In that case, try to implement a Bloom filter. Be careful that this is a probabilistic structure i.e. it may tell you that a key exists when it does not, but you can make the probability of failure extremely low if you tweak the parameters carefully.
Edit: Ok, so bloom filters are not acceptable. To still maintain constant lookup (albeit not a constant insertion), try to look into Cuckoo hashing.
1) Setup your database like this
ALTER TABLE Root ADD UNIQUE INDEX(Prop1, Prop2, Prop3);
INSERT INTO Root (Prop1, Prop2, Prop3, Body) VALUES (#prop1, #prop2, #prop3, #body)
ON DUPLICATE KEY UPDATE Body=#body
2) You don't need any algorithms or fancy hashing ADTs
shell> mysqlimport [options] db_name textfile1 [textfile2 ...]
http://dev.mysql.com/doc/refman/5.1/en/mysqlimport.html
Make use of the --replace or --ignore flags, as well as, --compress.
3) All your Java will do is...
a) generate CSV files, use the StringBuffer class then every X seconds or so, swap with a fresh StringBuffer and pass the .toString of the old one to a thread to flush it to a file /temp/SOURCE/TIME_STAMP.csv
b) occasionally kick off a Runtime.getRuntime().exec of the mysqlimport command
c) delete the old CSV files if space is an issue, or archive them to network storage/backup device
Well you're basically looking for some kind of extremely large Hashmap and something like
if (map.put(key, val) != null) // send data
There are lots of different Hashmap implementations available, but you could look at NBHM. Non-blocking puts and designed with large, scalable problems in mind could work just fine. The Map also has iterators that do NOT throw a ConcurrentModificationException while using them to traverse the map which is basically a requirement for removing old data as I see it. Also putIfAbsent is all you actually need - but no idea if that's more efficient than just a simple put, you'd have to ask Cliff or check the source.
The trick then is to try to avoid resizing of the Map by making it large enough - otherwise the throughput will suffer while resizing (which could be a problem). And think about how to implement the removing of old data - using some idle thread that traverses an iterator and removes old data probably.
Use a java.util.ConcurrentHashMap for building a map of your hashes, but make sure you have the correct initialCapacity and concurrencyLevel assigned to the map at creation time.
The api docs for ConcurrentHashMap have all the relevant information:
initialCapacity - the initial capacity. The implementation performs
internal sizing to accommodate this many elements.
concurrencyLevel - the estimated number of concurrently updating threads. The
implementation performs internal sizing to try to accommodate this
many threads.
You should be able to use putIfAbsent for handling 3K requests as long as you have initialized the ConcurrentHashMap the right way - make sure this is tuned as part of your load testing.
At some point, though, trying to handle all the requests in one server may prove to be too much, and you will have to load-balance across servers. At that point you may consider using memcached for storing the index of hashes, instead of the CHP.
The interesting problems that you will still have to solve, though, are:
loading all of the hashes into memory at startup
determining when to knock off hashes from the in-memory map
If you use a strong hash formula, such as MD5 or SHA-1, you will not need to store any data at all. The probability of duplicate is virtually null, so if you find the same hash result twice, the second is a duplicate.
Given that MD5 is 16 bytes, and SHA-1 20 bytes, it should decrease memory requirements, therefore keeping more elements in the CPU cache, therefore dramatically improving speed.
Storing these keys requires little else than a small hash table followed by trees to handle collisions.