Difference in hash map in java 7 and 8 - java

What is the difference in Hash Map of Java 7 and Java 8 when both works on constant complexity algorithm? As per my understanding hash map searches in constant time by generating a hash key for an object through hash function.

In Java 7 after calculating hash from hash function if more then one element has same hash than they are searched by linear search so it's complexity is (n). In Java 8 that search is performed by binary search so the complexity will become log(n). So, this concept is wrong that hash map searches an object in constant complexity because it is not the case at all times.

You might find the latest issues of the Java Specialist newsletter very helpful. It goes into great depth discussing hashing in Java over the course of the years; for example pointing out that you better make sure your map keys implement Comparable (when using Java8).

Related

Algorithm used for bucket lookup for hashcodes [duplicate]

This question already has answers here:
What hashing function does Java use to implement Hashtable class?
(6 answers)
Closed 8 years ago.
In most cases, HashSet has lookup complexity O(1). I understand that this is because objects are kept in buckets corresponding to hashcodes of the object.
When lookup is done, it directly goes to the bucket and finds (using equals if many objects are present in same bucket) the element.
I always wonder, how it directly goes to the required bucket? Which algorithm is used for bucket lookup? Does that add nothing to total lookup time?
I always wonder, how it directly goes to the required bucket?
The hash code is treated and used as an index in to an array.
The index is determined by hash & (array.length - 1) because the length of the Java HashMap's internal array is always a power of 2. (This a cheaper computation of hash % array.length.)
Each "bucket" is actually a linked list (and now, possibly a tree) where entries with colliding hashes are grouped. If there are collisions, then a linear search through the bucket is performed.
Does that add nothing to total lookup time?
It incurs the cost of a few loads from memory.
Often, the algorithm is simply
hash = hashFunction(key)
index = hash % arraySize
See the wikipedia article on Hash Table for details.
From memory: the HashSet is actually backed by a HashMap and the basic look up process is:
Get the key
hash it (hashcode())
hashcode % the number of buckets
Go to that bucket and evaluate equals()
For a Set there would only be unique elements. I would suggest reading the source for HashSet and it should be able to answer your queries.
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/HashMap.java#HashMap.containsKey%28java.lang.Object%29
Also note that the Java 8 code has been updated and this explanation covers pre Java 8 codebase. I have not examined in detail the Java 8 implementation except to figure out that it is different.

Java Key-Value Collection with complexity of O(1) for millions of random unordered keys

I am stuck with a problem where I have millions of key-value pairs that I need to access using the keys randomly (not by using an iterator).
The range of keys is not known at compile time, but total number of the key-value pairs is known.
I have looked into HashMap and Hashset data structures but they are not truly O(1) as in case of collision in the hash-code they become array of LinkedLists which has linear search complexity at worst case.
I have also considered increasing the number of buckets in the HashMap but it does not ensure that every element will be stored in a separate bucket.
Is there any way to store and access millions of key-value pairs with O(1) complexity?
Ideally I would like every key to be like a variable and corresponding value should be the value assigned to that key
Thanks in advance.
I think you are confusing what Big O notation represents. It defines limiting behavior of a function, not necessarily actual behavior.
The average complexity of a hash map is O(1) for insert, delete, and search operations. What does this mean? In means, on average, those operations will complete in constant time regardless of the size of the hash map. So, depending on the implementation of the map, a lookup might not take exactly one step but it will most likely not involve more than a few steps, relative to the hash map's size.
How well a hash map actually behaves for those operations is determined by a few factors. The most obvious is the hash function used to bucket keys. Hash functions that distribute the computed hashes more uniformly over the hash range and limit the number of collisions are preferred. The better the hash function in those areas, the closer a hash map will actually operate in constant time.
Another factor that affects actual hash map behavior is how storage is managed. How a map resizes and repositions entries as items are added and removed helps control hash collisions by using an optimal number of buckets. Managing the hash map storage affectively will allow the hash map to operate close to constant time.
With all that said, there are ways to construct hash maps that have O(1) worst case behavior for lookups. This is accomplished using a perfect hash function. A perfect hash function is an invertible 1-1 function between keys and hashes. With a perfect hash function and the proper hash map storage, O(1) lookups can be achieved. The prerequisite for using this approach is knowing all the key values in advance so a perfect hash function can be developed.
Sadly, your case does not involve known keys so a perfect hash function can not be constructed but, the available research might help you construct a near perfect hash function for your case.
No, there isn't such a (known) data structure for generic data types.
If there were, it would most likely have replaced hash tables in most commonly-used libraries, unless there's some significant disadvantage like a massive constant factor or ridiculous memory usage, either of which would probably make it nonviable for you as well.
I said "generic data types" above, as there may be some specific special cases for which it's possible, such as when the key is a integer in a small range - in this case you could just have an array where each index corresponds to the same key, but this is also really a hash table where the key hashes to itself.
Note that you need a terrible hash function, the pathological input for your hash function, or a very undersized hash table to actually get the worst-case O(n) performance for your hash table. You really should test it and see if it's fast enough before you go in search of something else. You could also try TreeMap, which, with its O(log n) operations, will sometimes outperform HashMap.

Difference between a general Hash table and java's HashMap in Big O

The Hash table wiki entry lists its Big O as:
Search: O(n)
Insert: O(n)
Delete: O(n)
while a java HashMap is listed with Big O as:
get: O(1)
put: O(1)
remove: O(1)
Can someone plz explain why does the Big O differ between the concept and the implementation? I mean if there an implementation with a worst case of O(1) then why is there a possibility of O(n) in the concept?
The worst case is O(n) because it might be possible that every entry you put into the HashMap produces the same hash value (lets say 10). This produces a conflict for every entry because every entry is put at HashMap[10]. Depending on what collision resolution strategy was implemented, the HashMap either creates a list at the index 10 or moves the entry to the next index.
Nevertheless when the entry should be accessed again, the hash value is used to get the initial index of the HashMap. As it is 10 in every case, the HashMap has to resolve this.
Because there's a difference between worst case and average case, and even wikipedia lists the O(1) complexity for the avarage case. Java's HashMap is exactly the same as wikipedia's Hash table. So it is just a documentation issue.
Basically, hash tables compute a numerical value from the object you want to store. That numerical value is roughly used as an index to access the location to store the object into (leading to O(1) complexity). However, sometimes certain objects may lead to the same numerical value. In this case those objects will be stored in a list stored in the corresponding location in the hash map, hence the O(n) complexity for the worst case.
I'm not sure where you found the reported complexity of a java HashMap, but it is listing the average case, which matches what wikipedia states on the page you linked.

Hash Array Mapped Trie (HAMT)

I am trying to get my head around the details of a HAMT. I'd have implemented one myself in Java just to understand. I am familiar with Tries and I think I get the main concept of the HAMT.
Basically,
Two types of nodes:
Key/Value
Key Value Node:
K key
V value
Index
Index Node:
int bitmap (32 bits)
Node[] table (max length of 32)
Generate a 32-bit hash for an object.
Step through the hash 5-bits at a time. (0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-31)
note: the last step (7th) is only 2 bits.
At each step, find the position of that 5-bit integer in the bitmap. e.g. integer==5 bitmap==00001
If the bit is a 1 then that part of the hash exist.
If the bit is a 0 then key doesn't exist.
If the key exists, find it's index into the table by counting the number of 1s in the bitmap between 0 and the position. e.g. integer==6 bitmap==0101010101 index==3
If the table points to a key/value node then compare the keys.
If the table points to a index node then go to 2 moving ahead one step.
The part I don't quite understand is collision detection and mitigation. In the linked paper he alludes to it:
The existing key is then inserted in the new sub-hash table and the
new key added. Each time 5 more bits of the hash are used the
probability of a collision reduces by a factor of 1/32. Occasionally
an entire 32 bit hash may be consumed and a new one must be computed
to differentiate the two keys.
If I were to compute a "new" hash and store the object at that new hash; how would you ever be able to look-up the object in the structure? When doing a look-up, wouldn't it generate the "initial" hash and not the "re-computed hash".
I must be missing something.....
BTW: The HAMT performs fairly well, it's sits between a hash map and tree map in my tests.
Data Structure Add time Remove time Sorted add time Sorted remove time Lookup time Size
Java's Hash Map 38.67 ms 18 ms 30 ms 15 ms 16.33 ms 8.19 MB
HAMT 68.67 ms 30 ms 57.33 ms 35.33 ms 33 ms 10.18 MB
Java's Tree Map 86.33 ms 78 ms 75 ms 39.33 ms 76 ms 8.79 MB
Java's Skip List Map 111.33 ms 106 ms 72 ms 41 ms 72.33 ms 9.19 MB
HAMT is great and highly performant structure especially when one needs immutable objects, i.e. each time after any modification a new copy of a data structure is created!
As for your question on hash collisions, I have found a C# implementation (which is buggy now) that shows how it works: on each hash collision the key is rehashed and lookup is retried recursively until maximum iterations limit is reached.
Currently I am also exploring HAMP in functional programming context and learning existing code. There are several reference implementations of HAMT in Haskell as Data.HshMap and in Clojure as PersistenceHashMap.
There are some other simpler implementations on the web that do not deal with collisions, but they are useful to understand the concept. Here they are in Haskell and OCaml
I have found a nice summary article article that describes HAMT with pictures and links to original research papers by Phil Bagwell.
Related points:
While implementing HAMT in F# I have noticed that popCount function implementation described here really matters and gives 10-15% compared to naive implementation described in the next answers in the link. Not great, but a free lunch.
Related IntMap structures (Haskell and its port to F#) are very good when the key could be an integer and they implement related PATRICIA/Radix trie.
I believe all these implementation are very good to learn efficient immutable data structure and functional languages in all their beauty on these examples - they really shine together!
There's two sections of the paper I think you might of missed. The first is the bit immediately preceding the bit you quoted:
Or the key will collide with an existing one. In which case the existing key
must be replaced with a sub-hash table and the next 5 bit hash of the existing key
computed. If there is still a collision then this process is repeated until no collision
occurs.
So if you have object A in the table and you add object B which clashes, the cell at which their keys clashed will be a pointer to another subtable (where they don't clash).
Next, Section 3.7 of the paper you linked describes the method for generating a new hash when you run off the end of your first 32 bits:
The hash function was tailored to give a 32 bit hash. The algorithm requires that
the hash can be extended to an arbitrary number of bits. This was accomplished by
rehashing the key combined with an integer representing the trie level, zero being
the root. Hence if two keys do give the same initial hash then the rehash has a
probability of 1 in 2^32 of a further collision.
If this doesn't seem to explain anything, say and I'll extend this answer with more detail.
If I were to compute a "new" hash and store the object at that new
hash; how would you ever be able to look-up the object in the
structure? When doing a look-up, wouldn't it generate the "initial"
hash and not the "re-computed hash".
When doing a look-up the initial hash is used. When the bits in the initial
hash is exhausted, either one of the following condition is true:
we end up with a key/value node - return it
we end up with an index node - this is the hint that we have to go
deeper by recomputing a new hash.
The key here is hash bits exhaustion.
The chance of collision is presumably very low, and generally only problematic for huge trees. Given this, you're better off just storing collisions in an array at the leaf and searching it linearly (I do this in my C# HAMT).

Is a Java hashmap search really O(1)?

I've seen some interesting claims on SO re Java hashmaps and their O(1) lookup time. Can someone explain why this is so? Unless these hashmaps are vastly different from any of the hashing algorithms I was bought up on, there must always exist a dataset that contains collisions.
In which case, the lookup would be O(n) rather than O(1).
Can someone explain whether they are O(1) and, if so, how they achieve this?
A particular feature of a HashMap is that unlike, say, balanced trees, its behavior is probabilistic. In these cases its usually most helpful to talk about complexity in terms of the probability of a worst-case event occurring would be. For a hash map, that of course is the case of a collision with respect to how full the map happens to be. A collision is pretty easy to estimate.
pcollision = n / capacity
So a hash map with even a modest number of elements is pretty likely to experience at least one collision. Big O notation allows us to do something more compelling. Observe that for any arbitrary, fixed constant k.
O(n) = O(k * n)
We can use this feature to improve the performance of the hash map. We could instead think about the probability of at most 2 collisions.
pcollision x 2 = (n / capacity)2
This is much lower. Since the cost of handling one extra collision is irrelevant to Big O performance, we've found a way to improve performance without actually changing the algorithm! We can generalzie this to
pcollision x k = (n / capacity)k
And now we can disregard some arbitrary number of collisions and end up with vanishingly tiny likelihood of more collisions than we are accounting for. You could get the probability to an arbitrarily tiny level by choosing the correct k, all without altering the actual implementation of the algorithm.
We talk about this by saying that the hash-map has O(1) access with high probability
You seem to mix up worst-case behaviour with average-case (expected) runtime. The former is indeed O(n) for hash tables in general (i.e. not using a perfect hashing) but this is rarely relevant in practice.
Any dependable hash table implementation, coupled with a half decent hash, has a retrieval performance of O(1) with a very small factor (2, in fact) in the expected case, within a very narrow margin of variance.
In Java, how HashMap works?
Using hashCode to locate the corresponding bucket [inside buckets container model].
Each bucket is a LinkedList (or a Balanced Red-Black Binary Tree under some conditions starting from Java 8) of items residing in that bucket.
The items are scanned one by one, using equals for comparison.
When adding more items, the HashMap is resized (doubling the size) once a certain load percentage is reached.
So, sometimes it will have to compare against a few items, but generally, it's much closer to O(1) than O(n) / O(log n).
For practical purposes, that's all you should need to know.
Remember that o(1) does not mean that each lookup only examines a single item - it means that the average number of items checked remains constant w.r.t. the number of items in the container. So if it takes on average 4 comparisons to find an item in a container with 100 items, it should also take an average of 4 comparisons to find an item in a container with 10000 items, and for any other number of items (there's always a bit of variance, especially around the points at which the hash table rehashes, and when there's a very small number of items).
So collisions don't prevent the container from having o(1) operations, as long as the average number of keys per bucket remains within a fixed bound.
I know this is an old question, but there's actually a new answer to it.
You're right that a hash map isn't really O(1), strictly speaking, because as the number of elements gets arbitrarily large, eventually you will not be able to search in constant time (and O-notation is defined in terms of numbers that can get arbitrarily large).
But it doesn't follow that the real time complexity is O(n)--because there's no rule that says that the buckets have to be implemented as a linear list.
In fact, Java 8 implements the buckets as TreeMaps once they exceed a threshold, which makes the actual time O(log n).
O(1+n/k) where k is the number of buckets.
If implementation sets k = n/alpha then it is O(1+alpha) = O(1) since alpha is a constant.
If the number of buckets (call it b) is held constant (the usual case), then lookup is actually O(n).
As n gets large, the number of elements in each bucket averages n/b. If collision resolution is done in one of the usual ways (linked list for example), then lookup is O(n/b) = O(n).
The O notation is about what happens when n gets larger and larger. It can be misleading when applied to certain algorithms, and hash tables are a case in point. We choose the number of buckets based on how many elements we're expecting to deal with. When n is about the same size as b, then lookup is roughly constant-time, but we can't call it O(1) because O is defined in terms of a limit as n → ∞.
Elements inside the HashMap are stored as an array of linked list (node), each linked list in the array represents a bucket for unique hash value of one or more keys.
While adding an entry in the HashMap, the hashcode of the key is used to determine the location of the bucket in the array, something like:
location = (arraylength - 1) & keyhashcode
Here the & represents bitwise AND operator.
For example: 100 & "ABC".hashCode() = 64 (location of the bucket for the key "ABC")
During the get operation it uses same way to determine the location of bucket for the key. Under the best case each key has unique hashcode and results in a unique bucket for each key, in this case the get method spends time only to determine the bucket location and retrieving the value which is constant O(1).
Under the worst case, all the keys have same hashcode and stored in same bucket, this results in traversing through the entire list which leads to O(n).
In the case of java 8, the Linked List bucket is replaced with a TreeMap if the size grows to more than 8, this reduces the worst case search efficiency to O(log n).
We've established that the standard description of hash table lookups being O(1) refers to the average-case expected time, not the strict worst-case performance. For a hash table resolving collisions with chaining (like Java's hashmap) this is technically O(1+α) with a good hash function, where α is the table's load factor. Still constant as long as the number of objects you're storing is no more than a constant factor larger than the table size.
It's also been explained that strictly speaking it's possible to construct input that requires O(n) lookups for any deterministic hash function. But it's also interesting to consider the worst-case expected time, which is different than average search time. Using chaining this is O(1 + the length of the longest chain), for example Θ(log n / log log n) when α=1.
If you're interested in theoretical ways to achieve constant time expected worst-case lookups, you can read about dynamic perfect hashing which resolves collisions recursively with another hash table!
It is O(1) only if your hashing function is very good. The Java hash table implementation does not protect against bad hash functions.
Whether you need to grow the table when you add items or not is not relevant to the question because it is about lookup time.
This basically goes for most hash table implementations in most programming languages, as the algorithm itself doesn't really change.
If there are no collisions present in the table, you only have to do a single look-up, therefore the running time is O(1). If there are collisions present, you have to do more than one look-up, which drives down the performance towards O(n).
It depends on the algorithm you choose to avoid collisions. If your implementation uses separate chaining then the worst case scenario happens where every data element is hashed to the same value (poor choice of the hash function for example). In that case, data lookup is no different from a linear search on a linked list i.e. O(n). However, the probability of that happening is negligible and lookups best and average cases remain constant i.e. O(1).
Only in theoretical case, when hashcodes are always different and bucket for every hash code is also different, the O(1) will exist. Otherwise, it is of constant order i.e. on increment of hashmap, its order of search remains constant.
Academics aside, from a practical perspective, HashMaps should be accepted as having an inconsequential performance impact (unless your profiler tells you otherwise.)
Of course the performance of the hashmap will depend based on the quality of the hashCode() function for the given object. However, if the function is implemented such that the possibility of collisions is very low, it will have a very good performance (this is not strictly O(1) in every possible case but it is in most cases).
For example the default implementation in the Oracle JRE is to use a random number (which is stored in the object instance so that it doesn't change - but it also disables biased locking, but that's an other discussion) so the chance of collisions is very low.

Categories