How does Java implement hash tables?

How does Java implement hash tables? - java

Does anyone know how Java implements its hash tables (HashSet or HashMap)? Given the various types of objects that one may want to put in a hash table, it seems very difficult to come up with a hash function that would work well for all cases.

HashMap and HashSet are very similar. In fact, the second contains an instance of the first.
A HashMap contains an array of buckets in order to contain its entries. Array size is always powers of 2. If you don't specify another value, initially there are 16 buckets.
When you put an entry (key and value) in it, it decides the bucket where the entry will be inserted calculating it from its key's hashcode (hashcode is not its memory address, and the the hash is not a modulus). Different entries can collide in the same bucket, so they'll be put in a list.
Entries will be inserted until they reach the load factor. This factor is 0.75 by default, and is not recommended to change it if you are not very sure of what you're doing. 0.75 as load factor means that a HashMap of 16 buckets can only contain 12 entries (16*0.75). Then, an array of buckets will be created, doubling the size of the previous. All entries will be put again in the new array. This process is known as rehashing, and can be expensive.
Therefore, a best practice, if you know how many entries will be inserted, is to construct a HashMap specifying its final size:
new HashMap(finalSize);

You can check the source of HashMap, for example.

Java depends on each class' implementation of the hashCode() method to distribute the objects evenly. Obviously, a bad hashCode() method will result in performance problems for large hash tables. If a class does not provide a hashCode() method, the default in the current implementation is to return some function (i.e. a hash) of the the object's address in memory. Quoting from the API doc:
As much as is reasonably practical,
the hashCode method defined by class
Object does return distinct integers
for distinct objects. (This is
typically implemented by converting
the internal address of the object
into an integer, but this
implementation technique is not
required by the JavaTM programming
language.)

There are two general ways to implement a HashMap. The difference is how one deals with collisions.
The first method, which is the one Java users, makes every bucket in a the HashMap contain a singly linked list. To accomplish this, each bucket contains an Entry type, which caches the hashCode, has a pointer to the key, pointer to the value, and a pointer to the next entry. When a collision occurs in Java, another entry is added to the list.
The other method for handling collisions, is to simply put the item into the next empty bucket. The advantage of this method is it requires less space, however, it complicates removals, as if the bucket following the removed item is not empty, one has to check to see if that item is in the right or wrong bucket, and shift the item if it originally has collided with the item being removed.

In my own words:
An Entry object is created to hold the reference of the Key and Value.
The HashMap has an array of Entry's.
The index for the given entry is the hash returned by key.hashCode()
If there is a collision ( two keys gave the same index ) , the entry is stored in the .next attribute of the existing entry.
That's how two objects with the same hash could be stored into the collection.
From this answer we get:
public V get(Object key) {
if (key == null)
return getForNullKey();
int hash = hash(key.hashCode());
for (Entry<K,V> e = table[indexFor(hash, table.length)];
e != null;
e = e.next) {
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
return e.value;
}
return null;
}
Let me know if I got something wrong.

Related

Why does HashMap.Entry class have a field for hash?

When we are adding an entry or retrieving an entry to the HashMap, the equals() on the key would have been sufficient to find it in the particular index of the bucket. Why is hash also stored and checked?

You are correct that there is no need to recalculate the hash values for keys (in entries) when doing operations like lookup and removal. They are stored for performance related reasons.
Calculating the hash value for a key can be expensive.
When the ratio of entries to the size of a main array exceeds a certain value (the "load factor") the HashMap implementation expands the array. When this happens, all existing entries need to be redistributed to new hash chains ... based on the entry keys' has values. The hash values are stored in the Entry objects so that they doesn't have to be recalculated each time the entries need to be redistributed.
Having stored the hashcode values, they could also be used to accelerate lookup ...
// Version #1
if (node.key.equals(keyToTest)) {
...
}
// Versions #2
if (node.hashValue == hashValueToTest && node.key.equals(keyToTest)) {
...
}
If the key.equals method is expensive, then you can save some time (on average) by avoiding calling it when the hash values don't match. (But when they do match, the call must be made anyway!)
So, really, there are TWO reasons why the hash values are stored.

HashMap is an implementation of Map which maintains a table of entries, with references to the associated keys and values, organized according to their hash code. This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets.
Please see:
Hashmap and how this works behind the scene
While it is correct that equals() would be sufficient to (ultimately) find the key, this would not provide constant-time performance.
There are other implementations of Map which do not use hash codes.

LinkedHashMap Implementation in Java

I cannot understand the use of HashFunction in LinkedHashMap.
In the HashMap implementation, the use of hashFunction is to find the index of the internal array, which can be justified, following the hashfunction contract (same key will must have same hashcode, but distinct key can have same hashcode).
My questions are:
1) What is the use of hashfunction in LinkedHashMap?
2) How does the put and get method works for LinkedHashMap?
3) Why does it maintains the doublylinkedlist internally?
Whats wrong in using the HashMap as internal implementation(just like HashSet) and maintain a separate Array/List of indexes of the Entry array in the sequence of insertion?
Appreciate useful response and references.

1) LinkedHashMap extends HashMap so the hashfunction is the same of HashMap (if you check the code the hash function is inherited from HashMap), i.e. the function computes a the hash of the object inserted and it use to store in a data structure together with the elements with the same key hash; the hasfunction is used in the get method to retrieve the object with the key specified as a param.
2)Put and Get method are behave the same way as HashMap plus the track the insertion order of the elements so when you iterate over the the keyset you get the key values in the order you inserted into the map (see here for more details)
3)the LinkedHashMap uses a double linked list instead of an Array because it's more compact; a double linked list is the the most efficient data structure for list where you insert and remove items; if you mostly insert/append elements then an array based implementation may be better. Since the map sematic is a key-value implementation and removing elements from the map could be a frequent operation a double linked list is a better fit. The internal implmentation could be made with a LinkedList but my opionion is that using a low level data stucture is more efficient and decouples LinkedHashMap from other classes.

A LinkedHashMap does use a HashMap (in fact it extends from it), so the hashCode is used to identify the right hash bucket in the array of hash buckets, just as for HashMap. put and get work just as for HashMap (except that the before and after references for iterating over the entries are updated differently for the two implementations).
The reason insertion order is not kept by keeping an Array or ArrayList is that addition or removal in the middle of an ArrayList is an O(n) operation because you have to move all subsequent items along one place. You could do this with a LinkedList because addition and removal in the middle of a LinkedList is O(1) (all you have to do is break a few links and make a few new ones). However there's no point using a separate LinkedList because you may as well make the Map.Entry objects reference the previous and next Entry objects, which is exactly how LinkedHashMap works.

LinkedHashMap is a good choice for a data structure where you want to be able to put and get entries with O(1) running time, but you also need the behavior of a LinkedList. The internal hashing function is what allows you put and get entries with constant-time.
Here is how you use LinkedHashMap:
Map<String, Double> linkedHashMap = new LinkedHashMap<String, String>();
linkedHashMap.put("today", "Wednesday");
linkedHashMap.put("tomorrow", "Thursday");
String today = linkedHashMap.get("today"); // today is 'Wednesday'
There are several arguments against using a simple HashMap and maintaining a separate List for the insertion order. First, if you go this route it means you will have to maintain 2 data structures instead of one. This is error prone, and makes maintaining your code more difficult. Second, if you have to make your data structure Thread-safe, this would be complex for 2 data structures. On the other hand, if you use a LinkedHashMap you only would have to worry about making this single data structure thread-safe.
As for implementation details, when you do a put into a LinkedHashMap, the JVM will take your key and use a cryptographic mapping function to ultimately convert that key into a memory address where your value will be stored. Doing a get with a given key will also use this mapping function to find the exact location in memory where the value be stored. The entrySet() method returns a Set consisting of all the keys and values in the LinkedHashMap. By definition, sets are not ordered. The entrySet() is not guaranteed to be Thread-safe.

Ans. 2)
when we call put(map,key) of linkedhashmap. Internally it calls createEntry
void createEntry(int hash, K key, V value, int bucketIndex) {
HashMap.Entry<K,V> old = table[bucketIndex];
Entry<K,V> e = new Entry<K,V>(hash, key, value, old);
table[bucketIndex] = e;
e.addBefore(header);
size++;
Ans 3)
To efficiently maintain a linkedHashmap, you actually need a doubly linked list.
Consider three entries in order
A ---> B ---> C
Suppose you want to remove B. Obviously A should now point to C. But unless you know the entry before B you cannot efficiently say which entry should now point to C. To fix this, you need entries to point in both the directions Like this
---> --->
A B C
<--- <---
This way, when you remove B you can look at the entries before and after B (A and C) and update so that A and C point to each other.
similar post in this link discussed earlier
why linkedhashmap maintains doubly linked list for iteration

HashMap has containsValue method, but not getValue method

I am curious that in the Java collections library, HashMap has a method that searches for the existance of a particular object value called containsValue(Object value) returing a boolean, but no method exists to get the value object by value object directly like you do by providing a key via the get(Object key) method. Now, I know that the purpose of HashMap is to access object values via the keys, but in exceptional cases may want retrieve via the object value, so why is there not a getValue(Object value) method? I ask this, because the algorithm that the method containsValue() implements to search for the object value is faster than my custom search (see below). Also, is there a better way to accomplish this search using HashMap in Java 7 ?
Code Snippet:
// Custom Search
MyCustomer findCust = new MyCustomer(50000, "Joe Bloggs", "London");
for (MyCustomer value : hashMap.values()) {
if (value.equals(findCust)) { // found
cust = value;
break;
}
}

The basic assumption of the collections framework is that if two objects are .equals, they are interchangeable in every way. Given that assumption, there's no reason to get out the value from a Map, because you already have one that is equals and interchangeable. As far as the Collections Framework is concerned, these two methods are fully equivalent:
for (V value : map.values()) {
if (value.equals(myValue)) {
return value;
}
}
and
if (map.containsValue(myValue)) {
return myValue;
}
This assumption is built into the Collections Framework in many places, and this is one of many examples.

hashMap.values().contains(findCust)
You will need equals and hashCode on Customer based on your "business rules" (for example, are two customers with the same "id" but with different other values "equal"????... Obviously you are already doing that because you are using equals...)

HashMap is designed to aid constant lookups using hashcode() and equals() of the key you use to put some value into map.
If you look at the internal structure of HashMap, it's nothing but an array. Each index is called a bucket which can be obtained by normalizing current array's length and the hashcode of the key you pass. Once you find the bucket, it will store the element at that particular index. But if there's already some element stored in that index, they will form a LinkedList of these elements chaining all the values having same hashcode() but different equals() criteria.
In Java 8, this linked list is even changed to TreeMap if the number of elements in that linked list reaches some threshold (8) for improving performance.
Coming to your question, containsValue() basically iterates over all the buckets in the array and again through all the elements in the linked list of each bucket
// iterate through buckets
for (int i = 0; i < table.length; ++i) {
// iterate through each element in linked list at each bucket
for (Node<K,V> e = table[i]; e != null; e = e.next) {
if ((v = e.value) == value ||
(value != null && value.equals(v)))
return true;
}
}
HashMap.values() returns a Collection with the iterator implemented to traverse each element in HashMap providing access to Value object in each iteration.
containsValue() is used when you want to do something if some value is already there in the map but you don't need that value to proceed with your flow.This is merely a convenience method because if you're using values, you will be creating a Collection object and an iterator object to iterate over them but using containsValue(), you just have two nested for loops. I think the reason for not having a getValue() is to encourage the purpose HashMap is intended for - near constant time look ups using hashcode & equals of some key.
values() is used when you basically need to iterate over all the values. This is different from calling map.get(key) in a loop because you don't have to normalize the hashcode, find the bucket, then find the element in the linked list in each iteration, you just loop in the natural order, the way the elements are laid out in the array.
If you're doing this value lookup way too many times, you lose the advantage of constant lookups offered by HashMap. If you're only going to skim through the values searching for some value, I suggest you use an ArrayList. If there are too many elements in that list, and you need to search for some random value quite often, sort the list and use Binary Search.

How is the internal implementation of LinkedHashMap different from HashMap implementation?

I read that HashMap has the following implementation:
main array
↓
[Entry] → Entry → Entry ← linked-list implementation
[Entry]
[Entry] → Entry
[Entry]
[null ]
So, it has an array of Entry objects.
Questions:
I was wondering how can an index of this array store multiple Entry objects in case of same hashCode but different objects.
How is this different from LinkedHashMap implementation? Its doubly linked list implementation of map but does it maintain an array like the above and how does it store pointers to the next and previous element?

HashMap does not maintain insertion order, hence it does not maintain any doubly linked list.
Most salient feature of LinkedHashMap is that it maintains insertion order of key-value pairs. LinkedHashMap uses doubly Linked List for doing so.
Entry of LinkedHashMap looks like this-
static class Entry<K, V> {
K key;
V value;
Entry<K,V> next;
Entry<K,V> before, after; //For maintaining insertion order
public Entry(K key, V value, Entry<K,V> next){
this.key = key;
this.value = value;
this.next = next;
}
}
By using before and after - we keep track of newly added entry in LinkedHashMap, which helps us in maintaining insertion order.
Before refers to previous entry and
after refers to next entry in LinkedHashMap.
For diagrams and step by step explanation please refer http://www.javamadesoeasy.com/2015/02/linkedhashmap-custom-implementation.html
Thanks..!!

So, it has an array of Entry objects.
Not exactly. It has an array of Entry object chains. A HashMap.Entry object has a next field allowing the Entry objects to be chained as a linked list.
I was wondering how can an index of this array store multiple Entry objects in case of same hashCode but different objects.
Because (as the picture in your question shows) the Entry objects are chained.
How is this different from LinkedHashMap implementation? Its doubly linked list implementation of map but does it maintain an array like the above and how does it store pointers to the next and previous element?
In the LinkedHashMap implementation, the LinkedHashMap.Entry class extends the HashMap.Entry class, by adding before and after fields. These fields are used to assemble the LinkedHashMap.Entry objects into an independent doubly-linked list that records the insertion order. So, in the LinkedHashMap class, each entry object is in two distinct chains:
There are a number of singly linked hash chains that is accessed via the main hash array. This is used for (regular) hashmap lookups.
There is a separate doubly linked list that contains all of the entry objects. It is kept in entry insertion order, and is used when you iterate the entries, keys or values in the hashmap.

Take a look for yourself. For future reference, you can just google:
java LinkedHashMap source
HashMap uses a LinkedList to handle collissions, but the difference between HashMap and LinkedHashMap is that LinkedHashMap has a predicable iteration order, which is achieved through an additional doubly-linked list, which usually maintains the insertion order of the keys. The exception is when a key is reinserted, in which case it goes back to the original position in the list.
For reference, iterating through a LinkedHashMap is more efficient than iterating through a HashMap, but LinkedHashMap is less memory efficient.
In case it wasn't clear from my above explanation, the hashing process is the same, so you get the benefits of a normal hash, but you also get the iteration benefits as stated above, since you're using a doubly linked list to maintain the ordering of your Entry objects, which is independent of the linked-list used during hashing for collisions, in case that was ambiguous..
EDIT: (in response to OP's comment):
A HashMap is backed by an array, in which some slots contain chains of Entry objects to handle the collisions. To iterate through all of the (key,value) pairs, you would need to go through all of the slots in the array and then go through the LinkedLists; hence, your overall time would be proportional to the capacity.
When using a LinkedHashMap, all you need to do is traverse through the doubly-linked list, so the overall time is proportional to the size.

Since none of the other answers actually explain how something like this could be implemented I'll give it a shot.
One way would be to have some extra information in the value (of the key->value pair) not visible to the user, that had a reference to the previous and next element inserted into the hash map. The benefits are that you can still delete elements in constant time removing from a hashmap is constant time and removing from a linked list is in this case because you have a reference to the entry. You can still insert in constant time because hash map insert is constant, linked list isn't normally but in this case you have constant time access to a spot in the linked list so you can insert in constant time, and lastly retrieval is constant time because you only have to deal with the hash map part of the structure for it.
Keep in mind that a data structure like this does not come without costs. The size of the hash map will rise significantly because of all the extra references. Each of the main methods will be slightly slower (could matter if they are called repeatedly). And the indirection of the data structure (not sure if that's a real term :P) is increased, though this might not be as big a deal because the references are guaranteed to be pointing to stuff inside the hash map.
Since the only advantage of this type of structure is that it preserves order be careful when you use it. Also when reading the answer keep in mind I don't know that this is the way it's implemented but it is how I would do it if given the task.
On the oracle docs there is a quote confirming some of my guesses.
This implementation differs from HashMap in that it maintains a doubly-linked list running through all of its entries.
Another relevant quote from the same website.
This class provides all of the optional Map operations, and permits null elements. Like HashMap, it provides constant-time performance for the basic operations (add, contains and remove), assuming the hash function disperses elements properly among the buckets. Performance is likely to be just slightly below that of HashMap, due to the added expense of maintaining the linked list, with one exception: Iteration over the collection-views of a LinkedHashMap requires time proportional to the size of the map, regardless of its capacity. Iteration over a HashMap is likely to be more expensive, requiring time proportional to its capacity.

hashCode will be mapped to any bucket by the hash function. If there is a collision in hashCode than HashMap resolve this collision by chaining i.e. it will add the value to the linked list. Below is the code which does this:
for (Entry<K,V> e = table[i]; e != null; e = e.next) {
392 Object k;
393 if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
394 `enter code here` V oldValue = e.value;
395 e.value = value;
396 e.recordAccess(this);
397 return oldValue;
398 }
399 }
You can clearly see that it traverse the linked list and if it finds the key than it replaces the old value with new else append to the linked list.
But the difference between LinkedHashMap and HashMap is LinkedHashMap maintains the insertion order. From docs:
This linked list defines the iteration ordering, which is normally the order in which keys were inserted into the map (insertion-order). Note that insertion order is not affected if a key is re-inserted into the map. (A key k is reinserted into a map m if m.put(k, v) is invoked when m.containsKey(k) would return true immediately prior to the invocation).

how does hashing in java works?

I am trying to figure something out about hashing in java.
If i want to store some data in a hashmap for example, will it have some kind of underlying hashtable with the hashvalues?
Or if someone could give a good and simple explanation of how hashing work, I would really appreciate it.

HashMap is basically implemented internally as an array of Entry[]. If you understand what is linkedList, this Entry type is nothing but a linkedlist implementation. This type actually stores both key and value.
To insert an element into the array, you need index. How do you calculate index? This is where hashing function(hashFunction) comes into picture. Here, you pass an integer to this hashfunction. Now to get this integer, java gives a call to hashCode method of the object which is being added as a key in the map. This concept is called preHashing.
Now once the index is known, you place the element on this index. This is basically called as BUCKET , so if element is inserted at Entry[0], you say that it falls under bucket 0.
Now assume that the hashFunction returns you same index say 0, for another object that you wanted to insert as a key in the map. This is where equals method is called and if even equals returns true, it simple means that there is a hashCollision. So under this case, since Entry is a linkedlist implmentation, on this index itself, on the already available entry at this index, you add one more node(Entry) to this linkedlist. So bottomline, on hashColission, there are more than one elements at a perticular index through linkedlist.
The same case is applied when you are talking about getting a key from map. Based on index returned by hashFunction, if there is only one entry, that entry is returned otherwise on linkedlist of entries, equals method is called.
Hope this helps with the internals of how it works :)

Hash values in Java are provided by objects through the implementation of public int hashCode() which is declared in Object class and it is implemented for all the basic data types. Once you implement that method in your custom data object then you don't need to worry about how these are used in miscellaneous data structures provided by Java.
A note: implementing that method requires also to have public boolean equals(Object o) implemented in a consistent manner.

If i want to store some data in a hashmap for example, will it have some kind of underlying hashtable with the hashvalues?
A HashMap is a form of hash table (and HashTable is another). They work by using the hashCode() and equals(Object) methods provided by the HashMaps key type. Depending on how you want you keys to behave, you can use the hashCode / equals methods implemented by java.lang.Object ... or you can override them.
Or if someone could give a good and simple explanation of how hashing work, I would really appreciate it.
I suggest you read the Wikipedia page on Hash Tables to understand how they work. (FWIW, the HashMap and HashTable classes use "separate chaining with linked lists", and some other tweaks to optimize average performance.)
A hash function works by turning an object (i.e. a "key") into an integer. How it does this is up to the implementor. But a common approach is to combine hashcodes of the object's fields something like this:
hashcode = (..((field1.hashcode * prime) + field2.hashcode) * prime + ...)
where prime is a smallish prime number like 31. The key is that you get a good spread of hashcode values for different keys. What you DON'T want is lots of keys all hashing to the same value. That causes "collisions" and is bad for performance.
When you implement the hashcode and equals methods, you need to do it in a way that satisfies the following constraints for the hash table to work correctly:
1. O1.equals(o2) => o1.hashcode() == o2.hashcode()
2. o2.equals(o2) == o2.equals(o1)
3. The hashcode of an object doesn't change while it is a key in a hash table.
It is also worth noting that the default hashCode and equals methods provided by Object are based on the target object's identity.
"But where is the hash values stored then? It is not a part of the HashMap, so is there an array assosiated to the HashMap?"
The hash values are typically not stored. Rather they are calculated as required.
In the case of the HashMap class, the hashcode for each key is actually cached in the entry's Node.hash field. But that is a performance optimization ... to make hash chain searching faster, and to avoid recalculating hashes if / when the hash table is resized. But if you want this level of understanding, you really need to read the source code rather than asking Questions.

This is the most fundamental contract in Java: the .equals()/.hashCode() contract.
The most important part of it here is that two objects which are considered .equals() should return the same .hashCode().
The reverse is not true: objects not considered equal may return the same hash code. But it should be as rare an occurrence as possible. Consider the following .hashCode() implementation, which, while perfectly legal, is as broken an implementation as can exist:
#Override
public int hashCode() { return 42; } // legal!!
While this implementation obeys the contract, it is pretty much useless... Hence the importance of a good hash function to begin with.
Now: the Set contract stipulates that a Set should not contain duplicate elements; however, the strategy of a Set implementation is left... Well, to the implementation. You will notice, if you look at the javadoc of Map, that its keys can be retrieved by a method called .keySet(). Therefore, Map and Set are very closely related in this regard.
If we take the case of a HashSet (and, ultimately, HashMap), it relies on .equals() and .hashCode(): when adding an item, it first calculates this item's hash code, and according to this hash code, attemps to insert the item into a given bucket. In contrast, a TreeSet (and TreeMap) relies on the natural ordering of elements (see Comparable).
However, if an object is to be inserted and the hash code of this object would trigger its insertion into a non empty hash bucket (see the legal, but broken, .hashCode() implementation above), then .equals() is used to determine whether that object is really unique.
Note that, internally, a HashSet is a HashMap...

Hashing is a way to assign a unique code for any variable/object after applying any function/algorithm on its properties.

HashMap stores key-value pair in Map.Entry static nested class implementation.
HashMap works on hashing algorithm and uses hashCode() and equals() method in put and get methods.
When we call put method by passing key-value pair, HashMap uses Key hashCode() with hashing to find out
the index to store the key-value pair. The Entry is stored in the LinkedList, so if there are already
existing entry, it uses equals() method to check if the passed key already exists, if yes it overwrites
the value else it creates a new entry and store this key-value Entry.
When we call get method by passing Key, again it uses the hashCode() to find the index
in the array and then use equals() method to find the correct Entry and return it’s value.
Below image will explain these detail clearly.
The other important things to know about HashMap are capacity, load factor, threshold resizing.
HashMap initial default capacity is 16 and load factor is 0.75. Threshold is capacity multiplied
by load factor and whenever we try to add an entry, if map size is greater than threshold,
HashMap rehashes the contents of map into a new array with a larger capacity.
The capacity is always power of 2, so if you know that you need to store a large number of key-value pairs,
for example in caching data from database, it’s good idea to initialize the HashMap with correct capacity
and load factor.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.