how does hashing in java works? - java

I am trying to figure something out about hashing in java.
If i want to store some data in a hashmap for example, will it have some kind of underlying hashtable with the hashvalues?
Or if someone could give a good and simple explanation of how hashing work, I would really appreciate it.

HashMap is basically implemented internally as an array of Entry[]. If you understand what is linkedList, this Entry type is nothing but a linkedlist implementation. This type actually stores both key and value.
To insert an element into the array, you need index. How do you calculate index? This is where hashing function(hashFunction) comes into picture. Here, you pass an integer to this hashfunction. Now to get this integer, java gives a call to hashCode method of the object which is being added as a key in the map. This concept is called preHashing.
Now once the index is known, you place the element on this index. This is basically called as BUCKET , so if element is inserted at Entry[0], you say that it falls under bucket 0.
Now assume that the hashFunction returns you same index say 0, for another object that you wanted to insert as a key in the map. This is where equals method is called and if even equals returns true, it simple means that there is a hashCollision. So under this case, since Entry is a linkedlist implmentation, on this index itself, on the already available entry at this index, you add one more node(Entry) to this linkedlist. So bottomline, on hashColission, there are more than one elements at a perticular index through linkedlist.
The same case is applied when you are talking about getting a key from map. Based on index returned by hashFunction, if there is only one entry, that entry is returned otherwise on linkedlist of entries, equals method is called.
Hope this helps with the internals of how it works :)

Hash values in Java are provided by objects through the implementation of public int hashCode() which is declared in Object class and it is implemented for all the basic data types. Once you implement that method in your custom data object then you don't need to worry about how these are used in miscellaneous data structures provided by Java.
A note: implementing that method requires also to have public boolean equals(Object o) implemented in a consistent manner.

If i want to store some data in a hashmap for example, will it have some kind of underlying hashtable with the hashvalues?
A HashMap is a form of hash table (and HashTable is another). They work by using the hashCode() and equals(Object) methods provided by the HashMaps key type. Depending on how you want you keys to behave, you can use the hashCode / equals methods implemented by java.lang.Object ... or you can override them.
Or if someone could give a good and simple explanation of how hashing work, I would really appreciate it.
I suggest you read the Wikipedia page on Hash Tables to understand how they work. (FWIW, the HashMap and HashTable classes use "separate chaining with linked lists", and some other tweaks to optimize average performance.)
A hash function works by turning an object (i.e. a "key") into an integer. How it does this is up to the implementor. But a common approach is to combine hashcodes of the object's fields something like this:
hashcode = (..((field1.hashcode * prime) + field2.hashcode) * prime + ...)
where prime is a smallish prime number like 31. The key is that you get a good spread of hashcode values for different keys. What you DON'T want is lots of keys all hashing to the same value. That causes "collisions" and is bad for performance.
When you implement the hashcode and equals methods, you need to do it in a way that satisfies the following constraints for the hash table to work correctly:
1. O1.equals(o2) => o1.hashcode() == o2.hashcode()
2. o2.equals(o2) == o2.equals(o1)
3. The hashcode of an object doesn't change while it is a key in a hash table.
It is also worth noting that the default hashCode and equals methods provided by Object are based on the target object's identity.
"But where is the hash values stored then? It is not a part of the HashMap, so is there an array assosiated to the HashMap?"
The hash values are typically not stored. Rather they are calculated as required.
In the case of the HashMap class, the hashcode for each key is actually cached in the entry's Node.hash field. But that is a performance optimization ... to make hash chain searching faster, and to avoid recalculating hashes if / when the hash table is resized. But if you want this level of understanding, you really need to read the source code rather than asking Questions.

This is the most fundamental contract in Java: the .equals()/.hashCode() contract.
The most important part of it here is that two objects which are considered .equals() should return the same .hashCode().
The reverse is not true: objects not considered equal may return the same hash code. But it should be as rare an occurrence as possible. Consider the following .hashCode() implementation, which, while perfectly legal, is as broken an implementation as can exist:
#Override
public int hashCode() { return 42; } // legal!!
While this implementation obeys the contract, it is pretty much useless... Hence the importance of a good hash function to begin with.
Now: the Set contract stipulates that a Set should not contain duplicate elements; however, the strategy of a Set implementation is left... Well, to the implementation. You will notice, if you look at the javadoc of Map, that its keys can be retrieved by a method called .keySet(). Therefore, Map and Set are very closely related in this regard.
If we take the case of a HashSet (and, ultimately, HashMap), it relies on .equals() and .hashCode(): when adding an item, it first calculates this item's hash code, and according to this hash code, attemps to insert the item into a given bucket. In contrast, a TreeSet (and TreeMap) relies on the natural ordering of elements (see Comparable).
However, if an object is to be inserted and the hash code of this object would trigger its insertion into a non empty hash bucket (see the legal, but broken, .hashCode() implementation above), then .equals() is used to determine whether that object is really unique.
Note that, internally, a HashSet is a HashMap...

Hashing is a way to assign a unique code for any variable/object after applying any function/algorithm on its properties.

HashMap stores key-value pair in Map.Entry static nested class implementation.
HashMap works on hashing algorithm and uses hashCode() and equals() method in put and get methods.
When we call put method by passing key-value pair, HashMap uses Key hashCode() with hashing to find out
the index to store the key-value pair. The Entry is stored in the LinkedList, so if there are already
existing entry, it uses equals() method to check if the passed key already exists, if yes it overwrites
the value else it creates a new entry and store this key-value Entry.
When we call get method by passing Key, again it uses the hashCode() to find the index
in the array and then use equals() method to find the correct Entry and return it’s value.
Below image will explain these detail clearly.
The other important things to know about HashMap are capacity, load factor, threshold resizing.
HashMap initial default capacity is 16 and load factor is 0.75. Threshold is capacity multiplied
by load factor and whenever we try to add an entry, if map size is greater than threshold,
HashMap rehashes the contents of map into a new array with a larger capacity.
The capacity is always power of 2, so if you know that you need to store a large number of key-value pairs,
for example in caching data from database, it’s good idea to initialize the HashMap with correct capacity
and load factor.

Related

Does HashMap uses hashCode-equals methods of the HashMap or object?

I am confused regarding to the internal working of HashMap and as far as I understood, HashMap uses hashCode() and equals() methods of the given object that is stored in map (that's why the implementation of these methods are so important). On the other hand, the sources that I followed explains that as if HashMap uses its internal hashCode-equals methods to compare key values. Which one is right? Could you pls clarify me? I know the scenario of put / remove, but just need the place of the hashCode-equals methods. Are they in HashMap or Object?
In HashMap, Keys are hashed and compared, not the values. So, if you want to use your own objects as a Key, you need to implement hashCode and equals methods. e.g. You may want to store Student and corresponding Marksheet in a HashMap then for using Student objects as Key, override hashCode and equals method in that class.
If you just want to use String as key, then HashMap would call hashCode and equals method of String class
It uses it on the Object to facilitate key lookup. Here is an example.
Imagine you had 1000 files of individuals in a box and you had the ID number of one you had to find. So you need to go thru the box and search thru 1000 files to find the correct one.
Now imagine that you had two boxes. And all the even id's were in one box marked even and all the odd ones in a box marked odd. By looking at the ID you can pick the proper box (even or odd). So statistically you would only have to search thru about 500 files.
So the hashCode generates "boxes" based on the key.
And the search for the exact key within a box uses equals.
In my example, hashCode() could return 0 or 1 via something like the following:
#Override
public int hashCode() {
return id % 2; // returns 0 or 1 for even or odd box
}
Ideally a hashCode that generates many boxes (within reason) is best as the boxes become smaller (holds fewer objects). And a hashCode of a fixed value will work but it reduces the map to a single box with no benefit of improved performance.

What do HashMap and HashSet have in common? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Everywhere you can find answer what are differences:
Map is storing keys-values, it is not synchronized(not a thread safe), allows null values and only one null key, faster to get value because all values have unique key, etc.
Set - not sorted, slower to get value, storing only value, does not allow duplicates or null values I guess.
BUT what means Hash word (that is what they have the same). Is it something about hashing values or whatever I hope you can answer me clearly.
Both use hash value of the Object to store which internally uses hashCode(); method of Object class.
So if you are storing instances of your custom class then you need to override hashCode(); method.
HashSet and HashMap have a number of things in common:
The start of their name - which is a clue to the real similarity.
They use Hash Codes (from the hashCode method built into all Java objects) to quickly process and organize Objects.
They are both unordered collections - but both provide ordered varients (LinkedHashX to store objects in the order of addition)
There is also TreeSet/TreeMap to sort all objects present in the collection and keep them sorted. A comparison of TreeSet to TreeMap will find very similar differences and similarities to one between HashSet and HashMap.
They are also both impacted by the strengths and limitations of Hash algorithms in general.
Hashing is only effective if the objects have well behaved hash functions.
Hashing breaks entirely if equals and hashCode do not follow the correct contract.
Key objects in maps and objects in set should be immutable (or at least their hashCode and equals return values should never change) as otherwise behavior becomes undefined.
If you look at the Map API you can also see a number of other interesting connections - such as the fact that keySet and entrySet both return a Set.
None of the Java Collections are thread safe. Some of the older classes from other packages were but they have mostly been retired. For thread-safety look at the concurrent package for non-thread-safety look at the collections package.
Just look into HashSet source code and you will see that it uses HashMap. So they have the same properties of null-safety, synchronization etc:
public class HashSet<E>
...
private transient HashMap<E,Object> map;
// Dummy value to associate with an Object in the backing Map
private static final Object PRESENT = new Object();
/**
* Constructs a new, empty set; the backing <tt>HashMap</tt> instance has
* default initial capacity (16) and load factor (0.75).
*/
public HashSet() {
map = new HashMap<>();
}
...
public boolean contains(Object o) {
return map.containsKey(o);
}
...
public boolean add(E e) {
return map.put(e, PRESENT)==null;
}
...
}
HashSet is like a HashMap where you don't care about the values but about the keys only.
So you care only if a given key K is in the set but not about the value V to which it is mapped (you can think of it as if V is a constant e.g. V=Boolean.TRUE for all keys in the HashSet). So HashSet has no values (V set). This is the whole difference from structural point of view. The hash part means that when putting elements into the structure Java first calls the hashCode method. See also http://en.wikipedia.org/wiki/Open_addressing to understand in general what happens under the hood.
The hash value is used to check faster if two objects are the same. If two objects have same hash, they can be equal or not equal (so they are then compared for equality with the equals method). But if they have different hashes they are different for sure and the check for equality is not needed. This doesn't mean that if two objects have same hash values they overwrite each other when they are stored in the HashSet or in the HashMap.
Both are not Thread safe and store values using hashCode(). Those are common facts. And another one is both are member of Java collection framework. But there are lots of variations between those two.
Hash regards the technique used to convert the key to an index. Back in the data strucutures class we used to learn how to construct a hash table, to do that you would need to get the strings that were inserted as values and convert them to a number to index an array used internally as the storing data structure.
One problem that was also very discussed was to find a hashing function that would incurr in minimum colision so that we won't have two different objects, with different keys sharing the same position.
So, the hash is about how the keys are processed to be stored. If we think about it for a while, there isn't a (real) way to index memory with strings, only with numbers, so to have a 2d structure like a table that is indexed by a string (or an object as you wish) you need to generate a number (or a hash) for that string and store the value in an array in this index. However, if you need the key "name" you would need a different array to, in the same index, store the key "name".
Cheers
The "HASH" word is common because both uses hashing mechanism. HashSet is actually implemented using HashMap, using dummy object instance on every entry of the Set. And thereby a wastage of 4 bytes for each entry.

Java Overriding hashCode() method has any Performance issue?

If i will override hashCode() method will it degrade the performance of application. I am overriding this method in many places in my application.
Yes you can degrade the performance of a hashed collection if the hashCode method is implemented in a bad way. The best implementation of a hashCode method should generate the unique hashCode for unique objects. Unique hashCode will avoid collisions and an element can be stored and retrieved with O(1) complexity. But only hashCode method will not be able to do it, you need to override the equals method also to help the JVM.
If the hashCode method is not able to generate unique hash for unique objects then there is a chance that you will be holding more than one objects at a bucket. This will occur when you have two elements with same hash but equals method returns false for them. So each time this happens the element will be added to the list at hash bucket. This will slow down both the insertion and retreival of elements. It will lead to O(n) complexity for the get method, where n is the size of the list at a bucket.
Note: While you try to generate unique hash for unique objects in your hashCode implementation, be sure that you write simple algorithm for doing so. If your algorithm for generating the hash is too heavy then you will surely see a poor performance for operations on your hashed collection. As hashCode method is called for most of the operations on the hashed collection.
It would improve performance if the right data structure used at right place,
For example: a proper hashcode implementation in Object can nearly convert O(N) to O(1) for HashMap lookup
unless you are doing too much complicated operation in hashCode() method
It would invoke hashCode() method every time it has to deal with Hash data structure with your Object and if you have heavy hashCode() method (which shouldn't be)
It depends entirely on how you're implementing hashCode. If you're doing lots of expensive deep operations, then perhaps it might, and in that case, you should consider caching a copy of the hashCode (like String does). But a decent implementation, such as with HashCodeBuilder, won't be a big deal. Having a good hashCode value can make lookups in data structures like HashMaps and HashSets much, much faster, and if you override equals, you need to override hashCode.
Java's hashCode() is a virtual function anyway, so there is no performance loss by the sheer fact that it is overridden and the overridden method is used.
The real difference may be the implementation of the method. By default, hashCode() works like this (source):
As much as is reasonably practical, the hashCode method defined by
class Object does return distinct integers for distinct objects. (This
is typically implemented by converting the internal address of the
object into an integer, but this implementation technique is not
required by the JavaTM programming language.)
So, whenever your implementation is as simple as this, there will be no performance loss. However, if you perform complex computing operations based on many fields, calling many other functions - you will notice a performance loss but only because your hashCode() does more things.
There is also the issue of inefficient hashCode() implementations. For example, if your hashCode() simply returns value 1 then the use of HashMap or HashSet will be significantly slower than with proper implementation. There is a great question which covers the topic of implementing hashCode() and equals() on SO: What issues should be considered when overriding equals and hashCode in Java?
One more note: remember, that whenever you implement hashCode() you should also implement equals(). Moreover, you should do it with care, because if you write an invalid hashCode() you may break equality checks for various collections.
Overriding hashCode() in a class in itself does not cause any performance issues. However when an instance of such class is inserted either into a HashMap HashSet or equivalent data structure hashCode() & optionally equals() method is invoked to identify right bucket to put the element in. same applicable to Retrival Search & Deletion.
As posted by others performance totally depends on how hashCode() is implemented.
However If a particular class's equals method is not used at all then it is not mandatory to override equals() and hashCode() , but if equals() is overridden , hashcode() must be overridden as well
As all previous comments mentioned, hash-code is used for hashing in collections or it could be used as negative condition in equals. So, yes you can slow you app a lot. Obviously there is more use-cases.
First of all I would say that the approach (whether to rewrite it at all) depends on the type of objects you are talking about.
Default implementation of hash-code is fast as possible because it's unique for every object. It's possible to be enough for many cases.
This is not good when you want to use hashset and let say want to do not store two same objects in a collection. Now, the point is in "same" word.
"Same" can mean "same instance". "Same" can mean object with same (database) identifier when your object is entity or "same" can mean the object with all equal properties. It seems that it can affect performance so far.
But one of properties can be a object which could evaluate hashCode() on demand too and right now you can get evaluation of object tree's hash-code always when you call hash-code method on the root object.
So, what I would recommend? You need to define and clarify what you want to do. Do you really need to distinguish different object instances, or identifier is crucial, or is it value object?
It also depends on immutability. It's possible to calculate hashcode value once when object is constructed using all constructor properties (which has only get) and use it always when hashcode() is call. Or another option is to calculate hashcode always when any property gets change. You need to decide whether most cases read the value or write it.
The last thing I would say is to override hashCode() method only when you know that you need it and when you know what are you doing.
If you will override hashCode() method will it degrade the performance of application.It would improve performance if the right data structure used at right place,
For example: a proper hashcode() implementation in Object can nearly convert O(N) to O(1) for HashMap lookup.unless you are doing too much complicated operation in hashCode() method
The main purpose of hashCode method is to allow an object to be a key in the hash map or a member of a hash set. In this case an object should also implement equals(Object) method, which is consistent with hashCode implementation:
If a.equals(b) then a.hashCode() == b.hashCode()
If hashCode() was called twice on the same object, it should return the same result provided that the object was not changed
hashCode from the performance point of view
From the performance point of view, the main objective for your hashCode method implementation is to minimize the number of objects sharing the same hash code.
All JDK hash based collections store their values in an array.
Hash code is used to calculate an initial lookup position in this array. After that equals is used to compare given value with values stored in the internal array. So, if all values have distinct hash codes, this will minimize the possibility of hash collisions.
On the other hand, if all values will have the same hash code, hash map (or set) will degrade into a list with operations on it having O(n2) complexity.
From Java 8 onwards though collision will not impact performance as much as it does in earlier versions because after a threshold the linked list will be replaced by the binary tree, which will give you O(logN) performance in the worst case as compared to O(n) of linked list.
Never write a hashCode method which returns a constant.
String.hashCode results distribution is nearly perfect, so you can sometimes substitute Strings with their hash codes.
The next objective is to check how many identifiers with non-unique has codes you still have. Improve your hashCode method or increase a range of allowed hash code values if you have too many of non-unique hash codes. In the perfect case all your identifiers will have unique hash codes.

Why is HashMap faster than HashSet?

I have been reading/researching the reason why HashMapis faster than HashSet.
I am not quite understanding the following statements:
HashMap is faster than HashSet because the values are associated to a unique key.
In HashSet, member object is used for calculating hashcode value which can be same for two objects so equals() method is used to check for equality. If it returns false, that means the two objects are different. In HashMap, the hashcode value is calculated using the key object.
The HashMap hashcode value is calculated using the key object. Here, the member object is used to calculate the hashcode, which can be the same for two objects, so equals() method is used to check for equality. If it returns false, that means the two objects are different.
To conclude my question:
I thought HashMap and HashSet calculate the hashcode in the same way. Why are they different?
Can you provide a concrete example how HashSet and HashMap calculating the hashcode differently?
I know what a "key object" is, but what does it mean by "member object"?
HashMap can do the same things as HashSet, and faster. Why do we need HashSet? Example:
HashMap <Object1, Boolean>= new HashMap<Object1, boolean>();
map.put("obj1",true); => exist
map.get("obj1"); =>if null = not exist, else exist
Performance:
If you look at the source code of HashSet (at least JDK 6, 7 and 8), it uses HashMap internally, so it basically does exactly what you are doing with sample code.
So, if you need a Set implementation, you use HashSet, if you need a Map - HashMap. Code using HashMap instead of HashSet will have exactly the same performance as using HashSet directly.
Choosing the right collection
Map - maps keys to values (associative array) - http://en.wikipedia.org/wiki/Associative_array.
Set - a collection that contains no duplicate elements - http://en.wikipedia.org/wiki/Set_(computer_science).
If the only thing you need your collection for is to check if an element is present in there - use Set. Your code will be cleaner and more understandable to others.
If you need to store some data for your elements - use Map.
None of these answers really explain why HashMap is faster than HashSet. They both have to calculate the hashcode, but think about the nature of the key of a HashMap - it is typically a simple String or even a number. Calculating the hashcode of that is much faster than the default hashcode calculation of an entire object. If the key of the HashMap was the same object as that stored in a HashSet, there would be no real difference in performance. The difference comes in the what sort of object is the HashMap's key.

Best practices on what should be key in a hashtable

The best look-up structure is a HashTable. It provides constant access on average (linear in worst case).
This depends on the hash function. Ok.
My question is the following. Assuming a good implementation of a HashTable e.g. HashMap is there a best practice concerning the keys passed in the map?I mean it is recommended that the key must be an immutable object but I was wondering if there are other recommendations.
Example the size of the key? For example in a good hashmap (in the way described above) if we used String as keys, won't the "bottleneck" be in the string comparison for equals (trying to find the key)? So should the keys be kept small? Or are there objects that should not be used as keys? E.g. a URL? In such cases how can you choose what to use as a key?
The best performing key for an HashMap is probably an Integer, where hashCode() and equals() are implemented as:
public int hashCode() {
return value;
}
public boolean equals(Object obj) {
if (obj instanceof Integer) {
return value == ((Integer)obj).intValue();
}
return false;
}
Said that, the purpose of an HashMap is to map some object (value) to some others (key). The fact that a hash function is used to address the (value) objects is to provide fast, constant-time access.
it is recommended that the key must be an immutable object but I was wondering if there are other recommendations.
The recommendation is to Map objects to what you need: don't think what is faster; but think what is the best for your business logic to address the objects to retrieve.
The important requirement is that the key object must be immutable, because if you change the key object after storing it in the Map it may be not possible to retrieve the associated value later.
The key word in HashMap is Map. Your object should just map. If you sacrifice the mapping task optimizing the key, you are defeating the purpose of the Map - without probably achieving any performance boost.
I 100% agree with the first two comments in your question:
the major constraint is that it has to be the thing that you want to base the lookup on ;)
– Oli Charlesworth
The general rule is to use as the key whatever you need to look up with.
– Louis Wasserman
Remember the two rules for optimization:
Don't.
(for experts only) don't yet.
The third rule is: profile before to optimize.
You should use whatever key you want to use to lookup things in the data structure, it's typically a domain-specific constraint. With that said, keep in mind that both hashCode() and equals() will be used in finding a key in the table.
hashCode() is used to find the position of the key, while equals() is used to determine if the key you are searching for is actually the key that we just found using hashCode().
For example, consider two keys a and b that have the same hash code in a table using separate chaining. Then a search for a would require testing if a.equals(key) for potentially both a and b in the table once we find the index of the list containing a and b from hashCode().
it is recommended that the key must be an immutable object but I was wondering if there are other recommendations.
The key of the value should be final.
Most times a field of the object is used as key. If that field changes then the map cannot find it:
void foo(Employee e) {
map.put(e.getId(), e);
String newId = e.getId() + "new";
e.setId(newId);
Employee e2 = e.get(newId);
// e != e2 !
}
So Employee should not have a setId() method at all, but that is difficult because when you are writing Employee you don't know what it will be keyed by.
I digged up the implementation. I had an assumption that the effectiveness of the hashCode() method will be the key factor.
When I looked into the HashMap() and the Hashtable() implementation, I found that the implementation is quite similar (with one exception). Both are using and storing an internal hash code for all entries, so that's a good point that hashCode() is not so heavily influencing the performance.
Both are having a number of buckets, where the values are stored. It is important balance between the number of buckets (say n), and the average number of keys within a bucket (say k). The bucket is found in O(1) time, the content of the bucket is iterated in O(k) size, but the more bucket we have, the more memory will be allocated. Also, if many buckets are empty, it means that the hashCode() method for the key class does not the hashcode wide enough.
The algorithm works like this:
Take the `hashCode()` of the Key (and make a slight bijective transformation on it)
Find the appropriate bucket
Loop through the content of the bucket (which is some kind of LinkedList)
Make the comparison of the keys as follows:
1. Compare the hashcodes
(it is calculated in the first step, and stored for the entry)
2. Examine if key `==` the stored key (still no call)
(this step is missing from Hashtable)
3. Compare the keys by `key.equals(storedKey)`
To summarize:
hashCode() is called once per call (this is a must, you cannot do
without it)
equals() is called if the hashCode is not so well spread, and two keys happen to have the same hashcode
The same algorithm is for get() and put() (because in put() case you can set the value for an existing key). So, the most important thing is how the hashCode() method was implemented. That is the most frequently called method.
Two strategies are: make it fast and make it effective (well-spread). The JDK developers made efforts to make it both, but it's not always possible to have them both.
Numeric types are good
Object (and non-overriden classes) are good (hashCode() is native), except that you cannot specify an own equals()
String is not good, iterates through the characters, but caches after that (see my comment below)
Any class with synchronized hashCode() is not good
Any class that has an iteration is not good
Classes that have hashcode cache are a bit better (depends on the usage)
Comment on the String: To make it fast, in the first versions of JDK the String hash code calculation was made for the first 32 characters only. But the hashcode it produced was not well spread, so they decided to take all the characters into the hashcode.

Categories