How HashSet works with regards to hashCode()?

How HashSet works with regards to hashCode()? - java

I'm trying to understand java.util.Collection and java.util.Map a little deeper but I have some doubts about HashSet funcionality:
In the documentation, it says: This class implements the Set interface, backed by a hash table (actually a HashMap instance). Ok, so I can see that a HashSet always has a Hashtable working in background. A hashtable is a struct that asks for a key and a value everytime you want to add a new element to it. Then, the value and the key are stored in a bucket based on the key hashCode. If the hashcodes of two keys are the same, they add both key values to the same bucket, using a linkedlist. Please, correct me if I said something wrong.
So, my question is: If a HashSet always has a Hashtable acting in background, then everytime we add a new element to the HashSet using HashSet.add() method, the HashSet should add it to its internal Hashtable. But, the Hashtable asks for a value and a key, so what key does it use? Does it just uses the value we're trying to add also as a key and then take its hashCode? Please, correct me if I said something wrong about HashSet implementation.
Another question that I have is: In general, what classes can use the hashCode() method of an java object? I'm asking this because, in the documentation, it says that everytime we override equals() method we need to override hashCode() method. Ok, it really makes sense, but my doubt is if it's just a recommendation we should do to keep everything 'nice and perfect' (putting in this way), or if it's really necessary, because maybe a lot of Java defaults classes will constantly uses hashCode() method of your objects. In my vision, I can't see other classes using this method instead of those classes related to Collections. Thank you very much guys

If you look at the actual javacode of HashSet you can see what it does:
// Dummy value to associate with an Object in the backing Map
private static final Object PRESENT = new Object();
...
public boolean add(E e) {
return map.put(e, PRESENT)==null;
}
So the element you are adding is the Key in the backing hashmap with a dummy value as the value. this dummy value is never actually used by the hashSet.
Your second question regarding overriding equals and hashcode:
It is really necessary to always override both if you want to override either one. This is because the contract for hashCode says equal objects must have the same hashcode. the default implementation of hashcode will give different values for each instance.
Therefore, if you override equals() but not hashcode() This could happen
object1.equals(object2) //true
MySet.add(object1);
MySet.contains(object2); //false but should be true if we overrode hashcode()
Since contains will use hashcode to find the bucket to search in we might get a different bucket back and not find the equal object.

If you look at the source for HashSet (the source comes with the JDK and is very informative), you will see that it creates an object to use as the value:
// Dummy value to associate with an Object in the backing Map
private static final Object PRESENT = new Object();
Each value that is added to the HashSet is used as a key to the backing HashMap with this PRESENT object as the value.
Regarding overriding equals() whenever you override hashCode() (and vice versa), it is very important that these two methods return consistent results. That is, they should agree with one another. For more details, see the book Effective Java by Josh Bloch.

Related

Should I override hashCode() of Collections?

Given that I some class with various fields in it:
class MyClass {
private String s;
private MySecondClass c;
private Collection<someInterface> coll;
// ...
#Override public int hashCode() {
// ????
}
}
and of that, I do have various objects which I'd like to store in a HashMap. For that, I need to have the hashCode() of MyClass.
I'll have to go into all fields and respective parent classes recursively to make sure they all implement hashCode() properly, because otherwise hashCode() of MyClass might not take into consideration some values. Is this right?
What do I do with that Collection? Can I always rely on its hashCode() method? Will it take into consideration all child values that might exist in my someInterface object?
I OPENED A SECOND QUESTION regarding the actual problem of uniquely IDing an object here: How do I generate an (almost) unique hash ID for objects?
Clarification:
is there anything more or less unqiue in your class? The String s? Then only use that as hashcode.
MyClass hashCode() of two objects should definitely differ, if any of the values in the coll of one of the objects is changed. HashCode should only return the same value if all fields of two objects store the same values, resursively. Basically, there is some time-consuming calculation going on on a MyClass object. I want to spare this time, if the calculation had already been done with the exact same values some time ago. For this purpose, I'd like to look up in a HashMap, if the result is available already.
Would you be using MyClass in a HashMap as the key or as the value? If the key, you have to override both equals() and hashCode()
Thus, I'm using the hashCode OF MyClass as the key in a HashMap. The value (calculation result) will be something different, like an Integer (simplified).
What do you think equality should mean for multiple collections? Should it depend on element ordering? Should it only depend on the absolute elements that are present?
Wouldn't that depend on the kind of Collection that is stored in coll? Though I guess ordering not really important, no
The response you get from this site is gorgeous. Thank you all
#AlexWien that depends on whether that collection's items are part of the class's definition of equivalence or not.
Yes, yes they are.

I'll have to go into all fields and respective parent classes recursively to make sure they all implement hashCode() properly, because otherwise hashCode() of MyClass might not take into consideration some values. Is this right?
That's correct. It's not as onerous as it sounds because the rule of thumb is that you only need to override hashCode() if you override equals(). You don't have to worry about classes that use the default equals(); the default hashCode() will suffice for them.
Also, for your class, you only need to hash the fields that you compare in your equals() method. If one of those fields is a unique identifier, for instance, you could get away with just checking that field in equals() and hashing it in hashCode().
All of this is predicated upon you also overriding equals(). If you haven't overridden that, don't bother with hashCode() either.
What do I do with that Collection? Can I always rely on its hashCode() method? Will it take into consideration all child values that might exist in my someInterface object?
Yes, you can rely on any collection type in the Java standard library to implement hashCode() correctly. And yes, any List or Set will take into account its contents (it will mix together the items' hash codes).

So you want to do a calculation on the contents of your object that will give you a unique key you'll be able to check in a HashMap whether the "heavy" calculation that you don't want to do twice has already been done for a given deep combination of fields.
Using hashCode alone:
I believe hashCode is not the appropriate thing to use in the scenario you are describing.
hashCode should always be used in association with equals(). It's part of its contract, and it's an important part, because hashCode() returns an integer, and although one may try to make hashCode() as well-distributed as possible, it is not going to be unique for every possible object of the same class, except for very specific cases (It's easy for Integer, Byte and Character, for example...).
If you want to see for yourself, try generating strings of up to 4 letters (lower and upper case), and see how many of them have identical hash codes.
HashMap therefore uses both the hashCode() and equals() method when it looks for things in the hash table. There will be elements that have the same hashCode() and you can only tell if it's the same element or not by testing all of them using equals() against your class.
Using hashCode and equals together
In this approach, you use the object itself as the key in the hash map, and give it an appropriate equals method.
To implement the equals method you need to go deeply into all your fields. All of their classes must have equals() that matches what you think of as equal for the sake of your big calculation. Special care needs to be be taken when your objects implement an interface. If the calculation is based on calls to that interface, and different objects that implement the interface return the same value in those calls, then they should implement equals in a way that reflects that.
And their hashCode is supposed to match the equals - when the values are equal, the hashCode must be equal.
You then build your equals and hashCode based on all those items. You may use Objects.equals(Object, Object) and Objects.hashCode( Object...) to save yourself a lot of boilerplate code.
But is this a good approach?
While you can cache the result of hashCode() in the object and re-use it without calculation as long as you don't mutate it, you can't do that for equals. This means that calculation of equals is going to be lengthy.
So depending on how many times the equals() method is going to be called for each object, this is going to be exacerbated.
If, for example, you are going to have 30 objects in the hashMap, but 300,000 objects are going to come along and be compared to them only to realize that they are equal to them, you'll be making 300,000 heavy comparisons.
If you're only going to have very few instances in which an object is going to have the same hashCode or fall in the same bucket in the HashMap, requiring comparison, then going the equals() way may work well.
If you decide to go this way, you'll need to remember:
If the object is a key in a HashMap, it should not be mutated as long as it's there. If you need to mutate it, you may need to make a deep copy of it and keep the copy in the hash map. Deep copying again requires consideration of all the objects and interfaces inside to see if they are copyable at all.
Creating a unique key for each object
Back to your original idea, we have established that hashCode is not a good candidate for a key in a hash map. A better candidate for that would be a hash function such as md5 or sha1 (or more advanced hashes, like sha256, but you don't need cryptographic strength in your case), where collisions are a lot rarer than a mere int. You could take all the values in your class, transform them into a byte array, hash it with such a hash function, and take its hexadecimal string value as your map key.
Naturally, this is not a trivial calculation. So you need to think if it's really saving you much time over the calculation you are trying to avoid. It is probably going to be faster than repeatedly calling equals() to compare objects, as you do it only once per instance, with the values it had at the time of the "big calculation".
For a given instance, you could cache the result and not calculate it again unless you mutate the object. Or you could just calculate it again only just before doing the "big calculation".
However, you'll need the "cooperation" of all the objects you have inside your class. That is, they will all need to be reasonably convertible into a byte array in such a way that two equivalent objects produce the same bytes (including the same issue with the interface objects that I mentioned above).
You should also beware of situations in which you have, for example, two strings "AB" and "CD" which will give you the same result as "A" and "BCD", and then you'll end up with the same hash for two different objects.

For future readers.
Yes, equals and hashCode go hand in hand.
Below shows a typical implementation using a helper library, but it really shows the "hand in hand" nature. And the helper library from apache keeps things simpler IMHO:
#Override
public boolean equals(Object o) {
if (this == o) {
return true;
}
if (o == null || getClass() != o.getClass()) {
return false;
}
MyCustomObject castInput = (MyCustomObject) o;
boolean returnValue = new org.apache.commons.lang3.builder.EqualsBuilder()
.append(this.getPropertyOne(), castInput.getPropertyOne())
.append(this.getPropertyTwo(), castInput.getPropertyTwo())
.append(this.getPropertyThree(), castInput.getPropertyThree())
.append(this.getPropertyN(), castInput.getPropertyN())
.isEquals();
return returnValue;
}
#Override
public int hashCode() {
return new org.apache.commons.lang3.builder.HashCodeBuilder(17, 37)
.append(this.getPropertyOne())
.append(this.getPropertyTwo())
.append(this.getPropertyThree())
.append(this.getPropertyN())
.toHashCode();
}
17, 37 .. those you can pick your own values.

From your clarifications:
You want to store MyClass in an HashMap as key.
This means the hashCode() is not allowed to change after adding the object.
So if your collections may change after object instantiation, they should not be part of the hashcode().
From http://docs.oracle.com/javase/8/docs/api/java/util/Map.html
Note: great care must be exercised if mutable objects are used as map
keys. The behavior of a map is not specified if the value of an object
is changed in a manner that affects equals comparisons while the
object is a key in the map.
For 20-100 objects it is not worth that you enter the risk of an inconsistent hash() or equals() implementation.
There is no need to override hahsCode() and equals() in your case.
If you don't overide it, java takes the unique object identity for equals and hashcode() (and that works, epsecially because you stated that you don't need an equals() considering the values of the object fields).
When using the default implementation, you are on the safe side.
Making an error like using a custom hashcode() as key in the HashMap when the hashcode changes after insertion, because you used the hashcode() of the collections as part of your object hashcode may result in an extremly hard to find bug.
If you need to find out whether the heavy calculation is finished, I would not absue equals(). Just write an own method objectStateValue() and call hashcode() on the collection, too. This then does not interfere with the objects hashcode and equals().
public int objectStateValue() {
// TODO make sure the fields are not null;
return 31 * s.hashCode() + coll.hashCode();
}
Another simpler possibility: The code that does the time consuming calculation can raise an calculationCounter by one as soon as the calculation is ready. You then just check whether or not the counter has changed. this is much cheaper and simpler.

why should I override equals and hashcode method for following scnerio

why I need to override for direct access of value in Hash map.That is if insert data into hashmap as follow HashMap,I could get value by giving the Key as Integer ,would get Object as Value.In this case is it necessary to Override equals() and hashCode() method?Please give suggestion.

No, you don't need to override anything to use an object as a value in a HashMap.
Only keys need to have a working hashCode().
However, you need to implement these two methods (technically only equals, but these two are a set, really) if you want to use things like Map#containsValue, List#indexOf or Collection#contains (and these should not just be using reference identity).

hashCode() is used to search for a specific elem when you want to retrieve it from a hashTable. hashCode() doesn't have to be distinct. in fact, you could just return the same integer for all your instance, but then, elems are stored in a list instead of a hashTable, and will cause a performance problem.
By default implementation of hashCode() (which is the implementation of Object for subClass to extents from )of JVM returns a integer according to the memory address of the object, so this should be enough, but this implement was not required by the JVM standard.
By default(Class object), implementation of equals() will return true and only return true when they have same reference , ie obj1 == obj2. read this
keep in mind that:
equal objects must have same hashCode()
those have same hashCode() are not required to be equal to each other.
I think override of hashCode() is not needed in most situations(not extends from other Class), cause modern JVMs has done
pretty good job for you.
So conclusion is:
if your super class have overwrite the hashCode() and equals() method, then you should override them, or at least take a look at the implementation, and decide whether you should override them.

how does hashing in java works?

I am trying to figure something out about hashing in java.
If i want to store some data in a hashmap for example, will it have some kind of underlying hashtable with the hashvalues?
Or if someone could give a good and simple explanation of how hashing work, I would really appreciate it.

HashMap is basically implemented internally as an array of Entry[]. If you understand what is linkedList, this Entry type is nothing but a linkedlist implementation. This type actually stores both key and value.
To insert an element into the array, you need index. How do you calculate index? This is where hashing function(hashFunction) comes into picture. Here, you pass an integer to this hashfunction. Now to get this integer, java gives a call to hashCode method of the object which is being added as a key in the map. This concept is called preHashing.
Now once the index is known, you place the element on this index. This is basically called as BUCKET , so if element is inserted at Entry[0], you say that it falls under bucket 0.
Now assume that the hashFunction returns you same index say 0, for another object that you wanted to insert as a key in the map. This is where equals method is called and if even equals returns true, it simple means that there is a hashCollision. So under this case, since Entry is a linkedlist implmentation, on this index itself, on the already available entry at this index, you add one more node(Entry) to this linkedlist. So bottomline, on hashColission, there are more than one elements at a perticular index through linkedlist.
The same case is applied when you are talking about getting a key from map. Based on index returned by hashFunction, if there is only one entry, that entry is returned otherwise on linkedlist of entries, equals method is called.
Hope this helps with the internals of how it works :)

Hash values in Java are provided by objects through the implementation of public int hashCode() which is declared in Object class and it is implemented for all the basic data types. Once you implement that method in your custom data object then you don't need to worry about how these are used in miscellaneous data structures provided by Java.
A note: implementing that method requires also to have public boolean equals(Object o) implemented in a consistent manner.

If i want to store some data in a hashmap for example, will it have some kind of underlying hashtable with the hashvalues?
A HashMap is a form of hash table (and HashTable is another). They work by using the hashCode() and equals(Object) methods provided by the HashMaps key type. Depending on how you want you keys to behave, you can use the hashCode / equals methods implemented by java.lang.Object ... or you can override them.
Or if someone could give a good and simple explanation of how hashing work, I would really appreciate it.
I suggest you read the Wikipedia page on Hash Tables to understand how they work. (FWIW, the HashMap and HashTable classes use "separate chaining with linked lists", and some other tweaks to optimize average performance.)
A hash function works by turning an object (i.e. a "key") into an integer. How it does this is up to the implementor. But a common approach is to combine hashcodes of the object's fields something like this:
hashcode = (..((field1.hashcode * prime) + field2.hashcode) * prime + ...)
where prime is a smallish prime number like 31. The key is that you get a good spread of hashcode values for different keys. What you DON'T want is lots of keys all hashing to the same value. That causes "collisions" and is bad for performance.
When you implement the hashcode and equals methods, you need to do it in a way that satisfies the following constraints for the hash table to work correctly:
1. O1.equals(o2) => o1.hashcode() == o2.hashcode()
2. o2.equals(o2) == o2.equals(o1)
3. The hashcode of an object doesn't change while it is a key in a hash table.
It is also worth noting that the default hashCode and equals methods provided by Object are based on the target object's identity.
"But where is the hash values stored then? It is not a part of the HashMap, so is there an array assosiated to the HashMap?"
The hash values are typically not stored. Rather they are calculated as required.
In the case of the HashMap class, the hashcode for each key is actually cached in the entry's Node.hash field. But that is a performance optimization ... to make hash chain searching faster, and to avoid recalculating hashes if / when the hash table is resized. But if you want this level of understanding, you really need to read the source code rather than asking Questions.

This is the most fundamental contract in Java: the .equals()/.hashCode() contract.
The most important part of it here is that two objects which are considered .equals() should return the same .hashCode().
The reverse is not true: objects not considered equal may return the same hash code. But it should be as rare an occurrence as possible. Consider the following .hashCode() implementation, which, while perfectly legal, is as broken an implementation as can exist:
#Override
public int hashCode() { return 42; } // legal!!
While this implementation obeys the contract, it is pretty much useless... Hence the importance of a good hash function to begin with.
Now: the Set contract stipulates that a Set should not contain duplicate elements; however, the strategy of a Set implementation is left... Well, to the implementation. You will notice, if you look at the javadoc of Map, that its keys can be retrieved by a method called .keySet(). Therefore, Map and Set are very closely related in this regard.
If we take the case of a HashSet (and, ultimately, HashMap), it relies on .equals() and .hashCode(): when adding an item, it first calculates this item's hash code, and according to this hash code, attemps to insert the item into a given bucket. In contrast, a TreeSet (and TreeMap) relies on the natural ordering of elements (see Comparable).
However, if an object is to be inserted and the hash code of this object would trigger its insertion into a non empty hash bucket (see the legal, but broken, .hashCode() implementation above), then .equals() is used to determine whether that object is really unique.
Note that, internally, a HashSet is a HashMap...

Hashing is a way to assign a unique code for any variable/object after applying any function/algorithm on its properties.

HashMap stores key-value pair in Map.Entry static nested class implementation.
HashMap works on hashing algorithm and uses hashCode() and equals() method in put and get methods.
When we call put method by passing key-value pair, HashMap uses Key hashCode() with hashing to find out
the index to store the key-value pair. The Entry is stored in the LinkedList, so if there are already
existing entry, it uses equals() method to check if the passed key already exists, if yes it overwrites
the value else it creates a new entry and store this key-value Entry.
When we call get method by passing Key, again it uses the hashCode() to find the index
in the array and then use equals() method to find the correct Entry and return it’s value.
Below image will explain these detail clearly.
The other important things to know about HashMap are capacity, load factor, threshold resizing.
HashMap initial default capacity is 16 and load factor is 0.75. Threshold is capacity multiplied
by load factor and whenever we try to add an entry, if map size is greater than threshold,
HashMap rehashes the contents of map into a new array with a larger capacity.
The capacity is always power of 2, so if you know that you need to store a large number of key-value pairs,
for example in caching data from database, it’s good idea to initialize the HashMap with correct capacity
and load factor.

Java Map Implementation not based on HashCode

Is there some implementation of java.util.Map that does not uses HashCode?
I have the following problem:
I store an object associated to another object on a HashMap;
Change a property from the key object used on step 1;
As the hashcode is used to store the keys on the regular implementation of HashMap, when I perform a get() on the HashMap, I get null, because the old object hashCode was different at step 1.
Is there a solution for that? Or should I really use just immutable fields for my equals / hashCode methods?

IdentityHashMap uses the Object identity instead of the hashCode; however that does mean that you require the original object used as key to retrieve the value of the map. Other options would be redefine the hashcode to exclude the mutable parts of the object, or - if you can't redefine the hashCode for some reason - wrap the object in another object which provides a stable hashCode.

You would be well advised to use an immutable key, and to re-insert the key/value pair into Map, rather than mutating the key in-place. As you discovered, that just leads to weird bugs.
If this isn't an option for you, then see if you can ignore the mutable property in the hashCode() method, so that the hash code doesn't change. If that's the only property of the class, though, that's not a good idea.
You may be able to get away with using TreeMap, which I don't think uses hashCode(). However, it does require consistency between the key's compareTo() and equals() methods, so you may just end up with the same problem as before if the return values of those methods can change.

All Maps should use immutable objects for keys. True for Python; true for Java.
If you implement equals and hashCode using only immutable fields you should be fine.

How about removing and adding it again ?
On Step 2, You can remove the element added in Step 1 and again add it with new latest properties set. This way when you are try to get in Step 3, you will find it.
Try it.

I think modify the key object in map is not a good practice.
But if you really want, you can override the hashCode() and remember to override the equal() method.

All associative containers use comparing or hash code, so I would like to recommend you using immutable fields for equals() / hashCode() methods.

Override equals and hashCode methods if you don't want original implementation.

Understanding contains method of Java HashSet

Newbie question about java HashSet
Set<User> s = new HashSet<User>();
User u = new User();
u.setName("name1");
s.add(u);
u.setName("name3");
System.out.println(s.contains(u));
Can someone explain why this code output false ? Moreover this code does not even call equals method of User. But according to the sources of HashSet and HashMap it have to call it. Method equals of User simply calls equals on user's name. Method hashCode return hashCode of user's name

If the hash code method is based on the name field, and you then change it after adding the object, then the second contains check will use the new hash value, and won't find the object you were looking for. This is because HashSets first search by hash code, so they won't bother calling equals if that search fails.
The only way this would work is if you hadn't overridden equals (and so the default reference equality was used) and you got lucky and the hash codes of the two objects were equal. But this is a really unlikely scenario, and you shouldn't rely on it.
In general, you should never update an object after you have added it to a HashSet if that change will also change its hashcode.

Since your new User has a different hashcode, the HashSet knows that it isn't equal.
HashSets store their items according to their hashcodes.
The HashSet will only call equals if it finds an item with the same hashcode, to make sure that the two items are actually equal (as opposed to a hash collision)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.