1.
I understand the different hash map mechanisms and the ways in which key collisions are handled (either open addressing -linear/quadratic probing, chaining, extendable hashing, etc. Which one does HashSet/HashMap make use of?
2.
I realise that a good HashMap relies on a good hash function. How does Java's HashSet/HashMap hash the objects? I know that there is a hash function but so far for strings I have not needed to implement this. What if I now want to Hash a Java Object that I create - do I need to implement the hash function? Or does Java have a built in way of creating a hash code?
I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.
You could answer many of these questions yourself, by reading the source code for HashMap.
(Hint: you can usually find the source code for Java SE classes using Google; e.g. search for "java.util.HashMap source".)
I understand the different hash map mechanisms and the ways in which key collisions are handled (either open addressing -linear/quadratic probing, chaining, extendable hashing, etc. Which one does HashSet/HashMap make use of?
Chaining. See the source code. (Line 154 in the version I linked to).
How does Java's HashSet/HashMap hash the objects?
It doesn't. The object's hashCode method is called to do this. See the source code. (line 360).
If you look at the code you will see some interesting wrinkles:
The code (in the version I linked to) is hashing Strings using a special method. (It appears that this is to allow hashing of strings to be "tuned" at the platform level. I didn't dig into this ...)
The hashcode returned by the Object.hashCode() call is "scrambled" further to reduce the chance of collisions. (Read the comment!)
What if I now want to Hash a Java Object that I create - do I need to implement the hash function?
You can do that.
Whether you need to do this depends on how you have defined equals for the class. Specifically, Java's HashMap, HashSet and related classes place the following requirement on hashcode() and equals(Object):
If a.equals(b) then a.hashCode() == b.hashCode().
While a is in a HashSet or is a key in a HashMap, the value returned by a.hashCode() must not change.
if !a.equals(b), then the probability that a.hashCode() == b.hashCode() should be low, especially if a and b are probably hash keys for the application.
(The last requirement for performance reasons. If you you have a "poor" hash function that results in a high probability that different keys hash the same hashcode, you get lots of collisions. The hash chains will become unbalanced, and you won't get the average O(1) performance that is normally expected of hash table operations. In the worst case, performance will be O(N); i.e. equivalent to a linear search of a linked list.)
Or does Java have a built in way of creating a hash code?
Every class inherits a default hashCode() method from Object (unless this is overridden). It uses what is known as an "identity hash code"; i.e. a hash value that is based on the object's identity (its reference). This matches the default implementation of equals(Object) ... which simply uses == to compare references.
I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.
This is incorrect.
The default hashCode() method returns the "identity hashcode". This is typically based on the object's memory address at some point time, but it is NOT the object's memory address.
In particular, if an object is moved by the garbage collector, its "identity hashcode" is guaranteed not to change. Yes. That's right, it DOES NOT CHANGE ... even though the object was moved!
(How they implement this efficiently is rather clever. See https://stackoverflow.com/a/3796963/139985 for details.)
The bottom line is that the default Object.hashCode() method satisfies all of the requirements that I listed above. It can be relied on.
Question 1)
The Java HashMap implementation uses the chaining implementation to deal with collisions. Think of it as an array of linked lists.
Question 2
Object has a default implementation of equals and hashCode. equals is implemented as return this == other and hashcode is (to all intents and purposes) implemented as assigning a random identifier to each instance and using that as the hashCode.
As all classes in Java extends Object, they all inherit these implementations.
Some classes override these implementations by default. String, as you mentioned, is a very good example. Another is the classes in the collections API - so ArrayList implements these methods based on the elements it contains.
As far as implementing a good hashCode, this is a little bit of a dark art. Here's a pretty good summary of best practice.
Your final comment:
I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.
This is not correct. The default implementation of hashCode is constant as that is part of the method's contract. From the Javadoc:
Whenever it is invoked on the same object more than once during an
execution of a Java application, the hashCode method must consistently
return the same integer, provided no information used in equals
comparisons on the object is modified. This integer need not remain
consistent from one execution of an application to another execution
of the same application.
Related
In Kotlin as in Java, in the normal course of events, HashMap uses the equals and hashCode methods provided by the key class.
Suppose the key class does not provide those methods (that is, just uses the defaults provided by java.lang.Object), whether because it was written by someone else who did not foresee the need, or because it needs to have reference semantics in some other context and value semantics in the current one.
Is it possible to create a hash map and supply equals and hash code functions on the fly, without modifying the key class?
(I would be happy with either a Kotlin-specific solution, or one defined in terms of the Java standard library.)
There is a generator in IntelliJ IDEA. You press Alt+Ins, choose 'equal and hashCode' and a constructors opens. You can choose fields for equals and then you can choose fields for hashCode(). Why can we choose different field sets? Isn't it contradicted to equals-hashCode contracts?
As per Java Doc of Object Class -
Note that it is generally necessary to override the {#code hashCode}
method whenever this method is overridden, so as to maintain the
general contract for the {#code hashCode} method, which states
that equal objects must have equal hash codes.
By default equals method returns true for inputs referring to same object instance. An overridden equals might return true for completely different objects even (even the object with different field values) which totally depends on your implementation.
The contract enforces that if your equals logic is determining two
different object as same, your hashcode method should return the same
value for those two objects.
This does not mean that you should be using the same fields for hashcode also. This is all what you should be taking care of while overriding these functions.
Well, it doesn't really allow you to choose different field sets, it allows you to pick a subset of the fields for equals, to use for hashCode.
While this will probably lead to a poorer hash code since this will cause more hash collisions, it will technically still be correct. Note that the requirement is just that equal objects have equal hash codes, not that equal hash codes must be from equal objects. The latter would be impossible to achieve for classes that can have more different instances than there are ints (e.g. java.lang.Long).
There may be good motivations to use a suboptimal hash for collisions if calculating the best hash for collisions would be too expensive, compared to just dealing with the collision.
This question already has answers here:
What issues should be considered when overriding equals and hashCode in Java?
(11 answers)
Closed 7 years ago.
It seems that many classes (e.g. HashSet) assume that the hashCode of an object will not change. The documentation is clear about what the relationship between equals and hashCode should be.
But is it poor design to implement a hashCode that changes across an object's life-time?
There at least needs to be point in the application where the hashCode is frozen while it is in a collection that cares. Typically, the hashCode will change while you build up the object (e.g., adding to an ArrayList), then you add it to a collection and stop changing. Later, if you remove it from the collection, you could mutate it again. I would say it is generally a best practice to use immutable data structures (ala String or your own type with finals all the way down) with collections that rely on the hashCode (e.g., HashMap key or HashSet).
No, it's ok that the hashCode changes when a mutable object changes its internal state.
However, once the object is in a place that expects a constant hashCode, the application must make sure that the object is not mutated such that the hashCode changes.
It depends on what you call the "lifetime". Your exact link states that:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer
This means that there is no guarantee whatsoever that the hash code of an object remain consistent across two different runs of the application.
But it is guaranteed that on a given run of an application, that is an instance of a JVM running Java code, the hash code of an object will never change.
The contract does guarantee this, but nothing more.
You are talking about different things.
If you want use Hash(Map|Set) - keys should be an immutable objects.
So in this case hashCode will be immutable too.
But in common cases hashCode should be changed with object state (according to fields that make sense for hashCode).
hashCode does not guarantee the same result in different executions.. As the javadocs point out -
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
This is uncommon and some classes in the class library even specify the exact formula they use to calculate hash codes - e.g. String. For these classes, the hash code will always be the same. But while most of the hashCode implementations provide stable values, you must not rely on it.
Furthermore, some think that the hashcode is a unique handle to an object. This is wrong and an anti-pattern. For example, the Strings "Aa" and "BB" produce the same hashCode: 2112.
This is what the Java documentation of Object.hashCode() says:
If two objects are equal according to the equals(Object) method, then
calling the hashCode method on each of the two objects must produce
the same integer result.
But they don't explain why two equal objects must return equal hash codes. Why did Oracle engineers decided hashCode must be overriden when overriding equals?
The typical implementation of equals doesn't call the hashCode method:
#Override
public boolean equals(Object arg0) {
if (this == arg0) {
return true;
}
if (!(arg0 instanceof MyClass)) {
return false;
}
MyClass another = (MyClass) arg0;
// compare significant fields here
}
In Effective Java (2nd Edition) I read:
Item 9: Always override hashCode when you override equals.
A common source of bugs is the failure to override the hashCode
method. You must override hashCode in every class that overrides
equals. Failure to do so will result in a violation of the general
contract for Object.hashCode, which will prevent your class from
functioning properly in conjunction with all hash-based collections,
including HashMap, HashSet, and Hashtable.
Suppose I don't need to use MyClass as a key of a hash table. Why do I need to override hashCode() in this case?
Of course when you have a little program only written by your self and you check every time you use an external lib that this did not rely on hashCode() then you can ignore all this warnings. But when a software projects grows you will use external libraries and and these will rely on hashCode()and you will lose a lot of time searching for bugs. Or in a newer Java version some other classes use the hashCode()too and your program will fail.
So its a lot easier to just implement this and follow this easy rule, because an modern IDE can auto generate equals and hashCode with one click.
Update a little story: At work we ignored this rule too in a lot of classes and only implemented the one needed mostly equals or compareTo. Some day some strange things happen, because one programmer has used a Hash*-Class in the GUI and our Objects did not follow this rule. In the end an apprentice need to search all classes with equals and have to add the corresponding hashCode method.
As the text says, it's a violation of the general contract that is in use. Of course if you never ever use it in any place that hashCode would be required nobody is forcing you to implement it.
But what if some day in the future it is needed? What if the class is used by some other developer? That's why there is this contract that both of them have to be implemented, so there is no confusion.
Obviously if nobody ever calls your class's hashCode method, nobody will know that it's inconsistent with equals. Can you guarantee that for the duration of your project, including the years in maintenance, no one will need to, say, remove duplicate objects from a list or associate some extra bit of data with your objects?
You are probably just safer implementing hashCode so that it's consistent with equals. It's not particularly hard, always returning 0 is already a valid implementation.
(please don't just return 0 though)
The reason why hashCode has to be overridden to agree with equals is because of why and how hashCode is used. Hash codes are used as surrogates for values so that when mapping a key value to something hashing can be used to give near-constant lookup time with reasonable space. When two values compare equal (ie, they are the same value) then they have to map to the same something when used as a key for a hashed collection. That requires that they have the same hash code to get them there.
You have to override hashCode appropriately because the manual says you have to. It says you have to because the decision was made that libraries can assume that you are satisfying that contract so they (& you) can have the performance benefits of hashing when using a value that you gave it as a key in their functions' implementations.
The common wizdom dictates that the logic for // compare significant fields here is required for equals() and also for compareTo() (in case you want to sort instances of MyClass) and also for using hash tables. so it makes sense to put this logic in hashCode() and have the other methods use the hash code.
The method hashCode() in class Enum is final and defined as super.hashCode(), which means it returns a number based on the address of the instance, which is a random number from programmers POV.
Defining it e.g. as ordinal() ^ getClass().getName().hashCode() would be deterministic across different JVMs. It would even work a bit better, since the least significant bits would "change as much as possible", e.g., for an enum containing up to 16 elements and a HashMap of size 16, there'd be for sure no collisions (sure, using an EnumMap is better, but sometimes not possible, e.g. there's no ConcurrentEnumMap). With the current definition you have no such guarantee, have you?
Summary of the answers
Using Object.hashCode() compares to a nicer hashCode like the one above as follows:
PROS
simplicity
CONTRAS
speed
more collisions (for any size of a HashMap)
non-determinism, which propagates to other objects making them unusable for
deterministic simulations
ETag computation
hunting down bugs depending e.g. on a HashSet iteration order
I'd personally prefer the nicer hashCode, but IMHO no reason weights much, maybe except for the speed.
UPDATE
I was curious about the speed and wrote a benchmark with surprising results. For a price of a single field per class you can a deterministic hash code which is nearly four times faster. Storing the hash code in each field would be even faster, although negligibly.
The explanation why the standard hash code is not much faster is that it can't be the object's address as objects gets moved by the GC.
UPDATE 2
There are some strange things going on with the hashCode performance in general. When I understand them, there's still the open question, why System.identityHashCode (reading from the object header) is way slower than accessing a normal object field.
The only reason for using Object's hashCode() and for making it final I can imagine, is to make me ask this question.
First of all, you should not rely on such mechanisms for sharing objects between JVMs. That's simply not a supported use case. When you serialize / deserialize you should rely on your own comparison mechanisms or only "compare" the results against objects within your own JVM.
The reason for letting enums hashCode be implemented as Objects hash code (based on identity) is because, within one JVM there will only be one instance of each enum object. This is enough to ensure that such implementation makes sense and is correct.
You could argue like "Hey, String and the wrappers for the primitives (Long, Integer, ...) all have well defined, deterministic, specifications of hashCode! Why doesn't the enums have it?", Well, to begin with, you can have several distinct string references representing the same string which means that using super.hashCode would be an error, so these classes necessarily need their own hashCode implementations. For these core classes it made sense to let them have well-defined deterministic hashCodes.
Why did they choose to solve it like this?
Well, look at the requirements of the hashCode implementation. The main concern is to make sure that each object should return a distinct hash code (unless it is equal to another object). The identity-based approach is super efficient and guarantees this, while your suggestion does not. This requirement is apparently stronger than any "convenience bonus" about easing up on serialization etc.
I think that the reason they made it final is to avoid developers shooting themselves in the foot by rewriting a suboptimal (or even incorrect) hashCode.
Regarding the chosen implementation: it's not stable across JVMs, but it's very fast, avoid collisions, and doesn't need an additional field in the enum. Given the normally small number of instances of an enum class, and the speed of the equals method, I wouldn't be surprised if the HashMap lookup time was bigger with your algorithm than with the current one, due to its additional complexity.
I've asked the same question, because did not saw this one. Why in Enum hashCode() refers to the Object hashCode() implementaion, instead of ordinal() function?
I encountered it as a sort of a problem, when defining my own hash function, for an Object relying on enum hashCode as one of the composites. When checking a value in a Set of Objects, returned by the function, I checked them in an order, which I would expect it to be the same, since the hashCode I define myself, and so I expect elements to fall at the same nodes on the tree, but since hashCode returned by enum changes from start to start, this assumption was wrong, and test could fail once in a while.
So, when I figured out the problem, I started using ordinal instead. I am not sure everyone writing hashCode for their Object realize this.
So basically, you can't define your own deterministic hashCode, while relying on enum hashCode, and you need to use ordinal instead
P.S. This was too big for a comment :)
The JVM enforces that for an enum constant, only one object will exist in memory. There is no way that you could end up with two different instance objects of the same enum constant within a single VM, not with reflection, not across the network via serialization/deserialization.
That being said, since it is the only object to represent this constant, it doesn't matter that its hascode is its address since no other object can occupy the same address space at the same time. It is guaranteed to be unique & "deterministic" (in the sense that in the same VM, in memory, all objects will have the same reference, no matter what it is).
There is no requirement for hash codes to be deterministic between JVMs and no advantage gained if they were. If you are relying on this fact you are using them wrong.
As only one instance of each enum value exists, Object.hashcode() is guaranteed never to collide, is good code reuse and is very fast.
If equality is defined by identity, then Object.hashcode() will always give the best performance.
The determinism of other hash codes is just a side effect of their implementation. As their equality is usually defined by field values, mixing in non-deterministic values would be a waste of time.
As long as we can't send an enum object1 to a different JVM I see no reason for putting such a requirements on enums (and objects in general)
1 I thought it was clear enough - an object is an instance of a class. A serialized object is a sequence of bytes, usually stored in a byte array. I was talking about an object.
One more reason that it is implemented like this I could imagine is because of the requirement for hashCode() and equals() to be consistent, and for the design goal of Enums that they sould be simple to use and compile-time constant (to use them is "case" constants). This also makes it legal to compare enum instances with "==", and you simply wouldn't want "equals" to behave differntly from "==" for enums. This again ties hashCode to the default Object.hashCode() reference-based behavior.
As said before, I also don't expect equals() and hashCode() to consider two enum constants from different JVM as being equal. When talking about serialization: For instance fields typed as enums the default binary serializer in Java has a special behaviour that serializess only the name of the constant, and on deserialization the reference to the corresponding enum value in the de-serializing JVM is re-created. JAXB and other XML-based serialization mechanisms work in a similar way. So: just don't worry