Hash map with custom equals and hash code functions

Hash map with custom equals and hash code functions - java

In Kotlin as in Java, in the normal course of events, HashMap uses the equals and hashCode methods provided by the key class.
Suppose the key class does not provide those methods (that is, just uses the defaults provided by java.lang.Object), whether because it was written by someone else who did not foresee the need, or because it needs to have reference semantics in some other context and value semantics in the current one.
Is it possible to create a hash map and supply equals and hash code functions on the fly, without modifying the key class?
(I would be happy with either a Kotlin-specific solution, or one defined in terms of the Java standard library.)

Related

Why should two equal objects return equal hash codes if I don't want to use my object as a key in a hash table?

This is what the Java documentation of Object.hashCode() says:
If two objects are equal according to the equals(Object) method, then
calling the hashCode method on each of the two objects must produce
the same integer result.
But they don't explain why two equal objects must return equal hash codes. Why did Oracle engineers decided hashCode must be overriden when overriding equals?
The typical implementation of equals doesn't call the hashCode method:
#Override
public boolean equals(Object arg0) {
if (this == arg0) {
return true;
}
if (!(arg0 instanceof MyClass)) {
return false;
}
MyClass another = (MyClass) arg0;
// compare significant fields here
}
In Effective Java (2nd Edition) I read:
Item 9: Always override hashCode when you override equals.
A common source of bugs is the failure to override the hashCode
method. You must override hashCode in every class that overrides
equals. Failure to do so will result in a violation of the general
contract for Object.hashCode, which will prevent your class from
functioning properly in conjunction with all hash-based collections,
including HashMap, HashSet, and Hashtable.
Suppose I don't need to use MyClass as a key of a hash table. Why do I need to override hashCode() in this case?

Of course when you have a little program only written by your self and you check every time you use an external lib that this did not rely on hashCode() then you can ignore all this warnings. But when a software projects grows you will use external libraries and and these will rely on hashCode()and you will lose a lot of time searching for bugs. Or in a newer Java version some other classes use the hashCode()too and your program will fail.
So its a lot easier to just implement this and follow this easy rule, because an modern IDE can auto generate equals and hashCode with one click.
Update a little story: At work we ignored this rule too in a lot of classes and only implemented the one needed mostly equals or compareTo. Some day some strange things happen, because one programmer has used a Hash*-Class in the GUI and our Objects did not follow this rule. In the end an apprentice need to search all classes with equals and have to add the corresponding hashCode method.

As the text says, it's a violation of the general contract that is in use. Of course if you never ever use it in any place that hashCode would be required nobody is forcing you to implement it.
But what if some day in the future it is needed? What if the class is used by some other developer? That's why there is this contract that both of them have to be implemented, so there is no confusion.

Obviously if nobody ever calls your class's hashCode method, nobody will know that it's inconsistent with equals. Can you guarantee that for the duration of your project, including the years in maintenance, no one will need to, say, remove duplicate objects from a list or associate some extra bit of data with your objects?
You are probably just safer implementing hashCode so that it's consistent with equals. It's not particularly hard, always returning 0 is already a valid implementation.
(please don't just return 0 though)

The reason why hashCode has to be overridden to agree with equals is because of why and how hashCode is used. Hash codes are used as surrogates for values so that when mapping a key value to something hashing can be used to give near-constant lookup time with reasonable space. When two values compare equal (ie, they are the same value) then they have to map to the same something when used as a key for a hashed collection. That requires that they have the same hash code to get them there.
You have to override hashCode appropriately because the manual says you have to. It says you have to because the decision was made that libraries can assume that you are satisfying that contract so they (& you) can have the performance benefits of hashing when using a value that you gave it as a key in their functions' implementations.

The common wizdom dictates that the logic for // compare significant fields here is required for equals() and also for compareTo() (in case you want to sort instances of MyClass) and also for using hash tables. so it makes sense to put this logic in hashCode() and have the other methods use the hash code.

Annotation.equals() vs. Object.equals()

Some frameworks (e.g. guice) require in certain situations to create an implementing class of an annotation interface.
There seems to be a difference between the Annotation.equals(Object) and Object.equals(Object) definitions which need to be respected in that case (same applies for hashCode()).
Questions:
Why was it designed that way and what is the reason of the difference?
What side-effects can occur when using the Object.equals(Object) definition for annotation classes instead?
Update:
Additional questions:
What about the Annotation.hashCode() definition? Is it really required to implement it that way, especially the "(...)127 times the hash code of the member-name as computed by String.hashCode()) XOR the hash code(...)"-part?
What happens if a hashCode() method is implemented to be consistent to equals() but doesn't match the exact definition of Annotation.hashCode() (e.g. using 128 times the hash code of the member-name)?

The definitions are not different. The definition in Annotation is simply specialized for the annotation type.
The definition in Object basically says "If you decide to implement equals for your class, it should represent an equivalence relation that follows these rules".
In Annotation it defines an equivalence that follows those rules, which is meaningful specifically for Annotation instances.
In fact, the Annotation equivalence would work for many other classes. The point is that different classes have different meanings, and therefore their instances may have different equivalence relationships, and it's up to the programmer to decide which equivalence relation to use for his/her class. In Annotation, the contract is for this particular equivalence relation.
As for side effects - suppose an Annotation type inherited Object's equals. This is a mistake many people do when they try to use their own classes in maps or other equals()-dependent situations. Object has an equals() function that is the same as object identity: two references are equal only if they are references to the same object.
If you used that, then no two instances would be considered the same. You would not be able to create a second Annotation instance that would be equivalent to a previous one, despite them having the same values in their fields and semantically representing the same sort of behavior. So you wouldn't be able to tell if two items are annotated with the same annotation, when they have different instances of the same annotation.
As for the hashCode question, although Jeff Bowman has already answered that, I'll address that to make my answer more complete:
Basically, implementation of annotations is left to compilers, and the JLS doesn't dictate the exact implementation. It is also possible to create implementing classes, as your question itself mentions.
This means that annotation classes can come from different sources - different compilers (you are supposed to be able to run .class files anywhere, no matter which java compiler created them) and developer-created implementations.
The equals() and hashCode() methods are usually considered in a single class context, not in an interface context. This is because interfaces are usually antithetic to implementation - they only define contracts. When you create these methods for a particular class, you know that the object you compare with is supposed to be of the same class, and thus have the same implementation. Once it has a hashCode method that returns the same value for objects that are equivalent under equals for the same class, then whatever that implementation is, it satisfies the contract.
However, in this particular case, you have an interface, and you are required to make equals() and hashcode() to work not only for two instances of the same class, but for instances of different classes that implement of the same interface. This means that if you don't agree on a single implementation across all possible classes, you might get two instances of the same annotation with the same element values, and different hash codes. This would break the hashcode() contract.
As an example, imagine an annotation #SomeAnnotation that doesn't take parameters. Imagine that you implement it with a class SomeAnnotationImpl that returns 15 as the hash code. Two equal instances of SomeAnnotationImpl will have the same hash code, which is good. But the Java compiler would return 0 as the hash code when you check the returned instance of its own implementation of #SomeAnnotation. Therefore two objects of type Annotation are equal (they implement the same annotation interface and if they follow the equals() definition above, they should return true for equals), but have different hash codes. That breaks the contract.

RealSkeptic's answer is great, but I'll put it a slightly different way.
This is a specific instance of a general problem:
You defined an interface (specifically an annotation).
Someone (javac) wrote a particular (built-in) implementation of that interface. You can't access that implementation, but need to be able to create equal instances, particularly for use in Sets and Maps. (Guice is one big Map<Key, Provider> after all.)
The implementor (javac) wrote a custom implementation of equals so that annotation instances with the same parameters pass equals. You need to match that implementation so that equals is symmetric (a.equals(b) if and only if b.equals(a), which is assumed in Java along with reflexivity, consistency, and transitivity).
Equal objects must have equal hashCodes because Java uses it as a shortcut for equality: if objects have unequal hashCodes then they cannot be equal. This comes in handy to make the efficient Map implementation HashMap, because you can use the hashCode to only check objects in the right hashCode-determined bucket. If you used a different or modified hashCode algorithm, you'd be breaking spec in theory, and in practice your annotation implementation wouldn't match others consistently in HashSet or HashMap (rendering it worthless to Guice). Many other features use hashCode, but those are the most obvious examples.
It would be much easier if Java let you instantiate their implementation, or generate an implementation automatically for your class, but here the best they've done is an exact spec for you to match.
So yes, you'll run into this with annotations more often than anything else, but these matter any time you're trying to act equal with an implementation you can't control or use yourself.

The above answers are excellent general answers to the question, but since I haven't seen them mentioned I'll just add that the use of AnnotationLiteral for implementing Annotations takes care of the equals and hashCode issues properly. There are a couple to choose from:
AnnotationLiteral
AnnotationLiteral

Hash function for creating a generic hash table in Java (for learning purposes)

If you were creating creating a generic hash table in Java (assume it didn't already have one), then how would you implement its default hash function? I know you can pass one in (via an interface), but most data structures have defaults.
My Attempt:
As Java generics require reference types, and all reference types in Java implement hashCode(), I figured that you could just use T.hashCode() % backingArraySize as the hash function, and that this would be sufficient. After all, the implementer of any type you may store in the hash table should give their type an appropriate hashCode() function, right?
Is there a better way to do this?

In my hashtable implementation I decided to use plain hashCode() % backingArraySize (i. e. yours suggestion) when the algorithm isn't a subject of primary clustering and hashCode() * 2654435761 (the constant is taken from this answer) when it is, i. e. for linear hashing implementation. The reason is that many default hashCode() implementations don't distribute values across full int range well (all numberic boxed types, String, List), and when the keys are somehow biased linear hashing may suffer from primary clustering.

clarifying facts behind Java's implementation of HashSet/HashMap

1.
I understand the different hash map mechanisms and the ways in which key collisions are handled (either open addressing -linear/quadratic probing, chaining, extendable hashing, etc. Which one does HashSet/HashMap make use of?
2.
I realise that a good HashMap relies on a good hash function. How does Java's HashSet/HashMap hash the objects? I know that there is a hash function but so far for strings I have not needed to implement this. What if I now want to Hash a Java Object that I create - do I need to implement the hash function? Or does Java have a built in way of creating a hash code?
I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.

You could answer many of these questions yourself, by reading the source code for HashMap.
(Hint: you can usually find the source code for Java SE classes using Google; e.g. search for "java.util.HashMap source".)
I understand the different hash map mechanisms and the ways in which key collisions are handled (either open addressing -linear/quadratic probing, chaining, extendable hashing, etc. Which one does HashSet/HashMap make use of?
Chaining. See the source code. (Line 154 in the version I linked to).
How does Java's HashSet/HashMap hash the objects?
It doesn't. The object's hashCode method is called to do this. See the source code. (line 360).
If you look at the code you will see some interesting wrinkles:
The code (in the version I linked to) is hashing Strings using a special method. (It appears that this is to allow hashing of strings to be "tuned" at the platform level. I didn't dig into this ...)
The hashcode returned by the Object.hashCode() call is "scrambled" further to reduce the chance of collisions. (Read the comment!)
What if I now want to Hash a Java Object that I create - do I need to implement the hash function?
You can do that.
Whether you need to do this depends on how you have defined equals for the class. Specifically, Java's HashMap, HashSet and related classes place the following requirement on hashcode() and equals(Object):
If a.equals(b) then a.hashCode() == b.hashCode().
While a is in a HashSet or is a key in a HashMap, the value returned by a.hashCode() must not change.
if !a.equals(b), then the probability that a.hashCode() == b.hashCode() should be low, especially if a and b are probably hash keys for the application.
(The last requirement for performance reasons. If you you have a "poor" hash function that results in a high probability that different keys hash the same hashcode, you get lots of collisions. The hash chains will become unbalanced, and you won't get the average O(1) performance that is normally expected of hash table operations. In the worst case, performance will be O(N); i.e. equivalent to a linear search of a linked list.)
Or does Java have a built in way of creating a hash code?
Every class inherits a default hashCode() method from Object (unless this is overridden). It uses what is known as an "identity hash code"; i.e. a hash value that is based on the object's identity (its reference). This matches the default implementation of equals(Object) ... which simply uses == to compare references.
I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.
This is incorrect.
The default hashCode() method returns the "identity hashcode". This is typically based on the object's memory address at some point time, but it is NOT the object's memory address.
In particular, if an object is moved by the garbage collector, its "identity hashcode" is guaranteed not to change. Yes. That's right, it DOES NOT CHANGE ... even though the object was moved!
(How they implement this efficiently is rather clever. See https://stackoverflow.com/a/3796963/139985 for details.)
The bottom line is that the default Object.hashCode() method satisfies all of the requirements that I listed above. It can be relied on.

Question 1)
The Java HashMap implementation uses the chaining implementation to deal with collisions. Think of it as an array of linked lists.
Question 2
Object has a default implementation of equals and hashCode. equals is implemented as return this == other and hashcode is (to all intents and purposes) implemented as assigning a random identifier to each instance and using that as the hashCode.
As all classes in Java extends Object, they all inherit these implementations.
Some classes override these implementations by default. String, as you mentioned, is a very good example. Another is the classes in the collections API - so ArrayList implements these methods based on the elements it contains.
As far as implementing a good hashCode, this is a little bit of a dark art. Here's a pretty good summary of best practice.
Your final comment:
I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.
This is not correct. The default implementation of hashCode is constant as that is part of the method's contract. From the Javadoc:
Whenever it is invoked on the same object more than once during an
execution of a Java application, the hashCode method must consistently
return the same integer, provided no information used in equals
comparisons on the object is modified. This integer need not remain
consistent from one execution of an application to another execution
of the same application.

Any disadvantage to using arbitrary objects as Map keys in Java?

I have two kinds of objects in my application where every object of one kind has exactly one corresponding object of the other kind.
The obvious choice to keep track of this relationship is a Map<type1, type2>, like a HashMap. But somehow, I'm suspicious. Can I use an object as a key in the Map, pass it around, have it sitting in another collection, too, and retrieve its partner from the Map any time?
After an object is created, all I'm passing around is an identifier, right? So probably no problem there. What if I serialize and deserialize the key?
Any other caveats? Should I use something else to correlate the object pairs, like a number I generate myself?

The key needs to implement .equals() and .hashCode() correctly
The key must not be changed in any way that changes it's .hashCode() value while it's used as the key
Ideally any object used as a key in a HashMap should be immutable. This would automatically ensure that 2. is always held true.
Objects that could otherwise be GCed might be kept around when they are used as key and/or value.

I have two kinds of objects in my
application where every object of one
kind has exactly one corresponding
object of the other kind.
This really sounds like a has-a relationship and thus could be implemented using a simple attribute.

It depends on the implementation of the map you choose:
HashMap uses equals() and hashCode(). By default (in Object) these are based on the object identity, which will work OK unless you serialize/deserialize. With a proper implementation of equals() and hashCode() based on the content of the object you will have no problems, as long as you don't modify it while it is a key in a hash map.
TreeMap uses compareTo(). There is no default implementation, so you need to provide one. The same limitations apply as for implementing hashCode() and equals() above.

You could use a standard Map, but doing so you will keep strong references to your objects in the Map. If your objects are referenced in another structure and you need the Map just to link them together consider using a WeakHashMap.
And BTW you don't have to override equals and hashCode unless you have to consider several instances of an object as equal...

Can I use an object as a key in the Map, pass it around, have it sitting in another collection, too, and retrieve its partner from the Map any time?
Yes, no problem here at all.
After an object is created, all I'm passing around is an identifier, right? So probably no problem there. What if I serialize and deserialize the key?
That's right, you are only passing a reference around - they will all point to the same actual object. If you serialize or deserialize the object, that would create a new object. However, if your object implements equals and hashCode properly, you should still be able to use the new deserialized object to retrieve items from the map.
Any other caveats? Should I use something else to correlate the object pairs, like a number I generate myself?
As for Caveats, yes, you can't change anything that would cause the hashCode of the object to change while the object is in the Map.

Any object can be a map key. The important thing here is to make sure that you override .equals() and .hashCode() for any objects that will be used as map keys.
The reason you do this is that if you don't, equals will be understood as object equality, and the only way you'll be able to find "equal" map keys is to have a handle to the original object itself.
You override hashcode because it needs to be consistent with equals. This is so that objects that you've defined as equals hash identically.

The failure points are the hashcode and equals functions. If they don't produce consistent and proper return values, the Map will behave strangely. Effective Java has a whole section on them and is highly, highly recommended.

You might consider Google Collection's BiMap.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.