In a nutshell, the hashCode contract, according to Java's object.hashCode():
The hash code shouldn't change unless something affecting equals() changes
equals() implies hash codes are ==
Let's assume interest primarily in immutable data objects - their information never changes after they're constructed, so #1 is assumed to hold. That leaves #2: the problem is simply one of confirming that equals implies hash code ==.
Obviously, we can't test every conceivable data object unless that set is trivially small. So, what is the best way to write a unit test that is likely to catch the common cases?
Since the instances of this class are immutable, there are limited ways to construct such an object; this unit test should cover all of them if possible. Off the top of my head, the entry points are the constructors, deserialization, and constructors of subclasses (which should be reducible to the constructor call problem).
[I'm going to try to answer my own question via research. Input from other StackOverflowers is a welcome safety mechanism to this process.]
[This could be applicable to other OO languages, so I'm adding that tag.]
EqualsVerifier is a relatively new open source project and it does a very good job at testing the equals contract. It doesn't have the issues the EqualsTester from GSBase has. I would definitely recommend it.
My advice would be to think of why/how this might ever not hold true, and then write some unit tests which target those situations.
For example, let's say you had a custom Set class. Two sets are equal if they contain the same elements, but it's possible for the underlying data structures of two equal sets to differ if those elements are stored in a different order. For example:
MySet s1 = new MySet( new String[]{"Hello", "World"} );
MySet s2 = new MySet( new String[]{"World", "Hello"} );
assertEquals(s1, s2);
assertTrue( s1.hashCode()==s2.hashCode() );
In this case, the order of the elements in the sets might affect their hash, depending on the hashing algorithm you've implemented. So this is the kind of test I'd write, since it tests the case where I know it would be possible for some hashing algorithm to produce different results for two objects I've defined to be equal.
You should use a similar standard with your own custom class, whatever that is.
It's worth using the junit addons for this. Check out the class EqualsHashCodeTestCase http://junit-addons.sourceforge.net/ you can extend this and implement createInstance and createNotEqualInstance, this will check the equals and hashCode methods are correct.
I would recommend the EqualsTester from GSBase. It does basically what you want. I have two (minor) problems with it though:
The constructor does all the work, which I don't consider to be good practice.
It fails when an instance of class A equals to an instance of a subclass of class A. This is not necessarily a violation of the equals contract.
[At the time of this writing, three other answers were posted.]
To reiterate, the aim of my question is to find standard cases of tests to confirm that hashCode and equals are agreeing with each other. My approach to this question is to imagine the common paths taken by programmers when writing the classes in question, namely, immutable data. For example:
Wrote equals() without writing hashCode(). This often means equality was defined to mean equality of the fields of two instances.
Wrote hashCode() without writing equals(). This may mean the programmer was seeking a more efficient hashing algorithm.
In the case of #2, the problem seems nonexistent to me. No additional instances have been made equals(), so no additional instances are required to have equal hash codes. At worst, the hash algorithm may yield poorer performance for hash maps, which is outside the scope of this question.
In the case of #1, the standard unit test entails creating two instances of the same object with the same data passed to the constructor, and verifying equal hash codes. What about false positives? It's possible to pick constructor parameters that just happen to yield equal hash codes on a nonetheless unsound algorithm. A unit test that tends to avoid such parameters would fulfill the spirit of this question. The shortcut here is to inspect the source code for equals(), think hard, and write a test based on that, but while this may be necessary in some cases, there may also be common tests that catch common problems - and such tests also fulfill the spirit of this question.
For example, if the class to be tested (call it Data) has a constructor that takes a String, and instances constructed from Strings that are equals() yielded instances that were equals(), then a good test would probably test:
new Data("foo")
another new Data("foo")
We could even check the hash code for new Data(new String("foo")), to force the String to not be interned, although that's more likely to yield a correct hash code than Data.equals() is to yield a correct result, in my opinion.
Eli Courtwright's answer is an example of thinking hard of a way to break the hash algorithm based on knowledge of the equals specification. The example of a special collection is a good one, as user-made Collections do turn up at times, and are quite prone to muckups in the hash algorithm.
This is one of the only cases where I would have multiple asserts in a test. Since you need to test the equals method you should also check the hashCode method at the same time. So on each of your equals method test cases check the hashCode contract as well.
A one = new A(...);
A two = new A(...);
assertEquals("These should be equal", one, two);
int oneCode = one.hashCode();
assertEquals("HashCodes should be equal", oneCode, two.hashCode());
assertEquals("HashCode should not change", oneCode, one.hashCode());
And of course checking for a good hashCode is another exercise. Honestly I wouldn't bother to do the double check to make sure the hashCode wasn't changing in the same run, that sort of problem is better handled by catching it in a code review and helping the developer understand why that's not a good way to write hashCode methods.
You can also use something similar to http://code.google.com/p/guava-libraries/source/browse/guava-testlib/src/com/google/common/testing/EqualsTester.java
to test equals and hashCode.
If I have a class Thing, as most others do I write a class ThingTest, which holds all the unit tests for that class. Each ThingTest has a method
public static void checkInvariants(final Thing thing) {
...
}
and if the Thing class overrides hashCode and equals it has a method
public static void checkInvariants(final Thing thing1, Thing thing2) {
ObjectTest.checkInvariants(thing1, thing2);
... invariants that are specific to Thing
}
That method is responsible for checking all invariants that are designed to hold between any pair of Thing objects. The ObjectTest method it delegates to is responsible for checking all invariants that must hold between any pair of objects. As equals and hashCode are methods of all objects, that method checks that hashCode and equals are consistent.
I then have some test methods that create pairs of Thing objects, and pass them to the pairwise checkInvariants method. I use equivalence partitioning to decide what pairs are worth testing. I usually create each pair to be different in only one attribute, plus a test that tests two equivalent objects.
I also sometimes have a 3 argument checkInvariants method, although I finds that is less useful in findinf defects, so I do not do this often
Related
This is what the Java documentation of Object.hashCode() says:
If two objects are equal according to the equals(Object) method, then
calling the hashCode method on each of the two objects must produce
the same integer result.
But they don't explain why two equal objects must return equal hash codes. Why did Oracle engineers decided hashCode must be overriden when overriding equals?
The typical implementation of equals doesn't call the hashCode method:
#Override
public boolean equals(Object arg0) {
if (this == arg0) {
return true;
}
if (!(arg0 instanceof MyClass)) {
return false;
}
MyClass another = (MyClass) arg0;
// compare significant fields here
}
In Effective Java (2nd Edition) I read:
Item 9: Always override hashCode when you override equals.
A common source of bugs is the failure to override the hashCode
method. You must override hashCode in every class that overrides
equals. Failure to do so will result in a violation of the general
contract for Object.hashCode, which will prevent your class from
functioning properly in conjunction with all hash-based collections,
including HashMap, HashSet, and Hashtable.
Suppose I don't need to use MyClass as a key of a hash table. Why do I need to override hashCode() in this case?
Of course when you have a little program only written by your self and you check every time you use an external lib that this did not rely on hashCode() then you can ignore all this warnings. But when a software projects grows you will use external libraries and and these will rely on hashCode()and you will lose a lot of time searching for bugs. Or in a newer Java version some other classes use the hashCode()too and your program will fail.
So its a lot easier to just implement this and follow this easy rule, because an modern IDE can auto generate equals and hashCode with one click.
Update a little story: At work we ignored this rule too in a lot of classes and only implemented the one needed mostly equals or compareTo. Some day some strange things happen, because one programmer has used a Hash*-Class in the GUI and our Objects did not follow this rule. In the end an apprentice need to search all classes with equals and have to add the corresponding hashCode method.
As the text says, it's a violation of the general contract that is in use. Of course if you never ever use it in any place that hashCode would be required nobody is forcing you to implement it.
But what if some day in the future it is needed? What if the class is used by some other developer? That's why there is this contract that both of them have to be implemented, so there is no confusion.
Obviously if nobody ever calls your class's hashCode method, nobody will know that it's inconsistent with equals. Can you guarantee that for the duration of your project, including the years in maintenance, no one will need to, say, remove duplicate objects from a list or associate some extra bit of data with your objects?
You are probably just safer implementing hashCode so that it's consistent with equals. It's not particularly hard, always returning 0 is already a valid implementation.
(please don't just return 0 though)
The reason why hashCode has to be overridden to agree with equals is because of why and how hashCode is used. Hash codes are used as surrogates for values so that when mapping a key value to something hashing can be used to give near-constant lookup time with reasonable space. When two values compare equal (ie, they are the same value) then they have to map to the same something when used as a key for a hashed collection. That requires that they have the same hash code to get them there.
You have to override hashCode appropriately because the manual says you have to. It says you have to because the decision was made that libraries can assume that you are satisfying that contract so they (& you) can have the performance benefits of hashing when using a value that you gave it as a key in their functions' implementations.
The common wizdom dictates that the logic for // compare significant fields here is required for equals() and also for compareTo() (in case you want to sort instances of MyClass) and also for using hash tables. so it makes sense to put this logic in hashCode() and have the other methods use the hash code.
I have a class with a few fields, one of which is an int, and 2 are long. What I'm thinking of doing is adding in a check in equals() so if an Integer object is passed in, it will compare the int field, and if the same return true. Likewise, if Long is passed in, if it is between the 2 long fields, it will return true.
So, if I add several of these objects to a List or Set, I can then do a get() and have it automatically give me the first object that matches. My thought is if I do this, then I simply make the get() call, and then I'll have it, instead of having to have an extra loop & checks.
Is this a good idea or bad idea compared to simply iterating over all of the objects and doing the comparisons that way?
Don't do this.
The equals() method has a well-defined contract, and your proposed implementation violates it. For example, it won't be symmetric; if x is your object and y is an Integer, y.equals(x) will be false even when x.equals(y) is true. Breaking these rules will confuse anyone who has to work with your codeāeven yourself, in the future, when you are more accustomed to using this method correctly.
Your use cases sound like they could be satisfied with a NavigableMap, where keys are integers, and values are instances of your class.
The performance will be the same but the code will be obfuscated. A different developer (or yourself in a couple of month) will just expect equals() to check if an object is equal.
I would go for a more explicit solution.
Your equals method should have one concrete implementation without depending on the type of Object being passed, read the equals contract here as anyone reading your code or javadoc will expect it to be as per the contract.
For such cases you can write your own custom Comparator and use it to search your objects in collection.
Or have separate equals method like checkIntEquality and checkLongEquality and call them as appropriate.
That only makes sense if the semantics of the object follow the same logic.
If the different types represent different values, with different meanings, this type of overloading generates confusion.
It also sounds like the "equals" for longs isn't even an equals, which is worse.
Encapsulating the behavior in the object is fine, but should be named sensibly.
If only some of the fields of an object represents the actual state, I suppose these could be ignored when overriding equals and hashCode...
I get an uneasy feeling about this though, and wanted to ask,
Is this common practice?
Are there any potential pitfalls with this approach?
Is there any documentation or guidelines when it comes to ignoring some fields in equals / hashCode?
In my particular situation, I'm exploring a state-space of a problem. I'd like to keep a hash set of visited states, but I'm also considering including the path which lead to the state. Obviously, two states are equal, even though they are found through different paths.
This is based on how you would consider the uniqueness of a given object. If it has a primary key ( unique key) , then using that attribute alone is enough.
If you think the uniqueness is combination of 10 different attributes, then use all 10 attributes in the equals.
Then use only the attributes that you used in equals to generate the hashcode because same objects should generate the same hashcodes.
Selecting the attribute(s) for equals and hashcode is how you define the uniqueness of a given object.
Is this common practice? Yes
Are there any potential pitfalls with this approach? No
Is there any documentation or guidelines when it comes to ignoring some fields in equals / hashCode?
"The equals method for class Object implements the most discriminating
possible equivalence relation on objects;"
This is from object class Javadoc. But as the author of the class , you know how the uniqueness is defined.
Ultimately, "equals" means what you want it to mean. There is the restriction that "equal" values must return the same hashcode, and, of course, if presented with two identical address "equals" must return true. But you could, eg, have an "equals" that compared the contents of two web pages (ignoring the issue of repeatability for the nonce), and, even though the URLs were different, said "equal" if the page contents matched in some way.
The best documentation/guidelines I have seen for overriding the methods on Object was in Josh Bloch's Effective Java. It has a whole chapter on "Methods Common to All Objects" which includes sections about "Obey the general contract when overriding equals" and "Always override hashCode when you override equals". It describes, in detail, the things you should consider when overriding these two methods. I won't give away the answer directly; the book is definitely worth the cost for every Java developer.
The method hashCode() in class Enum is final and defined as super.hashCode(), which means it returns a number based on the address of the instance, which is a random number from programmers POV.
Defining it e.g. as ordinal() ^ getClass().getName().hashCode() would be deterministic across different JVMs. It would even work a bit better, since the least significant bits would "change as much as possible", e.g., for an enum containing up to 16 elements and a HashMap of size 16, there'd be for sure no collisions (sure, using an EnumMap is better, but sometimes not possible, e.g. there's no ConcurrentEnumMap). With the current definition you have no such guarantee, have you?
Summary of the answers
Using Object.hashCode() compares to a nicer hashCode like the one above as follows:
PROS
simplicity
CONTRAS
speed
more collisions (for any size of a HashMap)
non-determinism, which propagates to other objects making them unusable for
deterministic simulations
ETag computation
hunting down bugs depending e.g. on a HashSet iteration order
I'd personally prefer the nicer hashCode, but IMHO no reason weights much, maybe except for the speed.
UPDATE
I was curious about the speed and wrote a benchmark with surprising results. For a price of a single field per class you can a deterministic hash code which is nearly four times faster. Storing the hash code in each field would be even faster, although negligibly.
The explanation why the standard hash code is not much faster is that it can't be the object's address as objects gets moved by the GC.
UPDATE 2
There are some strange things going on with the hashCode performance in general. When I understand them, there's still the open question, why System.identityHashCode (reading from the object header) is way slower than accessing a normal object field.
The only reason for using Object's hashCode() and for making it final I can imagine, is to make me ask this question.
First of all, you should not rely on such mechanisms for sharing objects between JVMs. That's simply not a supported use case. When you serialize / deserialize you should rely on your own comparison mechanisms or only "compare" the results against objects within your own JVM.
The reason for letting enums hashCode be implemented as Objects hash code (based on identity) is because, within one JVM there will only be one instance of each enum object. This is enough to ensure that such implementation makes sense and is correct.
You could argue like "Hey, String and the wrappers for the primitives (Long, Integer, ...) all have well defined, deterministic, specifications of hashCode! Why doesn't the enums have it?", Well, to begin with, you can have several distinct string references representing the same string which means that using super.hashCode would be an error, so these classes necessarily need their own hashCode implementations. For these core classes it made sense to let them have well-defined deterministic hashCodes.
Why did they choose to solve it like this?
Well, look at the requirements of the hashCode implementation. The main concern is to make sure that each object should return a distinct hash code (unless it is equal to another object). The identity-based approach is super efficient and guarantees this, while your suggestion does not. This requirement is apparently stronger than any "convenience bonus" about easing up on serialization etc.
I think that the reason they made it final is to avoid developers shooting themselves in the foot by rewriting a suboptimal (or even incorrect) hashCode.
Regarding the chosen implementation: it's not stable across JVMs, but it's very fast, avoid collisions, and doesn't need an additional field in the enum. Given the normally small number of instances of an enum class, and the speed of the equals method, I wouldn't be surprised if the HashMap lookup time was bigger with your algorithm than with the current one, due to its additional complexity.
I've asked the same question, because did not saw this one. Why in Enum hashCode() refers to the Object hashCode() implementaion, instead of ordinal() function?
I encountered it as a sort of a problem, when defining my own hash function, for an Object relying on enum hashCode as one of the composites. When checking a value in a Set of Objects, returned by the function, I checked them in an order, which I would expect it to be the same, since the hashCode I define myself, and so I expect elements to fall at the same nodes on the tree, but since hashCode returned by enum changes from start to start, this assumption was wrong, and test could fail once in a while.
So, when I figured out the problem, I started using ordinal instead. I am not sure everyone writing hashCode for their Object realize this.
So basically, you can't define your own deterministic hashCode, while relying on enum hashCode, and you need to use ordinal instead
P.S. This was too big for a comment :)
The JVM enforces that for an enum constant, only one object will exist in memory. There is no way that you could end up with two different instance objects of the same enum constant within a single VM, not with reflection, not across the network via serialization/deserialization.
That being said, since it is the only object to represent this constant, it doesn't matter that its hascode is its address since no other object can occupy the same address space at the same time. It is guaranteed to be unique & "deterministic" (in the sense that in the same VM, in memory, all objects will have the same reference, no matter what it is).
There is no requirement for hash codes to be deterministic between JVMs and no advantage gained if they were. If you are relying on this fact you are using them wrong.
As only one instance of each enum value exists, Object.hashcode() is guaranteed never to collide, is good code reuse and is very fast.
If equality is defined by identity, then Object.hashcode() will always give the best performance.
The determinism of other hash codes is just a side effect of their implementation. As their equality is usually defined by field values, mixing in non-deterministic values would be a waste of time.
As long as we can't send an enum object1 to a different JVM I see no reason for putting such a requirements on enums (and objects in general)
1 I thought it was clear enough - an object is an instance of a class. A serialized object is a sequence of bytes, usually stored in a byte array. I was talking about an object.
One more reason that it is implemented like this I could imagine is because of the requirement for hashCode() and equals() to be consistent, and for the design goal of Enums that they sould be simple to use and compile-time constant (to use them is "case" constants). This also makes it legal to compare enum instances with "==", and you simply wouldn't want "equals" to behave differntly from "==" for enums. This again ties hashCode to the default Object.hashCode() reference-based behavior.
As said before, I also don't expect equals() and hashCode() to consider two enum constants from different JVM as being equal. When talking about serialization: For instance fields typed as enums the default binary serializer in Java has a special behaviour that serializess only the name of the constant, and on deserialization the reference to the corresponding enum value in the de-serializing JVM is re-created. JAXB and other XML-based serialization mechanisms work in a similar way. So: just don't worry
I'm in the middle of QA'ing a bunch of code and have found several instances where the developer has a DTO which implements Comparable. This DTO has 7 or 8 fields in it. The compareTo method has been implemented on just one field:
private DateMidnight field1; //from Joda date/time library
public int compareTo(SomeObject o) {
if (o == null) {
return -1;
}
return field1.compareTo(o.getField1());
}
Similarly the equals method is overridden and basically boils down to:
return field1.equals(o.getField1());
and finally the hashcode method implementation is:
return field1.hashCode;
field1 should never be null and will be unique across these objects (i.e. we shouldn't get two objects with the same field1).
So, the implementations are consistent which is good, but should I be concerned that only one field is used? Is this unusual? Is it likely to cause problems or confuse other developers? I'm thinking of the scenario where a list of these objects are passed around and another developer uses a Map or Set of somesort and gets unusual behaviour from these objects. Any thoughts appreciated. Thanks!
I suspect that this is a case of "first use wins" - someone needed to sort a collection of these objects or put them in a hash map, and they only cared about the date. The easiest way of implementing that was to override equals/hashCode and implement Comparable<T> in the way you've said.
For specialist sorting, a better approach would be to implement Comparator<T> in a different class... but Java doesn't have any equivalent class for equality testing, unfortunately. I consider it a major weakness in the Java collections, to be honest.
Assuming this really isn't "the one natural and obvious comparison", it certainly smells in terms of design... and should be very carefully document.
Strictly speaking, this violates the Comparable spec:
http://download.oracle.com/javase/6/docs/api/java/lang/Comparable.html
Note that null is not an instance of any class, and e.compareTo(null) should throw a NullPointerException even though e.equals(null) returns false.
Similarly, it looks like the equals method will throw NPE on equals(null) instead of returning false (unless of course you "boiled" out the null handling code).
Is it likely to cause problems or confuse other developers?
Possibly, possibly not. It really depends on how large your project is and how widespread/"reusable"/long-lived your object source code is expected to be used:
Small/short-lived/limited use == probably not a problem.
Large/long-lived/widespread use == counter-intuitive implementation may cause future problems
You shouldnt be concerned with it, if field1 is really unique. If it`s not, you may have problems. Anyway, my advise is to do some unit tests. They should show the truth.
I don't think you need to be concerned. The contract between the three methods is kept and it's consistent.
Whether it's correct from a business logic point of view is a different question.
If e.g. field1 maps to a primary key in the database it's perfectly valid. If field1 is the "firstname" of a person, I would be concerned