I have a POJO having ~450 fields and I'm trying to compare instances of this POJO using hascode. I've generated the overridden hashCode() method with eclipse. In quite a few cases the generated hashcode is crossing the integer boundary. As a result, it's getting difficult to perform the comparison. What's the workaround?
The hashCode() method is as follows:
public int hashCode()
{
final int prime = 31;
int result = 1;
result = prime * result + ((stringOne == null) ? 0 : stringOne.hashCode());
result = prime * result + intOne;
result = prime * result + Arrays.hashCode(someArray);
result = prime * result + ((stringTwo == null) ? 0 : stringTwo.hashCode());
result = prime * result + intTwo;
result = prime * result + intThree;
result = prime * result + ((stringThree == null) ? 0 : stringThree.hashCode());
result = prime * result + ((stringFour == null) ? 0 : stringFour.hashCode());
result = prime * result + ((stringFive == null) ? 0 : stringFive.hashCode());
result = prime * result + ((objectOne == null) ? 0 : objectOne.hashCode());
result = prime * result + ((objectTwo == null) ? 0 : objectTwo.hashCode());
return result;
}
Integer overflow is a normal part of hashCode() calculations. It is not a problem.
For example, the hashCode() of a String is often negative.
System.out.println("The hashCode() of this String is negative".hashCode());
If a hashCode() calculation can overflow, obviously that can mean that unequal Objects can have the same hashCode, but this can happen without overflow. For example, both of these print true.
System.out.println("Aa".hashCode() == "BB".hashCode());
System.out.println(new HashSet<>(Arrays.asList(1, 2)).hashCode() == Collections.singleton(3).hashCode());
The only requirement is that equal objects should have the same hashCode. There is no requirement that different objects should have different hashCodes.
hashCode() and equals() should also be quick. You can improve the performance of equals() by comparing the fields most likely to be different first and returning early. You can't do this with hashCode() because the calculation must involve all the relevant fields. If your class has 450 fields, you may want to consider caching the result of hashCode() or, better, refactoring your class into smaller units.
The other thing to consider is whether you need to override these methods at all. It is only absolutely necessary if the objects are going to used as keys in a hash based container, such as HashMap.
The workaround is to use a different method to compute the hashcode. For instance, you could xor the hashcodes of your 450 fields (btw: wow!), but without knowing more about your object it's hard to say whether that would be a good approach for your particular case.
Ideally, since hashcodes are used for hashing, objects that are not equal should also with high probability produce different hashcodes.
Related
The accepted answer in Best implementation for hashCode method gives a seemingly good method for finding Hash Codes. But I'm new to Hash Codes, so I don't quite know what to do.
For 1), does it matter what nonzero value I choose? Is 1 just as good as other numbers such as the prime 31?
For 2), do I add each value to c? What if I have two fields that are both a long, int, double, etc?
Did I interpret it right in this class:
public MyClass{
long a, b, c; // these are the only fields
//some code and methods
public int hashCode(){
return 37 * (37 * ((int) (a ^ (a >>> 32))) + (int) (b ^ (b >>> 32)))
+ (int) (c ^ (c >>> 32));
}
}
The value is not important, it can be whatever you want. Prime numbers will result in a better distribution of the hashCode values therefore they are preferred.
You do not necessary have to add them, you are free to implement whatever algorithm you want, as long as it fulfills the hashCode contract:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.
There are some algorithms which can be considered as not good hashCode implementations, simple adding of the attributes values being one of them. The reason for that is, if you have a class which has two fields, Integer a, Integer b and your hashCode() just sums up these values then the distribution of the hashCode values is highly depended on the values your instances store. For example, if most of the values of a are between 0-10 and b are between 0-10 then the hashCode values are be between 0-20. This implies that if you store the instance of this class in e.g. HashMap numerous instances will be stored in the same bucket (because numerous instances with different a and b values but with the same sum will be put inside the same bucket). This will have bad impact on the performance of the operations on the map, because when doing a lookup all the elements from the bucket will be compared using equals().
Regarding the algorithm, it looks fine, it is very similar to the one that Eclipse generates, but it is using a different prime number, 31 not 37:
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + (int) (a ^ (a >>> 32));
result = prime * result + (int) (b ^ (b >>> 32));
result = prime * result + (int) (c ^ (c >>> 32));
return result;
}
A well-behaved hashcode method already exists for long values - don't reinvent the wheel:
int hashCode = Long.hashCode((a * 31 + b) * 31 + c); // Java 8+
int hashCode = Long.valueOf((a * 31 + b) * 31 + c).hashCode() // Java <8
Multiplying by a prime number (usually 31 in JDK classes) and cumulating the sum is a common method of creating a "unique" number from several numbers.
The hashCode() method of Long keeps the result properly distributed across the int range, making the hash "well behaved" (basically pseudo random).
This question already has answers here:
Why use a prime number in hashCode?
(9 answers)
What issues should be considered when overriding equals and hashCode in Java?
(11 answers)
Closed 3 years ago.
I’ve read about hash and HashMap coming from here: https://howtodoinjava.com/java/collections/hashmap/design-good-key-for-hashmap/
In particular:
//Depends only on account number
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + accountNumber;
return result;
}
Why the hash code is calculated with prime * result + accountNumber and not only from accountNumber
What meaning has the fixed value prime * result?
Why not just prime (result value is 1 so prime time 1 is prime)?
Code generation uses that in order in the case of multiple fields to have randomness:
result = prime * result + accountNumber;
result = prime * result + hairColorAsInt; // Biometrics to identify account
result = prime * result + userWeight;
So indeed unneeded, but here it might serve a purpose: accountNumber should never leak to the outside if possible, hence duplicating it into the hashCode is undesirable. I do not know of any serialisation storing the hashCode, so it is a weak argument.
Citing this SO answer, a prime number is necessary to ensure all value combinations of fields in an object create distinct and unique hash codes. Yes, this may not be necessary when you only have a single int for fields, but as soon as you want to add any, you would need to change your implementation to the above source; this is of course not desirable, and the solution works in all use cases, so it may as well be used from the beginning.
I created a class "Book":
public class Book {
public static int idCount = 1;
private int id;
private String title;
private String author;
private String publisher;
private int yearOfPublication;
private int numOfPages;
private Cover cover;
...
}
And then i need to override the hashCode() and equals() methods.
#Override
public int hashCode() {
int result = id; // !!!
result = 31 * result + (title != null ? title.hashCode() : 0);
result = 31 * result + (author != null ? author.hashCode() : 0);
result = 31 * result + (publisher != null ? publisher.hashCode() : 0);
result = 31 * result + yearOfPublication;
result = 31 * result + numOfPages;
result = 31 * result + (cover != null ? cover.hashCode() : 0);
return result;
}
It's no problem with equals(). I just wondering about one thing in hashCode() method.
Note: IntelliJ IDEA generated that hashCode() method.
So, is it OK to set the result variable to id, or should i use some prime number?
What is the better choice here?
Thanks!
Note that only the initial value of the result is set to id, not the final one. The final value is calculated by combining that initial value with hash codes of other parts of the object, multiplied by a power of a small prime number (i.e. 31). Using id rather than an arbitrary prime is definitely right in this context.
In general, there is no advantage to hash code being prime (it's the number of hash buckets that needs to be prime). Using an int as its own hash code (in your case, that's id and numOfPages) is a valid approach.
It helps to know what the hashCode is used for. It's supposed to help you map a theoretically infinite set of objects to fitting in a small number of "bins", with each bin having a number, and each object saying which bin it wants to go in based on its hashCode. The question is not whether it's okay to do one thing or another, but whether what you want to do matches what the hashCode function is for.
As per http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html#hashCode(), it's not about the number you return, it's about how it behaves for different objects of the same class.
If the object doesn't change, the hashCode must be the same value every time you call the hashCode() function.
Two objects that are equal according to .equals, must have the same hashCode.
Two objects that are not equal may have the same hashCode. (if this wasn't the case, there would be no point in using the hashCode at all, because every object already has a unique object pointer)
If you're reimplementing the hashCode function, the most important thing is to either rely on a tool to generate it for you, or to use code you understand that obeys those rules. The basic Java hashCode function uses an incredibly well-researched, seemingly simple bit of code for String hashing, so the code you see is based on turning everything into Strings and falling back to that.
If you don't know why that works, don't touch it. Just rely on it working and move on. That 31 is ridiculously important and ensures an even hashing distribution. See Why does Java's hashCode() in String use 31 as a multiplier? for the why on that one.
However, this might also be way more than you need. You could use id, but then you're basically negating the reason to use a hashCode (because now every object will want to be in a bin on its own, turning any hashed collection into a flat array. Kind of silly).
If you know the distribution of your id values, there are far easier hashCodes to come up with. Say you know they are always between 0 and Interger.MAX_VALUE, and you know there are never any gaps between ids, you could simply generate a hashCode like
final int modulus = Intereger.MAX_VALUE / 255;
int hashCode() {
return this.id % modulus;
}
now, you have a hashCode optimised for 255 bins, fulfilling the necessary requirements for an acceptable hashCode function.
Note : In my answer I am assuming that you know how hash code is meant to be used. The following just talks about any potential optimization using a non-zero constant for the initial value of result may produce.
If id is rarely 0 then it's fine to use it. However, if it's 0 frequently you should use some constant instead (just using 1 should be fine). The reason you want for it to be non-zero is so that the 31 * result part always adds some value to the hash. That way say if object A has all fields null or 0 except for yearOfPublication = 1 and object B has all fields null or 0 except for numOfPages = 1 the hash codes will be:
A.hashCode() => initialValue * 31 ^ 4 + 1
B.hashCode() => initialValue * 31 ^ 5 + 1
As you can see if initialValue is 0 then both hash codes are the same, however if it's not 0 then they will be different. It is preferable for them to be different so as to reduce collisions in data structures that use the hash code like HashMap.
That said, in your example of the Book class it is likely that id will never be 0. In fact, if id uniquely identifies the Book then you can have the hashCode() method just return the id.
I used eclipse to generate the overriding of the Object's hashCode and equals methods and that generated a few questions regarding the hashCode overriding. Is the below hashCode() correct?
Questions:
-Why does eclipse generate two result = lines of code? I would think that adding the two results together is appropriate. Any ideas why they're separate assignments?
-Can the final int prime be any prime number?
-Should int result always be 1?
public class Overrider {
private Long id;
private String name;
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((id == null) ? 0 : id.hashCode());
result = prime * result + ((name == null) ? 0 : name.hashCode());
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
Overrider other = (Overrider) obj;
if (id == null) {
if (other.id != null)
return false;
} else if (!id.equals(other.id))
return false;
if (name == null) {
if (other.name != null)
return false;
} else if (!name.equals(other.name))
return false;
return true;
}
}
Looks fine to me.
"Why does eclipse generate two result = lines of code?"
I'm guessing the reason is readability. It looks like you are multiplying the first result + the hash of the second field by the prime again, not just adding the two lines together. The generated code looks a lot better than:
result = (prime * result + ((id == null) ? 0 : id.hashCode())) +
(prime * (prime * result + ((id == null) ? 0 : id.hashCode())) +
((name == null) ? 0 : name.hashCode()));
"Can the final int prime be any prime number?"
Yep, but the higher the better. A higher number reduces the likelihood of a collision. In most cases 31 should be more than sufficient.
"Should int result always be 1?"
Assuming you mean "Should int result always be initialized to 1":
No, it just needs to be some constant that isn't 0.
-Why does eclipse generate two result = lines of code? I would think that adding the two results together is appropriate. Any ideas why they're separate assignments?
Answer : You have fix prime number defined above 'final int prime = 31;'.
Assume following cases
id.hashcode()=1 & name.hashcode()=2.
Hashcode by addition = 32+33=65 Hashcode with eclipse Implementation= 994
id.hashcode()=2 & name.hashcode()=1.
Hashcode by addition = 33+32=65 Hashcode with eclipse Implementation= 1024
Hashcode should be as unique as possible for better performance. With addition of results there is possibility of 2 different objects return same hashcode. Whereas with multiplication, there is very less chance to have duplicate hashcodes for different object.
-Can the final int prime be any prime number?
It can be any number but it ensures that there will be very less possible duplicate hashcodes for different objects. E.g 31 above represents that even if difference of 1 in id.hashcodes of 2 different objects will ensure that actual hashcodes of these 2 objects will differ by atleast 31. S
-Should int result always be 1?
It can be anything. It's just it should be non-zero and should avoid complexity in calculations to make it work fast.
-Why does eclipse generate two result = lines of code? I would think that adding the two results together is appropriate. Any ideas why
they're separate assignments?
Please remember that this is code that runs in eclipse to generate code. Therefore, there is a specific logic.
It is much easier to generate one line per variable instead of one line for all.
Also, it makes the code much more readable... can you imagine combining those two statements into one?
return prime * (prime + ((id == null) ? 0 : id.hashCode()) ) +
((name == null) ? 0 : name.hashCode());
I will not bother with simplifying that but it would become quite large and hideous if there were 10 class variables...
"Can the final int prime be any prime number?"
Have a look at: Why does Java's hashCode() in String use 31 as a multiplier?
I am quoting from there, which quotes the Book "Effective Java"...
According to Joshua Bloch's Effective Java (a book that can't be
recommended enough, and which I bought thanks to continual mentions on
stackoverflow):
"The value 31 was chosen because it is an odd prime. If it were even
and the multiplication overflowed, information would be lost, as
multiplication by 2 is equivalent to shifting. The advantage of using
a prime is less clear, but it is traditional. A nice property of 31 is
that the multiplication can be replaced by a shift and a subtraction
for better performance: 31 * i == (i << 5) - i. Modern VMs do this
sort of optimization automatically."
In SO I have read several answers related to the implementation of hashcode and the suggestion to use the XOR operator. (E.g. Why are XOR often used in java hashCode() but another bitwise operators are used rarely?).
When I use Eclipse to generate the hashcode function where field is an object and timestamp a long, the output is:
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result
+ field == null) ? 0 : field.hashCode());
return result;
}
Is there any reason by not using the XOR operator like below?
result = prime * result + (int) (timestamp ^ (timestamp >>> 32));
Eclipse takes the safe way out. Although the calculation method that uses a prime, a multiplication, and an addition is slower than a single XOR, it gives you an overall better hash code in situations when you have multiple fields.
Consider a simple example - a class with two Strings, a and b. You can use
a.hashCode() ^ b.hashCode()
or
a.hashCode() * 31 + b.hashCode()
Now consider two objects:
a = "ABC"; b = "XYZ"
and
a = "XYZ"; b = "ABC"
The first method will produce identical hash codes for them, because XOR is symmetric; the second method will produce different hash codes, which is good, because the objects are not equal. In general, you want non-equal objects to have different hash codes as often as possible, to improve performance of hash-based containers of these objects. The 31*a+b method achieves this goal better than XOR.
Note that when you are dealing with portions of the same object, as in
timestamp ^ (timestamp >>> 32)
the above argument is much weaker: encountering two timestamps such that the only difference between them is that their upper and lower parts are swapped is harder to imagine than two objects with swapped a and b field values.