Overriding hashCode() in Java - java

I created a class "Book":
public class Book {
public static int idCount = 1;
private int id;
private String title;
private String author;
private String publisher;
private int yearOfPublication;
private int numOfPages;
private Cover cover;
...
}
And then i need to override the hashCode() and equals() methods.
#Override
public int hashCode() {
int result = id; // !!!
result = 31 * result + (title != null ? title.hashCode() : 0);
result = 31 * result + (author != null ? author.hashCode() : 0);
result = 31 * result + (publisher != null ? publisher.hashCode() : 0);
result = 31 * result + yearOfPublication;
result = 31 * result + numOfPages;
result = 31 * result + (cover != null ? cover.hashCode() : 0);
return result;
}
It's no problem with equals(). I just wondering about one thing in hashCode() method.
Note: IntelliJ IDEA generated that hashCode() method.
So, is it OK to set the result variable to id, or should i use some prime number?
What is the better choice here?
Thanks!

Note that only the initial value of the result is set to id, not the final one. The final value is calculated by combining that initial value with hash codes of other parts of the object, multiplied by a power of a small prime number (i.e. 31). Using id rather than an arbitrary prime is definitely right in this context.
In general, there is no advantage to hash code being prime (it's the number of hash buckets that needs to be prime). Using an int as its own hash code (in your case, that's id and numOfPages) is a valid approach.

It helps to know what the hashCode is used for. It's supposed to help you map a theoretically infinite set of objects to fitting in a small number of "bins", with each bin having a number, and each object saying which bin it wants to go in based on its hashCode. The question is not whether it's okay to do one thing or another, but whether what you want to do matches what the hashCode function is for.
As per http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html#hashCode(), it's not about the number you return, it's about how it behaves for different objects of the same class.
If the object doesn't change, the hashCode must be the same value every time you call the hashCode() function.
Two objects that are equal according to .equals, must have the same hashCode.
Two objects that are not equal may have the same hashCode. (if this wasn't the case, there would be no point in using the hashCode at all, because every object already has a unique object pointer)
If you're reimplementing the hashCode function, the most important thing is to either rely on a tool to generate it for you, or to use code you understand that obeys those rules. The basic Java hashCode function uses an incredibly well-researched, seemingly simple bit of code for String hashing, so the code you see is based on turning everything into Strings and falling back to that.
If you don't know why that works, don't touch it. Just rely on it working and move on. That 31 is ridiculously important and ensures an even hashing distribution. See Why does Java's hashCode() in String use 31 as a multiplier? for the why on that one.
However, this might also be way more than you need. You could use id, but then you're basically negating the reason to use a hashCode (because now every object will want to be in a bin on its own, turning any hashed collection into a flat array. Kind of silly).
If you know the distribution of your id values, there are far easier hashCodes to come up with. Say you know they are always between 0 and Interger.MAX_VALUE, and you know there are never any gaps between ids, you could simply generate a hashCode like
final int modulus = Intereger.MAX_VALUE / 255;
int hashCode() {
return this.id % modulus;
}
now, you have a hashCode optimised for 255 bins, fulfilling the necessary requirements for an acceptable hashCode function.

Note : In my answer I am assuming that you know how hash code is meant to be used. The following just talks about any potential optimization using a non-zero constant for the initial value of result may produce.
If id is rarely 0 then it's fine to use it. However, if it's 0 frequently you should use some constant instead (just using 1 should be fine). The reason you want for it to be non-zero is so that the 31 * result part always adds some value to the hash. That way say if object A has all fields null or 0 except for yearOfPublication = 1 and object B has all fields null or 0 except for numOfPages = 1 the hash codes will be:
A.hashCode() => initialValue * 31 ^ 4 + 1
B.hashCode() => initialValue * 31 ^ 5 + 1
As you can see if initialValue is 0 then both hash codes are the same, however if it's not 0 then they will be different. It is preferable for them to be different so as to reduce collisions in data structures that use the hash code like HashMap.
That said, in your example of the Book class it is likely that id will never be 0. In fact, if id uniquely identifies the Book then you can have the hashCode() method just return the id.

Related

Bad Hash Function [duplicate]

The accepted answer in Best implementation for hashCode method gives a seemingly good method for finding Hash Codes. But I'm new to Hash Codes, so I don't quite know what to do.
For 1), does it matter what nonzero value I choose? Is 1 just as good as other numbers such as the prime 31?
For 2), do I add each value to c? What if I have two fields that are both a long, int, double, etc?
Did I interpret it right in this class:
public MyClass{
long a, b, c; // these are the only fields
//some code and methods
public int hashCode(){
return 37 * (37 * ((int) (a ^ (a >>> 32))) + (int) (b ^ (b >>> 32)))
+ (int) (c ^ (c >>> 32));
}
}
The value is not important, it can be whatever you want. Prime numbers will result in a better distribution of the hashCode values therefore they are preferred.
You do not necessary have to add them, you are free to implement whatever algorithm you want, as long as it fulfills the hashCode contract:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.
There are some algorithms which can be considered as not good hashCode implementations, simple adding of the attributes values being one of them. The reason for that is, if you have a class which has two fields, Integer a, Integer b and your hashCode() just sums up these values then the distribution of the hashCode values is highly depended on the values your instances store. For example, if most of the values of a are between 0-10 and b are between 0-10 then the hashCode values are be between 0-20. This implies that if you store the instance of this class in e.g. HashMap numerous instances will be stored in the same bucket (because numerous instances with different a and b values but with the same sum will be put inside the same bucket). This will have bad impact on the performance of the operations on the map, because when doing a lookup all the elements from the bucket will be compared using equals().
Regarding the algorithm, it looks fine, it is very similar to the one that Eclipse generates, but it is using a different prime number, 31 not 37:
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + (int) (a ^ (a >>> 32));
result = prime * result + (int) (b ^ (b >>> 32));
result = prime * result + (int) (c ^ (c >>> 32));
return result;
}
A well-behaved hashcode method already exists for long values - don't reinvent the wheel:
int hashCode = Long.hashCode((a * 31 + b) * 31 + c); // Java 8+
int hashCode = Long.valueOf((a * 31 + b) * 31 + c).hashCode() // Java <8
Multiplying by a prime number (usually 31 in JDK classes) and cumulating the sum is a common method of creating a "unique" number from several numbers.
The hashCode() method of Long keeps the result properly distributed across the int range, making the hash "well behaved" (basically pseudo random).

Collision strength of Java's Arrays.hashCode

How strong is the hashing mechanism that is used in the Arrays.hashCode methods against collision? What is the possibility of two different arrays (of, say, double) to have an exact hash value calculated with these methods?
Arrays.hashCode(double[]) is specified to return the equivalent value of a List containing Double values representing the same numeric value.
List.hashCode in turn is specified with a fairly simple algorithm:
int hashCode = 1;
for (E e : list)
hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());
In general the multiplication with a prime number is a good practice for general-purpose hash functions, but it's far from a cryptographically strong hash function.
This means that while collisions are unlikely in the general (effectively random) case, they can usually be constructed quite easily if you can influence (or select) the hashCode of the items in the List.
As a constructed example consider these two statements:
System.out.println(Arrays.hashCode(new double[] {4.753E-321d}));
System.out.println(Arrays.hashCode(new double[] {4.9E-324d, 4.9E-324d}));
Both of these will output 993, despite being clearly different arrays.
This is the implementation of Arrays.hashCode that you use
public static int hashCode(int a[]) {
if (a == null)
return 0;
int result = 1;
for (int element : a)
result = 31 * result + element;
return result;
}
If your values happen to be smaller then 31 they are treated like distinct numbers in the base 31, so each result in a different numbers (if we ignore overflows for now). Lets call those pure hashes
Now of course 31^11 is way larger then the number of integers in Java, so we will get tons of overflows. But since the powers of 31 and the maximum integer are "very different" you don't get a almost random distribution, but a very regular uniform distribution.
Lets consider a smaller example. I assume you have only 2 elements in your array and the range from 0 to 5 each. I try to create "hashCode" between 0 and 37 by taking the modulo 38 of the "pure hash" The result is that I get streaks of 5 integers with small gaps in between, and not a single collision.
val hashes = for {
i <- 0 to 4
j <- 0 to 4
} yield (i * 31 + j) % 38
enter code here
println(hashes.size) // prints 25
println(hashes.toSet.size) // prints 25
To verify if this is what happens to your numbers you might create a graph as follows: For each hash take the first 16 bits for x and and the second 16 bits for y, color that dot black. I bet you will see an extremely regular pattern.

Java Hashcode gives integer overflow

Background information:
In my project I'm applying Reinforcement Learning (RL) to the Mario domain. For my state representation I chose to use a hashtable with custom objects as keys. My custom objects are immutable and have overwritten the .equals() and the .hashcode() (which were generated by the IntelliJ IDE).
This is the resulting .hashcode(), I've added the possible values in comments as extra information:
#Override
public int hashCode() {
int result = (stuck ? 1 : 0); // 2 possible values: 0, 1
result = 31 * result + (facing ? 1 : 0); // 2 possible values: 0, 1
result = 31 * result + marioMode; // 3 possible values: 0, 1, 2
result = 31 * result + (onGround ? 1 : 0); // 2 possible values: 0, 1
result = 31 * result + (canJump ? 1 : 0); // 2 possible values: 0, 1
result = 31 * result + (wallNear ? 1 : 0); // 2 possible values: 0, 1
result = 31 * result + nearestEnemyX; // 33 possible values: - 16 to 16
result = 31 * result + nearestEnemyY; // 33 possible values: - 16 to 16
return result;
}
The Problem:
The problem here is that the result in the above code can exceed Integer.MAX_VALUE. I've read online this doesn't have to be a problem, but in my case it is. This is partly due to algorithm used which is Q-Learning (an RL method) and depends on the correct Q-values stored inside the hashtable. Basically I cannot have conflicts when retrieving values. When running my experiments I see that the results are not good at all and I'm 95% certain the problem lies with the retrieval of the Q-values from the hashtable. (If needed I can expand on why I'm certain about this, but this requires some extra information on the project which isn't relevant for the question.)
The Question:
Is there a way to avoid the integer overflow, maybe I'm overlooking something here? Or is there another way (perhaps another datastructure) to get reasonably fast the values given my custom-key?
Remark:
After reading some comments I do realise that my choice for using a HashTable wasn't maybe the best one as I want unique keys that do not cause collisions. If I still want to use the HashTable I will probably need a proper encoding.
You need a dedicated Key Field to guarantee uniqueness
.hashCode() isn't designed for what you are using it for
.hashCode() is designed to give good general results in bucketing algorithms, which can tolerate minor collisions. It is not designed to provide a unique key. The default algorithm is a trade off of time and space and minor collisions, it isn't supposed to guarantee uniqueness.
Perfect Hash
What you need to implement is a perfect hash or some other unique key based on the contents of the object. This is possible within the boundries of an int but I wouldn't use .hashCode() for this representation. I would use an explicit key field on the object.
Unique Hashing
One way to use use SHA1 hashing that is built into the standard library which has an extremely low chance of collisions for small data sets. You don't have a huge combinational explosion in the values you posts to SHA1 will work.
You should be able to calculate a way to generate a minimal perfect hash with the limited values that you are showing in your question.
A minimal perfect hash function is a perfect hash function that maps n
keys to n consecutive integers—usually [0..n−1] or [1..n]. A more
formal way of expressing this is: Let j and k be elements of some
finite set K. F is a minimal perfect hash function iff F(j) =F(k)
implies j=k (injectivity) and there exists an integer a such that the
range of F is a..a+|K|−1. It has been proved that a general purpose
minimal perfect hash scheme requires at least 1.44 bits/key.2 The
best currently known minimal perfect hashing schemes use around 2.6
bits/key.[3]
A minimal perfect hash function F is order preserving if keys are
given in some order a1, a2, ..., an and for any keys aj and ak, j
A minimal perfect hash function F is monotone if it preserves the
lexicographical order of the keys. In this case, the function value is
just the position of each key in the sorted ordering of all of the
keys. If the keys to be hashed are themselves stored in a sorted
array, it is possible to store a small number of additional bits per
key in a data structure that can be used to compute hash values
quickly.[6]
Solution
Note where it talks about a URL it can be any byte[] representation of any String that you calculate from your object.
I usually override the toString() method to make it generate something unique, and then feed that into the UUID.nameUUIDFromBytes() method.
Type 3 UUID can be just as useful as well UUID.nameUUIDFromBytes()
Version 3 UUIDs use a scheme deriving a UUID via MD5 from a URL, a
fully qualified domain name, an object identifier, a distinguished
name (DN as used in Lightweight Directory Access Protocol), or on
names in unspecified namespaces. Version 3 UUIDs have the form
xxxxxxxx-xxxx-3xxx-yxxx-xxxxxxxxxxxx where x is any hexadecimal digit
and y is one of 8, 9, A, or B.
To determine the version 3 UUID of a given name, the UUID of the
namespace (e.g., 6ba7b810-9dad-11d1-80b4-00c04fd430c8 for a domain) is
transformed to a string of bytes corresponding to its hexadecimal
digits, concatenated with the input name, hashed with MD5 yielding 128
bits. Six bits are replaced by fixed values, four of these bits
indicate the version, 0011 for version 3. Finally, the fixed hash is
transformed back into the hexadecimal form with hyphens separating the
parts relevant in other UUID versions.
My preferred solution is Type 5 UUID ( SHA version of Type 3)
Version 5 UUIDs use a scheme with SHA-1 hashing; otherwise it is the
same idea as in version 3. RFC 4122 states that version 5 is preferred
over version 3 name based UUIDs, as MD5's security has been
compromised. Note that the 160 bit SHA-1 hash is truncated to 128 bits
to make the length work out. An erratum addresses the example in
appendix B of RFC 4122.
Key objects should be immutable
That way you can calculate toString(), .hashCode() and generate a unique primary key inside the Constructor and set them once and not calculate them over and over.
Here is a straw man example of an idiomatic immutable object and calculating a unique key based on the contents of the object.
package com.stackoverflow;
import javax.annotation.Nonnull;
import java.util.Date;
import java.util.UUID;
public class Q23633894
{
public static class Person
{
private final String firstName;
private final String lastName;
private final Date birthday;
private final UUID key;
private final String strRep;
public Person(#Nonnull final String firstName, #Nonnull final String lastName, #Nonnull final Date birthday)
{
this.firstName = firstName;
this.lastName = lastName;
this.birthday = birthday;
this.strRep = String.format("%s%s%d", firstName, lastName, birthday.getTime());
this.key = UUID.nameUUIDFromBytes(this.strRep.getBytes());
}
#Nonnull
public UUID getKey()
{
return this.key;
}
// Other getter/setters omitted for brevity
#Override
#Nonnull
public String toString()
{
return this.strRep;
}
#Override
public boolean equals(final Object o)
{
if (this == o) { return true; }
if (o == null || getClass() != o.getClass()) { return false; }
final Person person = (Person) o;
return key.equals(person.key);
}
#Override
public int hashCode()
{
return key.hashCode();
}
}
}
For a unique representation of your object's state, you would need 19 bits in total. Thus, it is possible to represent it by a "perfect hash" integer value (which can have up to 32 bits):
#Override
public int hashCode() {
int result = (stuck ? 1 : 0); // needs 1 bit (2 possible values)
result += (facing ? 1 : 0) << 1; // needs 1 bit (2 possible values)
result += marioMode << 2; // needs 2 bits (3 possible values)
result += (onGround ? 1 : 0) << 4; // needs 1 bit (2 possible values)
result += (canJump ? 1 : 0) << 5; // needs 1 bit (2 possible values)
result += (wallNear ? 1 : 0) << 6; // needs 1 bit (2 possible values)
result += (nearestEnemyX + 16) << 7; // needs 6 bits (33 possible values)
result += (nearestEnemyY + 16) << 13; // needs 6 bits (33 possible values)
}
Instead of using 31 as a your magic number, you need to use the number of possibilities (normalised to 0)
#Override
public int hashCode() {
int result = (stuck ? 1 : 0); // 2 possible values: 0, 1
result = 2 * result + (facing ? 1 : 0); // 2 possible values: 0, 1
result = 3 * result + marioMode; // 3 possible values: 0, 1, 2
result = 2 * result + (onGround ? 1 : 0); // 2 possible values: 0, 1
result = 2 * result + (canJump ? 1 : 0); // 2 possible values: 0, 1
result = 2 * result + (wallNear ? 1 : 0); // 2 possible values: 0, 1
result = 33 * result + (16 + nearestEnemyX); // 33 possible values: - 16 to 16
result = 33 * result + (16 + nearestEnemyY); // 33 possible values: - 16 to 16
return result;
}
This will give you 104544 possible hashCodes() BTW you can reverse this process to get the original values from the code by using a series of / and %
Try Guava's hashCode() method or JDK7's Objects.hash(). It's way better than writing your own. Don't repeat code yourself (and anyone else when you can use out of box solution):

Hashcode generated by Eclipse

In SO I have read several answers related to the implementation of hashcode and the suggestion to use the XOR operator. (E.g. Why are XOR often used in java hashCode() but another bitwise operators are used rarely?).
When I use Eclipse to generate the hashcode function where field is an object and timestamp a long, the output is:
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result
+ field == null) ? 0 : field.hashCode());
return result;
}
Is there any reason by not using the XOR operator like below?
result = prime * result + (int) (timestamp ^ (timestamp >>> 32));
Eclipse takes the safe way out. Although the calculation method that uses a prime, a multiplication, and an addition is slower than a single XOR, it gives you an overall better hash code in situations when you have multiple fields.
Consider a simple example - a class with two Strings, a and b. You can use
a.hashCode() ^ b.hashCode()
or
a.hashCode() * 31 + b.hashCode()
Now consider two objects:
a = "ABC"; b = "XYZ"
and
a = "XYZ"; b = "ABC"
The first method will produce identical hash codes for them, because XOR is symmetric; the second method will produce different hash codes, which is good, because the objects are not equal. In general, you want non-equal objects to have different hash codes as often as possible, to improve performance of hash-based containers of these objects. The 31*a+b method achieves this goal better than XOR.
Note that when you are dealing with portions of the same object, as in
timestamp ^ (timestamp >>> 32)
the above argument is much weaker: encountering two timestamps such that the only difference between them is that their upper and lower parts are swapped is harder to imagine than two objects with swapped a and b field values.

Hashcode for objects with only integers

How do you in a general (and performant) way implement hashcode while minimizing collisions for objects with 2 or more integers?
update: as many stated, you cant ofcource eliminate colisions entierly (honestly didnt think about it). So my question should be how do you minimize collisions in a proper way, edited to reflect that.
Using NetBeans' autogeneration fails; for example:
public class HashCodeTest {
#Test
public void testHashCode() {
int loopCount = 0;
HashSet<Integer> hashSet = new HashSet<Integer>();
for (int outer = 0; outer < 18; outer++) {
for (int inner = 0; inner < 2; inner++) {
loopCount++;
hashSet.add(new SimpleClass(inner, outer).hashCode());
}
}
org.junit.Assert.assertEquals(loopCount, hashSet.size());
}
private class SimpleClass {
int int1;
int int2;
public SimpleClass(int int1, int int2) {
this.int1 = int1;
this.int2 = int2;
}
#Override
public int hashCode() {
int hash = 5;
hash = 17 * hash + this.int1;
hash = 17 * hash + this.int2;
return hash;
}
}
}
Can you in a general (and performant) way implement hashcode without
colisions for objects with 2 or more integers.
It is technically impossible to have zero collision when hashing to 32 bits (one integer) something made of more than 32 bits (like 2 or more integers).
This is what eclipse auto-generates:
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + getOuterType().hashCode();
result = prime * result + int1;
result = prime * result + int2;
return result;
}
And with this code your testcase passes...
PS: And don't forget to implement equals()!
There is no way to eliminate hash collisions entirely. Your approach is basically the preferred one to minimize collisions.
Creating a hash method with zero collisions is impossible. The idea of a hash method is you're taking a large set of objects and mapping it to a smaller set of integers. The best you can do is minimize the number of collisions you get within a subset of your objects.
As others have said, it's more important to minimize collisions that to eliminate them -- especially since you didn't say how many buckets you're aiming for. It's going to be much easier to have zero collisions with 5 items in 1000 buckets than if you have 5 items in 2 buckets! And even if there are plenty of buckets, your collisions could look very different with 1000 buckets vs 1001.
Another thing to note is that there's a good chance that the hash you provide won't even be the one the HashMap eventually uses. If you take a look at the OpenJDK HashMap code, for instance, you'll see that your keys' hashCodes are put through a private hash method (line 264 in that link) which re-hashes them. So, if you're going through the trouble of creating a carefully constructed custom hash function to reduce collisions (rather than just a simple, auto-generated one), make sure you also understand who's going to use it, and how.

Categories