This question already has answers here:
Why use a prime number in hashCode?
(9 answers)
What issues should be considered when overriding equals and hashCode in Java?
(11 answers)
Closed 3 years ago.
I’ve read about hash and HashMap coming from here: https://howtodoinjava.com/java/collections/hashmap/design-good-key-for-hashmap/
In particular:
//Depends only on account number
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + accountNumber;
return result;
}
Why the hash code is calculated with prime * result + accountNumber and not only from accountNumber
What meaning has the fixed value prime * result?
Why not just prime (result value is 1 so prime time 1 is prime)?
Code generation uses that in order in the case of multiple fields to have randomness:
result = prime * result + accountNumber;
result = prime * result + hairColorAsInt; // Biometrics to identify account
result = prime * result + userWeight;
So indeed unneeded, but here it might serve a purpose: accountNumber should never leak to the outside if possible, hence duplicating it into the hashCode is undesirable. I do not know of any serialisation storing the hashCode, so it is a weak argument.
Citing this SO answer, a prime number is necessary to ensure all value combinations of fields in an object create distinct and unique hash codes. Yes, this may not be necessary when you only have a single int for fields, but as soon as you want to add any, you would need to change your implementation to the above source; this is of course not desirable, and the solution works in all use cases, so it may as well be used from the beginning.
Related
The accepted answer in Best implementation for hashCode method gives a seemingly good method for finding Hash Codes. But I'm new to Hash Codes, so I don't quite know what to do.
For 1), does it matter what nonzero value I choose? Is 1 just as good as other numbers such as the prime 31?
For 2), do I add each value to c? What if I have two fields that are both a long, int, double, etc?
Did I interpret it right in this class:
public MyClass{
long a, b, c; // these are the only fields
//some code and methods
public int hashCode(){
return 37 * (37 * ((int) (a ^ (a >>> 32))) + (int) (b ^ (b >>> 32)))
+ (int) (c ^ (c >>> 32));
}
}
The value is not important, it can be whatever you want. Prime numbers will result in a better distribution of the hashCode values therefore they are preferred.
You do not necessary have to add them, you are free to implement whatever algorithm you want, as long as it fulfills the hashCode contract:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.
There are some algorithms which can be considered as not good hashCode implementations, simple adding of the attributes values being one of them. The reason for that is, if you have a class which has two fields, Integer a, Integer b and your hashCode() just sums up these values then the distribution of the hashCode values is highly depended on the values your instances store. For example, if most of the values of a are between 0-10 and b are between 0-10 then the hashCode values are be between 0-20. This implies that if you store the instance of this class in e.g. HashMap numerous instances will be stored in the same bucket (because numerous instances with different a and b values but with the same sum will be put inside the same bucket). This will have bad impact on the performance of the operations on the map, because when doing a lookup all the elements from the bucket will be compared using equals().
Regarding the algorithm, it looks fine, it is very similar to the one that Eclipse generates, but it is using a different prime number, 31 not 37:
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + (int) (a ^ (a >>> 32));
result = prime * result + (int) (b ^ (b >>> 32));
result = prime * result + (int) (c ^ (c >>> 32));
return result;
}
A well-behaved hashcode method already exists for long values - don't reinvent the wheel:
int hashCode = Long.hashCode((a * 31 + b) * 31 + c); // Java 8+
int hashCode = Long.valueOf((a * 31 + b) * 31 + c).hashCode() // Java <8
Multiplying by a prime number (usually 31 in JDK classes) and cumulating the sum is a common method of creating a "unique" number from several numbers.
The hashCode() method of Long keeps the result properly distributed across the int range, making the hash "well behaved" (basically pseudo random).
I have a POJO having ~450 fields and I'm trying to compare instances of this POJO using hascode. I've generated the overridden hashCode() method with eclipse. In quite a few cases the generated hashcode is crossing the integer boundary. As a result, it's getting difficult to perform the comparison. What's the workaround?
The hashCode() method is as follows:
public int hashCode()
{
final int prime = 31;
int result = 1;
result = prime * result + ((stringOne == null) ? 0 : stringOne.hashCode());
result = prime * result + intOne;
result = prime * result + Arrays.hashCode(someArray);
result = prime * result + ((stringTwo == null) ? 0 : stringTwo.hashCode());
result = prime * result + intTwo;
result = prime * result + intThree;
result = prime * result + ((stringThree == null) ? 0 : stringThree.hashCode());
result = prime * result + ((stringFour == null) ? 0 : stringFour.hashCode());
result = prime * result + ((stringFive == null) ? 0 : stringFive.hashCode());
result = prime * result + ((objectOne == null) ? 0 : objectOne.hashCode());
result = prime * result + ((objectTwo == null) ? 0 : objectTwo.hashCode());
return result;
}
Integer overflow is a normal part of hashCode() calculations. It is not a problem.
For example, the hashCode() of a String is often negative.
System.out.println("The hashCode() of this String is negative".hashCode());
If a hashCode() calculation can overflow, obviously that can mean that unequal Objects can have the same hashCode, but this can happen without overflow. For example, both of these print true.
System.out.println("Aa".hashCode() == "BB".hashCode());
System.out.println(new HashSet<>(Arrays.asList(1, 2)).hashCode() == Collections.singleton(3).hashCode());
The only requirement is that equal objects should have the same hashCode. There is no requirement that different objects should have different hashCodes.
hashCode() and equals() should also be quick. You can improve the performance of equals() by comparing the fields most likely to be different first and returning early. You can't do this with hashCode() because the calculation must involve all the relevant fields. If your class has 450 fields, you may want to consider caching the result of hashCode() or, better, refactoring your class into smaller units.
The other thing to consider is whether you need to override these methods at all. It is only absolutely necessary if the objects are going to used as keys in a hash based container, such as HashMap.
The workaround is to use a different method to compute the hashcode. For instance, you could xor the hashcodes of your 450 fields (btw: wow!), but without knowing more about your object it's hard to say whether that would be a good approach for your particular case.
Ideally, since hashcodes are used for hashing, objects that are not equal should also with high probability produce different hashcodes.
Background information:
In my project I'm applying Reinforcement Learning (RL) to the Mario domain. For my state representation I chose to use a hashtable with custom objects as keys. My custom objects are immutable and have overwritten the .equals() and the .hashcode() (which were generated by the IntelliJ IDE).
This is the resulting .hashcode(), I've added the possible values in comments as extra information:
#Override
public int hashCode() {
int result = (stuck ? 1 : 0); // 2 possible values: 0, 1
result = 31 * result + (facing ? 1 : 0); // 2 possible values: 0, 1
result = 31 * result + marioMode; // 3 possible values: 0, 1, 2
result = 31 * result + (onGround ? 1 : 0); // 2 possible values: 0, 1
result = 31 * result + (canJump ? 1 : 0); // 2 possible values: 0, 1
result = 31 * result + (wallNear ? 1 : 0); // 2 possible values: 0, 1
result = 31 * result + nearestEnemyX; // 33 possible values: - 16 to 16
result = 31 * result + nearestEnemyY; // 33 possible values: - 16 to 16
return result;
}
The Problem:
The problem here is that the result in the above code can exceed Integer.MAX_VALUE. I've read online this doesn't have to be a problem, but in my case it is. This is partly due to algorithm used which is Q-Learning (an RL method) and depends on the correct Q-values stored inside the hashtable. Basically I cannot have conflicts when retrieving values. When running my experiments I see that the results are not good at all and I'm 95% certain the problem lies with the retrieval of the Q-values from the hashtable. (If needed I can expand on why I'm certain about this, but this requires some extra information on the project which isn't relevant for the question.)
The Question:
Is there a way to avoid the integer overflow, maybe I'm overlooking something here? Or is there another way (perhaps another datastructure) to get reasonably fast the values given my custom-key?
Remark:
After reading some comments I do realise that my choice for using a HashTable wasn't maybe the best one as I want unique keys that do not cause collisions. If I still want to use the HashTable I will probably need a proper encoding.
You need a dedicated Key Field to guarantee uniqueness
.hashCode() isn't designed for what you are using it for
.hashCode() is designed to give good general results in bucketing algorithms, which can tolerate minor collisions. It is not designed to provide a unique key. The default algorithm is a trade off of time and space and minor collisions, it isn't supposed to guarantee uniqueness.
Perfect Hash
What you need to implement is a perfect hash or some other unique key based on the contents of the object. This is possible within the boundries of an int but I wouldn't use .hashCode() for this representation. I would use an explicit key field on the object.
Unique Hashing
One way to use use SHA1 hashing that is built into the standard library which has an extremely low chance of collisions for small data sets. You don't have a huge combinational explosion in the values you posts to SHA1 will work.
You should be able to calculate a way to generate a minimal perfect hash with the limited values that you are showing in your question.
A minimal perfect hash function is a perfect hash function that maps n
keys to n consecutive integers—usually [0..n−1] or [1..n]. A more
formal way of expressing this is: Let j and k be elements of some
finite set K. F is a minimal perfect hash function iff F(j) =F(k)
implies j=k (injectivity) and there exists an integer a such that the
range of F is a..a+|K|−1. It has been proved that a general purpose
minimal perfect hash scheme requires at least 1.44 bits/key.2 The
best currently known minimal perfect hashing schemes use around 2.6
bits/key.[3]
A minimal perfect hash function F is order preserving if keys are
given in some order a1, a2, ..., an and for any keys aj and ak, j
A minimal perfect hash function F is monotone if it preserves the
lexicographical order of the keys. In this case, the function value is
just the position of each key in the sorted ordering of all of the
keys. If the keys to be hashed are themselves stored in a sorted
array, it is possible to store a small number of additional bits per
key in a data structure that can be used to compute hash values
quickly.[6]
Solution
Note where it talks about a URL it can be any byte[] representation of any String that you calculate from your object.
I usually override the toString() method to make it generate something unique, and then feed that into the UUID.nameUUIDFromBytes() method.
Type 3 UUID can be just as useful as well UUID.nameUUIDFromBytes()
Version 3 UUIDs use a scheme deriving a UUID via MD5 from a URL, a
fully qualified domain name, an object identifier, a distinguished
name (DN as used in Lightweight Directory Access Protocol), or on
names in unspecified namespaces. Version 3 UUIDs have the form
xxxxxxxx-xxxx-3xxx-yxxx-xxxxxxxxxxxx where x is any hexadecimal digit
and y is one of 8, 9, A, or B.
To determine the version 3 UUID of a given name, the UUID of the
namespace (e.g., 6ba7b810-9dad-11d1-80b4-00c04fd430c8 for a domain) is
transformed to a string of bytes corresponding to its hexadecimal
digits, concatenated with the input name, hashed with MD5 yielding 128
bits. Six bits are replaced by fixed values, four of these bits
indicate the version, 0011 for version 3. Finally, the fixed hash is
transformed back into the hexadecimal form with hyphens separating the
parts relevant in other UUID versions.
My preferred solution is Type 5 UUID ( SHA version of Type 3)
Version 5 UUIDs use a scheme with SHA-1 hashing; otherwise it is the
same idea as in version 3. RFC 4122 states that version 5 is preferred
over version 3 name based UUIDs, as MD5's security has been
compromised. Note that the 160 bit SHA-1 hash is truncated to 128 bits
to make the length work out. An erratum addresses the example in
appendix B of RFC 4122.
Key objects should be immutable
That way you can calculate toString(), .hashCode() and generate a unique primary key inside the Constructor and set them once and not calculate them over and over.
Here is a straw man example of an idiomatic immutable object and calculating a unique key based on the contents of the object.
package com.stackoverflow;
import javax.annotation.Nonnull;
import java.util.Date;
import java.util.UUID;
public class Q23633894
{
public static class Person
{
private final String firstName;
private final String lastName;
private final Date birthday;
private final UUID key;
private final String strRep;
public Person(#Nonnull final String firstName, #Nonnull final String lastName, #Nonnull final Date birthday)
{
this.firstName = firstName;
this.lastName = lastName;
this.birthday = birthday;
this.strRep = String.format("%s%s%d", firstName, lastName, birthday.getTime());
this.key = UUID.nameUUIDFromBytes(this.strRep.getBytes());
}
#Nonnull
public UUID getKey()
{
return this.key;
}
// Other getter/setters omitted for brevity
#Override
#Nonnull
public String toString()
{
return this.strRep;
}
#Override
public boolean equals(final Object o)
{
if (this == o) { return true; }
if (o == null || getClass() != o.getClass()) { return false; }
final Person person = (Person) o;
return key.equals(person.key);
}
#Override
public int hashCode()
{
return key.hashCode();
}
}
}
For a unique representation of your object's state, you would need 19 bits in total. Thus, it is possible to represent it by a "perfect hash" integer value (which can have up to 32 bits):
#Override
public int hashCode() {
int result = (stuck ? 1 : 0); // needs 1 bit (2 possible values)
result += (facing ? 1 : 0) << 1; // needs 1 bit (2 possible values)
result += marioMode << 2; // needs 2 bits (3 possible values)
result += (onGround ? 1 : 0) << 4; // needs 1 bit (2 possible values)
result += (canJump ? 1 : 0) << 5; // needs 1 bit (2 possible values)
result += (wallNear ? 1 : 0) << 6; // needs 1 bit (2 possible values)
result += (nearestEnemyX + 16) << 7; // needs 6 bits (33 possible values)
result += (nearestEnemyY + 16) << 13; // needs 6 bits (33 possible values)
}
Instead of using 31 as a your magic number, you need to use the number of possibilities (normalised to 0)
#Override
public int hashCode() {
int result = (stuck ? 1 : 0); // 2 possible values: 0, 1
result = 2 * result + (facing ? 1 : 0); // 2 possible values: 0, 1
result = 3 * result + marioMode; // 3 possible values: 0, 1, 2
result = 2 * result + (onGround ? 1 : 0); // 2 possible values: 0, 1
result = 2 * result + (canJump ? 1 : 0); // 2 possible values: 0, 1
result = 2 * result + (wallNear ? 1 : 0); // 2 possible values: 0, 1
result = 33 * result + (16 + nearestEnemyX); // 33 possible values: - 16 to 16
result = 33 * result + (16 + nearestEnemyY); // 33 possible values: - 16 to 16
return result;
}
This will give you 104544 possible hashCodes() BTW you can reverse this process to get the original values from the code by using a series of / and %
Try Guava's hashCode() method or JDK7's Objects.hash(). It's way better than writing your own. Don't repeat code yourself (and anyone else when you can use out of box solution):
I created a class "Book":
public class Book {
public static int idCount = 1;
private int id;
private String title;
private String author;
private String publisher;
private int yearOfPublication;
private int numOfPages;
private Cover cover;
...
}
And then i need to override the hashCode() and equals() methods.
#Override
public int hashCode() {
int result = id; // !!!
result = 31 * result + (title != null ? title.hashCode() : 0);
result = 31 * result + (author != null ? author.hashCode() : 0);
result = 31 * result + (publisher != null ? publisher.hashCode() : 0);
result = 31 * result + yearOfPublication;
result = 31 * result + numOfPages;
result = 31 * result + (cover != null ? cover.hashCode() : 0);
return result;
}
It's no problem with equals(). I just wondering about one thing in hashCode() method.
Note: IntelliJ IDEA generated that hashCode() method.
So, is it OK to set the result variable to id, or should i use some prime number?
What is the better choice here?
Thanks!
Note that only the initial value of the result is set to id, not the final one. The final value is calculated by combining that initial value with hash codes of other parts of the object, multiplied by a power of a small prime number (i.e. 31). Using id rather than an arbitrary prime is definitely right in this context.
In general, there is no advantage to hash code being prime (it's the number of hash buckets that needs to be prime). Using an int as its own hash code (in your case, that's id and numOfPages) is a valid approach.
It helps to know what the hashCode is used for. It's supposed to help you map a theoretically infinite set of objects to fitting in a small number of "bins", with each bin having a number, and each object saying which bin it wants to go in based on its hashCode. The question is not whether it's okay to do one thing or another, but whether what you want to do matches what the hashCode function is for.
As per http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html#hashCode(), it's not about the number you return, it's about how it behaves for different objects of the same class.
If the object doesn't change, the hashCode must be the same value every time you call the hashCode() function.
Two objects that are equal according to .equals, must have the same hashCode.
Two objects that are not equal may have the same hashCode. (if this wasn't the case, there would be no point in using the hashCode at all, because every object already has a unique object pointer)
If you're reimplementing the hashCode function, the most important thing is to either rely on a tool to generate it for you, or to use code you understand that obeys those rules. The basic Java hashCode function uses an incredibly well-researched, seemingly simple bit of code for String hashing, so the code you see is based on turning everything into Strings and falling back to that.
If you don't know why that works, don't touch it. Just rely on it working and move on. That 31 is ridiculously important and ensures an even hashing distribution. See Why does Java's hashCode() in String use 31 as a multiplier? for the why on that one.
However, this might also be way more than you need. You could use id, but then you're basically negating the reason to use a hashCode (because now every object will want to be in a bin on its own, turning any hashed collection into a flat array. Kind of silly).
If you know the distribution of your id values, there are far easier hashCodes to come up with. Say you know they are always between 0 and Interger.MAX_VALUE, and you know there are never any gaps between ids, you could simply generate a hashCode like
final int modulus = Intereger.MAX_VALUE / 255;
int hashCode() {
return this.id % modulus;
}
now, you have a hashCode optimised for 255 bins, fulfilling the necessary requirements for an acceptable hashCode function.
Note : In my answer I am assuming that you know how hash code is meant to be used. The following just talks about any potential optimization using a non-zero constant for the initial value of result may produce.
If id is rarely 0 then it's fine to use it. However, if it's 0 frequently you should use some constant instead (just using 1 should be fine). The reason you want for it to be non-zero is so that the 31 * result part always adds some value to the hash. That way say if object A has all fields null or 0 except for yearOfPublication = 1 and object B has all fields null or 0 except for numOfPages = 1 the hash codes will be:
A.hashCode() => initialValue * 31 ^ 4 + 1
B.hashCode() => initialValue * 31 ^ 5 + 1
As you can see if initialValue is 0 then both hash codes are the same, however if it's not 0 then they will be different. It is preferable for them to be different so as to reduce collisions in data structures that use the hash code like HashMap.
That said, in your example of the Book class it is likely that id will never be 0. In fact, if id uniquely identifies the Book then you can have the hashCode() method just return the id.
I have a class which has three integers to represent it: a serverID, a streamID and an messageID.
I have some HashSet that are small but I do lots of stuff like set intersection on, and others that have 10K+ elements in.
There are only a handful of values for serverID, but they are truly random numbers with a full 32-bits of randomness. Often there is only one serverID for a whole hashtable; other times just a couple of serverIDs.
The streamID is a small number, typically 0 but may be 1 or 2 sometimes.
The messageID is sequentially increasing for each serverID/streamID pair.
I currently have:
(-messageID << 24) ^ messageID ^ serverID ^ streamID
I want to understand that I have a good hash function despite having a sequentially increasing messageID and not a lot of other bits to mix in.
What makes a good hashCode and how can I best mix these three numbers?
I personally always use strategy implemented in java.lang.String:
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
So, in your case I'd use the following: 31 * (31 * messageID + serverID) + streamID
eclipse gives it self good hashcode generation
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + messageID;
result = prime * result + serverID;
result = prime * result + streamID;
return result;
}