The accepted answer in Best implementation for hashCode method gives a seemingly good method for finding Hash Codes. But I'm new to Hash Codes, so I don't quite know what to do.
For 1), does it matter what nonzero value I choose? Is 1 just as good as other numbers such as the prime 31?
For 2), do I add each value to c? What if I have two fields that are both a long, int, double, etc?
Did I interpret it right in this class:
public MyClass{
long a, b, c; // these are the only fields
//some code and methods
public int hashCode(){
return 37 * (37 * ((int) (a ^ (a >>> 32))) + (int) (b ^ (b >>> 32)))
+ (int) (c ^ (c >>> 32));
}
}
The value is not important, it can be whatever you want. Prime numbers will result in a better distribution of the hashCode values therefore they are preferred.
You do not necessary have to add them, you are free to implement whatever algorithm you want, as long as it fulfills the hashCode contract:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.
There are some algorithms which can be considered as not good hashCode implementations, simple adding of the attributes values being one of them. The reason for that is, if you have a class which has two fields, Integer a, Integer b and your hashCode() just sums up these values then the distribution of the hashCode values is highly depended on the values your instances store. For example, if most of the values of a are between 0-10 and b are between 0-10 then the hashCode values are be between 0-20. This implies that if you store the instance of this class in e.g. HashMap numerous instances will be stored in the same bucket (because numerous instances with different a and b values but with the same sum will be put inside the same bucket). This will have bad impact on the performance of the operations on the map, because when doing a lookup all the elements from the bucket will be compared using equals().
Regarding the algorithm, it looks fine, it is very similar to the one that Eclipse generates, but it is using a different prime number, 31 not 37:
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + (int) (a ^ (a >>> 32));
result = prime * result + (int) (b ^ (b >>> 32));
result = prime * result + (int) (c ^ (c >>> 32));
return result;
}
A well-behaved hashcode method already exists for long values - don't reinvent the wheel:
int hashCode = Long.hashCode((a * 31 + b) * 31 + c); // Java 8+
int hashCode = Long.valueOf((a * 31 + b) * 31 + c).hashCode() // Java <8
Multiplying by a prime number (usually 31 in JDK classes) and cumulating the sum is a common method of creating a "unique" number from several numbers.
The hashCode() method of Long keeps the result properly distributed across the int range, making the hash "well behaved" (basically pseudo random).
Related
How strong is the hashing mechanism that is used in the Arrays.hashCode methods against collision? What is the possibility of two different arrays (of, say, double) to have an exact hash value calculated with these methods?
Arrays.hashCode(double[]) is specified to return the equivalent value of a List containing Double values representing the same numeric value.
List.hashCode in turn is specified with a fairly simple algorithm:
int hashCode = 1;
for (E e : list)
hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());
In general the multiplication with a prime number is a good practice for general-purpose hash functions, but it's far from a cryptographically strong hash function.
This means that while collisions are unlikely in the general (effectively random) case, they can usually be constructed quite easily if you can influence (or select) the hashCode of the items in the List.
As a constructed example consider these two statements:
System.out.println(Arrays.hashCode(new double[] {4.753E-321d}));
System.out.println(Arrays.hashCode(new double[] {4.9E-324d, 4.9E-324d}));
Both of these will output 993, despite being clearly different arrays.
This is the implementation of Arrays.hashCode that you use
public static int hashCode(int a[]) {
if (a == null)
return 0;
int result = 1;
for (int element : a)
result = 31 * result + element;
return result;
}
If your values happen to be smaller then 31 they are treated like distinct numbers in the base 31, so each result in a different numbers (if we ignore overflows for now). Lets call those pure hashes
Now of course 31^11 is way larger then the number of integers in Java, so we will get tons of overflows. But since the powers of 31 and the maximum integer are "very different" you don't get a almost random distribution, but a very regular uniform distribution.
Lets consider a smaller example. I assume you have only 2 elements in your array and the range from 0 to 5 each. I try to create "hashCode" between 0 and 37 by taking the modulo 38 of the "pure hash" The result is that I get streaks of 5 integers with small gaps in between, and not a single collision.
val hashes = for {
i <- 0 to 4
j <- 0 to 4
} yield (i * 31 + j) % 38
enter code here
println(hashes.size) // prints 25
println(hashes.toSet.size) // prints 25
To verify if this is what happens to your numbers you might create a graph as follows: For each hash take the first 16 bits for x and and the second 16 bits for y, color that dot black. I bet you will see an extremely regular pattern.
I have a POJO having ~450 fields and I'm trying to compare instances of this POJO using hascode. I've generated the overridden hashCode() method with eclipse. In quite a few cases the generated hashcode is crossing the integer boundary. As a result, it's getting difficult to perform the comparison. What's the workaround?
The hashCode() method is as follows:
public int hashCode()
{
final int prime = 31;
int result = 1;
result = prime * result + ((stringOne == null) ? 0 : stringOne.hashCode());
result = prime * result + intOne;
result = prime * result + Arrays.hashCode(someArray);
result = prime * result + ((stringTwo == null) ? 0 : stringTwo.hashCode());
result = prime * result + intTwo;
result = prime * result + intThree;
result = prime * result + ((stringThree == null) ? 0 : stringThree.hashCode());
result = prime * result + ((stringFour == null) ? 0 : stringFour.hashCode());
result = prime * result + ((stringFive == null) ? 0 : stringFive.hashCode());
result = prime * result + ((objectOne == null) ? 0 : objectOne.hashCode());
result = prime * result + ((objectTwo == null) ? 0 : objectTwo.hashCode());
return result;
}
Integer overflow is a normal part of hashCode() calculations. It is not a problem.
For example, the hashCode() of a String is often negative.
System.out.println("The hashCode() of this String is negative".hashCode());
If a hashCode() calculation can overflow, obviously that can mean that unequal Objects can have the same hashCode, but this can happen without overflow. For example, both of these print true.
System.out.println("Aa".hashCode() == "BB".hashCode());
System.out.println(new HashSet<>(Arrays.asList(1, 2)).hashCode() == Collections.singleton(3).hashCode());
The only requirement is that equal objects should have the same hashCode. There is no requirement that different objects should have different hashCodes.
hashCode() and equals() should also be quick. You can improve the performance of equals() by comparing the fields most likely to be different first and returning early. You can't do this with hashCode() because the calculation must involve all the relevant fields. If your class has 450 fields, you may want to consider caching the result of hashCode() or, better, refactoring your class into smaller units.
The other thing to consider is whether you need to override these methods at all. It is only absolutely necessary if the objects are going to used as keys in a hash based container, such as HashMap.
The workaround is to use a different method to compute the hashcode. For instance, you could xor the hashcodes of your 450 fields (btw: wow!), but without knowing more about your object it's hard to say whether that would be a good approach for your particular case.
Ideally, since hashcodes are used for hashing, objects that are not equal should also with high probability produce different hashcodes.
From the Javadoc for BigDecimal:
Note: care should be exercised if BigDecimal objects are used as keys in a SortedMap or elements in a SortedSet since BigDecimal's natural ordering is inconsistent with equals.
For example, if you create a HashSet and add new BigDecimal("1.0") and new BigDecimal("1.00") to it, the set will contain two elements (because the values have different scales, so are non-equal according to equals and hashCode), but if you do the same thing with a TreeSet, the set will contain only one element, because the values compare as equal when you use compareTo.
Is there any specific reason behind this inconsistency?
From the OpenJDK implementation of BigDecimal:
/**
* Compares this {#code BigDecimal} with the specified
* {#code Object} for equality. Unlike {#link
* #compareTo(BigDecimal) compareTo}, this method considers two
* {#code BigDecimal} objects equal only if they are equal in
* value and scale (thus 2.0 is not equal to 2.00 when compared by
* this method).
*
* #param x {#code Object} to which this {#code BigDecimal} is
* to be compared.
* #return {#code true} if and only if the specified {#code Object} is a
* {#code BigDecimal} whose value and scale are equal to this
* {#code BigDecimal}'s.
* #see #compareTo(java.math.BigDecimal)
* #see #hashCode
*/
#Override
public boolean equals(Object x) {
if (!(x instanceof BigDecimal))
return false;
BigDecimal xDec = (BigDecimal) x;
if (x == this)
return true;
if (scale != xDec.scale)
return false;
long s = this.intCompact;
long xs = xDec.intCompact;
if (s != INFLATED) {
if (xs == INFLATED)
xs = compactValFor(xDec.intVal);
return xs == s;
} else if (xs != INFLATED)
return xs == compactValFor(this.intVal);
return this.inflate().equals(xDec.inflate());
}
More from the implementation:
* <p>Since the same numerical value can have different
* representations (with different scales), the rules of arithmetic
* and rounding must specify both the numerical result and the scale
* used in the result's representation.
Which is why the implementation of equals takes scale into consideration. The constructor that takes a string as a parameter is implemented like this:
public BigDecimal(String val) {
this(val.toCharArray(), 0, val.length());
}
where the third parameter will be used for the scale (in another constructor) which is why the strings 1.0 and 1.00 will create different BigDecimals (with different scales).
From Effective Java By Joshua Bloch:
The final paragraph of the compareTo contract, which is a strong
suggestion rather than a true provision, simply states that the
equality test imposed by the compareTo method should generally return
the same results as the equals method. If this provision is obeyed,
the ordering imposed by the compareTo method is said to be consistent
with equals. If it’s violated, the ordering is said to be inconsistent
with equals. A class whose compareTo method imposes an order that is
inconsistent with equals will still work, but sorted collections
containing elements of the class may not obey the general contract of
the appropriate collection interfaces (Collection, Set, or Map). This
is because the general contracts for these interfaces are defined in
terms of the equals method, but sorted collections use the equality
test imposed by compareTo in place of equals. It is not a catastrophe
if this happens, but it’s something to be aware of.
The behaviour seems reasonable in the context of arithmetic precision where trailing zeros are significant figures and 1.0 does not carry the same meaning as 1.00. Making them unequal seems to be a reasonable choice.
However from a comparison perspective neither of the two is greater or less than the other and the Comparable interface requires a total order (i.e. each BigDecimal must be comparable with any other BigDecimal). The only reasonable option here was to define a total order such that the compareTo method would consider the two numbers equal.
Note that inconsistency between equal and compareTo is not a problem as long as it's documented. It is even sometimes exactly what one needs.
BigDecimal works by having two numbers, an integer and a scale. The integer is the "number" and the scale is the number of digits to the right of the decimal place. Basically a base 10 floating point number.
When you say "1.0" and "1.00" these are technically different values in BigDecimal notation:
1.0
integer: 10
scale: 1
precision: 2
= 10 x 10 ^ -1
1.00
integer: 100
scale: 2
precision: 3
= 100 x 10 ^ -2
In scientific notation you wouldn't do either of those, it should be 1 x 10 ^ 0 or just 1, but BigDecimal allows it.
In compareTo the scale is ignored and they are evaluated as ordinary numbers, 1 == 1. In equals the integer and scale values are compared, 10 != 100 and 1 != 2. The BigDecimal equals method ignores the object == this check I assume because the intention is that each BigDecimal is treated as a type of number, not like an object.
I would liken it to this:
// same number, different types
float floatOne = 1.0f;
double doubleOne = 1.0;
// true: 1 == 1
System.out.println( (double)floatOne == doubleOne );
// also compare a float to a double
Float boxFloat = floatOne;
Double boxDouble = doubleOne;
// false: one is 32-bit and the other is 64-bit
System.out.println( boxInt.equals(boxDouble) );
// BigDecimal should behave essentially the same way
BigDecimal bdOne1 = new BigDecimal("1.0");
BigDecimal bdOne2 = new BigDecimal("1.00");
// true: 1 == 1
System.out.println( bdOne1.compareTo(bdOne2) );
// false: 10 != 100 and 1 != 2 ensuring 2 digits != 3 digits
System.out.println( bdOne1.equals(bdOne2) );
Because BigDecimal allows for a specific "precision", comparing both the integer and the scale is more or less the same as comparing both the number and the precision.
Although there is a semi-caveat to that when talking about BigDecimal's precision() method which always returns 1 if the BigDecimal is 0. In this case compareTo && precision evaluates true and equals evaluates false. But 0 * 10 ^ -1 should not equal 0 * 10 ^ -2 because the former is a 2 digit number 0.0 and the latter is a 3 digit number 0.00. The equals method is comparing both the value and the number of digits.
I suppose it is weird that BigDecimal allows trailing zeroes but this is basically necessary. Doing a mathematical operation like "1.1" + "1.01" requires a conversion but "1.10" + "1.01" doesn't.
So compareTo compares BigDecimals as numbers and equals compares BigDecimals as BigDecimals.
If the comparison is unwanted, use a List or array where this doesn't matter. HashSet and TreeSet are of course designed specifically for holding unique elements.
The answer is pretty short. equals() method compares objects while compareTo() compares values. In case of BigDecimal different objects can represent same value. Thats why equals() might return false, while compareTo() returns 0.
equal objects => equal values
equal values =/> equal objects
Object is just a computer representation of a some real world value. For example same picture might be represented in a GIF and JPEG formats. Thats very like BigDecimal, where same value might have distinct representations.
I created a class "Book":
public class Book {
public static int idCount = 1;
private int id;
private String title;
private String author;
private String publisher;
private int yearOfPublication;
private int numOfPages;
private Cover cover;
...
}
And then i need to override the hashCode() and equals() methods.
#Override
public int hashCode() {
int result = id; // !!!
result = 31 * result + (title != null ? title.hashCode() : 0);
result = 31 * result + (author != null ? author.hashCode() : 0);
result = 31 * result + (publisher != null ? publisher.hashCode() : 0);
result = 31 * result + yearOfPublication;
result = 31 * result + numOfPages;
result = 31 * result + (cover != null ? cover.hashCode() : 0);
return result;
}
It's no problem with equals(). I just wondering about one thing in hashCode() method.
Note: IntelliJ IDEA generated that hashCode() method.
So, is it OK to set the result variable to id, or should i use some prime number?
What is the better choice here?
Thanks!
Note that only the initial value of the result is set to id, not the final one. The final value is calculated by combining that initial value with hash codes of other parts of the object, multiplied by a power of a small prime number (i.e. 31). Using id rather than an arbitrary prime is definitely right in this context.
In general, there is no advantage to hash code being prime (it's the number of hash buckets that needs to be prime). Using an int as its own hash code (in your case, that's id and numOfPages) is a valid approach.
It helps to know what the hashCode is used for. It's supposed to help you map a theoretically infinite set of objects to fitting in a small number of "bins", with each bin having a number, and each object saying which bin it wants to go in based on its hashCode. The question is not whether it's okay to do one thing or another, but whether what you want to do matches what the hashCode function is for.
As per http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html#hashCode(), it's not about the number you return, it's about how it behaves for different objects of the same class.
If the object doesn't change, the hashCode must be the same value every time you call the hashCode() function.
Two objects that are equal according to .equals, must have the same hashCode.
Two objects that are not equal may have the same hashCode. (if this wasn't the case, there would be no point in using the hashCode at all, because every object already has a unique object pointer)
If you're reimplementing the hashCode function, the most important thing is to either rely on a tool to generate it for you, or to use code you understand that obeys those rules. The basic Java hashCode function uses an incredibly well-researched, seemingly simple bit of code for String hashing, so the code you see is based on turning everything into Strings and falling back to that.
If you don't know why that works, don't touch it. Just rely on it working and move on. That 31 is ridiculously important and ensures an even hashing distribution. See Why does Java's hashCode() in String use 31 as a multiplier? for the why on that one.
However, this might also be way more than you need. You could use id, but then you're basically negating the reason to use a hashCode (because now every object will want to be in a bin on its own, turning any hashed collection into a flat array. Kind of silly).
If you know the distribution of your id values, there are far easier hashCodes to come up with. Say you know they are always between 0 and Interger.MAX_VALUE, and you know there are never any gaps between ids, you could simply generate a hashCode like
final int modulus = Intereger.MAX_VALUE / 255;
int hashCode() {
return this.id % modulus;
}
now, you have a hashCode optimised for 255 bins, fulfilling the necessary requirements for an acceptable hashCode function.
Note : In my answer I am assuming that you know how hash code is meant to be used. The following just talks about any potential optimization using a non-zero constant for the initial value of result may produce.
If id is rarely 0 then it's fine to use it. However, if it's 0 frequently you should use some constant instead (just using 1 should be fine). The reason you want for it to be non-zero is so that the 31 * result part always adds some value to the hash. That way say if object A has all fields null or 0 except for yearOfPublication = 1 and object B has all fields null or 0 except for numOfPages = 1 the hash codes will be:
A.hashCode() => initialValue * 31 ^ 4 + 1
B.hashCode() => initialValue * 31 ^ 5 + 1
As you can see if initialValue is 0 then both hash codes are the same, however if it's not 0 then they will be different. It is preferable for them to be different so as to reduce collisions in data structures that use the hash code like HashMap.
That said, in your example of the Book class it is likely that id will never be 0. In fact, if id uniquely identifies the Book then you can have the hashCode() method just return the id.
In SO I have read several answers related to the implementation of hashcode and the suggestion to use the XOR operator. (E.g. Why are XOR often used in java hashCode() but another bitwise operators are used rarely?).
When I use Eclipse to generate the hashcode function where field is an object and timestamp a long, the output is:
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result
+ field == null) ? 0 : field.hashCode());
return result;
}
Is there any reason by not using the XOR operator like below?
result = prime * result + (int) (timestamp ^ (timestamp >>> 32));
Eclipse takes the safe way out. Although the calculation method that uses a prime, a multiplication, and an addition is slower than a single XOR, it gives you an overall better hash code in situations when you have multiple fields.
Consider a simple example - a class with two Strings, a and b. You can use
a.hashCode() ^ b.hashCode()
or
a.hashCode() * 31 + b.hashCode()
Now consider two objects:
a = "ABC"; b = "XYZ"
and
a = "XYZ"; b = "ABC"
The first method will produce identical hash codes for them, because XOR is symmetric; the second method will produce different hash codes, which is good, because the objects are not equal. In general, you want non-equal objects to have different hash codes as often as possible, to improve performance of hash-based containers of these objects. The 31*a+b method achieves this goal better than XOR.
Note that when you are dealing with portions of the same object, as in
timestamp ^ (timestamp >>> 32)
the above argument is much weaker: encountering two timestamps such that the only difference between them is that their upper and lower parts are swapped is harder to imagine than two objects with swapped a and b field values.