Redefinition of equals() - java

This is an extract from Core Java by C. Horstmann.
+++++
The hashCode method should return an integer (which can be negative). Just combine the
hash codes of the instance fields so that the hash codes for different objects are likely to
be widely scattered.
For example, here is a hashCode method for the Employee class:
class Employee
{
public int hashCode()
{
return 7 * name.hashCode() + 11 * new Double(salary).hashCode() + 13 * hireDay.hashCode();
}
. . .
}
+++
I can't understand these 7, 11, and 13. Are they just pulled out of a hat? Without them the result (checking for equality of two objects) seems to be the same.

In general, testing for equality does not use the hash code.
The 7, 11, 13 are all prime numbers. This lowers the possibility of two different employees having the same hash code (because of theorem of Bézout).
In fact, I would suggest (to widen the obtained hash) using much bigger but non-consecutive primes, e.g. 1039, 2011, 32029. On Linux, the /usr/games/primes utility from package bsdgames is very useful to get them.
What is important is that if two things compare equal they have the same hash code.
For perfomance reasons, you want the hash code to be widely distributed (so if two things are not equals, their hash code usually should be different) to lower the probability of hash collision.
Read wikipage on hash tables.

the numbers are prime numbers.
you don't want to just add the hash codes, because it would give you more collissions.
e.g.
situation A: foo="bla", bar="111"
situation B: foo="111", bar="bla"
this means that foo.hash() + bar.hash() will return the same value in both situations. you use prime numbers because the function f: N/2^32 -> N/2^32: x -> x * p (mod 2^32) is bijective if p is a prime > 2. (i.e. you would lose bits if you multiplied with 256 instead...)
collisions are only to be avoided if you use somthing like hash-sets.

Multiplying with primes is a common optimization which is often done for you by your IDE. I wouldn't do it if there is no need for optimization.

Related

Can anybody explain how java design HashMap's hash() function? [duplicate]

This question already has answers here:
Explanation of HashMap#hash(int) method
(2 answers)
Closed 7 years ago.
after I read JDK's source code ,I find HashMap's hash() function seems fun. Its soucre code like this:
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
Parameter h is the hashCode from Objects which was put into HashMap. How does this method work and why? Why this method can defend against poor hashCode functions?
Hashtable uses the 'classical' approach of prime numbers: to get the 'index' of a value, you take the hash of the key and perform the modulus against the size. Taking a prime number as size, gives (normally) a nice spread over the indexes (depending on the hash as well, of course).
HashMap uses a 'power of two'-approach, meaning the sizes are a power of two. The reason is it's supposed to be faster than prime number calculations. However, since a power of two is not a prime number, there would be more collisions, especially with hash values having the same lower bits.
Why? The modulus performed against the size to get the (bucket/slot) index is simply calculated by: hash & (size-1) (which is exactly what's used in HashMap to get the index!). That's basically the problem with the 'power-of-two' approach: if the length is limited, e.g. 16, the default value of HashMap, only the last bits are used and hence, hash values with the same lower bits will result in the same (bucket) index. In the case of 16, only the last 4 bits are used to calculate the index.
That's why an extra hash is calculated and basically it's shifting the higher bit values, and operate on them with the lower bit values. The reason for the numbers 20, 12, 7 and 4, I don't really know. They used be different (in Java 1.5 or so, the hash function was little different). I suppose there's more advanced literature available. You might find more info about why they use the numbers they use in all kinds of algorithm-related literature, e.g.
http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming
http://mitpress.mit.edu/books/introduction-algorithms
http://burtleburtle.net/bob/hash/evahash.html#lookup uses different algorithms depending on the length (which makes some sense).
http://www.javaspecialists.eu/archive/Issue054.html is probably interesting as well. Check the reaction of Joshua Bloch near the bottom of the article: "The replacement secondary hash function (which I developed with the aid of a computer) has strong statistical properties that pretty much guarantee good bucket distribution.") So, if you ask me, the numbers come from some kind of analysis performed by Josh himself, probably assisted by who knows who.
So: power of two gives faster calculation, but the necessity for additional hash calculation in order to have a nice spread over the slots/buckets.

Explanation of the constants used while calculating hashcode value of java.util.hash

Can someone explain the significance of these constants and why they are chosen?
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
source: java-se6 library
Understanding what makes for a good hash function is tricky, as there are in fact a great many different functions that are used and for slightly different purposes.
Java's hash tables work as follows:
They ask the key object to produce its hash code. The implementation of the hashCode() method is likely to be of distinctly variable quality (in the worst case, returning a constant value!) and will definitely not be adapted to the particular hash table you're working with.
They then use the above function to mix the bits up a bit, so that information present in the high bits also gets moved down to the low bits. This is important because next …
They take the mod of the hash code (w.r.t. the number of hash table array entries) to get the index into the array of hash table chains. There's a distinct possibility that the hash table array will have size equivalent to a power of 2, so the mixing down of the bits in step 2 is important to ensure that they don't just get thrown away.
They then traverse the chain until they get to the entry with an equal key (according to the equals() method).
To complete the picture, the number of entries in the hash table array is non-constant; if the chains get too long the array gets replaced with a new larger array and everything gets rehashed. That's relatively fast and has good performance implications for normal use patterns (e.g., lots of put()s followed by lots of get()s).
The actual constants used are fairly arbitrary (and are probably chosen by experiment with some simple corpus including things like large numbers of Integer and String values) but their purpose is not: getting the information in the whole value spread to most of the low bits in the value ensures that such information as is present in the output of the hashCode() is used as well as possible.
(You wouldn't do this with perfect hashing or cryptographic hashing; despite the similar names, they have very different implementation strategies. The former requires knowledge of the key space so that collisions are avoided/reduced, and the latter needs information to be moved about in all directions, not just to the low bits.)
I have also wondered about such "magic" numbers. As far as I know they are magic numbers.
It has been proven by extensive testing that odd and prime numbers have interesting priorities that could be used in hashing (avoid primary/secondary clustering etc).
I believe that most of the numbers come after research and testing that prove statistically to give good distributions. Why specifically these numbers do that, I have no idea but I have the impression (hopefully collegues here can correct me if I am way off) neither the implementers know why these specific numbers present these qualities

A good hash function to use in interviews for integer numbers, strings?

I have come across situations in an interview where I needed to use a hash function for integer numbers or for strings. In such situations which ones should we choose ? I've been wrong in these situations because I end up choosing the ones which have generate lot of collisions but then hash functions tend to be mathematical that you cannot recollect them in an interview. Are there any general recommendations so atleast the interviewer is satisfied with your approach for integer numbers or string inputs? Which functions would be adequate for both inputs in an "interview situation"
Here is a simple recipe from Effective java page 33:
Store some constant nonzero value, say, 17, in an int variable called result.
For each significant field f in your object (each field taken into account by the
equals method, that is), do the following:
Compute an int hash code c for the field:
If the field is a boolean, compute (f ? 1 : 0).
If the field is a byte, char, short, or int, compute (int) f.
If the field is a long, compute (int) (f ^ (f >>> 32)).
If the field is a float, compute Float.floatToIntBits(f).
If the field is a double, compute Double.doubleToLongBits(f), and
then hash the resulting long as in step 2.1.iii.
If the field is an object reference and this class’s equals method
compares the field by recursively invoking equals, recursively
invoke hashCode on the field. If a more complex comparison is
required, compute a “canonical representation” for this field and
invoke hashCode on the canonical representation. If the value of the
field is null, return 0 (or some other constant, but 0 is traditional).
48 CHAPTER 3 METHODS COMMON TO ALL OBJECTS
If the field is an array, treat it as if each element were a separate field.
That is, compute a hash code for each significant element by applying
these rules recursively, and combine these values per step 2.b. If every
element in an array field is significant, you can use one of the
Arrays.hashCode methods added in release 1.5.
Combine the hash code c computed in step 2.1 into result as follows:
result = 31 * result + c;
Return result.
When you are finished writing the hashCode method, ask yourself whether
equal instances have equal hash codes. Write unit tests to verify your intuition!
If equal instances have unequal hash codes, figure out why and fix the problem.
You should ask the interviewer what the hash function is for - the answer to this question will determine what kind of hash function is appropriate.
If it's for use in hashed data structures like hashmaps, you want it to be a simple as possible (fast to execute) and avoid collisions (most common values map to different hash values). A good example is an integer hashing to the same integer - this is the standard hashCode() implementation in java.lang.Integer
If it's for security purposes, you will want to use a cryptographic hash function. These are primarily designed so that it is hard to reverse the hash function or find collisions.
If you want fast pseudo-random-ish hash values (e.g. for a simulation) then you can usually modify a pseudo-random number generator to create these. My personal favourite is:
public static final int hash(int a) {
a ^= (a << 13);
a ^= (a >>> 17);
a ^= (a << 5);
return a;
}
If you are computing a hash for some form of composite structure (e.g. a string with multiple characters, or an array, or an object with multiple fields), then there are various techniques you can use to create a combined hash function. I'd suggest something that XORs the rotated hash values of the constituent parts, e.g.:
public static <T> int hashCode(T[] data) {
int result=0;
for(int i=0; i<data.length; i++) {
result^=data[i].hashCode();
result=Integer.rotateRight(result, 1);
}
return result;
}
Note the above is not cryptographically secure, but will do for most other purposes. You will obviously get collisions but that's unavoidable when hashing a large structure to a integer :-)
For integers, I usually go with k % p where p = size of the hash table and is a prime number and for strings I choose hashcode from String class. Is this sufficient enough for an interview with a major tech company? – phoenix 2 days ago
Maybe not. It's not uncommon to need to provide a hash function to a hash table whose implementation is unknown to you. Further, if you hash in a way that depends on the implementation using a prime number of buckets, then your performance may degrade if the implementation changes due to a new library, compiler, OS port etc..
Personally, I think the important thing at interview is a clear understanding of the ideal characteristics of a general-purpose hash algorithm, which is basically that for any two input keys with values varying by as little as one bit, each and every bit in the output has about 50/50 chance of flipping. I found that quite counter-intuitive because a lot of the hashing functions I first saw used bit-shifts and XOR and a flipped input bit usually flipped one output bit (usually in another bit position, so 1-input-bit-affects-many-output-bits was a little revelation moment when I read it in one of Knuth's books. With this knowledge you're at least capable of testing and assessing specific implementations regardless of how they're implemented.
One approach I'll mention because it achieves this ideal and is easy to remember, though the memory usage may make it slower than mathematical approaches (could be faster too depending on hardware), is to simply use each byte in the input to look up a table of random ints. For example, given a 24-bit RGB value and int table[3][256], table[0][r] ^ table[1][g] ^ table[2][b] is a great sizeof int hash value - indeed "perfect" if inputs are randomly scattered through the int values (rather than say incrementing - see below). This approach isn't ideal for long or arbitrary-length keys, though you can start revisiting tables and bit-shift the values etc..
All that said, you can sometimes do better than this randomising approach for specific cases where you are aware of the patterns in the input keys and/or the number of buckets involved (for example, you may know the input keys are contiguous from 1 to 100 and there are 128 buckets, so you can pass the keys through without any collisions). If, however, the input ceases to meet your expectations, you can get horrible collision problems, while a "randomising" approach should never get much worse than load (size() / buckets) implies. Another interesting insight is that when you want a quick-and-mediocre hash, you don't necessarily have to incorporate all the input data when generating the hash: e.g. last time I looked at Visual C++'s string hashing code it picked ten letters evenly spaced along the text to use as inputs....

Hashcode comparison problem

I have list of a an object which is termed as rule in our case, this object itself is a list of field for which I have to do hashcode comparison as we can't duplicate rule in the
system.
i.e Let say I have two Rules R1 and R2 with fields A & B.
Now if values of A & B in R1 are 7 and 2 respectively.
And in R2 it's 3 and 4 respectively then the process I have used to check the duplicity
of Rules in the system that is hashcode comparison fails
the method which I have used is
for(Rule rule : rules){
changeableAttrCode=0;
fieldCounter=1;
attributes = rule.getAttributes();
for(RuleField ruleField : attributes){
changeableAttrCode = changeableAttrCode + (fieldCounter * ruleField.getValue().hashCode());
fieldCounter++;
}
parameters = rule.getParameters();
for(RuleField ruleField : parameters){
changeableAttrCode = changeableAttrCode + (fieldCounter * ruleField.getValue().hashCode());
fieldCounter++;
}
changeableAttrCodes.add(changeableAttrCode);
here changeableAttrCodes where we store the hashcode of all the rules.
so can please suggest me better method so that this kind of problem does not arise in future as well as duplicity of rules in system can be seen.
Thanks in advance
hashcode() is not meant to be used to check for equality. return 42; is a perfectly valid implementation of hashcode(). Why don't you overwrite equals() (and hashcode() for that matter) in the rules objects and use that to check whether two rules are equal? You could still use the hashcode to check which objects you need to investigate, since two equal() objects should always have the same hashcode, but that is a performance improvement that you may or may not need, depending on your system.
Implement hashCode and equals in class Rule.
Implementation of equals has to compare its values.
Then use a HashSet<Rule> and ask if(mySet.contains(newRule))
HashSet + equals implementation solves the problem of the non-uniqueness of the hash. It uses hash for classifying and speed but it uses equals at the end to ensure that two Rules with same hash are the same Rule or not.
More on hash: if you want to do it by hand, use the prime number sudggestion, and review the JDK code for string hashcodes. If you want to make a clean implementation try to retrieve the hashcode of the elements, make some kind of array of ints and use Arrays.hashCode(int[]) to get a hashcode for the combination of them.
Updated Your hashing algorithm is not producing a good spread of hash values - it gives the same value for (7, 2) and (3, 4):
1 * 7 + 2 * 2 = 11
1 * 3 + 2 * 4 = 11
It would also give the same value for (11, 0), (-1, 6), ... and one can trivially make up an endless number of similar equivalence classes based on your current algorithm.
Of course you can not avoid collisions - if you have enough instances, hash collision is inevitable. However, you should aim to minimize the chance for collisions. Good hashing algorithms strive to spread hash values equally over a wide range of values. A typical way to achieve this is to generate the hash value for an object containing n independent fields as an n-digit number with a base big enough to hold the different hash values for the individual fields.
In your case, instead of multiplying with fieldCounter you should multiply with a prime constant, e.g. 31 (that would be the base of your number). And add another prime constant to the result, e.g. 17. This gives you a better spread of hash values. (Of course the concrete base depends on what values can your fields take - I have no info about that.)
Also if you implement hashCode, you are strongly advised to implement equals as well - and in fact, you should use the latter to test for equality.
Here is an article about implementing hashCode.
I don't understand what you are trying to do here. With most hash function scenarios, collision is inevitable, because there are way more objects to hash than there are possible hash values (it's a pigeonhole principle).
It is generally the case that two different objects may have the same hash value. You cannot rely on hash functions alone to eliminate duplicates.
Some hash functions are better than others in minimizing collisions, but it's still an inevitability.
That said, there are some simple guidelines that usually gives a good enough hash function. Joshua Bloch gives the following in his book Effective Java 2nd Edition:
Store some constant nonzero value, say 17, in an int variable called result.
Compute an int hashcode c for each field:
If the field is a boolean, compute (f ? 1 : 0)
If the field is a byte, char, short, int, compute (int) f
If the field is a long, compute (int) (f ^ (f >>> 32))
If the field is a float, compute Float.floatToIntBits(f)
If the field is a double, compute Double.doubleToLongBits(f), then hash the resulting long as in above.
If the field is an object reference and this class's equals method compares the field by recursively invoking equals, recursively invoke hashCode on the field. If the value of the field is null, return 0.
If the field is an array, treat it as if each element is a separate field. If every element in an array field is significant, you can use one of the Arrays.hashCode methods added in release 1.5.
Combine the hashcode c into result as follows: result = 31 * result + c;
I started to write that the only way you can achieve what you want is with Perfect Hashing.
But then I thought about the fact that you said you can't duplicate objects in your system.
Edit based on thought-provoking comment from helios:
Your solution depends on what you meant when you wrote that you "can't duplicate rules".
If you meant that literally you cannot, that there is guaranteed to be only one instance of a rule with a particular set of values, then your problem is trivial: you can do identity comparison, in which case you can do identity comparison using ==.
On the other hand, you meant that you shouldn't for some reason (performance), then your problem is also trivial: just do value comparisons.
Given the way you've defined your problem, under no circumstances should you be considering the use of hashcodes as a substitute for equality. As others have noted, hashcodes by their nature yield collisions (false equality), unless you go to a Perfect Hashing solution, but why would you in this case?

Why does Java's hashCode() in String use 31 as a multiplier?

Per the Java documentation, the hash code for a String object is computed as:
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using int arithmetic, where s[i] is the
ith character of the string, n is the length of
the string, and ^ indicates exponentiation.
Why is 31 used as a multiplier?
I understand that the multiplier should be a relatively large prime number. So why not 29, or 37, or even 97?
According to Joshua Bloch's Effective Java (a book that can't be recommended enough, and which I bought thanks to continual mentions on stackoverflow):
The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.
(from Chapter 3, Item 9: Always override hashcode when you override equals, page 48)
Goodrich and Tamassia computed from over 50,000 English words (formed as the union of the word lists provided in two variants of Unix) that using the constants 31, 33, 37, 39, and 41 will produce fewer than 7 collisions in each case. This may be the reason that so many Java implementations choose such constants.
See section 9.2 Hash Tables (page 522) of Data Structures and Algorithms in Java.
On (mostly) old processors, multiplying by 31 can be relatively cheap. On an ARM, for instance, it is only one instruction:
RSB r1, r0, r0, ASL #5 ; r1 := - r0 + (r0<<5)
Most other processors would require a separate shift and subtract instruction. However, if your multiplier is slow this is still a win. Modern processors tend to have fast multipliers so it doesn't make much difference, so long as 32 goes on the correct side.
It's not a great hash algorithm, but it's good enough and better than the 1.0 code (and very much better than the 1.0 spec!).
By multiplying, bits are shifted to the left. This uses more of the available space of hash codes, reducing collisions.
By not using a power of two, the lower-order, rightmost bits are populated as well, to be mixed with the next piece of data going into the hash.
The expression n * 31 is equivalent to (n << 5) - n.
You can read Bloch's original reasoning under "Comments" in http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622. He investigated the performance of different hash functions in regards to the resulting "average chain size" in a hash table. P(31) was one of the common functions during that time which he found in K&R's book (but even Kernighan and Ritchie couldn't remember where it came from). In the end he basically had to choose one and so he took P(31) since it seemed to perform well enough. Even though P(33) was not really worse and multiplication by 33 is equally fast to calculate (just a shift by 5 and an addition), he opted for 31 since 33 is not a prime:
Of the remaining
four, I'd probably select P(31), as it's the cheapest to calculate on a RISC
machine (because 31 is the difference of two powers of two). P(33) is
similarly cheap to calculate, but it's performance is marginally worse, and
33 is composite, which makes me a bit nervous.
So the reasoning was not as rational as many of the answers here seem to imply. But we're all good in coming up with rational reasons after gut decisions (and even Bloch might be prone to that).
Actually, 37 would work pretty well! z := 37 * x can be computed as y := x + 8 * x; z := x + 4 * y. Both steps correspond to one LEA x86 instructions, so this is extremely fast.
In fact, multiplication with the even-larger prime 73 could be done at the same speed by setting y := x + 8 * x; z := x + 8 * y.
Using 73 or 37 (instead of 31) might be better, because it leads to denser code: The two LEA instructions only take 6 bytes vs. the 7 bytes for move+shift+subtract for the multiplication by 31. One possible caveat is that the 3-argument LEA instructions used here became slower on Intel's Sandy bridge architecture, with an increased latency of 3 cycles.
Moreover, 73 is Sheldon Cooper's favorite number.
Neil Coffey explains why 31 is used under Ironing out the bias.
Basically using 31 gives you a more even set-bit probability distribution for the hash function.
From JDK-4045622, where Joshua Bloch describes the reasons why that particular (new) String.hashCode() implementation was chosen
The table below summarizes the performance of the various hash
functions described above, for three data sets:
1) All of the words and phrases with entries in Merriam-Webster's
2nd Int'l Unabridged Dictionary (311,141 strings, avg length 10 chars).
2) All of the strings in /bin/, /usr/bin/, /usr/lib/, /usr/ucb/
and /usr/openwin/bin/* (66,304 strings, avg length 21 characters).
3) A list of URLs gathered by a web-crawler that ran for several
hours last night (28,372 strings, avg length 49 characters).
The performance metric shown in the table is the "average chain size"
over all elements in the hash table (i.e., the expected value of the
number of key compares to look up an element).
Webster's Code Strings URLs
--------- ------------ ----
Current Java Fn. 1.2509 1.2738 13.2560
P(37) [Java] 1.2508 1.2481 1.2454
P(65599) [Aho et al] 1.2490 1.2510 1.2450
P(31) [K+R] 1.2500 1.2488 1.2425
P(33) [Torek] 1.2500 1.2500 1.2453
Vo's Fn 1.2487 1.2471 1.2462
WAIS Fn 1.2497 1.2519 1.2452
Weinberger's Fn(MatPak) 6.5169 7.2142 30.6864
Weinberger's Fn(24) 1.3222 1.2791 1.9732
Weinberger's Fn(28) 1.2530 1.2506 1.2439
Looking at this table, it's clear that all of the functions except for
the current Java function and the two broken versions of Weinberger's
function offer excellent, nearly indistinguishable performance. I
strongly conjecture that this performance is essentially the
"theoretical ideal", which is what you'd get if you used a true random
number generator in place of a hash function.
I'd rule out the WAIS function as its specification contains pages of random numbers, and its performance is no better than any of the
far simpler functions. Any of the remaining six functions seem like
excellent choices, but we have to pick one. I suppose I'd rule out
Vo's variant and Weinberger's function because of their added
complexity, albeit minor. Of the remaining four, I'd probably select
P(31), as it's the cheapest to calculate on a RISC machine (because 31
is the difference of two powers of two). P(33) is similarly cheap to
calculate, but it's performance is marginally worse, and 33 is
composite, which makes me a bit nervous.
Josh
Bloch doesn't quite go into this, but the rationale I've always heard/believed is that this is basic algebra. Hashes boil down to multiplication and modulus operations, which means that you never want to use numbers with common factors if you can help it. In other words, relatively prime numbers provide an even distribution of answers.
The numbers that make up using a hash are typically:
modulus of the data type you put it into
(2^32 or 2^64)
modulus of the bucket count in your hashtable (varies. In java used to be prime, now 2^n)
multiply or shift by a magic number in your mixing function
The input value
You really only get to control a couple of these values, so a little extra care is due.
In latest version of JDK, 31 is still used. https://docs.oracle.com/en/java/javase/12/docs/api/java.base/java/lang/String.html#hashCode()
The purpose of hash string is
unique (Let see operator ^ in hashcode calculation document, it help unique)
cheap cost for calculating
31 is max value can put in 8 bit (= 1 byte) register, is largest prime number can put in 1 byte register, is odd number.
Multiply 31 is <<5 then subtract itself, therefore need cheap resources.
Java String hashCode() and 31
This is because 31 has a nice property – it's multiplication can be replaced by a bitwise shift which is faster than the standard multiplication:
31 * i == (i << 5) - i
I'm not sure, but I would guess they tested some sample of prime numbers and found that 31 gave the best distribution over some sample of possible Strings.
A big expectation from hash functions is that their result's uniform randomness survives an operation such as hash(x) % N where N is an arbitrary number (and in many cases, a power of two), one reason being that such operations are used commonly in hash tables for determining slots. Using prime number multipliers when computing the hash decreases the probability that your multiplier and the N share divisors, which would make the result of the operation less uniformly random.
Others have pointed out the nice property that multiplication by 31 can be done by a multiplication and a subtraction. I just want to point out that there is a mathematical term for such primes: Mersenne Prime
All mersenne primes are one less than a power of two so we can write them as:
p = 2^n - 1
Multiplying x by p:
x * p = x * (2^n - 1) = x * 2^n - x = (x << n) - x
Shifts (SAL/SHL) and subtractions (SUB) are generally faster than multiplications (MUL) on many machines. See instruction tables from Agner Fog
That's why GCC seems to optimize multiplications by mersenne primes by replacing them with shifts and subs, see here.
However, in my opinion, such a small prime is a bad choice for a hash function. With a relatively good hash function, you would expect to have randomness at the higher bits of the hash. However, with the Java hash function, there is almost no randomness at the higher bits with shorter strings (and still highly questionable randomness at the lower bits). This makes it more difficult to build efficient hash tables. See this nice trick you couldn't do with the Java hash function.
Some answers mention that they believe it is good that 31 fits into a byte. This is actually useless since:
(1) We execute shifts instead of multiplications, so the size of the multiplier does not matter.
(2) As far as I know, there is no specific x86 instruction to multiply an 8 byte value with a 1 byte value so you would have needed to convert "31" to a 8 byte value anyway even if you were multiplying. See here, you multiply entire 64bit registers.
(And 127 is actually the largest mersenne prime that could fit in a byte.)
Does a smaller value increase randomness in the middle-lower bits? Maybe, but it also seems to greatly increase the possible collisions :).
One could list many different issues but they generally boil down to two core principles not being fulfilled well: Confusion and Diffusion
But is it fast? Probably, since it doesn't do much. However, if performance is really the focus here, one character per loop is quite inefficient. Why not do 4 characters at a time (8 bytes) per loop iteration for longer strings, like this? Well, that would be difficult to do with the current definition of hash where you need to multiply every character individually (please tell me if there is a bit hack to solve this :D).

Categories