Per the Java documentation, the hash code for a String object is computed as:
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using int arithmetic, where s[i] is the
ith character of the string, n is the length of
the string, and ^ indicates exponentiation.
Why is 31 used as a multiplier?
I understand that the multiplier should be a relatively large prime number. So why not 29, or 37, or even 97?
According to Joshua Bloch's Effective Java (a book that can't be recommended enough, and which I bought thanks to continual mentions on stackoverflow):
The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.
(from Chapter 3, Item 9: Always override hashcode when you override equals, page 48)
Goodrich and Tamassia computed from over 50,000 English words (formed as the union of the word lists provided in two variants of Unix) that using the constants 31, 33, 37, 39, and 41 will produce fewer than 7 collisions in each case. This may be the reason that so many Java implementations choose such constants.
See section 9.2 Hash Tables (page 522) of Data Structures and Algorithms in Java.
On (mostly) old processors, multiplying by 31 can be relatively cheap. On an ARM, for instance, it is only one instruction:
RSB r1, r0, r0, ASL #5 ; r1 := - r0 + (r0<<5)
Most other processors would require a separate shift and subtract instruction. However, if your multiplier is slow this is still a win. Modern processors tend to have fast multipliers so it doesn't make much difference, so long as 32 goes on the correct side.
It's not a great hash algorithm, but it's good enough and better than the 1.0 code (and very much better than the 1.0 spec!).
By multiplying, bits are shifted to the left. This uses more of the available space of hash codes, reducing collisions.
By not using a power of two, the lower-order, rightmost bits are populated as well, to be mixed with the next piece of data going into the hash.
The expression n * 31 is equivalent to (n << 5) - n.
You can read Bloch's original reasoning under "Comments" in http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622. He investigated the performance of different hash functions in regards to the resulting "average chain size" in a hash table. P(31) was one of the common functions during that time which he found in K&R's book (but even Kernighan and Ritchie couldn't remember where it came from). In the end he basically had to choose one and so he took P(31) since it seemed to perform well enough. Even though P(33) was not really worse and multiplication by 33 is equally fast to calculate (just a shift by 5 and an addition), he opted for 31 since 33 is not a prime:
Of the remaining
four, I'd probably select P(31), as it's the cheapest to calculate on a RISC
machine (because 31 is the difference of two powers of two). P(33) is
similarly cheap to calculate, but it's performance is marginally worse, and
33 is composite, which makes me a bit nervous.
So the reasoning was not as rational as many of the answers here seem to imply. But we're all good in coming up with rational reasons after gut decisions (and even Bloch might be prone to that).
Actually, 37 would work pretty well! z := 37 * x can be computed as y := x + 8 * x; z := x + 4 * y. Both steps correspond to one LEA x86 instructions, so this is extremely fast.
In fact, multiplication with the even-larger prime 73 could be done at the same speed by setting y := x + 8 * x; z := x + 8 * y.
Using 73 or 37 (instead of 31) might be better, because it leads to denser code: The two LEA instructions only take 6 bytes vs. the 7 bytes for move+shift+subtract for the multiplication by 31. One possible caveat is that the 3-argument LEA instructions used here became slower on Intel's Sandy bridge architecture, with an increased latency of 3 cycles.
Moreover, 73 is Sheldon Cooper's favorite number.
Neil Coffey explains why 31 is used under Ironing out the bias.
Basically using 31 gives you a more even set-bit probability distribution for the hash function.
From JDK-4045622, where Joshua Bloch describes the reasons why that particular (new) String.hashCode() implementation was chosen
The table below summarizes the performance of the various hash
functions described above, for three data sets:
1) All of the words and phrases with entries in Merriam-Webster's
2nd Int'l Unabridged Dictionary (311,141 strings, avg length 10 chars).
2) All of the strings in /bin/, /usr/bin/, /usr/lib/, /usr/ucb/
and /usr/openwin/bin/* (66,304 strings, avg length 21 characters).
3) A list of URLs gathered by a web-crawler that ran for several
hours last night (28,372 strings, avg length 49 characters).
The performance metric shown in the table is the "average chain size"
over all elements in the hash table (i.e., the expected value of the
number of key compares to look up an element).
Webster's Code Strings URLs
--------- ------------ ----
Current Java Fn. 1.2509 1.2738 13.2560
P(37) [Java] 1.2508 1.2481 1.2454
P(65599) [Aho et al] 1.2490 1.2510 1.2450
P(31) [K+R] 1.2500 1.2488 1.2425
P(33) [Torek] 1.2500 1.2500 1.2453
Vo's Fn 1.2487 1.2471 1.2462
WAIS Fn 1.2497 1.2519 1.2452
Weinberger's Fn(MatPak) 6.5169 7.2142 30.6864
Weinberger's Fn(24) 1.3222 1.2791 1.9732
Weinberger's Fn(28) 1.2530 1.2506 1.2439
Looking at this table, it's clear that all of the functions except for
the current Java function and the two broken versions of Weinberger's
function offer excellent, nearly indistinguishable performance. I
strongly conjecture that this performance is essentially the
"theoretical ideal", which is what you'd get if you used a true random
number generator in place of a hash function.
I'd rule out the WAIS function as its specification contains pages of random numbers, and its performance is no better than any of the
far simpler functions. Any of the remaining six functions seem like
excellent choices, but we have to pick one. I suppose I'd rule out
Vo's variant and Weinberger's function because of their added
complexity, albeit minor. Of the remaining four, I'd probably select
P(31), as it's the cheapest to calculate on a RISC machine (because 31
is the difference of two powers of two). P(33) is similarly cheap to
calculate, but it's performance is marginally worse, and 33 is
composite, which makes me a bit nervous.
Josh
Bloch doesn't quite go into this, but the rationale I've always heard/believed is that this is basic algebra. Hashes boil down to multiplication and modulus operations, which means that you never want to use numbers with common factors if you can help it. In other words, relatively prime numbers provide an even distribution of answers.
The numbers that make up using a hash are typically:
modulus of the data type you put it into
(2^32 or 2^64)
modulus of the bucket count in your hashtable (varies. In java used to be prime, now 2^n)
multiply or shift by a magic number in your mixing function
The input value
You really only get to control a couple of these values, so a little extra care is due.
In latest version of JDK, 31 is still used. https://docs.oracle.com/en/java/javase/12/docs/api/java.base/java/lang/String.html#hashCode()
The purpose of hash string is
unique (Let see operator ^ in hashcode calculation document, it help unique)
cheap cost for calculating
31 is max value can put in 8 bit (= 1 byte) register, is largest prime number can put in 1 byte register, is odd number.
Multiply 31 is <<5 then subtract itself, therefore need cheap resources.
Java String hashCode() and 31
This is because 31 has a nice property – it's multiplication can be replaced by a bitwise shift which is faster than the standard multiplication:
31 * i == (i << 5) - i
I'm not sure, but I would guess they tested some sample of prime numbers and found that 31 gave the best distribution over some sample of possible Strings.
A big expectation from hash functions is that their result's uniform randomness survives an operation such as hash(x) % N where N is an arbitrary number (and in many cases, a power of two), one reason being that such operations are used commonly in hash tables for determining slots. Using prime number multipliers when computing the hash decreases the probability that your multiplier and the N share divisors, which would make the result of the operation less uniformly random.
Others have pointed out the nice property that multiplication by 31 can be done by a multiplication and a subtraction. I just want to point out that there is a mathematical term for such primes: Mersenne Prime
All mersenne primes are one less than a power of two so we can write them as:
p = 2^n - 1
Multiplying x by p:
x * p = x * (2^n - 1) = x * 2^n - x = (x << n) - x
Shifts (SAL/SHL) and subtractions (SUB) are generally faster than multiplications (MUL) on many machines. See instruction tables from Agner Fog
That's why GCC seems to optimize multiplications by mersenne primes by replacing them with shifts and subs, see here.
However, in my opinion, such a small prime is a bad choice for a hash function. With a relatively good hash function, you would expect to have randomness at the higher bits of the hash. However, with the Java hash function, there is almost no randomness at the higher bits with shorter strings (and still highly questionable randomness at the lower bits). This makes it more difficult to build efficient hash tables. See this nice trick you couldn't do with the Java hash function.
Some answers mention that they believe it is good that 31 fits into a byte. This is actually useless since:
(1) We execute shifts instead of multiplications, so the size of the multiplier does not matter.
(2) As far as I know, there is no specific x86 instruction to multiply an 8 byte value with a 1 byte value so you would have needed to convert "31" to a 8 byte value anyway even if you were multiplying. See here, you multiply entire 64bit registers.
(And 127 is actually the largest mersenne prime that could fit in a byte.)
Does a smaller value increase randomness in the middle-lower bits? Maybe, but it also seems to greatly increase the possible collisions :).
One could list many different issues but they generally boil down to two core principles not being fulfilled well: Confusion and Diffusion
But is it fast? Probably, since it doesn't do much. However, if performance is really the focus here, one character per loop is quite inefficient. Why not do 4 characters at a time (8 bytes) per loop iteration for longer strings, like this? Well, that would be difficult to do with the current definition of hash where you need to multiply every character individually (please tell me if there is a bit hack to solve this :D).
Related
I have to pow a bigInteger number with another BigInteger number.
Unfortunately, only one BigInteger.pow(int) is allowed.
I have no clue on how I can solve this problem.
I have to pow a bigInteger number with another BigInteger number.
No, you don't.
You read a crypto spec and it seemed to say that. But that's not what it said; you didn't read carefully enough. The mathematical 'universe' that the math in the paper / spec you're reading operates in is different from normal math. It's a modulo-space. All operations are implicitly performed modulo X, where X is some number the crypto algorithm explains.
You can do that just fine.
Alternatively, the spec is quite clear and says something like: C = (A^B) % M and you've broken that down in steps (... first, I must calculate A to the power of B. I'll worry about what the % M part is all about later). That's not how that works - you can't lop that operation into parts. (A^B) % M is quite doable, and has its own efficient algorithm. (A^B) is simply not calculable without a few years worth of the planet's entire energy and GDP output.
The reason I know that must be what you've been reading, is because (A ^ B) % M is a common operation in crypto. (Well, that, and the simple fact that A^B can't be done).
Just to be crystal clear: When I say impossible, I mean it in the same way 'travelling faster than the speed of light' is impossible. It's a law in the physics sense of the word: If you really just want to do A^B and not in a modspace where B is so large it doesn't fit in an int, a computer cannot calculate it, and the result will be gigabytes large. int can hold about 9 digits worth. Just for fun, imagine doing X^Y where both X and Y are 20 digit numbers.
The result would have 10^21 digits.
That's roughly equal to the total amount of disk space available worldwide. 10^12 is a terabyte. You're asking to calculate a number where, forget about calculating it, merely storing it requires one thousand million harddisks each of 1TB.
Thus, I'm 100% certain that you do not want what you think you want.
TIP: If you can't follow the math (which is quite bizarre; it's not like you get modulo-space math in your basic AP math class!), generally rolling your own implementation of a crypto algorithm isn't going to work out. The problem with crypto is, if you mess up, often a unit test cannot catch it. No; someone will hack your stuff and then you know, and that's a high price to pay. Rely on experts to build the algorithm, spend your time ensuring the protocol is correct (which is still quite difficult to get right, don't take that lightly!). If you insist, make dang sure you have a heap of plaintext+keys / encrypted (or plaintext / hashed, or whatever it is you're doing) pairs to test against, and assume that whatever you wrote, even if it passes those tests, is still insecure because e.g. it is trivial to leak the key out of your algorithm using timing attacks.
Since you anyway want to use it in a modulo operation with a prime number, like #Progman said in the comments, you can use modPow()
Below is an example code:
// Create BigInteger objects
BigInteger biginteger1, biginteger2, exponent, result;
//prime number
int pNumber = 5;
// Intializing all BigInteger Objects
biginteger1 = new BigInteger("23895");
biginteger2 = BigInteger.valueOf(pNumber);
exponent = new BigInteger("15");
// Perform modPow operation on the objects and exponent
result = biginteger1.modPow(exponent, biginteger2);
Hash Functions are incredibly useful and versatile. In general, they are used to map a space to one much smaller space. Of course that means that two objects may hash to the same
value (collision), but this is because you are reducing the space (pigeonhole principle).
The efficiency of the function largely depends on the size of the hash space.
It comes as a surprise then that a lot of Java hashCode functions are using multiplication to produce the hash code of a new object as e.g. follows (creating-a-hashcode-method-java)
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((email == null) ? 0 : email.hashCode());
result = prime * result + (int) (id ^ (id >>> 32));
result = prime * result + ((name == null) ? 0 : name.hashCode());
return result;
}
If we want to mix two hashcodes in the same range, xor should be much better than addition and is I think traditionally used. If we wanted to increase the space, shifting by some bytes and then xoring would still imho make sense. I guess multiplying by 31 is almost the same as shifting one hash by 1 and then adding but it should be much less efficient...
As it is the recommended approach though, I think I am missing something. So my question is why would this be?
Notes:
I am not asking why we use a prime. It is pretty clear that if we used multiplication, we should go with a prime. However multiplying by any number, even a prime, should still be suboptimal to xor. That is why e.g. all these other non-cryptographic hash functions - as well as most cryptographic - use xor and not multiplications...
I have indeed no indication (apart from all those well known hash functions) xor would be better. In fact just by the fact it is so widely accepted, I suspect it should be as good and in practice better to multiply by a prime and sum. I am asking why this is...
The int type in Java can be used to represent any whole number from -2147483648 to 2147483647.
Sometimes the hashcode of an object may be its memory address (which makes sense and is efficient in a lot of situations) (if inherited from e.g. object)
The answer to this is a mixture of different factors:
On modern architecture, the time taken to perform a multiplication versus a shift may not end up being measurable overall within a given pipeline of instructions-- it has more to do with the availability of the relevant execution unit on the CPU than the "raw" time taken;
In practice when integrating with standard collections libraries in day-to-day programming, it's often more important that a hash function is correct, "good enough" and easy to automate in an IDE than for it to be as perfect as possible;
The collections libraries generally add secondary hash functions and potentially other techniques behind the scenes to overcome some of the weaknesses of what would otherwise be a poor hash function;
With resizable collections, an effective hash function has the goal of dispersing its hashes across the available range for arbitrary sizes of hash tables (though as I say, it will get help from the built-in secondary function): multiplying by a "magic" constant is often a cheap way to achieve this (or, even if multiplication turned out to be a bit more expensive than a shift: still cheap enough, given the benefit); addition rather than XOR may help to allow this 'avalanche' effect slightly. (In most practical cases, you will probably find that they work equally well.)
You can generally assume that the JIT compiler "knows" about equivalents such as shifting 5 places and subtracting 1 rather than multiplying by 31. Just because you write "*31" in the source code doesn't mean that it will literally be compiled to a multiplication instruction. (In practice, it might be, though, because despite what you think, the multiply instruction may well be "faster" on average on the architecture in question... It's usually better to make your code stick to the required logic and let the JIT compiler handle the low level optimisations in a case such as this.)
The list of possible algorithms for multiplication is quite long:
Schoolbook long multiplication
Karatsuba algorithm
3-way Toom–Cook multiplication
k-way Toom–Cook multiplication
Mixed-level Toom–Cook
Schönhage–Strassen algorithm
Fürer's algorithm
Which one is used by Java by default and why? When does it switch to a "better performance" algorithm?
Well ... the * operator will use whatever the hardware provides. Java has no say in it.
But if you are talking about BigInteger.multiply(BigInteger), the answer depends on the Java version. For Java 11 it uses:
naive "long multiplication" for small numbers,
Karatsuba algorithm for medium sized number, and
3-way Toom–Cook multiplication for large numbers.
The thresholds are Karatsuba for numbers represented by 80 to 239 int values, an 3-way Toom-Cook for >= 240 int values. The smaller of the numbers being multiplied controls the algorithm selection.
Which one is used by Java by default and why?
Which ones? See above.
Why? Comments in the code imply that the thresholds were chosen empirically; i.e. someone did some systematic testing to determine which threshold values gave the best performance1.
You can find more details by reading the source code2.
1 - The current implementation BigInteger implementation hasn't changed significantly since 2013, so it is possible that it doesn't incorporate more recent research results.
2 - Note that this link is to the latest version on Github.
I'm programming a minhashing algorithm in Java that requires me to generate an arbitrary number of random hash functions (240 hash functions in my case), and run any number of integers through it (2000 at the moment).
In order to do that, I've been generating random numbers a, b, and c (from the range 1 - 2001) for each of the 240 hash functions. Then, my hash function returns h = ((a*x) + b) % c, where h is the return value and x is one of the integers run through it.
Is this an efficient implementation of random hashing, or is there a more common/acceptable way to do it?
This post was asking a similar question, but I'm still somewhat confused by the wording of the answer: Minhash implementation how to find hash functions for permutations
When I was working with Bloom filters a few years ago, I ran across an article that describes how to generate multiple hash functions very simply, with a minimum of code. The method he describes works very well. See Less Hashing, Same Performance: Building a Better Bloom Filter.
The basic idea is to create two hash functions, call them h1 and h2, with which you can then simulate multiple hash functions, g1 through gk, using the formula:
gi = h1(x) + i*h2(x)
i varies from 1 to k (the number of hash functions you want).
The paper is well worth reading, even if you decide not to implement his idea. Although after reading it I can't imagine not wanting to implement it. It made my Bloom filter code a whole lot more tractable and didn't negatively impact performance.
So the method that I described above was almost correct. The numbers a and b should be randomly generated. However, c needs to be a prime number that is slightly larger than the maximum possible value of x. Once those numbers have been chosen, finding hash value h using h = ((a*x)+b) % c is the standard, accepted way to generate hash functions.
Also, a and b should be random numbers from the range 1 to c-1.
This question already has answers here:
Explanation of HashMap#hash(int) method
(2 answers)
Closed 7 years ago.
after I read JDK's source code ,I find HashMap's hash() function seems fun. Its soucre code like this:
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
Parameter h is the hashCode from Objects which was put into HashMap. How does this method work and why? Why this method can defend against poor hashCode functions?
Hashtable uses the 'classical' approach of prime numbers: to get the 'index' of a value, you take the hash of the key and perform the modulus against the size. Taking a prime number as size, gives (normally) a nice spread over the indexes (depending on the hash as well, of course).
HashMap uses a 'power of two'-approach, meaning the sizes are a power of two. The reason is it's supposed to be faster than prime number calculations. However, since a power of two is not a prime number, there would be more collisions, especially with hash values having the same lower bits.
Why? The modulus performed against the size to get the (bucket/slot) index is simply calculated by: hash & (size-1) (which is exactly what's used in HashMap to get the index!). That's basically the problem with the 'power-of-two' approach: if the length is limited, e.g. 16, the default value of HashMap, only the last bits are used and hence, hash values with the same lower bits will result in the same (bucket) index. In the case of 16, only the last 4 bits are used to calculate the index.
That's why an extra hash is calculated and basically it's shifting the higher bit values, and operate on them with the lower bit values. The reason for the numbers 20, 12, 7 and 4, I don't really know. They used be different (in Java 1.5 or so, the hash function was little different). I suppose there's more advanced literature available. You might find more info about why they use the numbers they use in all kinds of algorithm-related literature, e.g.
http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming
http://mitpress.mit.edu/books/introduction-algorithms
http://burtleburtle.net/bob/hash/evahash.html#lookup uses different algorithms depending on the length (which makes some sense).
http://www.javaspecialists.eu/archive/Issue054.html is probably interesting as well. Check the reaction of Joshua Bloch near the bottom of the article: "The replacement secondary hash function (which I developed with the aid of a computer) has strong statistical properties that pretty much guarantee good bucket distribution.") So, if you ask me, the numbers come from some kind of analysis performed by Josh himself, probably assisted by who knows who.
So: power of two gives faster calculation, but the necessity for additional hash calculation in order to have a nice spread over the slots/buckets.