What is a sensible prime for hashcode calculation?

What is a sensible prime for hashcode calculation? - java

Eclipse 3.5 has a very nice feature to generate Java hashCode() functions. It would generate for example (slightly shortened:)
class HashTest {
int i;
int j;
public int hashCode() {
final int prime = 31;
int result = prime + i;
result = prime * result + j;
return result;
}
}
(If you have more attributes in the class, result = prime * result + attribute.hashCode(); is repeated for each additional attribute. For ints .hashCode() can be omitted.)
This seems fine but for the choice 31 for the prime. It is probably taken from the hashCode implementation of Java String, which was used for performance reasons that are long gone after the introduction of hardware multipliers. Here you have many hashcode collisions for small values of i and j: for example (0,0) and (-1,31) have the same value. I think that is a Bad Thing(TM), since small values occur often. For String.hashCode you'll also find many short strings with the same hashcode, for instance "Ca" and "DB". If you take a large prime, this problem disappears if you choose the prime right.
So my question: what is a good prime to choose? What criteria do you apply to find it?
This is meant as a general question - so I do not want to give a range for i and j. But I suppose in most applications relatively small values occur more often than large values. (If you have large values the choice of the prime is probably unimportant.) It might not make much of a difference, but a better choice is an easy and obvious way to improve this - so why not do it? Commons lang HashCodeBuilder also suggests curiously small values.
(Clarification: this is not a duplicate of Why does Java's hashCode() in String use 31 as a multiplier? since my question is not concerned with the history of the 31 in the JDK, but on what would be a better value in new code using the same basic template. None of the answers there try to answer that.)

I recommend using 92821. Here's why.
To give a meaningful answer to this you have to know something about the possible values of i and j. The only thing I can think of in general is, that in many cases small values will be more common than large values. (The odds of 15 appearing as a value in your program are much better than, say, 438281923.) So it seems a good idea to make the smallest hashcode collision as large as possible by choosing an appropriate prime. For 31 this rather bad - already for i=-1 and j=31 you have the same hash value as for i=0 and j=0.
Since this is interesting, I've written a little program that searched the whole int range for the best prime in this sense. That is, for each prime I searched for the minimum value of Math.abs(i) + Math.abs(j) over all values of i,j that have the same hashcode as 0,0, and then took the prime where this minimum value is as large as possible.
Drumroll: the best prime in this sense is 486187739 (with the smallest collision being i=-25486, j=67194). Nearly as good and much easier to remember is 92821 with the smallest collision being i=-46272 and j=46016.
If you give "small" another meaning and want to be the minimum of Math.sqrt(i*i+j*j) for the collision as large as possible, the results are a little different: the best would be 1322837333 with i=-6815 and j=70091, but my favourite 92821 (smallest collision -46272,46016) is again almost as good as the best value.
I do acknowledge that it is quite debatable whether these calculation make much sense in practice. But I do think that taking 92821 as prime makes much more sense than 31, unless you have good reasons not to.

Actually, if you take a prime so large that it comes close to INT_MAX, you have the same problem because of modulo arithmetic. If you expect to hash mostly strings of length 2, perhaps a prime near the square root of INT_MAX would be best, if the strings you hash are longer it doesn't matter so much and collisions are unavoidable anyway...

Collisions may not be such a big issue... The primary goal of the hash is to avoid using equals for 1:1 comparisons.
If you have an implementation where equals is "generally" extremely cheap for objects that have collided hashs, then this is not an issue (at all).
In the end, what is the best way of hashing depends on what you are comparing. In the case of an int pair (as in your example), using basic bitwise operators could be sufficient (as using & or ^).

You need to define your range for i and j. You could use a prime number for both.
public int hashCode() {
http://primes.utm.edu/curios/ ;)
return 97654321 * i ^ 12356789 * j;
}

I'd choose 7243. Large enough to avoid collissions with small numbers. Doesn't overflow to small numbers quickly.

I just want to point out that hashcode has nothing to do with prime.
In JDK implementation
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
I found if you replace 31 with 27, the result are very similar.

Related

Multiplication should be suboptimal. Why is it used in hashCode?

Hash Functions are incredibly useful and versatile. In general, they are used to map a space to one much smaller space. Of course that means that two objects may hash to the same
value (collision), but this is because you are reducing the space (pigeonhole principle).
The efficiency of the function largely depends on the size of the hash space.
It comes as a surprise then that a lot of Java hashCode functions are using multiplication to produce the hash code of a new object as e.g. follows (creating-a-hashcode-method-java)
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((email == null) ? 0 : email.hashCode());
result = prime * result + (int) (id ^ (id >>> 32));
result = prime * result + ((name == null) ? 0 : name.hashCode());
return result;
}
If we want to mix two hashcodes in the same range, xor should be much better than addition and is I think traditionally used. If we wanted to increase the space, shifting by some bytes and then xoring would still imho make sense. I guess multiplying by 31 is almost the same as shifting one hash by 1 and then adding but it should be much less efficient...
As it is the recommended approach though, I think I am missing something. So my question is why would this be?
Notes:
I am not asking why we use a prime. It is pretty clear that if we used multiplication, we should go with a prime. However multiplying by any number, even a prime, should still be suboptimal to xor. That is why e.g. all these other non-cryptographic hash functions - as well as most cryptographic - use xor and not multiplications...
I have indeed no indication (apart from all those well known hash functions) xor would be better. In fact just by the fact it is so widely accepted, I suspect it should be as good and in practice better to multiply by a prime and sum. I am asking why this is...
The int type in Java can be used to represent any whole number from -2147483648 to 2147483647.
Sometimes the hashcode of an object may be its memory address (which makes sense and is efficient in a lot of situations) (if inherited from e.g. object)

The answer to this is a mixture of different factors:
On modern architecture, the time taken to perform a multiplication versus a shift may not end up being measurable overall within a given pipeline of instructions-- it has more to do with the availability of the relevant execution unit on the CPU than the "raw" time taken;
In practice when integrating with standard collections libraries in day-to-day programming, it's often more important that a hash function is correct, "good enough" and easy to automate in an IDE than for it to be as perfect as possible;
The collections libraries generally add secondary hash functions and potentially other techniques behind the scenes to overcome some of the weaknesses of what would otherwise be a poor hash function;
With resizable collections, an effective hash function has the goal of dispersing its hashes across the available range for arbitrary sizes of hash tables (though as I say, it will get help from the built-in secondary function): multiplying by a "magic" constant is often a cheap way to achieve this (or, even if multiplication turned out to be a bit more expensive than a shift: still cheap enough, given the benefit); addition rather than XOR may help to allow this 'avalanche' effect slightly. (In most practical cases, you will probably find that they work equally well.)
You can generally assume that the JIT compiler "knows" about equivalents such as shifting 5 places and subtracting 1 rather than multiplying by 31. Just because you write "*31" in the source code doesn't mean that it will literally be compiled to a multiplication instruction. (In practice, it might be, though, because despite what you think, the multiply instruction may well be "faster" on average on the architecture in question... It's usually better to make your code stick to the required logic and let the JIT compiler handle the low level optimisations in a case such as this.)

Finding a prime number at least a 100 digits long that contains 273042282802155991

I am new to Java and one of my class assignments is to find a prime number at least 100 digits long that contains the numbers 273042282802155991.
I have this so far but when I compile it and run it it seems to be in a continuous loop.
I'm not sure if I've done something wrong.
public static void main(String[] args) {
BigInteger y = BigInteger.valueOf(304877713615599127L);
System.out.println(RandomPrime(y));
}
public static BigInteger RandomPrime(BigInteger x)
{
BigInteger i;
for (i = BigInteger.valueOf(2); i.compareTo(x)<0; i.add(i)) {
if ((x.remainder(i).equals(BigInteger.ZERO))) {
x.divide(i).equals(x);
i.subtract(i);
}
}
return i;
}

Since this is homework ...
There is a method on BigInteger that tests for primality. This is much much faster than attempting to factorize a number. (If you take an approach that involves attempting to factorize 100 digit numbers you will fail. Factorization is believed to be an NP-complete problem. Certainly, there is no known polynomial time solution.)
The question is asking for a prime number that contains a given sequence of digits when it is represented as a sequence of decimal digits.
The approach of generating "random" primes and then testing if they contain those digits is infeasible. (Some simple high-school maths tells you that the probability that a randomly generated 100 digit number contains a given 18 digit sequence is ... 82 / 1018. And you haven't tested for primality yet ...
But there's another way to do it ... think about it!
Only start writing code once you've figured out in your head how your algorithm will work, and done the mental estimates to confirm that it will give an answer in a reasonable length of time.
When I say infeasible, I mean infeasible for you. Given a large enough number of computers, enough time and some high-powered mathematics, it may be possible to do some of these things. Thus, technically they may be computationally feasible. But they are not feasible as a homework exercise. I'm sure that the point of this exercise is to get you to think about how to do this the smart way ...

One tip is that these statements do nothing:
x.divide(i).equals(x);
i.subtract(i);
Same with part of your for loop:
i.add(i)
They don't modify the instances themselves, but return new values - values that you're failing to check and do anything with. BigIntegers are "immutable". They can't be changed - but they can be operated upon and return new values.
If you actually wanted to do something like this, you would have to do:
i = i.add(i);
Also, why would you subtract i from i? Wouldn't you always expect this to be 0?

You need to implement/use miller-rabin algorithm
Handbook of Applied Cryptography
chapter 4.24
http://www.cacr.math.uwaterloo.ca/hac/about/chap4.pdf

What's the benefit of seeding a random number generator with only prime numbers?

While conducting some experiments in Java, my project supervisor reminded me to seed each iteration of the experiment with a different number. He also mentioned that I should use prime numbers for the seed values. This got me thinking — why primes? Why not any other number as the seed? Also, why must the prime number be sufficiently big? Any ideas? I would've asked him this myself, but its 4am here right now, everyone's asleep, I just remembered this question and I'm burning to know the answer (I'm sure you know the feeling).
It would be nice if you could provide some references, I'm very interested in the math/concept behind all this!
EDIT:
I'm using java.util.Random.
FURTHER EDIT:
My professor comes from a C background, but I'm using Java. Don't know if that helps. It appears that using primes is his idiosyncrasy, but I think we've unearthed some interesting answers about generating random numbers. Thanks to everyone for the effort!

Well one blink at the implementation would show you that he CAN'T have any reason for that claim at all. Why? Because that's how the set seed function looks like:
synchronized public void setSeed(long seed) {
seed = (seed ^ multiplier) & mask;
this.seed.set(seed);
haveNextNextGaussian = false;
}
And that's exactly what's called from the constructor. So even if you give it a prime, it won't use it anyhow, so if at all you'd have to use a seed s where (s^ multiplier) & mask results in a prime ;)
Java uses a usual linear congruency method, i.e.:
x_n+1 = (a * x_n + c) mod m with 2 <= a < m; 0 <= c < m.
Since you want to get a maximal periode, c and m have to be relatively prime and a few other quite obscure limitations, plus a few tips how to get a practically useful version. Knuth obviously covers that in detail in part2 ;)
But anyhow, the seed doesn't influence the qualities of the generator at all. Even if the implementation would be using a Lehmer generator, it would obviously make sure that N is prime (otherwise the algorithm is practically useless; and not uniformly distributed if all random values would have to be coprime to a non prime N I wager) which makes the point moot

If the generator is a Lehmer generator, than the seed and the modulus must be co-prime; see the wiki page. One way to ensure they are co-prime is to start with a prime number.

If you are talking about java.util.Random, or one of its subclasses in the Oracle runtime, there's no reason for this. It's just a whim of your supervisor.

Why multiply by a prime before xoring in many GetHashCode Implementations?

I understand that multiplication by a large number before xoring should help with badly distributed operands but why should the multiplier be a prime?
Related:
Why should hash functions use a prime number modulus?
Close, but not quite a Duplicate:
Why does Java’s hashCode() in String use 31 as a multiplier?

There's a good article on the Computing Life blog that discusses this topic in detail. It was originally posted as a response to the Java hashCode() question I linked to in the question. According to the article:
Primes are unique numbers. They are unique in that, the product of a prime with any other number has the best chance of being unique (not as unique as the prime itself of-course) due to the fact that a prime is used to compose it. This property is used in hashing functions.
Given a string “Samuel”, you can generate a unique hash by multiply each of the constituent digits or letters with a prime number and adding them up. This is why primes are used.
However using primes is an old technique. The key here to understand that as long as you can generate a sufficiently unique key you can move to other hashing techniques too. Go here for more on this topic about hashes without primes.

Multiplying by a non-prime has a cyclic repeating pattern much smaller than the number. If you use a prime then the cyclic repeating pattern is guaranteeed to be at least as large as the prime number.

I'm not sure exactly which algorithm you're talking about, but typically the constants in such algorithms need to be relatively prime. Otherwise, you get cycles and not all the possible values show up in the result.
The number probably doesn't need to be prime in your case, only relatively prime to some other numbers, but making it prime guarantees that. It also covers the cases where the other magic numbers change.
For example, if you are talking about taking the last bits of some number, then the multiplier needs to not be a multiple of 2. So, 9 would work even though it's not prime.

Consider the simplest multiplication: x2.
It is equivalent to a left-bitshift. In other words, it really didn't "randomize" the data, it just shifted it over.
Same with x4, or any power of two. The original data is intact, just shifted.
Now, multiplication by other numbers (non-powers of two) are not as obvious, but still have the same problem, more or less. The original data is intact, or trivially transformed. (eg. x5 is the same as left-bitshift two places, then add on the original data).
The point of GetHashCode is to essentially distribute the data as randomly as possible. Multiplying by a prime number guarantees that the answer won't be a simpler transform like bit-shifting or adding a number to itself.

Efficient hashCode() implementation

I often auto-generate an class's hashCode() method using IntelliJ IDEA and typically the method takes the form:
result = 31 * result + ...
My question is what is the purpose of multiplying by 31? I know this is a prime number but why pick 31 specifically? Also, if implementing a hashCode() for a particularly small / large dataset would people approach this problem differently?

Multiplying by 31 is fast because the JIT can convert it to a shift left by 5 bits and a subtract:
x * 31 == (x << 5) - x
Without any particular extra information, I'd stick to this approach. It's reasonably fast and likely to end up with reasonably well-distributed hash codes, and it's also easy to get right :)
The size of the dataset doesn't really matter, but if you have particular extra information about the values you'll be work with (e.g. "it's always even") then you may be able to design a better hash function. I'd wait until it's an actual problem first though :)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.