Multiplication should be suboptimal. Why is it used in hashCode? - java

Hash Functions are incredibly useful and versatile. In general, they are used to map a space to one much smaller space. Of course that means that two objects may hash to the same
value (collision), but this is because you are reducing the space (pigeonhole principle).
The efficiency of the function largely depends on the size of the hash space.
It comes as a surprise then that a lot of Java hashCode functions are using multiplication to produce the hash code of a new object as e.g. follows (creating-a-hashcode-method-java)
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((email == null) ? 0 : email.hashCode());
result = prime * result + (int) (id ^ (id >>> 32));
result = prime * result + ((name == null) ? 0 : name.hashCode());
return result;
}
If we want to mix two hashcodes in the same range, xor should be much better than addition and is I think traditionally used. If we wanted to increase the space, shifting by some bytes and then xoring would still imho make sense. I guess multiplying by 31 is almost the same as shifting one hash by 1 and then adding but it should be much less efficient...
As it is the recommended approach though, I think I am missing something. So my question is why would this be?
Notes:
I am not asking why we use a prime. It is pretty clear that if we used multiplication, we should go with a prime. However multiplying by any number, even a prime, should still be suboptimal to xor. That is why e.g. all these other non-cryptographic hash functions - as well as most cryptographic - use xor and not multiplications...
I have indeed no indication (apart from all those well known hash functions) xor would be better. In fact just by the fact it is so widely accepted, I suspect it should be as good and in practice better to multiply by a prime and sum. I am asking why this is...
The int type in Java can be used to represent any whole number from -2147483648 to 2147483647.
Sometimes the hashcode of an object may be its memory address (which makes sense and is efficient in a lot of situations) (if inherited from e.g. object)

The answer to this is a mixture of different factors:
On modern architecture, the time taken to perform a multiplication versus a shift may not end up being measurable overall within a given pipeline of instructions-- it has more to do with the availability of the relevant execution unit on the CPU than the "raw" time taken;
In practice when integrating with standard collections libraries in day-to-day programming, it's often more important that a hash function is correct, "good enough" and easy to automate in an IDE than for it to be as perfect as possible;
The collections libraries generally add secondary hash functions and potentially other techniques behind the scenes to overcome some of the weaknesses of what would otherwise be a poor hash function;
With resizable collections, an effective hash function has the goal of dispersing its hashes across the available range for arbitrary sizes of hash tables (though as I say, it will get help from the built-in secondary function): multiplying by a "magic" constant is often a cheap way to achieve this (or, even if multiplication turned out to be a bit more expensive than a shift: still cheap enough, given the benefit); addition rather than XOR may help to allow this 'avalanche' effect slightly. (In most practical cases, you will probably find that they work equally well.)
You can generally assume that the JIT compiler "knows" about equivalents such as shifting 5 places and subtracting 1 rather than multiplying by 31. Just because you write "*31" in the source code doesn't mean that it will literally be compiled to a multiplication instruction. (In practice, it might be, though, because despite what you think, the multiply instruction may well be "faster" on average on the architecture in question... It's usually better to make your code stick to the required logic and let the JIT compiler handle the low level optimisations in a case such as this.)

Related

Is there a way to pow 2 BigInteger Numbers in java?

I have to pow a bigInteger number with another BigInteger number.
Unfortunately, only one BigInteger.pow(int) is allowed.
I have no clue on how I can solve this problem.
I have to pow a bigInteger number with another BigInteger number.
No, you don't.
You read a crypto spec and it seemed to say that. But that's not what it said; you didn't read carefully enough. The mathematical 'universe' that the math in the paper / spec you're reading operates in is different from normal math. It's a modulo-space. All operations are implicitly performed modulo X, where X is some number the crypto algorithm explains.
You can do that just fine.
Alternatively, the spec is quite clear and says something like: C = (A^B) % M and you've broken that down in steps (... first, I must calculate A to the power of B. I'll worry about what the % M part is all about later). That's not how that works - you can't lop that operation into parts. (A^B) % M is quite doable, and has its own efficient algorithm. (A^B) is simply not calculable without a few years worth of the planet's entire energy and GDP output.
The reason I know that must be what you've been reading, is because (A ^ B) % M is a common operation in crypto. (Well, that, and the simple fact that A^B can't be done).
Just to be crystal clear: When I say impossible, I mean it in the same way 'travelling faster than the speed of light' is impossible. It's a law in the physics sense of the word: If you really just want to do A^B and not in a modspace where B is so large it doesn't fit in an int, a computer cannot calculate it, and the result will be gigabytes large. int can hold about 9 digits worth. Just for fun, imagine doing X^Y where both X and Y are 20 digit numbers.
The result would have 10^21 digits.
That's roughly equal to the total amount of disk space available worldwide. 10^12 is a terabyte. You're asking to calculate a number where, forget about calculating it, merely storing it requires one thousand million harddisks each of 1TB.
Thus, I'm 100% certain that you do not want what you think you want.
TIP: If you can't follow the math (which is quite bizarre; it's not like you get modulo-space math in your basic AP math class!), generally rolling your own implementation of a crypto algorithm isn't going to work out. The problem with crypto is, if you mess up, often a unit test cannot catch it. No; someone will hack your stuff and then you know, and that's a high price to pay. Rely on experts to build the algorithm, spend your time ensuring the protocol is correct (which is still quite difficult to get right, don't take that lightly!). If you insist, make dang sure you have a heap of plaintext+keys / encrypted (or plaintext / hashed, or whatever it is you're doing) pairs to test against, and assume that whatever you wrote, even if it passes those tests, is still insecure because e.g. it is trivial to leak the key out of your algorithm using timing attacks.
Since you anyway want to use it in a modulo operation with a prime number, like #Progman said in the comments, you can use modPow()
Below is an example code:
// Create BigInteger objects
BigInteger biginteger1, biginteger2, exponent, result;
//prime number
int pNumber = 5;
// Intializing all BigInteger Objects
biginteger1 = new BigInteger("23895");
biginteger2 = BigInteger.valueOf(pNumber);
exponent = new BigInteger("15");
// Perform modPow operation on the objects and exponent
result = biginteger1.modPow(exponent, biginteger2);

Bitwise operator advantages in StringBuilder

Why does the reverse() method in StringBuffer/StringBuilder classes use bitwise operator?
I would like to know the advantages of it.
public AbstractStringBuilder reverse() {
boolean hasSurrogate = false;
int n = count - 1;
for (int j = (n-1) >> 1; j >= 0; --j) {
char temp = value[j];
char temp2 = value[n - j];
if (!hasSurrogate) {
hasSurrogate = (temp >= Character.MIN_SURROGATE && temp <= Character.MAX_SURROGATE)
|| (temp2 >= Character.MIN_SURROGATE && temp2 <= Character.MAX_SURROGATE);
}
value[j] = temp2;
value[n - j] = temp;
}
if (hasSurrogate) {
// Reverse back all valid surrogate pairs
for (int i = 0; i < count - 1; i++) {
char c2 = value[i];
if (Character.isLowSurrogate(c2)) {
char c1 = value[i + 1];
if (Character.isHighSurrogate(c1)) {
value[i++] = c1;
value[i] = c2;
}
}
}
}
return this;
}
Right shifting by one means dividing by two, I don't think you'll notice any performance difference, the compiler will perform these optimization at compile time.
Many programmers are used to right shift by two when dividing instead of writing / 2, it's a matter of style, or maybe one day it was really more efficient to right shift instead of actually dividing by writing / 2, (prior to optimizations). Compilers know how to optimize things like that, I wouldn't waste my time by trying to write things that might be unclear to other programmers (unless they really make difference). Anyway, the loop is equivalent to:
int n = count - 1;
for (int j = (n-1) / 2; j >= 0; --j)
As #MarkoTopolnik mentioned in his comment, JDK was written without considering any optimization at all, this might explain why they explicitly right shifted the number by one instead of explicitly dividing it, if they considered the maximum power of the optimization, they would probably have wrote / 2.
Just in case you're wondering why they are equivalent, the best explanation is by example, consider the number 32. Assuming 8 bits, its binary representation is:
00100000
right shift it by one:
00010000
which has the value 16 (1 * 24)
In summary:
The >> operator in Java is known as the Sign Extended Right Bit Shift operator.
X >> 1 is mathematically equivalent to X / 2, for all strictly positive value of X.
X >> 1 is always faster than X / 2, in a ratio of roughly 1:16, though the difference might turn out to be much less significant in actual benchmark due to modern processor architecture.
All mainstream JVMs can correctly perform such optimizations, but the non-optimized byte code will be executed in interpreted mode thousand of times before these optimization actually occurs.
The JRE source code use a lot of optimization idioms, because they make an important difference on code executed in interpreted mode (and most importantly, at the JVM launch time).
The systematic use of proven-to-be-effective code optimization idioms that are accepted by a whole development team is not premature optimization.
Long answer
The following discussion try to correctly address all questions and doubts that have been issued in other comments on this page. It is so long because I felt that it was necesary to put emphasis on why some approach are better, rather than show off personal benchmark results, beliefs and practice, where millage might significantly vary from one person to the next.
So let's take questions one at a time.
1. What means X >> 1 (or X << 1, or X >>> 1) in Java?
The >>, << and >>> are collectively known as the Bit Shift operators. >> is commonly known as Sign Extended Right Bit Shift, or Arithmetic Right Bit Shift. >>> is the Non-Sign Extended Right Bit Shift (also known as Logical Right Bit Shift), and << is simply the Left Bit Shift (sign extension does not apply in that direction, so there is no need for logical and arithmetic variants).
Bit Shift operators are available (though with varying notation) in many programming language (actually, from a quick survey I would say, almost every languages that are more or less descendents of the C language, plus a few others). Bit Shifts are fundamental binary operations, and consquently, almost every CPU ever created offer assembly instructions for these. Bit Shifters are also a classic buiding block in electronic design, which, given a reasonable number of transitors, provide its final result in a single step, with a constant and predicatable stabilization period time.
Concretly, a bit shift operator transforms a number by moving all of its bits by n positions, either left or right. Bits that falls out are forgotten; bits that "comes in" are forced to 0, except in the case of the sign extended right bit shift, in which the left-most bit preserve its value (and therefore its sign). See Wikipedia for some graphic of this.
2. Does X >> 1 equals to X / 2?
Yes, as long as the dividend is guaranteed to be positive.
More generally:
a left shift by N is equivalent to a multiplication by 2N;
a logical right shift by N is equivalent to an unsigned integer division by 2N;
an arithmetic right shift by N is equivalent to a non-integer division by 2N, rounded to integer toward negative infinity (which is also equivalent to a signed integer division by 2N for any strictly positive integer).
3. Is bit shifting faster than the equivalent artihemtic operation, at the CPU level?
Yes, it is.
First of all, we can easily assert that, at the CPU's level, bit shifting does require less work than the equivalent arithmetic operation. This is true both for multiplications and divisions, and the reason for this is simple: both integer multiplication and integer division circuitry themselves contains several bit shifters. Put otherwise: a bit shift unit represents a mere fraction of the complexity level of a multiplication or division unit. It is therefore guaranteed that less energy is required to perform a simple bit shift rather than a full arithmetic operation. Yet, in the end, unless you monitor your CPU's electric consumption or heat dissipation, I doubt that you might notice the fact that your CPU is using more energy.
Now, lets talk about speed. On processors with reasonnably simple architecture (that is roughly, any processor designed before the Pentium or the PowerPC, plus most recent processors that do not feature some form of execution pipelines), integer division (and multiplication, to a lesser degree) is generally implemented by iterating over bits (actually group of bits, known as radix) on one of the operand. Each iteration require one CPU cycle, which means that integer division on a 32 bits processor would require (at most) 16 cycles (assuming a Radix 2 SRT division unit, on an hypothetical processor). Multiplication units usually handle more bits at once, so a 32 bits processor might complete integer multiplication in 4 to 8 cycles. These units might use some form of variable bit shifter to quickly jump over sequence of consecutive zeros, and therefore might terminate quickly when multiplying or dividing by simple operands (such as positive power of two); in that case, the arithmetic operation will complete in less cycles, but will still require more than a simple bit shift operation.
Obviously, instruction timing vary between processor designs, but the preceeding ratio (bit shift = 1, multiplication = 4, division = 16) is a reasonable approximation of actual performance of these instructions. For reference, on the Intel 486, the SHR, IMUL and IDIV instructions (for 32 bits, assuming register by a constant) required respectively 2, 13-42 and 43 cycles (see here for a list of 486 instructions with their timing).
What about CPUs found in modern computers? These processors are designed around pipeline architectures that allow the simultaneous execution of several instructions; the result is that most instructions nowaday require only one cycle of dedicated time. But this is misleading, since instructions actually remains in the pipeline for several cycles before being released, during which they might prevent other instructions from being completed. The integer multiplication or division unit remains "reserved" during that time and therefore any further division will be hold back. That is particularly a problem in short loops, where a single mutliplication or division will end up being stalled by the previous invocation of itself that hasn't yet completed. Bit shift instructions do not suffer from such risk: most "complex" processors have access to several bit shift units, and don't need to reserve them for very long (though generally at least 2 cycles for reasons intrinsic to the pipeline architecture). Actually, to put this into numbers, a quick look at the Intel Optimization Reference Manual for the Atom seems to indicates that SHR, IMUL and IDIV (same parameter as above) respectively have a 2, 5 and 57 latency cycles; for 64 bits operands, it is 8, 14 and 197 cycles. Similar latency applies to most recent Intel processors.
So, yes, bit shifting is faster than the equivalent arithmetic operations, even though in some situations, on modern processors, it might actualy makes absolutely no difference. But in most case, it is very significant.
4. Will the Java Virtual Machine will perform such optimization for me?
Sure, it will. Well... most certainly, and... eventually.
Unlike most language compilers, regular Java compilers perform no optimization. It is considered that the Java Virtual Machine is in best position to decide how to optimize a program for a specific execution context. And this indeed provide good results in practice. The JIT compiler acquire very deep understanding of the code's dynamics, and exploit this knowledge to select and apply tons of minor code transforms, in order to produce a very efficient native code.
But compiling byte code into optimized native methods require a lot of time and memory. That is why the JVM will not even consider optimizing a code block before it has been executed thousands of times. Then, even though the code block has been scheduled for optimization, it might be a long time before the compiler thread actualy process that method. And later, various conditions might cause that optimized code block to be discarded, reverting back to byte code interpretation.
Though the JSE API is designed with the objective of being implementable by various vendor, it is incorrect to claim that so is the JRE. The Oracle JRE is provided to other everyone as the reference implementation, but its usage with another JVM is discouraged (actualy, it was forbiden not so long ago, before Oracle open sourced the JRE's source code).
Optimizations in the JRE source code are the result of adopted conventions and optimization efforts among JRE developpers to provide reasonable performances even in situations where JIT optimizations haven't yet or simply can't help. For example, hundreds of classes are loaded before your main method is invoked. That early, the JIT compiler has not yet acquired sufficient information to properly optimize code. At such time, hand made optimizations makes an important difference.
5. Ain't this is premature optimization?
It is, unless there is a reason why it is not.
It is a fact of modern life that whenever a programmer demonstrate a code optimization somewhere, another programmer will oppose Donald Knuth's quote on optimization (well, was it his? who knows...) It is even perceived by many as the clear assertion by Knuth that we should never try to optimize code. Unfortunately, that is a major misunderstanding of Knuth's important contributions to computer science in the last decades: Knuth as actually authored thousand of pages of literacy on practical code optimization.
As Knuth put it:
Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
— Donald E. Knuth, "Structured Programming with Goto Statements"
What Knuth qualify as premature optimization are optimizations that require lot of thinking and apply only to non critical part of a program, and have strong negative impact on debugging and maintenance. Now, all of this could be debated for a long time, but let's not.
It should however be understood that small local optimizations, that have been proven to be effective (that is, at least in average, on the overall), that do not negatively affect the overall construction of a program, do not reduce a code's maintainability, and do not require extraneous thinking are not a bad thing at all. Such optimizations are actualy good, since they cost you nothing, and we should not pass up such opportunities.
Yet, and that is the most important thing to remember, an optimization that would be trivial to programers in one context might turn out to be incomprenhendable to programmers in another context. Bit shifting and masking idioms are particularly problematic for that reason. Programmers that do know the idiom can read it and use it without much thinking, and the effectiveness of these optimizations is proven, though generaly insignificant unless the code contains hundreds of occurences. These idioms are rarely an actual source of bugs. Still, programmers unfamilliar with a specific idiom will loose time understanding what, why and how that specific code snippet does.
In the end, either to favor such optimization or not, and exactly which idioms should be used is really a matter of team decision and code context. I personnaly consider a certain number of idioms to be best practice in all situations, and any new programmer joining my team quickly acquire these. Many more idioms are reserved to critical code path. All code put into internal shared code library are treated as critical code path, since they might turns out to be invoked from such critical code path. Anyway, that is my personal practice, and your millage may vary.
It uses (n-1) >> 1 instead of (n-1)/2 to find the middle index of the internal array to be reversed. Bitwise shift operators are usually more efficient than the division operator.
In this method, there's just this expression: (n-1) >> 1. I assume this is the expression you're referring to. This is called right shift/shifting. It is equivalent to (n-1)/2 but it's generally considered faster, more efficient. It's used often in many other languages too (e.g. in C/C++).
Note though that modern compilers will optimize your code anyway even if you use division like (n-1)/2. So there are no obvious benefits of using right shifting. It's more like a question of coding preference, style, habit.
See also:
Is shifting bits faster than multiplying and dividing in Java? .NET?

Explanation of the constants used while calculating hashcode value of java.util.hash

Can someone explain the significance of these constants and why they are chosen?
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
source: java-se6 library
Understanding what makes for a good hash function is tricky, as there are in fact a great many different functions that are used and for slightly different purposes.
Java's hash tables work as follows:
They ask the key object to produce its hash code. The implementation of the hashCode() method is likely to be of distinctly variable quality (in the worst case, returning a constant value!) and will definitely not be adapted to the particular hash table you're working with.
They then use the above function to mix the bits up a bit, so that information present in the high bits also gets moved down to the low bits. This is important because next …
They take the mod of the hash code (w.r.t. the number of hash table array entries) to get the index into the array of hash table chains. There's a distinct possibility that the hash table array will have size equivalent to a power of 2, so the mixing down of the bits in step 2 is important to ensure that they don't just get thrown away.
They then traverse the chain until they get to the entry with an equal key (according to the equals() method).
To complete the picture, the number of entries in the hash table array is non-constant; if the chains get too long the array gets replaced with a new larger array and everything gets rehashed. That's relatively fast and has good performance implications for normal use patterns (e.g., lots of put()s followed by lots of get()s).
The actual constants used are fairly arbitrary (and are probably chosen by experiment with some simple corpus including things like large numbers of Integer and String values) but their purpose is not: getting the information in the whole value spread to most of the low bits in the value ensures that such information as is present in the output of the hashCode() is used as well as possible.
(You wouldn't do this with perfect hashing or cryptographic hashing; despite the similar names, they have very different implementation strategies. The former requires knowledge of the key space so that collisions are avoided/reduced, and the latter needs information to be moved about in all directions, not just to the low bits.)
I have also wondered about such "magic" numbers. As far as I know they are magic numbers.
It has been proven by extensive testing that odd and prime numbers have interesting priorities that could be used in hashing (avoid primary/secondary clustering etc).
I believe that most of the numbers come after research and testing that prove statistically to give good distributions. Why specifically these numbers do that, I have no idea but I have the impression (hopefully collegues here can correct me if I am way off) neither the implementers know why these specific numbers present these qualities

A good hash function to use in interviews for integer numbers, strings?

I have come across situations in an interview where I needed to use a hash function for integer numbers or for strings. In such situations which ones should we choose ? I've been wrong in these situations because I end up choosing the ones which have generate lot of collisions but then hash functions tend to be mathematical that you cannot recollect them in an interview. Are there any general recommendations so atleast the interviewer is satisfied with your approach for integer numbers or string inputs? Which functions would be adequate for both inputs in an "interview situation"
Here is a simple recipe from Effective java page 33:
Store some constant nonzero value, say, 17, in an int variable called result.
For each significant field f in your object (each field taken into account by the
equals method, that is), do the following:
Compute an int hash code c for the field:
If the field is a boolean, compute (f ? 1 : 0).
If the field is a byte, char, short, or int, compute (int) f.
If the field is a long, compute (int) (f ^ (f >>> 32)).
If the field is a float, compute Float.floatToIntBits(f).
If the field is a double, compute Double.doubleToLongBits(f), and
then hash the resulting long as in step 2.1.iii.
If the field is an object reference and this class’s equals method
compares the field by recursively invoking equals, recursively
invoke hashCode on the field. If a more complex comparison is
required, compute a “canonical representation” for this field and
invoke hashCode on the canonical representation. If the value of the
field is null, return 0 (or some other constant, but 0 is traditional).
48 CHAPTER 3 METHODS COMMON TO ALL OBJECTS
If the field is an array, treat it as if each element were a separate field.
That is, compute a hash code for each significant element by applying
these rules recursively, and combine these values per step 2.b. If every
element in an array field is significant, you can use one of the
Arrays.hashCode methods added in release 1.5.
Combine the hash code c computed in step 2.1 into result as follows:
result = 31 * result + c;
Return result.
When you are finished writing the hashCode method, ask yourself whether
equal instances have equal hash codes. Write unit tests to verify your intuition!
If equal instances have unequal hash codes, figure out why and fix the problem.
You should ask the interviewer what the hash function is for - the answer to this question will determine what kind of hash function is appropriate.
If it's for use in hashed data structures like hashmaps, you want it to be a simple as possible (fast to execute) and avoid collisions (most common values map to different hash values). A good example is an integer hashing to the same integer - this is the standard hashCode() implementation in java.lang.Integer
If it's for security purposes, you will want to use a cryptographic hash function. These are primarily designed so that it is hard to reverse the hash function or find collisions.
If you want fast pseudo-random-ish hash values (e.g. for a simulation) then you can usually modify a pseudo-random number generator to create these. My personal favourite is:
public static final int hash(int a) {
a ^= (a << 13);
a ^= (a >>> 17);
a ^= (a << 5);
return a;
}
If you are computing a hash for some form of composite structure (e.g. a string with multiple characters, or an array, or an object with multiple fields), then there are various techniques you can use to create a combined hash function. I'd suggest something that XORs the rotated hash values of the constituent parts, e.g.:
public static <T> int hashCode(T[] data) {
int result=0;
for(int i=0; i<data.length; i++) {
result^=data[i].hashCode();
result=Integer.rotateRight(result, 1);
}
return result;
}
Note the above is not cryptographically secure, but will do for most other purposes. You will obviously get collisions but that's unavoidable when hashing a large structure to a integer :-)
For integers, I usually go with k % p where p = size of the hash table and is a prime number and for strings I choose hashcode from String class. Is this sufficient enough for an interview with a major tech company? – phoenix 2 days ago
Maybe not. It's not uncommon to need to provide a hash function to a hash table whose implementation is unknown to you. Further, if you hash in a way that depends on the implementation using a prime number of buckets, then your performance may degrade if the implementation changes due to a new library, compiler, OS port etc..
Personally, I think the important thing at interview is a clear understanding of the ideal characteristics of a general-purpose hash algorithm, which is basically that for any two input keys with values varying by as little as one bit, each and every bit in the output has about 50/50 chance of flipping. I found that quite counter-intuitive because a lot of the hashing functions I first saw used bit-shifts and XOR and a flipped input bit usually flipped one output bit (usually in another bit position, so 1-input-bit-affects-many-output-bits was a little revelation moment when I read it in one of Knuth's books. With this knowledge you're at least capable of testing and assessing specific implementations regardless of how they're implemented.
One approach I'll mention because it achieves this ideal and is easy to remember, though the memory usage may make it slower than mathematical approaches (could be faster too depending on hardware), is to simply use each byte in the input to look up a table of random ints. For example, given a 24-bit RGB value and int table[3][256], table[0][r] ^ table[1][g] ^ table[2][b] is a great sizeof int hash value - indeed "perfect" if inputs are randomly scattered through the int values (rather than say incrementing - see below). This approach isn't ideal for long or arbitrary-length keys, though you can start revisiting tables and bit-shift the values etc..
All that said, you can sometimes do better than this randomising approach for specific cases where you are aware of the patterns in the input keys and/or the number of buckets involved (for example, you may know the input keys are contiguous from 1 to 100 and there are 128 buckets, so you can pass the keys through without any collisions). If, however, the input ceases to meet your expectations, you can get horrible collision problems, while a "randomising" approach should never get much worse than load (size() / buckets) implies. Another interesting insight is that when you want a quick-and-mediocre hash, you don't necessarily have to incorporate all the input data when generating the hash: e.g. last time I looked at Visual C++'s string hashing code it picked ten letters evenly spaced along the text to use as inputs....

What is a sensible prime for hashcode calculation?

Eclipse 3.5 has a very nice feature to generate Java hashCode() functions. It would generate for example (slightly shortened:)
class HashTest {
int i;
int j;
public int hashCode() {
final int prime = 31;
int result = prime + i;
result = prime * result + j;
return result;
}
}
(If you have more attributes in the class, result = prime * result + attribute.hashCode(); is repeated for each additional attribute. For ints .hashCode() can be omitted.)
This seems fine but for the choice 31 for the prime. It is probably taken from the hashCode implementation of Java String, which was used for performance reasons that are long gone after the introduction of hardware multipliers. Here you have many hashcode collisions for small values of i and j: for example (0,0) and (-1,31) have the same value. I think that is a Bad Thing(TM), since small values occur often. For String.hashCode you'll also find many short strings with the same hashcode, for instance "Ca" and "DB". If you take a large prime, this problem disappears if you choose the prime right.
So my question: what is a good prime to choose? What criteria do you apply to find it?
This is meant as a general question - so I do not want to give a range for i and j. But I suppose in most applications relatively small values occur more often than large values. (If you have large values the choice of the prime is probably unimportant.) It might not make much of a difference, but a better choice is an easy and obvious way to improve this - so why not do it? Commons lang HashCodeBuilder also suggests curiously small values.
(Clarification: this is not a duplicate of Why does Java's hashCode() in String use 31 as a multiplier? since my question is not concerned with the history of the 31 in the JDK, but on what would be a better value in new code using the same basic template. None of the answers there try to answer that.)
I recommend using 92821. Here's why.
To give a meaningful answer to this you have to know something about the possible values of i and j. The only thing I can think of in general is, that in many cases small values will be more common than large values. (The odds of 15 appearing as a value in your program are much better than, say, 438281923.) So it seems a good idea to make the smallest hashcode collision as large as possible by choosing an appropriate prime. For 31 this rather bad - already for i=-1 and j=31 you have the same hash value as for i=0 and j=0.
Since this is interesting, I've written a little program that searched the whole int range for the best prime in this sense. That is, for each prime I searched for the minimum value of Math.abs(i) + Math.abs(j) over all values of i,j that have the same hashcode as 0,0, and then took the prime where this minimum value is as large as possible.
Drumroll: the best prime in this sense is 486187739 (with the smallest collision being i=-25486, j=67194). Nearly as good and much easier to remember is 92821 with the smallest collision being i=-46272 and j=46016.
If you give "small" another meaning and want to be the minimum of Math.sqrt(i*i+j*j) for the collision as large as possible, the results are a little different: the best would be 1322837333 with i=-6815 and j=70091, but my favourite 92821 (smallest collision -46272,46016) is again almost as good as the best value.
I do acknowledge that it is quite debatable whether these calculation make much sense in practice. But I do think that taking 92821 as prime makes much more sense than 31, unless you have good reasons not to.
Actually, if you take a prime so large that it comes close to INT_MAX, you have the same problem because of modulo arithmetic. If you expect to hash mostly strings of length 2, perhaps a prime near the square root of INT_MAX would be best, if the strings you hash are longer it doesn't matter so much and collisions are unavoidable anyway...
Collisions may not be such a big issue... The primary goal of the hash is to avoid using equals for 1:1 comparisons.
If you have an implementation where equals is "generally" extremely cheap for objects that have collided hashs, then this is not an issue (at all).
In the end, what is the best way of hashing depends on what you are comparing. In the case of an int pair (as in your example), using basic bitwise operators could be sufficient (as using & or ^).
You need to define your range for i and j. You could use a prime number for both.
public int hashCode() {
http://primes.utm.edu/curios/ ;)
return 97654321 * i ^ 12356789 * j;
}
I'd choose 7243. Large enough to avoid collissions with small numbers. Doesn't overflow to small numbers quickly.
I just want to point out that hashcode has nothing to do with prime.
In JDK implementation
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
I found if you replace 31 with 27, the result are very similar.

Categories