Why the capacity must be a multiple or 2?
Why use "&" in the indexFor functions?
Why recompute the hash in the hash function instead of directly using the key's hash code?
I think there are some important differences between the this implementation and the description on the "Introduction to Algorithm".
What does ">>>" mean?
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
Can anyone give me some guide ? I appreciate If some one can explain the hash algorithm.
Thanks a lot!
This is a performance optimization. The usual way to map a hash code to a table index is
table_index = hash_code % table_length;
The % operator is expensive. If table_length is a power of 2, then the calculation:
table_index = hash_code & (table_length - 1);
is equivalent to the (much) more expensive modulo operation.
Pay no attention to the man behind the curtain.
The actual algorithm is no doubt a combination of "what feels good" to the developer, fixes for some odd degenerate cases, and simple tradition (for which users often develop obscure dependencies).
And note this:
* Applies a supplemental hash function to a given hashCode, which
* defends against poor quality hash functions. This is critical
* because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
Net: So long as it works and the performance is good, you don't care.
Related
I was going through the source code of HashMap, but the binary operators confuses a lot.
I do understand the general purpose of below, fair distribution and bring hashCode within the bucket limit.
Can someone explain the comments here and what is the benefit of doing the way it is done right now?
/**
* Computes key.hashCode() and spreads (XORs) higher bits of hash
* to lower. Because the table uses power-of-two masking, sets of
* hashes that vary only in bits above the current mask will
* always collide. (Among known examples are sets of Float keys
* holding consecutive whole numbers in small tables.) So we
* apply a transform that spreads the impact of higher bits
* downward. There is a tradeoff between speed, utility, and
* quality of bit-spreading. Because many common sets of hashes
* are already reasonably distributed (so don't benefit from
* spreading), and because we use trees to handle large sets of
* collisions in bins, we just XOR some shifted bits in the
* cheapest possible way to reduce systematic lossage, as well as
* to incorporate impact of the highest bits that would otherwise
* never be used in index calculations because of table bounds.
*/
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
It would be a really big help if someone could help me understand it.
This is not a duplicate because the other questions are related to hash implementations before Java 8.
Thanks in advance
hashCode() returns an int, which is 32 bits wide.
Internally, HashMap keeps the objects in pow(2, n) buckets or bins. The value of n can vary — the details are not important here; what is important is that n is typically much smaller than 32 (the number of bits in the hash).
Each object is placed into one of the buckets. To achieve good performance, it is desirable to spread the objects evenly across the buckets. This is where object hashes come in: The simplest way to choose a bucket would be by taking the lowest n bits of the object's hash code (using a simple bitwise AND). However, this would only use the lowest n bits and ignore the rest of the hash.
In the comments, the authors make an argument that this is undesirable. They cite examples of known use cases in which object hashes would systematically differ in bits other than the lowest n. This would lead to systematic collisions and systematic collisions are bad news.
To partially address this, they've implemented the current heuristic that:
keeps the top 16 bits of the hash as they are;
replaces the bottom 16 bits with an XOR of the top 16 and the bottom 16 bits.
This is an extract from Core Java by C. Horstmann.
+++++
The hashCode method should return an integer (which can be negative). Just combine the
hash codes of the instance fields so that the hash codes for different objects are likely to
be widely scattered.
For example, here is a hashCode method for the Employee class:
class Employee
{
public int hashCode()
{
return 7 * name.hashCode() + 11 * new Double(salary).hashCode() + 13 * hireDay.hashCode();
}
. . .
}
+++
I can't understand these 7, 11, and 13. Are they just pulled out of a hat? Without them the result (checking for equality of two objects) seems to be the same.
In general, testing for equality does not use the hash code.
The 7, 11, 13 are all prime numbers. This lowers the possibility of two different employees having the same hash code (because of theorem of Bézout).
In fact, I would suggest (to widen the obtained hash) using much bigger but non-consecutive primes, e.g. 1039, 2011, 32029. On Linux, the /usr/games/primes utility from package bsdgames is very useful to get them.
What is important is that if two things compare equal they have the same hash code.
For perfomance reasons, you want the hash code to be widely distributed (so if two things are not equals, their hash code usually should be different) to lower the probability of hash collision.
Read wikipage on hash tables.
the numbers are prime numbers.
you don't want to just add the hash codes, because it would give you more collissions.
e.g.
situation A: foo="bla", bar="111"
situation B: foo="111", bar="bla"
this means that foo.hash() + bar.hash() will return the same value in both situations. you use prime numbers because the function f: N/2^32 -> N/2^32: x -> x * p (mod 2^32) is bijective if p is a prime > 2. (i.e. you would lose bits if you multiplied with 256 instead...)
collisions are only to be avoided if you use somthing like hash-sets.
Multiplying with primes is a common optimization which is often done for you by your IDE. I wouldn't do it if there is no need for optimization.
This question already has answers here:
Explanation of HashMap#hash(int) method
(2 answers)
Closed 7 years ago.
after I read JDK's source code ,I find HashMap's hash() function seems fun. Its soucre code like this:
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
Parameter h is the hashCode from Objects which was put into HashMap. How does this method work and why? Why this method can defend against poor hashCode functions?
Hashtable uses the 'classical' approach of prime numbers: to get the 'index' of a value, you take the hash of the key and perform the modulus against the size. Taking a prime number as size, gives (normally) a nice spread over the indexes (depending on the hash as well, of course).
HashMap uses a 'power of two'-approach, meaning the sizes are a power of two. The reason is it's supposed to be faster than prime number calculations. However, since a power of two is not a prime number, there would be more collisions, especially with hash values having the same lower bits.
Why? The modulus performed against the size to get the (bucket/slot) index is simply calculated by: hash & (size-1) (which is exactly what's used in HashMap to get the index!). That's basically the problem with the 'power-of-two' approach: if the length is limited, e.g. 16, the default value of HashMap, only the last bits are used and hence, hash values with the same lower bits will result in the same (bucket) index. In the case of 16, only the last 4 bits are used to calculate the index.
That's why an extra hash is calculated and basically it's shifting the higher bit values, and operate on them with the lower bit values. The reason for the numbers 20, 12, 7 and 4, I don't really know. They used be different (in Java 1.5 or so, the hash function was little different). I suppose there's more advanced literature available. You might find more info about why they use the numbers they use in all kinds of algorithm-related literature, e.g.
http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming
http://mitpress.mit.edu/books/introduction-algorithms
http://burtleburtle.net/bob/hash/evahash.html#lookup uses different algorithms depending on the length (which makes some sense).
http://www.javaspecialists.eu/archive/Issue054.html is probably interesting as well. Check the reaction of Joshua Bloch near the bottom of the article: "The replacement secondary hash function (which I developed with the aid of a computer) has strong statistical properties that pretty much guarantee good bucket distribution.") So, if you ask me, the numbers come from some kind of analysis performed by Josh himself, probably assisted by who knows who.
So: power of two gives faster calculation, but the necessity for additional hash calculation in order to have a nice spread over the slots/buckets.
Can someone explain the significance of these constants and why they are chosen?
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
source: java-se6 library
Understanding what makes for a good hash function is tricky, as there are in fact a great many different functions that are used and for slightly different purposes.
Java's hash tables work as follows:
They ask the key object to produce its hash code. The implementation of the hashCode() method is likely to be of distinctly variable quality (in the worst case, returning a constant value!) and will definitely not be adapted to the particular hash table you're working with.
They then use the above function to mix the bits up a bit, so that information present in the high bits also gets moved down to the low bits. This is important because next …
They take the mod of the hash code (w.r.t. the number of hash table array entries) to get the index into the array of hash table chains. There's a distinct possibility that the hash table array will have size equivalent to a power of 2, so the mixing down of the bits in step 2 is important to ensure that they don't just get thrown away.
They then traverse the chain until they get to the entry with an equal key (according to the equals() method).
To complete the picture, the number of entries in the hash table array is non-constant; if the chains get too long the array gets replaced with a new larger array and everything gets rehashed. That's relatively fast and has good performance implications for normal use patterns (e.g., lots of put()s followed by lots of get()s).
The actual constants used are fairly arbitrary (and are probably chosen by experiment with some simple corpus including things like large numbers of Integer and String values) but their purpose is not: getting the information in the whole value spread to most of the low bits in the value ensures that such information as is present in the output of the hashCode() is used as well as possible.
(You wouldn't do this with perfect hashing or cryptographic hashing; despite the similar names, they have very different implementation strategies. The former requires knowledge of the key space so that collisions are avoided/reduced, and the latter needs information to be moved about in all directions, not just to the low bits.)
I have also wondered about such "magic" numbers. As far as I know they are magic numbers.
It has been proven by extensive testing that odd and prime numbers have interesting priorities that could be used in hashing (avoid primary/secondary clustering etc).
I believe that most of the numbers come after research and testing that prove statistically to give good distributions. Why specifically these numbers do that, I have no idea but I have the impression (hopefully collegues here can correct me if I am way off) neither the implementers know why these specific numbers present these qualities
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Explanation of HashMap#hash(int) method
/**
* Applies a supplemental hash function to a given hashCode, which
* defends against poor quality hash functions. This is critical
* because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
Can anyone explain this method in details, thanks.
One of the problems with designing a general-purpose hash-code, is that you put all of this effort into ensuring a nice spread of bits, and then someone comes along and uses it in such a way as to completely undo that.
Let's take a classic example of a co-ordinate class with an X and a Y, both integers.
It's a classic example, because people will use it to demonstrate that X ^ Y is not a good hashcode, because it's common for there to be several objects where X == Y (all hash to 0) or one whose X and Y are the Y and X of the other (will hash the same) and other cases where we end up with the same hash code.
It comes down to the fact that while integers have a possible range covering 4billion values, in 99% of use they tend to be less than a few hundred or a few tens of thousands at most. We can never get away from trying to spread 18quadrillion possible values among 4billion possible results, but we can work to make those we're likely to actually see, less likely to clash.
Now, (X << 16 | X >> 16) ^ Y does a better job in this case, spreading the bits from X around more.
Unfortunately, if the use of this hash is to do % someBinaryRoundNumer to index into a table (or even % someOtherNumber, to a slightly lesser extent), then for values of X below someBinaryRoundNumber - which we can expect to be most common - this hash becomes effectively return Y.
All our hard work wasted!
The rehash used is to make a hash with such logic, slightly better.
Its worth noting that it wouldn't be fair to be too critical of the (X << 16 | X >> 16) ^ Y approach as another use of the hash could have that form superior to a given alternative.
Well not to enter into to fine details but:
due to the hascode() and equals() contracts, a poor implementation of the hashcode function could lead to different instances having the same hashcode. This means that you may have a class wit a crappy hashcode() method implementation, such that the equals() method of the class will return false for the A and B instances (meaning that they are different from the business logic point of view) but the hashcode() method returns the same value for instances A and B. Again, this is technically valid according to the hashcode() and equals() contract, but not very proper
in a Hashtable-like structure (like HashMap) "buckets" are used to place instances inside the map according to their hashcode. If two instances have the same hashcode() (but are different according to the equas() method) they will both be placed in the same bucket. This is bad from a performance point of view, because you loose some of the retrieving speed inherent to a Hashtable-like structure when you have a lot of such situations. The are called collisions. What happends is that if later on someone uses a "search" instance to retrieve something from the hashmap, and the corresponding hash bucket has more than one occupant, each element in that bucket must be checked with the equals() method to find out which is the one that needs to be retrieved. In an extreme situation a HashMap can be transformed into a linked list like this.
That hash(int n) method adds some extra stuff to the existing hashcode() value in order to defend agains such situations.