I am wondering if we implement our own hashmap that doesn't use power-of-two length hash tables (initial capacity and whenever we re-size), then in that case can we just use the object's hashcode and mod the total size directly instead of use a hash function to hash the object's hashcode ?
for example
public V put(K key, V value) {
if (key == null)
return putForNullKey(value);
// int hash = hash(key.hashCode()); original way
//can we just use the key's hashcode if our table length is not power-of-two ?
int hash = key.hashCode();
int i = indexFor(hash, table.length);
...
...
}
Presuming we're talking about OpenJDK 7, the additional hash is used to stimulate avalanching; it is a mixing function. It is used because the mapping function from a hash to a bucket, since were using a power of 2 for the capacity, is a mere bitwise & (since a % b is equivalent to a & (b - 1) iff b is a power of 2); this means that the lower bits are the only important ones, so by applying this mixing step it can help protect against poorer hashes.
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
If you want to use sizes that aren't powers of 2, the above may not be needed.
Actually changing the mapping from hashes to buckets (which normally relies on the capacity being a power of 2) will require you to you to look at indexFor:
static int indexFor(int h, int length) {
return h & (length-1);
}
You could use (h & 0x7fffffff) % length here.
You can think of the mod function as a simple form of hash function. It maps a large range of data to a smaller space. Assuming the original hashcode is well designed, I see no reason why a mod cannot be used to transform the hashcode into the size of the table you are using.
If your original hashfunction is not well implemented, e.g. always returns an even number, you will create quite a lot of collisions using just a mod function as your hashfunction.
This is true, you can pick pseudo-prime numbers instead.
Note: indexFor needs to use % compensating for the sign instead of a simple & which can actually make the lookup slower.
indexFor = (h & Integer.MAX_VALUE) % length
// or
indexFor = Math.abs(h % length)
Related
The accepted answer in Best implementation for hashCode method gives a seemingly good method for finding Hash Codes. But I'm new to Hash Codes, so I don't quite know what to do.
For 1), does it matter what nonzero value I choose? Is 1 just as good as other numbers such as the prime 31?
For 2), do I add each value to c? What if I have two fields that are both a long, int, double, etc?
Did I interpret it right in this class:
public MyClass{
long a, b, c; // these are the only fields
//some code and methods
public int hashCode(){
return 37 * (37 * ((int) (a ^ (a >>> 32))) + (int) (b ^ (b >>> 32)))
+ (int) (c ^ (c >>> 32));
}
}
The value is not important, it can be whatever you want. Prime numbers will result in a better distribution of the hashCode values therefore they are preferred.
You do not necessary have to add them, you are free to implement whatever algorithm you want, as long as it fulfills the hashCode contract:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.
There are some algorithms which can be considered as not good hashCode implementations, simple adding of the attributes values being one of them. The reason for that is, if you have a class which has two fields, Integer a, Integer b and your hashCode() just sums up these values then the distribution of the hashCode values is highly depended on the values your instances store. For example, if most of the values of a are between 0-10 and b are between 0-10 then the hashCode values are be between 0-20. This implies that if you store the instance of this class in e.g. HashMap numerous instances will be stored in the same bucket (because numerous instances with different a and b values but with the same sum will be put inside the same bucket). This will have bad impact on the performance of the operations on the map, because when doing a lookup all the elements from the bucket will be compared using equals().
Regarding the algorithm, it looks fine, it is very similar to the one that Eclipse generates, but it is using a different prime number, 31 not 37:
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + (int) (a ^ (a >>> 32));
result = prime * result + (int) (b ^ (b >>> 32));
result = prime * result + (int) (c ^ (c >>> 32));
return result;
}
A well-behaved hashcode method already exists for long values - don't reinvent the wheel:
int hashCode = Long.hashCode((a * 31 + b) * 31 + c); // Java 8+
int hashCode = Long.valueOf((a * 31 + b) * 31 + c).hashCode() // Java <8
Multiplying by a prime number (usually 31 in JDK classes) and cumulating the sum is a common method of creating a "unique" number from several numbers.
The hashCode() method of Long keeps the result properly distributed across the int range, making the hash "well behaved" (basically pseudo random).
When to calculate key's hashcode, spread() method is called:
static final int spread(int h) {
return (h ^ (h >>> 16)) & HASH_BITS;
}
where HASH_BITS equals 0x7fffffff, so, what is the purpose of HASH_BITS? Some one says it make the sign bit to 0, I am not sure about that.
The index of KV Node in hash buckets is calculated by following formula:
index = (n - 1) & hash
hash is the result of spread()
n is the length of hash buckets which maximum is 2^30
private static final int MAXIMUM_CAPACITY = 1 << 30;
So the maximum of n - 1 is 2^30 - 1 which means the top bit of hash will never be used in index calculation.
But i still don't understand is it necessary to clear the top bit of hash to 0.It seems that there are more reasons to do so.
/**
* Spreads (XORs) higher bits of hash to lower and also forces top
* bit to 0. Because the table uses power-of-two masking, sets of
* hashes that vary only in bits above the current mask will
* always collide. (Among known examples are sets of Float keys
* holding consecutive whole numbers in small tables.) So we
* apply a transform that spreads the impact of higher bits
* downward. There is a tradeoff between speed, utility, and
* quality of bit-spreading. Because many common sets of hashes
* are already reasonably distributed (so don't benefit from
* spreading), and because we use trees to handle large sets of
* collisions in bins, we just XOR some shifted bits in the
* cheapest possible way to reduce systematic lossage, as well as
* to incorporate impact of the highest bits that would otherwise
* never be used in index calculations because of table bounds.
*/
static final int spread(int h) {
return (h ^ (h >>> 16)) & HASH_BITS;
}
I think it is to avoid collision with the preserved hashcodes: MOVED(-1), TREEBIN(-2) and RESERVED(-3) of which symbol bits are always 1.
I am trying this code snippet
Map headers=new HashMap();
headers.put("X-Capillary-Relay","abcd");
headers.put("Message-ID","abcd");
Now when I do a get for either of the keys its working fine.
However I am seeing a strange phenomenon on the Eclipse debugger.
When I debug and go inside the Variables and check inside the table entry at first I see this
->table
--->[4]
------>key:X-Capillary-Relay
...........
However after debugging across the 2nd line I get
->table
--->[4]
------>key:Message-ID
...........
Instead of creating a new entry it overwrites on the existing key. For any other key this overwrite does not occur. The size of the map is shown 2. and the get works for both keys. So what is the reason behind this discrepancy in the eclipse debugger. Is it an eclipse problem? Or a hashing problem. The hashcode is different for the 2 keys.
The hashCode of the keys is not used as is.
It is applied two transformations (at least based on Java 6 code):
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
and
/**
* Returns index for hash code h.
*/
static int indexFor(int h, int length) {
return h & (length-1);
}
Since length is the initial capacity of the HashMap (16 by default), you get 4 for both keys :
System.out.println (hash("X-Capillary-Relay".hashCode ())&(16-1));
System.out.println (hash("Message-ID".hashCode ())&(16-1));
Therefore both entries are stored in a linked list in the same bucket of the map (index 4 of the table array, as you can see in the debugger). The fact that the debugger shows only one of them doesn't mean that the other was overwritten. It means that you see the key of the first Entry of the linked list, and each new Entry is added to the head of the list.
I am little bit confused about HashSet internal working, as i know HashSet uses key(K) to find the right bucket and equals used to compare values but how HashSet works means how it generate hash Key ?
here it is
final int hash(Object k) {
int h = hashSeed;
if (0 != h && k instanceof String) {
return sun.misc.Hashing.stringHash32((String) k);
}
h ^= k.hashCode();
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
it's actually in HashMap which HashSet uses internally
Internally HashSet use HashMap,the hash key of the value is generated and used to save the element in HashTable.
To generate HashCode of the element the method HashCode() is called
Below method of HashMap to put element which is internally used by HashSet to add element :
public V put(K paramK, V paramV)
{
if (paramK == null)
return putForNullKey(paramV);
int i = hash(paramK.hashCode());
-----------------------------^
// More code
}
Here is an implementation of HashMap.
It provides this code for getting index of the bin:
private int getIndex(K key)
{
int hash = key.hashCode() % nodes.length;
if (hash < 0)
hash += nodes.length;
return hash;
}
To make sure the hash value is not bigger than the size of the table,
the result of the user provided hash function is used modulo the
length of the table. We need the index to be non-negative, but the
modulus operator (%) will return a negative number if the left operand
(the hash value) is negative, so we have to test for it and make it
non-negative.
If hash turns out to be very big negative value, the additions hash += nodes.length in the cycle may take a lot of processing.
I think there should be O(1) algorithm for it (independent of hash value).
If so, how can it be achieved?
It can't be a very big negative number.
The result of anything % nodes.length is always less that nodes.length in absolute value, so you need a single if, not a loop. This is exactly what the code does:
if (hash < 0) /* `if', not `while' */
hash += nodes.length;
This not the approach HashMap uses in reality.
272 /**
273 * Returns index for hash code h.
274 */
275 static int indexFor(int h, int length) {
276 return h & (length-1);
277 }
This works because length is always a power of 2 and this is the same an unsigned % length
If hash turns out to be very big negative value, the additions hash += nodes.length in the cycle may take a lot of processing.
The hash at this point must be between -length+1 and length-1 so it cannot be a very large negative value and the code wouldn't work if it did. In any case it doesn't matter how large the value is, the cost is always the same.