HashMap keeps its data in buckets as:
transient Node<K,V>[] table;
To put something in HashMap we need a hash() function which returns hash of Key in range from 0 to table.length(), right?
Suppose, I have:
String s = "15315";
// Just pasted internal operation. Is it supposed to calcule hash in table.length range?
int h;
int hmhc = (h = s.hashCode()) ^ (h >>> 16);
System.out.println("String native hashCode: "+s.hashCode() + ", HashMap hash: "+hmhc);
This returns the following:
String native hashCode: 46882035, HashMap hash: 46882360
We should have approximately 256 buckets (so hash of Key should be in range from 0 to 256), but internal hash in HashMap gives us 46882360. How to "normalize" this hash to our range? I just can't see it in the source code.
I looked at this jdk ( put() starts from line 610): http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashMap.java
Generally the hash code returned will be taken modulo the number of buckets.
In your case, it will go into bucket 46882360 % 256 = 56.
Related
I am trying this code snippet
Map headers=new HashMap();
headers.put("X-Capillary-Relay","abcd");
headers.put("Message-ID","abcd");
Now when I do a get for either of the keys its working fine.
However I am seeing a strange phenomenon on the Eclipse debugger.
When I debug and go inside the Variables and check inside the table entry at first I see this
->table
--->[4]
------>key:X-Capillary-Relay
...........
However after debugging across the 2nd line I get
->table
--->[4]
------>key:Message-ID
...........
Instead of creating a new entry it overwrites on the existing key. For any other key this overwrite does not occur. The size of the map is shown 2. and the get works for both keys. So what is the reason behind this discrepancy in the eclipse debugger. Is it an eclipse problem? Or a hashing problem. The hashcode is different for the 2 keys.
The hashCode of the keys is not used as is.
It is applied two transformations (at least based on Java 6 code):
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
and
/**
* Returns index for hash code h.
*/
static int indexFor(int h, int length) {
return h & (length-1);
}
Since length is the initial capacity of the HashMap (16 by default), you get 4 for both keys :
System.out.println (hash("X-Capillary-Relay".hashCode ())&(16-1));
System.out.println (hash("Message-ID".hashCode ())&(16-1));
Therefore both entries are stored in a linked list in the same bucket of the map (index 4 of the table array, as you can see in the debugger). The fact that the debugger shows only one of them doesn't mean that the other was overwritten. It means that you see the key of the first Entry of the linked list, and each new Entry is added to the head of the list.
My concern was to check how Java HashMap gets the same index for a key. Even when it's size expand from default 16 to much higher values as we keep adding Entries.
I tried to reproduce the indexing algorithm of HashMap.
int length=1<<5;
int v=6546546;
int h = new Integer(v).hashCode();
h =h^( (h >>> 20) ^ (h >>> 12));
h=h ^ h >>> 7 ^ h >>> 4;
System.out.println("index :: " + (h & (length-1)));
I ran my code for different values of "length".
So for same key I am getting different index, as length of HashMap changes. What am I missing here?
My Results:
length=1<<5;
index :: 10
length=1<<15;
index :: 7082
length=1<<30;
index :: 6626218
You're missing the fact that every time the length changes, the entries are redistributed - they're put in the new buckets appropriately. That's why it takes a bit of time (O(N)) when the map expands - everything needs to be copied from the old buckets to the new ones.
So long as you only ever have indexes for one length at a time (not a mixture), you're fine.
I am not sure the best way to go about hashing a "dictionary" into a table.
The dictionary has 61406 words, I determine the overload by SizeOFDictionary/.75
That gives me 81874 buckets in the table.
I run it through my hash function(generic random algorithm) and there are 31690 buckets that get used up. and 50 some thousand that are empty. The largest bucket only contains 10 words.
My question: Do these numbers suffice for a hashing project? I am unfamiliar with what I am trying to achieve, to me, it seems like 50 some thousand is a lot of empty buckets.
Here is my hashing function.
private void hashingAlgorithm(String word)
{
int key = 1;
//Multiplying ASCII values of string
//To determine the index
for(int i = 0 ; i < word.length(); i++){
key *= (int)word.charAt(i);
//Accounting for integer overflow
if(key<0)
key*=-1;
}
key %= sizeOfTable;
//Inserting into the table
table[key].addToBucket(word);
}
Performance analysis:
Your hashing function doesn't take the order into account. According to your algorithm, if there's no overflow,
ab = ba. Your code depends on overflow to make difference between different order. So there is space for a lot of extra collisions which can be removed if you think about the sentences to be a N based number.
Suggested Improvement:
2 * 3 == 3 * 2
but
2 * 223 + 3 != 3 * 223 + 2
So if we represent the strings as N based number, number of collisions will be decreased at a dramatic scale.
If dictionary contains words like :
abdc
abcd
dbca
dabc
dacb
all will get hashed to same value in hash table i.e int(a)*int(b)*int(c)*int(d) , which is not a good idea .
So , use rolling hash .
example :
hash = [0]*base^(n-1) + [1]*base^(n-2) + ... + [n-1]
where base be a prime number like say 31.
NOTE : [i] means char.at(i) .
you can also use modulo p [obviously p is a prime number] operator to avoid overflow and limit your size of hash table .
hash = [0]*base^(n-1) + [1]*base^(n-2) + ... + [n-1] mod p
Here is an implementation of HashMap.
It provides this code for getting index of the bin:
private int getIndex(K key)
{
int hash = key.hashCode() % nodes.length;
if (hash < 0)
hash += nodes.length;
return hash;
}
To make sure the hash value is not bigger than the size of the table,
the result of the user provided hash function is used modulo the
length of the table. We need the index to be non-negative, but the
modulus operator (%) will return a negative number if the left operand
(the hash value) is negative, so we have to test for it and make it
non-negative.
If hash turns out to be very big negative value, the additions hash += nodes.length in the cycle may take a lot of processing.
I think there should be O(1) algorithm for it (independent of hash value).
If so, how can it be achieved?
It can't be a very big negative number.
The result of anything % nodes.length is always less that nodes.length in absolute value, so you need a single if, not a loop. This is exactly what the code does:
if (hash < 0) /* `if', not `while' */
hash += nodes.length;
This not the approach HashMap uses in reality.
272 /**
273 * Returns index for hash code h.
274 */
275 static int indexFor(int h, int length) {
276 return h & (length-1);
277 }
This works because length is always a power of 2 and this is the same an unsigned % length
If hash turns out to be very big negative value, the additions hash += nodes.length in the cycle may take a lot of processing.
The hash at this point must be between -length+1 and length-1 so it cannot be a very large negative value and the code wouldn't work if it did. In any case it doesn't matter how large the value is, the cost is always the same.
I am wondering if we implement our own hashmap that doesn't use power-of-two length hash tables (initial capacity and whenever we re-size), then in that case can we just use the object's hashcode and mod the total size directly instead of use a hash function to hash the object's hashcode ?
for example
public V put(K key, V value) {
if (key == null)
return putForNullKey(value);
// int hash = hash(key.hashCode()); original way
//can we just use the key's hashcode if our table length is not power-of-two ?
int hash = key.hashCode();
int i = indexFor(hash, table.length);
...
...
}
Presuming we're talking about OpenJDK 7, the additional hash is used to stimulate avalanching; it is a mixing function. It is used because the mapping function from a hash to a bucket, since were using a power of 2 for the capacity, is a mere bitwise & (since a % b is equivalent to a & (b - 1) iff b is a power of 2); this means that the lower bits are the only important ones, so by applying this mixing step it can help protect against poorer hashes.
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
If you want to use sizes that aren't powers of 2, the above may not be needed.
Actually changing the mapping from hashes to buckets (which normally relies on the capacity being a power of 2) will require you to you to look at indexFor:
static int indexFor(int h, int length) {
return h & (length-1);
}
You could use (h & 0x7fffffff) % length here.
You can think of the mod function as a simple form of hash function. It maps a large range of data to a smaller space. Assuming the original hashcode is well designed, I see no reason why a mod cannot be used to transform the hashcode into the size of the table you are using.
If your original hashfunction is not well implemented, e.g. always returns an even number, you will create quite a lot of collisions using just a mod function as your hashfunction.
This is true, you can pick pseudo-prime numbers instead.
Note: indexFor needs to use % compensating for the sign instead of a simple & which can actually make the lookup slower.
indexFor = (h & Integer.MAX_VALUE) % length
// or
indexFor = Math.abs(h % length)