Here is an implementation of HashMap.
It provides this code for getting index of the bin:
private int getIndex(K key)
{
int hash = key.hashCode() % nodes.length;
if (hash < 0)
hash += nodes.length;
return hash;
}
To make sure the hash value is not bigger than the size of the table,
the result of the user provided hash function is used modulo the
length of the table. We need the index to be non-negative, but the
modulus operator (%) will return a negative number if the left operand
(the hash value) is negative, so we have to test for it and make it
non-negative.
If hash turns out to be very big negative value, the additions hash += nodes.length in the cycle may take a lot of processing.
I think there should be O(1) algorithm for it (independent of hash value).
If so, how can it be achieved?
It can't be a very big negative number.
The result of anything % nodes.length is always less that nodes.length in absolute value, so you need a single if, not a loop. This is exactly what the code does:
if (hash < 0) /* `if', not `while' */
hash += nodes.length;
This not the approach HashMap uses in reality.
272 /**
273 * Returns index for hash code h.
274 */
275 static int indexFor(int h, int length) {
276 return h & (length-1);
277 }
This works because length is always a power of 2 and this is the same an unsigned % length
If hash turns out to be very big negative value, the additions hash += nodes.length in the cycle may take a lot of processing.
The hash at this point must be between -length+1 and length-1 so it cannot be a very large negative value and the code wouldn't work if it did. In any case it doesn't matter how large the value is, the cost is always the same.
Related
How strong is the hashing mechanism that is used in the Arrays.hashCode methods against collision? What is the possibility of two different arrays (of, say, double) to have an exact hash value calculated with these methods?
Arrays.hashCode(double[]) is specified to return the equivalent value of a List containing Double values representing the same numeric value.
List.hashCode in turn is specified with a fairly simple algorithm:
int hashCode = 1;
for (E e : list)
hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());
In general the multiplication with a prime number is a good practice for general-purpose hash functions, but it's far from a cryptographically strong hash function.
This means that while collisions are unlikely in the general (effectively random) case, they can usually be constructed quite easily if you can influence (or select) the hashCode of the items in the List.
As a constructed example consider these two statements:
System.out.println(Arrays.hashCode(new double[] {4.753E-321d}));
System.out.println(Arrays.hashCode(new double[] {4.9E-324d, 4.9E-324d}));
Both of these will output 993, despite being clearly different arrays.
This is the implementation of Arrays.hashCode that you use
public static int hashCode(int a[]) {
if (a == null)
return 0;
int result = 1;
for (int element : a)
result = 31 * result + element;
return result;
}
If your values happen to be smaller then 31 they are treated like distinct numbers in the base 31, so each result in a different numbers (if we ignore overflows for now). Lets call those pure hashes
Now of course 31^11 is way larger then the number of integers in Java, so we will get tons of overflows. But since the powers of 31 and the maximum integer are "very different" you don't get a almost random distribution, but a very regular uniform distribution.
Lets consider a smaller example. I assume you have only 2 elements in your array and the range from 0 to 5 each. I try to create "hashCode" between 0 and 37 by taking the modulo 38 of the "pure hash" The result is that I get streaks of 5 integers with small gaps in between, and not a single collision.
val hashes = for {
i <- 0 to 4
j <- 0 to 4
} yield (i * 31 + j) % 38
enter code here
println(hashes.size) // prints 25
println(hashes.toSet.size) // prints 25
To verify if this is what happens to your numbers you might create a graph as follows: For each hash take the first 16 bits for x and and the second 16 bits for y, color that dot black. I bet you will see an extremely regular pattern.
When to calculate key's hashcode, spread() method is called:
static final int spread(int h) {
return (h ^ (h >>> 16)) & HASH_BITS;
}
where HASH_BITS equals 0x7fffffff, so, what is the purpose of HASH_BITS? Some one says it make the sign bit to 0, I am not sure about that.
The index of KV Node in hash buckets is calculated by following formula:
index = (n - 1) & hash
hash is the result of spread()
n is the length of hash buckets which maximum is 2^30
private static final int MAXIMUM_CAPACITY = 1 << 30;
So the maximum of n - 1 is 2^30 - 1 which means the top bit of hash will never be used in index calculation.
But i still don't understand is it necessary to clear the top bit of hash to 0.It seems that there are more reasons to do so.
/**
* Spreads (XORs) higher bits of hash to lower and also forces top
* bit to 0. Because the table uses power-of-two masking, sets of
* hashes that vary only in bits above the current mask will
* always collide. (Among known examples are sets of Float keys
* holding consecutive whole numbers in small tables.) So we
* apply a transform that spreads the impact of higher bits
* downward. There is a tradeoff between speed, utility, and
* quality of bit-spreading. Because many common sets of hashes
* are already reasonably distributed (so don't benefit from
* spreading), and because we use trees to handle large sets of
* collisions in bins, we just XOR some shifted bits in the
* cheapest possible way to reduce systematic lossage, as well as
* to incorporate impact of the highest bits that would otherwise
* never be used in index calculations because of table bounds.
*/
static final int spread(int h) {
return (h ^ (h >>> 16)) & HASH_BITS;
}
I think it is to avoid collision with the preserved hashcodes: MOVED(-1), TREEBIN(-2) and RESERVED(-3) of which symbol bits are always 1.
I'm trying to implement cuckoo hashing with hash functions:
hash1: key.hashcode() % capacity
hash2: key.hashcode() / capacity % capacity
With an infinite loop check and rehashing method doubling capacity. The program works fine with small amount of data, but when data gets big (around 20k elements) the program keeps getting rehashing until the capacity gets overflowed.
I figured that mostly the infinite rehashing causes by the data with exactly same hashcode. After rehashing, there will be chance other data get same hashcode and causing rehashing again.
I already use Java built-in hashcode but the chance of same hashcodes still high when data is large. Even I modified a little bit hashcode method, eventually there is still data with same hashcode.
So which hash method should I use to prevent this?
A usual method to create a hash function is generally to use primes. I write a function (below), with which, I don't guarantee no collisions, but it should be lessened.
hashFunction1(String s){
int k = 7; //take a prime number, can be anything (I just chose 7)
for(int i = 0; i < s.length(); i++){
k *= (23 * (int)(s.charAt(i)));
k %= capacity;
}
}
//23 is another randomly chosen number.
You can write a similar hash function as hashFunction2, choosing two different prime numbers. But here, the main problem is, for strings "stop" and "pots", this gives same hash code.
So, an improvization over this function can be:
hashFunction1(String s){
int k = 7; //take a prime number, can be anything (I just chose 7)
for(int i = 0; i < s.length(); i++){
k *= (23 * (int)(s.charAt(i)));
k += (int)(s.charAt(i));
k %= capacity;
}
}
which will resolve this (for most cases, if not all).
If you still find this function bad, instead of s.charAt(i), you can use a unique prime number mapped to every character, ie. a=3, b=5, c=7, d=11 and so on. This should resolve collision even more.
EDIT:
You are using +n, which is a constant.
2 is not the prime to be used in such cases. Use an odd prime number, 3 works.
Say I have a hash table with 59 elements (each element value is an integer). Index 15 is blank and the rest of the table is full of data. Depending on the number I want to insert, the quadratic probing formula never hits element 15!
Assume I want to insert the number 199 (which should hash to 22 using the hashFunc() function I'm using below.:
public int hashFunc(int key)
{
return key % arraySize; //199 % 59 = 22
}
public void insert(DataItem item)
{
int key = item.getKey(); // extract the key (199)
int hashVal = hashFunc(key); // hash the key (22)
int i = 1;
//The while loop just checks that the array index isn't null and isn't equal to -1 which I defined to be a deleted element
while(hashArray[hashVal] != null && hashArray[hashVal].getKey() != -1)
{
hashVal = hashFunc(key) + (i * i); //This never hits element 15!!!
i++;
hashVal %= arraySize; // wraparound when hashVal is beyond 59
}
hashArray[hashVal] = item; // insert item
}
This is expected in a quadratic probing hash table. Using some modular arithmetic, you can show that only the first p / 2 probe locations in the probe sequence are guaranteed to be unique, meaning that it's possible that each element's probe sequence will not visit half of the locations in the table.
To fix this, you should probably update your code so that you rehash any time p / 2 or more of the table locations are in-use. Alternatively, you can use the technique suggested on the Wikipedia article of alternating the sign of your probe offset (+1, -4, +9, -16, +25, etc.), which should ensure that you can hit every possible location.
Hope this helps!
I am wondering if we implement our own hashmap that doesn't use power-of-two length hash tables (initial capacity and whenever we re-size), then in that case can we just use the object's hashcode and mod the total size directly instead of use a hash function to hash the object's hashcode ?
for example
public V put(K key, V value) {
if (key == null)
return putForNullKey(value);
// int hash = hash(key.hashCode()); original way
//can we just use the key's hashcode if our table length is not power-of-two ?
int hash = key.hashCode();
int i = indexFor(hash, table.length);
...
...
}
Presuming we're talking about OpenJDK 7, the additional hash is used to stimulate avalanching; it is a mixing function. It is used because the mapping function from a hash to a bucket, since were using a power of 2 for the capacity, is a mere bitwise & (since a % b is equivalent to a & (b - 1) iff b is a power of 2); this means that the lower bits are the only important ones, so by applying this mixing step it can help protect against poorer hashes.
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
If you want to use sizes that aren't powers of 2, the above may not be needed.
Actually changing the mapping from hashes to buckets (which normally relies on the capacity being a power of 2) will require you to you to look at indexFor:
static int indexFor(int h, int length) {
return h & (length-1);
}
You could use (h & 0x7fffffff) % length here.
You can think of the mod function as a simple form of hash function. It maps a large range of data to a smaller space. Assuming the original hashcode is well designed, I see no reason why a mod cannot be used to transform the hashcode into the size of the table you are using.
If your original hashfunction is not well implemented, e.g. always returns an even number, you will create quite a lot of collisions using just a mod function as your hashfunction.
This is true, you can pick pseudo-prime numbers instead.
Note: indexFor needs to use % compensating for the sign instead of a simple & which can actually make the lookup slower.
indexFor = (h & Integer.MAX_VALUE) % length
// or
indexFor = Math.abs(h % length)