How does HashMap determine where to put things? [duplicate] - java

This question already has answers here:
Internals of how the HashMap put() and get() methods work (basic logic only )
(3 answers)
Closed 9 years ago.
How does the add method in HashMap determine where a key goes in a HashMap? Like, if I was trying to put "S","T","A","C","K" into the HashMap of size 10, how does it determine where each letter goes?

The least significant bits of the object's hash code are used to select a bucket. Note there is no such thing as a java.util.HashMap of size 10, the size must be a power of 2 so that the bits can be masked to choose a bucket. If you pass 10 to the constructor, you will get a HashMap with 16 buckets back.
So, reducing to 8 bits for clarity, if "S" returns hashcode 123 java will do
01111011 & 00001111 -> 00001011
and put S in bucket 11.
The real Hash Map also applies a secondary hash function that shifts bits rightward to make sure there is data with some entropy in the least significant bits so that things have a good chance of being distributed evenly even if their hashCode function isn't that great.

Related

How does Java Hashtable calculate where to place an element based on hashcode? [duplicate]

This question already has answers here:
How does a hash table work?
(17 answers)
Closed 2 years ago.
In Java, Hashtable has buckets whose quantity is equal to its capacity. Now how does it determine that it has to store an object in a particular bucket? I know it uses hashcode of the object but hashcode is a weird long string, what does hashtable do to the hashcode to determine place of entry?
Implementation-dependent (as in, if you rely on it working this way, your code is broken; the things HashMap guarantees are spelled out in its javadoc, and none of what I'm about to type is in there):
hashes are just a number. Between about -2billion and +2billion. That 'long weird string' you see is just a more convenient way to show it to you.
First, the higher digits of that number are mixed into the lower digits (actually, the higher bits are XORed into the lower ones): 12340005 is turned into 12341239.
Then, that number is divided by how many buckets there currently are, but the result is tossed out, it's the remainder we are interested in. This remainder is necessarily 0 or higher, and never more than '# of buckets there are', so always points exactly at one of the buckets.
That's the bucket that the object goes into.
if buckets grow too large, a resizing is done.
For more, well, HashMap is open source, as is HashSet - just have a look.
For behavior as of jdk7 see:
https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/util/Hashtable.java#L358
int index = (hash & 0x7FFFFFFF) % tab.length;
This is a common technique for hash tables. First bit is discarded (to make the value positive). The index is the remainder from division by table size.
I know it uses hashcode of the object but hashcode is a weird long string, what does hashtable do to the hashcode to determine place of entry?
A hashcode is not a "weird long string". It is a 32 bit signed integer.
(I think you are confusing the hashcode and what you get when you call Object::toString ... which is a string consisting of a hashcode AND a Java internal type name.)
So what HashMap and HashTable (and HashSet and LinkedHashMap) actually do is:
call hashCode() to get the 32 bit integer,
perform some implementation-specific mangling1 on the integer,
convert the mangled integer to a non-negative integer by removing the sign bit,
compute an array index (for the bucket) as value % array.length where array is the hash table's current array of hash chains (or trees).
1 - Some implementations of HashMap / HashTable perform some simple / cheap bitwise mangling. The purpose is to reduce clustering in the case that the low few bits of the hashcode values are not evenly distributed.

Storing more objects in HashMap than range of int [duplicate]

This question already has answers here:
Theoretical limit for number of keys (objects) that can be stored in a HashMap?
(4 answers)
Closed 5 years ago.
I was reading about HashMap. HashCodereturns int value. What if i have Huge Huge HashMap, which needs to store more objects than int range. Consider that for every object HashCode() method will returns unique value. In this case what will happen
Is any exception thrown ? Or
It will behave randomly?
You mean storing more than 2 billion entries? Java collections or maps can't do this, their size is always an int value.
There are 3rd party libraries for huge maps.
Are you sure you can store these many objects in memory anyway? One object takes at least 24 bytes (you will be out of the range of Compressed OOPS), so you will be using beyond 100 gigabytes of RAM, and that is with very small objects stored in the HashMap.
PS: I don't understand what you mean with "hashCode returning a unique value". Hash codes don't have to be unique. For a 2+ billion entry hash map, a 32 bit hash code is a bit weak, but still theoretically possible.

Algorithm used for bucket lookup for hashcodes [duplicate]

This question already has answers here:
What hashing function does Java use to implement Hashtable class?
(6 answers)
Closed 8 years ago.
In most cases, HashSet has lookup complexity O(1). I understand that this is because objects are kept in buckets corresponding to hashcodes of the object.
When lookup is done, it directly goes to the bucket and finds (using equals if many objects are present in same bucket) the element.
I always wonder, how it directly goes to the required bucket? Which algorithm is used for bucket lookup? Does that add nothing to total lookup time?
I always wonder, how it directly goes to the required bucket?
The hash code is treated and used as an index in to an array.
The index is determined by hash & (array.length - 1) because the length of the Java HashMap's internal array is always a power of 2. (This a cheaper computation of hash % array.length.)
Each "bucket" is actually a linked list (and now, possibly a tree) where entries with colliding hashes are grouped. If there are collisions, then a linear search through the bucket is performed.
Does that add nothing to total lookup time?
It incurs the cost of a few loads from memory.
Often, the algorithm is simply
hash = hashFunction(key)
index = hash % arraySize
See the wikipedia article on Hash Table for details.
From memory: the HashSet is actually backed by a HashMap and the basic look up process is:
Get the key
hash it (hashcode())
hashcode % the number of buckets
Go to that bucket and evaluate equals()
For a Set there would only be unique elements. I would suggest reading the source for HashSet and it should be able to answer your queries.
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/HashMap.java#HashMap.containsKey%28java.lang.Object%29
Also note that the Java 8 code has been updated and this explanation covers pre Java 8 codebase. I have not examined in detail the Java 8 implementation except to figure out that it is different.

How to improve the complexity of HashMap iteration?

I implemented a custom HashMap class (in C++, but shouldn't matter). The implementation is simple -
A large array holds pointers to Items.
Each item contains the key - value pair, and a pointer to an Item (to form a linked list in case of key collision).
I also implemented an iterator for it.
My implementation of incrementing/decrementing the iterator is not very efficient. From the present position, the iterator scans the array of hashes for the next non-null entry. This is very inefficient, when the map is sparsely populated (which it would be for my use case).
Can anyone suggest a faster implementation, without affecting the complexity of other operations like insert and find? My primary use case is find, secondary is insert. Iteration is not even needed, I just want to know this for the sake of learning.
PS: Why I implemented a custom class? Because I need to find strings with some error tolerance, while ready made hash maps that I have seen provide only exact match.
EDIT: To clarify, I am talking about incrementing/decrementing an already obtained iterator. Yes, this is mostly done in order to traverse the whole map.
The errors in strings (keys) in my case occur from OCR errors. So I can not use the error handling techniques used to detect typing errors. The chance of fist character being wrong is almost the same as that of the last one.
Also, my keys are always string, one word to be exact. Number of entries will be less than 5000. So hash table size of 2^16 is enough for me. Even though it will still be sparsely populated, but that's ok.
My hash function:
hash code size is 16 bits.
First 5 bits for the word length. ==> Max possible key length = 32. Reasonable, given that key is a single word.
Last 11 bits for sum of the char codes. I only store the English alphabet characters, and do not need case sensitivity. So 26 codes are enough, 0 to 25. So a key with 32 'z' = 25 * 32 = 800. Which is well within 2^11. I even have scope to add case sensitivity, if needed in future.
Now when you compare a key containing an error with the correct one,
say "hell" with "hello"
1. Length of the keys is approx the same
2. sum of their chars will differ by the sum of the dropped/added/distorted chars.
in the hash code, as first 5 bits are for length, the whole table has fixed sections for every possible length of keys. All sections are of same size. First section stores keys of length 1, second of length 2 and so on.
Now 'hello' is stored in the 5th section, as length is 5.'When we try to find 'hello',
Hashcode of 'hello' = (length - 1) (sum of chars) = (4) (7 + 4 + 11 + 11 + 14) = (4) (47)
= (00100)(00000101111)
similarly, hashcode of 'helo' = (3)(36)
= (00011)(00000100100)
We jump to its bucket, and don't find it there.
so we try to check for ONE distorted character. This will not change the length, but change the sum of characters by at max -25 to +25. So we search from 25 places backwards to 25 places forward. i.e, we check the sum part from (36-25) to (36+25) in the same section. We won't find it.
We check for an additional character error. That means the correct string would contain only 3 characters. So we go to the third section. Now sum of chars due to additional char would have increased by max 25, it has to be compensated. So search the third section for appropriate 25 places (36 - 0) to (36 - 25). Again we don't find.
Now we consider the case of a missing character. So the original string would contain 5 chars. And the second part of hashcode, sum of chars in the original string, would be more by a factor of 0 to 25. So we search the corresponding 25 buckets in the 5th section, (36 + 0) to (36 + 25). Now as 47 (the sum part of 'hello') lies in this range, we will find a match of the hashcode. Ans we also know that this match will be due to a missing character. So we compare the keys allowing for a tolerance of 1 missing character. And we get a match!
In reality, this has been implemented to allow more than one error in key.
It can also be optimized to use only 25 places for the first section (since it has only one character) and so on.
Also, checking 25 places seems overkill, as we already know the largest and smallest char of the key. But it gets complex in case of multiple errors.
You mention an 'error tolerance' for the string. Why not build in the "tolerance' into the hash function itself and thus obviate the need for iteration.
You could go the way of Javas LinkedHashMap class. It adds efficient iteration to a hashmap by also making it a doubly-linked list.
The entries are key-value pairs that have pointers to the previous and next entries. The hashmap itself has the large array as well as the head of the linked list.
Insertion/deletion are constant time for both data structures, searches are done via the hashmap, and iteration via the linked list.

Can anybody explain how java design HashMap's hash() function? [duplicate]

This question already has answers here:
Explanation of HashMap#hash(int) method
(2 answers)
Closed 7 years ago.
after I read JDK's source code ,I find HashMap's hash() function seems fun. Its soucre code like this:
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
Parameter h is the hashCode from Objects which was put into HashMap. How does this method work and why? Why this method can defend against poor hashCode functions?
Hashtable uses the 'classical' approach of prime numbers: to get the 'index' of a value, you take the hash of the key and perform the modulus against the size. Taking a prime number as size, gives (normally) a nice spread over the indexes (depending on the hash as well, of course).
HashMap uses a 'power of two'-approach, meaning the sizes are a power of two. The reason is it's supposed to be faster than prime number calculations. However, since a power of two is not a prime number, there would be more collisions, especially with hash values having the same lower bits.
Why? The modulus performed against the size to get the (bucket/slot) index is simply calculated by: hash & (size-1) (which is exactly what's used in HashMap to get the index!). That's basically the problem with the 'power-of-two' approach: if the length is limited, e.g. 16, the default value of HashMap, only the last bits are used and hence, hash values with the same lower bits will result in the same (bucket) index. In the case of 16, only the last 4 bits are used to calculate the index.
That's why an extra hash is calculated and basically it's shifting the higher bit values, and operate on them with the lower bit values. The reason for the numbers 20, 12, 7 and 4, I don't really know. They used be different (in Java 1.5 or so, the hash function was little different). I suppose there's more advanced literature available. You might find more info about why they use the numbers they use in all kinds of algorithm-related literature, e.g.
http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming
http://mitpress.mit.edu/books/introduction-algorithms
http://burtleburtle.net/bob/hash/evahash.html#lookup uses different algorithms depending on the length (which makes some sense).
http://www.javaspecialists.eu/archive/Issue054.html is probably interesting as well. Check the reaction of Joshua Bloch near the bottom of the article: "The replacement secondary hash function (which I developed with the aid of a computer) has strong statistical properties that pretty much guarantee good bucket distribution.") So, if you ask me, the numbers come from some kind of analysis performed by Josh himself, probably assisted by who knows who.
So: power of two gives faster calculation, but the necessity for additional hash calculation in order to have a nice spread over the slots/buckets.

Categories