Algorithm used for bucket lookup for hashcodes [duplicate]

Algorithm used for bucket lookup for hashcodes [duplicate] - java

This question already has answers here:
What hashing function does Java use to implement Hashtable class?
(6 answers)
Closed 8 years ago.
In most cases, HashSet has lookup complexity O(1). I understand that this is because objects are kept in buckets corresponding to hashcodes of the object.
When lookup is done, it directly goes to the bucket and finds (using equals if many objects are present in same bucket) the element.
I always wonder, how it directly goes to the required bucket? Which algorithm is used for bucket lookup? Does that add nothing to total lookup time?

I always wonder, how it directly goes to the required bucket?
The hash code is treated and used as an index in to an array.
The index is determined by hash & (array.length - 1) because the length of the Java HashMap's internal array is always a power of 2. (This a cheaper computation of hash % array.length.)
Each "bucket" is actually a linked list (and now, possibly a tree) where entries with colliding hashes are grouped. If there are collisions, then a linear search through the bucket is performed.
Does that add nothing to total lookup time?
It incurs the cost of a few loads from memory.

Often, the algorithm is simply
hash = hashFunction(key)
index = hash % arraySize
See the wikipedia article on Hash Table for details.

From memory: the HashSet is actually backed by a HashMap and the basic look up process is:
Get the key
hash it (hashcode())
hashcode % the number of buckets
Go to that bucket and evaluate equals()
For a Set there would only be unique elements. I would suggest reading the source for HashSet and it should be able to answer your queries.
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/HashMap.java#HashMap.containsKey%28java.lang.Object%29
Also note that the Java 8 code has been updated and this explanation covers pre Java 8 codebase. I have not examined in detail the Java 8 implementation except to figure out that it is different.

Related

How does Java Hashtable calculate where to place an element based on hashcode? [duplicate]

This question already has answers here:
How does a hash table work?
(17 answers)
Closed 2 years ago.
In Java, Hashtable has buckets whose quantity is equal to its capacity. Now how does it determine that it has to store an object in a particular bucket? I know it uses hashcode of the object but hashcode is a weird long string, what does hashtable do to the hashcode to determine place of entry?

Implementation-dependent (as in, if you rely on it working this way, your code is broken; the things HashMap guarantees are spelled out in its javadoc, and none of what I'm about to type is in there):
hashes are just a number. Between about -2billion and +2billion. That 'long weird string' you see is just a more convenient way to show it to you.
First, the higher digits of that number are mixed into the lower digits (actually, the higher bits are XORed into the lower ones): 12340005 is turned into 12341239.
Then, that number is divided by how many buckets there currently are, but the result is tossed out, it's the remainder we are interested in. This remainder is necessarily 0 or higher, and never more than '# of buckets there are', so always points exactly at one of the buckets.
That's the bucket that the object goes into.
if buckets grow too large, a resizing is done.
For more, well, HashMap is open source, as is HashSet - just have a look.

For behavior as of jdk7 see:
https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/util/Hashtable.java#L358
int index = (hash & 0x7FFFFFFF) % tab.length;
This is a common technique for hash tables. First bit is discarded (to make the value positive). The index is the remainder from division by table size.

I know it uses hashcode of the object but hashcode is a weird long string, what does hashtable do to the hashcode to determine place of entry?
A hashcode is not a "weird long string". It is a 32 bit signed integer.
(I think you are confusing the hashcode and what you get when you call Object::toString ... which is a string consisting of a hashcode AND a Java internal type name.)
So what HashMap and HashTable (and HashSet and LinkedHashMap) actually do is:
call hashCode() to get the 32 bit integer,
perform some implementation-specific mangling1 on the integer,
convert the mangled integer to a non-negative integer by removing the sign bit,
compute an array index (for the bucket) as value % array.length where array is the hash table's current array of hash chains (or trees).
1 - Some implementations of HashMap / HashTable perform some simple / cheap bitwise mangling. The purpose is to reduce clustering in the case that the low few bits of the hashcode values are not evenly distributed.

What makes hashmaps faster?

So I understand that the hashmaps use buckets and hashcodes and what not. From my experience, Java hashcodes are not small, but rather large numbers usually, so I assume it's not indexed internally. Unless the hashcode quality is poor resulting in approximately equal bucket length and amount buckets, what makes hashmaps faster than a list of name->value pairs?

Hashmaps work by mapping elements to "buckets" by using a hash function. When someone tries to insert an element, a hash code is calculated and a modulus operation is applied to the hash code in order to get the bucket index in which the element should be inserted (That is the reason why it doesn't matter how big the hashcode is). For example, if you have 4 buckets and your hashcode is 40, it will be inserted in the bucket 0, because 40 mod 4 is 0.
When two elements are mapped to the same bucket a "collision" occurs and usually the element is stored in a list under the same bucket.
If you try to obtain an element the key is mapped again using the hash function. If the bucket contains a list of elements, the equals() function is used in order to identify which element is the correct one (That is the reason why you must implement equals() and hashcode() to insert a custom object into a hashmap).
So, if you search for an element, and your hashmap does not have any lists on the buckets, you have a O(1) cost. The worst case would be when you have only 1 bucket and a list containing all elements in which obtaining an element would be the same as searching on a list O(N).

I looked in the Java implementation and found it does a bitwise and akin to a modulus, which makes a lot of sense to reduce array size. This allows the O(1) access that makes HashMaps nice.

What is size of a hash-table bucket in java?

We know that more than one object with same hash code can be stored in a single bucket of a hash-table in JAVA. My question is:
What is maximum number of objects a single bucket can store?

It's unlimited. Whatever has the same hashCode (with the mask) goes into the same position in the hash table. It's basically linked list.
It may cause some problems obviously as it could significantly affect the performance but usually with reasonable distribution of items it hardly happens that there are more than one or two items in single position.

How HashTable and HashMap key-value are stored in the memory?

I understand there is a hashing technique is applied to a key to store its value in the memory address.
But I don't understand how the collision is happening here? Which hashing algorithm does Java use to create a memory space? Is it MD5?

The basic idea of HashMap is this:
A HashMap is really an array of special objects that hold both Key and Value.
The array has some amount of buckets (slots), say 16.
The hashing algorithm is provided by the hashCode() method that every object has. Therefore, when you are writing a new Class, you should take care of proper hashCode() and equals() methods implementation. The default one (of the Object class) takes the memory pointer as a number. But that's not good for most of the classes we would like to use. For example, the String class uses an algorithm that makes the hash from all the characters in the string - think of it like this: hashCode = 1.char + 2.char + 3.char... (simplified). Therefore, two equal Strings, even though they are on a different location in the memory, have the same hashCode().
The result of hashCode(), say "132", is then the number of bucket where the object should be stored if we had an array that big. But we don't, ours is only 16 buckets long. So we use the obvious calculation 'hashcode % array.length = bucket' or '132 mod 16 = 4' and store the Key-Value pair in a bucket number 4.
If there's no other pair yet, it's ok.
If there's one with Key equal to the Key we have, we remove the old one.
If there's another Key-Value pair (collision), we chain the new one after the old one into a linked list.
If the backing array becomes too full, so we would have to make too many linked lists, we make a new array with doubled length, rehash all the elements and add them into the new array, then we dispose the old one. This is most likely the most expensive operation on HashMap, so you want to tell your Maps how many buckets you'll use if you know it in before.
If somebody tries to get a value, he provides a key, we hash it, mod it, and then go through the potential linked list for the exact match.
An image, courtesy of Wikipedia:
In this case,
there is an array with 256 buckets (numbered, of course, 0-255)
there are five people. Their hashcodes, after being put through mod 256, point to four different slots in the array.
you can see Sandra Dee didn't have a free slot, so she has been chained after John Smith.
Now, if you tried to look up a telephone number of Sandra Dee, you would hash her name, mod it by 256, and look into the bucket 152. There you'd find John Smith. That's no Sandra, look further ... aha, there's Sandra chained after John.

Here Hash does not mean for Hashing techniques such as MD5. Its the HashCode of memory location used to store an Object for a particular key.
Readings:
How does Java hashmap work?
How HashMap works in Java
This is somewhat more clearer explanation of How HashMap works?

As default implementation hashCode() function in Object class return the memory address as the hash which is used as key in HashTable & HashMap.

After going through #Slanec's answer, do see the javadoc from Java-8, as there are significant changes. For ex: 'TREEIFY' where the LinkedList is converted to a TreeMap in case the threshold number of entries per bucket (currently 8) is reached.

Hash : How does it work internally?

This might sound as an very vague question upfront but it is not. I have gone through Hash Function description on wiki but it is not very helpful to understand.
I am looking simple answers for rather complex topics like Hashing. Here are my questions:
What do we mean by hashing? How does it work internally?
What algorithm does it follow ?
What is the difference between HashMap, HashTable and HashList ?
What do we mean by 'Constant Time Complexity' and why does different implementation of the hash gives constant time operation ?
Lastly, why in most interview questions Hash and LinkedList are asked, is there any specific logic for it from testing interviewee's knowledge?
I know my question list is big but I would really appreciate if I can get some clear answers to these questions as I really want to understand the topic.

Here is a good explanation about hashing. For example you want to store the string "Rachel" you apply a hash function to that string to get a memory location. myHashFunction(key: "Rachel" value: "Rachel") --> 10. The function may return 10 for the input "Rachel" so assuming you have an array of size 100 you store "Rachel" at index 10. If you want to retrieve that element you just call GetmyHashFunction("Rachel") and it will return 10. Note that for this example the key is "Rachel" and the value is "Rachel" but you could use another value for that key for example birth date or an object. Your hash function may return the same memory location for two different inputs, in this case you will have a collision you if you are implementing your own hash table you have to take care of this maybe using a linked list or other techniques.
Here are some common hash functions used. A good hash function satisfies that: each key is equally likely to hash to any of the n memory slots independently of where any other key has hashed to. One of the methods is called the division method. We map a key k into one of n slots by taking the remainder of k divided by n. h(k) = k mod n. For example if your array size is n = 100 and your key is an integer k = 15 then h(k) = 10.
Hashtable is synchronised and Hashmap is not.
Hashmap allows null values as key but Hashtable does not.
The purpose of a hash table is to have O(c) constant time complexity in adding and getting the elements. In a linked list of size N if you want to get the last element you have to traverse all the list until you get it so the complexity is O(N). With a hash table if you want to retrieve an element you just pass the key and the hash function will return you the desired element. If the hash function is well implemented it will be in constant time O(c) This means you dont have to traverse all the elements stored in the hash table. You will get the element "instantly".
Of couse a programer/developer computer scientist needs to know about data structures and complexity =)

Hashing means generating a (hopefully) unique number that represents a value.
Different types of values (Integer, String, etc) use different algorithms to compute a hashcode.
HashMap and HashTable are maps; they are a collection of unqiue keys, each of which is associated with a value.
Java doesn't have a HashList class. A HashSet is a set of unique values.
Getting an item from a hashtable is constant-time with regard to the size of the table.
Computing a hash is not necessarily constant-time with regard to the value being hashed.
For example, computing the hash of a string involves iterating the string, and isn't constant-time with regard to the size of the string.
These are things that people ought to know.

Hashing is transforming a given entity (in java terms - an object) to some number (or sequence). The hash function is not reversable - i.e. you can't obtain the original object from the hash. Internally it is implemented (for java.lang.Object by getting some memory address by the JVM.
The JVM address thing is unimportant detail. Each class can override the hashCode() method with its own algorithm. Modren Java IDEs allow for generating good hashCode methods.
Hashtable and hashmap are the same thing. They key-value pairs, where keys are hashed. Hash lists and hashsets don't store values - only keys.
Constant-time means that no matter how many entries there are in the hashtable (or any other collection), the number of operations needed to find a given object by its key is constant. That is - 1, or close to 1
This is basic computer-science material, and it is supposed that everyone is familiar with it. I think google have specified that the hashtable is the most important data-structure in computer science.

I'll try to give simple explanations of hashing and of its purpose.
First, consider a simple list. Each operation (insert, find, delete) on such list would have O(n) complexity, meaning that you have to parse the whole list (or half of it, on average) to perform such an operation.
Hashing is a very simple and effective way of speeding it up: consider that we split the whole list in a set of small lists. Items in one such small list would have something in common, and this something can be deduced from the key. For example, by having a list of names, we could use first letter as the quality that will choose in which small list to look. In this way, by partitioning the data by the first letter of the key, we obtained a simple hash, that would be able to split the whole list in ~30 smaller lists, so that each operation would take O(n)/30 time.
However, we could note that the results are not that perfect. First, there are only 30 of them, and we can't change it. Second, some letters are used more often than others, so that the set with Y or Z will be much smaller that the set with A. For better results, it's better to find a way to partition the items in sets of roughly same size. How could we solve that? This is where you use hash functions. It's such a function that is able to create an arbitrary number of partitions with roughly the same number of items in each. In our example with names, we could use something like
int hash(const char* str){
int rez = 0;
for (int i = 0; i < strlen(str); i++)
rez = rez * 37 + str[i];
return rez % NUMBER_OF_PARTITIONS;
};
This would assure a quite even distribution and configurable number of sets (also called buckets).

What do we mean by Hashing, how does
it work internally ?
Hashing is the transformation of a string shorter fixed-length value or key that represents the original string. It is not indexing. The heart of hashing is the hash table. It contains array of items. Hash tables contain an index from the data item's key and use this index to place the data into the array.
What algorithm does it follow ?
In simple words most of the Hash algorithms work on the logic "index = f(key, arrayLength)"
Lastly, why in most interview
questions Hash and LinkedList are
asked, is there any specific logic for
it from testing interviewee's
knowledge ?
Its about how good you are at logical reasoning. It is most important data-structure that every programmers know it.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.