Hash : How does it work internally?

Hash : How does it work internally? - java

This might sound as an very vague question upfront but it is not. I have gone through Hash Function description on wiki but it is not very helpful to understand.
I am looking simple answers for rather complex topics like Hashing. Here are my questions:
What do we mean by hashing? How does it work internally?
What algorithm does it follow ?
What is the difference between HashMap, HashTable and HashList ?
What do we mean by 'Constant Time Complexity' and why does different implementation of the hash gives constant time operation ?
Lastly, why in most interview questions Hash and LinkedList are asked, is there any specific logic for it from testing interviewee's knowledge?
I know my question list is big but I would really appreciate if I can get some clear answers to these questions as I really want to understand the topic.

Here is a good explanation about hashing. For example you want to store the string "Rachel" you apply a hash function to that string to get a memory location. myHashFunction(key: "Rachel" value: "Rachel") --> 10. The function may return 10 for the input "Rachel" so assuming you have an array of size 100 you store "Rachel" at index 10. If you want to retrieve that element you just call GetmyHashFunction("Rachel") and it will return 10. Note that for this example the key is "Rachel" and the value is "Rachel" but you could use another value for that key for example birth date or an object. Your hash function may return the same memory location for two different inputs, in this case you will have a collision you if you are implementing your own hash table you have to take care of this maybe using a linked list or other techniques.
Here are some common hash functions used. A good hash function satisfies that: each key is equally likely to hash to any of the n memory slots independently of where any other key has hashed to. One of the methods is called the division method. We map a key k into one of n slots by taking the remainder of k divided by n. h(k) = k mod n. For example if your array size is n = 100 and your key is an integer k = 15 then h(k) = 10.
Hashtable is synchronised and Hashmap is not.
Hashmap allows null values as key but Hashtable does not.
The purpose of a hash table is to have O(c) constant time complexity in adding and getting the elements. In a linked list of size N if you want to get the last element you have to traverse all the list until you get it so the complexity is O(N). With a hash table if you want to retrieve an element you just pass the key and the hash function will return you the desired element. If the hash function is well implemented it will be in constant time O(c) This means you dont have to traverse all the elements stored in the hash table. You will get the element "instantly".
Of couse a programer/developer computer scientist needs to know about data structures and complexity =)

Hashing means generating a (hopefully) unique number that represents a value.
Different types of values (Integer, String, etc) use different algorithms to compute a hashcode.
HashMap and HashTable are maps; they are a collection of unqiue keys, each of which is associated with a value.
Java doesn't have a HashList class. A HashSet is a set of unique values.
Getting an item from a hashtable is constant-time with regard to the size of the table.
Computing a hash is not necessarily constant-time with regard to the value being hashed.
For example, computing the hash of a string involves iterating the string, and isn't constant-time with regard to the size of the string.
These are things that people ought to know.

Hashing is transforming a given entity (in java terms - an object) to some number (or sequence). The hash function is not reversable - i.e. you can't obtain the original object from the hash. Internally it is implemented (for java.lang.Object by getting some memory address by the JVM.
The JVM address thing is unimportant detail. Each class can override the hashCode() method with its own algorithm. Modren Java IDEs allow for generating good hashCode methods.
Hashtable and hashmap are the same thing. They key-value pairs, where keys are hashed. Hash lists and hashsets don't store values - only keys.
Constant-time means that no matter how many entries there are in the hashtable (or any other collection), the number of operations needed to find a given object by its key is constant. That is - 1, or close to 1
This is basic computer-science material, and it is supposed that everyone is familiar with it. I think google have specified that the hashtable is the most important data-structure in computer science.

I'll try to give simple explanations of hashing and of its purpose.
First, consider a simple list. Each operation (insert, find, delete) on such list would have O(n) complexity, meaning that you have to parse the whole list (or half of it, on average) to perform such an operation.
Hashing is a very simple and effective way of speeding it up: consider that we split the whole list in a set of small lists. Items in one such small list would have something in common, and this something can be deduced from the key. For example, by having a list of names, we could use first letter as the quality that will choose in which small list to look. In this way, by partitioning the data by the first letter of the key, we obtained a simple hash, that would be able to split the whole list in ~30 smaller lists, so that each operation would take O(n)/30 time.
However, we could note that the results are not that perfect. First, there are only 30 of them, and we can't change it. Second, some letters are used more often than others, so that the set with Y or Z will be much smaller that the set with A. For better results, it's better to find a way to partition the items in sets of roughly same size. How could we solve that? This is where you use hash functions. It's such a function that is able to create an arbitrary number of partitions with roughly the same number of items in each. In our example with names, we could use something like
int hash(const char* str){
int rez = 0;
for (int i = 0; i < strlen(str); i++)
rez = rez * 37 + str[i];
return rez % NUMBER_OF_PARTITIONS;
};
This would assure a quite even distribution and configurable number of sets (also called buckets).

What do we mean by Hashing, how does
it work internally ?
Hashing is the transformation of a string shorter fixed-length value or key that represents the original string. It is not indexing. The heart of hashing is the hash table. It contains array of items. Hash tables contain an index from the data item's key and use this index to place the data into the array.
What algorithm does it follow ?
In simple words most of the Hash algorithms work on the logic "index = f(key, arrayLength)"
Lastly, why in most interview
questions Hash and LinkedList are
asked, is there any specific logic for
it from testing interviewee's
knowledge ?
Its about how good you are at logical reasoning. It is most important data-structure that every programmers know it.

Related

Is HashTable/HashMap an array?

I am having confusion in hashing:
When we use Hashtable/HashMap (key,value), first I understood the internal data structure is an array (already allocated in memory).
Java hashcode() method has an int return type, so I think this hash value will be used as an index for the array and in this case, we should have 2 power 32 entries in the array in RAM, which is not what actually happens.
So does Java create an index from the hashcode() which is smaller range?
Answer:
As the guys pointed out below and from the documentation: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/HashMap.java
HashMap is an array. The hashcode() is rehashed again but still integer and the index in the array becomes: h & (length-1); so if the length of the array is 2^n then I think the index takes the first n bit from re-hashed value.

The structure for a Java HashMap is not just an array. It is an array, but not of 2^31 entries (int is a signed type!), but of some smaller number of buckets, by default 16 initially. The Javadocs for HashMap explain that.
When the number of entries exceeds a certain fraction (the "load factor) of the capacity, the array grows to a larger size.
Each element of the array does not hold only one entry. Each element of the array holds a structure (currently a red-black tree, formerly a list) of entries. Each entry of the structure has a hash code that transforms internally to the same bucket position in the array.
Have you read the docs on this type?
http://docs.oracle.com/javase/8/docs/api/java/util/HashMap.html
You really should.

Generally the base data structure will indeed be an array.
The methods that need to find an entry (or empty gap in the case of adding a new object) will reduce the hash code to something that fits the size of the array (generally by modulo), and use this as an index into that array.
Of course this makes the chance of collisions more likely, since many objects could have a hash code that reduces to the same index (possible anyway since multiple objects might have exactly the same hash code, but now much more likely). There are different strategies for dealing with this, generally either by using a linked-list-like structure or a mechanism for picking another slot if the first slot that matched was occupied by a non-equal key.
Since this adds cost, the more often such collisions happen the slower things become and in the worse case lookup would in fact be O(n) (and slow as O(n) goes, too).
Increasing the size of the internal store will generally improve this though, especially if it is not to a multiple of the previous size (so the operation that reduced the hash code to find an index won't take a bunch of items colliding on the same index and then give them all the same index again). Some mechanisms will increase the internal size before absolutely necessary (while there is some empty space remaining) in certain cases (certain percentage, certain number of collisions with objects that don't have the same full hash code, etc.)
This means that unless the hash codes are very bad (most obviously, if they are in fact all exactly the same), the order of operation stays at O(1).

Inevitable Collisions When Hashing?

If I create a new Map:
Map<Integer,String> map = new HashMap<Integer,String>();
Then I call map.put() a bunch of times each with a unique key, say, a million times, will there ever be a collision or does java's hashing algorithm guarantee no collisions if the key is unique?

Hashing does not guarantee that there will be no collisions if the key is unique. In fact, the only thing that's required is that objects that are equal have the same hashcode. The number of collisions determines how efficient retrieval will be (fewer collisions, closer to O(1), more collisions, closer to O(n)).
What an object's hashcode will be depends on what type it is. For instance, a string's default hashcode is
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
which necessarily simplifies down the complexity of the string to a single number -- definitely possible to reach the same hashcode with two different strings, though it'll be pretty rare.
If two things hash to the same thing, hashmap uses .equals to determine whether a particular key matches. That's why it's so important that you override both hashCode() and equals() together and ensure that things that are equal have the same hash code.

Hashtable works somewhat as follows:
A hashmap is created with an initial capacity (or number of buckets)
Each time you add an object to it, java invokes the hash function of the key, a number, then modulo this to the current size of the hashtable
The object is stored in the bucket with the result from step 2.
So even if you have unique keys, they can still collide unless you have as many buckets as your range of hash of your key.

There are two things you need to know:
Even there is collision, it is not going to cause problem, because for each bucket, there is a list. In case you are putting to a bucket that already have value inside, it will simply append at the list. When retrieval, it will first find out which bucket to lookup, and from the bucket, go through each value in the list and find out the one that is equals (by calling equals())
If you are putting millions of value in the Hashmap, you may wonder, then every linked list in the map will contains thousands of values. Then we are always doing big linear search which will be slow. Then you need to know that, Java's HashMap is going to be resized whenever number of entries are larger than certain threshold (have a look in capacity and loadFactor in Javadoc). With a properly implemented hash code, number of entries in each bucket is going to be small.

Java Key-Value Collection with complexity of O(1) for millions of random unordered keys

I am stuck with a problem where I have millions of key-value pairs that I need to access using the keys randomly (not by using an iterator).
The range of keys is not known at compile time, but total number of the key-value pairs is known.
I have looked into HashMap and Hashset data structures but they are not truly O(1) as in case of collision in the hash-code they become array of LinkedLists which has linear search complexity at worst case.
I have also considered increasing the number of buckets in the HashMap but it does not ensure that every element will be stored in a separate bucket.
Is there any way to store and access millions of key-value pairs with O(1) complexity?
Ideally I would like every key to be like a variable and corresponding value should be the value assigned to that key
Thanks in advance.

I think you are confusing what Big O notation represents. It defines limiting behavior of a function, not necessarily actual behavior.
The average complexity of a hash map is O(1) for insert, delete, and search operations. What does this mean? In means, on average, those operations will complete in constant time regardless of the size of the hash map. So, depending on the implementation of the map, a lookup might not take exactly one step but it will most likely not involve more than a few steps, relative to the hash map's size.
How well a hash map actually behaves for those operations is determined by a few factors. The most obvious is the hash function used to bucket keys. Hash functions that distribute the computed hashes more uniformly over the hash range and limit the number of collisions are preferred. The better the hash function in those areas, the closer a hash map will actually operate in constant time.
Another factor that affects actual hash map behavior is how storage is managed. How a map resizes and repositions entries as items are added and removed helps control hash collisions by using an optimal number of buckets. Managing the hash map storage affectively will allow the hash map to operate close to constant time.
With all that said, there are ways to construct hash maps that have O(1) worst case behavior for lookups. This is accomplished using a perfect hash function. A perfect hash function is an invertible 1-1 function between keys and hashes. With a perfect hash function and the proper hash map storage, O(1) lookups can be achieved. The prerequisite for using this approach is knowing all the key values in advance so a perfect hash function can be developed.
Sadly, your case does not involve known keys so a perfect hash function can not be constructed but, the available research might help you construct a near perfect hash function for your case.

No, there isn't such a (known) data structure for generic data types.
If there were, it would most likely have replaced hash tables in most commonly-used libraries, unless there's some significant disadvantage like a massive constant factor or ridiculous memory usage, either of which would probably make it nonviable for you as well.
I said "generic data types" above, as there may be some specific special cases for which it's possible, such as when the key is a integer in a small range - in this case you could just have an array where each index corresponds to the same key, but this is also really a hash table where the key hashes to itself.
Note that you need a terrible hash function, the pathological input for your hash function, or a very undersized hash table to actually get the worst-case O(n) performance for your hash table. You really should test it and see if it's fast enough before you go in search of something else. You could also try TreeMap, which, with its O(log n) operations, will sometimes outperform HashMap.

How to get the optimal number to use in hashtable?

As the question states, how calculate the optimal number to use and how to motivate it?
If we are going to build an hashtable which uses the following hash function:
h(k) = k mod m, k = key
So some sources tells me:
to use the number of elements to be inserted as the value of m
to use a close prime to m
that java simply use 31 as their value of m
And some people tell me to use the closed prime to 2^n as m
I'm so confused at this point that I don't know what value to use for m. Like for instance if we use the table size for m then what happens if we want to expand the table size? Will I then have to rehash all the values with the new value of m. If so why does Java use simply 31 as prime value for m.
I've also heard that the table size should be two times bigger then the total elements in the hashtable, that's for each time it rehashes. But how come we for instance use m=10 for a table of 10 elements when it should be m=20 to create that extra empty space?
Can someone please help me understand how to calculate the value of m to use based on different scenarios like when we want to have a static (where we know that we will only insnert like 10 elements) or dynamic (rehash after a certain limit) hashtable.
Lets illustrate my problem by the following examples:
I got the values {1,2,...,n}
Question: What would be a optimized value of m if I must use the division by mod in my hashfunction?
Senario 1: n = 100?
Senario 2: n = 5043?
Addition question:
Would the m value hashfunction be different if we used a open or closed hashtable?
Note that i'm now not in need to understand hashtable for java but hashtable in general where I must use a divsion mod hashfunction.
Thank you for your time!

You have several issues here:
1) What should m equal?
2) How much free space should you have in your hash table?
3) Should you make the size of your table be a prime number?
1) As was mentioned in the comments, the h(k) you describe isn't the hash function, it gives you the index into your hash table. The idea is that every object produces some hash code, which is a positive integer. You use the hash code to figure out where to put the object in the hash table (so that you can find it again later). You clearly don't want a hash table of size MAX_INT, so you choose some size m. Then for any object, you take its hash code, compute k % m, and now you have an integer in the interval [0, m-1], which is a valid index into your hash table.
2) Because a hash table works by using a hash code to find the place in a table where an object should go, you get into trouble if multiple items are assigned to the same location. This is called a collision. Every hash table implementation must deal with collisions, either by putting items into nearby spots or keeping a linked list of items in each location. No matter the solution, more collisions means lower performance for your hash table. For that reason, it is recommended that you not let your hash table fill up, otherwise, collisions are more likely. Keeping your hash table at least twice as large as the number of items is a common recommendation to reduce the probability of collisions. Obviously, this means you will have to resize your table as it fills up. Yes, this means that you have to rehash each item since it will go into a different location when you are taking a modulus by a different value. That is the hidden cost of a hash table: it runs in constant time (assuming few or no collisions), but it can have a large coefficient (ammortized resizing, rehashing, etc.).
3) It is also often recommended that you make the size of your hash table be a prime number. This is because it tends to produce a better distribution of items in your hash table in certain common use cases, thus avoiding collisions. Rather than giving a complete explanation here, I will refer you to this excellent answer: Why should hash functions use a prime number modulus?

How HashTable and HashMap key-value are stored in the memory?

I understand there is a hashing technique is applied to a key to store its value in the memory address.
But I don't understand how the collision is happening here? Which hashing algorithm does Java use to create a memory space? Is it MD5?

The basic idea of HashMap is this:
A HashMap is really an array of special objects that hold both Key and Value.
The array has some amount of buckets (slots), say 16.
The hashing algorithm is provided by the hashCode() method that every object has. Therefore, when you are writing a new Class, you should take care of proper hashCode() and equals() methods implementation. The default one (of the Object class) takes the memory pointer as a number. But that's not good for most of the classes we would like to use. For example, the String class uses an algorithm that makes the hash from all the characters in the string - think of it like this: hashCode = 1.char + 2.char + 3.char... (simplified). Therefore, two equal Strings, even though they are on a different location in the memory, have the same hashCode().
The result of hashCode(), say "132", is then the number of bucket where the object should be stored if we had an array that big. But we don't, ours is only 16 buckets long. So we use the obvious calculation 'hashcode % array.length = bucket' or '132 mod 16 = 4' and store the Key-Value pair in a bucket number 4.
If there's no other pair yet, it's ok.
If there's one with Key equal to the Key we have, we remove the old one.
If there's another Key-Value pair (collision), we chain the new one after the old one into a linked list.
If the backing array becomes too full, so we would have to make too many linked lists, we make a new array with doubled length, rehash all the elements and add them into the new array, then we dispose the old one. This is most likely the most expensive operation on HashMap, so you want to tell your Maps how many buckets you'll use if you know it in before.
If somebody tries to get a value, he provides a key, we hash it, mod it, and then go through the potential linked list for the exact match.
An image, courtesy of Wikipedia:
In this case,
there is an array with 256 buckets (numbered, of course, 0-255)
there are five people. Their hashcodes, after being put through mod 256, point to four different slots in the array.
you can see Sandra Dee didn't have a free slot, so she has been chained after John Smith.
Now, if you tried to look up a telephone number of Sandra Dee, you would hash her name, mod it by 256, and look into the bucket 152. There you'd find John Smith. That's no Sandra, look further ... aha, there's Sandra chained after John.

Here Hash does not mean for Hashing techniques such as MD5. Its the HashCode of memory location used to store an Object for a particular key.
Readings:
How does Java hashmap work?
How HashMap works in Java
This is somewhat more clearer explanation of How HashMap works?

As default implementation hashCode() function in Object class return the memory address as the hash which is used as key in HashTable & HashMap.

After going through #Slanec's answer, do see the javadoc from Java-8, as there are significant changes. For ex: 'TREEIFY' where the LinkedList is converted to a TreeMap in case the threshold number of entries per bucket (currently 8) is reached.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.