How to retrieve values after a hash collision - java

I have read in many places that after a hash collision in Java it is internally using a linked list/tree, based on the number of hash collisions.Till this is fine,
But how to retrieve back the expected value using the 'key'

It just iterates the linked list stored in that bucket and checks the elements using equals which has no collisions.
The running time for that is linear, but only linear in the amount of items stored in that specific bucket, so it is okay as long as the buckets are kept balanaced well enough.
Look at this illustration (source):
So the implementation will make sure that a get operation, even if it has collisions, gives back the correct result in the end.
Note that Javas HashSet and HashMap are not a pure hashtable like illustrated. They will switch to a red-black tree internally after a certain threshold.

Related

What does "assuming the hash function disperses the elements properly among the buckets" mean at all?

Having started learning Java, I came across this statement in the docs of Java 8:
assuming the hash function disperses the elements properly among the buckets.
Does that simply mean that the order you get, after assigning, will be a mess?
It means that HashMap maintains an array of buckets under the hood. Hash code produced by the hashCode() method of a key object determines to which bucket this entry should go.
A situation when multiple keys yield similar hashes and as a consequence are mapped to the same bucket is called a collision.
Entries of the map that are mapped to the same bucket will be structured as a linked list. Starting with Java 8 when a number of collisions grow after a certain threshold the list will be transformed into a tree.
As you probably know the cost of accessing an element under a certain index in the array is O(1). And HashMap provides access to the values by key with amortized time complexity O(1), but only if a number of collisions is neglectable. I.e. hashCode() is implemented in such a way that it allows to spread the keys relatively evenly between the buckets.
In the edge case when the hash function is badly implemented and, let's say, it returns the same hash for every key all the entries end up in the same bucket. The time complexity for methods like get(), containsKey() degrades to O(n) (with Java 7 and earlier) because you have to iterate over the list of all entries in order to find a particular one. And with Java 8 onwards the time complexity will be O(log n) because that is the worse time required to find an element in a red-black tree.
Does that simply mean that the order you get, after assigning, will be
a mess?
The order of elements in the HashMap is undefined. This class is useful when you need quick access and don't care about the order. If need an ordered map consider LinkedHashMap which tracks the order in which the entries were added to a map by maintaining a linked list or TreeMap which sorts keys ordered accordingly to their natural order or based on the given comparator.
A hash map contains a number of "buckets". For best performance, you want the number of entries to be more or less the same in each bucket. The bucket is determined by the hash function; thus you want a hash function that results in more or less the same probability of hitting each bucket. That is, "the hash function disperses the elements properly among the buckets".
At the other extreme: a hash function that always returned, say, the value 3 would work, but map access wouldn't be very efficient, since one bucket would have all the entries.
I don't understand what you mean by the order being a "mess". A hash map is not ordered; the location of an element depends on its hash code.

What is size of a hash-table bucket in java?

We know that more than one object with same hash code can be stored in a single bucket of a hash-table in JAVA. My question is:
What is maximum number of objects a single bucket can store?
It's unlimited. Whatever has the same hashCode (with the mask) goes into the same position in the hash table. It's basically linked list.
It may cause some problems obviously as it could significantly affect the performance but usually with reasonable distribution of items it hardly happens that there are more than one or two items in single position.

Inevitable Collisions When Hashing?

If I create a new Map:
Map<Integer,String> map = new HashMap<Integer,String>();
Then I call map.put() a bunch of times each with a unique key, say, a million times, will there ever be a collision or does java's hashing algorithm guarantee no collisions if the key is unique?
Hashing does not guarantee that there will be no collisions if the key is unique. In fact, the only thing that's required is that objects that are equal have the same hashcode. The number of collisions determines how efficient retrieval will be (fewer collisions, closer to O(1), more collisions, closer to O(n)).
What an object's hashcode will be depends on what type it is. For instance, a string's default hashcode is
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
which necessarily simplifies down the complexity of the string to a single number -- definitely possible to reach the same hashcode with two different strings, though it'll be pretty rare.
If two things hash to the same thing, hashmap uses .equals to determine whether a particular key matches. That's why it's so important that you override both hashCode() and equals() together and ensure that things that are equal have the same hash code.
Hashtable works somewhat as follows:
A hashmap is created with an initial capacity (or number of buckets)
Each time you add an object to it, java invokes the hash function of the key, a number, then modulo this to the current size of the hashtable
The object is stored in the bucket with the result from step 2.
So even if you have unique keys, they can still collide unless you have as many buckets as your range of hash of your key.
There are two things you need to know:
Even there is collision, it is not going to cause problem, because for each bucket, there is a list. In case you are putting to a bucket that already have value inside, it will simply append at the list. When retrieval, it will first find out which bucket to lookup, and from the bucket, go through each value in the list and find out the one that is equals (by calling equals())
If you are putting millions of value in the Hashmap, you may wonder, then every linked list in the map will contains thousands of values. Then we are always doing big linear search which will be slow. Then you need to know that, Java's HashMap is going to be resized whenever number of entries are larger than certain threshold (have a look in capacity and loadFactor in Javadoc). With a properly implemented hash code, number of entries in each bucket is going to be small.

Java Key-Value Collection with complexity of O(1) for millions of random unordered keys

I am stuck with a problem where I have millions of key-value pairs that I need to access using the keys randomly (not by using an iterator).
The range of keys is not known at compile time, but total number of the key-value pairs is known.
I have looked into HashMap and Hashset data structures but they are not truly O(1) as in case of collision in the hash-code they become array of LinkedLists which has linear search complexity at worst case.
I have also considered increasing the number of buckets in the HashMap but it does not ensure that every element will be stored in a separate bucket.
Is there any way to store and access millions of key-value pairs with O(1) complexity?
Ideally I would like every key to be like a variable and corresponding value should be the value assigned to that key
Thanks in advance.
I think you are confusing what Big O notation represents. It defines limiting behavior of a function, not necessarily actual behavior.
The average complexity of a hash map is O(1) for insert, delete, and search operations. What does this mean? In means, on average, those operations will complete in constant time regardless of the size of the hash map. So, depending on the implementation of the map, a lookup might not take exactly one step but it will most likely not involve more than a few steps, relative to the hash map's size.
How well a hash map actually behaves for those operations is determined by a few factors. The most obvious is the hash function used to bucket keys. Hash functions that distribute the computed hashes more uniformly over the hash range and limit the number of collisions are preferred. The better the hash function in those areas, the closer a hash map will actually operate in constant time.
Another factor that affects actual hash map behavior is how storage is managed. How a map resizes and repositions entries as items are added and removed helps control hash collisions by using an optimal number of buckets. Managing the hash map storage affectively will allow the hash map to operate close to constant time.
With all that said, there are ways to construct hash maps that have O(1) worst case behavior for lookups. This is accomplished using a perfect hash function. A perfect hash function is an invertible 1-1 function between keys and hashes. With a perfect hash function and the proper hash map storage, O(1) lookups can be achieved. The prerequisite for using this approach is knowing all the key values in advance so a perfect hash function can be developed.
Sadly, your case does not involve known keys so a perfect hash function can not be constructed but, the available research might help you construct a near perfect hash function for your case.
No, there isn't such a (known) data structure for generic data types.
If there were, it would most likely have replaced hash tables in most commonly-used libraries, unless there's some significant disadvantage like a massive constant factor or ridiculous memory usage, either of which would probably make it nonviable for you as well.
I said "generic data types" above, as there may be some specific special cases for which it's possible, such as when the key is a integer in a small range - in this case you could just have an array where each index corresponds to the same key, but this is also really a hash table where the key hashes to itself.
Note that you need a terrible hash function, the pathological input for your hash function, or a very undersized hash table to actually get the worst-case O(n) performance for your hash table. You really should test it and see if it's fast enough before you go in search of something else. You could also try TreeMap, which, with its O(log n) operations, will sometimes outperform HashMap.

Hash : How does it work internally?

This might sound as an very vague question upfront but it is not. I have gone through Hash Function description on wiki but it is not very helpful to understand.
I am looking simple answers for rather complex topics like Hashing. Here are my questions:
What do we mean by hashing? How does it work internally?
What algorithm does it follow ?
What is the difference between HashMap, HashTable and HashList ?
What do we mean by 'Constant Time Complexity' and why does different implementation of the hash gives constant time operation ?
Lastly, why in most interview questions Hash and LinkedList are asked, is there any specific logic for it from testing interviewee's knowledge?
I know my question list is big but I would really appreciate if I can get some clear answers to these questions as I really want to understand the topic.
Here is a good explanation about hashing. For example you want to store the string "Rachel" you apply a hash function to that string to get a memory location. myHashFunction(key: "Rachel" value: "Rachel") --> 10. The function may return 10 for the input "Rachel" so assuming you have an array of size 100 you store "Rachel" at index 10. If you want to retrieve that element you just call GetmyHashFunction("Rachel") and it will return 10. Note that for this example the key is "Rachel" and the value is "Rachel" but you could use another value for that key for example birth date or an object. Your hash function may return the same memory location for two different inputs, in this case you will have a collision you if you are implementing your own hash table you have to take care of this maybe using a linked list or other techniques.
Here are some common hash functions used. A good hash function satisfies that: each key is equally likely to hash to any of the n memory slots independently of where any other key has hashed to. One of the methods is called the division method. We map a key k into one of n slots by taking the remainder of k divided by n. h(k) = k mod n. For example if your array size is n = 100 and your key is an integer k = 15 then h(k) = 10.
Hashtable is synchronised and Hashmap is not.
Hashmap allows null values as key but Hashtable does not.
The purpose of a hash table is to have O(c) constant time complexity in adding and getting the elements. In a linked list of size N if you want to get the last element you have to traverse all the list until you get it so the complexity is O(N). With a hash table if you want to retrieve an element you just pass the key and the hash function will return you the desired element. If the hash function is well implemented it will be in constant time O(c) This means you dont have to traverse all the elements stored in the hash table. You will get the element "instantly".
Of couse a programer/developer computer scientist needs to know about data structures and complexity =)
Hashing means generating a (hopefully) unique number that represents a value.
Different types of values (Integer, String, etc) use different algorithms to compute a hashcode.
HashMap and HashTable are maps; they are a collection of unqiue keys, each of which is associated with a value.
Java doesn't have a HashList class. A HashSet is a set of unique values.
Getting an item from a hashtable is constant-time with regard to the size of the table.
Computing a hash is not necessarily constant-time with regard to the value being hashed.
For example, computing the hash of a string involves iterating the string, and isn't constant-time with regard to the size of the string.
These are things that people ought to know.
Hashing is transforming a given entity (in java terms - an object) to some number (or sequence). The hash function is not reversable - i.e. you can't obtain the original object from the hash. Internally it is implemented (for java.lang.Object by getting some memory address by the JVM.
The JVM address thing is unimportant detail. Each class can override the hashCode() method with its own algorithm. Modren Java IDEs allow for generating good hashCode methods.
Hashtable and hashmap are the same thing. They key-value pairs, where keys are hashed. Hash lists and hashsets don't store values - only keys.
Constant-time means that no matter how many entries there are in the hashtable (or any other collection), the number of operations needed to find a given object by its key is constant. That is - 1, or close to 1
This is basic computer-science material, and it is supposed that everyone is familiar with it. I think google have specified that the hashtable is the most important data-structure in computer science.
I'll try to give simple explanations of hashing and of its purpose.
First, consider a simple list. Each operation (insert, find, delete) on such list would have O(n) complexity, meaning that you have to parse the whole list (or half of it, on average) to perform such an operation.
Hashing is a very simple and effective way of speeding it up: consider that we split the whole list in a set of small lists. Items in one such small list would have something in common, and this something can be deduced from the key. For example, by having a list of names, we could use first letter as the quality that will choose in which small list to look. In this way, by partitioning the data by the first letter of the key, we obtained a simple hash, that would be able to split the whole list in ~30 smaller lists, so that each operation would take O(n)/30 time.
However, we could note that the results are not that perfect. First, there are only 30 of them, and we can't change it. Second, some letters are used more often than others, so that the set with Y or Z will be much smaller that the set with A. For better results, it's better to find a way to partition the items in sets of roughly same size. How could we solve that? This is where you use hash functions. It's such a function that is able to create an arbitrary number of partitions with roughly the same number of items in each. In our example with names, we could use something like
int hash(const char* str){
int rez = 0;
for (int i = 0; i < strlen(str); i++)
rez = rez * 37 + str[i];
return rez % NUMBER_OF_PARTITIONS;
};
This would assure a quite even distribution and configurable number of sets (also called buckets).
What do we mean by Hashing, how does
it work internally ?
Hashing is the transformation of a string shorter fixed-length value or key that represents the original string. It is not indexing. The heart of hashing is the hash table. It contains array of items. Hash tables contain an index from the data item's key and use this index to place the data into the array.
What algorithm does it follow ?
In simple words most of the Hash algorithms work on the logic "index = f(key, arrayLength)"
Lastly, why in most interview
questions Hash and LinkedList are
asked, is there any specific logic for
it from testing interviewee's
knowledge ?
Its about how good you are at logical reasoning. It is most important data-structure that every programmers know it.

Categories