So I understand that the hashmaps use buckets and hashcodes and what not. From my experience, Java hashcodes are not small, but rather large numbers usually, so I assume it's not indexed internally. Unless the hashcode quality is poor resulting in approximately equal bucket length and amount buckets, what makes hashmaps faster than a list of name->value pairs?
Hashmaps work by mapping elements to "buckets" by using a hash function. When someone tries to insert an element, a hash code is calculated and a modulus operation is applied to the hash code in order to get the bucket index in which the element should be inserted (That is the reason why it doesn't matter how big the hashcode is). For example, if you have 4 buckets and your hashcode is 40, it will be inserted in the bucket 0, because 40 mod 4 is 0.
When two elements are mapped to the same bucket a "collision" occurs and usually the element is stored in a list under the same bucket.
If you try to obtain an element the key is mapped again using the hash function. If the bucket contains a list of elements, the equals() function is used in order to identify which element is the correct one (That is the reason why you must implement equals() and hashcode() to insert a custom object into a hashmap).
So, if you search for an element, and your hashmap does not have any lists on the buckets, you have a O(1) cost. The worst case would be when you have only 1 bucket and a list containing all elements in which obtaining an element would be the same as searching on a list O(N).
I looked in the Java implementation and found it does a bitwise and akin to a modulus, which makes a lot of sense to reduce array size. This allows the O(1) access that makes HashMaps nice.
Related
Having started learning Java, I came across this statement in the docs of Java 8:
assuming the hash function disperses the elements properly among the buckets.
Does that simply mean that the order you get, after assigning, will be a mess?
It means that HashMap maintains an array of buckets under the hood. Hash code produced by the hashCode() method of a key object determines to which bucket this entry should go.
A situation when multiple keys yield similar hashes and as a consequence are mapped to the same bucket is called a collision.
Entries of the map that are mapped to the same bucket will be structured as a linked list. Starting with Java 8 when a number of collisions grow after a certain threshold the list will be transformed into a tree.
As you probably know the cost of accessing an element under a certain index in the array is O(1). And HashMap provides access to the values by key with amortized time complexity O(1), but only if a number of collisions is neglectable. I.e. hashCode() is implemented in such a way that it allows to spread the keys relatively evenly between the buckets.
In the edge case when the hash function is badly implemented and, let's say, it returns the same hash for every key all the entries end up in the same bucket. The time complexity for methods like get(), containsKey() degrades to O(n) (with Java 7 and earlier) because you have to iterate over the list of all entries in order to find a particular one. And with Java 8 onwards the time complexity will be O(log n) because that is the worse time required to find an element in a red-black tree.
Does that simply mean that the order you get, after assigning, will be
a mess?
The order of elements in the HashMap is undefined. This class is useful when you need quick access and don't care about the order. If need an ordered map consider LinkedHashMap which tracks the order in which the entries were added to a map by maintaining a linked list or TreeMap which sorts keys ordered accordingly to their natural order or based on the given comparator.
A hash map contains a number of "buckets". For best performance, you want the number of entries to be more or less the same in each bucket. The bucket is determined by the hash function; thus you want a hash function that results in more or less the same probability of hitting each bucket. That is, "the hash function disperses the elements properly among the buckets".
At the other extreme: a hash function that always returned, say, the value 3 would work, but map access wouldn't be very efficient, since one bucket would have all the entries.
I don't understand what you mean by the order being a "mess". A hash map is not ordered; the location of an element depends on its hash code.
I have read in many places that after a hash collision in Java it is internally using a linked list/tree, based on the number of hash collisions.Till this is fine,
But how to retrieve back the expected value using the 'key'
It just iterates the linked list stored in that bucket and checks the elements using equals which has no collisions.
The running time for that is linear, but only linear in the amount of items stored in that specific bucket, so it is okay as long as the buckets are kept balanaced well enough.
Look at this illustration (source):
So the implementation will make sure that a get operation, even if it has collisions, gives back the correct result in the end.
Note that Javas HashSet and HashMap are not a pure hashtable like illustrated. They will switch to a red-black tree internally after a certain threshold.
I am having confusion in hashing:
When we use Hashtable/HashMap (key,value), first I understood the internal data structure is an array (already allocated in memory).
Java hashcode() method has an int return type, so I think this hash value will be used as an index for the array and in this case, we should have 2 power 32 entries in the array in RAM, which is not what actually happens.
So does Java create an index from the hashcode() which is smaller range?
Answer:
As the guys pointed out below and from the documentation: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/HashMap.java
HashMap is an array. The hashcode() is rehashed again but still integer and the index in the array becomes: h & (length-1); so if the length of the array is 2^n then I think the index takes the first n bit from re-hashed value.
The structure for a Java HashMap is not just an array. It is an array, but not of 2^31 entries (int is a signed type!), but of some smaller number of buckets, by default 16 initially. The Javadocs for HashMap explain that.
When the number of entries exceeds a certain fraction (the "load factor) of the capacity, the array grows to a larger size.
Each element of the array does not hold only one entry. Each element of the array holds a structure (currently a red-black tree, formerly a list) of entries. Each entry of the structure has a hash code that transforms internally to the same bucket position in the array.
Have you read the docs on this type?
http://docs.oracle.com/javase/8/docs/api/java/util/HashMap.html
You really should.
Generally the base data structure will indeed be an array.
The methods that need to find an entry (or empty gap in the case of adding a new object) will reduce the hash code to something that fits the size of the array (generally by modulo), and use this as an index into that array.
Of course this makes the chance of collisions more likely, since many objects could have a hash code that reduces to the same index (possible anyway since multiple objects might have exactly the same hash code, but now much more likely). There are different strategies for dealing with this, generally either by using a linked-list-like structure or a mechanism for picking another slot if the first slot that matched was occupied by a non-equal key.
Since this adds cost, the more often such collisions happen the slower things become and in the worse case lookup would in fact be O(n) (and slow as O(n) goes, too).
Increasing the size of the internal store will generally improve this though, especially if it is not to a multiple of the previous size (so the operation that reduced the hash code to find an index won't take a bunch of items colliding on the same index and then give them all the same index again). Some mechanisms will increase the internal size before absolutely necessary (while there is some empty space remaining) in certain cases (certain percentage, certain number of collisions with objects that don't have the same full hash code, etc.)
This means that unless the hash codes are very bad (most obviously, if they are in fact all exactly the same), the order of operation stays at O(1).
In some post I read:
ConcurrentHashMap groups elements by a proximity based on loadfactor
How this grouping happens?
Lets say I override hashCode() function so that it always return 1. Now how are higher and lower values of loadfactor going to effect inserts into a ConcurrentHashMap ?.
Now I override hashCode() function so that it always returns different hashcodes. Now how are higher and lower values of loadfactor going to effect inserts into a ConcurrentHashMap ?.
A hashmap is essentially an array of lists. For example, lets say a given hashmap has an array of 100 lists. When you add something to it, the hashCode is calculated for that object. Then the modulus of that value and the number of lists (in this case 100) is used to determine which list it is added to. So if you add a object with hashcode 13, it gets added to list 13. If you add an object with the hascode 12303512, it get's added to list 12.
The load factor tells the hashmap when to increase the number of lists. It's based on the number of items in the entire map and the current capacity.
In your first scenario where hashcode always returns 1, no matter how many lists there are, your objects will end up in the same list (this is bad.) In the second scenario, they will be distributed more evenly across the lists (this is good.)
Since the load factor is based on the overall size of the map and not that of the lists, the quality of your hashcodes doesn't really interact with the loadfactor. In the first scenario, it will grow just like in the second one but everything will still end up in the same list regardless.
If I create a new Map:
Map<Integer,String> map = new HashMap<Integer,String>();
Then I call map.put() a bunch of times each with a unique key, say, a million times, will there ever be a collision or does java's hashing algorithm guarantee no collisions if the key is unique?
Hashing does not guarantee that there will be no collisions if the key is unique. In fact, the only thing that's required is that objects that are equal have the same hashcode. The number of collisions determines how efficient retrieval will be (fewer collisions, closer to O(1), more collisions, closer to O(n)).
What an object's hashcode will be depends on what type it is. For instance, a string's default hashcode is
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
which necessarily simplifies down the complexity of the string to a single number -- definitely possible to reach the same hashcode with two different strings, though it'll be pretty rare.
If two things hash to the same thing, hashmap uses .equals to determine whether a particular key matches. That's why it's so important that you override both hashCode() and equals() together and ensure that things that are equal have the same hash code.
Hashtable works somewhat as follows:
A hashmap is created with an initial capacity (or number of buckets)
Each time you add an object to it, java invokes the hash function of the key, a number, then modulo this to the current size of the hashtable
The object is stored in the bucket with the result from step 2.
So even if you have unique keys, they can still collide unless you have as many buckets as your range of hash of your key.
There are two things you need to know:
Even there is collision, it is not going to cause problem, because for each bucket, there is a list. In case you are putting to a bucket that already have value inside, it will simply append at the list. When retrieval, it will first find out which bucket to lookup, and from the bucket, go through each value in the list and find out the one that is equals (by calling equals())
If you are putting millions of value in the Hashmap, you may wonder, then every linked list in the map will contains thousands of values. Then we are always doing big linear search which will be slow. Then you need to know that, Java's HashMap is going to be resized whenever number of entries are larger than certain threshold (have a look in capacity and loadFactor in Javadoc). With a properly implemented hash code, number of entries in each bucket is going to be small.