What is the maximum size of LinkedList in a HashSet and what happens when that max size is reached, if any? If all the n input elements have hash codes that store values in the same node array of the hash map. i.e what happens when due to a specific input , bucket 0 keeps on growing and rest of the buckets are unfilled.Is rehashing done in that case or is there a specific way to avoid this problem?
The strategy is somehow implementation specific, but in general when a HashMap (and HashSet is based on that) reaches 64 entries overall and 8 entries in a single bucket, it will be transformed to a Tree. Until than a resize happens, when a bucket is doubled in size, thus an extra bit is taken into consideration of where to place an entry - this is called rehash - this is done to try and move entries to different buckets.
See this and this for some implementation specifics.
Related
I understand that when we declare a map like the following:
Map <String, Integer> map = new HashMap ();
The default load factor is 0.75 and its size is 16, when the buckets of the map exceed the number of 12 elements, the size changes to 32.
However, the way in which the map chooses the index of the bucket where the object will be placed when using the put function, is defined by hascode % n but what happens when the map size exceeds the load factor? n no longer has the same value, therefore, how can you find the previously set entries if, when applying hascode % n, the resulting index will not be the same as before?
My final question is :
how can the index of the bucket be the same after we've increased the size?
The simple answer is that it can't. HashMap has to perform a rehashing of all of the elements at the point when it expands.
See the following method:
/**
* Transfers all entries from current table to newTable.
*/
void transfer(Entry[] newTable, boolean rehash) {
which is called by resize. The JavaDoc for which says
Rehashes the contents of this map into a new array with a larger
capacity. This method is called automatically when the number of keys
in this map reaches its threshold.
Emphasis mine
See also:
Rehashing process in hashmap or hashtable
how and when Rehashing is done in HashMap
Rehashing in HashMap in Java
Default initial capacity of the HashMap takes is 16 and load factor is 0.75f (i.e 75% of current map size). The load factor represents at what level the HashMap capacity should be doubled.
For example product of capacity and load factor as 16 * 0.75 = 12. This represents that after storing the 12th key – value pair into the HashMap , its capacity becomes 32.
For Further Exposure
Further Process Explanation
Rehashing of a hash map is done when the number of elements in the
map reaches the maximum threshold value.
Usually the load factor value is 0.75 and the default initial capacity
value is 16. Once the number of elements reaches or crosses 0.75 times
the capacity, then rehashing of map takes place. In this case when the
number of elements are 12, then rehashing occurs. (0.75 * 16 = 12)
When rehashing occurs a new hash function or even the same hash
function could be used but the buckets at which the values are present
could change. Basically when rehashing occurs the number of buckets
are approximately doubled and hence the new index at which the value
has to be put changes.
While rehashing, the linked list for each bucket gets reversed in
order. This happens because HashMap doesn't append the new element at
the tail instead it appends the new element at the head. So when
rehashing occurs, it reads each element and inserts it in the new
bucket at the head and then keeps on adding next elements from the old
map at the head of the new map resulting in reversal of linked list.
If there are multiple threads handling the same hash map it could
result in infinite loop.
Detailed explanation stating how infinite loop occurs in the above
case can be found here :
Read this Article for more unserstanding
If the elements inserted in the map has to be sorted wrt the keys then
TreeMap can be used. But HashMap would be more efficient if order of
keys doesn't matter.
Internal data
structures are rebuilt. From the docs: https://docs.oracle.com/javase/8/docs/api/java/util/HashMap.html :
When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
Hashmaps do not preserve ordering. Look at using LinkedHashMaps instead.
I want to store 1*10^8 Objects in a map for searching. When my program start, it will read and store these objects in a map. After reading is end, this map never be updated util program is dead. I don't want jvm to abandon any of them. I learn that HashMap will waste many memory , is there any type of map can store so much objects and save memory?
and I know that jvm will scan these objects, it waste time. how to avid this?
Sorry, The situation is that: I am writing a bolt with apache storm. I want to read data from databases. when a bolt is processing a tuple, I need to calculate with the data in databases. For performance of program I have to store them in memory. I know jvm is not good at managing a lot of memory, So maybe I should to try koloboke?
HashMap need to allocate array of sufficient size in order to minimize hash collisions - it can happen that two or more objects that are not equal have the same hash code - probability of such situation depends on quality of hash function. Collisions are resolved by techniques such as linear probing, which stores entry at next (hash + i) mod length index that is not occupied, quadratic probing which stores entry at next (hash + i^k) mod length index that is not occupied, separate chaining which stores linked list of entries at each bucket. Collision probability is decreased by increasing length of backing array, thus memory wasting.
However, you can use TreeMap which stores entries in tree structure that creates only such a number of nodes that is equal to number of entries i. e. efficient memory usage.
Note, there is a difference in complexity of get, put, remove operations. HashMap has complexity O(1), while TreeMap has complexity O(log n).
Suppose you want to get an entry from map of size 100 000 000, then in worst case (element to be found is leaf i. e. is located at the last level of the tree), path that need to be passed down the tree has length log(100 000 000) = 8.
Well, I am back.
In first I used about 30g to store about 5x10^7 key-value entry.. but gc is not stable.I make a mistake about using string to store double, it is bigger than double in memory and a char is 16bit in java ..after I changed this mistake, gc is better..but not enough. Finally I used 'filedb' in mapdb to fix this.
I am having confusion in hashing:
When we use Hashtable/HashMap (key,value), first I understood the internal data structure is an array (already allocated in memory).
Java hashcode() method has an int return type, so I think this hash value will be used as an index for the array and in this case, we should have 2 power 32 entries in the array in RAM, which is not what actually happens.
So does Java create an index from the hashcode() which is smaller range?
Answer:
As the guys pointed out below and from the documentation: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/HashMap.java
HashMap is an array. The hashcode() is rehashed again but still integer and the index in the array becomes: h & (length-1); so if the length of the array is 2^n then I think the index takes the first n bit from re-hashed value.
The structure for a Java HashMap is not just an array. It is an array, but not of 2^31 entries (int is a signed type!), but of some smaller number of buckets, by default 16 initially. The Javadocs for HashMap explain that.
When the number of entries exceeds a certain fraction (the "load factor) of the capacity, the array grows to a larger size.
Each element of the array does not hold only one entry. Each element of the array holds a structure (currently a red-black tree, formerly a list) of entries. Each entry of the structure has a hash code that transforms internally to the same bucket position in the array.
Have you read the docs on this type?
http://docs.oracle.com/javase/8/docs/api/java/util/HashMap.html
You really should.
Generally the base data structure will indeed be an array.
The methods that need to find an entry (or empty gap in the case of adding a new object) will reduce the hash code to something that fits the size of the array (generally by modulo), and use this as an index into that array.
Of course this makes the chance of collisions more likely, since many objects could have a hash code that reduces to the same index (possible anyway since multiple objects might have exactly the same hash code, but now much more likely). There are different strategies for dealing with this, generally either by using a linked-list-like structure or a mechanism for picking another slot if the first slot that matched was occupied by a non-equal key.
Since this adds cost, the more often such collisions happen the slower things become and in the worse case lookup would in fact be O(n) (and slow as O(n) goes, too).
Increasing the size of the internal store will generally improve this though, especially if it is not to a multiple of the previous size (so the operation that reduced the hash code to find an index won't take a bunch of items colliding on the same index and then give them all the same index again). Some mechanisms will increase the internal size before absolutely necessary (while there is some empty space remaining) in certain cases (certain percentage, certain number of collisions with objects that don't have the same full hash code, etc.)
This means that unless the hash codes are very bad (most obviously, if they are in fact all exactly the same), the order of operation stays at O(1).
This may be a strange question, but it is based on some results I get, using Java Map - is element retrieval speed greater in case of a HashMap, when the map is smaller?
I have some part of code that uses containsKey and get(key) methods of a HashMap, and it seems that runs faster if number of elements in the Map is smaller? Is that so?
My knowledge is that HashMap uses some hash function to access to certain field of a map, and there are versions in which that field is a reference to a linked list (because some keys can map to same value), or to other fields in the map, when implemented fully statically.
Is this correct - speed can be greater if Map has less elements?
I need to extend my question, with a concrete example.
I have 2 cases, in both the total number of elements is same.
In first case, I have 10 HashMaps, I'm not aware how elements are distributed. Time of execution of that part of algorithm is 141ms.
In second case, I have 25 HashMaps, same total number of elements. Time of execution of same algorithm is 69ms.
In both cases, I have a for loop that goes through each of the HashMaps, tries to find same elements, and to get elements if present.
Can it be that the execution time is smaller, because individual search inside HashMap is smaller, so is there sum?
I know that this is very strange, but is something like this somehow possible, or am I doing something wrong?
Map(Integer,Double) is considered. It is hard to tell what is the distribution of elements, since it is actually an implementation of KMeans clustering algorithm, and the elements are representations of cluster centroids. That means that they will mostly depend on the initialization of the algorithm. And the total number of elements will not mostly be the same, but I have tried to simplify the problem, sorry if that was misleading.
The number of collisions is decisive for a slow down.
Assume an array of some size, the hash code modulo the size then points to an index where the object is put. Two objects with the same index collide.
Having a large capacity (array size) with respect to number of elements helps.
With HashMap there are overloaded constructors with extra settings.
public HashMap(int initialCapacity,
float loadFactor)
Constructs an empty HashMap with the specified initial capacity and load factor.
You might experiment with that.
For a specific key class used with a HashMap, having a good hashCode can help too. Hash codes are a separate mathematical field.
Of course using less memory helps on the processor / physical memory level, but I doubt an influence in this case.
Does your timing take into account only the cost of get / containsKey, or are you also performing puts in the timed code section? If so, and if you're using the default constructor (initial capacity 16, load factor 0.75) then the larger hash tables are going to need to resize themselves more often than will the smaller hash tables. Like Joop Eggen says in his answer, try playing around with the initial capacity in the constructor, e.g. if you know that you have N elements then set the initial capacity to N / number_of_hash_tables or something along those lines - this ought to result in the smaller and larger hash tables having sufficient capacity that they won't need to be resized
From the JavaDocs of HashSet:
This class offers constant time performance for the basic operations
(add, remove, contains and size), assuming the hash function disperses
the elements properly among the buckets. Iterating over this set
requires time proportional to the sum of the HashSet instance's size
(the number of elements) plus the "capacity" of the backing HashMap
instance (the number of buckets). Thus, it's very important not to set
the initial capacity too high (or the load factor too low) if
iteration performance is important
Why does iteration takes time proportional to the sum(number of elements in set+ capacity of backing map) and not only to the number of elements in the set itself ?
.
HashSet is imlemented using a HashMap where the elements are the map keys. Since a map has a defined number of buckets that can contain one or more elements, iteration needs to check each bucket, whether it contains elements or not.
Using LinkedHashSet follows the "linked" list of entries so the number of blanks doesn't matter. Normally you wouldn't have a HashSet where the capacity is much more than double the size actually used. Even if you do, scanning a million entries, mostly null doesn't take much time (milli-seconds)
Why does iteration takes time proportional to the sum(number of
elements in set+ capacity of backing map) and not only to the number
of elements in the set itself ?
The elements are dispersed inside the underlying HashMap which is backed by an array.
So it is not known which buckets are occupied (but it is known how many elements are totally available).
So to iterate over all elements all buckets must be checked
If your concern is the time it takes to iterate around the set, and you are using Java 6 or greater take a look at this beauty:
ConcurrentSkipListSet