What is size of a hash-table bucket in java? - java

We know that more than one object with same hash code can be stored in a single bucket of a hash-table in JAVA. My question is:
What is maximum number of objects a single bucket can store?

It's unlimited. Whatever has the same hashCode (with the mask) goes into the same position in the hash table. It's basically linked list.
It may cause some problems obviously as it could significantly affect the performance but usually with reasonable distribution of items it hardly happens that there are more than one or two items in single position.

Related

How to retrieve values after a hash collision

I have read in many places that after a hash collision in Java it is internally using a linked list/tree, based on the number of hash collisions.Till this is fine,
But how to retrieve back the expected value using the 'key'
It just iterates the linked list stored in that bucket and checks the elements using equals which has no collisions.
The running time for that is linear, but only linear in the amount of items stored in that specific bucket, so it is okay as long as the buckets are kept balanaced well enough.
Look at this illustration (source):
So the implementation will make sure that a get operation, even if it has collisions, gives back the correct result in the end.
Note that Javas HashSet and HashMap are not a pure hashtable like illustrated. They will switch to a red-black tree internally after a certain threshold.

How to save memory when storing a lot of 'Entry' in a map in java?

I want to store 1*10^8 Objects in a map for searching. When my program start, it will read and store these objects in a map. After reading is end, this map never be updated util program is dead. I don't want jvm to abandon any of them. I learn that HashMap will waste many memory , is there any type of map can store so much objects and save memory?
and I know that jvm will scan these objects, it waste time. how to avid this?
Sorry, The situation is that: I am writing a bolt with apache storm. I want to read data from databases. when a bolt is processing a tuple, I need to calculate with the data in databases. For performance of program I have to store them in memory. I know jvm is not good at managing a lot of memory, So maybe I should to try koloboke?
HashMap need to allocate array of sufficient size in order to minimize hash collisions - it can happen that two or more objects that are not equal have the same hash code - probability of such situation depends on quality of hash function. Collisions are resolved by techniques such as linear probing, which stores entry at next (hash + i) mod length index that is not occupied, quadratic probing which stores entry at next (hash + i^k) mod length index that is not occupied, separate chaining which stores linked list of entries at each bucket. Collision probability is decreased by increasing length of backing array, thus memory wasting.
However, you can use TreeMap which stores entries in tree structure that creates only such a number of nodes that is equal to number of entries i. e. efficient memory usage.
Note, there is a difference in complexity of get, put, remove operations. HashMap has complexity O(1), while TreeMap has complexity O(log n).
Suppose you want to get an entry from map of size 100 000 000, then in worst case (element to be found is leaf i. e. is located at the last level of the tree), path that need to be passed down the tree has length log(100 000 000) = 8.
Well, I am back.
In first I used about 30g to store about 5x10^7 key-value entry.. but gc is not stable.I make a mistake about using string to store double, it is bigger than double in memory and a char is 16bit in java ..after I changed this mistake, gc is better..but not enough. Finally I used 'filedb' in mapdb to fix this.

Is HashTable/HashMap an array?

I am having confusion in hashing:
When we use Hashtable/HashMap (key,value), first I understood the internal data structure is an array (already allocated in memory).
Java hashcode() method has an int return type, so I think this hash value will be used as an index for the array and in this case, we should have 2 power 32 entries in the array in RAM, which is not what actually happens.
So does Java create an index from the hashcode() which is smaller range?
Answer:
As the guys pointed out below and from the documentation: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/HashMap.java
HashMap is an array. The hashcode() is rehashed again but still integer and the index in the array becomes: h & (length-1); so if the length of the array is 2^n then I think the index takes the first n bit from re-hashed value.
The structure for a Java HashMap is not just an array. It is an array, but not of 2^31 entries (int is a signed type!), but of some smaller number of buckets, by default 16 initially. The Javadocs for HashMap explain that.
When the number of entries exceeds a certain fraction (the "load factor) of the capacity, the array grows to a larger size.
Each element of the array does not hold only one entry. Each element of the array holds a structure (currently a red-black tree, formerly a list) of entries. Each entry of the structure has a hash code that transforms internally to the same bucket position in the array.
Have you read the docs on this type?
http://docs.oracle.com/javase/8/docs/api/java/util/HashMap.html
You really should.
Generally the base data structure will indeed be an array.
The methods that need to find an entry (or empty gap in the case of adding a new object) will reduce the hash code to something that fits the size of the array (generally by modulo), and use this as an index into that array.
Of course this makes the chance of collisions more likely, since many objects could have a hash code that reduces to the same index (possible anyway since multiple objects might have exactly the same hash code, but now much more likely). There are different strategies for dealing with this, generally either by using a linked-list-like structure or a mechanism for picking another slot if the first slot that matched was occupied by a non-equal key.
Since this adds cost, the more often such collisions happen the slower things become and in the worse case lookup would in fact be O(n) (and slow as O(n) goes, too).
Increasing the size of the internal store will generally improve this though, especially if it is not to a multiple of the previous size (so the operation that reduced the hash code to find an index won't take a bunch of items colliding on the same index and then give them all the same index again). Some mechanisms will increase the internal size before absolutely necessary (while there is some empty space remaining) in certain cases (certain percentage, certain number of collisions with objects that don't have the same full hash code, etc.)
This means that unless the hash codes are very bad (most obviously, if they are in fact all exactly the same), the order of operation stays at O(1).

Inevitable Collisions When Hashing?

If I create a new Map:
Map<Integer,String> map = new HashMap<Integer,String>();
Then I call map.put() a bunch of times each with a unique key, say, a million times, will there ever be a collision or does java's hashing algorithm guarantee no collisions if the key is unique?
Hashing does not guarantee that there will be no collisions if the key is unique. In fact, the only thing that's required is that objects that are equal have the same hashcode. The number of collisions determines how efficient retrieval will be (fewer collisions, closer to O(1), more collisions, closer to O(n)).
What an object's hashcode will be depends on what type it is. For instance, a string's default hashcode is
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
which necessarily simplifies down the complexity of the string to a single number -- definitely possible to reach the same hashcode with two different strings, though it'll be pretty rare.
If two things hash to the same thing, hashmap uses .equals to determine whether a particular key matches. That's why it's so important that you override both hashCode() and equals() together and ensure that things that are equal have the same hash code.
Hashtable works somewhat as follows:
A hashmap is created with an initial capacity (or number of buckets)
Each time you add an object to it, java invokes the hash function of the key, a number, then modulo this to the current size of the hashtable
The object is stored in the bucket with the result from step 2.
So even if you have unique keys, they can still collide unless you have as many buckets as your range of hash of your key.
There are two things you need to know:
Even there is collision, it is not going to cause problem, because for each bucket, there is a list. In case you are putting to a bucket that already have value inside, it will simply append at the list. When retrieval, it will first find out which bucket to lookup, and from the bucket, go through each value in the list and find out the one that is equals (by calling equals())
If you are putting millions of value in the Hashmap, you may wonder, then every linked list in the map will contains thousands of values. Then we are always doing big linear search which will be slow. Then you need to know that, Java's HashMap is going to be resized whenever number of entries are larger than certain threshold (have a look in capacity and loadFactor in Javadoc). With a properly implemented hash code, number of entries in each bucket is going to be small.

HashMap speed greater for smaller maps

This may be a strange question, but it is based on some results I get, using Java Map - is element retrieval speed greater in case of a HashMap, when the map is smaller?
I have some part of code that uses containsKey and get(key) methods of a HashMap, and it seems that runs faster if number of elements in the Map is smaller? Is that so?
My knowledge is that HashMap uses some hash function to access to certain field of a map, and there are versions in which that field is a reference to a linked list (because some keys can map to same value), or to other fields in the map, when implemented fully statically.
Is this correct - speed can be greater if Map has less elements?
I need to extend my question, with a concrete example.
I have 2 cases, in both the total number of elements is same.
In first case, I have 10 HashMaps, I'm not aware how elements are distributed. Time of execution of that part of algorithm is 141ms.
In second case, I have 25 HashMaps, same total number of elements. Time of execution of same algorithm is 69ms.
In both cases, I have a for loop that goes through each of the HashMaps, tries to find same elements, and to get elements if present.
Can it be that the execution time is smaller, because individual search inside HashMap is smaller, so is there sum?
I know that this is very strange, but is something like this somehow possible, or am I doing something wrong?
Map(Integer,Double) is considered. It is hard to tell what is the distribution of elements, since it is actually an implementation of KMeans clustering algorithm, and the elements are representations of cluster centroids. That means that they will mostly depend on the initialization of the algorithm. And the total number of elements will not mostly be the same, but I have tried to simplify the problem, sorry if that was misleading.
The number of collisions is decisive for a slow down.
Assume an array of some size, the hash code modulo the size then points to an index where the object is put. Two objects with the same index collide.
Having a large capacity (array size) with respect to number of elements helps.
With HashMap there are overloaded constructors with extra settings.
public HashMap(int initialCapacity,
float loadFactor)
Constructs an empty HashMap with the specified initial capacity and load factor.
You might experiment with that.
For a specific key class used with a HashMap, having a good hashCode can help too. Hash codes are a separate mathematical field.
Of course using less memory helps on the processor / physical memory level, but I doubt an influence in this case.
Does your timing take into account only the cost of get / containsKey, or are you also performing puts in the timed code section? If so, and if you're using the default constructor (initial capacity 16, load factor 0.75) then the larger hash tables are going to need to resize themselves more often than will the smaller hash tables. Like Joop Eggen says in his answer, try playing around with the initial capacity in the constructor, e.g. if you know that you have N elements then set the initial capacity to N / number_of_hash_tables or something along those lines - this ought to result in the smaller and larger hash tables having sufficient capacity that they won't need to be resized

Categories