Java hashcode is an integer (size 2 pow 32)
When we create a hashtable/hashmap it creates buckets with size equals Initial Capacity of the Map. In other words, it creates an array of size "Initial Capacity"
Question
1. How does it map the hashcode of a key (java object) to the bucket index?
2. Since size of hashmap can grow, can the size of hashmap be equal to 2 pow 32? If answer is yes, it is wise to have an array of size 2 pow 32?
Here is a link to the current source code: http://www.docjar.com/html/api/java/util/HashMap.java.html
The answers to your questions are (in part) implementation specific.
1) See the code. Note that your assumption about how initialCapacity is implemented is incorrect ... for Oracle Java 6 and 7 at least. Specifically, initialCapacity is not necessarily the hashmap's array size.
2) The size of a HashMap is the number of entries, and that can exceed 2^32! I assume that your are actually talking about the capacity. The size of the HashMap's array is theoretically limited to 2^31 - 1 (the largest size for a Java array). For the current implementations, MAX_CAPACITY is actually 2^30; see the code.
3) "... it is wise to have an array of size 2^32?" It is not possible with Java as currently defined, and it is unwise to try to do something that is impossible.
If you are really asking about the design of hash table data structures in Java, then there is a trade-off between efficiency for normal sized hash tables, and ones that are HUGE; i.e. maps with significantly more than 2^30 elements. The HashMap implementations are tuned to work best for normal sized maps. If you routinely had to deal with HUGE maps, and performance was critical, then you should be looking to implement a custom map class that is tuned to your specific requirements.
The size of a Java array is actually limited to Integer.MAX_VALUE elements, 2^31-1.
HashMap uses power of two array sizes, so presumably the largest it could use is 2^31. You would need a large physical memory to make that reasonable.
HashMap does a series of shift and xor operations to reduce some sources of collisions before doing a simple bitwise AND to get the bucket index.
Related
I am having confusion in hashing:
When we use Hashtable/HashMap (key,value), first I understood the internal data structure is an array (already allocated in memory).
Java hashcode() method has an int return type, so I think this hash value will be used as an index for the array and in this case, we should have 2 power 32 entries in the array in RAM, which is not what actually happens.
So does Java create an index from the hashcode() which is smaller range?
Answer:
As the guys pointed out below and from the documentation: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/HashMap.java
HashMap is an array. The hashcode() is rehashed again but still integer and the index in the array becomes: h & (length-1); so if the length of the array is 2^n then I think the index takes the first n bit from re-hashed value.
The structure for a Java HashMap is not just an array. It is an array, but not of 2^31 entries (int is a signed type!), but of some smaller number of buckets, by default 16 initially. The Javadocs for HashMap explain that.
When the number of entries exceeds a certain fraction (the "load factor) of the capacity, the array grows to a larger size.
Each element of the array does not hold only one entry. Each element of the array holds a structure (currently a red-black tree, formerly a list) of entries. Each entry of the structure has a hash code that transforms internally to the same bucket position in the array.
Have you read the docs on this type?
http://docs.oracle.com/javase/8/docs/api/java/util/HashMap.html
You really should.
Generally the base data structure will indeed be an array.
The methods that need to find an entry (or empty gap in the case of adding a new object) will reduce the hash code to something that fits the size of the array (generally by modulo), and use this as an index into that array.
Of course this makes the chance of collisions more likely, since many objects could have a hash code that reduces to the same index (possible anyway since multiple objects might have exactly the same hash code, but now much more likely). There are different strategies for dealing with this, generally either by using a linked-list-like structure or a mechanism for picking another slot if the first slot that matched was occupied by a non-equal key.
Since this adds cost, the more often such collisions happen the slower things become and in the worse case lookup would in fact be O(n) (and slow as O(n) goes, too).
Increasing the size of the internal store will generally improve this though, especially if it is not to a multiple of the previous size (so the operation that reduced the hash code to find an index won't take a bunch of items colliding on the same index and then give them all the same index again). Some mechanisms will increase the internal size before absolutely necessary (while there is some empty space remaining) in certain cases (certain percentage, certain number of collisions with objects that don't have the same full hash code, etc.)
This means that unless the hash codes are very bad (most obviously, if they are in fact all exactly the same), the order of operation stays at O(1).
Before starting to explain my problem, I should mention that I am not looking for a way to increase Java heap memory. I should strictly store these objects.
I am working on storing huge number (5-10 GB) of DNA sequences and their counts (Integer) in a hash table. The DNA sequences (with length 32 or less) consists of 'A', 'C', 'G', 'T', and 'N' (undefined) chars. As we know, when storing a large number of objects in memory, Java has poor space efficiency compared to lower level languages like C and C++. Thus, if I store this sequence as string (it holds about 100 MB memory for a sequence with length ~30), I see the error.
I tried to represent nucleic acids as 'A'=00, 'C'=01, 'G'=10, 'T'=11 and neglect 'N' (because it ruins the char to 2-bit transform as the 5-th acid). Then, concatenate these 2-bit acids into byte array. It brought some improvement but unfortunately I see the error after a couple of hours again. I need a convenient solution or at least a workaround to handle this error. Thank you in advance.
Being fairly complex maybe this here is a weird idea, and would require quite a lot of work, but this is what I would try:
You already pointed out two individual subproblems of your overall task:
the default HashMap implementation may be suboptimal for such large collection sizes
you need to store something else than strings
The map implementation
I would recommend to write a highly tailored hash map implementation for the Map<String, Long> interface. Internally you do not have to store strings. Unfortunately 5^32 > 2^64, so there is no way to pack your whole string into a single long, well, let's stick to two longs for a key. You can make string to/back long[2] conversion fairly efficiently on the fly when providing a string key to your map implementation (use bit shifts etc).
As for packing the values, here are some considerations:
for a key-value pair a standard hashmap will need to have an array of N longs for buckets, where N is the current capacity, when the bucket is found from the hash key it will need to have a linked list of key-value pairs to resolve keys that produce identical hash codes. For your specific case you could try to optimize it in the following way:
use a long[] of size 3N where N is the capacity to store both keys and values in a continuous array
in this array, at locations 3 * (hashcode % N) and 3 * (hashcode % N) + 1 you store the long[2] representation of the key, of the first key that matches this bucket or of the only one (on insertion, zero otherwise), at location 3 * (hashcode % N) + 2 you store the corresponding count
for all those cases where a different key results in the same hash code and thus the same bucket, your store the data in a standard HashMap<Long2KeyWrapper, Long>. The idea is to keep the capacity of the array mentioned above (and resize correspondingly) large enough to have by far the largest part of the data in that contiguous array and not in the fallback hash map. This will dramatically reduce the storage overhead of the hashmap
do not expand the capacity in N=2N iterations, make smaller growth steps, e.g. 10-20%. this will cost performance on populating the map, but will keep your memory footprint under control
The keys
Given the inequality 5^32 > 2^64 your idea to use bits to encode 5 letters seems to be the best I can think of right now. Use 3 bits and correspondingly long[2].
I recommend you look into the Trove4j Collections API; it offers Collections that hold primitives which will use less memory than their boxed, wrapper classes.
Specifically, you should check out their TObjectIntHashMap.
Also, I wouldn't recommended storing anything as a String or char until JDK 9 is released, as the backing char array of a String is UTF-16 encoded, using two bytes per char. JDK 9 defaults to UTF-8 where only one byte is used.
If you're using on the order of ~10gb of data, or at least data with an in memory representation size of ~10gb, then you might need to think of ways to write the data you don't need at the moment to disk and load individual portions of your dataset into memory to work on them.
I had this exact problem a few years ago when I was conducting research with monte carlo simulations so I wrote a Java data structure to solve it. You can clone/fork the source here: github.com/tylerparsons/surfdep
The library supports both MySQL and SQLite as the underlying database. If you don't have either, I'd recommend SQLite as it's much quicker to set up.
Full disclaimer: this is not the most efficient implementation, but it will handle very large datasets if you let it run for a few hours. I tested it successfully with matrices of up to 1 billion elements on my Windows laptop.
This may be a strange question, but it is based on some results I get, using Java Map - is element retrieval speed greater in case of a HashMap, when the map is smaller?
I have some part of code that uses containsKey and get(key) methods of a HashMap, and it seems that runs faster if number of elements in the Map is smaller? Is that so?
My knowledge is that HashMap uses some hash function to access to certain field of a map, and there are versions in which that field is a reference to a linked list (because some keys can map to same value), or to other fields in the map, when implemented fully statically.
Is this correct - speed can be greater if Map has less elements?
I need to extend my question, with a concrete example.
I have 2 cases, in both the total number of elements is same.
In first case, I have 10 HashMaps, I'm not aware how elements are distributed. Time of execution of that part of algorithm is 141ms.
In second case, I have 25 HashMaps, same total number of elements. Time of execution of same algorithm is 69ms.
In both cases, I have a for loop that goes through each of the HashMaps, tries to find same elements, and to get elements if present.
Can it be that the execution time is smaller, because individual search inside HashMap is smaller, so is there sum?
I know that this is very strange, but is something like this somehow possible, or am I doing something wrong?
Map(Integer,Double) is considered. It is hard to tell what is the distribution of elements, since it is actually an implementation of KMeans clustering algorithm, and the elements are representations of cluster centroids. That means that they will mostly depend on the initialization of the algorithm. And the total number of elements will not mostly be the same, but I have tried to simplify the problem, sorry if that was misleading.
The number of collisions is decisive for a slow down.
Assume an array of some size, the hash code modulo the size then points to an index where the object is put. Two objects with the same index collide.
Having a large capacity (array size) with respect to number of elements helps.
With HashMap there are overloaded constructors with extra settings.
public HashMap(int initialCapacity,
float loadFactor)
Constructs an empty HashMap with the specified initial capacity and load factor.
You might experiment with that.
For a specific key class used with a HashMap, having a good hashCode can help too. Hash codes are a separate mathematical field.
Of course using less memory helps on the processor / physical memory level, but I doubt an influence in this case.
Does your timing take into account only the cost of get / containsKey, or are you also performing puts in the timed code section? If so, and if you're using the default constructor (initial capacity 16, load factor 0.75) then the larger hash tables are going to need to resize themselves more often than will the smaller hash tables. Like Joop Eggen says in his answer, try playing around with the initial capacity in the constructor, e.g. if you know that you have N elements then set the initial capacity to N / number_of_hash_tables or something along those lines - this ought to result in the smaller and larger hash tables having sufficient capacity that they won't need to be resized
What is the maximum size of HashSet, Vector, LinkedList? I know that ArrayList can store more than 3277000 numbers.
However the size of list depends on the memory (heap) size. If it reaches maximum the JDK throws an OutOfMemoryError.
But I don't know the limit for the number of elements in HashSet, Vector and LinkedList.
There is no specified maximum size of these structures.
The actual practical size limit is probably somewhere in the region of Integer.MAX_VALUE (i.e. 2147483647, roughly 2 billion elements), as that's the maximum size of an array in Java.
A HashSet uses a HashMap internally, so it has the same maximum size as that
A HashMap uses an array which always has a size that is a power of two, so it can be at most 230 = 1073741824 elements big (since the next power of two is bigger than Integer.MAX_VALUE).
Normally the number of elements is at most the number of buckets multiplied by the load factor (0.75 by default). However, when the HashMap stops resizing, then it will still allow you to add elements, exploiting the fact that each bucket is managed via a linked list. Therefore the only limit for elements in a HashMap/HashSet is memory.
A Vector uses an array internally which has a maximum size of exactly Integer.MAX_VALUE, so it can't support more than that many elements
A LinkedList doesn't use an array as the underlying storage, so that doesn't limit the size. It uses a classical doubly linked list structure with no inherent limit, so its size is only bounded by the available memory. Note that a LinkedList will report the size wrongly if it is bigger than Integer.MAX_VALUE, because it uses a int field to store the size and the return type of size() is int as well.
Note that while the Collection API does define how a Collection with more than Integer.MAX_VALUE elements should behave. Most importantly it states this the size() documentation:
If this collection contains more than Integer.MAX_VALUE elements, returns Integer.MAX_VALUE.
Note that while HashMap, HashSet and LinkedList seem to support more than Integer.MAX_VALUE elements, none of those implement the size() method in this way (i.e. they simply let the internal size field overflow).
This leads me to believe that other operations also aren't well-defined in this condition.
So I'd say it's safe to use those general-purpose collections with up to Integer.MAX_VLAUE elements. If you know that you'll need to store more than that, then you should switch to dedicated collection implementations that actually support this.
In all cases, you're likely to be limited by the JVM heap size rather than anything else. Eventually you'll always get down to arrays so I very much doubt that any of them will manage more than 231 - 1 elements, but you're very, very likely to run out of heap before then anyway.
It very much depends on the implementation details.
A HashSet uses an array as an underlying store which by default it attempt to grow when the collection is 75% full. This means it will fail if you try to add more than about 750,000,000 entries. (It cannot grow the array from 2^30 to 2^31 entries)
Increasing the load factor increases the maximum size of the collection. e.g. a load factor of 10 allows 10 billion elements. (It is worth noting that HashSet is relatively inefficient past 100 million elements as the distribution of the 32-bit hashcode starts to look less random, and the number of collisions increases)
A Vector doubles its capacity and starts at 10. This means it will fail to grow above approx 1.34 billion. Changing the initial size to 2^n-1 gives you slightly more head room.
BTW: Use ArrayList rather than Vector if you can.
A LinkedList has no inherent limit and can grow beyond 2.1 billion. At this point size() could return Integer.MAX_VALUE, however some functions such as toArray will fail as it cannot put all objects into an array, in will instead give you the first Integer.MAX_VALUE rather than throw an exception.
As #Joachim Sauer notes, the current OpenJDK could return an incorrect result for sizes above Integer.MAX_VALUE. e.g. it could be a negative number.
The maximum size depends on the memory settings of the JVM and of course the available system memory. Specific size of memory consumption per list entry also differs between platforms, so the easiest way might be to run simple tests.
As stated in other answers, an array cannot reach 2^31 entries. Other data types are limited either by this or they will likely misreport their size() eventually. However, these theoretical limits cannot be reached on some systems:
On a 32 bit system, the number of bytes available never exceeds 2^32 exactly. And that is assuming that you have no operating system taking up memory. A 32 bit pointer is 4 bytes. Anything which does not rely on arrays must include at least one pointer per entry: this means that the maximum number of entries is 2^32/4 or 2^30 for things that do not utilize arrays.
A plain array can achieve it's theoretical limit, but only a byte array, a short array of length 2^31-1 would use up about 2^32+38 bytes.
Some java VMs have introduced a new memory model that uses compressed pointers. By adjusting pointer alignment, slightly more than 2^32 bytes may be referenced with 32 byte pointers. Around four times more. This is enough to cause a LinkedList size() to become negative, but not enough to allow it to wrap around to zero.
A sixty four bit system has sixty four bit pointers, making all pointers twice as big, making non array lists a bunch fatter. This also means that the maximum capacity supported jumps to 2^64 bytes exactly. This is enough for a 2D array to reach its theoretical maximum. byte[0x7fffffff][0x7fffffff] uses memory apporximately equal to 40+40*(2^31-1)+(2^31-1)(2^31-1)=40+40(2^31-1)+(2^62-2^32+1)
What is the maximum size of HashSet, Vector, LinkedList? I know that ArrayList can store more than 3277000 numbers.
However the size of list depends on the memory (heap) size. If it reaches maximum the JDK throws an OutOfMemoryError.
But I don't know the limit for the number of elements in HashSet, Vector and LinkedList.
There is no specified maximum size of these structures.
The actual practical size limit is probably somewhere in the region of Integer.MAX_VALUE (i.e. 2147483647, roughly 2 billion elements), as that's the maximum size of an array in Java.
A HashSet uses a HashMap internally, so it has the same maximum size as that
A HashMap uses an array which always has a size that is a power of two, so it can be at most 230 = 1073741824 elements big (since the next power of two is bigger than Integer.MAX_VALUE).
Normally the number of elements is at most the number of buckets multiplied by the load factor (0.75 by default). However, when the HashMap stops resizing, then it will still allow you to add elements, exploiting the fact that each bucket is managed via a linked list. Therefore the only limit for elements in a HashMap/HashSet is memory.
A Vector uses an array internally which has a maximum size of exactly Integer.MAX_VALUE, so it can't support more than that many elements
A LinkedList doesn't use an array as the underlying storage, so that doesn't limit the size. It uses a classical doubly linked list structure with no inherent limit, so its size is only bounded by the available memory. Note that a LinkedList will report the size wrongly if it is bigger than Integer.MAX_VALUE, because it uses a int field to store the size and the return type of size() is int as well.
Note that while the Collection API does define how a Collection with more than Integer.MAX_VALUE elements should behave. Most importantly it states this the size() documentation:
If this collection contains more than Integer.MAX_VALUE elements, returns Integer.MAX_VALUE.
Note that while HashMap, HashSet and LinkedList seem to support more than Integer.MAX_VALUE elements, none of those implement the size() method in this way (i.e. they simply let the internal size field overflow).
This leads me to believe that other operations also aren't well-defined in this condition.
So I'd say it's safe to use those general-purpose collections with up to Integer.MAX_VLAUE elements. If you know that you'll need to store more than that, then you should switch to dedicated collection implementations that actually support this.
In all cases, you're likely to be limited by the JVM heap size rather than anything else. Eventually you'll always get down to arrays so I very much doubt that any of them will manage more than 231 - 1 elements, but you're very, very likely to run out of heap before then anyway.
It very much depends on the implementation details.
A HashSet uses an array as an underlying store which by default it attempt to grow when the collection is 75% full. This means it will fail if you try to add more than about 750,000,000 entries. (It cannot grow the array from 2^30 to 2^31 entries)
Increasing the load factor increases the maximum size of the collection. e.g. a load factor of 10 allows 10 billion elements. (It is worth noting that HashSet is relatively inefficient past 100 million elements as the distribution of the 32-bit hashcode starts to look less random, and the number of collisions increases)
A Vector doubles its capacity and starts at 10. This means it will fail to grow above approx 1.34 billion. Changing the initial size to 2^n-1 gives you slightly more head room.
BTW: Use ArrayList rather than Vector if you can.
A LinkedList has no inherent limit and can grow beyond 2.1 billion. At this point size() could return Integer.MAX_VALUE, however some functions such as toArray will fail as it cannot put all objects into an array, in will instead give you the first Integer.MAX_VALUE rather than throw an exception.
As #Joachim Sauer notes, the current OpenJDK could return an incorrect result for sizes above Integer.MAX_VALUE. e.g. it could be a negative number.
The maximum size depends on the memory settings of the JVM and of course the available system memory. Specific size of memory consumption per list entry also differs between platforms, so the easiest way might be to run simple tests.
As stated in other answers, an array cannot reach 2^31 entries. Other data types are limited either by this or they will likely misreport their size() eventually. However, these theoretical limits cannot be reached on some systems:
On a 32 bit system, the number of bytes available never exceeds 2^32 exactly. And that is assuming that you have no operating system taking up memory. A 32 bit pointer is 4 bytes. Anything which does not rely on arrays must include at least one pointer per entry: this means that the maximum number of entries is 2^32/4 or 2^30 for things that do not utilize arrays.
A plain array can achieve it's theoretical limit, but only a byte array, a short array of length 2^31-1 would use up about 2^32+38 bytes.
Some java VMs have introduced a new memory model that uses compressed pointers. By adjusting pointer alignment, slightly more than 2^32 bytes may be referenced with 32 byte pointers. Around four times more. This is enough to cause a LinkedList size() to become negative, but not enough to allow it to wrap around to zero.
A sixty four bit system has sixty four bit pointers, making all pointers twice as big, making non array lists a bunch fatter. This also means that the maximum capacity supported jumps to 2^64 bytes exactly. This is enough for a 2D array to reach its theoretical maximum. byte[0x7fffffff][0x7fffffff] uses memory apporximately equal to 40+40*(2^31-1)+(2^31-1)(2^31-1)=40+40(2^31-1)+(2^62-2^32+1)