What is the maximum size of HashSet, Vector, LinkedList? I know that ArrayList can store more than 3277000 numbers.
However the size of list depends on the memory (heap) size. If it reaches maximum the JDK throws an OutOfMemoryError.
But I don't know the limit for the number of elements in HashSet, Vector and LinkedList.
There is no specified maximum size of these structures.
The actual practical size limit is probably somewhere in the region of Integer.MAX_VALUE (i.e. 2147483647, roughly 2 billion elements), as that's the maximum size of an array in Java.
A HashSet uses a HashMap internally, so it has the same maximum size as that
A HashMap uses an array which always has a size that is a power of two, so it can be at most 230 = 1073741824 elements big (since the next power of two is bigger than Integer.MAX_VALUE).
Normally the number of elements is at most the number of buckets multiplied by the load factor (0.75 by default). However, when the HashMap stops resizing, then it will still allow you to add elements, exploiting the fact that each bucket is managed via a linked list. Therefore the only limit for elements in a HashMap/HashSet is memory.
A Vector uses an array internally which has a maximum size of exactly Integer.MAX_VALUE, so it can't support more than that many elements
A LinkedList doesn't use an array as the underlying storage, so that doesn't limit the size. It uses a classical doubly linked list structure with no inherent limit, so its size is only bounded by the available memory. Note that a LinkedList will report the size wrongly if it is bigger than Integer.MAX_VALUE, because it uses a int field to store the size and the return type of size() is int as well.
Note that while the Collection API does define how a Collection with more than Integer.MAX_VALUE elements should behave. Most importantly it states this the size() documentation:
If this collection contains more than Integer.MAX_VALUE elements, returns Integer.MAX_VALUE.
Note that while HashMap, HashSet and LinkedList seem to support more than Integer.MAX_VALUE elements, none of those implement the size() method in this way (i.e. they simply let the internal size field overflow).
This leads me to believe that other operations also aren't well-defined in this condition.
So I'd say it's safe to use those general-purpose collections with up to Integer.MAX_VLAUE elements. If you know that you'll need to store more than that, then you should switch to dedicated collection implementations that actually support this.
In all cases, you're likely to be limited by the JVM heap size rather than anything else. Eventually you'll always get down to arrays so I very much doubt that any of them will manage more than 231 - 1 elements, but you're very, very likely to run out of heap before then anyway.
It very much depends on the implementation details.
A HashSet uses an array as an underlying store which by default it attempt to grow when the collection is 75% full. This means it will fail if you try to add more than about 750,000,000 entries. (It cannot grow the array from 2^30 to 2^31 entries)
Increasing the load factor increases the maximum size of the collection. e.g. a load factor of 10 allows 10 billion elements. (It is worth noting that HashSet is relatively inefficient past 100 million elements as the distribution of the 32-bit hashcode starts to look less random, and the number of collisions increases)
A Vector doubles its capacity and starts at 10. This means it will fail to grow above approx 1.34 billion. Changing the initial size to 2^n-1 gives you slightly more head room.
BTW: Use ArrayList rather than Vector if you can.
A LinkedList has no inherent limit and can grow beyond 2.1 billion. At this point size() could return Integer.MAX_VALUE, however some functions such as toArray will fail as it cannot put all objects into an array, in will instead give you the first Integer.MAX_VALUE rather than throw an exception.
As #Joachim Sauer notes, the current OpenJDK could return an incorrect result for sizes above Integer.MAX_VALUE. e.g. it could be a negative number.
The maximum size depends on the memory settings of the JVM and of course the available system memory. Specific size of memory consumption per list entry also differs between platforms, so the easiest way might be to run simple tests.
As stated in other answers, an array cannot reach 2^31 entries. Other data types are limited either by this or they will likely misreport their size() eventually. However, these theoretical limits cannot be reached on some systems:
On a 32 bit system, the number of bytes available never exceeds 2^32 exactly. And that is assuming that you have no operating system taking up memory. A 32 bit pointer is 4 bytes. Anything which does not rely on arrays must include at least one pointer per entry: this means that the maximum number of entries is 2^32/4 or 2^30 for things that do not utilize arrays.
A plain array can achieve it's theoretical limit, but only a byte array, a short array of length 2^31-1 would use up about 2^32+38 bytes.
Some java VMs have introduced a new memory model that uses compressed pointers. By adjusting pointer alignment, slightly more than 2^32 bytes may be referenced with 32 byte pointers. Around four times more. This is enough to cause a LinkedList size() to become negative, but not enough to allow it to wrap around to zero.
A sixty four bit system has sixty four bit pointers, making all pointers twice as big, making non array lists a bunch fatter. This also means that the maximum capacity supported jumps to 2^64 bytes exactly. This is enough for a 2D array to reach its theoretical maximum. byte[0x7fffffff][0x7fffffff] uses memory apporximately equal to 40+40*(2^31-1)+(2^31-1)(2^31-1)=40+40(2^31-1)+(2^62-2^32+1)
Related
I have a use-case where I need to store Key - Value pairs of size approx. 500 Million entries in sinle VM of size 8 GB. Key and Value are of type Long. Key is auto incremented starting from 1, 2 ,3, so on..
Only once I build this Map[K-V] structure at the start of program as a exclusive operation, Once this is build, used only for lookup, No update or delete is performed in this structure.
I have tried this with java.util.hashMap but as expected it consumes a lot of memory and program give OOM : Heap usage exceeds Error.
I need some guidance on following which helps in reducing the memory footprint, I am Ok with some degradation in access performance.
What are the other alternative (from java collection or other libraries)
that can be tried here.
What is a recommended way to get the memory footprint by this Map, for
comparison purpose.
Just use a long[] or long[][].
500 million ascending keys is less than 2^31. And if you go over 2^31, use a long[][] where the first dimension is small and the second one is large.
(When the key type is an integer, you only need a complicated "map" data structure if the key space is sparse.)
The space wastage in a 1D array is insignificant. Every Java array node has 12 byte header, and the node size is rounded up to a multiple of 8 bytes. So a 500 million entry long[] will take so close to 500 million x 8 bytes == 4 billion bytes that it doesn't matter.
However, a JVM typically cannot allocate a single object that takes up the entire available heap space. If virtual address space is at a premium, it would be advisable to use a 2-D array; e.g. new long[4][125_000_000]. This makes the lookups slightly more complicated, but you will most likely reduce the memory footprint by doing this.
If you don't know beforehand the number of keys to expect, you could do the same thing with a combination of arrays and ArrayList objects. But an ArrayList has the problem that if you don't set an (accurate) capacity, the memory utilization is liable to be suboptimal. And if you populate an ArrayList by appending to it, the instantaneous memory demand for the append can be as much as 3 times the list's current space usage.
There is no reason for using a Map in your case.
If you just have a start index and further indizes are just constant increments, just use a List:
List<Long> data=new ArrayList<>(510_000_000);//capacity should ideally not be reached, if it is reached, the array behind the ArrayList needs to be reallocated, the allocated memory would be doubled by that
data.add(1337L);//inserting, how often you want
long value=data.get(1-1);//1...your index that starts with 1, -1...because your index starts with 1, you should subtract one from the index.
If you don't even add more elements and know the size from the start, an array will be even better:
long[] data=long[510_000_000];//capacity should surely not be reached, you will need to create a new array and copy all data if it is higher
int currentIndex=0;
data[currentIndex++]=1337L//inserting, as often as it is smaller than the size
long value=data[1-1];//1...your index that starts with 1, -1...because your index starts with 1, you should subtract one from the index.
Note that you should check the index (currentIndex) before inserting so that it is smaller than the array length.
When iterating, use currentIndex+1 as length instead of .length.
Create an array with the size you need and whenever you need to access it, use arr[i-1] (-1 because your indizes start with 1 instead of zero).
If you "just" have 500 million entries, you will not reach the integer limit and a simple array will be fine.
If you need more entries and you have sufficient memories, use an array of arrays.
The memory footprint of using an array this big is the memory footprint of the data and a bit more.
However, if you don't know the size, you should use a higher length/capacity then you may need. If you use an ArrayList, the memory footprint will be doubled (temporarily tripled) whenever the capacity is reached because it needs to allocate a bigger array.
A Map would need an object for each entry and an array of lists for all those object that would highly increase the memory footprint. The increasing of the memory footprint (using HashMap) is even worse than with ÀrrayLists as the underlaying array is reallocated even if the Map is not completely filled up.
But consider saving it to the HDD/SSD if you need to store that much data. In most cases, this works much better. You can use RandomAccessFile in order to access the data on the HDD/SSD on any point.
I am parsing data where precision is not my main concern. I often get java.lang.OutOfMemoryError even if I use maximum Java heap size. So my main concern here is memory usage, and java heap space. Should I use double or float data type?
I consistently get OOM exceptions because I use a great number of ArrayLists with numbers.
Well that is your problem!
An ArrayList of N 32-bit floating point values takes at least1 20 * N bytes in a 32-bit JVM and 24 * N bytes in a 64-bit JVM2.
An ArrayList of N 64-bit floating point values takes the same amount of space3.
The above only accounts for the backing array and the list elements. If you have huge numbers of small ArrayList objects, the overhead of the ArrayList object itself may be significant. (Add 16 or 24 bytes for each ArrayList object`.)
If you make use of dynamic resizing, this may generate object churn as the backing array grows. At some points, the backing array may be as much as twice as large as it needs to be.
By contrast:
An array of 32-bit floating point values takes approximately 4 * N bytes4.
An array of 64-bit floating point values takes approximately 8 * N bytes4.
There is no wastage due to dynamic resizing.
Solutions:
ArrayList<Float> versus ArrayList<Double> makes no difference. It is NOT a solution
For maximal saving, use float[] or double[] depending on your precision requirements. Preallocate the arrays to hold the exact number of elements required.
If you want the flexibility of dynamic resizing there are 3rd-party libraries that implement space efficient lists of primitive types. Alternatively implement your own. However, you won't be able to use the standard List<...> API because that forces you down the path of using Float OR Double.
1 - The actual space used depends on how the ArrayList was created and populated. If you pre-allocate an ArrayList with exactly the correct capacity, you will use the space I said above. If you build the array by repeatedly appending to an ArrayList with the default initial capacity, you will use on average N * 2 bytes extra space for a 32-bit JVM. This is due to the heuristic that ArrayList uses to grow the backing array when it is full.
2 - On a 64-bit JVM, a pointer occupies 8 bytes rather than 4 ... unless you are using compressed oops.
3 - The reason it takes the same amount of bytes is that on a typical JVM a Float and a Double are both 16 bytes due to heap node padding.
4 - There is a header overhead of (typically) 12 bytes per array, and the array's heap node size is padded to a multiple of 8 bytes.
If your memory usage is related to a huge amount (many millions) of floating-point numbers (which can be verified with a decent memory profiler), then you're most probably storing them in some data structures like arrays or lists.
Recommendations (I guess, you are already following most of them...):
Prefer float over double if number range and precision are sufficient, as that consumes only half the size.
Do not use the java.lang.Float or java.lang.Double classes for storage, as they hav a considerable memory overhead compared to the naked scalar values.
Be sure to use arrays, not collections like java.util.List, as they store boxed java.lang.Float instances instead of the naked numbers.
But above that, have a decent memory profiler show you which instances occupy most of your memory. Maybe there are other memory consumers besides the float/double data.
EDIT:
The OP's recent comment "I consistently get OOM exceptions because I use a great number of ArrayLists with numbers" makes it clear. ArrayList<Float> wastes a lot of memory when compared to float[] (Stephen C gave detailed numbers in his answer), but gives the benefit of dynamic resizing.
So, I see the following possibilities:
If you can tell the array size from the beginning, then immediately use float[] arrays.
If you need the dynamic size while initializing instances, use ArrayList<Float> while building one object (when size still increases), and then copy the contents to a float[] array for long-term storage. Then the wasteful ArrayLists exist only for a limited timespan.
If you need dynamic sizes over the whole lifespan of your data, create your own FloatArrayList class based on a float[] array, resembling the ArrayList<Float> as far as your code needs it (that can range from a very shallow implementation up to a full-featured List, maybe based on AbstractList).
How to declare an array of size 10^9 in java ?.I have tried Array list but the problem is i need to find minimum and maximum element in array and so I need to compare 0th element of an array with all other elements of array and i initially need some fixed size of array which is required in the input format of array on code-chef. Can anybody help?. I tried using long array but it gave out of memory error.
A Java array can have a maximum size equal to Integer.MAX_VALUE (or in some cases a slightly different value) which is about 2.3 * 10^9, so creating that large an array is theoretically possible. However since 10^9 would mean the prefix giga (to make it easier to read) the array would have a size of at least 1GB (when using byte[]). Depending on the data type you're using the array probably just takes too much memory (an int[] would already take up 4GB).
You could try to increase the JVM's max memory by using the -Xmx option (e.g. to allow a maximum of 4GB you could use -Xmx=4g) but you're still limited by the maximum of adressable memory (e.g. IIRC a 32-bit JVM can only adress up top 4GB in total) and available memory.
Alternatively you could try and split the array over multiple machines or JVMs and employ some distributed approach. Or you could write the array to a (memory mapped) file and keep only a part of the array in memory.
The best approach, however, would probably be to check whether you really need that much memory. In many cases using some clever algorithms or structures can dramatically reduce memory requirements. What to use depends on what you're trying to achieve in the end though.
What is the maximum size of HashSet, Vector, LinkedList? I know that ArrayList can store more than 3277000 numbers.
However the size of list depends on the memory (heap) size. If it reaches maximum the JDK throws an OutOfMemoryError.
But I don't know the limit for the number of elements in HashSet, Vector and LinkedList.
There is no specified maximum size of these structures.
The actual practical size limit is probably somewhere in the region of Integer.MAX_VALUE (i.e. 2147483647, roughly 2 billion elements), as that's the maximum size of an array in Java.
A HashSet uses a HashMap internally, so it has the same maximum size as that
A HashMap uses an array which always has a size that is a power of two, so it can be at most 230 = 1073741824 elements big (since the next power of two is bigger than Integer.MAX_VALUE).
Normally the number of elements is at most the number of buckets multiplied by the load factor (0.75 by default). However, when the HashMap stops resizing, then it will still allow you to add elements, exploiting the fact that each bucket is managed via a linked list. Therefore the only limit for elements in a HashMap/HashSet is memory.
A Vector uses an array internally which has a maximum size of exactly Integer.MAX_VALUE, so it can't support more than that many elements
A LinkedList doesn't use an array as the underlying storage, so that doesn't limit the size. It uses a classical doubly linked list structure with no inherent limit, so its size is only bounded by the available memory. Note that a LinkedList will report the size wrongly if it is bigger than Integer.MAX_VALUE, because it uses a int field to store the size and the return type of size() is int as well.
Note that while the Collection API does define how a Collection with more than Integer.MAX_VALUE elements should behave. Most importantly it states this the size() documentation:
If this collection contains more than Integer.MAX_VALUE elements, returns Integer.MAX_VALUE.
Note that while HashMap, HashSet and LinkedList seem to support more than Integer.MAX_VALUE elements, none of those implement the size() method in this way (i.e. they simply let the internal size field overflow).
This leads me to believe that other operations also aren't well-defined in this condition.
So I'd say it's safe to use those general-purpose collections with up to Integer.MAX_VLAUE elements. If you know that you'll need to store more than that, then you should switch to dedicated collection implementations that actually support this.
In all cases, you're likely to be limited by the JVM heap size rather than anything else. Eventually you'll always get down to arrays so I very much doubt that any of them will manage more than 231 - 1 elements, but you're very, very likely to run out of heap before then anyway.
It very much depends on the implementation details.
A HashSet uses an array as an underlying store which by default it attempt to grow when the collection is 75% full. This means it will fail if you try to add more than about 750,000,000 entries. (It cannot grow the array from 2^30 to 2^31 entries)
Increasing the load factor increases the maximum size of the collection. e.g. a load factor of 10 allows 10 billion elements. (It is worth noting that HashSet is relatively inefficient past 100 million elements as the distribution of the 32-bit hashcode starts to look less random, and the number of collisions increases)
A Vector doubles its capacity and starts at 10. This means it will fail to grow above approx 1.34 billion. Changing the initial size to 2^n-1 gives you slightly more head room.
BTW: Use ArrayList rather than Vector if you can.
A LinkedList has no inherent limit and can grow beyond 2.1 billion. At this point size() could return Integer.MAX_VALUE, however some functions such as toArray will fail as it cannot put all objects into an array, in will instead give you the first Integer.MAX_VALUE rather than throw an exception.
As #Joachim Sauer notes, the current OpenJDK could return an incorrect result for sizes above Integer.MAX_VALUE. e.g. it could be a negative number.
The maximum size depends on the memory settings of the JVM and of course the available system memory. Specific size of memory consumption per list entry also differs between platforms, so the easiest way might be to run simple tests.
As stated in other answers, an array cannot reach 2^31 entries. Other data types are limited either by this or they will likely misreport their size() eventually. However, these theoretical limits cannot be reached on some systems:
On a 32 bit system, the number of bytes available never exceeds 2^32 exactly. And that is assuming that you have no operating system taking up memory. A 32 bit pointer is 4 bytes. Anything which does not rely on arrays must include at least one pointer per entry: this means that the maximum number of entries is 2^32/4 or 2^30 for things that do not utilize arrays.
A plain array can achieve it's theoretical limit, but only a byte array, a short array of length 2^31-1 would use up about 2^32+38 bytes.
Some java VMs have introduced a new memory model that uses compressed pointers. By adjusting pointer alignment, slightly more than 2^32 bytes may be referenced with 32 byte pointers. Around four times more. This is enough to cause a LinkedList size() to become negative, but not enough to allow it to wrap around to zero.
A sixty four bit system has sixty four bit pointers, making all pointers twice as big, making non array lists a bunch fatter. This also means that the maximum capacity supported jumps to 2^64 bytes exactly. This is enough for a 2D array to reach its theoretical maximum. byte[0x7fffffff][0x7fffffff] uses memory apporximately equal to 40+40*(2^31-1)+(2^31-1)(2^31-1)=40+40(2^31-1)+(2^62-2^32+1)
Java hashcode is an integer (size 2 pow 32)
When we create a hashtable/hashmap it creates buckets with size equals Initial Capacity of the Map. In other words, it creates an array of size "Initial Capacity"
Question
1. How does it map the hashcode of a key (java object) to the bucket index?
2. Since size of hashmap can grow, can the size of hashmap be equal to 2 pow 32? If answer is yes, it is wise to have an array of size 2 pow 32?
Here is a link to the current source code: http://www.docjar.com/html/api/java/util/HashMap.java.html
The answers to your questions are (in part) implementation specific.
1) See the code. Note that your assumption about how initialCapacity is implemented is incorrect ... for Oracle Java 6 and 7 at least. Specifically, initialCapacity is not necessarily the hashmap's array size.
2) The size of a HashMap is the number of entries, and that can exceed 2^32! I assume that your are actually talking about the capacity. The size of the HashMap's array is theoretically limited to 2^31 - 1 (the largest size for a Java array). For the current implementations, MAX_CAPACITY is actually 2^30; see the code.
3) "... it is wise to have an array of size 2^32?" It is not possible with Java as currently defined, and it is unwise to try to do something that is impossible.
If you are really asking about the design of hash table data structures in Java, then there is a trade-off between efficiency for normal sized hash tables, and ones that are HUGE; i.e. maps with significantly more than 2^30 elements. The HashMap implementations are tuned to work best for normal sized maps. If you routinely had to deal with HUGE maps, and performance was critical, then you should be looking to implement a custom map class that is tuned to your specific requirements.
The size of a Java array is actually limited to Integer.MAX_VALUE elements, 2^31-1.
HashMap uses power of two array sizes, so presumably the largest it could use is 2^31. You would need a large physical memory to make that reasonable.
HashMap does a series of shift and xor operations to reduce some sources of collisions before doing a simple bitwise AND to get the bucket index.