Developing in Java, I need a data structure to select N distinct random numbers between 0 and 999999 ?
I want to be able to quickly allocate N numbers and make sure they don't repeat themselves.
Main goal is not to use too much memory and still keep performance reasonable.
I am considering using a BitSet But I am not sure if the memory implications.
Can someone tell me if the memory requirements of this class are related to the number of bits or to the number of set bits? and what is the complexity to setting/testing a bit ?
UPDATE:
Thanks for all the replies so far.
I Think I had this in my initial wording of this Q but removed it when I first saw the BitSet Class.
Anyway I wanted to add the following info:
Currently I am looking at N of a few thousands at most (most likely around 1000-2000) and a number range of 0 to 999999.
But I would like my choice to take into consideration the option of increasing the range to 8 digits (i.e. 0 to 99 999 999) while keeping N at roughly the same ranges (maybe increase it to 5K or 10K).
So the "used values" are quite sparse.
It depends on how large N is.
For small values of N, you could use a HashSet<Integer> to hold the numbers you have already issued. This gives you O(1) lookup and O(N) space usage.
A BitSet for the range 0-999999 is going to use roughly 125Kb, regardless of the value of N. For large enough values of N, this will be more space efficient than a HashSet. I'm not sure exactly what the value of N is where a BitSet will use less space, but my guestimate would be 10,000 to 20,000.
Can someone tell me if the memory requirements of BitSet are related to the number of bits or to the number of set bits?
The size is determined either by the largest bit that has ever been set, or the nBits parameter if you use the BitSet(int nBits) constructor.
and what is the complexity to setting/testing a bit ?
Testing bit B is O(1).
Setting bit B is O(1) best case, and O(B) if you need to expand the bitset backing array. However, since the size of the backing array is the next largest power of 2, the cost of expansion can typically be amortized over multiple BitSet operations.
A BitSet will take up as much space as 1,000,000 booleans, which is 125,000 bytes or roughly 122kB, plus some minor overhead and space to grow. An array of the actual numbers, i.e. an int[] will take N × 4B of space plus some overhead. The break-even point is
4 × N = 125,000
N = 31250
I'm not intimately familiar with Java internals, but I suspect it won't allocate more than twice the actual space used, so you're using less then 250kB of memory with a bitset. Also, an array makes it harder to find the duplicates when you need unique integers, so I'd use the bitset either way and perhaps convert it to an array at the end, if that's more convenient for further processing.
Setting/getting a bit in a BitSet will have constant complexity, although it takes a few more operations than getting one out of a boolean[].
Related
This might sound dumb but I was wondering while thinking a bit, can't you play around with an algorithm and make O(n) memory seem O(1)?
(Java)
Let's say you have an array of N elements of true or false.
Then that array would result in O(n) memory.
However, if we have an array of say, "FFFFFTFTFFT" with each charAt(i) answering the result of the i-th index of the array, haven't we used only O(1) memory or is it considered O(n) memory since String is size of O(n) itself?
Let's take this further.
If we have an N-array of true and false and convert this to bytes, we use even less memory. Then is the byte also considered O(1) memory or O(n) memory?
For instance, let's say n = 6. Then array size is 6 = O(n). But the byte size is just 1 byte since 1 byte can store 8 different values (8 bits). So is this O(1) or is this O(n) since for large N we get the following case...:
N equals 10000. Array is O(n) memory but is a byte what memory? Cause our byte is O(n/8) = O(n)?
All the cases you've described are O(n), it describes the limiting behavior when n tends towards infinity, saying mathematically:
f(n) = O(n), as n -> INF equals to f(n)/n -> const, as n -> INF, where const <> 0
So 10*n + 100 = O(n) and 0.1*n = O(n).
And as you wrote next statement is correct too: O(n/8) = O(n) = O(n/const)
I'm not sure you understand the concepts of Big O completely, but you still have N elements in each of the listed cases.
The notation O(N) is an upper bound for a function of N elements, not so much defined by the size of the underlying datatypes, since as noted O(N/8) = O(N).
So, example,
If we have an N-array of true and false and convert this to bytes
You are converting N elements into N bytes. This is O(N) time complexity. You stored 2 * O(N) total arrays, resulting in O(N) space complexity.
charAt(i)
This operation alone is O(1) time complexity because you are accessing one element. But you have N elements in an array or string, so it is O(N) space complexity
I'm not really sure there is a common O(1) space complexity algorithm (outside of simple math operations)
There is another misconception here: in order to truly make a "character container" with that O(1) property (respectively: O(log n), as the required memory still grows with growing data), that would only work for exactly that: strings that contain n characters of one kind, and 1 character of another kind.
In such cases, yes, you would only need to remember the index that has the different character. That is similar to defining a super sparse matrix: if only one value is != 0 in a huge matrix, you could store only the corresponding indexes instead of the whole matrix with gazillions of 0 values.
And of course: there are libraries that do such things for sparse matrixes to reduce the cost of keeping known 0 values in memory. Why remember something when you can (easily) compute it?!
I'm trying to find a counterexample to the Pólya Conjecture which will be somewhere in the 900 millions. I'm using a very efficient algorithm that doesn't even require any factorization (similar to a Sieve of Eratosthenes, but with even more information. So, a large array of ints is required.
The program is efficient and correct, but requires an array up to the x i want to check for (it checks all numbers from (2, x)). So, if the counterexample is in the 900 millions, I need an array that will be just as large. Java won't allow me anything over about 20 million. Is there anything I can possibly do to get an array that large?
You may want to extend the max size of the JVM Heap. You can do that with a command line option.
I believe it is -Xmx3600m (3600 megabytes)
Java arrays are indexed by int, so an array can't get larger than 2^31 (there are no unsigned ints). So, the maximum size of an array is 2147483648, which consumes (for a plain int[]) 8589934592 bytes (= 8GB).
Thus, the int-index is usually not a limitation, since you would run out of memory anyway.
In your algorithm, you should use a List (or a Map) as your data structure instead, and choose an implementation of List (or Map) that can grow beyond 2^31. This can get tricky, since the "usual" implementation ArrayList (and HashMap) uses arrays internally. You will have to implement a custom data structure; e.g. by using a 2-level array (a list/array). When you are at it, you can also try to pack the bits more tightly.
Java will allow up to 2 billions array entries. It’s your machine (and your limited memory) that can not handle such a large amount.
900 million 32 bit ints with no further overhead - and there will always be more overhead - would require a little over 3.35 GiB. The only way to get that much memory is with a 64 bit JVM (on a machine with at least 8 GB of RAM) or use some disk backed cache.
If you don't need it all loaded in memory at once, you could segment it into files and store on disk.
What do you mean by "won't allow". You probably getting an OutOfMemoryError, so add more memory with the -Xmx command line option.
You could define your own class which stores the data in a 2d array which would be closer to sqrt(n) by sqrt(n). Then use an index function to determine the two indices of the array. This can be extended to more dimensions, as needed.
The main problem you will run into is running out of RAM. If you approach this limit, you'll need to rethink your algorithm or consider external storage (ie a file or database).
If your algorithm allows it:
Compute it in slices which fit into memory.
You will have to redo the computation for each slice, but it will often be fast enough.
Use an array of a smaller numeric type such as byte.
Depending on how you need to access the array, you might find a RandomAccessFile will allow you to use a file which is larger than will fit in memory. However, the performance you get is very dependant on your access behaviour.
I wrote a version of the Sieve of Eratosthenes for Project Euler which worked on chunks of the search space at a time. It processes the first 1M integers (for example), but keeps each prime number it finds in a table. After you've iterated over all the primes found so far, the array is re-initialised and the primes found already are used to mark the array before looking for the next one.
The table maps a prime to its 'offset' from the start of the array for the next processing iteration.
This is similar in concept (if not in implementation) to the way functional programming languages perform lazy evaluation of lists (although in larger steps). Allocating all the memory up-front isn't necessary, since you're only interested in the parts of the array that pass your test for primeness. Keeping the non-primes hanging around isn't useful to you.
This method also provides memoisation for later iterations over prime numbers. It's faster than scanning your sparse sieve data structure looking for the ones every time.
I second #sfossen's idea and #Aaron Digulla. I'd go for disk access. If your algorithm can take in a List interface rather than a plain array, you could write an adapter from the List to the memory mapped file.
Use Tokyo Cabinet, Berkeley DB, or any other disk-based key-value store. They're faster than any conventional database but allow you to use the disk instead of memory.
could you get by with 900 million bits? (maybe stored as a byte array).
You can try splitting it up into multiple arrays.
for(int x = 0; x <= 1000000; x++){
myFirstList.add(x);
}
for(int x = 1000001; x <= 2000000; x++){
mySecondList.add(x);
}
then iterate over them.
for(int x: myFirstList){
for(int y: myFirstList){
//Remove multiples
}
}
//repeat for second list
Use a memory mapped file (Java 5 NIO package) instead. Or move the sieve into a small C library and use Java JNI.
What is the maximum size of HashSet, Vector, LinkedList? I know that ArrayList can store more than 3277000 numbers.
However the size of list depends on the memory (heap) size. If it reaches maximum the JDK throws an OutOfMemoryError.
But I don't know the limit for the number of elements in HashSet, Vector and LinkedList.
There is no specified maximum size of these structures.
The actual practical size limit is probably somewhere in the region of Integer.MAX_VALUE (i.e. 2147483647, roughly 2 billion elements), as that's the maximum size of an array in Java.
A HashSet uses a HashMap internally, so it has the same maximum size as that
A HashMap uses an array which always has a size that is a power of two, so it can be at most 230 = 1073741824 elements big (since the next power of two is bigger than Integer.MAX_VALUE).
Normally the number of elements is at most the number of buckets multiplied by the load factor (0.75 by default). However, when the HashMap stops resizing, then it will still allow you to add elements, exploiting the fact that each bucket is managed via a linked list. Therefore the only limit for elements in a HashMap/HashSet is memory.
A Vector uses an array internally which has a maximum size of exactly Integer.MAX_VALUE, so it can't support more than that many elements
A LinkedList doesn't use an array as the underlying storage, so that doesn't limit the size. It uses a classical doubly linked list structure with no inherent limit, so its size is only bounded by the available memory. Note that a LinkedList will report the size wrongly if it is bigger than Integer.MAX_VALUE, because it uses a int field to store the size and the return type of size() is int as well.
Note that while the Collection API does define how a Collection with more than Integer.MAX_VALUE elements should behave. Most importantly it states this the size() documentation:
If this collection contains more than Integer.MAX_VALUE elements, returns Integer.MAX_VALUE.
Note that while HashMap, HashSet and LinkedList seem to support more than Integer.MAX_VALUE elements, none of those implement the size() method in this way (i.e. they simply let the internal size field overflow).
This leads me to believe that other operations also aren't well-defined in this condition.
So I'd say it's safe to use those general-purpose collections with up to Integer.MAX_VLAUE elements. If you know that you'll need to store more than that, then you should switch to dedicated collection implementations that actually support this.
In all cases, you're likely to be limited by the JVM heap size rather than anything else. Eventually you'll always get down to arrays so I very much doubt that any of them will manage more than 231 - 1 elements, but you're very, very likely to run out of heap before then anyway.
It very much depends on the implementation details.
A HashSet uses an array as an underlying store which by default it attempt to grow when the collection is 75% full. This means it will fail if you try to add more than about 750,000,000 entries. (It cannot grow the array from 2^30 to 2^31 entries)
Increasing the load factor increases the maximum size of the collection. e.g. a load factor of 10 allows 10 billion elements. (It is worth noting that HashSet is relatively inefficient past 100 million elements as the distribution of the 32-bit hashcode starts to look less random, and the number of collisions increases)
A Vector doubles its capacity and starts at 10. This means it will fail to grow above approx 1.34 billion. Changing the initial size to 2^n-1 gives you slightly more head room.
BTW: Use ArrayList rather than Vector if you can.
A LinkedList has no inherent limit and can grow beyond 2.1 billion. At this point size() could return Integer.MAX_VALUE, however some functions such as toArray will fail as it cannot put all objects into an array, in will instead give you the first Integer.MAX_VALUE rather than throw an exception.
As #Joachim Sauer notes, the current OpenJDK could return an incorrect result for sizes above Integer.MAX_VALUE. e.g. it could be a negative number.
The maximum size depends on the memory settings of the JVM and of course the available system memory. Specific size of memory consumption per list entry also differs between platforms, so the easiest way might be to run simple tests.
As stated in other answers, an array cannot reach 2^31 entries. Other data types are limited either by this or they will likely misreport their size() eventually. However, these theoretical limits cannot be reached on some systems:
On a 32 bit system, the number of bytes available never exceeds 2^32 exactly. And that is assuming that you have no operating system taking up memory. A 32 bit pointer is 4 bytes. Anything which does not rely on arrays must include at least one pointer per entry: this means that the maximum number of entries is 2^32/4 or 2^30 for things that do not utilize arrays.
A plain array can achieve it's theoretical limit, but only a byte array, a short array of length 2^31-1 would use up about 2^32+38 bytes.
Some java VMs have introduced a new memory model that uses compressed pointers. By adjusting pointer alignment, slightly more than 2^32 bytes may be referenced with 32 byte pointers. Around four times more. This is enough to cause a LinkedList size() to become negative, but not enough to allow it to wrap around to zero.
A sixty four bit system has sixty four bit pointers, making all pointers twice as big, making non array lists a bunch fatter. This also means that the maximum capacity supported jumps to 2^64 bytes exactly. This is enough for a 2D array to reach its theoretical maximum. byte[0x7fffffff][0x7fffffff] uses memory apporximately equal to 40+40*(2^31-1)+(2^31-1)(2^31-1)=40+40(2^31-1)+(2^62-2^32+1)
This question already has answers here:
Java array with more than 4gb elements
(11 answers)
Closed 8 years ago.
I was trying to get all primes before 600851475143.
I was using Sieve of Eratosthenes for this.
This requires me to create a boolean array of that huge size.
Bad idea, you can run out of memory.
Any other way. I tried using a string, using each index with values 0 & 1 to represent true or false. but indexOf method too returns int.
Next i am using 2d array for my problem.
Any other better way to store such a huge array?
The memory requirement for 600851475143 booleans is at best 70Gb. This isn't feasible. You need to either use compression as suggested by Stephan, or find a different algorithm for calculating the primes.
I had a similar problem and i used a bit set (basically set 1 or 0 for the desired offset in order) and i recomend using EWAHCompressedBitmap it will also compress your bit set
EDIT
As Alan said the BitSet will occupy 70GB of memory but you can do another thing : to have multiple BitSets (consecutive ones so that you can calculate the absolute position) and load in memory just the BitSet that you need in that moment something like a lazy load, in this case you will have control of the memory used.
Its not really practical to remember for each number if it was a prime or not for such a large amount (the sieve is a very slow approach for large numbers in general).
From this link you get an idea how many primes there are to be expected smaller than X. For your 600 billion range you can expect roughly 20 billion primes to exist within that range. Storing them as long[] would require about 160GB of memory... that notably more than the suggested 70GB for storing a single bit for each number, half if you exclude even numbers (2 is the only even prime).
For a desktop computer 35GB in memory may be a bit much, but a good workstation can have that much RAM. I would try a two-dimensional array with bit shifting/masking.
I still would expect your sieve code to run a considerable amount of time (something from days to years). I suggest you investigate more advanced prime detection methods than sieve.
You could use HotSpot's internal sun.misc.Unsafe API to allocate a bigger array. I wrote a blogpost how to simulate an array with it However, it's not an official Java API, so it qualifies as a hack.
Use BitSet. You can then set bit any index element. 600851475143 is 2^39 thus taking only 39 bits internally (actually in reality it will occupy 64 bits as it uses long).
You can infact move upto 2^63 which is massive for most purposes
What is the maximum size of HashSet, Vector, LinkedList? I know that ArrayList can store more than 3277000 numbers.
However the size of list depends on the memory (heap) size. If it reaches maximum the JDK throws an OutOfMemoryError.
But I don't know the limit for the number of elements in HashSet, Vector and LinkedList.
There is no specified maximum size of these structures.
The actual practical size limit is probably somewhere in the region of Integer.MAX_VALUE (i.e. 2147483647, roughly 2 billion elements), as that's the maximum size of an array in Java.
A HashSet uses a HashMap internally, so it has the same maximum size as that
A HashMap uses an array which always has a size that is a power of two, so it can be at most 230 = 1073741824 elements big (since the next power of two is bigger than Integer.MAX_VALUE).
Normally the number of elements is at most the number of buckets multiplied by the load factor (0.75 by default). However, when the HashMap stops resizing, then it will still allow you to add elements, exploiting the fact that each bucket is managed via a linked list. Therefore the only limit for elements in a HashMap/HashSet is memory.
A Vector uses an array internally which has a maximum size of exactly Integer.MAX_VALUE, so it can't support more than that many elements
A LinkedList doesn't use an array as the underlying storage, so that doesn't limit the size. It uses a classical doubly linked list structure with no inherent limit, so its size is only bounded by the available memory. Note that a LinkedList will report the size wrongly if it is bigger than Integer.MAX_VALUE, because it uses a int field to store the size and the return type of size() is int as well.
Note that while the Collection API does define how a Collection with more than Integer.MAX_VALUE elements should behave. Most importantly it states this the size() documentation:
If this collection contains more than Integer.MAX_VALUE elements, returns Integer.MAX_VALUE.
Note that while HashMap, HashSet and LinkedList seem to support more than Integer.MAX_VALUE elements, none of those implement the size() method in this way (i.e. they simply let the internal size field overflow).
This leads me to believe that other operations also aren't well-defined in this condition.
So I'd say it's safe to use those general-purpose collections with up to Integer.MAX_VLAUE elements. If you know that you'll need to store more than that, then you should switch to dedicated collection implementations that actually support this.
In all cases, you're likely to be limited by the JVM heap size rather than anything else. Eventually you'll always get down to arrays so I very much doubt that any of them will manage more than 231 - 1 elements, but you're very, very likely to run out of heap before then anyway.
It very much depends on the implementation details.
A HashSet uses an array as an underlying store which by default it attempt to grow when the collection is 75% full. This means it will fail if you try to add more than about 750,000,000 entries. (It cannot grow the array from 2^30 to 2^31 entries)
Increasing the load factor increases the maximum size of the collection. e.g. a load factor of 10 allows 10 billion elements. (It is worth noting that HashSet is relatively inefficient past 100 million elements as the distribution of the 32-bit hashcode starts to look less random, and the number of collisions increases)
A Vector doubles its capacity and starts at 10. This means it will fail to grow above approx 1.34 billion. Changing the initial size to 2^n-1 gives you slightly more head room.
BTW: Use ArrayList rather than Vector if you can.
A LinkedList has no inherent limit and can grow beyond 2.1 billion. At this point size() could return Integer.MAX_VALUE, however some functions such as toArray will fail as it cannot put all objects into an array, in will instead give you the first Integer.MAX_VALUE rather than throw an exception.
As #Joachim Sauer notes, the current OpenJDK could return an incorrect result for sizes above Integer.MAX_VALUE. e.g. it could be a negative number.
The maximum size depends on the memory settings of the JVM and of course the available system memory. Specific size of memory consumption per list entry also differs between platforms, so the easiest way might be to run simple tests.
As stated in other answers, an array cannot reach 2^31 entries. Other data types are limited either by this or they will likely misreport their size() eventually. However, these theoretical limits cannot be reached on some systems:
On a 32 bit system, the number of bytes available never exceeds 2^32 exactly. And that is assuming that you have no operating system taking up memory. A 32 bit pointer is 4 bytes. Anything which does not rely on arrays must include at least one pointer per entry: this means that the maximum number of entries is 2^32/4 or 2^30 for things that do not utilize arrays.
A plain array can achieve it's theoretical limit, but only a byte array, a short array of length 2^31-1 would use up about 2^32+38 bytes.
Some java VMs have introduced a new memory model that uses compressed pointers. By adjusting pointer alignment, slightly more than 2^32 bytes may be referenced with 32 byte pointers. Around four times more. This is enough to cause a LinkedList size() to become negative, but not enough to allow it to wrap around to zero.
A sixty four bit system has sixty four bit pointers, making all pointers twice as big, making non array lists a bunch fatter. This also means that the maximum capacity supported jumps to 2^64 bytes exactly. This is enough for a 2D array to reach its theoretical maximum. byte[0x7fffffff][0x7fffffff] uses memory apporximately equal to 40+40*(2^31-1)+(2^31-1)(2^31-1)=40+40(2^31-1)+(2^62-2^32+1)