Setting aside the heap's capacity, are there ways to go beyond Integer.MAX_VALUE constraints in Java?
Examples are:
Collections limit themselves to Integer.MAX_VALUE.
StringBuilder / StringBuffer limit themselves to Integer.MAX_VALUE.
If you have a huge Collection you're going to hit all sorts of practical limits before you ever have 231 - 1 items in it. A Collection with a million items in it is going to be pretty unwieldy, let alone one with more than a thousands times more than that.
Similarly, a StringBuilder can build a String that's 2GB in size before it hits the MAX_VALUE limit which is more than adequate for any practical purpose.
If you truly think that you might be hitting these limits your application should be storing your data in a different way, probably in a database.
With a long? Works for me.
Edit: Ah, clarification of the question. Cool. My new and improved answer:
With a paging algorithm.
Coincidentally, somewhat recently for another question (Binary search in a sorted (memory-mapped ?) file in java), I whipped up a paging algorithm to get around the int parameters in the java.nio.MappedByteBuffer API.
You can create your own collections which have a long size() based on the source code for those collections. To have larger arrays of Objects for example, you can have an array of arrays (and stitch these together)
This approach will allow almost 2^62 elements.
Array indexes are limited by Integer.MAX_VALUE, not the physical size of the array.
Therefore the maximum size of an array is linked to the size of the array-type.
byte = 1 byte => max 2 Gb data
char = 2 byte => max 4 Gb data
int = 4 byte => max 8 Gb data
long = 8 byte => max 16 Gb data
Dictionaries are a different story because they often use techniques like buckets or an internal data layout as a tree. Therefore these "limits" usually dont apply or you will need even more data to reach the limit.
Short: Integer.MAX_VALUE is not really a limit because you need lots of memory to actually reach the limit. If you should ever reach this limit you might want to think about improving your algorithm and/or data-layout :)
Yes, with BigInteger class.
A memory upgrade is necessary.. :)
Related
Before starting to explain my problem, I should mention that I am not looking for a way to increase Java heap memory. I should strictly store these objects.
I am working on storing huge number (5-10 GB) of DNA sequences and their counts (Integer) in a hash table. The DNA sequences (with length 32 or less) consists of 'A', 'C', 'G', 'T', and 'N' (undefined) chars. As we know, when storing a large number of objects in memory, Java has poor space efficiency compared to lower level languages like C and C++. Thus, if I store this sequence as string (it holds about 100 MB memory for a sequence with length ~30), I see the error.
I tried to represent nucleic acids as 'A'=00, 'C'=01, 'G'=10, 'T'=11 and neglect 'N' (because it ruins the char to 2-bit transform as the 5-th acid). Then, concatenate these 2-bit acids into byte array. It brought some improvement but unfortunately I see the error after a couple of hours again. I need a convenient solution or at least a workaround to handle this error. Thank you in advance.
Being fairly complex maybe this here is a weird idea, and would require quite a lot of work, but this is what I would try:
You already pointed out two individual subproblems of your overall task:
the default HashMap implementation may be suboptimal for such large collection sizes
you need to store something else than strings
The map implementation
I would recommend to write a highly tailored hash map implementation for the Map<String, Long> interface. Internally you do not have to store strings. Unfortunately 5^32 > 2^64, so there is no way to pack your whole string into a single long, well, let's stick to two longs for a key. You can make string to/back long[2] conversion fairly efficiently on the fly when providing a string key to your map implementation (use bit shifts etc).
As for packing the values, here are some considerations:
for a key-value pair a standard hashmap will need to have an array of N longs for buckets, where N is the current capacity, when the bucket is found from the hash key it will need to have a linked list of key-value pairs to resolve keys that produce identical hash codes. For your specific case you could try to optimize it in the following way:
use a long[] of size 3N where N is the capacity to store both keys and values in a continuous array
in this array, at locations 3 * (hashcode % N) and 3 * (hashcode % N) + 1 you store the long[2] representation of the key, of the first key that matches this bucket or of the only one (on insertion, zero otherwise), at location 3 * (hashcode % N) + 2 you store the corresponding count
for all those cases where a different key results in the same hash code and thus the same bucket, your store the data in a standard HashMap<Long2KeyWrapper, Long>. The idea is to keep the capacity of the array mentioned above (and resize correspondingly) large enough to have by far the largest part of the data in that contiguous array and not in the fallback hash map. This will dramatically reduce the storage overhead of the hashmap
do not expand the capacity in N=2N iterations, make smaller growth steps, e.g. 10-20%. this will cost performance on populating the map, but will keep your memory footprint under control
The keys
Given the inequality 5^32 > 2^64 your idea to use bits to encode 5 letters seems to be the best I can think of right now. Use 3 bits and correspondingly long[2].
I recommend you look into the Trove4j Collections API; it offers Collections that hold primitives which will use less memory than their boxed, wrapper classes.
Specifically, you should check out their TObjectIntHashMap.
Also, I wouldn't recommended storing anything as a String or char until JDK 9 is released, as the backing char array of a String is UTF-16 encoded, using two bytes per char. JDK 9 defaults to UTF-8 where only one byte is used.
If you're using on the order of ~10gb of data, or at least data with an in memory representation size of ~10gb, then you might need to think of ways to write the data you don't need at the moment to disk and load individual portions of your dataset into memory to work on them.
I had this exact problem a few years ago when I was conducting research with monte carlo simulations so I wrote a Java data structure to solve it. You can clone/fork the source here: github.com/tylerparsons/surfdep
The library supports both MySQL and SQLite as the underlying database. If you don't have either, I'd recommend SQLite as it's much quicker to set up.
Full disclaimer: this is not the most efficient implementation, but it will handle very large datasets if you let it run for a few hours. I tested it successfully with matrices of up to 1 billion elements on my Windows laptop.
I'm trying to find a counterexample to the Pólya Conjecture which will be somewhere in the 900 millions. I'm using a very efficient algorithm that doesn't even require any factorization (similar to a Sieve of Eratosthenes, but with even more information. So, a large array of ints is required.
The program is efficient and correct, but requires an array up to the x i want to check for (it checks all numbers from (2, x)). So, if the counterexample is in the 900 millions, I need an array that will be just as large. Java won't allow me anything over about 20 million. Is there anything I can possibly do to get an array that large?
You may want to extend the max size of the JVM Heap. You can do that with a command line option.
I believe it is -Xmx3600m (3600 megabytes)
Java arrays are indexed by int, so an array can't get larger than 2^31 (there are no unsigned ints). So, the maximum size of an array is 2147483648, which consumes (for a plain int[]) 8589934592 bytes (= 8GB).
Thus, the int-index is usually not a limitation, since you would run out of memory anyway.
In your algorithm, you should use a List (or a Map) as your data structure instead, and choose an implementation of List (or Map) that can grow beyond 2^31. This can get tricky, since the "usual" implementation ArrayList (and HashMap) uses arrays internally. You will have to implement a custom data structure; e.g. by using a 2-level array (a list/array). When you are at it, you can also try to pack the bits more tightly.
Java will allow up to 2 billions array entries. It’s your machine (and your limited memory) that can not handle such a large amount.
900 million 32 bit ints with no further overhead - and there will always be more overhead - would require a little over 3.35 GiB. The only way to get that much memory is with a 64 bit JVM (on a machine with at least 8 GB of RAM) or use some disk backed cache.
If you don't need it all loaded in memory at once, you could segment it into files and store on disk.
What do you mean by "won't allow". You probably getting an OutOfMemoryError, so add more memory with the -Xmx command line option.
You could define your own class which stores the data in a 2d array which would be closer to sqrt(n) by sqrt(n). Then use an index function to determine the two indices of the array. This can be extended to more dimensions, as needed.
The main problem you will run into is running out of RAM. If you approach this limit, you'll need to rethink your algorithm or consider external storage (ie a file or database).
If your algorithm allows it:
Compute it in slices which fit into memory.
You will have to redo the computation for each slice, but it will often be fast enough.
Use an array of a smaller numeric type such as byte.
Depending on how you need to access the array, you might find a RandomAccessFile will allow you to use a file which is larger than will fit in memory. However, the performance you get is very dependant on your access behaviour.
I wrote a version of the Sieve of Eratosthenes for Project Euler which worked on chunks of the search space at a time. It processes the first 1M integers (for example), but keeps each prime number it finds in a table. After you've iterated over all the primes found so far, the array is re-initialised and the primes found already are used to mark the array before looking for the next one.
The table maps a prime to its 'offset' from the start of the array for the next processing iteration.
This is similar in concept (if not in implementation) to the way functional programming languages perform lazy evaluation of lists (although in larger steps). Allocating all the memory up-front isn't necessary, since you're only interested in the parts of the array that pass your test for primeness. Keeping the non-primes hanging around isn't useful to you.
This method also provides memoisation for later iterations over prime numbers. It's faster than scanning your sparse sieve data structure looking for the ones every time.
I second #sfossen's idea and #Aaron Digulla. I'd go for disk access. If your algorithm can take in a List interface rather than a plain array, you could write an adapter from the List to the memory mapped file.
Use Tokyo Cabinet, Berkeley DB, or any other disk-based key-value store. They're faster than any conventional database but allow you to use the disk instead of memory.
could you get by with 900 million bits? (maybe stored as a byte array).
You can try splitting it up into multiple arrays.
for(int x = 0; x <= 1000000; x++){
myFirstList.add(x);
}
for(int x = 1000001; x <= 2000000; x++){
mySecondList.add(x);
}
then iterate over them.
for(int x: myFirstList){
for(int y: myFirstList){
//Remove multiples
}
}
//repeat for second list
Use a memory mapped file (Java 5 NIO package) instead. Or move the sieve into a small C library and use Java JNI.
How to declare an array of size 10^9 in java ?.I have tried Array list but the problem is i need to find minimum and maximum element in array and so I need to compare 0th element of an array with all other elements of array and i initially need some fixed size of array which is required in the input format of array on code-chef. Can anybody help?. I tried using long array but it gave out of memory error.
A Java array can have a maximum size equal to Integer.MAX_VALUE (or in some cases a slightly different value) which is about 2.3 * 10^9, so creating that large an array is theoretically possible. However since 10^9 would mean the prefix giga (to make it easier to read) the array would have a size of at least 1GB (when using byte[]). Depending on the data type you're using the array probably just takes too much memory (an int[] would already take up 4GB).
You could try to increase the JVM's max memory by using the -Xmx option (e.g. to allow a maximum of 4GB you could use -Xmx=4g) but you're still limited by the maximum of adressable memory (e.g. IIRC a 32-bit JVM can only adress up top 4GB in total) and available memory.
Alternatively you could try and split the array over multiple machines or JVMs and employ some distributed approach. Or you could write the array to a (memory mapped) file and keep only a part of the array in memory.
The best approach, however, would probably be to check whether you really need that much memory. In many cases using some clever algorithms or structures can dramatically reduce memory requirements. What to use depends on what you're trying to achieve in the end though.
Developing in Java, I need a data structure to select N distinct random numbers between 0 and 999999 ?
I want to be able to quickly allocate N numbers and make sure they don't repeat themselves.
Main goal is not to use too much memory and still keep performance reasonable.
I am considering using a BitSet But I am not sure if the memory implications.
Can someone tell me if the memory requirements of this class are related to the number of bits or to the number of set bits? and what is the complexity to setting/testing a bit ?
UPDATE:
Thanks for all the replies so far.
I Think I had this in my initial wording of this Q but removed it when I first saw the BitSet Class.
Anyway I wanted to add the following info:
Currently I am looking at N of a few thousands at most (most likely around 1000-2000) and a number range of 0 to 999999.
But I would like my choice to take into consideration the option of increasing the range to 8 digits (i.e. 0 to 99 999 999) while keeping N at roughly the same ranges (maybe increase it to 5K or 10K).
So the "used values" are quite sparse.
It depends on how large N is.
For small values of N, you could use a HashSet<Integer> to hold the numbers you have already issued. This gives you O(1) lookup and O(N) space usage.
A BitSet for the range 0-999999 is going to use roughly 125Kb, regardless of the value of N. For large enough values of N, this will be more space efficient than a HashSet. I'm not sure exactly what the value of N is where a BitSet will use less space, but my guestimate would be 10,000 to 20,000.
Can someone tell me if the memory requirements of BitSet are related to the number of bits or to the number of set bits?
The size is determined either by the largest bit that has ever been set, or the nBits parameter if you use the BitSet(int nBits) constructor.
and what is the complexity to setting/testing a bit ?
Testing bit B is O(1).
Setting bit B is O(1) best case, and O(B) if you need to expand the bitset backing array. However, since the size of the backing array is the next largest power of 2, the cost of expansion can typically be amortized over multiple BitSet operations.
A BitSet will take up as much space as 1,000,000 booleans, which is 125,000 bytes or roughly 122kB, plus some minor overhead and space to grow. An array of the actual numbers, i.e. an int[] will take N × 4B of space plus some overhead. The break-even point is
4 × N = 125,000
N = 31250
I'm not intimately familiar with Java internals, but I suspect it won't allocate more than twice the actual space used, so you're using less then 250kB of memory with a bitset. Also, an array makes it harder to find the duplicates when you need unique integers, so I'd use the bitset either way and perhaps convert it to an array at the end, if that's more convenient for further processing.
Setting/getting a bit in a BitSet will have constant complexity, although it takes a few more operations than getting one out of a boolean[].
This question already has answers here:
Java array with more than 4gb elements
(11 answers)
Closed 8 years ago.
I was trying to get all primes before 600851475143.
I was using Sieve of Eratosthenes for this.
This requires me to create a boolean array of that huge size.
Bad idea, you can run out of memory.
Any other way. I tried using a string, using each index with values 0 & 1 to represent true or false. but indexOf method too returns int.
Next i am using 2d array for my problem.
Any other better way to store such a huge array?
The memory requirement for 600851475143 booleans is at best 70Gb. This isn't feasible. You need to either use compression as suggested by Stephan, or find a different algorithm for calculating the primes.
I had a similar problem and i used a bit set (basically set 1 or 0 for the desired offset in order) and i recomend using EWAHCompressedBitmap it will also compress your bit set
EDIT
As Alan said the BitSet will occupy 70GB of memory but you can do another thing : to have multiple BitSets (consecutive ones so that you can calculate the absolute position) and load in memory just the BitSet that you need in that moment something like a lazy load, in this case you will have control of the memory used.
Its not really practical to remember for each number if it was a prime or not for such a large amount (the sieve is a very slow approach for large numbers in general).
From this link you get an idea how many primes there are to be expected smaller than X. For your 600 billion range you can expect roughly 20 billion primes to exist within that range. Storing them as long[] would require about 160GB of memory... that notably more than the suggested 70GB for storing a single bit for each number, half if you exclude even numbers (2 is the only even prime).
For a desktop computer 35GB in memory may be a bit much, but a good workstation can have that much RAM. I would try a two-dimensional array with bit shifting/masking.
I still would expect your sieve code to run a considerable amount of time (something from days to years). I suggest you investigate more advanced prime detection methods than sieve.
You could use HotSpot's internal sun.misc.Unsafe API to allocate a bigger array. I wrote a blogpost how to simulate an array with it However, it's not an official Java API, so it qualifies as a hack.
Use BitSet. You can then set bit any index element. 600851475143 is 2^39 thus taking only 39 bits internally (actually in reality it will occupy 64 bits as it uses long).
You can infact move upto 2^63 which is massive for most purposes