ArrayList<Double> to double[] with 300 million entries

ArrayList<Double> to double[] with 300 million entries - java

I'm using a java program to get some data from a DB. I then calculate some numbers and start storing them in an array. The machine I'm using has 4 gigs of RAM. Now, I don't know how many numbers there will be in advance, so I use an ArrayList<Double>. But I do know there will be roughly 300 million numbers.
So, since one double is 8 bytes a rough estimate of the memory this array will consume is 2.4 gigs (probably more because of the overheads of an ArrayList). After this, I want to calculate the median of this array and am using the org.apache.commons.math3.stat.descriptive.rank.Median library which takes as input a double[] array. So, I need to convert the ArrayList<Double> to double[].
I did see many questions where this is raised and they all mention there is no way around looping through the entire array. Now this is fine, but since they also maintain both objects in memory, this brings my memory requirements up to 4.8 gigs. Now we have a problem since the total RAM available us 4 gigs.
First of all, is my suspicion that the program will at some point give me a memory error correct (it is currently running)? And if so, how can I calculate the median without having to allocate double the memory? I want to avoid sorting the array as calculating the median is O(n).

Your problem is even worse than you realize, because ArrayList<Double> is much less efficient than 8 bytes per entry. Each entry is actually an object, to which the ArrayList keeps an array of references. A Double object is probably about 12 bytes (4 bytes for some kind of type identifier, 8 bytes for the double itself), and the reference to it adds another 4, bringing the total up to 16 bytes per entry, even excluding overhead for memory management and such.
If the constraints were a little wider, you could implement your own DoubleArray that is backed by a double[] but knows how to resize itself. However, the resizing means you'll have to keep a copy of both the old and the new array in memory at the same time, also blowing your memory limit.
That still leaves a few options though:
Loop through the input twice; once to count the entries, once to read them into a right-sized double[]. It depends on the nature of your input whether this is possible, of course.
Make some assumption on the maximum input size (perhaps user-configurable), and allocate a double[] up front that is this fixed size. Use only the part of it that's filled.
Use float instead of double to cut memory requirements in half, at the expense of some precision.
Rethink your algorithm to avoid holding everything in memory at once.

There are many open source libraries that create dynamic arrays for primitives. One of these:
http://trove.starlight-systems.com/

The Median value is the value at the middle of a sorted list. So you don't have to use a second array, you can just do:
Collections.sort(myArray);
final double median = myArray.get(myArray.size() / 2);
And since you get that data from a DB anyways, you could just tell the DB to give you the median instead of doing it in Java, which will save all the time (and memory) for transmitting the data as well.

I agree, use Trove4j TDoubleArrayList class (see javadoc) to store double or TFloatArrayList for float. And by combining previous answers, we gets :
// guess initialcapacity to remove requirement for resizing
TDoubleArrayList data = new TDoubleArrayList(initialcapacity);
// fill data
data.sort();
double median = data.get(data.size()/2);

Related

Java memory optimized [Key:Long, Value:Long] store of very large size (500M) for concurrent read-access

I have a use-case where I need to store Key - Value pairs of size approx. 500 Million entries in sinle VM of size 8 GB. Key and Value are of type Long. Key is auto incremented starting from 1, 2 ,3, so on..
Only once I build this Map[K-V] structure at the start of program as a exclusive operation, Once this is build, used only for lookup, No update or delete is performed in this structure.
I have tried this with java.util.hashMap but as expected it consumes a lot of memory and program give OOM : Heap usage exceeds Error.
I need some guidance on following which helps in reducing the memory footprint, I am Ok with some degradation in access performance.
What are the other alternative (from java collection or other libraries)
that can be tried here.
What is a recommended way to get the memory footprint by this Map, for
comparison purpose.

Just use a long[] or long[][].
500 million ascending keys is less than 2^31. And if you go over 2^31, use a long[][] where the first dimension is small and the second one is large.
(When the key type is an integer, you only need a complicated "map" data structure if the key space is sparse.)
The space wastage in a 1D array is insignificant. Every Java array node has 12 byte header, and the node size is rounded up to a multiple of 8 bytes. So a 500 million entry long[] will take so close to 500 million x 8 bytes == 4 billion bytes that it doesn't matter.
However, a JVM typically cannot allocate a single object that takes up the entire available heap space. If virtual address space is at a premium, it would be advisable to use a 2-D array; e.g. new long[4][125_000_000]. This makes the lookups slightly more complicated, but you will most likely reduce the memory footprint by doing this.
If you don't know beforehand the number of keys to expect, you could do the same thing with a combination of arrays and ArrayList objects. But an ArrayList has the problem that if you don't set an (accurate) capacity, the memory utilization is liable to be suboptimal. And if you populate an ArrayList by appending to it, the instantaneous memory demand for the append can be as much as 3 times the list's current space usage.

There is no reason for using a Map in your case.
If you just have a start index and further indizes are just constant increments, just use a List:
List<Long> data=new ArrayList<>(510_000_000);//capacity should ideally not be reached, if it is reached, the array behind the ArrayList needs to be reallocated, the allocated memory would be doubled by that
data.add(1337L);//inserting, how often you want
long value=data.get(1-1);//1...your index that starts with 1, -1...because your index starts with 1, you should subtract one from the index.
If you don't even add more elements and know the size from the start, an array will be even better:
long[] data=long[510_000_000];//capacity should surely not be reached, you will need to create a new array and copy all data if it is higher
int currentIndex=0;
data[currentIndex++]=1337L//inserting, as often as it is smaller than the size
long value=data[1-1];//1...your index that starts with 1, -1...because your index starts with 1, you should subtract one from the index.
Note that you should check the index (currentIndex) before inserting so that it is smaller than the array length.
When iterating, use currentIndex+1 as length instead of .length.
Create an array with the size you need and whenever you need to access it, use arr[i-1] (-1 because your indizes start with 1 instead of zero).
If you "just" have 500 million entries, you will not reach the integer limit and a simple array will be fine.
If you need more entries and you have sufficient memories, use an array of arrays.
The memory footprint of using an array this big is the memory footprint of the data and a bit more.
However, if you don't know the size, you should use a higher length/capacity then you may need. If you use an ArrayList, the memory footprint will be doubled (temporarily tripled) whenever the capacity is reached because it needs to allocate a bigger array.
A Map would need an object for each entry and an array of lists for all those object that would highly increase the memory footprint. The increasing of the memory footprint (using HashMap) is even worse than with ÀrrayLists as the underlaying array is reallocated even if the Map is not completely filled up.
But consider saving it to the HDD/SSD if you need to store that much data. In most cases, this works much better. You can use RandomAccessFile in order to access the data on the HDD/SSD on any point.

Float or double type in terms of storage and memory

I am parsing data where precision is not my main concern. I often get java.lang.OutOfMemoryError even if I use maximum Java heap size. So my main concern here is memory usage, and java heap space. Should I use double or float data type?

I consistently get OOM exceptions because I use a great number of ArrayLists with numbers.
Well that is your problem!
An ArrayList of N 32-bit floating point values takes at least1 20 * N bytes in a 32-bit JVM and 24 * N bytes in a 64-bit JVM2.
An ArrayList of N 64-bit floating point values takes the same amount of space3.
The above only accounts for the backing array and the list elements. If you have huge numbers of small ArrayList objects, the overhead of the ArrayList object itself may be significant. (Add 16 or 24 bytes for each ArrayList object`.)
If you make use of dynamic resizing, this may generate object churn as the backing array grows. At some points, the backing array may be as much as twice as large as it needs to be.
By contrast:
An array of 32-bit floating point values takes approximately 4 * N bytes4.
An array of 64-bit floating point values takes approximately 8 * N bytes4.
There is no wastage due to dynamic resizing.
Solutions:
ArrayList<Float> versus ArrayList<Double> makes no difference. It is NOT a solution
For maximal saving, use float[] or double[] depending on your precision requirements. Preallocate the arrays to hold the exact number of elements required.
If you want the flexibility of dynamic resizing there are 3rd-party libraries that implement space efficient lists of primitive types. Alternatively implement your own. However, you won't be able to use the standard List<...> API because that forces you down the path of using Float OR Double.
1 - The actual space used depends on how the ArrayList was created and populated. If you pre-allocate an ArrayList with exactly the correct capacity, you will use the space I said above. If you build the array by repeatedly appending to an ArrayList with the default initial capacity, you will use on average N * 2 bytes extra space for a 32-bit JVM. This is due to the heuristic that ArrayList uses to grow the backing array when it is full.
2 - On a 64-bit JVM, a pointer occupies 8 bytes rather than 4 ... unless you are using compressed oops.
3 - The reason it takes the same amount of bytes is that on a typical JVM a Float and a Double are both 16 bytes due to heap node padding.
4 - There is a header overhead of (typically) 12 bytes per array, and the array's heap node size is padded to a multiple of 8 bytes.

If your memory usage is related to a huge amount (many millions) of floating-point numbers (which can be verified with a decent memory profiler), then you're most probably storing them in some data structures like arrays or lists.
Recommendations (I guess, you are already following most of them...):
Prefer float over double if number range and precision are sufficient, as that consumes only half the size.
Do not use the java.lang.Float or java.lang.Double classes for storage, as they hav a considerable memory overhead compared to the naked scalar values.
Be sure to use arrays, not collections like java.util.List, as they store boxed java.lang.Float instances instead of the naked numbers.
But above that, have a decent memory profiler show you which instances occupy most of your memory. Maybe there are other memory consumers besides the float/double data.
EDIT:
The OP's recent comment "I consistently get OOM exceptions because I use a great number of ArrayLists with numbers" makes it clear. ArrayList<Float> wastes a lot of memory when compared to float[] (Stephen C gave detailed numbers in his answer), but gives the benefit of dynamic resizing.
So, I see the following possibilities:
If you can tell the array size from the beginning, then immediately use float[] arrays.
If you need the dynamic size while initializing instances, use ArrayList<Float> while building one object (when size still increases), and then copy the contents to a float[] array for long-term storage. Then the wasteful ArrayLists exist only for a limited timespan.
If you need dynamic sizes over the whole lifespan of your data, create your own FloatArrayList class based on a float[] array, resembling the ArrayList<Float> as far as your code needs it (that can range from a very shallow implementation up to a full-featured List, maybe based on AbstractList).

Best data structure to hold large amounts of data?

Reading in a lot of data from a file. There may be 100 different data objects with necessary headings, but there can be well over 300,000 values stored in each of these data objects. The values need to be stored in the same order that they are read in. This is the constructor for the data object:
public Data(String heading, ArrayList<Float> values) {
this.heading = heading;
this.values = values;
}
What would be the quickest way to store and retrieve these values sequentially in RAM?

Although in your comments you mention "quickness", without specifying what operation needs to be "quick", your main concern seems to be heap memory consumption.
Let's assume 100 groups of 300,000 numbers (you've used words like "may be" and "well over" but this will do as an example).
That's 30,000,000 numbers to store, plus 100 headings and some structural overhead for grouping.
A primitive Java float is 32 bits, that is 4 bytes. So at an absolute minimum, you're going to need 30,000,000 * 4 bytes == 120MB.
An array of primitives - float[30000000] - is just all the values concatenated into a contiguous chunk of memory, so will consume this theoretical minumum of 120MB -- plus a few bytes of once-per-array overhead that I won't go into detail about here.
A java Float wrapper object is 12 bytes. When you store an object (rather than a primitive) in an array, the reference itself is 4 bytes. So an array of Float - Float[30000000] will consume 30,000,000 * (12 + 4) == 480MB.
So, you can cut your memory use by more than half by using primitives rather than wrappers.
An ArrayList is quite a light wrapper around an array of Object and so has about the same memory costs. The once-per-list overheads are too small to have an impact compared to the elements, at these list sizes. But there are some caveats:
ArrayList can only store Objects, not primitives, so if you choose a List you're stuck with the 12-bytes-per-element overhead of Float.
There are some third-party libraries that provide lists of primitives - see: Create a List of primitive int?
The capacity of an ArrayList is dynamic, and to achieve this, if you grow the list to be bigger than its backing array, it will:
create a new array, 50% bigger than the old array
copy the contents of the old array into the new array (this sounds expensive, but hardware is very fast at doing this)
discard the old array
This means that if the backing array happens to have 30 million elements, and is full, ArrayList.add() will replace the array with one of 45 million elements, even if your List only needs 30,000,001.
You can avoid this if you know the needed capacity in advance, by providing the capacity in the constructor.
You can use ArrayList.trimToSize() to drop unneeded capacity and claw some memory back after you've filled the ArrayList.
If I was striving to use as little heap memory as possible, I would aim to store my lists of numbers as arrays of primitives:
class Data {
String header;
float[] values;
}
... and I would just put these into an ArrayList<Data>.
With this structure, you have O(1) access to arbitrary values, and you can use Arrays.binarySearch() (if the values are sorted) to find by value within a group.
If at all possible, I would find out the size of each group before reading the values, and initialise the array to the right size. If you can, make your input file format facilitate this:
while(line = readLine()) {
if(isHeader(line)) {
ParsedHeader header = new ParsedHeader(line);
currentArray = new float[header.size()];
arrayIndex = 0;
currentGroup = new Group(header.name(), currentArray);
groups.add(currentGroup);
} else if (isValue(line)) {
currentArray[arrayIndex++] = parseValue(line);
}
}
If you can't change the input format, consider making two passes through the file - once to discover group lengths, once again to fill your arrays.
If you have to consume the file in one pass, and the file format can't provide group lengths before groups, then you'll have to do something that allows a "list" to grow arbitrarily. There are several options:
Consume each group into an ArrayList<Float> - when the group is complete, convert it into an array[float]:
float[] array = new float[list.size()];
int i = 0;
for (Float f : list) {
array[i] = f; // auto-unboxes Float to float
}
Use a third-party list-of-float library class
Copy the logic used by ArrayList to replace your array with a bigger one when needed -- http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/ArrayList.java#ArrayList.ensureCapacity%28int%29
Any number of approaches discussed in Computer Science textbooks, for example a linked list of arrays.
However none of this considers your reasons for slurping all these numbers into memory in the first place, nor whether this store meets your needs when it comes to processing the numbers.
You should step back and consider what your actual data processing requirement is, and whether slurping into memory is the best approach.
See whether you can do your processing by storing only a slice of data at a time, rather than storing the whole thing in memory. For example, to calculate max/min/mean, you don't need every number to be in memory -- you just need to keep a running total.
Or, consider using a lightweight database library.

You could use a RedBlack BST, which will be an extremely efficient way to store/retrieve data. This relies on nodes that link to other nodes, so there's no limit to the size of the input, as long as you have enough memory for java.

Java - Large array advice on how to break it down [duplicate]

I'm trying to find a counterexample to the Pólya Conjecture which will be somewhere in the 900 millions. I'm using a very efficient algorithm that doesn't even require any factorization (similar to a Sieve of Eratosthenes, but with even more information. So, a large array of ints is required.
The program is efficient and correct, but requires an array up to the x i want to check for (it checks all numbers from (2, x)). So, if the counterexample is in the 900 millions, I need an array that will be just as large. Java won't allow me anything over about 20 million. Is there anything I can possibly do to get an array that large?

You may want to extend the max size of the JVM Heap. You can do that with a command line option.
I believe it is -Xmx3600m (3600 megabytes)

Java arrays are indexed by int, so an array can't get larger than 2^31 (there are no unsigned ints). So, the maximum size of an array is 2147483648, which consumes (for a plain int[]) 8589934592 bytes (= 8GB).
Thus, the int-index is usually not a limitation, since you would run out of memory anyway.
In your algorithm, you should use a List (or a Map) as your data structure instead, and choose an implementation of List (or Map) that can grow beyond 2^31. This can get tricky, since the "usual" implementation ArrayList (and HashMap) uses arrays internally. You will have to implement a custom data structure; e.g. by using a 2-level array (a list/array). When you are at it, you can also try to pack the bits more tightly.

Java will allow up to 2 billions array entries. It’s your machine (and your limited memory) that can not handle such a large amount.

900 million 32 bit ints with no further overhead - and there will always be more overhead - would require a little over 3.35 GiB. The only way to get that much memory is with a 64 bit JVM (on a machine with at least 8 GB of RAM) or use some disk backed cache.

If you don't need it all loaded in memory at once, you could segment it into files and store on disk.

What do you mean by "won't allow". You probably getting an OutOfMemoryError, so add more memory with the -Xmx command line option.

You could define your own class which stores the data in a 2d array which would be closer to sqrt(n) by sqrt(n). Then use an index function to determine the two indices of the array. This can be extended to more dimensions, as needed.
The main problem you will run into is running out of RAM. If you approach this limit, you'll need to rethink your algorithm or consider external storage (ie a file or database).

If your algorithm allows it:
Compute it in slices which fit into memory.
You will have to redo the computation for each slice, but it will often be fast enough.
Use an array of a smaller numeric type such as byte.

Depending on how you need to access the array, you might find a RandomAccessFile will allow you to use a file which is larger than will fit in memory. However, the performance you get is very dependant on your access behaviour.

I wrote a version of the Sieve of Eratosthenes for Project Euler which worked on chunks of the search space at a time. It processes the first 1M integers (for example), but keeps each prime number it finds in a table. After you've iterated over all the primes found so far, the array is re-initialised and the primes found already are used to mark the array before looking for the next one.
The table maps a prime to its 'offset' from the start of the array for the next processing iteration.
This is similar in concept (if not in implementation) to the way functional programming languages perform lazy evaluation of lists (although in larger steps). Allocating all the memory up-front isn't necessary, since you're only interested in the parts of the array that pass your test for primeness. Keeping the non-primes hanging around isn't useful to you.
This method also provides memoisation for later iterations over prime numbers. It's faster than scanning your sparse sieve data structure looking for the ones every time.

I second #sfossen's idea and #Aaron Digulla. I'd go for disk access. If your algorithm can take in a List interface rather than a plain array, you could write an adapter from the List to the memory mapped file.

Use Tokyo Cabinet, Berkeley DB, or any other disk-based key-value store. They're faster than any conventional database but allow you to use the disk instead of memory.

could you get by with 900 million bits? (maybe stored as a byte array).

You can try splitting it up into multiple arrays.
for(int x = 0; x <= 1000000; x++){
myFirstList.add(x);
}
for(int x = 1000001; x <= 2000000; x++){
mySecondList.add(x);
}
then iterate over them.
for(int x: myFirstList){
for(int y: myFirstList){
//Remove multiples
}
}
//repeat for second list

Use a memory mapped file (Java 5 NIO package) instead. Or move the sieve into a small C library and use Java JNI.

Array programming in java large size

How to declare an array of size 10^9 in java ?.I have tried Array list but the problem is i need to find minimum and maximum element in array and so I need to compare 0th element of an array with all other elements of array and i initially need some fixed size of array which is required in the input format of array on code-chef. Can anybody help?. I tried using long array but it gave out of memory error.

A Java array can have a maximum size equal to Integer.MAX_VALUE (or in some cases a slightly different value) which is about 2.3 * 10^9, so creating that large an array is theoretically possible. However since 10^9 would mean the prefix giga (to make it easier to read) the array would have a size of at least 1GB (when using byte[]). Depending on the data type you're using the array probably just takes too much memory (an int[] would already take up 4GB).
You could try to increase the JVM's max memory by using the -Xmx option (e.g. to allow a maximum of 4GB you could use -Xmx=4g) but you're still limited by the maximum of adressable memory (e.g. IIRC a 32-bit JVM can only adress up top 4GB in total) and available memory.
Alternatively you could try and split the array over multiple machines or JVMs and employ some distributed approach. Or you could write the array to a (memory mapped) file and keep only a part of the array in memory.
The best approach, however, would probably be to check whether you really need that much memory. In many cases using some clever algorithms or structures can dramatically reduce memory requirements. What to use depends on what you're trying to achieve in the end though.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.