Best data structure to hold large amounts of data?

Best data structure to hold large amounts of data? - java

Reading in a lot of data from a file. There may be 100 different data objects with necessary headings, but there can be well over 300,000 values stored in each of these data objects. The values need to be stored in the same order that they are read in. This is the constructor for the data object:
public Data(String heading, ArrayList<Float> values) {
this.heading = heading;
this.values = values;
}
What would be the quickest way to store and retrieve these values sequentially in RAM?

Although in your comments you mention "quickness", without specifying what operation needs to be "quick", your main concern seems to be heap memory consumption.
Let's assume 100 groups of 300,000 numbers (you've used words like "may be" and "well over" but this will do as an example).
That's 30,000,000 numbers to store, plus 100 headings and some structural overhead for grouping.
A primitive Java float is 32 bits, that is 4 bytes. So at an absolute minimum, you're going to need 30,000,000 * 4 bytes == 120MB.
An array of primitives - float[30000000] - is just all the values concatenated into a contiguous chunk of memory, so will consume this theoretical minumum of 120MB -- plus a few bytes of once-per-array overhead that I won't go into detail about here.
A java Float wrapper object is 12 bytes. When you store an object (rather than a primitive) in an array, the reference itself is 4 bytes. So an array of Float - Float[30000000] will consume 30,000,000 * (12 + 4) == 480MB.
So, you can cut your memory use by more than half by using primitives rather than wrappers.
An ArrayList is quite a light wrapper around an array of Object and so has about the same memory costs. The once-per-list overheads are too small to have an impact compared to the elements, at these list sizes. But there are some caveats:
ArrayList can only store Objects, not primitives, so if you choose a List you're stuck with the 12-bytes-per-element overhead of Float.
There are some third-party libraries that provide lists of primitives - see: Create a List of primitive int?
The capacity of an ArrayList is dynamic, and to achieve this, if you grow the list to be bigger than its backing array, it will:
create a new array, 50% bigger than the old array
copy the contents of the old array into the new array (this sounds expensive, but hardware is very fast at doing this)
discard the old array
This means that if the backing array happens to have 30 million elements, and is full, ArrayList.add() will replace the array with one of 45 million elements, even if your List only needs 30,000,001.
You can avoid this if you know the needed capacity in advance, by providing the capacity in the constructor.
You can use ArrayList.trimToSize() to drop unneeded capacity and claw some memory back after you've filled the ArrayList.
If I was striving to use as little heap memory as possible, I would aim to store my lists of numbers as arrays of primitives:
class Data {
String header;
float[] values;
}
... and I would just put these into an ArrayList<Data>.
With this structure, you have O(1) access to arbitrary values, and you can use Arrays.binarySearch() (if the values are sorted) to find by value within a group.
If at all possible, I would find out the size of each group before reading the values, and initialise the array to the right size. If you can, make your input file format facilitate this:
while(line = readLine()) {
if(isHeader(line)) {
ParsedHeader header = new ParsedHeader(line);
currentArray = new float[header.size()];
arrayIndex = 0;
currentGroup = new Group(header.name(), currentArray);
groups.add(currentGroup);
} else if (isValue(line)) {
currentArray[arrayIndex++] = parseValue(line);
}
}
If you can't change the input format, consider making two passes through the file - once to discover group lengths, once again to fill your arrays.
If you have to consume the file in one pass, and the file format can't provide group lengths before groups, then you'll have to do something that allows a "list" to grow arbitrarily. There are several options:
Consume each group into an ArrayList<Float> - when the group is complete, convert it into an array[float]:
float[] array = new float[list.size()];
int i = 0;
for (Float f : list) {
array[i] = f; // auto-unboxes Float to float
}
Use a third-party list-of-float library class
Copy the logic used by ArrayList to replace your array with a bigger one when needed -- http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/ArrayList.java#ArrayList.ensureCapacity%28int%29
Any number of approaches discussed in Computer Science textbooks, for example a linked list of arrays.
However none of this considers your reasons for slurping all these numbers into memory in the first place, nor whether this store meets your needs when it comes to processing the numbers.
You should step back and consider what your actual data processing requirement is, and whether slurping into memory is the best approach.
See whether you can do your processing by storing only a slice of data at a time, rather than storing the whole thing in memory. For example, to calculate max/min/mean, you don't need every number to be in memory -- you just need to keep a running total.
Or, consider using a lightweight database library.

You could use a RedBlack BST, which will be an extremely efficient way to store/retrieve data. This relies on nodes that link to other nodes, so there's no limit to the size of the input, as long as you have enough memory for java.

Related

Growable multidimensional data structure supporting range queries

Let me put the question first: considering the situation and requirements I'll describe further down, what data structures would make sense/help achieving the non-functional requirements?
I tried to look up several structures but wasn't very successful so far, which might be due to me missing some terminology.
Since we'll implement that in Java any answers should take that into account (e.g. no pointer-magic, assume 8-byte references etc.).
The situation
We have somewhat large set of values that are mapped via a 4-dimensional key (let's call those dimensions A, B, C and D). Each dimension can have a different size, so we'll assume the following:
A: 100
B: 5
C: 10000
D: 2
This means a completely filled structure would contain 10 million elements. Not considering their size the space needed to hold the references alone would be like 80 megabytes, so that would be considered a lower bound for memory consumption.
We further can assume that the structure won't be completely filled but quite densely.
The requirements
Since we build and query that structure quite often we have the following requirements:
constructing the structure should be fast
queries on single elements and ranges (e.g. [A1-A5, B3, any C, D0]) should be efficient
fast deletion of elements isn't required (won't happen too often)
the memory footprint should be low
What we already considered
kd-trees
Building such a tree takes some time since it can get quite deep and we'd either have to accept slower queries or take rebalancing measures. Additonally the memory footprint is quite high since we need to hold the complete key in each node (there might be ways to reduce that though).
Nested maps/map tree
Using nested maps we could store only the key for each dimension as well as a reference to the next dimension map or the values - effectively building a tree out of those maps. To support range queries we'd keep sorted sets of the possible keys and access those while traversing the tree.
Construction and queries were way faster than with kd-trees but the memory footprint was much higher (as expected).
A single large map
An alternative would be to keep the sets for individual available keys and use a single large map instead.
Construction and queries were fast as well but memory consumption was even higher due to each map node being larger (they need to hold all dimensions of a key now).
What we're thinking of at the moment
Building insertion-order index-maps for the dimension keys, i.e. we map each incoming key to a new integer index as it comes in. Thus we can make sure that those indices grow one step a time without any gaps (not considering deletions).
With those indices we'd then access a tree of n-dimensional arrays (flattened to a 1-d array of course) - aka n-ary tree. That tree would grow on demand, i.e. if we need a new array then instead of creating a larger one and copying all the data we'd just create the new block. Any needed non-leaf nodes would be created on demand, replacing the root if needed.
Let me illustrate that with an example of 2 dimensions A and B. We'll allocate 2 elements for each dimension resulting in a 2x2 matrix (array of length 4).
Adding the first element A1/B1 we'd get something like this:
[A1/B1,null,null,null]
Now we add element A2/B2:
[A1/B1,null,A2/B2,null]
Now we add element A3/B3. Since we can't map the new element to the existing array we'll create a new one as well as a common root:
[x,null,x,null]
/ \
[A1/B1,null,A2/B2,null] [A3/B3,null,null,null]
Memory consumption for densely filled matrices should be rather low depending on the size of each array (having 4 dimensions and 4 values per dimension in an array we'd have arrays of length 256 and thus get a maximum tree depth of 2-4 in most cases).
Does this make sense?

If the structure will be "quite densely" filled, then I think it makes sense to assume that it will be full. That simplifies things quite a bit. And it's not like you're going to save a lot (or anything) using a sparse matrix representation of a densely filled matrix.
I'd try the simplest possible structure first. It might not be the most memory efficient, but it should be reasonable and quite easy to work with.
First, a simple array of 10,000,000 references. That is (and please pardon the C#, as I'm not really a Java programmer):
MyStructure[] theArray = new MyStructure[](10000000);
As you say, that's going to consume 80 megabytes.
Next is four different dictionaries (maps, I think, in Java), one for each key type:
Dictionary<KeyAType, int> ADict;
Dictionary<KeyBType, int> BDict;
Dictionary<KeyCType, int> CDict;
Dictionary<KeyDType, int> DDict;
When you add an element at {A,B,C,D}, you look up the respective keys in the dictionary to get their indexes (or add a new index if that key doesn't exist), and do the math to compute an index into the array. The math is, I think:
DIndex + 2*(CIndex + 10000*(BIndex + 5*AIndex));
In .NET, dictionary overhead is something like 24 bytes per key. But you only have 11,007 total keys, so the dictionaries are going to consume something like 250 kilobytes.
This should be very quick to query directly, and range queries should be as fast as a single lookup and then some array manipulation.
One thing I'm not clear on is if you want a key, to resolve to the same index with every build. That is, if "foo" maps to index 1 in one build, will it always map to index 1?
If so, you probably should statically construct the dictionaries. I guess it depends on if your range queries always expect things in the same key order.
Anyway, this is a very simple and very effective data structure. If you can afford 81 megabytes as the maximum size of the structure (minus the actual data), it seems like a good place to start. You could probably have it working in a couple of hours.
At best it's all you'll have to do. And if you end up having to replace it, at least you have a working implementation that you can use to verify the correctness of whatever new structure you come up with.

There are other multidimensional trees that are usually better than kd-trees:quadtrees, R*Trees (like R-Tree, but much faster for updates) or PH-Tree.
The PH-Tree is like a quadtree, but much more space efficient, scales better with dimensions and depth is limited by maximum bitwidth of values, i.e. maximum '10000' requires 14 bit, so the depth will not be more than 14.
Java implementations of all trees can be found on my repo, either here (quadtree may be a bit buggy) or here.
EDIT
The following optimization can probably be ignored. Of course the described query will result in a full scan, but that may not be as bad as it sounds, because it will on average anyway return 33%-50% of the whole tree.
Possible optimisation (not tested, but might work for the PH-Tree):
One problem with range queries is the different selectivity of your dimensions, which may result in something to a full scan of the tree. For example when querying for [0..100][0..5][0..10000][1..1], i.e. constraining only the last dimension (with least selectivity).
To avoid this, especially for the PH-Tree, I would try to multiply your values by a fixed constant. For example multiply A by 100, B by 2000, C by 1 and D by 5000. This allows all values to range from 0 to 10000, which may improve query performance when constraining only dimensions with low selectivity (the 2nd or 4th).

Tradeoff between guessing ArrayList capacity wrongly vs. unused values?

Say I have to read out data, which can be either 1 object (majority of time) or multiple objects (some of the time).
If I do:
List list = new ArrayList<Object>(1);
... loop over the loaded object(s) and add it/them to the list...
This will serve me well the majority of times when there is only 1 object loaded from the database.
But assuming the less common scenario where I have to expand my initial list, this will cause me to lose operations.
I assume this won't really make much of an impact in the real world, but I wonder how I could calculate the following:
Assume X% of my data is 1 object, and Y% is a list of multiple objects. Is there a way I can calculate the ideal initial capacity for my list, going for the least operations (through list expansions, allocated but unused fields in the list)?

You dissociate your data into 2 groups X (1 element) and Y (more than one). You optimized your code for the X group because it is the most common case.
It's a good idea to initialize your ArrayList with one element so most of the time you won't waste any memory.
But if the members of the Y group have a high average size (and a small standard deviation) you can still optimize the worstcase with ensureCapacity(int cap). On the second iteration you can force to resize the ArrayList backing array to the average size of the Y group.
For a member of the Y group with 100 elements it will create/copy arrays 12 times and the backing array length will be 141 against 1 small array copy and no wasted memory if you implement the optimization.
Example of this optimization :
Iterator<Obj> it = // Get your iterator from your resource
ArrayList<Obj> result = new ArrayList<Obj>(1);
if(it.hasNext()) {
result.add(it.next());
}
if(it.hasNext()) {
result.ensureCapacity(100);// Avg size of the Y group
while(it.hasNext()) {
result.add(it.next());
}
}
But unless it's a performance critical feature It's not worth the effort. Because to make sure this trick will optimize speed and memory you have to analyse the distribution of the size in the Y group.
It's not drectly related to your problem but it contains a lot of useful comments on ArrayList : When to use LinkedList over ArrayList?

ArrayList<Double> to double[] with 300 million entries

I'm using a java program to get some data from a DB. I then calculate some numbers and start storing them in an array. The machine I'm using has 4 gigs of RAM. Now, I don't know how many numbers there will be in advance, so I use an ArrayList<Double>. But I do know there will be roughly 300 million numbers.
So, since one double is 8 bytes a rough estimate of the memory this array will consume is 2.4 gigs (probably more because of the overheads of an ArrayList). After this, I want to calculate the median of this array and am using the org.apache.commons.math3.stat.descriptive.rank.Median library which takes as input a double[] array. So, I need to convert the ArrayList<Double> to double[].
I did see many questions where this is raised and they all mention there is no way around looping through the entire array. Now this is fine, but since they also maintain both objects in memory, this brings my memory requirements up to 4.8 gigs. Now we have a problem since the total RAM available us 4 gigs.
First of all, is my suspicion that the program will at some point give me a memory error correct (it is currently running)? And if so, how can I calculate the median without having to allocate double the memory? I want to avoid sorting the array as calculating the median is O(n).

Your problem is even worse than you realize, because ArrayList<Double> is much less efficient than 8 bytes per entry. Each entry is actually an object, to which the ArrayList keeps an array of references. A Double object is probably about 12 bytes (4 bytes for some kind of type identifier, 8 bytes for the double itself), and the reference to it adds another 4, bringing the total up to 16 bytes per entry, even excluding overhead for memory management and such.
If the constraints were a little wider, you could implement your own DoubleArray that is backed by a double[] but knows how to resize itself. However, the resizing means you'll have to keep a copy of both the old and the new array in memory at the same time, also blowing your memory limit.
That still leaves a few options though:
Loop through the input twice; once to count the entries, once to read them into a right-sized double[]. It depends on the nature of your input whether this is possible, of course.
Make some assumption on the maximum input size (perhaps user-configurable), and allocate a double[] up front that is this fixed size. Use only the part of it that's filled.
Use float instead of double to cut memory requirements in half, at the expense of some precision.
Rethink your algorithm to avoid holding everything in memory at once.

There are many open source libraries that create dynamic arrays for primitives. One of these:
http://trove.starlight-systems.com/

The Median value is the value at the middle of a sorted list. So you don't have to use a second array, you can just do:
Collections.sort(myArray);
final double median = myArray.get(myArray.size() / 2);
And since you get that data from a DB anyways, you could just tell the DB to give you the median instead of doing it in Java, which will save all the time (and memory) for transmitting the data as well.

I agree, use Trove4j TDoubleArrayList class (see javadoc) to store double or TFloatArrayList for float. And by combining previous answers, we gets :
// guess initialcapacity to remove requirement for resizing
TDoubleArrayList data = new TDoubleArrayList(initialcapacity);
// fill data
data.sort();
double median = data.get(data.size()/2);

Merge many short sorted lists into a long sorted list efficiently

I'm repeatedly merging 10000 sorted lists into a single long sorted list. Each list contains about 5000 doubles.
double[] result;// this is the single long sorted list
void merge(double[] x){
double[] newList=new double[x.length+result.length];
int i=0,j=0;
while(i<x.length && j<result.length){
insert the smaller one
increment i or j;
}
if(i<x.length){
add the rest
}
if(j<result.length){
add the rest
}
result=newList;
}
This method allocates a new array every time. As result[] grows, this is not efficient. Any advise?

You could handle it the same way ArrayList does and double the length of your array every time you need to reallocate and then only reallocate when you run out of space. Although you might have a fair amount of leftover space at the end, you would save processing time due to less allocations. Then just do an in-place merge with Result and X.

You clearly have enough memory to hold the entire result (400Mb is it?) so presumably you could hold all the source too 800Mb is big, but not too big? Then you can quick allocate the entire answer buffer right at the start.
If you are prepared to use even more memory you could take a "doubling" approach.
Merge 1 & 2 to form A1, 3 & 4 to form A2 etc. up to A2500 (you can now discard the first level arrays)
Then merge A1 and A2 to form B1; A3 & A4 to form B2 up to B1250 (you now discard the A arrays)
And so on yielding C1-C625, D1-D313, E1-E157 ... M1, which is the final answer
This way any given number gets moved 15 times whereas at present you move every number 5000 times.

See your problem as the merging part of a merge-sort. Create 2 arrays that are big enough to hold the content of all the small lists combined. Then use them alternatingly for source and target storage in the merge steps.

Java large datastructure for storing a matrix

I need to store a 2d matrix containing zip codes and the distance in km between each one of them. My client has an application that calculates the distances which are then stored in an Excel file. Currently, there are 952 places. So the matrix would have 952x952 = 906304 entries.
I tried to map this into a HashMap[Integer, Float]. The Integer is the hash code of the two Strings for two places, e.g. "A" and "B". The float value is the distance in km between them.
While filling in the data I run into OutOfMemoryExceptions after 205k entries. Do you have a tip how I can store this in a clever way? I even don't know if it's clever to have the whole bunch in memory. My options are SQL and MS Access...
The problem is that I need to access the data very quickly and possibly very often which is why I chose the HashMap because it runs in O(1) for the look up.
Thansk for your replies and suggestions!
Marco

A 2d array would be more memory efficient.
You can use a small hashmap to map the 952 places into a number between 0 and 951 .
Then, just do:
float[][] distances= new float[952][952];
To look things up, just use two hash lookups to convert the two places into two integers, and use them as indexes into the 2d array.
By doing it this way, you avoid the boxing of floats, and also the memory overhead of the large hashmap.
However, 906304 really isn't that many entries, you may just need to increase the Xmx maximum heap size

I would have thought that you could calculate the distances on the fly. Presumably someone has already done this, so you simply need to find out what algorithm they used, and the input data; e.g. longitude/latitude of the notional centres of each ZIP code.
EDIT: There are two commonly used algorithms for finding the (approximate) geodesic distance between two points given by longitude/latitude pairs.
The Vicenty formula is based on an ellipsoid approximation. It is more accurate, but more complicated to implement.
The Haversine formula is based on a spherical approximation. It is less accurate (0.3%), but simpler to implement.

Can you simply boost the memory available to the JVM ?
java -Xmx512m ...
By default the maximum memory configuration is 64Mb. Some more tuning tips here. If you can do this then you can keep the data in-process and maximise the performance (i.e. you don't need to calculate on the fly).

I upvoted Chi's and Benjamin's answers, because they're telling you what you need to do, but while I'm here, I'd like to stress that using the hashcode of the two strings directly will get you into trouble. You're likely to run into the problem of hash collisions.
This would not be a problem if you were concatenating the two strings (being careful to use a delimiter which cannot appear in the place designators), and letting HashMap do its magic, but the method you suggested, using the hashcodes for the two strings as a key, that's going to get you into trouble.

You will simply need more memory. When starting your Java process, kick it off like so:
java -Xmx256M MyClass
The -Xmx defines the max heap size, so this says the process can use up to 256 MB of memory for the heap. If you still run out, keep bumping that number up until you hit the physical limit.

Lately I've managed similar requisites for my master thesis.
I ended with a Matrix class that uses a double[], not a double[][], in order to alleviate double deref costs (data[i] that is an array, then array[i][j] that is a double) while allowing the VM to allocate a big, contiguous chunk of memory:
public class Matrix {
private final double data[];
private final int rows;
private final int columns;
public Matrix(int rows, int columns, double[][] initializer) {
this.rows = rows;
this.columns = columns;
this.data = new double[rows * columns];
int k = 0;
for (int i = 0; i < initializer.length; i++) {
System.arraycopy(initializer[i], 0, data, k, initializer[i].length);
k += initializer[i].length;
}
}
public Matrix set(int i, int j, double value) {
data[j + i * columns] = value;
return this;
}
public double get(int i, int j) {
return data[j + i * columns];
}
}
this class should use less memory than an HashMap since it uses a primitive array (no boxing needed): it needs only 906304 * 8 ~ 8 Mb (for doubles) or 906304 * 4 ~ 4 Mb (for floats). My 2 cents.
NB
I've omitted some sanity checks for simplicity's sake

Stephen C. has a good point: if the distances are as-the-crow-flies, then you could probably save memory by doing some calculations on the fly. All you'd need is space for the longitude and latitude for 952 zip codes and then you could use the vicenty formula to do your calculation when you need to. This would make your memory usage O(n) in zipcodes.
Of course, that solution makes some assumptions that may turn out to be false in your particular case, i.e. that you have longitude and latitude data for your zipcodes and that you're concerned with as-the-crow-flies distances and not something more complicated like driving directions.
If those assumptions are true though, trading a few computes for a whole bunch of memory might help you scale in the future if you ever need to handle a bigger dataset.

The above suggestions regarding heap size will be helpful. However, I am not sure if you gave an accurate description of the size of your matrix.
Suppose you have 4 locations. Then you need to assess the distances between A->B, A->C, A->D, B->C, B->D, C->D. This suggests six entries in your HashMap (4 choose 2).
That would lead me to believe the actual optimal size of your HashMap is (952 choose 2)=452,676; NOT 952x952=906,304.
This is all assuming, of course, that you only store one-way relationships (i.e. from A->B, but not from B->A, since that is redundant), which I would recommend since you are already experiencing problems with memory space.
Edit: Should have said that the size of your matrix is not optimal, rather than saying the description was not accurate.

Create a new class with 2 slots for location names. Have it always put the alphabetically first name in the first slot. Give it a proper equals and hashcode method. Give it a compareTo (e.g. order alphabetically by names). Throw them all in an array. Sort it.
Also, hash1 = hash2 does not imply object1 = object2. Don't ever do this. It's a hack.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.