I was solving combination sum IV on leetcode (#377), which reads:
"Given an integer array with all positive numbers and no duplicates, find the number of possible combinations that add up to a positive integer target."
I solved it in Java using a top down recursive approach with a memoization array:
public int combinationSum4(int[] nums, int target){
int[] memo = new int[target+1];
for(int i = 1; i < target+1; i++) {
memo[i] = -1;
}
memo[0] = 1;
return topDownCalc(nums, target, memo);
}
public static int topDownCalc(int[] nums, int target, int[] memo) {
if (memo[target] >= 0) {
return memo[target];
}
int tot = 0;
for(int num : nums) {
if(target - num >= 0) {
tot += topDownCalc(nums, target - num, memo);
}
}
memo[target] = tot;
return tot;
}
Then I figured I was wasting time by initializing the entire memo array and could just use a Map instead (which would also save space / memory). So I rewrote the code as follows:
public int combinationSum4(int[] nums, int target) {
Map<Integer, Integer> memo = new HashMap<Integer, Integer>();
memo.put(0, 1);
return topDownMapCalc(nums, target, memo);
}
public static int topDownMapCalc(int[] nums, int target, Map<Integer, Integer> memo) {
if (memo.containsKey(target)) {
return memo.get(target);
}
int tot = 0;
for(int num : nums) {
if(target - num >= 0) {
tot += topDownMapCalc(nums, target - num, memo);
}
}
memo.put(target, tot);
return tot;
}
I am confused though, because after submitting the second version of my code Leetcode said it was slower and used more space than the first code. How does the HashMap use more space and run slower than an array whos values all had to be initialized and whos length is greater than the HashMaps size?
Then I figured I was wasting time by initializing the entire memo array
You could have stored 'answer + 1' instead, so that the default value (0) can now be a placeholder for 'not calculated yet', and save that initialization. Not that it is expensive. Let's dig into cache pages.
Cache pages
CPUs are complex beasts. They don't operate on memory directly; not anymore. They literally cannot do it; the chip's calculating parts are simply not hooked up. at all. Instead, the CPU has caches, which come in set sizes (for example, 64k - you can't have a single cache node hold more or less than precisely 64k, and that entire 64k is then considered to be a cached copy of some 64k segment of main memory). One such node is called a cache page.
The CPU can only operate on cache pages.
In java, int[] leads to a contiguous, straight sized chunk of memory representing the data. In other words, an int[] x = new int[1000] would declare a single chunk of memory of 1000*4 = 4000 bytes (because ints are 4 bytes, and you reserved room for 1000 of em). That fits inside a single page. So, when you write your loop to initialize the values to -1, that's asking the CPU to loop through a single cache page and write some data to it. CPUs have pipelines and other speedup factors; this costs maybe 250 cycles.
Contrast to the cost of fetching a cache page: The CPU will be twiddling its thumbs (which is good; it can cool down some, and on modern hardware, often the CPU is limited not by its raw speed capabilities, but by the ability of the system to wick away the thermal impact of having it run! - it can also spend time on other threads/processes) whilst it farms out the job of fetching some chunk of memory into a cache page to the memory controller. Nevertheless, that thumb twiddling takes on the order of magnitude of 500 cycles or more. It's nice the CPU gets to cool down or focus on other things during it, but it's still the case that writing 4000 contiguous bytes in a tight loop is faster than a single cache miss.
Thus, 'fill a 1000-large int array with -1s' is an extremely cheap operation.
Wrapper objects
maps operate on objects, not ints, which is why you had to write Integer and not int. an Integer, in memory at least, is a much, much larger load on memory. It's an entire object, containing an int field. Then, your variable (or your map) holds a pointer to it.
So, an int[] x = new int[1000] takes 4000 bytes, plus some change for the object headers (maybe add 12 bytes to it all), and 1 reference (depends on VM, but let's say 64 bit), for a grand total of 4020 bytes.
In contrast,
Integer[] x = new Integer[1000];
for (int i = 0; i < 1000; i++) x[i] = i;`
is much, much larger. It's 1000 pointers (can be as large as 8 bytes per pointer, or as small as 4. So, 4000 to 8000 bytes), to 1000 separate integer objects. Each integer object gets the object overhead (~12 bytes or more), + 1 integer field, generally word-aligned (so, 64-bits, even though it's only 32-bit, assuming a 64-bit VM running on 64-bit hardware, which is going to be the case on anything modern), for another 20000 bytes. A grand total of something closer to 30000 bytes.
That is about 8x more memory required.
Then consider that the 'key' in your memoized array is inherent (it's the index into the array) whereas in the map the key needs separate storage and it gets worse still: Each k/v pair in your map occupies at least 12+12+8+8+8+8 bytes (2 object overheads and 2 int fields for your key and value Integer objects, and 2 pointers for the map to point at these), 56 bytes. In contrast to your int[] which does it in 4.
That gives you a rate of 56/4 = 14.
If your map contains only 1 in 14 numbers, then the map should be about as large as your array, because the map can do a thing your array can't: The array is as large as it has to be from the get-go, the map only needs to store required nodes.
Still, one would assume for most 'interesting' inputs, the coverage factor of that map is going to be far north of 7.14%, thus resulting in the map being larger.
The map also has its objects smeared out over memory which risks them being in more than one cache page: Large memory load + fragmentation = an easy road to having the CPU wait for multiple cache page fetches vs. being able to do all the work in one go, never having to wait for cache misses.
Can it be faster?
Yeah, probably - but with map occupancy rates at 10% or higher, the concept of using a map to save space is dubious. If you want to try, you'd need a map specifically designed to hold ints and nothing else. These do exist, such as eclipse collections' IntIntMap.
But I bet in this case the simple array memoization strategy is just the winner, even if you use IntIntMap.
These things came to my mind first:
HashMap is what the name implies, a hash-based map. Sowhenever you put something into it or get something out of it, it has to hash the key, then find the target based on that hash.
put() operation isn't just a walk in the park, either - you can check here to get an idea what it does. Definitely more than array assignment.
in java it doesn't work with primitives, so for each value you have to convert ints to Integers and vice versa. (as noted by others, there are int-specialized map alternatives available, but not in standard lib)
aaand since you're not initializing it, it might need to resize internally several times during your run - default size for a hashmap is just 16 - which is definitely more expensive then one-shot initialization you did with array. here's what each resizing does.
it also works with Entry objects that it needs for each internal entry it's got, and all those objects also take some space, plenty more than just having an array of integers
So I wouldn't think a hashmap would save you neither space or time. Why would it?
Related
I was working on linear probing. Which hashes the values on mod of table size and wrote some code for it.
public class LinearProbing
{
private int table[];
private int size;
LinearProbing(int size)
{
this.size=size;
table=new int[size];
}
public void hash(int value)
{
int key=value%size;
while(table[key]!=0)
{
key++;
if(key==size)
{
key=0;
}
}
table[key]=value;
}
public void display()
{
for(int i=0;i<size;i++)
{
System.out.println(i+"->"+table[i]);
}
}
}
It works fine for every value except zero(0). When zero is in values to be hashed, as in java array each index is initially initiated with zero. Checking with zero to see whether the index is free or not causing trouble if zero is to be hashed and can be overwritten. I also checked with equality with null but it raises an error type mismatch.
Does anyone have any suggestion?
Computers don't work that way, at least, not without paying a rather great cost.
Specifically, a new int[10] quite literally just creates a contiguous block of memory that is precisely large enough to hold 10 int variables, and not a bit more than that. Specifically, each int will cover 32 bits, and those bits can be used to represent precisely 2^32 different things. Think about it: If I give you a panel of 3 light switches, and all you get to do is walk in, flip some switches, and walk back out again, then I walk in and I get to look at what you have been flipping, and that is all the communication channel we ever get, we can pre-arrange for 8 different signals. Why 8? Because that's 2^3. A bit is like that lightswitch. It's on, or off. There is no other option, and there is no 'unset'. There is no way to represent 'oh, you have not been in the room yet' unless we 'spend' one of our 8 different arrangements on this signal, leaving only 7 left.
Thus, if you want each 'int' to also know 'whether it has been set or not', and for 'not set yet' to be different from any of the valid values, you need an entire new bit, and given that modern CPUs don't like doing work on sub-word units, that one bit is excessively expensive. In either case, you have to program it.
For example:
private int table[];
private int set[];
LinearProbing(int size) {
this.size = size;
this.table = new int[size];
this.set = new int[(size + 31) / 32];
}
boolean isSet(int idx) {
int setIdx = idx / 32;
int bit = idx % 32;
return this.set[setIdx] >>> bit != 0;
}
private void markAsSet(int idx) {
int setIdx = idx / 32;
int bit = idx % 32;
this.set[setIdx] |= (1 << bit);
}
This rather complex piece of machinery 'packs' that additional 'is it set?' bit into a separate array called set, which we can get away with making 1/32nd the size of the whole thing, as each int contains 32 bits and we just need 1 bit to mark an index slot as 'unset'. Unfortunately, this means we need to do all sorts of 'bit wrangling', and thus we're using the bitwise OR operator (|=), and bit shifts (<< and >>) to isolate the right bit.
This is why, usually, this is not the way, bit wrangling isn't cheap.
It's a much, much better idea to take away exactly one of the 2^32 different values a hash can be. You could choose 0, but you can also choose some arbitrarily chosen value; there is a very minor benefit to picking a large prime number. Let's say 7549.
Now all you need to do is decree a certain algorithm: The practical hash of a value is derived from this formula:
If the actual hash is 7549 specifically, we say the practical hash is 6961. Yes, that means 6961 will occur more often.
If the actual hash is anything else, including 6961, the practical hash is identical.
Tada: This algorithm means '7549' is free. No practical hash can ever be 7549. That means we can now use 7549 as marker as meaning 'unset'.
The fact that 6961 is now doubled up is technically not relevant: Any hash bucket system cannot just state that equal hashes means equal objects - after all, there are only 2^32 hashes, so collisions are mathematically impossible to avoid. That's why e.g. java's own HashMap doesn't JUST compare hashes - it also calls .equals. If you shove 2 different (as in, not .equals) objects in the same map that so happen to hash to the same value, HashMap is fine with it. Hence, having more conflicts around 6961 is not particularly relevant.
The additional cost associated with the additional chance of collision on 6961 is vastly less than the additional cost associated with keeping track of which buckets have been set or not. After all, assuming good hash distribution, our transformation algorithm that frees up 7549 means 1 in 4 billion items happens to collide twice more likely. That's... an infinitesimal occurrence on top of another infinitesimal, it's not going to matter.
NB: 6961 and 7549 are randomly chosen prime numbers. Prime numbers are merely slightly less likely to collide, it's not crucial that you pick primes here.
Reading in a lot of data from a file. There may be 100 different data objects with necessary headings, but there can be well over 300,000 values stored in each of these data objects. The values need to be stored in the same order that they are read in. This is the constructor for the data object:
public Data(String heading, ArrayList<Float> values) {
this.heading = heading;
this.values = values;
}
What would be the quickest way to store and retrieve these values sequentially in RAM?
Although in your comments you mention "quickness", without specifying what operation needs to be "quick", your main concern seems to be heap memory consumption.
Let's assume 100 groups of 300,000 numbers (you've used words like "may be" and "well over" but this will do as an example).
That's 30,000,000 numbers to store, plus 100 headings and some structural overhead for grouping.
A primitive Java float is 32 bits, that is 4 bytes. So at an absolute minimum, you're going to need 30,000,000 * 4 bytes == 120MB.
An array of primitives - float[30000000] - is just all the values concatenated into a contiguous chunk of memory, so will consume this theoretical minumum of 120MB -- plus a few bytes of once-per-array overhead that I won't go into detail about here.
A java Float wrapper object is 12 bytes. When you store an object (rather than a primitive) in an array, the reference itself is 4 bytes. So an array of Float - Float[30000000] will consume 30,000,000 * (12 + 4) == 480MB.
So, you can cut your memory use by more than half by using primitives rather than wrappers.
An ArrayList is quite a light wrapper around an array of Object and so has about the same memory costs. The once-per-list overheads are too small to have an impact compared to the elements, at these list sizes. But there are some caveats:
ArrayList can only store Objects, not primitives, so if you choose a List you're stuck with the 12-bytes-per-element overhead of Float.
There are some third-party libraries that provide lists of primitives - see: Create a List of primitive int?
The capacity of an ArrayList is dynamic, and to achieve this, if you grow the list to be bigger than its backing array, it will:
create a new array, 50% bigger than the old array
copy the contents of the old array into the new array (this sounds expensive, but hardware is very fast at doing this)
discard the old array
This means that if the backing array happens to have 30 million elements, and is full, ArrayList.add() will replace the array with one of 45 million elements, even if your List only needs 30,000,001.
You can avoid this if you know the needed capacity in advance, by providing the capacity in the constructor.
You can use ArrayList.trimToSize() to drop unneeded capacity and claw some memory back after you've filled the ArrayList.
If I was striving to use as little heap memory as possible, I would aim to store my lists of numbers as arrays of primitives:
class Data {
String header;
float[] values;
}
... and I would just put these into an ArrayList<Data>.
With this structure, you have O(1) access to arbitrary values, and you can use Arrays.binarySearch() (if the values are sorted) to find by value within a group.
If at all possible, I would find out the size of each group before reading the values, and initialise the array to the right size. If you can, make your input file format facilitate this:
while(line = readLine()) {
if(isHeader(line)) {
ParsedHeader header = new ParsedHeader(line);
currentArray = new float[header.size()];
arrayIndex = 0;
currentGroup = new Group(header.name(), currentArray);
groups.add(currentGroup);
} else if (isValue(line)) {
currentArray[arrayIndex++] = parseValue(line);
}
}
If you can't change the input format, consider making two passes through the file - once to discover group lengths, once again to fill your arrays.
If you have to consume the file in one pass, and the file format can't provide group lengths before groups, then you'll have to do something that allows a "list" to grow arbitrarily. There are several options:
Consume each group into an ArrayList<Float> - when the group is complete, convert it into an array[float]:
float[] array = new float[list.size()];
int i = 0;
for (Float f : list) {
array[i] = f; // auto-unboxes Float to float
}
Use a third-party list-of-float library class
Copy the logic used by ArrayList to replace your array with a bigger one when needed -- http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/ArrayList.java#ArrayList.ensureCapacity%28int%29
Any number of approaches discussed in Computer Science textbooks, for example a linked list of arrays.
However none of this considers your reasons for slurping all these numbers into memory in the first place, nor whether this store meets your needs when it comes to processing the numbers.
You should step back and consider what your actual data processing requirement is, and whether slurping into memory is the best approach.
See whether you can do your processing by storing only a slice of data at a time, rather than storing the whole thing in memory. For example, to calculate max/min/mean, you don't need every number to be in memory -- you just need to keep a running total.
Or, consider using a lightweight database library.
You could use a RedBlack BST, which will be an extremely efficient way to store/retrieve data. This relies on nodes that link to other nodes, so there's no limit to the size of the input, as long as you have enough memory for java.
In Ulrich Drepper's paper What every programmer should know about memory, the 3rd part: CPU Caches, he shows a graph that shows the relationship between "working set" size and the cpu cycle consuming per operation (in this case, sequential reading). And there are two jumps in the graph which indicate the size of L1 cache and L2 cache. I wrote my own program to reproduce the effect in c. It just simply read a int[] array sequentially from head to tail, and I've tried different size of the array(from 1KB to 1MB). I plot the data into a graph and there is no jump, the graph is a straight line.
My questions are:
Is there something wrong with my method? What is the right way to produce the cpu cache effect(to see the jumps).
I was thinking, if it is sequential read, then it should operate like this:
When read the first element, it's a cache miss, and within the cache line size(64K), there will be hits. With the help of the prefetching, the latency of reading the next cache line will be hidden. It will contiguously read data into the L1 cache, even when the working set size is over the L1 cache size, it will evict the least recently used ones, and continue prefetch. So most of the cache misses will be hidden, the time consumed by fetch data from L2 will be hidden behind the reading activity, meaning they are operating at the same time. the assosiativity (8 way in my case) will hide the latency of reading data from L2. So, phenomenon of my program should be right, am I missing something?
Is it possible to get the same effect in java?
By the way, I am doing this in linux.
Edit 1
Thanks for Stephen C's suggestion, here are some additional Information:
This is my code:
int *arrayInt;
void initInt(long len) {
int i;
arrayInt = (int *)malloc(len * sizeof(int));
memset(arrayInt, 0, len * sizeof(int));
}
long sreadInt(long len) {
int sum = 0;
struct timespec tsStart, tsEnd;
initInt(len);
clock_gettime(CLOCK_REALTIME, &tsStart);
for(i = 0; i < len; i++) {
sum += arrayInt[i];
}
clock_gettime(CLOCK_REALTIME, &tsEnd);
free(arrayInt);
return (tsEnd.tv_nsec - tsStart.tv_nsec) / len;
}
In main() function, I've tried from 1KB to 100MB of the array size, still the same, average time consuming per element is 2 nanoseconds. I think the time is the access time of L1d.
My cache size:
L1d == 32k
L2 == 256k
L3 == 6144k
EDIT 2
I've changed my code to use a linked list.
// element type
struct l {
struct l *n;
long int pad[NPAD]; // the NPAD could be changed, in my case I set it to 1
};
struct l *array;
long globalSum;
// for init the array
void init(long len) {
long i, j;
struct l *ptr;
array = (struct l*)malloc(sizeof(struct l));
ptr = array;
for(j = 0; j < NPAD; j++) {
ptr->pad[j] = j;
}
ptr->n = NULL;
for(i = 1; i < len; i++) {
ptr->n = (struct l*)malloc(sizeof(struct l));
ptr = ptr->n;
for(j = 0; j < NPAD; j++) {
ptr->pad[j] = i + j;
}
ptr->n = NULL;
}
}
// for free the array when operation is done
void release() {
struct l *ptr = array;
struct l *tmp = NULL;
while(ptr) {
tmp = ptr;
ptr = ptr->n;
free(tmp);
}
}
double sread(long len) {
int i;
long sum = 0;
struct l *ptr;
struct timespec tsStart, tsEnd;
init(len);
ptr = array;
clock_gettime(CLOCK_REALTIME, &tsStart);
while(ptr) {
for(i = 0; i < NPAD; i++) {
sum += ptr->pad[i];
}
ptr = ptr->n;
}
clock_gettime(CLOCK_REALTIME, &tsEnd);
release();
globalSum += sum;
return (double)(tsEnd.tv_nsec - tsStart.tv_nsec) / (double)len;
}
At last, I will printf out the globalSum in order to avoid the compiler optimization. As you can see, it is still a sequential read, I've even tried up to 500MB of the array size, the average time per element is approximately 4 nanoseconds (maybe because it has to access the data 'pad' and the pointer 'n', two accesses), the same as 1KB of the array size. So, I think it is because the cache optimization like prefetch hide the latency very well, am I right? I will try a random access, and put the result on later.
EDIT 3
I've tried a random access to the linked list, this is the result:
the first red line is my L1 cache size, the second is L2. So we can see a little jump there. And some times the latency still be hidden well.
This answer isn't an answer, but more of a set of notes.
First, the CPU tends to operate on cache lines, not on individual bytes/words/dwords. This means that if you sequentially read/write an array of integers then the first access to a cache line may cause a cache miss but subsequent accesses to different integers in that same cache line won't. For 64-byte cache lines and 4-byte integers this means that you'd only get a cache miss once for every 16 accesses; which will dilute the results.
Second, the CPU has a "hardware pre-fetcher." If it detects that cache lines are being read sequentially, the hardware pre-fetcher will automatically pre-fetch cache lines it predicts will be needed next (in an attempt to fetch them into cache before they're needed).
Third, the CPU does other things (like "out of order execution") to hide fetch costs. The time difference (between cache hit and cache miss) that you can measure is the time that the CPU couldn't hide and not the total cost of the fetch.
These 3 things combined mean that; for sequentially reading an array of integers, it's likely that the CPU pre-fetches the next cache line while you're doing 16 reads from the previous cache line; and any cache miss costs won't be noticeable and may be entirely hidden. To prevent this; you'd want to "randomly" access each cache line once, to maximise the performance difference measured between "working set fits in cache/s" and "working set doesn't fit in cache/s."
Finally, there are other factors that may influence measurements. For example, for an OS that uses paging (e.g. Linux and almost all other modern OSs) there's a whole layer of caching above all this (TLBs/Translation Look-aside Buffers), and TLB misses once the working set gets beyond a certain size; which should be visible as a fourth "step" in the graph. There's also interference from the kernel (IRQs, page faults, task switches, multiple CPUs, etc); which might be visible as random static/error in the graph (unless tests are repeated often and outliers discarded). There are also artifacts of the cache design (cache associativity) that can reduce the effectiveness of the cache in ways that depend on the physical address/es allocated by the kernel; which might be seen as the "steps" in the graph shifting to different places.
Is there something wrong with my method?
Possibly, but without seeing your actual code that cannot be answered.
Your description of what your code is doing does not say whether you are reading the array once or many times.
The array may not be big enough ... depending on your hardware. (Don't some modern chips have a 3rd level cache of a few megabytes?)
In the Java case in particular you have to do lots of things the right way to implement a meaningful micro-benchmark.
In the C case:
You might try adjusting the C compiler's optimization switches.
Since your code is accessing the array serially, the compiler might be able to order the instructions so that the CPU can keep up, or the CPU might be optimistically prefetching or doing wide fetches. You could try reading the array elements in a less predictable order.
It is even possible that the compiler has entirely optimized the loop away because result of the loop calculation is not used for anything.
(According to this Q&A - How much time does it take to fetch one word from memory?, a fetch from L2 cache is ~7 nanoseconds and a fetch from main memory is ~100 nanoseconds. But you are getting ~2 nanoseconds. Something clever has to be going on here to make it run as fast as you are observing.)
With gcc-4.7 and compilation with gcc -std=c99 -O2 -S -D_GNU_SOURCE -fverbose-asm tcache.c you can see that the compiler is optimizing enough to remove the for loop (because sum is not used).
I had to improve your source code; some #include-s are missing, and i is not declared in the second function, so your example don't even compile as it is.
Make sum a global variable, or pass it somehow to the caller (perhaps with a global int globalsum; and putting globalsum=sum; after the loop).
And I am not sure you are right to clear the array with a memset. I could imagine a clever-enough compiler understanding that you are summing all zeros.
At last your code has extremely regular behavior with good locality: once in a while, a cache miss happens, the entire cache line is loaded and data is good enough for many iterations. Some clever optimizations (e.g. -O3 or better) might generate the good prefetch instructions. This is optimal for caches, because for a 32 words L1 cache line the cache miss happens every 32 loops so is well amortized.
Making a linked list of data will make cache behavior be worse. Conversely, in some real programs carefully adding a __builtin_prefetch at few well chosen places may improve performance by more than 10% (but adding too many of them will decrease performance).
In real life, the processor is spending the majority of the time to wait for some cache (and it is difficult to measure that; this waiting is CPU time, not idle time). Remember that during an L3 cache miss, the time needed to load data from your RAM module is the time needed to execute hundreds of machine instructions!
I can't say for certain about 1 and 2, but it would be more challenging to successfully run such a test in Java. In particular, I might be concerned that managed language features like automatic garbage collection might happen during the middle of your testing and throw off your results.
As you can see from graph 3.26 the Intel Core 2 shows hardly any jumps while reading (red line at the top of the graph). It is writing/copying where the jumps are clearly visible. Better to do a write test.
How can I store a 100K X 100K matrix in Java?
I can't do that with a normal array declaration as it is throwing a java.lang.OutofMemoryError.
The Colt library has a sparse matrix implementation for Java.
You could alternatively use Berkeley DB as your storage engine.
Now if your machine has enough actual RAM (at least 9 gigabytes free), you can increase the heap size in the Java command-line.
If the vast majority of entries in your matrix will be zero (or even some other constant value) a sparse matrix will be suitable. Otherwise it might be possible to rewrite your algorithm so that the whole matrix doesn't exist simultaneously. You could produce and consume one row at a time, for example.
Sounds like you need a sparse matrix. Others have already suggested good 3rd party implementations that may suite your needs...
Depending on your applications, you could get away without a third-party matrix library by just using a Map as a backing-store for your matrix data. Kind of...
public class SparseMatrix<T> {
private T defaultValue;
private int m;
private int n;
private Map<Integer, T> data = new TreeMap<Integer, T>();
/// create a new matrix with m rows and n columns
public SparseMatrix(int m, int n, T defaultValue) {
this.m = m;
this.n = n;
this.defaultValue = defaultValue;
}
/// set value at [i,j] (row, col)
public void setValueAt(int i, int j, T value) {
if (i >= m || j >= n || i < 0 || j < 0)
throw new IllegalArgumentException(
"index (" + i + ", " +j +") out of bounds");
data.put(i * n + j, value);
}
/// retrieve value at [i,j] (row, col)
public T getValueAt(int i, int j) {
if (i >= m || j >= n || i < 0 || j < 0)
throw new IllegalArgumentException(
"index (" + i + ", " +j +") out of bounds");
T value = data.get(i * n + j);
return value != null ? value : defaultValue;
}
}
A simple test-case illustrating the SparseMatrix' use would be:
public class SparseMatrixTest extends TestCase {
public void testMatrix() {
SparseMatrix<Float> matrix =
new SparseMatrix<Float>(100000, 100000, 0.0F);
matrix.setValueAt(1000, 1001, 42.0F);
assertTrue(matrix.getValueAt(1000,1001) == 42.0);
assertTrue(matrix.getValueAt(1001,1000) == 0.0);
}
}
This is not the most efficient way of doing it because every non-default entry in the matrix is stored as an Object. Depending on the number of actual values you are expecting, the simplicity of this approach might trump integrating a 3rd-party solution (and possibly dealing with its License - again, depending on your situation).
Adding matrix-operations like multiplication to the above SparseMatrix implementation should be straight-forward (and is left as an exercise for the reader ;-)
100,000 x 100,000 = 10,000,000,000 (10 billion) entries. Even if you're storing single byte entries, that's still in the vicinity of 10 GB - does your machine even have that much physical memory, let alone have a will to allocate that much to a single process?
Chances are you're going to need to look into some kind of a way to only keep part of the matrix in memory at any given time, and the rest buffered on disk.
There are a number possible solutions depending on how much memory you have, how sparse the array actually is, and what the access patterns are going to be.
If the calculation of 100K * 100K * 8 is less than the amount of physical memory on your machine for use by the JVM, a simple non-sparse array is viable solution.
If the array is sparse, with (say) 75% or more of the elements being zero, then you can save space by using a sparse array library. Various alternatives have been suggested, but in all cases, you still need to work out if this is going to give you enough savings. Figure out how many non-zero elements there are going to be, multiply that by 8 (to give you doubles) and (say) 4 to account for the overheads of the sparse array. If that is less than the amount of physical memory that you can make available to the JVM, then sparse arrays are a viable solution.
If sparse and non-sparse arrays (in memory) won't work, things will get more complicated, and the viability of any solution will depend on the access patterns for the array data.
One approach is to represent the array as a file that is mapped into memory in the form of a MappedByteBuffer. Assuming that you don't have enough physical memory to store the entire file in memory, you are going to be hitting the virtual memory system hard. So it is best if your algorithm only needs to operate on contiguous sections of the array at any time. Otherwise, you'll probably die from swapping.
A second approach is a variation of the first. Map the array/file a section at a time, and when you are done, unmap and move to the next section. This only works if the algorithm works on the array in sections.
A third approach is to represent the array using a light-weight database like BDB. This will be slower than any in-memory solution because reading array elements will translate into disc accesses. But if you get it wrong it won't kill the system like the memory mapped approach will. (And if you do this on Linux/Unix, the system's disc block cache may speed things up, depending on your algorithm's array access patterns)
A fourth approach is to use a distributed memory cache. This replaces disc i/o with network i/o, and it is hard to say whether this is a good or bad thing.
A fifth approach is to analyze your algorithm and see if it is amenable to implementing as a distributed algorithm; e.g. with sections of the array and corresponding parts of the algorithm on different machines.
You can upgrade to this machine:
http://www.azulsystems.com/products/compute_appliance.htm
864 processor cores and 768 GB of memory, only costs a single family house somewhere.
Well, I'd suggest that you increase the memory in your jvm but you've going to need a lot of memory, as you're talking about 10 billion items. It's (barely) possible with lots of memory or a clustered jvm, but that's probably the wrong answer.
You're getting the outOfmemory because if you declare int[1000], the memory is allocated immediately (additionally doubles take up more space than ints-an int representation will also save you space). Maybe you can substitute a more efficient implementation of your array (if you have many empty entries lookup "sparse matrix" representations).
You could store pieces in an outside system, like memcached or memory-mapped buffers.
There are lots of good suggestions here, maybe if you posted a more detailed description of the problem you're trying to solve people could be more specific.
You should try an "external" package to handle matrices, I never did that though, maybe something like jama.
Unless you have 100K x 100K x 8 ~ 80GB of memory, you cannot create this matrix in memory. You can create this matrix on disk and access it using memory mapping. However, using this approach will be very slow.
What are you trying to do? You may find that representing your data in a different way will be much more efficient.
I need to store a 2d matrix containing zip codes and the distance in km between each one of them. My client has an application that calculates the distances which are then stored in an Excel file. Currently, there are 952 places. So the matrix would have 952x952 = 906304 entries.
I tried to map this into a HashMap[Integer, Float]. The Integer is the hash code of the two Strings for two places, e.g. "A" and "B". The float value is the distance in km between them.
While filling in the data I run into OutOfMemoryExceptions after 205k entries. Do you have a tip how I can store this in a clever way? I even don't know if it's clever to have the whole bunch in memory. My options are SQL and MS Access...
The problem is that I need to access the data very quickly and possibly very often which is why I chose the HashMap because it runs in O(1) for the look up.
Thansk for your replies and suggestions!
Marco
A 2d array would be more memory efficient.
You can use a small hashmap to map the 952 places into a number between 0 and 951 .
Then, just do:
float[][] distances= new float[952][952];
To look things up, just use two hash lookups to convert the two places into two integers, and use them as indexes into the 2d array.
By doing it this way, you avoid the boxing of floats, and also the memory overhead of the large hashmap.
However, 906304 really isn't that many entries, you may just need to increase the Xmx maximum heap size
I would have thought that you could calculate the distances on the fly. Presumably someone has already done this, so you simply need to find out what algorithm they used, and the input data; e.g. longitude/latitude of the notional centres of each ZIP code.
EDIT: There are two commonly used algorithms for finding the (approximate) geodesic distance between two points given by longitude/latitude pairs.
The Vicenty formula is based on an ellipsoid approximation. It is more accurate, but more complicated to implement.
The Haversine formula is based on a spherical approximation. It is less accurate (0.3%), but simpler to implement.
Can you simply boost the memory available to the JVM ?
java -Xmx512m ...
By default the maximum memory configuration is 64Mb. Some more tuning tips here. If you can do this then you can keep the data in-process and maximise the performance (i.e. you don't need to calculate on the fly).
I upvoted Chi's and Benjamin's answers, because they're telling you what you need to do, but while I'm here, I'd like to stress that using the hashcode of the two strings directly will get you into trouble. You're likely to run into the problem of hash collisions.
This would not be a problem if you were concatenating the two strings (being careful to use a delimiter which cannot appear in the place designators), and letting HashMap do its magic, but the method you suggested, using the hashcodes for the two strings as a key, that's going to get you into trouble.
You will simply need more memory. When starting your Java process, kick it off like so:
java -Xmx256M MyClass
The -Xmx defines the max heap size, so this says the process can use up to 256 MB of memory for the heap. If you still run out, keep bumping that number up until you hit the physical limit.
Lately I've managed similar requisites for my master thesis.
I ended with a Matrix class that uses a double[], not a double[][], in order to alleviate double deref costs (data[i] that is an array, then array[i][j] that is a double) while allowing the VM to allocate a big, contiguous chunk of memory:
public class Matrix {
private final double data[];
private final int rows;
private final int columns;
public Matrix(int rows, int columns, double[][] initializer) {
this.rows = rows;
this.columns = columns;
this.data = new double[rows * columns];
int k = 0;
for (int i = 0; i < initializer.length; i++) {
System.arraycopy(initializer[i], 0, data, k, initializer[i].length);
k += initializer[i].length;
}
}
public Matrix set(int i, int j, double value) {
data[j + i * columns] = value;
return this;
}
public double get(int i, int j) {
return data[j + i * columns];
}
}
this class should use less memory than an HashMap since it uses a primitive array (no boxing needed): it needs only 906304 * 8 ~ 8 Mb (for doubles) or 906304 * 4 ~ 4 Mb (for floats). My 2 cents.
NB
I've omitted some sanity checks for simplicity's sake
Stephen C. has a good point: if the distances are as-the-crow-flies, then you could probably save memory by doing some calculations on the fly. All you'd need is space for the longitude and latitude for 952 zip codes and then you could use the vicenty formula to do your calculation when you need to. This would make your memory usage O(n) in zipcodes.
Of course, that solution makes some assumptions that may turn out to be false in your particular case, i.e. that you have longitude and latitude data for your zipcodes and that you're concerned with as-the-crow-flies distances and not something more complicated like driving directions.
If those assumptions are true though, trading a few computes for a whole bunch of memory might help you scale in the future if you ever need to handle a bigger dataset.
The above suggestions regarding heap size will be helpful. However, I am not sure if you gave an accurate description of the size of your matrix.
Suppose you have 4 locations. Then you need to assess the distances between A->B, A->C, A->D, B->C, B->D, C->D. This suggests six entries in your HashMap (4 choose 2).
That would lead me to believe the actual optimal size of your HashMap is (952 choose 2)=452,676; NOT 952x952=906,304.
This is all assuming, of course, that you only store one-way relationships (i.e. from A->B, but not from B->A, since that is redundant), which I would recommend since you are already experiencing problems with memory space.
Edit: Should have said that the size of your matrix is not optimal, rather than saying the description was not accurate.
Create a new class with 2 slots for location names. Have it always put the alphabetically first name in the first slot. Give it a proper equals and hashcode method. Give it a compareTo (e.g. order alphabetically by names). Throw them all in an array. Sort it.
Also, hash1 = hash2 does not imply object1 = object2. Don't ever do this. It's a hack.