At what point is it worth reusing arrays in Java?

At what point is it worth reusing arrays in Java? - java

How big does a buffer need to be in Java before it's worth reusing?
Or, put another way: I can repeatedly allocate, use, and discard byte[] objects OR run a pool to keep and reuse them. I might allocate a lot of small buffers that get discarded often, or a few big ones that's don't. At what size is is cheaper to pool them than to reallocate, and how do small allocations compare to big ones?
EDIT:
Ok, specific parameters. Say an Intel Core 2 Duo CPU, latest VM version for OS of choice. This questions isn't as vague as it sounds... a little code and a graph could answer it.
EDIT2:
You've posted a lot of good general rules and discussions, but the question really asks for numbers. Post 'em (and code too)! Theory is great, but the proof is the numbers. It doesn't matter if results vary some from system to system, I'm just looking for a rough estimate (order of magnitude). Nobody seems to know if the performance difference will be a factor of 1.1, 2, 10, or 100+, and this is something that matters. It is important for any Java code working with big arrays -- networking, bioinformatics, etc.
Suggestions to get a good benchmark:
Warm up code before running it in the benchmark. Methods should all be called at least 1000 10000 times to get full JIT optimization.
Make sure benchmarked methods run for at least 1 10 seconds and use System.nanotime if possible, to get accurate timings.
Run benchmark on a system that is only running minimal applications
Run benchmark 3-5 times and report all times, so we see how consistent it is.
I know this is a vague and somewhat demanding question. I will check this question regularly, and answers will get comments and rated up consistently. Lazy answers will not (see below for criteria). If I don't have any answers that are thorough, I'll attach a bounty. I might anyway, to reward a really good answer with a little extra.
What I know (and don't need repeated):
Java memory allocation and GC are fast and getting faster.
Object pooling used to be a good optimization, but now it hurts performance most of the time.
Object pooling is "not usually a good idea unless objects are expensive to create." Yadda yadda.
What I DON'T know:
How fast should I expect memory allocations to run (MB/s) on a standard modern CPU?
How does allocation size effect allocation rate?
What's the break-even point for number/size of allocations vs. re-use in a pool?
Routes to an ACCEPTED answer (the more the better):
A recent whitepaper showing figures for allocation & GC on modern CPUs (recent as in last year or so, JVM 1.6 or later)
Code for a concise and correct micro-benchmark I can run
Explanation of how and why the allocations impact performance
Real-world examples/anecdotes from testing this kind of optimization
The Context:
I'm working on a library adding LZF compression support to Java. This library extends the H2 DBMS LZF classes, by adding additional compression levels (more compression) and compatibility with the byte streams from the C LZF library. One of the things I'm thinking about is whether or not it's worth trying to reuse the fixed-size buffers used to compress/decompress streams. The buffers may be ~8 kB, or ~32 kB, and in the original version they're ~128 kB. Buffers may be allocated one or more times per stream. I'm trying to figure out how I want to handle buffers to get the best performance, with an eye toward potentially multithreading in the future.
Yes, the library WILL be released as open source if anyone is interested in using this.

If you want a simple answer, it is that there is no simple answer. No amount of calling answers (and by implication people) "lazy" is going to help.
How fast should I expect memory allocations to run (MB/s) on a standard modern CPU?
At the speed at which the JVM can zero memory, assuming that the allocation does not trigger a garbage collection. If it does trigger garbage collection, it is impossible to predict without knowing what GC algorithm is used, the heap size and other parameters, and an analysis of the application's working set of non-garbage objects over the lifetime of the app.
How does allocation size effect allocation rate?
See above.
What's the break-even point for number/size of allocations vs. re-use in a pool?
If you want a simple answer, it is that there is no simple answer.
The golden rule is, the bigger your heap is (up to the amount of physical memory available), the smaller the amortized cost of GC'ing a garbage object. With a fast copying garbage collector, the amortized cost of freeing a garbage object approaches zero as the heap gets larger. The cost of the GC is actually determined by (in simplistic terms) the number and size of non-garbage objects that the GC has to deal with.
Under the assumption that your heap is large, the lifecycle cost of allocating and GC'ing a large object (in one GC cycle) approaches the cost of zeroing the memory when the object is allocated.
EDIT: If all you want is some simple numbers, write a simple application that allocates and discards large buffers and run it on your machine with various GC and heap parameters and see what happens. But beware that this is not going to give you a realistic answer because real GC costs depend on an application's non-garbage objects.
I'm not going to write a benchmark for you because I know that it would give you bogus answers.
EDIT 2: In response to the OP's comments.
So, I should expect allocations to run about as fast as System.arraycopy, or a fully JITed array initialization loop (about 1GB/s on my last bench, but I'm dubious of the result)?
Theoretically yes. In practice, it is difficult to measure in a way that separates the allocation costs from the GC costs.
By heap size, are you saying allocating a larger amount of memory for JVM use will actually reduce performance?
No, I'm saying it is likely to increase performance. Significantly. (Provided that you don't run into OS-level virtual memory effects.)
Allocations are just for arrays, and almost everything else in my code runs on the stack. It should simplify measuring and predicting performance.
Maybe. Frankly, I think that you are not going to get much improvement by recycling buffers.
But if you are intent on going down this path, create a buffer pool interface with two implementations. The first is a real thread-safe buffer pool that recycles buffers. The second is dummy pool which simply allocates a new buffer each time alloc is called, and treats dispose as a no-op. Finally, allow the application developer to choose between the pool implementations via a setBufferPool method and/or constructor parameters and/or runtime configuration properties. The application should also be able to supply a buffer pool class / instance of its own making.

When it is larger than young space.
If your array is larger than the thread-local young space, it is directly allocated in the old space. Garbage collection on the old space is way slower than on the young space. So if your array is larger than the young space, it might make sense to reuse it.
On my machine, 32kb exceeds the young space. So it would make sense to reuse it.

You've neglected to mention anything about thread safety. If it's going to be reused by multiple threads you'll have to worry about synchronization.

An answer from a completely different direction: let the user of your library decide.
Ultimately, however optimized you make your library, it will only be a component of a larger application. And if that larger application makes infrequent use of your library, there's no reason that it should pay to maintain a pool of buffers -- even if that pool is only a few hundred kilobytes.
So create your pooling mechanism as an interface, and based on some configuration parameter select the implementation that's used by your library. Set the default to be whatever your benchmark tests determine to be the best solution.1 And yes, if you use an interface you'll have to rely on the JVM being smart enough to inline calls.2
(1) By "benchmark," I mean a long-running program that exercises your library outside of a profiler, passing it a variety of inputs. Profilers are extremely useful, but so is measuring the total throughput after an hour of wall-clock time. On several different computers with differing heap sizes, and several different JVMs, running in single and multi-threaded modes.
(2) This can get you into another line of debate about the relative performance of the various invoke opcodes.

Short answer: Don't buffer.
Reasons are follow:
Don't optimize it, yet until it become a bottleneck
If you recycle it, the overhead of the pool management will be another bottleneck
Try to trust the JIT. In the latest JVM, your array may allocated in STACK rather then HEAP.
Trust me, the JRE usually do handle them faster and better then you DIY.
Keep it simple, for easier to read and debug
When you should recycle a object:
only if is it heavy. The size of memory won't make it heavy, but native resources and CPU cycle do, which cost addition finalize and CPU cycle.
You may want to recycle them if they are "ByteBuffer" rather then byte[]

Keep in mind that cache effects will probably be more of an issue than the cost of "new int[size]" and its corresponding collection. Reusing buffers is therefore a good idea if you have good temporal locality. Reallocating the buffer instead of reusing it means you might get a different chunk of memory each time. As others mentioned, this is especially true when your buffers don't fit in the young generation.
If you allocate but then don't use the whole buffer, it also pays to reuse as you don't waste time zeroing out memory you never use.

I forgot that this is a managed-memory system.
Actually, you probably have the wrong mindset. The appropriate way to determine when it is useful is dependent on the application, system it is running on, and user usage pattern.
In other words - just profile the system, determine how much time is being spent in garbage collection as a percentage of total application time in a typical session, and see if it is worthwhile to optimize that.
You will probably find out that gc isn't even being called at all. So writing code to optimize this would be a complete waste of time.
with today's large memory space I suspect 90% of the time it isn't worth doing at all. You can't really determine this based on parameters - it is too complex. Just profile - easy and accurate.

Looking at a micro benchmark (code below) there is no appreciable difference in time on my machine regardless of the size and the times the array is used (I am not posting the times, you can easily run it on your machine :-). I suspect that this is because the garbage is alive for so short a time there is not much to do for cleanup. Array allocation should probably a call to calloc or malloc/memset. Depending on the CPU this will be a very fast operation. If the arrays survived for a longer time to make it past the initial GC area (the nursery) then the time for the one that allocated several arrays might take a bit longer.
code:
import java.util.Random;
public class Main
{
public static void main(String[] args)
{
final int size;
final int times;
size = 1024 * 128;
times = 100;
// uncomment only one of the ones below for each run
test(new NewTester(size), times);
// test(new ReuseTester(size), times);
}
private static void test(final Tester tester, final int times)
{
final long total;
// warmup
testIt(tester, 1000);
total = testIt(tester, times);
System.out.println("took: " + total);
}
private static long testIt(final Tester tester, final int times)
{
long total;
total = 0;
for(int i = 0; i < times; i++)
{
final long start;
final long end;
final int value;
start = System.nanoTime();
value = tester.run();
end = System.nanoTime();
total += (end - start);
// make sure the value is used so the VM cannot optimize too much
System.out.println(value);
}
return (total);
}
}
interface Tester
{
int run();
}
abstract class AbstractTester
implements Tester
{
protected final Random random;
{
random = new Random(0);
}
public final int run()
{
int value;
value = 0;
// make sure the random number generater always has the same work to do
random.setSeed(0);
// make sure that we have something to return so the VM cannot optimize the code out of existence.
value += doRun();
return (value);
}
protected abstract int doRun();
}
class ReuseTester
extends AbstractTester
{
private final int[] array;
ReuseTester(final int size)
{
array = new int[size];
}
public int doRun()
{
final int size;
// make sure the lookup of the array.length happens once
size = array.length;
for(int i = 0; i < size; i++)
{
array[i] = random.nextInt();
}
return (array[size - 1]);
}
}
class NewTester
extends AbstractTester
{
private int[] array;
private final int length;
NewTester(final int size)
{
length = size;
}
public int doRun()
{
final int size;
// make sure the lookup of the length happens once
size = length;
array = new int[size];
for(int i = 0; i < size; i++)
{
array[i] = random.nextInt();
}
return (array[size - 1]);
}
}

I came across this thread and, since I was implementing a Floyd-Warshall all pairs connectivity algorithm on a graph with one thousand vertices, I tried to implement it in both ways (re-using matrices or creating new ones) and check the elapsed time.
For the computation I need 1000 different matrices of size 1000 x 1000, so it seems a decent test.
My system is Ubuntu Linux with the following virtual machine.
java version "1.7.0_65"
Java(TM) SE Runtime Environment (build 1.7.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
Re-using matrices was about 10% slower (average running time over 5 executions 17354ms vs 15708ms. I don't know if it would still be faster in case the matrix was much bigger.
Here is the relevant code:
private void computeSolutionCreatingNewMatrices() {
computeBaseCase();
smallest = Integer.MAX_VALUE;
for (int k = 1; k <= nVertices; k++) {
current = new int[nVertices + 1][nVertices + 1];
for (int i = 1; i <= nVertices; i++) {
for (int j = 1; j <= nVertices; j++) {
if (previous[i][k] != Integer.MAX_VALUE && previous[k][j] != Integer.MAX_VALUE) {
current[i][j] = Math.min(previous[i][j], previous[i][k] + previous[k][j]);
} else {
current[i][j] = previous[i][j];
}
smallest = Math.min(smallest, current[i][j]);
}
}
previous = current;
}
}
private void computeSolutionReusingMatrices() {
computeBaseCase();
current = new int[nVertices + 1][nVertices + 1];
smallest = Integer.MAX_VALUE;
for (int k = 1; k <= nVertices; k++) {
for (int i = 1; i <= nVertices; i++) {
for (int j = 1; j <= nVertices; j++) {
if (previous[i][k] != Integer.MAX_VALUE && previous[k][j] != Integer.MAX_VALUE) {
current[i][j] = Math.min(previous[i][j], previous[i][k] + previous[k][j]);
} else {
current[i][j] = previous[i][j];
}
smallest = Math.min(smallest, current[i][j]);
}
}
matrixCopy(current, previous);
}
}
private void matrixCopy(int[][] source, int[][] destination) {
assert source.length == destination.length : "matrix sizes must be the same";
for (int i = 0; i < source.length; i++) {
assert source[i].length == destination[i].length : "matrix sizes must be the same";
System.arraycopy(source[i], 0, destination[i], 0, source[i].length);
}
}

More important than buffer size is number of allocated objects, and total memory allocated.
Is memory usage an issue at all? If it is a small app may not be worth worrying about.
The real advantage from pooling is to avoid memory fragmentation. The overhead for allocating/freeing memory is small, but the disadvantage is that if you repeatedly allocated many objects of many different sizes memory becomes more fragmented. Using a pool prevents fragmentation.

I think the answer you need is related with the 'order' (measuring space, not time!) of the algorithm.
Copy file example
By example, if you want to copy a file you need to read from an inputstream and write to an outputstream. The TIME order is O(n) because the time will be proportional to the size of the file. But the SPACE order will be O(1) because the program you'll need to do it will ocuppy a fixed ammount of memory (you'll need only one fixed buffer). In this case it's clear that it's convenient to reuse that very buffer you instantiated at the beginning of the program.
Relate the buffer policy with your algorithm execution structure
Of course, if your algoritm needs and endless supply of buffers and each buffer is a different size probably you cannot reuse them. But it gives you some clues:
try to fix the size of buffers (even
sacrifying a little bit of memory).
Try to see what's the structure of
the execution: by example, if you're
algorithm traverse some kind of tree
and you're buffers are related to
each node, maybe you only need O(log
n) buffers... so you can make an
educated guess of the space required.
Also if you need diferent buffers but
you can arrange things to share
diferent segments of the same
array... maybe it's a better
solution.
When you release a buffer you can
add it to a pool of buffers. That
pool can be a heap ordered by the
"fitting" criteria (buffers that
fit the most should be first).
What I'm trying to say is: there's no fixed answer. If you instantiated something that you can reuse... probably it's better to reuse it. The tricky part is to find how you can do it without incurring in buffer managing overhead. Here's when the algorithm analysis come in handy.
Hope it helps... :)

Related

VisualVM java profiling - self time execution?

I have the following Java method:
static Board board;
static int[][] POSSIBLE_PLAYS; // [262143][0 - 81]
public static void playSingleBoard() {
int subBoard = board.subBoards[board.boardIndex];
int randomMoveId = generateRandomInt(POSSIBLE_PLAYS[subBoard].length);
board.play(board.boardIndex, POSSIBLE_PLAYS[subBoard][randomMoveId]);
}
Accessed arrays do not change at runtime. The method is always called by the same thread. board.boardIndex may change from 0 to 8, there is a total of 9 subBoards.
In VisualVM I end up with the method being executed 2 228 212 times, with (Total Time CPU):
Self Time 27.9%
Board.play(int, int) 24.6%
MainClass.generateRnadomInt(int) 8.7%
What I am wondering is where does come from those 27.9% of self execution (999ms / 2189ms).
I first thought that allocating 2 int could slow down the method so I tried the following:
public static void playSingleBoard() {
board.play(
board.boardIndex,
POSSIBLE_PLAYS[board.subBoards[board.boardIndex]]
[generateRandomInt(POSSIBLE_PLAYS[board.subBoards[board.boardIndex]].length)]
);
}
But ending up with similar results, I have no clue what this self execution time can be.. is it GC time? memory access?
I have tried with JVM options mentionnned here => VisualVM - strange self time
and without.

First, Visual VM (as well as many other safepoint-based profilers) are inherently misleading. Try using a profiler that does not suffer from the safepoint bias. E.g. async-profiler can show not only methods, but also particular lines/bytecodes where the most CPU time is spent.
Second, in your example, playSingleBoard may indeed take relatively long. Even without a profiler, I can tell that the most expensive operations here are the numerous array accesses.
RAM is the new disk. Memory access is not free, especially the random access. Especially when the dataset is too big to fit into CPU cache. Furthermore, an array access in Java needs to be bounds-checked. Also, there are no "true" two-dimentional arrays in Java, they are rather arrays of arrays.
This means, an expression like POSSIBLE_PLAYS[subBoard][randomMoveId] will result in at least 5 memory reads and 2 bounds checks. And every time there is a L3 cache miss (which is likely for large arrays like in your case), this will result in ~50 ns latency - the time enough to execute a hundred arithmetic operations otherwise.

How to produce the cpu cache effect in C and java?

In Ulrich Drepper's paper What every programmer should know about memory, the 3rd part: CPU Caches, he shows a graph that shows the relationship between "working set" size and the cpu cycle consuming per operation (in this case, sequential reading). And there are two jumps in the graph which indicate the size of L1 cache and L2 cache. I wrote my own program to reproduce the effect in c. It just simply read a int[] array sequentially from head to tail, and I've tried different size of the array(from 1KB to 1MB). I plot the data into a graph and there is no jump, the graph is a straight line.
My questions are:
Is there something wrong with my method? What is the right way to produce the cpu cache effect(to see the jumps).
I was thinking, if it is sequential read, then it should operate like this:
When read the first element, it's a cache miss, and within the cache line size(64K), there will be hits. With the help of the prefetching, the latency of reading the next cache line will be hidden. It will contiguously read data into the L1 cache, even when the working set size is over the L1 cache size, it will evict the least recently used ones, and continue prefetch. So most of the cache misses will be hidden, the time consumed by fetch data from L2 will be hidden behind the reading activity, meaning they are operating at the same time. the assosiativity (8 way in my case) will hide the latency of reading data from L2. So, phenomenon of my program should be right, am I missing something?
Is it possible to get the same effect in java?
By the way, I am doing this in linux.
Edit 1
Thanks for Stephen C's suggestion, here are some additional Information:
This is my code:
int *arrayInt;
void initInt(long len) {
int i;
arrayInt = (int *)malloc(len * sizeof(int));
memset(arrayInt, 0, len * sizeof(int));
}
long sreadInt(long len) {
int sum = 0;
struct timespec tsStart, tsEnd;
initInt(len);
clock_gettime(CLOCK_REALTIME, &tsStart);
for(i = 0; i < len; i++) {
sum += arrayInt[i];
}
clock_gettime(CLOCK_REALTIME, &tsEnd);
free(arrayInt);
return (tsEnd.tv_nsec - tsStart.tv_nsec) / len;
}
In main() function, I've tried from 1KB to 100MB of the array size, still the same, average time consuming per element is 2 nanoseconds. I think the time is the access time of L1d.
My cache size:
L1d == 32k
L2 == 256k
L3 == 6144k
EDIT 2
I've changed my code to use a linked list.
// element type
struct l {
struct l *n;
long int pad[NPAD]; // the NPAD could be changed, in my case I set it to 1
};
struct l *array;
long globalSum;
// for init the array
void init(long len) {
long i, j;
struct l *ptr;
array = (struct l*)malloc(sizeof(struct l));
ptr = array;
for(j = 0; j < NPAD; j++) {
ptr->pad[j] = j;
}
ptr->n = NULL;
for(i = 1; i < len; i++) {
ptr->n = (struct l*)malloc(sizeof(struct l));
ptr = ptr->n;
for(j = 0; j < NPAD; j++) {
ptr->pad[j] = i + j;
}
ptr->n = NULL;
}
}
// for free the array when operation is done
void release() {
struct l *ptr = array;
struct l *tmp = NULL;
while(ptr) {
tmp = ptr;
ptr = ptr->n;
free(tmp);
}
}
double sread(long len) {
int i;
long sum = 0;
struct l *ptr;
struct timespec tsStart, tsEnd;
init(len);
ptr = array;
clock_gettime(CLOCK_REALTIME, &tsStart);
while(ptr) {
for(i = 0; i < NPAD; i++) {
sum += ptr->pad[i];
}
ptr = ptr->n;
}
clock_gettime(CLOCK_REALTIME, &tsEnd);
release();
globalSum += sum;
return (double)(tsEnd.tv_nsec - tsStart.tv_nsec) / (double)len;
}
At last, I will printf out the globalSum in order to avoid the compiler optimization. As you can see, it is still a sequential read, I've even tried up to 500MB of the array size, the average time per element is approximately 4 nanoseconds (maybe because it has to access the data 'pad' and the pointer 'n', two accesses), the same as 1KB of the array size. So, I think it is because the cache optimization like prefetch hide the latency very well, am I right? I will try a random access, and put the result on later.
EDIT 3
I've tried a random access to the linked list, this is the result:
the first red line is my L1 cache size, the second is L2. So we can see a little jump there. And some times the latency still be hidden well.

This answer isn't an answer, but more of a set of notes.
First, the CPU tends to operate on cache lines, not on individual bytes/words/dwords. This means that if you sequentially read/write an array of integers then the first access to a cache line may cause a cache miss but subsequent accesses to different integers in that same cache line won't. For 64-byte cache lines and 4-byte integers this means that you'd only get a cache miss once for every 16 accesses; which will dilute the results.
Second, the CPU has a "hardware pre-fetcher." If it detects that cache lines are being read sequentially, the hardware pre-fetcher will automatically pre-fetch cache lines it predicts will be needed next (in an attempt to fetch them into cache before they're needed).
Third, the CPU does other things (like "out of order execution") to hide fetch costs. The time difference (between cache hit and cache miss) that you can measure is the time that the CPU couldn't hide and not the total cost of the fetch.
These 3 things combined mean that; for sequentially reading an array of integers, it's likely that the CPU pre-fetches the next cache line while you're doing 16 reads from the previous cache line; and any cache miss costs won't be noticeable and may be entirely hidden. To prevent this; you'd want to "randomly" access each cache line once, to maximise the performance difference measured between "working set fits in cache/s" and "working set doesn't fit in cache/s."
Finally, there are other factors that may influence measurements. For example, for an OS that uses paging (e.g. Linux and almost all other modern OSs) there's a whole layer of caching above all this (TLBs/Translation Look-aside Buffers), and TLB misses once the working set gets beyond a certain size; which should be visible as a fourth "step" in the graph. There's also interference from the kernel (IRQs, page faults, task switches, multiple CPUs, etc); which might be visible as random static/error in the graph (unless tests are repeated often and outliers discarded). There are also artifacts of the cache design (cache associativity) that can reduce the effectiveness of the cache in ways that depend on the physical address/es allocated by the kernel; which might be seen as the "steps" in the graph shifting to different places.

Is there something wrong with my method?
Possibly, but without seeing your actual code that cannot be answered.
Your description of what your code is doing does not say whether you are reading the array once or many times.
The array may not be big enough ... depending on your hardware. (Don't some modern chips have a 3rd level cache of a few megabytes?)
In the Java case in particular you have to do lots of things the right way to implement a meaningful micro-benchmark.
In the C case:
You might try adjusting the C compiler's optimization switches.
Since your code is accessing the array serially, the compiler might be able to order the instructions so that the CPU can keep up, or the CPU might be optimistically prefetching or doing wide fetches. You could try reading the array elements in a less predictable order.
It is even possible that the compiler has entirely optimized the loop away because result of the loop calculation is not used for anything.
(According to this Q&A - How much time does it take to fetch one word from memory?, a fetch from L2 cache is ~7 nanoseconds and a fetch from main memory is ~100 nanoseconds. But you are getting ~2 nanoseconds. Something clever has to be going on here to make it run as fast as you are observing.)

With gcc-4.7 and compilation with gcc -std=c99 -O2 -S -D_GNU_SOURCE -fverbose-asm tcache.c you can see that the compiler is optimizing enough to remove the for loop (because sum is not used).
I had to improve your source code; some #include-s are missing, and i is not declared in the second function, so your example don't even compile as it is.
Make sum a global variable, or pass it somehow to the caller (perhaps with a global int globalsum; and putting globalsum=sum; after the loop).
And I am not sure you are right to clear the array with a memset. I could imagine a clever-enough compiler understanding that you are summing all zeros.
At last your code has extremely regular behavior with good locality: once in a while, a cache miss happens, the entire cache line is loaded and data is good enough for many iterations. Some clever optimizations (e.g. -O3 or better) might generate the good prefetch instructions. This is optimal for caches, because for a 32 words L1 cache line the cache miss happens every 32 loops so is well amortized.
Making a linked list of data will make cache behavior be worse. Conversely, in some real programs carefully adding a __builtin_prefetch at few well chosen places may improve performance by more than 10% (but adding too many of them will decrease performance).
In real life, the processor is spending the majority of the time to wait for some cache (and it is difficult to measure that; this waiting is CPU time, not idle time). Remember that during an L3 cache miss, the time needed to load data from your RAM module is the time needed to execute hundreds of machine instructions!

I can't say for certain about 1 and 2, but it would be more challenging to successfully run such a test in Java. In particular, I might be concerned that managed language features like automatic garbage collection might happen during the middle of your testing and throw off your results.

As you can see from graph 3.26 the Intel Core 2 shows hardly any jumps while reading (red line at the top of the graph). It is writing/copying where the jumps are clearly visible. Better to do a write test.

Reducing memory churn when processing large data set

Java has a tendency to create a large number objects that needs to be garbage collected when processing large data set. This happens fairly frequently when streaming a amounts of data from the database, creating reports, etc. Is there a strategy to reduce the memory churn.
In this example, the object based version spends significant amount of times (2+ seconds) generating objects and performing garbage collection whereas the boolean array version completes in a fraction of a section without any garbages collection whatsoever.
How do I reduce the memory churn (the need for large number of garbage collections) when processing large data sets?
java -verbose:gc -Xmx500M UniqChars
...
----------------
[GC 495441K->444241K(505600K), 0.0019288 secs] x 45 times
70000007
================
70000007
import java.util.HashSet;
import java.util.Set;
public class UniqChars {
static String a=null;
public static void main(String [] args) {
//Generate data set
StringBuffer sb=new StringBuffer("sfdisdf");
for (int i =0; i< 10000000; i++) {
sb.append("sfdisdf");
}
a=sb.toString();
sb=null; //free sb
System.out.println("----------------");
compareAsSet();
System.out.println("================");
compareAsAry();
}
public static void compareAsSet() {
Set<String> uniqSet = new HashSet<String>();
int n=0;
for(int i=0; i<a.length(); i++) {
String chr = a.substring(i,i);
uniqSet.add(chr);
n++;
}
System.out.println(n);
}
public static void compareAsAry() {
boolean uniqSet[] = new boolean[65536];
int n=0;
for(int i=0; i<a.length(); i++) {
int chr = (int) a.charAt(i);
uniqSet[chr]=true;
n++;
}
System.out.println(n);
}
}

Well as pointed out by one of the comments it's your code, not Java at fault for memory churn. So let's see you've written this code that builds an insanely large String from a StringBuffer. Calls toString() on it. Then calls substring() on that insanely large string which is in a loop and creating new a.length() Strings. Then does some in place junk on an array that really will perform pretty damn fast since there is no object creation, but ultimately writes to true to the same 5-6 locations in a huge array. Waste much? So what did you think would happen? Ditch StringBuffer and use StringBuilder since it's not fully synchronized which will be a little faster.
Ok so here's where your algorithm is probably spending its time. See the StringBuffer is allocating an internal character array to store stuff in each time you call append(). When that character array fills entirely up, it has to allocate a larger character array, copy all that junk you just wrote to it into the new array, then append what you originally called it with. So your code is allocating filling up, allocating a bigger chunk, copying that junk to the new array, then repeating that process until it does that 1000000 times. You can speed that up by pre-allocating the character array for the StringBuffer. Roughly that's 10000000 * "sfdisdf".length(). That will keep Java from creating tons of memory that it just dumps over and over.
Next is the compareAsSet() mess. Your line String chr = a.substring(i,i); is creating NEW strings a.length() times. Well since you're doing a.substring(i,i) is only a character you could just charAt(i) then there's no allocating happen. There's also an option of CharSequence which doesn't create a new String with it's own character array but simply points to the original underlying char[] with an offset and length. String.subSequence()
You plug this same code in any other language and it'll suck there too. In fact I'd say far far worse. Just try this is C++ and watch it be significantly worse than Java should you allocate and deallocate this much. See Java memory allocation is way way way faster than C++ because everything in Java is allocated from a memory pool so creating objects is magnitudes faster. But, there are limits. Furthermore, Java compresses its memory should it become too fragmented, C++ doesn't. So as you allocate memory and dump it, just in the same way, you'll probably run the risk of fragmenting the memory in C++. That could mean your StringBuffer might run out of the ability to grow large enough to finish and would crash.
In fact that might also explain some of the performance issues with GC because it's having to make room more a continuous block big enough after lots of trash has been taken out. So Java is not only cleaning up the memory its also having to compress the memory address space so it can get a block big enough for your StringBuffer.
Anyway, I'm sure your just testing the tires, but testing with code like this isn't really smart because it'll never perform well because it's unrealistic memory allocation. You know the old adage Garbage In Garbage Out. And that's what you got Garbage.

In your example your two methods are doing very different things.
In compareAsSet() you are generating the same 4 Strings ("s", "d", "f" and "i") and calling String.hashCode() and String.equals(String) (HashSet does this when you try to add them) 70000007 times. What you end up with is a HashSet of size 4. While you are doing this you are allocating String objects each time String.substring(int, int) returns which will force a minor collection every time the 'new' generation of the garbage collector gets filled.
In compareAsAry() you've allocated a single array 65536 elements wide changed some values in it and and then it goes out of scope when the method returns. This is a single heap memory operation vs 70000007 done in compareAsSet. You do have a local int variable being changed 70000007 times but this happens in stack memory not in heap memory. This method does not really generate that much garbage in the heap compared to the other method (basically just the array).
Regarding churn your options are recycle objects or tuning the garbage collector.
Recycling is not really possible with Strings in general as they are immutable, though the VM may perform interning operations this only reduces total memory footprint not garbage churn. A solution targeted for the above scenario that recycles could be generated but the implementation would be brittle and inflexible.
Tuning the garbage collector so that the 'new' generation is larger could reduce the total number of collections that has to be performed during your method call and thus increase the throughput of the call, you could also just increase the heap size in general which would accomplish the same thing.
For futher reading on garbage collector tuning in Java 6 I recommend the Oracle white paper linked below.
http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

For comparison, if you wrote this it would do the same thing.
public static void compareLength() {
// All the loop does is count the length in a complex way.
System.out.println(a.length());
}
// I assume you intended to write this.
public static void compareAsBitSet() {
BitSet uniqSet = new BitSet();
for(int i=0; i<a.length(); i++)
uniqSet.set(a.charAt(i));
System.out.println(uniqSet.size());
}
Note: the BitSet uses 1 bit per element, rather than 1 byte per element. It also expands as required so say you have ASCII text, the BitSet might use 128-bits or 16 bytes (plus 32-byte overhead) The boolean[] uses 64 KB which is much higher. Ironically, using a boolean[] can be faster as it involves less bit shifting and only the portion of the array used needs to be in memory.
As you can see, with either solution, you get a much more efficient result because you use a better algorithm for what needs to be done.

Representing a 100K X 100K matrix in Java

How can I store a 100K X 100K matrix in Java?
I can't do that with a normal array declaration as it is throwing a java.lang.OutofMemoryError.

The Colt library has a sparse matrix implementation for Java.
You could alternatively use Berkeley DB as your storage engine.
Now if your machine has enough actual RAM (at least 9 gigabytes free), you can increase the heap size in the Java command-line.

If the vast majority of entries in your matrix will be zero (or even some other constant value) a sparse matrix will be suitable. Otherwise it might be possible to rewrite your algorithm so that the whole matrix doesn't exist simultaneously. You could produce and consume one row at a time, for example.

Sounds like you need a sparse matrix. Others have already suggested good 3rd party implementations that may suite your needs...
Depending on your applications, you could get away without a third-party matrix library by just using a Map as a backing-store for your matrix data. Kind of...
public class SparseMatrix<T> {
private T defaultValue;
private int m;
private int n;
private Map<Integer, T> data = new TreeMap<Integer, T>();
/// create a new matrix with m rows and n columns
public SparseMatrix(int m, int n, T defaultValue) {
this.m = m;
this.n = n;
this.defaultValue = defaultValue;
}
/// set value at [i,j] (row, col)
public void setValueAt(int i, int j, T value) {
if (i >= m || j >= n || i < 0 || j < 0)
throw new IllegalArgumentException(
"index (" + i + ", " +j +") out of bounds");
data.put(i * n + j, value);
}
/// retrieve value at [i,j] (row, col)
public T getValueAt(int i, int j) {
if (i >= m || j >= n || i < 0 || j < 0)
throw new IllegalArgumentException(
"index (" + i + ", " +j +") out of bounds");
T value = data.get(i * n + j);
return value != null ? value : defaultValue;
}
}
A simple test-case illustrating the SparseMatrix' use would be:
public class SparseMatrixTest extends TestCase {
public void testMatrix() {
SparseMatrix<Float> matrix =
new SparseMatrix<Float>(100000, 100000, 0.0F);
matrix.setValueAt(1000, 1001, 42.0F);
assertTrue(matrix.getValueAt(1000,1001) == 42.0);
assertTrue(matrix.getValueAt(1001,1000) == 0.0);
}
}
This is not the most efficient way of doing it because every non-default entry in the matrix is stored as an Object. Depending on the number of actual values you are expecting, the simplicity of this approach might trump integrating a 3rd-party solution (and possibly dealing with its License - again, depending on your situation).
Adding matrix-operations like multiplication to the above SparseMatrix implementation should be straight-forward (and is left as an exercise for the reader ;-)

100,000 x 100,000 = 10,000,000,000 (10 billion) entries. Even if you're storing single byte entries, that's still in the vicinity of 10 GB - does your machine even have that much physical memory, let alone have a will to allocate that much to a single process?
Chances are you're going to need to look into some kind of a way to only keep part of the matrix in memory at any given time, and the rest buffered on disk.

There are a number possible solutions depending on how much memory you have, how sparse the array actually is, and what the access patterns are going to be.
If the calculation of 100K * 100K * 8 is less than the amount of physical memory on your machine for use by the JVM, a simple non-sparse array is viable solution.
If the array is sparse, with (say) 75% or more of the elements being zero, then you can save space by using a sparse array library. Various alternatives have been suggested, but in all cases, you still need to work out if this is going to give you enough savings. Figure out how many non-zero elements there are going to be, multiply that by 8 (to give you doubles) and (say) 4 to account for the overheads of the sparse array. If that is less than the amount of physical memory that you can make available to the JVM, then sparse arrays are a viable solution.
If sparse and non-sparse arrays (in memory) won't work, things will get more complicated, and the viability of any solution will depend on the access patterns for the array data.
One approach is to represent the array as a file that is mapped into memory in the form of a MappedByteBuffer. Assuming that you don't have enough physical memory to store the entire file in memory, you are going to be hitting the virtual memory system hard. So it is best if your algorithm only needs to operate on contiguous sections of the array at any time. Otherwise, you'll probably die from swapping.
A second approach is a variation of the first. Map the array/file a section at a time, and when you are done, unmap and move to the next section. This only works if the algorithm works on the array in sections.
A third approach is to represent the array using a light-weight database like BDB. This will be slower than any in-memory solution because reading array elements will translate into disc accesses. But if you get it wrong it won't kill the system like the memory mapped approach will. (And if you do this on Linux/Unix, the system's disc block cache may speed things up, depending on your algorithm's array access patterns)
A fourth approach is to use a distributed memory cache. This replaces disc i/o with network i/o, and it is hard to say whether this is a good or bad thing.
A fifth approach is to analyze your algorithm and see if it is amenable to implementing as a distributed algorithm; e.g. with sections of the array and corresponding parts of the algorithm on different machines.

You can upgrade to this machine:
http://www.azulsystems.com/products/compute_appliance.htm
864 processor cores and 768 GB of memory, only costs a single family house somewhere.

Well, I'd suggest that you increase the memory in your jvm but you've going to need a lot of memory, as you're talking about 10 billion items. It's (barely) possible with lots of memory or a clustered jvm, but that's probably the wrong answer.
You're getting the outOfmemory because if you declare int[1000], the memory is allocated immediately (additionally doubles take up more space than ints-an int representation will also save you space). Maybe you can substitute a more efficient implementation of your array (if you have many empty entries lookup "sparse matrix" representations).
You could store pieces in an outside system, like memcached or memory-mapped buffers.
There are lots of good suggestions here, maybe if you posted a more detailed description of the problem you're trying to solve people could be more specific.

You should try an "external" package to handle matrices, I never did that though, maybe something like jama.

Unless you have 100K x 100K x 8 ~ 80GB of memory, you cannot create this matrix in memory. You can create this matrix on disk and access it using memory mapping. However, using this approach will be very slow.
What are you trying to do? You may find that representing your data in a different way will be much more efficient.

Memory usage large arrays puzzle in java

I want to test how much memory takes a class(foo) in java.In the constructor of foo I have the followings new:
int 1 = new int[size]
int 2 = new int[size]
....
int 6 = new int[size]
The size begins from 100 and increases until 4000.
So my code is:
Runtime r = Runtime.getRuntime();
for(int i=0;i<10;i++)
r.gc();
double before = r.TotalMemory()-r.freeMemory();
Foo f = new Foo();
double after = r.TotalMemory()-r.freeMemory();
double result = after-before;
The problem is unti 2000 I have an increasing good result.But after 2000 I have a number which is < of the result of 2000. I guess that the gc is triggered.And sometimes I get the same number as it doesn't see the difference. I did run with -Xms2024m -Xmx2024m which is my full pc memory. But I get the same behaviour. I did run with -Xmn2023m -Xmx2024m and I get some strange results such as: 3.1819152E7.
Please help me on this.Thanks in advance.

All these “I need to know how much memory object A takes” question are usually a symptom of premature optimization.
If you are optimizing prematurely (and I assume that much) please stop what you’re doing right now and get back to what you really should be doing: completing the application you’re currently working on (another assumption by me).
If you are not optimizing prematurely you probably still need to stop right now and start using a profiler that will tell you which objects actually use memory. Only then can you start cutting down memory requirements for objects or checking for objects you have forgotten to remove from some collection.

Garbage collectors are clever beasts. They don't need to collect everything everytime. They can defer shuffling things around. You could read about Generational Gabage Collection.
If you want to know how much memory your class is taking, why bother to introduce undertainty by asking for garbage collection. Hold on the the successively bigger objects and examine how big your app gets. Look at the increments in size.
List myListOfBigObjects
for ( for sizes up to 100 or more ) {
make an object of current size
put it in the list
now how big are we?
}
Or you could just say "an int is so many bytes and we have n x that many bytes" there's some constant overhead for an object, but just increasing the array size will surely increase the object by a predictable amount.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.