VisualVM java profiling - self time execution?

VisualVM java profiling - self time execution? - java

I have the following Java method:
static Board board;
static int[][] POSSIBLE_PLAYS; // [262143][0 - 81]
public static void playSingleBoard() {
int subBoard = board.subBoards[board.boardIndex];
int randomMoveId = generateRandomInt(POSSIBLE_PLAYS[subBoard].length);
board.play(board.boardIndex, POSSIBLE_PLAYS[subBoard][randomMoveId]);
}
Accessed arrays do not change at runtime. The method is always called by the same thread. board.boardIndex may change from 0 to 8, there is a total of 9 subBoards.
In VisualVM I end up with the method being executed 2 228 212 times, with (Total Time CPU):
Self Time 27.9%
Board.play(int, int) 24.6%
MainClass.generateRnadomInt(int) 8.7%
What I am wondering is where does come from those 27.9% of self execution (999ms / 2189ms).
I first thought that allocating 2 int could slow down the method so I tried the following:
public static void playSingleBoard() {
board.play(
board.boardIndex,
POSSIBLE_PLAYS[board.subBoards[board.boardIndex]]
[generateRandomInt(POSSIBLE_PLAYS[board.subBoards[board.boardIndex]].length)]
);
}
But ending up with similar results, I have no clue what this self execution time can be.. is it GC time? memory access?
I have tried with JVM options mentionnned here => VisualVM - strange self time
and without.

First, Visual VM (as well as many other safepoint-based profilers) are inherently misleading. Try using a profiler that does not suffer from the safepoint bias. E.g. async-profiler can show not only methods, but also particular lines/bytecodes where the most CPU time is spent.
Second, in your example, playSingleBoard may indeed take relatively long. Even without a profiler, I can tell that the most expensive operations here are the numerous array accesses.
RAM is the new disk. Memory access is not free, especially the random access. Especially when the dataset is too big to fit into CPU cache. Furthermore, an array access in Java needs to be bounds-checked. Also, there are no "true" two-dimentional arrays in Java, they are rather arrays of arrays.
This means, an expression like POSSIBLE_PLAYS[subBoard][randomMoveId] will result in at least 5 memory reads and 2 bounds checks. And every time there is a L3 cache miss (which is likely for large arrays like in your case), this will result in ~50 ns latency - the time enough to execute a hundred arithmetic operations otherwise.

Related

Java can recognize SIMD advantages of CPU; or there is just optimization effect of loop unrolling

This part of code is from dotproduct method of a vector class of mine. The method does inner product computing for a target array of vectors(1000 vectors).
When vector length is an odd number(262145), compute time is 4.37 seconds. When vector length(N) is 262144(multiple of 8), compute time is 1.93 seconds.
time1=System.nanotime();
int count=0;
for(int j=0;j<1000;i++)
{
b=vektors[i]; // selects next vector(b) to multiply as inner product.
// each vector has an array of float elements.
if(((N/2)*2)!=N)
{
for(int i=0;i<N;i++)
{
t1+=elements[i]*b.elements[i];
}
}
else if(((N/8)*8)==N)
{
float []vek=new float[8];
for(int i=0;i<(N/8);i++)
{
vek[0]=elements[i]*b.elements[i];
vek[1]=elements[i+1]*b.elements[i+1];
vek[2]=elements[i+2]*b.elements[i+2];
vek[3]=elements[i+3]*b.elements[i+3];
vek[4]=elements[i+4]*b.elements[i+4];
vek[5]=elements[i+5]*b.elements[i+5];
vek[6]=elements[i+6]*b.elements[i+6];
vek[7]=elements[i+7]*b.elements[i+7];
t1+=vek[0]+vek[1]+vek[2]+vek[3]+vek[4]+vek[5]+vek[6]+vek[7];
//t1 is total sum of all dot products.
}
}
}
time2=System.nanotime();
time3=(time2-time1)/1000000000.0; //seconds
Question: Could the reduction of time from 4.37s to 1.93s (2x as fast) be JIT's wise decision of using SIMD instructions or just my loop-unrolling's positive effect?
If JIT cannot do SIMD optimizaton automatically, then in this example there is also no unrolling optimization done automatically by JIT, is this true?.
For 1M iterations(vectors) and for vector size of 64, speedup multiplier goes to 3.5X(cache advantage?).
Thanks.

Your code has a bunch of problems. Are you sure you're measuring what you think you're measuring?
Your first loop does this, indented more conventionally:
for(int j=0;j<1000;i++) {
b=vektors[i]; // selects next vector(b) to multiply as inner product.
// each vector has an array of float elements.
}
Your rolled loop involves a really long chain of dependent loads and stores. Your unrolled loop involves 8 separate chains of dependent loads and stores. The JVM can't turn one into the other if you're using floating-point arithmetic because they're fundamentally different computations. Breaking dependent load-store chains can lead to major speedups on modern processors.
Your rolled loop iterates over the whole vector. Your unrolled loop only iterates over the first (roughly) eighth. Thus, the unrolled loop again computes something fundamentally different.
I haven't seen a JVM generate vectorised code for something like your second loop, but I'm maybe a few years out of date on what JVMs do. Try using -XX:+PrintAssembly when you run your code and inspect the code opto generates.

I have done a little research on this (and am drawing from knowledge from a similar project I did in C with matrix multiplication), but take my answer with a grain of salt as I am by no means an expert on this topic.
As for your first question, I think the speedup is coming from your loop unrolling; you're making roughly 87% fewer condition checks in terms of the for loop. As far as I know, JVM supports SSE since 1.4, but to actually control whether your code is using vectorization (and to know for sure), you'll need to use JNI.
See an example of JNI here: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?
When you decrease the size of your vector to 64 from 262144, cache is definitely a factor. When I did this project in C, we had to implement cache blocking for larger matrices in order to take advantage of the cache. One thing you might want to do is check your cache size.
Just as a side note: It might be a better idea to measure performance in flops rather than seconds, just because the runtime (in seconds) of your program can vary based on many different factors, such as CPU usage at the time.

Decreased array access time without array bounds check in java

Ok. I am miscalculated things of microbenchmarking. Plz dont read if you dont have excess time.
Instead of
double[] my_array=new array[1000000];double blabla=0;
for(int i=0;i<1000000;i++)
{
my_array[i]=Math.sqrt(i);//init
}
for(int i=0;i<1000000;i++)
{
blabla+=my_array[i];//array access time is 3.7ms per 1M operation
}
i used
public final static class my_class
{
public static double element=0;
my_class(double elementz)
{
element=elementz;
}
}
my_class[] class_z=new my_class[1000000];
for(int i=0;i<1000000;i++)
{
class_z[i]=new my_class(Math.sqrt(i)); //instantiating array elements for later use(random-access)
}
double blabla=0;
for(int i=0;i<1000000;i++)
{
blabla+=class_z[i].element; // array access time 2.7 ms per 1M operations.
}
}
looping overhead is nearly 0.5 ms per 1M looping iterations(used this offset).
Array of classes' element accessing time is %25 lower than a primitive-array's.
Question: Do you know any other way to even lower random-access time?
intel 2Ghz single core java -eclipse

Looking at your code again, I can see that in the first loop you are adding 1m different elements. In the second example, you are adding the same static element 1m times.
A common problem with micro-benchmarks is the order you perform the tests impacts the results.
For example, if you have two loops, the first loops is initially not compiled to native code. However after some time, the whole method will be compiled and the loop will run faster.
Then you run the second loop and find it is either
much faster because it is optimised from the start. (For simple loops)
much slower because it is optimised without any runtime metrics. (For complex loop)
You need to place each loop in a seperate method and run the test alteratively a numebr of times to get reproduceable results.
In your first case, the loop is not optimised until after it has run for a while. In the second case, your loop is likely to already be compiled when it starts.

The difference is easily explained:
The primitive array has a memory footprint of 1M * 8 bytes = 8MB.
The class array has a memory footprint of 1M * 4 bytes = 4MB, all pointing to the same instance (assuming 32bit VM or compressed refs 64bit VM).
Put different objects into your class array and you will see the primitive array perform better. You are comparing oranges to apples at the moment.

There are several problems with your benchmarks and your assessment above. First, your code doesn't compile as shown. Second, your benchmark times (i.e., a few milliseconds) are far too short to be of any statistical worth with today's high-speed processors. Third, you're comparing apples to oranges (as mentioned above). That is, you're timing two completely different use cases: a single static and a million variables.
I fixed your code and ran it several times on an i7-2620m for 10,000 x 1,000,000 repetitions. All results were within +/- 1%, which is good enough for this discussion. Then, I took the fastest of all of those runs in order to compare their performance.
Above, you claimed that the second use case was "25% lower" than the first. That is wildly inaccurate.
In order to do a "static" versus "variable" performance comparison, I changed the first benchmark to add the 999,999th square-root just like the second one is doing. The difference was only about 4.63% in favor of the second use case.
In order to do an array access performance comparison, I changed the second use case to a "non-static" variable. The difference was about 68.2% in favor of the first use case (primitive array access), meaning that the first way was much faster than the second.
(Feel free to ask me more about micro-benchmarking since I've been doing performance measurement and assessment for over 25 years.)

Out of memory : Multithreading using hashset

I have implemented a java program . This is basically a multi threaded service with fixed number of threads. Each thread takes one task at a time, create a hashSet , the size of hashset can vary from 10 to 20,000+ items in a single hashset. At end of each thread, the result is added to a shared collection List using synchronized.
The problem happens is at some point I start getting out of memory exception. Now after doing bit of research, I found that this memory exception occurs when GC is busy clearing the memory and at that point it stops the whole world to execute anything.
Please give me suggestions for how to deal with such large amount of data. Is Hashset a correct datastructure to be used? How to deal with memory exception, I mean one way is to use System.GC(), which is again not good as it will slow down the whole process. Or is it possible to dispose the "HashSet hsN" after I add it to the shared collection List?
Please let me know your thoughts and guide me for wherever I am going wrong. This service is going to deal with huge amout of data processing.
Thanks
//business object - to save the result of thread execution
public class Location{
integer taskIndex;
HashSet<Integer> hsN;
}
//task to be performed by each thread
public class MyTask implements Runnable {
MyTask(long task) {
this.task = task;
}
#Override
public void run() {
HashSet<Integer> hsN = GiveMeResult(task);//some function calling which returns a collection of integer where the size vary from 10 to 20000
synchronized (locations) {
locations.add(task,hsN);
}
}
}
public class Main {
private static final int NTHREDS = 8;
private static List<Location> locations;
public static void main(String[] args) {
ExecutorService executor = Executors.newFixedThreadPool(NTHREDS);
for (int i = 0; i < 216000; i++) {
Runnable worker = new MyTask(i);
executor.execute(worker);
}
// This will make the executor accept no new threads
// and finish all existing threads in the queue
executor.shutdown();
// Wait until all threads are finish
while (!executor.isTerminated()) {
}
System.out.println("Finished all threads");
}
}
For such implementation is JAVA a best choice or C# .net4?

A couple of issues that I can see:
You synchronize on the MyTask object, which is created separately for each execution. You should be synchronizing on a shared object, preferably the one that you are modifying i.e. the locations object.
216,000 runs, multiplied by say 10,000 returned objects each, multiplied by a minimum of 12 bytes per Integer object is about 24 GB of memory. Do you even have that much physical memory available on your computer, let alone available to the JVM?
32-bit JVMs have a heap size limit of less than 2 GB. On a 64-bit JVM on the other hand, an Integer object takes about 16 bytes, which raises the memory requirements to over 30 GB.
With these numbers it's hardly surprising that you get an OutOfMemoryError...
PS: If you do have that much physical memory available and you still think that you are doing the right thing, you might want to have a look at tuning the JVM heap size.
EDIT:
Even with 25GB of memory available to the JVM it could still be pushing it:
Each Integer object requires 16 bytes on modern 64-bit JVMs.
You also need an 8-byte reference that will point to it, regardless of which List implementation you are using.
If you are using a linked list implementation, each entry will also have an overhead of at least 24 bytes for the list entry object.
At best you could hope to store about 1,000,000,000 Integer objects in 25GB - half that if you are using a linked list. That means that each task could not produce more than 5,000 (2,500 respectively) objects on average without causing an error.
I am unsure of your exact requirement, but have you considered returning a more compact object? For example an int[] array produced from each HashSet would only keep the minimum of 4 bytes per result without the object container overhead.
EDIT 2:
I just realized that you are storing the HashSet objects themselves in the list. HashSet objects use a HashMap internally which then uses a HashMap.Entry object of each entry. On an 64-bit JVM the entry object consumes about 40 bytes of memory in addition to the stored object:
The key reference which points to the Integer object - 8 bytes.
The value reference (always null in a HashSet) - 8 bytes.
The next entry reference - 8 bytes.
The hash value - 4 bytes.
The object overhead - 8 bytes.
Object padding - 4 bytes.
I.e. for each Integer object you need 56 bytes for storage in a HashSet. With the typical HashMap load factor of 0.75, you should add another 10 or bytes for the HashMap array references. With 66 bytes per Integer you can only store about 400,000,000 such objects in 25 GB, without taking into account the rest of your application any any other overhead. That's less than 2,000 object per task...
EDIT 3:
You would be better off storing a sorted int[] array instead of a HashSet. That array is searchable in logarithmic time for any arbitrary integer and minimizes the memory consumption to 4 bytes per number. Considering the memory I/O it would also be as fast (or faster) as the HashSet implementation.

If you want a more memory efficient solution I would use TIntHashSet or a sorted int[]. In this case, you get a Full GC before an OutOfMemoryError. These are not the cause of the problem, but symptoms. The cause of the problem is you are using too much memory for the amount you are allowing as your maximum heap.
Another solution is to create tasks as you go instead of creating all your tasks in advance. You can do this by breaking your task in to NTHREAD tasks instead. It appears that you are trying to retain every solution. If so this won't help much. Instead you need to find a way to reduce consumption.
Depending on your distribution of numbers, a BitSet may be more efficient. This uses 1 bit per integer in a range. e.g. say your range is 0 - 20,000, This will use only 2.5 KB.

If you are going to keep 216000 * 10000 Integers in memory you do require huge memory.
You can try Xmx settings to maximum allowable in your system and see how many objects you can store before you run out of memory.
It is not clear why you want to store the results of processing of so many threads, what is the next step? If you really need to store so much of data you need to probably use a database.

Now after doing bit of research, I found that this memory exception
occurs when GC is busy clearing the memory and at that point it stops
the whole world to execute anything.
No - not true. Memory exceptions occur because you are using more memory than was allocated to your program. Very rarely is a memory exception due to some behavior of the GC. This can happen if you configure the GC in poorly.
Have you tried running with a larger -Xmx value? And why don't you just use a Hashtable for locations?

You probably need to increase the size of your heap. Please look at the -Xmx JVM setting.

What is the typical speed of a memory allocation in Java?

I was profiling a Java application and discovered that object allocations were happening considerably slower than I'd expect. I ran a simple benchmark to attempt to establish the overall speed of small-object allocations, and I found that allocating a small object (a vector of 3 floats) seems to take about 200 nanoseconds on my machine. I'm running on a (dual-core) 2.0 GHz processor, so this is roughly 400 CPU cycles. I wanted to ask people here who have profiled Java applications before whether that sort of speed is to be expected. It seems a little cruel and unusual to me. After all, I would think that a language like Java that can compact the heap and relocate objects would have object allocation look something like the following:
int obj_addr = heap_ptr;
heap_ptr += some_constant_size_of_object
return obj_addr;
....which is a couple lines of assembly. As for garbage collection, I don't allocate or discard enough objects for that to come into play. When I optimize my code by re-using objects, I get performance on the order of 15 nanoseconds / object I need to process instead of 200 ns per object I need to process, so re-using objects hugely improves performance. I'd really like to not reuse objects because that makes notation kind of hairy (many methods need to accept a receptacle argument instead of returning a value).
So the question is: is it normal that object allocation is taking so long? Or might something be wrong on my machine that, once fixed, might allow me to have better performance on this? How long do small-object allocations typically take for others, and is there a typical value? I'm using a client machine and not using any compile flags at the moment. If things are faster on your machine, what is your machine's JVM version and operating system?
I realize that individual mileage may vary greatly when it comes to performance, but I'm just asking whether the numbers I'm mentioning above seem like they're in the right ballpark.

Creating objects is very fast when the object is small and there is no GC cost.
final int batch = 1000 * 1000;
Double[] doubles = new Double[batch];
long start = System.nanoTime();
for (int j = 0; j < batch; j++)
doubles[j] = (double) j;
long time = System.nanoTime() - start;
System.out.printf("Average object allocation took %.1f ns.%n", (double) time/batch);
prints with -verbosegc
Average object allocation took 13.0 ns.
Note: no GCs occurred. However increase the size, and the program needs to wait to copy memory around in the GC.
final int batch = 10 *1000 * 1000;
prints
[GC 96704K->94774K(370496K), 0.0862160 secs]
[GC 191478K->187990K(467200K), 0.4135520 secs]
[Full GC 187990K->187974K(618048K), 0.2339020 secs]
Average object allocation took 78.6 ns.
I suspect your allocation is relatively slow because you are performing GCs. One way around this is to increase the memory available to the application. (Though this may just delay the cost)
If I run it again with -verbosegc -XX:NewSize=1g
Average object allocation took 9.1 ns.

I don't know how you measure the allocation time. It is probably inlined at least the equivalent of
intptr_t obj_addr = heap_ptr;
heap_ptr += CONSTANT_SIZE;
if (heap_ptr > young_region_limit)
call_the_garbage_collector ();
return obj_addr;
But it is more complex than that, because you have to fill the obj_addr; then, some JIT compilation or class loading may happen; and very probably, the first few words are initialized (e.g. to the class pointer and to the hash code, which may involve some random number generation...), and the object constructors are called. They may require synchronization, etc.
And more importantly, a freshly allocated object is perhaps not in the nearest level-one cache, so some cache misses may happen.
So while I am not a Java expert, I am not suprized by your measures. I do believe that allocating fresh objects make your code cleaner and more maintainable, than trying to reuse older objects.

Yes. The difference between what you think it should do and what it actually does can be pretty large. Pooling may be messy, but when allocation and garbage collection is a large fraction of execution time, which it certainly can be, pooling is a big win, performance-wise.
The objects to pool are the ones you most often find it in the process of allocating, via stack samples.
Here's what such a sample looks like in C++. In Java the details are different, but the idea's the same:
... blah blah system stuff ...
MSVCRTD! 102129f9()
MSVCRTD! 1021297f()
operator new() line 373 + 22 bytes
operator new() line 65 + 19 bytes
COpReq::Handler() line 139 + 17 bytes <----- here is the line that's doing it
doit() line 346 + 12 bytes
main() line 367
mainCRTStartup() line 338 + 17 bytes
KERNEL32! 7c817077()
V------ and that line shows what's being allocated
COperation* pOp = new COperation(iNextOp++, jobid);

At what point is it worth reusing arrays in Java?

How big does a buffer need to be in Java before it's worth reusing?
Or, put another way: I can repeatedly allocate, use, and discard byte[] objects OR run a pool to keep and reuse them. I might allocate a lot of small buffers that get discarded often, or a few big ones that's don't. At what size is is cheaper to pool them than to reallocate, and how do small allocations compare to big ones?
EDIT:
Ok, specific parameters. Say an Intel Core 2 Duo CPU, latest VM version for OS of choice. This questions isn't as vague as it sounds... a little code and a graph could answer it.
EDIT2:
You've posted a lot of good general rules and discussions, but the question really asks for numbers. Post 'em (and code too)! Theory is great, but the proof is the numbers. It doesn't matter if results vary some from system to system, I'm just looking for a rough estimate (order of magnitude). Nobody seems to know if the performance difference will be a factor of 1.1, 2, 10, or 100+, and this is something that matters. It is important for any Java code working with big arrays -- networking, bioinformatics, etc.
Suggestions to get a good benchmark:
Warm up code before running it in the benchmark. Methods should all be called at least 1000 10000 times to get full JIT optimization.
Make sure benchmarked methods run for at least 1 10 seconds and use System.nanotime if possible, to get accurate timings.
Run benchmark on a system that is only running minimal applications
Run benchmark 3-5 times and report all times, so we see how consistent it is.
I know this is a vague and somewhat demanding question. I will check this question regularly, and answers will get comments and rated up consistently. Lazy answers will not (see below for criteria). If I don't have any answers that are thorough, I'll attach a bounty. I might anyway, to reward a really good answer with a little extra.
What I know (and don't need repeated):
Java memory allocation and GC are fast and getting faster.
Object pooling used to be a good optimization, but now it hurts performance most of the time.
Object pooling is "not usually a good idea unless objects are expensive to create." Yadda yadda.
What I DON'T know:
How fast should I expect memory allocations to run (MB/s) on a standard modern CPU?
How does allocation size effect allocation rate?
What's the break-even point for number/size of allocations vs. re-use in a pool?
Routes to an ACCEPTED answer (the more the better):
A recent whitepaper showing figures for allocation & GC on modern CPUs (recent as in last year or so, JVM 1.6 or later)
Code for a concise and correct micro-benchmark I can run
Explanation of how and why the allocations impact performance
Real-world examples/anecdotes from testing this kind of optimization
The Context:
I'm working on a library adding LZF compression support to Java. This library extends the H2 DBMS LZF classes, by adding additional compression levels (more compression) and compatibility with the byte streams from the C LZF library. One of the things I'm thinking about is whether or not it's worth trying to reuse the fixed-size buffers used to compress/decompress streams. The buffers may be ~8 kB, or ~32 kB, and in the original version they're ~128 kB. Buffers may be allocated one or more times per stream. I'm trying to figure out how I want to handle buffers to get the best performance, with an eye toward potentially multithreading in the future.
Yes, the library WILL be released as open source if anyone is interested in using this.

If you want a simple answer, it is that there is no simple answer. No amount of calling answers (and by implication people) "lazy" is going to help.
How fast should I expect memory allocations to run (MB/s) on a standard modern CPU?
At the speed at which the JVM can zero memory, assuming that the allocation does not trigger a garbage collection. If it does trigger garbage collection, it is impossible to predict without knowing what GC algorithm is used, the heap size and other parameters, and an analysis of the application's working set of non-garbage objects over the lifetime of the app.
How does allocation size effect allocation rate?
See above.
What's the break-even point for number/size of allocations vs. re-use in a pool?
If you want a simple answer, it is that there is no simple answer.
The golden rule is, the bigger your heap is (up to the amount of physical memory available), the smaller the amortized cost of GC'ing a garbage object. With a fast copying garbage collector, the amortized cost of freeing a garbage object approaches zero as the heap gets larger. The cost of the GC is actually determined by (in simplistic terms) the number and size of non-garbage objects that the GC has to deal with.
Under the assumption that your heap is large, the lifecycle cost of allocating and GC'ing a large object (in one GC cycle) approaches the cost of zeroing the memory when the object is allocated.
EDIT: If all you want is some simple numbers, write a simple application that allocates and discards large buffers and run it on your machine with various GC and heap parameters and see what happens. But beware that this is not going to give you a realistic answer because real GC costs depend on an application's non-garbage objects.
I'm not going to write a benchmark for you because I know that it would give you bogus answers.
EDIT 2: In response to the OP's comments.
So, I should expect allocations to run about as fast as System.arraycopy, or a fully JITed array initialization loop (about 1GB/s on my last bench, but I'm dubious of the result)?
Theoretically yes. In practice, it is difficult to measure in a way that separates the allocation costs from the GC costs.
By heap size, are you saying allocating a larger amount of memory for JVM use will actually reduce performance?
No, I'm saying it is likely to increase performance. Significantly. (Provided that you don't run into OS-level virtual memory effects.)
Allocations are just for arrays, and almost everything else in my code runs on the stack. It should simplify measuring and predicting performance.
Maybe. Frankly, I think that you are not going to get much improvement by recycling buffers.
But if you are intent on going down this path, create a buffer pool interface with two implementations. The first is a real thread-safe buffer pool that recycles buffers. The second is dummy pool which simply allocates a new buffer each time alloc is called, and treats dispose as a no-op. Finally, allow the application developer to choose between the pool implementations via a setBufferPool method and/or constructor parameters and/or runtime configuration properties. The application should also be able to supply a buffer pool class / instance of its own making.

When it is larger than young space.
If your array is larger than the thread-local young space, it is directly allocated in the old space. Garbage collection on the old space is way slower than on the young space. So if your array is larger than the young space, it might make sense to reuse it.
On my machine, 32kb exceeds the young space. So it would make sense to reuse it.

You've neglected to mention anything about thread safety. If it's going to be reused by multiple threads you'll have to worry about synchronization.

An answer from a completely different direction: let the user of your library decide.
Ultimately, however optimized you make your library, it will only be a component of a larger application. And if that larger application makes infrequent use of your library, there's no reason that it should pay to maintain a pool of buffers -- even if that pool is only a few hundred kilobytes.
So create your pooling mechanism as an interface, and based on some configuration parameter select the implementation that's used by your library. Set the default to be whatever your benchmark tests determine to be the best solution.1 And yes, if you use an interface you'll have to rely on the JVM being smart enough to inline calls.2
(1) By "benchmark," I mean a long-running program that exercises your library outside of a profiler, passing it a variety of inputs. Profilers are extremely useful, but so is measuring the total throughput after an hour of wall-clock time. On several different computers with differing heap sizes, and several different JVMs, running in single and multi-threaded modes.
(2) This can get you into another line of debate about the relative performance of the various invoke opcodes.

Short answer: Don't buffer.
Reasons are follow:
Don't optimize it, yet until it become a bottleneck
If you recycle it, the overhead of the pool management will be another bottleneck
Try to trust the JIT. In the latest JVM, your array may allocated in STACK rather then HEAP.
Trust me, the JRE usually do handle them faster and better then you DIY.
Keep it simple, for easier to read and debug
When you should recycle a object:
only if is it heavy. The size of memory won't make it heavy, but native resources and CPU cycle do, which cost addition finalize and CPU cycle.
You may want to recycle them if they are "ByteBuffer" rather then byte[]

Keep in mind that cache effects will probably be more of an issue than the cost of "new int[size]" and its corresponding collection. Reusing buffers is therefore a good idea if you have good temporal locality. Reallocating the buffer instead of reusing it means you might get a different chunk of memory each time. As others mentioned, this is especially true when your buffers don't fit in the young generation.
If you allocate but then don't use the whole buffer, it also pays to reuse as you don't waste time zeroing out memory you never use.

I forgot that this is a managed-memory system.
Actually, you probably have the wrong mindset. The appropriate way to determine when it is useful is dependent on the application, system it is running on, and user usage pattern.
In other words - just profile the system, determine how much time is being spent in garbage collection as a percentage of total application time in a typical session, and see if it is worthwhile to optimize that.
You will probably find out that gc isn't even being called at all. So writing code to optimize this would be a complete waste of time.
with today's large memory space I suspect 90% of the time it isn't worth doing at all. You can't really determine this based on parameters - it is too complex. Just profile - easy and accurate.

Looking at a micro benchmark (code below) there is no appreciable difference in time on my machine regardless of the size and the times the array is used (I am not posting the times, you can easily run it on your machine :-). I suspect that this is because the garbage is alive for so short a time there is not much to do for cleanup. Array allocation should probably a call to calloc or malloc/memset. Depending on the CPU this will be a very fast operation. If the arrays survived for a longer time to make it past the initial GC area (the nursery) then the time for the one that allocated several arrays might take a bit longer.
code:
import java.util.Random;
public class Main
{
public static void main(String[] args)
{
final int size;
final int times;
size = 1024 * 128;
times = 100;
// uncomment only one of the ones below for each run
test(new NewTester(size), times);
// test(new ReuseTester(size), times);
}
private static void test(final Tester tester, final int times)
{
final long total;
// warmup
testIt(tester, 1000);
total = testIt(tester, times);
System.out.println("took: " + total);
}
private static long testIt(final Tester tester, final int times)
{
long total;
total = 0;
for(int i = 0; i < times; i++)
{
final long start;
final long end;
final int value;
start = System.nanoTime();
value = tester.run();
end = System.nanoTime();
total += (end - start);
// make sure the value is used so the VM cannot optimize too much
System.out.println(value);
}
return (total);
}
}
interface Tester
{
int run();
}
abstract class AbstractTester
implements Tester
{
protected final Random random;
{
random = new Random(0);
}
public final int run()
{
int value;
value = 0;
// make sure the random number generater always has the same work to do
random.setSeed(0);
// make sure that we have something to return so the VM cannot optimize the code out of existence.
value += doRun();
return (value);
}
protected abstract int doRun();
}
class ReuseTester
extends AbstractTester
{
private final int[] array;
ReuseTester(final int size)
{
array = new int[size];
}
public int doRun()
{
final int size;
// make sure the lookup of the array.length happens once
size = array.length;
for(int i = 0; i < size; i++)
{
array[i] = random.nextInt();
}
return (array[size - 1]);
}
}
class NewTester
extends AbstractTester
{
private int[] array;
private final int length;
NewTester(final int size)
{
length = size;
}
public int doRun()
{
final int size;
// make sure the lookup of the length happens once
size = length;
array = new int[size];
for(int i = 0; i < size; i++)
{
array[i] = random.nextInt();
}
return (array[size - 1]);
}
}

I came across this thread and, since I was implementing a Floyd-Warshall all pairs connectivity algorithm on a graph with one thousand vertices, I tried to implement it in both ways (re-using matrices or creating new ones) and check the elapsed time.
For the computation I need 1000 different matrices of size 1000 x 1000, so it seems a decent test.
My system is Ubuntu Linux with the following virtual machine.
java version "1.7.0_65"
Java(TM) SE Runtime Environment (build 1.7.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
Re-using matrices was about 10% slower (average running time over 5 executions 17354ms vs 15708ms. I don't know if it would still be faster in case the matrix was much bigger.
Here is the relevant code:
private void computeSolutionCreatingNewMatrices() {
computeBaseCase();
smallest = Integer.MAX_VALUE;
for (int k = 1; k <= nVertices; k++) {
current = new int[nVertices + 1][nVertices + 1];
for (int i = 1; i <= nVertices; i++) {
for (int j = 1; j <= nVertices; j++) {
if (previous[i][k] != Integer.MAX_VALUE && previous[k][j] != Integer.MAX_VALUE) {
current[i][j] = Math.min(previous[i][j], previous[i][k] + previous[k][j]);
} else {
current[i][j] = previous[i][j];
}
smallest = Math.min(smallest, current[i][j]);
}
}
previous = current;
}
}
private void computeSolutionReusingMatrices() {
computeBaseCase();
current = new int[nVertices + 1][nVertices + 1];
smallest = Integer.MAX_VALUE;
for (int k = 1; k <= nVertices; k++) {
for (int i = 1; i <= nVertices; i++) {
for (int j = 1; j <= nVertices; j++) {
if (previous[i][k] != Integer.MAX_VALUE && previous[k][j] != Integer.MAX_VALUE) {
current[i][j] = Math.min(previous[i][j], previous[i][k] + previous[k][j]);
} else {
current[i][j] = previous[i][j];
}
smallest = Math.min(smallest, current[i][j]);
}
}
matrixCopy(current, previous);
}
}
private void matrixCopy(int[][] source, int[][] destination) {
assert source.length == destination.length : "matrix sizes must be the same";
for (int i = 0; i < source.length; i++) {
assert source[i].length == destination[i].length : "matrix sizes must be the same";
System.arraycopy(source[i], 0, destination[i], 0, source[i].length);
}
}

More important than buffer size is number of allocated objects, and total memory allocated.
Is memory usage an issue at all? If it is a small app may not be worth worrying about.
The real advantage from pooling is to avoid memory fragmentation. The overhead for allocating/freeing memory is small, but the disadvantage is that if you repeatedly allocated many objects of many different sizes memory becomes more fragmented. Using a pool prevents fragmentation.

I think the answer you need is related with the 'order' (measuring space, not time!) of the algorithm.
Copy file example
By example, if you want to copy a file you need to read from an inputstream and write to an outputstream. The TIME order is O(n) because the time will be proportional to the size of the file. But the SPACE order will be O(1) because the program you'll need to do it will ocuppy a fixed ammount of memory (you'll need only one fixed buffer). In this case it's clear that it's convenient to reuse that very buffer you instantiated at the beginning of the program.
Relate the buffer policy with your algorithm execution structure
Of course, if your algoritm needs and endless supply of buffers and each buffer is a different size probably you cannot reuse them. But it gives you some clues:
try to fix the size of buffers (even
sacrifying a little bit of memory).
Try to see what's the structure of
the execution: by example, if you're
algorithm traverse some kind of tree
and you're buffers are related to
each node, maybe you only need O(log
n) buffers... so you can make an
educated guess of the space required.
Also if you need diferent buffers but
you can arrange things to share
diferent segments of the same
array... maybe it's a better
solution.
When you release a buffer you can
add it to a pool of buffers. That
pool can be a heap ordered by the
"fitting" criteria (buffers that
fit the most should be first).
What I'm trying to say is: there's no fixed answer. If you instantiated something that you can reuse... probably it's better to reuse it. The tricky part is to find how you can do it without incurring in buffer managing overhead. Here's when the algorithm analysis come in handy.
Hope it helps... :)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.