Optimizing processing and management of large Java data arrays - java

I'm writing some pretty CPU-intensive, concurrent numerical code that will process large amounts of data stored in Java arrays (e.g. lots of double[100000]s). Some of the algorithms might run millions of times over several days so getting maximum steady-state performance is a high priority.
In essence, each algorithm is a Java object that has an method API something like:
public double[] runMyAlgorithm(double[] inputData);
or alternatively a reference could be passed to the array to store the output data:
public runMyAlgorithm(double[] inputData, double[] outputData);
Given this requirement, I'm trying to determine the optimal strategy for allocating / managing array space. Frequently the algorithms will need large amounts of temporary storage space. They will also take large arrays as input and create large arrays as output.
Among the options I am considering are:
Always allocate new arrays as local variables whenever they are needed (e.g. new double[100000]). Probably the simplest approach, but will produce a lot of garbage.
Pre-allocate temporary arrays and store them as final fields in the algorithm object - big downside would be that this would mean that only one thread could run the algorithm at any one time.
Keep pre-allocated temporary arrays in ThreadLocal storage, so that a thread can use a fixed amount of temporary array space whenever it needs it. ThreadLocal would be required since multiple threads will be running the same algorithm simultaneously.
Pass around lots of arrays as parameters (including the temporary arrays for the algorithm to use). Not good since it will make the algorithm API extremely ugly if the caller has to be responsible for providing temporary array space....
Allocate extremely large arrays (e.g. double[10000000]) but also provide the algorithm with offsets into the array so that different threads will use a different area of the array independently. Will obviously require some code to manage the offsets and allocation of the array ranges.
Any thoughts on which approach would be best (and why)?

What I have noticed when working with memory in Java is the following. If your memory needs patterns are simple (mostly 2-3 types of memory allocations) you can usually be better than the default allocator. You can either preallocate a pool of buffers at the application startup and use them as needed or go to the other route (allocate an huge array at the beginning and provide pieces of that when needed). In effect you are writing your own memory allocator. But chances are you will do a worse job than the default allocator of Java.
I would probably try to do the following: standardize the buffer sizes and allocate normally. That way after a while the only memory allocation/deallocation will be in fixed sizes which will greatly help the garbage collector to run fast. Another thing I would do is to make sure at the algorithm design time that the total memory needed at any one point will not exceed something like 80-85% of the memory of the machine in order to not trigger a full collection inadvertently.
Apart from those heuristics I would probably test the hell of any solution I would pick and see how it works in practice.

Allocating big arrays is relatively cheap for the GC. You tend to use you your Eden space quickly, but the cost is largely per object. I suggest you write the code in the simplest manner possible and optimise it later after profiling the application. a double[100000] is less than a MB and you can over a thousand in a GB.
Memory is a lot cheaper than it used to be. An 8 GB server costs about £850. A 24 GB server costs about £1,800. (a 24 GB machine could allow you 24K x double[100000]) You may find using a large heap size or even a large Eden size gives you the efficiency you want.

Related

Java: is it faster to create new array or set all elements of current array to 0

In a performance-critical part of my code, I need to clear an int array buffer by setting it back to all 0s.
Should I do buffer = new int[size] or Arrays.fill(buffer, 0)? The first seems to be faster in my tests, but maybe it will slow down eventually because of garbage collection. I don't have confidence in my own tests (because of stuff like compiler optimization), so I am asking it here.
If it matters, buffer will be size of about 300, and I need to clear buffer when it fills up, so after 300 iterations of my main loop.
I read More efficient to create new array or reset array but it doesn't specifically say for larger arrays. Also it is for Objects, not ints, which I think could matter.
Is it faster to create new array or set all elements of current array to 0.
There is no simple answer. The JVM can allocate a default initialized array faster that fill(array, 0) can fill an array of the same size. But the flipside is that there are GC-related overheads that are difficult to quantify:
The GC costs are typically proportional to amount of reachable data. For non-reachable objects, the cost is essentially the cost of zeroing memory.
The GC costs / efficiency will depend on the heap size, and on how full it is.
The GC overheads also depend on the lifetime of the objects. For example a long-lived object will typically be tenured to the "old" generation and GC'd less often. But the flipside is that write barriers may make array writes slower.
Different GC's perform differently.
Different Java JIT compilers, etc perform differently.
And so on.
The bottom line is that it is not possible to give a clear answer without knowing ... more information than you can provide to create a valid model of the behavior.
Likewise, artificial benchmarks are liable to involve making explicit or implicit choices about various of the above (overt and hidden) variables. The result is liable to be that the benchmark results don't reflect real performance in your application.
So the best answer is to measure and compare the performance in the context of your actual application. In other words:
Get your application working
Write a benchmark for measuring your application's performance with realistic test data / inputs
Use the benchmark to compare the performance of the two alternatives in the context of your application.
(Your question has the smell of premature optimization about it. You should be able to put off deciding which of these alternatives is better ... until you have the tools to make a well-founded decision.)

Java: Fastest way to make a local copy of an initial ArrayList?

My code requires me to create a large (301x301x19 items) ArrayList that has some initial values (some are 0, some are 1, etc) every time I call a function. The starting values are always the same and need to be loaded into the array every time the function is called so that the function has its own copy of these initial values to mess with.
Originally I was recalculating the array every time the function was called, but that proved to be laughably slow; instead, I am now calculating the initial array only once and am making local copies of it every time the function is called (so that I can change the values without changing the initial array's values).
However, copying the array is still proving to be prohibitively slow (well over 3/4ths of the computation time is spent just copying this array). I have tried the following:
// oldList is an ArrayList<Byte>
ArrayList<Byte> newList = new ArrayList<Byte>(oldList);
// oldList is an ArrayList<Byte>
ArrayList<Byte> newList = new ArrayList<Byte>();
newList.addAll(oldList);
// oldList is a Byte[]
ArrayList<Byte> newList = new ArrayList<Byte>(Arrays.asList(oldList));
All of these methods are simply too slow for my application; is there any faster technique to do this or am I out of luck?
In summary:
Aim to design out the need to copy so many large data structures (a hard problem, I know)
Avoid pointer chasing, use arrays rather than ArrayLists. If your objects contain other objects, try to replace them with primitives. The ultimate here is to reduce to an array of primitives, such as a byte array
Compact your data structures, use arrays, smaller types; the goal is to gain the same amount of benefit from copying less actual bytes
Use System.arrayCopy
If you still want to go still faster, then take memory layout and responsibility away from the JVM and use sun.misc.Unsafe directly (otherwise known as 'running with scissors')
Changing to a more easily copied data structure, and using System.arraycopy is going to be about as fast as you can get with the approach that you outlined in your question.
System.arraycopy is implemented as a native call. Most JVM providers will have prepared a native version that makes use of native instructions to accelerate the memory copying.
Unfortunately copying large regions of memory has unintended side effects within the JVM, mostly around the Garbage Collector.
during a mem copy the JVM cannot access a safe point, this prevents STW GC from starting
GC not starting causes threads that did pay attention to the safe point to wait longer, creating weird stalls on threads that have nothing to do with this work
a large array may not fit within the TAB (a thread local buffer used to accelerate object allocations), meaning that object allocation slows down as it enters special case code
large objects increase the likelihood of premature tenuring during GC cycles, which increases the frequency of more costly older gen/full GCs (in oppose to the cheaper young gen GCs)
NB: for the above effects to be seen, we must be talking about very high rates of allocations and discardings. Most algorithms that do a few allocations and copies here and there will not see these problems; modern JVMs can even cope with fairly high rates, these problems do not occur until a threshold is exceeded and plates that we had been spinning on poles start to hit the floor.

Best practice for creating millions of small temporary objects

What are the "best practices" for creating (and releasing) millions of small objects?
I am writing a chess program in Java and the search algorithm generates a single "Move" object for each possible move, and a nominal search can easily generate over a million move objects per second. The JVM GC has been able to handle the load on my development system, but I'm interested in exploring alternative approaches that would:
Minimize the overhead of garbage collection, and
reduce the peak memory footprint for lower-end systems.
A vast majority of the objects are very short-lived, but about 1% of the moves generated are persisted and returned as the persisted value, so any pooling or caching technique would have to provide the ability to exclude specific objects from being re-used.
I don't expect fully-fleshed out example code, but I would appreciate suggestions for further reading/research, or open source examples of a similar nature.
Run the application with verbose garbage collection:
java -verbose:gc
And it will tell you when it collects. There would be two types of sweeps, a fast and a full sweep.
[GC 325407K->83000K(776768K), 0.2300771 secs]
[GC 325816K->83372K(776768K), 0.2454258 secs]
[Full GC 267628K->83769K(776768K), 1.8479984 secs]
The arrow is before and after size.
As long as it is just doing GC and not a full GC you are home safe. The regular GC is a copy collector in the 'young generation', so objects that are no longer referenced are simply just forgotten about, which is exactly what you would want.
Reading Java SE 6 HotSpot Virtual Machine Garbage Collection Tuning is probably helpful.
Since version 6, the server mode of JVM employs an escape analysis technique. Using it you can avoid GC all together.
Well, there are several questions in one here !
1 - How are short-lived objects managed ?
As previously stated, the JVM can perfectly deal with a huge amount of short lived object, since it follows the Weak Generational Hypothesis.
Note that we are speaking of objects that reached the main memory (heap). This is not always the case. A lot of objects you create does not even leave a CPU register. For instance, consider this for-loop
for(int i=0, i<max, i++) {
// stuff that implies i
}
Let's not think about loop unrolling (an optimisations that the JVM heavily performs on your code). If max is equal to Integer.MAX_VALUE, you loop might take some time to execute. However, the i variable will never escape the loop-block. Therefore the JVM will put that variable in a CPU register, regularly increment it but will never send it back to the main memory.
So, creating millions of objects are not a big deal if they are used only locally. They will be dead before being stored in Eden, so the GC won't even notice them.
2 - Is it useful to reduce the overhead of the GC ?
As usual, it depends.
First, you should enable GC logging to have a clear view about what is going on. You can enable it with -Xloggc:gc.log -XX:+PrintGCDetails.
If your application is spending a lot of time in a GC cycle, then, yes, tune the GC, otherwise, it might not be really worth it.
For instance, if you have a young GC every 100ms that takes 10ms, you spend 10% of your time in the GC, and you have 10 collections per second (which is huuuuuge). In such a case, I would not spend any time in GC tuning, since those 10 GC/s would still be there.
3 - Some experience
I had a similar problem on an application that was creating a huge amount of a given class. In the GC logs, I noticed that the creation rate of the application was around 3 GB/s, which is way too much (come on... 3 gigabytes of data every second ?!).
The problem : Too many frequent GC caused by too many objects being created.
In my case, I attached a memory profiler and noticed that a class represented a huge percentage of all my objects. I tracked down the instantiations to find out that this class was basically a pair of booleans wrapped in an object. In that case, two solutions were available :
Rework the algorithm so that I do not return a pair of booleans but instead I have two methods that return each boolean separately
Cache the objects, knowing that there were only 4 different instances
I chose the second one, as it had the least impact on the application and was easy to introduce. It took me minutes to put a factory with a not-thread-safe cache (I did not need thread safety since I would eventually have only 4 different instances).
The allocation rate went down to 1 GB/s, and so did the frequency of young GC (divided by 3).
Hope that helps !
If you have just value objects (that is, no references to other objects) and really but I mean really tons and tons of them, you can use direct ByteBuffers with native byte ordering [the latter is important] and you need some few hundred lines of code to allocate/reuse + getter/setters. Getters look similar to long getQuantity(int tupleIndex){return buffer.getLong(tupleInex+QUANTITY_OFFSSET);}
That would solve the GC problem almost entirely as long as you do allocate once only, that is, a huge chunk and then manage the objects yourself. Instead of references you'd have only index (that is, int) into the ByteBuffer that has to be passed along. You may need to do the memory align yourself as well.
The technique would feel like using C and void*, but with some wrapping it's bearable. A performance downside could be bounds checking if the compiler fails to eliminate it. A major upside is the locality if you process the tuples like vectors, the lack of the object header reduces the memory footprint as well.
Other than that, it's likely you'd not need such an approach as the young generation of virtually all JVM dies trivially and the allocation cost is just a pointer bump. Allocation cost can be a bit higher if you use final fields as they require memory fence on some platforms (namely ARM/Power), on x86 it is free, though.
Assuming you find GC is an issue (as others point out it might not be) you will be implementing your own memory management for you special case i.e. a class which suffers massive churn. Give object pooling a go, I've seen cases where it works quite well. Implementing object pools is a well trodden path so no need to re-visit here, look out for:
multi-threading: using thread local pools might work for your case
backing data structure: consider using ArrayDeque as it performs well on remove and has no allocation overhead
limit the size of your pool :)
Measure before/after etc,etc
I've met a similar problem. First of all, try to reduce the size of the small objects. We introduced some default field values referencing them in each object instance.
For example, MouseEvent has a reference to Point class. We cached Points and referenced them instead of creating new instances. The same for, for example, empty strings.
Another source was multiple booleans which were replaced with one int and for each boolean we use just one byte of the int.
I dealt with this scenario with some XML processing code some time ago. I found myself creating millions of XML tag objects which were very small (usually just a string) and extremely short-lived (failure of an XPath check meant no-match so discard).
I did some serious testing and came to the conclusion that I could only achieve about a 7% improvement on speed using a list of discarded tags instead of making new ones. However, once implemented I found that the free queue needed a mechanism added to prune it if it got too big - this completely nullified my optimisation so I switched it to an option.
In summary - probably not worth it - but I'm glad to see you are thinking about it, it shows you care.
Given that you are writing a chess program there are some special techniques you can use for decent performance. One simple approach is to create a large array of longs (or bytes) and treat it as a stack. Each time your move generator creates moves it pushes a couple of numbers onto the stack, e.g. move from square and move to square. As you evaluate the search tree you will be popping off moves and updating a board representation.
If you want expressive power use objects. If you want speed (in this case) go native.
One solution I've used for such search algorithms is to create just one Move object, mutate it with new move, and then undo the move before leaving the scope. You are probably analyzing just one move at a time, and then just storing the best move somewhere.
If that's not feasible for some reason, and you want to decrease peak memory usage, a good article about memory efficiency is here: http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/memory-efficient-java-tutorial.pdf
Just create your millions of objects and write your code in the proper way: don't keep unnecessary references to these objects. GC will do the dirty job for you. You can play around with verbose GC as mentioned to see if they are really GC'd. Java IS about creating and releasing objects. :)
I think you should read about stack allocation in Java and escape analysis.
Because if you go deeper into this topic you may find that your objects are not even allocated on the heap, and they are not collected by GC the way that objects on the heap are.
There is a wikipedia explanation of escape analysis, with example of how this works in Java:
http://en.wikipedia.org/wiki/Escape_analysis
I am not a big fan of GC, so I always try finding ways around it. In this case I would suggest using Object Pool pattern:
The idea is to avoid creating new objects by store them in a stack so you can reuse it later.
Class MyPool
{
LinkedList<Objects> stack;
Object getObject(); // takes from stack, if it's empty creates new one
Object returnObject(); // adds to stack
}
Object pools provide tremendous (sometime 10x) improvements over object allocation on the heap. But the above implementation using a linked list is both naive and wrong! The linked list creates objects to manage its internal structure nullifying the effort.
A Ringbuffer using an array of objects work well. In the example give (a chess programm managing moves) the Ringbuffer should be wrapped into a holder object for the list of all computed moves. Only the moves holder object references would then be passed around.

Java allocation : allocating objects from a pre-existing/allocated pool

In a Java program when it is necessary to allocate thousands of similar-size objects, it would be better (in my mind) to have a "pool" (which is a single allocation) with reserved items that can be pulled from when needed. This single large allocation wouldn't fragment the heap as much as thousands of smaller allocations.
Obviously, there isn't a way to specifically point an object reference to an address in memory (for its member fields) to set up a pool. Even if the new object referenced an area of the pool, the object itself would still need to be allocated. How would you handle many allocations like this without resorting to native OS libraries?
You could try using the Commons Pool library.
That said, unless I had proof the JVM wasn't doing what I needed, I'd probably hold off on optimizing object creation.
Don't worry about it. Unless you have done a lot of testing and analysis on the actual code being run and know that it is a problem with garbage collection and that the JVM isn't doing a good enough job, spend your time elsewhere.
If you are building an application, where a predictable response time is very important, then pooling of objects, no matter how small they are will pay you dividends. Again, pooling is also a factor of how big of a data set you are trying to pool and how much physical memory your machine has.
There is ample proof on the web that shows that object pooling, no matter how small the objects are, is beneficial for application performance.
There are two levels of pooling you could do:
Pooling of the basic objects such as Vectors, which you retrieve from the pool each time you have to use the vector to form a map or such.
Have the higher level composite objects pooled, which are most commonly used, pooled.
This is generally an application design decision.
Also, in a multi-threaded application, you would like to be sensitive about how many different threads are going to be allocating and returning to the pool. You certainly do not want your application to be bogged down by contention - especially if you are dealing with thousands of objects at the same time.
#Dave and Casey, you don't need any proof to show that contiguous memory layout improves Cache efficiency, which is the major bottleneck in most OOP apps that need high performance but follow a "too idealistic" OOP-design trajectory.
People often think of the GC as the culprit causing low performance in high performance Java applications and after fixing it, just leave it at that, without actually profiling memory-behavior of the application. Note though that un-cached memory instructions are inherently more expensive than arithmetic instructions (and are getting more and more expensive due to the memory access <-> computation gap). So if you care about performance, you should certainly care about memory management.
Cache-aware, or more general, data-oriented programming, is the key to achieving high performance in many kinds of applications, such as games, or mobile apps (to reduce power consumption).
Here is a SO thread on DOP.
Here is a slideshow from the Sony R&D department that shows the usefulness of DOP as applied to a playstation game (high performance required).
So how to solve the problem that Java, does not, in general allow you to allocate a chunk of memory? My guess is that when the program is just starting, you can assume that there is very little internal fragmentation in the already allocated pages. If you now have a loop that allocates thousands or millions of objects, they will probably all be as contiguous as possible. Note that you only need to make sure that consecutive objects stretch out over the same cacheline, which in many modern systems, is only 64 bytes. Also, take a look at the DOP slides, if you really care about the (memory-) performance of your application.
In short: Always allocate multiple objects at once (increase temporal locality of allocation), and, if your GC has defragmentation, run it beforehand, else try to reduce such allocations to the beginning of your program.
I hope, this is of some help,
-Domi
PS: #Dave, the commons pool library does not allocate objects contiguously. It only keeps track of the allocations by putting them into a reference array, embedded in a stack, linked list, or similar.

Does allocation speed depend on the garbage collector being used?

My app is allocating a ton of objects (>1mln per second; most objects are byte arrays of size ~80-100 and strings of the same size) and I think it might be the source of its poor performance.
The app's working set is only tens of megabytes. Profiling the app shows that GC time is negligibly small.
However, I suspect that perhaps the allocation procedure depends on which GC is being used, and some settings might make allocation faster or perhaps make a positive influence on cache hit rate, etc.
Is that so? Or is allocation performance independent on GC settings under the assumption that garbage collection itself takes little time?
Of course your performance depends on the allocator used. But you have profiled GC and saw that it is not much of an issue. Also, one of the strengths of the GC is fast allocation at the expense of slower collection.
I think you are having issues with resulting fragmentation which makes memory access pattern problematic for the cpu, since it may need to invalidate its cache too often. Most GC algorithms doesn't reclaim space in an optimum way.
Since your working set is limited and predictable, you might want to use an object pool which is allocated beforehand. You may also want to use reference counting to avoid much of the manual memory management. Technically it is still GC but not in the common sense of the GC.
Still, I don't think the performance is much affected by how you manage memory but how you actually use, access it. Most likely your profiler has the definite answer.
There are two distinct aspects to object allocation. The first is finding a suitable area of memory - with todays generational garbarge collectors, this is usually very fast (in the order of a few 10ths of machine cycles).
The second is the initialization of the objects you allocate. Since everything you allocate in Java is initialized, the cost for initialization can easily outweight the cost of allocation (except for the most simple, smallest objects). There is more. Since initialization requires writing the entire memory area the new object occupies (if you allocate a "new byte[1<<20]" for example, the entire megabyte needs to be set to zeros), this also usually pulls that memory into the cpu's cache, evicting other, older cache lines (which may or may not belong to your current "hot" working set).
If you do comparatively little processing on each of your arrays, those effects can severly affect the performance of your code. This can be partially avoided by re-using the same arrays over and over, but it usually makes the program logic more complex. It is also often not easy to determine if cache trashing is really the culprit. Its impossible to say from what little information is given in your question.
Does your VM try to pool strings? I had heard once, that IBM's VM did something like string interning but dynamically (no idea if its true) perhaps your VM is trying doing extra work to build an internal data structure of String internals.
Are you doing something like byte b[] = new byte[100]; String s = new String(b); by any chance? You might try not to allocate the String objects, and instead allocate some random object which has a reference to the byte[] (for comparison).

Categories