Efficient GC-assisted cleanup of LARGE native resources - java

I'm currently attempting to write a tensor-processing/deep learning library in Java similar to PyTorch or Tensorflow.
Tensors reference MemoryHandles, which hold the native memory needed for the tensor data.
During training, tensor instances are created rapidly, but never the less, the JVM heap itself stays about 100Mb-200Mb and thus the garbage collector is never prompted to garbage collect.
This results in the memory footprint of the application exploding and consuming upwards of 16GB of RAM, due to how much native memory is needed to store the tensor data.
The memory handles themselves are allocated via a cental MemoryManager, which creates PhantomReferences to the handed out handles, and after the object is garbage collected, the associated native memory is correctly freed.
What makes this problem hard
Why is the GC not smart enough to instantly clean these tensors?
Operations such as .matmul(), .plus() etc. are not immediately executed, but rather recorded into a Graph, where nodes represent either variables or operations. This graph is necessary for backpropagation and thus creating it is not optional.
This creates a rather complicated reference structure that is hard to unravel for a GC.
Attempted solutions
I have attempted various less then ideal ways to fix this problem:
Insanely small JVM heap size
-Xmx100M
By forcing the Garbage collector to work with insanely low heap sizes, the garbage collector keeps the native memory footprint bearable.
This introduces very little slow down to the training loop in the cases I have evaluated and would be bearable, if finding out that ideal MB to make the GC do what you want wasn't so painful. Also, if the memory usage of your application isn't more or less constant, this approach also bursts into flames.
Periodic full gc
Running a full gc for every X Mb of natively allocated memory.
This introduces abysmal slow down to the training loop in the cases I have evaluated.
This is the only "in-application" fix that I can think of, meaning, that the user is not forced to use weird jvm args when running their program.
While -XX:+UseZGC and -XX:+ExplicitGCInvokesConcurrent show some improvement, the situation remains rather bad.
Both these solutions do in fact keep the memory footprint of the application at bay, which goes to show that IF the GC catches all the un-referenced MemoryHandles, everything is freed correctly.
Thus my question:
When Jvm applications experience high allocation rates, the GC usually kicks in hard.
Now the problem here is that we have effectively high allocation rates, but that is not at all reflected in the JVM heap. If you put yourself into the shoes of the Garbage Collector, the least that you suspect is that freeing a java object solely consisting of an 8 byte long is where you should place your efforts.
If however it was possible to hint the GC to try harder to free objects of the MemoryHandle type, I suspect these problems would largely disappear. So my question would be: Is this possible?
I wouldn't mind writing hacky native code, if necessary.
Another idea would be to use some jvm argument to make the full GC less aggressive, more in line with the slight slowdown that I experienced with -Xmx100m .
If this is in fact not possible, are there alternative solutions to sovling this problem?
Surely I can't be the first person to attempt to write a Java library with large native resources.

I think that I have now figured out a solution that works as good as it can.
The problem
If you face a similar issue you probably have code that fits some of these criterias:
A high allocation rate of small objects, which hold large native resources
Objects referencing each other in complicated ways that is hard for the GC to untangle
No place in the code where you can safely determine that the resources are no longer in use
Requirements for a potential solution
Your requirements probably are:
Don't bottleneck the loop that allocates the native handles
Nearly instantanious cleanup after the native handle becomes unreferenced
The tradeoff
It turns out you cannot accomplish both these requirements at once.
You unfortunately have to choose between one or the other.
If you don't want to bottleneck the loop that allocates these native handles at a high rate, you need to trade RAM to do that.
If you want instantatious cleanup after the native handle becomes unreferenced,
you have to sacrifice the execution speed of the code that allocates the native handles.
The (hacky) solution
Create a mechanism such that you can asynchronously request a full GC to be performed.
private final AtomicBoolean shouldRunGC = new AtomicBoolean(false);
private final Thread gcThread = new Thread(() -> {
while (true) {
try {
Thread.sleep(10);
} catch (InterruptedException e) {
e.printStackTrace();
}
if (shouldRunGC.getAndSet(false)) {
System.gc();
}
}
}, "GC-Invoker-Thread");
{
gcThread.setDaemon(true);
gcThread.start();
}
Ideally, you have a region of code that is loosely associated with cleanup of these handle objects. It doesn't have to mean that these objects can be safely disposed at this point in time, it just has to mean that the object is >probably< safe to delete. This callsite merely serves a statistical metric to determine the best intervall in which to trigger the Garbage Collection.
You should also know the size of your native resource, or alternatively an estimate of how bad it would be to keep a given object arround.
Alternatively you could also place this at the point of the allocation of your native handles, but note that the effectiveness of the statistical metric that you collect is less effective.
This is an example of such a method in my tensor processing library Sci-Core:
/**
* Drops the history of how this tensor was computed.
* This is useful e.g. when the tensor was changed by the optimizer
* and thus backpropagation back into the last training step (wtf) would be brain-dead.
* Thus, we no longer need to keep a record of how the tensor was computed.
* Executes all operations to compute the value of the specified tensor contained in the graph, if it is not already computed.
* #param tensor the tensor to drop the computation history for
*/
public void dropHistory(ITensor tensor) {
// for all nodes now dropped from the graph
...
nBytesDeletedSinceLastAsyncGC += value.getNumBytes();
nBytesDeletedSinceLastOnSameThreadGC += value.getNumBytes();
...
if (nBytesDeletedSinceLastAsyncGC > 100_000_000) { // 100 Mb
shouldRunGC.set(true);
nBytesDeletedSinceLastAsyncGC = 0;
}
if (nBytesDeletedSinceLastOnSameThreadGC > 2_000_000_000) { // 2 GB
System.gc();
nBytesDeletedSinceLastOnSameThreadGC = 0;
}
}
To fight against bottlenecking your allocation loop, you can use the following JVM arguments:
-XX:+UseZGC -XX:+ExplicitGCInvokesConcurrent -XX:MaxGCPauseMillis=1
Why would this work?
Triggering regular garbage collection seems to make the garbage collector interested in cleaning the very small handle objects (among basically every other object that you create in your application. You still don't have "prioritization" for your handles, they just happen to also be garbage collected. If your application in addition to the native handle objects also allocates a significant amount of other small objects, the effectiveness of this technique will be significantly reduced.
Note however, that triggering the Garbage collector is expensive and thus the maximum value for nBytesDeletedSinceLastAsyncGC and nBytesDeletedSinceLastOnSameThreadGC must be carefully chosen.
Running the garbage collector asynchronously is less expensive, as it will not bottleneck your allocation loop very much but also less effective than calling the garbage collector on the same thread the objects are allocated. So, doing both in carefully chosen intervals can probably get you a good compromise between execution speed of your allocation loop and memory footprint.

Related

Java: reliably allocate large array on heap

The Task
Allocate X=4..8MB of byte array (on heap), e.g. using ByteBuffer.allocate() such that it will not cause an OutOfMemoryError. It is not allowed to split the array and process it in smaller portions. Note that the allocation happens on heap, this is not a direct ByteBuffer.
The Challenges
Memory can be fragmented, and if there is enough memory (greater than X), a continuous portion of size X bytes may still be unavailable to allocate the array (any API to find out is there a continuous region of X bytes is available probably would help).
Heap memory is divided into regions to keep objects of different generations, and an object cannot span two or more regions of the heap: Huge arrays throws out of memory despite enough memory available and Large Array allocation across young and tenured portions of java Heap
Large objects are immediately allocated in a tenured region, but it is tricky to reliably reason about which region exactly even using ManagementFactory.getMemoryPoolMXBeans(): how can I know size of each generation in java heap with jmx Some JVMs dynamically adjust LOAs: https://www.ibm.com/docs/en/sdk-java-technology/8?topic=SSYKE2_8.0.0/com.ibm.java.vm.80.doc/docs/mm_allocation_loa.html
Question
Is there a way in Java to code as follows?
if (<I can reliably allocate an array sized X bytes on heap right now>) {
ByteBuffer.allocate(X);
}
There’s a fundamental problem with the idea to do
if (<I can reliably allocate an array sized X bytes on heap right now>) {
ByteBuffer.allocate(X);
}
known as “check-then-act” anti-pattern. Regardless of how the check in the if’s condition is supposed to work, you need to ensure that it doesn’t change between the check and the subsequent action, i.e. the allocation.
To ensure that the result doesn’t change, you’d not only need to stop all other threads of the same JVM from performing allocations (or concurrent garbage collection from completing) but also prevent all other processes of the same machine from allocating memory, as it is possible that the operating system did not reserve memory for your JVM exclusively but still allows other processing to take it right at this point.
The condition itself has the challenges already named in your question and, as you said yourself, all this fiddling with implementation specific memory regions might be moot when the JVM is capable of reconfiguring them on-the-fly. Since this is usually done as response to the result of a garbage collection, you’d need to perform a full garbage collection first, to determine the resulting situation. Only in this case we were able to be sure that another GC won’t change the situation, if we were able to stop all other threads and processes from doing allocations.
And on some JVMs the only way to reliably trigger a garbage collection, is to perform an actual allocation.
So you need a way to atomically perform the check, followed by an actual allocation that ensures that the memory stays available to you no matter what happens in the environment or an answer that the memory is not available. This mechanism does exist. Just call ByteBuffer.allocate(X) and if it completes normally, the returned reference ensures that the memory stays available as long as you keep it. Otherwise, the thrown OutOfMemoryError signals the unavailability of the memory. Since this mechanism exist, there is no reason to provide a second one with the same outcome.
No, there is no reliable way to do this in Java.
There are several ways to get estimates or best-effort guesses for the available memory, but nothing reliable. Also note that even if there were such a thing, another thread could change the available amount between the condition and the call to allocate.
This related answer contains a way to get such an estimate, and also explains some of the reasons why this can not be reliable.

Predicting Java memory

Is there a way to predict how much memory my Java program is going to take? I come from a C++ background where I implemented methods such as "size_in_bytes()" on classes and I could fairly accurately predict the runtime memory footprint of my app. Now I'm in a Java world, and that is not so easy... There are references, pools, immutable objects that are shared... but I'd still like to be able to predict my memory footprint before I look at the process size in top.
You can inspect the size of objects if you use the instrumentation API. It is a bit tricky to use -- it requires a "premain" method and extra VM parameters -- but there are plenty of examples on the web. "java instrumentation size" should find you these.
Note that the default methods will only give you a shallow size. And unless you avoid any object construction outside of the constructor (which is next to impossible), there will be dead objects around waiting to be garbage collected.
But in general, you could use these to estimate the memory requirements of your application, if you have a good control on the amount of objects generated.
You can't predict the amount of memory a program is going to take. However, you can predict how much an object will take. Edit it turns out I'm almost completely wrong, this document describes the memory usage of objects better: http://www.javamex.com/tutorials/memory/object_memory_usage.shtml
In general, you can predict fairly closely what a given object will require. There's some overhead that is relatively fixed, plus the instance fields in the object, plus a modest amount of padding. But then object size is rounded up to at least (on most JVMs) a 16-byte boundary, and some JVMs round up some object sizes to larger boundaries (to allow the use of standard sized pre-allocated object frames). But all this is relatively fixed for a given JVM.
What varies, of course, is the overhead required for garbage collection. A naive garbage collector requires 100% overhead (at least one free byte for every allocated byte), though certain varieties of "generational" collectors can improve on this to a degree. But how much space is required for GC is highly dependent on the workload (on most JVMs).
The other problem is that when you're running at a relatively low level of allocation (where you're only using maybe 10% of max available heap) then garbage will accumulate. It isn't actively referenced, but the bits of garbage are interspersed with your active objects, so it takes up working set. As a result, your working set tends to be roughly equal to your current overall garbage-collected heap size (plus other system overhead).
You can, of course, "throttle" the heap size so that you run at a higher % utilization, but that increases the frequency of garbage collection (and the overall cost of GC to a lesser degree).
You can use profilers to understand the constant set of objects that are always in memory. Then you should execute all the code paths to check for memory leaks. JProfiler is a good one to start with.

Is object creation a bottleneck in Java in multithreaded environment?

Based on the understanding from the following:
Where is allocated variable reference, in stack or in the heap?
I was wondering since all the objects are created on the common heap. If multiple threads create objects then to prevent data corruption there has to be some serialization that must be happening to prevent the multiple threads from creating objects at same locations. Now, with a large number of threads this serialization would cause a big bottleneck. How does Java avoid this bottleneck? Or am I missing something?
Any help appreciated.
Modern VM implementations reserve for each thread an own area on the heap to create objects in. So, no problem as long as this area does not get full (then the garbage collector moves the surviving objects).
Further read: how TLAB works in Sun's JVM. Azul's VM uses slightly different approach (look at "A new thread & stack layout"), the article shows quite a few tricks JVMs may perform behind the scenes to ensure nowadays Java speed.
The main idea is keeping per thread (non-shared) area to allocate new objects, much like allocating on the stack with C/C++. The copy garbage collection is very quick to deallocate the short-lived objects, the few survivors, if any, are moved into different area. Thus, creating relatively small objects is very fast and lock free.
The lock free allocation is very important, especially since the question regards multithreaded environment. It also allows true lock-free algorithms to exist. Even if an algorithm, itself, is a lock-free but allocation of new objects is synchronized, the entire algorithm is effectively synchronized and ultimately less scalable.
java.util.concurrent.ConcurrentLinkedQueue which is based on the work of Maged M. Michael Michael L. Scott is a classic example.
What happens if an object is referenced by another thread? (due to discussion request)
That object (call it A) will be moved to some "survivor" area. The survivor area is checked less often than the ThreadLocal areas. It contains, like the name suggests, objects whose references managed to escape, or in particular A managed to stay alive. The copy (move) part occurs during some "safe point" (safe point excludes properly JIT'd code), so the garbage collector is sure the object is not being referenced. The references to the object are updated, the necessary memories fences issued and the application (java code) is free to continue. Further read to this simplistic scenario.
To the very interested reader and if possible to chew it: the highly advanced Pauseless GC Algorithm
No. The JVM has all sorts of tricks up its sleeves to avoid any sort of simpleminded serialization at the point of 'new'.
Sometimes. I wrote a recursive method that generates integer permutations and creates objects from those. The multithreaded version (every branch from root = task, but concurrent thread count limited to number of cores) of that method wasn't faster. And the CPU load wasn't higher. The tasks didn't share any object. After I removed the object creation from both methods the multithreaded method was ~4x faster (6 cores) and used 100% CPU. In my test case the methods generated ~4,500,000 permutations, 1500 per task.
I think TLAB didn't work because it's space is limited (see: Thread Local Allocation Buffers).

Does allocation speed depend on the garbage collector being used?

My app is allocating a ton of objects (>1mln per second; most objects are byte arrays of size ~80-100 and strings of the same size) and I think it might be the source of its poor performance.
The app's working set is only tens of megabytes. Profiling the app shows that GC time is negligibly small.
However, I suspect that perhaps the allocation procedure depends on which GC is being used, and some settings might make allocation faster or perhaps make a positive influence on cache hit rate, etc.
Is that so? Or is allocation performance independent on GC settings under the assumption that garbage collection itself takes little time?
Of course your performance depends on the allocator used. But you have profiled GC and saw that it is not much of an issue. Also, one of the strengths of the GC is fast allocation at the expense of slower collection.
I think you are having issues with resulting fragmentation which makes memory access pattern problematic for the cpu, since it may need to invalidate its cache too often. Most GC algorithms doesn't reclaim space in an optimum way.
Since your working set is limited and predictable, you might want to use an object pool which is allocated beforehand. You may also want to use reference counting to avoid much of the manual memory management. Technically it is still GC but not in the common sense of the GC.
Still, I don't think the performance is much affected by how you manage memory but how you actually use, access it. Most likely your profiler has the definite answer.
There are two distinct aspects to object allocation. The first is finding a suitable area of memory - with todays generational garbarge collectors, this is usually very fast (in the order of a few 10ths of machine cycles).
The second is the initialization of the objects you allocate. Since everything you allocate in Java is initialized, the cost for initialization can easily outweight the cost of allocation (except for the most simple, smallest objects). There is more. Since initialization requires writing the entire memory area the new object occupies (if you allocate a "new byte[1<<20]" for example, the entire megabyte needs to be set to zeros), this also usually pulls that memory into the cpu's cache, evicting other, older cache lines (which may or may not belong to your current "hot" working set).
If you do comparatively little processing on each of your arrays, those effects can severly affect the performance of your code. This can be partially avoided by re-using the same arrays over and over, but it usually makes the program logic more complex. It is also often not easy to determine if cache trashing is really the culprit. Its impossible to say from what little information is given in your question.
Does your VM try to pool strings? I had heard once, that IBM's VM did something like string interning but dynamically (no idea if its true) perhaps your VM is trying doing extra work to build an internal data structure of String internals.
Are you doing something like byte b[] = new byte[100]; String s = new String(b); by any chance? You might try not to allocate the String objects, and instead allocate some random object which has a reference to the byte[] (for comparison).

Finding Memory Usage in Java

Following is the scenario i need to solve. I have struck with two solutions.
I need to maintain a cache of data fetched from database to be shown on a Swing GUI.
Whenever my JVM memory exceeds 70% of its allocated memory, i need to warn user regarding excessive usage. And once JVM memory usage exceeds 80%, then i have to halt all the database querying and clean up the existing cache fetched as part of the user operations and notifying the user. During cleanup process, i will manually handle deleting some data based up on some rules and instructs JVM for a GC. Whenever GC occurs, if memory cleans up and reaches 60% of the allocated memory, I need to restart all the Database handling and giving back control to the user.
For checking JVM memory statistics i found following two solutions. Could not able to decide which is best way and why.
Runtime.freeMemory() - Thread created to run every 10 seconds and check for the free memory and if memory exceeds the limits mentioned, necessary popups will intimate user and will call the methods to halt the operations and freeing up the memory.
MemoryPoolMXBean.getUsage() - Java 5 has introduced JMX to get the snapshot of the memory at runtime. In, JMX i cannot use Threshold notification since it will only notify when memory reaches/exceeds the given threshhold. Only way to use is Polling in MemoryMXBean and check the memory statistics over a period.
In case of using polling, it seems for me both the implementations are going to be same.
Please suggest the advantages of the methods and if there are any other alternatives/any corrections to the methods using.
Just a side note: Runtime.freeMemory() doesn't state the amount of memory that's left of allocating, it's just the amount of memory that's free within the currently allocated memory (which is initially smaller than the maximum memory the VM is configured to use), but grows over time.
When starting a VM, the max memory (Runtime.maxMemory()) just defines the upper limit of memory that the VM may allocate (configurable using the -Xmx VM option).
The total memory (Runtime.totalMemory()) is the initial size of the memory allocated for the VM process (configurable using the -Xms VM option), and will dynamically grow every time you allocate more than the currently free portion of it (Runtime.freeMemory()), until it reaches the max memory.
The metric you're interested in is the memory available for further allocation:
long usableFreeMemory= Runtime.getRuntime().maxMemory()
-Runtime.getRuntime().totalMemory()
+Runtime.getRuntime().freeMemory()
or:
double usedPercent=(double)(Runtime.getRuntime().totalMemory()
-Runtime.getRuntime().freeMemory())/Runtime.getRuntime().maxMemory()
The usual way to handle this sort of thing is to use WeakReferences and SoftReferences. You need to use both - the weak reference means you are not holding multiple copies of things, and the soft references mean that the GC will hang onto things until it starts running out of memory.
If you need to do additional cleanup, then you can add references to queues, and override the queue notification methods to trigger the cleanup. It's all good fun, but you do need to understand what these classes do.
It is entirely normal for a JVM to go up to 100% memory usage and them back to say 10% after a GC and do this every few second.
You shouldn't need to try managing the memory in this way.
You cannot say how much memory is being retained until a full GC has been run.
I suggest you work out what you are really trying to achieve and look at the problem another way.
The requirements you mention are a clear contradiction with how Garbage Collection works in a JVM.
because of the behaviour of the JVM it will be very hard to warn you users in a correct way.
Altogether stopping als database manipulation , cleaning stuff up and starting again really is not the way to go.
Let the JVM do what it is supposed to do, handle all memory related for you.
Modern generations of the JVM are very good at it and with some finetuning of the GC parameters you will get a a much cleaner memory handling then forcing things yourself
Articles like http://www.kodewerk.com/advice_on_jvm_heap_tuning_dont_touch_that_dial.htm mention the pros and cons and offer a nice explanation of what the VM does for you
I've only used the first method for similar task and it was OK.
One thing you should note, for both methods, is to implement some kind of debouncing - i.e. once you recognize you've hit 70% of memory, wait for a minute (or any other time you find appropriate) - GC can run at that time and clean up lots of memory.
If you implement a Runtime.freeMemory() graph in your system you'll see how the memory is constantly going up and down, up and down.
VisualVM is a bit nicer than JConsole because it gives you a nice visual Garbage Collector view.
Look into JConsole. It graphs the information you need so it is a matter of adapting this to your needs (given that you run on a Sun Java 6).
This also allows you to detach the surveiling process from what you want to look at.
Very late after the original post, I know, but I thought I'd post an example of how I've done it. Hopefully it'll be of some use to someone (I stress, it's a proof of principal example, nothing else... not particularly elegant either :) )
Just stick these two functions in a class, and it should work.
EDIT: Oh, andimport java.util.ArrayList;
import java.util.List;
public static int MEM(){
return (int)(Runtime.getRuntime().maxMemory()-Runtime.getRuntime().totalMemory() +Runtime.getRuntime().freeMemory())/1024/1024;
}
public static void main(String[] args) throws InterruptedException
{
List list = new ArrayList();
//get available memory before filling list
int initMem = MEM();
int lowMemWarning = (int) (initMem * 0.2);
int highMem = (int) (initMem *0.8);
int iteration =0;
while(true)
{
//use up some memory
list.add(Math.random());
//report
if(++iteration%10000==0)
{
System.out.printf("Available Memory: %dMb \tListSize: %d\n", MEM(),list.size());
//if low on memory, clear list and await garbage collection before continuing
if(MEM()<lowMemWarning)
{
System.out.printf("Warning! Low memory (%dMb remaining). Clearing list and cleaning up.\n",MEM());
//clear list
list = new ArrayList(); //obviously, here is a good place to put your warning logic
//ensure garbage collection occurs before continuing to re-add to list, to avoid immediately entering this block again
while(MEM()<highMem)
{
System.out.printf("Awaiting gc...(%dMb remaining)\n",MEM());
//give it a nudge
Runtime.getRuntime().gc();
Thread.sleep(250);
}
System.out.printf("gc successful! Continuing to fill list (%dMb remaining). List size: %d\n",MEM(),list.size());
Thread.sleep(3000); //just to view output
}
}
}
}
EDIT: This approach still relies on sensible setting of memory in the jvm using -Xmx, however.
EDIT2: It seems that the gc request line really does help things along, at least on my jvm. ymmv.

Categories