How to achieve deterministic GC pause in Java 7+?

How to achieve deterministic GC pause in Java 7+? - java

As after Jrockit is no more available, hence is there any way to achieve deterministic (no more that x ms) GC pause? I am trying with G1 GC in java_8_65 but it is non-deterministic and many times i see young gc pauses greater than -XX:MaxGCPauseMillis which is expected but not as per my requirement.

The simple answer is no. All GC used by Hotspot and other JVMs (like Zing from Azul, who I work for) are inherently non-deterministic. You can certainly tune a GC to achieve your latency goal for most of the time and using Zing would give you much more reliable results because it performs a compacting collection truly concurrently with the application threads (so, therefore, does not have stop-the-world pauses).
The problem is that, if your application suddenly hits a point where it starts allocating objects at a much higher rate or generates garbage much faster than you have tuned for, you will start seeing pauses that exceed your goal. This is simply the way GC works.
The only way to get true deterministic behaviour like you're looking for would be to use a real-time JVM (look up the RTSJ spec) that would also require a real-time operating system underneath. The drawback to doing this is often your throughput will suffer.

Your options are
do some tuning until G1 performs as expected
switch to another collector available in the JVM you're using, e.g. CMS
switch to a different JVM which offers collectors with stronger guarantees
optimize your application to reduce GC pressure or worst case behavior
throw more hardware at the problem (more or faster CPU cores, more RAM)

One other options could be OpenJ9 Metronome GC.
As far as I know, it is design for deterministic, short pauses for real time apps. According to the documentation the default is 10 millisecond pauses. However, it will of course need more CPU and is more design for small heaps.
I never used it, so I cannot share any experience.

The release of Java 11 contains a brand new Garbage Collector, ZGC, that promises very low pause times.
The goal of this project is to create a scalable low latency garbage collector capable of handling heaps ranging from a few gigabytes to multi terabytes in size, with GC pause times not exceeding 10ms.

Related

Design issue - Physical memory size and Full Garbage Collection

We are designing new software system architecture. and I am working by project manager.
but there is something on the issue within our team.
Our architect says "System memory should be kept as small as possible because it takes a long time when Full GC occurs. (JVM)
I am not sure of that opinion.
When setting up system memory, what level of Full GC(Garbage Collection) time should be reviewed?
How long will it take if Full GC occurs in a 16GB memory environment?

You might be worrying (your "architects") about something that might not be a problem for your throughput to begin with. Until java-9, the default collector is ParallelGC and there are dozens and dozens of applications that have not changed it and are happy with the pause times (and that collector pauses the world every time). So the only real answer is : measure. Enable GC logs and look into it.
On the other hand, if you choose a concurrent collector (you should start with G1), having enough breathing room for it in the heap is crucial. It is a lot more important for Shenandoan and ZGC, since they do everything concurrently. Every time GC initiates a concurrent phase, it works via so-called "barriers", which are basically interceptors for the objects in the heap. These structures used by these barriers require storage. If you will narrow this storage, GC is not going to be happy.
In rather simple words - the more "free" space in the heap, the better your GC will perform.
When setting up system memory, what level of Full GC(Garbage Collection) time should be reviewed?
This is not the correct question. If you are happy with your target response times, this does not matter. If you are not - you start analyzing gc logs and understand what is causing your delays (this is not going to be trivial, though).
How long will it take if Full GC occurs in a 16GB memory environment?
It depends on the collector and on the java version and on the type of Objects to be collected, there is no easy answer. Shenandah and ZGC - this is irrelevant since they do not care on the size of the heap to be scanned. For G1 it is going to be in the range of a "few seconds" most probably. But if you have WeakReferences and finalizers and a java version that is known to handle this not so good, the times to collect is going to be big.

How long will it take if Full GC occurs in a 16GB memory environment?
On a small heaps like that the ballpark figure is around 10 sec I guess.
But it's not what you should consider.
When setting up system memory, what level of Full GC(Garbage Collection) time should be reviewed?
All of the times. Whenever full gc occurs it should be reviewed if your application is latency-critical. This is what you should consider.
Full GC is a failure.
Failure on a multiple levels.
to address memory size available for application
to address GC type you use
to address the types of workloads
to address graceful degradation under the load
and the list goes on
Concurrent GC implicitly relies on a simple fact that it can collect faster then application allocates.
When allocation pressure becomes overwhelming GC has two options: slowdown allocations or stop them altogether.
And when it stops, you know, the hell breaks loose, futures time out, clusters brake apart and engineers fear and loathe large heaps for rest of their lives...
It's a common scenario for applications that evolve for years with increasing complexity and loads and lack of overhaul to accommodate to changing world.
It doesn't have to be this way though.
When you build new application from ground up you can design in with performance and latency in mind, with scalability and graceful degradation instead heap size and GC times.
You can split workloads that are not latency-critical but memory-heavy to different VM and run it under good 'ol ParallelGC, and it will outperform any concurrent GC in both throughput and memory overhead.
You can run latency-critical tasks under modern state-of-the-art GC like Shenandoah and have sub-second collection pauses on heaps of several TB if you don't mind some-30% memory overhead and considerable amount of CPU overhead.
Let the application and requirements dictate you heap size, not engineers.

Schedule garbage collections for multiple jvms on single server

We have linux servers running about 200 microservices in java each, using c-groups for isolation for cpu and memory isolation. One thing we have discovered is that java tends to require a full core of CPU to perform (major) garbage collection efficiently.
However obviously with 200 apps running and only 24 CPUs if they all decided to GC at the same time they would be limited by the c-groups. Since typical application CPU usage is relatively small (say about 15% of 1 cpu peak), it would be nice to find a way to ensure they don't all GC at the same time.
I'm looking into how we can schedule GCs so that each microserevice does not GC at the same time so that we can still run over 200 apps per host, but was wondering if anybody had some suggestions or experience on this topic before trying to re invent the wheel.
I found that there are command line methods that we can use, as well as using MBeans to actually trigger the GC, but read that it is not advised to do so as this will mess up the non-deterministic procedure java uses for GC.
Something that I'm thinking about is using performance metrics to monitor cpu, memory, and traffic to try and predict a GC, then if multiple are about to GC, perhaps we could trigger them one at a time, however this might be impractical or also bad idea.
We are running java 7, and 8.

You can't schedule a GC, since it depends on the allocation rate - i.e. it depends on the load and application logic. Some garbage collectors try to control major GC duration, total time consumed by GC, but not the rate. There is no guarantee as well that an externally triggered GC (e.g. via MBean) will actually run, and if it will run it might be ran later then it was triggered.
As other guys pointed, it is a very rare possibility (you can calculate it by gathering an average period in seconds of a major GC from all of your apps and bulding a histogram) to face it under the "normal" load. Under the "heavy" load you'll likely face CPU shortage much earlier than a probable simultaneous GC from an increased allocation rate will happen - because you would need to have a "lot" (depending on the size of your objects) of "long"-living objects to pollute the old generation to trigger a major GC.

How to detect a low heap situation for monitoring and alerting purposes?

We monitor our production JVMs and have monitoring triggers that (ideally) send warnings, when the JVM runs low on heap space. However, coming up with an effective detection algorithm is quite difficult, since it is the nature of garbage collection, that the application regularly has no available memory, before the GC kicks in.
There are many ways to work around this, I can think of. E.g. monitor the available space, send a warning when it becomes too low, but delay it and only trigger, when it is persistent for more than a minute. So, what works for you in practice?
Particular interesting:
How to detect a critical memory/heap problem, that needs immediate reaction?
How to detect a heap problem, that needs a precaution action?
What approaches work universally? E.g. without the need to adapt the triggers to certain JVM tuning parameters or vice versa, or, force a GC in certain time intervals.
Is there any best practice that is used widely?

I have found a very effective measure of JVM memory health to be the percentage of time the JVM spends in garbage collection. A healthy, well-tuned JVM will use very little (< 1% or so) of its CPU time collecting garbage. An unhealthy JVM will be "wasting" much of its time keeping the heap clean, and the percent of CPU used on collection will climb exponentially in a JVM experiencing a memory leak or with a max heap setting that is too low (as more CPU is used keeping the heap clean, less is used doing "real work"...assuming the inbound request rate doesn't slow down, it's easy to fall off a cliff where you become CPU bound and can't get enough work done quickly enough long before you actually get a java.lang.OutOfMemoryError).
It's worth noting that this is really the condition you want to guard against, too. You don't actually care if the JVM uses all of its heap, so long as it can efficiently reclaim memory without getting in the way of the "real work" it needs to do. (In fact, if you're never hitting the max heap size, you may want to consider shrinking your heap.)
This statistic is provided by many modern JVMs (certainly Oracle's and IBMs, at least).
Another somewhat effective measure can be the time between full GCs. The more often you are having to perform a full GC, the more time you're spending in GC.

Will A java run task scheduled with scheduleWithFixedDelay be stopped during Stop-The-World GC?

I want to implement a heartbeat mechanism in which the child process will periodically ping the parent to tell parent it is still alive. I implement the ping task with threadPool's scheduleWithFixedDelay method in this way:
pingService.scheduleWithFixedDelay(new PingGroomServer(umbilical,
taskId), 0, pingPeriod, TimeUnit.MILLISECONDS);
umbilical is the RPC client to parent process?
Can scheduleWithFixedDelay be scheduled with the fixed delay? Will the ping thread be stopped during Stop-The-World GC?
Actually, I still miss the heartbeat even after I wait for 6 * pingPeriod ms.

So there are several choices for a more deterministic runtime (Gc pauses, thread sheduling etc) in your application
Go predicable, go realtime. One way is to heading for a realtime Java implementation based on RTSJ or other realtime Java implementations like:
http://fiji-systems.com/
http://www.aicas.com/
http://www-03.ibm.com/linux/realtime.html
Go low latency, go no stop-the-world GC. One other way is to use Java implementations with no stop-the-world Garbage collection pauses, like the Zing JVM from Azul systems which uses a low latency concurrent collector.
http://www.azulsystems.com/solutions/low-latency/overview
Since the above choices may is a rather big step for a small application there are things you can do with a "classic" Java implementation.
So, if you want short GC pauses with a Oracle/OpenJDK JVM there are some rule of thumbs:
Hardware - This is the most important. You must have a SMP system with a fast multi-threaded CPU and fast memory modules. Avoid swapping. So the faster hardware you have, the faster the Garbage collector will perform.
Use a multithreaded garbage collector. If you are using the Oracle JVM go for G1 or parallelGC (aka throughput collector)
Use a small heap, the larger heap the more space must be handled by the Garbage collector.
Tune memory (heap) ergonomics. Fine tune your memory ergonomics, let the objects be collected in the young generation (and not be promoted to the old generation) to avoid Full garbage collections.
Another answer to a similar question:
Why will a full gc on small heap takes 5 seconds?
For more information about GC ergonomics:
http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

Yes, stop-the-world is exactly what it says it is - every thread that accesses objects on the heap that is not specifically involved with the garbage collector is stopped.
If you want a heartbeat in java, best to farm out to a separate VM running just that code so the VM pauses are not so long. Or better still, not rely upon millisecond timing - you can't assume that level of scheduling amongst processes in desktop OSs.

FixedThreadPool not parallel enough

I create a fixed threadpool using forPool = Executors.newFixedThreadPool(poolSize); where poolSize is initialized to the number of cores on the processor (lets say 4). In some runs, it works fine and the CPU utilisation is consistently at 400%.
But sometimes, the usage drops to 100%, and never rises back to 400%. I have 1000s of tasks scheduled, so the problem is not that. I catch every exception, but no exception is thrown. So the issue is random and not reproducible, but very much present. They are data parallel operations. At the end of each thread, there is a synchronised access to update a single variable. Highly unlikely I have a deadlock there. In fact, once I spot this issue, if I destroy the pool, and create a fresh one of size 4, it is still only 100% usage. There is no I/O.
It seems counter intuitive to java's assurance of a "FixedThreadPool". Am I reading the guarantee wrong? Is only concurrency guaranteed and not parallelism?
And to the question - Have you come across this issue and solved it? If I want parallelism, am I doing the correct thing?
Thanks!
On doing a thread dump:
I find that there are 4 threads all doing their parallel operations. But the usage is still ~100% only. Here are the thread dumps at 400% usage and 100% usage. I set the number of threads to 16 to trigger the scenario. It runs at 400% for a while, and then drops to 100%. When I use 4 threads, it runs on 400% and only rarely drops to 100%. This is the parallelization code.
****** [MAJOR UPDATE] ******
It turns out that if I give the JVM a huge amount of memory to play with, this issue is solved and the performance does not drop. But I don't know how to use this information to solve this issue. Help!

Given the fact that increasing your heap size makes the problem 'go away' (perhaps not permanently), the issue is probably related to GC.
Is it possible that the Operation implementation is generating some state, that is stored on the heap, between calls to
pOperation.perform(...);
? If so, then you might have a memory usage problem, perhaps a leak. As more tasks complete, more data is on the heap. The garbage collector has to work harder and harder to try and reclaim as much as it can, gradually taking up 75% of your total available CPU resources. Even destroying the ThreadPool won't help, because that's not where the references are stored, it's in the Operation.
The 16 thread case hitting this problem more could be due to the fact that it's generating more state quicker (don't know the Operation implementation, so hard for me to say).
And increasing the heap size while keeping the problem set the same would make this problem appear to disappear, because you'd have more room for all this state.

I'll suggest that you use the Yourkit Thread Analysis feature to understand the real behavior. It will tell you exactly which threads are running, blocked or waiting and why.
If you can't/don't want to purchase it, next best option is to use Visual VM, which is bundled with the JDK to do this analysis. It won't give you as detailed information as Yourkit. Following blog post can get you started with Visual VM:
http://marxsoftware.blogspot.in/2009/06/thread-analysis-with-visualvm.html

My answer is based on a mixture of knowledge about JVM memory management and some guesses about facts which I could not find precise information on. I believe that your problem is related to the thread-local allocation buffers (TLABs) Java uses:
A Thread Local Allocation Buffer (TLAB) is a region of Eden that is
used for allocation by a single thread. It enables a thread to do
object allocation using thread local top and limit pointers, which is
faster than doing an atomic operation on a top pointer that is shared
across threads.
Let's say you have an eden size of 2M and use 4 threads: The JVM may choose a TLAB size of (eden/64)=32K and each thread gets a TLAB of that size. Once the 32K TLAB of a thread are exhausted, it needs to acquire a new one, which requires global synchronization. Global synchronization is also needed for allocation of objects which are larger than the TLAB.
But, to be honest with you, things are not as easy as I described: The JVM adaptively sizes a thread's TLAB based on its estimated allocation rate determined at minor GCs [1] which makes TLAB-related behavior even less predictable. However, I can imagine that the JVM scales the TLAB sizes down when more threads are working. This seems to make sense, because the sum of all TLABs must be less than the available eden space (and even some fraction of the eden space in practice to be able to refill the TLABs).
Let us assume a fixed TLAB size per thread of (eden size / (16 * user threads working)):
for 4 threads this results in TLABs of 32K
for 16 threads this results in TLABs of 8K
You can imagine that 16 threads which exhaust their TLAB faster because it's smaller will cause much more locks on the TLAB allocator than 4 threads with 32K TLABs.
To conclude, when you decrease the number of working threads or increase the memory available to the JVM, the threads can be given larger TLABs and the problem is solved.
https://blogs.oracle.com/daviddetlefs/entry/tlab_sizing_an_annoying_little

This is almost certainly due to GC.
If you want to be sure add the following startup flags to your Java program:
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
and check stdout.
You will see lines containing "Full GC" including the time this took: during this time you will see 100% CPU usage.
The default garbage collector on multi-CPU or multi-core machines is the throughput collector, which collects the young generation in parallel but uses serial collection (in one thread) for the old generation.
So what is probably happening is that in your 100% CPU example, GC is going on of the old generation which is done in one thread and so keeps one core busy only.
Suggestion for solution: use the concurrent mark-and-sweep collector, by using the flag -XX:+UseConcMarkSweepGC at JVM startup.

Tune the JVM
The core of the Java platform is the Java Virtual Machine (JVM). The entire Java application server runs inside a JVM. The JVM takes many startup parameters as command line flags, and some of them have great implications on the application performance. So, let's examine some of the important JVM parameters for server applications.
First, you should allocate as much memory as possible to the JVM using the -Xms (minimum memory) and -Xmx (maximum memory) flags. For instance, the -Xms1g -Xmx1g tag allocates 1GB of RAM to the JVM. If you don't specify a memory size in the JVM startup flags, the JVM would limit the heap memory to 64MB (512MB on Linux), no matter how much physical memory you have on the server! More memory allows the application to handle more concurrent web sessions, and to cache more data to improve the slow I/O and database operations. We typically specify the same amount of memory for both flags to force the server to use all the allocated memory from startup. This way, the JVM wouldn't need to dynamically change the heap size at runtime, which is a leading cause of JVM instability. For 64-bit servers, make sure that you run a 64-bit JVM on top of a 64-bit operating system to take advantage of all RAM on the server. Otherwise, the JVM would only be able to utilize 2GB or less of memory space. 64-bit JVMs are typically only available for JDK 5.0.
With a large heap memory, the garbage collection (GC) operation could become a major performance bottleneck. It could take more than ten seconds for the GC to sweep through a multiple gigabyte heap. In JDK 1.3 and earlier, GC is a single threaded operation, which stops all other tasks in the JVM. That not only causes long and unpredictable pauses in the application, but it also results in very poor performance on multi-CPU computers since all other CPUs must wait in idle while one CPU is running at 100% to free up the heap memory space. It is crucial that we select a JDK 1.4+ JVM that supports parallel and concurrent GC operations. Actually, the concurrent GC implementation in the JDK 1.4 series of JVMs is not very stable. So, we strongly recommend you upgrade to JDK 5.0. Using the command line flags, you can choose from the following two GC algorithms. Both of them are optimized for multi-CPU computers.
If your priority is to increase the total throughput of the
application and you can tolerate occasional GC pauses, you should use
the -XX:UseParallelGC and -XX:UseParallelOldGC (the latter is only
available in JDK 5.0) flags to turn on parallel GC. The parallel GC
uses all available CPUs to perform the GC operation, and hence it is
much faster than the default single thread GC. It still pauses all
other activities in the JVM during GC, however.
If you need to minimize the GC pause, you can use the
-XX:+UseConcMarkSweepGC flag to turn on the concurrent GC. The concurrent GC still pauses the JVM and uses parallel GC to clean up
short-lived objects. However, it cleans up long-lived objects from
the heap using a background thread running in parallel with other JVM
threads. The concurrent GC drastically reduces the GC pause, but
managing the background thread does add to the overhead of the system
and reduces the total throughput.
Furthermore, there are a few more JVM parameters you can tune to optimize the GC operations.
On 64-bit systems, the call stack for each thread is allocated 1MB of
memory space. Most threads do not use that much space. Using the
-XX:ThreadStackSize=256k flag, you can decrease the stack size to 256k to allow more threads.
Use the -XX:+DisableExplicitGC flag to ignore explicit application
calls to System.gc(). If the application calls this method
frequently, then we could be doing a lot of unnecessary GCs.
The -Xmn flag lets you manually set the size of the "young
generation" memory space for short-lived objects. If your application
generates lots of new objects, you might improve GCs dramatically by
increasing this value. The "young generation" size should almost
never be more than 50% of heap.
Since the GC has a big impact on performance, the JVM provides several flags to help you fine-tune the GC algorithm for your specific server and application. It's beyond the scope of this article to discuss GC algorithms and tuning tips in detail, but we'd like to point out that the JDK 5.0 JVM comes with an adaptive GC-tuning feature called ergonomics. It can automatically optimize GC algorithm parameters based on the underlying hardware, the application itself, and desired goals specified by the user (e.g., the max pause time and desired throughput). That saves you time trying different GC parameter combinations yourself. Ergonomics is yet another compelling reason to upgrade to JDK 5.0. Interested readers can refer to Tuning Garbage Collection with the 5.0 Java Virtual Machine. If the GC algorithm is misconfigured, it is relatively easy to spot the problems during the testing phase of your application. In a later section, we will discuss several ways to diagnose GC problems in the JVM.
Finally, make sure that you start the JVM with the -server flag. It optimizes the Just-In-Time (JIT) compiler to trade slower startup time for faster runtime performance. There are more JVM flags we have not discussed; for details on these, please check out the JVM options documentation page.
Reference:
http://onjava.com/onjava/2006/11/01/scaling-enterprise-java-on-64-bit-multi-core.html

A total cpu utilisation at a 100% implied that you have written is single threaded. i.e. you may have any number of concurrent tasks, but due to locking, only one can execute at a time.
If you have high IO you can get less than 400% but it is unlikely you will get a round number of cpu utilisation. e.g. you might see 38%, 259%, 72%, 9% etc. (It is also likely to jump around)
A common problem is locking the data you are using too often. You need to consider how it could be re-written where locking is performed for the briefest period and smallest portion of the overall work. Ideally, you want to avoid locking all together.
Using multiple thread means you can use up to that many cpus, but if your code prevents it you are likely to be better off (i.e. faster) to write the code single threaded as it avoids the overhead of locking.

Since you are using locking, it is possible that one of your four threads attains the lock but is then context switched - perhaps to run the GC thread. The other threads can't make progress since they can't attain the lock. When the thread context switches back, it completes the work in the critical section and relinquishes the lock to allow only one other thread to attain the lock. So now you have two threads active. It is possible that while the second thread executes the critical section the first thread does the next piece of data parallel work but generates enough garbage to trigger the GC and we're back where we started :)
P.S. This is just a best guess since it is hard to figure out what is happenning without any code snippets.

Increasing the size of the Java heap usually improves throughput until the heap no longer resides in physical memory. When the heap size exceeds the physical memory, the heap begins swapping to disk which causes Java performance to drastically decrease. Therefore, it is important to set the maximum heap size to a value that allows the heap to be contained within physical memory.
Since you give the JVM ~90% of physical memory on the machines, problem may be related to IO happening due to memory paging and swapping when you try to allocate memory for more objects. Note that the physical memory is also used by other running processes as well as OS. Also since symptoms occur after a while, this is also indication for memory leaks.
Try to find out how much physical memory is available (not already
used) and allocate ~90% of available physical memory to your JVM heap.
What happens if you leave the system running for extended period of
time?
Does it ever comes back at CPU 400% of utilization?
Do you notice any disk activity when CPU is at 100% of utilization?
Can you monitor which threads are running and which are blocked and
when?
Take a look at following link for tuning:
http://java.sun.com/performance/reference/whitepapers/tuning.html#section4

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.