Make ZGC run often

Make ZGC run often - java

ZGC runs not often enough. GC logs show that it runs once every 2-3 minutes for my application and because of this, my memory usage goes high between GC cycles (as high as 90%). After GC, it drops to as low as 20%.
How to increase GC run's frequency to run more often?

-XX:ZCollectionInterval=N - set maximum gap between collections to N seconds.
-XX:ZUncommitDelay=M - set the delay until unused memory is returned to the OS to M seconds.

Before tuning the GC, I would recommend to investigate why this is happening. Might have some issue/bug in your application.
[Some notes about GC]
-XX:ZUncommitDelay=M (Check if it is supported by your linux kernel)
-XX:+ZProactive: Enables proactive GC cycles when using ZGC. By default, this option is enabled. ZGC will start a proactive GC cycle if doing so is expected to have minimal impact on the running application. This is useful if the application is mostly idle or allocates very few objects, but you still want to keep the heap size down and allow reference processing to happen even when there are a lot of free space on the heap.
More details about ZGC config. options can be found:
ZGC Home Page.
Oracle Documentation

Presently (as of JDK 17), ZGC's primary strategy is to wait until the last possible moment of the heap filling up and then do a collection. Its goals are
Avoid unnecessary CPU load by running GC only when it's necessary.
Start the GC early enough so that it will finish before the heap actually fills up (since the heap filling up would be bad, leading to a temporary application stall).
It does this by measuring how fast your app is allocating memory, how long the GC takes to run, and predicting at what point it should start the GC. You can find the exact algorithm in the source code.
ZGC also exposes some knobs for running GC more often (ie, proactively), but honestly I don't find them terribly effective. You can find more info in my other answer. G1 does a better job of being proactive, but whether that's good or not depends on your use-case. (It sounds like you care more about throughput than memory usage, so I think you should prefer ZGC's behavior.)
However, if you find that ZGC is making mistakes in predicting when the heap will fill up and that your application really is hitting stalls, please share that info here or on the ZGC mailing list.

Related

Design issue - Physical memory size and Full Garbage Collection

We are designing new software system architecture. and I am working by project manager.
but there is something on the issue within our team.
Our architect says "System memory should be kept as small as possible because it takes a long time when Full GC occurs. (JVM)
I am not sure of that opinion.
When setting up system memory, what level of Full GC(Garbage Collection) time should be reviewed?
How long will it take if Full GC occurs in a 16GB memory environment?

You might be worrying (your "architects") about something that might not be a problem for your throughput to begin with. Until java-9, the default collector is ParallelGC and there are dozens and dozens of applications that have not changed it and are happy with the pause times (and that collector pauses the world every time). So the only real answer is : measure. Enable GC logs and look into it.
On the other hand, if you choose a concurrent collector (you should start with G1), having enough breathing room for it in the heap is crucial. It is a lot more important for Shenandoan and ZGC, since they do everything concurrently. Every time GC initiates a concurrent phase, it works via so-called "barriers", which are basically interceptors for the objects in the heap. These structures used by these barriers require storage. If you will narrow this storage, GC is not going to be happy.
In rather simple words - the more "free" space in the heap, the better your GC will perform.
When setting up system memory, what level of Full GC(Garbage Collection) time should be reviewed?
This is not the correct question. If you are happy with your target response times, this does not matter. If you are not - you start analyzing gc logs and understand what is causing your delays (this is not going to be trivial, though).
How long will it take if Full GC occurs in a 16GB memory environment?
It depends on the collector and on the java version and on the type of Objects to be collected, there is no easy answer. Shenandah and ZGC - this is irrelevant since they do not care on the size of the heap to be scanned. For G1 it is going to be in the range of a "few seconds" most probably. But if you have WeakReferences and finalizers and a java version that is known to handle this not so good, the times to collect is going to be big.

How long will it take if Full GC occurs in a 16GB memory environment?
On a small heaps like that the ballpark figure is around 10 sec I guess.
But it's not what you should consider.
When setting up system memory, what level of Full GC(Garbage Collection) time should be reviewed?
All of the times. Whenever full gc occurs it should be reviewed if your application is latency-critical. This is what you should consider.
Full GC is a failure.
Failure on a multiple levels.
to address memory size available for application
to address GC type you use
to address the types of workloads
to address graceful degradation under the load
and the list goes on
Concurrent GC implicitly relies on a simple fact that it can collect faster then application allocates.
When allocation pressure becomes overwhelming GC has two options: slowdown allocations or stop them altogether.
And when it stops, you know, the hell breaks loose, futures time out, clusters brake apart and engineers fear and loathe large heaps for rest of their lives...
It's a common scenario for applications that evolve for years with increasing complexity and loads and lack of overhaul to accommodate to changing world.
It doesn't have to be this way though.
When you build new application from ground up you can design in with performance and latency in mind, with scalability and graceful degradation instead heap size and GC times.
You can split workloads that are not latency-critical but memory-heavy to different VM and run it under good 'ol ParallelGC, and it will outperform any concurrent GC in both throughput and memory overhead.
You can run latency-critical tasks under modern state-of-the-art GC like Shenandoah and have sub-second collection pauses on heaps of several TB if you don't mind some-30% memory overhead and considerable amount of CPU overhead.
Let the application and requirements dictate you heap size, not engineers.

Confusing Zookeeper Memory usage

I have an instance of zookeeper that has been running for some time... (Java 1.7.0_131, ZK 3.5.1-1), with -Xmx10G -XX:+UseParallelGC.
Recently there was a leadership change, and the memory usage on most instances in the quorum went from ~200MB to 2GB+. I took a jmap dump, and what I found that was interesting was that there was a lot of byte[] serialization data (>1GB) that had no GC Root, but hadn't been collected.
(This is ByteArrayOutputStream, DataOutputStream, org.apache.jute.BinaryOutputArchive, or HeapByteBuffer, BinaryOutputArchive).
Looking at the gc log, shortly before the election change, the full GC was running every 4-5 minutes. After the election, the tenuring threshold increases from 1 to 15 (max) and the full GC runs less and less often, eventually it doesn't even run on some days.
After severals days, suddenly, and mysteriously to me, something changes, and the memory plummets back to ~200MB with Full GC running every 4-5 minutes.
What I'm confused about here, is how so much memory can have no GC Root, and not get collected by a full GC. I even tried triggering a GC.run from jcmd a few times.
I wondered if something in ZK native land was holding onto this memory, or leaking this memory... which could explain it.
I'm looking for any debugging suggestions; I'm planning on upgrading Java 1.8, maybe ZK 3.5.4, but would really like to root cause this before moving on.
So far I've used visualvm, GCviewer and Eclipse MAT.
(Solid vertical black lines are full GC. Yellow is young generation).

I am not an expert on ZK. However, I have been tuning JVMs on Weblogic for a while and I feel, based on this information, that your configuration is generating the expansion and shrinking of the heaps (-Xmx10G -XX:+UseParallelGC). Thus, perhaps you should try using -Xms10G and -Xmx10G to avoid this resizing. Importantly, each time the JVM is resized a full GC is executed so avoiding this process is a good way to minimize the number of full garbage collections.
Please read this
"When a Hotspot JVM starts, the heap, the young generation and the perm generation space are
allocated to their initial sizes determined by the -Xms, -XX:NewSize, and -XX:PermSize parameters
respectively, and increment as-needed to the maximum reserved size, which are -Xmx, -
XX:MaxNewSize, and -XX:MaxPermSize. The JVM may also shrink the real size at runtime if the
memory is not needed as much as originally specified. However, each resizing activity triggers a
Full Garbage Collection (GC), and therefore impacts performance. As a best practice, we
recommend that you make the initial and maximum sizes identical"
Source: http://www.oracle.com/us/products/applications/aia-11g-performance-tuning-1915233.pdf
If you could provide your gc.log, it would be useful to analyse this case thoroughly.
Best regards,
RCC

Java: how to trace/monitor GC times for the CMS garbage collector?

I'm having trouble figuring out a way to monitor the JVM GC for memory exhaustion issues.
With the serial GC, we could just look at the full GC pause times and have a pretty good notion if the JVM was in trouble (if it took more than a few seconds, for example).
CMS seems to behave differently.
When querying lastGcInfo from the java.lang:type=GarbageCollector,name=ConcurrentMarkSweep MXBean (via JMX), the reported duration is the sum of all GC steps, and is usually several seconds long. This does not indicate an issue with GC, to the contrary, I've found that too short GC times are usually more of an indicator of trouble (which happens, for example, if the JVM goes into a CMS-concurrent-mark-start-> concurrent mode failure loop).
I've tried jstat as well, which gives the cumulative time spent garbage collecting (unsure if it's for old or newgen GC). This can be graphed, but it's not trivial to use for monitoring purposes. For example, I could parse jstat -gccause output and calculate differences over time, and trace+monitor that (e.g. amount of time spent GC'ing over the last X minutes).
I'm using the following JVM arguments for GC logging:
-Xloggc:/xxx/gc.log
-XX:+PrintGCDetails
-verbose:gc
-XX:+PrintGCDateStamps
-XX:+PrintReferenceGC
-XX:+PrintPromotionFailure
Parsing gc.log is also an option if nothing else is available, but the optimal solution would be to have a java-native way to get at the relevant information.
The information must be machine-readable (to send to monitoring platforms) so visual tools are not an option. I'm running a production environment with a mix of JDK 6/7/8 instances, so version-agnostic solutions are better.
Is there a simple(r) way to monitor CMS garbage collection? What indicators should I be looking at?

Fundamentally one wants two things from the CMS concurrent collector
the throughput of the concurrent cycle to keep up with the promotion rate, i.e. the objects surviving into the old gen per unit of time
enough room in the old generation for objects promoted during a concurrent cycle
So let's say the IHOP is fixed to 70% then you probably are approaching a problem when it reaches >90% at some point. Maybe even earlier if you do some large allocations that don't fit into the young generation or outlive it (that's entirely application-specific).
Additionally you usually want it to spend more time outside the concurrent cycle than in it, although that depends on how tightly you tune the collector, in principle you could have the concurrent cycle running almost all the time, but then you have very little throughput margin and burn a lot of CPU time on concurrent collections.
If you really really want to avoid even the occasional Full GC then you'll need even more safety margins due to fragmentation (CMS is non-compacting). I think this can't be monitored via MX beans, you'll have to to enable some CMS-specific GC logging to get fragmentation info.

For viewing GC logs:
If you have already enabled GC logging, I suggest GCViewer - this is an open source tool that can be used to view GC logs and look at parameters like throughput, pause times etc.
For profiling:
I don't see a JDK version mentioned in the question. For JDK 6, I would recommend visualvm to profile an application. For JDK 7/8 I would suggest mission control. You can find these in JDK\lib folder. These tools can be used to see how the application performs over a period of time and during GC (can trigger GC via visualvm UI).

How to detect a low heap situation for monitoring and alerting purposes?

We monitor our production JVMs and have monitoring triggers that (ideally) send warnings, when the JVM runs low on heap space. However, coming up with an effective detection algorithm is quite difficult, since it is the nature of garbage collection, that the application regularly has no available memory, before the GC kicks in.
There are many ways to work around this, I can think of. E.g. monitor the available space, send a warning when it becomes too low, but delay it and only trigger, when it is persistent for more than a minute. So, what works for you in practice?
Particular interesting:
How to detect a critical memory/heap problem, that needs immediate reaction?
How to detect a heap problem, that needs a precaution action?
What approaches work universally? E.g. without the need to adapt the triggers to certain JVM tuning parameters or vice versa, or, force a GC in certain time intervals.
Is there any best practice that is used widely?

I have found a very effective measure of JVM memory health to be the percentage of time the JVM spends in garbage collection. A healthy, well-tuned JVM will use very little (< 1% or so) of its CPU time collecting garbage. An unhealthy JVM will be "wasting" much of its time keeping the heap clean, and the percent of CPU used on collection will climb exponentially in a JVM experiencing a memory leak or with a max heap setting that is too low (as more CPU is used keeping the heap clean, less is used doing "real work"...assuming the inbound request rate doesn't slow down, it's easy to fall off a cliff where you become CPU bound and can't get enough work done quickly enough long before you actually get a java.lang.OutOfMemoryError).
It's worth noting that this is really the condition you want to guard against, too. You don't actually care if the JVM uses all of its heap, so long as it can efficiently reclaim memory without getting in the way of the "real work" it needs to do. (In fact, if you're never hitting the max heap size, you may want to consider shrinking your heap.)
This statistic is provided by many modern JVMs (certainly Oracle's and IBMs, at least).
Another somewhat effective measure can be the time between full GCs. The more often you are having to perform a full GC, the more time you're spending in GC.

JVM consumes 100% CPU with a lot of GC

After running a few days the CPU load of my JVM is about 100% with about 10% of GC (screenshot).
The memory consumption is near to max (about 6 GB).
The tomcat is extremely slow at that state.

Since it's too much for a comment i'll write it up ans answer:
Looking at your charts it seems to be using CPU for non-GC tasks, peak "GC activity" seems to stay within 10%.
So on first impression it would seem that your task is simply CPU-bound, so if that's unexpected maybe you should do some CPU-profiling on your java application to see if something pops out.
Apart from that, based on comments I suspect that physical memory filling up might evict file caches and memory-mapped things, leading to increased page faults which forces the CPU to wait for IO.
Freeing up 500MB on a manual GC out of a 4GB heap does not seem all that much, most GCs try to keep pause times low as their primary goal, keep the total time spent in GC within some bound as secondary goal and only when the other goals are met they try to reduce memory footprint as tertiary goal.
Before recommending further steps you should gather more statistics/provide more information since it's hard to even discern what your actual problem is from your description.
monitor page faults
figure out which GC algorithm is used in your setup and how they're tuned (-XX:+PrintFlagsFinal)
log GC activity - I suspect it's pretty busy with minor GCs and thus eating up its pause time or CPU load goals
perform allocation profiling of your application (anything creating excessive garbage?)
You also have to be careful to distinguish problems caused by the java heap reaching its sizing limit vs. problems causing by the OS exhausting its physical memory.
TL;DR: Unclear problem, more information required.
Or if you're lazy/can afford it just plug in more RAM / remove other services from the machine and see if the problem goes away.

I learned to check this on GC problems:
Give the JVM enough memory e.g. -Xmx2G
If memory is not sufficient and no more RAM is available on the host, analyze the HEAP dump (e.g. by jvisualvm).
Turn on Concurrent Marc and Sweep:
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
Check the garbage collection log: -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:gc.log
My Solution:
But I solved that problem finally by tuning the cache sizes.
The cache sizes were to big, so memory got scarce.

if you want keep the memory of your server free you can simply try the vm-parameter
-Xmx2G //or any different value
This ensures your program never takes more than 2 Gigabyte of Ram. But be aware if case of high workload the server may be get an OutOfMemoryError.
Since a old generation (full) GC may block your whole server from working for some seconds java will try to avoid a Full Garbage collection.
The Ram-Limitation may trigger a Full-Generation GC more easy (or even support more objects to be collected by Young-Generation GC).
From my (more guessing than actually knowing) opinion: I don't think another algorithm can help so much here.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.