Reducing JVM pause time > 1 second using UseConcMarkSweepGC

Reducing JVM pause time > 1 second using UseConcMarkSweepGC - java

I'm running a memory intensive app on a machine with 16Gb of RAM, and an 8-core processor, and Java 1.6 all running on CentOS release 5.2 (Final). Exact JVM details are:
java version "1.6.0_10"
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b15, mixed mode)
I'm launching the app with the following command line options:
java -XX:+UseConcMarkSweepGC -verbose:gc -server -Xmx10g -Xms10g ...
My application exposes a JSON-RPC API, and my goal is to respond to requests within 25ms. Unfortunately, I'm seeing delays up to and exceeding 1 second and it appears to be caused by garbage collection. Here are some of the longer examples:
[GC 4592788K->4462162K(10468736K), 1.3606660 secs]
[GC 5881547K->5768559K(10468736K), 1.2559860 secs]
[GC 6045823K->5914115K(10468736K), 1.3250050 secs]
Each of these garbage collection events was accompanied by a delayed API response of very similar duration to the length of the garbage collection shown (to within a few ms).
Here are some typical examples (these were all produced within a few seconds):
[GC 3373764K->3336654K(10468736K), 0.6677560 secs]
[GC 3472974K->3427592K(10468736K), 0.5059650 secs]
[GC 3563912K->3517273K(10468736K), 0.6844440 secs]
[GC 3622292K->3589011K(10468736K), 0.4528480 secs]
The thing is that I thought the UseConcMarkSweepGC would avoid this, or at least make it extremely rare. On the contrary, delays exceeding 100ms are occurring almost once a minute or more (although delays of over 1 second are considerably rarer, perhaps once every 10 or 15 minutes).
The other thing is that I thought only a FULL GC would cause threads to be paused, yet these don't appear to be full GCs.
It may be relevant to note that most of the memory is occupied by a LRU memory cache that makes use of soft references.
Any assistance or advice would be greatly appreciated.

First, check out the Java SE 6 HotSpot[tm] Virtual Machine Garbage Collection Tuning documentation, if you haven't already done so. This documentation says:
the concurrent collector does most of its tracing and sweeping work with the
application threads still running, so only brief pauses are seen by the
application threads. However, if the concurrent collector is unable to finish
reclaiming the unreachable objects before the tenured generation fills up, or if
an allocation cannot be satisfied with the available free space blocks in the
tenured generation, then the application is paused and the collection is completed
with all the application threads stopped. The inability to complete a collection
concurrently is referred to as concurrent mode failure and indicates the need to
adjust the concurrent collector parameters.
and a little bit later on...
The concurrent collector pauses an application twice during a
concurrent collection cycle.
I notice that those GCs don't seem to be freeing very much memory. Perhaps many of your objects are long lived? You may wish to tune the generation sizes and other GC parameters. 10 Gig is a huge heap by many standards, and I would naively expect GC to take longer with such a huge heap. Still, 1 second is a very long pause time and indicates either something is wrong (your program is generating a large number of unneeded objects or is generating difficult-to-reclaim objects, or something else) or you just need to tune the GC.
Usually, I would tell someone that if they have to tune GC then they have other problems they need to fix first. But with an application of this size, I think you fall into the territory of "needing to understand GC much more than the average programmer."
As others have said, you need to profile your application to see where the bottleneck is. Is your PermGen too large for the space allocated to it? Are you creating unnecessary objects? jconsole works to at least show a minimum of information about the VM. It's a starting point. As others have indicated however, you very likely need more advanced tools than this.
Good luck.

Since you mention your desire to cache, I'm guessing that most of your huge heap is occupied by that cache. You might want to limit the size of the cache so that you are sure it never attempts to grow large enough to fill the tenured generation. Don't rely on SoftReference alone to limit the size. As the old generation fills with soft references, older references will be cleared and become garbage. New references (perhaps to the same information) will be created, but cleared quickly because free space is in short supply. Eventually, the tenured space is full of garbage and needs to be cleaned.
Consider adjusting the -XX:NewRatio setting too. The default is 1:2, meaning that one-third of the heap is allocated to the new generation. For a large heap, this is almost always too much. You might want to try something like 9, which would keep 9 Gb of your 10 Gb heap for the old generation.

Turns out that part of the heap was getting swapped out to disk, so that garbage collection had to pull a bunch of data off the disk back into memory.
I resolved this by setting Linux's "swappiness" parameter to 0 (so that it wouldn't swap data out to disk).

Here are some things I have found which could be significant.
JSON-RPC can generate a lot of objects. Not as much as XML-RPC, but still something to watch for. In any case you do appear to be generating as much at 100 MB of objects per second which means your GC is running a high percentage of the time and is likely to be adding to your random latency. Even though the GC is concurrent, your hardware/OS is very likely to exhibit non-ideal random latency under load.
Have a look at your memory bank architecture. On Linux the command is numactl --hardware. If your VM is being split across more than one memory bank this will increase your GC times significantly. (It will also slow down your application as these accesses can be significantly less efficient) The harder you work the memory subsystem the more likely the OS will have to shift memory around (Often in large amounts) and you get dramatic pauses as a result (100 ms is not surprising). Don't forget your OS does more than just run your app.
Consider compacting/reducing the memory consumption of your cache. If you are using multiple GB of cache it is worth looking at ways to cut memory consumption further than you have already.
I suggest you profile your app with memory allocation tracing AND cpu sampling on at the same time. This can yield very different results and often points to the cause of these sort of problems.
Using these approaches, the latency of an RPC call can be reduced to below 200 micro-second and the GC times reduced to 1-3 ms effecting less than 1/300 of calls.

I'd also suggest GCViewer and a profiler.

Some places to start looking:
https://visualvm.dev.java.net/
http://java.sun.com/j2se/1.5.0/docs/tooldocs/share/jstat.html
http://www.javaperformancetuning.com/tools/gcviewer/index.shtml
Also I'd run the code through a profiler.. I like the one in NetBeans but there are other ones as well. You can view the gc behaviour in real time. The Visual VM does that as well... but I haven't run it yet (been looking for a reason to... but haven't had the time or the need yet).

I haven't personally used such a huge heap but I've experienced very low latency in general using following switches for Oracle/Sun Java 1.6.x:
-Xincgc -XX:+UseConcMarkSweepGC -XX:CMSIncrementalSafetyFactor=50
-XX:+UseParNewGC
-XX:+CMSConcurrentMTEnabled -XX:ConcGCThreads=2 -XX:ParallelGCThreads=2
-XX:CMSIncrementalDutyCycleMin=0 -XX:CMSIncrementalDutyCycle=5
-XX:GCTimeRatio=90 -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=1000
The important parts are, in my opinion, the use of CMS for tenured generation and ParNewGC for young generation. In addition, this adds a pretty big safety factor for CMS (default is 10% instead of 50%) and request short pause times. As you're targeting for 25 ms response time, I'd try setting -XX:MaxGCPauseMillis to even smaller value. You could even try to use more than two cores for concurrent GC but I would guess that is not worth the CPU usage.
You should probably also check the HotSpot JVM GC cheat sheet.

Related

Major GC decreasing performance

We are having frecuent outages in our app, basically the heap grows over time to the point were the GC takes a lot of CPU time and execute for several minutes, decreasing the app perfomance drastically. The app is in JSF with a tomcat server.
In the mean time, we:
Increased the heap size from 15G to 26G (-Xms27917287424 -Xmx27917287424)
Take several heap dumps (we are trying to determine the problem using these)
Activated GC logs
With the heap size increase, GC is not executing for that much time but still takes a lot of CPU and frezees the app.
So the question is:
Is this normal? When the GC executes it frees memory, so i think this probably isn't a memory leak (Am I right?)
Is there a way of optimize the GC or maybe this behavior is just a sympthom of something wrong in the app itself?
How can I monitor and analyze this without taking a heap dump?
UPDATE:
I changed JSF from 2.2 to 2.3 because some heap dumps were pointing that JSF was using a lot of memory.
That didn't work out, and yesterday we had and outage again, but this time a little different (from my point of view). Also this time, we had to reset tomcat because the app didn't work anymore after a while
In this case, the garbage collector is running when de old gen heap is not full, and the new generation GC is running all the time.
¿What can be the cause of this?

As has been said in the comments, the behaviour of the application does not look unreasonable. Your code is continually allocating objects that leads to heap space filling up, causing the GC to run. There does not appear to be a memory leak since GC reclaims a lot of space and the overall used space is not continually increasing.
What does appear to be an issue is that a significant number of objects are being promoted to the old-gen before being collected. Major GC cycles are more expensive in terms of CPU due to the relocation and remapping of objects (assuming you're using a compacting algorithm).
To reduce this, you could try increasing the size of the young generation. This will have happened when you increased the overall heap size but not by enough. Ideally, you want the majority of objects to be collected during a minor GC cycle since this is effectively free (the GC does nothing to the objects in Eden space as they are collected). You can do this with the -XX:NewRatio= or -XX:NewSize= flags. You could also try changing the survivor space sizes, again to increase the number of objects collected before tenuring. (use the -XX:SurvivorRatio= flag for this).
For monitoring, I find Flight Recorder and Mission Control very useful as you can drill down into details of how many objects of specific types are allocated. It's also easy to connect to a running JVM or take dumps for later analysis.

Why does java ParNew young generation collection degrade to single CPU with CMS?

I've been unable to find an explanation of which phase of young collection is actually serialized to a single CPU. My speculation is that is when doing the first fit (best fit?) search in the old generation.
From what I've been able to glean, the horrible performance we see deals with the eden size being to small, and we end up unnecessarily copying lots of data into the old generation.
We hit periods were the process basically stops in young GC, and CPU utilization drops to about 1 CPU for sometimes a minute or to. We are not seeing evidence that we actually fail to allocate in the old generation due to fragmentation.
We're using -XX:+UseConcMarkSweepGC -XX:+UseParNewGC with 16GB or 31GB heaps on 32 CPU machines. Both on Java 7 and Java 8 (u66) . Will upgrading help this?

I have found a solution to my problem with GC, and understand what was happening, but not exactly which part of ParNew young GC was actually serialized to a single CPU.
It turns out that if you are using CMS, then the documented default value of NewRatio, 2, is not used, even on a new Java 8. Simply specifying -XX:NewRatio=2 on the command line got us completely out of GC hell on our 31GB heap ElasticSearch instances. The "this is not a bug, its a feature" explanation is here: http://bugs.java.com/view_bug.do?bug_id=8153578
The young size we had been getting was 1.4GB, rather than the ~10GB we expected from the default NewRatio. Therefore at high thread counts, the young data was treated as tenured, even though we had plenty of memory.
Related to this, we have noticed a phenomenon with thread pools, such as ElasticSearch has. The eden size was large enough for the typical 8-12 threads that would typically run, but if the load increased, we'd descend into GC hell. So if you have a server with thread pools, the eden sizing needs to be based on cases where all the threads are in use.

JVM consumes 100% CPU with a lot of GC

After running a few days the CPU load of my JVM is about 100% with about 10% of GC (screenshot).
The memory consumption is near to max (about 6 GB).
The tomcat is extremely slow at that state.

Since it's too much for a comment i'll write it up ans answer:
Looking at your charts it seems to be using CPU for non-GC tasks, peak "GC activity" seems to stay within 10%.
So on first impression it would seem that your task is simply CPU-bound, so if that's unexpected maybe you should do some CPU-profiling on your java application to see if something pops out.
Apart from that, based on comments I suspect that physical memory filling up might evict file caches and memory-mapped things, leading to increased page faults which forces the CPU to wait for IO.
Freeing up 500MB on a manual GC out of a 4GB heap does not seem all that much, most GCs try to keep pause times low as their primary goal, keep the total time spent in GC within some bound as secondary goal and only when the other goals are met they try to reduce memory footprint as tertiary goal.
Before recommending further steps you should gather more statistics/provide more information since it's hard to even discern what your actual problem is from your description.
monitor page faults
figure out which GC algorithm is used in your setup and how they're tuned (-XX:+PrintFlagsFinal)
log GC activity - I suspect it's pretty busy with minor GCs and thus eating up its pause time or CPU load goals
perform allocation profiling of your application (anything creating excessive garbage?)
You also have to be careful to distinguish problems caused by the java heap reaching its sizing limit vs. problems causing by the OS exhausting its physical memory.
TL;DR: Unclear problem, more information required.
Or if you're lazy/can afford it just plug in more RAM / remove other services from the machine and see if the problem goes away.

I learned to check this on GC problems:
Give the JVM enough memory e.g. -Xmx2G
If memory is not sufficient and no more RAM is available on the host, analyze the HEAP dump (e.g. by jvisualvm).
Turn on Concurrent Marc and Sweep:
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
Check the garbage collection log: -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:gc.log
My Solution:
But I solved that problem finally by tuning the cache sizes.
The cache sizes were to big, so memory got scarce.

if you want keep the memory of your server free you can simply try the vm-parameter
-Xmx2G //or any different value
This ensures your program never takes more than 2 Gigabyte of Ram. But be aware if case of high workload the server may be get an OutOfMemoryError.
Since a old generation (full) GC may block your whole server from working for some seconds java will try to avoid a Full Garbage collection.
The Ram-Limitation may trigger a Full-Generation GC more easy (or even support more objects to be collected by Young-Generation GC).
From my (more guessing than actually knowing) opinion: I don't think another algorithm can help so much here.

Tuning JVM (GC) for high responsive server application

I am running an application server on Linux 64bit with 8 core CPUs and 6 GB memory.
The server must be highly responsive.
After some inspection I found that the application running on the server creates rather a huge amount of short-lived objects, and has only about 200~400 MB long-lived objects(as long as there is no memory leak)
After reading http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html
I use these JVM options
-server -Xms2g -Xmx2g -XX:MaxPermSize=256m -XX:NewRatio=1 -XX:+UseConcMarkSweepGC
Result: the minor GC takes 0.01 ~ 0.02 sec, the major GC takes 1 ~ 3 sec
the minor GC happens constantly.
How can I further improve or tune the JVM?
larger heap size? but will it take more time for GC?
larger NewSize and MaxNewSize (for young generation)?
other collector? parallel GC?
is it a good idea to let major GC take place more often? and how?

Result: the minor GC takes 0.01 ~ 0.02 sec, the major GC takes 1 ~ 3 sec the minor GC happens constantly.
Unless you are reporting pauses, I would say that the CMS collector is doing what you have asked it to do. By definition, CMS will use a larger percentage of the CPU than the Serial and Parallel collectors. This is the penalty you pay for low pause times.
If you are seeing 1 to 3 second pause times, I'd say that you need to do some tuning. I'm no expert, but it looks like you should start by reducing the value of CMSInitiatingOccupancyFraction from the default value of 92.
Increasing the heap size will improve the "throughput" of the GC. But if your problem is long pauses, increasing the heap size is likely to make the problem worse.

Careful .... GC can be a hairy subject if you are not cautious. Within any runtime (JVM for Java / CLR for .Net) there are several processes that take place. Generally there is an early stage optimization of memory (Young Generational Garbage Collection / Young Gen GC & Old Generational Garbage Collection / Old Gen GC). The young gen gc happens on a regular basis and is commonly attributed to your smaller pauses / hiccups. The old gen gc is normally what is going on when you see the long "stop the world" pauses.
Why you might ask? The reason you get pauses with your runtime / JVM is that when the runtime does its cleanup of the Heap it has to go through what is called a phase change. It stops the threads running your application in order to mark and swap pointers to optimize your available memory. Yong gen is faster as it is mainly releasing objects that are only temporary. Old gen, however, evaluates all the objects on the heap and when you run out of memory will it will kick of to free up much needed memory.
Why the Caution? Old gen gets exponentially worse in pause time the more heap you use. at 2-4 GB in total heap size you should be fine on modern runtimes like Java 6 (JDK 1.6+). Once you go beyond that threashold you will see exponential increases in pause times. I have run into some clients that have to restart servers - as in some rare cases where a heap is large GC pause times can take longer than a full restart.
There are some new tools out there that are pretty cool and can give you a leading edge on evaluating if GC is your pain. JHiccup is one and it is free from the azulsystemswebsite. At this time I think it is only for Linux though. They also have a JVM that has a re-built GC algorithm that runs pauseless ... but if you are on a single server deployment with a non-critical app it may not be cost effective (that one is not free).
To sum up - if your runtime / JVM / CLR heap is less than 2 GB adding more memory will help. Be sure to give yourself some overhead. You never want to hit 100% Heap size / memory size if possible. That is when the long pauses are the longest. Give yourself an extra 20%+ memory over what you think you will need. That way you have room for the GC algorithms to move objects around for optimization. If you ever plan to go large ... there is one tool that fixes the circa 1990 JVM technology (Azul Systems Zing JVM), but it is not free. They do offer an open source tool to diagnose GC issues. The JVM (as I have tried it) also has a really cool thread level visibility tool that lets you report on any leaks, bugs, or locks in production without overhead (some trick with offloading data the JVM already deals with and time stamping). That has saved tons of dev test time ... but again, not for small apps.
Stay below 4 GB. Give extra headroom. And if you want you can turn on these flags to monitor GC for Java / JVM:
java -verbose:gc myProgram
java -Xloggc:D:/log/myLogFile.log -XX:+PrintGCDetails myProgram
You may try some of the other collectors Hotspot uses. There are more than one.
If you are on Linux go ahead and try the JHiccup tool as well. It is free.

You may be interested in trying the low-pause Garbage-First collector instead of concurrent mark-sweep (although it's not necessarily more performant for all collections, it's supposed to have a better worst case). It's enabled by -XX:+UseG1GC and is supposed to be really awesomesauce, but you may want to give it a thorough evaluation before using it in production. It has probably improved since, but it seems to have been a bit buggy a year ago, as seen in Experience with JDK 1.6.x G1 (“Garbage First”)

It is perfectly fine for the garbage collection to run in parallel with your program, if you have plenty of cpu, which you do.
What you want, is to make absolutely certain that you will not run into a scenario where the garbage collection PAUSES your main program.
Have you tried simply not stating any flags except saying you want the server VM (for the Sun JVM), and then put your server under heavy load to see how it behaves? Only then can you see, if you get any improvements from tinkering with options.

This actually sounds like a throughput app and should probably use the throughput collector. I would balance the size of the new gen making it big enough to not GC too often and small enough to prevent long pauses. 20ms sounds like a long minor GC to me. I also suspect your survivor space is set too large and is just being wasted. If you don't have much surviving to old gen, you shouldn't have that much surviving your minor GCs.
In the end, you should use jvmstat and VisualGC to really get a feel for how your app is using memory.

For high responsive server application, I think you want to see the major GC happens less frequently. Here is the list of parameters would help.
-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=50
-XX:CMSWaitDuration=300000
-XX:GCTimeRatio=40
Larger heap size may not help on low pause, as long as your app doesn't run out of memory.
Larger NewSize and MaxNewSize would help on throughput, may not help on low pause. If you choose to take this approach, you may consider give GC threads more execution time by setting -XX:GCTimeRatio lower. The point is to remember to take a holistic when tuning JVM.

I think previous posters have missed something very obvious- the Perm Generation size is too low. If the system uses 200 to 400 MB as permanent generation- then it may be best to set Max Perm Gen to 400 MB. the PerGen size should also be set to the same value. You will then never run out of Permanent Generation Space.
Currently- it looks like the JVM has to spend a lot of time moving objects in and out of Permanent Generation. This can take time. JVM tries to allocate contiguous memory areas to Java Objects- this speeds memory access due to hardware level features. In order to do that, it is very helpful to have plenty of buffer in memory. If Permanent Generation is almost full, newly discovered permanent objects must be split or existing objects must be shuffled. This is what triggers a full GC, as well as causes long full GC pauses.
The question states that the Permanent Generation size has already been measured- if this has not been done, it should be measured using a tool. These tools process logs generated by the JVM with the verboseGC option turned on.
All the mark and sweep options listed above- may not be needed with this basic improvement.
People throw GC options as solutions without evaluating how mature they have proven to be in actual use.

Does the Sun JVM slow down when more memory is allocated via -Xmx?

Does the Sun JVM slow down when more memory is available and used via -Xmx? (Assumption: The machine has enough physical memory so that virtual memory swapping is not a problem.)
I ask because my production servers are to receive a memory upgrade. I'd like to bump up the -Xmx value to something decadent. The idea is to prevent any heap space exhaustion failures due to my own programming errors that occur from time to time. Rare events, but they could be avoided with my rapidly evolving webapp if I had an obscene -Xmx value, like 2048mb or higher. The application is heavily monitored, so unusual spikes in JVM memory consumption would be noticed and any flaws fixed.
Possible important details:
Java 6 (runnign in 64-bit mode)
4-core Xeon
RHEL4 64-bit
Spring, Hibernate
High disk and network IO
EDIT: I tried to avoid posting the configuration of my JVM, but clearly that makes the question ridiculously open ended. So, here we go with relevant configuration parameters:
-Xms256m
-Xmx1024m
-XX:+UseConcMarkSweepGC
-XX:+AlwaysActAsServerClassMachine
-XX:MaxGCPauseMillis=1000
-XX:MaxGCMinorPauseMillis=1000
-XX:+PrintGCTimeStamps
-XX:+HeapDumpOnOutOfMemoryError

By adding more memory, it will take longer for the heap to fill up. Consequently, it will reduce the frequency of garbage collections. However, depending on how mortal your objects are, you may find that how long it takes to do any single GC increases.
The primary factor for how long a GC takes is how many live objects there are. Thus, if virtually all of your objects die young and once you get established, none of them escape the young heap, you may not notice much of a change in how long it takes to do a GC. However, whenever you have to cycle the tenured heap, you may find everything halting for an unreasonable amount of time since most of these objects will still be around. Tune the sizes accordingly.

If you just throw more memory at the problem, you will have better throughput in your application, but your responsiveness can go down if you're not on a multi core system using the CMS garbage collector. This is because fewer GCs will occur, but they will have more work to do. The upside is that you will get more memory freed up with your GCs, so allocation will continue to be very cheap, hence the higher througput.
You seem to be confusing -Xmx and -Xms, by the way. -Xms just sets the initial heap size, whereas -Xmx is your max heap size.

More memory usually gives you better performance in garbage collected environments, at least as long as this does not lead to virtual memory usage / swapping.
The GC only tracks references, not memory per se. In the end, the VM will allocate the same number of (mostly short-lived, temporary) objects, but the garbage collector needs to be invoked less often - the total amount of garbage collector work will therefore not be more - even less, since this can also help with caching mechanisms which use weak references.
I'm not sure if there is still a server and a client VM for 64 bit (there is for 32 bit), so you may want to investigate that also.

According to my experience, it does not slow down BUT the JVM tries to cut back to Xms all the time and try to stay at the lower boundary or close to. So if you can effort it, bump Xms as well. Sun is recommending both at the same size. Add some -XX:NewSize=512m (just a made up number) to avoid the costly pile up of old data in the old generation with leads to longer/heavier GCs on the way. We are running our web app with 700 MB NewSize because most data is short-lived.
So, bottom line: I do not expect a slow down, but put your more of memory to work. Set a larger new size area and set Xms to Xmx to lower the stress on the GC, because it does not need to try to cut back to Xms limits...

It typically will not help your peformance/throughput,if you increase -Xmx.
Theoretically there could be longer "stop the world" phases but in practice with the CMS that's not a real problem.
Of course you should not set -Xmx to some insane value like 300Gbyte unless you really need it :)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.