After running a few days the CPU load of my JVM is about 100% with about 10% of GC (screenshot).
The memory consumption is near to max (about 6 GB).
The tomcat is extremely slow at that state.
Since it's too much for a comment i'll write it up ans answer:
Looking at your charts it seems to be using CPU for non-GC tasks, peak "GC activity" seems to stay within 10%.
So on first impression it would seem that your task is simply CPU-bound, so if that's unexpected maybe you should do some CPU-profiling on your java application to see if something pops out.
Apart from that, based on comments I suspect that physical memory filling up might evict file caches and memory-mapped things, leading to increased page faults which forces the CPU to wait for IO.
Freeing up 500MB on a manual GC out of a 4GB heap does not seem all that much, most GCs try to keep pause times low as their primary goal, keep the total time spent in GC within some bound as secondary goal and only when the other goals are met they try to reduce memory footprint as tertiary goal.
Before recommending further steps you should gather more statistics/provide more information since it's hard to even discern what your actual problem is from your description.
monitor page faults
figure out which GC algorithm is used in your setup and how they're tuned (-XX:+PrintFlagsFinal)
log GC activity - I suspect it's pretty busy with minor GCs and thus eating up its pause time or CPU load goals
perform allocation profiling of your application (anything creating excessive garbage?)
You also have to be careful to distinguish problems caused by the java heap reaching its sizing limit vs. problems causing by the OS exhausting its physical memory.
TL;DR: Unclear problem, more information required.
Or if you're lazy/can afford it just plug in more RAM / remove other services from the machine and see if the problem goes away.
I learned to check this on GC problems:
Give the JVM enough memory e.g. -Xmx2G
If memory is not sufficient and no more RAM is available on the host, analyze the HEAP dump (e.g. by jvisualvm).
Turn on Concurrent Marc and Sweep:
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
Check the garbage collection log: -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:gc.log
My Solution:
But I solved that problem finally by tuning the cache sizes.
The cache sizes were to big, so memory got scarce.
if you want keep the memory of your server free you can simply try the vm-parameter
-Xmx2G //or any different value
This ensures your program never takes more than 2 Gigabyte of Ram. But be aware if case of high workload the server may be get an OutOfMemoryError.
Since a old generation (full) GC may block your whole server from working for some seconds java will try to avoid a Full Garbage collection.
The Ram-Limitation may trigger a Full-Generation GC more easy (or even support more objects to be collected by Young-Generation GC).
From my (more guessing than actually knowing) opinion: I don't think another algorithm can help so much here.
Related
ZGC runs not often enough. GC logs show that it runs once every 2-3 minutes for my application and because of this, my memory usage goes high between GC cycles (as high as 90%). After GC, it drops to as low as 20%.
How to increase GC run's frequency to run more often?
-XX:ZCollectionInterval=N - set maximum gap between collections to N seconds.
-XX:ZUncommitDelay=M - set the delay until unused memory is returned to the OS to M seconds.
Before tuning the GC, I would recommend to investigate why this is happening. Might have some issue/bug in your application.
[Some notes about GC]
-XX:ZUncommitDelay=M (Check if it is supported by your linux kernel)
-XX:+ZProactive: Enables proactive GC cycles when using ZGC. By default, this option is enabled. ZGC will start a proactive GC cycle if doing so is expected to have minimal impact on the running application. This is useful if the application is mostly idle or allocates very few objects, but you still want to keep the heap size down and allow reference processing to happen even when there are a lot of free space on the heap.
More details about ZGC config. options can be found:
ZGC Home Page.
Oracle Documentation
Presently (as of JDK 17), ZGC's primary strategy is to wait until the last possible moment of the heap filling up and then do a collection. Its goals are
Avoid unnecessary CPU load by running GC only when it's necessary.
Start the GC early enough so that it will finish before the heap actually fills up (since the heap filling up would be bad, leading to a temporary application stall).
It does this by measuring how fast your app is allocating memory, how long the GC takes to run, and predicting at what point it should start the GC. You can find the exact algorithm in the source code.
ZGC also exposes some knobs for running GC more often (ie, proactively), but honestly I don't find them terribly effective. You can find more info in my other answer. G1 does a better job of being proactive, but whether that's good or not depends on your use-case. (It sounds like you care more about throughput than memory usage, so I think you should prefer ZGC's behavior.)
However, if you find that ZGC is making mistakes in predicting when the heap will fill up and that your application really is hitting stalls, please share that info here or on the ZGC mailing list.
I have an instance of zookeeper that has been running for some time... (Java 1.7.0_131, ZK 3.5.1-1), with -Xmx10G -XX:+UseParallelGC.
Recently there was a leadership change, and the memory usage on most instances in the quorum went from ~200MB to 2GB+. I took a jmap dump, and what I found that was interesting was that there was a lot of byte[] serialization data (>1GB) that had no GC Root, but hadn't been collected.
(This is ByteArrayOutputStream, DataOutputStream, org.apache.jute.BinaryOutputArchive, or HeapByteBuffer, BinaryOutputArchive).
Looking at the gc log, shortly before the election change, the full GC was running every 4-5 minutes. After the election, the tenuring threshold increases from 1 to 15 (max) and the full GC runs less and less often, eventually it doesn't even run on some days.
After severals days, suddenly, and mysteriously to me, something changes, and the memory plummets back to ~200MB with Full GC running every 4-5 minutes.
What I'm confused about here, is how so much memory can have no GC Root, and not get collected by a full GC. I even tried triggering a GC.run from jcmd a few times.
I wondered if something in ZK native land was holding onto this memory, or leaking this memory... which could explain it.
I'm looking for any debugging suggestions; I'm planning on upgrading Java 1.8, maybe ZK 3.5.4, but would really like to root cause this before moving on.
So far I've used visualvm, GCviewer and Eclipse MAT.
(Solid vertical black lines are full GC. Yellow is young generation).
I am not an expert on ZK. However, I have been tuning JVMs on Weblogic for a while and I feel, based on this information, that your configuration is generating the expansion and shrinking of the heaps (-Xmx10G -XX:+UseParallelGC). Thus, perhaps you should try using -Xms10G and -Xmx10G to avoid this resizing. Importantly, each time the JVM is resized a full GC is executed so avoiding this process is a good way to minimize the number of full garbage collections.
Please read this
"When a Hotspot JVM starts, the heap, the young generation and the perm generation space are
allocated to their initial sizes determined by the -Xms, -XX:NewSize, and -XX:PermSize parameters
respectively, and increment as-needed to the maximum reserved size, which are -Xmx, -
XX:MaxNewSize, and -XX:MaxPermSize. The JVM may also shrink the real size at runtime if the
memory is not needed as much as originally specified. However, each resizing activity triggers a
Full Garbage Collection (GC), and therefore impacts performance. As a best practice, we
recommend that you make the initial and maximum sizes identical"
Source: http://www.oracle.com/us/products/applications/aia-11g-performance-tuning-1915233.pdf
If you could provide your gc.log, it would be useful to analyse this case thoroughly.
Best regards,
RCC
With VisualVM I am observing the following heap usage on a JBoss server:
The server is started with the following (relevant) JVM options:
-Xrs -Xms3072m -Xmx3072m -XX:MaxPermSize=512m -XX:+UseParallelOldGC -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000
And we currently also have enabled GC logging:
-XX:+PrintGC -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Xloggc:log\gc.log
Basically I am happy with the observed pattern, since it looks like we don't have any memory leaks (the pattern repeats itself over days).
However I am wondering if there is room for optimization?
First of all, I don't understand why the garbage collection already kicks in when the heap usage reaches about 2GB? It looks to me like it could kick in later since the heap would have 3GB available?
Further more I would be interested in tips regarding the observed heap usage pattern and the used JVM options:
Does the observed pattern allow me to draw conclusions about the used GC strategy (UseParallelOldGC)? Ist this strategy the right one, or should I try to use another one given the observed heap usage?
Can I optimize the GC process, so that the full heap size (3GB) is used?
Right now it looks like the full 3GB are never used, should I reduce the Xms/Xmx to 2.5GB?
Are there any obvious GC optimizations that I am missing? Like tuning -XX:NewSize or -XX:NewRatio?
Any other tips that come to mind?
Thanks!
I'd say the GC behaviour in your screen-shot looks 'normal'.
You'd usually want major collections to trigger before the heap space gets too full or it would be very easy to encounter OutOfMemoryError's, based on a number of scenarios.
Also, are you aware that Java's heap space is divided into distinct areas for new (eden), current (survivor) and old (tenured) objects?
This answer provides some excellent information on the subject, so I won't repeat it here:
How is the java memory pool divided?
Very basically, each area of the heap triggers its own collections. The eden space is normally collected often and 'quickly' the survivor and tenured spaces are usually larger and take longer to collect.
Could you reduce your heap size based on the above graph?
Yes. However, your current configuration allows your application some breathing room, if it's ever likely to encounter busier periods or spikes in load.
Can you optimize GC?
Yes, but there are no magic settings. The first question is do you really need to? If your application is just a non-interactive 'processor', I really wouldn't bother. If you have a genuine need for a low pause application, then there are some tweaks available. The trade off is generally that you'll need more resources to achieve the same result.
My experience is that low-pause JVM configurations have a very noticeable fall-off point when load increases. If your application is usually fairly idle, but you expect a 'quick' response when it is called, low pause may be appropriate. On a busier system, with peaks in traffic / load, you may prefer a more traditional approach.
Summary
In any case, don't be tempted to make arbitrary changes to 'improve' your configuration. Be scientific and professional about your approach.
If you don't have production metrics available, consider using tools like Apache JMeter to build load test scenarios to simulate the typical live load on you application, increased load (by say, 10%, 20% or 50% etc.) and intermittent peak load.
Use metrics for both the GC and the application, measuring at least:
Average throughput.
Peak throughput.
Average load (CPU and memory).
Peak load.
Application pause times (total and individual pauses).
Time spent performing collections.
Reliability (OOME's etc.).
Once you're happy that you've recorded an accurate benchmark on the performance of you application with its current configuration, only then should you start making any changes.
Obviously, record you configuration and its metrics. Document any changes and then perform the same benchmark tests. Then you'll be able to see any performance gain (or loss) and any trade-off that may be applicable.
Here's the some further reading from Oracle on the subject to get you started:
Java SE 6 Virtual Machine Garbage Collection Tuning
I create a fixed threadpool using forPool = Executors.newFixedThreadPool(poolSize); where poolSize is initialized to the number of cores on the processor (lets say 4). In some runs, it works fine and the CPU utilisation is consistently at 400%.
But sometimes, the usage drops to 100%, and never rises back to 400%. I have 1000s of tasks scheduled, so the problem is not that. I catch every exception, but no exception is thrown. So the issue is random and not reproducible, but very much present. They are data parallel operations. At the end of each thread, there is a synchronised access to update a single variable. Highly unlikely I have a deadlock there. In fact, once I spot this issue, if I destroy the pool, and create a fresh one of size 4, it is still only 100% usage. There is no I/O.
It seems counter intuitive to java's assurance of a "FixedThreadPool". Am I reading the guarantee wrong? Is only concurrency guaranteed and not parallelism?
And to the question - Have you come across this issue and solved it? If I want parallelism, am I doing the correct thing?
Thanks!
On doing a thread dump:
I find that there are 4 threads all doing their parallel operations. But the usage is still ~100% only. Here are the thread dumps at 400% usage and 100% usage. I set the number of threads to 16 to trigger the scenario. It runs at 400% for a while, and then drops to 100%. When I use 4 threads, it runs on 400% and only rarely drops to 100%. This is the parallelization code.
****** [MAJOR UPDATE] ******
It turns out that if I give the JVM a huge amount of memory to play with, this issue is solved and the performance does not drop. But I don't know how to use this information to solve this issue. Help!
Given the fact that increasing your heap size makes the problem 'go away' (perhaps not permanently), the issue is probably related to GC.
Is it possible that the Operation implementation is generating some state, that is stored on the heap, between calls to
pOperation.perform(...);
? If so, then you might have a memory usage problem, perhaps a leak. As more tasks complete, more data is on the heap. The garbage collector has to work harder and harder to try and reclaim as much as it can, gradually taking up 75% of your total available CPU resources. Even destroying the ThreadPool won't help, because that's not where the references are stored, it's in the Operation.
The 16 thread case hitting this problem more could be due to the fact that it's generating more state quicker (don't know the Operation implementation, so hard for me to say).
And increasing the heap size while keeping the problem set the same would make this problem appear to disappear, because you'd have more room for all this state.
I'll suggest that you use the Yourkit Thread Analysis feature to understand the real behavior. It will tell you exactly which threads are running, blocked or waiting and why.
If you can't/don't want to purchase it, next best option is to use Visual VM, which is bundled with the JDK to do this analysis. It won't give you as detailed information as Yourkit. Following blog post can get you started with Visual VM:
http://marxsoftware.blogspot.in/2009/06/thread-analysis-with-visualvm.html
My answer is based on a mixture of knowledge about JVM memory management and some guesses about facts which I could not find precise information on. I believe that your problem is related to the thread-local allocation buffers (TLABs) Java uses:
A Thread Local Allocation Buffer (TLAB) is a region of Eden that is
used for allocation by a single thread. It enables a thread to do
object allocation using thread local top and limit pointers, which is
faster than doing an atomic operation on a top pointer that is shared
across threads.
Let's say you have an eden size of 2M and use 4 threads: The JVM may choose a TLAB size of (eden/64)=32K and each thread gets a TLAB of that size. Once the 32K TLAB of a thread are exhausted, it needs to acquire a new one, which requires global synchronization. Global synchronization is also needed for allocation of objects which are larger than the TLAB.
But, to be honest with you, things are not as easy as I described: The JVM adaptively sizes a thread's TLAB based on its estimated allocation rate determined at minor GCs [1] which makes TLAB-related behavior even less predictable. However, I can imagine that the JVM scales the TLAB sizes down when more threads are working. This seems to make sense, because the sum of all TLABs must be less than the available eden space (and even some fraction of the eden space in practice to be able to refill the TLABs).
Let us assume a fixed TLAB size per thread of (eden size / (16 * user threads working)):
for 4 threads this results in TLABs of 32K
for 16 threads this results in TLABs of 8K
You can imagine that 16 threads which exhaust their TLAB faster because it's smaller will cause much more locks on the TLAB allocator than 4 threads with 32K TLABs.
To conclude, when you decrease the number of working threads or increase the memory available to the JVM, the threads can be given larger TLABs and the problem is solved.
https://blogs.oracle.com/daviddetlefs/entry/tlab_sizing_an_annoying_little
This is almost certainly due to GC.
If you want to be sure add the following startup flags to your Java program:
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
and check stdout.
You will see lines containing "Full GC" including the time this took: during this time you will see 100% CPU usage.
The default garbage collector on multi-CPU or multi-core machines is the throughput collector, which collects the young generation in parallel but uses serial collection (in one thread) for the old generation.
So what is probably happening is that in your 100% CPU example, GC is going on of the old generation which is done in one thread and so keeps one core busy only.
Suggestion for solution: use the concurrent mark-and-sweep collector, by using the flag -XX:+UseConcMarkSweepGC at JVM startup.
Tune the JVM
The core of the Java platform is the Java Virtual Machine (JVM). The entire Java application server runs inside a JVM. The JVM takes many startup parameters as command line flags, and some of them have great implications on the application performance. So, let's examine some of the important JVM parameters for server applications.
First, you should allocate as much memory as possible to the JVM using the -Xms (minimum memory) and -Xmx (maximum memory) flags. For instance, the -Xms1g -Xmx1g tag allocates 1GB of RAM to the JVM. If you don't specify a memory size in the JVM startup flags, the JVM would limit the heap memory to 64MB (512MB on Linux), no matter how much physical memory you have on the server! More memory allows the application to handle more concurrent web sessions, and to cache more data to improve the slow I/O and database operations. We typically specify the same amount of memory for both flags to force the server to use all the allocated memory from startup. This way, the JVM wouldn't need to dynamically change the heap size at runtime, which is a leading cause of JVM instability. For 64-bit servers, make sure that you run a 64-bit JVM on top of a 64-bit operating system to take advantage of all RAM on the server. Otherwise, the JVM would only be able to utilize 2GB or less of memory space. 64-bit JVMs are typically only available for JDK 5.0.
With a large heap memory, the garbage collection (GC) operation could become a major performance bottleneck. It could take more than ten seconds for the GC to sweep through a multiple gigabyte heap. In JDK 1.3 and earlier, GC is a single threaded operation, which stops all other tasks in the JVM. That not only causes long and unpredictable pauses in the application, but it also results in very poor performance on multi-CPU computers since all other CPUs must wait in idle while one CPU is running at 100% to free up the heap memory space. It is crucial that we select a JDK 1.4+ JVM that supports parallel and concurrent GC operations. Actually, the concurrent GC implementation in the JDK 1.4 series of JVMs is not very stable. So, we strongly recommend you upgrade to JDK 5.0. Using the command line flags, you can choose from the following two GC algorithms. Both of them are optimized for multi-CPU computers.
If your priority is to increase the total throughput of the
application and you can tolerate occasional GC pauses, you should use
the -XX:UseParallelGC and -XX:UseParallelOldGC (the latter is only
available in JDK 5.0) flags to turn on parallel GC. The parallel GC
uses all available CPUs to perform the GC operation, and hence it is
much faster than the default single thread GC. It still pauses all
other activities in the JVM during GC, however.
If you need to minimize the GC pause, you can use the
-XX:+UseConcMarkSweepGC flag to turn on the concurrent GC. The concurrent GC still pauses the JVM and uses parallel GC to clean up
short-lived objects. However, it cleans up long-lived objects from
the heap using a background thread running in parallel with other JVM
threads. The concurrent GC drastically reduces the GC pause, but
managing the background thread does add to the overhead of the system
and reduces the total throughput.
Furthermore, there are a few more JVM parameters you can tune to optimize the GC operations.
On 64-bit systems, the call stack for each thread is allocated 1MB of
memory space. Most threads do not use that much space. Using the
-XX:ThreadStackSize=256k flag, you can decrease the stack size to 256k to allow more threads.
Use the -XX:+DisableExplicitGC flag to ignore explicit application
calls to System.gc(). If the application calls this method
frequently, then we could be doing a lot of unnecessary GCs.
The -Xmn flag lets you manually set the size of the "young
generation" memory space for short-lived objects. If your application
generates lots of new objects, you might improve GCs dramatically by
increasing this value. The "young generation" size should almost
never be more than 50% of heap.
Since the GC has a big impact on performance, the JVM provides several flags to help you fine-tune the GC algorithm for your specific server and application. It's beyond the scope of this article to discuss GC algorithms and tuning tips in detail, but we'd like to point out that the JDK 5.0 JVM comes with an adaptive GC-tuning feature called ergonomics. It can automatically optimize GC algorithm parameters based on the underlying hardware, the application itself, and desired goals specified by the user (e.g., the max pause time and desired throughput). That saves you time trying different GC parameter combinations yourself. Ergonomics is yet another compelling reason to upgrade to JDK 5.0. Interested readers can refer to Tuning Garbage Collection with the 5.0 Java Virtual Machine. If the GC algorithm is misconfigured, it is relatively easy to spot the problems during the testing phase of your application. In a later section, we will discuss several ways to diagnose GC problems in the JVM.
Finally, make sure that you start the JVM with the -server flag. It optimizes the Just-In-Time (JIT) compiler to trade slower startup time for faster runtime performance. There are more JVM flags we have not discussed; for details on these, please check out the JVM options documentation page.
Reference:
http://onjava.com/onjava/2006/11/01/scaling-enterprise-java-on-64-bit-multi-core.html
A total cpu utilisation at a 100% implied that you have written is single threaded. i.e. you may have any number of concurrent tasks, but due to locking, only one can execute at a time.
If you have high IO you can get less than 400% but it is unlikely you will get a round number of cpu utilisation. e.g. you might see 38%, 259%, 72%, 9% etc. (It is also likely to jump around)
A common problem is locking the data you are using too often. You need to consider how it could be re-written where locking is performed for the briefest period and smallest portion of the overall work. Ideally, you want to avoid locking all together.
Using multiple thread means you can use up to that many cpus, but if your code prevents it you are likely to be better off (i.e. faster) to write the code single threaded as it avoids the overhead of locking.
Since you are using locking, it is possible that one of your four threads attains the lock but is then context switched - perhaps to run the GC thread. The other threads can't make progress since they can't attain the lock. When the thread context switches back, it completes the work in the critical section and relinquishes the lock to allow only one other thread to attain the lock. So now you have two threads active. It is possible that while the second thread executes the critical section the first thread does the next piece of data parallel work but generates enough garbage to trigger the GC and we're back where we started :)
P.S. This is just a best guess since it is hard to figure out what is happenning without any code snippets.
Increasing the size of the Java heap usually improves throughput until the heap no longer resides in physical memory. When the heap size exceeds the physical memory, the heap begins swapping to disk which causes Java performance to drastically decrease. Therefore, it is important to set the maximum heap size to a value that allows the heap to be contained within physical memory.
Since you give the JVM ~90% of physical memory on the machines, problem may be related to IO happening due to memory paging and swapping when you try to allocate memory for more objects. Note that the physical memory is also used by other running processes as well as OS. Also since symptoms occur after a while, this is also indication for memory leaks.
Try to find out how much physical memory is available (not already
used) and allocate ~90% of available physical memory to your JVM heap.
What happens if you leave the system running for extended period of
time?
Does it ever comes back at CPU 400% of utilization?
Do you notice any disk activity when CPU is at 100% of utilization?
Can you monitor which threads are running and which are blocked and
when?
Take a look at following link for tuning:
http://java.sun.com/performance/reference/whitepapers/tuning.html#section4
I am running an application server on Linux 64bit with 8 core CPUs and 6 GB memory.
The server must be highly responsive.
After some inspection I found that the application running on the server creates rather a huge amount of short-lived objects, and has only about 200~400 MB long-lived objects(as long as there is no memory leak)
After reading http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html
I use these JVM options
-server -Xms2g -Xmx2g -XX:MaxPermSize=256m -XX:NewRatio=1 -XX:+UseConcMarkSweepGC
Result: the minor GC takes 0.01 ~ 0.02 sec, the major GC takes 1 ~ 3 sec
the minor GC happens constantly.
How can I further improve or tune the JVM?
larger heap size? but will it take more time for GC?
larger NewSize and MaxNewSize (for young generation)?
other collector? parallel GC?
is it a good idea to let major GC take place more often? and how?
Result: the minor GC takes 0.01 ~ 0.02 sec, the major GC takes 1 ~ 3 sec the minor GC happens constantly.
Unless you are reporting pauses, I would say that the CMS collector is doing what you have asked it to do. By definition, CMS will use a larger percentage of the CPU than the Serial and Parallel collectors. This is the penalty you pay for low pause times.
If you are seeing 1 to 3 second pause times, I'd say that you need to do some tuning. I'm no expert, but it looks like you should start by reducing the value of CMSInitiatingOccupancyFraction from the default value of 92.
Increasing the heap size will improve the "throughput" of the GC. But if your problem is long pauses, increasing the heap size is likely to make the problem worse.
Careful .... GC can be a hairy subject if you are not cautious. Within any runtime (JVM for Java / CLR for .Net) there are several processes that take place. Generally there is an early stage optimization of memory (Young Generational Garbage Collection / Young Gen GC & Old Generational Garbage Collection / Old Gen GC). The young gen gc happens on a regular basis and is commonly attributed to your smaller pauses / hiccups. The old gen gc is normally what is going on when you see the long "stop the world" pauses.
Why you might ask? The reason you get pauses with your runtime / JVM is that when the runtime does its cleanup of the Heap it has to go through what is called a phase change. It stops the threads running your application in order to mark and swap pointers to optimize your available memory. Yong gen is faster as it is mainly releasing objects that are only temporary. Old gen, however, evaluates all the objects on the heap and when you run out of memory will it will kick of to free up much needed memory.
Why the Caution? Old gen gets exponentially worse in pause time the more heap you use. at 2-4 GB in total heap size you should be fine on modern runtimes like Java 6 (JDK 1.6+). Once you go beyond that threashold you will see exponential increases in pause times. I have run into some clients that have to restart servers - as in some rare cases where a heap is large GC pause times can take longer than a full restart.
There are some new tools out there that are pretty cool and can give you a leading edge on evaluating if GC is your pain. JHiccup is one and it is free from the azulsystemswebsite. At this time I think it is only for Linux though. They also have a JVM that has a re-built GC algorithm that runs pauseless ... but if you are on a single server deployment with a non-critical app it may not be cost effective (that one is not free).
To sum up - if your runtime / JVM / CLR heap is less than 2 GB adding more memory will help. Be sure to give yourself some overhead. You never want to hit 100% Heap size / memory size if possible. That is when the long pauses are the longest. Give yourself an extra 20%+ memory over what you think you will need. That way you have room for the GC algorithms to move objects around for optimization. If you ever plan to go large ... there is one tool that fixes the circa 1990 JVM technology (Azul Systems Zing JVM), but it is not free. They do offer an open source tool to diagnose GC issues. The JVM (as I have tried it) also has a really cool thread level visibility tool that lets you report on any leaks, bugs, or locks in production without overhead (some trick with offloading data the JVM already deals with and time stamping). That has saved tons of dev test time ... but again, not for small apps.
Stay below 4 GB. Give extra headroom. And if you want you can turn on these flags to monitor GC for Java / JVM:
java -verbose:gc myProgram
java -Xloggc:D:/log/myLogFile.log -XX:+PrintGCDetails myProgram
You may try some of the other collectors Hotspot uses. There are more than one.
If you are on Linux go ahead and try the JHiccup tool as well. It is free.
You may be interested in trying the low-pause Garbage-First collector instead of concurrent mark-sweep (although it's not necessarily more performant for all collections, it's supposed to have a better worst case). It's enabled by -XX:+UseG1GC and is supposed to be really awesomesauce, but you may want to give it a thorough evaluation before using it in production. It has probably improved since, but it seems to have been a bit buggy a year ago, as seen in Experience with JDK 1.6.x G1 (“Garbage First”)
It is perfectly fine for the garbage collection to run in parallel with your program, if you have plenty of cpu, which you do.
What you want, is to make absolutely certain that you will not run into a scenario where the garbage collection PAUSES your main program.
Have you tried simply not stating any flags except saying you want the server VM (for the Sun JVM), and then put your server under heavy load to see how it behaves? Only then can you see, if you get any improvements from tinkering with options.
This actually sounds like a throughput app and should probably use the throughput collector. I would balance the size of the new gen making it big enough to not GC too often and small enough to prevent long pauses. 20ms sounds like a long minor GC to me. I also suspect your survivor space is set too large and is just being wasted. If you don't have much surviving to old gen, you shouldn't have that much surviving your minor GCs.
In the end, you should use jvmstat and VisualGC to really get a feel for how your app is using memory.
For high responsive server application, I think you want to see the major GC happens less frequently. Here is the list of parameters would help.
-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=50
-XX:CMSWaitDuration=300000
-XX:GCTimeRatio=40
Larger heap size may not help on low pause, as long as your app doesn't run out of memory.
Larger NewSize and MaxNewSize would help on throughput, may not help on low pause. If you choose to take this approach, you may consider give GC threads more execution time by setting -XX:GCTimeRatio lower. The point is to remember to take a holistic when tuning JVM.
I think previous posters have missed something very obvious- the Perm Generation size is too low. If the system uses 200 to 400 MB as permanent generation- then it may be best to set Max Perm Gen to 400 MB. the PerGen size should also be set to the same value. You will then never run out of Permanent Generation Space.
Currently- it looks like the JVM has to spend a lot of time moving objects in and out of Permanent Generation. This can take time. JVM tries to allocate contiguous memory areas to Java Objects- this speeds memory access due to hardware level features. In order to do that, it is very helpful to have plenty of buffer in memory. If Permanent Generation is almost full, newly discovered permanent objects must be split or existing objects must be shuffled. This is what triggers a full GC, as well as causes long full GC pauses.
The question states that the Permanent Generation size has already been measured- if this has not been done, it should be measured using a tool. These tools process logs generated by the JVM with the verboseGC option turned on.
All the mark and sweep options listed above- may not be needed with this basic improvement.
People throw GC options as solutions without evaluating how mature they have proven to be in actual use.