Garbage collection tuning a production application

Garbage collection tuning a production application - java

I've been tasked with tuning a production application that consists of a Spring MVC REST interface serving large (~0mb - 100mb) json documents from a Gemfire in memory cache backend. The application runs on a CentOS server inside Tomcat 7 on JDK 1.6. We realized that the application needed to be tuned because we were seeing frequent stop the world old generation garbage collections which would eventually lead to java.lang.OutOfMemoryError: GC overhead limit exceeded errors if left unattended.
Through some trial and error and monitoring I've managed to tune the application with these parameters:
-Xms20g
-Xmx20g
-XX:PermSize=256m
-XX:MaxPermSize=256m
-XX:NewSize=8g
-XX:MaxNewSize=8g
-XX:SurvivorRatio=8
-XX:+DisableExplicitGC
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:CMSInitiatingOccupancyFraction=70
The garbage collection behavior that I'm seeing now (48 hours under heavy test load) is that eden space collection is happening about once every 10 seconds and lasting about .04 seconds. The old generation is not growing at all after 48 hours and there have been 0 collections in that space.
My question is should I be concerned about not having the old generation garbage collected? Overall does this look like a healthy tuning?
Edit:
For anyone who cares my GC log is available here http://filebin.ca/2U8awo1KTS1D/udf-gc.log.0

My question is should I be concerned about not having the old generation garbage collected?
The logs look fine. Given the trends the old gen occupancy grows very slowly. So it will take several days until it becomes full enough for a concurrent marking cycle to be initiated.
Overall does this look like a healthy tuning?
it seems like you're giving it much more memory than it needs.
Old gen occupancy is around 2G / 12G. This means you could probably shrink it to 4G and still take many hours before a concurrent cycle gets started
Most young objects only live to age 1 (out of 15) in the young generation. This means the young generation could be shrunk too without increasing object promotion too much
-XX:CMSInitiatingOccupancyFraction=70
That should be combined with XX:+UseCMSInitiatingOccupancyOnly

Tuning Garbage Collection is no different for generic performance tuning in the sense that - without having the requirements you can (for non-trivial applications at least) effectively keep improving forever. At some point the improvements no longer matter for the practical use case. That is why you should be having the goals in place.
The goals regarding GC should be derived from the generic performance requirements. These in turn are usually describing three dimensions
Latency. Or more precisely, acceptable latency distribution per service published by application. For example - 99% of login() operations must complete under 500ms and worst case can not exceed 2,500ms.
Throughput. How many operations per time unit must be completed. Tougher to measure for large monoliths, but if running microservices you can express this as "1,000 login operations processed per second".
Capacity. Adding more resources & scaling out will improve the situation, but for practical matters, things such as the monthly AWS bill will set limits in this regard.
Having these requirements in place, you can start building/deriving from them and if necessary, optimizing further. The company I am affiliated with published a rather thorough handbook about GC tuning recently so you can check more from the GC tuning sections of the handbook.

Related

Design issue - Physical memory size and Full Garbage Collection

We are designing new software system architecture. and I am working by project manager.
but there is something on the issue within our team.
Our architect says "System memory should be kept as small as possible because it takes a long time when Full GC occurs. (JVM)
I am not sure of that opinion.
When setting up system memory, what level of Full GC(Garbage Collection) time should be reviewed?
How long will it take if Full GC occurs in a 16GB memory environment?

You might be worrying (your "architects") about something that might not be a problem for your throughput to begin with. Until java-9, the default collector is ParallelGC and there are dozens and dozens of applications that have not changed it and are happy with the pause times (and that collector pauses the world every time). So the only real answer is : measure. Enable GC logs and look into it.
On the other hand, if you choose a concurrent collector (you should start with G1), having enough breathing room for it in the heap is crucial. It is a lot more important for Shenandoan and ZGC, since they do everything concurrently. Every time GC initiates a concurrent phase, it works via so-called "barriers", which are basically interceptors for the objects in the heap. These structures used by these barriers require storage. If you will narrow this storage, GC is not going to be happy.
In rather simple words - the more "free" space in the heap, the better your GC will perform.
When setting up system memory, what level of Full GC(Garbage Collection) time should be reviewed?
This is not the correct question. If you are happy with your target response times, this does not matter. If you are not - you start analyzing gc logs and understand what is causing your delays (this is not going to be trivial, though).
How long will it take if Full GC occurs in a 16GB memory environment?
It depends on the collector and on the java version and on the type of Objects to be collected, there is no easy answer. Shenandah and ZGC - this is irrelevant since they do not care on the size of the heap to be scanned. For G1 it is going to be in the range of a "few seconds" most probably. But if you have WeakReferences and finalizers and a java version that is known to handle this not so good, the times to collect is going to be big.

How long will it take if Full GC occurs in a 16GB memory environment?
On a small heaps like that the ballpark figure is around 10 sec I guess.
But it's not what you should consider.
When setting up system memory, what level of Full GC(Garbage Collection) time should be reviewed?
All of the times. Whenever full gc occurs it should be reviewed if your application is latency-critical. This is what you should consider.
Full GC is a failure.
Failure on a multiple levels.
to address memory size available for application
to address GC type you use
to address the types of workloads
to address graceful degradation under the load
and the list goes on
Concurrent GC implicitly relies on a simple fact that it can collect faster then application allocates.
When allocation pressure becomes overwhelming GC has two options: slowdown allocations or stop them altogether.
And when it stops, you know, the hell breaks loose, futures time out, clusters brake apart and engineers fear and loathe large heaps for rest of their lives...
It's a common scenario for applications that evolve for years with increasing complexity and loads and lack of overhaul to accommodate to changing world.
It doesn't have to be this way though.
When you build new application from ground up you can design in with performance and latency in mind, with scalability and graceful degradation instead heap size and GC times.
You can split workloads that are not latency-critical but memory-heavy to different VM and run it under good 'ol ParallelGC, and it will outperform any concurrent GC in both throughput and memory overhead.
You can run latency-critical tasks under modern state-of-the-art GC like Shenandoah and have sub-second collection pauses on heaps of several TB if you don't mind some-30% memory overhead and considerable amount of CPU overhead.
Let the application and requirements dictate you heap size, not engineers.

Java: how to trace/monitor GC times for the CMS garbage collector?

I'm having trouble figuring out a way to monitor the JVM GC for memory exhaustion issues.
With the serial GC, we could just look at the full GC pause times and have a pretty good notion if the JVM was in trouble (if it took more than a few seconds, for example).
CMS seems to behave differently.
When querying lastGcInfo from the java.lang:type=GarbageCollector,name=ConcurrentMarkSweep MXBean (via JMX), the reported duration is the sum of all GC steps, and is usually several seconds long. This does not indicate an issue with GC, to the contrary, I've found that too short GC times are usually more of an indicator of trouble (which happens, for example, if the JVM goes into a CMS-concurrent-mark-start-> concurrent mode failure loop).
I've tried jstat as well, which gives the cumulative time spent garbage collecting (unsure if it's for old or newgen GC). This can be graphed, but it's not trivial to use for monitoring purposes. For example, I could parse jstat -gccause output and calculate differences over time, and trace+monitor that (e.g. amount of time spent GC'ing over the last X minutes).
I'm using the following JVM arguments for GC logging:
-Xloggc:/xxx/gc.log
-XX:+PrintGCDetails
-verbose:gc
-XX:+PrintGCDateStamps
-XX:+PrintReferenceGC
-XX:+PrintPromotionFailure
Parsing gc.log is also an option if nothing else is available, but the optimal solution would be to have a java-native way to get at the relevant information.
The information must be machine-readable (to send to monitoring platforms) so visual tools are not an option. I'm running a production environment with a mix of JDK 6/7/8 instances, so version-agnostic solutions are better.
Is there a simple(r) way to monitor CMS garbage collection? What indicators should I be looking at?

Fundamentally one wants two things from the CMS concurrent collector
the throughput of the concurrent cycle to keep up with the promotion rate, i.e. the objects surviving into the old gen per unit of time
enough room in the old generation for objects promoted during a concurrent cycle
So let's say the IHOP is fixed to 70% then you probably are approaching a problem when it reaches >90% at some point. Maybe even earlier if you do some large allocations that don't fit into the young generation or outlive it (that's entirely application-specific).
Additionally you usually want it to spend more time outside the concurrent cycle than in it, although that depends on how tightly you tune the collector, in principle you could have the concurrent cycle running almost all the time, but then you have very little throughput margin and burn a lot of CPU time on concurrent collections.
If you really really want to avoid even the occasional Full GC then you'll need even more safety margins due to fragmentation (CMS is non-compacting). I think this can't be monitored via MX beans, you'll have to to enable some CMS-specific GC logging to get fragmentation info.

For viewing GC logs:
If you have already enabled GC logging, I suggest GCViewer - this is an open source tool that can be used to view GC logs and look at parameters like throughput, pause times etc.
For profiling:
I don't see a JDK version mentioned in the question. For JDK 6, I would recommend visualvm to profile an application. For JDK 7/8 I would suggest mission control. You can find these in JDK\lib folder. These tools can be used to see how the application performs over a period of time and during GC (can trigger GC via visualvm UI).

Explain observed JVM Garbage Collection on JBoss Server

With VisualVM I am observing the following heap usage on a JBoss server:
The server is started with the following (relevant) JVM options:
-Xrs -Xms3072m -Xmx3072m -XX:MaxPermSize=512m -XX:+UseParallelOldGC -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000
And we currently also have enabled GC logging:
-XX:+PrintGC -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Xloggc:log\gc.log
Basically I am happy with the observed pattern, since it looks like we don't have any memory leaks (the pattern repeats itself over days).
However I am wondering if there is room for optimization?
First of all, I don't understand why the garbage collection already kicks in when the heap usage reaches about 2GB? It looks to me like it could kick in later since the heap would have 3GB available?
Further more I would be interested in tips regarding the observed heap usage pattern and the used JVM options:
Does the observed pattern allow me to draw conclusions about the used GC strategy (UseParallelOldGC)? Ist this strategy the right one, or should I try to use another one given the observed heap usage?
Can I optimize the GC process, so that the full heap size (3GB) is used?
Right now it looks like the full 3GB are never used, should I reduce the Xms/Xmx to 2.5GB?
Are there any obvious GC optimizations that I am missing? Like tuning -XX:NewSize or -XX:NewRatio?
Any other tips that come to mind?
Thanks!

I'd say the GC behaviour in your screen-shot looks 'normal'.
You'd usually want major collections to trigger before the heap space gets too full or it would be very easy to encounter OutOfMemoryError's, based on a number of scenarios.
Also, are you aware that Java's heap space is divided into distinct areas for new (eden), current (survivor) and old (tenured) objects?
This answer provides some excellent information on the subject, so I won't repeat it here:
How is the java memory pool divided?
Very basically, each area of the heap triggers its own collections. The eden space is normally collected often and 'quickly' the survivor and tenured spaces are usually larger and take longer to collect.
Could you reduce your heap size based on the above graph?
Yes. However, your current configuration allows your application some breathing room, if it's ever likely to encounter busier periods or spikes in load.
Can you optimize GC?
Yes, but there are no magic settings. The first question is do you really need to? If your application is just a non-interactive 'processor', I really wouldn't bother. If you have a genuine need for a low pause application, then there are some tweaks available. The trade off is generally that you'll need more resources to achieve the same result.
My experience is that low-pause JVM configurations have a very noticeable fall-off point when load increases. If your application is usually fairly idle, but you expect a 'quick' response when it is called, low pause may be appropriate. On a busier system, with peaks in traffic / load, you may prefer a more traditional approach.
Summary
In any case, don't be tempted to make arbitrary changes to 'improve' your configuration. Be scientific and professional about your approach.
If you don't have production metrics available, consider using tools like Apache JMeter to build load test scenarios to simulate the typical live load on you application, increased load (by say, 10%, 20% or 50% etc.) and intermittent peak load.
Use metrics for both the GC and the application, measuring at least:
Average throughput.
Peak throughput.
Average load (CPU and memory).
Peak load.
Application pause times (total and individual pauses).
Time spent performing collections.
Reliability (OOME's etc.).
Once you're happy that you've recorded an accurate benchmark on the performance of you application with its current configuration, only then should you start making any changes.
Obviously, record you configuration and its metrics. Document any changes and then perform the same benchmark tests. Then you'll be able to see any performance gain (or loss) and any trade-off that may be applicable.
Here's the some further reading from Oracle on the subject to get you started:
Java SE 6 Virtual Machine Garbage Collection Tuning

Tuning JVM (GC) for high responsive server application

I am running an application server on Linux 64bit with 8 core CPUs and 6 GB memory.
The server must be highly responsive.
After some inspection I found that the application running on the server creates rather a huge amount of short-lived objects, and has only about 200~400 MB long-lived objects(as long as there is no memory leak)
After reading http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html
I use these JVM options
-server -Xms2g -Xmx2g -XX:MaxPermSize=256m -XX:NewRatio=1 -XX:+UseConcMarkSweepGC
Result: the minor GC takes 0.01 ~ 0.02 sec, the major GC takes 1 ~ 3 sec
the minor GC happens constantly.
How can I further improve or tune the JVM?
larger heap size? but will it take more time for GC?
larger NewSize and MaxNewSize (for young generation)?
other collector? parallel GC?
is it a good idea to let major GC take place more often? and how?

Result: the minor GC takes 0.01 ~ 0.02 sec, the major GC takes 1 ~ 3 sec the minor GC happens constantly.
Unless you are reporting pauses, I would say that the CMS collector is doing what you have asked it to do. By definition, CMS will use a larger percentage of the CPU than the Serial and Parallel collectors. This is the penalty you pay for low pause times.
If you are seeing 1 to 3 second pause times, I'd say that you need to do some tuning. I'm no expert, but it looks like you should start by reducing the value of CMSInitiatingOccupancyFraction from the default value of 92.
Increasing the heap size will improve the "throughput" of the GC. But if your problem is long pauses, increasing the heap size is likely to make the problem worse.

Careful .... GC can be a hairy subject if you are not cautious. Within any runtime (JVM for Java / CLR for .Net) there are several processes that take place. Generally there is an early stage optimization of memory (Young Generational Garbage Collection / Young Gen GC & Old Generational Garbage Collection / Old Gen GC). The young gen gc happens on a regular basis and is commonly attributed to your smaller pauses / hiccups. The old gen gc is normally what is going on when you see the long "stop the world" pauses.
Why you might ask? The reason you get pauses with your runtime / JVM is that when the runtime does its cleanup of the Heap it has to go through what is called a phase change. It stops the threads running your application in order to mark and swap pointers to optimize your available memory. Yong gen is faster as it is mainly releasing objects that are only temporary. Old gen, however, evaluates all the objects on the heap and when you run out of memory will it will kick of to free up much needed memory.
Why the Caution? Old gen gets exponentially worse in pause time the more heap you use. at 2-4 GB in total heap size you should be fine on modern runtimes like Java 6 (JDK 1.6+). Once you go beyond that threashold you will see exponential increases in pause times. I have run into some clients that have to restart servers - as in some rare cases where a heap is large GC pause times can take longer than a full restart.
There are some new tools out there that are pretty cool and can give you a leading edge on evaluating if GC is your pain. JHiccup is one and it is free from the azulsystemswebsite. At this time I think it is only for Linux though. They also have a JVM that has a re-built GC algorithm that runs pauseless ... but if you are on a single server deployment with a non-critical app it may not be cost effective (that one is not free).
To sum up - if your runtime / JVM / CLR heap is less than 2 GB adding more memory will help. Be sure to give yourself some overhead. You never want to hit 100% Heap size / memory size if possible. That is when the long pauses are the longest. Give yourself an extra 20%+ memory over what you think you will need. That way you have room for the GC algorithms to move objects around for optimization. If you ever plan to go large ... there is one tool that fixes the circa 1990 JVM technology (Azul Systems Zing JVM), but it is not free. They do offer an open source tool to diagnose GC issues. The JVM (as I have tried it) also has a really cool thread level visibility tool that lets you report on any leaks, bugs, or locks in production without overhead (some trick with offloading data the JVM already deals with and time stamping). That has saved tons of dev test time ... but again, not for small apps.
Stay below 4 GB. Give extra headroom. And if you want you can turn on these flags to monitor GC for Java / JVM:
java -verbose:gc myProgram
java -Xloggc:D:/log/myLogFile.log -XX:+PrintGCDetails myProgram
You may try some of the other collectors Hotspot uses. There are more than one.
If you are on Linux go ahead and try the JHiccup tool as well. It is free.

You may be interested in trying the low-pause Garbage-First collector instead of concurrent mark-sweep (although it's not necessarily more performant for all collections, it's supposed to have a better worst case). It's enabled by -XX:+UseG1GC and is supposed to be really awesomesauce, but you may want to give it a thorough evaluation before using it in production. It has probably improved since, but it seems to have been a bit buggy a year ago, as seen in Experience with JDK 1.6.x G1 (“Garbage First”)

It is perfectly fine for the garbage collection to run in parallel with your program, if you have plenty of cpu, which you do.
What you want, is to make absolutely certain that you will not run into a scenario where the garbage collection PAUSES your main program.
Have you tried simply not stating any flags except saying you want the server VM (for the Sun JVM), and then put your server under heavy load to see how it behaves? Only then can you see, if you get any improvements from tinkering with options.

This actually sounds like a throughput app and should probably use the throughput collector. I would balance the size of the new gen making it big enough to not GC too often and small enough to prevent long pauses. 20ms sounds like a long minor GC to me. I also suspect your survivor space is set too large and is just being wasted. If you don't have much surviving to old gen, you shouldn't have that much surviving your minor GCs.
In the end, you should use jvmstat and VisualGC to really get a feel for how your app is using memory.

For high responsive server application, I think you want to see the major GC happens less frequently. Here is the list of parameters would help.
-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=50
-XX:CMSWaitDuration=300000
-XX:GCTimeRatio=40
Larger heap size may not help on low pause, as long as your app doesn't run out of memory.
Larger NewSize and MaxNewSize would help on throughput, may not help on low pause. If you choose to take this approach, you may consider give GC threads more execution time by setting -XX:GCTimeRatio lower. The point is to remember to take a holistic when tuning JVM.

I think previous posters have missed something very obvious- the Perm Generation size is too low. If the system uses 200 to 400 MB as permanent generation- then it may be best to set Max Perm Gen to 400 MB. the PerGen size should also be set to the same value. You will then never run out of Permanent Generation Space.
Currently- it looks like the JVM has to spend a lot of time moving objects in and out of Permanent Generation. This can take time. JVM tries to allocate contiguous memory areas to Java Objects- this speeds memory access due to hardware level features. In order to do that, it is very helpful to have plenty of buffer in memory. If Permanent Generation is almost full, newly discovered permanent objects must be split or existing objects must be shuffled. This is what triggers a full GC, as well as causes long full GC pauses.
The question states that the Permanent Generation size has already been measured- if this has not been done, it should be measured using a tool. These tools process logs generated by the JVM with the verboseGC option turned on.
All the mark and sweep options listed above- may not be needed with this basic improvement.
People throw GC options as solutions without evaluating how mature they have proven to be in actual use.

Reducing JVM pause time > 1 second using UseConcMarkSweepGC

I'm running a memory intensive app on a machine with 16Gb of RAM, and an 8-core processor, and Java 1.6 all running on CentOS release 5.2 (Final). Exact JVM details are:
java version "1.6.0_10"
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b15, mixed mode)
I'm launching the app with the following command line options:
java -XX:+UseConcMarkSweepGC -verbose:gc -server -Xmx10g -Xms10g ...
My application exposes a JSON-RPC API, and my goal is to respond to requests within 25ms. Unfortunately, I'm seeing delays up to and exceeding 1 second and it appears to be caused by garbage collection. Here are some of the longer examples:
[GC 4592788K->4462162K(10468736K), 1.3606660 secs]
[GC 5881547K->5768559K(10468736K), 1.2559860 secs]
[GC 6045823K->5914115K(10468736K), 1.3250050 secs]
Each of these garbage collection events was accompanied by a delayed API response of very similar duration to the length of the garbage collection shown (to within a few ms).
Here are some typical examples (these were all produced within a few seconds):
[GC 3373764K->3336654K(10468736K), 0.6677560 secs]
[GC 3472974K->3427592K(10468736K), 0.5059650 secs]
[GC 3563912K->3517273K(10468736K), 0.6844440 secs]
[GC 3622292K->3589011K(10468736K), 0.4528480 secs]
The thing is that I thought the UseConcMarkSweepGC would avoid this, or at least make it extremely rare. On the contrary, delays exceeding 100ms are occurring almost once a minute or more (although delays of over 1 second are considerably rarer, perhaps once every 10 or 15 minutes).
The other thing is that I thought only a FULL GC would cause threads to be paused, yet these don't appear to be full GCs.
It may be relevant to note that most of the memory is occupied by a LRU memory cache that makes use of soft references.
Any assistance or advice would be greatly appreciated.

First, check out the Java SE 6 HotSpot[tm] Virtual Machine Garbage Collection Tuning documentation, if you haven't already done so. This documentation says:
the concurrent collector does most of its tracing and sweeping work with the
application threads still running, so only brief pauses are seen by the
application threads. However, if the concurrent collector is unable to finish
reclaiming the unreachable objects before the tenured generation fills up, or if
an allocation cannot be satisfied with the available free space blocks in the
tenured generation, then the application is paused and the collection is completed
with all the application threads stopped. The inability to complete a collection
concurrently is referred to as concurrent mode failure and indicates the need to
adjust the concurrent collector parameters.
and a little bit later on...
The concurrent collector pauses an application twice during a
concurrent collection cycle.
I notice that those GCs don't seem to be freeing very much memory. Perhaps many of your objects are long lived? You may wish to tune the generation sizes and other GC parameters. 10 Gig is a huge heap by many standards, and I would naively expect GC to take longer with such a huge heap. Still, 1 second is a very long pause time and indicates either something is wrong (your program is generating a large number of unneeded objects or is generating difficult-to-reclaim objects, or something else) or you just need to tune the GC.
Usually, I would tell someone that if they have to tune GC then they have other problems they need to fix first. But with an application of this size, I think you fall into the territory of "needing to understand GC much more than the average programmer."
As others have said, you need to profile your application to see where the bottleneck is. Is your PermGen too large for the space allocated to it? Are you creating unnecessary objects? jconsole works to at least show a minimum of information about the VM. It's a starting point. As others have indicated however, you very likely need more advanced tools than this.
Good luck.

Since you mention your desire to cache, I'm guessing that most of your huge heap is occupied by that cache. You might want to limit the size of the cache so that you are sure it never attempts to grow large enough to fill the tenured generation. Don't rely on SoftReference alone to limit the size. As the old generation fills with soft references, older references will be cleared and become garbage. New references (perhaps to the same information) will be created, but cleared quickly because free space is in short supply. Eventually, the tenured space is full of garbage and needs to be cleaned.
Consider adjusting the -XX:NewRatio setting too. The default is 1:2, meaning that one-third of the heap is allocated to the new generation. For a large heap, this is almost always too much. You might want to try something like 9, which would keep 9 Gb of your 10 Gb heap for the old generation.

Turns out that part of the heap was getting swapped out to disk, so that garbage collection had to pull a bunch of data off the disk back into memory.
I resolved this by setting Linux's "swappiness" parameter to 0 (so that it wouldn't swap data out to disk).

Here are some things I have found which could be significant.
JSON-RPC can generate a lot of objects. Not as much as XML-RPC, but still something to watch for. In any case you do appear to be generating as much at 100 MB of objects per second which means your GC is running a high percentage of the time and is likely to be adding to your random latency. Even though the GC is concurrent, your hardware/OS is very likely to exhibit non-ideal random latency under load.
Have a look at your memory bank architecture. On Linux the command is numactl --hardware. If your VM is being split across more than one memory bank this will increase your GC times significantly. (It will also slow down your application as these accesses can be significantly less efficient) The harder you work the memory subsystem the more likely the OS will have to shift memory around (Often in large amounts) and you get dramatic pauses as a result (100 ms is not surprising). Don't forget your OS does more than just run your app.
Consider compacting/reducing the memory consumption of your cache. If you are using multiple GB of cache it is worth looking at ways to cut memory consumption further than you have already.
I suggest you profile your app with memory allocation tracing AND cpu sampling on at the same time. This can yield very different results and often points to the cause of these sort of problems.
Using these approaches, the latency of an RPC call can be reduced to below 200 micro-second and the GC times reduced to 1-3 ms effecting less than 1/300 of calls.

I'd also suggest GCViewer and a profiler.

Some places to start looking:
https://visualvm.dev.java.net/
http://java.sun.com/j2se/1.5.0/docs/tooldocs/share/jstat.html
http://www.javaperformancetuning.com/tools/gcviewer/index.shtml
Also I'd run the code through a profiler.. I like the one in NetBeans but there are other ones as well. You can view the gc behaviour in real time. The Visual VM does that as well... but I haven't run it yet (been looking for a reason to... but haven't had the time or the need yet).

I haven't personally used such a huge heap but I've experienced very low latency in general using following switches for Oracle/Sun Java 1.6.x:
-Xincgc -XX:+UseConcMarkSweepGC -XX:CMSIncrementalSafetyFactor=50
-XX:+UseParNewGC
-XX:+CMSConcurrentMTEnabled -XX:ConcGCThreads=2 -XX:ParallelGCThreads=2
-XX:CMSIncrementalDutyCycleMin=0 -XX:CMSIncrementalDutyCycle=5
-XX:GCTimeRatio=90 -XX:MaxGCPauseMillis=20 -XX:GCPauseIntervalMillis=1000
The important parts are, in my opinion, the use of CMS for tenured generation and ParNewGC for young generation. In addition, this adds a pretty big safety factor for CMS (default is 10% instead of 50%) and request short pause times. As you're targeting for 25 ms response time, I'd try setting -XX:MaxGCPauseMillis to even smaller value. You could even try to use more than two cores for concurrent GC but I would guess that is not worth the CPU usage.
You should probably also check the HotSpot JVM GC cheat sheet.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.