FixedThreadPool not parallel enough

FixedThreadPool not parallel enough - java

I create a fixed threadpool using forPool = Executors.newFixedThreadPool(poolSize); where poolSize is initialized to the number of cores on the processor (lets say 4). In some runs, it works fine and the CPU utilisation is consistently at 400%.
But sometimes, the usage drops to 100%, and never rises back to 400%. I have 1000s of tasks scheduled, so the problem is not that. I catch every exception, but no exception is thrown. So the issue is random and not reproducible, but very much present. They are data parallel operations. At the end of each thread, there is a synchronised access to update a single variable. Highly unlikely I have a deadlock there. In fact, once I spot this issue, if I destroy the pool, and create a fresh one of size 4, it is still only 100% usage. There is no I/O.
It seems counter intuitive to java's assurance of a "FixedThreadPool". Am I reading the guarantee wrong? Is only concurrency guaranteed and not parallelism?
And to the question - Have you come across this issue and solved it? If I want parallelism, am I doing the correct thing?
Thanks!
On doing a thread dump:
I find that there are 4 threads all doing their parallel operations. But the usage is still ~100% only. Here are the thread dumps at 400% usage and 100% usage. I set the number of threads to 16 to trigger the scenario. It runs at 400% for a while, and then drops to 100%. When I use 4 threads, it runs on 400% and only rarely drops to 100%. This is the parallelization code.
****** [MAJOR UPDATE] ******
It turns out that if I give the JVM a huge amount of memory to play with, this issue is solved and the performance does not drop. But I don't know how to use this information to solve this issue. Help!

Given the fact that increasing your heap size makes the problem 'go away' (perhaps not permanently), the issue is probably related to GC.
Is it possible that the Operation implementation is generating some state, that is stored on the heap, between calls to
pOperation.perform(...);
? If so, then you might have a memory usage problem, perhaps a leak. As more tasks complete, more data is on the heap. The garbage collector has to work harder and harder to try and reclaim as much as it can, gradually taking up 75% of your total available CPU resources. Even destroying the ThreadPool won't help, because that's not where the references are stored, it's in the Operation.
The 16 thread case hitting this problem more could be due to the fact that it's generating more state quicker (don't know the Operation implementation, so hard for me to say).
And increasing the heap size while keeping the problem set the same would make this problem appear to disappear, because you'd have more room for all this state.

I'll suggest that you use the Yourkit Thread Analysis feature to understand the real behavior. It will tell you exactly which threads are running, blocked or waiting and why.
If you can't/don't want to purchase it, next best option is to use Visual VM, which is bundled with the JDK to do this analysis. It won't give you as detailed information as Yourkit. Following blog post can get you started with Visual VM:
http://marxsoftware.blogspot.in/2009/06/thread-analysis-with-visualvm.html

My answer is based on a mixture of knowledge about JVM memory management and some guesses about facts which I could not find precise information on. I believe that your problem is related to the thread-local allocation buffers (TLABs) Java uses:
A Thread Local Allocation Buffer (TLAB) is a region of Eden that is
used for allocation by a single thread. It enables a thread to do
object allocation using thread local top and limit pointers, which is
faster than doing an atomic operation on a top pointer that is shared
across threads.
Let's say you have an eden size of 2M and use 4 threads: The JVM may choose a TLAB size of (eden/64)=32K and each thread gets a TLAB of that size. Once the 32K TLAB of a thread are exhausted, it needs to acquire a new one, which requires global synchronization. Global synchronization is also needed for allocation of objects which are larger than the TLAB.
But, to be honest with you, things are not as easy as I described: The JVM adaptively sizes a thread's TLAB based on its estimated allocation rate determined at minor GCs [1] which makes TLAB-related behavior even less predictable. However, I can imagine that the JVM scales the TLAB sizes down when more threads are working. This seems to make sense, because the sum of all TLABs must be less than the available eden space (and even some fraction of the eden space in practice to be able to refill the TLABs).
Let us assume a fixed TLAB size per thread of (eden size / (16 * user threads working)):
for 4 threads this results in TLABs of 32K
for 16 threads this results in TLABs of 8K
You can imagine that 16 threads which exhaust their TLAB faster because it's smaller will cause much more locks on the TLAB allocator than 4 threads with 32K TLABs.
To conclude, when you decrease the number of working threads or increase the memory available to the JVM, the threads can be given larger TLABs and the problem is solved.
https://blogs.oracle.com/daviddetlefs/entry/tlab_sizing_an_annoying_little

This is almost certainly due to GC.
If you want to be sure add the following startup flags to your Java program:
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
and check stdout.
You will see lines containing "Full GC" including the time this took: during this time you will see 100% CPU usage.
The default garbage collector on multi-CPU or multi-core machines is the throughput collector, which collects the young generation in parallel but uses serial collection (in one thread) for the old generation.
So what is probably happening is that in your 100% CPU example, GC is going on of the old generation which is done in one thread and so keeps one core busy only.
Suggestion for solution: use the concurrent mark-and-sweep collector, by using the flag -XX:+UseConcMarkSweepGC at JVM startup.

Tune the JVM
The core of the Java platform is the Java Virtual Machine (JVM). The entire Java application server runs inside a JVM. The JVM takes many startup parameters as command line flags, and some of them have great implications on the application performance. So, let's examine some of the important JVM parameters for server applications.
First, you should allocate as much memory as possible to the JVM using the -Xms (minimum memory) and -Xmx (maximum memory) flags. For instance, the -Xms1g -Xmx1g tag allocates 1GB of RAM to the JVM. If you don't specify a memory size in the JVM startup flags, the JVM would limit the heap memory to 64MB (512MB on Linux), no matter how much physical memory you have on the server! More memory allows the application to handle more concurrent web sessions, and to cache more data to improve the slow I/O and database operations. We typically specify the same amount of memory for both flags to force the server to use all the allocated memory from startup. This way, the JVM wouldn't need to dynamically change the heap size at runtime, which is a leading cause of JVM instability. For 64-bit servers, make sure that you run a 64-bit JVM on top of a 64-bit operating system to take advantage of all RAM on the server. Otherwise, the JVM would only be able to utilize 2GB or less of memory space. 64-bit JVMs are typically only available for JDK 5.0.
With a large heap memory, the garbage collection (GC) operation could become a major performance bottleneck. It could take more than ten seconds for the GC to sweep through a multiple gigabyte heap. In JDK 1.3 and earlier, GC is a single threaded operation, which stops all other tasks in the JVM. That not only causes long and unpredictable pauses in the application, but it also results in very poor performance on multi-CPU computers since all other CPUs must wait in idle while one CPU is running at 100% to free up the heap memory space. It is crucial that we select a JDK 1.4+ JVM that supports parallel and concurrent GC operations. Actually, the concurrent GC implementation in the JDK 1.4 series of JVMs is not very stable. So, we strongly recommend you upgrade to JDK 5.0. Using the command line flags, you can choose from the following two GC algorithms. Both of them are optimized for multi-CPU computers.
If your priority is to increase the total throughput of the
application and you can tolerate occasional GC pauses, you should use
the -XX:UseParallelGC and -XX:UseParallelOldGC (the latter is only
available in JDK 5.0) flags to turn on parallel GC. The parallel GC
uses all available CPUs to perform the GC operation, and hence it is
much faster than the default single thread GC. It still pauses all
other activities in the JVM during GC, however.
If you need to minimize the GC pause, you can use the
-XX:+UseConcMarkSweepGC flag to turn on the concurrent GC. The concurrent GC still pauses the JVM and uses parallel GC to clean up
short-lived objects. However, it cleans up long-lived objects from
the heap using a background thread running in parallel with other JVM
threads. The concurrent GC drastically reduces the GC pause, but
managing the background thread does add to the overhead of the system
and reduces the total throughput.
Furthermore, there are a few more JVM parameters you can tune to optimize the GC operations.
On 64-bit systems, the call stack for each thread is allocated 1MB of
memory space. Most threads do not use that much space. Using the
-XX:ThreadStackSize=256k flag, you can decrease the stack size to 256k to allow more threads.
Use the -XX:+DisableExplicitGC flag to ignore explicit application
calls to System.gc(). If the application calls this method
frequently, then we could be doing a lot of unnecessary GCs.
The -Xmn flag lets you manually set the size of the "young
generation" memory space for short-lived objects. If your application
generates lots of new objects, you might improve GCs dramatically by
increasing this value. The "young generation" size should almost
never be more than 50% of heap.
Since the GC has a big impact on performance, the JVM provides several flags to help you fine-tune the GC algorithm for your specific server and application. It's beyond the scope of this article to discuss GC algorithms and tuning tips in detail, but we'd like to point out that the JDK 5.0 JVM comes with an adaptive GC-tuning feature called ergonomics. It can automatically optimize GC algorithm parameters based on the underlying hardware, the application itself, and desired goals specified by the user (e.g., the max pause time and desired throughput). That saves you time trying different GC parameter combinations yourself. Ergonomics is yet another compelling reason to upgrade to JDK 5.0. Interested readers can refer to Tuning Garbage Collection with the 5.0 Java Virtual Machine. If the GC algorithm is misconfigured, it is relatively easy to spot the problems during the testing phase of your application. In a later section, we will discuss several ways to diagnose GC problems in the JVM.
Finally, make sure that you start the JVM with the -server flag. It optimizes the Just-In-Time (JIT) compiler to trade slower startup time for faster runtime performance. There are more JVM flags we have not discussed; for details on these, please check out the JVM options documentation page.
Reference:
http://onjava.com/onjava/2006/11/01/scaling-enterprise-java-on-64-bit-multi-core.html

A total cpu utilisation at a 100% implied that you have written is single threaded. i.e. you may have any number of concurrent tasks, but due to locking, only one can execute at a time.
If you have high IO you can get less than 400% but it is unlikely you will get a round number of cpu utilisation. e.g. you might see 38%, 259%, 72%, 9% etc. (It is also likely to jump around)
A common problem is locking the data you are using too often. You need to consider how it could be re-written where locking is performed for the briefest period and smallest portion of the overall work. Ideally, you want to avoid locking all together.
Using multiple thread means you can use up to that many cpus, but if your code prevents it you are likely to be better off (i.e. faster) to write the code single threaded as it avoids the overhead of locking.

Since you are using locking, it is possible that one of your four threads attains the lock but is then context switched - perhaps to run the GC thread. The other threads can't make progress since they can't attain the lock. When the thread context switches back, it completes the work in the critical section and relinquishes the lock to allow only one other thread to attain the lock. So now you have two threads active. It is possible that while the second thread executes the critical section the first thread does the next piece of data parallel work but generates enough garbage to trigger the GC and we're back where we started :)
P.S. This is just a best guess since it is hard to figure out what is happenning without any code snippets.

Increasing the size of the Java heap usually improves throughput until the heap no longer resides in physical memory. When the heap size exceeds the physical memory, the heap begins swapping to disk which causes Java performance to drastically decrease. Therefore, it is important to set the maximum heap size to a value that allows the heap to be contained within physical memory.
Since you give the JVM ~90% of physical memory on the machines, problem may be related to IO happening due to memory paging and swapping when you try to allocate memory for more objects. Note that the physical memory is also used by other running processes as well as OS. Also since symptoms occur after a while, this is also indication for memory leaks.
Try to find out how much physical memory is available (not already
used) and allocate ~90% of available physical memory to your JVM heap.
What happens if you leave the system running for extended period of
time?
Does it ever comes back at CPU 400% of utilization?
Do you notice any disk activity when CPU is at 100% of utilization?
Can you monitor which threads are running and which are blocked and
when?
Take a look at following link for tuning:
http://java.sun.com/performance/reference/whitepapers/tuning.html#section4

Related

Design issue - Physical memory size and Full Garbage Collection

We are designing new software system architecture. and I am working by project manager.
but there is something on the issue within our team.
Our architect says "System memory should be kept as small as possible because it takes a long time when Full GC occurs. (JVM)
I am not sure of that opinion.
When setting up system memory, what level of Full GC(Garbage Collection) time should be reviewed?
How long will it take if Full GC occurs in a 16GB memory environment?

You might be worrying (your "architects") about something that might not be a problem for your throughput to begin with. Until java-9, the default collector is ParallelGC and there are dozens and dozens of applications that have not changed it and are happy with the pause times (and that collector pauses the world every time). So the only real answer is : measure. Enable GC logs and look into it.
On the other hand, if you choose a concurrent collector (you should start with G1), having enough breathing room for it in the heap is crucial. It is a lot more important for Shenandoan and ZGC, since they do everything concurrently. Every time GC initiates a concurrent phase, it works via so-called "barriers", which are basically interceptors for the objects in the heap. These structures used by these barriers require storage. If you will narrow this storage, GC is not going to be happy.
In rather simple words - the more "free" space in the heap, the better your GC will perform.
When setting up system memory, what level of Full GC(Garbage Collection) time should be reviewed?
This is not the correct question. If you are happy with your target response times, this does not matter. If you are not - you start analyzing gc logs and understand what is causing your delays (this is not going to be trivial, though).
How long will it take if Full GC occurs in a 16GB memory environment?
It depends on the collector and on the java version and on the type of Objects to be collected, there is no easy answer. Shenandah and ZGC - this is irrelevant since they do not care on the size of the heap to be scanned. For G1 it is going to be in the range of a "few seconds" most probably. But if you have WeakReferences and finalizers and a java version that is known to handle this not so good, the times to collect is going to be big.

How long will it take if Full GC occurs in a 16GB memory environment?
On a small heaps like that the ballpark figure is around 10 sec I guess.
But it's not what you should consider.
When setting up system memory, what level of Full GC(Garbage Collection) time should be reviewed?
All of the times. Whenever full gc occurs it should be reviewed if your application is latency-critical. This is what you should consider.
Full GC is a failure.
Failure on a multiple levels.
to address memory size available for application
to address GC type you use
to address the types of workloads
to address graceful degradation under the load
and the list goes on
Concurrent GC implicitly relies on a simple fact that it can collect faster then application allocates.
When allocation pressure becomes overwhelming GC has two options: slowdown allocations or stop them altogether.
And when it stops, you know, the hell breaks loose, futures time out, clusters brake apart and engineers fear and loathe large heaps for rest of their lives...
It's a common scenario for applications that evolve for years with increasing complexity and loads and lack of overhaul to accommodate to changing world.
It doesn't have to be this way though.
When you build new application from ground up you can design in with performance and latency in mind, with scalability and graceful degradation instead heap size and GC times.
You can split workloads that are not latency-critical but memory-heavy to different VM and run it under good 'ol ParallelGC, and it will outperform any concurrent GC in both throughput and memory overhead.
You can run latency-critical tasks under modern state-of-the-art GC like Shenandoah and have sub-second collection pauses on heaps of several TB if you don't mind some-30% memory overhead and considerable amount of CPU overhead.
Let the application and requirements dictate you heap size, not engineers.

JVM consumes 100% CPU with a lot of GC

After running a few days the CPU load of my JVM is about 100% with about 10% of GC (screenshot).
The memory consumption is near to max (about 6 GB).
The tomcat is extremely slow at that state.

Since it's too much for a comment i'll write it up ans answer:
Looking at your charts it seems to be using CPU for non-GC tasks, peak "GC activity" seems to stay within 10%.
So on first impression it would seem that your task is simply CPU-bound, so if that's unexpected maybe you should do some CPU-profiling on your java application to see if something pops out.
Apart from that, based on comments I suspect that physical memory filling up might evict file caches and memory-mapped things, leading to increased page faults which forces the CPU to wait for IO.
Freeing up 500MB on a manual GC out of a 4GB heap does not seem all that much, most GCs try to keep pause times low as their primary goal, keep the total time spent in GC within some bound as secondary goal and only when the other goals are met they try to reduce memory footprint as tertiary goal.
Before recommending further steps you should gather more statistics/provide more information since it's hard to even discern what your actual problem is from your description.
monitor page faults
figure out which GC algorithm is used in your setup and how they're tuned (-XX:+PrintFlagsFinal)
log GC activity - I suspect it's pretty busy with minor GCs and thus eating up its pause time or CPU load goals
perform allocation profiling of your application (anything creating excessive garbage?)
You also have to be careful to distinguish problems caused by the java heap reaching its sizing limit vs. problems causing by the OS exhausting its physical memory.
TL;DR: Unclear problem, more information required.
Or if you're lazy/can afford it just plug in more RAM / remove other services from the machine and see if the problem goes away.

I learned to check this on GC problems:
Give the JVM enough memory e.g. -Xmx2G
If memory is not sufficient and no more RAM is available on the host, analyze the HEAP dump (e.g. by jvisualvm).
Turn on Concurrent Marc and Sweep:
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
Check the garbage collection log: -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:gc.log
My Solution:
But I solved that problem finally by tuning the cache sizes.
The cache sizes were to big, so memory got scarce.

if you want keep the memory of your server free you can simply try the vm-parameter
-Xmx2G //or any different value
This ensures your program never takes more than 2 Gigabyte of Ram. But be aware if case of high workload the server may be get an OutOfMemoryError.
Since a old generation (full) GC may block your whole server from working for some seconds java will try to avoid a Full Garbage collection.
The Ram-Limitation may trigger a Full-Generation GC more easy (or even support more objects to be collected by Young-Generation GC).
From my (more guessing than actually knowing) opinion: I don't think another algorithm can help so much here.

Is it possible to fix OutOfMemory error by specify GC flags?

Is specifying GC flags a possible solution to OutOfMemory exceptions, or does it have no impact or whether the program will run out of memory?
The GC flags in question are: -XX:+UseConcMarkSweepGC and -XX:+CMSIncrementalMode
I'm asking because I thought the above flags (and GC flags in general) are there to tune JVM performance (in relation to it's response/speed), but they have no impact on reducing the minimum memory requirements of your program. In other words, if there is not enough memory for the program to run to completion (e.g. running into OutOfMemory exception), no amount of JVM tuning will resolve that.

If you get OutOfMemoryError: heap space, there is nothing you can do. JVM will never throw this before GCing the last byte out of the heap.
However, if you get OutOfMemoryError: GC overhead limit exceeded, then there may be something you can still do about it because this is a "soft" error and there may be a config setting that will lessen the GC overhead. Quite improbable, I must add, but at least possible in theory.

The only tuning parameter which matters is the maximum memory. i.e. -Xmx or -mx If your program has run out of memory due to a memory leak, even raising this won't help.
BTW: Setting memory tuning can actually reduce the amount of memory you can use before you run out. E.g. if you set the NewSize, this can limit how much the JVM can resize the generations to use all the memory.
In general, the less options you use the better if you want to use all your memory.

If there is memory leak in your code, those flags are not useful.
As per Virtual Machine Garbage Collection Tuning:
Unless your application has rather strict pause time requirements, first run your application and allow the VM to select a collector. If necessary, adjust the heap size to improve performance. If the performance still does not meet your goals, then use the following guidelines as a starting point for selecting a collector.
If the application has a small data set (up to approximately 100MB), then
select the serial collector with -XX:+UseSerialGC.
If the application will be run on a single processor and there are no pause time requirements, then
let the VM select the collector, or
select the serial collector with -XX:+UseSerialGC.
If (a) peak application performance is the first priority and (b) there are no pause time requirements or pauses of one second or longer are acceptable, then
let the VM select the collector, or
select the parallel collector with -XX:+UseParallelGC and (optionally) enable parallel compaction with -XX:+UseParallelOldGC.
If response time is more important than overall throughput and garbage collection pauses must be kept shorter than approximately one second, then
select the concurrent collector with -XX:+UseConcMarkSweepGC. If only one or two processors are available, consider using incremental mode, described below.

-Xmx attribute and available system memory correlation

I have a question on my mind. Let's assume that I have two parameters passed to JVM:
-Xms256mb -Xmx1024mb
At the beginning of the program 256MB is allocated. Next, some objects are created and JVM process tries to allocate more memory. Let's say that JVM needs to allocate 800MB. Xmx attribute allows that but the memory which is currently available on the system (let's say Linux/Windows) is 600MB. Is it possible that OutOfMemoryError will be thrown? Or maybe swap mechanism will play a role?
My second question is related to the quality of GC algorithms. Let's say that I have jdk1.5u7 and jdk1.5u22. Is it possible that in the latter JVM the memory leaks vanish and OutOfMemoryError does not occur? Can the quality of GC be better in the latest version?

The quality of the GC (barring a buggy GC) does not affect memory leaks, as memory leaks are an artifact of the application -- GC can't collect what isn't actual garbage.
If a JVM needs more memory, it will take it from the system. If the system can swap, it will swap (like any other process). If the system can not swap, your JVM will fail with a system error, not an OOM exception, because the system can not satisfy the request and and this point its effectively fatal.
As a rule, you NEVER want to have an active JVM partially swapped out. GC event will crush you as the system thrashes cycling pages through the virtual memory system. It's one thing to have a idle background JVM swapped out as a whole, but if you machine as 1G of RAM and your main process wants 1.5GB, then you have a major problem.
The JVM like room to breathe. I've seen JVMs in a GC death spiral when they didn't have enough memory, even though they didn't have memory leaks. They simply didn't have enough working set. Adding another chunk of heap transformed that JVM from awful to happy sawtooth GC graphs.
Give a JVM the memory it needs, you and it will be much happier.

"Memory" and "RAM" aren't the same thing. Memory includes virtual memory (swap), so you can allocate a total of free RAM+ free swap before you get the OutOfMemoryError.

Allocation depends on the used OS.
If you allocate too much memory, maybe you could end up having loaded portions into swap, which is slow.
If the your program runs fater os slower depends on how VM handle the memory.
I would not specify a heap that's not so big to make sure it don't occupy all the memory preventing the slows from VM.

Concerning your first question:
Actually if the machine can not allocate the 1024 MB that you asked as max heap size it will not even start the JVM.
I know this because I noticed it often trying to open eclipse with large heap size and the OS could not allocate the larger heap space the JVM failed to load. You could also try it out yourself to confirm. So the rest of the details are irrelevant to you. If course if your program uses too much swap (same as in all languages) then the performance will be horrible.
Concerning your second question:
the memory leaks vanish
Not possible as they are bugs you will have to fix
and OutOfMemoryError does not occur? Can the quality of GC be better
in the latest version?
This could happen, if for example some different algorithm in GC is used and it manages to kick-in before you seeing the exception. But if you have a memory leak then it would probable mask it or you would see it intermittent.
Also various JVMs have different GCs you can configure
Update:
I have to admit (after see #Orochi note) that I noticed the behavior on max heap on Windows. I can not say for sure that this applies to linux as well. But you could try it yourself.
Update 2:
As an answer to comments of #DennisCheung
From IBM(my emphasis):
The table shows both the maximum Java heap possible and a recommended limit for the maximum Java heap size setting ......It is important to have more physical memory than is required by all of the processes on the machine combined to prevent paging or swapping. Paging reduces the performance of the system and affects the performance of the Java memory management system.

Force full garbage collection when memory occupation goes beyond a certain threshold

I have a server application that, in rare occasions, can allocate large chunks of memory.
It's not a memory leak, as these chunks can be claimed back by the garbage collector by executing a full garbage collection. Normal garbage collection frees amounts of memory that are too small: it is not adequate in this context.
The garbage collector executes these full GCs when it deems appropriate, namely when the memory footprint of the application nears the allotted maximum specified with -Xmx.
That would be ok, if it wasn't for the fact that these problematic memory allocations come in bursts, and can cause OutOfMemoryErrors due to the fact that the jvm is not able to perform a GC quickly enough to free the required memory. If I manually call System.gc() beforehand, I can prevent this situation.
Anyway, I'd prefer not having to monitor my jvm's memory allocation myself (or insert memory management into my application's logic); it would be nice if there was a way to run the virtual machine with a memory threshold, over which full GCs would be executed automatically, in order to release very early the memory I'm going to need.
Long story short: I need a way (a command line option?) to configure the jvm in order to release early a good amount of memory (i.e. perform a full GC) when memory occupation reaches a certain threshold, I don't care if this slows my application down every once in a while.
All I've found till now are ways to modify the size of the generations, but that's not what I need (at least not directly).
I'd appreciate your suggestions,
Silvio
P.S. I'm working on a way to avoid large allocations, but it could require a long time and meanwhile my app needs a little stability
UPDATE: analyzing the app with jvisualvm, I can see that the problem is in the old generation

From here (this is a 1.4.2 page, but the same option should exist in all Sun JVMs):
assuming you're using the CMS garbage collector (which I believe the server turns on by default), the option you want is
-XX:CMSInitiatingOccupancyFraction=<percent>
where % is the % of memory in use that will trigger a full GC.
Insert standard disclaimers here that messing with GC parameters can give you severe performance problems, varies wildly by machine, etc.

When you allocate large objects that do not fit into the young generation, they are immediately allocated in the tenured generation space. This space is only GC'ed when a full-GC is run which you try to force.
However I am not sure this would solve your problem. You say "JVM is not able to perform a GC quickly enough". Even if your allocations come in bursts, each allocation will cause the VM to check if it has enough space available to do it. If not - and if the object is too large for the young generation - it will cause a full GC which should "stop the world", thereby preventing new allocations from taking place in the first place. Once the GC is complete, your new object will be allocated.
If shortly after that the second large allocation is requested in your burst, it will do the same thing again. Depending on whether the initial object is still needed, it will either be able to succeed in GC'ing it, thereby making room for the next allocation, or fail if the first instance is still referenced.
You say "I need a way [...] to release early a good amount of memory (i.e. perform a full GC) when memory occupation reaches a certain threshold". This by definition can only succeed, if that "good amount of memory" is not referenced by anything in your application anymore.
From what I understand here, you might have a race condition which you might sometimes avoid by interspersing manual GC requests. In general you should never have to worry about these things - from my experience an OutOfMemoryError only occurs if there are in fact too many allocations to be fit into the heap concurrently. In all other situations the "only" problem should be a performance degradation (which might become extreme, depending on the circumstances, but this is a different problem).
I suggest you do further analysis of the exact problem to rule this out. I recommend the VisualVM tool that comes with Java 6. Start it and install the VisualGC plugin. This will allow you to see the different memory generations and their sizes. Also there is a plethora of GC related logging options, depending on which VM you use. Some options have been mentioned in other answers.
The other options for choosing which GC to use and how to tweak thresholds should not matter in your case, because they all depend on enough memory being available to contain all the objects that your application needs at any given time. These options can be helpful if you have performance problems related to heavy GC activity, but I fear they will not lead to a solution in your particular case.
Once you are more confident in what is actually happening, finding a solution will become easier.

Do you know which of the garbage collection pools are growing too large?....i.e. eden vs. survivor space? (try the JVM option -Xloggc:<file> log GC status to a file with time stamps)...When you know this, you should be able to tweak the size of the effected pool with one of the options mentioned here: hotspot options for Java 1.4
I know that page is for the 1.4 JVM, I can't seem to find the same -X options on my current 1.6 install help options, unless setting those individual pool sizes is a non-standard, non-standard feature!

The JVM is only supposed to throw an OutOfMemoryError after it has attempted to release memory via garbage collection (according to both the API docs for OutOfMemoryError and the JVM specification). Therefore your attempts to force garbage collection shouldn't make any difference. So there might be something more significant going on here - either a problem with your program not properly clearing references or, less likely, a JVM bug.

There's a very detailed explanation of how GC works here and it lists parameters to control memory available to different memory pools/generations.

Try to use -server option. It will enable parallel gc and you will have some performance increase if you use multi core processor.

Have you tried playing with G1 gc? It should be available in 1.6.0u14 onwards.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.