Critical guava memory leak - Workaround needed

Critical guava memory leak - Workaround needed - java

Is there any way to workaround the Google Guava r15 memory leak (link to the bug report) in the cache component?
(Without relying that the application server might clean things up and/or considering that the web application will never be restarted/redeployed)

I guess you don't need to care about it. The Tomcat message says
Threads are going to be renewed over time to try and avoid a probable memory leak.
IIUIC it means that once all old threads are gone, so will all the pointers to the old version of your class.
Details: The reason for the thread pooling is the big cost of thread creation. The pooling itself is hacky as you get a thread which was doing something else and thread are not stateless. Thread creation is expensive assuming you'd need a lot of them and never recycle them. There's nothing wrong with renewing all threads every few minutes, so I hoped, Tomcat's workaround solves it perfectly. But it's not the case.
EDIT
I'm afraid, I misunderstood something. The linked bug says
It seems that web applications which are using guava cache might face a memory leak.
After several redeployments, the application container crashes or stalls with an OutOfMemoryError.
I thought Tomcat could solve it easily, but for whatever reason it doesn't. So I'm afraid, you have to clean the ThreadLocals yourself. This is easily possible via reflection, the concerned fields are Thread.threadLocals and possibly inheritableThreadLocals. It's a bad hack and the harder part is to make this happen when nothing can go wrong, i.e., when no application is loaded.
EDIT 2 and 3
I guess it's safe to do something like
Stripped64.threadHashCode = new ThreadHashCode();
as the contained things are only needed for performance under heavy contention and they get recreated upon use. But according to MRalwasser's comment, it won't help at all as alive threads will still refer the old value. So there seem to be no way.
As ThreadLocal works by storing data with the threads (rather then using a real Map<Thread, Something>), you'd have to through all threads and remove references there. Fooling around with other threads' private fields is a terrible idea as they are not thread-safe and also due to visibility issues.
Another thing that might or mighn't not work is my proposal on the issue page. It's just a 20 line patch. Or simply wait, the issue has been assigned yesterday.
EDIT 4
Thread locals which don't get used can't cause any problems. AFAIK the only use of this TL is in cache stats. So avoid both CacheBuilder.recordStats and Cache.stats and Stripped64 won't get loaded.
EDIT 5
It looks like it's gonna get fixed finally. From the issue:
Doug fixed this upstream for us and we patched it back into Guava:
http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166e/Striped64.java?revision=1.9
At the first glance his change seems to be identical to mine.
EDIT 6
Finally, this has been marked as fixed and Guava 18.0-rc1 has been announced. It's just sad it took that long given that the change is the same as mine (9 month ago).

You can use the ServletListener ClassLoaderLeakPreventor https://github.com/mjiderhamn/classloader-leak-prevention/ which also clears ThreadLocals on undeploy/stop. It also has fixes/workarounds for other common leaks.

Seems to be a drawback of ThreadLocals. You'll get the same every time you put an application level class in ThreadLocal.
The only workaround is to restart server on deploy I guess. I think it's a known issue of Java applications. Are you sure that it's the only place that stops unloading classloader?

Related

List of threads of a specific application on tomcat

So, the scenario is that I have multiple application in a single tomcat, and once in a while I have to update them without restarting the service.
To prevent some leaks (presumably generated by classes that I don't have access to [i.e., TimerThread that never ends]) when I reload or redeploy applications on tomcat 7, I decided to list the threads when destroying the context and stop/interrupt them by force.
I know that it doesn't sound like a perfect approach, but it seems the one that's working, for I couldn't find a point I could close the thread nicely. Therefore, I'm stuck with them generating these leaks.
I listed them with "Thread.getAllStackTraces()", but it gives me all the threads in the JVM apparently, and I just wanted the threads of a specific application, so I'd be able to iterate over them and find the one(s) I must interrupt.
I used "getName()" to find them.
Well, if anyone can clarify me on this...
Java 7
Tomcat 7

First of all I want to thank the guys that commented on my doubt above.
Anyway... I kept looking for an answer and couldn't list the threads as I thought it'd be the best way so solve the problem. Then, I decided to look over somewhere else, over the libraries and found that one of them, apache.commons.dbcp version 1.4 was destined to Java 6, and as I was using Java 7, it misteriously creates a TimerThread that never ends, therefore creating a leak. Updating to apache.commons.dbcp2 for Java 7+ made the Application never even start this TimerThread I mentioned.
It looks good now.

Tomcat behaves as if it's not threading requests, causes slow response time

I'm seeing strange behavior and I don't know how to gain any further insight into and am hoping someone can help.
Background: I have a query that takes a long time to return results so instead of making the user wait for the data directly upon request I execute this query via a Timer object at regular intervals and store the results in a static variable. Therefore, when the user requests the data I always just pull from the static variable, therefore making the response virtually instant. So far so good.
Issue: The behavior I'm seeing, however, is that if I make a request for the data just as the background (Timer) request has begun to query the data, my user's request waits for the data to come back before responding -- forcing the user to wait. It's as if tomcat is behaving synchronously with the threads (I know it's not -- it just looks that way).
This is in a Production environment and, for the most part, everything works great but for users there are times when the site just hangs for them and they feel it's unreliable (well, in a sense it is).
What I've done: Being that the requests for the data were in a static method I thought "A ha! The threads are syncronized which is causing the delay!" so i pulled all of my static methods out, removed the syncronization and forced each call to instantiate it's own object to retrieve the data (to keep it thread safe). There isn't any syncronization on a semaphore to the static variable either.
I've also installed javamelody to try and gain some insight into what's going on but nothing new thus far. I have noticed a lot (majority) of threads are in "WAITING" state but they also have 0ms for User and CPU time so don't think that is pointing to anything(?).
Running Tomcat 5.5 (no apache layer), struts 2, Java 1.5
If anyone has any idea why a simple request to a static variable hangs for longer background processes I would really appreciate it! Or if you know how I can gain insight that would be great too.
Thanks!

One possible explanation is that the threads are actually blocking at the database level due to database locking (or something) caused by the long-running query.
The way to figure out what is going on is to find out exactly where the blocked threads are blocking. A thread dump can be produced by sending a SIGQUIT (or equivalent) to the JVM, and included stack traces for all Java thread stacks. Alternatively, you can get the same information (and more) by attaching a debugger, etcetera. Either way, the class name and line number of the top frame of each stack should allow you to look at the source code and figure out (at least) what kind of locking or blocking is going on.

For those who would like to know I eventually found VisualVM (http://visualvm.java.net/download.html). It's perfect. I run Tomcat from eclipse like I normally do and it appears within the VisualVM client. Right-mouse click the tomcat icon, choose Thread Dump and, boom, I've got it all.
Thanks, all, for the help and pointers towards the right direction!

Dumping a Java program into a file and restarting it

I was just wondering if it's possible to dump a running Java program into a file, and later on restart it (same machine)
It's sounds a bit weird, but who knows
--- update -------
Yes, this is the hibernate feature for a process instead of a full system. But google 'hibernate jvm process' and you'll understand my pain.
There is a question for linux on this subject (here). Quickly, it's possible to hibernate a process (far from 100% reliable) with CryoPID.
A similar question was raised in stackoverflow some years ago.
With a JVM my educated guess is that hibernating should be a lot easier, not always possible and not reliable at 100% (e.g. UI and files).
Serializing a persistent state of the application is an option but it is not an answer to the question.

This may me a bit overkill but one thing you can do is run something like VirtualBox and halt/save the machine.
There is also:
- JavaFlow from Apache that should do just that even though I haven't personally tried
it.
- Brakes that may be exactly what you're looking for

There are a lot restrictions any solution to your problem will have: all external connections might or might not survive your attempt to freeze and awake them. Think of timeouts on the other side, or even stopped communication partners - anything from a web server to a database or even local files.
You are asking for a generic solution, without any internal knowledge of your program, that you would like to hibernate. What you can always do, is serialize that part of the state of your program, that you need to restart your program. It is, or at least was common wisdom to implement restart point in long running computations (think of days or weeks). So, when you hit a bug in your program after it run for a week, you could fix the bug and save some computation days.
The state of a program could be surprisingly small, compared to the complete memory size used.
You asked "if it's possible to dump a running Java program into a file, and later on restart it." - Yes it is, but I would not suggest a generic and automatic solution that has to handle your program as a black box, but I suggest that you externalize the important part of your programs state and program restart points.
Hope that helps - even if it's more complicated than what you might have hoped for.

I believe what the OP is asking is what the Smalltalk guys have been doing for decades - store the whole programming/execution environment in an image file, and work on it.
AFAIK there is no way to do the same thing in Java.

There has been some research in "persisting" the execution state of the JVM and then move it to another JVM and start it again. Saw something demonstrated once but don't remember which one. Don't think it has been standardized in the JVM specs though...
Found the presentation/demo I was thinking about, it was at OOPSLA 2005 that they were talking about squawk
Good luck!
Other links of interest:
Merpati
Aglets
M-JavaMPI

How about using SpringBatch framework?
As far as I understood from your question you need some reliable and resumable java task, if so, I believe that Spring Batch will do the magic, because you can split your task (job) to several steps while each step (and also the entire job) has its own execution context persisted to a storage you choose to work with.
In case of crash you can recover by analyzing previous run of specific job and resume it from exact point where the failure occurred.
You can also pause and restart your job programmatically if the job was configured as restartable and the ExecutionContext for this job already exists.
Good luck!

I believe :
1- the only generic way is to implement serialization.
2- a good way to restore a running system is OS virtualization
3- now you are asking something like single process serialization.
The problem are IOs.
Says your process uses a temporary file which gets deleted by the system after
'hybernation', but your program does not know it. You will have an IOException
somewhere.
So word is , if the program is not designed to be interrupted at random , it won't work.
Thats a risky and unmaintable solution so i believe only 1,2 make sense.

I guess IDE supports debugging in such a way. It is not impossible, though i don't know how. May be you will get details if you contact some eclipse or netbeans contributer.

First off you need to design your app to use the Memento pattern or any other pattern that allows you to save state of your application. Observer pattern may also be a possibility. Once your code is structured in a way that saving state is possible, you can use Java serialization to actually write out all the objects etc to a file rather than putting it in a DB.
Just by 2 cents.

What you want is impossible from the very nature of computer architecture.
Every Java program gets compiled into Java intermediate code and this code is then interpreted into into native platform code (when run). The native code is quite different from what you see in Java files, because it depends on underlining platform and JVM version. Every platform has different instruction set, memory management, driver system, etc... So imagine that you hibernated your program on Windows and then run it on Linux, Mac or any other device with JRE, such as mobile phone, car, card reader, etc... All hell would break loose.
You solution is to serialize every important object into files and then close the program gracefully. When "unhibernating", you deserialize these instances from these files and your program can continue. The number of "important" instances can be quite small, you only need to save the "business data", everything else can be reconstructed from these data. You can use Hibernate or any other ORM framework to automatize this serialization on top of a SQL database.

Probably Terracotta can this: http://www.terracotta.org
I am not sure but they are supporting server failures. If all servers stop, the process should saved to disk and wait I think.
Otherwise you should refactor your application to hold state explicitly. For example, if you implement something like runnable and make it Serializable, you will be able to save it.

How do I make the JVM exit on ANY OutOfMemoryException even when bad people try to catch it

An OOME is of the class of errors which generally you shouldn't recover from. But if it is buried in a thread, or someone catches it, it is possible for an application to get in a state from which it isn't exiting, but isn't useful. Any suggestions in how to prevent this even in the face of using libraries which may foolishly try to catch Throwable or Error/OOME? (ie you don't have direct access to modify the source code)

Solution:
On newer JVMs:
-XX:+ExitOnOutOfMemoryError
to exit on OOME, or to crash:
-XX:+CrashOnOutOfMemoryError
On Older:
-XX:OnOutOfMemoryError="<cmd args>; <cmd args>"
Definition: Run user-defined commands when an OutOfMemoryError is first thrown. (Introduced in 1.4.2 update 12, 6)
See http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
An example that kills the running process:
-XX:OnOutOfMemoryError="kill -9 %p"

If some piece of code in your application's JVM decides that it wants to try to catch OOMEs and attempt to recover, there is (unfortunately) nothing you that you can do to stop it ... apart from AOP heroics that are probably impractical, and definitely are bad for your application's performance and maintainability. Apart from that, the best you can do is to pull the plug on the JVM using an "OnOutOfMemoryError" hook. See the answer above: https://stackoverflow.com/a/3878199/139985/
Basically, you have to trust other developers not to do stupid things. Other stupid things that you probably shouldn't try to defend against include:
calling System.exit() deep in a library method,
calling Thread.stop() and friends,
leaking open streams, database connections and so on,
spawning lots of threads,
randomly squashing (i.e. catching and ignoring) exception,
etc.
In practice, the way to pick up problems like this in code written by other people is to use code quality checkers, and perform code reviews.
If the problem is in 3rd-party code, report it as a BUG (which it probably is) and if they disagree, start looking for alternatives.
For those who don't already know this, there are a number of reason why it is a bad idea to try to recover from an OOME:
The OOME might have been thrown while the current thread was in the middle of updating some important data structure. In the general case, the code that catches this OOME has no way of knowing this, and if it tries to "recover" there is a risk that the application will continue with a damages data structure.
If the application is multi-threaded there is a chance that OOMEs might have been thrown on other threads as well, making recovery even harder.
Even if the application can recover without leaving data structures in an inconsistent state, the recovery may just cause the application to limp along for a few seconds more and then OOME again.
Unless you set the JVM options appropriately, a JVM that has almost run out of memory tends to spend a lot of time garbage collecting in a vain attempt to keep doing. Attempting to recover from OOMEs is likely to prolong the agony.
Recovering from an OOME does nothing to address the root cause which is typically, a memory leak, a poorly designed (i.e. memory wasteful) data structure, and/or launching the application with a heap that is too small.

edit OutOfMemoryError.java, add
System.exit() in its constructors.
compile it. (interestingly javac
doesn't care it's in package java.lang)
add the class into JRE rt.jar
now jvm will use this new class. (evil laughs)
This is a possibility you might want to be aware of. Whether it's a good idea, or even legal, is another question.

User #dennie posted a comment which should really be its own answer. Newer JVM features make this easy, specifically
-XX:+ExitOnOutOfMemoryError
to exit on OOME, or to crash:
-XX:+CrashOnOutOfMemoryError
Since Java 8u92 https://www.oracle.com/java/technologies/javase/8u92-relnotes.html

One more thing I could think of (although I do not know how to implement it) would be to run your app in some kind of debugger. I noticed, that my debugger can stop the execution when an exception is thrown. :-)
So may be one could implement some kind of execution environment to achieve that.

How about catching OOME yourself in your code and System.exit()?

You can run your java program using Java Service Wrapper with an OutOfMemory Detection Filter. However, this assumes that the "bad people" are nice enough to log the error :)

One possibility, which I would love to be talked out of, is have a stupid thread thats job is to do something on the heap. Should it receive OOME - then it exits the whole JVM.
Please tell me this isn't sensible.

You could use the MemoryPoolMXBean to be notified when a program exceeds a set heap allocation threshold.
I haven't used it myself but it should be possible to shut down this way when the remaining memory gets low by setting an allocation threshold and calling System.exit() when you receive the notification.

Only thing I can think of is using AOP to wrap every single method (beware to rule out java.*) with a try-catch for OOME and if so, log something and call System.exit() in the catch block.
Not a solution I'd call elegant, though...

Tracking down a memory leak / garbage-collection issue in Java

This is a problem I have been trying to track down for a couple months now. I have a java app running that processes xml feeds and stores the result in a database. There have been intermittent resource problems that are very difficult to track down.
Background:
On the production box (where the problem is most noticeable), i do not have particularly good access to the box, and have been unable to get Jprofiler running. That box is a 64bit quad-core, 8gb machine running centos 5.2, tomcat6, and java 1.6.0.11. It starts with these java-opts
JAVA_OPTS="-server -Xmx5g -Xms4g -Xss256k -XX:MaxPermSize=256m -XX:+PrintGCDetails -
XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC -XX:+PrintTenuringDistribution -XX:+UseParNewGC"
The technology stack is the following:
Centos 64-bit 5.2
Java 6u11
Tomcat 6
Spring/WebMVC 2.5
Hibernate 3
Quartz 1.6.1
DBCP 1.2.1
Mysql 5.0.45
Ehcache 1.5.0
(and of course a host of other dependencies, notably the jakarta-commons libraries)
The closest I can get to reproducing the problem is a 32-bit machine with lower memory requirements. That I do have control over. I have probed it to death with JProfiler and fixed many performance problems (synchronization issues, precompiling/caching xpath queries, reducing the threadpool, and removing unnecessary hibernate pre-fetching, and overzealous "cache-warming" during processing).
In each case, the profiler showed these as taking up huge amounts of resources for one reason or another, and that these were no longer primary resource hogs once the changes went in.
The Problem:
The JVM seems to completely ignore the memory usage settings, fills all memory and becomes unresponsive. This is an issue for the customer facing end, who expects a regular poll (5 minute basis and 1-minute retry), as well for our operations teams, who are constantly notified that a box has become unresponsive and have to restart it. There is nothing else significant running on this box.
The problem appears to be garbage collection. We are using the ConcurrentMarkSweep (as noted above) collector because the original STW collector was causing JDBC timeouts and became increasingly slow. The logs show that as the memory usage increases, that is begins to throw cms failures, and kicks back to the original stop-the-world collector, which then seems to not properly collect.
However, running with jprofiler, the "Run GC" button seems to clean up the memory nicely rather than showing an increasing footprint, but since I can not connect jprofiler directly to the production box, and resolving proven hotspots doesnt seem to be working I am left with the voodoo of tuning Garbage Collection blind.
What I have tried:
Profiling and fixing hotspots.
Using STW, Parallel and CMS garbage collectors.
Running with min/max heap sizes at 1/2,2/4,4/5,6/6 increments.
Running with permgen space in 256M increments up to 1Gb.
Many combinations of the above.
I have also consulted the JVM [tuning reference](http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html) , but can't really find anything explaining this behavior or any examples of _which_ tuning parameters to use in a situation like this.
I have also (unsuccessfully) tried jprofiler in offline mode, connecting with jconsole, visualvm, but I can't seem to find anything that will interperet my gc log data.
Unfortunately, the problem also pops up sporadically, it seems to be unpredictable, it can run for days or even a week without having any problems, or it can fail 40 times in a day, and the only thing I can seem to catch consistently is that garbage collection is acting up.
Can anyone give any advice as to:
a) Why a JVM is using 8 physical gigs and 2 gb of swap space when it is configured to max out at less than 6.
b) A reference to GC tuning that actually explains or gives reasonable examples of when and what kind of setting to use the advanced collections with.
c) A reference to the most common java memory leaks (i understand unclaimed references, but I mean at the library/framework level, or something more inherenet in data structures, like hashmaps).
Thanks for any and all insight you can provide.
EDIT
Emil H:
1) Yes, my development cluster is a mirror of production data, down to the media server. The primary difference is the 32/64bit and the amount of RAM available, which I can't replicate very easily, but the code and queries and settings are identical.
2) There is some legacy code that relies on JaxB, but in reordering the jobs to try to avoid scheduling conflicts, I have that execution generally eliminated since it runs once a day. The primary parser uses XPath queries which call down to the java.xml.xpath package. This was the source of a few hotspots, for one the queries were not being pre-compiled, and two the references to them were in hardcoded strings. I created a threadsafe cache (hashmap) and factored the references to the xpath queries to be final static Strings, which lowered resource consumption significantly. The querying still is a large part of the processing, but it should be because that is the main responsibility of the application.
3) An additional note, the other primary consumer is image operations from JAI (reprocessing images from a feed). I am unfamiliar with java's graphic libraries, but from what I have found they are not particularly leaky.
(thanks for the answers so far, folks!)
UPDATE:
I was able to connect to the production instance with VisualVM, but it had disabled the GC visualization / run-GC option (though i could view it locally). The interesting thing: The heap allocation of the VM is obeying the JAVA_OPTS, and the actual allocated heap is sitting comfortably at 1-1.5 gigs, and doesnt seem to be leaking, but the box level monitoring still shows a leak pattern, but it is not reflected in the VM monitoring. There is nothing else running on this box, so I am stumped.

Well, I finally found the issue that was causing this, and I'm posting a detail answer in case someone else has these issues.
I tried jmap while the process was acting up, but this usually caused the jvm to hang further, and I would have to run it with --force. This resulted in heap dumps that seemed to be missing a lot of data, or at least missing the references between them. For analysis, I tried jhat, which presents a lot of data but not much in the way of how to interpret it. Secondly, I tried the eclipse-based memory analysis tool ( http://www.eclipse.org/mat/ ), which showed that the heap was mostly classes related to tomcat.
The issue was that jmap was not reporting the actual state of the application, and was only catching the classes on shutdown, which was mostly tomcat classes.
I tried a few more times, and noticed that there were some very high counts of model objects (actually 2-3x more than were marked public in the database).
Using this I analyzed the slow query logs, and a few unrelated performance problems. I tried extra-lazy loading ( http://docs.jboss.org/hibernate/core/3.3/reference/en/html/performance.html ), as well as replacing a few hibernate operations with direct jdbc queries (mostly where it was dealing with loading and operating on large collections -- the jdbc replacements just worked directly on the join tables), and replaced some other inefficient queries that mysql was logging.
These steps improved pieces of the frontend performance, but still did not address the issue of the leak, the app was still unstable and acting unpredictably.
Finally, I found the option: -XX:+HeapDumpOnOutOfMemoryError . This finally produced a very large (~6.5GB) hprof file that accurately showed the state of the application. Ironically, the file was so large that jhat could not anaylze it, even on a box with 16gb of ram. Fortunately, MAT was able to produce some nice looking graphs and showed some better data.
This time what stuck out was a single quartz thread was taking up 4.5GB of the 6GB of heap, and the majority of that was a hibernate StatefulPersistenceContext ( https://www.hibernate.org/hib_docs/v3/api/org/hibernate/engine/StatefulPersistenceContext.html ). This class is used by hibernate internally as its primary cache (i had disabled the second-level and query-caches backed by EHCache).
This class is used to enable most of the features of hibernate, so it can't be directly disabled (you can work around it directly, but spring doesn't support stateless session) , and i would be very surprised if this had such a major memory leak in a mature product. So why was it leaking now?
Well, it was a combination of things:
The quartz thread pool instantiates with certain things being threadLocal, spring was injecting a session factory in, that was creating a session at the start of the quartz threads lifecycle, which was then being reused to run the various quartz jobs that used the hibernate session. Hibernate then was caching in the session, which is its expected behavior.
The problem then is that the thread pool was never releasing the session, so hibernate was staying resident and maintaining the cache for the lifecycle of the session. Since this was using springs hibernate template support, there was no explicit use of the sessions (we are using a dao -> manager -> driver -> quartz-job hierarchy, the dao is injected with hibernate configs through spring, so the operations are done directly on the templates).
So the session was never being closed, hibernate was maintaining references to the cache objects, so they were never being garbage collected, so each time a new job ran it would just keep filling up the cache local to the thread, so there was not even any sharing between the different jobs. Also since this is a write-intensive job (very little reading), the cache was mostly wasted, so the objects kept getting created.
The solution: create a dao method that explicitly calls session.flush() and session.clear(), and invoke that method at the beginning of each job.
The app has been running for a few days now with no monitoring issues, memory errors or restarts.
Thanks for everyone's help on this, it was a pretty tricky bug to track down, as everything was doing exactly what it was supposed to, but in the end a 3 line method managed to fix all the problems.

Can you run the production box with JMX enabled?
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=<port>
...
Monitoring and Management Using JMX
And then attach with JConsole, VisualVM?
Is it ok to do a heap dump with jmap?
If yes you could then analyze the heap dump for leaks with JProfiler (you already have), jhat, VisualVM, Eclipse MAT. Also compare heap dumps that might help to find leaks/patterns.
And as you mentioned jakarta-commons. There is a problem when using the jakarta-commons-logging related to holding onto the classloader. For a good read on that check
A day in the life of a memory leak hunter (release(Classloader))

It seems like memory other than heap is leaking, you mention that heap is remaining stable. A classical candidate is permgen (permanent generation) which consists of 2 things: loaded class objects and interned strings. Since you report having connected with VisualVM you should be able to seem the amount of loaded classes, if there is a continues increase of the loaded classes (important, visualvm also shows the total amount of classes ever loaded, it's okay if this goes up but the amount of loaded classes should stabilize after a certain time).
If it does turn out to be a permgen leak then debugging gets trickier since tooling for permgen analysis is rather lacking in comparison to the heap. Your best bet is to start a small script on the server that repeatedly (every hour?) invokes:
jmap -permstat <pid> > somefile<timestamp>.txt
jmap with that parameter will generate an overview of loaded classes together with an estimate of their size in bytes, this report can help you identify if certain classes do not get unloaded. (note: with I mean the process id and should be some generated timestamp to distinguish the files)
Once you identified certain classes as being loaded and not unloaded you can figure out mentally where these might be generated, otherwise you can use jhat to analyze dumps generated with jmap -dump. I'll keep that for a future update should you need the info.

I would look for directly allocated ByteBuffer.
From the javadoc.
A direct byte buffer may be created by invoking the allocateDirect factory method of this class. The buffers returned by this method typically have somewhat higher allocation and deallocation costs than non-direct buffers. The contents of direct buffers may reside outside of the normal garbage-collected heap, and so their impact upon the memory footprint of an application might not be obvious. It is therefore recommended that direct buffers be allocated primarily for large, long-lived buffers that are subject to the underlying system's native I/O operations. In general it is best to allocate direct buffers only when they yield a measureable gain in program performance.
Perhaps the Tomcat code uses this do to I/O; configure Tomcat to use a different connector.
Failing that you could have a thread that periodically executes System.gc(). "-XX:+ExplicitGCInvokesConcurrent" might be an interesting option to try.

Any JAXB? I find that JAXB is a perm space stuffer.
Also, I find that visualgc, now shipped with JDK 6, is a great way to see what's going on in memory. It shows the eden, generational, and perm spaces and the transient behavior of the GC beautifully. All you need is the PID of the process. Maybe that will help while you work on JProfile.
And what about the Spring tracing/logging aspects? Maybe you can write a simple aspect, apply it declaratively, and do a poor man's profiler that way.

"Unfortunately, the problem also pops up sporadically, it seems to be unpredictable, it can run for days or even a week without having any problems, or it can fail 40 times in a day, and the only thing I can seem to catch consistently is that garbage collection is acting up."
Sounds like, this is bound to a use case which is executed up to 40 times a day and then not anymore for days. I hope, you do not just track only the symptoms. This must be something, that you can narrow down by tracing the actions of the application's actors (users, jobs, services).
If this happens by XML imports, you should compare the XML data of the 40 crashes day with data, that is imported on a zero crash day. Maybe it's some sort of logical problem, that you do not find inside your code, only.

I had the same problem, with couple of differences..
My technology is the following:
grails 2.2.4
tomcat7
quartz-plugin 1.0
I use two datasources on my application. That is a
particularity determinant to bug causes..
Another thing to consider is that quartz-plugin, inject hibernate session in quartz threads, just like #liam says, and quartz threads still alive, untill I finish application.
My problem was a bug on grails ORM combined with the way the plugin handle session and my two datasources.
Quartz plugin had a listener to init and destroy hibernate sessions
public class SessionBinderJobListener extends JobListenerSupport {
public static final String NAME = "sessionBinderListener";
private PersistenceContextInterceptor persistenceInterceptor;
public String getName() {
return NAME;
}
public PersistenceContextInterceptor getPersistenceInterceptor() {
return persistenceInterceptor;
}
public void setPersistenceInterceptor(PersistenceContextInterceptor persistenceInterceptor) {
this.persistenceInterceptor = persistenceInterceptor;
}
public void jobToBeExecuted(JobExecutionContext context) {
if (persistenceInterceptor != null) {
persistenceInterceptor.init();
}
}
public void jobWasExecuted(JobExecutionContext context, JobExecutionException exception) {
if (persistenceInterceptor != null) {
persistenceInterceptor.flush();
persistenceInterceptor.destroy();
}
}
}
In my case, persistenceInterceptor instances AggregatePersistenceContextInterceptor, and it had a List of HibernatePersistenceContextInterceptor. One for each datasource.
Every opertion do with AggregatePersistenceContextInterceptor its passed to HibernatePersistence, without any modification or treatments.
When we calls init() on HibernatePersistenceContextInterceptor he increment the static variable below
private static ThreadLocal<Integer> nestingCount = new ThreadLocal<Integer>();
I don't know the pourpose of that static count. I just know he it's incremented two times, one per datasource, because of the AggregatePersistence implementation.
Until here I just explain the cenario.
The problem comes now...
When my quartz job finish, the plugin calls the listener to flush and destroy hibernate sessions, like you can see in source code of SessionBinderJobListener.
The flush occurs perfectly, but the destroy not, because HibernatePersistence, do one validation before close hibernate session... It examines nestingCount to see if the value is grather than 1. If the answer is yes, he not close the session.
Simplifying what was did by Hibernate:
if(--nestingCount.getValue() > 0)
do nothing;
else
close the session;
That's the base of my memory leak..
Quartz threads still alive with all objects used in session, because grails ORM not close session, because of a bug caused because I have two datasources.
To solve that, I customize the listener, to call clear before destroy, and call destroy two times, (one for each datasource). Ensuring my session was clear and destroyed, and if the destroy fails, he was clear at least.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.