I was wondering. I have a simple Java web project with a servlet. When no users are using it (I host it local on GlassFish) I still see a sawtooth pattern in the memory usage of the process. I can't seem to understand what the JVM is doing? I understand the normal flow of the memory used by JVM. The heap is getting filled with objects the application is creating. At a certain point the heap reaches a point where the garbage collector comes in, which will remove all the 'cached' objects which no longer are used, so that new objects can be created and can be filled in the heap size.
What I wonder is, what is the JVM doing all the time when it's idle?
EDIT:
Like I said in the comments. I created a very simple Java application in Eclipse which showed a Swing window saying 'hello world'. When I watch the memory usage of the JVM in Java VisualVM, I see the same sawtooth pattern.
It turns out that VisualVM is the guilty one. This is because VisualVM constantly asks the JVM what it's statistics are, so the JVM creates objects to give the information to VisualVM.
Thanks to
Peter Lawrey
My guess is that the app server is doing some sort of behind-the-scenes book-keeping. Keep in mind that, even though no one is using it, there are still threads running in the background. Also, does your app use any type of chronological trigger framework, like Quartz? Or, what about logging. Even though it may not be actively sending log messages to a file, the Loggers are still getting called, which is causing all sorts of things to be created/destroyed.
I have a J2EE java application which processes SOAP requests. In our production environment (HPUX,OC4J,Java 5) we have about 20 threads running for this process, and we sometimes see 1 thread pausing for ~15 seconds. Until now, I haven't succeeded replicating the problem in our preproduction environment, and I'm scared of breaking stuff and violating SLA's if I use jconsole and associated tools on our production server.
Who has any inspiration? I know about http://java.sun.com/j2se/1.5/pdf/jdk50_ts_guide.pdf but I miss the experience to dare using it straight in production (plus, the HPUX guys threw some of these tools out of the toolbox, replacing them with HPJMeter)
Also, although this suggests a GC problem to me, I don't yet know enough to prove or disprove this theory and I am open to other suggestions.
We connect jconsole (and other tools) straight to production regularly. There is no significant overhead for us, the instrumentation is already going on within the JVM so you'd just be connecting a remote process to read published values. I say go for it!
Either way, you really need to see what's going on on the box. Thread dumps might or do some internal instrumentation. By internal instrumentation, I mean recording key measures within the code and exposing those somehow. It's essentially what the JVM does (exposing them via JMX) but rolling your own gives you more specificity. For example, I'm frequently recording request/response or other critical path performance timings internally.
oh, and one more thing. You can setup your app to using an agent to provide even more information. Typically this would be to plug a profiler in (like jprofiler or yourkit) but this does usually add more overhead and isn't recommended for production.
It's also worth thinking about the cost of not getting the information you need out of the VM. For example, is the cost of not fixing the issue more or less than the cost of a small % drop of performance when monitoring?
More scientifically, this article has some comments. It's suggesting up to 7% overhead (contradicting my previous point), a previous article from 2006 suggests 3-4% but both are highly contextual results. For example, CPU intensive applications may or may not be affected more than IO bound ones.
So a more appropriate answer from me (rather than just "go for it") would be to understand the impact it would have for your application in your environment through measurement. Run representative tests on a similar environment to production with jconsole connected and disconnected and see for your self.
Also see this stackoverflow question.
There are a few things that you can do on HP-UX to get additional information from a running Java process. If you send the PROF signal to the JVM, it will toggle the generation of a GC log (as if you had used the -Xverbosegc command line option). Generating the GC log is very inexpensive, so you should be able to turn this on in production without affecting the performance.
If you send the USR2 signal to the JVM, it starts profiling (same as -Xeprof). If you send the signal a second time, it turns off the profiling. This will have a noticeable performance impact, though it is smaller that what you would see from an external, third party profiler.
You can analyze the resulting data files using HPjmeter. HPjmeter can also connect to a running JVM for real-time monitoring. With Java 5, you need to start the JVM with the -agentlib option. If you were using Java 6, you could attach to the running JVM without needing any extra command line options.
So I've been trying to track down a good way to monitor when the JVM might potentially be heading towards an OOM situation. They best way that seems to work with our app is to track back-to-back concurrent mode failures through CMS. This indicates that the tenured pool is filling up faster than it can actually clean itself up, or its reclaiming very little.
The JMX bean for tracking GCs has very generic information such as memory usage before/after and the like. This information has been relatively inconsistent at best. Is there a better way I can be monitoring this potential warning sign of a dying JVM?
Assuming you're using the Sun JVM then I am aware of 2 options;
memory management mxbeans (API ref starts here) which you appear to be using already though note there are some hotspot specific internal ones you can get access to, see this blog for an example of how to use
jstat: cmd reference is here, you'll want the -gccause option. You can either write a script to launch this and parse the output or, theoretically, you could spawn a process from the host jvm (or another one) that then reads the output stream from jstat to detect the gc causes. I don't think the cause reporting is 100% comprehensive though. I don't know a way to get this info programatically from java code.
With standard JRE 1.6 GC, heap utilization can trend upwards overtime with the garbage collector running less and less frequently depending on the nature of your application and your maximum specified heap size. That said, it is hard to say what is going on without having more information.
A few methods to investigate further:
You could take a heap dump of your application while it is running using jmap, and then inspect the heap using jhat to see which objects are in heap at any given time.
You could also run your application with -XX:+HeapDumpOnOutOfMemoryError which will automatically make a heap dump on the first out of memory exception that the JVM encounters.
You could create a monitoring bean specific to your application, and create accessor methods you can hit with a remote JMX client. For example methods to return the sizes of queues and other collections that are likely places of memory utilization in your program.
HTH
We have an Java ERP type of application. Communication between server an client is via RMI. In peak hours there can be up to 250 users logged in and about 20 of them are working at the same time. This means that about 20 threads are live at any given time in peak hours.
The server can run for hours without any problems, but all of a sudden response times get higher and higher. Response times can be in minutes.
We are running on Windows 2008 R2 with Sun's JDK 1.6.0_16. We have been using perfmon and Process Explorer to see what is going on. The only thing that we find odd is that when server starts to work slow, the number of handles java.exe process has opened is around 3500. I'm not saying that this is the acual problem.
I'm just curious if there are some guidelines I should follow to be able to pinpoint the problem. What tools should I use? ....
Can you access to the log configuration of this application.
If you can, you should change the log level to "DEBUG". Tracing the DEBUG logs of a request could give you a usefull information about the contention point.
If you can't, profiler tools are can help you :
VisualVM (Free, and good product)
Eclipse TPTP (Free, but more complicated than VisualVM)
JProbe (not Free but very powerful. It is my favorite Java profiler, but it is expensive)
If the application has been developped with JMX control points, you can plug a JMX viewer to get informations...
If you want to stress the application to trigger the problem (if you want to verify whether it is a charge problem), you can use stress tools like JMeter
Sounds like the garbage collection cannot keep up and starts "halt-the-world" collecting for some reason.
Attach with jvisualvm in the JDK when starting and have a look at the collected data when the performance drops.
The problem you'r describing is quite typical but general as well. Causes can range from memory leaks, resource contention etcetera to bad GC policies and heap/PermGen-space allocation. To point out exact problems with your application, you need to profile it (I am aware of tools like Yourkit and JProfiler). If you profile your application wisely, only some application cycles would reveal the problems otherwise profiling isn't very easy itself.
In a similar situation, I have coded a simple profiling code myself. Basically I used a ThreadLocal that has a "StopWatch" (based on a LinkedHashMap) in it, and I then insert code like this into various points of the application: watch.time("OperationX");
then after the thread finishes a task, I'd call watch.logTime(); and the class would write a log that looks like this: [DEBUG] StopWatch time:Stuff=0, AnotherEvent=102, OperationX=150
After this I wrote a simple parser that generates CSV out from this log (per code path). The best thing you can do is to create a histogram (can be easily done using excel). Averages, medium and even mode can fool you.. I highly recommend to create a histogram.
Together with this histogram, you can create line graphs using average/medium/mode (which ever represents data best, you can determine this from the histogram).
This way, you can be 100% sure exactly what operation is taking time. If you can't determine the culprit, binary search is your friend (fine grain the events).
Might sound really primitive, but works. Also, if you make a library out of it, you can use it in any project. It's also cool because you can easily turn it on in production as well..
Aside from the GC that others have mentioned, Try taking thread dumps every 5-10 seconds for about 30 seconds during your slow down. There could be a case where DB calls, Web Service, or some other dependency becomes slow. If you take a look at the tread dumps you will be able to see threads which don't appear to move, and you could narrow your culprit that way.
From the GC stand point, do you monitor your CPU usage during these times? If the GC is running frequently you will see a jump in your overall CPU usage.
If only this was a Solaris box, prstat would be your friend.
For acute issues like this a quick jstack <pid> should quickly point out the problem area. Probably no need to get all fancy on it.
If I had to guess, I'd say Hotspot jumped in and tightly optimised some badly written code. Netbeans grinds to a halt where it uses a WeakHashMap with newly created objects to cache file data. When optimised, the entries can be removed from the map straight after being added. Obviously, if the cache is being relied upon, much file activity follows. You probably wont see the drive light up, because it'll all be cached by the OS.
This is a problem I have been trying to track down for a couple months now. I have a java app running that processes xml feeds and stores the result in a database. There have been intermittent resource problems that are very difficult to track down.
Background:
On the production box (where the problem is most noticeable), i do not have particularly good access to the box, and have been unable to get Jprofiler running. That box is a 64bit quad-core, 8gb machine running centos 5.2, tomcat6, and java 1.6.0.11. It starts with these java-opts
JAVA_OPTS="-server -Xmx5g -Xms4g -Xss256k -XX:MaxPermSize=256m -XX:+PrintGCDetails -
XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC -XX:+PrintTenuringDistribution -XX:+UseParNewGC"
The technology stack is the following:
Centos 64-bit 5.2
Java 6u11
Tomcat 6
Spring/WebMVC 2.5
Hibernate 3
Quartz 1.6.1
DBCP 1.2.1
Mysql 5.0.45
Ehcache 1.5.0
(and of course a host of other dependencies, notably the jakarta-commons libraries)
The closest I can get to reproducing the problem is a 32-bit machine with lower memory requirements. That I do have control over. I have probed it to death with JProfiler and fixed many performance problems (synchronization issues, precompiling/caching xpath queries, reducing the threadpool, and removing unnecessary hibernate pre-fetching, and overzealous "cache-warming" during processing).
In each case, the profiler showed these as taking up huge amounts of resources for one reason or another, and that these were no longer primary resource hogs once the changes went in.
The Problem:
The JVM seems to completely ignore the memory usage settings, fills all memory and becomes unresponsive. This is an issue for the customer facing end, who expects a regular poll (5 minute basis and 1-minute retry), as well for our operations teams, who are constantly notified that a box has become unresponsive and have to restart it. There is nothing else significant running on this box.
The problem appears to be garbage collection. We are using the ConcurrentMarkSweep (as noted above) collector because the original STW collector was causing JDBC timeouts and became increasingly slow. The logs show that as the memory usage increases, that is begins to throw cms failures, and kicks back to the original stop-the-world collector, which then seems to not properly collect.
However, running with jprofiler, the "Run GC" button seems to clean up the memory nicely rather than showing an increasing footprint, but since I can not connect jprofiler directly to the production box, and resolving proven hotspots doesnt seem to be working I am left with the voodoo of tuning Garbage Collection blind.
What I have tried:
Profiling and fixing hotspots.
Using STW, Parallel and CMS garbage collectors.
Running with min/max heap sizes at 1/2,2/4,4/5,6/6 increments.
Running with permgen space in 256M increments up to 1Gb.
Many combinations of the above.
I have also consulted the JVM [tuning reference](http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html) , but can't really find anything explaining this behavior or any examples of _which_ tuning parameters to use in a situation like this.
I have also (unsuccessfully) tried jprofiler in offline mode, connecting with jconsole, visualvm, but I can't seem to find anything that will interperet my gc log data.
Unfortunately, the problem also pops up sporadically, it seems to be unpredictable, it can run for days or even a week without having any problems, or it can fail 40 times in a day, and the only thing I can seem to catch consistently is that garbage collection is acting up.
Can anyone give any advice as to:
a) Why a JVM is using 8 physical gigs and 2 gb of swap space when it is configured to max out at less than 6.
b) A reference to GC tuning that actually explains or gives reasonable examples of when and what kind of setting to use the advanced collections with.
c) A reference to the most common java memory leaks (i understand unclaimed references, but I mean at the library/framework level, or something more inherenet in data structures, like hashmaps).
Thanks for any and all insight you can provide.
EDIT
Emil H:
1) Yes, my development cluster is a mirror of production data, down to the media server. The primary difference is the 32/64bit and the amount of RAM available, which I can't replicate very easily, but the code and queries and settings are identical.
2) There is some legacy code that relies on JaxB, but in reordering the jobs to try to avoid scheduling conflicts, I have that execution generally eliminated since it runs once a day. The primary parser uses XPath queries which call down to the java.xml.xpath package. This was the source of a few hotspots, for one the queries were not being pre-compiled, and two the references to them were in hardcoded strings. I created a threadsafe cache (hashmap) and factored the references to the xpath queries to be final static Strings, which lowered resource consumption significantly. The querying still is a large part of the processing, but it should be because that is the main responsibility of the application.
3) An additional note, the other primary consumer is image operations from JAI (reprocessing images from a feed). I am unfamiliar with java's graphic libraries, but from what I have found they are not particularly leaky.
(thanks for the answers so far, folks!)
UPDATE:
I was able to connect to the production instance with VisualVM, but it had disabled the GC visualization / run-GC option (though i could view it locally). The interesting thing: The heap allocation of the VM is obeying the JAVA_OPTS, and the actual allocated heap is sitting comfortably at 1-1.5 gigs, and doesnt seem to be leaking, but the box level monitoring still shows a leak pattern, but it is not reflected in the VM monitoring. There is nothing else running on this box, so I am stumped.
Well, I finally found the issue that was causing this, and I'm posting a detail answer in case someone else has these issues.
I tried jmap while the process was acting up, but this usually caused the jvm to hang further, and I would have to run it with --force. This resulted in heap dumps that seemed to be missing a lot of data, or at least missing the references between them. For analysis, I tried jhat, which presents a lot of data but not much in the way of how to interpret it. Secondly, I tried the eclipse-based memory analysis tool ( http://www.eclipse.org/mat/ ), which showed that the heap was mostly classes related to tomcat.
The issue was that jmap was not reporting the actual state of the application, and was only catching the classes on shutdown, which was mostly tomcat classes.
I tried a few more times, and noticed that there were some very high counts of model objects (actually 2-3x more than were marked public in the database).
Using this I analyzed the slow query logs, and a few unrelated performance problems. I tried extra-lazy loading ( http://docs.jboss.org/hibernate/core/3.3/reference/en/html/performance.html ), as well as replacing a few hibernate operations with direct jdbc queries (mostly where it was dealing with loading and operating on large collections -- the jdbc replacements just worked directly on the join tables), and replaced some other inefficient queries that mysql was logging.
These steps improved pieces of the frontend performance, but still did not address the issue of the leak, the app was still unstable and acting unpredictably.
Finally, I found the option: -XX:+HeapDumpOnOutOfMemoryError . This finally produced a very large (~6.5GB) hprof file that accurately showed the state of the application. Ironically, the file was so large that jhat could not anaylze it, even on a box with 16gb of ram. Fortunately, MAT was able to produce some nice looking graphs and showed some better data.
This time what stuck out was a single quartz thread was taking up 4.5GB of the 6GB of heap, and the majority of that was a hibernate StatefulPersistenceContext ( https://www.hibernate.org/hib_docs/v3/api/org/hibernate/engine/StatefulPersistenceContext.html ). This class is used by hibernate internally as its primary cache (i had disabled the second-level and query-caches backed by EHCache).
This class is used to enable most of the features of hibernate, so it can't be directly disabled (you can work around it directly, but spring doesn't support stateless session) , and i would be very surprised if this had such a major memory leak in a mature product. So why was it leaking now?
Well, it was a combination of things:
The quartz thread pool instantiates with certain things being threadLocal, spring was injecting a session factory in, that was creating a session at the start of the quartz threads lifecycle, which was then being reused to run the various quartz jobs that used the hibernate session. Hibernate then was caching in the session, which is its expected behavior.
The problem then is that the thread pool was never releasing the session, so hibernate was staying resident and maintaining the cache for the lifecycle of the session. Since this was using springs hibernate template support, there was no explicit use of the sessions (we are using a dao -> manager -> driver -> quartz-job hierarchy, the dao is injected with hibernate configs through spring, so the operations are done directly on the templates).
So the session was never being closed, hibernate was maintaining references to the cache objects, so they were never being garbage collected, so each time a new job ran it would just keep filling up the cache local to the thread, so there was not even any sharing between the different jobs. Also since this is a write-intensive job (very little reading), the cache was mostly wasted, so the objects kept getting created.
The solution: create a dao method that explicitly calls session.flush() and session.clear(), and invoke that method at the beginning of each job.
The app has been running for a few days now with no monitoring issues, memory errors or restarts.
Thanks for everyone's help on this, it was a pretty tricky bug to track down, as everything was doing exactly what it was supposed to, but in the end a 3 line method managed to fix all the problems.
Can you run the production box with JMX enabled?
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=<port>
...
Monitoring and Management Using JMX
And then attach with JConsole, VisualVM?
Is it ok to do a heap dump with jmap?
If yes you could then analyze the heap dump for leaks with JProfiler (you already have), jhat, VisualVM, Eclipse MAT. Also compare heap dumps that might help to find leaks/patterns.
And as you mentioned jakarta-commons. There is a problem when using the jakarta-commons-logging related to holding onto the classloader. For a good read on that check
A day in the life of a memory leak hunter (release(Classloader))
It seems like memory other than heap is leaking, you mention that heap is remaining stable. A classical candidate is permgen (permanent generation) which consists of 2 things: loaded class objects and interned strings. Since you report having connected with VisualVM you should be able to seem the amount of loaded classes, if there is a continues increase of the loaded classes (important, visualvm also shows the total amount of classes ever loaded, it's okay if this goes up but the amount of loaded classes should stabilize after a certain time).
If it does turn out to be a permgen leak then debugging gets trickier since tooling for permgen analysis is rather lacking in comparison to the heap. Your best bet is to start a small script on the server that repeatedly (every hour?) invokes:
jmap -permstat <pid> > somefile<timestamp>.txt
jmap with that parameter will generate an overview of loaded classes together with an estimate of their size in bytes, this report can help you identify if certain classes do not get unloaded. (note: with I mean the process id and should be some generated timestamp to distinguish the files)
Once you identified certain classes as being loaded and not unloaded you can figure out mentally where these might be generated, otherwise you can use jhat to analyze dumps generated with jmap -dump. I'll keep that for a future update should you need the info.
I would look for directly allocated ByteBuffer.
From the javadoc.
A direct byte buffer may be created by invoking the allocateDirect factory method of this class. The buffers returned by this method typically have somewhat higher allocation and deallocation costs than non-direct buffers. The contents of direct buffers may reside outside of the normal garbage-collected heap, and so their impact upon the memory footprint of an application might not be obvious. It is therefore recommended that direct buffers be allocated primarily for large, long-lived buffers that are subject to the underlying system's native I/O operations. In general it is best to allocate direct buffers only when they yield a measureable gain in program performance.
Perhaps the Tomcat code uses this do to I/O; configure Tomcat to use a different connector.
Failing that you could have a thread that periodically executes System.gc(). "-XX:+ExplicitGCInvokesConcurrent" might be an interesting option to try.
Any JAXB? I find that JAXB is a perm space stuffer.
Also, I find that visualgc, now shipped with JDK 6, is a great way to see what's going on in memory. It shows the eden, generational, and perm spaces and the transient behavior of the GC beautifully. All you need is the PID of the process. Maybe that will help while you work on JProfile.
And what about the Spring tracing/logging aspects? Maybe you can write a simple aspect, apply it declaratively, and do a poor man's profiler that way.
"Unfortunately, the problem also pops up sporadically, it seems to be unpredictable, it can run for days or even a week without having any problems, or it can fail 40 times in a day, and the only thing I can seem to catch consistently is that garbage collection is acting up."
Sounds like, this is bound to a use case which is executed up to 40 times a day and then not anymore for days. I hope, you do not just track only the symptoms. This must be something, that you can narrow down by tracing the actions of the application's actors (users, jobs, services).
If this happens by XML imports, you should compare the XML data of the 40 crashes day with data, that is imported on a zero crash day. Maybe it's some sort of logical problem, that you do not find inside your code, only.
I had the same problem, with couple of differences..
My technology is the following:
grails 2.2.4
tomcat7
quartz-plugin 1.0
I use two datasources on my application. That is a
particularity determinant to bug causes..
Another thing to consider is that quartz-plugin, inject hibernate session in quartz threads, just like #liam says, and quartz threads still alive, untill I finish application.
My problem was a bug on grails ORM combined with the way the plugin handle session and my two datasources.
Quartz plugin had a listener to init and destroy hibernate sessions
public class SessionBinderJobListener extends JobListenerSupport {
public static final String NAME = "sessionBinderListener";
private PersistenceContextInterceptor persistenceInterceptor;
public String getName() {
return NAME;
}
public PersistenceContextInterceptor getPersistenceInterceptor() {
return persistenceInterceptor;
}
public void setPersistenceInterceptor(PersistenceContextInterceptor persistenceInterceptor) {
this.persistenceInterceptor = persistenceInterceptor;
}
public void jobToBeExecuted(JobExecutionContext context) {
if (persistenceInterceptor != null) {
persistenceInterceptor.init();
}
}
public void jobWasExecuted(JobExecutionContext context, JobExecutionException exception) {
if (persistenceInterceptor != null) {
persistenceInterceptor.flush();
persistenceInterceptor.destroy();
}
}
}
In my case, persistenceInterceptor instances AggregatePersistenceContextInterceptor, and it had a List of HibernatePersistenceContextInterceptor. One for each datasource.
Every opertion do with AggregatePersistenceContextInterceptor its passed to HibernatePersistence, without any modification or treatments.
When we calls init() on HibernatePersistenceContextInterceptor he increment the static variable below
private static ThreadLocal<Integer> nestingCount = new ThreadLocal<Integer>();
I don't know the pourpose of that static count. I just know he it's incremented two times, one per datasource, because of the AggregatePersistence implementation.
Until here I just explain the cenario.
The problem comes now...
When my quartz job finish, the plugin calls the listener to flush and destroy hibernate sessions, like you can see in source code of SessionBinderJobListener.
The flush occurs perfectly, but the destroy not, because HibernatePersistence, do one validation before close hibernate session... It examines nestingCount to see if the value is grather than 1. If the answer is yes, he not close the session.
Simplifying what was did by Hibernate:
if(--nestingCount.getValue() > 0)
do nothing;
else
close the session;
That's the base of my memory leak..
Quartz threads still alive with all objects used in session, because grails ORM not close session, because of a bug caused because I have two datasources.
To solve that, I customize the listener, to call clear before destroy, and call destroy two times, (one for each datasource). Ensuring my session was clear and destroyed, and if the destroy fails, he was clear at least.