What could cause global Tomcat/JVM slowdown? - java
I'm experiencing a strange but severe problem running several (about 15) instances of a Java EE-ish web applications (Hibernate 4+Spring+Quartz+JSF+Facelets+Richfaces) on Tomcat 7/Java 7.
The system runs just fine, but after a greatly variyng amount of time all instances of the application at the same time suddenly suffer from rising response times. Basically the application still works, but the response times are about three times higher.
This are two diagrams displaying the response time of two certain short workflows/actions (log in, access list of seminars, ajax-refresh this list, log out; the lower line is just the request time for the ajax refresh) of two example instances of the application:
As you can see both instances of the application "explode" at the exact same time and stay slow. After restarting the server everything's back to normal. All the instances of the application "explode" simultaneously.
We're storing the session data to a database and use this for clustering. We checked session size and number and both are rather low (meaning that on other servers with other applications we sometimes have larger and more sessions). The other Tomcat in the cluster usually stays fast for some more hours and after this random-ish amount of time it also "dies". We checked the heap sizes with jconsole and the main heap stays between 2.5 and 1 GB size, db connection pool is basically full of free connections, as well as the thread pools. Max heap size is 5 GB, there's also plenty of perm gen space available. The load is not especially high; there's just about 5% load on the main CPU. The server does not swap. It's also no hardware issue as we additionally deployed the applications to a VM where the problems remain the same.
I don't know where to look anymore, I am out of ideas. Has someone an idea where to look?
2013-02-21 Update: New Data!
I added two more timing traces to the application. As for the measurement: the monitoring system calls a servlet that performs two tasks, measures execution time for each on the server and writes the time taken as response. These values are logged by the monitoring system.
I have several interesting new facts: a hot redeployment of the application causes this single instance on the current Tomcat to go nuts. This also seems to affect raw CPU calculation performance (see below). This individual-context-explosion is different from the overall-context-explosion that occurs randomly.
Now for some data:
First the individual lines:
Light blue is total execution time of a small workflow (details see above), measured on the client
Red is "part" of light blue and is the time taken to perform a special step of that workflow, measured on the client
Dark blue is measured in the application and consists of reading a list of entities from the DB through Hibernate and iterating over that list, fetching lazy collections and lazy entities.
Green is a small CPU benchmark using floating point and integer operations. As far as I see no object allocation, so no garbage.
Now for the individual stages of explosion: I marked each image with three black dots. The first one is a "small" explostion in more or less only one application instance - in Inst1 it jumps (especially visible in the red line), while Inst2 below more or less stays calm.
After this small explosion the "big bang" occurs and all application instances on that Tomcat explode (2nd dot). Note that this explosion affects all high level operations (request processing, DB access), but not the CPU benchmark. It stays low in both systems.
After that I hot-redeployed Inst1 by touching the context.xml file. As I said earlier this instance goes from exploded to completely devestated now (the light blue line is out of the chart - it is at about 18 secs). Note how a) this redeployment does not affect Inst2 at all and b) how the raw DB access of Inst1 is also not affected - but how the CPU suddenly seems to have become slower!. This is crazy, I say.
Update of update
The leak prevention listener of Tomcat does not whine about stale ThreadLocals or Threads when the application is undeployed. There obviously seems to be some cleanup problem (which is I assume not directly related to the Big Bang), but Tomcat doesn't have a hint for me.
2013-02-25 Update: Application Environment and Quartz Schedule
The application environment is not very sophisticated. Network components aside (I don't know enough about those) there's basically one application server (Linux) and two database servers (MySQL 5 and MSSQL 2008). The main load is on the MSSQL server, the other one merely serves as a place to store the sessions.
The application server runs an Apache as a load balancer between two Tomcats. So we have two JVMs running on the same hardware (two Tomcat instances). We use this configuration not to actually balance load as the application server is capable of running the application just fine (which it did for years now) but to enable small application updates without downtime. The web application in question is deployed as separate contexts for different customers, about 15 contexts per Tomcat. (I seemm to have mixed up "instances" and "contexts" in my posting - here in the office they're often used synonymously and we usually magically know what the colleague is talking about. My bad, I'm really sorry.)
To clarify the situation with better wording: the diagrams I posted show response times of two different contexts of the same application on the same JVM. The Big Bang affects all contexts on one JVM but doesn't happen on the other one (the order in which the Tomcats explode is random btw). After hot-redeployment one context on one Tomcat instance goes nuts (with all the funny side effects, like seemingly slower CPU for that context).
The overall load on the system is rather low. It's an internal core business related software with about 30 active users simultaneously. Application specific requests (server touches) are currently at about 130 per minute. The number of single requests are low but the requests itself often require several hundred selects to the database, so they're rather expensive. But usually everything's perfectly acceptable. The application also does not create large infinite caches - some lookup data is cached, but only for a short amount of time.
Above I wrote that the servers where capable of running the application just fine for several years. I know that the best way to find the problem would be to find out exactly when things went wrong for the first time and see what has been changed in this timeframe (in the application itself, the associated libraries or infrastructure), however the problem is that we don't know when the problems first occured. Just let's call that suboptimal (in the sense of absent) application monitoring... :-/
We ruled out some aspects, but the application has been updated several times during the last months and thus we e.g. cannot simply deploy an older version. The largest update that wasn't feature change was a switch from JSP to Facelets. But still, "something" must be the cause of all the problems, yet I have no idea why Facelets for instance should influence pure DB query times.
Quartz
As for the Quartz schedule: there's a total of 8 jobs. Most of them run only once per day and have to do with large volume data synchronization (absolutely not "large" as in "big data large"; it's just more than the averate user sees through his usual daily work). However, those jobs of course run at night and the problems occur during daytime. I omit a detailled job listing here (if beneficial I can provide more details of course). The jobs' source code has not been altered during the last months. I already checked whether the explosions align with the jobs - yet the results are inconclusive at best. I'd actually say that they don't align, but as there are several jobs that run every minute I can't rule it out just yet. The acutal jobs that run every minute are pretty low-weight in my opinion, they usually check if data is available (in different sources, DB, external systems, email account) and if so write it to the DB or push it to another system.
However I'm currently enabling logging of indivdual job execution so that I can exactly see start and end timestamp of each single job execution. Perhaps this provides more insight.
2013-02-28 Update: JSF Phases and Timing
I manually added a JSF phae listener to the application. I executed a sample call (the ajax refresh) and this is what I've got (left: normal running Tomcat instance, right: Tomcat instance after Big Bang - the numbers have been taken almost simultaneously from both Tomcats and are in milliseconds):
RESTORE_VIEW: 17 vs 46
APPLY_REQUEST_VALUES: 170 vs 486
PROCESS_VALIDATIONS: 78 vs 321
UPDATE_MODEL_VALUES: 75 vs 307
RENDER_RESPONSE: 1059 vs 4162
The ajax refresh itself belongs to a search form and its search result. There's also another delay between the application's outmost request filter and web flow starts its work: there's a FlowExecutionListenerAdapter that measures time taken in certain phases of web flow. This listener reports 1405 ms for "Request submitted" (which is as far as I know the first web flow event) out of a total of 1632 ms for the complete request on an un-exploded Tomcat, thus I estimate about 200ms overhead.
But on the exploded Tomcat it reports 5332 ms for request submitted (meaning all JSF phases happen in those 5 seconds) out of a total request duration of 7105ms, thus we're up to almost 2 seconds overhead for everything outside of web flow's request submitted.
Below my measurement filter the filter chain contains a org.ajax4jsf.webapp.BaseFilter, then the Spring servlet is called.
2013-06-05 Update: All the stuff going on in the last weeks
A small and rather late update... the application performance still sucks after some time and the behaviour remains erratic. Profiling did not help much yet, it just generated an enormous amount of data that's hard to dissect. (Try poking around in performance data on or profile a production system... sigh) We conducted several tests (ripping out certain parts of the software, undeploying other applications etc.) and actually had some improvements that affect the whole application. The default flush mode of our EntityManager is AUTO and during view rendering lots of fetches and selects are issued, always including the check whether flushing is neccesary.
So we built a JSF phase listener that sets the flush mode to COMMIT during RENDER_RESPONSE. This improved overall performance a lot and seems to have mitigated the problems somewhat.
Yet, our application monitoring keeps yielding completely insane results and performance on some contexts on some tomcat instances. Like an action that should finish in under a second (and that actually does it after deployment) and that now takes more than four seconds. (These numbers are supported by manual timing in the browsers, so it's not the monitoring that causes the problems).
See the following picture for example:
This diagram shows two tomcat instances running the same context (meaning same db, same configuration, same jar). Again the blue line is the amount of time taken by pure DB read operations (fetch a list of entities, iterate over them, lazily fetch collections and associated data). The turquoise-ish and red line are measured by rendering several views and doing an ajax refresh, respectively. The data rendered by two of the requests in turquoise-ish and red is mostly the same as is queried for the blue line.
Now around 0700 on instance 1 (right) there's this huge increase in pure DB time which seems to affect actual render response times as well, but only on tomcat 1. Tomcat 0 is largely unaffected by this, so it cannot be caused by the DB server or network with both tomcats running on the same physical hardware. It has to be a software problem in the Java domain.
During my last tests I found out something interesting: All responses contain the header "X-Powered-By: JSF/1.2, JSF/1.2". Some (the redirect responses produced by WebFlow) even have "JSF/1.2" three times in there.
I traced down the code parts that set those headers and the first time this header is set it's caused by this stack:
... at org.ajax4jsf.webapp.FilterServletResponseWrapper.addHeader(FilterServletResponseWrapper.java:384)
at com.sun.faces.context.ExternalContextImpl.<init>(ExternalContextImpl.java:131)
at com.sun.faces.context.FacesContextFactoryImpl.getFacesContext(FacesContextFactoryImpl.java:108)
at org.springframework.faces.webflow.FlowFacesContext.newInstance(FlowFacesContext.java:81)
at org.springframework.faces.webflow.FlowFacesContextLifecycleListener.requestSubmitted(FlowFacesContextLifecycleListener.java:37)
at org.springframework.webflow.engine.impl.FlowExecutionListeners.fireRequestSubmitted(FlowExecutionListeners.java:89)
at org.springframework.webflow.engine.impl.FlowExecutionImpl.resume(FlowExecutionImpl.java:255)
at org.springframework.webflow.executor.FlowExecutorImpl.resumeExecution(FlowExecutorImpl.java:169)
at org.springframework.webflow.mvc.servlet.FlowHandlerAdapter.handle(FlowHandlerAdapter.java:183)
at org.springframework.webflow.mvc.servlet.FlowController.handleRequest(FlowController.java:174)
at org.springframework.web.servlet.mvc.SimpleControllerHandlerAdapter.handle(SimpleControllerHandlerAdapter.java:48)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:925)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:856)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:920)
at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:827)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:641)
... several thousands ;) more
The second time this header is set by
at org.ajax4jsf.webapp.FilterServletResponseWrapper.addHeader(FilterServletResponseWrapper.java:384)
at com.sun.faces.context.ExternalContextImpl.<init>(ExternalContextImpl.java:131)
at com.sun.faces.context.FacesContextFactoryImpl.getFacesContext(FacesContextFactoryImpl.java:108)
at org.springframework.faces.webflow.FacesContextHelper.getFacesContext(FacesContextHelper.java:46)
at org.springframework.faces.richfaces.RichFacesAjaxHandler.isAjaxRequestInternal(RichFacesAjaxHandler.java:55)
at org.springframework.js.ajax.AbstractAjaxHandler.isAjaxRequest(AbstractAjaxHandler.java:19)
at org.springframework.webflow.mvc.servlet.FlowHandlerAdapter.createServletExternalContext(FlowHandlerAdapter.java:216)
at org.springframework.webflow.mvc.servlet.FlowHandlerAdapter.handle(FlowHandlerAdapter.java:182)
at org.springframework.webflow.mvc.servlet.FlowController.handleRequest(FlowController.java:174)
at org.springframework.web.servlet.mvc.SimpleControllerHandlerAdapter.handle(SimpleControllerHandlerAdapter.java:48)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:925)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:856)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:920)
at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:827)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:641)
I have no idea if this could indicate a problem, but I did not notice this with other applications that are running on any of our servers, so this might as well provide some hints. I really have no idea what that framework code is doing (admittedly I did not dive into it yet)... perhaps someone has an idea? Or am I running into a dead end?
Appendix
My CPU benchmark code consists of a loop that calculates Math.tan and uses the result value to modify some fields on the servlet instance (no volatile/synchronized there), and secondly performs several raw integer calcualations. This is not severly sophisticated, I know, but well... it seems to show something in the charts, however I am not sure what it shows. I do the field updates to prevent HotSpot from optimizing away all my precious code ;)
long time2 = System.nanoTime();
for (int i = 0; i < 5000000; i++) {
double tan = Math.tan(i);
if (tan < 0) {
this.l1++;
} else {
this.l2++;
}
}
for (int i = 1; i < 7500; i++) {
int n = i;
while (n != 1) {
this.steps++;
if (n % 2 == 0) {
n /= 2;
} else {
n = n * 3 + 1;
}
}
}
// This execution time is written to the client.
time2 = System.nanoTime() - time2;
Solution
Increase the maximum size of the Code Cache:
-XX:ReservedCodeCacheSize=256m
Background
We are using ColdFusion 10 which runs on Tomcat 7 and Java 1.7.0_15. Our symptoms were similar to yours. Occasionally the response times and the CPU usage on the server would go up by a lot for no apparent reason. It seemed as if the CPU got slower. The only solution was to restart ColdFusion (and Tomcat).
Initial analysis
I started by looking at the memory usage and the garbage collector log. There was nothing there that could explain our problems.
My next step was to schedule a heap dump every hour and to regularly perform sampling using VisualVM. The goal was to get data from before and after a slowdown so that it could be compared. I managed to get accomplish that.
There was one function in the sampling that stood out: get() in coldfusion.runtime.ConcurrentReferenceHashMap. A lot of time was spent in it after the slowdown compared to very little before. I spent some time on understanding how the function worked and developed a theory that maybe there was a problem with the hash function resulting in some huge buckets. Using the heap dumps I was able to see that the largest buckets only contained 6 elements so I discarded that theory.
Code Cache
I finally got on the right track when I read "Java Performance: The Definitive Guide". It has a chapter on the JIT Compiler which talks about the Code Cache which I had not heard of before.
Compiler disabled
When monitoring the number of compilations performed (monitored with jstat) and the size of the Code Cache (monitored with Memory Pools plugin of VisualVM) I saw that the size increased up to the maximum size (which is 48 MB by default in our environment -- the default varies depending on Java version and Java compiler). When the Code Cache became full the JIT Compiler was turned off. I have read that "CodeCache is full. Compiler has been disabled." should be printed when that happens but I did not see that message; maybe the version we are using does not have that message. I know that the compiler was turned off because the number of compilations performed stopped increasing.
Deoptimization continues
The JIT Compiler can deoptimize previously compiled functions which will caues the function to be executed by the interpreter again (unless the function is replaced by an improved compilation). The deoptimized function can be garbage collected to free up space in the Code Cache.
For some reason functions continued to be deoptimized even though nothing was compiled to replace them. More and more memory would become available in the Code Cache but the JIT Compiler was not restarted.
I never had -XX:+PrintCompilation enabled when we experience a slowdown but I am quite sure that I would have seen either ConcurrentReferenceHashMap.get(), or a function that it depends on, be deoptimized at that time.
Result
We have not seen any slowdowns since we increased the maximum size of the Code Cache to 256 MB and we have also seen a general performance improvement. There is currently 110 MB in our Code Cache.
First, let me say that you have done an excellent job grabbing detailed facts about the problem; I really like how you make it clear what you know and what you are speculating - it really helps.
EDIT 1 Massive edit after the update on context vs. instance
We can rule out:
GCs (that would affect the CPU benchmark service thread and spike the main CPU)
Quartz jobs (that would either affect both Tomcats or the CPU benchmark)
The database (that would affect both Tomcats)
Network packet storms and similar (that would affect both Tomcats)
I believe that you are suffering from is an increase in latency somewhere in your JVM. Latency is where a thread is waiting (synchronously) for a response from somewhere - it's increased your servlet response time but at no cost to the CPU. Typical latencies are caused by:
Network calls, including
JDBC
EJB or RMI
JNDI
DNS
File shares
Disk reading and writing
Threading
Reading from (and sometimes writing to) queues
synchronized method or block
futures
Thread.join()
Object.wait()
Thread.sleep()
Confirming that the problem is latency
I suggest using a commercial profiling tool. I like [JProfiler](http://www.ej-technologies.com/products/jprofiler/overview.html, 15 day trial version available) but YourKit is also recommended by the StackOverflow community. In this discussion I will use JProfiler terminology.
Attach to the Tomcat process while it is performing fine and get a feel for how it looks under normal conditions. In particular, use the high-level JDBC, JPA, JNDI, JMS, servlet, socket and file probes to see how long the JDBC, JMS, etc operations take (screencast. Run this again when the server is exhibiting problems and compare. Hopefully you will see what precisely has been slowed down. In the product screenshot below, you can see the SQL timings using the JPA Probe:
(source: ej-technologies.com)
However it's possible that the probes did not isolate the issue - for example it might be some threading issue. Go to the Threads view for the application; this displays a running chart of the states of each thread, and whether it is executing on the CPU, in an Object.wait(), is waiting to enter a synchronized block or is waiting on network I/O . When you know which thread or threads is exhibiting the issue, go to the CPU views, select the thread and use the thread states selector to immediately drill down to the expensive methods and their call stacks. [Screencast]((screencast). You will be able to drill up into your application code.
This is a call stack for runnable time:
And this is the same one, but showing network latency:
When you know what is blocking, hopefully the path to resolution will be clearer.
We had the same problem, running on Java 1.7.0_u101 (one of Oracle's supported versions, since the latest public JDK/JRE 7 is 1.7.0_u79), running on G1 garbage collector. I cannot tell if the problem appears in other Java 7 versions or with other GCs.
Our process was Tomcat running Liferay Portal (I believe the exact version of Liferay is of no interest here).
This is the behavior we observed: using a -Xmx of 5GB, the inital Code Cache pool size right after startup ranged at about 40MB. After a while, it dropped to about 30MB (which is kind of normal, since there is a lot of code running during startup which will be never executed again, so it is expected to be evicted from the cache after some time). We observed that there was some JIT activity, so the JIT actually populated the cache (comparing to the sizes I am mentioning later, it seems that the small cache size relative to the overall heap size places stringent requirements on the JIT, and this makes the latter evict the cache rather nervously). However, after a while, no more compilations ever took place, and the JVM got painfully slow. We had to kill our Tomcats every now and then to get back adequate performance, and as we added more code to our portal, the problem got worse and worse (since the Code Cache got saturated more quickly, I guess).
It seems that there are several bugs in JDK 7 JVM that cause it to not restart the JIT (look at this blog post: https://blogs.oracle.com/poonam/entry/why_do_i_get_message), even in JDK 7, after an emergency flush (the blog mentions Java bugs 8006952, 8012547, 8020151 and 8029091).
This is why increasing manually the Code Cache to a level where an emergency flush is unlikely to ever occur "fixes" the issue (I guess this is the case with JDK 7).
In our case, instead of trying to adjust the Code Cache pool size, we chose to upgrade to Java 8. This seems to have fixed the issue. Also, the Code Cache now seems to be quite larger (startup size gets about 200MB, and cruising size gets to about 160MB). As it is expected, after some idling time, the cache pool size drops, to get up again if some user (or robot, or whatever) browses our site, causing more code to be executed.
I hope you find the above data helpful.
Forgot to say: I found the exposition, the supporting data, the infering logic and the conclusion of this post very, very helpful. Thank you, really!
Has someone an idea where to look?
Issue could be out of Tomcat/JVM- do you have some batch job which kicks in and stress the shared resource(s) like a common database?
Take a thread dump and see what the java processes are doing when application response time explodes?
If you are using Linux, use a tool like strace and check what is java process doing.
Have you checked JVM GC times? Some GC algorithms might 'pause' the application threads and increase the response time.
You can use jstat utility to monitor garbage collection statistics:
jstat -gcutil <pid of tomcat> 1000 100
Above command would print GC statistics on every 1 second for 100 times. Look at the FGC/YGC columns, if the number keeps raising, there is something wrong with your GC options.
You might want to switch to CMS GC if you want to keep response time low:
-XX:+UseConcMarkSweepGC
You can check more GC options here.
What happens after your app is performing slow for a while, does it get back to performing well?
If so then I would check if there is any activity that is not related to your app taking place at this time.
Something like an antivirus scan or a system/db backup.
If not then I would suggest running it with a profiler (JProfiler, yourkit, etc.) this tools can point you to your hotspots very easily.
You are using Quartz, which manages timed processes, and this seems to take place at particular times.
Post your Quartz schedule and let us know if that aligns, and if so, you can determine which internal application process may be kicking off to consume your resources.
Alternately, it is possible a portion of your application code has finally been activated and decides to load data to the memory cache. You're using Hibernate; check the calls to your database and see if anything coincides.
Related
Multithreaded legacy Java application's threads taking turns in order
TL;DR - A user has an error (ORA-01438: value larger than specified precision allows for this column). I can't recreate it locally because when my machine runs the multithreaded app, only one of ten threads runs at a time, in sequence. Furthermore, running it often results in my machine running out of heap even with 8GB allocated to the heap, and then I happen to hit a NullPointerException instead of the user's issue. I'm attempting to debug a multithreaded legacy Java app (JDK 1.6) written years ago by people that are no longer around. It is attempting to insert some data into an Oracle DB. The app usually runs on a Weblogic 11G server and takes about 5 minutes to finish running the calculations. However, debugging locally, the threads don't work concurrently, they're taking turns on my local machine. This makes the running time go from the aforementioned 5 minutes to ~1 hour and still manages to run out of heap (I gave it 8GB) or throw a NullPointerException if I'm lucky, but that isn't the business user's error. I've thought about cutting it down to use only one thread since it's taking turns anyway, but after touching this for a week, the business impact is becoming real and I can't just keep hitting it with a hammer. This may be a long shot given I haven't provided and of the code, but does anyone have experience with a similar issue? Specifically why the threads are taking turns. EDIT: the user's error is a constraint violation, so I think it's modifying the inputted data and doing something like adding extra precision. The problem: The application's 10 threads are working in sequence rather than concurrently and the code potentially contains a memory leak, resulting in the app crashing and not hitting the same code the constraint violation exception that the business user is encountering. Edit 2: I suspect the threads trading off, rather than running concurrently, could be causing them not to run garbage collection on my local machine perhaps? Though, it still doesn't explain the issue of me receiving a different error than the business user if I'm lucky enough not to run out of heap.
You may well be correct in your instincts which tell you that the "threads" are working against you and that your predecessor simply left you with an unworkable design which he could never manage to fix. "The eventual recipient," in all cases, "is the [Oracle ...] database." No matter what the application does in presenting requests to it, the only thing that matters is the requests that it receives. Obviously the clients are colliding with themselves, and it is therefore probable that there's no reason for having multiple threads at all.
Weird Memory Usage by Tomcat
Its a vague question. So please feel free to ask for any specific data. We have a tomcat server running with two web-service's. One tomcat built using spring. We use mysql for 90% of tasks and mongo for caching of jsons (10%). The other web-service is written using grails. Both the services are medium sized codebases (About 35k lines of code each) The computation only happens when there is an HTTP request (No batch processing). With about 2000 database hits per request (I know its humongous. We are working on it). The request rate is about 30 req/min. For one particular request, there is Image processing which is quite memory expensive. No JNI anywhere We have found a weird behavior. Last night, I can confirm that there was no request to the server for about 12 hours. But when I look at the memory consumption, it is very confusing: Without any requests, the memory keeps jumping from 500Mb to 1.2Gb (700 Mb jump is worrysome). There is no computation on server side as mentioned. I am not sure if its a memory leak: The memory usage comes down. (Things would have been way easier if the memory didnt come down). This behavior is reproducable with caches based on SoftReference or so. With full gc's. But I am not using them anywhere (Not sure if something else is using it) What else can be the reason. is it a cause to worry? PS: We have had Our of Memory Crashes (Not errors but JVM crash) quite frequently very recently.
This is actually normal behavior. You're just seeing garbage collection occur.
How to determine why is Java app slow
We have an Java ERP type of application. Communication between server an client is via RMI. In peak hours there can be up to 250 users logged in and about 20 of them are working at the same time. This means that about 20 threads are live at any given time in peak hours. The server can run for hours without any problems, but all of a sudden response times get higher and higher. Response times can be in minutes. We are running on Windows 2008 R2 with Sun's JDK 1.6.0_16. We have been using perfmon and Process Explorer to see what is going on. The only thing that we find odd is that when server starts to work slow, the number of handles java.exe process has opened is around 3500. I'm not saying that this is the acual problem. I'm just curious if there are some guidelines I should follow to be able to pinpoint the problem. What tools should I use? ....
Can you access to the log configuration of this application. If you can, you should change the log level to "DEBUG". Tracing the DEBUG logs of a request could give you a usefull information about the contention point. If you can't, profiler tools are can help you : VisualVM (Free, and good product) Eclipse TPTP (Free, but more complicated than VisualVM) JProbe (not Free but very powerful. It is my favorite Java profiler, but it is expensive) If the application has been developped with JMX control points, you can plug a JMX viewer to get informations... If you want to stress the application to trigger the problem (if you want to verify whether it is a charge problem), you can use stress tools like JMeter
Sounds like the garbage collection cannot keep up and starts "halt-the-world" collecting for some reason. Attach with jvisualvm in the JDK when starting and have a look at the collected data when the performance drops.
The problem you'r describing is quite typical but general as well. Causes can range from memory leaks, resource contention etcetera to bad GC policies and heap/PermGen-space allocation. To point out exact problems with your application, you need to profile it (I am aware of tools like Yourkit and JProfiler). If you profile your application wisely, only some application cycles would reveal the problems otherwise profiling isn't very easy itself.
In a similar situation, I have coded a simple profiling code myself. Basically I used a ThreadLocal that has a "StopWatch" (based on a LinkedHashMap) in it, and I then insert code like this into various points of the application: watch.time("OperationX"); then after the thread finishes a task, I'd call watch.logTime(); and the class would write a log that looks like this: [DEBUG] StopWatch time:Stuff=0, AnotherEvent=102, OperationX=150 After this I wrote a simple parser that generates CSV out from this log (per code path). The best thing you can do is to create a histogram (can be easily done using excel). Averages, medium and even mode can fool you.. I highly recommend to create a histogram. Together with this histogram, you can create line graphs using average/medium/mode (which ever represents data best, you can determine this from the histogram). This way, you can be 100% sure exactly what operation is taking time. If you can't determine the culprit, binary search is your friend (fine grain the events). Might sound really primitive, but works. Also, if you make a library out of it, you can use it in any project. It's also cool because you can easily turn it on in production as well..
Aside from the GC that others have mentioned, Try taking thread dumps every 5-10 seconds for about 30 seconds during your slow down. There could be a case where DB calls, Web Service, or some other dependency becomes slow. If you take a look at the tread dumps you will be able to see threads which don't appear to move, and you could narrow your culprit that way. From the GC stand point, do you monitor your CPU usage during these times? If the GC is running frequently you will see a jump in your overall CPU usage. If only this was a Solaris box, prstat would be your friend.
For acute issues like this a quick jstack <pid> should quickly point out the problem area. Probably no need to get all fancy on it. If I had to guess, I'd say Hotspot jumped in and tightly optimised some badly written code. Netbeans grinds to a halt where it uses a WeakHashMap with newly created objects to cache file data. When optimised, the entries can be removed from the map straight after being added. Obviously, if the cache is being relied upon, much file activity follows. You probably wont see the drive light up, because it'll all be cached by the OS.
Tracking down a memory leak / garbage-collection issue in Java
This is a problem I have been trying to track down for a couple months now. I have a java app running that processes xml feeds and stores the result in a database. There have been intermittent resource problems that are very difficult to track down. Background: On the production box (where the problem is most noticeable), i do not have particularly good access to the box, and have been unable to get Jprofiler running. That box is a 64bit quad-core, 8gb machine running centos 5.2, tomcat6, and java 1.6.0.11. It starts with these java-opts JAVA_OPTS="-server -Xmx5g -Xms4g -Xss256k -XX:MaxPermSize=256m -XX:+PrintGCDetails - XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC -XX:+PrintTenuringDistribution -XX:+UseParNewGC" The technology stack is the following: Centos 64-bit 5.2 Java 6u11 Tomcat 6 Spring/WebMVC 2.5 Hibernate 3 Quartz 1.6.1 DBCP 1.2.1 Mysql 5.0.45 Ehcache 1.5.0 (and of course a host of other dependencies, notably the jakarta-commons libraries) The closest I can get to reproducing the problem is a 32-bit machine with lower memory requirements. That I do have control over. I have probed it to death with JProfiler and fixed many performance problems (synchronization issues, precompiling/caching xpath queries, reducing the threadpool, and removing unnecessary hibernate pre-fetching, and overzealous "cache-warming" during processing). In each case, the profiler showed these as taking up huge amounts of resources for one reason or another, and that these were no longer primary resource hogs once the changes went in. The Problem: The JVM seems to completely ignore the memory usage settings, fills all memory and becomes unresponsive. This is an issue for the customer facing end, who expects a regular poll (5 minute basis and 1-minute retry), as well for our operations teams, who are constantly notified that a box has become unresponsive and have to restart it. There is nothing else significant running on this box. The problem appears to be garbage collection. We are using the ConcurrentMarkSweep (as noted above) collector because the original STW collector was causing JDBC timeouts and became increasingly slow. The logs show that as the memory usage increases, that is begins to throw cms failures, and kicks back to the original stop-the-world collector, which then seems to not properly collect. However, running with jprofiler, the "Run GC" button seems to clean up the memory nicely rather than showing an increasing footprint, but since I can not connect jprofiler directly to the production box, and resolving proven hotspots doesnt seem to be working I am left with the voodoo of tuning Garbage Collection blind. What I have tried: Profiling and fixing hotspots. Using STW, Parallel and CMS garbage collectors. Running with min/max heap sizes at 1/2,2/4,4/5,6/6 increments. Running with permgen space in 256M increments up to 1Gb. Many combinations of the above. I have also consulted the JVM [tuning reference](http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html) , but can't really find anything explaining this behavior or any examples of _which_ tuning parameters to use in a situation like this. I have also (unsuccessfully) tried jprofiler in offline mode, connecting with jconsole, visualvm, but I can't seem to find anything that will interperet my gc log data. Unfortunately, the problem also pops up sporadically, it seems to be unpredictable, it can run for days or even a week without having any problems, or it can fail 40 times in a day, and the only thing I can seem to catch consistently is that garbage collection is acting up. Can anyone give any advice as to: a) Why a JVM is using 8 physical gigs and 2 gb of swap space when it is configured to max out at less than 6. b) A reference to GC tuning that actually explains or gives reasonable examples of when and what kind of setting to use the advanced collections with. c) A reference to the most common java memory leaks (i understand unclaimed references, but I mean at the library/framework level, or something more inherenet in data structures, like hashmaps). Thanks for any and all insight you can provide. EDIT Emil H: 1) Yes, my development cluster is a mirror of production data, down to the media server. The primary difference is the 32/64bit and the amount of RAM available, which I can't replicate very easily, but the code and queries and settings are identical. 2) There is some legacy code that relies on JaxB, but in reordering the jobs to try to avoid scheduling conflicts, I have that execution generally eliminated since it runs once a day. The primary parser uses XPath queries which call down to the java.xml.xpath package. This was the source of a few hotspots, for one the queries were not being pre-compiled, and two the references to them were in hardcoded strings. I created a threadsafe cache (hashmap) and factored the references to the xpath queries to be final static Strings, which lowered resource consumption significantly. The querying still is a large part of the processing, but it should be because that is the main responsibility of the application. 3) An additional note, the other primary consumer is image operations from JAI (reprocessing images from a feed). I am unfamiliar with java's graphic libraries, but from what I have found they are not particularly leaky. (thanks for the answers so far, folks!) UPDATE: I was able to connect to the production instance with VisualVM, but it had disabled the GC visualization / run-GC option (though i could view it locally). The interesting thing: The heap allocation of the VM is obeying the JAVA_OPTS, and the actual allocated heap is sitting comfortably at 1-1.5 gigs, and doesnt seem to be leaking, but the box level monitoring still shows a leak pattern, but it is not reflected in the VM monitoring. There is nothing else running on this box, so I am stumped.
Well, I finally found the issue that was causing this, and I'm posting a detail answer in case someone else has these issues. I tried jmap while the process was acting up, but this usually caused the jvm to hang further, and I would have to run it with --force. This resulted in heap dumps that seemed to be missing a lot of data, or at least missing the references between them. For analysis, I tried jhat, which presents a lot of data but not much in the way of how to interpret it. Secondly, I tried the eclipse-based memory analysis tool ( http://www.eclipse.org/mat/ ), which showed that the heap was mostly classes related to tomcat. The issue was that jmap was not reporting the actual state of the application, and was only catching the classes on shutdown, which was mostly tomcat classes. I tried a few more times, and noticed that there were some very high counts of model objects (actually 2-3x more than were marked public in the database). Using this I analyzed the slow query logs, and a few unrelated performance problems. I tried extra-lazy loading ( http://docs.jboss.org/hibernate/core/3.3/reference/en/html/performance.html ), as well as replacing a few hibernate operations with direct jdbc queries (mostly where it was dealing with loading and operating on large collections -- the jdbc replacements just worked directly on the join tables), and replaced some other inefficient queries that mysql was logging. These steps improved pieces of the frontend performance, but still did not address the issue of the leak, the app was still unstable and acting unpredictably. Finally, I found the option: -XX:+HeapDumpOnOutOfMemoryError . This finally produced a very large (~6.5GB) hprof file that accurately showed the state of the application. Ironically, the file was so large that jhat could not anaylze it, even on a box with 16gb of ram. Fortunately, MAT was able to produce some nice looking graphs and showed some better data. This time what stuck out was a single quartz thread was taking up 4.5GB of the 6GB of heap, and the majority of that was a hibernate StatefulPersistenceContext ( https://www.hibernate.org/hib_docs/v3/api/org/hibernate/engine/StatefulPersistenceContext.html ). This class is used by hibernate internally as its primary cache (i had disabled the second-level and query-caches backed by EHCache). This class is used to enable most of the features of hibernate, so it can't be directly disabled (you can work around it directly, but spring doesn't support stateless session) , and i would be very surprised if this had such a major memory leak in a mature product. So why was it leaking now? Well, it was a combination of things: The quartz thread pool instantiates with certain things being threadLocal, spring was injecting a session factory in, that was creating a session at the start of the quartz threads lifecycle, which was then being reused to run the various quartz jobs that used the hibernate session. Hibernate then was caching in the session, which is its expected behavior. The problem then is that the thread pool was never releasing the session, so hibernate was staying resident and maintaining the cache for the lifecycle of the session. Since this was using springs hibernate template support, there was no explicit use of the sessions (we are using a dao -> manager -> driver -> quartz-job hierarchy, the dao is injected with hibernate configs through spring, so the operations are done directly on the templates). So the session was never being closed, hibernate was maintaining references to the cache objects, so they were never being garbage collected, so each time a new job ran it would just keep filling up the cache local to the thread, so there was not even any sharing between the different jobs. Also since this is a write-intensive job (very little reading), the cache was mostly wasted, so the objects kept getting created. The solution: create a dao method that explicitly calls session.flush() and session.clear(), and invoke that method at the beginning of each job. The app has been running for a few days now with no monitoring issues, memory errors or restarts. Thanks for everyone's help on this, it was a pretty tricky bug to track down, as everything was doing exactly what it was supposed to, but in the end a 3 line method managed to fix all the problems.
Can you run the production box with JMX enabled? -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=<port> ... Monitoring and Management Using JMX And then attach with JConsole, VisualVM? Is it ok to do a heap dump with jmap? If yes you could then analyze the heap dump for leaks with JProfiler (you already have), jhat, VisualVM, Eclipse MAT. Also compare heap dumps that might help to find leaks/patterns. And as you mentioned jakarta-commons. There is a problem when using the jakarta-commons-logging related to holding onto the classloader. For a good read on that check A day in the life of a memory leak hunter (release(Classloader))
It seems like memory other than heap is leaking, you mention that heap is remaining stable. A classical candidate is permgen (permanent generation) which consists of 2 things: loaded class objects and interned strings. Since you report having connected with VisualVM you should be able to seem the amount of loaded classes, if there is a continues increase of the loaded classes (important, visualvm also shows the total amount of classes ever loaded, it's okay if this goes up but the amount of loaded classes should stabilize after a certain time). If it does turn out to be a permgen leak then debugging gets trickier since tooling for permgen analysis is rather lacking in comparison to the heap. Your best bet is to start a small script on the server that repeatedly (every hour?) invokes: jmap -permstat <pid> > somefile<timestamp>.txt jmap with that parameter will generate an overview of loaded classes together with an estimate of their size in bytes, this report can help you identify if certain classes do not get unloaded. (note: with I mean the process id and should be some generated timestamp to distinguish the files) Once you identified certain classes as being loaded and not unloaded you can figure out mentally where these might be generated, otherwise you can use jhat to analyze dumps generated with jmap -dump. I'll keep that for a future update should you need the info.
I would look for directly allocated ByteBuffer. From the javadoc. A direct byte buffer may be created by invoking the allocateDirect factory method of this class. The buffers returned by this method typically have somewhat higher allocation and deallocation costs than non-direct buffers. The contents of direct buffers may reside outside of the normal garbage-collected heap, and so their impact upon the memory footprint of an application might not be obvious. It is therefore recommended that direct buffers be allocated primarily for large, long-lived buffers that are subject to the underlying system's native I/O operations. In general it is best to allocate direct buffers only when they yield a measureable gain in program performance. Perhaps the Tomcat code uses this do to I/O; configure Tomcat to use a different connector. Failing that you could have a thread that periodically executes System.gc(). "-XX:+ExplicitGCInvokesConcurrent" might be an interesting option to try.
Any JAXB? I find that JAXB is a perm space stuffer. Also, I find that visualgc, now shipped with JDK 6, is a great way to see what's going on in memory. It shows the eden, generational, and perm spaces and the transient behavior of the GC beautifully. All you need is the PID of the process. Maybe that will help while you work on JProfile. And what about the Spring tracing/logging aspects? Maybe you can write a simple aspect, apply it declaratively, and do a poor man's profiler that way.
"Unfortunately, the problem also pops up sporadically, it seems to be unpredictable, it can run for days or even a week without having any problems, or it can fail 40 times in a day, and the only thing I can seem to catch consistently is that garbage collection is acting up." Sounds like, this is bound to a use case which is executed up to 40 times a day and then not anymore for days. I hope, you do not just track only the symptoms. This must be something, that you can narrow down by tracing the actions of the application's actors (users, jobs, services). If this happens by XML imports, you should compare the XML data of the 40 crashes day with data, that is imported on a zero crash day. Maybe it's some sort of logical problem, that you do not find inside your code, only.
I had the same problem, with couple of differences.. My technology is the following: grails 2.2.4 tomcat7 quartz-plugin 1.0 I use two datasources on my application. That is a particularity determinant to bug causes.. Another thing to consider is that quartz-plugin, inject hibernate session in quartz threads, just like #liam says, and quartz threads still alive, untill I finish application. My problem was a bug on grails ORM combined with the way the plugin handle session and my two datasources. Quartz plugin had a listener to init and destroy hibernate sessions public class SessionBinderJobListener extends JobListenerSupport { public static final String NAME = "sessionBinderListener"; private PersistenceContextInterceptor persistenceInterceptor; public String getName() { return NAME; } public PersistenceContextInterceptor getPersistenceInterceptor() { return persistenceInterceptor; } public void setPersistenceInterceptor(PersistenceContextInterceptor persistenceInterceptor) { this.persistenceInterceptor = persistenceInterceptor; } public void jobToBeExecuted(JobExecutionContext context) { if (persistenceInterceptor != null) { persistenceInterceptor.init(); } } public void jobWasExecuted(JobExecutionContext context, JobExecutionException exception) { if (persistenceInterceptor != null) { persistenceInterceptor.flush(); persistenceInterceptor.destroy(); } } } In my case, persistenceInterceptor instances AggregatePersistenceContextInterceptor, and it had a List of HibernatePersistenceContextInterceptor. One for each datasource. Every opertion do with AggregatePersistenceContextInterceptor its passed to HibernatePersistence, without any modification or treatments. When we calls init() on HibernatePersistenceContextInterceptor he increment the static variable below private static ThreadLocal<Integer> nestingCount = new ThreadLocal<Integer>(); I don't know the pourpose of that static count. I just know he it's incremented two times, one per datasource, because of the AggregatePersistence implementation. Until here I just explain the cenario. The problem comes now... When my quartz job finish, the plugin calls the listener to flush and destroy hibernate sessions, like you can see in source code of SessionBinderJobListener. The flush occurs perfectly, but the destroy not, because HibernatePersistence, do one validation before close hibernate session... It examines nestingCount to see if the value is grather than 1. If the answer is yes, he not close the session. Simplifying what was did by Hibernate: if(--nestingCount.getValue() > 0) do nothing; else close the session; That's the base of my memory leak.. Quartz threads still alive with all objects used in session, because grails ORM not close session, because of a bug caused because I have two datasources. To solve that, I customize the listener, to call clear before destroy, and call destroy two times, (one for each datasource). Ensuring my session was clear and destroyed, and if the destroy fails, he was clear at least.
CPU Usage Spikes in WebSphere 6.1
First, just a bit of background: One of our customers is experiencing CPU usage spikes for WebSphere instances running one of our web apps (other instances with other apps are fine). They have a test environment and a live environment (both iSeries) which both experience the problem - with a single app per instance setup. We have deployed this application locally in our own test environments and also for many other customers all on iSeries with no similar problems. What's actually happening: Every one second or so, the CPU usage for the WebSphere process' CPU usage jumps to anywhere from 7%-20% even though there are no requests being processed at the time. Customer has reported seeing spikes as high as 30%. These spikes average out to be 1.5% of CPU overall - the other WebSphere instances typically use 0%-0.1% when idle. My investigations so far So, I had a look at the threads. One thread in there test environment was using ~350 CPU cycles per second. A similar thread in their live environment was using ~1500 CPU cycles per second (showing that it has bigger CPU). The call stack for these threads looks like Type Program Statement Procedure QLESPI QSYS 17 LE_Create_Thread2__FP12crtt > QJVALIBJVM QSYS 7 startThread__FPv J com/ibm/ws/util/Threa > run J com/ibm/ws/util/Threa > run J com/ibm/ws/util/Threa > getTask J com/ibm/ws/util/Bound > poll The entire class name from the bottom line is com/ibm/ws/util/BoundedBuffer. I asked the customer to do a JVM Dump for me - the only additional information I got from this was the thread name: Thread: 00002F82 Deferrable Alarm : 11 Now for my questions: Can any of you identify the problem, given these symptoms? (Maybe that's a long shot!) What is Deferrable Alarm? From the JVM Dump, I can see 4 threads with this name. The other three seem to be doing just fine. By debugging my local WebSphere (on Windows) and adding breakpoints in the BoundedBuffer class, I see that BoudedBuffers are polling and periodically invoking some listener. I don't have access to the WebSphere console for the customer machines, and they aren't owning up to having made any config changes. I can ask them to check the console for me though - what should I be asking them to look at? I have telnet access to the customer boxes, is there anything else I can investigate here? Looking at the WebSphere profile files, etc? Which files should I be looking at? Because the Call Stack and JVM Dump don't explicitly reference our code, is it safe to assume that this is a configuration problem? It's been a long question, so thanks for reading this far. 30 April Update (1) This morning I've noticed that this behaviour only happens after the first request of the day has been processed (irrespective of which Web Service is invoked). This points the finger back at our application or Apache Axis. Could it be that this is just normal behaviour?! 30 April Update (2) So it seems that this CPU activity is some kind of housekeeping activity for the web-container or maybe something within Apache Axis. I've now observed this happening on a few different web-applications on a few different servers. Applications with no web component don't suffer the same additional CPU overhead. I'd imagine if it is housekeeping work, that "tuning" it somehow could be counter productive - by that, I mean that making the App Server idle better would probably negatively affect the amount of "real" work it can do.
You could try to profile and do heap dumps of the application, that could answer a few questions related to memory and cpu usage.
I would recommend following the must gather documentation provided by IBM, and raising a PMR along with your own investigation. Things you might suspect: Garbage collection (unlikely on low application utilization) Timers or tasks (such as java.util.Timer or commonj work manager) Pretest connection that has a complex SQL query (in the DataSource's WebSphere Application Server data source properties) I would also recommend using the profiler to determine the cause, YourKit profiler is a pretty decent one.
Very instinctively (being unfamiliar with iSeries platforms) I would look at disk IO related issues. Can you describe the disk subsystem? Can you see if your app is spending an unusually large amount of time in iowait?
I know this doesn't quite match your problem. But might be worth a look if your running prior to WAS 6.1 patch 17. http://www-01.ibm.com/support/docview.wss?uid=swg24018437 Hope this helps. Cheers John
My best guess is that it is some type of monitoring is being done on the instance, like Tivioli etc. Have you ruled out any GC activity? HTH Tom
Most application servers are implemented in java itself and so is WebSphere. This servers apart from serving client requests have to do other periodical jobs like say resource pool management. Performing this jobs will create some temporary objects that needs to be garbage collected. Depending on how much heap you have allocated, usage and garbage collector settings, garbage collector will be invoked. I'd say try to see if it is garbage collector thread that is taking up your CPU. For this connect jconsole utility to remote websphere process for a day and see if there is any co-relation between heap usage and cpu usage.
I am also experiencing this very same issue, [Deferrable Alarm:x] using with BoundedBuffer. The only difference I have is that this is on a Windows 7 64bit machine. There is absolutely no Tivioli or other batch process running, no requests being made, the single instance is just idle. I can run the application in DEBUG mode and pause the Deferrable Alarm thread and the CPU spikes stop, resume and they start again. I've checked disk activity, network activity and their is nothing happening there. I am running WebSphere 6.1.0.27 .