An application that we have recently started sporadically crashing with a message about "java.lang.OutOfMemoryError: requested 8589934608 bytes for Chunk::new. Out of swap space?".
I've looked around on the net, and everywhere suggestions are limited to
revert to a previous version of Java
fiddle with the memory settings
use client instead of server mode
Reverting to a previous version implies that the new Java has a bug, but I haven't seen any indication of that. The memory isn't an issue at all; the server has 32GB available, and Xmx is set to 20 while Xms is 10. I can't see the the JVM running out of the remaining 12GB (less the amount given to the handful of other processes on the machine). And we're stuck with server mode due to the nature of the application and environment.
When I look at the memory and CPU usage for the application, I see constant memory usage for the whole day, but then suddenly right before it dies CPU usage goes up to 100% and the memory usage goes from X, to X + 2GB, to X + 4GB, to (sometimes) X + 8GB, to JVM death. It would appear that there is maybe cycle of repeated array resizing going on in the JIT compilation.
I've now seen the error occur with the above 8GB request and also 16GB requests. All times, the method being compiled when this happens is the same. It is a simple method that has non-nested loops, no recursion, and uses methods on objects that return static member fields or instance member fields directly with little computation.
So I have 2 questions:
Does anybody have any suggestions?
Can I test out whether there is a problem compiling this specific method on a test environment, without running the whole application, invoking the JIT compiler directly? Or should I start up the application and tell it to compile methods after a much smaller call count (like 2) to force it to compile the method almost instantly instead of at a random point in the day?
#StephenC
The JVM is 1.6.0_20 (previously 1.6.0_0), running on Solaris. I know it's the compilation that is causing a problem for a couple reasons.
ps in the seconds leading up to it shows that a java thread with id corresponding to the compiler thread (from jstack) is taking up 100% of the CPU time
jstack shows the issue is in JavaThread "CompilerThread1" daemon [_thread_in_native, id=34, ...]
The method mentioned in jstack is always the same one, and is one we wrote. If you look at sample jstack output you will know what I mean, but for obvious reasons I can't provide code samples or filenames. I will say that it is a very simple method. Essentiall a handful of null checks, 2 for loops that do equality checks and possibly assign values, and some simple method calls after. All in all maybe 40 lines of code.
This issue has happened 2 times in 2 weeks, though the application runs every day and is restarted daily. In addition, the application wasn't under heavy load any of these times.
You can exclude a particular method from being JIT'ed by creating a file called .hotspot_compiler and putting it in your applications 'working directory'. Simply add an entry in the file in the following format:
exclude com/amir/SomeClass someMethod
And the console output from the compiler will look like:
### Excluding compile: com.amir.SomeClasst::someMethod
For more information, read this. If you're not sure what you're applications 'working directory' is, use the
-XX:CompileCommandFile=/my/excludefile/location/.hotspot_compiler
in your Java start script or command line.
Alternatively, if you're not sure its the JIT compilers fault, and want to see if you can reproduce the problem without any JIT'ing, run your Java process with -Xint.
Okay, I did a quick search and found a thread on sun java forums that discusses this. Hope it helps.
Here another entry on Oracles forum. Similiar sporadic crash. There is one answer where one solved the problem by reconfiguring the gc's survivor ratio.
Related
I've been developing a Reporting Engine (RE is to generate PDF-reports) in C++ on Linux. If a PDF-report being generated must contain some charts, I need to build them while building the report. ChartBuilder is written in Java (with JFreeChart of Java-TeeChart - it does not matter anyway). Well, while RE is building a report, it invokes some ChartBuilder-API functions via JNI to build a chart (or several charts) step by step (ChartBuilder is packed into .jar-file). The problem is that it takes a lot of time to build the first chart (that is, to execute every ChartBuilder-API function for the first time during the process lifetime)! More specifically, it takes about 1.5 seconds to build the first chart. If there are several charts to be created, the rest of charts are built during about (~0.05, ~0.1) seconds. That is 30 times faster than the first one! It's worth to note, that this first chart is the same with the rest of them (except for data). The problem seems to be fundamental for Java (and I'm not very expirienced in this platform).
Below is the picture that illustrates described problem:
I wonder if there is a way to hasten the first execution. It would be great to understand how to avoid the overhead on the first execution at all because now it hampers the whole performance of RE.
In addition I'd like to describe the way it works: Somebody invokes C++RE::CreateReport with all needed parameters. This function, if it's needed, creates a JVM and makes requests to it via JNI. When a report is created, the JVM is destroyed.
Thanks in advance!
Just-in-time compilation. Keep your JVM alive as a service to avoid paying JIT compilation cost multiple times.
I this it is likely a combination of things as people have pointed out in the comments and other answer - JVM startup, class loader, the fact that Java 'interprets' your code when it is running it etc.
Most fall into the category of 'first time start up' overhead - hence the higher performance in subsequent runs.
I would personally be inclined to agree with Thomas (in the comments to your question) that the highest overhead is possibly the class loader.
There are tools you can use to profile the Java JVM to get a feel for what is taking the most time within the JVM itself - such as:
visualvm (http://visualvm.java.net)
JVM monitor (http://jvmmonitor.org)
You have to be careful using these tools to interpret the results with some thought - you may want to measure first runs and subsequent runs separately, and you also may want to add your own system timings into your C++ code that wraps the JNI calls to get a better picture of the end to end timings. With performance monitoring, multiple test runs are very important to allow for slow and fast individual runs for one reason or another (e.g. other load on the computer - even on a non shared laptop).
As LeffeBrune mentions if you can have the chart builder running as a service already, it will likely speed up the first run, although you will probably need to experiment to see how much difference it makes if it has not actually been running on a processor for a while, for example.
I faced a very frustrating issue. I used an IDE to develop my application. I was monitoring performance through the execution times in IDE but unfortunately when I exported my classes to Jar files, it ran 7x slower on command line. I checked my JVM and made sure the same JVMs are linked to both executions. Allocated more heap memory to the commandline.
Later, I used a process performance monitoring tool and ran both applications (on CMD & on IDE) simultaneously. I noticed that both processes are identical on everything but for CPU power. The process running on IDE takes around 19%-23% of CPU usage while this on CMD took about 6%-11% CPU usage. This may explain why the run on IDE takes less time.
In order to make the application clear, below is the piece of code that taking most of the time
for (Call call: calls {
CallXCD callXCD = call.getCallXCD();
if(callXCD.isOnNet()){
System.out.println("OnNet");
}
else if (callXCD.isXNet()){
System.out.println("XNet");
}
else if (callXCD.isOthers()){
System.out.println("Others");
}
else if (callXCD.isIntra()){
System.out.println("Intra");
}
else {
System.out.println("Not Known");
}
}
CallXCD is an object containing several string variables and several methods like isOnNet and isXNet. These methods apply the compareTo() method on the strings of the object and return true or false according to this comparison.
I profiled this piece of code by printing the time each iteration take. In the IDE each iteration took about 0.007 milli second while in command line running the jar file, it took about 0.2 milli second. Because my code takes around 4 million iteration on this, the negative impact on performance is very significant
Why does this happen?. Is there any way to allocate more processing power to JVM like memory arguments.
(After a conversation in comments.)
It seems this is down to a difference in how console output is handled in your command line and in your IDE. These can behave very significantly differently in terms of auto-scrolling, size of buffer etc.
Generally, when benchmarking code it's good to isolate the code you're actually interested in away from any diagnostic output which can interfere with the results. If you absolutely have to include diagnostic output, writing it to a file (possibly on a ram disk) can help to reduce the differences between console implementations.
Maybe your IDE has different JVM options, for running your process? Larger heap size (-Xmx) could reduce garbage collection costs. If your application is memory-intensive & choking on an insufficent heap size, this could easily be the problem.
The IDE typically also pipes your process's standard input/ output. I don't know if that would be more efficient than standard output to console, but if you're printing/logging a lot it might be a factor to consider.
As Jon Skeet says, there's a big apparent difference between the I/O traces on left & right.
The third main thing an IDE does, in Debug mode, is attach to your process as a debugger. But this normally makes the process slower, not faster.
Summary: best check your IDE's JVM options & launch configuration. Maybe it's run with different parameters, in a different working directory, or with too small a heap size.
I have seen such effects when the IDE (eclipse) used a newer Java version than is used when starting from the command line. Also, it could be that when running from within eclipse you are using the server VM and on the command line the client JVM is used. This can either be set in the run configuration or in the wirkspace JVM settings ("Default VM arguments" when you click on "Edit JRE" in the "Installed JREs" preference page).
So, first make sure the same arguments are used. Add the arguments that are used when starting from within eclipse to your command line.
I'm experiencing a strange but severe problem running several (about 15) instances of a Java EE-ish web applications (Hibernate 4+Spring+Quartz+JSF+Facelets+Richfaces) on Tomcat 7/Java 7.
The system runs just fine, but after a greatly variyng amount of time all instances of the application at the same time suddenly suffer from rising response times. Basically the application still works, but the response times are about three times higher.
This are two diagrams displaying the response time of two certain short workflows/actions (log in, access list of seminars, ajax-refresh this list, log out; the lower line is just the request time for the ajax refresh) of two example instances of the application:
As you can see both instances of the application "explode" at the exact same time and stay slow. After restarting the server everything's back to normal. All the instances of the application "explode" simultaneously.
We're storing the session data to a database and use this for clustering. We checked session size and number and both are rather low (meaning that on other servers with other applications we sometimes have larger and more sessions). The other Tomcat in the cluster usually stays fast for some more hours and after this random-ish amount of time it also "dies". We checked the heap sizes with jconsole and the main heap stays between 2.5 and 1 GB size, db connection pool is basically full of free connections, as well as the thread pools. Max heap size is 5 GB, there's also plenty of perm gen space available. The load is not especially high; there's just about 5% load on the main CPU. The server does not swap. It's also no hardware issue as we additionally deployed the applications to a VM where the problems remain the same.
I don't know where to look anymore, I am out of ideas. Has someone an idea where to look?
2013-02-21 Update: New Data!
I added two more timing traces to the application. As for the measurement: the monitoring system calls a servlet that performs two tasks, measures execution time for each on the server and writes the time taken as response. These values are logged by the monitoring system.
I have several interesting new facts: a hot redeployment of the application causes this single instance on the current Tomcat to go nuts. This also seems to affect raw CPU calculation performance (see below). This individual-context-explosion is different from the overall-context-explosion that occurs randomly.
Now for some data:
First the individual lines:
Light blue is total execution time of a small workflow (details see above), measured on the client
Red is "part" of light blue and is the time taken to perform a special step of that workflow, measured on the client
Dark blue is measured in the application and consists of reading a list of entities from the DB through Hibernate and iterating over that list, fetching lazy collections and lazy entities.
Green is a small CPU benchmark using floating point and integer operations. As far as I see no object allocation, so no garbage.
Now for the individual stages of explosion: I marked each image with three black dots. The first one is a "small" explostion in more or less only one application instance - in Inst1 it jumps (especially visible in the red line), while Inst2 below more or less stays calm.
After this small explosion the "big bang" occurs and all application instances on that Tomcat explode (2nd dot). Note that this explosion affects all high level operations (request processing, DB access), but not the CPU benchmark. It stays low in both systems.
After that I hot-redeployed Inst1 by touching the context.xml file. As I said earlier this instance goes from exploded to completely devestated now (the light blue line is out of the chart - it is at about 18 secs). Note how a) this redeployment does not affect Inst2 at all and b) how the raw DB access of Inst1 is also not affected - but how the CPU suddenly seems to have become slower!. This is crazy, I say.
Update of update
The leak prevention listener of Tomcat does not whine about stale ThreadLocals or Threads when the application is undeployed. There obviously seems to be some cleanup problem (which is I assume not directly related to the Big Bang), but Tomcat doesn't have a hint for me.
2013-02-25 Update: Application Environment and Quartz Schedule
The application environment is not very sophisticated. Network components aside (I don't know enough about those) there's basically one application server (Linux) and two database servers (MySQL 5 and MSSQL 2008). The main load is on the MSSQL server, the other one merely serves as a place to store the sessions.
The application server runs an Apache as a load balancer between two Tomcats. So we have two JVMs running on the same hardware (two Tomcat instances). We use this configuration not to actually balance load as the application server is capable of running the application just fine (which it did for years now) but to enable small application updates without downtime. The web application in question is deployed as separate contexts for different customers, about 15 contexts per Tomcat. (I seemm to have mixed up "instances" and "contexts" in my posting - here in the office they're often used synonymously and we usually magically know what the colleague is talking about. My bad, I'm really sorry.)
To clarify the situation with better wording: the diagrams I posted show response times of two different contexts of the same application on the same JVM. The Big Bang affects all contexts on one JVM but doesn't happen on the other one (the order in which the Tomcats explode is random btw). After hot-redeployment one context on one Tomcat instance goes nuts (with all the funny side effects, like seemingly slower CPU for that context).
The overall load on the system is rather low. It's an internal core business related software with about 30 active users simultaneously. Application specific requests (server touches) are currently at about 130 per minute. The number of single requests are low but the requests itself often require several hundred selects to the database, so they're rather expensive. But usually everything's perfectly acceptable. The application also does not create large infinite caches - some lookup data is cached, but only for a short amount of time.
Above I wrote that the servers where capable of running the application just fine for several years. I know that the best way to find the problem would be to find out exactly when things went wrong for the first time and see what has been changed in this timeframe (in the application itself, the associated libraries or infrastructure), however the problem is that we don't know when the problems first occured. Just let's call that suboptimal (in the sense of absent) application monitoring... :-/
We ruled out some aspects, but the application has been updated several times during the last months and thus we e.g. cannot simply deploy an older version. The largest update that wasn't feature change was a switch from JSP to Facelets. But still, "something" must be the cause of all the problems, yet I have no idea why Facelets for instance should influence pure DB query times.
Quartz
As for the Quartz schedule: there's a total of 8 jobs. Most of them run only once per day and have to do with large volume data synchronization (absolutely not "large" as in "big data large"; it's just more than the averate user sees through his usual daily work). However, those jobs of course run at night and the problems occur during daytime. I omit a detailled job listing here (if beneficial I can provide more details of course). The jobs' source code has not been altered during the last months. I already checked whether the explosions align with the jobs - yet the results are inconclusive at best. I'd actually say that they don't align, but as there are several jobs that run every minute I can't rule it out just yet. The acutal jobs that run every minute are pretty low-weight in my opinion, they usually check if data is available (in different sources, DB, external systems, email account) and if so write it to the DB or push it to another system.
However I'm currently enabling logging of indivdual job execution so that I can exactly see start and end timestamp of each single job execution. Perhaps this provides more insight.
2013-02-28 Update: JSF Phases and Timing
I manually added a JSF phae listener to the application. I executed a sample call (the ajax refresh) and this is what I've got (left: normal running Tomcat instance, right: Tomcat instance after Big Bang - the numbers have been taken almost simultaneously from both Tomcats and are in milliseconds):
RESTORE_VIEW: 17 vs 46
APPLY_REQUEST_VALUES: 170 vs 486
PROCESS_VALIDATIONS: 78 vs 321
UPDATE_MODEL_VALUES: 75 vs 307
RENDER_RESPONSE: 1059 vs 4162
The ajax refresh itself belongs to a search form and its search result. There's also another delay between the application's outmost request filter and web flow starts its work: there's a FlowExecutionListenerAdapter that measures time taken in certain phases of web flow. This listener reports 1405 ms for "Request submitted" (which is as far as I know the first web flow event) out of a total of 1632 ms for the complete request on an un-exploded Tomcat, thus I estimate about 200ms overhead.
But on the exploded Tomcat it reports 5332 ms for request submitted (meaning all JSF phases happen in those 5 seconds) out of a total request duration of 7105ms, thus we're up to almost 2 seconds overhead for everything outside of web flow's request submitted.
Below my measurement filter the filter chain contains a org.ajax4jsf.webapp.BaseFilter, then the Spring servlet is called.
2013-06-05 Update: All the stuff going on in the last weeks
A small and rather late update... the application performance still sucks after some time and the behaviour remains erratic. Profiling did not help much yet, it just generated an enormous amount of data that's hard to dissect. (Try poking around in performance data on or profile a production system... sigh) We conducted several tests (ripping out certain parts of the software, undeploying other applications etc.) and actually had some improvements that affect the whole application. The default flush mode of our EntityManager is AUTO and during view rendering lots of fetches and selects are issued, always including the check whether flushing is neccesary.
So we built a JSF phase listener that sets the flush mode to COMMIT during RENDER_RESPONSE. This improved overall performance a lot and seems to have mitigated the problems somewhat.
Yet, our application monitoring keeps yielding completely insane results and performance on some contexts on some tomcat instances. Like an action that should finish in under a second (and that actually does it after deployment) and that now takes more than four seconds. (These numbers are supported by manual timing in the browsers, so it's not the monitoring that causes the problems).
See the following picture for example:
This diagram shows two tomcat instances running the same context (meaning same db, same configuration, same jar). Again the blue line is the amount of time taken by pure DB read operations (fetch a list of entities, iterate over them, lazily fetch collections and associated data). The turquoise-ish and red line are measured by rendering several views and doing an ajax refresh, respectively. The data rendered by two of the requests in turquoise-ish and red is mostly the same as is queried for the blue line.
Now around 0700 on instance 1 (right) there's this huge increase in pure DB time which seems to affect actual render response times as well, but only on tomcat 1. Tomcat 0 is largely unaffected by this, so it cannot be caused by the DB server or network with both tomcats running on the same physical hardware. It has to be a software problem in the Java domain.
During my last tests I found out something interesting: All responses contain the header "X-Powered-By: JSF/1.2, JSF/1.2". Some (the redirect responses produced by WebFlow) even have "JSF/1.2" three times in there.
I traced down the code parts that set those headers and the first time this header is set it's caused by this stack:
... at org.ajax4jsf.webapp.FilterServletResponseWrapper.addHeader(FilterServletResponseWrapper.java:384)
at com.sun.faces.context.ExternalContextImpl.<init>(ExternalContextImpl.java:131)
at com.sun.faces.context.FacesContextFactoryImpl.getFacesContext(FacesContextFactoryImpl.java:108)
at org.springframework.faces.webflow.FlowFacesContext.newInstance(FlowFacesContext.java:81)
at org.springframework.faces.webflow.FlowFacesContextLifecycleListener.requestSubmitted(FlowFacesContextLifecycleListener.java:37)
at org.springframework.webflow.engine.impl.FlowExecutionListeners.fireRequestSubmitted(FlowExecutionListeners.java:89)
at org.springframework.webflow.engine.impl.FlowExecutionImpl.resume(FlowExecutionImpl.java:255)
at org.springframework.webflow.executor.FlowExecutorImpl.resumeExecution(FlowExecutorImpl.java:169)
at org.springframework.webflow.mvc.servlet.FlowHandlerAdapter.handle(FlowHandlerAdapter.java:183)
at org.springframework.webflow.mvc.servlet.FlowController.handleRequest(FlowController.java:174)
at org.springframework.web.servlet.mvc.SimpleControllerHandlerAdapter.handle(SimpleControllerHandlerAdapter.java:48)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:925)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:856)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:920)
at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:827)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:641)
... several thousands ;) more
The second time this header is set by
at org.ajax4jsf.webapp.FilterServletResponseWrapper.addHeader(FilterServletResponseWrapper.java:384)
at com.sun.faces.context.ExternalContextImpl.<init>(ExternalContextImpl.java:131)
at com.sun.faces.context.FacesContextFactoryImpl.getFacesContext(FacesContextFactoryImpl.java:108)
at org.springframework.faces.webflow.FacesContextHelper.getFacesContext(FacesContextHelper.java:46)
at org.springframework.faces.richfaces.RichFacesAjaxHandler.isAjaxRequestInternal(RichFacesAjaxHandler.java:55)
at org.springframework.js.ajax.AbstractAjaxHandler.isAjaxRequest(AbstractAjaxHandler.java:19)
at org.springframework.webflow.mvc.servlet.FlowHandlerAdapter.createServletExternalContext(FlowHandlerAdapter.java:216)
at org.springframework.webflow.mvc.servlet.FlowHandlerAdapter.handle(FlowHandlerAdapter.java:182)
at org.springframework.webflow.mvc.servlet.FlowController.handleRequest(FlowController.java:174)
at org.springframework.web.servlet.mvc.SimpleControllerHandlerAdapter.handle(SimpleControllerHandlerAdapter.java:48)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:925)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:856)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:920)
at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:827)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:641)
I have no idea if this could indicate a problem, but I did not notice this with other applications that are running on any of our servers, so this might as well provide some hints. I really have no idea what that framework code is doing (admittedly I did not dive into it yet)... perhaps someone has an idea? Or am I running into a dead end?
Appendix
My CPU benchmark code consists of a loop that calculates Math.tan and uses the result value to modify some fields on the servlet instance (no volatile/synchronized there), and secondly performs several raw integer calcualations. This is not severly sophisticated, I know, but well... it seems to show something in the charts, however I am not sure what it shows. I do the field updates to prevent HotSpot from optimizing away all my precious code ;)
long time2 = System.nanoTime();
for (int i = 0; i < 5000000; i++) {
double tan = Math.tan(i);
if (tan < 0) {
this.l1++;
} else {
this.l2++;
}
}
for (int i = 1; i < 7500; i++) {
int n = i;
while (n != 1) {
this.steps++;
if (n % 2 == 0) {
n /= 2;
} else {
n = n * 3 + 1;
}
}
}
// This execution time is written to the client.
time2 = System.nanoTime() - time2;
Solution
Increase the maximum size of the Code Cache:
-XX:ReservedCodeCacheSize=256m
Background
We are using ColdFusion 10 which runs on Tomcat 7 and Java 1.7.0_15. Our symptoms were similar to yours. Occasionally the response times and the CPU usage on the server would go up by a lot for no apparent reason. It seemed as if the CPU got slower. The only solution was to restart ColdFusion (and Tomcat).
Initial analysis
I started by looking at the memory usage and the garbage collector log. There was nothing there that could explain our problems.
My next step was to schedule a heap dump every hour and to regularly perform sampling using VisualVM. The goal was to get data from before and after a slowdown so that it could be compared. I managed to get accomplish that.
There was one function in the sampling that stood out: get() in coldfusion.runtime.ConcurrentReferenceHashMap. A lot of time was spent in it after the slowdown compared to very little before. I spent some time on understanding how the function worked and developed a theory that maybe there was a problem with the hash function resulting in some huge buckets. Using the heap dumps I was able to see that the largest buckets only contained 6 elements so I discarded that theory.
Code Cache
I finally got on the right track when I read "Java Performance: The Definitive Guide". It has a chapter on the JIT Compiler which talks about the Code Cache which I had not heard of before.
Compiler disabled
When monitoring the number of compilations performed (monitored with jstat) and the size of the Code Cache (monitored with Memory Pools plugin of VisualVM) I saw that the size increased up to the maximum size (which is 48 MB by default in our environment -- the default varies depending on Java version and Java compiler). When the Code Cache became full the JIT Compiler was turned off. I have read that "CodeCache is full. Compiler has been disabled." should be printed when that happens but I did not see that message; maybe the version we are using does not have that message. I know that the compiler was turned off because the number of compilations performed stopped increasing.
Deoptimization continues
The JIT Compiler can deoptimize previously compiled functions which will caues the function to be executed by the interpreter again (unless the function is replaced by an improved compilation). The deoptimized function can be garbage collected to free up space in the Code Cache.
For some reason functions continued to be deoptimized even though nothing was compiled to replace them. More and more memory would become available in the Code Cache but the JIT Compiler was not restarted.
I never had -XX:+PrintCompilation enabled when we experience a slowdown but I am quite sure that I would have seen either ConcurrentReferenceHashMap.get(), or a function that it depends on, be deoptimized at that time.
Result
We have not seen any slowdowns since we increased the maximum size of the Code Cache to 256 MB and we have also seen a general performance improvement. There is currently 110 MB in our Code Cache.
First, let me say that you have done an excellent job grabbing detailed facts about the problem; I really like how you make it clear what you know and what you are speculating - it really helps.
EDIT 1 Massive edit after the update on context vs. instance
We can rule out:
GCs (that would affect the CPU benchmark service thread and spike the main CPU)
Quartz jobs (that would either affect both Tomcats or the CPU benchmark)
The database (that would affect both Tomcats)
Network packet storms and similar (that would affect both Tomcats)
I believe that you are suffering from is an increase in latency somewhere in your JVM. Latency is where a thread is waiting (synchronously) for a response from somewhere - it's increased your servlet response time but at no cost to the CPU. Typical latencies are caused by:
Network calls, including
JDBC
EJB or RMI
JNDI
DNS
File shares
Disk reading and writing
Threading
Reading from (and sometimes writing to) queues
synchronized method or block
futures
Thread.join()
Object.wait()
Thread.sleep()
Confirming that the problem is latency
I suggest using a commercial profiling tool. I like [JProfiler](http://www.ej-technologies.com/products/jprofiler/overview.html, 15 day trial version available) but YourKit is also recommended by the StackOverflow community. In this discussion I will use JProfiler terminology.
Attach to the Tomcat process while it is performing fine and get a feel for how it looks under normal conditions. In particular, use the high-level JDBC, JPA, JNDI, JMS, servlet, socket and file probes to see how long the JDBC, JMS, etc operations take (screencast. Run this again when the server is exhibiting problems and compare. Hopefully you will see what precisely has been slowed down. In the product screenshot below, you can see the SQL timings using the JPA Probe:
(source: ej-technologies.com)
However it's possible that the probes did not isolate the issue - for example it might be some threading issue. Go to the Threads view for the application; this displays a running chart of the states of each thread, and whether it is executing on the CPU, in an Object.wait(), is waiting to enter a synchronized block or is waiting on network I/O . When you know which thread or threads is exhibiting the issue, go to the CPU views, select the thread and use the thread states selector to immediately drill down to the expensive methods and their call stacks. [Screencast]((screencast). You will be able to drill up into your application code.
This is a call stack for runnable time:
And this is the same one, but showing network latency:
When you know what is blocking, hopefully the path to resolution will be clearer.
We had the same problem, running on Java 1.7.0_u101 (one of Oracle's supported versions, since the latest public JDK/JRE 7 is 1.7.0_u79), running on G1 garbage collector. I cannot tell if the problem appears in other Java 7 versions or with other GCs.
Our process was Tomcat running Liferay Portal (I believe the exact version of Liferay is of no interest here).
This is the behavior we observed: using a -Xmx of 5GB, the inital Code Cache pool size right after startup ranged at about 40MB. After a while, it dropped to about 30MB (which is kind of normal, since there is a lot of code running during startup which will be never executed again, so it is expected to be evicted from the cache after some time). We observed that there was some JIT activity, so the JIT actually populated the cache (comparing to the sizes I am mentioning later, it seems that the small cache size relative to the overall heap size places stringent requirements on the JIT, and this makes the latter evict the cache rather nervously). However, after a while, no more compilations ever took place, and the JVM got painfully slow. We had to kill our Tomcats every now and then to get back adequate performance, and as we added more code to our portal, the problem got worse and worse (since the Code Cache got saturated more quickly, I guess).
It seems that there are several bugs in JDK 7 JVM that cause it to not restart the JIT (look at this blog post: https://blogs.oracle.com/poonam/entry/why_do_i_get_message), even in JDK 7, after an emergency flush (the blog mentions Java bugs 8006952, 8012547, 8020151 and 8029091).
This is why increasing manually the Code Cache to a level where an emergency flush is unlikely to ever occur "fixes" the issue (I guess this is the case with JDK 7).
In our case, instead of trying to adjust the Code Cache pool size, we chose to upgrade to Java 8. This seems to have fixed the issue. Also, the Code Cache now seems to be quite larger (startup size gets about 200MB, and cruising size gets to about 160MB). As it is expected, after some idling time, the cache pool size drops, to get up again if some user (or robot, or whatever) browses our site, causing more code to be executed.
I hope you find the above data helpful.
Forgot to say: I found the exposition, the supporting data, the infering logic and the conclusion of this post very, very helpful. Thank you, really!
Has someone an idea where to look?
Issue could be out of Tomcat/JVM- do you have some batch job which kicks in and stress the shared resource(s) like a common database?
Take a thread dump and see what the java processes are doing when application response time explodes?
If you are using Linux, use a tool like strace and check what is java process doing.
Have you checked JVM GC times? Some GC algorithms might 'pause' the application threads and increase the response time.
You can use jstat utility to monitor garbage collection statistics:
jstat -gcutil <pid of tomcat> 1000 100
Above command would print GC statistics on every 1 second for 100 times. Look at the FGC/YGC columns, if the number keeps raising, there is something wrong with your GC options.
You might want to switch to CMS GC if you want to keep response time low:
-XX:+UseConcMarkSweepGC
You can check more GC options here.
What happens after your app is performing slow for a while, does it get back to performing well?
If so then I would check if there is any activity that is not related to your app taking place at this time.
Something like an antivirus scan or a system/db backup.
If not then I would suggest running it with a profiler (JProfiler, yourkit, etc.) this tools can point you to your hotspots very easily.
You are using Quartz, which manages timed processes, and this seems to take place at particular times.
Post your Quartz schedule and let us know if that aligns, and if so, you can determine which internal application process may be kicking off to consume your resources.
Alternately, it is possible a portion of your application code has finally been activated and decides to load data to the memory cache. You're using Hibernate; check the calls to your database and see if anything coincides.
My Java application has started to crash regularly with a SIGSEGV and a dump of stack data and a load of information in a text file.
I have debugged C programs in gdb and I have debugged Java code from my IDE. I'm not sure how to approach C-like crashes in a running Java program.
I'm assuming I'm not looking at a JVM bug here. Other Java programs run just fine, and the JVM from Sun is probably more stable than my code. However, I have no idea how I could even cause segfaults with Java code. There definitely is enough memory available, and when I last checked in the profiler, heap usage was around 50% with occasional spikes around 80%. Are there any startup parameters I could investigate? What is a good checklist when approaching a bug like this?
Though I'm not so far able to reliably reproduce the event, it does not seem to occur entirely at random either, so testing is not completely impossible.
ETA: Some of the gory details
(I'm looking for a general approach, since the actual problem might be very specific. Still, there's some info I already collected and that may be of some value.)
A while ago, I had similar-looking trouble after upgrading my CI server (see here for more details), but that fix (setting -XX:MaxPermSize) did not help this time.
Further investigation revealed that in the crash log files the thread marked as "current thread" is never one of mine, but either one called "VMThread" or one called "GCTaskThread"- I f it's the latter, it is additionally marked with the comment "(exited)", if it's the former, the GCTaskThread is not in the list. This makes me suppose that the problem might be around the end of a GC operation.
I'm assuming I'm not looking at a JVM bug here. Other Java programs
run just fine, and the JVM from Sun is probably more stable than my
code.
I don't think you should make that assumption. Without using JNI, you should not be able to write Java code that causes a SIGSEGV (although we know it happens). My point is, when it happens, it is either a bug in the JVM (not unheard of) or a bug in some JNI code. If you don't have any JNI in your own code, that doesn't mean that you aren't using some library that is, so look for that. When I have seen this kind of problem before, it was in an image manipulation library. If the culprit isn't in your own JNI code, you probably won't be able to 'fix' the bug, but you may still be able to work around it.
First, you should get an alternate JVM on the same platform and try to reproduce it. You can try one of these alternatives.
If you cannot reproduce it, it likely is a JVM bug. From that, you can either mandate a particular JVM or search the bug database, using what you know about how to reproduce it, and maybe get suggested workarounds. (Even if you can reproduce it, many JVM implementations are just tweaks on Oracle's Hotspot implementation, so it might still be a JVM bug.)
If you can reproduce it with an alternative JVM, the fault might be that you have some JNI bug. Look at what libraries you are using and what native calls they might be making. Sometimes there are alternative "pure Java" configurations or jar files for the same library or alternative libraries that do almost the same thing.
Good luck!
The following will almost certainly be useless unless you have native code. However, here goes.
Start java program in java debugger, with breakpoint well before possible sigsegv.
Use the ps command to obtain the processid of java.
gdb /usr/lib/jvm/sun-java6/bin/java processid
make sure that the gdb 'handle' command is set to stop on SIGSEGV
continue in the java debugger from the breakpoint.
wait for explosion.
Use gdb to investigate
If you've really managed to make the JVM take a sigsegv without any native code of your own, you are very unlikely to make any sense of what you will see next, and the best you can do is push a test case onto a bug report.
I found a good list at http://www.oracle.com/technetwork/java/javase/crashes-137240.html. As I'm getting the crashes during GC, I'll try switching between garbage collectors.
I tried switching between the serial and the parallel GC (the latter being the default on a 64-bit Linux server), this only changed the error message accordingly.
Reducing the max heap size from 16G to 10G after a fresh analysis in the profiler (which gave me a heap usage flattening out at 8G) did lead to a significantly lower "Virtual Memory" footprint (16G instead of 60), but I don't even know what that means, and The Internet says, it doesn't matter.
Currently, the JVM is running in client mode (using the -client startup option thus overriding the default of -server). So far, there's no crash, but the performance impact seems rather large.
If you have a corefile you could try running jstack on it, which would give you something a little more comprehensible - see http://download.oracle.com/javase/6/docs/technotes/tools/share/jstack.html, although if it's a bug in the gc thread it may not be all that helpful.
Try to check whether c program carsh which have caused java crash.use valgrind to know invalid and also cross check stack size.
I asked this question a few weeks ago, but I'm still having the problem and I have some new hints. The original question is here:
Java Random Slowdowns on Mac OS
Basically, I have a java application that splits a job into independent pieces and runs them in separate threads. The threads have no synchronization or shared memory items. The only resources they do share are data files on the hard disk, with each thread having an open file channel.
Most of the time it runs very fast, but occasionally it will run very slow for no apparent reason. If I attach a CPU profiler to it, then it will start running quickly again. If I take a CPU snapshot, it says its spending most of its time in "self time" in a function that doesn't do anything except check a few (unshared unsynchronized) booleans. I don't know how this could be accurate because 1, it makes no sense, and 2, attaching the profiler seems to knock the threads out of whatever mode they're in and fix the problem. Also, regardless of whether it runs fast or slow, it always finishes and gives the same output, and it never dips in total cpu usage (in this case ~1500%), implying that the threads aren't getting blocked.
I have tried different garbage collectors, different sizings the parts of the memory space, writing data output to non-raid drives, and putting all data output in threads separate the main worker threads.
Does anyone have any idea what kind of problem this could be? Could it be the operating system (OS X 10.6.2) ? I have not been able to duplicate it on a windows machine, but I don't have one with a similar hardware configuration.
It's probably a bit late to reply, but I could observe similar slowdowns using Random in Threads, related to a volatile variable used within java.util.Random - see How can assigning a variable result in a serious performance drop while the execution order is (nearly) untouched? for details. If the answer I got is correct (and it sounds pretty reasonable to me), the slowdown might be related to the in-memory-addresses of the volatile variables used within Random (Have a look at the answer of user 'irreputable' to my question, which explains the problem much better than I do here).
In case you're creating the Random-instances within the run-method of your Threads, you could simply try to turn them into object-variables and initialize them within the constructor of your Thread: This would most likely ensure that the volatile fields of your Random instances will end up in 'different areas' in RAM, which do not have to get synchronized between the processor cores.
How do you know it's running slow? How do you know that it runs quicker when CPU profiler is active? If you do the entire run under the profiler does it ever run slow? If you restrict the number of threads to one does it ever run slow?
Actually this is an interesting problem, im curious to know whats the problem.
First, in your previous question, you are saying you split the job between "multiple" processors. Are they physically multiple, like in multiple machines? or a multi core CPU?
Second, im not sure if Snow Leopard has something to do with it, but we know that SL introduced few new features in term of multi-processor machines. So there might be some problem with the VM on the new OS. Try to use another Java version, i know SL uses Java 6 by default. Try to use Java 5.
Third, did you try to make the Thread pool a little smaller, you are talking about 100 threads running at same time. Try to make them 20 or 40 for example. See if it makes difference.
Finally, i would be interested in seeing how you implemented the multi-threading solution. Small parts of the code will be good