I'm attempting to detect a bottleneck in our servers, and I'm having a hard time deciding where to start. The symptoms from the machine before it crashes are dropped connections (time outs, this is what happens when a client sees when a response takes too long, possibly indicating that a processor couldn't be allocated by the server side code, and the request couldn't be handled) and out of memory issues. The latter error code is actually given by the JVM in an error log, but I have a hard time believing that both the RAM and the lack of available CPUs are turning out to be the bottlenecks at the same time. (In the days of the previous crashes, it has consistently crashed in the manner described above.) We have our own in house server code, and I wouldn't classify it as analogous to Apache or any other code I've seen. (Sorry if this makes giving advice that much more difficult.)
I'd like to take some time to create a somewhat controlled test locally. I'm running the server, and I'm creating a program that will request different things from the local server. What's a good way to monitor RAM/CPU? I'm currently using Java's VisualVM, but monitors stop responding when I hammer it with some of the tests.
Any ideas out there would be greatly appreciated. Like I mentioned, I'm trying to grab as much useful data, to help me further troubleshoot. In general, when a bottleneck issue like this arises, what are some general strategies to take? The live servers are all running on Windows Server 2008. The version of Java is 7.03. My local box is running Windows 7, with Java 7.03 as well. I don't want to make too many assumptions, but I think it's reasonable to assume that Server 2008 and Windows 7 are pretty similar. (The OS architecture is the same.) Aside from that, my local box has identical hardware to our servers.
For Windows, you need to:
1) Establish a "performance baseline"
2) Try to reproduce the bottleneck, and compare the behavior under stress with the baseline
3) Identify the cause of the bottleneck, and correct it.
These links will help:
http://technet.microsoft.com/en-us/library/cc781394%28v=ws.10%29.aspx
http://msdn.microsoft.com/en-us/library/windows/hardware/gg463394.aspx
You should also look at these TCP/IP registry settings:
Increase the range of ephemeral ports
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters, MaxUserPort
Consider disabling auto-tuning:
http://geekswithblogs.net/cajunmcse/archive/2010/08/20/windows-2008-exchange-and-tcpip-auto-tuning.aspx
Related
I have some issues with my PC and don't know what else to check or look for.
If the tags or description are off, feel free to edit/comment.
Basically the question is: Do you know of any tool that I could run during reproduction of the test, which logs frequently and might provide a hint on what's going on?
If anyone already has a clue of what could cause the problem, that would be great as well.
So here's the problem:
I have a running JBoss 4.2.3.GA server application which provides some EJBs with remote interfaces. Those EJBs write or read stuff to/from the database but that doesn't seem to matter since I also had methods that just did a System.out.println(...) and nothing more.
Now I run a test client from the console which basically just "remotely" calls one of those EJB methods in a loop (to take some timings etc.).
So far nothing too unorthodox should be done, it's basically just a bunch of remote EJB calls.
However, during the execution of the loop the computer freezes completely (keyboard doesn't respond as well, e.g. num-lock key) - the only thing that changes is the blinking cursor. :)
Unfortunately I didn't manage to find a reason for this and since I often do my tests from eclipse I'd like to not have that happen too often (workspace crashes etc.)
Here's what I tried so far:
Numerous hardware tests including Lenovo PC Doctor (it's a Lenovo PC) - all succeeded, so i seems like there's no hardware problem
Use different JDK versions: 1.5.0, 1.6.0, 1.7.0 - all crashed
JRockit JVM (Java 6) - crashed as well
make Java cause 100% CPU load (10 thread running constantly on 4 cores) - succeeded/no error
allocate as much memory as the JVM would allow me - succeeded
run the tests on other computers - succeeded, except one that has the same hardware and similar software setup
Windows logs don't provide a hint (except "system was not shut down correctly" ... well that helps :) )
After all these tests I assume it might be a problem with the system configuration (drivers etc.) but I don't know how to track that (and I can't just use brute force due to the massive time requirements).
So, did anyone experience similar problems?
Do you know of any tool that I might use to log what the system does and preferably get a log right before the crash?
Thanks in advance,
Thomas
I have a J2EE java application which processes SOAP requests. In our production environment (HPUX,OC4J,Java 5) we have about 20 threads running for this process, and we sometimes see 1 thread pausing for ~15 seconds. Until now, I haven't succeeded replicating the problem in our preproduction environment, and I'm scared of breaking stuff and violating SLA's if I use jconsole and associated tools on our production server.
Who has any inspiration? I know about http://java.sun.com/j2se/1.5/pdf/jdk50_ts_guide.pdf but I miss the experience to dare using it straight in production (plus, the HPUX guys threw some of these tools out of the toolbox, replacing them with HPJMeter)
Also, although this suggests a GC problem to me, I don't yet know enough to prove or disprove this theory and I am open to other suggestions.
We connect jconsole (and other tools) straight to production regularly. There is no significant overhead for us, the instrumentation is already going on within the JVM so you'd just be connecting a remote process to read published values. I say go for it!
Either way, you really need to see what's going on on the box. Thread dumps might or do some internal instrumentation. By internal instrumentation, I mean recording key measures within the code and exposing those somehow. It's essentially what the JVM does (exposing them via JMX) but rolling your own gives you more specificity. For example, I'm frequently recording request/response or other critical path performance timings internally.
oh, and one more thing. You can setup your app to using an agent to provide even more information. Typically this would be to plug a profiler in (like jprofiler or yourkit) but this does usually add more overhead and isn't recommended for production.
It's also worth thinking about the cost of not getting the information you need out of the VM. For example, is the cost of not fixing the issue more or less than the cost of a small % drop of performance when monitoring?
More scientifically, this article has some comments. It's suggesting up to 7% overhead (contradicting my previous point), a previous article from 2006 suggests 3-4% but both are highly contextual results. For example, CPU intensive applications may or may not be affected more than IO bound ones.
So a more appropriate answer from me (rather than just "go for it") would be to understand the impact it would have for your application in your environment through measurement. Run representative tests on a similar environment to production with jconsole connected and disconnected and see for your self.
Also see this stackoverflow question.
There are a few things that you can do on HP-UX to get additional information from a running Java process. If you send the PROF signal to the JVM, it will toggle the generation of a GC log (as if you had used the -Xverbosegc command line option). Generating the GC log is very inexpensive, so you should be able to turn this on in production without affecting the performance.
If you send the USR2 signal to the JVM, it starts profiling (same as -Xeprof). If you send the signal a second time, it turns off the profiling. This will have a noticeable performance impact, though it is smaller that what you would see from an external, third party profiler.
You can analyze the resulting data files using HPjmeter. HPjmeter can also connect to a running JVM for real-time monitoring. With Java 5, you need to start the JVM with the -agentlib option. If you were using Java 6, you could attach to the running JVM without needing any extra command line options.
I've developed a web application using the following tech stack:
Java
Mysql
Scala
Play Framework
DavMail integration (for calender and exchange server)
Javamail
Akka actors
On the first days, the application runs smoothly and without lags. But after 5 days or so, the application gets really slow! And now I have no clue how to profile this, since I have huge dependencies and it's hard to reproduce this kind of thing. I have looked into the memory and it seems that everything its okay.
Any pointers on the matter?
Try using VisualVM - you can monitor gc behaviour, memory usage, heap, threads, cpu usage etc. You can use it to connect to a remote VM.
`visualvm˙ is also a great tool for such purposes, you can connect to a remote JVM as well and see what's inside.
I suggest you doing this:
take a snapshot of the application running since few hours and since 5 days
compare thread counts
compare object counts, search for increasing numbers
see if your program spends more time in particular methods on the 5th day than on the 1str one
check for disk space, maybe you are running out of it
jconsole comes with the JDK and is an easy tool to spot bottlenecks. Connect it to your server, look into memory usage, GC times, take a look at how many threads are alive because it could be that the server creates many threads and they never exit.
I agree with tulskiy. On top of that you could also use JMeter if the investigations you will have made with jconsole are unconclusive.
The probable causes of the performances degradation are threads (that are created but never exit) and also memory leaks: if you allocate more and more memory, before having the OutOfMemoryError, you may encounter some performances degradation (happened to me a few weeks ago).
To eliminate your database you can monitor slow queries (and/or queries that are not using an index) using the slow query log
see: http://dev.mysql.com/doc/refman/5.1/en/slow-query-log.html
I would hazard a guess that you have a missing index, and it has only become apparent as your data volumes have increased.
Yet another profiler is Yourkit.
It is commercial, but with trial period (two weeks).
Actually, I've firstly tried VisualVM as #axel22 suggested, but our remote server was ssh'ed and we had problems with connecting via VisualVM (not saying that it is impossible, I've just surrendered after a few hours).
You might just want to try the 'play status' command, which will list web app state (threads, jobs, etc). This might give you a hint on what's going on.
So guys, in this specific case, I was running play in Developer mode, which makes the compiler works every now and then.
After changing to production mode, everything was lightning fast and no more problems anymore. But thanks for all the help.
We have an Java ERP type of application. Communication between server an client is via RMI. In peak hours there can be up to 250 users logged in and about 20 of them are working at the same time. This means that about 20 threads are live at any given time in peak hours.
The server can run for hours without any problems, but all of a sudden response times get higher and higher. Response times can be in minutes.
We are running on Windows 2008 R2 with Sun's JDK 1.6.0_16. We have been using perfmon and Process Explorer to see what is going on. The only thing that we find odd is that when server starts to work slow, the number of handles java.exe process has opened is around 3500. I'm not saying that this is the acual problem.
I'm just curious if there are some guidelines I should follow to be able to pinpoint the problem. What tools should I use? ....
Can you access to the log configuration of this application.
If you can, you should change the log level to "DEBUG". Tracing the DEBUG logs of a request could give you a usefull information about the contention point.
If you can't, profiler tools are can help you :
VisualVM (Free, and good product)
Eclipse TPTP (Free, but more complicated than VisualVM)
JProbe (not Free but very powerful. It is my favorite Java profiler, but it is expensive)
If the application has been developped with JMX control points, you can plug a JMX viewer to get informations...
If you want to stress the application to trigger the problem (if you want to verify whether it is a charge problem), you can use stress tools like JMeter
Sounds like the garbage collection cannot keep up and starts "halt-the-world" collecting for some reason.
Attach with jvisualvm in the JDK when starting and have a look at the collected data when the performance drops.
The problem you'r describing is quite typical but general as well. Causes can range from memory leaks, resource contention etcetera to bad GC policies and heap/PermGen-space allocation. To point out exact problems with your application, you need to profile it (I am aware of tools like Yourkit and JProfiler). If you profile your application wisely, only some application cycles would reveal the problems otherwise profiling isn't very easy itself.
In a similar situation, I have coded a simple profiling code myself. Basically I used a ThreadLocal that has a "StopWatch" (based on a LinkedHashMap) in it, and I then insert code like this into various points of the application: watch.time("OperationX");
then after the thread finishes a task, I'd call watch.logTime(); and the class would write a log that looks like this: [DEBUG] StopWatch time:Stuff=0, AnotherEvent=102, OperationX=150
After this I wrote a simple parser that generates CSV out from this log (per code path). The best thing you can do is to create a histogram (can be easily done using excel). Averages, medium and even mode can fool you.. I highly recommend to create a histogram.
Together with this histogram, you can create line graphs using average/medium/mode (which ever represents data best, you can determine this from the histogram).
This way, you can be 100% sure exactly what operation is taking time. If you can't determine the culprit, binary search is your friend (fine grain the events).
Might sound really primitive, but works. Also, if you make a library out of it, you can use it in any project. It's also cool because you can easily turn it on in production as well..
Aside from the GC that others have mentioned, Try taking thread dumps every 5-10 seconds for about 30 seconds during your slow down. There could be a case where DB calls, Web Service, or some other dependency becomes slow. If you take a look at the tread dumps you will be able to see threads which don't appear to move, and you could narrow your culprit that way.
From the GC stand point, do you monitor your CPU usage during these times? If the GC is running frequently you will see a jump in your overall CPU usage.
If only this was a Solaris box, prstat would be your friend.
For acute issues like this a quick jstack <pid> should quickly point out the problem area. Probably no need to get all fancy on it.
If I had to guess, I'd say Hotspot jumped in and tightly optimised some badly written code. Netbeans grinds to a halt where it uses a WeakHashMap with newly created objects to cache file data. When optimised, the entries can be removed from the map straight after being added. Obviously, if the cache is being relied upon, much file activity follows. You probably wont see the drive light up, because it'll all be cached by the OS.
We are doing some Java stress runs (involving network IO). Initially things are all fine and the system responds very fast (avg latency in test 2ms). But hours later when I redo the same test I observe the performance goes down (20 - 60ms). It's the same Jar files, same JVM, and the same LAN over which the stress is running. I am not understanding the reason for this behavior.
The LAN is 1GBPS and for the stress requirements I'm sure we are not using all of it.
So my questions:
Can it be because of some switches in the LANs?
Does the machine slow off after some time ( The machines are restarted .. say about 6months back well before the stress can start; They are RHEL5, XEON 64bit Quad core)
What is the general way to debug such an issues?
A few questions...
How much of the environment is under your control and are you putting any measures in place to ensure it's consistent for each run? i.e. are you sharing the network with other systems, is the machine you're using being used solely for your stress testing?
The way I'd look at this is to start gathering details on what your machine and code are up to. That means use perfmon (windows) sar (unix) to find out what the OS and hardware is doing and get a profiler attached to make sure your code is doing the same thing and help pin-point where the bottleneck is occuring from a code perspective.
Nothing terribly detailed but something I hope that will help get you started.
The general way is "measure everything". This, in particular might mean:
Ensure time on all servers is the same (use ntp or something similar);
Measure how long did it take to generate request (what if request generator has a bug?);
Measure when did request leave the client machine(s), or at least how long did it take to do i/o. Sometimes it is enough to know average time necessary for many requests.
Measure when did the request arrive.
Measure how long did it take to generate a response.
Measure how long did it take to send the response.
You can probably start from the 5th element, as this is (you believe) your critical chain. But it is best to log as much as you can - as according to what you've said yourself, it takes days to produce different results.
If you don't want to modify your code, look for cases where you can sniff data without intervening (e.g. define a servlet filter in your web.xml).