We are working on a large Java program that was converted from a Forte application. During the day we are getting Blocking SPID's in the server. We had a DBA visit yesterday and he set up a profile template to run to catch the locking/blocking action. When we run this profile the blocking problem goes away. Why?
This application is distributed using RMI and has around 70 users. We are using SQL 2000 and windows 2000 servers to keep compatibility with a bunch of old VB helper applications.
We have traced the blocking down to a specific screen and stored procedure but now we can't get the errors to happen with profiler running.
Thanks for any help!
Theo
The good old Heisenberg debugger problem.
Any profiler does two things: it adds code in place to invoke the debugger, and it stores data. The first one can thward optimizers, and the second can change the timing of something, causing a race condition to go away.
This blocking SPID problem seems to show up on Google a lot; the reason appears to be that it occurs when some resource is locked when another one wants it, so the timing error sounds likely.
Microsoft has an article on how to deal with the problem.
Just a collection of random thoughts.. I've seen traces take a server down but never make things better.
What trace template are you using? (These are taken from SQL Server 2005 tools, sorry)
The "Standard (default)" one tracks high levels calls and logon/logout
The "TSQL_SPs" tracks statement calls which would be a lot more intrusive
Is it binary and guaranteed too? Trace on= no blocks, trace off = blocks, or is it unlucky coincidence? When you're all watching the DBA does someone stop clicking in the client and come to watch?
Is something else being switched off as part of the trace. That is, are you using profiler or a scripted trace (lots of sp_trace_set% statements)?. In a scripted trace, there may be something that switches something else off.
Related
I have a browser game built on a java web server using jsp.
I added a new module that uses some http session object and keep data in it. However, after it runs 3-4 hours, it suddenly stops working and freezes. When I check the error log, I dont see any exception thrown.
The server has 50-60 online in a moment.
I monitored the server using visualVM and here is the result after 4 hours until it stops :
I set the max memory as 1024Mb. As you can see its problem is not about the memory.
The thing that I notice is when the server stops, the thread amount increased.
According to the screenshot, should I doubt the httpsession object ? Why does the server stop responding ??
It looks like a system limitation or a deadlock.
Your thread graph looks like problematic : the number of living thread is important and never decreases. A web application should be stateless. The living tread count should rises when the requests arrive but also drops when the requests are finished.
I have not the impression it is the case in your application.
MGorgon is right.
You should also check "Deadlock detection" in jconsole.
If you use a JDK 6+ version, you could use ThreadMXBean. It has a findDeadlockedThreads() methods and other interesting methods to address your need.
Anyway, if it is not a deadlock, to get more information about the cause of the problem, I advise you to look in the system log whatever you OS is. You would have maybe interesting things.
Since last month, we got a problem on our company's server (Win2008ServerStd + IIS7 + CF enterprise 9.0.1 (hotfix2)).
I used jConsole to monitor the Coldfusion JVM (1.6.0_24) activity and here's what I see:
Notice that strange "curve" between 14:10 and 14:15 ! What is that?
Obviously it's not a standard behaviour, when it happens, my applications hang for 30 to 70 seconds!
Do you know what can cause that memory issue? It seems like GC does not run correctly, or hangs itself.
I don't expect a flash-answer, I wonder there can be a lot of root problems causing that but.... where can I start investigating?
Using cfstat, perfmon, fusionreactor, or cf perfomance monitor take a look at running requests and queued duing your problem. What you will likely see is running requests climbing past the setting of the simultaneous requests (in the cf admin). Then the requests will start to queue. Eventually the queue will clear out (if your server is recovering on it's own).
This sort of thing can be caused by a number of things. For example, if your DB server slows down or has an issue, if your network has a problem, or if network ports are resyncing, if your disks have I/O problems etc.
My guess is that you will drive yourself batty trying to figure this out by monitoring your heap. See if you can watch one of the monitors for some specific scripts that might be the culprit.
The other comment (about some indexing agents) is also a possibility. A flurry of indexing can definitely cause behavior. If that's the case, you might take a look at the simultaneous request settings. If it is set at the default you might have enough head room to increase it.
It could have been a spider creating lots and lots of sessions as it crawled the site which would eat up memory for a period of time. Once the spider stopped crawling those sessions would time out and be garbage collected.
I would compare your HTTP server logs w/ the JVM logs. Compare that time frame and see if there are a lot of requests from a search engine spider (Googlebot, msnbot, etc).
Fabio,
Same kind of issue I have couple of month ago where I was getting spike on regular interval and server eating up arround 50% of CPU usage. I wrote full story below URL
http://www.isummation.com/blog/strange-coldfusion-issue-jrun-eating-up-to-50-of-cpu/
which may help you (Sorry for so long).
I found that client variables storing in registry was causing issue and I am able to catch with help of VisualVM where I first find out thread causing issue and looking into trace of exactly find solution.
The only thing that's really odd IMO is the sudden spike to having so many threads. Capture a thread dump on a regular basis (jstack, etc.. are your friends) and then correlate those thread dumps to your monitoring where it shows the spike.
The root problem will become more obvious once you understand what all the extra threads are doing. Perhaps it's more threads handling transactions, but it might be something else entirely.
We have an Java ERP type of application. Communication between server an client is via RMI. In peak hours there can be up to 250 users logged in and about 20 of them are working at the same time. This means that about 20 threads are live at any given time in peak hours.
The server can run for hours without any problems, but all of a sudden response times get higher and higher. Response times can be in minutes.
We are running on Windows 2008 R2 with Sun's JDK 1.6.0_16. We have been using perfmon and Process Explorer to see what is going on. The only thing that we find odd is that when server starts to work slow, the number of handles java.exe process has opened is around 3500. I'm not saying that this is the acual problem.
I'm just curious if there are some guidelines I should follow to be able to pinpoint the problem. What tools should I use? ....
Can you access to the log configuration of this application.
If you can, you should change the log level to "DEBUG". Tracing the DEBUG logs of a request could give you a usefull information about the contention point.
If you can't, profiler tools are can help you :
VisualVM (Free, and good product)
Eclipse TPTP (Free, but more complicated than VisualVM)
JProbe (not Free but very powerful. It is my favorite Java profiler, but it is expensive)
If the application has been developped with JMX control points, you can plug a JMX viewer to get informations...
If you want to stress the application to trigger the problem (if you want to verify whether it is a charge problem), you can use stress tools like JMeter
Sounds like the garbage collection cannot keep up and starts "halt-the-world" collecting for some reason.
Attach with jvisualvm in the JDK when starting and have a look at the collected data when the performance drops.
The problem you'r describing is quite typical but general as well. Causes can range from memory leaks, resource contention etcetera to bad GC policies and heap/PermGen-space allocation. To point out exact problems with your application, you need to profile it (I am aware of tools like Yourkit and JProfiler). If you profile your application wisely, only some application cycles would reveal the problems otherwise profiling isn't very easy itself.
In a similar situation, I have coded a simple profiling code myself. Basically I used a ThreadLocal that has a "StopWatch" (based on a LinkedHashMap) in it, and I then insert code like this into various points of the application: watch.time("OperationX");
then after the thread finishes a task, I'd call watch.logTime(); and the class would write a log that looks like this: [DEBUG] StopWatch time:Stuff=0, AnotherEvent=102, OperationX=150
After this I wrote a simple parser that generates CSV out from this log (per code path). The best thing you can do is to create a histogram (can be easily done using excel). Averages, medium and even mode can fool you.. I highly recommend to create a histogram.
Together with this histogram, you can create line graphs using average/medium/mode (which ever represents data best, you can determine this from the histogram).
This way, you can be 100% sure exactly what operation is taking time. If you can't determine the culprit, binary search is your friend (fine grain the events).
Might sound really primitive, but works. Also, if you make a library out of it, you can use it in any project. It's also cool because you can easily turn it on in production as well..
Aside from the GC that others have mentioned, Try taking thread dumps every 5-10 seconds for about 30 seconds during your slow down. There could be a case where DB calls, Web Service, or some other dependency becomes slow. If you take a look at the tread dumps you will be able to see threads which don't appear to move, and you could narrow your culprit that way.
From the GC stand point, do you monitor your CPU usage during these times? If the GC is running frequently you will see a jump in your overall CPU usage.
If only this was a Solaris box, prstat would be your friend.
For acute issues like this a quick jstack <pid> should quickly point out the problem area. Probably no need to get all fancy on it.
If I had to guess, I'd say Hotspot jumped in and tightly optimised some badly written code. Netbeans grinds to a halt where it uses a WeakHashMap with newly created objects to cache file data. When optimised, the entries can be removed from the map straight after being added. Obviously, if the cache is being relied upon, much file activity follows. You probably wont see the drive light up, because it'll all be cached by the OS.
I'm trying to profile my java app, just to find out the methods in which most time is being spent. Given the poor reactions here to TPTP, I thought I'd give Java VisualVM a go.
It all seemed rather simple to use - except that I can't seem to get anything consistent or useful out of it.
I can't seem to see anything relating to MY OWN code - all I get is a whole bunch of calls to things like java.* methods.
I've tried restricting instrumentation to only my own packages, which seems to cut down the number of methods instrumented, but still I don't ever seem to see my own.
Each time I run, I get varying numbers of methods instrumented, ranging from 10's to 1000's.
I've tried putting in a sleep at the start of my app, to make sure I get VisualVM up and running before my app starts to do anything interesting, to make sure it's profiling when the interesting stuff is running.
Is there something I have to do to ensure my classes get instrumented ?
Are there timing issues ? ..like, have to wait for classes to be loaded etc ?
I've also tried running the guts of the code twice, to make sure all the code does get exercised...
I'm just running an app, with a main, from Eclipse. I've tried using the Eclipse integration so that VisualVM starts up when I start the app - results are the same.
I've also tried exporting the app as a runnable app, and running it standalone from the command line, rather than through Eclipse - same result.
My app is not a long running web app etc - just a main that calls some other of my own classes to do some processing, then quits.
I'd be grateful for any advice about what I might be doing wrong ! :)
Thanks !
I too am struggling with VisualVM, which is a shame because its user interface is fantastic while its profiling output seems horrific. You can seem my question here.
Java VisualVM giving bizarre results for CPU profiling - Has anyone else run into this?
I can tell you a couple of odd things that I have learned about VisualVM and the way it seems to do its profiling.
VisualVM appears to be counting the total time spent inside a method (wall-clock time). I have a thread in my application which starts a number of other threads and then immediately blocks waiting for a message on a queue. VisualVM will not register this method in the profiler until one of the other threads sends the message the first thread was waiting for (when the application terminates). Suddenly the blocking method call dominates the profiling output and is recorded as taking up more than 80% of the application time.
Other profilers, such as JProfiler and the one used by Azul do not count a blocked thread as taking up time for the profiler. This means that blocking methods which probably aren't interesting (situation dependant) for performance profiling are obscuring your view of that code that is eating your CPU time.
When I am running my profiling I end up with
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run()
obscuring my profiling right up until that message comes back to the waiting thread and then the top spot is shared between these two totally irrelevant methods, as well as various other uninteresting methods which don't appear on other profilers.
Secondly and I think quite importantly the method filtering mechanism doesn't work as I would have expected. This means that I can't filter out the I am trying to track down what the story is with this right now.
Not a really helpful answer. The solution as I see it right now is to pay for JProfiler - VisualVM just doesn't seem trustworthy for this task.
you could take a look at Appdynamics lite , it's has a nice features such as business transaction discovery which can samples all call made to a specific method in your code.
The lite version has a lot of limitation such as 10min sampling max and 30 business transaction discovery max.
It's would be nice to have a free tools that do the same
I assume this isn't just an academic question - you would like to see if you could make the app run faster. I assume you also wouldn't mind a little "out of the box" thinking. There are many popular ideas about performance that are actually pretty fuzzy.
For example, you say you're looking for "methods in which most time is being spent". If by that you mean "self time" (program counter actually in the method) there is probably very little, unless you've got some intense loops. Methods generally spend time by calling other methods, sometimes doing I/O.
Another fuzzy idea is that measuring method time or counting the number of calls can tell you very much about where bottlenecks are. Bottlenecks are specific lines of code, not methods, so even if you know approximately where to look, you're still playing detective.
So those are a few of the fuzzy ideas. Here is a bunch more. Let me suggest how one should think about it, and how that leads to results.
When you eventually fix something, it will reduce execution time by some percent, like (pick a number) 30%, right? (Otherwise you didn't fix anything.) OK, during that 30% it was doing something, something that it didn't need to do because later you got rid of it. So, you don't need to measure. You do need to find out what it is doing in that time, so you know what to get rid of.
A very simple way is to "pause" it 10 (or some number of) times at random. Understand what it is doing and why, by looking at the call stack and possibly some of the data. On about 3 of those times you will see it doing something you could get rid of.
You will know approximately how much it will save by seeing what percent of samples is showing it. Approximate is good enough. You can easily see exactly how much time is saved by stopwatching it before and after.
Then, don't stop. You've made the app faster. Do it again, and make it faster yet. Sooner or later you get to a point where you can't make it any faster, but it's probably in more than one step.
I've built a RCP-based application, and one of my users running on Windows XP, Sun JVM 1.6.0_12 had a full application crash. After the app was running for two days (and this is not a new version or anything), he got the nice gray JVM force exit box, with exit code=1073807364.
He was away from the machine at the time, and the only thing I can find near that time in the application logs was some communication with the database (SQL Server by way of Hibernate). There's no hs_ files or anything similar as far as I can tell. Web searching found a bunch of crash reports with that exit code in a variety of applications, but I didn't see any fundamental explanation of what causes it.
Can anyone tell me what causes it? Is there additional information likely to have been dumped that could prove useful?
From what I can tell, this error code (0x40010004) arises in all sorts of situations, with (as you noted) no obvious common thread.
However this page says "0x40010004" means "the task is running"! So, I would surmise that the correct way to interpret it is as saying "this tasked has exited in a way that prevented it setting a proper exit code".
I don't know if this will help, but I would try looking in the Windows Event logs to see if the problem is being reported there.