What are the most likely causes for application server failure?
For example: "out of disk space" is more likely than "2 of the drives in a RAID 4 setup die simultaneously".
My particular environment is Java, so Java-specific answers are welcome, but not required.
EDIT just to clarify, i'm looking for downtime-related crashes (out of memory is a good example) not just one-time issues (like a temporary network glitch).
If you are trying to keep an application server up, start monitoring it. Nagios, Big Sister, and other Network Monitoring tools can be very useful.
Watch memory availability / usage, disk availability / usage, cpu availability / usage, etc.
The most common reason why a server goes down is rarely the same reason twice. Someone "fixes" the last-most-common-reason, and a new-most-common-reason is born.
Edwin is right - you need monitoring to understand what the problem is. Or better - understand what the problem is AND prevent it from causing downtime.
You should not only track resource consumption but also demand. The difference between the two shows you if you have sized your server correctly.
There are a ton of open source tools like nagios, CollectD, etc. that can give you server specific data - that's only monitoring though, not prevention. Librato Silverline (disclosure: I work there) allows you to monitor individual processes and then throttle the resources they use by placing them in application containers for which you define resource polices.
If your server is 8 cores or less you can use it for free.
"Out of Memory" exception due to memory leaks.
All sorts of things can cause a server to crash, ranging from busted hardware (e.g. disk failures) to faulty code (memory leak resulting in an out of memory exception, network failure that got rethrown as a runtime exception and was never caught, in servers that aren't Java servers a SEGFAULT, etc.)
At first, it is usually because of memory leaks, disk space problems, endless loops causing cpu to eat up.
Once you monitor those issues and set up correct logging and warning mechanisms, they turn meta on you... and exploding error handling becomes a possible reason for a full lockup: an error (or more likely: two in an unhappy combination) occurs but when the handler is trying to write to the logfiles or send a warning (by mail or something) it gets another error which it is trying to handle by writing to the logfile or sending a warning or... and this continues until one of the resources gives out: it may lead to skyrocketing server load, memory problems, filling disk space, locking up network traffic which means it won't be accessible for a remote user to correct the problem, etc.
Related
We have a Java application, deployed in CloudFoundry, that occasionally throws an OOM error, due to requests that have a large payload as result.
When this happens, CloudFoundry kills the app and restarts it.
When the application is running on a development machine (rather than in CF), the OOM does not result in a crash (but does display an "out of heap memory" message in the output); usually the request-handler thread ends and the memory that was allocated for the request is garbage-collected. The application continues to run and successfully serves more requests.
Is there a way to configure CF to avoid restarting the app on OOM?
Thanks.
The short answer is no. The platform will always kill your app when you exceed the memory limit that you've assigned to it. This is the intended behavior. You cannot bypass this because that would essentially mean that your application has no memory limit.
On a side note, I would highly recommend using the Java buildpack v4.x (latest), if you are not already. It is much better about configuring the JVM such that you get meaningful errors like JVM OOME's instead of just letting your application crash. It also dumps helpful diagnostic information when this happens that will direct you to the source of the problem.
One other side note...
the OOM does not result in a crash (but does display an "out of heap memory" message in the output); usually the request-handler thread ends and the memory that was allocated for the request is garbage-collected.
You don't want to rely on this behavior. Once an OOME happens in the JVM, all bets are off. It may be able to recover and it may put your application into a horrible and unusable state. There's no way to know because there's no way to know exactly where the OOME will strike. When you get an OOME, the best course of action is to obtain any diagnostic information that you need and restart. This is exactly what the Java buildpack (v4+) does when your app runs on CF.
Hope that helps!
Its a vague question. So please feel free to ask for any specific data.
We have a tomcat server running with two web-service's. One tomcat built using spring. We use mysql for 90% of tasks and mongo for caching of jsons (10%). The other web-service is written using grails. Both the services are medium sized codebases (About 35k lines of code each)
The computation only happens when there is an HTTP request (No batch processing). With about 2000 database hits per request (I know its humongous. We are working on it). The request rate is about 30 req/min. For one particular request, there is Image processing which is quite memory expensive. No JNI anywhere
We have found a weird behavior. Last night, I can confirm that there was no request to the server for about 12 hours. But when I look at the memory consumption, it is very confusing:
Without any requests, the memory keeps jumping from 500Mb to 1.2Gb (700 Mb jump is worrysome). There is no computation on server side as mentioned. I am not sure if its a memory leak:
The memory usage comes down. (Things would have been way easier if the memory didnt come down).
This behavior is reproducable with caches based on SoftReference or so. With full gc's. But I am not using them anywhere (Not sure if something else is using it)
What else can be the reason. is it a cause to worry?
PS: We have had Our of Memory Crashes (Not errors but JVM crash) quite frequently very recently.
This is actually normal behavior. You're just seeing garbage collection occur.
Since last month, we got a problem on our company's server (Win2008ServerStd + IIS7 + CF enterprise 9.0.1 (hotfix2)).
I used jConsole to monitor the Coldfusion JVM (1.6.0_24) activity and here's what I see:
Notice that strange "curve" between 14:10 and 14:15 ! What is that?
Obviously it's not a standard behaviour, when it happens, my applications hang for 30 to 70 seconds!
Do you know what can cause that memory issue? It seems like GC does not run correctly, or hangs itself.
I don't expect a flash-answer, I wonder there can be a lot of root problems causing that but.... where can I start investigating?
Using cfstat, perfmon, fusionreactor, or cf perfomance monitor take a look at running requests and queued duing your problem. What you will likely see is running requests climbing past the setting of the simultaneous requests (in the cf admin). Then the requests will start to queue. Eventually the queue will clear out (if your server is recovering on it's own).
This sort of thing can be caused by a number of things. For example, if your DB server slows down or has an issue, if your network has a problem, or if network ports are resyncing, if your disks have I/O problems etc.
My guess is that you will drive yourself batty trying to figure this out by monitoring your heap. See if you can watch one of the monitors for some specific scripts that might be the culprit.
The other comment (about some indexing agents) is also a possibility. A flurry of indexing can definitely cause behavior. If that's the case, you might take a look at the simultaneous request settings. If it is set at the default you might have enough head room to increase it.
It could have been a spider creating lots and lots of sessions as it crawled the site which would eat up memory for a period of time. Once the spider stopped crawling those sessions would time out and be garbage collected.
I would compare your HTTP server logs w/ the JVM logs. Compare that time frame and see if there are a lot of requests from a search engine spider (Googlebot, msnbot, etc).
Fabio,
Same kind of issue I have couple of month ago where I was getting spike on regular interval and server eating up arround 50% of CPU usage. I wrote full story below URL
http://www.isummation.com/blog/strange-coldfusion-issue-jrun-eating-up-to-50-of-cpu/
which may help you (Sorry for so long).
I found that client variables storing in registry was causing issue and I am able to catch with help of VisualVM where I first find out thread causing issue and looking into trace of exactly find solution.
The only thing that's really odd IMO is the sudden spike to having so many threads. Capture a thread dump on a regular basis (jstack, etc.. are your friends) and then correlate those thread dumps to your monitoring where it shows the spike.
The root problem will become more obvious once you understand what all the extra threads are doing. Perhaps it's more threads handling transactions, but it might be something else entirely.
I have a J2EE java application which processes SOAP requests. In our production environment (HPUX,OC4J,Java 5) we have about 20 threads running for this process, and we sometimes see 1 thread pausing for ~15 seconds. Until now, I haven't succeeded replicating the problem in our preproduction environment, and I'm scared of breaking stuff and violating SLA's if I use jconsole and associated tools on our production server.
Who has any inspiration? I know about http://java.sun.com/j2se/1.5/pdf/jdk50_ts_guide.pdf but I miss the experience to dare using it straight in production (plus, the HPUX guys threw some of these tools out of the toolbox, replacing them with HPJMeter)
Also, although this suggests a GC problem to me, I don't yet know enough to prove or disprove this theory and I am open to other suggestions.
We connect jconsole (and other tools) straight to production regularly. There is no significant overhead for us, the instrumentation is already going on within the JVM so you'd just be connecting a remote process to read published values. I say go for it!
Either way, you really need to see what's going on on the box. Thread dumps might or do some internal instrumentation. By internal instrumentation, I mean recording key measures within the code and exposing those somehow. It's essentially what the JVM does (exposing them via JMX) but rolling your own gives you more specificity. For example, I'm frequently recording request/response or other critical path performance timings internally.
oh, and one more thing. You can setup your app to using an agent to provide even more information. Typically this would be to plug a profiler in (like jprofiler or yourkit) but this does usually add more overhead and isn't recommended for production.
It's also worth thinking about the cost of not getting the information you need out of the VM. For example, is the cost of not fixing the issue more or less than the cost of a small % drop of performance when monitoring?
More scientifically, this article has some comments. It's suggesting up to 7% overhead (contradicting my previous point), a previous article from 2006 suggests 3-4% but both are highly contextual results. For example, CPU intensive applications may or may not be affected more than IO bound ones.
So a more appropriate answer from me (rather than just "go for it") would be to understand the impact it would have for your application in your environment through measurement. Run representative tests on a similar environment to production with jconsole connected and disconnected and see for your self.
Also see this stackoverflow question.
There are a few things that you can do on HP-UX to get additional information from a running Java process. If you send the PROF signal to the JVM, it will toggle the generation of a GC log (as if you had used the -Xverbosegc command line option). Generating the GC log is very inexpensive, so you should be able to turn this on in production without affecting the performance.
If you send the USR2 signal to the JVM, it starts profiling (same as -Xeprof). If you send the signal a second time, it turns off the profiling. This will have a noticeable performance impact, though it is smaller that what you would see from an external, third party profiler.
You can analyze the resulting data files using HPjmeter. HPjmeter can also connect to a running JVM for real-time monitoring. With Java 5, you need to start the JVM with the -agentlib option. If you were using Java 6, you could attach to the running JVM without needing any extra command line options.
I have a Java program for doing a set of scientific calculations across multiple processors by breaking it into pieces and running each piece in a different thread. The problem is trivially partitionable so there's no contention or communication between the threads. The only common data they access are some shared static caches that don't need to have their access synchronized, and some data files on the hard drive. The threads are also continuously writing to the disk, but to separate files.
My problem is that sometimes when I run the program I get very good speed, and sometimes when I run the exact same thing it runs very slowly. If I see it running slowly and ctrl-C and restart it, it will usually start running fast again. It seems to set itself into either slow mode or fast mode early on in the run and never switches between modes.
I have hooked it up to jconsole and it doesn't seem to be a memory problem. When I have caught it running slowly, I've tried connecting a profiler to it but the profiler won't connect. I've tried running with -Xprof but the dumps between a slow run and fast run don't seem to be much different. I have tried using different garbage collectors and different sizings of the various parts of the memory space, also.
My machine is a mac pro with striped RAID partition. The cpu usage never drops off whether its running slowly or quickly, which you would expect if threads were spending too much time blocking on reads from the disk, so I don't think it could be a disk read problem.
My question is, what types of problems with my code could cause this? Or could this be an OS problem? I haven't been able to duplicate it on a windows a machine, but I don't have a windows machine with a similar RAID setup.
You might have thread that have gone into an endless loop.
Try connecting with VisualVM and use the Thread monitor.
https://visualvm.dev.java.net
You may have to connect before the problem occurs.
I second that you should be doing it with a profiler looking at the threads view - how many threads, what states are they in, etc. It might be an odd race condition happening every now and then. It could also be the case that instrumenting the classes with profiler hooks (which causes slowdown), sortes the race condition out and you will see no slowdown with the profiler attached :/
Please have a look at this post, or rather the answer, where there is Cache contention problem mentioned.
Are you spawning the same umber of threads each time? Is that number less or equal the number of threads available on your platform? That number could be checked or guestimated with a fair accuracy.
Please post any finidngs!
Do you have a tool to measure CPU temperature? The OS might be throttling the CPU to deal with temperature issues.
Is it possible that your program is being paged to disk sometimes? In this case, you will need to look at the memory usage of the operating system as whole, rather than just your program. I know from experience there is a huge difference in runtime performance when memory is being continually paged to the disk and back.
I don't know much about OSX, but in linux the "free" command is useful for this purpose.
Another issue that might cause this slowdown is log files? I've known at least some logging code that slowed down the system incrementally as the log files grew. It's possible that your threads are synchronizing on a log file which is growing in size, then when you restart your program, another log file is used.