How should I tackle with "Program Crashing" issues?

How should I tackle with "Program Crashing" issues? - java

I am working on a Java product. An client claims that the application is getting crashed after an arbitrary time. SInce it is a crash we can't find any information on our logs.
Are there any tools, methods to find out the reasons for such Issues?
Can we do anything in code side to get more information on such program crashes?
Can we enable a "DEBUG" mode for JVM? If so where can I find the JVM log files/crash dumps?
Any known procedures to deal with this sort of issues?
If you got in to this problem, What would be your procedure in troubleshooting this?

I find it hard to believe there's no output from the JVM when it crashes. Start by taking a long and hard look at your run scripts and seeing whether you are simply ignoring output. If the JVM ends due to an unhandled exception, it will output the exception to stdout I believe. If it crashes hard (heap corruption etc) it will output something to stderr. Your in-application logging is useful, but you should be logging any output that goes to stdout and stderr as well (you don't define the platform your app is running on, but this basically applies to all of them).
Aside from that, there's a whole host of non-standard options you can pass to define the location of error files and the like, see Java HotSpot VM Options.

I would adjust your application logging to verboser levels or tweak the JVM as pointed before, but if you want more options, you can try JVisualVM to watch something weird (memory/thread/gc/jmx operations) and, in the last chance, I would search for hs_err_pid*.log files.
These files contains information about the state of the JVM in the moment of the hard crash (memory violations and so on).
Here you have an example:
#
# An unexpected error has been detected by HotSpot Virtual Machine:
#
# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x6d741e3a, pid=1572, tid=1364
#
# Java VM: Java HotSpot(TM) Client VM (1.5.0_11-b03 mixed mode)
# Problematic frame:
# V [jvm.dll+0x1e3a]
#
--------------- T H R E A D ---------------
Current thread (0x00a85c78): VMThread [id=1364]
siginfo: ExceptionCode=0xc0000005, reading address 0x00000054
Registers:
EAX=0x00000050, EBX=0x00990000, ECX=0x0847b9f8, EDX=0x00000050
ESP=0x0ab0f660, EBP=0x0ab0f684, ESI=0x0847b9f8, EDI=0x0847b9f8
EIP=0x6d741e3a, EFLAGS=0x00010216

After a crash, you don't have logs during the crash itself, but you still have all your logs before your actual crash. That should give you a lot of information, if your logs are detailed enough.
In java, you combine the two phases:
logging in the code can be very detailed, using levels (fatal, error, warning, info, debug)
logging can be configured in production to only output what is relevant only (even as specific as a single class's logs at debug level, while the rest is only at error level), to have a decent performance and log files of acceptable size.
Using the power of logging, you should be able to narrow your focus little by little. Note that, if your application has too few logs, you should start ASAP adding some more (at the appropriate logging levels of course). Example process:
activate error level for all the application, see what you get
activate warning level for one module, see what you get
deactivate the previous, activate info level for one package, see what you get
deactivate the previous, activate debug level for one class, see what you get

At first yout should be aware, if the JVM crashes or your application itself. If your JVM crashes the java process creates several crash dumps on the file system, something like hs_errXXX.pid. If you find one of these files in the directory where java starts, you should check for this error on the official bug site at sun.
If your application crashes, you should extend your log infrastructure (like KLE mentioned). Using a shutdown hook to print out, that it is shut down (normally) is also quite handy .See here for API reference.

If this problem occurs only with that client, ask them if they run the application on more than one machine. If yes, does the problem occur on all of them?
If the problem occurs only on one machine, I'd suspect faulty hardware, most likely RAM. This can be diagnosed with a tool like memtest.
I've personally witnessed only two instances of recurring JVM crashes. In both cases, the problem was faulty RAM.

A few options that will help to diagnose memory issues:
The JVM option -XX:+HeapDumpOnOutOfMemoryError will create a heap dump if the VM exits due to memory exhaustion. You can analyse the dump using something like eclipseMAT to determine the cause of the problem.
Also -verbose:gc will provide detailed garbage collection stats, and adding -Xloggc:<file> will redirect this to a file.

If you're using JNI (or any libraries that use JNI), it's easy to crash the JVM so that it leaves no traces at all. As far as I know, the only way to debug this kind of problems is to step through the native stuff with a debugger.

In addition to all of the other suggestions, check your codebase for calls to System.exit().

Related

How does a Java process die?

A quick answer would be "without crying" of course :).
I have a really strange problem with my Java application (J2SE 1.7) on a Win7 32bits system. I encountered all the cases :
Sometimes it goes out of memory of Java heap (and so I can log it and recover from this)
Sometimes it crash in the native and I have the hs_err_pidxxxx.log file and I can analyze what is going on.
Sometimes it crash in the native and I have no hs_err file but I have a popup java stop functioning and I can see the exception in windows event log and even debug with visual some part of the process.
Sometimes it crash and I have nothing (no hs_err, no popup, nothing...). It just ends all like if there were an System.exit() or a native exit() call.
So my question is :
how can I be sure this is a native exit call as I don't have all the code of native libraries I am using ?
Is it possible to have this strange behaviour produced by another mean ?
Finally how to debug and track which lib can be the root cause ?

how can I be sure this is a native exit call as I don't have all the
code of native libraries I am using ?
The only way I know to be sure would be to wrap the call to a native library with logging commands so you log before each call and after each return. After your program has crashed if the log has an enter message but no return message then that library call is suspect.
Is it possible to have this strange behaviour produced by another
means ?
Yes there are an infinite number of strange other means.
Using up memory or some other resource might be one explanation.
Finally how to debug and track which lib can be the root cause ?
Logging described above should find this too if the messages are specific to which library is being called. You can monitor the application in jconsole to see if it is using up tons of memory or threads. Disable anything that can be disabled so you can eliminate it as being part of the problem. If the problem goes away enable things one at atime until the problem returns.

How can I be sure this is a native exit call as I don't have all the code of native libraries I am using?
Debug it.
Is it possible to have this strange behaviour produced by another mean?
Hard to tell... Could be threading, could be memory leak, ...
Finally how to debug and track which lib can be the root cause?
Run Java with
-XX:+CreateMinidumpOnCrash
and you'll get a crash dump that you can analyze. Or use
-XX:+UseOSErrorReporting
to let Windows handle the crash (which will e.g. show a message to attach a debugger, depending on what you have installed. It might as well show "Send to Microsoft" error report.).

How do I debug Segfaults occurring in the JVM when it runs my code?

My Java application has started to crash regularly with a SIGSEGV and a dump of stack data and a load of information in a text file.
I have debugged C programs in gdb and I have debugged Java code from my IDE. I'm not sure how to approach C-like crashes in a running Java program.
I'm assuming I'm not looking at a JVM bug here. Other Java programs run just fine, and the JVM from Sun is probably more stable than my code. However, I have no idea how I could even cause segfaults with Java code. There definitely is enough memory available, and when I last checked in the profiler, heap usage was around 50% with occasional spikes around 80%. Are there any startup parameters I could investigate? What is a good checklist when approaching a bug like this?
Though I'm not so far able to reliably reproduce the event, it does not seem to occur entirely at random either, so testing is not completely impossible.
ETA: Some of the gory details
(I'm looking for a general approach, since the actual problem might be very specific. Still, there's some info I already collected and that may be of some value.)
A while ago, I had similar-looking trouble after upgrading my CI server (see here for more details), but that fix (setting -XX:MaxPermSize) did not help this time.
Further investigation revealed that in the crash log files the thread marked as "current thread" is never one of mine, but either one called "VMThread" or one called "GCTaskThread"- I f it's the latter, it is additionally marked with the comment "(exited)", if it's the former, the GCTaskThread is not in the list. This makes me suppose that the problem might be around the end of a GC operation.

I'm assuming I'm not looking at a JVM bug here. Other Java programs
run just fine, and the JVM from Sun is probably more stable than my
code.
I don't think you should make that assumption. Without using JNI, you should not be able to write Java code that causes a SIGSEGV (although we know it happens). My point is, when it happens, it is either a bug in the JVM (not unheard of) or a bug in some JNI code. If you don't have any JNI in your own code, that doesn't mean that you aren't using some library that is, so look for that. When I have seen this kind of problem before, it was in an image manipulation library. If the culprit isn't in your own JNI code, you probably won't be able to 'fix' the bug, but you may still be able to work around it.
First, you should get an alternate JVM on the same platform and try to reproduce it. You can try one of these alternatives.
If you cannot reproduce it, it likely is a JVM bug. From that, you can either mandate a particular JVM or search the bug database, using what you know about how to reproduce it, and maybe get suggested workarounds. (Even if you can reproduce it, many JVM implementations are just tweaks on Oracle's Hotspot implementation, so it might still be a JVM bug.)
If you can reproduce it with an alternative JVM, the fault might be that you have some JNI bug. Look at what libraries you are using and what native calls they might be making. Sometimes there are alternative "pure Java" configurations or jar files for the same library or alternative libraries that do almost the same thing.
Good luck!

The following will almost certainly be useless unless you have native code. However, here goes.
Start java program in java debugger, with breakpoint well before possible sigsegv.
Use the ps command to obtain the processid of java.
gdb /usr/lib/jvm/sun-java6/bin/java processid
make sure that the gdb 'handle' command is set to stop on SIGSEGV
continue in the java debugger from the breakpoint.
wait for explosion.
Use gdb to investigate
If you've really managed to make the JVM take a sigsegv without any native code of your own, you are very unlikely to make any sense of what you will see next, and the best you can do is push a test case onto a bug report.

I found a good list at http://www.oracle.com/technetwork/java/javase/crashes-137240.html. As I'm getting the crashes during GC, I'll try switching between garbage collectors.
I tried switching between the serial and the parallel GC (the latter being the default on a 64-bit Linux server), this only changed the error message accordingly.
Reducing the max heap size from 16G to 10G after a fresh analysis in the profiler (which gave me a heap usage flattening out at 8G) did lead to a significantly lower "Virtual Memory" footprint (16G instead of 60), but I don't even know what that means, and The Internet says, it doesn't matter.
Currently, the JVM is running in client mode (using the -client startup option thus overriding the default of -server). So far, there's no crash, but the performance impact seems rather large.

If you have a corefile you could try running jstack on it, which would give you something a little more comprehensible - see http://download.oracle.com/javase/6/docs/technotes/tools/share/jstack.html, although if it's a bug in the gc thread it may not be all that helpful.

Try to check whether c program carsh which have caused java crash.use valgrind to know invalid and also cross check stack size.

Can I auto restart tomcat jvm on out of memory exception

I know that this is not "best practice" but I would like to know if I can auto restart tomcat if my deployed app throws an outofmemory exception

You can try to use the OnOutOfMemoryError JVM option
-XX:OnOutOfMemoryError="/yourscripts/tomcat-restart"
It is also possible to generate the heap dump for later analysis:
-XX:+HeapDumpOnOutOfMemoryError
Be careful with combining these two options. If you force killing the process in "tomcat-restart" the heap dump might not be complete.

I know this isn't what you asked, but have you tried looking through a heap dump to see where you may be leaking memory?
Some very useful tools for tracking down memory leaks:
jdk/bin/jmap -histo:live pid
This will give you a histogram of all live objects currently in the JVM. Look for any odd object counts. You'll have to know your application pretty well to be able to determine what object counts are odd.
jdk/bin/jmap -dump:live,file=heap.hprof pid
This will dump the entire heap of the JVM identified by pid. You can then use the great Eclipse Memory Analyzer to inspect it and find out who is holding on to references of your objects. Your two biggest friends in Eclipse Memory Analyzer are the histo gram and a right click -> references -> exclude weak/soft references to see what is referencing your object.
jconsole is of course another good tool.

not easily, and definitely not through the JVM that just suffered the out of memory exception. Your best bet would be some combination of tomcat status monitor coupled with cron scripts or related scheduled system administrator scripts; something to check the status of the server and automatically stop and restart the service if it has failed.

Unfortunately when you kill the java process. Your script will keep a reference to the tomcat ports 8080 8005 8009 and you will not be able to start it again from the same script. The only way it works for me is:
-XX:OnOutOfMemoryError="kill -9 %p" and then another cron or monit or something similar to ensure you have the tomcat running again.
%p is actually the JVM pid , something the JVM provides for you.

Generally, no. The VM is a bad state, and cannot be completely trusted.
Typically, one can use a configurable wrapper process that starts and stops the "real" server VM you want. An example I've worked with is "Java Service Wrapper" from Tanuki Software http://wrapper.tanukisoftware.com/doc/english/download.jsp
I know there are others.
To guard against OOMs in the first place, there are ways to instrument modern VMs via interface beans to query the status of the heap and other memory structures. These can be used to, say, warn in a log or an email if some app specific operations are pushing some established limits.

I use
-XX:OnOutOfMemoryError='pkill java;/usr/local/tomcat/bin/start.sh'

What about something like this? -XX:OnOutOfMemoryError="exec \`ps --no-heading -p $$ -o cmd\`"

Why does the Java VM update 25 crash with internal error

Since Java update 25 the VM crashes occasionally with internal error. With previous versions <25 it was working fine. According to the release notes, the hotspot compiler was modified in update 25. Does it produce defect code that causes the crash? It does not crash if the JIT compiler is turned off with -Xint. I filed a bug here http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7054478 .
How do I interpret the log file to find out where the crash occurs. I am not able to identify the lines in my Java code where it happenes.

I searched the Bug Database for the string "Unexpected result from topLevelExceptionFilter", and there were three other hits. They all seem to be about unexpected exceptions in native code propagating back across the JNI boundary.
Is that clue relevant to your application?

The reason is that an internal assertion failed - the JVM was not in the state it expected to be. This is a good thing because it avoids propagating errors, but a bad thing because it doesn't tell you how to get around it.
If simple tricks like -client or -server doesn't help, then consider a different JVM.
IBM has a Windows JVM, but it is a bit tricky to get. The easiest for now would be a development package http://www.ibm.com/developerworks/java/jdk/eclipse/index.html
Oracle also have JRockit. http://www.oracle.com/technetwork/middleware/jrockit/index.html
This will allow you to work while Oracle has a look at your bug. It is low priority though, so it make take a while.

What can I do if a Java VM crashes repeatedly?

What is the best practice to solve a Java VM crash if the follow conditions are true:
No own or third party native code. 100% pure java
The same program run on many other system without any problems.
PS: With VM crash I means that the VM write a dump file like hs_err_pid1234.log and terminate.

Read the hs_err_pid1234.log file (or whatever the error log file name is). There are usually clues in there. The next step depends on what you discover in the log.
Yes, it could be a bug in the specific version of the JVM implementation you are using, but I have also seen problems caused by memory fragmentation in the operating system. Windows, for example, is prone to pin dlls at inappropriate locations, and fail to allocate a contiguous block of memory when the JVM asks for it as a result. Other out opf memory problems can also manifest themselves through crash dumps of this type.

Update or replace your JVM. If you currently have the newest version, then try an older one, or if you don't have the latest version, try updating to it. Maybe its a known issue in your particular version?

Assuming the JVM version across machines is the same:
Figure out what is different about the machine where the JVM is crashing. Same OS and OS version? We have problems with JVMs crashing on a particular version of Red Hat for example. And we have also found some older Red Hat versions unable to cope with extra memory properly, resulting in running out of swap space. (Our solution was to upgrade RedHat).
Also, is the program doing exactly the same thing across machines? Is it accessing a shared filesystem? Is the file system mounted similarly on your machines (SMB/NFS etc)? Something must be different.
The log file should give you some idea of where the crash occurred (malloc for example).

Take a look at the stacktraces in the dump file, as it should tell you what was going on when the crash occurred.
As well as digging into the hs_err dump file, I'd also submit it to Sun or whomever made your JVM (I believe there are instructions in how to do so at the top of the file?). It can't hurt.

32bit? 64bit? Amount of ram in client machine? processor? os? See if there is any connection between the systems. A connection may lead to a clue. If all else fails, consider using different major/minor versions of the JVM. Also, if the problem JUST started can you get to a time (via version control) where the program didn't crash? Look through the hs_err log, you may get an idea of what caused the crash. It could be a version of some other client library the JVM uses. Lastly, run the program in debug/profile and maybe you'll see some symptons before the crash (assuming you can duplicate it)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.