How should I diagnose and prevent JVM crashes?

How should I diagnose and prevent JVM crashes? - java

What should I (as a Java programmer who doesn't know anything about JVM internals) do when I come across a JVM crash?
In particular, how would you produce a reproducible test case? What should I be searching for in Sun's (or IBM's) bug database? What information can I get from the log files produced (e.g. hs_err_pidXYZ.log)?

If the crashes occur only one one specific machine, run memtest. I've seen recurring JVM crashes only two times, and in both cases the culprit turned out to be a hardware problem, namely faulty RAM.

In my experience they are nearly always caused by native code using JNI, either mine or someone else's. If you can, try re-running without the native code to see if you can reproduce it.
Sometimes it is worth trying with the JIT compiler turned off, if your bug is easily reproducible.
As others have pointed out, faulty hardware can also cause this, I've seen it for both Memory and video cards (when the crash was in swing code). Try running whatever hardware diagnostics are most appropriate for your system.
As JVM crashes are rare I'd report them to Sun. This can be done at their bug database. Use category Java SE, Subcategory jvm_exact or jit.
Under Unix/Linux you might get a Core dump. Under windows the JVM will usually tell you where it has stored a log of what has happened. These files often given some hint, but will vary from JVM to JVM. Sun gives full details of these files on their website. or IBM the files can be analysed using the Java Core Analyzer and Java heapdump Analyzer from IBM's alphaworks.
Unfortunately Java debuggers in my experience tend to hurt more than help. However, attaching an OS specific debugger (eg Visual Studio) can help if you are familiar with reading C stack traces.
Trying to get a reproducible test case is hard. If you have a large amount of code that always (or nearly always) crashes it is easier, just slowly remove parts while it keeps crashing, getting the result as small as possible. If you have no reproducible test code at all then it is very difficult. I'd suggest getting hints from my numbered selection above.

Sun documents the details of the crash log here. There is also a nice tutorial written up here, if you want to get into the dirty details (it sounds like you don't, though)
However, as a commenter mentioned, a JVM crash is a pretty rare and serious event, and it might be worthwhile to call Sun or IBM professional support in this situation.

When an iBM JVM crashes, it might have written to the file /tmp/dump_locations in there it lists any heapdump or javacore files it has written.
These files can be analysed using the Java Core Analyzer and Java heapdump Analyzer from IBM's alphaworks.

There's a great page on the Oracle website to troubleshoot these types of problems.
Check out the relevant sections for:
Hung Processes (eg. jstack utility)
Post Mortem diagnostics

Related

Debugging the "safepoint" error - need theoretical OR practical to debugging JVM crashes?

We have an amazingly elusive jvm crash occuring on an Ubuntu server that runs on AWS.
Our JVM crashes while crawling a few web pages.
The crash occurs at line 308 of the "safepoint" cpp module. At the stage where a gauranteeArmed==0 statement occurs.
Our sysadmin has advised that , at the time of crashing, there are a massive amount of threads created by the JVM.
We have not reproduced this bug in other Linux or OSX boxes.
We use the Ning library to crawl a few
Web pages.
Related Posts
How do I investigate the cause of a JVM crash?
JBoss / HotSpot JVM crashing
In each of these posts a "safepoint" related crash which comes from "nowhere" was observed. Most interestingly, the first above post actually exhibits a JVM crash during network related events.
The cryptic nature of this bug leads me to believe that there is a bug related to thread creation and scheduling which is specific to our current version of Ubuntu with respect to the way java invokes some of its concurrency features, or some underlying library incompatibility that is highly idiosyncratic to our particular situation.
My Question(s)
My main question here is - what is the best method for debugging a JVM stack trace involving these "safepoints", and where can I get started learning about dealing with such errors ? There have been other questions along this line, but I have not seen a generic answer .
Secondary, any insight into aws, java, networking, and how Ubuntu might behave differently in the cloud would be useful here.

Try using the very latest JVM (6u32 or 7u4) and see if it's still reproducible. If you are on an older version, there's at least a decent chance it's already been fixed in the latest.

How do I debug Segfaults occurring in the JVM when it runs my code?

My Java application has started to crash regularly with a SIGSEGV and a dump of stack data and a load of information in a text file.
I have debugged C programs in gdb and I have debugged Java code from my IDE. I'm not sure how to approach C-like crashes in a running Java program.
I'm assuming I'm not looking at a JVM bug here. Other Java programs run just fine, and the JVM from Sun is probably more stable than my code. However, I have no idea how I could even cause segfaults with Java code. There definitely is enough memory available, and when I last checked in the profiler, heap usage was around 50% with occasional spikes around 80%. Are there any startup parameters I could investigate? What is a good checklist when approaching a bug like this?
Though I'm not so far able to reliably reproduce the event, it does not seem to occur entirely at random either, so testing is not completely impossible.
ETA: Some of the gory details
(I'm looking for a general approach, since the actual problem might be very specific. Still, there's some info I already collected and that may be of some value.)
A while ago, I had similar-looking trouble after upgrading my CI server (see here for more details), but that fix (setting -XX:MaxPermSize) did not help this time.
Further investigation revealed that in the crash log files the thread marked as "current thread" is never one of mine, but either one called "VMThread" or one called "GCTaskThread"- I f it's the latter, it is additionally marked with the comment "(exited)", if it's the former, the GCTaskThread is not in the list. This makes me suppose that the problem might be around the end of a GC operation.

I'm assuming I'm not looking at a JVM bug here. Other Java programs
run just fine, and the JVM from Sun is probably more stable than my
code.
I don't think you should make that assumption. Without using JNI, you should not be able to write Java code that causes a SIGSEGV (although we know it happens). My point is, when it happens, it is either a bug in the JVM (not unheard of) or a bug in some JNI code. If you don't have any JNI in your own code, that doesn't mean that you aren't using some library that is, so look for that. When I have seen this kind of problem before, it was in an image manipulation library. If the culprit isn't in your own JNI code, you probably won't be able to 'fix' the bug, but you may still be able to work around it.
First, you should get an alternate JVM on the same platform and try to reproduce it. You can try one of these alternatives.
If you cannot reproduce it, it likely is a JVM bug. From that, you can either mandate a particular JVM or search the bug database, using what you know about how to reproduce it, and maybe get suggested workarounds. (Even if you can reproduce it, many JVM implementations are just tweaks on Oracle's Hotspot implementation, so it might still be a JVM bug.)
If you can reproduce it with an alternative JVM, the fault might be that you have some JNI bug. Look at what libraries you are using and what native calls they might be making. Sometimes there are alternative "pure Java" configurations or jar files for the same library or alternative libraries that do almost the same thing.
Good luck!

The following will almost certainly be useless unless you have native code. However, here goes.
Start java program in java debugger, with breakpoint well before possible sigsegv.
Use the ps command to obtain the processid of java.
gdb /usr/lib/jvm/sun-java6/bin/java processid
make sure that the gdb 'handle' command is set to stop on SIGSEGV
continue in the java debugger from the breakpoint.
wait for explosion.
Use gdb to investigate
If you've really managed to make the JVM take a sigsegv without any native code of your own, you are very unlikely to make any sense of what you will see next, and the best you can do is push a test case onto a bug report.

I found a good list at http://www.oracle.com/technetwork/java/javase/crashes-137240.html. As I'm getting the crashes during GC, I'll try switching between garbage collectors.
I tried switching between the serial and the parallel GC (the latter being the default on a 64-bit Linux server), this only changed the error message accordingly.
Reducing the max heap size from 16G to 10G after a fresh analysis in the profiler (which gave me a heap usage flattening out at 8G) did lead to a significantly lower "Virtual Memory" footprint (16G instead of 60), but I don't even know what that means, and The Internet says, it doesn't matter.
Currently, the JVM is running in client mode (using the -client startup option thus overriding the default of -server). So far, there's no crash, but the performance impact seems rather large.

If you have a corefile you could try running jstack on it, which would give you something a little more comprehensible - see http://download.oracle.com/javase/6/docs/technotes/tools/share/jstack.html, although if it's a bug in the gc thread it may not be all that helpful.

Try to check whether c program carsh which have caused java crash.use valgrind to know invalid and also cross check stack size.

HPjmeter-like graphical tool to view -agentlib:hprof profiling output

What tools are available to view the output of the built-in JVM profiler? For example, I'm starting my JVM with:
-agentlib:hprof=cpu=times,thread=y,cutoff=0,format=a,file=someFile.hprof.txt
This generates output in the hprof ("JAVA PROFILE 1.0.1") format.
I have had success in the past using HPjmeter to view these output files in a reasonable way. However, for whatever reason the files that are generated using the current version of the Sun JVM fail to load in the current version of HPjmeter:
java.lang.NullPointerException
at com.hp.jmeter.f.jb.a(Unknown Source)
at com.hp.jmeter.f.a.a(Unknown Source)
at com.hp.c.a.j.z.run(Unknown Source)
Exception in thread "HPeprofDataFileReaderThread" java.lang.AssertionError: null pointer exception from loader
at com.hp.jmeter.f.a.a(Unknown Source)
at com.hp.c.a.j.z.run(Unknown Source)
(Why would they obfuscate the bytecode for a free product?!)
Two questions arise from this:
Does anyone know the cause of this HPjmeter error? (EDIT: Yes--see below)
What other tools exist to read hprof files? And why are there none from Sun (are there)?
I know the Eclipse TPTP and other tools can monitor JVMTI data on the fly, but I need a solution that can process the generated hprof files after the fact since the deployed machine only has a JRE (not a JDK) intalled.
EDIT: A very helpful HPjmeter developer replied to my question on an HP ITRC forum indicating that heap=dump needs to be included in the -agentlib options temporarily until a bug in HPjmeter is fixed. This information makes HPjmeter viable again, but I will still leave the question open to see if anyone knows of any other tools.
EDIT: As of version 4.0.00 of HPjmeter (available 05/2009) this bug has been fixed.

Your Kit Java Profiler is able to read hprof snapshots (I am not sure if only for memory profiling or for CPU as well). It is not free but is by far the best java profiler I ever used. It presents the results in a clear, intuitive way and performs well on large data sets. The documentation is also pretty good.

For viewing and analyzing the output of hprof=samples or hprof=cpu I have used PerfAnal with good results. The GUI is a bit spartan, but very useful.
PerfAnal is a free download (GPL, originally an example project in the book Java Programming on Linux).
See this article:
http://www.oracle.com/technetwork/articles/javase/perfanal-137231.html
for more information and the download.
Normally you can just run
java -jar PerfAnal.jar hprof.java.txt
You may need to fiddle with -Xmx for large hprof files.

I am not 100% sure it'll work (it sounds like it will) and I am not sure it'll show it in the format you want... but have you thought about the VisualVM?
I believe it'll open up the resulting file.

I have been using Eclipse Memory Analyzer for analyzing different performance problems successfully. First of all, install the tool as described in the project webpage in Eclipse.
After that, you can create a dump file knowing the pid of the jvm to be analyzed
jmap -dump:format=b,file=<filename>.hprof <jvm_pid>
Then just import the .hprof file in eclipse. It has some automatic reports that try (for me they usually do not work) to point out which could be the possible problems.
Edit:
Answering the comment: You are right, it is more like a leak finder for Java. For performance problems, I have played with JRat for small projects. It shows time comsumed per method, number of times a method is called, hierarchy of calls, etc. The only problem is that as far as I know, it does not support .hprof files. To use it, yo need to execute your program adding a VM argument
-javaagent:<path>/shiftone-jrat.jar
This will generate a directory with the profile captured by the tool. Then, execute
java -jar shiftone-jrat.jar
And open the trace. Even been a simple tool, I think it could be useful.

Possible causes of Java VM EXCEPTION_ACCESS_VIOLATION?

When a Java VM crashes with an EXCEPTION_ACCESS_VIOLATION and produces an hs_err_pidXXX.log file, what does that indicate? The error itself is basically a null pointer exception. Is it always caused by a bug in the JVM, or are there other causes like malfunctioning hardware or software conflicts?
Edit: there is a native component, this is an SWT application on win32.

Most of the times this is a bug in the VM.
But it can be caused by any native code (e.g. JNI calls).
The hs_err_pidXXX.log file should contain some information about where the problem happened.
You can also check the "Heap" section inside the file. Many of the VM bugs are caused by the garbage collection (expecially in older VMs). This section should show you if the garbage was running at the time of the crash. Also this section shows, if some sections of the heap are filled (the percentage numbers).
The VM is also much more likely to crash in a low memory situation than otherwise.

Answer found!
I had the same error and noticed that others who provided the contents of the pid log file were running 64 bit Windows. Just like me. At the end log file, it included the PATH statement. There I could see C:\Windows\SysWOW64 was incorrectly listed ahead of: %SystemRoot%\system32. Once I corrected it, the exception disappeared.

First thing you should do is upgrade your JVM to the latest you can.
Can you repeat the issue? Or does it seem to happen randomly? We recently had a problem where our JVM was crashing all over the place, at random times. Turns out it was a hardware problem. We put the drives in a new server and it completely went away.
Bottom line, the JVM should never crash, as the poster above mentioned if your not doing any JNI then my gut is that you have a hardware problem.

The cause of the problem will be documented in the hs_err* file, if you know what to look for. Take a look, and if it still isn't clear, consider posting the first 5 or 10 lines of the stack trace and other pertinent info (don't post the whole thing, there's tons of info in there that won't help - but you have to figure out which 1% is important :-) )

Are you using a Browser widget and executing javascript in the Browser widget? If so, then there are bugs in some versions of SWT that causes the JVM to crash in native code, in various Windows libraries.
Two examples (that I opened) are bug 217306 and bug 127960. These two bug reports are not the only bug reports of the JVM crashing in SWT, however.
If you aren't using the Browser widget then these suggestions won't help you. In that case, you can search for a list of SWT bugs causing a JVM crash. If none of those are your issue, then I highly recommend that you open a bug report with SWT.

I have the same problem with a JNLP application that I have been using for a long time and is pretty reliable. The problem started immediately after I upgraded from Windows 7 to Windows 10. According to my investigation, it is most likely a bug in Win 10.
The following is not a solution, but an ugly workaround. In jre/bin directory, there is javaws.exe. If I right-clicked /Properties/Compatibility and ticked Run this program as an administrator, the JNLP app started to work.
Please, be aware that this approach could cause security issues and use it only if you have no other option and 100% know what you are doing.

What can I do if a Java VM crashes repeatedly?

What is the best practice to solve a Java VM crash if the follow conditions are true:
No own or third party native code. 100% pure java
The same program run on many other system without any problems.
PS: With VM crash I means that the VM write a dump file like hs_err_pid1234.log and terminate.

Read the hs_err_pid1234.log file (or whatever the error log file name is). There are usually clues in there. The next step depends on what you discover in the log.
Yes, it could be a bug in the specific version of the JVM implementation you are using, but I have also seen problems caused by memory fragmentation in the operating system. Windows, for example, is prone to pin dlls at inappropriate locations, and fail to allocate a contiguous block of memory when the JVM asks for it as a result. Other out opf memory problems can also manifest themselves through crash dumps of this type.

Update or replace your JVM. If you currently have the newest version, then try an older one, or if you don't have the latest version, try updating to it. Maybe its a known issue in your particular version?

Assuming the JVM version across machines is the same:
Figure out what is different about the machine where the JVM is crashing. Same OS and OS version? We have problems with JVMs crashing on a particular version of Red Hat for example. And we have also found some older Red Hat versions unable to cope with extra memory properly, resulting in running out of swap space. (Our solution was to upgrade RedHat).
Also, is the program doing exactly the same thing across machines? Is it accessing a shared filesystem? Is the file system mounted similarly on your machines (SMB/NFS etc)? Something must be different.
The log file should give you some idea of where the crash occurred (malloc for example).

Take a look at the stacktraces in the dump file, as it should tell you what was going on when the crash occurred.
As well as digging into the hs_err dump file, I'd also submit it to Sun or whomever made your JVM (I believe there are instructions in how to do so at the top of the file?). It can't hurt.

32bit? 64bit? Amount of ram in client machine? processor? os? See if there is any connection between the systems. A connection may lead to a clue. If all else fails, consider using different major/minor versions of the JVM. Also, if the problem JUST started can you get to a time (via version control) where the program didn't crash? Look through the hs_err log, you may get an idea of what caused the crash. It could be a version of some other client library the JVM uses. Lastly, run the program in debug/profile and maybe you'll see some symptons before the crash (assuming you can duplicate it)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.