JVM crash on hadoop reducer

JVM crash on hadoop reducer - java

I am running java codes on hadoop, but encounter this error:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f2ffe7e1904, pid=31718, tid=139843231057664
#
# JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build 1.8.0_72-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V [libjvm.so+0x813904] PhaseIdealLoop::build_loop_late_post(Node*)+0x144
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /hadoop/nm-local-dir/usercache/ihradmin/appcache/application_1479451766852_3736/container_1479451766852_3736_01_000144/hs_err_pid31718.log
#
# Compiler replay data is saved as:
# /hadoop/nm-local-dir/usercache/ihradmin/appcache/application_1479451766852_3736/container_1479451766852_3736_01_000144/replay_pid31718.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
When I go to the node manager, all the logs are aggregated since yarn.log-aggregation-enable is true, and log hs_err_pid31718.log and replay_pid31718.log cannot be found.
Normally 1) the JVM crashes after several minutes of reducer, 2) sometimes the auto-retry of reducer can succeeds, 3) some reducers can succeed without failure.
Hadoop version is 2.6.0, Java is Java8. This is not a new environments, we have lots of jobs running on the cluster.
My questions:
Can I find hs_err_pid31718.log anywhere after yarn aggregate the log and remove the folder? Or is there a setting to keep all the local logs so I can check the hs_err_pid31718.log while aggregating logs by yarn?
What's the common steps to narrow down the deep dive scope? Since the jvm crashed, I cannot see any exception in code. I have tried -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp this args but there is no heap dumped on the host failing the reduce tasks.
Thanks for any suggestion.

Answers
Use -XX:ErrorFile=<your prefered location>/hs_err_pid<pid>.log to set the hs_error file location to your prefered one.
Crash is due to JDK bug JDK-6675699 this has already fixed in JDK9 and backports are available on JDK8 update 74 onwards.
You are using JDK8 update 72.
Kindly upgrade to latest version from here to avoid this crash.

Related

Set ulimit when debugging with Eclipse

Sometimes my code crashes leaving me with an error file but no core dump because the latter is disabled.
Now it is suggested to set ulimit -c unlimited to allow core dumps.
If I would be running the code from console, it would be no issue to set ulimit before starting the Java Application. But the error seems to be much more frequent when debugged in Eclipse as when running as a standalone. (Actually as a standalone it crashed this way only once in hundreds of running hours, in debugging it crashed 4 times in the last two months (and some times before)).
Is there a way to tell Eclipse to set ulimit before launching a Debug-Session in Eclipse ?
And will a Core-Dump help me to find what causes the crash ?
For completeness, I am working on macOS and my error files all start like this (only pidchanges between them):
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGILL (0x4) at pc=0x00007fffa08c144e, pid=617, tid=0x0000000000000307
#
# JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 1.8.0_181-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C [AppKit+0x3a544e] -[NSApplication _crashOnException:]+0x6d
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
I suspect (but have nothing to prove it except the "moment when it crashes") that the crashes are related with the handling of BufferedImage within the Java code, but then again, the Error-Log tells that the error happened outside the JVM.

Why JVM crashes with seg fault? [duplicate]

We are seeing the JVM getting crashed at times with segfault. The only error we see in logs is as below.
Anyone can suggest something by looking at the below error trace.
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fef7f1d3eb0, pid=42623, tid=0x00007feea62c8700
#
# JRE version: OpenJDK Runtime Environment (8.0_222-b10) (build 1.8.0_222-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.222-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J 62683 C2 org.apache.ignite.internal.marshaller.optimized.OptimizedObjectOutputStream.writeObject0(Ljava/lang/Object;)V (331 bytes) # 0x00007fef7f1d3eb0 [0x00007fef7f1d3e00+0xb0]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/hsperfdata_pvappuser/hs_err_pid42623.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
While trying to understand the reason for this crash Oracle JVM docs https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/crashes001.html ,this looks to be the case of 5.1.2 Crash in Compiled Code as the problematic frame is java frame(has a "J")
Though could not get much further from it, we also not sure when it comes, the only probale pattern is it comes when JVM is running for 5-6 days so usually on Friday.
We are using openjdk-8 ("1.8.0_232") distribution provided by RedHat running on RHEL 6.10.
Looking forward to get any leading point in tracing this error.

The current stack frame has writeObject0 as the last called method. There is a naming convention that native method's names end with 0. Therefore check whether that method is indeed native.
If it is, it is probably written in C, an ancient unsafe language whose programs tend to crash in an uncontrolled way. This often leads to SIGSEGV.
In this case, that method is written in Java though.
As you were told in the error message, read hs_err_pid42623.log for further details. In that file you will find the registers and a few machine instructions around the code that crashed.

How to know the reason for JVM crashing with Segfault?

We are seeing the JVM getting crashed at times with segfault. The only error we see in logs is as below.
Anyone can suggest something by looking at the below error trace.
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fef7f1d3eb0, pid=42623, tid=0x00007feea62c8700
#
# JRE version: OpenJDK Runtime Environment (8.0_222-b10) (build 1.8.0_222-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.222-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J 62683 C2 org.apache.ignite.internal.marshaller.optimized.OptimizedObjectOutputStream.writeObject0(Ljava/lang/Object;)V (331 bytes) # 0x00007fef7f1d3eb0 [0x00007fef7f1d3e00+0xb0]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/hsperfdata_pvappuser/hs_err_pid42623.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
While trying to understand the reason for this crash Oracle JVM docs https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/crashes001.html ,this looks to be the case of 5.1.2 Crash in Compiled Code as the problematic frame is java frame(has a "J")
Though could not get much further from it, we also not sure when it comes, the only probale pattern is it comes when JVM is running for 5-6 days so usually on Friday.
We are using openjdk-8 ("1.8.0_232") distribution provided by RedHat running on RHEL 6.10.
Looking forward to get any leading point in tracing this error.

The current stack frame has writeObject0 as the last called method. There is a naming convention that native method's names end with 0. Therefore check whether that method is indeed native.
If it is, it is probably written in C, an ancient unsafe language whose programs tend to crash in an uncontrolled way. This often leads to SIGSEGV.
In this case, that method is written in Java though.
As you were told in the error message, read hs_err_pid42623.log for further details. In that file you will find the registers and a few machine instructions around the code that crashed.

How to prove that the error is not my Java agent or how to debug it?

One of our clients have contact me complaining that he's getting the following JRE crash while using our Java Agent.
According to the error (below) the crash is on the native code since the problematic frame is categorized as 'C'.
I've did some googling and it seems like there are some open bugs which are quite similar around this issue in this while using java agents. See the following links:
https://bugs.openjdk.java.net/browse/JDK-8094079
https://bugs.openjdk.java.net/browse/JDK-8041920
The issue is that the customer is reluctant to upgrade the JDK since he mentions that he has other java agents which are running without any issue.
Any suggestions on how to solve this issue?
For the completeness, please see the error that he had sent:
cat /opt/somecompany/apps/some-product-platform/some-product-name/hs_err_pid6697.log | grep sealights
7fe19d9b5000-7fe19d9d7000 r--s 00401000 ca:01 3539943 /opt/somecompany/apps/some-product-platform/some-product-name/sealights/sl-test-listener.jar
jvm_args: -javaagent:/opt/somecompany/apps/some-product-platform/hawtio/jolokia-jvm.jar=config=/opt/somecompany/apps/some-product-platform/some-product-name/conf/jolokia-agent.properties -javaagent:/opt/somecompany/apps/some-product-platform/some-product-name/agent/newrelic.jar -DNEWS_product_HOME=/opt/somecompany/apps/some-product-platform/some-product-name -Dsl.environmentName=Functional Tests DEV-INT -Dsl.customerId=myCustomer -Dsl.appName=ABB-product-name -Dsl.server=https://my-server.com -Dsl.includes=com.somecompany.* -javaagent:/opt/somecompany/apps/some-product-platform/some-product-name/sealights/sl-test-listener.jar -Dlog.dir=/opt/somecompany/apps/some-product-platform/logs -Dlog.threshold=debug
java_class_path (initial): some-product-name.jar:/opt/somecompany/apps/some-product-platform/hawtio/jolokia-jvm.jar:/opt/somecompany/apps/some-product-platform/some-product-name/agent/newrelic.jar:/opt/somecompany/apps/some-product-platform/some-product-name/sealights/sl-test-listener.jar
JVM crash message:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x0000000000000055, pid=6697, tid=140604865455872
#
# JRE version: Java(TM) SE Runtime Environment (8.0_20-b26) (build 1.8.0_20-b26)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.20-b23 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C 0x0000000000000055
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /opt/somecompany/apps/some-product-platform/some-product-name/hs_err_pid6697.log
Compiled method (c1) 565518 13823 1 sun.invoke.util.ValueConversions::identity (2 bytes)
total in heap [0x00007fe1a76857d0,0x00007fe1a7685a48] = 632
relocation [0x00007fe1a76858f8,0x00007fe1a7685918] = 32
main code [0x00007fe1a7685920,0x00007fe1a7685980] = 96
stub code [0x00007fe1a7685980,0x00007fe1a7685a10] = 144
metadata [0x00007fe1a7685a10,0x00007fe1a7685a18] = 8
scopes data [0x00007fe1a7685a18,0x00007fe1a7685a20] = 8
scopes pcs [0x00007fe1a7685a20,0x00007fe1a7685a40] = 32
dependencies [0x00007fe1a7685a40,0x00007fe1a7685a48] = 8
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#

After asking for further details in the chat discussion, I assume that this error is related to the use of the ASM AdviceAdapter in combination with other agents in the same application. If the other agent expected a specific structure of a class, using the adapter might reorder variable indices of the class in question what the other agent does not anticipate. As the VM skips the verification of built-in classes, this results in a hard crash.
The solution would be the change the instrumentation to not use the AdviceAdapter.

SIGSEGV error on Tomcat 7 on CentOS with problematic frame libkrb5.so.3

I have a Tomcat 7.47 running on java 1.7.0_65. The server I am using is a virtual machine using VMWare, so it cannot be a hardware error and runs CentOS 6.5
I created an online chat-tool using the websocket api and my authentification is based on LDAP. The user-information is also received via LDAP using Apache LDAP.
Everything works quite well, but once in a while the whole JVM crashes with a SIGSEGV error.
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f778ad0a040, pid=12392, tid=140151424538368
#
# JRE version: OpenJDK Runtime Environment (7.0_65-b17) (build 1.7.0_65-mockbuild_2014_07_16_06_06-b00)
# Java VM: OpenJDK 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [libkrb5.so.3+0x29a040]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
# http://icedtea.classpath.org/bugzilla
#
The full error.log can be found here
I already tried increasing the memory without success. Next I tried changing the LDAP Library because I thought maybe my previous library UnboundID has some problems with libkrb5.so.3 on CentOS but Apache LDAP doesn't work either
I also tried running it with the same java version and the same tomcat version on an Ubuntu laptop and here everything worked fine.
Is there any known bug with libkrb5.so.3 on CentOS?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.