Datanode having trouble with JVM pausing

Datanode having trouble with JVM pausing - java

I am on CDH 5.1.2, I am seeing this error with one of the datanode pausing often. i see this from logs.
WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12428ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=12707ms
Any Idea why i am seeing this? once a while hdfs capacity is dropping by one node.

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=12707ms
You're experiencing a long GC pause with the CMS collector.
To investigate further you should turn on GC logging via -Xloggc:<path to gc log file> -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintGCDetails and in case you're on java 7 also add -XX:+PrintGCCause.
GCViewer can help visualizing the logs.
Once you've found the cause you can try adjust CMS to avoid those pauses. For starters, there is the official CMS tuning guide.

We just encountered a very similar issue running CDH 5.3.2 where we were unable to successfully start the HDFS NameNode Service on our Hadoop Cluster.
At the time it was very puzzling as we weren't observing any apparent ERRORs in the /var/log/messages and /var/log/hadoop-hdfs/NAMENODE.log.out other than WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC)
After working with Cloudera Support we were able to determine that we were running into an OOM Exception that wasn't being logged... as a general rule of thumb take a look at the configuration of your Heap Sizes... for every 1 Million Blocks you should have at least 1GB of Heap Size.
In our case, the resolution was as simple as increasing the Java Heap Size for the NameNode and Secondary NameNode Services and Restarting... as we had 1.5 Million Block but were only using the default 1GB setting for heap size. After increasing the Java Heap Size and restarting the HDFS Services we were green across the board.
Cheers!

Related

Java 17 process running out of memory on Kubernetes when memory potentially available

The goal is to understand what should be tuned in order for the Java process to stop restarting itself.
We have a Java Springboot backend application with Hazelcast running that restarts instead of garbage collecting.
Environment is:
Amazon Corretto 17.0.3
The only memory tuning parameter supplied is:
-XX:+UseContainerSupport -XX:MaxRAMPercentage=80.0
The memory limit in kubernetes is 2Gi so the container gets 1.6Gi
Graphs of memory usage:
The huge drop towards the end is where I performed a heap dump. Performing the dump lead to a drastic decrease in memory usage (due to a full GC?).
The GC appears to be working against me here. If the memory dump was not performed, the container hits what appears to be a memory limit, it is restarted by kubernetes, and it continues in this cycle. Are there tuning parameters that are missed, is this a clear memory leak (perhaps due to hazelcast metrics) https://github.com/hazelcast/hazelcast/issues/16672)?

So the JVM will determine which garbage collector (GC) to use based on the amount of memory and CPU given to the application. By default, it will use the Serial GC if the RAM is under 2GB or the CPU cores is less than 2. For a Kubernetes server application, the Serial GC is not a great choice as it runs in a single thread and it seems to wait until the heap is near the max limit to reclaim the heap space. It also results in a lot of pausing of the application which can lead to health check failures or scaling to due to momentary higher cpu usage. What has worked best for us, is to force the use of the G1 GC collector. It is a concurrent collector that runs side by side with your app and tries its best to minimize application pausing. I would suggest setting your CPU limit to at least 1 and setting your RAM limit to however much you think your application is going to use plus a little overhead. To force the G1 GC collector add the following option to your java XX:+UseG1GC.

JVM Freeze is happening and hotspot logs shows that "ParallelGCFailedAllocation", "Revoke Bias" is taking more time

We are facing a peculiar problem in our clustered application. After running the system for sometime, suddenly the application is freezing and couldn't find any clue what is causing this.
After enabling JVM hotspot logs we see that "ParallelGCFailedAllocation", "Revoke Bias" is taking more time.
Refer to attached graph which is plotted by parsing the hotspot logs and converted to csv.
The graph shows at certain time the "ParallelGCFailedAllocation", "Revoke Bias" is spiking and take around 13 secs which is not normal.
We are trying to find what is causing it to take so much time.
Anybody having clue on how to debug such issue?
Enviroment details:
32 core machine running in VMWare hypervisor.
Heap Size: 12GB
RHEL 7 with Open JDK 8

wow, you have about 2800 threads in your application, its too much!
Also your heap is too huge, 4GB in young gen and 8 GB in old gen. What are you expecting in this case ?
From PrintSafepointStatistics output, you have no problems with safepoint sync, actually vm operation takes all time.
You can disable biased locking -XX:-UseBiasedLocking and use concurrent gc's (CMS\G1) instead of parallel old gc, maybe this will help you and reduce pauses a little bit, but the main problem is bad configuration and maybe code/design.
Use size-limited thread pools, ~2800 threads is too much
12 GB is huge heap, also young gen should be not so big.
profile your apllication (JFR, yourkit, jprofiler, visualvm) can help you to find allocation hotspots.
also eclipse MAT can help you to analyze heap
if you want to trace revokeBias, add -XX:+TraceBiasedLocking

Hadoop JvmPauseMonitor

Recently came across and interesting scenario with Cloudera Hadoop and HDFS where we were unable to start our NameNode Service.
When attempting a restart of HDFS Services we were unable successfully restart NameNode Service in our cluster. Upon review of the logs, we did not observe any ERRORs but did see a few entries related to JvmPauseMonitor...
org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 5015ms
We were observing these entries in the /var/log/hadoop-hdfs/NAMENODE.log.out and were not seeing any other errors including /var/log/messages.

CHECK YOUR JAVA HEAP SIZES
Ultimately, we were able to determine that we were running into a Java OOM Exception that wasn't being logged.
From a performance perspective as a general rule for every 1 Million Blocks in HDFS you should have configured at least 1GB of Java Heap Size.
In our case, the resolution was as simple as increasing the Java Heap Size for the NameNode and Secondary NameNode Services and Restarting... as we had grown to 1.5 Million Blocks but were only using the default 1GB setting for the java heap size.
After increasing the Java Heap Size to at least 2GB and restarting the HDFS Services we were green across the board.
Cheers!

Gridgain: java.lang.OutOfMemoryError: GC overhead limit exceeded

I'm trying to set up a Gridgain cluster with 2 servers.
Load data from a .csv file (1 million to 50 million data) to the Gridgain using GridDataLoader.
Find the min, max, average, etc. from the data loaded,
When running as a standalone application in eclipse I'm getting correct output.
But while making a cluster (2 nodes in the 2 servers + 1 node inside my eclipse environment), I'm getting java.lang.OutOfMemoryError: GC overhead limit exceeded error.
The configuration file I'm using is http://pastebin.com/LUa7gxbe

Changing eclipse.ini's Xmx property might solve the problem.
Change it to -Xmx3g

java.lang.OutOfMemoryError: GC limit overhead exceeded
This error happens when the system spends too much time executing garbage collection. There can be multiple causes, it is highly related to your environment details. I don't know Gridgain. Because of your complex environment, I think about VM tuning: if your application waits for the whole memory to be full before running garbage collection, here is your main problem.
A hint can be the -XX:-UseParallelGC JVM option (some documentation is available here), but it should be the default conf in Grigain. I don't understand the proper way to configure vm options in your environment (some options seem to be related to the cache). According to the same doc, a slow network could induce a low CPU. I guess a high network could induce a high CPU (perhaps related to GC) ? To ensure you have an appropriate VM configuration, could you check the options applied when running ?

Edit the bin/ggstart.sh script, set the JVM_OPTS to a higher value.
Default is 1 GB,
Change it to
JVM_OPTS="-Xms2g -Xmx2g -server -XX:+AggressiveOpts -XX:MaxPermSize=256m"
or higher

How can a track down a non-heap JVM memory leak in Jboss AS 5.1?

After upgrading to JBoss AS 5.1, running JRE 1.6_17, CentOS 5 Linux, the JRE process runs out of memory after about 8 hours (hits 3G max on a 32-bit system). This happens on both servers in the cluster under moderate load. Java heap usage settles down, but the overall JVM footprint just continues to grow. Thread count is very stable and maxes out at 370 threads with a thread stack size set at 128K.
The footprint of the JVM reaches 3G, then it dies with:
java.lang.OutOfMemoryError: requested 32756 bytes for ChunkPool::allocate. Out of swap space?
Internal Error (allocation.cpp:117), pid=8443, tid=1667668880
Error: ChunkPool::allocate
Current JVM memory args are:
-Xms1024m -Xmx1024m -XX:MaxPermSize=256m -XX:ThreadStackSize=128
Given these settings, I would expect the process footprint to settle in around 1.5G. Instead, it just keeps growing until it hits 3G.
It seems none of the standard Java memory tools can tell me what in the native side of the JVM is eating all this memory. (Eclipse MAT, jmap, etc). Pmap on the PID just gives me a bunch of [ anon ] allocations which don't really help much. This memory problem occurs when I have no JNI nor java.nio classes loaded, as far as I can tell.
How can I troubleshoot the native/internal side of the JVM to find out where all the non-heap memory is going?
Thank you! I am rapidly running out of ideas and restarting the app servers every 8 hours is not going to be a very good solution.

As #Thorbjørn suggested, profile your application.
If you need more memory, you could go for a 64bit kernel and JVM.

Attach with Jvisualvm in the JDK to get an idea on what goes on. jvisualvm can attach to a running process.

Walton:
I had similar issue, posted my question/finding in https://community.jboss.org/thread/152698 .
Please try adding -Djboss.vfs.forceCopy=false to java start up parameter to see if it helps.
WARN: even if it cut down process size, you need to test more to make sure everything all right.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Datanode having trouble with JVM pausing - java

Related

Java 17 process running out of memory on Kubernetes when memory potentially available

JVM Freeze is happening and hotspot logs shows that "ParallelGCFailedAllocation", "Revoke Bias" is taking more time

Hadoop JvmPauseMonitor

Gridgain: java.lang.OutOfMemoryError: GC overhead limit exceeded

How can a track down a non-heap JVM memory leak in Jboss AS 5.1?

Categories

Resources