Recently came across and interesting scenario with Cloudera Hadoop and HDFS where we were unable to start our NameNode Service.
When attempting a restart of HDFS Services we were unable successfully restart NameNode Service in our cluster. Upon review of the logs, we did not observe any ERRORs but did see a few entries related to JvmPauseMonitor...
org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 5015ms
We were observing these entries in the /var/log/hadoop-hdfs/NAMENODE.log.out and were not seeing any other errors including /var/log/messages.
CHECK YOUR JAVA HEAP SIZES
Ultimately, we were able to determine that we were running into a Java OOM Exception that wasn't being logged.
From a performance perspective as a general rule for every 1 Million Blocks in HDFS you should have configured at least 1GB of Java Heap Size.
In our case, the resolution was as simple as increasing the Java Heap Size for the NameNode and Secondary NameNode Services and Restarting... as we had grown to 1.5 Million Blocks but were only using the default 1GB setting for the java heap size.
After increasing the Java Heap Size to at least 2GB and restarting the HDFS Services we were green across the board.
Cheers!
Related
We have an enterprise application running on Java 8. The deployment environment is built & updated through Bitbucket pipelines. I have a graphic showing the high-level architecture of the environment. We have two app servers running identical configurations apart from some application specific environment variables.
It was all working well until a week ago when after a successful pipeline run, the 2 app instances on one of the servers stopped working with the following error:
There is insufficient memory for the Java Runtime Environment to continue.
Cannot create GC thread. Out of system resources.
Both the instances are working fine on the other server. In contrast, the containers fail to start on this server.
Solutions Tried
The error accompanies the following information:
Possible reasons:
The system is out of physical RAM or swap space
The process is running with CompressedOops enabled, and the Java Heap may be blocking the growth of the native heap.
Possible solutions:
Reduce memory load on the system
Increase physical memory or swap space
Check if swap backing store is full
Decrease Java heap size (-Xmx/-Xms)
Decrease number of Java threads
Decrease Java thread stack sizes (-Xss)
Set larger code cache with -XX:ReservedCodeCacheSize=
We have tried:
Adding more swap memory. The server has 8GB of RAM while we have tried the swap from 4GB to 9GB.
Played with the heap sizes Xms & Xmx from 128m to 4096m.
Increased the RAM on this server to 16GB while the other server that works still does on 8GB.
Here is how the memory & swap consumption looks like:
free -mh
total used free shared buff/cache available
Mem: 15Gi 378Mi 12Gi 1.0Mi 2.9Gi 14Gi
Swap: 9Gi 0B 9Gi
I have links to several related artifacts. These include the complete docker logs output and the output of docker info on the failing server and the operational server.
This is what docker ps -a gets us:
:~$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d29747bf2ad3 :a7608a838625ae945bd0a06fea9451f8bf11ebe4 "catalina.sh run" 10 hours ago Exited (1) 10 hours ago jbbatch
0951b6eb5d42 :a7608a838625ae945bd0a06fea9451f8bf11ebe4 "catalina.sh run" 10 hours ago Exited (1) 10 hours ago jbapp
We are out of ideas right now as we have tried almost all the solutions on stack overflow. What are we missing?
I see that your Docker image uses Ubuntu 22.04 LTS as its base. Recently base Java images were rebuilt on top of this LTS version, which caused a lot of issues on older Docker runtimes. Most likely this is what you're experiencing. It has nothing to do with memory, but rather with Docker incompatibility with a newer Linux version used as a base image.
Your operational server has Docker server version 20.10.10, while the failing server has version 20.10.09. The incompatibility issue was fixed exactly in Docker 20.10.10. Some more technical details on the incompatibility issue are available here.
The solution would be to upgrade the failing server to at least Docker 20.10.10.
I had had the same error.
Output of
# docker info
was:
....
Security Options:
seccomp
WARNING: You're not using the default seccomp profile
Profile: /etc/docker/seccomp.json
....
The issue was resolved by putting
security_opt:
- seccomp:unconfined
in the docker-compose.yml for the service and removing and recreating the container
docker rm <container_name>
docker-compose up -d <service_name>
Maybe the same result could be achieved having /etc/docker/seccomp.jsonn tweaked - I tried and failed.
I am using Hadoop on a single node machine in Local Mode. When I am running files sequentially then everything is working file. However, when I run mappers parallelly I am facing the error "org.apache.hadoop.mapred.LocalJobRunner$Job:run(560) | job_local131593030_0008 java.lang.Exception: java.lang.OutOfMemoryError: Java heap space". I have done some research on this and found two property "mapreduce.map.memory.mb" and "mapred.child.java.opts" that used to increase the physical and heap memory for the mappers but this did not work in my case for Local mode. Could anyone please give me the suggestion to fix this issue.
Since moving to Tomcat8/Java8, now and then the Tomcat server is OOM-killed. OOM = Out-of-memory kill by the Linux kernel.
How can I prevent the Tomcat server be OOM-killed?
Can this be the result of a memory leak? I guess I would get a normal Out-of-memory message, but no OOM-kill. Correct?
Should I change settings in the HEAP size?
Should I change settings for the MetaSpace size?
Knowing which Tomcat process is killed, how to retrieve info so that I can reconfigure the Tomcat server?
Firstly check that the oomkill isn't being triggered by another process in the system, or that the server isn't overloaded with other processes. It could be that Tomcat is being unfairly targeted by oomkill when some other greedy process is the culprit.
Heap should be set as a maximum size (-Xmx) to be smaller than the physical RAM on the server. If it is more than this, then paging will cause desperately poor performance when garbage collecting.
If it's caused by the metaspace growing in an unbounded fashion, then you need to find out why that is happening. Simply setting the maximum size of metaspace will cause an outofmemory error once you reach the limit you've set. And raising the limit will be pointless, because eventually you'll hit any higher limit you set.
Run your application and before it crashes (not easy of course but you'll need to judge it), kill -3 the tomcat process. Then analyse the heap and try to find out why metaspace is growing big. It's usually caused by dynamically loading classes. Is this something your application is doing? More likely, it's some framework doing this. (Nb oom killer will kill -9 the tomcat process, and you won't be able to diagnostics after that, so you need to let the app run and intervene before this happens).
Check out also this question - there's an intriguing answer which claims an obscure fix to an XML binding setting cleared the problem (highly questionable but may be worth a try) java8 "java.lang.OutOfMemoryError: Metaspace"
Another very good solution is transforming your application to a Spring Boot JAR (Docker) application. Normally this application has a lot less memory consumption.
So steps the get huge improvements (if you can move to Spring Boot application):
Migrate to Spring Boot application. In my case, this took 3 simple actions.
Use a light-weight base image. See below.
VERY IMPORTANT - use the Java memory balancing options. See the last line of the Dockerfile below. This reduced my running container RAM usage from over 650MB to ONLY 240MB. Running smoothly. So, SAVING over 400MB on 650MB!!
This is my Dockerfile:
FROM openjdk:8-jdk-alpine
ENV JAVA_APP_JAR your.jar
ENV AB_OFF true
EXPOSE 8080
ADD target/$JAVA_APP_JAR /deployments/
CMD ["java","-XX:+UnlockExperimentalVMOptions", "-XX:+UseCGroupMemoryLimitForHeap", "-jar","/deployments/your.jar"]
When I start elastic search, I see "Killed" on the console and the process ends. I am unable to get the elastic search process to start. What am I missing?
:~/elasticsearch-5.5.2/bin$ ./elasticsearch
Killed
If it is relevant, I am installing this on a VPS. I don't see any other error message - making it hard to debug.
As of configuring RAM usage by Elasticsearch
Find the elasticsearch jvm.options file location, by default it is /etc/elasticsearch/jvm.options
Set the -Xms and -Xmx options in there to reflect the RAM amout available at your VPS as described here
By default, elasticsearch tries to occupy 1Gb of RAM at start, so if your VPS has less than 1Gb of RAM you need to configure elasticsearch to use less RAM accordingly
As an alternative to above file configuration you can try export of corresponding environment variable
export ES_JAVA_OPTS="-Xms256m -Xmx256m"
and then check if it helps
./elasticsearch
Regarding the exit state
Killed
It most frequently indicates the OoM Killer process activity which is aimed on emergent RAM freeing to let Linux survive the lack of available RAM event. OoM Killer, as per his name, sends kill signal to some most memory-consuming user process.
As of VPS and it's virtualization model, there are some custom container-based OoM settings in effect (check out example for OpenVZ), so if you're 100% sure you're configured ealsticsearch correctly, and there is enough RAM to start its instance - contact your VPS provider to clarify possible limits (like 10% of RAM must always be free or OoM Killer is triggered otherwise)
Some debug approach to OoM Killer events is described in this answer
I am on CDH 5.1.2, I am seeing this error with one of the datanode pausing often. i see this from logs.
WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12428ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=12707ms
Any Idea why i am seeing this? once a while hdfs capacity is dropping by one node.
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=12707ms
You're experiencing a long GC pause with the CMS collector.
To investigate further you should turn on GC logging via -Xloggc:<path to gc log file> -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintGCDetails and in case you're on java 7 also add -XX:+PrintGCCause.
GCViewer can help visualizing the logs.
Once you've found the cause you can try adjust CMS to avoid those pauses. For starters, there is the official CMS tuning guide.
We just encountered a very similar issue running CDH 5.3.2 where we were unable to successfully start the HDFS NameNode Service on our Hadoop Cluster.
At the time it was very puzzling as we weren't observing any apparent ERRORs in the /var/log/messages and /var/log/hadoop-hdfs/NAMENODE.log.out other than WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC)
After working with Cloudera Support we were able to determine that we were running into an OOM Exception that wasn't being logged... as a general rule of thumb take a look at the configuration of your Heap Sizes... for every 1 Million Blocks you should have at least 1GB of Heap Size.
In our case, the resolution was as simple as increasing the Java Heap Size for the NameNode and Secondary NameNode Services and Restarting... as we had 1.5 Million Block but were only using the default 1GB setting for heap size. After increasing the Java Heap Size and restarting the HDFS Services we were green across the board.
Cheers!