SGE h_vmem vs java -Xmx -Xms

SGE h_vmem vs java -Xmx -Xms - java

We have a couple of SGE clusters running various versions of RHEL at my work and we're testing a new one with a newer Redhat, all . On the old cluster ("Centos release 5.4"), I'm able to submit a job like the following one and it runs fine:
echo "java -Xms8G -Xmx8G -jar blah.jar ..." |qsub ... -l h_vmem=10G,virtual_free=10G ...
On the new cluster "CentOS release 6.2 (Final)", a job with those parameters fails due to running out of memory, and I have to change the h_vmem to h_vmem=17G in order for it to succeed. The new nodes have about 3x the RAM of the old node and in testing I'm only putting in a couple of jobs at a time.
On the old cluster, I'd set the -Xms/Xms to be N, I could use N+1 or so for the h_vmem. On the new cluster, I seem to be crashing unless I set h_vmem to be 2N+1.
I wrote a tiny perl script that all it does is progressively use consume more memory and periodically print out the memory used until it crashes or it reaches a limit. The h_vmem parameter makes it crash at the expected memory usage.
I've tried multiple versions of the JVM (1.6 and 1.7). If I omit the h_vmem, it works, but then things are riskier to run.
I have googled where others have seen similar issues, but no resolutions found.

The problem here appears to be an issue with the combination of the following factors:
The old cluster was RHEL5, and the new RHEL6
RHEL6 includes an update to glibc that changes the way MALLOC reports memory usage of multi-threaded programs.
the JVM includes a Multi-threaded garbage collector by default
To fix the problem I've used a combination of the following:
Export the MALLOC_ARENA_MAX environment variable to a small number (1-10) e.g. in the job script. I.e. include something like: export MALLOC_ARENA_MAX=1
Moderately increased the SGE memory requests by 10% or so
Explicitly set the number of java GC threads to a low number by using java -XX:ParallelGCThreads=1 ...
Increased the SGE thread requests. E.g. qsub -pe pthreads 2
Note that it's unclear that setting the MALLOC_ARENA_MAX all the way down to 1 is the right number, but low numbers seem to work well from my testing.
Here are the links that lead me to these conclusions:
https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en
What would cause a java process to greatly exceed the Xmx or Xss limit?
http://siddhesh.in/journal/2012/10/24/malloc-per-thread-arenas-in-glibc/

Related

Troubleshooting Java memory usage exception

I'm trying to troubleshoot a Java program that requires increasingly more memory until it cannot allocate any more and then it crashes.
EDIT More information about the program. The program is an indexer going through thousands of documents and indexing them for search. The documents are read from MongoDB and written to MongoDB as well after some processing is performed. During the processing I'm using RocksDB (rocksdb-jni version 5.13.4 from Maven). There is some mentioning in this GitHub issue of RocksDB memory usage growing uncontrollably, but I'm not sure it could be related.
Monitoring the process with visualvm results in the following plot:
but running htop on the machine shows totally different stats:
There is a difference of several GBs of memory that I'm unable to trace the source of.
The program is launched with the following VM arguments:
jvm_args: -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=<port> -Djava.rmi.server.hostname=<hostname> -Xmx12G -XX:+UseStringDeduplication
The system has 32GB of RAM and no swap. Out of this 32 GB, ~10GB are always taken by a tmpfs partition, ~8GB by MongoDB and the remaining 12GB are assigned to the program. EDIT The visualvm screenshot above shows 20GB of heap size, because it was from a previous run where I passed -Xmx20G; the behaviour, however, is the same whether I assign 12GB or 20GB to the heap. The behaviour also does not change if I remove the tmpfs partition, freeing 10 more GB of memory: it just takes longer but eventually it will get out of memory.
I have no idea where this memory usage that is not shown in visualvm but appears in htop is coming from. What tools should I use to understand what is going on? The application is running on a remote server, so I would like a tool that only works in the console or can be configured to work remotely, like visualvm.

I always use JProfiler but i hear jetbrains has a good one as well both can connect remotely.
If possible i would try to create a (local) setup where you can freely test it.
in the RocksDB several possible solutions are mentioned, do they work?
RocksDB seems to need some configuration, how did you configure it?
I am not familiar with RocksDB but i see there is some indexing and caching going on. How much data are you processing and what indexes/caching configuration do you have? are you sure this should fit in memory?
as far as i know the memory mismatch is because jni usage is not shown by default by most. There are some flags to improve this NativeMemoryTracking. i can't recall if this will add them to your visualvm overviews as well.
karbos-538 i just noticed this is quite an old issue, i hope this application is already working by now. what part of the question is still relevant to you?

Is -XX:MaxRAMFraction=1 safe for production in a containered environment?

Java 8/9 brought support for -XX:+UseCGroupMemoryLimitForHeap (with -XX:+UnlockExperimentalVMOptions). This sets -XX:MaxRAM to the cgroup memory limit. Per default, the JVM allocates roughly 25% of the max RAM, because -XX:MaxRAMFraction defaults to 4.
Example:
MaxRAM = 1g
MaxRAMFraction = 4
JVM is allowed to allocate: MaxRAM / MaxRAMFraction = 1g / 4 = 256m
Using only 25% of the quota seems like waste for a deployment which (usually) consists of a single JVM process. So now people set -XX:MaxRAMFraction=1, so the JVM is theoretically allowed to use 100% of the MaxRAM.
For the 1g example, this often results in heap sizes around 900m. This seems a bit high - there is not a lot of free room for the JVM or other stuff like remote shells or out-of-process tasks.
So is this configuration (-XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=1) considered safe for prod or even best practice? Or should I still hand pick -Xmx, -Xms, -Xss and so on?

We did some simple testing which showed that setting -XX:MaxRAM=$QUOTA and -XX:MaxRAMFraction=1 results in killed containers under load. The JVM allocates more than 900M heap, which is way too much. -XX:MaxRAMFraction=2 seems safe(ish).
Keep in mind that you may want to leave headroom for other processes like getting a debug shell (docker exec) or diagnostics in the container.
Edit: we've written up what we've learned in detail in an article. Money quotes:
TL'DR:
Java memory management and configuration is still complex. Although the JVM can read cgroup memory limits and adapt memory usage accordingly since Java 9/8u131, it’s not a golden bullet. You need to know what -XX:+UseCGroupMemoryLimitForHeap does and you need to fine tune some parameters for every deployment. Otherwise you risk wasting resources and money or getting your containers killed at the worst time possible. -XX:MaxRAMFraction=1 is especially dangerous. Java 10+ brings a lot of improvements but still needs manual configuration. To be safe, load test your stuff.
and
The most elegant solution is to upgrade to Java 10+. Java 10 deprecates -XX:+UseCGroupMemoryLimitForHeap (11) and introduces -XX:+UseContainerSupport (12), which supersedes it. It also introduces -XX:MaxRAMPercentage (13) which takes a value between 0 and 100. This allows fine grained control of the amount of RAM the JVM is allowed to allocate. Since +UseContainerSupport is enabled by default, everything should work out of the box.
Edit #2: we've written a little bit more about -XX:+UseContainerSupport
Java 10 introduced +UseContainerSupport (enabled by default) which makes the JVM use sane defaults in a container environment. This feature is backported to Java 8 since 8u191, potentially allowing a huge percentage of Java deployments in the wild to properly configure their memory.

The recent oracle-jdk-8(8u191) brings the following options to allow Docker container users to gain more fine grained control over the amount of system memory that will be used for the Java Heap:
-XX:InitialRAMPercentage
-XX:MaxRAMPercentage
-XX:MinRAMPercentage
Three new JVM options have been added to allow Docker container users
to gain more fine grained control over the amount of system memory
that will be used for the Java Heap:
-XX:InitialRAMPercentage
-XX:MaxRAMPercentage
-XX:MinRAMPercentage These options replace the deprecated Fraction forms (-XX:InitialRAMFraction, -XX:MaxRAMFraction, and
-XX:MinRAMFraction).
See https://www.oracle.com/technetwork/java/javase/8u191-relnotes-5032181.html

Jenkins java.lang.OutOfMemoryError: GC overhead limit exceeded

I am currently working on creating a performance framework using jenkins and execute the performance test from Jenkins. I am using https://github.com/jmeter-maven-plugin/jmeter-maven-plugin this plugin. The sanity test with single user in this performance framework worked well and went ahead with an actual performance test of 200 users and within 2 mins received the error
java.lang.OutOfMemoryError: GC overhead limit exceeded
I tried the following in jenkins.xml
<arguments>-Xrs -Xmx2048m -XX:MaxPermSize=512m -Dhudson.lifecycle=hudson.lifecycle.WindowsServiceLifecycle -jar "%BASE%\jenkins.war" --httpPort=8080 --prefix=/jenkins --webroot="%BASE%\war"</arguments>
but it didn't work and also noted that whenever I increased the memory the jenkins service stops and had to reduce the memory to 1Gb and then the service restarts.
Had increased the memory for jmeter and java as well but no help.
In the .jmx file view results tree and every other listener is disabled but still the issue persists.
Since I am doing a POC jenkins is hosted in my laptop and high level specs as follows
System Model : Latitude E7270 Processor : Intel(R) Core(TM) i5-6300U CPU # 2.40GHZ(4CPU's), ~2.5GHZ Memory : 8192MB RAM
Any help please ?

The error about GC overhead implies that Jenkins is thrashing in Garbage Collection. This means it's probably spending more time doing Garbage Collection than doing useful work.
This situation normally comes about when the heap is too small for the application. With modern multi generational heap layouts it's difficult to say what exactly needs changing.
I would suggest you enable Verbose GC with the following options "-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps"
Then follow the advice here: http://www.oracle.com/technetwork/articles/javase/gcportal-136937.html

Few points to note
You are using the integrated maven goal to run your jmeter tests. This will use Jenkins as the container to launch your jmeter tests thereby not only impacting your work but also other users of jenkins
It is better to defer the execution to a different client machine like a dedicated jmeter machine which uses its own JVM with parameters to launch your tests (OR) use the one that you provide
In summary,
1. Move the test execution out of jenkins
2. Provide the output of the report as an input to your performance plug-in [ this can also crash since it will need more JVM memory when you process endurance test results like an 8 hour result file]
This way, your tests will have better chance of scaling. Also, you haven't mentioned what type of scripting engine that you are using. AS per Jmeter documentation, JSR223 with groovy has a memory leak. Please refer
http://jmeter.apache.org/usermanual/component_reference.html#JSR223_Sampler
Try adding -Dgroovy.use.classvalue=true to see if that helps (provided you are using groovy). If you are using Java 8, there is a high chance that it is creating unique class for all your scripts in jmeter and it is increasing the meta space which is outside your JVM. In that case, restrict the meta space and use class unloading and a 64 bit JVM like
-d64 -XX:+CMSClassUnloadingEnabled.
Also, what is your new generation size. -XX:NewSize=1024m -XX:MaxNewSize=1024m ? Please note jmeter loads all the files permanently and it will go directly to the old generation thereby shrinking any available space for new generation.

ColdFusion Garbage Collection

We have 6 Windows Server 2008/IIS 7.5 ColdFusion 9.0.2 servers on a round-robin load balancer. Each server is allocated 2GB for ColdFusion. Servers have 6GB of memory total. Garbage collection seems to be an issue across all the servers but I'm not sure how to resolve the issue without recycling ColdFusion.
The graph below is the AVG/MAX memory for our 6 servers over the past few days. Each day the AVG memory increases. Eventually, the servers start queuing requests (because they can't process them fast enough) and we have no choice but to recycle.
The data in the graph was taken from 1m snapshots of FusionReactor across all 6 servers.
Our servers are using the following command line in jvm.config for ColdFusion:
java.args=-Xmx2G -server -Xms2g -Dsun.io.useCanonCaches=false
-XX:MaxPermSize=192m -XX:+UseParallelGC -Xbatch -Dcoldfusion.rootDir={application.home}/ -Djava.security.policy={application.home}/servers/cfusion/cfusion-ear/cfusion-war/WEB-INF/cfusion/lib/coldfusion.policy
-Djava.security.auth.policy={application.home}/servers/cfusion/cfusion-ear/cfusion-war/WEB-INF/cfusion/lib/neo_jaas.policy
I'm not sure if changing garbage collection parameters is the solution, and I know nothing about GC especially as it relates to ColdFusion.
I'm aware this may have something to do with the code on the site. It's a portal (something like fusebox) that hosts many different applications inside of it. There are NOT many uses of cfobject calls in the portal.

This is similar to this question: Coldfusion OutOfMemoryError (CF9/Wheels)
But let me highlight the ones that are relevant for this:
Make sure you are on at least CF 9.01hf4 or 9.02hf1 and run ColdFusion on Java (See ColdFusion 9.01 on Java 7)
Bump up the `-XX:MaxPermSize=512m
Use -XX:+G1GC (See Is JDK 6u14 Garbage First (G1) garbage collector, suitable for JRun?)
Make that the JVM can use 4GB
Every 100 to 1000 iterations do a strongly suggested Garbage Collect
Make your function silent
Make sure that variables in functions are scoped to var or local
Consider ORM

Limiting memory usage for Solr on Jetty

I have a memory-limited environment and I'm running Solr on Jetty with the following command:
java -jar -Xmx64M -Xmn32M -Xss512K start.jar
But the total memory consumption of the Solr instance (or Jetty) seems to be much higher than the heap limit I provide. The output of ps is:
ps -u buradayiz -o rss,etime,pid,command
155164 01:37:40 21989 java -jar -Xmx64M -Xmn32M -Xss512K start.jar
As you see, the RSS is over 150M. How can I avoid this situation? I just want to get a simple OutOfMemory exception when Solr/Jetty uses more memory than I let them.
I understand that there may be a difference between the heap limit I provide and the actual memory usage, but a difference factor of two (actually 2.5) seems a lot to me. I must be missing something.
Thanks.

There are a number of factors that contribute to memory usage beyond the heap specification.
A major one in your situation is the permanent generation. It's used to load classes for all the dependencies required to run the application and a few other things. There's not too much getting around a certain minimum for a given application due the classes necessary. You likely need around 64M (perhaps more) to run Solr on Jetty.
You can specify a maximum size to prevent the permanent generation from growing for the other things, e.g. add -XX:MaxPermSize=64M to your command line. It's unlikely going to help much though, and it might even break it if more is required. Usually it's almost all used by classes that you need.
Another contributor to memory usage beyond the heap is the stack size per thread. Each thread in your case is going to consume 512K. You can probably specify 256K safely, although you probably don't have enough threads running to matter too much.

I have the same problem; trying to run it in a limited environment. (Max 400mb ram/vm size). This solution seems to get it running at least.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.