java process hangs on start

java process hangs on start - java

I have a server (40GB RAM) on which the java process hangs on start.
If I simply type "java" on the shell, it prints the help message and then never exits.
It appears that there are about 8GBs of RAM available. Any help would be appreciated.
This is what the output of top looks like:
Tasks: 297 total, 1 running, 296 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 19.4%sy, 0.0%ni, 79.5%id, 0.0%wa, 0.0%hi, 1.1%si, 0.0%st
Mem: 49556016k total, 41112432k used, 8443584k free, 286900k buffers
Swap: 97851904k total, 276044k used, 97575860k free, 23982784k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13 root 15 -5 0 0 0 S 88 0.0 2302:14 ksoftirqd/3
25 root 15 -5 0 0 0 S 73 0.0 2782:56 ksoftirqd/7
4 root 15 -5 0 0 0 S 64 0.0 10223:40 ksoftirqd/0
4912 user1 20 0 1529m 211m 9.8m S 25 0.4 6510:25 java
13092 user2 20 0 6565m 2.6g 8472 S 18 5.6 3178:40 java
1 root 20 0 19428 860 420 S 0 0.0 9:32.65 init

java -version should exit almost immediately. If it doesn't its not installed correctly.
BTW Try installing Java 6 update 33 as update 20 is quite old.

Related

How do I decide on a suitable TLABSIZE setting for a Java application?

My Java application on an single cpu arm7 (32bit) device using Java 14 is occasionally crashing
after running under load for a number of hours, and is always failing in ThreadLocalAllocBuffer::resize()
A fatal error has been detected by the Java Runtime Environment:
#
SIGSEGV (0xb) at pc=0xb6cd515e, pid=1725, tid=1733
#
JRE version: OpenJDK Runtime Environment (14.0+36) (build 14+36)
Java VM: OpenJDK Client VM (14+36, mixed mode, serial gc, linux-arm)
Problematic frame:
V
#
No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
If you would like to submit a bug report, please visit:
https://bugreport.java.com/bugreport/crash.jsp
#
--------------- S U M M A R Y ------------
Command Line: -Duser.home=/mnt/app/share/log -Djdk.lang.Process.launchMechanism=vfork -Xms150m -Xmx900m -Dcom.mchange.v2.log.MLog=com.mchange.v2.log.jdk14logging.Jdk14MLog -Dorg.jboss.logging.provider=jdk -Djava.util.logging.config.class=com.jthink.songkong.logging.StandardLogging --add-opens=java.base/java.lang=ALL-UNNAMED lib/songkong-6.9.jar -r
Host: Marvell PJ4Bv7 Processor rev 1 (v7l), 1 cores, 1G, Buildroot 2014.11-rc1
Time: Fri Apr 24 19:36:54 2020 BST elapsed time: 37456 seconds (0d 10h 24m 16s)
--------------- T H R E A D ---------------
Current thread (0xb6582a30): VMThread "VM Thread" [stack: 0x7b716000,0x7b796000] [id=3625] _threads_hazard_ptr=0x7742f140
Stack: [0x7b716000,0x7b796000], sp=0x7b7946b0, free space=505k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x48015e] ThreadLocalAllocBuffer::resize()+0x85
[error occurred during error reporting (printing native stack), id 0xb, SIGSEGV (0xb) at pc=0xb6b4ccae]
Now this must surely be bug in JVM, but as its not one of the standard Java platforms and I dont have a simple test case I cannot see it getting fixed anytime soon, so I am trying to workaround it. Its also worth noting that it crashed with ThreadLocalAllocBuffer::accumulate_statistics_before_gc() when I used Java 11 which is why I moved to Java 14 to try and resolve the issue.
As the the issue is with TLABs one solution is to disable TLABS with -XX:-UseTLAB but that makes the code run slower on an already slow machine.
So I think another solution is to disable resizing with -XX:-ResizeTLAB, but then I need to know work out a suitable size and specify that using -XX:TLABSize=N. But I am not sure what N actually represents and what would be a suitable size to set
I tried setting -XX:TLABSize=1000000 which seems to me to be quite large ?
I have some logging set with
-Xlog:tlab*=debug,tlab*=trace:file=gc.log:time:filecount=7,filesize=8M
but I don't really understand the output.
[2020-05-19T15:43:43.836+0100] ThreadLocalAllocBuffer::compute_size(132) returns 250132
[2020-05-19T15:43:43.837+0100] TLAB: fill thread: 0x0026d548 [id: 871] desired_size: 976KB slow allocs: 0 refill waste: 15624B alloc: 0.25725 1606KB refills: 1 waste 0.0% gc: 0B slow: 0B fast: 0B
[2020-05-19T15:43:43.853+0100] ThreadLocalAllocBuffer::compute_size(6) returns 250006
[2020-05-19T15:43:43.854+0100] TLAB: fill thread: 0xb669be48 [id: 32635] desired_size: 976KB slow allocs: 0 refill waste: 15624B alloc: 0.00002 0KB refills: 1 waste 0.0% gc: 0B slow: 0B fast: 0B
[2020-05-19T15:43:43.910+0100] ThreadLocalAllocBuffer::compute_size(4) returns 250004
[2020-05-19T15:43:43.911+0100] TLAB: fill thread: 0x76c1d6f8 [id: 917] desired_size: 976KB slow allocs: 0 refill waste: 15624B alloc: 0.91261 8085KB refills: 1 waste 0.0% gc: 0B slow: 0B fast: 0B
[2020-05-19T15:43:43.962+0100] ThreadLocalAllocBuffer::compute_size(2052) returns 252052
[2020-05-19T15:43:43.962+0100] TLAB: fill thread: 0x76e06f10 [id: 534] desired_size: 976KB slow allocs: 4 refill waste: 15688B alloc: 0.13977 1612KB refills: 2 waste 0.2% gc: 0B slow: 4520B fast: 0B
[2020-05-19T15:43:43.982+0100] ThreadLocalAllocBuffer::compute_size(28878) returns 278878
[2020-05-19T15:43:43.983+0100] TLAB: fill thread: 0x76e06f10 [id: 534] desired_size: 976KB slow allocs: 4 refill waste: 15624B alloc: 0.13977 1764KB refills: 3 waste 0.3% gc: 0B slow: 10424B fast: 0B
[2020-05-19T15:43:44.023+0100] ThreadLocalAllocBuffer::compute_size(4) returns 250004
[2020-05-19T15:43:44.023+0100] TLAB: fill thread: 0x7991df20 [id: 32696] desired_size: 976KB slow allocs: 0 refill waste: 15624B alloc: 0.00132 19KB refills: 1 waste 0.0% gc: 0B slow: 0B fast: 0B
Update
I reran with -XX:+HeapDumpOnOutOfMemoryError option added, and this time it showed:
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid1600.hprof ...
but then the dump itself failed with
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0xb6a81b9a, pid=1600, tid=1606
#
# JRE version: OpenJDK Runtime Environment (14.0+36) (build 14+36)
# Java VM: OpenJDK Client VM (14+36, mixed mode, serial gc, linux-arm)
# Problematic frame:
# V [libjvm.so+0x22eb9a] DumperSupport::dump_field_value(DumpWriter*, char, oopDesc*, int)+0x91
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /mnt/system/config/Apps/SongKong/songkong/hs_err_pid1600.log
#
# If you would like to submit a bug report, please visit:
# https://bugreport.java.com/bugreport/crash.jsp
I am not clear if the dump failed because of ulimit or soemthing else, but
java_pid1600.hprof was created but was empty
I was also monitoring the process with jstat -gc, and jstat -gcutil. I paste the end of the putput here, to me it does not look like there was a particular memory problem before the crash, although I am only checking every 5 seconds so maybe that is the issue ?
[root#N1-0247 bin]# ./jstat -gc 1600 5s
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT CGC CGCT GCT
........
30720.0 30720.0 0.0 0.0 245760.0 236647.2 614400.0 494429.2 50136.0 49436.9 0.0 0.0 5084 3042.643 155 745.523 - - 3788.166
30720.0 30720.0 0.0 28806.1 245760.0 244460.2 614400.0 506541.7 50136.0 49436.9 0.0 0.0 5085 3043.887 156 745.523 - - 3789.410
30720.0 30720.0 28760.4 0.0 245760.0 245760.0 614400.0 514809.7 50136.0 49437.2 0.0 0.0 5086 3044.895 157 751.204 - - 3796.098
30720.0 30720.0 0.0 231.1 245760.0 234781.8 614400.0 514809.7 50136.0 49437.2 0.0 0.0 5087 3044.895 157 755.042 - - 3799.936
30720.0 30720.0 0.0 0.0 245760.0 190385.5 614400.0 519650.7 50136.0 49449.6 0.0 0.0 5087 3045.905 159 758.890 - - 3804.795
30720.0 30720.0 0.0 0.0 245760.0 190385.5 614400.0 519650.7 50136.0 49449.6 0.0 0.0 5087 3045.905 159 758.890 - - 3804.795
[root#N1-0247 bin]# ./jstat -gc 1600 5s
S0 S1 E O M CCS YGC YGCT FGC FGCT CGC CGCT GCT
..............
99.70 0.00 100.00 75.54 98.56 - 5080 3037.321 150 724.674 - - 3761.995
0.00 29.93 99.30 75.55 98.56 - 5081 3038.403 151 728.584 - - 3766.987
0.00 100.00 99.30 75.94 98.56 - 5081 3039.405 152 728.584 - - 3767.989
100.00 0.00 99.14 76.14 98.56 - 5082 3040.366 153 734.088 - - 3774.454
0.00 96.58 99.87 78.50 98.57 - 5083 3041.366 154 737.960 - - 3779.325
56.99 0.00 100.00 78.50 98.58 - 5084 3041.366 154 741.880 - - 3783.246
0.00 0.00 96.29 80.47 98.61 - 5084 3042.643 155 745.523 - - 3788.166
0.00 93.77 99.47 82.44 98.61 - 5085 3043.887 156 745.523 - - 3789.410
93.62 0.00 100.00 83.79 98.61 - 5086 3044.895 157 751.204 - - 3796.098
0.00 0.76 95.53 83.79 98.61 - 5087 3044.895 157 755.042 - - 3799.936
0.00 0.00 77.47 84.58 98.63 - 5087 3045.905 159 758.890 - - 3804.795
0.00 0.00 77.47 84.58 98.63 - 5087 3045.905 159 758.890 - - 3804.795
Update Latest run
Configured gclogging, i get many
Pause Young (Allocation Failure)
errors, does this indicate I need to make the eden space larger?
[2020-05-29T14:00:22.668+0100] GC(44) Pause Young (GCLocker Initiated GC)
[2020-05-29T14:00:22.739+0100] GC(44) DefNew: 43230K(46208K)->4507K(46208K) Eden: 41088K(41088K)->0K(41088K) From: 2142K(5120K)->4507K(5120K)
[2020-05-29T14:00:22.739+0100] GC(44) Tenured: 50532K(102400K)->50532K(102400K)
[2020-05-29T14:00:22.740+0100] GC(44) Metaspace: 40054K(40536K)->40054K(40536K)
[2020-05-29T14:00:22.740+0100] GC(44) Pause Young (GCLocker Initiated GC) 91M->53M(145M) 72.532ms
[2020-05-29T14:00:22.741+0100] GC(44) User=0.07s Sys=0.00s Real=0.07s
[2020-05-29T14:00:25.196+0100] GC(45) Pause Young (Allocation Failure)
[2020-05-29T14:00:25.306+0100] GC(45) DefNew: 45595K(46208K)->2150K(46208K) Eden: 41088K(41088K)->0K(41088K) From: 4507K(5120K)->2150K(5120K)
[2020-05-29T14:00:25.306+0100] GC(45) Tenured: 50532K(102400K)->53861K(102400K)
[2020-05-29T14:00:25.307+0100] GC(45) Metaspace: 40177K(40664K)->40177K(40664K)
[2020-05-29T14:00:25.307+0100] GC(45) Pause Young (Allocation Failure) 93M->54M(145M) 111.252ms
[2020-05-29T14:00:25.308+0100] GC(45) User=0.08s Sys=0.02s Real=0.11s
[2020-05-29T14:00:29.248+0100] GC(46) Pause Young (Allocation Failure)
[2020-05-29T14:00:29.404+0100] GC(46) DefNew: 43238K(46208K)->4318K(46208K) Eden: 41088K(41088K)->0K(41088K) From: 2150K(5120K)->4318K(5120K)
[2020-05-29T14:00:29.405+0100] GC(46) Tenured: 53861K(102400K)->53861K(102400K)
[2020-05-29T14:00:29.405+0100] GC(46) Metaspace: 40319K(40792K)->40319K(40792K)
[2020-05-29T14:00:29.406+0100] GC(46) Pause Young (Allocation Failure) 94M->56M(145M) 157.614ms
[2020-05-29T14:00:29.406+0100] GC(46) User=0.07s Sys=0.00s Real=0.16s
[2020-05-29T14:00:36.466+0100] GC(47) Pause Young (Allocation Failure)
[2020-05-29T14:00:36.661+0100] GC(47) DefNew: 45406K(46208K)->5120K(46208K) Eden: 41088K(41088K)->0K(41088K) From: 4318K(5120K)->5120K(5120K)
[2020-05-29T14:00:36.662+0100] GC(47) Tenured: 53861K(102400K)->55125K(102400K)
[2020-05-29T14:00:36.662+0100] GC(47) Metaspace: 40397K(40920K)->40397K(40920K)
[2020-05-29T14:00:36.663+0100] GC(47) Pause Young (Allocation Failure) 96M->58M(145M) 196.531ms
[2020-05-29T14:00:36.663+0100] GC(47) User=0.09s Sys=0.01s Real=0.19s
[2020-05-29T14:00:40.523+0100] GC(48) Pause Young (Allocation Failure)
[2020-05-29T14:00:40.653+0100] GC(48) DefNew: 44274K(46208K)->2300K(46208K) Eden: 39154K(41088K)->0K(41088K) From: 5120K(5120K)->2300K(5120K)
[2020-05-29T14:00:40.653+0100] GC(48) Tenured: 55125K(102400K)->59965K(102400K)
[2020-05-29T14:00:40.654+0100] GC(48) Metaspace: 40530K(41048K)->40530K(41048K)
[2020-05-29T14:00:40.654+0100] GC(48) Pause Young (Allocation Failure) 97M->60M(145M) 131.365ms
[2020-05-29T14:00:40.655+0100] GC(48) User=0.11s Sys=0.01s Real=0.14s
[2020-05-29T14:00:43.936+0100] GC(49) Pause Young (Allocation Failure)
[2020-05-29T14:00:44.100+0100] GC(49) DefNew: 43388K(46208K)->5120K(46208K) Eden: 41088K(41088K)->0K(41088K) From: 2300K(5120K)->5120K(5120K)
Updated with gc analysis done by gceasy
Okay so this is useful I uploaded log to gceasy.org and it clearly shows that shortly before it crashed heap size was significantly higher and approaching the 900mb limit,even after a number of full gcs, so I think basically it ran out of heap space.
What is a little frustrating is I have the
-XX:+HeapDumpOnOutOfMemoryError
option enabled, but when it crashes it reports an issue trying to do create the dump file so I cannot get one.
And when I process the same file on Windows with the same setting for heap size it suceeds without failure, But Im goinf to run again ewith gclogging enabled and see if it reaches simailr levels even if it doesnt actually fall over.
Ran again (this is building on chnages made in previous run and doesnt show start of run) but to me the memory usage is higher but looks quite normal (sawtooth pattern) with no particular differenc ebefore the crash.
Update
With last run I reduced max heap from 900MB to 600MB, but I also monitored with vmstat, Yo can see clearly below where the applciation crashed but It doesn't seem we were approaching particularly ow memory at this point.
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 0 57072 7812 1174128 0 0 5360 0 211 558 96 4 0 0 0
1 0 0 55220 7812 1176184 0 0 2048 0 203 467 79 21 0 0 0
3 0 0 61296 7812 1169096 0 0 2036 44 193 520 96 4 0 0 0
2 0 0 59808 7812 1171144 0 0 2048 32 212 522 96 4 0 0 0
1 0 0 59436 7812 1171144 0 0 0 0 180 307 83 17 0 0 0
1 0 0 59436 7812 1171144 0 0 0 0 179 173 100 0 0 0 0
1 0 0 59436 7812 1171128 0 0 0 0 179 184 100 0 0 0 0
2 1 0 51764 7816 1158452 0 0 4124 52 190 490 80 20 0 0 0
3 0 0 63428 7612 1146388 0 0 20472 48 251 533 86 14 0 0 0
2 0 0 63428 7616 1146412 0 0 4 0 196 508 99 1 0 0 0
2 0 0 84136 7616 1146400 0 0 0 0 186 461 84 16 0 0 0
2 0 0 61436 7608 1148960 0 0 24601 0 325 727 77 23 0 0 0
4 0 0 60196 7648 1150204 0 0 1160 76 232 611 98 2 0 0 0
4 0 0 59204 7656 1151052 0 0 52 376 305 570 80 20 0 0 0
3 0 0 59204 7656 1151052 0 0 0 0 378 433 96 4 0 0 0
1 0 0 762248 7768 1151420 0 0 106 0 253 660 74 26 0 0 0
0 0 0 859272 8188 1151892 0 0 417 0 302 550 9 26 64 1 0
0 0 0 859272 8188 1151892 0 0 0 0 111 132 0 0 100 0 0

Based on your jstat data and their explanation here: https://docs.oracle.com/en/java/javase/11/tools/jstat.html#GUID-5F72A7F9-5D5A-4486-8201-E1D1BA8ACCB5
I would not expect OutOfMemoryError just yet from the HeapSpace based on the slow and steady rate of the Old Generation filling up and the small size of the from and to space (not that I know whether your application might allocate a huge array anytime soon) unless:
initial heap size (-Xms) is smaller than the max (-Xmx) and
Linux has overcomitted virtual memory
If you do overcommit (and who doesn't) maybe you should keep an eye on Linux with vmstat 1 or gathering data frequently for sar
But I do wonder why you are not using Garbage Collection logging with -Xlog:gc*:stderr or to a file with -Xlog:gc*:file= and maybe analyze that with https://gceasy.io/ as it is very low overhead (unless writing to the logfile is slow) and very precise. For more information on the logging syntax see: https://openjdk.java.net/jeps/158 and https://openjdk.java.net/jeps/271
java -Xlog:gc*:stderr -jar yourapp.jar
and analyze those logs with great ease with tools like these:
https://gceasy.io/
JClarity Censum
This should give similar information as jstack and more in realtime (as far as I know)

I think you may already be on the wrong track:
It is more likely that your process has a general problem with allocating memory than that there are two different bugs in two different Java versions.
Have you already checked whether the process has enough memory? A segmentation fault can also occur when the process runs out of memory. I would also check the configuration of the swap file. Years ago I got inexplicable segfaults with Java 8 also somewhere in a resize or allocation method. In my case the size of the OS's swap file was set to zero.
What error do you see on top of the error log file? You only copied the information of the single thread.
UPDATE
You definitely do not have a problem with GC. If GC would be overloaded you would some when get an java.lang.OutOfMemoryError with the message:
GC Overhead limit exceeded
GC tries to collect garbage but it also has CPU constraints. Concrete behavior depends on the actual GC implementation but usually garbage will accumulate (see your big OldGen) before the GC uses more CPU cycles. So an increased heap usage is completely normal as long as you do not get the mentioned OOM error.
The segmentation faults in the native code are an indicator that there's something wrong with accessing native memory. You even get segmentation faults when the JVM tries to generate a dump. This is an additional indicator for a general problem with accessing native memory.
What's still unanswered is whether you really have enough native memory for all the processes running on your host.
Linux's overcommitment of memory usually triggers the OOM killer. But there are situations where the OOM killer is not triggered (see the kernel documentation for details). In such cases it is possible that a process may die with a SIGSEGV. Like other native applications also the JVM makes use of mmap. Also the man pages of mmap mention that depending on the used parameters a SIGSEGV may occur upon a write if no physical memory is available.

vmstat pidstat result analysis for Java program

vmstat 1 100
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 2307452 283392 712136 0 0 2 2 17 21 0 1 99 0 0
3 0 0 2307436 283392 712136 0 0 0 0 10677 3455 21 43 35 0 0
4 0 0 2307436 283392 712136 0 0 0 0 10700 3620 22 42 36 0 0
3 0 0 2307436 283392 712136 0 0 0 0 10549 3523 21 43 36 0 0
pidstat -I -w -p 3809 2
PID cswch/s nvcswch/s Command
3809 0.00 0.00 java
3809 0.00 0.00 java
3809 0.00 0.00 java
I am doing a pressure test. The server program is a WebSocket server, which accepts 10,000 client connections. Each client connection sends a message to server every 2 seconds, and server responds a message to each client every 2 seconds.
My question is :
1) From vmstat 1 100, it seems that the cpu( sy is 42%, us is 21% around) is doing much system-level work instead of user-level work. So I think there are too much context switch for CPU.
However, from pidstat, the cswch/s and nvcswch/s are all 0 for the server program. I think this result means that there are not much context switch for CPU.
Could anybody help explain the result of the Linux server monitoring result?

pidstat is referencing to process 3809
vmstat is measuring all the processes of the system cause you set only the sampling frequency

How to put tomcat7 always online?

I have Ubuntu 14.04 cloud server with 512MB RAM on Digital Ocean and installed tomcat7 in order to accept my Java applications, also there is a wordpress site running on it with little accesses. So, I created a REST Web Service that needs to always be online because there are accesses by Android Apps. The problem is when I don't use the WS for sometime it goes down and I have to manually start tomcat again.
When I ask for tomcat' status I have the answer below:
Tomcat Servlet engine is not running, but pid file exists.
Here is a memory log of server in normal state:
total used free shared buffers cached
Mem: 490 480 9 64 6 119
-/+ buffers/cache: 354 135
Swap: 0 0 0
Top command:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8835 www-data 20 0 363904 65472 37244 S 16.6 13.0 0:31.02 php5-fpm
12625 www-data 20 0 361052 63896 35704 S 8.3 12.7 0:13.30 php5-fpm
24655 mysql 20 0 891176 56332 1576 S 1.7 11.2 72:04.31 mysqld
11509 www-data 20 0 361696 65796 37168 S 1.3 13.1 0:16.99 php5-fpm
7 root 20 0 0 0 0 S 0.3 0.0 4:31.17 rcu_sched
28 root 20 0 0 0 0 S 0.3 0.0 0:44.41 kswapd0
123 root 20 0 0 0 0 S 0.3 0.0 3:26.29 jbd2/vda1-8
744 www-data 20 0 91112 2400 540 S 0.3 0.5 0:53.93 nginx
13305 tomcat7 20 0 1126588 144516 5792 S 0.3 28.8 0:44.17 java
14557 root 20 0 24820 1508 1100 R 0.3 0.3 0:00.07 top
1 root 20 0 33504 1504 120 S 0.0 0.3 1:59.18 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.29 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:03.83 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
8 root 20 0 0 0 0 R 0.0 0.0 4:37.10 rcuos/0
9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
10 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/0
Using the jmap -heap in tomcat process i have these details:
using thread-local object allocation.
Concurrent Mark-Sweep GC
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 134217728 (128.0MB)
NewSize = 1310720 (1.25MB)
MaxNewSize = 44695552 (42.625MB)
OldSize = 5439488 (5.1875MB)
NewRatio = 2
SurvivorRatio = 8
PermSize = 21757952 (20.75MB)
MaxPermSize = 174063616 (166.0MB)
G1HeapRegionSize = 0 (0.0MB)
Heap Usage:
New Generation (Eden + 1 Survivor Space):
capacity = 2424832 (2.3125MB)
used = 280872 (0.26786041259765625MB)
free = 2143960 (2.0446395874023438MB)
11.583152977195946% used
Eden Space:
capacity = 2162688 (2.0625MB)
used = 242168 (0.23094940185546875MB)
free = 1920520 (1.8315505981445312MB)
11.197546756628787% used
From Space:
capacity = 262144 (0.25MB)
used = 38704 (0.0369110107421875MB)
free = 223440 (0.2130889892578125MB)
14.764404296875% used
To Space:
capacity = 262144 (0.25MB)
used = 0 (0.0MB)
free = 262144 (0.25MB)
0.0% used
concurrent mark-sweep generation:
capacity = 34521088 (32.921875MB)
used = 26207256 (24.993186950683594MB)
free = 8313832 (7.928688049316406MB)
75.91665708798054% used
Perm Generation:
capacity = 50319360 (47.98828125MB)
used = 43680848 (41.65730285644531MB)
free = 6638512 (6.3309783935546875MB)
86.8072407916158% used
16661 interned Strings occupying 2074936 bytes.
Does anybody know how to always put it online?

OK, if you have a 512 MB RAM server, and you have MySQL and PHP5 running, the JVM will probably have crashed with an OutOfMemory exception.
In the jmap output, the only important number is the free memory of the concurrent mark sweep generation, where you have only 7.9 MB free, which sounds very small for a web service.
Before it crashes, the JVM will also spend a lot of time trying to garbage collect, which could lead to the process becoming non-responsive, even before it crashes completely.
You could add 1GB of swap (IIRC, linux admins recommend swap = 2 x ram).
See e.g. http://www.prowebdev.us/2012/05/amazon-ec2-linux-micro-swap-space.html for AWS, will probably work on Digital Ocean, too.
The MySQL and PHP5 processes can probably swap out a lot of unused allocated memory. If that slows your applications down too much, you'll probably need some more RAM, or move the PHP and MySQL to different servers.

Max memory allocation on AWS EBS

I am using ElasticBeanstalk to host my Grails app.
As of now I am using m1.small instance which has 1.7 GiB of main Memory. My question is what is the MaxHeap and Max PermGen I can allocate to my instance? As of now my configuration looks like below
Initial JVM heap size: 512m
JVM command line options: -XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled -XX:+UseConcMarkSweepGC
Maximum JVM heap size: 512m
Maximum JVM permanent generation size: 256m
Any suggestion for selecting the optimum numbers so that I can use max memory for my Tomcat and still have enough left for the OS itself?
Rephrasing the question what is the MAX out of 1.7 GiB can I allocate to something other than the OS(tomcat in this case)

You really need to profile your application stack.
It's easy to increase the size of the heap above the amount of physical memory by adding swap which increases Virtual memory.
For performance you want to find the right fit between instance and application.
Run your application stack in a vagrant instance first and measure it. Adapt until
you get consistent performance, but m1.small instances may be slowed down by others on the same physical host.
That said, you can run top, and sort on Virtual memory.
Example: (Compare VIRT, RES, and SHR columns)
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1220 root 20 0 308m 1316 868 S 0.0 0.3 0:00.43 VBoxService
1126 root 20 0 243m 1576 1040 S 0.0 0.3 0:00.02 rsyslogd
1324 root 20 0 114m 1380 764 S 0.0 0.3 0:00.84 crond
1966 vagrant 20 0 105m 1880 1536 S 0.0 0.4 0:00.12 bash
1962 root 20 0 95788 3740 2832 S 0.0 0.7 0:00.18 sshd
1965 vagrant 20 0 95788 1748 836 S 0.0 0.3 0:00.11 sshd
1323 postfix 20 0 78868 3280 2448 S 0.0 0.7 0:00.00 qmgr
1322 postfix 20 0 78800 3232 2408 S 0.0 0.6 0:00.00 pickup
1314 root 20 0 78720 3252 2400 S 0.0 0.6 0:00.01 master
1238 root 20 0 64116 1180 512 S 0.0 0.2 0:00.80 sshd
This is on a rather small vagrant instance:
$ cat /proc/meminfo
MemTotal: 502412 kB
MemFree: 350976 kB
Buffers: 27148 kB
Cached: 45668 kB
SwapCached: 0 kB
Active: 45240 kB
Inactive: 39616 kB
Active(anon): 12164 kB
Inactive(anon): 44 kB
Active(file): 33076 kB
Inactive(file): 39572 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 1048568 kB
SwapFree: 1048568 kB
Dirty: 4 kB
Writeback: 0 kB

Troubleshooting unbounded Java Resident Set Size(RSS) growth

I have a standalone Java application which has:
-Xmx1024m -Xms1024m -XX:MaxPermSize=256m -XX:PermSize=256m
Over the course of time it hogs more and more memory, starts to swap(and slow down) and eventually died a number of times(not OOM+dump, just died, nothing on /var/log/messages).
What I've tried so far:
Heap dumps: live objects take 200-300Mb out of 1G heap --> ok with heap
Number of live threads is rather constant(~60-70) --> ok with thread stacks
JMX stops answering at some point(mb it answers but timeout is lower)
Turn off swap - it dies faster
strace - seems everything slows down a bit, app still haven't died, and not sure for which things look there
Checking top: VIRT grows to 5.5Gb, RSS to 3.7 Gb
Checking vmstat(obviously we start to swap):
--------------------------procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
Sun Jul 22 16:10:26 2012: r b swpd free buff cache si so bi bo in cs us sy id wa st
Sun Jul 22 16:48:41 2012: 0 0 138652 2502504 40360 706592 1 0 169 21 1047 206 20 1 74 4 0
. . .
Sun Jul 22 18:10:59 2012: 0 0 138648 24816 58600 1609212 0 0 124 669 913 24436 43 22 34 2 0
Sun Jul 22 19:10:22 2012: 33 1 138644 33304 4960 1107480 0 0 100 536 810 19536 44 22 23 10 0
Sun Jul 22 20:10:28 2012: 54 1 213916 26928 2864 578832 3 360 100 710 639 12702 43 16 30 11 0
Sun Jul 22 21:10:43 2012: 0 0 629256 26116 2992 467808 84 176 278 1320 1293 24243 50 19 29 3 0
Sun Jul 22 22:10:55 2012: 4 0 772168 29136 1240 165900 203 94 435 1188 1278 21851 48 16 33 2 0
Sun Jul 22 23:10:57 2012: 0 1 2429536 26280 1880 169816 6875 6471 7081 6878 2146 8447 18 37 1 45 0
sar also shows steady system% growth = swapping:
15:40:02 CPU %user %nice %system %iowait %steal %idle
17:40:01 all 51.00 0.00 7.81 3.04 0.00 38.15
19:40:01 all 48.43 0.00 18.89 2.07 0.00 30.60
20:40:01 all 43.93 0.00 15.84 5.54 0.00 34.70
21:40:01 all 46.14 0.00 15.44 6.57 0.00 31.85
22:40:01 all 44.25 0.00 20.94 5.43 0.00 29.39
23:40:01 all 18.24 0.00 52.13 21.17 0.00 8.46
12:40:02 all 22.03 0.00 41.70 15.46 0.00 20.81
Checking pmap gaves the following largest contributors:
000000005416c000 1505760K rwx-- [ anon ]
00000000b0000000 1310720K rwx-- [ anon ]
00002aaab9001000 2079748K rwx-- [ anon ]
Trying to correlate addresses I've got from pmap from stuff dumped by strace gave me no matches
Adding more memory is not practical(just make problem appear later)
Switching JVM's is not possible(env is not under our control)
And the question is:
What else can I try to track down the problem's cause or try to work around it?

Something in your JVM is using an "unbounded" amount of non-Heap memory. Some possible candidates are:
Thread stacks.
Native heap allocated by some native code library.
Memory-mapped files.
The first possibility will show up as a large (and increasing) number of threads when you take a thread stack dump. (Just check it ... OK?)
The second one you can (probably) eliminate if your application (or some 3rd part library it uses) doesn't use any native libraries.
The third one you can eliminate if your application (or some 3rd part library it uses) doesn't use memory mapped files.
I would guess that the reason that you are not seeing OOME's is that your JVM is being killed by the Linux OOM killer. It is also possible that the JVM is bailing out in native code (e.g. due to a malloc failure not being handled properly), but I'd have thought that a JVM crash dump would be the more likely outcome ...

Problem was in a profiler library attached - it recorded CPU calls/allocation sites, thus required memory to store that.
So, human factor here :)

There is a known problem with Java and glibc >= 2.10 (includes Ubuntu >= 10.04, RHEL >= 6).
The cure is to set this env. variable:
export MALLOC_ARENA_MAX=4
There is an IBM article about setting MALLOC_ARENA_MAX
https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en
This blog post says
resident memory has been known to creep in a manner similar to a
memory leak or memory fragmentation.
search for MALLOC_ARENA_MAX on Google or SO for more references.
You might want to tune also other malloc options to optimize for low fragmentation of allocated memory:
# tune glibc memory allocation, optimize for low fragmentation
# limit the number of arenas
export MALLOC_ARENA_MAX=2
# disable dynamic mmap threshold, see M_MMAP_THRESHOLD in "man mallopt"
export MALLOC_MMAP_THRESHOLD_=131072
export MALLOC_TRIM_THRESHOLD_=131072
export MALLOC_TOP_PAD_=131072
export MALLOC_MMAP_MAX_=65536

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.