High iowait with java processes on linux

High iowait with java processes on linux - java

I have a concurrent system with many machines/nodes involved. Each machine run several JVMs doing different stuff. It is a "layered" architecture where each layer consists of many JVM running across the machines. Basically the top-layer JVM receives input from the outside via files, parses the input and sends it as many small records for "storage" in layer-two. Layer-two doesn't actually persist the data itself but actually persists it in layer-three (HBase and Solr) and HBase actually doesn't persist it itself either since it sends it to layer-four (HDFS) for persistence.
Most of the communication among the layers is synchronized so of course it ends up in a lot of threads waiting for lower layers to complete. But I would expect those waiting threads to be "free" wrt CPU usage.
I see a very high iowait (%wa in top) though - something like 80-90% iowait and only 10-20% sys/usr CPU usage. The system seems exhausted - slow to login via ssh and slow to respond to commands etc.
My question is if all those JVM threads waiting for lower layers to complete can cause this? Is it not supposed to be "free" waiting for responses (sockets). Does it matter with respect to this, whether the different layers uses blocking or non-blocking (NIO) io? Exactly in what situations does Linux count something as iowait (%wa in top)? When all threads in all JVMs on the machines are in a situation where it is waiting (counting because there is no other thread to run to do something meaningful in the meantime)? Or does threads waiting also count in %wa even though there are other processes ready to use the CPU for real processing?
I would really want to get a thorough explanation on how it works and how to interpret this high %wa. In the beginning I guessed that it counted as %wa when all threads where waiting, but that there where actually plenty of room for doing more, so I tried to increase the number of threads expecting to get more throughput, but that doesn't happen. So it is a real problem, not just a "visual" problem looking at top.
The output below is taken from a machine where only HBase and HDFS is running. It is on machines with HBase and/or HDFS that the problem i showing (most clearly)
--- jps ---
19498 DataNode
19690 HRegionServer
19327 SecondaryNameNode
---- typical top -------
top - 11:13:21 up 14 days, 18:20, 1 user, load average: 4.83, 4.50, 4.25
Tasks: 99 total, 1 running, 98 sleeping, 0 stopped, 0 zombie
Cpu(s): 14.1%us, 4.3%sy, 0.0%ni, 5.4%id, 74.8%wa, 0.0%hi, 1.3%si, 0.0%st
Mem: 7133800k total, 7099632k used, 34168k free, 55540k buffers
Swap: 487416k total, 248k used, 487168k free, 2076804k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
19690 hbase 20 0 4629m 4.2g 9244 S 51 61.7 194:08.84 java
19498 hdfs 20 0 1030m 116m 9076 S 16 1.7 75:29.26 java
---- iostat -kd 1 ----
root#edrxen1-2:~# iostat -kd 1
Linux 2.6.32-29-server (edrxen1-2) 02/22/2012 _x86_64_ (2 CPU)
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 3.53 3.36 15.66 4279502 19973226
dm-0 319.44 6959.14 422.37 8876213913 538720280
dm-1 0.00 0.00 0.00 912 624
xvdb 229.03 6955.81 406.71 8871957888 518747772
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 0.00 0.00 0.00 0 0
dm-0 122.00 3852.00 0.00 3852 0
dm-1 0.00 0.00 0.00 0 0
xvdb 105.00 3252.00 0.00 3252 0
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 0.00 0.00 0.00 0 0
dm-0 57.00 1712.00 0.00 1712 0
dm-1 0.00 0.00 0.00 0 0
xvdb 78.00 2428.00 0.00 2428 0
--- iostat -x ---
Linux 2.6.32-29-server (edrxen1-2) 02/22/2012 _x86_64_ (2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
8.06 0.00 3.29 65.14 0.08 23.43
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 0.74 0.35 3.18 6.72 31.32 10.78 0.11 30.28 6.24 2.20
dm-0 0.00 0.00 213.15 106.59 13866.95 852.73 46.04 1.29 14.41 2.83 90.58
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 5.78 1.12 0.00
xvdb 0.07 86.97 212.73 15.69 13860.27 821.42 64.27 2.44 25.21 3.96 90.47
--- free -o ----
total used free shared buffers cached
Mem: 7133800 7099452 34348 0 55612 2082364
Swap: 487416 248 487168

IO wait on Linux indicates that processes are blocked on uninterruptible I/O. In practice, it typically means that the process is performing disk access -- in this case, I'd guess one of the following:
hdfs is performing a lot of disk accesses, and it's making other disk access slow as a result. (Checking iostat -x may help, as it'll show an extra "%util" column which indicates what percentage of the time the disk is "busy".)
You're running low on system memory under load, and are ending up dipping into swap sometimes.

Related

How do I decide on a suitable TLABSIZE setting for a Java application?

My Java application on an single cpu arm7 (32bit) device using Java 14 is occasionally crashing
after running under load for a number of hours, and is always failing in ThreadLocalAllocBuffer::resize()
A fatal error has been detected by the Java Runtime Environment:
#
SIGSEGV (0xb) at pc=0xb6cd515e, pid=1725, tid=1733
#
JRE version: OpenJDK Runtime Environment (14.0+36) (build 14+36)
Java VM: OpenJDK Client VM (14+36, mixed mode, serial gc, linux-arm)
Problematic frame:
V
#
No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
If you would like to submit a bug report, please visit:
https://bugreport.java.com/bugreport/crash.jsp
#
--------------- S U M M A R Y ------------
Command Line: -Duser.home=/mnt/app/share/log -Djdk.lang.Process.launchMechanism=vfork -Xms150m -Xmx900m -Dcom.mchange.v2.log.MLog=com.mchange.v2.log.jdk14logging.Jdk14MLog -Dorg.jboss.logging.provider=jdk -Djava.util.logging.config.class=com.jthink.songkong.logging.StandardLogging --add-opens=java.base/java.lang=ALL-UNNAMED lib/songkong-6.9.jar -r
Host: Marvell PJ4Bv7 Processor rev 1 (v7l), 1 cores, 1G, Buildroot 2014.11-rc1
Time: Fri Apr 24 19:36:54 2020 BST elapsed time: 37456 seconds (0d 10h 24m 16s)
--------------- T H R E A D ---------------
Current thread (0xb6582a30): VMThread "VM Thread" [stack: 0x7b716000,0x7b796000] [id=3625] _threads_hazard_ptr=0x7742f140
Stack: [0x7b716000,0x7b796000], sp=0x7b7946b0, free space=505k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x48015e] ThreadLocalAllocBuffer::resize()+0x85
[error occurred during error reporting (printing native stack), id 0xb, SIGSEGV (0xb) at pc=0xb6b4ccae]
Now this must surely be bug in JVM, but as its not one of the standard Java platforms and I dont have a simple test case I cannot see it getting fixed anytime soon, so I am trying to workaround it. Its also worth noting that it crashed with ThreadLocalAllocBuffer::accumulate_statistics_before_gc() when I used Java 11 which is why I moved to Java 14 to try and resolve the issue.
As the the issue is with TLABs one solution is to disable TLABS with -XX:-UseTLAB but that makes the code run slower on an already slow machine.
So I think another solution is to disable resizing with -XX:-ResizeTLAB, but then I need to know work out a suitable size and specify that using -XX:TLABSize=N. But I am not sure what N actually represents and what would be a suitable size to set
I tried setting -XX:TLABSize=1000000 which seems to me to be quite large ?
I have some logging set with
-Xlog:tlab*=debug,tlab*=trace:file=gc.log:time:filecount=7,filesize=8M
but I don't really understand the output.
[2020-05-19T15:43:43.836+0100] ThreadLocalAllocBuffer::compute_size(132) returns 250132
[2020-05-19T15:43:43.837+0100] TLAB: fill thread: 0x0026d548 [id: 871] desired_size: 976KB slow allocs: 0 refill waste: 15624B alloc: 0.25725 1606KB refills: 1 waste 0.0% gc: 0B slow: 0B fast: 0B
[2020-05-19T15:43:43.853+0100] ThreadLocalAllocBuffer::compute_size(6) returns 250006
[2020-05-19T15:43:43.854+0100] TLAB: fill thread: 0xb669be48 [id: 32635] desired_size: 976KB slow allocs: 0 refill waste: 15624B alloc: 0.00002 0KB refills: 1 waste 0.0% gc: 0B slow: 0B fast: 0B
[2020-05-19T15:43:43.910+0100] ThreadLocalAllocBuffer::compute_size(4) returns 250004
[2020-05-19T15:43:43.911+0100] TLAB: fill thread: 0x76c1d6f8 [id: 917] desired_size: 976KB slow allocs: 0 refill waste: 15624B alloc: 0.91261 8085KB refills: 1 waste 0.0% gc: 0B slow: 0B fast: 0B
[2020-05-19T15:43:43.962+0100] ThreadLocalAllocBuffer::compute_size(2052) returns 252052
[2020-05-19T15:43:43.962+0100] TLAB: fill thread: 0x76e06f10 [id: 534] desired_size: 976KB slow allocs: 4 refill waste: 15688B alloc: 0.13977 1612KB refills: 2 waste 0.2% gc: 0B slow: 4520B fast: 0B
[2020-05-19T15:43:43.982+0100] ThreadLocalAllocBuffer::compute_size(28878) returns 278878
[2020-05-19T15:43:43.983+0100] TLAB: fill thread: 0x76e06f10 [id: 534] desired_size: 976KB slow allocs: 4 refill waste: 15624B alloc: 0.13977 1764KB refills: 3 waste 0.3% gc: 0B slow: 10424B fast: 0B
[2020-05-19T15:43:44.023+0100] ThreadLocalAllocBuffer::compute_size(4) returns 250004
[2020-05-19T15:43:44.023+0100] TLAB: fill thread: 0x7991df20 [id: 32696] desired_size: 976KB slow allocs: 0 refill waste: 15624B alloc: 0.00132 19KB refills: 1 waste 0.0% gc: 0B slow: 0B fast: 0B
Update
I reran with -XX:+HeapDumpOnOutOfMemoryError option added, and this time it showed:
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid1600.hprof ...
but then the dump itself failed with
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0xb6a81b9a, pid=1600, tid=1606
#
# JRE version: OpenJDK Runtime Environment (14.0+36) (build 14+36)
# Java VM: OpenJDK Client VM (14+36, mixed mode, serial gc, linux-arm)
# Problematic frame:
# V [libjvm.so+0x22eb9a] DumperSupport::dump_field_value(DumpWriter*, char, oopDesc*, int)+0x91
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /mnt/system/config/Apps/SongKong/songkong/hs_err_pid1600.log
#
# If you would like to submit a bug report, please visit:
# https://bugreport.java.com/bugreport/crash.jsp
I am not clear if the dump failed because of ulimit or soemthing else, but
java_pid1600.hprof was created but was empty
I was also monitoring the process with jstat -gc, and jstat -gcutil. I paste the end of the putput here, to me it does not look like there was a particular memory problem before the crash, although I am only checking every 5 seconds so maybe that is the issue ?
[root#N1-0247 bin]# ./jstat -gc 1600 5s
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT CGC CGCT GCT
........
30720.0 30720.0 0.0 0.0 245760.0 236647.2 614400.0 494429.2 50136.0 49436.9 0.0 0.0 5084 3042.643 155 745.523 - - 3788.166
30720.0 30720.0 0.0 28806.1 245760.0 244460.2 614400.0 506541.7 50136.0 49436.9 0.0 0.0 5085 3043.887 156 745.523 - - 3789.410
30720.0 30720.0 28760.4 0.0 245760.0 245760.0 614400.0 514809.7 50136.0 49437.2 0.0 0.0 5086 3044.895 157 751.204 - - 3796.098
30720.0 30720.0 0.0 231.1 245760.0 234781.8 614400.0 514809.7 50136.0 49437.2 0.0 0.0 5087 3044.895 157 755.042 - - 3799.936
30720.0 30720.0 0.0 0.0 245760.0 190385.5 614400.0 519650.7 50136.0 49449.6 0.0 0.0 5087 3045.905 159 758.890 - - 3804.795
30720.0 30720.0 0.0 0.0 245760.0 190385.5 614400.0 519650.7 50136.0 49449.6 0.0 0.0 5087 3045.905 159 758.890 - - 3804.795
[root#N1-0247 bin]# ./jstat -gc 1600 5s
S0 S1 E O M CCS YGC YGCT FGC FGCT CGC CGCT GCT
..............
99.70 0.00 100.00 75.54 98.56 - 5080 3037.321 150 724.674 - - 3761.995
0.00 29.93 99.30 75.55 98.56 - 5081 3038.403 151 728.584 - - 3766.987
0.00 100.00 99.30 75.94 98.56 - 5081 3039.405 152 728.584 - - 3767.989
100.00 0.00 99.14 76.14 98.56 - 5082 3040.366 153 734.088 - - 3774.454
0.00 96.58 99.87 78.50 98.57 - 5083 3041.366 154 737.960 - - 3779.325
56.99 0.00 100.00 78.50 98.58 - 5084 3041.366 154 741.880 - - 3783.246
0.00 0.00 96.29 80.47 98.61 - 5084 3042.643 155 745.523 - - 3788.166
0.00 93.77 99.47 82.44 98.61 - 5085 3043.887 156 745.523 - - 3789.410
93.62 0.00 100.00 83.79 98.61 - 5086 3044.895 157 751.204 - - 3796.098
0.00 0.76 95.53 83.79 98.61 - 5087 3044.895 157 755.042 - - 3799.936
0.00 0.00 77.47 84.58 98.63 - 5087 3045.905 159 758.890 - - 3804.795
0.00 0.00 77.47 84.58 98.63 - 5087 3045.905 159 758.890 - - 3804.795
Update Latest run
Configured gclogging, i get many
Pause Young (Allocation Failure)
errors, does this indicate I need to make the eden space larger?
[2020-05-29T14:00:22.668+0100] GC(44) Pause Young (GCLocker Initiated GC)
[2020-05-29T14:00:22.739+0100] GC(44) DefNew: 43230K(46208K)->4507K(46208K) Eden: 41088K(41088K)->0K(41088K) From: 2142K(5120K)->4507K(5120K)
[2020-05-29T14:00:22.739+0100] GC(44) Tenured: 50532K(102400K)->50532K(102400K)
[2020-05-29T14:00:22.740+0100] GC(44) Metaspace: 40054K(40536K)->40054K(40536K)
[2020-05-29T14:00:22.740+0100] GC(44) Pause Young (GCLocker Initiated GC) 91M->53M(145M) 72.532ms
[2020-05-29T14:00:22.741+0100] GC(44) User=0.07s Sys=0.00s Real=0.07s
[2020-05-29T14:00:25.196+0100] GC(45) Pause Young (Allocation Failure)
[2020-05-29T14:00:25.306+0100] GC(45) DefNew: 45595K(46208K)->2150K(46208K) Eden: 41088K(41088K)->0K(41088K) From: 4507K(5120K)->2150K(5120K)
[2020-05-29T14:00:25.306+0100] GC(45) Tenured: 50532K(102400K)->53861K(102400K)
[2020-05-29T14:00:25.307+0100] GC(45) Metaspace: 40177K(40664K)->40177K(40664K)
[2020-05-29T14:00:25.307+0100] GC(45) Pause Young (Allocation Failure) 93M->54M(145M) 111.252ms
[2020-05-29T14:00:25.308+0100] GC(45) User=0.08s Sys=0.02s Real=0.11s
[2020-05-29T14:00:29.248+0100] GC(46) Pause Young (Allocation Failure)
[2020-05-29T14:00:29.404+0100] GC(46) DefNew: 43238K(46208K)->4318K(46208K) Eden: 41088K(41088K)->0K(41088K) From: 2150K(5120K)->4318K(5120K)
[2020-05-29T14:00:29.405+0100] GC(46) Tenured: 53861K(102400K)->53861K(102400K)
[2020-05-29T14:00:29.405+0100] GC(46) Metaspace: 40319K(40792K)->40319K(40792K)
[2020-05-29T14:00:29.406+0100] GC(46) Pause Young (Allocation Failure) 94M->56M(145M) 157.614ms
[2020-05-29T14:00:29.406+0100] GC(46) User=0.07s Sys=0.00s Real=0.16s
[2020-05-29T14:00:36.466+0100] GC(47) Pause Young (Allocation Failure)
[2020-05-29T14:00:36.661+0100] GC(47) DefNew: 45406K(46208K)->5120K(46208K) Eden: 41088K(41088K)->0K(41088K) From: 4318K(5120K)->5120K(5120K)
[2020-05-29T14:00:36.662+0100] GC(47) Tenured: 53861K(102400K)->55125K(102400K)
[2020-05-29T14:00:36.662+0100] GC(47) Metaspace: 40397K(40920K)->40397K(40920K)
[2020-05-29T14:00:36.663+0100] GC(47) Pause Young (Allocation Failure) 96M->58M(145M) 196.531ms
[2020-05-29T14:00:36.663+0100] GC(47) User=0.09s Sys=0.01s Real=0.19s
[2020-05-29T14:00:40.523+0100] GC(48) Pause Young (Allocation Failure)
[2020-05-29T14:00:40.653+0100] GC(48) DefNew: 44274K(46208K)->2300K(46208K) Eden: 39154K(41088K)->0K(41088K) From: 5120K(5120K)->2300K(5120K)
[2020-05-29T14:00:40.653+0100] GC(48) Tenured: 55125K(102400K)->59965K(102400K)
[2020-05-29T14:00:40.654+0100] GC(48) Metaspace: 40530K(41048K)->40530K(41048K)
[2020-05-29T14:00:40.654+0100] GC(48) Pause Young (Allocation Failure) 97M->60M(145M) 131.365ms
[2020-05-29T14:00:40.655+0100] GC(48) User=0.11s Sys=0.01s Real=0.14s
[2020-05-29T14:00:43.936+0100] GC(49) Pause Young (Allocation Failure)
[2020-05-29T14:00:44.100+0100] GC(49) DefNew: 43388K(46208K)->5120K(46208K) Eden: 41088K(41088K)->0K(41088K) From: 2300K(5120K)->5120K(5120K)
Updated with gc analysis done by gceasy
Okay so this is useful I uploaded log to gceasy.org and it clearly shows that shortly before it crashed heap size was significantly higher and approaching the 900mb limit,even after a number of full gcs, so I think basically it ran out of heap space.
What is a little frustrating is I have the
-XX:+HeapDumpOnOutOfMemoryError
option enabled, but when it crashes it reports an issue trying to do create the dump file so I cannot get one.
And when I process the same file on Windows with the same setting for heap size it suceeds without failure, But Im goinf to run again ewith gclogging enabled and see if it reaches simailr levels even if it doesnt actually fall over.
Ran again (this is building on chnages made in previous run and doesnt show start of run) but to me the memory usage is higher but looks quite normal (sawtooth pattern) with no particular differenc ebefore the crash.
Update
With last run I reduced max heap from 900MB to 600MB, but I also monitored with vmstat, Yo can see clearly below where the applciation crashed but It doesn't seem we were approaching particularly ow memory at this point.
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 0 57072 7812 1174128 0 0 5360 0 211 558 96 4 0 0 0
1 0 0 55220 7812 1176184 0 0 2048 0 203 467 79 21 0 0 0
3 0 0 61296 7812 1169096 0 0 2036 44 193 520 96 4 0 0 0
2 0 0 59808 7812 1171144 0 0 2048 32 212 522 96 4 0 0 0
1 0 0 59436 7812 1171144 0 0 0 0 180 307 83 17 0 0 0
1 0 0 59436 7812 1171144 0 0 0 0 179 173 100 0 0 0 0
1 0 0 59436 7812 1171128 0 0 0 0 179 184 100 0 0 0 0
2 1 0 51764 7816 1158452 0 0 4124 52 190 490 80 20 0 0 0
3 0 0 63428 7612 1146388 0 0 20472 48 251 533 86 14 0 0 0
2 0 0 63428 7616 1146412 0 0 4 0 196 508 99 1 0 0 0
2 0 0 84136 7616 1146400 0 0 0 0 186 461 84 16 0 0 0
2 0 0 61436 7608 1148960 0 0 24601 0 325 727 77 23 0 0 0
4 0 0 60196 7648 1150204 0 0 1160 76 232 611 98 2 0 0 0
4 0 0 59204 7656 1151052 0 0 52 376 305 570 80 20 0 0 0
3 0 0 59204 7656 1151052 0 0 0 0 378 433 96 4 0 0 0
1 0 0 762248 7768 1151420 0 0 106 0 253 660 74 26 0 0 0
0 0 0 859272 8188 1151892 0 0 417 0 302 550 9 26 64 1 0
0 0 0 859272 8188 1151892 0 0 0 0 111 132 0 0 100 0 0

Based on your jstat data and their explanation here: https://docs.oracle.com/en/java/javase/11/tools/jstat.html#GUID-5F72A7F9-5D5A-4486-8201-E1D1BA8ACCB5
I would not expect OutOfMemoryError just yet from the HeapSpace based on the slow and steady rate of the Old Generation filling up and the small size of the from and to space (not that I know whether your application might allocate a huge array anytime soon) unless:
initial heap size (-Xms) is smaller than the max (-Xmx) and
Linux has overcomitted virtual memory
If you do overcommit (and who doesn't) maybe you should keep an eye on Linux with vmstat 1 or gathering data frequently for sar
But I do wonder why you are not using Garbage Collection logging with -Xlog:gc*:stderr or to a file with -Xlog:gc*:file= and maybe analyze that with https://gceasy.io/ as it is very low overhead (unless writing to the logfile is slow) and very precise. For more information on the logging syntax see: https://openjdk.java.net/jeps/158 and https://openjdk.java.net/jeps/271
java -Xlog:gc*:stderr -jar yourapp.jar
and analyze those logs with great ease with tools like these:
https://gceasy.io/
JClarity Censum
This should give similar information as jstack and more in realtime (as far as I know)

I think you may already be on the wrong track:
It is more likely that your process has a general problem with allocating memory than that there are two different bugs in two different Java versions.
Have you already checked whether the process has enough memory? A segmentation fault can also occur when the process runs out of memory. I would also check the configuration of the swap file. Years ago I got inexplicable segfaults with Java 8 also somewhere in a resize or allocation method. In my case the size of the OS's swap file was set to zero.
What error do you see on top of the error log file? You only copied the information of the single thread.
UPDATE
You definitely do not have a problem with GC. If GC would be overloaded you would some when get an java.lang.OutOfMemoryError with the message:
GC Overhead limit exceeded
GC tries to collect garbage but it also has CPU constraints. Concrete behavior depends on the actual GC implementation but usually garbage will accumulate (see your big OldGen) before the GC uses more CPU cycles. So an increased heap usage is completely normal as long as you do not get the mentioned OOM error.
The segmentation faults in the native code are an indicator that there's something wrong with accessing native memory. You even get segmentation faults when the JVM tries to generate a dump. This is an additional indicator for a general problem with accessing native memory.
What's still unanswered is whether you really have enough native memory for all the processes running on your host.
Linux's overcommitment of memory usually triggers the OOM killer. But there are situations where the OOM killer is not triggered (see the kernel documentation for details). In such cases it is possible that a process may die with a SIGSEGV. Like other native applications also the JVM makes use of mmap. Also the man pages of mmap mention that depending on the used parameters a SIGSEGV may occur upon a write if no physical memory is available.

How to calculate Actual Memory usage for java using top linux?

I need to find out What is the actual memory consumed by Java process in Linux, the tools like visualVM/jconsole shows accurate but I must calculate actual memory used by JVM through top command.
I am looking at PID : 28169 if you look at top (linux) , it is saying 17.2g ( virtual ) , Res 10g , Shared : 15m . 10G is not possible since I Have given 6G jvmmax to this jvm process but if I use jvmtop it shows actuate results(matching with visualVM)
can someone shows me how to calculate actual usage of memory using top stats ?
Using JvmTop
JvmTop 0.8.0 alpha - 11:09:08, amd64, 12 cpus, Linux 2.6.32-57, load avg 0.00
http://code.google.com/p/jvmtop
PID 28169: com.gigaspaces.start.SystemBoot
ARGS: com.gigaspaces.start.services="GSC"
VMARGS: -XX:+AggressiveOpts -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemo[...]
VM: Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 1.7.0_51
UP: 179:23m #THR: 90 #THRPEAK: 92 #THRCREATED: 3725 USER: evolv
GC-Time: 0: 2m #GC-Runs: 3353 #TotalLoadedClasses: 23107
CPU: 1.46% GC: 0.00% HEAP:4623m /10240m NONHEAP: 180m / 304m
TID NAME STATE CPU TOTALCPU BLOCKEDBY
3733 RMI TCP Connection(2210)-10.16 RUNNABLE 14.93% 0.00%
3734 JMX server connection timeout TIMED_WAITING 0.13% 0.00%
95 GS-directLoadJobListenerPollin TIMED_WAITING 0.12% 0.14%
94 GS-jobListenerPollingContainer TIMED_WAITING 0.11% 0.14%
3375 GS-jobListenerPollingContainer TIMED_WAITING 0.10% 0.55%
93 GS-jobListenerPollingContainer TIMED_WAITING 0.09% 0.14%
3377 GS-jobListenerPollingContainer TIMED_WAITING 0.09% 0.56%
81 GS-subJobCompleteListenerPolli TIMED_WAITING 0.09% 0.14%
3376 GS-jobListenerPollingContainer TIMED_WAITING 0.08% 0.54%
98 GS-stopJobListenerPollingConta TIMED_WAITING 0.08% 0.14%
Note: Only top 10 threads (according cpu load) are shown!
^C-bash-4.1$
using Top :
top - 11:15:30 up 18 days, 6:34, 1 user, load average: 0.00, 0.00, 0.00
Tasks: 306 total, 1 running, 304 sleeping, 1 stopped, 0 zombie
Cpu(s): 0.3%us, 0.1%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16332776k total, 15913220k used, 419556k free, 316876k buffers
Swap: 4095996k total, 146452k used, 3949544k free, 3024048k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28169 evolv 20 0 17.2g 10g 15m S 2.8 70.3 4493:54 java
28034 evolv 20 0 5690m 289m 7656 S 0.0 1.8 16:31.29 java
28006 evolv 20 0 5821m 286m 7952 S 0.5 1.8 18:16.50 java
2098 root 20 0 272m 145m 4016 S 0.3 0.9 46:51.51 splunkd
2163 root 20 0 128m 40m 1220 S 0.0 0.3 1:05.86 puppet
1879 root 20 0 244m 6660 5036 S 0.0 0.0 1:21.82 sssd_be

You can use jstat to view your process statistics.
Examples
jstat -gc [insert-pid-here]
The above would give you an overview of your GC heap.
other commands
jstat -gccapacity [insert-pid-here]
jstat -gcutil [insert-pid-here]

Java code run time difference b/w two different platforms

I have deployed Java code on two different servers.The code is doing File Writing operations.
On the local server ,parameters are :
uname -a
SunOS snmi5001 5.10 Generic_120011-14 sun4u sparc SUNW,SPARC-Enterprise
ulimit -a
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) 389296
coredump(blocks) unlimited
nofiles(descriptors) 20000
vmemory(kbytes) unlimited
Java Version:
java version "1.5.0_12"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_12-b04)
Java HotSpot(TM) Server VM (build 1.5.0_12-b04, mixed mode)
On a Different(lets say MIT) server :
uname -a
SunOS au11qapcwbtels2 5.10 Generic_147440-05 sun4u sparc SUNW,Sun-Fire-15000
ulimit -a
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) 8192
coredump(blocks) unlimited
nofiles(descriptors) 256
vmemory(kbytes) unlimited
java -version
java version "1.5.0_32"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_32-b05)
Java HotSpot(TM) Server VM (build 1.5.0_32-b05, mixed mode)
The problem is that the code is running signficatly slower on the MIT server.
Because of the difference in nofiles and stack for the two OS's ,i thought if i change the ulimit -s and ulimit -n it would make a difference.
I cannot change the parameters on MIT server without confirming the problem,so the decreased the ulimit parameters for the local server and retested.But code finished execution is same time.
I have no idea what difference between the OS parameters which could be causing this.
Any help is appreciated.I will post more paramters if anyone tells me what to look for.
EDIT:
For MIT Server
No of CPU: psrinfo -p
24
psrinfo -pv
The physical processor has 2 virtual processors (0 4)
UltraSPARC-IV+ (portid 0 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (1 5)
UltraSPARC-IV+ (portid 1 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (2 6)
UltraSPARC-IV+ (portid 2 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (3 7)
UltraSPARC-IV+ (portid 3 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (32 36)
UltraSPARC-IV+ (portid 32 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (33 37)
UltraSPARC-IV+ (portid 33 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (34 38)
UltraSPARC-IV+ (portid 34 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (35 39)
UltraSPARC-IV+ (portid 35 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (64 68)
UltraSPARC-IV+ (portid 64 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (65 69)
UltraSPARC-IV+ (portid 65 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (66 70)
UltraSPARC-IV+ (portid 66 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (67 71)
UltraSPARC-IV+ (portid 67 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (96 100)
UltraSPARC-IV+ (portid 96 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (97 101)
UltraSPARC-IV+ (portid 97 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (98 102)
UltraSPARC-IV+ (portid 98 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (99 103)
UltraSPARC-IV+ (portid 99 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (128 132)
UltraSPARC-IV+ (portid 128 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (129 133)
UltraSPARC-IV+ (portid 129 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (130 134)
UltraSPARC-IV+ (portid 130 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (131 135)
UltraSPARC-IV+ (portid 131 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (224 228)
UltraSPARC-IV+ (portid 224 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (225 229)
UltraSPARC-IV+ (portid 225 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (226 230)
UltraSPARC-IV+ (portid 226 impl 0x19 ver 0x24 clock 1800 MHz)
The physical processor has 2 virtual processors (227 231)
UltraSPARC-IV+ (portid 227 impl 0x19 ver 0x24 clock 1800 MHz)
kstat cpu_info :
module: cpu_info instance: 231
name: cpu_info231 class: misc
brand UltraSPARC-IV+
chip_id 227
clock_MHz 1800
core_id 231
cpu_fru hc:///component=SB7
cpu_type sparcv9
crtime 587.102844985
current_clock_Hz 1799843256
device_ID 9223937394446500460
fpu_type sparcv9
implementation UltraSPARC-IV+ (portid 227 impl 0x19 ver 0x24 clock 1800 MHz)
pg_id 48
snaptime 19846866.5310415
state on-line
state_begin 1334854522
For the Local server i could only get the kstat info :
module: cpu_info instance: 0
name: cpu_info0 class: misc
brand SPARC64-VI
chip_id 1024
clock_MHz 2150
core_id 0
cpu_fru hc:///component=/MBU_A/CPUM0
cpu_type sparcv9
crtime 288.5675516
device_ID 250691889836161
fpu_type sparcv9
implementation SPARC64-VI (portid 1024 impl 0x6 ver 0x93 clock 2150 MHz)
snaptime 207506.8330168
state on-line
state_begin 1354493257
module: cpu_info instance: 1
name: cpu_info1 class: misc
brand SPARC64-VI
chip_id 1024
clock_MHz 2150
core_id 0
cpu_fru hc:///component=/MBU_A/CPUM0
cpu_type sparcv9
crtime 323.4572206
device_ID 250691889836161
fpu_type sparcv9
implementation SPARC64-VI (portid 1024 impl 0x6 ver 0x93 clock 2150 MHz)
snaptime 207506.8336113
state on-line
state_begin 1354493292
Similarly total 59 instances .
Also the memory for local server : vmstat
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s4 s1 in sy cs us sy id
0 0 0 143845984 93159232 431 895 1249 30 29 0 2 6 0 -0 1 3284 72450 6140 11 3 86
The memory for the MIT server : vmstat
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr m0 m1 m2 m3 in sy cs us sy id
0 0 0 180243376 184123896 81 786 248 15 15 0 0 3 14 -0 4 1854 7563 2072 1 1 98
df -h for MIT server:
Filesystem Size Used Available Capacity Mounted on
/dev/md/dsk/d0 7.9G 6.7G 1.1G 86% /
/devices 0K 0K 0K 0% /devices
ctfs 0K 0K 0K 0% /system/contract
proc 0K 0K 0K 0% /proc
mnttab 0K 0K 0K 0% /etc/mnttab
swap 171G 1.7M 171G 1% /etc/svc/volatile
objfs 0K 0K 0K 0% /system/object
sharefs 0K 0K 0K 0% /etc/dfs/sharetab
/platform/sun4u-us3/lib/libc_psr/libc_psr_hwcap2.so.1
7.9G 6.7G 1.1G 86% /platform/sun4u-us3/lib/libc_psr.so.1
/platform/sun4u-us3/lib/sparcv9/libc_psr/libc_psr_hwcap2.so.1
7.9G 6.7G 1.1G 86% /platform/sun4u-us3/lib/sparcv9/libc_psr.so.1
/dev/md/dsk/d3 7.9G 6.6G 1.2G 85% /var
swap 6.0G 56K 6.0G 1% /tmp
swap 171G 40K 171G 1% /var/run
swap 171G 0K 171G 0% /dev/vx/dmp
swap 171G 0K 171G 0% /dev/vx/rdmp
/dev/md/dsk/d5 2.0G 393M 1.5G 21% /home
/dev/vx/dsk/appdg/oravl
2.0G 17M 2.0G 1% /ora
/dev/md/dsk/d60 1.9G 364M 1.5G 19% /apps/stats
/dev/md/dsk/d4 16G 2.1G 14G 14% /var/crash
/dev/md/dsk/d61 1005M 330M 594M 36% /opt/controlm6
/dev/vx/dsk/appdg/oraproductvl
10G 2.3G 7.6G 24% /ora/product
/dev/md/dsk/d63 963M 1.0M 904M 1% /var/opt/app
/dev/vx/dsk/dmldg/appsdmlsvtvl
1.0T 130G 887G 13% /apps/dml/svt
/dev/vx/dsk/appdg/homeappusersvl
20G 19G 645M 97% /home/app/users
/dev/vx/dsk/dmldg/appsdmlmit2vl
20G 66M 20G 1% /apps/dml/mit2
/dev/vx/dsk/dmldg/datadmlmit2vl
1.9T 1.1T 773G 61% /data/dml/mit2
/dev/md/dsk/d62 9.8G 30M 9.7G 1% /usr/openv/netbackup/logs
df -h for local server :
Filesystem Size Used Available Capacity Mounted on
/dev/dsk/c0t0d0s0 20G 7.7G 12G 40% /
/devices 0K 0K 0K 0% /devices
ctfs 0K 0K 0K 0% /system/contract
proc 0K 0K 0K 0% /proc
mnttab 0K 0K 0K 0% /etc/mnttab
swap 140G 1.6M 140G 1% /etc/svc/volatile
objfs 0K 0K 0K 0% /system/object
fd 0K 0K 0K 0% /dev/fd
/dev/dsk/c0t0d0s5 9.8G 9.3G 483M 96% /var
swap 140G 504K 140G 1% /tmp
swap 140G 80K 140G 1% /var/run
swap 140G 0K 140G 0% /dev/vx/dmp
swap 140G 0K 140G 0% /dev/vx/rdmp
/dev/dsk/c0t0d0s6 9.8G 9.4G 403M 96% /opt
/dev/vx/dsk/eva8k/tlkhome
2.0G 66M 1.8G 4% /tlkhome
/dev/vx/dsk/eva8k/tlkuser4
48G 26G 20G 57% /tlkuser4
/dev/vx/dsk/eva8k/ST82
1.1G 17M 999M 2% /ST_A_82
/dev/vx/dsk/eva8k/tlkuser11
37G 37G 176M 100% /tlkuser11
/dev/vx/dsk/eva8k/oravl97
20G 12G 7.3G 63% /oravl97
/dev/vx/dsk/eva8k/tlkuser5
32G 23G 8.3G 74% /tlkuser5
/dev/vx/dsk/eva8k/mbtlkproj1
2.0G 18M 1.9G 1% /mbtlkproj1
/dev/vx/dsk/eva8k/Oravol98
38G 25G 12G 68% /oravl98
/dev/vx/dsk/eva8k_new/tlkuser15
57G 57G 0K 100% /tlkuser15
/dev/vx/dsk/eva8k/Oravol1
39G 16G 22G 42% /oravl01
/dev/vx/dsk/eva8k/Oravol99
30G 8.3G 20G 30% /oravl99
/dev/vx/dsk/eva8k/tlkuser9
18G 13G 4.8G 73% /tlkuser9
/dev/vx/dsk/eva8k/oravl08
32G 25G 6.3G 81% /oravl08
/dev/vx/dsk/eva8k/oravl07
46G 45G 1.2G 98% /oravl07
/dev/vx/dsk/eva8k/Oravol3
103G 90G 13G 88% /oravl03
/dev/vx/dsk/eva8k_new/tlkuser12
79G 79G 0K 100% /tlkuser12
/dev/vx/dsk/eva8k/Oravol4
88G 83G 4.3G 96% /oravl04
/dev/vx/dsk/eva8k/oravl999
10G 401M 9.0G 5% /oravl999
/dev/vx/dsk/eva8k_new/tlkuser14
54G 39G 15G 73% /tlkuser14
/dev/vx/dsk/eva8k/Oravol2
85G 69G 14G 84% /oravl02
/dev/vx/dsk/eva8k/sdkhome
1.0G 17M 944M 2% /sdkhome
/dev/vx/dsk/eva8k/tlkuser7
44G 36G 7.8G 83% /tlkuser7
/dev/vx/dsk/eva8k/tlkproj1
1.0G 17M 944M 2% /tlkproj1
/dev/vx/dsk/eva8k/tlkuser3
35G 29G 5.9G 84% /tlkuser3
/dev/vx/dsk/eva8k/tlkuser10
29G 29G 2.7M 100% /tlkuser10
/dev/vx/dsk/eva8k/oravl05
30G 29G 1.2G 97% /oravl05
/dev/vx/dsk/eva8k/oravl06
36G 34G 1.6G 96% /oravl06
/dev/vx/dsk/eva8k/tlkuser6
29G 27G 2.1G 93% /tlkuser6
/dev/vx/dsk/eva8k/tlkuser2
36G 30G 5.8G 84% /tlkuser2
/dev/vx/dsk/eva8k/tlkuser1
66G 49G 16G 75% /tlkuser1
/dev/vx/dsk/eva8k_new/tlkuser13
84G 77G 7.0G 92% /tlkuser13
/dev/vx/dsk/eva8k_new/tlkuser16
44G 37G 6.4G 86% /tlkuser16
/dev/vx/dsk/eva8k/db2
1.0G 593M 404M 60% /opt/db2V8.1
/dev/vx/dsk/eva8k/WebSphere6029
3.0G 2.2G 776M 75% /opt/WebSphere6029
/dev/vx/dsk/eva8k/websphere6
2.0G 88M 1.8G 5% /opt/websphere6
/dev/vx/dsk/eva8k/wli
4.0G 1.4G 2.5G 36% /opt/wli10gR3MP1
/dev/vx/dsk/eva8k/user
2.0G 19M 1.9G 1% /user/telstra/history
dvcinasdm3:/oracle_cdrom/data
576G 576G 206M 100% /oracle_cdrom
dvcinasdm2:/system_kits
822G 818G 4.2G 100% /system_kits
dvcinasdm2:/db_share 295G 283G 13G 96% /db_share
dvcinas2dm2:/system_data/data
315G 283G 32G 90% /system_data
dvcinas2dm2:/ossinfra/data
49G 18G 32G 36% /ossinfra
For local server the command : /usr/sbin/prtpicl -v | egrep "devfs-path|driver-name|subsystem-id" | nawk '/:subsystem-id/ { print $0; getline; print $0; getline; print $0; }' | nawk -F: '{ print $2 }' gives :
subsystem-id 0x13a1
devfs-path /pci#0,600000/pci#0/pci#8/pci#0/scsi#1
driver-name mpt
subsystem-id 0x1648
devfs-path /pci#0,600000/pci#0/pci#8/pci#0/network#2
driver-name bge
subsystem-id 0x1648
devfs-path /pci#0,600000/pci#0/pci#8/pci#0/network#2,1
driver-name bge
subsystem-id 0xfc11
devfs-path /pci#0,600000/pci#0/pci#8/pci#0,1/SUNW,emlxs#1
driver-name emlxs
subsystem-id 0x125e
devfs-path /pci#3,700000/network
driver-name e1000g
subsystem-id 0x125e
devfs-path /pci#3,700000/network
driver-name e1000g
subsystem-id 0x13a1
devfs-path /pci#10,600000/pci#0/pci#8/pci#0/scsi#1
driver-name mpt
subsystem-id 0x1648
devfs-path /pci#10,600000/pci#0/pci#8/pci#0/network
driver-name bge
subsystem-id 0x1648
devfs-path /pci#10,600000/pci#0/pci#8/pci#0/network
driver-name bge
subsystem-id 0xfc11
devfs-path /pci#10,600000/pci#0/pci#8/pci#0,1/SUNW,emlxs#1
driver-name emlxs
For MIT server it gives :
subsystem-id 0xfc00
devfs-path /pci#3d,600000/SUNW,emlxs#1
driver-name emlxs
subsystem-id 0xfc00
devfs-path /pci#3d,600000/SUNW,emlxs#1,1
driver-name emlxs
subsystem-id 0xfc00
devfs-path /pci#5d,600000/SUNW,emlxs#1
driver-name emlxs
subsystem-id 0xfc00
devfs-path /pci#5d,600000/SUNW,emlxs#1,1
driver-name emlxs
on the start of i/o consuming code,iostat -d c3t50001FE1502613A9d7 5 shows :
1161 37 134 0 0 0 0 0 0 329 24 2
3 2 3 0 0 0 0 0 0 554 71 10
195 26 6 0 0 0 0 0 0 853 108 19
37 6 4 0 0 0 0 0 0 1134 143 10
140 8 7 0 0 0 0 0 0 3689 86 7
173 24 85 0 0 0 0 0 0 9914 74 9
0 0 0 0 0 0 0 0 0 12323 114 2
13 9 41 0 0 0 0 0 0 10609 117 2
0 0 0 0 0 0 0 0 0 10746 72 2
sd0 sd1 sd4 ssd134
kps tps serv kps tps serv kps tps serv kps tps serv
1 0 3 0 0 0 0 0 0 11376 137 2
2 0 10 0 0 0 0 0 0 11980 157 3
231 39 14 0 0 0 0 0 0 10584 140 3
785 175 5 0 0 0 0 0 0 13503 170 2
9 4 32 0 0 0 0 0 0 11597 168 2
7 1 6 0 0 0 0 0 0 11555 106 2
On the MIT server iostat shows :
0.0 460.4 0.0 4029.2 0.4 0.6 0.9 1.2 2 11 c6t5006048452A79BD6d206
0.0 885.2 0.0 8349.3 0.5 0.8 0.6 0.9 3 24 c4t5006048452A79BD9d206
0.0 660.0 0.0 5618.8 0.5 0.7 0.7 1.0 2 18 c6t5006048452A79BD6d206
0.0 779.1 0.0 7408.6 0.3 0.7 0.4 0.8 2 21 c4t5006048452A79BD9d206
0.0 569.8 0.0 4893.9 0.3 0.5 0.5 1.0 2 15 c6t5006048452A79BD6d206
0.0 521.5 0.0 5433.6 0.2 0.5 0.3 0.9 1 16 c4t5006048452A79BD9d206
0.0 362.8 0.0 3134.8 0.2 0.4 0.6 1.1 1 10 c6t5006048452A79BD6d206
So,we can see that the kps for local server is much more than that of MIT server,during the time of max i/o operations.

Conclusions on the local and MIT server
A quick glance at your machines:
Local server is a small-chassis Sun Enterprise machine on SPARC VI, possibly a M4000. You are writing data on an external file system (called eva8k_new) over multipathed PCIe slots using a direct SCSI connection. This machine is 3-5 years old.
MIT server is a SunFire 15000 - an old, mainframe-class Solaris server. It has 12 dual-core UltraSPARC IV+ CPUs in the hardware partition that you are running in (the physical chassis can be logically split into several different hardware partitions which cannot see each other at all). You are writing to a SAN over a 1Gb/s or 2Gb/s fibre channel (the LUN might be called dmldg) on multipathed PCI slots. This machine is at least 7 years old, but the technology is 10 years old.
The storage system used on the local and MIT servers are both external. The performance of the storage is dependent on a number of factors including the I/O speed of the physical interface (PCI vs. PCIe) and the interconnect (1 or 2Gb/s fibre channel on the SunFire). This article explains how to get this information.
Theoretical performance problems
The performance of your application may be gated on one of several bottlenecks (assuming no code problems and network latencies/bottlenecks):
CPU: If your CPU were faster, you could get the application to go faster.
Single-threaded: Some applications are bottlenecked on a single thread, and so adding threads/cores does not improve performance.
Multi-thread capable: Sometimes, if the application is multi-threaded, adding more threads/cores can improve performance
Storage IO bandwidth or IOPS: The application is reading from or writing to storage system (including disks). Adding disks, changing RAID type, adding disk cache and other things may improve IO or IOPS; alternatively you might change to another storage subsystem.
IO bandwidth is the maximum amount of data that can pass in a given second, which may saturate first if streaming data to or from a disk
IOPS (IO operations per second) is the maximum number of IO commands (read or write) that can be processed per second. Typically this saturates first for processes that are searching for or in files, or (re)writing small chunks.
Looking at your issue, we can do a quick check:
If the issue is CPU, then:
You should see the CPU utilisation for the java process in top to be very high during program execution (90-99%)
The problem is not likely threading, because the SunFire MIT Server has a good number of cores available, therefore the problem is single-thread performance.
The UltraSPARC IV+ is quite a lot slower than the SPARC VI's. This is easily a noticeable drop, so this might be the reason the MIT server is slower
If the issue is IO, then:
You will see the CPU utilization for the java process in top to be low (probably 50% or lower, but possibly as high as 80% or so as a rule of thumb)
You will see the IO to the disk subsystem using iostat saturate - that is immediately rise to a fixed number and not really 'peak' over this number. The following options might be useful: iostat -d <disk> 5. The throughput value and number of operations/sec will be higher on the local server, and lower on the MIT server
You need to speak to the administrator to see if a faster storage system is available for the MIT server.
All the above is assuming that other processes on the servers are not interfering with the operation of your program - clearly another high-cpu process or one writing a lot to the same disk will affect the performance greatly.
Conclusions
From the CPU data you provide, there is no evidence of a CPU bottleneck.
From the iostat data you provide, as you comment, the IO on the SunFire is significantly below that of the local server. This is likely the result of the attached storage, namely at least one of:
Lower performance of PCI vs. PCIe in the local server
Probable 1Gb/s fibre channel slower than the (possibly faster) SCSI attached storage on the local server
Older and slower disks on the SunFire vs. the local attached storage
(Note that the same SAN appears connected to the local server, so this could be tested).
With clear evidence of a hardware being the cause of the performance difference, there is little that can be done.
Some things may improve the general performance of the application, though. It's a good idea to run a Java profiler on the application. Examples include Netbeans and JProfiler.
The profiler will identify which IO operations are the problem. You might be able to:
Generally improve the algorithm at the bottleneck
Use a caching layer to aggregate multiple write operations before writing once
If using the original Java I/O clases (in java.io), you could rewrite the application to use Java NIO
EDIT: Thoughts on a caching layer
Assumption: That the problematic IO operation is either repeatedly writing small chunks to disk and flushing them, or keeps performing random-access write-to-disk operations. Your application may already be streaming to disk efficiently, in which case caching would not be useful.
When you have an expensive or slow operation in an application, you will want to minimize the number of times it is invoked - ideally to the theoretical minimum which hopefully is 1. However your code may not be doing so - for example you are using an OutputStream and writing small chunks to it and flushing to disk. In this case, you may write each disk block (8k) many times, each time with just a little more data.
Instead, you could use a RAM cache to consolidate all the writes; when you know there will be no more writes to the block, then you write it exactly once to disk. For streaming, Java has the BufferedOutputStream for this for simplistic cases. When you obtain the FileOutputStream instance from the File, wrap the FileOutputStream in the BufferedOutputStream and use only the BufferedOutputStream.
If, however, you are performing true random-access writes (eg using a java.io.RandomAccessFile), and moving the file pointer with RandomAccessFile.seek(), you may want to consider writing a write cache in RAM. Precisely what this would look like depends wholly on your file data structure, but you might want to start with a block paging mechanism. Chapter 1 of Java NIO has an introduction to those concepts, but hopefully you either don't need to go there or you find a close match in the NIO API.

If you are concerned about performance, I wouldn't use such an old version of Java. It's quite likely that the OS calls and native code generated for one architecture is sub-optimal. I would expect the newer architecure to suffer.
Can you compare Java 7 between these machines?
The ulimit suggest the first machine has much more resources. Which model of CPUs and how much memory do the two machines have?

Troubleshooting unbounded Java Resident Set Size(RSS) growth

I have a standalone Java application which has:
-Xmx1024m -Xms1024m -XX:MaxPermSize=256m -XX:PermSize=256m
Over the course of time it hogs more and more memory, starts to swap(and slow down) and eventually died a number of times(not OOM+dump, just died, nothing on /var/log/messages).
What I've tried so far:
Heap dumps: live objects take 200-300Mb out of 1G heap --> ok with heap
Number of live threads is rather constant(~60-70) --> ok with thread stacks
JMX stops answering at some point(mb it answers but timeout is lower)
Turn off swap - it dies faster
strace - seems everything slows down a bit, app still haven't died, and not sure for which things look there
Checking top: VIRT grows to 5.5Gb, RSS to 3.7 Gb
Checking vmstat(obviously we start to swap):
--------------------------procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
Sun Jul 22 16:10:26 2012: r b swpd free buff cache si so bi bo in cs us sy id wa st
Sun Jul 22 16:48:41 2012: 0 0 138652 2502504 40360 706592 1 0 169 21 1047 206 20 1 74 4 0
. . .
Sun Jul 22 18:10:59 2012: 0 0 138648 24816 58600 1609212 0 0 124 669 913 24436 43 22 34 2 0
Sun Jul 22 19:10:22 2012: 33 1 138644 33304 4960 1107480 0 0 100 536 810 19536 44 22 23 10 0
Sun Jul 22 20:10:28 2012: 54 1 213916 26928 2864 578832 3 360 100 710 639 12702 43 16 30 11 0
Sun Jul 22 21:10:43 2012: 0 0 629256 26116 2992 467808 84 176 278 1320 1293 24243 50 19 29 3 0
Sun Jul 22 22:10:55 2012: 4 0 772168 29136 1240 165900 203 94 435 1188 1278 21851 48 16 33 2 0
Sun Jul 22 23:10:57 2012: 0 1 2429536 26280 1880 169816 6875 6471 7081 6878 2146 8447 18 37 1 45 0
sar also shows steady system% growth = swapping:
15:40:02 CPU %user %nice %system %iowait %steal %idle
17:40:01 all 51.00 0.00 7.81 3.04 0.00 38.15
19:40:01 all 48.43 0.00 18.89 2.07 0.00 30.60
20:40:01 all 43.93 0.00 15.84 5.54 0.00 34.70
21:40:01 all 46.14 0.00 15.44 6.57 0.00 31.85
22:40:01 all 44.25 0.00 20.94 5.43 0.00 29.39
23:40:01 all 18.24 0.00 52.13 21.17 0.00 8.46
12:40:02 all 22.03 0.00 41.70 15.46 0.00 20.81
Checking pmap gaves the following largest contributors:
000000005416c000 1505760K rwx-- [ anon ]
00000000b0000000 1310720K rwx-- [ anon ]
00002aaab9001000 2079748K rwx-- [ anon ]
Trying to correlate addresses I've got from pmap from stuff dumped by strace gave me no matches
Adding more memory is not practical(just make problem appear later)
Switching JVM's is not possible(env is not under our control)
And the question is:
What else can I try to track down the problem's cause or try to work around it?

Something in your JVM is using an "unbounded" amount of non-Heap memory. Some possible candidates are:
Thread stacks.
Native heap allocated by some native code library.
Memory-mapped files.
The first possibility will show up as a large (and increasing) number of threads when you take a thread stack dump. (Just check it ... OK?)
The second one you can (probably) eliminate if your application (or some 3rd part library it uses) doesn't use any native libraries.
The third one you can eliminate if your application (or some 3rd part library it uses) doesn't use memory mapped files.
I would guess that the reason that you are not seeing OOME's is that your JVM is being killed by the Linux OOM killer. It is also possible that the JVM is bailing out in native code (e.g. due to a malloc failure not being handled properly), but I'd have thought that a JVM crash dump would be the more likely outcome ...

Problem was in a profiler library attached - it recorded CPU calls/allocation sites, thus required memory to store that.
So, human factor here :)

There is a known problem with Java and glibc >= 2.10 (includes Ubuntu >= 10.04, RHEL >= 6).
The cure is to set this env. variable:
export MALLOC_ARENA_MAX=4
There is an IBM article about setting MALLOC_ARENA_MAX
https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en
This blog post says
resident memory has been known to creep in a manner similar to a
memory leak or memory fragmentation.
search for MALLOC_ARENA_MAX on Google or SO for more references.
You might want to tune also other malloc options to optimize for low fragmentation of allocated memory:
# tune glibc memory allocation, optimize for low fragmentation
# limit the number of arenas
export MALLOC_ARENA_MAX=2
# disable dynamic mmap threshold, see M_MMAP_THRESHOLD in "man mallopt"
export MALLOC_MMAP_THRESHOLD_=131072
export MALLOC_TRIM_THRESHOLD_=131072
export MALLOC_TOP_PAD_=131072
export MALLOC_MMAP_MAX_=65536

Does anyone here know a good, cross-platform way to get the process list?

Okay, I got into a conversation with a friend about Ada (I'm the local proponent here), and in his project he's having a pain trying to get Java (using JNI) to get the applications running on the client machine (only Windows, Mac, and Linux) to get a listing of applications.
I'm not familiar with Macs at all, and my Linux experience is mostly user-end within academia.
So, my question is this: does anyone know a good cross-platform way to get the process-list?
My solution would be to use a package spec with a general function returning the list in the manner the Java expects it and throw together three different bodies for each of the platforms which would get the process-list according to that system and compile the (resultant) three binaries for those targets individually.
Is there a [good] way to do it w/o resorting to three different versions?
(This is an Ada question, but Java solutions are welcome.)

Java has no cross-platform API to list running processes. ProcessBuilder may be used to excecute the ps command, as shown here and here. The (rough) equivalent in Ada would be GNAT.Os_Lib.Spawn in GNAT. I'm not sure about other implementations.

JavaSysMon can provide a list of running processes (as well as other system information) in a platform-independent manner. Currently it supports Mac OS X, Linux, Windows, and Solaris. As an added bonus, it is BSD licensed.
Wiki
JavaDocs

You were almost at the Ada solution. As you only want 1 procedure to execute & look at the system call (top/ps in linux/unix) i would suggest a separate procedure. This will live in its own directory, and only be referenced by the correct compilation (per os). As for the actual commands per os, that is not part of my answer.

Do you just mean to get a list of running processes?
If so, you can just Google the commands to get this (1) the name of the OS on which the program is running, then (2) run Runtime.getRuntime.exec(stringCommandToGetProcessList); to based on #1, and output the results.
You don't need a different Java binary for every OS. You only need one. Just Google the command to find the OS name/version, and the command to get the list of running applications.
You also don't need JNI to do this. Use the Runtime class to run commands as if they were on the command-line.
There's no cross-platform way to do it, because the commands are different on each OS. But since there's only three major OS's (maybe a dozen total that you want to support, in some crazy extreme example), then it's just a matter of making a list of the 12 different commands to do this.
On Macs, and many Linux versions, OS name/version:
$ uname -a
Darwin normalocity 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun 7 16:32:41 PDT 2011; root:xnu-1504.15.3~1/RELEASE_X86_64 x86_64
Running processes (by highest usage):
$ top
Processes: 92 total, 5 running, 87 sleeping, 408 threads 20:38:35
Load Avg: 0.18, 0.20, 0.17 CPU usage: 7.26% user, 1.95% sys, 90.78% idle
SharedLibs: 6272K resident, 7300K data, 0B linkedit. MemRegions: 12204 total, 730M resident, 29M private, 393M shared.
PhysMem: 1076M wired, 1184M active, 1859M inactive, 4119M used, 4062M free.
VM: 207G vsize, 1041M framework vsize, 1851231(0) pageins, 603(0) pageouts.
Networks: packets: 1727104/1746M in, 984226/269M out. Disks: 295257/6745M read, 397634/15G written.
PID COMMAND %CPU TIME #TH #WQ #PORT #MRE RPRVT RSHRD RSIZE VPRVT VSIZE PGRP PPID STATE UID
12547 top 3.5 00:00.26 1/1 0 24 34 1208K 264K 1784K 17M 2378M 12547 12217 running 0
12217 bash 0.0 00:00.08 1 0 17 25 1328K 856K 1988K 17M 2378M 12217 12211 sleeping 502
12212 bash 0.0 00:00.08 1 0 17 25 1276K 856K 1980K 9688K 2378M 12212 12200 sleeping 502
12211 login 0.0 00:00.01 1 0 22 54 512K 312K 1648K 11M 2379M 12211 12196 sleeping 0
12202 bash 0.0 00:00.07 1 0 17 25 1276K 856K 1980K 9688K 2378M 12202 12199 sleeping 502
12201 bash 0.0 00:00.07 1 0 17 25 1276K 856K 1980K 9688K 2378M 12201 12198 sleeping 502
12200 login 0.0 00:00.01 1 0 22 54 512K 312K 1648K 11M 2379M 12200 12196 sleeping 0
12199 login 0.0 00:00.01 1 0 22 54 512K 312K 1648K 11M 2379M 12199 12196 sleeping 0
12198 login 0.0 00:00.01 1 0 22 54 512K 312K 1648K 11M 2379M 12198 12196 sleeping 0
12196 Terminal 33.9 00:01.84 5 1 114- 137 5736K+ 32M 23M+ 90M 2768M 12196 300 sleeping 502
11803- Google Chrom 0.0 04:06.79 7 1 99 365 45M 84M 79M 112M 1199M 11788 11788 sleeping 502
11800- Google Chrom 0.0 00:00.25 7 1 98 215 9632K 77M 23M 110M 1090M 11788 11788 sleeping 502
11799- Google Chrom 0.0 00:07.92 7 1 99 288 25M 82M 43M 109M 1108M 11788 11788 sleeping 502
11797- Google Chrom 0.0 00:01.49 7 1 99 316 27M 81M 48M 111M 1109M 11788 11788 sleeping 502
11796- Google Chrom 0.0 00:00.44 4 1 91 115 2824K 65M 8304K 96M 1012M 11788 11788 sleeping 502
11795- Google Chrom 0.0 00:00.96 7 1 98 215 9172K 77M 23M 111M 1091M 11788 11788 sleeping 502
11794- Google Chrom 0.0 00:07.64 8 1 100 294 20M 75M 36M 113M 1101M 11788 11788 sleeping 502
11793- Google Chrom 0.0 00:01.42 8 1 95 185 9732K 73M 24M 104M 1057M 11788 11788 sleeping 502
11788- Google Chrom 0.6 04:04.31 30 1 307 390 61M 110M 96M 254M 1298M 11788 300 sleeping 502
4328 ssh-agent 0.0 00:00.19 2 1 33 63 1300K 396K 2688K 59M 2420M 4328 300 sleeping 502
3855- Microsoft Of 0.0 00:36.14 4 1 121 337 12M 30M 22M 93M 1027M 3855 300 sleeping 502
492 AppleSpell 0.0 00:10.56 2 1 34 72 4608K 9028K 10M 88M 2469M 492 300 sleeping 502

Ada doesn't really have the concept of "processes" within the language. In fact, Ada code can run on platforms that do not support heavy processes at all (eg: Many smallish embedded platforms, like vxWorks).
That means you are going to have to use some kind of API (most likely supplied by your OS) to get that information.
If your OS supports POSIX, you may be able to use Posix bindings like Florist to get that info. There are full Unix subsystems available for Windows (Cygwin) and I believe MacOS is built on a flavor of Unix. So it might be possible to use Unix as sort of a lingua-franca so you can get your process info from a single (POSIX) API.
Now where Java is concerned, there are two issues: The Java language and the Java Platform (JVM). Java language fans like to conflate the two, but there are actually Ada compilers that target the JVM, and they can call all the same JVM API's that code written in the Java language can call. If there's one that allows Java programs to get a list of all the threads or processes that the JVM knows about, you could call that same routine from Ada too (if it is running under the JVM as well).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.