On master node: binding to port 8040 java.net.BindException - java

I have a 2-node cluster(hadoop 2.6.0) with a master(as salve) and 1 slave in etc/hadoop/salve on the master machine.
After calling
start-yarn.sh
jps shows
:~$ jps
11821 SecondaryNameNode
11114 NameNode
12037 ResourceManager
11432 DataNode
12674 Jps
12363 NodeManager
and within a minute the NodeManager is gone with an BindException: Address already in use in the log file(pasted below). The port 8040 is open and used by ResoureManager.
log file on master
2014-12-19 17:21:25,185 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NodeManager metrics system...
2014-12-19 17:21:25,186 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system stopped.
2014-12-19 17:21:25,186 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system shutdown complete.
2014-12-19 17:21:25,186 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:8040] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException
at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.createServer(ResourceLocalizationService.java:356)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.serviceStart(ResourceLocalizationService.java:334)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:450)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:264)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:463)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:509)
Caused by: java.net.BindException: Problem binding to [0.0.0.0:8040] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:720)
at org.apache.hadoop.ipc.Server.bind(Server.java:424)
at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:573)
at org.apache.hadoop.ipc.Server.<init>(Server.java:2205)
at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:931)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:537)
at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:512)
at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:776)
at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
... 13 more
Caused by: java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:444)
at sun.nio.ch.Net.bind(Net.java:436)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.apache.hadoop.ipc.Server.bind(Server.java:407)
... 21 more
2014-12-19 17:21:25,189 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG
dfsreport
:~$ hdfs dfsadmin -report
Configured Capacity: 216116072448 (201.27 GB)
Present Capacity: 55109070848 (51.32 GB)
DFS Remaining: 55109021696 (51.32 GB)
DFS Used: 49152 (48 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Live datanodes (2):
Name: 192.168.179.3:50010 (slave)
Hostname: xyz
Decommission Status : Normal
Configured Capacity: 83369902080 (77.64 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 63215587328 (58.87 GB)
DFS Remaining: 20154290176 (18.77 GB)
DFS Used%: 0.00%
DFS Remaining%: 24.17%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Name: 192.168.179.58:50010 (master)
Hostname: abc
Decommission Status : Normal
Configured Capacity: 132746170368 (123.63 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 97791414272 (91.08 GB)
DFS Remaining: 34954731520 (32.55 GB)
DFS Used%: 0.00%
DFS Remaining%: 26.33%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1

BindException - Address already in use - This is an indication that port 8040 is busy and is in use by someother process. You could overcome this by changing the default yarn port in file etc/hadoop/yarn-site.xml as below:
<property>
<name>yarn.nodemanager.localizer.address</name>
<value>192.168.179.58:10200</value>
</property>

Related

How to identify the optimum number of shuffle partition in Spark

I am running a spark structured streaming job (bounces every day) in EMR. I am getting an OOM error in my application after a few hours of execution and get killed. The following are my configurations and spark SQL code.
I am new to Spark and need your valuable input.
The EMR is having 10 instances with 16 core and 64GB memory.
Spark-Submit arguments:
num_of_executors: 17
executor_cores: 5
executor_memory: 19G
driver_memory: 30G
Job is reading input as micro-batches from a Kafka at an interval of 30seconds. Average number of rows read per batch is 90k.
spark.streaming.kafka.maxRatePerPartition: 4500
spark.streaming.stopGracefullyOnShutdown: true
spark.streaming.unpersist: true
spark.streaming.kafka.consumer.cache.enabled: true
spark.hadoop.fs.s3.maxRetries: 30
spark.sql.shuffle.partitions: 2001
Spark SQL aggregation code:
dataset.groupBy(functions.col(NAME),functions.window(functions.column(TIMESTAMP_COLUMN),30))
.agg(functions.concat_ws(SPLIT, functions.collect_list(DEPARTMENT)).as(DEPS))
.select(NAME,DEPS)
.map((row) -> {
Map<String, Object> map = Maps.newHashMap();
map.put(NAME, row.getString(0));
map.put(DEPS, row.getString(1));
return new KryoMapSerializationService().serialize(map);
}, Encoders.BINARY());
Some logs from the driver:
20/04/04 13:10:51 INFO TaskSetManager: Finished task 1911.0 in stage 1041.0 (TID 1052055) in 374 ms on <host> (executor 3) (1998/2001)
20/04/04 13:10:52 INFO TaskSetManager: Finished task 1925.0 in stage 1041.0 (TID 1052056) in 411 ms on <host> (executor 3) (1999/2001)
20/04/04 13:10:52 INFO TaskSetManager: Finished task 1906.0 in stage 1041.0 (TID 1052054) in 776 ms on <host> (executor 3) (2000/2001)
20/04/04 13:11:04 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
20/04/04 13:11:04 INFO DAGScheduler: Executor lost: 3 (epoch 522)
20/04/04 13:11:04 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
20/04/04 13:11:04 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, <host>, 38533, None)
20/04/04 13:11:04 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
20/04/04 13:11:04 INFO YarnAllocator: Completed container container_1582797414408_1814_01_000004 on host: <host> (state: COMPLETE, exit status: 143)
And by the way, I am using collectasList in my forEachBatch code
List<Event> list = dataset.select("value")
.selectExpr("deserialize(value) as rows")
.select("rows.*")
.selectExpr(NAME, DEPS)
.as(Encoders.bean(Event.class))
.collectAsList();
With these settings, you may be causing your own issues.
num_of_executors: 17
executor_cores: 5
executor_memory: 19G
driver_memory: 30G
You are basically creating extra containers here to have to shuffle between. Instead, start off with something like 10 executors, 15 cores, 60g memory. If that is working, then you can play these a bit to try and optimize performance. I usually try splitting my containers in half each step (but I also havent needed to do this since spark 2.0).
Let Spark SQL keep the default at 200. The more you break this up, the more math you make Spark do to calculate the shuffles. If anything, I'd try to go with the same number of parallelism as you have executors, so in this case just 10. When 2.0 came out, this is how you would tune hive queries.
Making the job complex to break up puts all the load on the master.
Using Datasets and Encoding are also generally not as performant as going with straight DataFrame operations. I have found great lifts in performance of factoring this out for dataframe operations.

Hadoop node is not active

I have 1 x master node and 1 x slave node setup.
My issue is when running the map reduce processing. The slave node doesn't seem working. Anyone can provide help on how to check, to change and ensure the slave is working?
The config files info can be found on the URL below too
https://drive.google.com/file/d/1ULEe6k2zYnfQDQUQIbz_xR29WgT1DJhB/view
Here are my observation
1) When i check the CPU resources utilization, The slaves doesn't seem working and CPU resources at 0% when running the map reduce job while the master at 44% CPU resources. refer to the attachment.
2) When i run the dfs report it show it has 2 live nodes but on the cluster web it show only 1. Refer to the attachment and below.
3) The total processing time of map reduce is same with or without the slave
-------------------------------------------------
Live datanodes (2):
Name: 192.168.249.128:9866 (node-master)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 20587741184 (19.17 GB)
DFS Used: 174785723 (166.69 MB)
Non DFS Used: 60308293 (57.51 MB)
DFS Remaining: 20352647168 (18.95 GB)
DFS Used%: 0.85%
DFS Remaining%: 98.86%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Oct 23 11:17:39 PDT 2018
Last Block Report: Tue Oct 23 11:07:32 PDT 2018
Num of Blocks: 93
Name: 192.168.249.129:9866 (node1)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 20587741184 (19.17 GB)
DFS Used: 85743 (83.73 KB)
Non DFS Used: 33775889 (32.21 MB)
DFS Remaining: 20553879552 (19.14 GB)
DFS Used%: 0.00%
DFS Remaining%: 99.84%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Oct 23 11:17:38 PDT 2018
Last Block Report: Tue Oct 23 11:03:59 PDT 2018
Num of Blocks: 4
You're showing datanodes with dfsreport, not nodemanagers that actually are processing the data. In the YARN UI, you will want to take note of the "Active Nodes" counter, which in your case is 1. That would make sense if the master is a namenode and resource manager while the slave would be a datanode and nodemanager.
Other than that, if you have a non splittable file, for example a ZIP, or your file is less than the block size (by default 128 MB), then only one mapper will process that. Plus, it's not guaranteed that mappers (or reducers) will be distributed evenly over all available resources
Outside of a learning environment, though, 40 GB of storage and 8 GB of RAM would be better spent on multi threading rather than distributed computing (or a proper database; i.e parse files and load them into a queryable store). Or use Spark or Pig, which don't require Hadoop, but are much easier to work with than MapReduce

Remote RPC client disassociated while doing operation on datasets with persisted datasets

While performing join or any operation with persisted datasets with other non-persisted datasets, Spark server throws Remote RPC client disassociated. Following is piece of code that causing issue.
Dataset<Row> dsTableA = sparkSession.read().format("jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableA").load().persist(StorageLevel.MEMORY_AND_DISK_SER());
Dataset<Row> dsTableB = sparkSession.read().format("jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableB").load().persist(StorageLevel.MEMORY_AND_DISK_SER());
Dataset<Row> anotherTableA = sparkSession.read().format("jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableC").load();
anotherTableA.write().format("json").save("/path/toJsonA"); // Working Fine - No use of persisted datasets
Dataset<Row> anotherTableB = sparkSession.read().format("jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableD").load();
dsTableA.createOrReplaceTempView("dsTableA");
dsTableB.createOrReplaceTempView("dsTableB");
anotherTableB.createOrReplaceTempView("anotherTableB");
Dataset<Row> joinedTable = sparkSession.sql("select atb.* from anotherTableB atb INNER JOIN dsTableA dsta ON atb.pid=dsta.pid LEFT JOIN dsTableB dstb ON atb.ssid=dstb.ssid");
joinedTable.write().format("json").save("/path/toJsonB");
// ERROR : Remote RPC client disassociated
// Working fine if Datasets dsTableA and dsTableB were not persisted
Part of logs
INFO TaskSetManager: Starting task 0.0 in stage 17.0 (TID 111, X.X.X.X, partition 0, PROCESS_LOCAL, 5342 bytes)
INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 111 on executor id: 0 hostname: X.X.X.X.
INFO BlockManagerInfo: Added broadcast_13_piece0 in memory on X.X.X.X:37153 (size: 12.9 KB, free: X.2 GB)
INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on X.X.X.X:37153 (size: 52.0 KB, free: X.2 GB)
ERROR TaskSchedulerImpl: Lost executor 0 on X.X.X.X: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-12121212121211-0000/0 is now EXITED (Command exited with code 134)
WARN TaskSetManager: Lost task 0.0 in stage 17.0 (TID 111, X.X.X.X): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
INFO StandaloneSchedulerBackend: Executor app-12121212121211-0000/0 removed: Command exited with code 134
INFO DAGScheduler: Executor lost: 0 (epoch 8)
If Datasets dsTableA and dsTableB were not persisted, then everything works smoothly. But must have to use persisted datasets. So how to solve this problem?

Java and Linux OS hangs randomly for 24 minutes (Linux, arm, Debian, Java 7 ARM)

Background/context:
We are running Java application on one of CompuLab CoM:
https://compulab.co.il/products/computer-on-modules/cm-fx6/#overview
JVM version: Oracle Java 7 ARM 1.7.0_60
OS reference:
http://www.compulab.co.il/workspace/mediawiki/index.php5/CM-FX6_Linux
The application is not trivial: lots of threads, access to Ethernet (LAN), serial interface, GPRS/UMTS modem, access to Internet (ppp deamon), GPS, touch screen, database (SQLite), file system. In other words use OS resources extensively.....
We are observing that Java application (all of its threads) and OS basic functionality randomly hangs. I would say it is a Linux kernel bug but by killing the Java application it recovers and operates normally.
This state always takes exactly 24 minutes. Afterwards it recovers and behaves normally. Average rate of occurrence is once per 24-30 hours.
When it happens, externally invoked events like messages sent to application via Ethernet or serial interface are buffered (by OS probably) and all of them are processed immediately after it recovers.
When I establish SSH connection to device in advance, after it happens the connection is either blocked (all command are buffered and processed after it recovers - 24 mintes) or its working, than:
basic OS utilities does not work: "top" for example
jstack -F does not work, just hangs and does not produce any output
killing Java application by kill -9 PID released the OS and everything starts to operate normally
While it is in this state, the OS each time behaves differently. Other findings:
Basic network based utilities does not work (SSH, FTP) – can not
establish new connection to OS from another machine.
PING from another machine does work until I unplug an plug Ethernet
cable from device, sometimes PING than stops working
Sometimes OS system time hangs as well (not always), after 24 minutes
it continues delayed for 24 minutes.
New USB input devices (mouse, keyboard) can not be connected while in
that state (happens always).
Another strange thing:
A touch screen is used for interaction with a user (driver compiled as kernel module). And it works even while it is hung. Java application (GUI Swing) can handle events like pressing button so I can run some code behind button click handler.
It seems like all threads are blocked but Java Swing can handle some input events and our application precesses them until it needs to interact with already blocked threads or OS (run bash script on button click) or call sleep method. Than it hangs as well.
In other words, the Java application is hung ”partially” - can still handle something.
Already tried:
Tools for JVM remote debugging: Java Mission Control, VisualVM.
Connection was also established before it hung. Everything seemed OK
in terms of thread dump, heap dump etc. (I can send by e-mail). Even
the connection remained and I could see in thees tools that processor
usage dropped to 0 % for JVM.
jstack -F (via SSH): does not work, just hangs and does not produce
any output
I tried to run OS without the driver for touch screen and it still
happened.
I tried to run two parallel Java application. One of them was very
simple – just writing to log timestamps. And both of them hung.
I tried to run System.exit(0) in terms of button click handler while
app. and all threads hung and it does not worked (hung as well)
Questions:
Is it Linux kernel bug or JVM (its ARM implementation) bug?
Is Java (JVM) able to hang and block basic OS functionality (FTP, SSH, system time, other utilities)?
How can I further diagnose/debug this issue when basic utilities like jstack -F does not work?
Do you have any ideas what could be the cause of this issue and why it always recovers exactly after 24 minutes?
Update 1: 2014-07-10
Finally I manage to “catch” this weird state again. Here are my further findings.
Based on nos suggestion I tried run via ssh (established in advanced):
*strace -f -p PID*
Unfortunately the bash script command hung as well (same behavior like with jstack).
As far as the user limit (ulimit) and OS resources are concerned, bellow I report figures taken just after the system recovered from last hung. At that state it had been running for 24 hours and I can confirm that those figures remain roughly the same during long-term operation (no random peeks during operation). From my point of view, they are ok and application is not stepping over any resource or other limit in any way.
Java current heap
Used: 18 MB, Free: 12 MB, Total: 30 MB, Max: 230 MB
Java heap
root#cm-debian:~# /usr/lib/jvm/jdk1.7.0_60/bin/jmap -heap 3242
Attaching to process ID 3242, please wait...
Debugger attached successfully.
Client compiler detected.
JVM version is 24.60-b09
using thread-local object allocation.
Mark Sweep Compact GC
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 249561088 (238.0MB)
NewSize = 1048576 (1.0MB)
MaxNewSize = 4294836224 (4095.875MB)
OldSize = 4194304 (4.0MB)
NewRatio = 2
SurvivorRatio = 8
PermSize = 12582912 (12.0MB)
MaxPermSize = 67108864 (64.0MB)
G1HeapRegionSize = 0 (0.0MB)
Heap Usage:
New Generation (Eden + 1 Survivor Space):
capacity = 10092544 (9.625MB)
used = 6772088 (6.458366394042969MB)
free = 3320456 (3.1666336059570312MB)
67.09991058745942% used
Eden Space:
capacity = 9043968 (8.625MB)
used = 6620336 (6.3136444091796875MB)
free = 2423632 (2.3113555908203125MB)
73.2016743093297% used
From Space:
capacity = 1048576 (1.0MB)
used = 151752 (0.14472198486328125MB)
free = 896824 (0.8552780151367188MB)
14.472198486328125% used
To Space:
capacity = 1048576 (1.0MB)
used = 0 (0.0MB)
free = 1048576 (1.0MB)
0.0% used
tenured generation:
capacity = 22134784 (21.109375MB)
used = 17650936 (16.83324432373047MB)
free = 4483848 (4.276130676269531MB)
79.7429782915433% used
Perm Generation:
capacity = 19136512 (18.25MB)
used = 19023016 (18.141761779785156MB)
free = 113496 (0.10823822021484375MB)
99.40691386183647% used
9597 interned Strings occupying 729344 bytes.
top
top - 11:41:29 up 21:59, 2 users, load average: 1.51, 1.25, 1.22
Tasks: 93 total, 1 running, 92 sleeping, 0 stopped, 0 zombie
Cpu(s): 9.4%us, 8.0%sy, 0.0%ni, 82.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 966780k total, 273080k used, 693700k free, 27216k buffers
Swap: 0k total, 0k used, 0k free, 126352k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3242 root 20 0 398m 79m 11m S 23.6 8.4 346:16.82 java
3889 root 20 0 2804 1096 848 R 5.5 0.1 0:00.07 top
1 root 20 0 2124 688 596 S 0.0 0.1 0:02.92 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.03 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:14.32 ksoftirqd/0
5 root 20 0 0 0 0 S 0.0 0.0 0:00.14 kworker/u:0
6 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
7 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 khelper
java limits
root#cm-debian:~# java -XX:+PrintFlagsFinal -version | grep -iE 'HeapSize|PermSize|ThreadStackSize'
uintx AdaptivePermSizeWeight = 20 {product}
intx CompilerThreadStackSize = 0 {pd product}
uintx ErgoHeapSizeLimit = 0 {product}
uintx HeapSizePerGCThread = 67108864 {product}
uintx InitialHeapSize := 15468480 {product}
uintx LargePageHeapSizeThreshold = 134217728 {product}
uintx MaxHeapSize := 249561088 {product}
uintx MaxPermSize = 67108864 {pd product}
uintx PermSize = 12582912 {pd product}
intx ThreadStackSize = 320 {pd product}
intx VMThreadStackSize = 512 {pd product}
java version "1.7.0_60"
Java(TM) SE Runtime Environment (build 1.7.0_60-b19)
Java HotSpot(TM) Client VM (build 24.60-b09, mixed mode)
process limits
root#cm-debian:~# cat /proc/3242/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes unlimited unlimited processes
Max open files 8192 8192 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 16382 16382 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
system memory info
root#cm-debian:~# cat /proc/meminfo
MemTotal: 966780 kB
MemFree: 694312 kB
Buffers: 27384 kB
Cached: 126364 kB
SwapCached: 0 kB
Active: 140748 kB
Inactive: 107684 kB
Active(anon): 94992 kB
Inactive(anon): 2064 kB
Active(file): 45756 kB
Inactive(file): 105620 kB
Unevictable: 0 kB
Mlocked: 0 kB
HighTotal: 524288 kB
HighFree: 301088 kB
LowTotal: 442492 kB
LowFree: 393224 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 94692 kB
Mapped: 21220 kB
Shmem: 2376 kB
Slab: 13268 kB
SReclaimable: 5284 kB
SUnreclaim: 7984 kB
KernelStack: 960 kB
PageTables: 980 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 483388 kB
Committed_AS: 137260 kB
VmallocTotal: 286720 kB
VmallocUsed: 2928 kB
VmallocChunk: 283040 kB
root#cm-debian:~# vmstat -s
966780 K total memory
272468 K used memory
140776 K active memory
107712 K inactive memory
694312 K free memory
27392 K buffer memory
126404 K swap cache
0 K total swap
0 K used swap
0 K free swap
726963 non-nice user cpu ticks
0 nice user cpu ticks
621187 system cpu ticks
6371123 idle cpu ticks
3683 IO-wait cpu ticks
324 IRQ cpu ticks
2146 softirq cpu ticks
0 stolen cpu ticks
130871 pages paged in
97520 pages paged out
0 pages swapped in
0 pages swapped out
293822206 interrupts
494034482 CPU context switches
1412595732 boot time
3916 forks
threads
root#cm-debian:~# cat /proc/sys/kernel/pid_max
32768
root#cm-debian:~# cat /proc/sys/kernel/threads-max
15102
root#cm-debian:~# cat /proc/sys/vm/max_map_count
65530
root#cm-debian:~# ls -l /proc/3242/task/ | wc -l
33
root#cm-debian:~# ps huH p 3242 | wc -l
32
root#cm-debian:~# grep -s '^Threads' /proc/[0-9]*/status | awk '{ sum += $2; } END { print sum; }'
122
open files / file descriptors
root#cm-debian:~# ls -l /proc/3242/fd | wc -l
81
Update 2: 2014-13-10
This time I logged all Java threads stack traces while the OS was hung (as I stated previously, the touch screen and its events still works so I wrote stack traces to log file in terms of UI button handler).
From my point of view, all threads are in “correct” state (sleeping, waiting for UDP datagram etc..) and it is obvious that the hang is not caused by a Java application SW operation which would took 24 minutes.
10:49:42,293> [INFO ] THREAD stack traces:
****************************************
ID: 56, name: Mpg123AudioPlayer_PASSENGER_ctrlLoop
java.lang.Thread.sleep(Native Method)
java.lang.Thread.sleep(Thread.java:340)
java.util.concurrent.TimeUnit.sleep(TimeUnit.java:360)
epis5fcc.audio.mpg.MpgAudioOutputPlayer.ctrlLoop(MpgAudioOutputPlayer.java:169)
epis5fcc.audio.mpg.MpgAudioOutputPlayer.access$000(MpgAudioOutputPlayer.java:19)
epis5fcc.audio.mpg.MpgAudioOutputPlayer$1.run(MpgAudioOutputPlayer.java:88)
java.lang.Thread.run(Thread.java:745)
ID: 11, name: AWT-EventQueue-0
java.lang.Thread.getStackTrace(Thread.java:1589)
epis5fcc.domain.debug.ThreadStackTracesLogger.log(ThreadStackTracesLogger.java:30)
epis5fcc.ui.settings.FccRegistryScreen$7.actionPerformed(FccRegistryScreen.java:303)
javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2018)
javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2341)
javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:402)
javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:259)
javax.swing.plaf.basic.BasicButtonListener.mouseReleased(BasicButtonListener.java:252)
java.awt.Component.processMouseEvent(Component.java:6516)
javax.swing.JComponent.processMouseEvent(JComponent.java:3320)
java.awt.Component.processEvent(Component.java:6281)
java.awt.Container.processEvent(Container.java:2229)
java.awt.Component.dispatchEventImpl(Component.java:4872)
java.awt.Container.dispatchEventImpl(Container.java:2287)
java.awt.Component.dispatchEvent(Component.java:4698)
java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4832)
java.awt.LightweightDispatcher.processMouseEvent(Container.java:4492)
java.awt.LightweightDispatcher.dispatchEvent(Container.java:4422)
java.awt.Container.dispatchEventImpl(Container.java:2273)
java.awt.Window.dispatchEventImpl(Window.java:2719)
java.awt.Component.dispatchEvent(Component.java:4698)
java.awt.EventQueue.dispatchEventImpl(EventQueue.java:735)
java.awt.EventQueue.access$200(EventQueue.java:103)
java.awt.EventQueue$3.run(EventQueue.java:694)
java.awt.EventQueue$3.run(EventQueue.java:692)
java.security.AccessController.doPrivileged(Native Method)
java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76)
java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:87)
java.awt.EventQueue$4.run(EventQueue.java:708)
java.awt.EventQueue$4.run(EventQueue.java:706)
java.security.AccessController.doPrivileged(Native Method)
java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76)
java.awt.EventQueue.dispatchEvent(EventQueue.java:705)
java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:242)
java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:161)
java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:150)
java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:146)
java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:138)
java.awt.EventDispatchThread.run(EventDispatchThread.java:91)
ID: 34, name: Mpg123AudioPlayer_DRIVER_ctrlLoop
java.lang.Thread.sleep(Native Method)
java.lang.Thread.sleep(Thread.java:340)
java.util.concurrent.TimeUnit.sleep(TimeUnit.java:360)
epis5fcc.audio.mpg.MpgAudioOutputPlayer.ctrlLoop(MpgAudioOutputPlayer.java:169)
epis5fcc.audio.mpg.MpgAudioOutputPlayer.access$000(MpgAudioOutputPlayer.java:19)
epis5fcc.audio.mpg.MpgAudioOutputPlayer$1.run(MpgAudioOutputPlayer.java:88)
java.lang.Thread.run(Thread.java:745)
ID: 26, name: IOTxUdpAccessLoop_IODispatchAccess
java.lang.Thread.sleep(Native Method)
jCommons.comm.io.access.IOUdpAccess.transmitLoop(IOUdpAccess.java:114)
jCommons.comm.io.access.IOAccessBase$2.run(IOAccessBase.java:50)
java.lang.Thread.run(Thread.java:745)
ID: 29, name: MasterLoop_main
java.lang.Thread.sleep(Native Method)
jCommons.master.MasterLoop.ctrlLoop(MasterLoop.java:87)
jCommons.master.MasterLoop.access$000(MasterLoop.java:11)
jCommons.master.MasterLoop$1.run(MasterLoop.java:58)
java.lang.Thread.run(Thread.java:745)
ID: 27, name: IORxSerialPortAccessPollLoop_IOModemAccess
java.lang.Thread.sleep(Native Method)
jCommons.comm.io.access.IOSerialPortAccessPoll.reciveLoop(IOSerialPortAccessPoll.java:256)
jCommons.comm.io.access.IOAccessBase$1.run(IOAccessBase.java:43)
java.lang.Thread.run(Thread.java:745)
ID: 31, name: UsbUpdateWatchService_ctrlLoop
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:489)
java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:678)
sun.nio.fs.AbstractWatchService.take(AbstractWatchService.java:118)
jCommons.update.usb.UsbUpdateWatchService.ctrlLoop(UsbUpdateWatchService.java:107)
jCommons.update.usb.UsbUpdateWatchService.access$000(UsbUpdateWatchService.java:25)
jCommons.update.usb.UsbUpdateWatchService$1.run(UsbUpdateWatchService.java:75)
java.lang.Thread.run(Thread.java:745)
ID: 25, name: IORxUdpAccessLoop_IODispatchAccess
java.net.PlainDatagramSocketImpl.receive0(Native Method)
java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:145)
java.net.DatagramSocket.receive(DatagramSocket.java:786)
jCommons.comm.io.access.IOUdpAccess.reciveLoop(IOUdpAccess.java:175)
jCommons.comm.io.access.IOAccessBase$1.run(IOAccessBase.java:43)
java.lang.Thread.run(Thread.java:745)
ID: 2, name: Reference Handler
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
ID: 30, name: VehicleCtrl_ctrlLoop
java.lang.Thread.sleep(Native Method)
epis5fcc.domain.vehicle.control.VehicleCtrl.ctrlLoop(VehicleCtrl.java:74)
jCommons.comm.protocol.ProtCtrlBase$1.run(ProtCtrlBase.java:24)
java.lang.Thread.run(Thread.java:745)
ID: 35, name: Mpg123AudioPlayer_INNER_ctrlLoop
java.lang.Thread.sleep(Native Method)
java.lang.Thread.sleep(Thread.java:340)
java.util.concurrent.TimeUnit.sleep(TimeUnit.java:360)
epis5fcc.audio.mpg.MpgAudioOutputPlayer.ctrlLoop(MpgAudioOutputPlayer.java:169)
epis5fcc.audio.mpg.MpgAudioOutputPlayer.access$000(MpgAudioOutputPlayer.java:19)
epis5fcc.audio.mpg.MpgAudioOutputPlayer$1.run(MpgAudioOutputPlayer.java:88)
java.lang.Thread.run(Thread.java:745)
ID: 21, name: IORxSerialPortAccessPollLoop_IOFccAccess
java.lang.Thread.sleep(Native Method)
jCommons.comm.io.access.IOSerialPortAccessPoll.reciveLoop(IOSerialPortAccessPoll.java:256)
jCommons.comm.io.access.IOAccessBase$1.run(IOAccessBase.java:43)
java.lang.Thread.run(Thread.java:745)
ID: 7, name: FileWatchdog
java.lang.Thread.sleep(Native Method)
org.apache.log4j.helpers.FileWatchdog.run(FileWatchdog.java:104)
ID: 8, name: Java2D Disposer
java.lang.Object.wait(Native Method)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
sun.java2d.Disposer.run(Disposer.java:145)
java.lang.Thread.run(Thread.java:745)
ID: 17, name: com.google.inject.internal.util.$Finalizer
java.lang.Object.wait(Native Method)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
com.google.inject.internal.util.$Finalizer.run(Finalizer.java:114)
ID: 10, name: AWT-XAWT
sun.awt.X11.XToolkit.waitForEvents(Native Method)
sun.awt.X11.XToolkit.run(XToolkit.java:541)
sun.awt.X11.XToolkit.run(XToolkit.java:505)
java.lang.Thread.run(Thread.java:745)
ID: 32, name: Thread-4
sun.nio.fs.LinuxWatchService.poll(Native Method)
sun.nio.fs.LinuxWatchService.access$600(LinuxWatchService.java:47)
sun.nio.fs.LinuxWatchService$Poller.run(LinuxWatchService.java:311)
java.lang.Thread.run(Thread.java:745)
ID: 28, name: IOTxSerialPortAccessPollLoop_IOModemAccess
java.lang.Thread.sleep(Native Method)
jCommons.comm.io.access.IOSerialPortAccessPoll.transmitLoop(IOSerialPortAccessPoll.java:187)
jCommons.comm.io.access.IOAccessBase$2.run(IOAccessBase.java:50)
java.lang.Thread.run(Thread.java:745)
ID: 14, name: DestroyJavaVM
ID: 22, name: IOTxSerialPortAccessPollLoop_IOFccAccess
java.lang.Thread.sleep(Native Method)
jCommons.comm.io.access.IOSerialPortAccessPoll.transmitLoop(IOSerialPortAccessPoll.java:187)
jCommons.comm.io.access.IOAccessBase$2.run(IOAccessBase.java:50)
java.lang.Thread.run(Thread.java:745)
ID: 19, name: TimerQueue
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
java.util.concurrent.DelayQueue.take(DelayQueue.java:220)
javax.swing.TimerQueue.run(TimerQueue.java:171)
java.lang.Thread.run(Thread.java:745)
ID: 12, name: AWT-Shutdown
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
sun.awt.AWTAutoShutdown.run(AWTAutoShutdown.java:296)
java.lang.Thread.run(Thread.java:745)
ID: 23, name: IORxUdpAccessLoop_IOCityScrnAccess
java.net.PlainDatagramSocketImpl.receive0(Native Method)
java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:145)
java.net.DatagramSocket.receive(DatagramSocket.java:786)
jCommons.comm.io.access.IOUdpAccess.reciveLoop(IOUdpAccess.java:175)
jCommons.comm.io.access.IOAccessBase$1.run(IOAccessBase.java:43)
java.lang.Thread.run(Thread.java:745)
ID: 3, name: Finalizer
java.lang.Object.wait(Native Method)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)
ID: 4, name: Signal Dispatcher
ID: 52, name: pool-3-thread-1
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
ID: 24, name: IOTxUdpAccessLoop_IOCityScrnAccess
java.lang.Thread.sleep(Native Method)
jCommons.comm.io.access.IOUdpAccess.transmitLoop(IOUdpAccess.java:114)
jCommons.comm.io.access.IOAccessBase$2.run(IOAccessBase.java:50)
java.lang.Thread.run(Thread.java:745)
ID: 36, name: RemoteUpdateCtrl_ctrlLoop
java.lang.Thread.sleep(Native Method)
epis5fcc.domain.update.remote.RemoteUpdateCtrl.ctrlLoop(RemoteUpdateCtrl.java:94)
jCommons.comm.protocol.ProtCtrlBase$1.run(ProtCtrlBase.java:24)
java.lang.Thread.run(Thread.java:745)
ID: 55, name: Mpg123AudioPlayer_OUTER_ctrlLoop
java.lang.Thread.sleep(Native Method)
java.lang.Thread.sleep(Thread.java:340)
java.util.concurrent.TimeUnit.sleep(TimeUnit.java:360)
epis5fcc.audio.mpg.MpgAudioOutputPlayer.ctrlLoop(MpgAudioOutputPlayer.java:169)
epis5fcc.audio.mpg.MpgAudioOutputPlayer.access$000(MpgAudioOutputPlayer.java:19)
epis5fcc.audio.mpg.MpgAudioOutputPlayer$1.run(MpgAudioOutputPlayer.java:88)
java.lang.Thread.run(Thread.java:745)
This appear to be a problem related with GPT and local timers simultaneous use.
On Freescale community you can see one more question similar to yours and other from someone given some clarification.
For the resolution, apply this patch.
From the second post you can jump to kernel 3.10.17 from Fresscale or 3.13.3 from kernel.org
Currently I am trying the patch to see if resolves a similar problem.

CPU load with play framework

Since a few days, on a system which has been in development for about a year, I have a constant CPU load from the play! server. I have two servers, one active and one as a hot spare. In the past, the hot-spre server showed no load, or a neglectable load. But now it consumes a constant 50-110% CPU (using top on Linux).
Is there an easy way to find out what the cause it? I don't see this behavior on my MacBook when debugging (usually 0.1-1%).This is something that only happened in the past few days as far as I am aware.
This is a status print of the hot-spare. As can be seen no controllers are queried apart from the scheduled tasks (which do not perform on this server due to a flag, but are launched):
~ _ _
~ _ __ | | __ _ _ _| |
~ | '_ \| |/ _' | || |_|
~ | __/|_|\____|\__ (_)
~ |_| |__/
~
~ play! 1.2.4, http://www.playframework.org
~ framework ID is prod-frontend
~
~ Status from http://localhost:xxxx/#status,
~
Java:
~~~~~
Version: 1.6.0_26
Home: /usr/lib/jvm/java-6-sun-1.6.0.26/jre
Max memory: 64880640
Free memory: 11297896
Total memory: 29515776
Available processors: 2
Play framework:
~~~~~~~~~~~~~~~
Version: 1.2.4
Path: /opt/play
ID: prod-frontend
Mode: PROD
Tmp dir: /xxx/tmp
Application:
~~~~~~~~~~~~
Path: /xxx/server
Name: iDoms Server
Started at: 07/01/2012 12:05
Loaded modules:
~~~~~~~~~~~~~~
secure at /opt/play/modules/secure
paginate at /xxx/server/modules/paginate-0.14
Loaded plugins:
~~~~~~~~~~~~~~
0:play.CorePlugin [enabled]
100:play.data.parsing.TempFilePlugin [enabled]
200:play.data.validation.ValidationPlugin [enabled]
300:play.db.DBPlugin [enabled]
400:play.db.jpa.JPAPlugin [enabled]
450:play.db.Evolutions [enabled]
500:play.i18n.MessagesPlugin [enabled]
600:play.libs.WS [enabled]
700:play.jobs.JobsPlugin [enabled]
100000:play.plugins.ConfigurablePluginDisablingPlugin [enabled]
Threads:
~~~~~~~~
Thread[Reference Handler,10,system] WAITING
Thread[Finalizer,8,system] WAITING
Thread[Signal Dispatcher,9,system] RUNNABLE
Thread[net.sf.ehcache.CacheManager#449278d5,5,main] WAITING
Thread[Timer-0,5,main] TIMED_WAITING
Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#0,5,main] TIMED_WAITING
Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#1,5,main] TIMED_WAITING
Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#2,5,main] TIMED_WAITING
Thread[jobs-thread-1,5,main] TIMED_WAITING
Thread[jobs-thread-2,5,main] TIMED_WAITING
Thread[jobs-thread-3,5,main] TIMED_WAITING
Thread[New I/O server boss #1 ([id: 0x7065ec20, /0:0:0:0:0:0:0:0:9001]),5,main] RUNNABLE
Thread[DestroyJavaVM,5,main] RUNNABLE
Thread[New I/O server worker #1-3,5,main] RUNNABLE
Requests execution pool:
~~~~~~~~~~~~~~~~~~~~~~~~
Pool size: 0
Active count: 0
Scheduled task count: 0
Queue size: 0
Monitors:
~~~~~~~~
controllers.ReaderJob.doJob(), ms. -> 114 hits; 4.1 avg; 0.0 min; 463.0 max;
controllers.MediaCoderProcess.doJob(), ms. -> 4572 hits; 0.1 avg; 0.0 min; 157.0 max;
controllers.Bootstrap.doJob(), ms. -> 1 hits; 0.0 avg; 0.0 min; 0.0 max;
Datasource:
~~~~~~~~~~~
Jdbc url: jdbc:mysql://xxxx
Jdbc driver: com.mysql.jdbc.Driver
Jdbc user: xxxx
Jdbc password: xxxx
Min pool size: 1
Max pool size: 30
Initial pool size: 3
Checkout timeout: 5000
Jobs execution pool:
~~~~~~~~~~~~~~~~~~~
Pool size: 3
Active count: 0
Scheduled task count: 4689
Queue size: 3
Scheduled jobs (4):
~~~~~~~~~~~~~~~~~~~~~~~~~~
controllers.APNSFeedbackJob run every 24h. (has never run)
controllers.Bootstrap run at application start. (last run at 07/01/2012 12:05:32)
controllers.MediaCoderProcess run every 15s. (last run at 07/02/2012 07:10:46)
controllers.ReaderJob run every 600s. (last run at 07/02/2012 07:05:36)
Waiting jobs:
~~~~~~~~~~~~~~~~~~~~~~~~~~~
controllers.MediaCoderProcess will run in 2 seconds
controllers.APNSFeedbackJob will run in 17672 seconds
controllers.ReaderJob will run in 276 seconds
if your server is running under Linux, you may be hit by the Leap Second bug which appears last week-end.
This bug affects the Linux kernel (the Thread management), so application which uses threads (as the JVM, mysql etc...) may consume high load of CPU.
if you are using jdk 1.7 should be easy as they added this feature have look at my other related answer -> How to monitor the computer's cpu, memory, and disk usage in Java?

Categories