I am running a spark structured streaming job (bounces every day) in EMR. I am getting an OOM error in my application after a few hours of execution and get killed. The following are my configurations and spark SQL code.
I am new to Spark and need your valuable input.
The EMR is having 10 instances with 16 core and 64GB memory.
Spark-Submit arguments:
num_of_executors: 17
executor_cores: 5
executor_memory: 19G
driver_memory: 30G
Job is reading input as micro-batches from a Kafka at an interval of 30seconds. Average number of rows read per batch is 90k.
spark.streaming.kafka.maxRatePerPartition: 4500
spark.streaming.stopGracefullyOnShutdown: true
spark.streaming.unpersist: true
spark.streaming.kafka.consumer.cache.enabled: true
spark.hadoop.fs.s3.maxRetries: 30
spark.sql.shuffle.partitions: 2001
Spark SQL aggregation code:
dataset.groupBy(functions.col(NAME),functions.window(functions.column(TIMESTAMP_COLUMN),30))
.agg(functions.concat_ws(SPLIT, functions.collect_list(DEPARTMENT)).as(DEPS))
.select(NAME,DEPS)
.map((row) -> {
Map<String, Object> map = Maps.newHashMap();
map.put(NAME, row.getString(0));
map.put(DEPS, row.getString(1));
return new KryoMapSerializationService().serialize(map);
}, Encoders.BINARY());
Some logs from the driver:
20/04/04 13:10:51 INFO TaskSetManager: Finished task 1911.0 in stage 1041.0 (TID 1052055) in 374 ms on <host> (executor 3) (1998/2001)
20/04/04 13:10:52 INFO TaskSetManager: Finished task 1925.0 in stage 1041.0 (TID 1052056) in 411 ms on <host> (executor 3) (1999/2001)
20/04/04 13:10:52 INFO TaskSetManager: Finished task 1906.0 in stage 1041.0 (TID 1052054) in 776 ms on <host> (executor 3) (2000/2001)
20/04/04 13:11:04 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
20/04/04 13:11:04 INFO DAGScheduler: Executor lost: 3 (epoch 522)
20/04/04 13:11:04 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
20/04/04 13:11:04 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, <host>, 38533, None)
20/04/04 13:11:04 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
20/04/04 13:11:04 INFO YarnAllocator: Completed container container_1582797414408_1814_01_000004 on host: <host> (state: COMPLETE, exit status: 143)
And by the way, I am using collectasList in my forEachBatch code
List<Event> list = dataset.select("value")
.selectExpr("deserialize(value) as rows")
.select("rows.*")
.selectExpr(NAME, DEPS)
.as(Encoders.bean(Event.class))
.collectAsList();
With these settings, you may be causing your own issues.
num_of_executors: 17
executor_cores: 5
executor_memory: 19G
driver_memory: 30G
You are basically creating extra containers here to have to shuffle between. Instead, start off with something like 10 executors, 15 cores, 60g memory. If that is working, then you can play these a bit to try and optimize performance. I usually try splitting my containers in half each step (but I also havent needed to do this since spark 2.0).
Let Spark SQL keep the default at 200. The more you break this up, the more math you make Spark do to calculate the shuffles. If anything, I'd try to go with the same number of parallelism as you have executors, so in this case just 10. When 2.0 came out, this is how you would tune hive queries.
Making the job complex to break up puts all the load on the master.
Using Datasets and Encoding are also generally not as performant as going with straight DataFrame operations. I have found great lifts in performance of factoring this out for dataframe operations.
I have 1 x master node and 1 x slave node setup.
My issue is when running the map reduce processing. The slave node doesn't seem working. Anyone can provide help on how to check, to change and ensure the slave is working?
The config files info can be found on the URL below too
https://drive.google.com/file/d/1ULEe6k2zYnfQDQUQIbz_xR29WgT1DJhB/view
Here are my observation
1) When i check the CPU resources utilization, The slaves doesn't seem working and CPU resources at 0% when running the map reduce job while the master at 44% CPU resources. refer to the attachment.
2) When i run the dfs report it show it has 2 live nodes but on the cluster web it show only 1. Refer to the attachment and below.
3) The total processing time of map reduce is same with or without the slave
-------------------------------------------------
Live datanodes (2):
Name: 192.168.249.128:9866 (node-master)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 20587741184 (19.17 GB)
DFS Used: 174785723 (166.69 MB)
Non DFS Used: 60308293 (57.51 MB)
DFS Remaining: 20352647168 (18.95 GB)
DFS Used%: 0.85%
DFS Remaining%: 98.86%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Oct 23 11:17:39 PDT 2018
Last Block Report: Tue Oct 23 11:07:32 PDT 2018
Num of Blocks: 93
Name: 192.168.249.129:9866 (node1)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 20587741184 (19.17 GB)
DFS Used: 85743 (83.73 KB)
Non DFS Used: 33775889 (32.21 MB)
DFS Remaining: 20553879552 (19.14 GB)
DFS Used%: 0.00%
DFS Remaining%: 99.84%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Oct 23 11:17:38 PDT 2018
Last Block Report: Tue Oct 23 11:03:59 PDT 2018
Num of Blocks: 4
You're showing datanodes with dfsreport, not nodemanagers that actually are processing the data. In the YARN UI, you will want to take note of the "Active Nodes" counter, which in your case is 1. That would make sense if the master is a namenode and resource manager while the slave would be a datanode and nodemanager.
Other than that, if you have a non splittable file, for example a ZIP, or your file is less than the block size (by default 128 MB), then only one mapper will process that. Plus, it's not guaranteed that mappers (or reducers) will be distributed evenly over all available resources
Outside of a learning environment, though, 40 GB of storage and 8 GB of RAM would be better spent on multi threading rather than distributed computing (or a proper database; i.e parse files and load them into a queryable store). Or use Spark or Pig, which don't require Hadoop, but are much easier to work with than MapReduce
I am using following code to check if innoDb is enabled in mysql server but i want to get total number of disk writes by mysql. please help with the following program
public class connectToInnoDb {
public static void main(String[] args) {
// TODO Auto-generated method stub
try{
Class.forName("com.mysql.cj.jdbc.Driver");
Connection con=DriverManager.getConnection(
"jdbc:mysql://localhost:3310/INFORMATION_SCHEMA","root","root");
Statement stmt=con.createStatement();
ResultSet rs=stmt.executeQuery("SELECT * FROM ENGINES");
while(rs.next()) {
if(rs.getString(1) == "Innodb")
System.out.println("Yes");
}
con.close();
}catch(Exception e){ System.out.println(e);}
}
You can get a lot of InnoDB information with SHOW ENGINE INNODB STATUS, including I/O counts:
mysql> SHOW ENGINE INNODB STATUS\G
...
--------
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread)
I/O thread 1 state: waiting for i/o request (log thread)
I/O thread 2 state: waiting for i/o request (read thread)
I/O thread 3 state: waiting for i/o request (read thread)
I/O thread 4 state: waiting for i/o request (read thread)
I/O thread 5 state: waiting for i/o request (read thread)
I/O thread 6 state: waiting for i/o request (write thread)
I/O thread 7 state: waiting for i/o request (write thread)
I/O thread 8 state: waiting for i/o request (write thread)
I/O thread 9 state: waiting for i/o request (write thread)
Pending normal aio reads: 0 [0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0] ,
ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
431 OS file reads, 69 OS file writes, 53 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
...
I see above there have been 69 OS file writes. The numbers above are small because I got this information from a sandbox MySQL instance running on my laptop, and it hasn't been running long.
As commented by JimGarrison above, most of the information reported by INNODB STATUS is also available to you as individual rows in the INFORMATION_SCHEMA.INNODB_METRICS table. This is much easier to use in Java than SHOW ENGINE INNODB STATUS and trying to parse the text.
mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_METRICS
WHERE NAME = 'os_data_writes'\G
NAME: os_data_writes
SUBSYSTEM: os
COUNT: 69
MAX_COUNT: 69
MIN_COUNT: NULL
AVG_COUNT: 0.0034979215248910067
COUNT_RESET: 69
MAX_COUNT_RESET: 69
MIN_COUNT_RESET: NULL
AVG_COUNT_RESET: NULL
TIME_ENABLED: 2017-12-22 10:27:50
TIME_DISABLED: NULL
TIME_ELAPSED: 19726
TIME_RESET: NULL
STATUS: enabled
TYPE: status_counter
COMMENT: Number of writes initiated (innodb_data_writes)
Read https://dev.mysql.com/doc/refman/5.7/en/innodb-information-schema-metrics-table.html
I won't show the Java code, you already know how to run a query and fetch the results. These statements can be run as SQL statements the same way you run SELECT queries.
mysql> SHOW GLOBAL STATUS LIKE 'Innodb%write%';
+-----------------------------------+-------+
| Variable_name | Value |
+-----------------------------------+-------+
| Innodb_buffer_pool_write_requests | 5379 |
| Innodb_data_pending_writes | 0 |
| Innodb_data_writes | 416 |
| Innodb_dblwr_writes | 30 |
| Innodb_log_write_requests | 1100 |
| Innodb_log_writes | 88 |
| Innodb_os_log_pending_writes | 0 |
| Innodb_truncated_status_writes | 0 |
+-----------------------------------+-------+
mysql> SHOW GLOBAL STATUS LIKE 'Uptime';
+---------------+--------+
| Variable_name | Value |
+---------------+--------+
| Uptime | 4807 | -- divide by this to get "per second"
+---------------+--------+
Note: "requests" include both writes that need to hit the disk and those that do not.
I need to find out What is the actual memory consumed by Java process in Linux, the tools like visualVM/jconsole shows accurate but I must calculate actual memory used by JVM through top command.
I am looking at PID : 28169 if you look at top (linux) , it is saying 17.2g ( virtual ) , Res 10g , Shared : 15m . 10G is not possible since I Have given 6G jvmmax to this jvm process but if I use jvmtop it shows actuate results(matching with visualVM)
can someone shows me how to calculate actual usage of memory using top stats ?
Using JvmTop
JvmTop 0.8.0 alpha - 11:09:08, amd64, 12 cpus, Linux 2.6.32-57, load avg 0.00
http://code.google.com/p/jvmtop
PID 28169: com.gigaspaces.start.SystemBoot
ARGS: com.gigaspaces.start.services="GSC"
VMARGS: -XX:+AggressiveOpts -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemo[...]
VM: Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 1.7.0_51
UP: 179:23m #THR: 90 #THRPEAK: 92 #THRCREATED: 3725 USER: evolv
GC-Time: 0: 2m #GC-Runs: 3353 #TotalLoadedClasses: 23107
CPU: 1.46% GC: 0.00% HEAP:4623m /10240m NONHEAP: 180m / 304m
TID NAME STATE CPU TOTALCPU BLOCKEDBY
3733 RMI TCP Connection(2210)-10.16 RUNNABLE 14.93% 0.00%
3734 JMX server connection timeout TIMED_WAITING 0.13% 0.00%
95 GS-directLoadJobListenerPollin TIMED_WAITING 0.12% 0.14%
94 GS-jobListenerPollingContainer TIMED_WAITING 0.11% 0.14%
3375 GS-jobListenerPollingContainer TIMED_WAITING 0.10% 0.55%
93 GS-jobListenerPollingContainer TIMED_WAITING 0.09% 0.14%
3377 GS-jobListenerPollingContainer TIMED_WAITING 0.09% 0.56%
81 GS-subJobCompleteListenerPolli TIMED_WAITING 0.09% 0.14%
3376 GS-jobListenerPollingContainer TIMED_WAITING 0.08% 0.54%
98 GS-stopJobListenerPollingConta TIMED_WAITING 0.08% 0.14%
Note: Only top 10 threads (according cpu load) are shown!
^C-bash-4.1$
using Top :
top - 11:15:30 up 18 days, 6:34, 1 user, load average: 0.00, 0.00, 0.00
Tasks: 306 total, 1 running, 304 sleeping, 1 stopped, 0 zombie
Cpu(s): 0.3%us, 0.1%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16332776k total, 15913220k used, 419556k free, 316876k buffers
Swap: 4095996k total, 146452k used, 3949544k free, 3024048k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28169 evolv 20 0 17.2g 10g 15m S 2.8 70.3 4493:54 java
28034 evolv 20 0 5690m 289m 7656 S 0.0 1.8 16:31.29 java
28006 evolv 20 0 5821m 286m 7952 S 0.5 1.8 18:16.50 java
2098 root 20 0 272m 145m 4016 S 0.3 0.9 46:51.51 splunkd
2163 root 20 0 128m 40m 1220 S 0.0 0.3 1:05.86 puppet
1879 root 20 0 244m 6660 5036 S 0.0 0.0 1:21.82 sssd_be
You can use jstat to view your process statistics.
Examples
jstat -gc [insert-pid-here]
The above would give you an overview of your GC heap.
other commands
jstat -gccapacity [insert-pid-here]
jstat -gcutil [insert-pid-here]
UPDATES on Oct 25:
Now I found out what's causing the problem.
1) The child process kills itself, that's why strace/perf/auditctl cannot track it down.
2) The JNI call to create a process is triggered from a Java thread. When the thread eventually dies, it's also destroying the process that it creates.
3) In my code to fork and execve() a child process, I have the code to monitor parent process death and kill my child process with the following line: prctl( PR_SET_PDEATHSIG, SIGKILL ); My fault that I didn't pay special attention to this flag before b/c it's considered as a BEST PRACTICE for my other projects where child process is forked from the main thread.
4) If I comment out this line, the problem is gone. The original purpose is to kill the child process when the parent process is gone. Even w/o this flag, it's still the correct behavior. Seems like the ubuntu box default behavior.
5) Finally found it's a kernel bug, fixed in kernel version 3.4.0, my ubuntu box from AWS is kernel version 3.13.0-29-generic.
There are a couple of useful links to the issues:
a) http://www.linuxprogrammingblog.com/threads-and-fork-think-twice-before-using-them
b) prctl(PR_SET_PDEATHSIG, SIGNAL) is called on parent thread exit, not parent process exit.
c) https://bugzilla.kernel.org/show_bug.cgi?id=43300
UPDATES on Oct 15:
Thanks so much for all the suggestions. I am investigating from one area of the system to another area. It's hard 2 find a reason.
I am wondering 2 things.
1) why are powerful tools such as strace, auditctl and perf script not able to track down who caused the kill?
2) Is +++ killed by SIGKILL +++ really means its killed from signal?
ORIGINAL POST
I have a long running C process launched from a Java application server in Ubuntu 12 through the JNI interface. The reason I use JNI interface to start a process instead of through Java's process builder is b/c of the performance reasons. It's very inefficient for java process builder to do IPC especially b/c extra buffering introduces very long delay.
Periodically it is terminated by SIGKILL mysteriously. The way I found out is through strace, which says: "+++ killed by SIGKILL +++"
I checked the following:
It's not a crash.
It's not a OOM. Nothing in dmesg. My process uses only 3.3% of 1Gbytes of memory.
Java layer didn't kill the process. I put a log in the JNI code if the code terminates the process, but no log was written to indicate that.
It's not a permission issue. I tried to run as sudo or a different user, both cases causes the process to be killed.
If I run the process locally in a shell, everything works fine. What's more, in my C code for my long-running process, I ignore the signal SIGHUP. Only when it's running as a child process of Java server, it gets killed.
The process is very CPU intensive. It's using 30% of the CPU. There are lots of voluntary context switch and nonvoluntary_ctxt_switches.
(NEW UPDATE) One IMPORTANT thing very likely related to why my process is killed. If the process do some heavy lifting, it won't be killed, however, sometimes it's doing little CPU intensive work. When that happens, after a while, roughly 1 min, it is killed. It's status is always S(Sleeping) instead of R(Running). It seems that the OS decides to kill the process if it was idle most of the time, and not kill the process if it was busy.
I suspect Java's GC is the culprit, however, Java will NEVER garbage collect a singleton object associated with JNI. (My JNI object is tied to that singleton).
I am puzzled by the reason why it's terminated. Does anyone has a good suggestion how to track it down?
p.s.
On my ubuntu limit -a result is:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 7862
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65535
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 7862
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
I tried to increase the limits, and still does not solve the issue.
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 65535
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) unlimited
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Here is proc status when I run cat /proc/$$$/status
Name: mimi_coso
State: S (Sleeping)
Tgid: 2557
Ngid: 0
Pid: 2557
PPid: 2229
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 256
Groups: 0
VmPeak: 146840 kB
VmSize: 144252 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 36344 kB
VmRSS: 34792 kB
VmData: 45728 kB
VmStk: 136 kB
VmExe: 116 kB
VmLib: 23832 kB
VmPTE: 292 kB
VmSwap: 0 kB
Threads: 1
SigQ: 0/7862
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000004
SigIgn: 0000000000011001
SigCgt: 00000001c00064ee
CapInh: 0000000000000000
CapPrm: 0000001fffffffff
CapEff: 0000001fffffffff
CapBnd: 0000001fffffffff
Seccomp: 0
Cpus_allowed: 7fff
Cpus_allowed_list: 0-14
Mems_allowed: 00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 16978
nonvoluntary_ctxt_switches: 52120
strace shows:
$ strace -p 22254 -s 80 -o /tmp/debug.lighttpd.txt
read(0, "SGI\0\1\0\0\0\1\0c\0\0\0\t\0\0T\1\2248\0\0\0\0'\1\0\0(\0\0"..., 512) = 113
read(0, "SGI\0\1\0\0\0\1\0\262\1\0\0\10\0\1\243\1\224L\0\0\0\0/\377\373\222D\231\214"..., 512) = 448
sendto(3, "<15>Oct 10 18:34:01 MixCoder[271"..., 107, MSG_NOSIGNAL, NULL, 0) = 107
write(1, "SGO\0\0\0\0 \272\1\0\0\t\0\1\253\1\243\273\0\0\0\0'\1\0\0\0\0\0\1\242"..., 454) = 454
sendto(3, "<15>Oct 10 18:34:01 MixCoder[271"..., 107, MSG_NOSIGNAL, NULL, 0) = 107
write(1, "SGO\0\0\0\0 \341\0\0\0\10\0\0\322\1\254Z\0\0\0\0/\377\373R\4\0\17\21!"..., 237) = 237
read(0, "SGI\0\1\0\0\0\1\0)\3\0\0\t\0\3\32\1\224`\0\0\0\0'\1\0\0\310\0\0"..., 512) = 512
read(0, "\344u\233\16\257\341\315\254\272\300\351\302\324\263\212\351\225\365\1\241\225\3+\276J\273\37R\234R\362z"..., 512) = 311
read(0, "SGI\0\1\0\0\0\1\0\262\1\0\0\10\0\1\243\1\224f\0\0\0\0/\377\373\222d[\210"..., 512) = 448
sendto(3, "<15>Oct 10 18:34:01 MixCoder[271"..., 107, MSG_NOSIGNAL, NULL, 0) = 107
write(1, "SGO\0\0\0\0 %!\0\0\t\0\0+\1\243\335\0\0\0\0\27\0\0\0\0\1B\300\36"..., 8497) = 8497
sendto(3, "<15>Oct 10 18:34:01 MixCoder[271"..., 107, MSG_NOSIGNAL, NULL, 0) = 107
write(1, "SGO\0\0\0\0 \341\0\0\0\10\0\0\322\1\254t\0\0\0\0/\377\373R\4\0\17\301\31"..., 237) = 237
read(0, "SGI\0\1\0\0\0\1\0\262\1\0\0\10\0\1\243\1\224\200\0\0\0\0/\377\373\222d/\200"..., 512) = 448
sendto(3, "<15>Oct 10 18:34:01 MixCoder[271"..., 107, MSG_NOSIGNAL, NULL, 0) = 107
write(1, "SGO\0\0\0\0 \341\0\0\0\10\0\0\322\1\254\216\0\0\0\0/\377\373R\4\0\17\361+"..., 237) = 237
read(0, "SGI\0\1\0\0\0\1\0\221\0\0\0\t\0\0\202\1\224\210\0\0\0\0'\1\0\0P\0\0"..., 512) = 159
read(0, unfinished ...)
+++ killed by SIGKILL +++
Assuming that you have root access on your machine, you can enable audit on kill(2) syscall to gather such information.
root # auditctl -a exit,always -F arch=b64 -S kill -F a1=9
root # auditctl -l
LIST_RULES: exit,always arch=3221225534 (0xc000003e) a1=9 (0x9) syscall=kill
root # sleep 99999 &
[2] 11688
root # kill -9 11688
root # ausearch -sc kill
time->Tue Oct 14 00:38:44 2014
type=OBJ_PID msg=audit(1413272324.413:441376): opid=11688 oauid=52872 ouid=0 oses=20 ocomm="sleep"
type=SYSCALL msg=audit(1413272324.413:441376): arch=c000003e syscall=62 success=yes exit=0 a0=2da8 a1=9 a2=0 a3=0 items=0 ppid=6107 pid=6108 auid=52872 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsg
id=0 tty=pts2 ses=20 comm="bash" exe="/bin/bash" key=(null)
The other way is to set up kernel tracing which may be an over-kill when audit system can do same work.
Finally I figured out the reason why.
The child process kills itself and it's a linux kernel bug.
Details:
1) The child process kills itself, that's why strace/perf/auditctl cannot track it down.
2) The JNI call to create a process is triggered from a Java thread. When the thread eventually dies, it's also destroying the process that it creates.
3) In my code to fork and execve() a child process, I have the code to monitor parent process death and kill my child process with the following line: prctl( PR_SET_PDEATHSIG, SIGKILL ); I didn't pay special attention to this flag before b/c it's considered as a BEST PRACTICE for my other projects where child process is forked from the main thread.
4) If I comment out this line, the problem is gone. The original purpose is to kill the child process when the parent process is gone. Even w/o this flag, it's still the correct behavior. Seems like the ubuntu box default behavior.
5) From this article, https://bugzilla.kernel.org/show_bug.cgi?id=43300. it's a kernel bug, fixed in kernel version 3.4.0, my ubuntu box from AWS is kernel version 3.13.0-29-generic.
My machine configuration:
===>Ubuntu 14.04 LTS
===>3.13.0-29-generic
Some useful links to the issues:
a) http://www.linuxprogrammingblog.com/threads-and-fork-think-twice-before-using-them
b) prctl(PR_SET_PDEATHSIG, SIGNAL) is called on parent thread exit, not parent process exit
c) https://bugzilla.kernel.org/show_bug.cgi?id=43300
As I mentioned earlier, the other choice is to use kernel trace, which can be done by perf tool.
# apt-get install linux-tools-3.13.0-35-generic
# perf list | grep kill
syscalls:sys_enter_kill [Tracepoint event]
syscalls:sys_exit_kill [Tracepoint event]
syscalls:sys_enter_tgkill [Tracepoint event]
syscalls:sys_exit_tgkill [Tracepoint event]
syscalls:sys_enter_tkill [Tracepoint event]
syscalls:sys_exit_tkill [Tracepoint event]
# perf record -a -e syscalls:sys_enter_kill sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.634 MB perf.data (~71381 samples) ]
// Open a new shell to kill.
$ sleep 9999 &
[1] 2387
$ kill -9 2387
[1]+ Killed sleep 9999
$ echo $$
9014
// Dump the trace in your original shell.
# perf script
...
bash 9014 [001] 1890350.544971: syscalls:sys_enter_kill: pid: 0x00000953, sig: 0x00000009