I'm working with OpenJDK 11 and a very simple SpringBoot application that almost the only thing it has is the SpringBoot actuator enabled so I can call /actuator/health etc.
I also have a kubernetes cluster on GCE very simple with just a pod with a container (containing this app of course)
My configuration has some key points that I want to highlight, it has some requirements and limits
resources:
limits:
memory: 600Mi
requests:
memory: 128Mi
And it has a readiness probe
readinessProbe:
initialDelaySeconds: 30
periodSeconds: 30
httpGet:
path: /actuator/health
port: 8080
I'm also setting a JVM_OPTS like (that my program is using obviously)
env:
- name: JVM_OPTS
value: "-XX:MaxRAM=512m"
The problem
I launch this and it gets OOMKilled in about 3 hours every time!
I'm never calling anything myself the only call is the readiness probe each 30 seconds that kubernetes does, and that is enough to exhaust the memory ? I have also not implemented anything out of the ordinary, just a Get method that says hello world along all the SpringBoot imports to have the actuators
If I run kubectl top pod XXXXXX I actually see how gradually get bigger and bigger
I have tried a lot of different configurations, tips, etc, but anything seems to work with a basic SpringBoot app
Is there a way to actually hard limit the memory in a way that Java can raise a OutOfMemory exception ? or to prevent this from happening?
Thanks in advance
EDIT: After 15h running
NAME READY STATUS RESTARTS AGE
pod/test-79fd5c5b59-56654 1/1 Running 4 15h
describe pod says...
State: Running
Started: Wed, 27 Feb 2019 10:29:09 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 27 Feb 2019 06:27:39 +0000
Finished: Wed, 27 Feb 2019 10:29:08 +0000
That last span of time is about 4 hours and only have 483 calls to /actuator/health, apparently that was enough to make java exceed the MaxRAM hint ?
EDIT: Almost 17h
its about to die again
$ kubectl top pod test-79fd5c5b59-56654
NAME CPU(cores) MEMORY(bytes)
test-79fd5c5b59-56654 43m 575Mi
EDIT: loosing any hope at 23h
NAME READY STATUS RESTARTS AGE
pod/test-79fd5c5b59-56654 1/1 Running 6 23h
describe pod:
State: Running
Started: Wed, 27 Feb 2019 18:01:45 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 27 Feb 2019 14:12:09 +0000
Finished: Wed, 27 Feb 2019 18:01:44 +0000
EDIT: A new finding
Yesterday night I was doing some interesting reading:
https://developers.redhat.com/blog/2017/03/14/java-inside-docker/
https://banzaicloud.com/blog/java10-container-sizing/
https://medium.com/adorsys/jvm-memory-settings-in-a-container-environment-64b0840e1d9e
TL;DR I decided to remove the memory limit and start the process again, the result was quite interesting (after like 11 hours running)
NAME CPU(cores) MEMORY(bytes)
test-84ff9d9bd9-77xmh 218m 1122Mi
So... WTH with that CPU? I kind expecting a big number on memory usage but what happens with the CPU?
The one thing I can think is that the GC is running as crazy thinking that the MaxRAM is 512m and he is using more than 1G. I'm wondering, is Java detecting ergonomics correctly? (I'm starting to doubt it)
To test my theory I set a limit of 512m and deploy the app this way and I found that from the start there is a unusual CPU load that it has to be the GC running very frequently
kubectl create ...
limitrange/mem-limit-range created
pod/test created
kubectl exec -it test-64ccb87fd7-5ltb6 /usr/bin/free
total used free shared buff/cache available
Mem: 7658200 1141412 4132708 19948 2384080 6202496
Swap: 0 0 0
kubectl top pod ..
NAME CPU(cores) MEMORY(bytes)
test-64ccb87fd7-5ltb6 522m 283Mi
522m is too much vCPU, so my logical next step was to ensure I'm using the most appropriated GC for this case, I changed the JVM_OPTS this way:
env:
- name: JVM_OPTS
value: "-XX:MaxRAM=512m -Xmx128m -XX:+UseSerialGC"
...
resources:
requests:
memory: 256Mi
cpu: 0.15
limits:
memory: 700Mi
And thats bring the vCPU usage to a reasonable status again, after kubectl top pod
NAME CPU(cores) MEMORY(bytes)
test-84f4c7445f-kzvd5 13m 305Mi
Messing with Xmx having MaxRAM is obviously affecting the JVM but how is not possible to control the amount of memory we have on virtualized containers ? I know that free command will report the host available RAM but OpenJDK should be using cgroups rihgt?.
I'm still monitoring the memory ...
EDIT: A new hope
I did two things, the first one was to remove again my container limit, I want to analyze how much it will grow, also I added a new flag to see how the process is using the native memory -XX:NativeMemoryTracking=summary
At the beginning every thing was normal, the process started consuming like 300MB via kubectl top pod so I let it running about 4 hours and then ...
kubectl top pod
NAME CPU(cores) MEMORY(bytes)
test-646864bc48-69wm2 54m 645Mi
kind of expected, right ? but then I checked the native memory usage
jcmd <PID> VM.native_memory summary
Native Memory Tracking:
Total: reserved=2780631KB, committed=536883KB
- Java Heap (reserved=131072KB, committed=120896KB)
(mmap: reserved=131072KB, committed=120896KB)
- Class (reserved=203583KB, committed=92263KB)
(classes #17086)
( instance classes #15957, array classes #1129)
(malloc=2879KB #44797)
(mmap: reserved=200704KB, committed=89384KB)
( Metadata: )
( reserved=77824KB, committed=77480KB)
( used=76069KB)
( free=1411KB)
( waste=0KB =0.00%)
( Class space:)
( reserved=122880KB, committed=11904KB)
( used=10967KB)
( free=937KB)
( waste=0KB =0.00%)
- Thread (reserved=2126472KB, committed=222584KB)
(thread #2059)
(stack: reserved=2116644KB, committed=212756KB)
(malloc=7415KB #10299)
(arena=2413KB #4116)
- Code (reserved=249957KB, committed=31621KB)
(malloc=2269KB #9949)
(mmap: reserved=247688KB, committed=29352KB)
- GC (reserved=951KB, committed=923KB)
(malloc=519KB #1742)
(mmap: reserved=432KB, committed=404KB)
- Compiler (reserved=1913KB, committed=1913KB)
(malloc=1783KB #1343)
(arena=131KB #5)
- Internal (reserved=7798KB, committed=7798KB)
(malloc=7758KB #28415)
(mmap: reserved=40KB, committed=40KB)
- Other (reserved=32304KB, committed=32304KB)
(malloc=32304KB #3030)
- Symbol (reserved=20616KB, committed=20616KB)
(malloc=17475KB #212850)
(arena=3141KB #1)
- Native Memory Tracking (reserved=5417KB, committed=5417KB)
(malloc=347KB #4494)
(tracking overhead=5070KB)
- Arena Chunk (reserved=241KB, committed=241KB)
(malloc=241KB)
- Logging (reserved=4KB, committed=4KB)
(malloc=4KB #184)
- Arguments (reserved=17KB, committed=17KB)
(malloc=17KB #469)
- Module (reserved=286KB, committed=286KB)
(malloc=286KB #2704)
Wait, What ? 2.1 GB reserved for threads? and 222 MB being used, what is this ? I currently don't know, I just saw it...
I need time trying to understand why this is happening
I finally found my issue and I want to share it so others can benefit in some way from this.
As I found on my last edit I had a thread problem that was causing all the memory consumption over time, specifically we was using an asynchronous method from a third party library without properly taking care those resources (ensure those calls was ending correctly in this case).
I was able to detect the issue because I used a memory limit on my kubernete deployment from the beginning (which is a good practice on production environments) and then I monitored very closely my app memory consumption using tools like jstat, jcmd, visualvm, kill -3 and most importantly the -XX:NativeMemoryTracking=summary flag that gave me so much detail in this regard.
I have 1 x master node and 1 x slave node setup.
My issue is when running the map reduce processing. The slave node doesn't seem working. Anyone can provide help on how to check, to change and ensure the slave is working?
The config files info can be found on the URL below too
https://drive.google.com/file/d/1ULEe6k2zYnfQDQUQIbz_xR29WgT1DJhB/view
Here are my observation
1) When i check the CPU resources utilization, The slaves doesn't seem working and CPU resources at 0% when running the map reduce job while the master at 44% CPU resources. refer to the attachment.
2) When i run the dfs report it show it has 2 live nodes but on the cluster web it show only 1. Refer to the attachment and below.
3) The total processing time of map reduce is same with or without the slave
-------------------------------------------------
Live datanodes (2):
Name: 192.168.249.128:9866 (node-master)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 20587741184 (19.17 GB)
DFS Used: 174785723 (166.69 MB)
Non DFS Used: 60308293 (57.51 MB)
DFS Remaining: 20352647168 (18.95 GB)
DFS Used%: 0.85%
DFS Remaining%: 98.86%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Oct 23 11:17:39 PDT 2018
Last Block Report: Tue Oct 23 11:07:32 PDT 2018
Num of Blocks: 93
Name: 192.168.249.129:9866 (node1)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 20587741184 (19.17 GB)
DFS Used: 85743 (83.73 KB)
Non DFS Used: 33775889 (32.21 MB)
DFS Remaining: 20553879552 (19.14 GB)
DFS Used%: 0.00%
DFS Remaining%: 99.84%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Oct 23 11:17:38 PDT 2018
Last Block Report: Tue Oct 23 11:03:59 PDT 2018
Num of Blocks: 4
You're showing datanodes with dfsreport, not nodemanagers that actually are processing the data. In the YARN UI, you will want to take note of the "Active Nodes" counter, which in your case is 1. That would make sense if the master is a namenode and resource manager while the slave would be a datanode and nodemanager.
Other than that, if you have a non splittable file, for example a ZIP, or your file is less than the block size (by default 128 MB), then only one mapper will process that. Plus, it's not guaranteed that mappers (or reducers) will be distributed evenly over all available resources
Outside of a learning environment, though, 40 GB of storage and 8 GB of RAM would be better spent on multi threading rather than distributed computing (or a proper database; i.e parse files and load them into a queryable store). Or use Spark or Pig, which don't require Hadoop, but are much easier to work with than MapReduce
I am having a problem with tomcat since switching to a different package provider (bitnami -> official debian).
Someone seems to be hitting our servers with a request (with malicious intent):
59.111.29.6 - - [04/Feb/2017:16:17:58 +0000] "-" 400 -
where "-" is the request path, which coincides with
Feb 04, 2017 4:17:58 PM org.apache.coyote.http11.AbstractHttp11Processor process
INFO: Error parsing HTTP request header
Note: further occurrences of HTTP header parsing errors will be logged at DEBUG level.
which coincides with the increased CPU usage.
The server status shows the following:
<h1>JVM</h1><p> Free memory: 355.58 MB Total memory: 833.13 MB Max memory: 2900.00 MB</p><table border="0"><thead><tr><th>Memory Pool</th><th>Type</th><th>Initial</th><th>Total</th><th>Maximum</th><th>Used</th></tr></thead><tbody><tr><td>Eden Space</td><td>Heap memory</td><td>34.12 MB</td><td>229.93 MB</td><td>800.00 MB</td><td>12.47 MB (1%)</td></tr><tr><td>Survivor Space</td><td>Heap memory</td><td>4.25 MB</td><td>28.68 MB</td><td>100.00 MB</td><td>2.22 MB (2%)</td></tr><tr><td>Tenured Gen</td><td>Heap memory</td><td>85.37 MB</td><td>574.51 MB</td><td>2000.00 MB</td><td>462.84 MB (23%)</td></tr><tr><td>Code Cache</td><td>Non-heap memory</td><td>2.43 MB</td><td>7.00 MB</td><td>48.00 MB</td><td>6.89 MB (14%)</td></tr><tr><td>Perm Gen</td><td>Non-heap memory</td><td>128.00 MB</td><td>128.00 MB</td><td>512.00 MB</td><td>52.57 MB (10%)</td></tr></tbody></table><h1>"http-nio-8080"</h1><p> Max threads: 200 Current thread count: 10 Current thread busy: 3 Keeped alive sockets count: 1<br> Max processing time: 301 ms Processing time: 71.068 s Request count: 10021 Error count: 2996 Bytes received: 0.00 MB Bytes sent: 3.18 MB</p><table border="0"><tr><th>Stage</th><th>Time</th><th>B Sent</th><th>B Recv</th><th>Client (Forwarded)</th><th>Client (Actual)</th><th>VHost</th><th>Request</th></tr><tr><td><strong>F</strong></td><td>1486364749526 ms</td><td>0 KB</td><td>0 KB</td><td>185.40.4.169</td><td>185.40.4.169</td><td nowrap>?</td><td nowrap class="row-left">? ? ?</td></tr><tr><td><strong>F</strong></td><td>1486364749526 ms</td><td>0 KB</td><td>0 KB</td><td>185.40.4.169</td><td>185.40.4.169</td><td nowrap>?</td><td nowrap class="row-left">? ? ?</td></tr><tr><td><strong>R</strong></td><td>?</td><td>?</td><td>?</td><td>?</td><td>?</td><td>?</td></tr><tr><td><strong>S</strong></td><td>36 ms</td><td>0 KB</td><td>0 KB</td><td>106.51.39.130</td><td>106.51.39.130</td><td nowrap>104.197.119.177</td><td nowrap class="row-left">GET /manager/status?org.apache.catalina.filters.CSRF_NONCE=072F9F6884D94C5D7B30D1D34CE61BD9 HTTP/1.1</td></tr><tr><td><strong>R</strong></td><td>?</td><td>?</td><td>?</td><td>?</td><td>?</td><td>?</td></tr></table><p>P: Parse and prepare request S: Service F: Finishing R: Ready K: Keepalive</p><hr size="1" noshade="noshade">
<center><font size="-1" color="#525D76">
So it doesn't seem like an out of memory issue (but I could be wrong).
How can I stop someone from making the request in the first place to avoid the issues I'm facing? My webapp running on tomcat restricts HTTP methods to GET/POST, but how can I configure tomcat as a whole to restrict them?
I would advise you to obtain a thread dump of your server :
Isolates the PID of the tomcat server using :
jps -l
Obtains a thread dump using :
kill -3 PID
or jstack PID
Then checks the Thread dump, you should find the reason of the hogging thread
As it is described on http://wiki.netbeans.org/Jemmy_Operators_Environment default time for ActionProducer.MaxActionTime is 10000 ms.
I need to increase it to 120000 ms and use next code:
JemmyProperties.setCurrentTimeout("ActionProducer.MaxActionTime", 120000);
And when the code is run under debugging mode the value is 120000:
but still I've got the next error:
"Menu pushing: (JMenuItem with text "Modules", JMenuItem with text
"Corporate entity") (ActionProducer.MaxActionTime)" action has not been
produced in 60005 milliseconds
Is 60000 ms a maximum value for ActionProducer.MaxActionTime?
UPDATE:
Every instance of a class implementing org.netbeans.jemmy.Timeoutable can have its own timeout values, so I checked timeout of instance that generates error
menuBar.getTimeouts().getTimeout("ActionProducer.MaxActionTime")
but the result was the same - it is 120000 seconds and still failing at 60000 seconds.
Despite the fact that error message states (ActionProducer.MaxActionTime)" action has not been produced in..., there is another timout that rules this action time:
JMenuOperator.PushMenuTimeout
Even if I set:
JemmyProperties.setCurrentTimeout("JMenuOperator.PushMenuTimeout", 50);
The error is:
"Menu pushing: (JMenuItem with text "Modules", JMenuItem with text
"Corporate entity") (ActionProducer.MaxActionTime)" action has not been
produced in 51 milliseconds
So do not belive Jemmy log messages and try to find the right timeout.
I have an application running on Websphere Application Server 6.1.0.43. And I'm having slowdown issues when thrying to invoke a remote service.
The slowdown is on the method findGroupAndGetConnection from the class outboundConnectionCache.
According to the IBM APAR PK94494:
The delay occurs after the client-side JAX-RPC handler (if present) is invoked and before the actual SOAP message is sent to the provider.
Because the delay occurs in the IBM web services engine, this problem
can be difficult to detect.
A com.ibm.ws.webservices.engine.transport.*=all trace will show entries similar to these which repeat:
[8/19/09 18:08:29:658 GMT] 00000047 OutboundConne 1 Enter:
WSWS3595I: Current pool size: 25. Connections-in-use size: 0.
Configured pool size: 25
In addition, that same trace spec will show long delays in executing
the .findGroupAndGetConnection() method:
[8/19/09 18:08:03:428 GMT] 00000047 OutboundConne >
OutboundConnectionCache.findGroupAndGetConnection()
WAITING_THREADS_THRESHOLD is 5 Entry
[8/19/09 18:08:38:358 GMT] 00000047 OutboundConne <
OutboundConnectionCache.findGroupAndGetConnection() Exit
And they recommend the following:
Reduce the 'com.ibm.websphere.webservices.http.connectionPoolCleanUpTime' from
the default of 180 to 120 seconds
Increase the max connections 'com.ibm.websphere.webservices.http.maxConnection' property from
default of 25 to 50. This will also require increasing the web
container thread pool size to 100.
Before changing the default properties I decided to monitor the Web Container thread usage and I noticed that maximum thread pool size (50) is never reached, but the minimum pool size (10) is reached very often, forcing connections to be destroyed and recreated.
Running over the minimum pool size will cause this slowdown? Should I increase the minimum pool size? Is my problem something other than http outbound connection pool?