Tomcat 8 : 100% cpu usage

Tomcat 8 : 100% cpu usage - java

I am having a problem with tomcat since switching to a different package provider (bitnami -> official debian).
Someone seems to be hitting our servers with a request (with malicious intent):
59.111.29.6 - - [04/Feb/2017:16:17:58 +0000] "-" 400 -
where "-" is the request path, which coincides with
Feb 04, 2017 4:17:58 PM org.apache.coyote.http11.AbstractHttp11Processor process
INFO: Error parsing HTTP request header
Note: further occurrences of HTTP header parsing errors will be logged at DEBUG level.
which coincides with the increased CPU usage.
The server status shows the following:
<h1>JVM</h1><p> Free memory: 355.58 MB Total memory: 833.13 MB Max memory: 2900.00 MB</p><table border="0"><thead><tr><th>Memory Pool</th><th>Type</th><th>Initial</th><th>Total</th><th>Maximum</th><th>Used</th></tr></thead><tbody><tr><td>Eden Space</td><td>Heap memory</td><td>34.12 MB</td><td>229.93 MB</td><td>800.00 MB</td><td>12.47 MB (1%)</td></tr><tr><td>Survivor Space</td><td>Heap memory</td><td>4.25 MB</td><td>28.68 MB</td><td>100.00 MB</td><td>2.22 MB (2%)</td></tr><tr><td>Tenured Gen</td><td>Heap memory</td><td>85.37 MB</td><td>574.51 MB</td><td>2000.00 MB</td><td>462.84 MB (23%)</td></tr><tr><td>Code Cache</td><td>Non-heap memory</td><td>2.43 MB</td><td>7.00 MB</td><td>48.00 MB</td><td>6.89 MB (14%)</td></tr><tr><td>Perm Gen</td><td>Non-heap memory</td><td>128.00 MB</td><td>128.00 MB</td><td>512.00 MB</td><td>52.57 MB (10%)</td></tr></tbody></table><h1>"http-nio-8080"</h1><p> Max threads: 200 Current thread count: 10 Current thread busy: 3 Keeped alive sockets count: 1<br> Max processing time: 301 ms Processing time: 71.068 s Request count: 10021 Error count: 2996 Bytes received: 0.00 MB Bytes sent: 3.18 MB</p><table border="0"><tr><th>Stage</th><th>Time</th><th>B Sent</th><th>B Recv</th><th>Client (Forwarded)</th><th>Client (Actual)</th><th>VHost</th><th>Request</th></tr><tr><td><strong>F</strong></td><td>1486364749526 ms</td><td>0 KB</td><td>0 KB</td><td>185.40.4.169</td><td>185.40.4.169</td><td nowrap>?</td><td nowrap class="row-left">? ? ?</td></tr><tr><td><strong>F</strong></td><td>1486364749526 ms</td><td>0 KB</td><td>0 KB</td><td>185.40.4.169</td><td>185.40.4.169</td><td nowrap>?</td><td nowrap class="row-left">? ? ?</td></tr><tr><td><strong>R</strong></td><td>?</td><td>?</td><td>?</td><td>?</td><td>?</td><td>?</td></tr><tr><td><strong>S</strong></td><td>36 ms</td><td>0 KB</td><td>0 KB</td><td>106.51.39.130</td><td>106.51.39.130</td><td nowrap>104.197.119.177</td><td nowrap class="row-left">GET /manager/status?org.apache.catalina.filters.CSRF_NONCE=072F9F6884D94C5D7B30D1D34CE61BD9 HTTP/1.1</td></tr><tr><td><strong>R</strong></td><td>?</td><td>?</td><td>?</td><td>?</td><td>?</td><td>?</td></tr></table><p>P: Parse and prepare request S: Service F: Finishing R: Ready K: Keepalive</p><hr size="1" noshade="noshade">
<center><font size="-1" color="#525D76">
So it doesn't seem like an out of memory issue (but I could be wrong).
How can I stop someone from making the request in the first place to avoid the issues I'm facing? My webapp running on tomcat restricts HTTP methods to GET/POST, but how can I configure tomcat as a whole to restrict them?

I would advise you to obtain a thread dump of your server :
Isolates the PID of the tomcat server using :
jps -l
Obtains a thread dump using :
kill -3 PID
or jstack PID
Then checks the Thread dump, you should find the reason of the hogging thread

Related

Kubernetes, simple SpringBoot app OOMKilled

I'm working with OpenJDK 11 and a very simple SpringBoot application that almost the only thing it has is the SpringBoot actuator enabled so I can call /actuator/health etc.
I also have a kubernetes cluster on GCE very simple with just a pod with a container (containing this app of course)
My configuration has some key points that I want to highlight, it has some requirements and limits
resources:
limits:
memory: 600Mi
requests:
memory: 128Mi
And it has a readiness probe
readinessProbe:
initialDelaySeconds: 30
periodSeconds: 30
httpGet:
path: /actuator/health
port: 8080
I'm also setting a JVM_OPTS like (that my program is using obviously)
env:
- name: JVM_OPTS
value: "-XX:MaxRAM=512m"
The problem
I launch this and it gets OOMKilled in about 3 hours every time!
I'm never calling anything myself the only call is the readiness probe each 30 seconds that kubernetes does, and that is enough to exhaust the memory ? I have also not implemented anything out of the ordinary, just a Get method that says hello world along all the SpringBoot imports to have the actuators
If I run kubectl top pod XXXXXX I actually see how gradually get bigger and bigger
I have tried a lot of different configurations, tips, etc, but anything seems to work with a basic SpringBoot app
Is there a way to actually hard limit the memory in a way that Java can raise a OutOfMemory exception ? or to prevent this from happening?
Thanks in advance
EDIT: After 15h running
NAME READY STATUS RESTARTS AGE
pod/test-79fd5c5b59-56654 1/1 Running 4 15h
describe pod says...
State: Running
Started: Wed, 27 Feb 2019 10:29:09 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 27 Feb 2019 06:27:39 +0000
Finished: Wed, 27 Feb 2019 10:29:08 +0000
That last span of time is about 4 hours and only have 483 calls to /actuator/health, apparently that was enough to make java exceed the MaxRAM hint ?
EDIT: Almost 17h
its about to die again
$ kubectl top pod test-79fd5c5b59-56654
NAME CPU(cores) MEMORY(bytes)
test-79fd5c5b59-56654 43m 575Mi
EDIT: loosing any hope at 23h
NAME READY STATUS RESTARTS AGE
pod/test-79fd5c5b59-56654 1/1 Running 6 23h
describe pod:
State: Running
Started: Wed, 27 Feb 2019 18:01:45 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 27 Feb 2019 14:12:09 +0000
Finished: Wed, 27 Feb 2019 18:01:44 +0000
EDIT: A new finding
Yesterday night I was doing some interesting reading:
https://developers.redhat.com/blog/2017/03/14/java-inside-docker/
https://banzaicloud.com/blog/java10-container-sizing/
https://medium.com/adorsys/jvm-memory-settings-in-a-container-environment-64b0840e1d9e
TL;DR I decided to remove the memory limit and start the process again, the result was quite interesting (after like 11 hours running)
NAME CPU(cores) MEMORY(bytes)
test-84ff9d9bd9-77xmh 218m 1122Mi
So... WTH with that CPU? I kind expecting a big number on memory usage but what happens with the CPU?
The one thing I can think is that the GC is running as crazy thinking that the MaxRAM is 512m and he is using more than 1G. I'm wondering, is Java detecting ergonomics correctly? (I'm starting to doubt it)
To test my theory I set a limit of 512m and deploy the app this way and I found that from the start there is a unusual CPU load that it has to be the GC running very frequently
kubectl create ...
limitrange/mem-limit-range created
pod/test created
kubectl exec -it test-64ccb87fd7-5ltb6 /usr/bin/free
total used free shared buff/cache available
Mem: 7658200 1141412 4132708 19948 2384080 6202496
Swap: 0 0 0
kubectl top pod ..
NAME CPU(cores) MEMORY(bytes)
test-64ccb87fd7-5ltb6 522m 283Mi
522m is too much vCPU, so my logical next step was to ensure I'm using the most appropriated GC for this case, I changed the JVM_OPTS this way:
env:
- name: JVM_OPTS
value: "-XX:MaxRAM=512m -Xmx128m -XX:+UseSerialGC"
...
resources:
requests:
memory: 256Mi
cpu: 0.15
limits:
memory: 700Mi
And thats bring the vCPU usage to a reasonable status again, after kubectl top pod
NAME CPU(cores) MEMORY(bytes)
test-84f4c7445f-kzvd5 13m 305Mi
Messing with Xmx having MaxRAM is obviously affecting the JVM but how is not possible to control the amount of memory we have on virtualized containers ? I know that free command will report the host available RAM but OpenJDK should be using cgroups rihgt?.
I'm still monitoring the memory ...
EDIT: A new hope
I did two things, the first one was to remove again my container limit, I want to analyze how much it will grow, also I added a new flag to see how the process is using the native memory -XX:NativeMemoryTracking=summary
At the beginning every thing was normal, the process started consuming like 300MB via kubectl top pod so I let it running about 4 hours and then ...
kubectl top pod
NAME CPU(cores) MEMORY(bytes)
test-646864bc48-69wm2 54m 645Mi
kind of expected, right ? but then I checked the native memory usage
jcmd <PID> VM.native_memory summary
Native Memory Tracking:
Total: reserved=2780631KB, committed=536883KB
- Java Heap (reserved=131072KB, committed=120896KB)
(mmap: reserved=131072KB, committed=120896KB)
- Class (reserved=203583KB, committed=92263KB)
(classes #17086)
( instance classes #15957, array classes #1129)
(malloc=2879KB #44797)
(mmap: reserved=200704KB, committed=89384KB)
( Metadata: )
( reserved=77824KB, committed=77480KB)
( used=76069KB)
( free=1411KB)
( waste=0KB =0.00%)
( Class space:)
( reserved=122880KB, committed=11904KB)
( used=10967KB)
( free=937KB)
( waste=0KB =0.00%)
- Thread (reserved=2126472KB, committed=222584KB)
(thread #2059)
(stack: reserved=2116644KB, committed=212756KB)
(malloc=7415KB #10299)
(arena=2413KB #4116)
- Code (reserved=249957KB, committed=31621KB)
(malloc=2269KB #9949)
(mmap: reserved=247688KB, committed=29352KB)
- GC (reserved=951KB, committed=923KB)
(malloc=519KB #1742)
(mmap: reserved=432KB, committed=404KB)
- Compiler (reserved=1913KB, committed=1913KB)
(malloc=1783KB #1343)
(arena=131KB #5)
- Internal (reserved=7798KB, committed=7798KB)
(malloc=7758KB #28415)
(mmap: reserved=40KB, committed=40KB)
- Other (reserved=32304KB, committed=32304KB)
(malloc=32304KB #3030)
- Symbol (reserved=20616KB, committed=20616KB)
(malloc=17475KB #212850)
(arena=3141KB #1)
- Native Memory Tracking (reserved=5417KB, committed=5417KB)
(malloc=347KB #4494)
(tracking overhead=5070KB)
- Arena Chunk (reserved=241KB, committed=241KB)
(malloc=241KB)
- Logging (reserved=4KB, committed=4KB)
(malloc=4KB #184)
- Arguments (reserved=17KB, committed=17KB)
(malloc=17KB #469)
- Module (reserved=286KB, committed=286KB)
(malloc=286KB #2704)
Wait, What ? 2.1 GB reserved for threads? and 222 MB being used, what is this ? I currently don't know, I just saw it...
I need time trying to understand why this is happening

I finally found my issue and I want to share it so others can benefit in some way from this.
As I found on my last edit I had a thread problem that was causing all the memory consumption over time, specifically we was using an asynchronous method from a third party library without properly taking care those resources (ensure those calls was ending correctly in this case).
I was able to detect the issue because I used a memory limit on my kubernete deployment from the beginning (which is a good practice on production environments) and then I monitored very closely my app memory consumption using tools like jstat, jcmd, visualvm, kill -3 and most importantly the -XX:NativeMemoryTracking=summary flag that gave me so much detail in this regard.

First REST call slow on Apache server

I'm facing this issue on some of the requests. All the first calls of a user (or when some time has passed) take up to 60 seconds to be completed. After that it takes milliseconds. It doesn't matter if I call it on an internet browser or with SOAP UI, it is always the same. I'm running a java application on Apache.
I ran it with fiddler to see where is the call spending the most time, and it looks like it is between when the server gets the request, and it started the answer, everytime it takes 61 seconds. Fiddler also shows the SSL handshake taking milliseconds in both cases
I'm not sure where else to look. If anyone can shed some light it will be greatly appreciated.
First request (takes 1:01 mins)
Request Count: 1
Bytes Sent: 683 (headers:683; body:0)
Bytes Received: 846 (headers:347; body:499)
ACTUAL PERFORMANCE
--------------
This traffic was captured on Friday, May 18, 2018.
ClientConnected: 16:21:17.244
ClientBeginRequest: 16:21:17.465
GotRequestHeaders: 16:21:17.465
ClientDoneRequest: 16:21:17.465
Determine Gateway: 0ms
DNS Lookup: 0ms
TCP/IP Connect: 0ms
HTTPS Handshake: 0ms
ServerConnected: 16:21:17.275
FiddlerBeginRequest: 16:21:17.466
ServerGotRequest: 16:21:17.466
ServerBeginResponse: 16:22:18.597
GotResponseHeaders: 16:22:18.597
ServerDoneResponse: 16:22:18.599
ClientBeginResponse: 16:22:18.599
ClientDoneResponse: 16:22:18.599
Overall Elapsed: 0:01:01.133
RESPONSE BYTES (by Content-Type)
--------------
application/json: 499
~headers~: 347
Subsequent requests (Takes 0.120 mins)
Request Count: 1
Bytes Sent: 683 (headers:683; body:0)
Bytes Received: 846 (headers:347; body:499)
ACTUAL PERFORMANCE
--------------
This traffic was captured on Friday, May 18, 2018.
ClientConnected: 16:22:38.582
ClientBeginRequest: 16:22:38.607
GotRequestHeaders: 16:22:38.607
ClientDoneRequest: 16:22:38.607
Determine Gateway: 0ms
DNS Lookup: 0ms
TCP/IP Connect: 0ms
HTTPS Handshake: 0ms
ServerConnected: 16:22:38.589
FiddlerBeginRequest: 16:22:38.607
ServerGotRequest: 16:22:38.607
ServerBeginResponse: 16:22:38.726
GotResponseHeaders: 16:22:38.726
ServerDoneResponse: 16:22:38.727
ClientBeginResponse: 16:22:38.727
ClientDoneResponse: 16:22:38.727
Overall Elapsed: 0:00:00.120
RESPONSE BYTES (by Content-Type)
--------------
application/json: 499
~headers~: 347
This is how it looks another first call made with the browser
Chrome Network Analysis
Thanks!!

So, I finally came with this post about LDAP ->
Debugging a timeout with ldap auth on apache
Which happened to be my case.
Adding LDAPConnectionPoolTTL 0 to my httpd.config solved my issue.

How to check time on all the nodes in hadoop cluster

I am running spark job on hadoop cluster, and the job is failing at few times with the exception :
exception : Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.JavaMain], main() threw exception, begin > end in range (begin, end): (1494159709088, 1494159706071)
the job ran successfully on the rerun.
After searching on google, It might be Clock skew between the Oozie server host and launcher host.
Is there a way i can check if there is clock skew ? or how can i check the time on all the nodes whether they are in sync or not.
Thanks
ntptime command output :
ntp_gettime() returns code 0 (OK)
time dcb9b19b.a2328f64 Sun, May 7 2017 14:45:47.633, (.633584090),
maximum error 434990 us, estimated error 815 us, TAI offset 0
ntp_adjtime() returns code 0 (OK)
modes 0x0 (),
offset 176.871 us, frequency -25.666 ppm, interval 1 s,
maximum error 434990 us, estimated error 815 us,
status 0x2001 (PLL,NANO),
time constant 10, precision 0.001 us, tolerance 500 ppm,
ntpstat command output :
synchronised to NTP server (174.68.168.57) at stratum 3
time correct to within 77 ms
polling server every 1024 s

Websphere HTTP outbound connection pool usage

I have an application running on Websphere Application Server 6.1.0.43. And I'm having slowdown issues when thrying to invoke a remote service.
The slowdown is on the method findGroupAndGetConnection from the class outboundConnectionCache.
According to the IBM APAR PK94494:
The delay occurs after the client-side JAX-RPC handler (if present) is invoked and before the actual SOAP message is sent to the provider.
Because the delay occurs in the IBM web services engine, this problem
can be difficult to detect.
A com.ibm.ws.webservices.engine.transport.*=all trace will show entries similar to these which repeat:
[8/19/09 18:08:29:658 GMT] 00000047 OutboundConne 1 Enter:
WSWS3595I: Current pool size: 25. Connections-in-use size: 0.
Configured pool size: 25
In addition, that same trace spec will show long delays in executing
the .findGroupAndGetConnection() method:
[8/19/09 18:08:03:428 GMT] 00000047 OutboundConne >
OutboundConnectionCache.findGroupAndGetConnection()
WAITING_THREADS_THRESHOLD is 5 Entry
[8/19/09 18:08:38:358 GMT] 00000047 OutboundConne <
OutboundConnectionCache.findGroupAndGetConnection() Exit
And they recommend the following:
Reduce the 'com.ibm.websphere.webservices.http.connectionPoolCleanUpTime' from
the default of 180 to 120 seconds
Increase the max connections 'com.ibm.websphere.webservices.http.maxConnection' property from
default of 25 to 50. This will also require increasing the web
container thread pool size to 100.
Before changing the default properties I decided to monitor the Web Container thread usage and I noticed that maximum thread pool size (50) is never reached, but the minimum pool size (10) is reached very often, forcing connections to be destroyed and recreated.
Running over the minimum pool size will cause this slowdown? Should I increase the minimum pool size? Is my problem something other than http outbound connection pool?

JMeter test on Netty-based impl produces error for every second request

I've implemented an HTTP service based on the HTTP server example as provided by the netty.io project.
When I execute a GET request to the service URL from command-line (wget) or from a browser, I receive a result as expected.
When I perform a load test using ApacheBench ab -n 100000 -c 8 http://localhost:9000/path/to/service, experience no errors (neither on service nor on ab side) and see fair numbers for request processing duration.
Afterwards, I set up a test plan in JMeter having a thread group with 1 thread and a loop count of 2. I inserted an HTTP request sampler where I simply added the server name localhost, the port number 9000 and the path /path/to/service. Then I also added a View Results Tree and a Summary Report listener.
Finally, I executed the test plan and received one valid response and one error showing the following content:
Thread Name: Thread Group 1-1
Sample Start: 2015-06-04 09:23:12 CEST
Load time: 0
Connect Time: 0
Latency: 0
Size in bytes: 2068
Headers size in bytes: 0
Body size in bytes: 2068
Sample Count: 1
Error Count: 1
Response code: Non HTTP response code: org.apache.http.NoHttpResponseException
Response message: Non HTTP response message: The target server failed to respond
Response headers:
HTTPSampleResult fields:
ContentType:
DataEncoding: null
The associated exception found in response data tab showed the following content
org.apache.http.NoHttpResponseException: The target server failed to respond
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:95)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:61)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
at org.apache.jmeter.protocol.http.sampler.MeasuringConnectionManager$MeasuredConnection.receiveResponseHeader(MeasuringConnectionManager.java:201)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at org.apache.jmeter.protocol.http.sampler.HTTPHC4Impl.executeRequest(HTTPHC4Impl.java:517)
at org.apache.jmeter.protocol.http.sampler.HTTPHC4Impl.sample(HTTPHC4Impl.java:331)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerProxy.sample(HTTPSamplerProxy.java:74)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.sample(HTTPSamplerBase.java:1146)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.sample(HTTPSamplerBase.java:1135)
at org.apache.jmeter.threads.JMeterThread.process_sampler(JMeterThread.java:434)
at org.apache.jmeter.threads.JMeterThread.run(JMeterThread.java:261)
at java.lang.Thread.run(Thread.java:745)
As I have a similar service already running which receives and processes web tracking data which shows no errors, it might be a problem within my test plan or JMeter .. but I am not sure :-(
Did anyone experience similar behavior? Thanks in advance ;-)

Issue can be related to Keep-Alive management.
Read those:
https://bz.apache.org/bugzilla/show_bug.cgi?id=57921
https://wiki.apache.org/jmeter/JMeterSocketClosed
So your solution is one of those:
If you're sure it's a keep alive issue:
Try jmeter nightly build http://jmeter.apache.org/nightly.html:
Download the _bin and _lib files
Unpack the archives into the same directory structure
The other archives are not needed to run JMeter.
And adapt the value of httpclient4.idletimeout
A workaround is to increase retry or add connection stale check as per :
https://wiki.apache.org/jmeter/JMeterSocketClosed

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.