I'm currently investigating issues on the following system:
3.2 GHz 8-core machine, 24 GB ram
Debian 6.0.2
ulimit -n 4096
ulimit -Sn 4096
ulimit -Hn 65535
Tomcat 6.0.28
-Xmx20g
MySQL 5.0.51a (through hibernate and a few manual JDBC queries)
also pretty much room for caching
I'm testing the most common requests to the server with 2000 requests per minute remotely. Testing tool is latest jMeter. The average response time is around 65 ms, min is 35 and max is 4000ms (in rare cases, but has it's reason).
As far as I watched htop, the system specs are sufficient for at least 3 times more request per Minute. (Avg. CPU: 25%, RAM: 5 of 22GB) The server itself is accessible all the time. (Pinging it constantly while running the test.)
Important is the fact, that each request results in 3 additional requests to the local tomcat where the second finally gets the required data and the last is for statistics:
jMeter(1) -> RESTeasy-Service(2) -> ?-Service(2) -> Data-Service(2) -(new Thread)> Statistic-Service(2)
(1) is my jMeter test server and distant from (2), which is the tomcat server. Yes, the architecture might be a little weird, but that's not my fault. ^^
I switched the thread management to pool in server.xml. Set 1000 max threads up from default 200 and 10 idle up from 4. What I noticed is that the number of concurrent threads as good as never decreases, instead steadily rises up to tomcat's max it seems. htop reports 160 Threads while tomcat is stopped. About 460 when it's started freshly. (Services seem to start a few...) After a few hours (sometimes less) of hitting the server with 2000 requests per minute htop says there are 1400 tasks. This seems to be the point when I start to get timeouts in jMeter. As this is extremely time consuming I did not watch it a thousand times and therefore can't garantuee this is the cause, but that's pretty much what happens.
Primary questions:
Math tells me that the concurrently used thread count should never ever exceed about 600. (34 requests * 4 requests * 4 seconds = 544, even less, but estimated 600 should be fine). As far as I understand the idea of thread pooling, unused threads should be released and stopped when idle for too long. Is there still a way I could get a thousand idling(?) threads? And is this ok?
Could a thread started manually in one of the request processors deny the tomcat threads to be released?
Shouldn't there be any log message telling me that tomcat could not create/fetch a thread for a request?
Any other ideas? I'm working on this for far too long and now tomcat exhausting it's thread pool seems the only valid reason for these weird timeouts. But maybe somebody has another hint.
Thanks in advance especially if you can finally save me from this...
After hours and days of mind-blowing I found that the timeouts happen when Tomcat reaches it's thread limit while we're in the middle of those 3 local connection openings. I guess if it once reaches that limit one thread is waiting for another to open which will not happen while the previous do not close. In German I'd call that Teufelskreis. ^^
Whatever, solution was raise max threads to a ridiculous high number:
<Executor name="tomcatThreadPool" namePrefix="catalina-exec-" maxThreads="10000" minSpareThreads="10"/>
I know that this should not be the way to go, but unfortunately we all here know that our architecture is somewhat impractical and nobody got the time to change something about it.
Hope it helps somebody. =)
I guess, this issue needs the understanding of underlying HTTP/1.1 or HTTP/1.1 keep alive connection.
If you are using it for REST web service, probably you want to set the maxKeepAliveRequests parameter in your connector configuration to 1.
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
maxKeepAliveRequests="1"
redirectPort="8443" />
This setting can be found in your $CATALINA_HOME/conf/server.xml.
Related
I am running a Tomcat 7.0.55 instance with a Spring REST service behind on Ubuntu 14.04LTS server. I am doing performance tests with Gatling. I have created a simulation using a front-end application that accesses the REST backend.
My config is:
Total RAM: 512MB, 1 CPU, JVM options: -Xms128m -Xmx312m -XX:PermSize=64m -XX:MaxPermSize=128m
The environment might not seem to be very efficient, but if I do not cross the limit of the ~700 users (I process 90k requests in 7 minutes) I get all request processed successfully and very quickly.
I am starting having issues when there are too many connections at the same time. The failing scenario is that there are around 120k requests in 7 minutes. Problems start to begin when there are around 800 concurrent users in play. Until the number of users is 600-700, all goes fine, but after this limit I am starting getting exceptions:
java.util.concurrent.TimeoutException: Request timed out to /xxx.xxx.xxx.xxx:8080 of 60000 ms
at com.ning.http.client.providers.netty.timeout.TimeoutTimerTask.expire(TimeoutTimerTask.java:43) [async-http-client-1.8.12.jar:na]
at com.ning.http.client.providers.netty.timeout.RequestTimeoutTimerTask.run(RequestTimeoutTimerTask.java:43) [async-http-client-1.8.12.jar:na]
at org.jboss.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:556) [netty-3.9.2.Final.jar:na]
at org.jboss.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:632) [netty-3.9.2.Final.jar:na]
at org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:369) [netty-3.9.2.Final.jar:na]
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [netty-3.9.2.Final.jar:na]
at java.lang.Thread.run(Unknown Source) [na:1.7.0_55]
12:00:50.809 [WARN ] c.e.e.g.h.a.GatlingAsyncHandlerActor - Request 'request_47'
failed : GatlingAsyncHandlerActor timed out
I thought this could be related to small jvm. However, when I upgrade the environment to:
Total RAM: 2GB, 2CPUs, JVM options: -Xms1024m -Xmx1024m -XX:PermSize=128m -XX:MaxPermSize=256m
I still get very similar results. The difference in failed requests is insignificant..
I've been playing with setting the Tomcat connector with no effect. The current tomcat settings are:
<Connector enableLookups="false" maxThreads="400" maxSpareThreads="200" minSpareThreads="60" maxConnections="8092" port="8080" protocol="org.apache.coyote.http11.Http11Protocol" connectionTimeout="20000" keepAliveTimeout="10000" redirectPort="8443" />
Manipulating the numbers of threads, connections, keepAliveTimeout didn't help at all to get the 800 concurrent users to work with no timeouts. I was planning to scale the app to handle at least 2k concurrent users, but so far I can see that vertical scaling and upgrading the env gives me no results. I also do not see any issues with memory through jvisualvm. The OS shoudln't be a limit, the ulimits are set to either unlimited or high values.. The DB is not a bottleneck as all REST is using internal caches.
It seems like tomcat is not able to process more than 800 connected users in my case. Do you have any ideas of how these issues could be adressed? I would like to be able to scale up to at least 2k users and keep the failed rate as low as possible. I will appreciate any thoughts and tips how I can work it out. If you need more details, please leave a comment.
Cheers
Adam
Do you increase open file number.Every connection consume a open file item.
You are probably hitting the limit on TCP connections given that you are creating so many in such a short time. By default Linux waits a while before cleaning up connections. After the test fails, run netstat -ant | grep WAIT | wc -l and see if you are close to 60,000. If so, that indicates you can do some tuning of the TCP stack. Try changing the following sysctl settings:
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_fin_timeout = 5
You can also try some other settings mentioned in this ServerFault question.
We're using Glassfish 3.0.1 and experiencing very long response times; in the order of 5 minutes for 25% of our POST/PUT requests, by the time the response comes back the front facing load balancer has timed out.
My theory is that the requests are queuing up and waiting for an available thread.
The reason I think this is because the access logs reveal that the requests are taking a few seconds to complete however the time at which the requests are being executed are five minutes later than I'd expect.
Does anyone have any advice for debugging what is going on with the thread pools? or what the optimum settings should be for them?
Is it required to do a thread dump periodically or will a one off dump be sufficient?
At first glance, this seems to have very little to do with the threadpools themselves. Without knowing much about the rest of your network setup, here are some things I would check:
Is there a dead/nonresponsive node in the load balancer pool? This can cause all requests to be tried against this node until they fail due to timeout before being redirected to the other node.
Is there some issue with initial connections between the load balancer and the Glassfish server? This can be slow or incorrect DNS lookups (though the server should cache results), a missing proxy, or some other network-related problem.
Have you checked that the clocks are synchronized between the machines? This could cause the logs to get out of sync. 5min is a pretty strange timeout period.
If all these come up empty, you may simply have an impedance mismatch between the load balancer and the web server and you may need to add webservers to handle the load. The load balancer should be able to give you plenty of stats on the traffic coming in and how it's stacking up.
Usually you get this behaviour if you configured not enough worker threads in your server. Default values range from 15 to 100 threads in common webservers. However if your application blocks the server's worker threads (e.g. by waiting for queries) the defaults are way too low frequently.
You can increase the number of workers up to 1000 without problems (assure 64 bit). Also check the number of workerthreads (sometimes referred to as 'max concurrent/open requests') of any in-between server (e.g. a proxy or an apache forwarding via mod_proxy).
Another common pitfall is your software sending requests to itself (e.g. trying to reroute or forward a request) while blocking an incoming request.
Taking threaddump is the best way to debug what is going on with the threadpools. Please take 3-4 threaddumps one after another with 1-2 seconds gap between each threaddump.
From threaddump, you can find the number of worker threads by their name. Find out long running threads from the multiple threaddumps.
You may use TDA tool (http://java.net/projects/tda/downloads/download/tda-bin-2.2.zip) for analyzing threaddumps.
When I send about 100 users to my web service, I get response and web service performs fine, but when I check for 1000 concurrent users none of the requests get reply.
I am using jmeter for testing.
When I send 1000 concurrent users my glassfish admin panel goes time out in browser and it opens after 4-5 minutes only.Same happen for wsdl URL.
I have tested my web service on our LAN and it works for 2000 queries without any issues.
Please help me find a solution.
Edit 1.0
Some more findings
Hi on your recommendation, what I did is that I simply returned string on web service function call, no lookup, no dao, nothing... just returning a string
Thread pool is 2000 no issues on that.
Now when I ran jmeter for 1000 users they run much fast and returned response for ~200 requests
So this means that my PC running Windows 7 with an i5 processor and 4GB RAM is out performing dedicated server of hostgator having 4GB RAM with xeon 5*** 8 cores :(
This is not for what am paying 220$ a month....
Correct me if my finding is wrong, I tested my app on lan b/w two pc's locally and it can process 2000+msgs smoothly
Edit 1.1
After lot of reading,and practicals I have come to a conclusion that it is network latency which is responsible for such a behavior.
I increased bean pool size in glassfish's admin panel and it helped improving number of concurrent users to 300, but issue arise again no matter how much beans I keep in pool.
So friends question is: please suggest some other settings which I can change in Glassfish's admin panel to remove this issue from root!
You need to add some performance logging for the various steps that your service performs. Does it do multiple steps? Is computation slow? Database access slow? Your connection pool not scale well? Do things need to be tweaked in the web server to allow for such high concurrency? You'll need to measure these things to find the bottlenecks so you can eliminate them.
I had the same problem in a server (with 200+ simultaneously users), I studied the official glassfish tuning guide but there is a parameter very important that doesn't appear. I used Jmeter too and in my case the time response increases exponentially but the server's processor stay low.
In the glassfish admin server (configurations/server-config/Network config/thread pools/http-thread-pool) you can see how many users you server can handle. (The parameters are different in glassfish 2 and 3).
Max Queue Size: The maximum number of threads in the queue. A value of –1 indicates that there is no limit to the queue size.
Max Thread Pool Size: The maximum number of threads in the thread pool
Min Thread Pool Size: The minimum number of threads in the thread pool
Idle Thread Timeout: The maximum amount of time that a thread can remain idle in the pool. After this time expires, the thread is removed from the pool.
I recommend you to set Max Thread Pool Size to 100 or 200 to fix the problem.
Also you can set another JMV variables, for example:
-Xmx/s/m
-server
-XX:ParallelGCThreads
-XX:+AggressiveOpts
I hope it helps.
I created a web service both client and server. I thought of doing the performance testing. I tried jmeter with a sample test plan to execute it. Upto 3000 request jboss processed the request but when requests more than 3000 some of the request are not processed (In sense of Can't open connection : Connection refused). Where i have to make the changes to handle more than 10000 request at the same time. Either it's a jboss issue or System Throughput ?
jmeter Config : 300 Threads, 1 sec ramp up and 10 loop ups.
System (Server Config) : Windows 7, 4G RAM
Where i have to make the changes to handle more than 10000 request at the same time
10 thousand concurrent requests in Tomcat (I believe it is used in JBoss) is quite a lot. In typical setup (with blocking IO connector) you need one thread per one HTTP connection. This is way too much for ordinary JVM. On a 64-bit server machine one thread needs 1 MiB (check out -Xss parameter). And you only have 4 GiB.
Moreover, the number of context switches will kill your performance. You would need hundreds of cores to effectively handle all these connections. And if your request is I/O or database bound - you'll see a bottleneck elsewhere.
That being said you need a different approach. Either try out non-blocking I/O or asynchronous servlets (since 3.0) or... scale out. By default Tomcat can handle 100-200 concurrent connections (reasonable default) and a similar amount of connections are queued. Everything above that is rejected and you are probably experiencing that.
See also
Advanced IO and Tomcat
Asynchronous Support in Servlet 3.0
There are two common problems that I think of.
First, if you run JBoss on Linux as a normal user, you can run into 'Too many open files', if you did not edit the limits.conf file. See https://community.jboss.org/thread/155699. Each open socket counts as an 'open file' for Linux, so the OS could block your connections because of this.
Second, the maximum threadpool size for incoming connections is 200 by default. This limits the number of concurrent requests, i.e. requests that are in progress at the same time. If you have jmeter doing 300 threads, the jboss connector threadpool should be larger. You can find this in jboss6 in the jboss-web.sar/server.xml. Look for 'maxThreads' in the element: http://docs.jboss.org/jbossweb/latest/config/http.html.
200 is the recommended maximum for a single core CPU. Above that, the context switches start to give too much overhead, like Tomasz says. So for production use, only increase to 400 on a dual core, 800 on a quad core, etc.
When my tomcat (6.0.20) maxThreads limit is reached, i get the expected error:
Maximum number of threads (XXX) created for connector with address null and port 80
And then request starts hanging on queue and eventually timing out. so far, so good.
The problem is that when the load goes down, the server does not recover and is forever paralysed, instead of coming back to life.
Any hints?
Consider switching to NIO, then you don't need to worry about the technical requirement of 1 thread per connection. Without NIO, the limit is about 5K threads (5K HTTP connections), then it blows like that. With NIO, Java will be able to manage multiple resources by a single thread, so the limit is much higher. The border is practically the available heap memory, with about 2GB you can go up to 20K connections.
Configuring Tomcat to use NIO is as simple as changing the protocol attribute of the <Connector> element in /conf/server.xml to "org.apache.coyote.http11.Http11NioProtocol".
I think may be a bug in Tomcat and according to the issue:
https://issues.apache.org/bugzilla/show_bug.cgi?id=48843
should be fixed in Tomcat 6.0.27 and 5.5.30