I have a very simple Java REST service. At lower traffic volumes, the service runs perfectly with response times of ~1ms and zero server backlog.
When traffic rises past a certain threshold the response times skyrocket from 1ms to 2.0 seconds, the http active session queue and open file counts spike, and the server is performing unacceptably. I posted a metrics graph of a typical six hour window where traffic starts low and goes above the problem threshold.
Any ideas on what could be causing this or how to diagnose further?
Your webapp will use a thread (borrowed from thread pool) to server one request.
Under stress load, many threads are created, if the number of requests exceed the capacity of the pool, they have to queue, waiting till a thread is available again from pool.
If your service is not run fast enough, (especially you're doing IO - open file), the wait time is increase, lead to slow response.
CPU has to switch between many threads hence the CPU will spike under load.
That's why they need a load balancing and many webapp to server as a service. The stress load is distributed to many subwebapp which improve the end user experience
The usual approach to diagnostic is to create load with JMeter and investigate results with Java VisualVM, Eclipse memory analyzer and so on. I don't know whether you have tried it.
Related
I've a Java + Spring app that will query ElasticSearch using Jest client (poor choice because it is poorly documented). ElasticSearch has response times of about 8-20 ms with 150 concurrent connections, but my app goes up to 900 -1500 ms. A quick look at VisualVM tells me that the processor usage is below 10% and profiling it tells me that 98% of the time all that the app does is wait on the following method
org.apache.http.pool.PoolEntryFuture.await()
that is part of Apache HttpCore and a dependency of Jest. I don't have a limitation in terms of threads that can run on tomcat (max is 200 and VisualVM says that the maximum number of thread during the experiment was 174). So it's not waiting free threads.
I think that the latency increase is excessive and I suspect that Jest is using an internal threadpool that has not enough threads to cope with all the requests, but I don't know.
Thoughts?
I think that the latency increase is excessive and I suspect that Jest is using an internal threadpool that has not enough threads to cope with all the requests...
In poking around the source real fast I see that you should be able to inject a ClientConfig into the Jest client factory.
The ClientConfig has the following setters which seem to impact the internal Apache http client connection manager:
clientConfig.maxTotalConnection(...);
clientConfig.defaultMaxTotalConnectionPerRoute(...);
clientConfig.maxTotalConnectionPerRoute(...);
Maybe tweaking some of those will give you more connections? Take a look at the JestClientFactory source to see what it is doing. We've definitely had to tweak those values in the past when making a large number of connections to the same server using HttpClient.
I would test this with just one connection and see what the average response time is. With just one thread you should have more than enough thread and resources etc. Most likely the process is waiting on an external resource like a database or a network service.
We're using Glassfish 3.0.1 and experiencing very long response times; in the order of 5 minutes for 25% of our POST/PUT requests, by the time the response comes back the front facing load balancer has timed out.
My theory is that the requests are queuing up and waiting for an available thread.
The reason I think this is because the access logs reveal that the requests are taking a few seconds to complete however the time at which the requests are being executed are five minutes later than I'd expect.
Does anyone have any advice for debugging what is going on with the thread pools? or what the optimum settings should be for them?
Is it required to do a thread dump periodically or will a one off dump be sufficient?
At first glance, this seems to have very little to do with the threadpools themselves. Without knowing much about the rest of your network setup, here are some things I would check:
Is there a dead/nonresponsive node in the load balancer pool? This can cause all requests to be tried against this node until they fail due to timeout before being redirected to the other node.
Is there some issue with initial connections between the load balancer and the Glassfish server? This can be slow or incorrect DNS lookups (though the server should cache results), a missing proxy, or some other network-related problem.
Have you checked that the clocks are synchronized between the machines? This could cause the logs to get out of sync. 5min is a pretty strange timeout period.
If all these come up empty, you may simply have an impedance mismatch between the load balancer and the web server and you may need to add webservers to handle the load. The load balancer should be able to give you plenty of stats on the traffic coming in and how it's stacking up.
Usually you get this behaviour if you configured not enough worker threads in your server. Default values range from 15 to 100 threads in common webservers. However if your application blocks the server's worker threads (e.g. by waiting for queries) the defaults are way too low frequently.
You can increase the number of workers up to 1000 without problems (assure 64 bit). Also check the number of workerthreads (sometimes referred to as 'max concurrent/open requests') of any in-between server (e.g. a proxy or an apache forwarding via mod_proxy).
Another common pitfall is your software sending requests to itself (e.g. trying to reroute or forward a request) while blocking an incoming request.
Taking threaddump is the best way to debug what is going on with the threadpools. Please take 3-4 threaddumps one after another with 1-2 seconds gap between each threaddump.
From threaddump, you can find the number of worker threads by their name. Find out long running threads from the multiple threaddumps.
You may use TDA tool (http://java.net/projects/tda/downloads/download/tda-bin-2.2.zip) for analyzing threaddumps.
When I send about 100 users to my web service, I get response and web service performs fine, but when I check for 1000 concurrent users none of the requests get reply.
I am using jmeter for testing.
When I send 1000 concurrent users my glassfish admin panel goes time out in browser and it opens after 4-5 minutes only.Same happen for wsdl URL.
I have tested my web service on our LAN and it works for 2000 queries without any issues.
Please help me find a solution.
Edit 1.0
Some more findings
Hi on your recommendation, what I did is that I simply returned string on web service function call, no lookup, no dao, nothing... just returning a string
Thread pool is 2000 no issues on that.
Now when I ran jmeter for 1000 users they run much fast and returned response for ~200 requests
So this means that my PC running Windows 7 with an i5 processor and 4GB RAM is out performing dedicated server of hostgator having 4GB RAM with xeon 5*** 8 cores :(
This is not for what am paying 220$ a month....
Correct me if my finding is wrong, I tested my app on lan b/w two pc's locally and it can process 2000+msgs smoothly
Edit 1.1
After lot of reading,and practicals I have come to a conclusion that it is network latency which is responsible for such a behavior.
I increased bean pool size in glassfish's admin panel and it helped improving number of concurrent users to 300, but issue arise again no matter how much beans I keep in pool.
So friends question is: please suggest some other settings which I can change in Glassfish's admin panel to remove this issue from root!
You need to add some performance logging for the various steps that your service performs. Does it do multiple steps? Is computation slow? Database access slow? Your connection pool not scale well? Do things need to be tweaked in the web server to allow for such high concurrency? You'll need to measure these things to find the bottlenecks so you can eliminate them.
I had the same problem in a server (with 200+ simultaneously users), I studied the official glassfish tuning guide but there is a parameter very important that doesn't appear. I used Jmeter too and in my case the time response increases exponentially but the server's processor stay low.
In the glassfish admin server (configurations/server-config/Network config/thread pools/http-thread-pool) you can see how many users you server can handle. (The parameters are different in glassfish 2 and 3).
Max Queue Size: The maximum number of threads in the queue. A value of –1 indicates that there is no limit to the queue size.
Max Thread Pool Size: The maximum number of threads in the thread pool
Min Thread Pool Size: The minimum number of threads in the thread pool
Idle Thread Timeout: The maximum amount of time that a thread can remain idle in the pool. After this time expires, the thread is removed from the pool.
I recommend you to set Max Thread Pool Size to 100 or 200 to fix the problem.
Also you can set another JMV variables, for example:
-Xmx/s/m
-server
-XX:ParallelGCThreads
-XX:+AggressiveOpts
I hope it helps.
I created a web service both client and server. I thought of doing the performance testing. I tried jmeter with a sample test plan to execute it. Upto 3000 request jboss processed the request but when requests more than 3000 some of the request are not processed (In sense of Can't open connection : Connection refused). Where i have to make the changes to handle more than 10000 request at the same time. Either it's a jboss issue or System Throughput ?
jmeter Config : 300 Threads, 1 sec ramp up and 10 loop ups.
System (Server Config) : Windows 7, 4G RAM
Where i have to make the changes to handle more than 10000 request at the same time
10 thousand concurrent requests in Tomcat (I believe it is used in JBoss) is quite a lot. In typical setup (with blocking IO connector) you need one thread per one HTTP connection. This is way too much for ordinary JVM. On a 64-bit server machine one thread needs 1 MiB (check out -Xss parameter). And you only have 4 GiB.
Moreover, the number of context switches will kill your performance. You would need hundreds of cores to effectively handle all these connections. And if your request is I/O or database bound - you'll see a bottleneck elsewhere.
That being said you need a different approach. Either try out non-blocking I/O or asynchronous servlets (since 3.0) or... scale out. By default Tomcat can handle 100-200 concurrent connections (reasonable default) and a similar amount of connections are queued. Everything above that is rejected and you are probably experiencing that.
See also
Advanced IO and Tomcat
Asynchronous Support in Servlet 3.0
There are two common problems that I think of.
First, if you run JBoss on Linux as a normal user, you can run into 'Too many open files', if you did not edit the limits.conf file. See https://community.jboss.org/thread/155699. Each open socket counts as an 'open file' for Linux, so the OS could block your connections because of this.
Second, the maximum threadpool size for incoming connections is 200 by default. This limits the number of concurrent requests, i.e. requests that are in progress at the same time. If you have jmeter doing 300 threads, the jboss connector threadpool should be larger. You can find this in jboss6 in the jboss-web.sar/server.xml. Look for 'maxThreads' in the element: http://docs.jboss.org/jbossweb/latest/config/http.html.
200 is the recommended maximum for a single core CPU. Above that, the context switches start to give too much overhead, like Tomasz says. So for production use, only increase to 400 on a dual core, 800 on a quad core, etc.
I've seen several StackOverflow posts that discuss what tools to use to monitor web application performance, but none that talk about what metrics to focus on.
What web server metrics should be monitored and which should have alerts setup on?
Here are some I currently have in mind:
requests timeouts (alerts)
requests queued (alerts)
time to first byte (may need to be monitored externally)
requests / second
Also, how can these be measured on a java web application server.
You're off to a good start. I would monitor:
Total response time
Total bytes
Throughput (reqs/sec)
Server CPU overhead
Errors (by error code)
I would also alert on the following:
Application/page not responding
Excessive response time (this depends upon your app, you'll have to figure out the normal SLA)
Excessive throughput (this will alert you to a DOS attack so that you can take action)
50x errors (such as 500, 503, etc.)
Server CPU load factor excessive (again, you'll have to determine what typical is, and configure your tool to alert you when things are abnormal, another indicator of DOS or a runaway process)
Errors in log files (if your tools supports it, configure it to send alerts when errors/exceptions pop up in log files)