Spring boot + tomcat 8.5 + mongoDB, AsyncRequestTimeoutException - java

I have created a spring boot web application and deployed war of the same to tomcat container.
The application connects to mongoDB using Async connections. I am using mongodb-driver-async library for that.
At startup everything works fine. But as soon as load increases, It shows following exception in DB connections:
org.springframework.web.context.request.async.AsyncRequestTimeoutException: null
at org.springframework.web.context.request.async.TimeoutDeferredResultProcessingInterceptor.handleTimeout(TimeoutDeferredResultProcessingInterceptor.java:42)
at org.springframework.web.context.request.async.DeferredResultInterceptorChain.triggerAfterTimeout(DeferredResultInterceptorChain.java:75)
at org.springframework.web.context.request.async.WebAsyncManager$5.run(WebAsyncManager.java:392)
at org.springframework.web.context.request.async.StandardServletAsyncWebRequest.onTimeout(StandardServletAsyncWebRequest.java:143)
at org.apache.catalina.core.AsyncListenerWrapper.fireOnTimeout(AsyncListenerWrapper.java:44)
at org.apache.catalina.core.AsyncContextImpl.timeout(AsyncContextImpl.java:131)
at org.apache.catalina.connector.CoyoteAdapter.asyncDispatch(CoyoteAdapter.java:157)
I am using following versions of software:
Spring boot -> 1.5.4.RELEASE
Tomcat (installed as standalone binary) -> apache-tomcat-8.5.37
Mongo DB version: v3.4.10
mongodb-driver-async: 3.4.2
As soon as I restart the tomcat service, everything starts working fine.
Please help, what could be the root cause of this issue.
P.S.: I am using DeferredResult and CompletableFuture to create Async REST API.
I have also tried using spring.mvc.async.request-timeout in application and configured asynTimeout in tomcat. But still getting same error.

It's probably obvious that Spring is timing out your requests and throwing AsyncRequestTimeoutException, which returns a 503 back to your client.
Now the question is, why is this happening? There are two possibilities.
These are legitimate timeouts. You mentioned that you only see the exceptions when the load on your server increases. So possibly your server just can't handle that load and its performance has degraded to the point where some requests can't complete before Spring times them out.
The timeouts are caused by your server failing to send a response to an asynchronous request due to a programming error, leaving the request open until Spring eventually times it out. It's easy for this to happen if your server doesn't handle exceptions well. If your server is synchronous, it's okay to be a little sloppy with exception handling because unhandled exceptions will propagate up to the server framework, which will send a response back to the client. But if you fail to handle an exception in some asynchronous code, that exception will be caught elsewhere (probably in some thread pool management code), and there's no way for that code to know that there's an asynchronous request waiting on the result of the operation that threw the exception.
It's hard to figure out what might be happening without knowing more about your application. But there are some things you could investigate.
First, try looking for resource exhaustion.
Is the garbage collector running all the time?
Are all CPUs pegged at 100%?
Is the OS swapping heavily?
If the database server is on a separate machine, is that machine showing signs of resource exhaustion?
How many connections are open to the database? If there is a connection pool, is it maxed out?
How many threads are running? If there are thread pools in the server, are they maxed out?
If something's at its limit then possibly it is the bottleneck that is causing your requests to time out.
Try setting spring.mvc.async.request-timeout to -1 and see what happens. Do you now get responses for every request, only slowly, or do some requests seem to hang forever? If it's the latter, that strongly suggests that there's a bug in your server that's causing it to lose track of requests and fail to send responses. (If setting spring.mvc.async.request-timeout appears to have no effect, then the next thing you should investigate is whether the mechanism you're using for setting the configuration actually works.)
A strategy that I've found useful in these cases is to generate a unique ID for each request and write the ID along with some contextual information every time the server either makes an asynchronous call or receives a response from an asynchronous call, and at various checkpoints within asynchronous handlers. If requests go missing, you can use the log information to figure out the request IDs and what the server was last doing with that request.
A similar strategy is to save each request ID into a map in which the value is an object that tracks when the request was started and what your server last did with that request. (In this case your server is updating this map at each checkpoint rather than, or in addition to, writing to the log.) You can set up a filter to generate the request IDs and maintain the map. If your filter sees the server send a 5xx response, you can log the last action for that request from the map.
Hope this helps!

Asynchroneus tasks are arranged in a queue(pool) which is processed in parallel depending on the number of threads allocated. Not all asynchroneus tasks are executed at the same time. Some of them are queued. In a such system getting AsyncRequestTimeoutException is normal behaviour.
If you are filling up the queues with asynchroneus tasks that are unable to execute under pressure. Increasing the timeout will only delay the problem. You should focus instead on the problem:
Reduce the execution time(through various optimizations) of asynchroneus task. This will relax the pooling of async tasks. It oviously requires coding.
Increase the number of CPUSs allocated in order to be able to run more efficiently the parallel tasks.
Increase the number of threads servicing the executor of the driver.
Mongo Async driver is using AsynchronousSocketChannel or Netty if Netty is found in the classpath. In order to increase the number of the worker threads servicing the async comunication you should use:
MongoClientSettings.builder()
.streamFactoryFactory(NettyStreamFactoryFactory(io.netty.channel.EventLoopGroup eventLoopGroup,
io.netty.buffer.ByteBufAllocator allocator))
.build();
where eventLoopGroup would be io.netty.channel.nio.NioEventLoopGroup(int nThreads))
on the NioEventLoopGroup you can set the number of threads servicing your async comunication
Read more about Netty configuration here https://mongodb.github.io/mongo-java-driver/3.2/driver-async/reference/connecting/connection-settings/

Related

Failed to connect to Tomcat server on ec2 instance

UPDATE:
My goal is to learn what factors could overwhelm my little tomcat server. And when some exception happens, what I could do to resolve or remediate it without switching my server to a better machine. This is not a real app in a production environment but just my own experiment (Besides some changes on the server-side, I may also do something on my client-side)
Both of my client and server are very simple: the server only checks the URL format and send 201 code if it is correct. Each request sent from my client only includes an easy JSON body. There is no database involved. The two machines (t2-micro) only run client and server respectively.
My client is OkHttpClient(). To avoid timeout exceptions, I already set timeout 1,000,000 milli secs via setConnectTimeout, setReadTimeout, and setWriteTimeout. I also go to $CATALINA/conf/server.xml on my server and set connectionTimeout = "-1"(infinite)
ORIGINAL POST:
I'm trying to stress out my server by having a client launching 3000+ threads sending HTTP requests to my server. Both of my client and server reside on different ec2 instances.
Initially, I encountered some timeout issues, but after I set the connection, read and write timeout to a bigger value, this exception has been resolved. However, with the same specification, I'm getting java.net.ConnectException: Failed to connect to my_host_ip:8080 exception. And I do not know its root cause. I'm new to multithreading and distributed system, can anyone please give me some insights of this exception?
Below is some screenshot of from my ec2:
1. Client:
2. Server:
Having gone through similar exercise in past I can say that there is no definitive answer to the problem of scaling.
Here are some general trouble shooting steps that may lead to more specific information. I would suggest trying out tests by tweaking a few parameters in each test and measure the changes in Cpu, logs etc.
Please provide what value you have put for the timeout. Increasing timeout could cause your server (or client) to run out of threads quickly (cause each thread can process for longer). Question the need for increasing timeout. Is there any processing that slows your server?
Check application logs, JVM usage, memory usage on the client and Server. There will be some hints there.
Your client seems to be hitting 99%+ and then come down. This implies that there could be a problem at the client side in that it maxes out during the test. Your might want to resize your client to be able to do more.
Look at open file handles. The number should be sufficiently high.
Tomcat has some limit on thread count to handle load. You can check this in server.xml and if required change it to handle more. Although cpu doesn't actually max out on server side so unlikely that this is the problem.
If you a database then check the performance of the database. Also check jdbc connect settings. There is thread and timeout config at jdbc level as well.
Is response compression set up on the Tomcat? It will give much better throughout on server especially if the data being sent back by each request is more than a few kbs.
--------Update----------
Based on update on question few more thoughts.
Since the application is fairly simple, the path in terms of stressing the server should be to start low and increase load in increments whilst monitoring various things (cpu, memory, JVM usage, file handle count, network i/o).
The increments of load should be spread over several runs.
Start with something as low as 100 parallel threads.
Record as much information as you can after each run and if the server holds up well, increase load.
Suggested increments 100, 200, 500, 1000, 1500, 2000, 2500, 3000.
At some level you will see that the server can no longer take it. That would be your breaking point.
As you increase load and monitor you will likely discover patterns that suggest tuning of specific parameters. Each tuning attempt should then be tested again the same level of multi threading. The improvement of available will be obvious from the monitoring.

How to setup RabbitMQ RPC in a web context

RabbitMQ RPC
I decided to use RabbitMQ RPC as described here.
My Setup
Incoming web requests (on Tomcat) will dispatch RPC requests over RabbitMQ to different services and assemble the results. I use one reply queue with one custom consumer that listens to all RPC responses and collects them with their correlation id in a simple hash map. Nothing fancy there.
This works great in a simple integration test on controller level.
Problem
When I try to do this in a web project deployed on Tomcat, Tomcat refuses to shut down. jstack and some debugging learned me a thread is spawn to listen for the RPC response and is blocking Tomcat from shutting down gracefully. I guess this is because the created thread is created on application level instead of request level and is not managed by Tomcat. When I set breakpoints in Servlet.destroy() or ServletContextListener.contextDestroyed(ServletContextEvent sce), they are not reached, so I see no way to manually clean things up.
Alternative
As an alternative, I could use a new reply queue (and simple QueueingConsumer) for each web request. I've tested this, it works and Tomcat shuts down as it should. But I'm wondering if this is the way to go.. Can a RabbitMQ cluster deal with thousands (or even millions) of short living queues/consumers? I can imagine queues aren't that big, but still.. constantly broadcasting to all cluster nodes.. the total memory footprint..
Question
So in short, is it wise do create a queue for each incoming web request or how should I setup RabbitMQ with one queue and consumer so Tomcat can shutdown gracefully?
I found a solution for my problem:
The Java client is creating his own threads. There is the possibility to add your own ExecutorService when creating a new connection. Doing so in the ServletContextListener.initialized() method, one can keep track of the ExecutorService and shut it down manually in the ServletContextListener.destroyed() method.
executorService.shutdown();
executorService.awaitTermination(20, TimeUnit.SECONDS);
I used Executors.newCachedThreadPool(); as the threads have many short executions, and they get cleaned up when being idle for more then 60s.
This is the link to the RabbitMQ Google group thread (thx to Michael Klishin for showing me the right direction)

Glassfish thread pool issues

We're using Glassfish 3.0.1 and experiencing very long response times; in the order of 5 minutes for 25% of our POST/PUT requests, by the time the response comes back the front facing load balancer has timed out.
My theory is that the requests are queuing up and waiting for an available thread.
The reason I think this is because the access logs reveal that the requests are taking a few seconds to complete however the time at which the requests are being executed are five minutes later than I'd expect.
Does anyone have any advice for debugging what is going on with the thread pools? or what the optimum settings should be for them?
Is it required to do a thread dump periodically or will a one off dump be sufficient?
At first glance, this seems to have very little to do with the threadpools themselves. Without knowing much about the rest of your network setup, here are some things I would check:
Is there a dead/nonresponsive node in the load balancer pool? This can cause all requests to be tried against this node until they fail due to timeout before being redirected to the other node.
Is there some issue with initial connections between the load balancer and the Glassfish server? This can be slow or incorrect DNS lookups (though the server should cache results), a missing proxy, or some other network-related problem.
Have you checked that the clocks are synchronized between the machines? This could cause the logs to get out of sync. 5min is a pretty strange timeout period.
If all these come up empty, you may simply have an impedance mismatch between the load balancer and the web server and you may need to add webservers to handle the load. The load balancer should be able to give you plenty of stats on the traffic coming in and how it's stacking up.
Usually you get this behaviour if you configured not enough worker threads in your server. Default values range from 15 to 100 threads in common webservers. However if your application blocks the server's worker threads (e.g. by waiting for queries) the defaults are way too low frequently.
You can increase the number of workers up to 1000 without problems (assure 64 bit). Also check the number of workerthreads (sometimes referred to as 'max concurrent/open requests') of any in-between server (e.g. a proxy or an apache forwarding via mod_proxy).
Another common pitfall is your software sending requests to itself (e.g. trying to reroute or forward a request) while blocking an incoming request.
Taking threaddump is the best way to debug what is going on with the threadpools. Please take 3-4 threaddumps one after another with 1-2 seconds gap between each threaddump.
From threaddump, you can find the number of worker threads by their name. Find out long running threads from the multiple threaddumps.
You may use TDA tool (http://java.net/projects/tda/downloads/download/tda-bin-2.2.zip) for analyzing threaddumps.

Threaded apache cxf clients and performance on high frequency requests

I have a relatively simple java service that fetches information from various SOAP webservices and does so using apache cxf 2.5.2 under the hood. The service launches 20 worker threads to churn through 1000-8000 requests every hour and each request could make 2-5 webservice calls depending on the nature of the request.
Setup
I am using connection pooling on the webservice connections
Connection Timeout is set to 2 seconds in order to realistically tackle the volume of requests efficiently.
All connections are going out through a http proxy.
20 Worker Threads
Grunty 16 cpu box
The problem is that I am starting to see 'connect time out' errors in the logs and quite a large number of them and it seems the the application service is also effecting the machines network performance as curl from the command line takes >5 seconds just establish a connection to the same webservices. However when I stop the service application, curl performance improves drastically to < 5ms
How have other people tackled this situation using CXF? did it work or did they switch to a different library? If you were to start from scratch how would you design for 'small payload high frequency' transactions?
Once we had the similar problem as yours that the request took very long time to complete. It is not CXF issue, every web services' stacks will operate long for very frequent request.
To solve this issue we implemented JMS EJB message driven bean. The flow were as follows: when the users send their request to web service, all requests were put into JMS queue so that response to users come very quickly and request is left to process at the background. Later the users were able to see their operations: if they are still send to process, if they are processing, if they are completed successfully or if they failed to complete for some reason.
If I had to design frequent transactions application, I would definitely use JMS for that.
Hope this helps.

How to force Apache Tomcat not to reuse a thread from the thread pool?

We've got a normal-ish server stack including BlazeDS, Tomcat and Hibernate.
We'd like to arrange things such that if certain errors (especially AssertionError) are thrown, the current thread is considered to be in an unknown state and won't be used for further HTTP requests. (Partly because we're storing some things, such as Hibernate transaction session, in thread-local storage. Now, we can catch throwables and make sure to roll back transactions and rethrow, but there's no guarantee about what other code may have left who knows what in the thread-local storage.)
Tomcat with the default thread pool behavior reuses threads. We tried specifying our own executor, which seems to be the most specific method of changing its thread pool behavior, but it doesn't always call Executor.execute() with a new task for each request. (Most likely it reuses the same execution context for all requests in the same HTTP connection.)
One option is to disable keepalive, so that there's only one request per HTTP connection, but that's ugly.
Anyway, I'd like to know. Is there a way to tell Tomcat not to reuse a thread, or to kill or exit the thread so that Tomcat is forced to create a new one?
(From the Tomcat source, it appears Tomcat will close the connection and abandon the task/thread after sending a HTTP 500 response, but I don't know how to get BlazeDS to generate a 500 response; that's another angle I'd like to know more about.)
I would strongly suggest simply getting rid of your use of thread-local storage, or at least coming up with a method to clear the thread-local storage when a request first enters the pipeline (with a <filter> in web.xml, for example).
Having to reconfigure something basic about Tomcat to get it to work with your app points to a code smell.

Categories