When I send a stop signal(either with kill -SIGINT <pid> or System.exit(0) or environment.getApplicationContext().getServer().stop()) to the application, it waits for the shutdownGracePeriod (by default 30 sec or whatever I configure in .yml file) and also it does not accept new request. However, my requirement is to make the server wait for the ongoing request to complete before stopping. The ongoing request may take 30 sec or 30 minutes, it is unknown. Can somebody suggest me the way to achieve this?
Note: I've referred to the below links but could not achieve.
How to shutdown dropwizard application?
shutdownGracePeriod
We've used in-app healthcheck combined with some external load balancer service and prestop scripts. A healthcheck is turned off by the prestop script, then healthcheck says it is unhealthy so no new requests are sent by the load balancer (but existing ones are processed), only after draining period a stop signal is sent to the application.
Even this though has a specified time limit. I don't know how you would monitor requests that last an unknown amount of time.
Related
My team maintains an application (written on Java) which processes long running batch jobs. These jobs needs to be run on a defined sequence. Hence, the application starts a socket server on a pre-defined port to accept job execution requests. It keeps the socket open until the job completes (with success or failure). This way the job scheduler knows when one job ends and upon successful completion of the job, it triggers the next job in the pre-defined sequence. If the job fails, scheduler sends out an alert.
This is a setup we have had for over a decade. We have some jobs which runs for a few minutes and other which takes a couple hours (depending on the volume) to complete. The setup has worked without any issues.
Now, we need to move this application to a container (RedHat OpenShift Container Platform) and the infra policy in place allows only default HTTPS port be exposed. The scheduler sits outside OCP and cannot access any port other than the default HTTPS port.
In theory, we could use the HTTPS, set Client timeout to a very large duration and try to mimic the the current setup with TCP socket. But would this setup be reliable enough as HTTP protocol is designed to serve short-lived requests?
There isn't a reliable way to keep a connection alive for a long period over the internet, because of nodes (routers, load balancers, proxies, nat gateways, etc) that may be sitting between your client and server, they might drop mid connection under load, some of them will happily ignore your HTTP keep alive request, or have an internal max connection duration time that will kill long running TCP connections, you may find it works for you today but there is no guarantee it will work for you tomorrow.
So you'll probably need to submit the job as a short lived request and check the status via other means:
Push based strategy by sending a webhook URL as part of the job submission and have the server call it (possibly with retries) on job completion to notify interested parties.
Pull based strategy by having the server return a job ID on submission, then have the client check periodically. Due to the nature of your job durations, you may want to implement this with some form of exponential backoff up to a certain limit, for example, first check after waiting for 2 seconds, then wait for 4 seconds before next check, then 8 seconds, and so on, up to a maximum of time you are happy to wait between each check. So that you can find out about short job completions sooner and not check too frequently for long jobs.
When your worked with socket and TCPprotocol you were in control on how long to keep connections open. With HTTP you are only in control of logical connections and not physical ones. Actual connections are controlled by OS and usually IT people can configure all those timeouts. But by default how it works is that when you even close logical connection the real connection is no closed in anticipation of next communication. It is closed by OS and not controlled by your code. However even if it closes and your next request comes after that it is opened transparently to you. SO it doesn't really matter if it closed or not. It should be transparent to your code. So in short I assume that you can move to HTTP/HTTPS with no problems. But you will have to test and see. Also about other options on server to client communications you can look at my answer to this question: How to continues send data from backend to frontend when something changes
We have had bad experiences with long standing HTTP/HTTPS connections. We used to schedule short jobs (only a couple of minutes) via HTTP and wait for it to finish and send a response. This worked fine, until the jobs got longer (hours) and some network infrastructure closed the inactive connections. We ended up only submitting the request via HTTP, get an immediate response and then implemented a polling to wait for the response. At the time, the migration was pretty quick for us, but since then we have migrated it even further to use "webhooks", e.g. allow the processor of the job to signal it's state back to the server using a known webhook address.
IMHO, you should improve your scheduler to a REST API server, Websocket isn't effective in this scenario, the connection will inactive most of time
The jobs can be short-lived or long running. So, When a long running job fails in the middle, how does the restart of the job happen? Does it start from beginning again?
In a similar scenario, we had a database to keep track of the progress of the job (no of records successfully processed). So, the jobs can resume after a failure. With such a design, another webservice can monitor the status of the job by looking at the database. So, the main process is not impacted by constant polling by the client.
I have created a spring boot web application and deployed war of the same to tomcat container.
The application connects to mongoDB using Async connections. I am using mongodb-driver-async library for that.
At startup everything works fine. But as soon as load increases, It shows following exception in DB connections:
org.springframework.web.context.request.async.AsyncRequestTimeoutException: null
at org.springframework.web.context.request.async.TimeoutDeferredResultProcessingInterceptor.handleTimeout(TimeoutDeferredResultProcessingInterceptor.java:42)
at org.springframework.web.context.request.async.DeferredResultInterceptorChain.triggerAfterTimeout(DeferredResultInterceptorChain.java:75)
at org.springframework.web.context.request.async.WebAsyncManager$5.run(WebAsyncManager.java:392)
at org.springframework.web.context.request.async.StandardServletAsyncWebRequest.onTimeout(StandardServletAsyncWebRequest.java:143)
at org.apache.catalina.core.AsyncListenerWrapper.fireOnTimeout(AsyncListenerWrapper.java:44)
at org.apache.catalina.core.AsyncContextImpl.timeout(AsyncContextImpl.java:131)
at org.apache.catalina.connector.CoyoteAdapter.asyncDispatch(CoyoteAdapter.java:157)
I am using following versions of software:
Spring boot -> 1.5.4.RELEASE
Tomcat (installed as standalone binary) -> apache-tomcat-8.5.37
Mongo DB version: v3.4.10
mongodb-driver-async: 3.4.2
As soon as I restart the tomcat service, everything starts working fine.
Please help, what could be the root cause of this issue.
P.S.: I am using DeferredResult and CompletableFuture to create Async REST API.
I have also tried using spring.mvc.async.request-timeout in application and configured asynTimeout in tomcat. But still getting same error.
It's probably obvious that Spring is timing out your requests and throwing AsyncRequestTimeoutException, which returns a 503 back to your client.
Now the question is, why is this happening? There are two possibilities.
These are legitimate timeouts. You mentioned that you only see the exceptions when the load on your server increases. So possibly your server just can't handle that load and its performance has degraded to the point where some requests can't complete before Spring times them out.
The timeouts are caused by your server failing to send a response to an asynchronous request due to a programming error, leaving the request open until Spring eventually times it out. It's easy for this to happen if your server doesn't handle exceptions well. If your server is synchronous, it's okay to be a little sloppy with exception handling because unhandled exceptions will propagate up to the server framework, which will send a response back to the client. But if you fail to handle an exception in some asynchronous code, that exception will be caught elsewhere (probably in some thread pool management code), and there's no way for that code to know that there's an asynchronous request waiting on the result of the operation that threw the exception.
It's hard to figure out what might be happening without knowing more about your application. But there are some things you could investigate.
First, try looking for resource exhaustion.
Is the garbage collector running all the time?
Are all CPUs pegged at 100%?
Is the OS swapping heavily?
If the database server is on a separate machine, is that machine showing signs of resource exhaustion?
How many connections are open to the database? If there is a connection pool, is it maxed out?
How many threads are running? If there are thread pools in the server, are they maxed out?
If something's at its limit then possibly it is the bottleneck that is causing your requests to time out.
Try setting spring.mvc.async.request-timeout to -1 and see what happens. Do you now get responses for every request, only slowly, or do some requests seem to hang forever? If it's the latter, that strongly suggests that there's a bug in your server that's causing it to lose track of requests and fail to send responses. (If setting spring.mvc.async.request-timeout appears to have no effect, then the next thing you should investigate is whether the mechanism you're using for setting the configuration actually works.)
A strategy that I've found useful in these cases is to generate a unique ID for each request and write the ID along with some contextual information every time the server either makes an asynchronous call or receives a response from an asynchronous call, and at various checkpoints within asynchronous handlers. If requests go missing, you can use the log information to figure out the request IDs and what the server was last doing with that request.
A similar strategy is to save each request ID into a map in which the value is an object that tracks when the request was started and what your server last did with that request. (In this case your server is updating this map at each checkpoint rather than, or in addition to, writing to the log.) You can set up a filter to generate the request IDs and maintain the map. If your filter sees the server send a 5xx response, you can log the last action for that request from the map.
Hope this helps!
Asynchroneus tasks are arranged in a queue(pool) which is processed in parallel depending on the number of threads allocated. Not all asynchroneus tasks are executed at the same time. Some of them are queued. In a such system getting AsyncRequestTimeoutException is normal behaviour.
If you are filling up the queues with asynchroneus tasks that are unable to execute under pressure. Increasing the timeout will only delay the problem. You should focus instead on the problem:
Reduce the execution time(through various optimizations) of asynchroneus task. This will relax the pooling of async tasks. It oviously requires coding.
Increase the number of CPUSs allocated in order to be able to run more efficiently the parallel tasks.
Increase the number of threads servicing the executor of the driver.
Mongo Async driver is using AsynchronousSocketChannel or Netty if Netty is found in the classpath. In order to increase the number of the worker threads servicing the async comunication you should use:
MongoClientSettings.builder()
.streamFactoryFactory(NettyStreamFactoryFactory(io.netty.channel.EventLoopGroup eventLoopGroup,
io.netty.buffer.ByteBufAllocator allocator))
.build();
where eventLoopGroup would be io.netty.channel.nio.NioEventLoopGroup(int nThreads))
on the NioEventLoopGroup you can set the number of threads servicing your async comunication
Read more about Netty configuration here https://mongodb.github.io/mongo-java-driver/3.2/driver-async/reference/connecting/connection-settings/
My tomcat server keeps processing some requests for more than 10 minutes. I stopped client which had triggered those requests but then also tomcat keeps processing those requests.
I have tried different settings for connectionTimeout property in server.xml file of tomcat but it is not working.
I would like to know how to configure tomcat such that tomcat kills/ stops processing requests which take longer than certain time like 10 seconds or 1 minute, etc.
From The Apache Tomcat Connector - Generic HowTo
Timeouts
JK can also use a timeout on request replies. This timeout does not
measure the full processing time of the response. Instead it controls,
how much time between consecutive response packets is allowed.
In most cases, this is what one actually wants. Consider for example
long running downloads. You would not be able to set an effective
global reply timeout, because downloads could last for many minutes.
Most applications though have limited processing time before starting
to return the response. For those applications you could set an
explicit reply timeout. Applications that do not harmonise with reply
timeouts are batch type applications, data warehouse and reporting
applications which are expected to observe long processing times.
If JK aborts waiting for a response, because a reply timeout fired,
there is no way to stop processing on the backend. Although you free
processing resources in your web server, the request will continue to
run on the backend - without any way to send back a result once the
reply timeout fired.
I have JBoss 5 with ejb3 beans deployed to it.
If bean method execution takes a very long time (I checked that for 2 hours), then the client does not receive the answer when the EJB method execution finishes (with exception or not).
The client is blocked waiting for response from socket.
Why does that happen?
Most likely this is caused by a (stateful) router, packet filter, load balancer, SSL box whatever in between: They just terminate the connection after a certain time of inactivity, and the real endpoints are not notified. Experience shows that it's normally out of your control to have suitable timeouts in each device.
Anyway in your case, instead of curing the symptoms: A running request needs an open TCP connection, and possibly blocks a thread. So consider changing the design of your system from synchronous to asynchronous:
Use polling here, every minute should be enough. So you have a function to submit a task, and another one which returns "not yet ready" or "here is the result".
Use JMS queues in your client to submit tasks and to receive results
We have two AppEngine (Java) apps. One of them uses URLFetch to the other to create an appointment. In the receiver, we've added a feature where we use the Channel API to see if there are any open channels and let them know about the new data.
The URLFetch call is failing with a SocketTimeoutException. All the code in the receiver is executed (including all open channels being notified) but the calling app still gets a SocketTimeoutException. When I comment out the channel notification line, no error.
This happens only in the deployed app, not in dev mode. Also, the call doesn't come close to reaching the 60-second (or even the old 10-second) timeout allowed by URLFetch.
The default deadline for urlfetch is 5s, so if your application take more than 5s to load and execute the handler it will return a SocketTimeoutException.
As described in the documentation, you can set a longer deadline for your urlfetch call using setConnectTimeout or setReadTimeout
In addition, it is a good idea to move the api call that can be deferred (i.e not necessary to build the http response) to a task queue:
the deadline for task queue request is longer (10 minutes, instead of 60s)
the task will be retried if failing
urlfetch timeout is longer too (10 minutes)