I am having difficulty with one of my application exposing GRPC service. The issue we started with was that while doing rolling update, we were getting failures from our service due to service unavailability, which was not expected as we have multiple nodes and are deploying the nodes one by one.
We are using consul for service discovery and our application logic is to start the grpc service (make method call on Server class to start), give one second delay and then register to consul. The conclusion we came up with for the issue is that even when we call the server start method, it takes some time before it is actually ready to start serving the RPC calls. This delay is supposedly more than our 1s delay, hence we register the service on consul before actually the server is ready, hence errors.
What I am looking for is a way to check server's readiness before registering it to consul. Which will allow us to get the RPC calls on server only when the server is actually ready. So, does anyone know any way which can be useful in this case ?
Sorry, it turned out the server was ready to take the requests. The failing request were actually due to timeout (set on the load balancer envoy). The DB pool (Hikari) was initially taking more time for serving requests hence the failure.
Related
Scenario:
I have implemented GRPC client that translates HTTP JSON requests to Protobuf to connect to a GRPC server.
Two styles of this.
As a Global gateway filter like NettyRoutingFilter
As a HTTP service that follows similar code.
If I put it on "ludicruous load" the one without the GRPC and one with GRPC
localhost-direct-ludicrous-load:
target: "http://localhost:28082"
phases:
- name: warmup
duration: 10
arrivalRate: 1
- name: load
duration: 100
arrivalRate: 100
maxVusers: 750
When I run without the GRPC calls I have no issues with the load test (5-node Docker Swarm cluster running t3a.larges). But with the GRPC the gets bogged down by the "refresh" token which is the slowest and load generating part of the system.
Hypothesis:
What I am thinking is that I am probably doing things inefficiently on the client call end even if it was supposed to use Netty.
I tried parallel, boundedElastic, newParallel and newBoundedElastic it seems to be unable to handle the load cleanly.
Though because of Resillience4J the moment I take the load out, it starts functioning normally so at least the limitters are preventing connection timeouts.
If I narrow it down, to the OP question is What scheduler is used for NettyRoutingFilter in Spring Cloud Gateway? as that may be the thing I just need to set the scheduler to in order to get the performance I am expecting.
I have been attempting to get GRPC's load balancing working in my Java application deployed to a Kubernetes cluster but I have not been having too much success. There does not seem to be too much documentation around this, but from examples online I can see that I should now be able to use '.defaultLoadBalancingPolicy("round_robin")' when setting up the ManagedChannel (in later versions of GRPC Java lib).
To be more specific, I am using version 1.34.1 of the GRPC Java libraries. I have created two Spring Boot (v2.3.4) applications, one called grpc-sender and one called grpc-receiver.
grpc-sender acts as a GRPC client and defines a (Netty) ManagedChannel as:
#Bean
public ManagedChannel greetingServiceManagedChannel() {
String host = "grpc-receiver";
int port = 6565;
return NettyChannelBuilder.forAddress(host, port)
.defaultLoadBalancingPolicy("round_robin")
.usePlaintext().build();
}
Then grpc-receiver acts as the GRPC server:
Server server = ServerBuilder.forPort(6565)
.addService(new GreetingServiceImpl()).build();
I am deploying these applications to a Kubernetes cluster (running locally in minikube for the time being), and I have created a Service for the grpc-receiver application as a headless service, so that GRPC load balancing can be achieved.
To test failed requests, I do two things:
kill one of the grpc-receiver pods during the execution of a test run - e.g. when I have requested grpc-sender to send, say, 5000 requests to grpc-receiver. Grpc-sender does detect that the pod has been killed and does refresh its list of receiver pods, and routes future requests to the new pods. As expected, some of the requests that were in flight during the kill of the pod fail with GRPC Status UNAVAILABLE.
have some simple logic in grpc-receiver that generates a random number and if that random number is below, say, 0.2, return Grpc Status INTERNAL rather than OK.
With both the above, I can get a proportion of the requests during a test run to fail. Now what I am trying to get GRPC's retry mechanism to work. From reading the sparse documentation I am doing the following:
return NettyChannelBuilder.forAddress(host, port)
.defaultLoadBalancingPolicy("round_robin")
.enableRetry()
.maxRetryAttempts(10)
.usePlaintext().build();
However this seems to have no effect and I cannot see that failed requests are retried at all.
I see that this is still marked as an #ExperimentalApi feature, so should it work as expected and has it been implemented?
If so, is there something obvious I am missing? Anything else I need to do to get retries working?
Any documentation that explains how to do this in more detail?
Thanks very much in advance...
ManagedChannelBuilder.enableRetry().maxRetryAttempts(10) is not sufficient to make retry happen. The retry needs a service config with RetryPolicy defined. One way is set a default service config with RetryPolicy, please see the retry example in https://github.com/grpc/grpc-java/tree/v1.35.0/examples
There's been some confusion on the javadoc of maxRetryAttempts(), and it's being clarified in https://github.com/grpc/grpc-java/pull/7803
Thanks very much #user675693 ! That worked perfectly :)
The working of maxRetryAttempts() is indeed a bit confusing.
From the documentation I can see that:
"maxAttempts MUST be specified and MUST be a JSON integer value greater than 1. Values greater than 5 are treated as 5 without being considered a validation error."
Referring to the maxAttempts in the service config. If we want more than 5 attempts I can set this as maxRetryAttempts(10) for example in my ManagedChannel set up:
return NettyChannelBuilder.forAddress(host, port)
.defaultLoadBalancingPolicy("round_robin")
.defaultServiceConfig(config)
.enableRetry()
.maxRetryAttempts(10)
.usePlaintext().build();
But for that setting to be used properly I need to set it as 10 in the service config AND the ManagedChannel setup code, otherwise only 5 retries are performed. Its not clear from the Javadoc or the documentation, but thats what seems to happen from my testing.
Also, this retry functionality is marked as #ExperimentalApi. How mature is it, is it suitable to be used in production? Is it likely to change drastically?
I have created a spring boot web application and deployed war of the same to tomcat container.
The application connects to mongoDB using Async connections. I am using mongodb-driver-async library for that.
At startup everything works fine. But as soon as load increases, It shows following exception in DB connections:
org.springframework.web.context.request.async.AsyncRequestTimeoutException: null
at org.springframework.web.context.request.async.TimeoutDeferredResultProcessingInterceptor.handleTimeout(TimeoutDeferredResultProcessingInterceptor.java:42)
at org.springframework.web.context.request.async.DeferredResultInterceptorChain.triggerAfterTimeout(DeferredResultInterceptorChain.java:75)
at org.springframework.web.context.request.async.WebAsyncManager$5.run(WebAsyncManager.java:392)
at org.springframework.web.context.request.async.StandardServletAsyncWebRequest.onTimeout(StandardServletAsyncWebRequest.java:143)
at org.apache.catalina.core.AsyncListenerWrapper.fireOnTimeout(AsyncListenerWrapper.java:44)
at org.apache.catalina.core.AsyncContextImpl.timeout(AsyncContextImpl.java:131)
at org.apache.catalina.connector.CoyoteAdapter.asyncDispatch(CoyoteAdapter.java:157)
I am using following versions of software:
Spring boot -> 1.5.4.RELEASE
Tomcat (installed as standalone binary) -> apache-tomcat-8.5.37
Mongo DB version: v3.4.10
mongodb-driver-async: 3.4.2
As soon as I restart the tomcat service, everything starts working fine.
Please help, what could be the root cause of this issue.
P.S.: I am using DeferredResult and CompletableFuture to create Async REST API.
I have also tried using spring.mvc.async.request-timeout in application and configured asynTimeout in tomcat. But still getting same error.
It's probably obvious that Spring is timing out your requests and throwing AsyncRequestTimeoutException, which returns a 503 back to your client.
Now the question is, why is this happening? There are two possibilities.
These are legitimate timeouts. You mentioned that you only see the exceptions when the load on your server increases. So possibly your server just can't handle that load and its performance has degraded to the point where some requests can't complete before Spring times them out.
The timeouts are caused by your server failing to send a response to an asynchronous request due to a programming error, leaving the request open until Spring eventually times it out. It's easy for this to happen if your server doesn't handle exceptions well. If your server is synchronous, it's okay to be a little sloppy with exception handling because unhandled exceptions will propagate up to the server framework, which will send a response back to the client. But if you fail to handle an exception in some asynchronous code, that exception will be caught elsewhere (probably in some thread pool management code), and there's no way for that code to know that there's an asynchronous request waiting on the result of the operation that threw the exception.
It's hard to figure out what might be happening without knowing more about your application. But there are some things you could investigate.
First, try looking for resource exhaustion.
Is the garbage collector running all the time?
Are all CPUs pegged at 100%?
Is the OS swapping heavily?
If the database server is on a separate machine, is that machine showing signs of resource exhaustion?
How many connections are open to the database? If there is a connection pool, is it maxed out?
How many threads are running? If there are thread pools in the server, are they maxed out?
If something's at its limit then possibly it is the bottleneck that is causing your requests to time out.
Try setting spring.mvc.async.request-timeout to -1 and see what happens. Do you now get responses for every request, only slowly, or do some requests seem to hang forever? If it's the latter, that strongly suggests that there's a bug in your server that's causing it to lose track of requests and fail to send responses. (If setting spring.mvc.async.request-timeout appears to have no effect, then the next thing you should investigate is whether the mechanism you're using for setting the configuration actually works.)
A strategy that I've found useful in these cases is to generate a unique ID for each request and write the ID along with some contextual information every time the server either makes an asynchronous call or receives a response from an asynchronous call, and at various checkpoints within asynchronous handlers. If requests go missing, you can use the log information to figure out the request IDs and what the server was last doing with that request.
A similar strategy is to save each request ID into a map in which the value is an object that tracks when the request was started and what your server last did with that request. (In this case your server is updating this map at each checkpoint rather than, or in addition to, writing to the log.) You can set up a filter to generate the request IDs and maintain the map. If your filter sees the server send a 5xx response, you can log the last action for that request from the map.
Hope this helps!
Asynchroneus tasks are arranged in a queue(pool) which is processed in parallel depending on the number of threads allocated. Not all asynchroneus tasks are executed at the same time. Some of them are queued. In a such system getting AsyncRequestTimeoutException is normal behaviour.
If you are filling up the queues with asynchroneus tasks that are unable to execute under pressure. Increasing the timeout will only delay the problem. You should focus instead on the problem:
Reduce the execution time(through various optimizations) of asynchroneus task. This will relax the pooling of async tasks. It oviously requires coding.
Increase the number of CPUSs allocated in order to be able to run more efficiently the parallel tasks.
Increase the number of threads servicing the executor of the driver.
Mongo Async driver is using AsynchronousSocketChannel or Netty if Netty is found in the classpath. In order to increase the number of the worker threads servicing the async comunication you should use:
MongoClientSettings.builder()
.streamFactoryFactory(NettyStreamFactoryFactory(io.netty.channel.EventLoopGroup eventLoopGroup,
io.netty.buffer.ByteBufAllocator allocator))
.build();
where eventLoopGroup would be io.netty.channel.nio.NioEventLoopGroup(int nThreads))
on the NioEventLoopGroup you can set the number of threads servicing your async comunication
Read more about Netty configuration here https://mongodb.github.io/mongo-java-driver/3.2/driver-async/reference/connecting/connection-settings/
I have a Java-based server managed by the kubernetes cluster. It's a distributed environment where the number of the instance is set to 4 to handle millions of request per minute.
The issue that I am facing is kubernetes tries to balance the cluster and in the process kills the pod and take it to another node, but there are pending HTTP request GET and POST that gets lost.
What is the solution by kubernetes or architectural solution that would let me retry if the request is stuck/ failed?
UPDATE:
I have two configurations for kubernetes service:
LoadBalancer (is with AWS ELB): for external facing
ClusterIP: for internal microservice based architecture
Kubernetes gives you the means to gracefully handle pod terminations via SIGTERM and preStop hooks. There are several articles on this, e.g. Graceful shutdown of pods with Kubernetes. In your Java app, you should listen for SIGTERM and gracefully shutdown the server (most http frameworks have this "shutdown" functionality built in them).
The issue that I am facing is kubernetes tries to balance the cluster and in the process kills the pod and take it to another node
Now this sounds a little suspicious - in general K8s only evicts and reschedules pods on different nodes under specific circumstances, for example when a node is running out of resources to serve the pod. If your pods are frequently getting rescheduled, this is generally a sign that something else is happening, so you should probably determine the root cause (if you have resource limits set in your deployment spec make sure your service container isn't exceeding those - this is a common problem with JVM containers).
Finally, HTTP retries are inherently unsafe for non-idempotent requests (POST/PUT), so you can't just retry on any failed request without knowing the logical implications. In any case, retries generally happen on the client side, not server, so it's not a flag you can set in K8s to enable them.
Service mesh solves the particular issue that you are facing.
There are different service mesh available. General features of service mesh are
Load balancing
Fine-grained traffic policies
Service discovery
Service monitoring
Tracing
Routing
Service Mesh
Istio
Envoy
Linkerd
Linkerd: https://linkerd.io/2/features/retries-and-timeouts/
For Grpc service client side load balancing is used.
Channel creation
ManageChannelBuilder.forTarget("host1:port,host2:port,host3:port").nameResolverFactory(new CustomNameResolverProvider()).loadBalancerFactory(RoundRobinBalancerFactory.getInstance()).usePlaintText(true).build();
Use this channel to create stub.
Problem
If one of the service [host1] goes down then whether stub will handle this scenario and not send any further request to service [host1] ?
As per documentation at https://grpc.io/blog/loadbalancing
A thick client approach means the load balancing smarts are
implemented in the client. The client is responsible for keeping track
of available servers, their workload, and the algorithms used for
choosing servers. The client typically integrates libraries that
communicate with other infrastructures such as service discovery, name
resolution, quota management, etc.
So is it the responsibility of ManagedChannel class to maintain list of active server or application code needs to maintain list of active server list and create instance of ManagedChannel every time with active server list ?
Test Result
As per test if one of the service goes down there is no impact on load balancing and all request are processed correctly.
So can it be assumed that either stub or ManagedChannel class handle active server list ?
Answer with documentation will be highly appreciated.
Load Balancers generally handle nodes going down. Even when managed by an external service, nodes can crash abruptly and Load Balancers want to avoid those nodes. So all Load Balancer implementations for gRPC I'm aware of avoid failing calls when a backend is down.
Pick First (the default), iterates through the addresses until one works. Round Robin only round robins over working connections. So what you're describing should work fine.
I will note that your approach does have one downfall: you can't change the servers while the process is running. Removing broken backends in one thing, but adding new working backends is another. If your load is ever too high, you may not be able to address the issue by adding more workers because even if you add more workers your clients won't connect to them.