I used the Micrometer in Resilience4j according to the Resilience4j documentation with the address x in the code. The Micrometer monitor service is also Influx. During the load test, we performed all types of tests, except for the situation where we shut down the Influx server. In this case, the Micrometer gives an internal connection error that quickly fills our log file. My question is, is there an event when the monitoring server like Influx is unavailable for any reason, the error can be handled and only the Micrometer is removed and the rest of the system like Resilience4j continues to work?
In the load test with tps 400, when we cut off the micrometer influx monitoring service in Resilience, an unhandled error occurs and the log file becomes very large.
Related
I am working on a small project using Spring Boot, Kafka, and Spark. So far I have been able to create a Kafka producer in one project and a Spark-Kafka direct stream as a consumer.
I am able to see messages pass through and things seem to be working as intended. However, I have a rest endpoint on the project that is running the consumer. Whenever I disable the Direct Stream, the endpoint works fine. However when I have the stream running, Postman says there is no response. I see nothing in the server logs indicating that a request was ever received either.
The Spark consumer is started by a bean at project launch. Is this keeping the normal server on localhost:8080 from being started?
Initially I was kicking off the StreamingContext by annotating it as a Bean. I instead made the application implement CommandLineRunner, and in the overridden run method, I called the method that kicks off the Streaming Context. That allowed Apache to start and fixed the issue.
Do you know (if it is possible) how to reserve threads/memory for a specific endpoint in a spring boot microservice?
I've one microservice that accepts HTTP Requests via Spring MVC, and those requests trigger http calls to 3rd system, which sometimes is partially degraded, and it responds very slow. I can't reduce the timeout time because there are some calls that are very slow by nature.
I've the spring-boot-actuator /health endpoint enabled and I use it like a container livenessProbe in a kubernetes cluster. Sometimes, when the 3rd system is degraded, the microservice doesn't respond to /health endpoint and kubernetes restarts my service.
This is because I'm using a RestTemplate to make HTTP calls, so I'm continuously creating new threads, and JVM starts to have problems with the memory.
I have thought about some solutions:
Implement a high availability “/health” endpoint, reserve threads, or something like that.
Use an async http client.
Implement a Circuit Breaker.
Configure custom timeouts per 3rd endpoint that I'm using.
Create other small service (golang) and deploy it in the same pod. This service is going to process the liveness probe.
Migrate/Refactor services to small services, and maybe with other framework/languages like Vert.x, go, etc.
What do you think?
The actuator health endpoint is very convenient with Spring boot - almost too convenient in this context as it does deeper health checks than you necessarily want in a liveness probe. For readiness you want to do deeper checks but not liveness. The idea is that if the Pod is overwhelmed for a bit and fails readiness then it will be withdrawn from the load balancing and get a breather. But if it fails liveness it will be restarted. So you want only minimal checks in liveness (Should Health Checks call other App Health Checks). By using actuator health for both there is no way for your busy Pods to get a breather as they get killed first. And kubernetes is periodically calling the http endpoint in performing both probes, which contributes further to your thread usage problem (do consider the periodSeconds on the probes).
For your case you could define a liveness command and not an http probe - https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-liveness-command. The command could just check that the Java process is running (so kinda similar to your go-based probe suggestion).
For many cases using the actuator for liveness would be fine (think apps that hit a different constraint before threads, which would be your case if you went async/non-blocking with the reactive stack). Yours is one where it can cause problems - the actuator's probing of availability for dependencies like message brokers can be another where you get excessive restarts (in that case on first deploy).
I have a prototype just wrapping up for this same problem: SpringBoot permits 100% of the available threads to be filled up with public network requests, leaving the /health endpoint inaccessible to AWS load balancer which knocks the service offline thinking it's unhealthy. There's a different between unhealthy and busy... and health is more than just a process running, port listening, superficial check, etc - it needs to be a "deep ping" which checks that it and all its dependencies are operable in order to give a confident health check response back.
My approach to solving the problem is to produce two new auto-wired components, the first to configure Jetty with a fixed, configurable maximum number of threads (make sure your JVM is allocated enough memory to match), and the second to keep a counter of each request as it starts and completes, throwing an Exception which maps to an HTTP 429 TOO MANY REQUESTS response if the count approaches a ceiling which is the maxThreads - reserveThreads. Then I can set reserveThreads to whatever I want and the /health endpoint is not bound by the request counter, ensuring that it's always able to get in.
I was just searching around to figure out how others are solving this problem and found your question with the same issue, so far haven't seen anything else solid.
To configure Jetty thread settings via application properties file:
http://jdpgrailsdev.github.io/blog/2014/10/07/spring_boot_jetty_thread_pool.html
Sounds like your Microservice should still respond to health checks /health whilist returning results from that 3rd service its calling.
I'd build an async http server with Vert.x-Web and try a test before modifying your good code. Create two endpoints. The /health check and a /slow call that just sleeps() for like 5 minutes before replying with "hello". Deploy that in minikube or your cluster and see if its able to respond to health checks while sleeping on the other http request.
We are running microservices on spring boot (with embedded tomcat) and spring cloud. It means service discovery, regular health checks and services that are responding to these health checks, ... We have also spring boot admin server for monitoring and we can see that all services are running ok. Currently running only on test environment...
Some of our microservices are called quite rarely (let's assume once per two days) however there are still regular health checks. When REST api of these services is called after so long idle time the first request takes very long time to process. It of course causes opening circuit breakers in request chains and errors... I see this behavior also when calling different endpoints using spring boot admin (Theads list, Metrics).
As a summary I have seen this behavior in calls on spring boot admim metrics, threads info, environment info or calls where database is called using Hikari data source or where aservice tried to send email through smtp server
My questions are:
Is it something related to setting of embedded server and its thread pool?
Or should I dive deep into other thread pools and connection pools touched by these requests?
Any ideas for diagnostics?
Many thanks
The problem was that there was not enough RAM to cover whole heap of those applications ... wrong setting applied to multiple virtual machines. Part of the heap was actually swapping. The problem dissapeared when heap and RAM sizes were fixed.
I have a spring boot application autodiscovering in Consul a downstream microservice by his serviceId.
Problem:
For some reasons some of previous registered services in Consul (which are not running anymore) are still returned during discovery.
So if I'm lucky load balancing is working through my restTemplate but sometimes I've a timeout because non running services are returned.
Questions on best practices to handle this use case:
Is it possible to log the host/service in error and not just timeout?
error:
a resourceAccessException I/O error on GET request for http://SERVICE-NAME .... connection timeout
Is it possible to log through the restTemplate the node chosen when loadbalancing occured ?
Does this kind of logging makes sense or is it better to let the magic happen later on, when Circuit breaker is implemented ?
thanks!
Appengine creates deferred queue(/_ah/queue/deferred) for every hit or every request on Multitenant applications site, hence creating and running alot of queues which gives wrong results as well as excessive use of queue leading to exceeded quota.
this application sets namespace i.e. NamespaceManger.set(somenamespace) before datastore operations as part of multitenant application. No queue related coding is done, default queue is automatically created by appengine with every operation multiple queues are generated which are causing issues.
thanks in advance.any help is appreciate.
In my current application we are using GAE Cloud Endpoint to connect to android and web module. In this application we have seperated the datastore and memcache based on namespace(Multitentant) application. The observation which i found is that for every invocation for android or web module there is a defered queue entry in the log file. This is causing my backend instance hours being consumed. What i am failing to understand is that since i am not using Taskqueue or have made no configuration for the same how/who is triggering the defered queue, and what is the possible approach to resolve this issue
The problem is due to async-session-persistence is enabled in your app.yaml.
When you enable this, your application will write your HTTP session date to your datastore. The process is a done using a deffered task queue. When this task queue fails, the system will keep retrying.
So to fix this issue, you can remove this code from your app.yaml
<async-session-persistence enabled="true"/>
Or
You have to debug the error to avoid task queue failure.
Hope this helps. All the best.