I have a Kafka consumer written in Java and SpringBoot.
I am using WebFlux in order to make a call to trigger some actions on a third party server (and waiting the result of course).
This server has rate limit that is limiting me from making a lot of requests in a short time.
In order to prevent failures I intend to keep on trying calling the server using WebFlux backoff:
webClientBuilder.build()
.get()
...
.retryWhen(getRetryPolicyOnTooManyRequests())
...
private RetryBackoffSpec getRetryPolicyOnTooManyRequests() {
return Retry.backoff(20, Duration.ofSeconds(retryBackoffMinimumSeconds))
.filter(this::is429Error);
}
private boolean is429Error(Throwable throwable) {
return throwable instanceof WebClientResponseException
&& ((WebClientResponseException) throwable).getStatusCode() == HttpStatus.TOO_MANY_REQUESTS;
}
My questions are about the behavior I should expect from my kafka:
What will happen when I'll be backoffing one of my calls? Will I be blocking the thread? Will a new thread be opened to process another message?
If I got the default consumer configurations (max.poll.records=500, max.poll.interval.ms=30000) and my backoff time will get to 5 minutes will the kafka group be rebalanced?
If so, is there a smarter way to tackle this issue so I won't get rebalanced
each time, other than just putting a super high number in max.poll.interval.ms
Related
I implemented retry mechanism offered by resilience4j in my machine for a project that makes http calls asynchronously. I can see that the http calls are being retried properly. However, these calls are asynchronous and so multiple HTTP calls will be made and it is possible that a number of these calls will fall into a retry at the same time. However, at a time I am able to see that only 7-9 retries are attempted. My question is why is there a cap on this ? Is it possible to configure this ?
Lets say i have a method as (this is a pseudocode).
#Async
#Retry(name = "retryA",fallbackMethod = "fallbackRetry")
public ResponseObj getExternalHttpRepsonse(String payload){
ClientResponse resp = webUtils.postRequest(payload);
boolean validatePredicate = responsePredciate.test(resp);
if(!validatePredicate){
throw new PredicateValidationFailedException();
}
return new ResponseObj(resp);
}
I am seeing an output of 7-9 attempts 1s failed, attempt 2s failed in the logs continuously whenever failures occur. Why is this capped between 7-9 and not more than that ?
Some background
We are running a fairly simple application that handles subscriptions and are running into the limits of the external service. The solution is that we are introducing a queue and throttle the consumers of this queue to optimize the throughput.
For this we are using a Quarkus (2.7.5.Final) implementation and using quarkus-smallrye-reactive-messaging-rabbitmq connector provided by quarkus.io
Simplified implementation
rabbitmq-host=localhost
rabbitmq-port=5672
rabbitmq-username=guest
rabbitmq-password=guest
mp.messaging.incoming.subscriptions-in.connector=smallrye-rabbitmq
mp.messaging.incoming.subscriptions-in.queue.name=subscriptions
#Incoming("subscriptions-in")
public CompletionStage<Void> consume(Message<JsonObject> message) {
try {
Thread.sleep(1000);
return message.ack();
} catch (Exception e) {
return message.nack(e);
}
}
The problem
This only uses one worker thread and therefore the jobs are handles 1 by 1, ideally this application picks up as many jobs as there are worker threads available (in parallel), how can I make this work?
I tried
#Incoming("subscriptions-in")
#Blocking
Didn't change anything
#Incoming("subscriptions-in")
#NonBlocking
Didn't change anything
#Incoming("subscriptions-in")
#Blocking(ordered = false)
This made it split of into different worker threads, but ?detached? the job from the queue, so none of the messages got ack'd or nack'd
#Incoming("subscriptions-in-1")
..
#Incoming("subscriptions-in-2")
..
#Incoming("subscriptions-in-3")
These different channels seem to all work on the same worker thread (which is picked on startup)
The only way I currently see is to slim down the application and run one consumer thread each and just run 50 in parallel in kubernetes. This feels wrong and I can't believe there is no way to multithread at least some of the consuming.
Question
I am hopeful that I am missing a simple solution or am missing the concept of this RabbitMQ connector.
Is there anyway to get the #Incoming consumption to run in parallel?
Or is there a way in this Java implementation to increase the prefetch count? If so I can multithread them myself
The problem appeared in logs: Consumer failed to start in 60000 milliseconds; does the task executor have enough threads to support the container concurrency?
We try to open handlers for like 50 queues dynamically by SimpleMessageListenerContainer.addQueueNames(), then application is started. It consumes some messages, but the RabbitMQ admin panel shows that they are unacked. After a significant amount of time, messages are stacking up to 6 unacked messages (queue has fairly low amount of messages per minute) per queue, which sums up to 300 messages total, something happens and they all become consumed and acked. While messages are unacked, the container seems to be trying to load another consumer until it bumps into the limit.
We rely on AUTO acknowledgment mode now, when it was MANUAL, it was fine.
There are two questions:
What can be the reason for unacked messages? Is there any flushing mechanism that doesn't trigger often?
What do I do with "not enough threads" message?
Those two seem to be really related one to another.
Here's the setup:
#Bean
fun queueMessageListenerContainer(
connectionFactory: ConnectionFactory,
retryOperationsInterceptor: RetryOperationsInterceptor,
vehicleQueueListenerFactory: QueueListenerFactory,
): SimpleMessageListenerContainer {
return SimpleMessageListenerContainer().also {
it.connectionFactory = connectionFactory
it.setConsumerTagStrategy { queueName -> consumerTag(queueName) }
it.setMessageListener(vehicleQueueListenerFactory.create())
it.setConcurrentConsumers(2)
it.setMaxConcurrentConsumers(5)
it.setListenerId("queue-consumer")
it.setAdviceChain(retryOperationsInterceptor)
it.setRecoveryInterval(RABBIT_HEARTH_BEAT.toMillis())
//had 10-100 threads, didn't help
it.setTaskExecutor(rabbitConsumersExecutorService)
// AUTO suppose to set ack for the messages, right?
it.acknowledgeMode = AcknowledgeMode.AUTO
}
}
#Bean
fun connectionFactory(rabbitProperties: RabbitProperties): AbstractConnectionFactory {
val rabbitConnectionFactory = com.rabbitmq.client.ConnectionFactory().also { connectionFactory ->
connectionFactory.isAutomaticRecoveryEnabled = true
connectionFactory.isTopologyRecoveryEnabled = true
connectionFactory.networkRecoveryInterval = RABBIT_HEARTH_BEAT.toMillis()
connectionFactory.requestedHeartbeat = RABBIT_HEARTH_BEAT.toSeconds().toInt()
// was up to 100 connections, didn't help
connectionFactory.setSharedExecutor(rabbitConnectionExecutorService)
connectionFactory.host = rabbitProperties.host
connectionFactory.port = rabbitProperties.port ?: connectionFactory.port
}
return CachingConnectionFactory(rabbitConnectionFactory)
.also {
it.cacheMode = rabbitProperties.cache.connection.mode
it.connectionCacheSize = rabbitProperties.cache.connection.size
it.setConnectionNameStrategy { "simulation-gateway:${springProfiles.firstOrNull()}:event-consumer" }
}
}
class QueueListenerFactory {
fun create(){
return MessageListener {
try {
// no ack, rely on AUTO acknowledgement mode
handle()
} catch (e: Throwable) {
...
}
}
}
}
Okay, I figured out what the problem was. Basically, it couldn't start all of the queues consumers in time, since it not only is slow process for too slow for SimpleMessageListenerContainer, but also we tried to addQueueNames one by one.
userRepository.findAll()
.map { user -> queueName(user) }
.onEach { queueName ->
simpleContainerListener.addQueueNames(queueName)
}
But the following line of documentation for SimpleMessageListenerContainer remained unnoticed:
The existing consumers will be cancelled after they have processed any pre-fetched messages and new consumers will be created
Which means what actually happened is recreation of (1, 2, ... N) consumers. What made it even worse is that if the request comes from the API, we did exactly the same simpleContainerListener.addQueueNames(queueName) after handling the request, which recreated all of consumers after that.
Also, recreation of the consumers was the reason why AUTO acknowledgement didn't work: threads were hanging trying to build enough consumers before the timeout.
I fixed this by adding DirectMessageListenerContainer to handle recently added queues, which is blazing fast, compared to SimpleMessageListenerContainer for the particular case of adding just one new consumer.
DirectMessageListenerContainer(connectionFactory).also {
it.setConsumerTagStrategy { queueName -> consumerTag(queueName, RECENT_CONSUMER_TAG) }
it.setMessageListener(ListenerFactory.create())
it.setListenerId("queue-consumer-recent")
it.setAdviceChain(retryOperationsInterceptor)
it.setTaskExecutor(recentQueuesTaskExecutor)
it.acknowledgeMode = AcknowledgeMode.AUTO
}
The downside is DirectMessageListenerContainer using 1 thread per queue on every instance. This is exactly why I didn't want to use it in the first place, but using both DirectMessageListenerContainer for recent and SimpleContainerListener for already existing queues significantly reduces amount of thread required to handle those queues. As far as I understand, an overwhelming usage of DirectMessageListenerContainer will lead to OOM eventually, so the next step can be to transfer queues from direct to simple container listener in batches.
I want to handle Flux to limit concurrent HTTP requests made by List of Mono.
When some requests are done (received responses), then service requests another until the total count of waiting requests is 15.
A single request returns a list and triggers another request depending on the result.
At this point, I want to send requests with limited concurrency.
Because consumer side, too many HTTP requests make an opposite server in trouble.
I used flatMapMany like below.
public Flux<JsonNode> syncData() {
return service1
.getData(param1)
.flatMapMany(res -> {
List<Mono<JsonNode>> totalTask = new ArrayList<>();
Map<String, Object> originData = service2.getDataFromDB(param2);
res.withArray("data").forEach(row -> {
String id = row.get("id").asText();
if (originData.containsKey(id)) {
totalTask.add(service1.updateRequest(param3));
} else {
totalTask.add(service1.deleteRequest(param4));
}
originData.remove(id);
});
for (left) {
totalTask.add(service1.createRequest(param5));
}
return Flux.merge(totalTask);
});
}
void syncData() {
syncDataService.syncData().????;
}
I tried chaining .window(15), but it doesn't work. All the requests are sent simultaneously.
How can I handle Flux for my goal?
I am afraid Project Reactor doesn't provide any implementation of either rate or time limit.
However, you can find a bunch of 3rd party libraries that provide such functionality and are compatible with Project Reactor. As far as I know, resilience4-reactor supports that and is also compatible with Spring and Spring Boot frameworks.
The RateLimiterOperator checks if a downstream subscriber/observer can acquire a permission to subscribe to an upstream Publisher. If the rate limit would be exceeded, the RateLimiterOperator could either delay requesting data from the upstream or it can emit a RequestNotPermitted error to the downstream subscriber.
RateLimiter rateLimiter = RateLimiter.ofDefaults("name");
Mono.fromCallable(backendService::doSomething)
.transformDeferred(RateLimiterOperator.of(rateLimiter))
More about RateLimiter module itself here: https://resilience4j.readme.io/docs/ratelimiter
You can use limitRate on a Flux. you need to probably reformat your code a bit but see docs here: https://projectreactor.io/docs/core/release/api/reactor/core/publisher/Flux.html#limitRate-int-
flatMap takes a concurrency parameter: https://projectreactor.io/docs/core/release/api/reactor/core/publisher/Flux.html#flatMap-java.util.function.Function-int-
Mono<User> getById(int userId) { ... }
Flux.just(1, 2, 3, 4).flatMap(client::getById, 2)
will limit the number of concurrent requests to 2.
After create multiple consumers (using Kafka 0.9 java API) and each thread started, I'm getting the following exception
Consumer has failed with exception: org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed due to group rebalance
class com.messagehub.consumer.Consumer is shutting down.
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed due to group rebalance
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:546)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:487)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:681)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:654)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:167)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:133)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:107)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.onComplete(ConsumerNetworkClient.java:350)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:288)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:303)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:197)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:187)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:157)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(ConsumerCoordinator.java:352)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:936)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:905)
and then start consuming message normally, I would like to know what is causing this exception in order to fix it.
Try also to tweak the following parameters:
heartbeat.interval.ms - This tells Kafka wait the specified amount of milliseconds before it consider the consumer will be considered "dead"
max.partition.fetch.bytes - This will limit the amount of messages (up to) the consumer will receive when polling.
I noticed that the rebalancing occurs if the consumer does not commit to Kafka before the heartbeat times out. If the commit occurs after the messages are processed, the amount of time to process them will determine these parameters. So, decreasing the number of messages and increasing the heartbeat time will help to avoid rebalancing.
Also consider to use more partitions, so there will be more threads processing your data, even with less messages per poll.
I wrote this small application to make tests. Hope it helps.
https://github.com/ajkret/kafka-sample
UPDATE
Kafka 0.10.x now offers a new parameter to control the number of messages received:
- max.poll.records - The maximum number of records returned in a single call to poll().
UPDATE
Kafka offers a way to pause the queue. While the queue is paused, you can process the messages in a separated Thread, allowing you to call KafkaConsumer.poll() to send heartbeats. Then call KafkaConsumer.resume() after the processing is done. This way you mitigate the problems of causing rebalances due to not sending heartbeats. Here is an outline of what you can do :
while(true) {
ConsumerRecords records = consumer.poll(Integer.MAX_VALUE);
consumer.commitSync();
consumer.pause();
for(ConsumerRecord record: records) {
Future<Boolean> future = workers.submit(() -> {
// Process
return true;
});
while (true) {
try {
if (future.get(1, TimeUnit.SECONDS) != null) {
break;
}
} catch (java.util.concurrent.TimeoutException e) {
getConsumer().poll(0);
}
}
}
consumer.resume();
}
Two possible reasons -->
If there are any network failures, consumers cannot reach out to the broker and will throw this exception. But there were no network failures when these exceptions occurred.
As mentioned in the error trace, if too much time is spent on processing the message, the ConsumerCoordinator will lose the connection and the commit will fail. This is because of polling.
The values given here are the default Kafka consumer configuration values.
request.timeout.ms=40000
heartbeat.interval.ms=3000
max.poll.interval.ms=300000
max.poll.records=500
session.timeout.ms=10000
Solution -->
Reduced the max.poll.records to 100 but still, the exception was occurring some times. So changed the configurations as below;
request.timeout.ms=300000
heartbeat.interval.ms=1000
max.poll.interval.ms=900000
max.poll.records=100
session.timeout.ms=600000
Reduced the heartbeat interval so that broker will be updated frequently that the Consumer is active. And also increased the session timeout configurations.