Kafka adminClient throws TimeoutException - java

I have a health thread that checks the state of my Kafka cluster every 5 seconds from my worker application. Every now and then however, I get TimeoutException:
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Aborted due to timeout.
at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)
I have tools to externally monitor my cluster as well (Cruise Control, Grafana) and none of them points to any problems in the cluster. Also, my worker application is constantly consuming messages and none seem to fail.
Why do I occasionally gets this timeout? If the broker is not down, than I am thinking something in my configs is off. I set the timeout for 5 seconds which seems like more than enough.
My AdminClient configs:
#Bean
public AdminClient adminClient() {
return KafkaAdminClient.create(adminClientConfigs());
}
public Map<String, Object> adminClientConfigs() {
Map<String, Object> props = new HashMap<>();
props.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, serverAddress);
props.put(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG, 5000);
return props;
}
How I check the cluster (I than run logic on the broker list):
#Autowired
private AdminClient adminClient;
private void addCluster() throws ExecutionException, InterruptedException {
adminClient.describeCluster().nodes().get().forEach(node -> brokers.add(node.host()));
}

2 things:
The default request timeout is 30secs. By setting it to a smaller value you augment the risk of timeouts for a slow request. If one request out of 1000 (0.1%) takes more than 5 seconds, because you query it every few seconds, you'll see several failures every day.
To investigate why some calls take longer, you can do several things:
Check the Kafka client logs. describeCluster() may require to initiate a new connection to the cluster. In that case, the client will also have to send an ApiVersionsRequest and depending on your config, may establish a TLS connection and/or perform SASL authentication. If any of these happen, it should be clear in the client logs. (You may need to bump the log level a bit to see all these).
Check the broker request metrics. describeCluster() translate into a MetadataRequest sent to a broker. You can track the time requests take to be process. See the metrics described in the docs, in your case, especially: kafka.network:type=RequestMetrics,name=*,request=Metadata

Related

WebFlux backoff and multi-threading in a kafka consumer flow

I have a Kafka consumer written in Java and SpringBoot.
I am using WebFlux in order to make a call to trigger some actions on a third party server (and waiting the result of course).
This server has rate limit that is limiting me from making a lot of requests in a short time.
In order to prevent failures I intend to keep on trying calling the server using WebFlux backoff:
webClientBuilder.build()
.get()
...
.retryWhen(getRetryPolicyOnTooManyRequests())
...
private RetryBackoffSpec getRetryPolicyOnTooManyRequests() {
return Retry.backoff(20, Duration.ofSeconds(retryBackoffMinimumSeconds))
.filter(this::is429Error);
}
private boolean is429Error(Throwable throwable) {
return throwable instanceof WebClientResponseException
&& ((WebClientResponseException) throwable).getStatusCode() == HttpStatus.TOO_MANY_REQUESTS;
}
My questions are about the behavior I should expect from my kafka:
What will happen when I'll be backoffing one of my calls? Will I be blocking the thread? Will a new thread be opened to process another message?
If I got the default consumer configurations (max.poll.records=500, max.poll.interval.ms=30000) and my backoff time will get to 5 minutes will the kafka group be rebalanced?
If so, is there a smarter way to tackle this issue so I won't get rebalanced
each time, other than just putting a super high number in max.poll.interval.ms

Unable to execute HTTP request: Acquire operation took longer than the configured maximum time

I am running a batch job in AWS which consumes messages from a SQS queue and writes them to a Kafka topic using akka. I've created a Sqs Async Client with the following parameters:
private static SqsAsyncClient getSqsAsyncClient(final Config configuration, final String awsRegion) {
var asyncHttpClientBuilder = NettyNioAsyncHttpClient.builder()
.maxConcurrency(100)
.maxPendingConnectionAcquires(10_000)
.connectionMaxIdleTime(Duration.ofSeconds(60))
.connectionTimeout(Duration.ofSeconds(30))
.connectionAcquisitionTimeout(Duration.ofSeconds(30))
.readTimeout(Duration.ofSeconds(30));
return SqsAsyncClient.builder()
.region(Region.of(awsRegion))
.httpClientBuilder(asyncHttpClientBuilder)
.endpointOverride(URI.create("https://sqs.us-east-1.amazonaws.com/000000000000")).build();
}
private static SqsSourceSettings getSqsSourceSettings(final Config configuration) {
final SqsSourceSettings sqsSourceSettings = SqsSourceSettings.create().withCloseOnEmptyReceive(false);
if (configuration.hasPath(ConfigPaths.SqsSource.MAX_BATCH_SIZE)) {
sqsSourceSettings.withMaxBatchSize(10);
}
if (configuration.hasPath(ConfigPaths.SqsSource.MAX_BUFFER_SIZE)) {
sqsSourceSettings.withMaxBufferSize(1000);
}
if (configuration.hasPath(ConfigPaths.SqsSource.WAIT_TIME_SECS)) {
sqsSourceSettings.withWaitTime(Duration.of(20, SECONDS));
}
return sqsSourceSettings;
}
But, whilst running my batch job I get the following AWS SDK exception:
software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Acquire operation took longer than the configured maximum time. This indicates that a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate.
The exception still seems to occur even after I try tweaking the parameters mentioned here:
Consider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate. Increasing the max connections can increase client throughput (unless the network interface is already fully utilized), but can eventually start to hit operation system limitations on the number of file descriptors used by the process. If you already are fully utilizing your network interface or cannot further increase your connection count, increasing the acquire timeout gives extra time for requests to acquire a connection before timing out. If the connections doesn't free up, the subsequent requests will still timeout. If the above mechanisms are not able to fix the issue, try smoothing out your requests so that large traffic bursts cannot overload the client, being more efficient with the number of times you need to call AWS, or by increasing the number of hosts sending requests
Has anyone run into this issue before?
I encountered the same issue, and I ended up firing 100 async batch requests then wait for those 100 to get cleared before firing another 100 and so on.

Azure eventhub Kafka org.apache.kafka.common.errors.TimeoutException for some of the records

Have a ArrayList containing 80 to 100 records trying to stream and send each individual record(POJO ,not entire list) to Kafka topic (event hub) . Scheduled a cron job like every hour to send these records(POJO) to event hub.
Able to see messages being sent to eventhub ,but after 3 to 4 successful run getting following exception (which includes several messages being sent and several failing with below exception)
Expiring 14 record(s) for eventhubname: 30125 ms has passed since batch creation plus linger time
Following is the config for Producer used,
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.ACKS_CONFIG, "1");
props.put(ProducerConfig.RETRIES_CONFIG, "3");
Message Retention period - 7
Partition - 6
using spring Kafka(2.2.3) to send the events
method marked as #Async where kafka send is written
#Async
protected void send() {
kafkatemplate.send(record);
}
Expected - No exception to be thrown from kafka
Actual - org.apache.kafka.common.errors.TimeoutException is been thrown
Prakash - we have seen a number of issues where spiky producer patterns see batch timeout.
The problem here is that the producer has two TCP connections that can go idle for > 4 mins - at that point, Azure load balancers close out the idle connections. The Kafka client is unaware that the connections have been closed so it attempts to send a batch on a dead connection, which times out, at which point retry kicks in.
Set connections.max.idle.ms to < 4mins – this allows Kafka client’s network client layer to gracefully handle connection close for the producer’s message-sending TCP connection
Set metadata.max.age.ms to < 4mins – this is effectively a keep-alive for the producer metadata TCP connection
Feel free to reach out to the EH product team on Github, we are fairly good about responding to issues - https://github.com/Azure/azure-event-hubs-for-kafka
This exception indicates you are queueing records at a faster rate than they can be sent. Once a record is added a batch, there is a time limit for sending that batch to ensure it has been sent within a specified duration. This is controlled by the Producer configuration parameter, request.timeout.ms. If the batch has been queued longer than the timeout limit, the exception will be thrown. Records in that batch will be removed from the send queue.
Please check the below for similar issue, this might help better.
Kafka producer TimeoutException: Expiring 1 record(s)
you can also check this link
when-does-the-apache-kafka-client-throw-a-batch-expired-exception/34794261#34794261 for reason more details about batch expired exception.
Also implement proper retry policy.
Note this does not account any network issues scanner side. With network issues you will not be able to send to either hub.
Hope it helps.

Kafka CommitFailedException consumer exception

After create multiple consumers (using Kafka 0.9 java API) and each thread started, I'm getting the following exception
Consumer has failed with exception: org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed due to group rebalance
class com.messagehub.consumer.Consumer is shutting down.
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed due to group rebalance
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:546)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:487)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:681)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:654)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:167)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:133)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:107)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.onComplete(ConsumerNetworkClient.java:350)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:288)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:303)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:197)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:187)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:157)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(ConsumerCoordinator.java:352)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:936)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:905)
and then start consuming message normally, I would like to know what is causing this exception in order to fix it.
Try also to tweak the following parameters:
heartbeat.interval.ms - This tells Kafka wait the specified amount of milliseconds before it consider the consumer will be considered "dead"
max.partition.fetch.bytes - This will limit the amount of messages (up to) the consumer will receive when polling.
I noticed that the rebalancing occurs if the consumer does not commit to Kafka before the heartbeat times out. If the commit occurs after the messages are processed, the amount of time to process them will determine these parameters. So, decreasing the number of messages and increasing the heartbeat time will help to avoid rebalancing.
Also consider to use more partitions, so there will be more threads processing your data, even with less messages per poll.
I wrote this small application to make tests. Hope it helps.
https://github.com/ajkret/kafka-sample
UPDATE
Kafka 0.10.x now offers a new parameter to control the number of messages received:
- max.poll.records - The maximum number of records returned in a single call to poll().
UPDATE
Kafka offers a way to pause the queue. While the queue is paused, you can process the messages in a separated Thread, allowing you to call KafkaConsumer.poll() to send heartbeats. Then call KafkaConsumer.resume() after the processing is done. This way you mitigate the problems of causing rebalances due to not sending heartbeats. Here is an outline of what you can do :
while(true) {
ConsumerRecords records = consumer.poll(Integer.MAX_VALUE);
consumer.commitSync();
consumer.pause();
for(ConsumerRecord record: records) {
Future<Boolean> future = workers.submit(() -> {
// Process
return true;
});
while (true) {
try {
if (future.get(1, TimeUnit.SECONDS) != null) {
break;
}
} catch (java.util.concurrent.TimeoutException e) {
getConsumer().poll(0);
}
}
}
consumer.resume();
}
Two possible reasons -->
If there are any network failures, consumers cannot reach out to the broker and will throw this exception. But there were no network failures when these exceptions occurred.
As mentioned in the error trace, if too much time is spent on processing the message, the ConsumerCoordinator will lose the connection and the commit will fail. This is because of polling.
The values given here are the default Kafka consumer configuration values.
request.timeout.ms=40000
heartbeat.interval.ms=3000
max.poll.interval.ms=300000
max.poll.records=500
session.timeout.ms=10000
Solution -->
Reduced the max.poll.records to 100 but still, the exception was occurring some times. So changed the configurations as below;
request.timeout.ms=300000
heartbeat.interval.ms=1000
max.poll.interval.ms=900000
max.poll.records=100
session.timeout.ms=600000
Reduced the heartbeat interval so that broker will be updated frequently that the Consumer is active. And also increased the session timeout configurations.

How do I stop a Camel route when JVM gets to a certain heap size?

I am using Apache Camel to connect to various endpoints, including JMS topics, and write to a database. Sometimes the database connection fails (for whatever reason, database issue, network blip, etc) and the messages from the topic subscriber start backing up. At a certain point, there are so many messages backed up waiting to be written to the database that the application throws an out of memory error. So far I understand all that.
The problem I have is the following: When the application is frantically trying to garbage collect before eventually giving up and accepting that it is out of memory, the application stops working, but is still alive. This means that the topic subscriber is still seen as active by the JMS provider, but not reading anything off the topic, so the provider starts queueing up the messages. Eventually the provider falls over also when the maximum depth runs out.
How can I configure my application to either disconnect when reaching a certain heap usage, or kill itself completely much much faster when running out of memory? I believe there are some JVM parameters that allow the application to kill itself much quicker when running out of memory, but I am wondering if that is the best solution or whether there is another way?
First of all I think you should use a JDBC connection pool that is capable of refreshing failed connections. So you do not run into the described scenario in the first place. At least not if the DB/network issue is short lived.
Next I'd protect the message broker by applying producer flow control (at least thats how it is called in ActiveMQ). I.e. prevent message producers from submitting more messages if a certain memory threshold has been breached. If the thresholds are set correctly, then that will prevent your message broker from falling over.
As for your original question: I'd use JMX to monitor the VM. If some metric, e.g. memory, breaches a threshold then you can suspend or shut down the route or the whole Camel context via the MBeans Camel exposes.
You can control (start/stop and suspend/resume) Camel routes using the Camel context methods .stop(), .start(), .suspend() and .resume().
You can spin a separate thread that monitors the current VM memory and stops the required route when a certain condition is met.
new Thread() {
#Override
public void run() {
while(true) {
long free = Runtime.getRuntime().freeMemory();
boolean routeRunning = camelContext.isRouteStarted("yourRoute");
if (free < threshold && routeRunning) {
camelContext.stopRoute("yourRoute");
} else if (free > threshold && !routeRunning) {
camelContext.startRoute("yourRoute");
}
// Check every 10 seconds
Thread.sleep(10000);
}
}
}
As commented in the other answer, relying on this is not particularly robust, but at least a little more robust than getting an OutOfMemoryException. Note that you need to .stop() the route, .suspend() does not deallocate resources, which means the connection with the queue provider is still open and the service looks like it is open for business.
You can also stop the route as part of the error handling of the route itself (this is possibly more robust but would require manual intervention to restart the route once the error is cleared, or a scheduled route that periodically checks if the error condition still exists and restart the route if it is gone). The thing to keep in mind is that you cannot stop a route from the same thread that is servicing the route at the time so you need to spin a separate thread that does the stopping. For example:
route("sample").from("jms://myqueue")
// Handle SQL Exceptions by shutting down the route
.onException(SQLException.class)
.process(new Processor() {
// This processor spawns a new thread that stops the current route
Thread stop;
#Override
public void process(final Exchange exchange) throws Exception {
if (stop == null) {
stop = new Thread() {
#Override
public void run() {
try {
// Stop the current route
exchange.getContext().stopRoute("sample");
} catch (Exception e) {}
}
};
}
// start the thread in background
stop.start();
}
})
.end()
// Standard route processors go here
.to(...);

Categories