Kafka old consumer rebalance issue - java

In our system we're using an older version of kafka (0.9.0.1) and the old scala consumer API in a tomcat application.
Everything works fine most of the time, but sometimes when the servers where the consumers run are heavily utilised by some other tasks in the app then the consumers become unresponsive which triggers as expected a rebalance and that consumer is removed from its partitions and other consumers are used.
My question is if there is an easy way for the consumer to re-register itself when it comes back up?
I know that the old consumers store the partition consumer details in Zookeeper and was thinking we could have a task that would periodically check if our consumer is registered there and restart the consumer if not, but I'm not sure what exactly we should check there. Can anyone point me to some documentation about the data stored in zookeeper by kafka (haven't found anything in the official documentation sadly :( )?

Basically, what you want is fixed assignments, and that consumer groups never rebalance. If there was a way to disable automatic consumer rebalancing in the old Scala client, or maybe even increase the rebalance timeout to a much higher value, that could also work, but I couldn't find how to do that with the old Scala consumer.
However, it is possible to assign fixed topic/partitions when using the newer Java consumers, also available in that same 0.9 kafka version. Look for Subscribing To Specific Partitions in the latest Javadocs:
https://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
Subscribing To Specific Partitions
In the previous examples we subscribed to the topics we were interested in and
let Kafka give our particular process a fair share of the partitions for those topics.
This provides a simple load balancing mechanism so multiple instances of our program
can divided up the work of processing records.
In this mode the consumer will just get the partitions it subscribes to
and if the consumer instance fails no attempt will be made to
rebalance partitions to other instances.

Related

Kafka consumer group not rebalancing when increasing partitions

I have a situation where in my dev environment, my Kafka consumer groups will rebalance and distribute partitions to consumer instances just fine after increasing the partition count of a subscribed topic.
However, when we deploy our product into its kubernetes environment, we aren't seeing the consumer groups rebalance after increasing the partition count of the topic. Kafka recognized the increase which can be seen from the server logs or describing the topic from the command line. However, the consumer groups won't rebalance and recognize the new partitions. From my local testing, kafka respects metadata.max.age.ms (default 5 mins). But in kubernetes the group never rebalances.
I don't know if it affects anything but we're using static membership.
The consumers are written in Java and use the standard Kafka Java library. No messages are flowing through Kafka, and adding messages doesn't help. I don't see anything special in the server or consumer configurations that differs from my dev environment. Is anyone aware of any configurations that may affect this behavior?
** Update **
The issue was only occurring for a new topic. At first, the consumer application was starting before the producer application (which is intended to create the topic). So the consumer was auto creating the topic. In this scenario, the topic defaulted to 1 partition. When the producer application started it, updated the partition count per configuration. After that update, we never saw a rebalance.
Next we tried disabling consumer auto topic creation to address this. This prevented the consumer application from auto creating the topic on subscription. Yet still after the topic was created by the producer app, the consumer group was never rebalanced, so the consumer application would sit idle.
According to the documentation I've found, and testing in my dev environment, both of these situations should trigger a rebalance. For whatever reason we don't see that happen in our deployments. My temporary workaround was just to ensure that the topic is already created prior to allowing my consumer's to subscribe. I don't like it, but it works for now. I suspect that the different behavior I'm seeing is due to my dev environment running a single kafka broker vs the kubernetes deployments with a cluster, but that's just a guess.
Kafka defaults to update topic metadata only after 5 minutes, so will not detect partition changes immediately, as you've noticed. The deployment method of your app shouldn't matter, as long as network requests are properly reaching the broker.
Plus, check your partition assignment strategy to see if it's using sticky assignment. This will depend on what version of the client you're using, as the defaults changed around 2.7, I think
No messages are flowing through Kafka
If there's no data on the new partitions, there's no real need to rebalance to consume from them

kafka springboot about receiving messages only from consumer application launch time and ignoring unprocessed messages

Currently when starting consumer application it will receive old messages that have not been processed by KafkaListener and I only want to receive the latest messages since starting the consumer application ignore those old messages, I have to do that any?
This is pretty brief introduction into your issue - it would be handy to show versions of libraries you are working with, configurations, etc.
Nevertheless, if you do not want to receive old messages, that has not been ack before, you need to move offset for you consumer group.
Offset is basically pointer at last successfully read item, so when consumer is stopped, it remains here until consumer starts reading again - that is the reason why "old" messages are read. In this thread are some answers, but it is difficult to answer completely without further information.
Set consumer settings as auto.offset.reset=latest and enable.auto.commit=false, then your app will always start reading from the very end of the Kafka topic and ignore whatever was written while the app is stopped (between restarts, for example)
You could also add a random UUID to the group.id to ensure no other consumer would easily join that consumer group and "take away" events from your app.
Kafka Consumer API also has a method seekToEnd that you can try.

RabbitMQ Delivery Acknowledgement Timeout

I am using a managed RabbitMQ cluster through AWS Amazon-MQ. If the consumers finish their work quickly then everything is working fine. However, depending on few scenarios few consumers are taking more than 30 mins to complete the processing.
In that scenarios, RabbitMQ deletes the consumer and makes the same messages visible again in the queue. Becasue of this another consumer picks it up and starts processing. It is happing in the loop. Therefore the same transaction is getting executed again and I am loosing the consumer as well.
I am not using any AcknowledgeMode so I believe it's AUTO by default and it has 30 mins limit.
Is there any way to increase the Delivery Acknowledgement Timeout for AUTO mode?
Or please let me know if anyone has any other solutions for this.
Reply From AWS Support:
Consumer timeout is now configurable but can be done only by the service team. The change will be permanent irrespective of any version.
So you may update RabbitMQ to latest, and no need to stick with 3.8.11. Provide your broker details and desired timeout, they should be able to do it for you.
This is the response from AWS support.
From my understanding, I see that your workload is currently affected by the consumer_timeout parameter that was introduced in v3.8.15.
We have had a number of reach outs due to this, unfortunately, the service team has confirmed that while they can manually edit the rabbitmq.conf, this will be overwritten on the next reboot or failover and thus is not a recommended solution. This will also mean that all security patching on the brokers where a manual change is applied, will have to be paused. Currently, the service does not support custom user configurations for RabbitMQ from this configuration file, but have confirmed they are looking to address this in future, however, is not able to an ETA on when this will available.
From the RabbitMQ github, it seems this was added for quorum queues in v3.8.15 (https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.8.15 ), but seems to apply to all consumers (https://github.com/rabbitmq/rabbitmq-server/pull/2990 ).
Unfortunately, RabbitMQ itself does not support downgrades (https://www.rabbitmq.com/upgrade.html )
Thus the recommended workaround and safest action form the service team, as of now is to create a new broker on an older version (3.8.11) and set auto minor version upgrade to false, so that it wont be upgraded.
Then export the configuration from the existing RabbitMQ instance and import it into new instance and use this instance going forward.

Access native KafkaConsumer in Camel RoutePolicy to change polling behaviour

I "monitor" the number of consecutive failures in my Camel processing pipeline with a Camel RoutePolicy.
When a threshold of failures is reached, I want to pause the processing for a configured amount of time because it probably means that the data from another system is not yet ready and therefore every message fails.
Since the source of my pipeline is a Kafka topic, I should not just stop the whole route because the broker would assume my consumer died and rebalance.
The best way to "pause" topic consumption seems to be to pause the [KafkaConsumer][3] (the native, not the one of Camel). Like this, the consumer continues to poll the broker, but it does not fetch any messages. Exactly what I need.
But can I access the native [KafkaConsumer][3] from the RoutePolicy context to call the pause and resume methods?
The spring-kafka listener containers expose these methods, it would be nice to use them from Camel too.
This is not yet supported, the two methods must be added to the camel-kafka consumer first.
There is also an existing issue for it: https://issues.apache.org/jira/browse/CAMEL-15106

Kafka-streams delay to kick rebalancing on consumer graceful shutdown

This is a follow up on a previous question I sent regarding high latency in our Kafka Streams; (Kafka Streams rebalancing latency spikes on high throughput kafka-streams services).
As a quick reminder, our Stateless service has very tight latency requirements and we are facing too high latency problems (some messages consumed more than 10 secs after being produced) specially when a consumer leaves gracefully the group.
After further investigation we have found out that at least for small consumer groups the rebalance is taking less than 500ms. So we thought, where is this huge latency when removing one consumer (>10s) coming from?
We realized that it is the time between the consumer exiting Gracefully and the rebalance kicking in.
That previous tests were executed with all-default configurations in both Kafka and Kafka Streams application.
We changed the configurations to:
properties.put("max.poll.records", 50); // defaults to 1000 in kafkastreams
properties.put("auto.offset.reset", "latest"); // defaults to latest
properties.put("heartbeat.interval.ms", 1000);
properties.put("session.timeout.ms", 6000);
properties.put("group.initial.rebalance.delay.ms", 0);
properties.put("max.poll.interval.ms", 6000);
And the result is that the time for the rebalance to start dropped to a bit more than 5 secs.
We also tested to kill a consumer non-gracefully by 'kill -9' it; the result is that the time to trigger the rebalance is exactly the same.
So we have some questions:
- We expected that when the consumer is stopping gracefully the rebalance is triggered right away, should that be the expected behavior? why isn't it happening in our tests?
- How can we reduce the time between a consumer gracefully exiting and the rebalance being triggered? what are the tradeoffs? more unneeded rebalances?
For more context, our Kafka version is 1.1.0, after looking at libs found for example kafka/kafka_2.11-1.1.0-cp1.jar, we installed Confluent platform 4.1.0. On the consumer side, we are using Kafka-streams 2.1.0.
Thank you!
Kafka Streams does not sent a "leave group request" when an instance is shut down gracefully -- this is on purpose. The goal is to avoid expensive rebalances if an instance is bounced (eg, if one upgrades an application; or if one runs in a Kubernetes environment and a POD is restarted quickly automatically).
To achieve this, a non public configuration is used. You can overwrite the config via
props.put("internal.leave.group.on.close", true); // Streams' default is `false`

Categories