Kafka consumer group not rebalancing when increasing partitions

Kafka consumer group not rebalancing when increasing partitions - java

I have a situation where in my dev environment, my Kafka consumer groups will rebalance and distribute partitions to consumer instances just fine after increasing the partition count of a subscribed topic.
However, when we deploy our product into its kubernetes environment, we aren't seeing the consumer groups rebalance after increasing the partition count of the topic. Kafka recognized the increase which can be seen from the server logs or describing the topic from the command line. However, the consumer groups won't rebalance and recognize the new partitions. From my local testing, kafka respects metadata.max.age.ms (default 5 mins). But in kubernetes the group never rebalances.
I don't know if it affects anything but we're using static membership.
The consumers are written in Java and use the standard Kafka Java library. No messages are flowing through Kafka, and adding messages doesn't help. I don't see anything special in the server or consumer configurations that differs from my dev environment. Is anyone aware of any configurations that may affect this behavior?
** Update **
The issue was only occurring for a new topic. At first, the consumer application was starting before the producer application (which is intended to create the topic). So the consumer was auto creating the topic. In this scenario, the topic defaulted to 1 partition. When the producer application started it, updated the partition count per configuration. After that update, we never saw a rebalance.
Next we tried disabling consumer auto topic creation to address this. This prevented the consumer application from auto creating the topic on subscription. Yet still after the topic was created by the producer app, the consumer group was never rebalanced, so the consumer application would sit idle.
According to the documentation I've found, and testing in my dev environment, both of these situations should trigger a rebalance. For whatever reason we don't see that happen in our deployments. My temporary workaround was just to ensure that the topic is already created prior to allowing my consumer's to subscribe. I don't like it, but it works for now. I suspect that the different behavior I'm seeing is due to my dev environment running a single kafka broker vs the kubernetes deployments with a cluster, but that's just a guess.

Kafka defaults to update topic metadata only after 5 minutes, so will not detect partition changes immediately, as you've noticed. The deployment method of your app shouldn't matter, as long as network requests are properly reaching the broker.
Plus, check your partition assignment strategy to see if it's using sticky assignment. This will depend on what version of the client you're using, as the defaults changed around 2.7, I think
No messages are flowing through Kafka
If there's no data on the new partitions, there's no real need to rebalance to consume from them

Related

kafka Consumer Reading Previous Records

i am facing a problem with my kafka consumer. i have two kafka brokers running with replication factor 2 for the topic. everytime a broker restarts and if i restart my consumer service, it starts to read records which it has already read. e.g. before i restarted the consumer this was the state.
and consumer was sitting idle not receiving any messages as it has read all of them.
i restart my consumer, and all of a sudden it starts receiving messages which it has processed previously and here is the offset situation now.
also what is this LOG-END-OFFSET and LAG, looks like these are something to consider here.
note that it only happens when 1 of the broker gets restarted due to kubernetes shifting it to another node.
this is the topic configuration

Based on the info you posted, a couple of things that immediately come to mind:
The first screenshot shows a lag of 182, which means the consumer either was not running, or it has some weird configuration that made it stop consuming. Was it possible one of the brokers was down when the consumer stopped consuming?
On restart, the consumer finally consumed all the remaining messages, because it now shows lag of 0. This is correct, expected Kafka behavior.
Make sure that the consumer group name is not changing between restarts. Some clients default to "randomized" customer group names, which works as long as the consumer is not restarted.

How to set Kafka acorss Multi DC

Scenario: Currently, we have a Primary cluster, and we have Producer and consumer, which are working as expected. We have to implement a secondary Kafka DR cluster in another data center. I have a couple of ideas, but not sure how to proceed with?
Question: How to automate the producer switch over from Primary cluster to the secondary cluster if the Primary cluster/Broker goes down?
Any sample code will be helpful.

You can use a Load Balancer in front of your producer. The Load Balancer can switch to the secondary cluster if the brokers in primary arent available.
You can also implement the failover without a load balancer.
As a next step you have to configure in your code an exception handling which indicates a reconnect by yourself. So the producer are then still able to ingest.
Consumers can subscribe to super topics (several topics). This can be done with a regular expression.
For the HA Scenario you need 2 Kafka Clusters and Mirrormaker 2.0.
The Failover happens on client side Producer / Consumer.
To your question:
If you have 3 brokers and 2 ISR are configured, a maximum of 1 broker can fail. If 2 brokers fail, the high availability is no longer guaranteed. This means that you can build an exception handling which intercepts the error not enough replicas available and on this basis carries out a reconnect.
If you use the scenario with the load balancer, make sure that the load balancer is configured in the passtrough-. In this way, the amount of code can be reduced.

Akhq UI shows Consumer lag always 1 for jdbc sink connector even though all messages are consumed

we are using Kafka stream(2.5.0 jar) with the java application ( with exactly once semantics) and a Jdbc sink connector (UPSERT mode) to write data to db.
flow:-
Java Kafka Stream app -------- > Db Sink Connector.
the akhq user interface shows a lag of 1 always ,even though all are valid messages. All the messages are consumed. is it due to the connector is not having "isolation.level" as "read_committed" ,currently "read_uncommited". The lag is shown in the pic below. Also I have seen
a bug related to Kafka https://issues.apache.org/jira/browse/KAFKA-10683 ,is it related to this.
Sink connector consumer lag

Late question but I don't know people ask question about akhq on stackoverflow.
It's a behaviour of Kafka Transaction.
Kafka transaction are handle with a commit message that will never be read by a Kafka Consumer.
The consumer just can't know that last offset is a commit.
So as Akhq is a simple Kafka consumer, it will be always see this lag of 1.
You will have this for every application using Kafka transaction.

Kafka-streams delay to kick rebalancing on consumer graceful shutdown

This is a follow up on a previous question I sent regarding high latency in our Kafka Streams; (Kafka Streams rebalancing latency spikes on high throughput kafka-streams services).
As a quick reminder, our Stateless service has very tight latency requirements and we are facing too high latency problems (some messages consumed more than 10 secs after being produced) specially when a consumer leaves gracefully the group.
After further investigation we have found out that at least for small consumer groups the rebalance is taking less than 500ms. So we thought, where is this huge latency when removing one consumer (>10s) coming from?
We realized that it is the time between the consumer exiting Gracefully and the rebalance kicking in.
That previous tests were executed with all-default configurations in both Kafka and Kafka Streams application.
We changed the configurations to:
properties.put("max.poll.records", 50); // defaults to 1000 in kafkastreams
properties.put("auto.offset.reset", "latest"); // defaults to latest
properties.put("heartbeat.interval.ms", 1000);
properties.put("session.timeout.ms", 6000);
properties.put("group.initial.rebalance.delay.ms", 0);
properties.put("max.poll.interval.ms", 6000);
And the result is that the time for the rebalance to start dropped to a bit more than 5 secs.
We also tested to kill a consumer non-gracefully by 'kill -9' it; the result is that the time to trigger the rebalance is exactly the same.
So we have some questions:
- We expected that when the consumer is stopping gracefully the rebalance is triggered right away, should that be the expected behavior? why isn't it happening in our tests?
- How can we reduce the time between a consumer gracefully exiting and the rebalance being triggered? what are the tradeoffs? more unneeded rebalances?
For more context, our Kafka version is 1.1.0, after looking at libs found for example kafka/kafka_2.11-1.1.0-cp1.jar, we installed Confluent platform 4.1.0. On the consumer side, we are using Kafka-streams 2.1.0.
Thank you!

Kafka Streams does not sent a "leave group request" when an instance is shut down gracefully -- this is on purpose. The goal is to avoid expensive rebalances if an instance is bounced (eg, if one upgrades an application; or if one runs in a Kubernetes environment and a POD is restarted quickly automatically).
To achieve this, a non public configuration is used. You can overwrite the config via
props.put("internal.leave.group.on.close", true); // Streams' default is `false`

Kafka old consumer rebalance issue

In our system we're using an older version of kafka (0.9.0.1) and the old scala consumer API in a tomcat application.
Everything works fine most of the time, but sometimes when the servers where the consumers run are heavily utilised by some other tasks in the app then the consumers become unresponsive which triggers as expected a rebalance and that consumer is removed from its partitions and other consumers are used.
My question is if there is an easy way for the consumer to re-register itself when it comes back up?
I know that the old consumers store the partition consumer details in Zookeeper and was thinking we could have a task that would periodically check if our consumer is registered there and restart the consumer if not, but I'm not sure what exactly we should check there. Can anyone point me to some documentation about the data stored in zookeeper by kafka (haven't found anything in the official documentation sadly :( )?

Basically, what you want is fixed assignments, and that consumer groups never rebalance. If there was a way to disable automatic consumer rebalancing in the old Scala client, or maybe even increase the rebalance timeout to a much higher value, that could also work, but I couldn't find how to do that with the old Scala consumer.
However, it is possible to assign fixed topic/partitions when using the newer Java consumers, also available in that same 0.9 kafka version. Look for Subscribing To Specific Partitions in the latest Javadocs:
https://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
Subscribing To Specific Partitions
In the previous examples we subscribed to the topics we were interested in and
let Kafka give our particular process a fair share of the partitions for those topics.
This provides a simple load balancing mechanism so multiple instances of our program
can divided up the work of processing records.
In this mode the consumer will just get the partitions it subscribes to
and if the consumer instance fails no attempt will be made to
rebalance partitions to other instances.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.