Handle duplicate message consumed on Kafka Rebalance or application reload - java

Currently, I have set up a spring boot application, it has 3 pods runnings.
And I have a Kafka consumer that processes a particular task for 20 minutes.
When the Kafka rebalances during that time then the same is consumed again, so I have set the Redis key whenever the message comes first, so when it rebalances it checks that key exists and discards that event as that old process is still running.
But now I have a scenario that a particular pod that is running can get restarted anytime, no when the application restarts and the same message is consumed that then, I want that message to be reprocessed, but as the Redis key exists it discard this event, but the old process is not running.
I have to re-process the message on application restart and discard it in case of Kafka rebalance. How can I handle this scenario?

Avoid the rebalance by increasing max.poll.interval.ms
Kafka is not really suited for this kind of long-running task; consider moving the task to a DB (or even Redis) when received from Kafka, and process from there.
Another solution would be to set max.poll.records to 1, then pause the listener container before running the task on another thread, and then resume the container when the job finishes.
Pausing the container will keep the consumer alive and avoid the rebalance.

Related

Kafka consumer group not rebalancing when increasing partitions

I have a situation where in my dev environment, my Kafka consumer groups will rebalance and distribute partitions to consumer instances just fine after increasing the partition count of a subscribed topic.
However, when we deploy our product into its kubernetes environment, we aren't seeing the consumer groups rebalance after increasing the partition count of the topic. Kafka recognized the increase which can be seen from the server logs or describing the topic from the command line. However, the consumer groups won't rebalance and recognize the new partitions. From my local testing, kafka respects metadata.max.age.ms (default 5 mins). But in kubernetes the group never rebalances.
I don't know if it affects anything but we're using static membership.
The consumers are written in Java and use the standard Kafka Java library. No messages are flowing through Kafka, and adding messages doesn't help. I don't see anything special in the server or consumer configurations that differs from my dev environment. Is anyone aware of any configurations that may affect this behavior?
** Update **
The issue was only occurring for a new topic. At first, the consumer application was starting before the producer application (which is intended to create the topic). So the consumer was auto creating the topic. In this scenario, the topic defaulted to 1 partition. When the producer application started it, updated the partition count per configuration. After that update, we never saw a rebalance.
Next we tried disabling consumer auto topic creation to address this. This prevented the consumer application from auto creating the topic on subscription. Yet still after the topic was created by the producer app, the consumer group was never rebalanced, so the consumer application would sit idle.
According to the documentation I've found, and testing in my dev environment, both of these situations should trigger a rebalance. For whatever reason we don't see that happen in our deployments. My temporary workaround was just to ensure that the topic is already created prior to allowing my consumer's to subscribe. I don't like it, but it works for now. I suspect that the different behavior I'm seeing is due to my dev environment running a single kafka broker vs the kubernetes deployments with a cluster, but that's just a guess.
Kafka defaults to update topic metadata only after 5 minutes, so will not detect partition changes immediately, as you've noticed. The deployment method of your app shouldn't matter, as long as network requests are properly reaching the broker.
Plus, check your partition assignment strategy to see if it's using sticky assignment. This will depend on what version of the client you're using, as the defaults changed around 2.7, I think
No messages are flowing through Kafka
If there's no data on the new partitions, there's no real need to rebalance to consume from them

Kafka Consumer I want it to poll the msg until consumer tell it to go to the next offset

I am building an event driven software in java to listen to a Kafka Topic and sent the messages to a other server from my application. I want the Kafka consumer to keep polling the same message if my application wasn't able to sent the data successfully to the second server. to do this I had set manual commit offset and only incremented the offset when the message was sent successfully to the second server but the broker will only resend the message if my application(consumer) restarts. its a issue since I don't want my application to restart. Let me know if you have any solutions to this issue.
You would need to track and manually seek the consumer to the last un-processed offset(s) for each topic partition. You may also want to pause() the consumer and halt any poll loop until each record is processed

kafka Consumer Reading Previous Records

i am facing a problem with my kafka consumer. i have two kafka brokers running with replication factor 2 for the topic. everytime a broker restarts and if i restart my consumer service, it starts to read records which it has already read. e.g. before i restarted the consumer this was the state.
and consumer was sitting idle not receiving any messages as it has read all of them.
i restart my consumer, and all of a sudden it starts receiving messages which it has processed previously and here is the offset situation now.
also what is this LOG-END-OFFSET and LAG, looks like these are something to consider here.
note that it only happens when 1 of the broker gets restarted due to kubernetes shifting it to another node.
this is the topic configuration
Based on the info you posted, a couple of things that immediately come to mind:
The first screenshot shows a lag of 182, which means the consumer either was not running, or it has some weird configuration that made it stop consuming. Was it possible one of the brokers was down when the consumer stopped consuming?
On restart, the consumer finally consumed all the remaining messages, because it now shows lag of 0. This is correct, expected Kafka behavior.
Make sure that the consumer group name is not changing between restarts. Some clients default to "randomized" customer group names, which works as long as the consumer is not restarted.

Kafka-streams delay to kick rebalancing on consumer graceful shutdown

This is a follow up on a previous question I sent regarding high latency in our Kafka Streams; (Kafka Streams rebalancing latency spikes on high throughput kafka-streams services).
As a quick reminder, our Stateless service has very tight latency requirements and we are facing too high latency problems (some messages consumed more than 10 secs after being produced) specially when a consumer leaves gracefully the group.
After further investigation we have found out that at least for small consumer groups the rebalance is taking less than 500ms. So we thought, where is this huge latency when removing one consumer (>10s) coming from?
We realized that it is the time between the consumer exiting Gracefully and the rebalance kicking in.
That previous tests were executed with all-default configurations in both Kafka and Kafka Streams application.
We changed the configurations to:
properties.put("max.poll.records", 50); // defaults to 1000 in kafkastreams
properties.put("auto.offset.reset", "latest"); // defaults to latest
properties.put("heartbeat.interval.ms", 1000);
properties.put("session.timeout.ms", 6000);
properties.put("group.initial.rebalance.delay.ms", 0);
properties.put("max.poll.interval.ms", 6000);
And the result is that the time for the rebalance to start dropped to a bit more than 5 secs.
We also tested to kill a consumer non-gracefully by 'kill -9' it; the result is that the time to trigger the rebalance is exactly the same.
So we have some questions:
- We expected that when the consumer is stopping gracefully the rebalance is triggered right away, should that be the expected behavior? why isn't it happening in our tests?
- How can we reduce the time between a consumer gracefully exiting and the rebalance being triggered? what are the tradeoffs? more unneeded rebalances?
For more context, our Kafka version is 1.1.0, after looking at libs found for example kafka/kafka_2.11-1.1.0-cp1.jar, we installed Confluent platform 4.1.0. On the consumer side, we are using Kafka-streams 2.1.0.
Thank you!
Kafka Streams does not sent a "leave group request" when an instance is shut down gracefully -- this is on purpose. The goal is to avoid expensive rebalances if an instance is bounced (eg, if one upgrades an application; or if one runs in a Kubernetes environment and a POD is restarted quickly automatically).
To achieve this, a non public configuration is used. You can overwrite the config via
props.put("internal.leave.group.on.close", true); // Streams' default is `false`

Detecting when an EC2 instance will be shutting down (Java SDK)

I posted this on the AWS support forums but haven't received a response, so hoping you guys have an idea...
We have an auto-scaling group which boots up or terminates an instance based on current load. What I'd like to be able to do it detect, on my current EC2 instance, that it's about to be shut down and to finish my work.
To describe the situation in more detail. We have an auto-scaling group, and each instance reads content from a single SQS. Each instance will be running multiple threads, each thread is reading from the same SQS queue and processing the data as needed.
I need to know when this instance will be about to shut down, so I can stop new threads from reading data, and block the shutdown until the remaining data has finished processing.
I'm not sure how I can do this in the Java SDK, and I'm worried my instances will be terminated without my data being processed correctly.
Thanks
Lee
When it wants to scale down, AWS Auto Scaling will terminate your EC2 instances without warning.
There's no way to allow any queue workers to drain before terminating.
If your workers are processing messages transactionally, and you're not deleting messages from SQS until after they have been successfully processed, then this shouldn't be a problem. The processing will stop when the instance is terminated, and the transaction will not commit. The message won't be deleted from the SQS queue, and can be picked up and processed by another worker later on.
The only kind of draining behavior it supports is HTTP connection draining from an ELB: "If connection draining is enabled for your load balancer, Auto Scaling waits for the in-flight requests to complete or for the maximum timeout to expire, whichever comes first, before terminating instances".

Categories