Getting duplicates in kafka consumer

Getting duplicates in kafka consumer - java

I am writing a Java client for Kafka consumer.I commit every messages asynchronously before processing it.Still I am receiving many duplicate messages during rebalance.
Can anyone explain the reason and how to avoid this?

Kafka Consumer does not provide exactly-once processing guarantees, even if you commit all messages synchronously.
The problem is, that when you did finish processing a message successfully and want to commit it, the rebalance can happen right before the commit. Thus, your commit is not done and the already processed message will be reprocessed.
Because you use asynchronous commits, the number of duplicates increases, as committing does not happen immediately for each single message. Hence, you can have many messages "in-flight" that are finished processing but not committed yet. On rebalance, all "in-flight" message will be re-processed.
So committing synchronously will reduce the number of duplicates. However, duplicates cannot be avoided completely, because there is no exactly-once delivery guarantee in Kafka.
Exactly-once delivery is on the roadmap for future release of Kafka though: https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging

Related

How to handle session timeout while processing Kafka messages?

I am processing messages from Kafka in a standard processing loop:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
processMessage(record);
}
}
What should I do if my Kafka Consumer gets into a timeout while processing the records? I mean the timeout controlled by the property session.timeout.ms
When this happens, my consumer should stop processing the records, because it would lose its partitions and the records that it processes could be already processed by another consumer. If the original consumer writes some processing results into a database, it could overwrite the records produced by the "new" consumer that got the partitions after my original consumer timed out.
I know about the ConsumerRebalanceListener, but from my understanding its method onPartitionsLost would only be called after I call the poll method from the consumer. Therefore this doesn't help me to stop the processing loop of the batch of records that I received from the previous poll.
I would expect that the heartbeat thread could notify me that it was not able to contact the broker and that we have a session timeout in the consumer, but there doesn't seem to be anything like that...
Am I missing something?

Adding this as an answer as it would be too long in a comment.
Kafka has a few ways that can be used to process messages
At most once;
At least once; and
Exactly once.
You are describing that you would like to use kafka as exactly once semantics (which by the way is the least common way of using kafka). Also producers need to play nicely as by default kafka can produce the same message more than once.
It's a lot more common to build services that use the at least once mechanism, in this way you can receive (or process) the same message more than once but you need to have a way to deduplicate them (it's the same idea behind idempotency on http APIs). You'll need to have something in the message that is unique and have register that that id has been processed already. If the payload has nothing you can use to deduplicate them, you can add a header on the message and use that.
This is also useful in the scenario that you have to reset the offset, so the service can go through old messages without breaking.
I would suggest you to google a bit for details on how to implement the above.
Here's a blog post from confluent about developing exactly once semantics Improved Robustness and Usability of Exactly-Once Semantics in Apache Kafka and the Kafka docs explaining the different semantics.
About the point of the ConsumerRebalanceListener, you don't need to do anything if you follow the solution of using idempotency in the consumer. Rebalances also happen when an app crashes, and in that scenario the service might have processed some records, but not committed them yet to Kafka.
A mini tip I give to everyone who is starting with Kafka. Kafka looks simple from the outside but it's a complex technology. Don't use it in production until you know the nitty gritty details of how it works including have done some good amount of negative testing (unless you are ok with losing data).

What is the impact if delay kafka manual commit offset

We want to manual commit kafka offset to control data lose events. But we might delay the manaul commit as we want to do this only after persisting to datasource
I would like to learn how slowing down a commit offset impact kafka's topic/paralalism/partition if at all

When you consume from one topic, if that consumers belongs to one consumer group, Kafka will make sure one partition consumed by one consumer. So if you commit manually it will not affect to other consumers because they consuming from another partition.
But if you compare same partition consumer with enable.auto.commit=false and enable.auto.commit=true, that auto commit enabled consumers throughput if relatively high. And if you don't need the confirmation of your commits, then use commitAsync, it will improve throughput than commitSync.
Generally, you call the API when you are finished processing all the messages in a batch, and don’t poll for new messages until the last offset in the batch is committed. This approach can can affect throughput and latency, as can the number of messages returned when polling, so you can set up your application to commit less frequently.
But, if you do manual committing, There can be duplicate consumed messages when consumer restarts or rebalances. When you consume a message and write to your db, after that you are going to commit the message to Kafka. If consumer rebalance or restart at that time, that message will not be committed and will be re-consumed by another consumer in same group.
For more informations, please refer
consumer-tuning
how-to-commit-offsets

Delayed ACK in Spring Kafka

I'm using Spring and Spring Kafka for a batching service that collects data from Kafka until certain conditions are met, and then dumps the data.
I want to acknowledge the commits when the data leaves my service, but it could potentially sit in memory for 5-10 minutes.
Given that the Acknowledgement implementations in Spring Kafka hold on to the original record(s) it seems unreasonable to hold on to them until I dump my data given what that would do to memory utilization.
Is there any other way to acknowledge / commit offsets from Spring Kafka given just the partition/offset information?

You could use AckMode.TIME or AckMode.COUNT with an incredibly large ackTime or ackCount so the container will never do the ack.
Then, pass the Consumer<?, ?> into your listener method and do the offset commit yourself.
Note, however, that the consumer is not thread-safe so you must perform the commit on the listener thread.
Also, bear in mind that records are not individually ack'd, just the offset. You can't ack "out of order".
Also, you would likely need to increase the max.poll.interval.ms beyond its default (5 minutes) to avoid a rebalance of the partitions.

When consumer gets message from channel in rabbitmq,where does pre-fetch messages reside

I have below configuration for rabbitmq
prefetchCount:1
ack-mode:auto.
I have one exchange and one queue is attached to that exchange and one consumer is attached to that queue. As per my understanding below steps will be happening if queue has multiple messages.
Queue write data on a channel.
As ack-mode is auto,as soon as queue writes message on channel,message is removed from queue.
Message comes to consumer,consumer start performing on that data.
As Queue has got acknowledgement for previous message.Queue writes next data on Channel.
Now,my doubt is,Suppose consumer is not finished with previous data yet.What will happen with that next data queue has written in channel?
Also,suppose prefetchCount is 10 and I have just once consumer attached to queue,where these 10 messages will reside?

The scenario you have described is one that is mentioned in the documentation for RabbitMQ, and elaborated in this blog post. Specifically, if you set a sufficiently large prefetch count, and have a relatively small publish rate, your RabbitMQ server turns into a fancy network switch. When acknowledgement mode is set to automatic, prefetch limiting is effectively disabled, as there are never unacknowledged messages. With automatic acknowledgement, the message is acknowledged as soon as it is delivered. This is the same as having an arbitrarily large prefetch count.
With prefetch >1, the messages are stored within a buffer in the client library. The exact data structure will depend upon the client library used, but to my knowledge, all implementations store the messages in RAM. Further, with automatic acknowledgements, you have no way of knowing when a specific consumer actually read and processed a message.
So, there are a few takeaways here:
Prefetch limit is irrelevant with automatic acknowledgements, as there are never any unacknowledged messages, thus
Automatic acknowledgements don't make much sense when using a consumer
Sufficiently-large prefetch when auto-ack is off, or any use of autoack = on will result in the message broker not doing any queuing, and instead doing routing only.
Now, here's a little bit of expert opinion. I find the whole notion of a message broker that "pushes" messages out to be a little backwards, and for this very reason- it's difficult to configure properly, and it is unclear what the benefit is. A queue system is a natural fit for a pull-based system. The processor can ask the broker for the next message when it is done processing the current message. This approach will ensure that load is balanced naturally and the messages don't get lost when processors disconnect or get knocked out.
Therefore, my recommendation is to drop the use of consumers altogether and switch over to using basic.get.

Best Practice for resilience of messages across RabbitMQ queues

I am trying to understand the best use of RabbitMQ to satisfy the following problem.
As context I'm not concerned with performance in this use case (my peak TPS for this flow is 2 TPS) but I am concerned about resilience.
I have RabbitMQ installed in a cluster and ignoring dead letter queues the basic flow is I have a service receive a request, creates a persistent message which it queues, in a transaction, to a durable queue (at this point I'm happy the request is secured to disk). I then have another process listening for a message, which it reads (not using auto ack), does a bunch of stuff, writes a new message to a different exchange queue in a transaction (again now happy this message is secured to disk). Assuming the transaction completes successfully it manually acks the message back to the original consumer.
At this point my only failure scenario is is I have a failure between the commit of the transaction to write to my second queue and the return of the ack. This will lead to a message being potentially processed twice. Is there anything else I can do to plug this gap or do I have to figure out a way of handling duplicate messages.
As a final bit of context the services are written in java so using the java client libs.
Paul Fitz.

First of all, I suggest you to look a this guide here which has a lot of valid information on your topic.
From the RabbitMQ guide:
At the Producer
When using confirms, producers recovering from a channel or connection
failure should retransmit any messages for which an acknowledgement
has not been received from the broker. There is a possibility of
message duplication here, because the broker might have sent a
confirmation that never reached the producer (due to network failures,
etc). Therefore consumer applications will need to perform
deduplication or handle incoming messages in an idempotent manner.
At the Consumer
In the event of network failure (or a node crashing), messages can be
duplicated, and consumers must be prepared to handle them. If
possible, the simplest way to handle this is to ensure that your
consumers handle messages in an idempotent way rather than explicitly
deal with deduplication.
So, the point is that is not possibile in any way at all to guarantee that this "failure" scenario of yours will not happen. You will always have to deal with network failure, disk failure, put something here failure etc.
What you have to do here is to lean on the messaging architecture and implement if possibile "idempotency" of your messages (which means that even if you process the message twice is not going to happen anything wrong, check this).
If you can't than you should provide some kind of "processed message" list (for example you can use a guid inside every message) and check this list every time you receive a message; you can simply discard them in this case.
To be more "theorical", this post from Brave New Geek is very interesting:
Within the context of a distributed system, you cannot have
exactly-once message delivery.
Hope it helps :)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.