I'm using Spring and Spring Kafka for a batching service that collects data from Kafka until certain conditions are met, and then dumps the data.
I want to acknowledge the commits when the data leaves my service, but it could potentially sit in memory for 5-10 minutes.
Given that the Acknowledgement implementations in Spring Kafka hold on to the original record(s) it seems unreasonable to hold on to them until I dump my data given what that would do to memory utilization.
Is there any other way to acknowledge / commit offsets from Spring Kafka given just the partition/offset information?
You could use AckMode.TIME or AckMode.COUNT with an incredibly large ackTime or ackCount so the container will never do the ack.
Then, pass the Consumer<?, ?> into your listener method and do the offset commit yourself.
Note, however, that the consumer is not thread-safe so you must perform the commit on the listener thread.
Also, bear in mind that records are not individually ack'd, just the offset. You can't ack "out of order".
Also, you would likely need to increase the max.poll.interval.ms beyond its default (5 minutes) to avoid a rebalance of the partitions.
Related
I am processing messages from Kafka in a standard processing loop:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
processMessage(record);
}
}
What should I do if my Kafka Consumer gets into a timeout while processing the records? I mean the timeout controlled by the property session.timeout.ms
When this happens, my consumer should stop processing the records, because it would lose its partitions and the records that it processes could be already processed by another consumer. If the original consumer writes some processing results into a database, it could overwrite the records produced by the "new" consumer that got the partitions after my original consumer timed out.
I know about the ConsumerRebalanceListener, but from my understanding its method onPartitionsLost would only be called after I call the poll method from the consumer. Therefore this doesn't help me to stop the processing loop of the batch of records that I received from the previous poll.
I would expect that the heartbeat thread could notify me that it was not able to contact the broker and that we have a session timeout in the consumer, but there doesn't seem to be anything like that...
Am I missing something?
Adding this as an answer as it would be too long in a comment.
Kafka has a few ways that can be used to process messages
At most once;
At least once; and
Exactly once.
You are describing that you would like to use kafka as exactly once semantics (which by the way is the least common way of using kafka). Also producers need to play nicely as by default kafka can produce the same message more than once.
It's a lot more common to build services that use the at least once mechanism, in this way you can receive (or process) the same message more than once but you need to have a way to deduplicate them (it's the same idea behind idempotency on http APIs). You'll need to have something in the message that is unique and have register that that id has been processed already. If the payload has nothing you can use to deduplicate them, you can add a header on the message and use that.
This is also useful in the scenario that you have to reset the offset, so the service can go through old messages without breaking.
I would suggest you to google a bit for details on how to implement the above.
Here's a blog post from confluent about developing exactly once semantics Improved Robustness and Usability of Exactly-Once Semantics in Apache Kafka and the Kafka docs explaining the different semantics.
About the point of the ConsumerRebalanceListener, you don't need to do anything if you follow the solution of using idempotency in the consumer. Rebalances also happen when an app crashes, and in that scenario the service might have processed some records, but not committed them yet to Kafka.
A mini tip I give to everyone who is starting with Kafka. Kafka looks simple from the outside but it's a complex technology. Don't use it in production until you know the nitty gritty details of how it works including have done some good amount of negative testing (unless you are ok with losing data).
We want to manual commit kafka offset to control data lose events. But we might delay the manaul commit as we want to do this only after persisting to datasource
I would like to learn how slowing down a commit offset impact kafka's topic/paralalism/partition if at all
When you consume from one topic, if that consumers belongs to one consumer group, Kafka will make sure one partition consumed by one consumer. So if you commit manually it will not affect to other consumers because they consuming from another partition.
But if you compare same partition consumer with enable.auto.commit=false and enable.auto.commit=true, that auto commit enabled consumers throughput if relatively high. And if you don't need the confirmation of your commits, then use commitAsync, it will improve throughput than commitSync.
Generally, you call the API when you are finished processing all the messages in a batch, and don’t poll for new messages until the last offset in the batch is committed. This approach can can affect throughput and latency, as can the number of messages returned when polling, so you can set up your application to commit less frequently.
But, if you do manual committing, There can be duplicate consumed messages when consumer restarts or rebalances. When you consume a message and write to your db, after that you are going to commit the message to Kafka. If consumer rebalance or restart at that time, that message will not be committed and will be re-consumed by another consumer in same group.
For more informations, please refer
consumer-tuning
how-to-commit-offsets
I am opening a kafka producer with config properties -
KafkaProducer<String, MyValue> producer = new KafkaProducer<String, MyValue>(kafkaProperties);
then sending records synchronously using - (so as to avoid batching and also maintain the original message order)
//create myValue instance //omited for simplicity
//create myrecord instance using topicname and myvalue
producer.send(myRecord).get();
producer.flush(); //send message as soon as record is available to producer
now my issue is, I have several records to send and between sends i might have to wait for long times - few minutes to hours (for what ever reasons, atleast to explore and understand kafka better).
I want to know for how long will the producer connection with the cluster/bootstrap server be alive. Is there anyway i can configure it using the producer configurations.
(In depth explanations will be greatly thanked - even if it has to go to tcp connection levels, you are welcome)
(kafka consumers have a heartbeat concept. Does producers have similar concept. A google search for "kafka producer heartbeat.interval.ms" returned only result for consumer).
KafkaProducer.send method is asynchronous, by default it adds all records into buffer memory and send them at once, so according docs the producer establish the connection while sending the batch to cluster
The send() method is asynchronous. When called it adds the record to a buffer of pending record sends and immediately returns. This allows the producer to batch together individual records for efficiency.
The producer maintains buffers of unsent records for each partition. These buffers are of a size specified by the batch.size config. Making this larger can result in more batching, but requires more memory (since we will generally have one of these buffers for each active partition).
By default a buffer is available to send immediately even if there is additional unused space in the buffer. However if you want to reduce the number of requests you can set linger.ms to something greater than 0.
This will instruct the producer to wait up to that number of milliseconds before sending a request in hope that more records will arrive to fill up the same batch. This is analogous to Nagle's algorithm in TCP.
For example, in the code snippet above, likely all 100 records would be sent in a single request since we set our linger time to 1 millisecond. However this setting would add 1 millisecond of latency to our request waiting for more records to arrive if we didn't fill up the buffer.
Note that records that arrive close together in time will generally batch together even with linger.ms=0 so under heavy load batching will occur regardless of the linger configuration; however setting this to something larger than 0 can lead to fewer, more efficient requests when not under maximal load at the cost of a small amount of latency.
From the KafkaProducer.flush, invoking flush doesn't mean producer send each record to cluster, invoking flush makes all buffered records immediately available to send
Invoking this method makes all buffered records immediately available to send (even if linger.ms is greater than 0) and blocks on the completion of the requests associated with these records. The post-condition of flush() is that any previously sent record will have completed (e.g. Future.isDone() == true). A request is considered completed when it is successfully acknowledged according to the acks configuration you have specified or else it results in an error.
I'm currently struggling with a consumer on kafka that can somehow schedule to a future time for execution.
To summarize: I have a big data storage (.csv file) and the records contains 2 columns: timestamp and value. I'm trying to process this values based on their timestamp. First record it has to be consumed instantly by kafka, next one should be processed in future with a delay of 'current record timestamp - previous record timestamp' (it is not a very big difference, just a few seconds = result will be in millis) and so on.
So basically I can't find a solution to implement a consumer on kafka that takes each records based on timestamp and use that exact delay. I have to just simulate these values and they have to be insert in DB accordly to that delay to work properly.
I've tried to work around threads, with executors, but for big data it's not a properly way.
I tried to create dynamic topics on producers based on timestamp and then subscribe to them and then somehow process with a queue. It didn't work.
I expect the kafka to consume each record with the delay based on timestamp.
I expect the kafka to consume each record with the delay based on
timestamp
If you have specific delay between messages then Kafka is not a proper solution.
When you send messages to the Kafka, in most scenarios you use network. Which could add its own, unpredictable, delay. Kafka is running as a different process and nobody could guarantee at which moment this process will be ready to receive next message. OS could suspend process, GC could start etc. This adds another delay which nobody could predict.
Also, Kafka is not designed to operate with time when message was received. It more focused on order of messages, low latency and high throughput but not on timing.
I have below configuration for rabbitmq
prefetchCount:1
ack-mode:auto.
I have one exchange and one queue is attached to that exchange and one consumer is attached to that queue. As per my understanding below steps will be happening if queue has multiple messages.
Queue write data on a channel.
As ack-mode is auto,as soon as queue writes message on channel,message is removed from queue.
Message comes to consumer,consumer start performing on that data.
As Queue has got acknowledgement for previous message.Queue writes next data on Channel.
Now,my doubt is,Suppose consumer is not finished with previous data yet.What will happen with that next data queue has written in channel?
Also,suppose prefetchCount is 10 and I have just once consumer attached to queue,where these 10 messages will reside?
The scenario you have described is one that is mentioned in the documentation for RabbitMQ, and elaborated in this blog post. Specifically, if you set a sufficiently large prefetch count, and have a relatively small publish rate, your RabbitMQ server turns into a fancy network switch. When acknowledgement mode is set to automatic, prefetch limiting is effectively disabled, as there are never unacknowledged messages. With automatic acknowledgement, the message is acknowledged as soon as it is delivered. This is the same as having an arbitrarily large prefetch count.
With prefetch >1, the messages are stored within a buffer in the client library. The exact data structure will depend upon the client library used, but to my knowledge, all implementations store the messages in RAM. Further, with automatic acknowledgements, you have no way of knowing when a specific consumer actually read and processed a message.
So, there are a few takeaways here:
Prefetch limit is irrelevant with automatic acknowledgements, as there are never any unacknowledged messages, thus
Automatic acknowledgements don't make much sense when using a consumer
Sufficiently-large prefetch when auto-ack is off, or any use of autoack = on will result in the message broker not doing any queuing, and instead doing routing only.
Now, here's a little bit of expert opinion. I find the whole notion of a message broker that "pushes" messages out to be a little backwards, and for this very reason- it's difficult to configure properly, and it is unclear what the benefit is. A queue system is a natural fit for a pull-based system. The processor can ask the broker for the next message when it is done processing the current message. This approach will ensure that load is balanced naturally and the messages don't get lost when processors disconnect or get knocked out.
Therefore, my recommendation is to drop the use of consumers altogether and switch over to using basic.get.