Is it possible to have a Kafka Streams app that runs through all the data in a topic and then exits?
Example I'm producing data into topics based on date. The consumer gets kicked off by cron, runs through all the available data and then .. does what? I don't want it to sit and wait for more data. Just assume it's all there and then exit gracefully.
Possible?
In Kafka Streams (as for other stream processing solutions), the is no "end of data" because it is stream processing in the first place -- and not batch processing.
Nevertheless, you could watch the "lag" of your Kafka Streams application and shut it down if there is not lag (lag, is the number of not yet consumed messages).
For example, you can use bin/kafka-consumer-groups.sh to check the lag of your Streams application (the application ID is used as consumer group ID). If you want to embed this in your Streams applications, you can use kafka.admin.AdminClient to get consumer group information.
You can create a consumer and then once it stops pulling up data, you can have call consumer.close(). Or if you want to poll again in the future just call consumer.pause() and call .resume later.
One way to do this is within the consumer poll block. Such as
data = consumer.poll()
if (!data.next()) {
consumer.close()
}
Keep in mind that poll returns ConsumerRecord<K,V> and conforms to the Iterable interface.
Related
I am processing messages from Kafka in a standard processing loop:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
processMessage(record);
}
}
What should I do if my Kafka Consumer gets into a timeout while processing the records? I mean the timeout controlled by the property session.timeout.ms
When this happens, my consumer should stop processing the records, because it would lose its partitions and the records that it processes could be already processed by another consumer. If the original consumer writes some processing results into a database, it could overwrite the records produced by the "new" consumer that got the partitions after my original consumer timed out.
I know about the ConsumerRebalanceListener, but from my understanding its method onPartitionsLost would only be called after I call the poll method from the consumer. Therefore this doesn't help me to stop the processing loop of the batch of records that I received from the previous poll.
I would expect that the heartbeat thread could notify me that it was not able to contact the broker and that we have a session timeout in the consumer, but there doesn't seem to be anything like that...
Am I missing something?
Adding this as an answer as it would be too long in a comment.
Kafka has a few ways that can be used to process messages
At most once;
At least once; and
Exactly once.
You are describing that you would like to use kafka as exactly once semantics (which by the way is the least common way of using kafka). Also producers need to play nicely as by default kafka can produce the same message more than once.
It's a lot more common to build services that use the at least once mechanism, in this way you can receive (or process) the same message more than once but you need to have a way to deduplicate them (it's the same idea behind idempotency on http APIs). You'll need to have something in the message that is unique and have register that that id has been processed already. If the payload has nothing you can use to deduplicate them, you can add a header on the message and use that.
This is also useful in the scenario that you have to reset the offset, so the service can go through old messages without breaking.
I would suggest you to google a bit for details on how to implement the above.
Here's a blog post from confluent about developing exactly once semantics Improved Robustness and Usability of Exactly-Once Semantics in Apache Kafka and the Kafka docs explaining the different semantics.
About the point of the ConsumerRebalanceListener, you don't need to do anything if you follow the solution of using idempotency in the consumer. Rebalances also happen when an app crashes, and in that scenario the service might have processed some records, but not committed them yet to Kafka.
A mini tip I give to everyone who is starting with Kafka. Kafka looks simple from the outside but it's a complex technology. Don't use it in production until you know the nitty gritty details of how it works including have done some good amount of negative testing (unless you are ok with losing data).
I have the following use case:
I have two Kafka topics, one is meant to be used a stream of incoming messages to be processed, the other is meant as store of records that is meant to be used as a bootstrap to the initial state of the application.
Is there a way to do the following:
Read all messages from a Kafka topic when the application starts up and store all ConsumerRecord in memory from the topic that is meant to bootstrap the application to its initial state
Only after all messages have been read allow the ConsumerRecord from the stream topic to be processed
As there may be additional records on the state topic to incorporate them into the application's state when the application is running without having to restart the application.
Thanks!
Start your bootstrap consumer first.
Read the other topic till a particular offset is reached or (if you want the end, you can read as long as there is no polled records available [this is not the best way!]). If you want to start at particular offset every-time you have to use a seek. Also use a unique consumer group id for this since you want to all the records. You might want to handle the rebalance case appropriately.
Then close that consumer and start the other stream consumer and process the data.
Using Ktables with Kafka streams might be better, but I am not familiar with it.
We have one requirement where we are using Kafka Streams to read from Kafka topic and then send the data over network through a pool of sessions. However, sometimes, network calls are bit slow and we need to frequently pause the stream, ensure we are not overloading network. Currently, we capture data into a stream and load it to a executor service and then send it over network through session pool.
If data in executor service is too high, we need to pause the stream for some time and then resume it once backlog on executor service is cleared up. For achiveing this pause mechanism, We are currently closing the stream and starting again once backlog is cleared up.
Is there any way we can pause the kafka stream?
If I understand you correctly, there is nothing special you need to do. You are talking about "back pressure" and Kafka Streams can handle it out of the box.
What can be done is putting this data into a queue with some max size and use this queue to load in executor service. Whenever the queue reaches some threshold, there are two methods:
If your call to put data in queue is blocking with no time-out, there is nothing more you need to do. Just wait until the system is back online, your call
returns, and processing will resume.
If your call to put data in queue is blocking with time-out,just issue the lookup to check the size of the queue. Repeat this until the system is back online and your call succeeds.
The only caveat is that as long as your Streams application blocks, the internally used Kafka consumer client will not send any heartbeats to Kafka and might time out. Thus, you need to set the time-out configuration parameter higher than the expected maximum downtime of your external system.
Another approach is to use a Processor API available in Kafka-streams, though, it is not usually recommended pattern.
Let me know if it helps!!
I'm currently struggling with a consumer on kafka that can somehow schedule to a future time for execution.
To summarize: I have a big data storage (.csv file) and the records contains 2 columns: timestamp and value. I'm trying to process this values based on their timestamp. First record it has to be consumed instantly by kafka, next one should be processed in future with a delay of 'current record timestamp - previous record timestamp' (it is not a very big difference, just a few seconds = result will be in millis) and so on.
So basically I can't find a solution to implement a consumer on kafka that takes each records based on timestamp and use that exact delay. I have to just simulate these values and they have to be insert in DB accordly to that delay to work properly.
I've tried to work around threads, with executors, but for big data it's not a properly way.
I tried to create dynamic topics on producers based on timestamp and then subscribe to them and then somehow process with a queue. It didn't work.
I expect the kafka to consume each record with the delay based on timestamp.
I expect the kafka to consume each record with the delay based on
timestamp
If you have specific delay between messages then Kafka is not a proper solution.
When you send messages to the Kafka, in most scenarios you use network. Which could add its own, unpredictable, delay. Kafka is running as a different process and nobody could guarantee at which moment this process will be ready to receive next message. OS could suspend process, GC could start etc. This adds another delay which nobody could predict.
Also, Kafka is not designed to operate with time when message was received. It more focused on order of messages, low latency and high throughput but not on timing.
I'm converting a Java application from Kafka to Kinesis. This application runs forever. It sleeps for 30 seconds, then wakes up, runs some HBase queries, consumes and processes any new Kafka messages, then sleeps again.
This works fine in Kafka - that's exactly what the default Consumer does. However this is not the case in Kinesis. Consuming from the KCL requires the KCL consumer to be running at all times, which doesn't work for my needs. I need to be able to consume all new messages as required with a single method call.
The official documentation for the Kinesis Java API says:
You retrieve records from the stream on a per-shard basis. For each shard, and for each batch of records that you retrieve from that shard, you need to obtain a shard iterator.
and
If no records are returned, that means no data records are currently available from this shard at the sequence number referenced by the shard iterator. When this situation occurs, your application should wait for an amount of time
But I don't care about shards! I just want to get all messages since I last consumed, in one method call. And what if my app dies and needs to restart; how will it know where to resume?
Current code:
GetRecordsRequest getRecordsRequest = new GetRecordsRequest();
getRecordsRequest.setShardIterator(TRIM_HORIZON);
getRecordsRequest.setLimit(25);
GetRecordsResult result = client.getRecords(getRecordsRequest);
// Put the result into record list. The result can be empty.
records = result.getRecords();
EDIT
To be clearer, with Kafka I can run:
ConsumerRecords<String, String> records = this.consumer.poll(0);
to get all unconsumed messages. If my app dies and restarts, there's no problem, offsets are taken care of and I'll resume where I left off.
How do I do this in Kinesis?
To answer your question, you can use with StockTradeRecordProcessor where it has the option to reset the stats which in turn enable to consume only the new messages. Refer here to find the implementation of StockTradeRecordProcessor.
But on a hard note, This method uses 60-second intervals for the reporting and checkpointing rate but not 30 seconds as your application demands.