I would like to use Kafka to perform bounded batch processing, where the program will know when it is processing the last record.
Batch:
Reading a flat file
Send each line as message to Kafka
Kafka Listener:
Consumes message from Kafka
Insert record into database
If it is the last record, mark batch job as done in database.
One way probably is to use a single Kafka partition, assuming FIFO (First In First Out) is guaranteed, and make the batch program to send an isLastRecord flag.
However, this means the processing will be restricted to single-thread (single consumer).
Question
Is there any way to achieve this with parallel-processing by leveraging multiple Kafka partitions?
If you need in-order guarantees per file, you are restricted to a single partition.
If you have multiple files, you could use different partitions for different files though.
If each line in the file is an insert into a database, I am wondering though if you need in-order guarantee in the first place, or if you can insert all records/lines in any order?
A more fundamental question is: why do you need to put the data into Kafka first? Why not read the file and to the insert directly?
Related
I'm working on a POC where requirement is to consume messages in a batch from a JMS queue and pass it to processor and writer in chain as per Spring batch Step..Spring Batch provides JMSItemReader as a out of box functionality but looks like it consumes messages one by one until there are no messages left in queue or receiver timeout is reached...As far as Chunk based step is concerned, Items are read into chunks,which are processed and then written within transaction as a chunk into another datastore..Here JMSItemReader doesn't read items in a batch..Is there any solution in Spring Batch world to consume messages in a batch from a queue to improve overall performance of an application?
I tried a lot of documentation but didn't find any appropriate solution for this use case..appreciate your help..Thank you
I believe the BatchMessageListenerContainer is what you are looking for. It allows you to read messages in batches.
Note that this is not part of the standard library of readers/writers, but you can use it as is if you want or take inspiration and adapt it as needed.
I have the following use case:
I have two Kafka topics, one is meant to be used a stream of incoming messages to be processed, the other is meant as store of records that is meant to be used as a bootstrap to the initial state of the application.
Is there a way to do the following:
Read all messages from a Kafka topic when the application starts up and store all ConsumerRecord in memory from the topic that is meant to bootstrap the application to its initial state
Only after all messages have been read allow the ConsumerRecord from the stream topic to be processed
As there may be additional records on the state topic to incorporate them into the application's state when the application is running without having to restart the application.
Thanks!
Start your bootstrap consumer first.
Read the other topic till a particular offset is reached or (if you want the end, you can read as long as there is no polled records available [this is not the best way!]). If you want to start at particular offset every-time you have to use a seek. Also use a unique consumer group id for this since you want to all the records. You might want to handle the rebalance case appropriately.
Then close that consumer and start the other stream consumer and process the data.
Using Ktables with Kafka streams might be better, but I am not familiar with it.
I am basically trying to use the same Flink pipeline (of transformations, with different input parameters to distinguish between real-time and batch modes) to run it in Batch Mode & realtime mode. I want to use the DataStream API, as most of my transformations are dependent on DataStream API.
My Producer is Kafka & real time pipeline works just fine. Now I want to build a Batch pipeline with the same exact code with different topics for batch & real-time mode. How does my batch processor know when to stop processing?
One way I thought of was to add an extra parameter in the Producer record to say this is the last record, however, given multi partitioned topics, record delivery across multiple partitions does not guarantee the order (delivery inside one partition is guaranteed though).
What is the best practice to design this?
PS: I don't want to use DataSet API.
You can use the DataStream API for batch processing without any issue. Basically, Flink will inject the barrier that will mark the end of the stream, so that Your application will work on finite streams instead of infinite ones.
I am not sure if Kafka is the best solution for the problem to be completely honest.
Generally, when implementing KafkaDeserializationSchema You have the method isEndOfStream() that will mark that the stream has finished. Perhaps, You could inject the end markers for each partition and simply check if all of the markers have been read and then finish the stream. But this would require You to know the number of partitions beforehand.
There is a program which is implemented using producer and consumer pattern. The producer fetches data from db based on list of queries and puts it in array blocking queue... The consumer prepares excel report based on data in array blocking queue. For increasing performance, I want to have dynamic number of producers and consumers.. example, when producer is slow, have more number of producers.. when, consumer is slow, have more numbers of consumers . How can I have dynamic producers and consumers??
If you do this, you must first ask yourself a couple of questions:
How will you make sure that multiple parallel producers put items in the queue in the correct order? This might or might not be possible - it depends on the kind of problem you are dealing with.
How will you make sure that multiple parallel consumers don't "steal" each other's items from the queue? Again, this depends on your problem, in some cases this might be desirable and in others it's forbidden. You didn't provide enough information, but typically if you prepare data for report, you will need to have a single consumer and wait until the report data is complete.
Is this actually going to achieve any speedup? Did you actually measure that the bottleneck is I/O bound on the producer side, or are you just assuming? If the bottleneck is CPU-bound, you will not achieve anything.
So, assuming that you need complete data for report (i.e. single consumer, which needs the full data), and that your data can be "sharded" to independent subsets, and that the bottleneck is in fact what you think it is, you could do it like this:
As multiple producers will be producing different parts of results, they will not be sequential. So a list is not a good option; you would need a data structure where you would store interim results and care about which ranges have been completed and which ranges are still missing. Possibly, you could use one list per producer as a buffer and have a "merge" thread which will write to a single output list for consumer.
You need to split input data to several input pieces (one per producer)
You need to somehow track the ordering and ensure that the consumer takes out pieces in correct order
You can start consumer at the moment the first output piece comes out
You must stop the consumer when the last piece is produced.
In short, this is a kind of problem for which you should probably think about using something like MapReduce
I'm trying to solve the following problem with kafka.
There is a topic. let's call it src-topic. I receive records from this topic from time to time. I would like to store those values in a ktable and emit the values stored in the ktable every 10 seconds to dst-topic. When I emit a value from this ktable for the first time then I want to append 1 to the record I emit. Every subsequent time I would like to append 0 to the emitted record.
I'm looking for a correct and preferably idiomatic solution to this issue.
One of the solutions I see is to emit a record with 1 appended when I ingest from src-topic and then store in the ktable the record with 0 appended. Another thread will be reading from this ktable and emitting the records regularly. The problem with this approach is that it has a race condition.
Any advice will be appreciated.
There is no straight forward way to do this. Note, a KTable is a changelog stream (it might have a table state internally -- not all KTables do have a state --, but that's an implementation detail).
Thus, a KTable is a stream and you cannot flush a stream... And because the state (if there is any) is internal, you cannot flush the state either.
You can only access the state via Interactive Queries that also allow to do a range scan. However, this will not emit anything downstream but gives the data to the "non Streams part" of you application.
I think, you will need to use low-level Processor API to get the result you want.