I'm trying to solve the following problem with kafka.
There is a topic. let's call it src-topic. I receive records from this topic from time to time. I would like to store those values in a ktable and emit the values stored in the ktable every 10 seconds to dst-topic. When I emit a value from this ktable for the first time then I want to append 1 to the record I emit. Every subsequent time I would like to append 0 to the emitted record.
I'm looking for a correct and preferably idiomatic solution to this issue.
One of the solutions I see is to emit a record with 1 appended when I ingest from src-topic and then store in the ktable the record with 0 appended. Another thread will be reading from this ktable and emitting the records regularly. The problem with this approach is that it has a race condition.
Any advice will be appreciated.
There is no straight forward way to do this. Note, a KTable is a changelog stream (it might have a table state internally -- not all KTables do have a state --, but that's an implementation detail).
Thus, a KTable is a stream and you cannot flush a stream... And because the state (if there is any) is internal, you cannot flush the state either.
You can only access the state via Interactive Queries that also allow to do a range scan. However, this will not emit anything downstream but gives the data to the "non Streams part" of you application.
I think, you will need to use low-level Processor API to get the result you want.
Related
I'm doing manual commitSync to Kafka, and I notice in the DSL they use a simple Map and not MultiMap, so I guess then every time I invoke
consumer.commitSync(Map.of(topicPartition, new OffsetAndMetadata(record.offset())))
Is just for a single record in the partition.
Any chance to send two offsets of the same topicPartition in the same commitSync
It's a Map, so no, you cannot have multiple instances of the same topicPartition key.
The offset is a single number. If you were able to commit multiple, then your consumer (in the same group) would always have to start reading from the greatest of those values.
You can commit offsets for other TopicPartitions, however, in one commit call, or you can commit the same value to other consumer groups using a differently configured Consumer instance.
I am trying to commit offsets from my Spark streaming job to Kafka using the following:
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
// some time later, after outputs have completed
((CanCommitOffsets) stream.inputDStream()).commitAsync(offsetRanges);
as I got from this question:
Spark DStream from Kafka always starts at beginning
And this works fine, offsets are being committed. However, the problem is that this is asynchronous, which means that even after two more offset commits have been sent down the line, Kafka may still hold on to the offset two commits before. If the consumer crashes at that point, and I bring it back up, it starts reading messages which have already been processed.
Now, from other sources, like the comments section here:
https://dzone.com/articles/kafka-clients-at-most-once-at-least-once-exactly-o
I understood that there's no way to commit offsets synchronously from a Spark streaming job, (though there is one if I use Kafka streams). People rather suggest to keep the offsets in the databases where you are persisting the end result of your calculations on the stream.
Now, my question is this:
If I DO store the currently read offset in my database, how do I start reading the stream from exactly that offset the next time?
I researched and found the answer to my question, so I'm posting it here for anyone else who might face the same problem:
Make a Map object with org.apache.kafka.common.TopicPartition as the key and a Long as the value. The TopicPartition constructor takes two arguments, the topic name and the partition from which you will be reading. The value of the Map object is the long representation of the offset from which you want to read the stream.
Map startingOffset = new HashMap<>();
startingOffset.put(new TopicPartition("topic_name", 0), 3332980L);
Read the stream contents into an appropriate JavaInputStream, and provide the previously created Map object as an argument to the ConsumerStrategies.Subscribe() method.
final JavaInputDStream> stream = KafkaUtils.createDirectStream(jssc,
LocationStrategies.PreferConsistent(), ConsumerStrategies.Subscribe(topics, kafkaParams, startingOffset));
I have a SQS which will receive a huge number of messages. The messages keep coming to the queue.
And I have a use case where if the number of messages in a queue reaches X number (such as 1,000), the system needs to trigger an event to process 1,000 at a time.
And the system will make a chunk of triggers. Each trigger has a thousand messages.
For example, if we have 2300 messages in a queue, we expect 3 triggers to a lambda function, the first 2 triggers corresponding to 1,000 messages, and the last one will contain 300 messages.
I'm researching and see CloudWatch Alarm can hook up to SQS metric on "NumberOfMessageReceived" to send to SNS. But I don't know how can I configure a chunk of alarms for each 1,000 messages.
Please advice me if AWS can support this use case or any customize we can make to achieve this.
So after going through some clarifications on the comments section with the OP, here's my answer (combined with #ChrisPollard's comment):
Achieving what you want with SQS is impossible, because every batch can only contain up to 10 messages. Since you need to process 1000 messages at once, this is definitely a no-go.
#ChrisPollard suggested to create a new record in DynamoDB every time a new file is pushed to S3. This is a very good approach. Increment the partition key by 1 every time and trigger a lambda through DynamoDB Streams. On your function, run a check against your partition key and, if it equals 1000, you run a query against your DynamoDB table filtering the last 1000 updated items (you'll need a Global Secondary Index on your CreatedAt field). Map these items (or use Projections) to create a very minimal JSON that contains only the necessary information. Something like:
[
{
"key": "my-amazing-key",
"bucket": "my-super-cool-bucket"
},
...
]
A JSON like this is only 87 bytes long (if you take the square brackets out of the game because they won't be repeated, you're left out with 83 bytes). If you round it up to 100 bytes, you can still successfully send it as one event to SQS, as it will only be around 100KB of data.
Then have one Lambda function subscribe to your SQS queue and then finally concatenate the 1 thousand files.
Things to keep in mind:
Make sure you really create the createdAt field in DynamoDB. By the time it hits one thousand, new items could have been inserted, so this way you make sure you are reading the 1000 items that you expected.
On your Lambda check, just run batchId % 1000 = 0, this way you don't need to delete anything, saving DynamoDB operations.
Watch out for the execution time of your Lambda. Concatenating 1000 files at once may take a while to run, so I'd run a couple of tests and put 1 min overhead on top of it. I.e, if it usually takes 5 mins, set your function's timeout to 6 mins.
If you have new info to share I am happy to edit my answer.
You can add alarms at 1k, 2k, 3k, etc...but that seems clunky.
Is there a reason you're letting the messages batch up? You can make this trigger event-based (when a queue message is added fire my lambda) and get rid of the complications of batching them.
I handled a very similar situation recently, process-A puts objects in an S3 bucket and every time it does it puts a message in the SQS, with the key and bucket details, I have a lambda which is triggered every hour, but it can be any trigger like your cloud watch alarm. Here is what you can do on every trigger:
Read the messages from the queue, SQS allows you to read only 10 messages at a time, and every time you read the messages, keep adding them to some list in your lambda, you also get a receipt handle for every message , you can use it to delete the messages and repeat this process until you read all 1000 messages in your queue. Now you can perform whatever operations are required on your list and feed it to process B in a number of different ways , like a file in S3 and/or a new queue that process B can read from.
Alternate approach to reading messages: SQS allows you to read only 10 messages at a time, you can send an optional parameter 'VisibilityTimeout':60 that hides the messages from the queue for 60 seconds and you can recursively read all the messages until you dont see any messages in the queue, all while adding them to a list in lambda to process them, this can be tricky since you have to try out different values for visibility time out based on how long it takes to read 1000 messages. Once you know you read all the messages, you can simply have the receipt handles and delete all of them. You can also purge the queue but , you may delete some of the messages that came in during this process that are not read at least once.
I would like to use Kafka to perform bounded batch processing, where the program will know when it is processing the last record.
Batch:
Reading a flat file
Send each line as message to Kafka
Kafka Listener:
Consumes message from Kafka
Insert record into database
If it is the last record, mark batch job as done in database.
One way probably is to use a single Kafka partition, assuming FIFO (First In First Out) is guaranteed, and make the batch program to send an isLastRecord flag.
However, this means the processing will be restricted to single-thread (single consumer).
Question
Is there any way to achieve this with parallel-processing by leveraging multiple Kafka partitions?
If you need in-order guarantees per file, you are restricted to a single partition.
If you have multiple files, you could use different partitions for different files though.
If each line in the file is an insert into a database, I am wondering though if you need in-order guarantee in the first place, or if you can insert all records/lines in any order?
A more fundamental question is: why do you need to put the data into Kafka first? Why not read the file and to the insert directly?
Assume that I have a topic with numerous partitions. Im writing K/V data in there and want to aggregate said data in Tumbling Windows by keys.
Assume that I've launched as many worker instances as I have partitions and each worker instance is running on a separate machine.
How would I go about insuring that the resultant aggregations include all values for each key? IE I don't want each worker instance to have some subset of the values.
Is this something that a StateStore would be used for? Does Kafka manage this on its own or do I need to come up with a method?
How would I go about insuring that the resultant aggregations include all values for each key? IE I don't want each worker instance to have some subset of the values.
In general, Kafka Streams ensures that all values for the same key will be processed by the same (and only one) stream task, which also means only one application instance (what you described as "worker instance") will process the values for that key. Note that an app instance may run 1+ stream tasks, but these tasks are isolated.
This behavior is achieved through the partitioning of the data, and Kafka Streams ensures that a partition is always processed by the same and only one stream task. The logical link to keys/values is that, in Kafka and Kafka Streams, a key is always sent to the same partition (there is a gotcha here, but I'm not sure whether it makes sense to go into details for the scope of this question), hence one particular partition -- among possible many partitions -- contains all the values for the same key.
In some situations, such as when joining two streams A and B, you must ensure though that the aggregation will operate on the same key to ensure that data from both streams are co-located in the same stream task -- which, again, is all about ensuring that the relevant input stream partitions and thus matching the keys (from A and B, respectively) are made available in the same stream task. A typical method you'd use here is selectKey(). Once that is done, Kafka Streams ensures that, for joining the two streams A and B as well as for creating the joined output stream, all values for the same key will be processed by the same stream task and thus the same application instance.
Example:
Stream A has key userId with value { georegion }.
Stream B has key georegion with value { continent, description }.
Joining two streams only works (as of Kafka 0.10.0) when both streams use the same key. In this example, this means that you must re-key (and thus re-partition) stream A so that the resulting key is changed from userId to georegion. Otherwise, as of Kafka 0.10, you can't join A and B because data is not co-located in the stream task that is responsible for actually performing the join.
In this example, you could re-key/re-partition stream A via:
// Kafka 0.10.0.x (latest stable release as of Sep 2016)
A.map((userId, georegion) -> KeyValue.pair(georegion, userId)).through("rekeyed-topic")
// Upcoming versions of Kafka (not released yet)
A.map((userId, georegion) -> KeyValue.pair(georegion, userId))
The through() call is only required in Kafka 0.10.0 to actually trigger re-partitioning, and later versions of Kafka will do these automatically for you (this upcoming functionality is already completed and available in Kafka trunk).
Is this something that a StateStore would be used for? Does Kafka manage this on its own or do I need to come up with a method?
In general, no. The behavior above is achieved through partitioning, not through state stores.
Sometimes state stores are involved because of the operations you have defined for a stream, which might explain why you were asking this question. For example, a windowing operation will require state to be managed, and thus a state store will be created behind the scenes. But your actual question -- "insuring that the resultant aggregations include all values for each key" -- has nothing to do with state stores, it's about the partitioning behavior.
With worker instance, I assume you mean a Kafka Streams application instance, right? (Because there is no master/worker pattern in Kafka Streams -- it's a library and not a framework -- we do not use the term "worker".)
If you want to co-locate data per key, you need to partition the data by key. Thus, either your data is partitioned by key by your external producer when data gets written into a topic from the beginning on. Or you explicitly set a new key within Kafka Streams application (using for example selectKey() or map()) and re-distributed via a call to through().
(The explicit call to through() will not be necessary in future releases, ie, 0.10.1 and Kafka Streams will re-distribute records automatically if necessary.)
If messages/record should be partitioned, the key must not be null. You can also change the partitioning schema via producer configuration partitioner.class (see https://kafka.apache.org/documentation.html#producerconfigs).
Partitioning is completely independent from StateStores, even if StateStores are usually used on top of partitioned data.