I have following use case
I have kinesis stream having user data.
I want to read kinesis stream based on user action.
Filter records based on user input, keep filtering for some time period, let say 5 minutes.
Keep returning these filtered batches to user for 5 minutes.
After timeout stop reading kinesis
Question:
Is there a way of reading kinesis on demand, without any lag using KCL or any other library. Ley say i can have KCL jvm apps setup which is not reading currently, whenever it gets user action, just start reading.
Similarly stop reading after some timeout or further user action.
I can write a logic which can do that but would like to know if there is anything built in KCL.
KCL does this for you - as long as the KCL application is running, it will continually read from the stream. Doesn't matter if most of the time your stream returns no data - it'll just keep running and doing nothing until there is data, at which point your processing code kicks in.
You can configure the timeout it takes when there are no records with KinesisClientLibConfiguration.idleTimeBetweenReadsInMillis - the default is 1 second. There are a lot of configuration options here to fine tune the behavior as needed.
Now, if your streams will be empty a lot of the time, it may be more cost effective to use AWS Lambda to process your streams, which will process your records on-demand without needing to run (and pay for) hardware to continually do read operations from the stream.
Related
We have one requirement where we are using Kafka Streams to read from Kafka topic and then send the data over network through a pool of sessions. However, sometimes, network calls are bit slow and we need to frequently pause the stream, ensure we are not overloading network. Currently, we capture data into a stream and load it to a executor service and then send it over network through session pool.
If data in executor service is too high, we need to pause the stream for some time and then resume it once backlog on executor service is cleared up. For achiveing this pause mechanism, We are currently closing the stream and starting again once backlog is cleared up.
Is there any way we can pause the kafka stream?
If I understand you correctly, there is nothing special you need to do. You are talking about "back pressure" and Kafka Streams can handle it out of the box.
What can be done is putting this data into a queue with some max size and use this queue to load in executor service. Whenever the queue reaches some threshold, there are two methods:
If your call to put data in queue is blocking with no time-out, there is nothing more you need to do. Just wait until the system is back online, your call
returns, and processing will resume.
If your call to put data in queue is blocking with time-out,just issue the lookup to check the size of the queue. Repeat this until the system is back online and your call succeeds.
The only caveat is that as long as your Streams application blocks, the internally used Kafka consumer client will not send any heartbeats to Kafka and might time out. Thus, you need to set the time-out configuration parameter higher than the expected maximum downtime of your external system.
Another approach is to use a Processor API available in Kafka-streams, though, it is not usually recommended pattern.
Let me know if it helps!!
I have a SQS which will receive a huge number of messages. The messages keep coming to the queue.
And I have a use case where if the number of messages in a queue reaches X number (such as 1,000), the system needs to trigger an event to process 1,000 at a time.
And the system will make a chunk of triggers. Each trigger has a thousand messages.
For example, if we have 2300 messages in a queue, we expect 3 triggers to a lambda function, the first 2 triggers corresponding to 1,000 messages, and the last one will contain 300 messages.
I'm researching and see CloudWatch Alarm can hook up to SQS metric on "NumberOfMessageReceived" to send to SNS. But I don't know how can I configure a chunk of alarms for each 1,000 messages.
Please advice me if AWS can support this use case or any customize we can make to achieve this.
So after going through some clarifications on the comments section with the OP, here's my answer (combined with #ChrisPollard's comment):
Achieving what you want with SQS is impossible, because every batch can only contain up to 10 messages. Since you need to process 1000 messages at once, this is definitely a no-go.
#ChrisPollard suggested to create a new record in DynamoDB every time a new file is pushed to S3. This is a very good approach. Increment the partition key by 1 every time and trigger a lambda through DynamoDB Streams. On your function, run a check against your partition key and, if it equals 1000, you run a query against your DynamoDB table filtering the last 1000 updated items (you'll need a Global Secondary Index on your CreatedAt field). Map these items (or use Projections) to create a very minimal JSON that contains only the necessary information. Something like:
[
{
"key": "my-amazing-key",
"bucket": "my-super-cool-bucket"
},
...
]
A JSON like this is only 87 bytes long (if you take the square brackets out of the game because they won't be repeated, you're left out with 83 bytes). If you round it up to 100 bytes, you can still successfully send it as one event to SQS, as it will only be around 100KB of data.
Then have one Lambda function subscribe to your SQS queue and then finally concatenate the 1 thousand files.
Things to keep in mind:
Make sure you really create the createdAt field in DynamoDB. By the time it hits one thousand, new items could have been inserted, so this way you make sure you are reading the 1000 items that you expected.
On your Lambda check, just run batchId % 1000 = 0, this way you don't need to delete anything, saving DynamoDB operations.
Watch out for the execution time of your Lambda. Concatenating 1000 files at once may take a while to run, so I'd run a couple of tests and put 1 min overhead on top of it. I.e, if it usually takes 5 mins, set your function's timeout to 6 mins.
If you have new info to share I am happy to edit my answer.
You can add alarms at 1k, 2k, 3k, etc...but that seems clunky.
Is there a reason you're letting the messages batch up? You can make this trigger event-based (when a queue message is added fire my lambda) and get rid of the complications of batching them.
I handled a very similar situation recently, process-A puts objects in an S3 bucket and every time it does it puts a message in the SQS, with the key and bucket details, I have a lambda which is triggered every hour, but it can be any trigger like your cloud watch alarm. Here is what you can do on every trigger:
Read the messages from the queue, SQS allows you to read only 10 messages at a time, and every time you read the messages, keep adding them to some list in your lambda, you also get a receipt handle for every message , you can use it to delete the messages and repeat this process until you read all 1000 messages in your queue. Now you can perform whatever operations are required on your list and feed it to process B in a number of different ways , like a file in S3 and/or a new queue that process B can read from.
Alternate approach to reading messages: SQS allows you to read only 10 messages at a time, you can send an optional parameter 'VisibilityTimeout':60 that hides the messages from the queue for 60 seconds and you can recursively read all the messages until you dont see any messages in the queue, all while adding them to a list in lambda to process them, this can be tricky since you have to try out different values for visibility time out based on how long it takes to read 1000 messages. Once you know you read all the messages, you can simply have the receipt handles and delete all of them. You can also purge the queue but , you may delete some of the messages that came in during this process that are not read at least once.
I'm converting a Java application from Kafka to Kinesis. This application runs forever. It sleeps for 30 seconds, then wakes up, runs some HBase queries, consumes and processes any new Kafka messages, then sleeps again.
This works fine in Kafka - that's exactly what the default Consumer does. However this is not the case in Kinesis. Consuming from the KCL requires the KCL consumer to be running at all times, which doesn't work for my needs. I need to be able to consume all new messages as required with a single method call.
The official documentation for the Kinesis Java API says:
You retrieve records from the stream on a per-shard basis. For each shard, and for each batch of records that you retrieve from that shard, you need to obtain a shard iterator.
and
If no records are returned, that means no data records are currently available from this shard at the sequence number referenced by the shard iterator. When this situation occurs, your application should wait for an amount of time
But I don't care about shards! I just want to get all messages since I last consumed, in one method call. And what if my app dies and needs to restart; how will it know where to resume?
Current code:
GetRecordsRequest getRecordsRequest = new GetRecordsRequest();
getRecordsRequest.setShardIterator(TRIM_HORIZON);
getRecordsRequest.setLimit(25);
GetRecordsResult result = client.getRecords(getRecordsRequest);
// Put the result into record list. The result can be empty.
records = result.getRecords();
EDIT
To be clearer, with Kafka I can run:
ConsumerRecords<String, String> records = this.consumer.poll(0);
to get all unconsumed messages. If my app dies and restarts, there's no problem, offsets are taken care of and I'll resume where I left off.
How do I do this in Kinesis?
To answer your question, you can use with StockTradeRecordProcessor where it has the option to reset the stats which in turn enable to consume only the new messages. Refer here to find the implementation of StockTradeRecordProcessor.
But on a hard note, This method uses 60-second intervals for the reporting and checkpointing rate but not 30 seconds as your application demands.
The dynamodb documentation says that there are shards and they are needed to be iterated first, then for each shard it is needed to get number of records.
The documentation also says:
(If you use the DynamoDB Streams Kinesis Adapter, this is handled for you: Your application will process the shards and stream records in the correct order, and automatically handle new or expired shards, as well as shards that split while the application is running. For more information, see Using the DynamoDB Streams Kinesis Adapter to Process Stream Records.)
Ok, But I use lambda not kinesis (ot they relates to each other?) and if a lambda function is attached to dynamodb stream should I care about shards ot not? Or I should just write labda code and expect that aws environment pass just some records to that lambda?
When using Lambda to consume a DynamoDB Stream the work of polling the API and keeping track of shards is all handled for you automatically. If your table has multiple shards then multiple Lambda functions will be invoked. From your prospective as a developer you just have to write the code for your Lambda function and the rest is taken care for you.
In-order processing is still guaranteed by DynamoDB streams so with a single shard will have only one instance of your Lambda function will be invoked at a time. However, with multiple shards you may see multiple instances of your Lambda function running at the same time. This fan-out is transparent and may cause issues or lead to surprising behaviors if you are not aware of it while coding your Lambda function.
For a deeper explanation of how this works I'd recommend the YouTube video AWS re:Invent 2016: Real-time Data Processing Using AWS Lambda (SVR301). While the focus is mostly on Kinesis Streams the same concepts for consuming DynamoDB Streams apply as the technology is nearly identical.
We use DynamoDB to process close to billion of records everyday and autoexpire those records and send to streams.
Everything is taken care by AWS and we don't need to do anything, except configuring streams (what type of image you want) and adding triggers.
The only fine tuning we did is,
When you get more data, we just increased the batch size to process faster and reduce the overhead on the number of calls to Lambda.
If you are using any external process to iterate over the stream, you might need to do the same.
Reference:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
Hope it helps.
Is it possible to have a Kafka Streams app that runs through all the data in a topic and then exits?
Example I'm producing data into topics based on date. The consumer gets kicked off by cron, runs through all the available data and then .. does what? I don't want it to sit and wait for more data. Just assume it's all there and then exit gracefully.
Possible?
In Kafka Streams (as for other stream processing solutions), the is no "end of data" because it is stream processing in the first place -- and not batch processing.
Nevertheless, you could watch the "lag" of your Kafka Streams application and shut it down if there is not lag (lag, is the number of not yet consumed messages).
For example, you can use bin/kafka-consumer-groups.sh to check the lag of your Streams application (the application ID is used as consumer group ID). If you want to embed this in your Streams applications, you can use kafka.admin.AdminClient to get consumer group information.
You can create a consumer and then once it stops pulling up data, you can have call consumer.close(). Or if you want to poll again in the future just call consumer.pause() and call .resume later.
One way to do this is within the consumer poll block. Such as
data = consumer.poll()
if (!data.next()) {
consumer.close()
}
Keep in mind that poll returns ConsumerRecord<K,V> and conforms to the Iterable interface.