Trigger Lambda by number of SQS messages

Trigger Lambda by number of SQS messages - java

I have a SQS which will receive a huge number of messages. The messages keep coming to the queue.
And I have a use case where if the number of messages in a queue reaches X number (such as 1,000), the system needs to trigger an event to process 1,000 at a time.
And the system will make a chunk of triggers. Each trigger has a thousand messages.
For example, if we have 2300 messages in a queue, we expect 3 triggers to a lambda function, the first 2 triggers corresponding to 1,000 messages, and the last one will contain 300 messages.
I'm researching and see CloudWatch Alarm can hook up to SQS metric on "NumberOfMessageReceived" to send to SNS. But I don't know how can I configure a chunk of alarms for each 1,000 messages.
Please advice me if AWS can support this use case or any customize we can make to achieve this.

So after going through some clarifications on the comments section with the OP, here's my answer (combined with #ChrisPollard's comment):
Achieving what you want with SQS is impossible, because every batch can only contain up to 10 messages. Since you need to process 1000 messages at once, this is definitely a no-go.
#ChrisPollard suggested to create a new record in DynamoDB every time a new file is pushed to S3. This is a very good approach. Increment the partition key by 1 every time and trigger a lambda through DynamoDB Streams. On your function, run a check against your partition key and, if it equals 1000, you run a query against your DynamoDB table filtering the last 1000 updated items (you'll need a Global Secondary Index on your CreatedAt field). Map these items (or use Projections) to create a very minimal JSON that contains only the necessary information. Something like:
[
{
"key": "my-amazing-key",
"bucket": "my-super-cool-bucket"
},
...
]
A JSON like this is only 87 bytes long (if you take the square brackets out of the game because they won't be repeated, you're left out with 83 bytes). If you round it up to 100 bytes, you can still successfully send it as one event to SQS, as it will only be around 100KB of data.
Then have one Lambda function subscribe to your SQS queue and then finally concatenate the 1 thousand files.
Things to keep in mind:
Make sure you really create the createdAt field in DynamoDB. By the time it hits one thousand, new items could have been inserted, so this way you make sure you are reading the 1000 items that you expected.
On your Lambda check, just run batchId % 1000 = 0, this way you don't need to delete anything, saving DynamoDB operations.
Watch out for the execution time of your Lambda. Concatenating 1000 files at once may take a while to run, so I'd run a couple of tests and put 1 min overhead on top of it. I.e, if it usually takes 5 mins, set your function's timeout to 6 mins.
If you have new info to share I am happy to edit my answer.

You can add alarms at 1k, 2k, 3k, etc...but that seems clunky.
Is there a reason you're letting the messages batch up? You can make this trigger event-based (when a queue message is added fire my lambda) and get rid of the complications of batching them.

I handled a very similar situation recently, process-A puts objects in an S3 bucket and every time it does it puts a message in the SQS, with the key and bucket details, I have a lambda which is triggered every hour, but it can be any trigger like your cloud watch alarm. Here is what you can do on every trigger:
Read the messages from the queue, SQS allows you to read only 10 messages at a time, and every time you read the messages, keep adding them to some list in your lambda, you also get a receipt handle for every message , you can use it to delete the messages and repeat this process until you read all 1000 messages in your queue. Now you can perform whatever operations are required on your list and feed it to process B in a number of different ways , like a file in S3 and/or a new queue that process B can read from.
Alternate approach to reading messages: SQS allows you to read only 10 messages at a time, you can send an optional parameter 'VisibilityTimeout':60 that hides the messages from the queue for 60 seconds and you can recursively read all the messages until you dont see any messages in the queue, all while adding them to a list in lambda to process them, this can be tricky since you have to try out different values for visibility time out based on how long it takes to read 1000 messages. Once you know you read all the messages, you can simply have the receipt handles and delete all of them. You can also purge the queue but , you may delete some of the messages that came in during this process that are not read at least once.

Related

Polling items from DynamoDB

AWS newbie here.
I have a DynamoDB table and 2+ nodes of Java apps reading/writing from/to it. My use case is as follow: the app should fetch N numbers of items every X seconds based on a timestamp, process them, then remove them from the DB. Because the app may scale, other nodes might be reading from the DB in the same time and I want to avoid processing the same items multiple times.
The questions is: is there any way to implement something like a poll() method that fetches the item and immediately removes it (atomic operation) as if the table was a queue. As far as I checked, delete item methods that DynamoDBMapper offers do not return removed items data.

Consistency is a weak spot of DDB, but that's the price to pay for its scalability.
You said it yourself, you're looking for a queue, so why not use one?
I suggest:
Create a lambda that:
Reads the items
Publishes them to an SQS FIFO queue with message deduplication
Deletes the items from the DB
Create an EventBridge schedule to run the Lambda every n minutes
Have your nodes poll that queue instead of DDB
For this to work you have to consider a few things regarding timings:
DDB will typically be consistent in under a second, but this isn't guaranteed.
SQS deduplication only works for 5 minutes.
EventBridge only supports minute level granularity, not seconds.
So you can run your Lambda as frequently as once a minute, but you can run your nodes as frequently (or infrequently) as you like.
If you run your Lambda less frequently than every 5 minutes then there is technically a chance of processing an item twice, but this is very unlikely to ever happen (technically this could still happen anyway if DDB took >10 minutes to be consistent, but again, extremely unlikely to ever happen).

My understanding is that you want to read and delete an item in an atomic manner, however, we are aware that is not possible with DynamoDB.
However, what is possible is deleting the item and being returned the value, which is more likened to a delete then read. As you correctly pointed out, the Mapper client does not support ReturnValues however the low level clients do.
Key keyToDelete = new Key().withHashKeyElement(new AttributeValue("214141"));
DeleteItemRequest dir = new DeleteItemRequest()
.withTableName("ABC")
.withKey(keyToDelete)
.withReturnValues("ALL_OLD");
More info here DeleteItemRequest

How to dinamically apply scheduled kafka consumer based on topic?

I'm currently struggling with a consumer on kafka that can somehow schedule to a future time for execution.
To summarize: I have a big data storage (.csv file) and the records contains 2 columns: timestamp and value. I'm trying to process this values based on their timestamp. First record it has to be consumed instantly by kafka, next one should be processed in future with a delay of 'current record timestamp - previous record timestamp' (it is not a very big difference, just a few seconds = result will be in millis) and so on.
So basically I can't find a solution to implement a consumer on kafka that takes each records based on timestamp and use that exact delay. I have to just simulate these values and they have to be insert in DB accordly to that delay to work properly.
I've tried to work around threads, with executors, but for big data it's not a properly way.
I tried to create dynamic topics on producers based on timestamp and then subscribe to them and then somehow process with a queue. It didn't work.
I expect the kafka to consume each record with the delay based on timestamp.

I expect the kafka to consume each record with the delay based on
timestamp
If you have specific delay between messages then Kafka is not a proper solution.
When you send messages to the Kafka, in most scenarios you use network. Which could add its own, unpredictable, delay. Kafka is running as a different process and nobody could guarantee at which moment this process will be ready to receive next message. OS could suspend process, GC could start etc. This adds another delay which nobody could predict.
Also, Kafka is not designed to operate with time when message was received. It more focused on order of messages, low latency and high throughput but not on timing.

Unable to consume all new Kinesis messages

I'm converting a Java application from Kafka to Kinesis. This application runs forever. It sleeps for 30 seconds, then wakes up, runs some HBase queries, consumes and processes any new Kafka messages, then sleeps again.
This works fine in Kafka - that's exactly what the default Consumer does. However this is not the case in Kinesis. Consuming from the KCL requires the KCL consumer to be running at all times, which doesn't work for my needs. I need to be able to consume all new messages as required with a single method call.
The official documentation for the Kinesis Java API says:
You retrieve records from the stream on a per-shard basis. For each shard, and for each batch of records that you retrieve from that shard, you need to obtain a shard iterator.
and
If no records are returned, that means no data records are currently available from this shard at the sequence number referenced by the shard iterator. When this situation occurs, your application should wait for an amount of time
But I don't care about shards! I just want to get all messages since I last consumed, in one method call. And what if my app dies and needs to restart; how will it know where to resume?
Current code:
GetRecordsRequest getRecordsRequest = new GetRecordsRequest();
getRecordsRequest.setShardIterator(TRIM_HORIZON);
getRecordsRequest.setLimit(25);
GetRecordsResult result = client.getRecords(getRecordsRequest);
// Put the result into record list. The result can be empty.
records = result.getRecords();
EDIT
To be clearer, with Kafka I can run:
ConsumerRecords<String, String> records = this.consumer.poll(0);
to get all unconsumed messages. If my app dies and restarts, there's no problem, offsets are taken care of and I'll resume where I left off.
How do I do this in Kinesis?

To answer your question, you can use with StockTradeRecordProcessor where it has the option to reset the stats which in turn enable to consume only the new messages. Refer here to find the implementation of StockTradeRecordProcessor.
But on a hard note, This method uses 60-second intervals for the reporting and checkpointing rate but not 30 seconds as your application demands.

AMQ Consumers choose which queue to process

Given the following scenario:
I have a system that creates, updates and deletes records. For each of these actions I need to do something (lets say write the events to a log as a silly example) however I need to process these events for each record in order - Meaning I can't log the delete before I have done the create, or any of the previous updates. I also can't log the update before I have logged the create.
I am investigating Queues in order to preserve sequence. However I don't really want RecordID_2 to be held up behind RecordID_14 The records do not need to be processed in sequence as much as the actions on each record have to. Hence I don't think I can/should use one queue.
As I don't have hundreds of RecordID_XX active at the same time, I was thinking of having a queue for each RecordID_XX so if several updates can in for that one RecordID each event for that record would be added to that same queue and be processed in order (I.e. Create first, Update_1 after Create is complete, Update_2 is processed after Update_1 is complete etc), however if additional events for a different record came in they would be added to their own queue. If the queue is empty for a period of time it simply gets deleted. I realize that this may result in a queue getting one message and then being deleted as there were no updates before the idle timeout expired. (This does not seem at all efficient)
Based on Andres T Finnell's excellent answer to this question.
I was thinking of doing the following
Producer (Web Service) -> Queue_Original <- Dispatcher -> RecordID_14
-> RecordID_2
-> RecordID_8
-> RecordID_15
Some of the "logging" may take long. So I want to be able to have a few consumers listening for these queues.
Lets say I have Consumer_1 and Consumer_2 (I may want to add Consumer_3 later to assist with growing load)
What I would like is Consumer_1 to do a getDistinations()
where the broker will return [RecordID_14, RecordID_2, RecordID_8, RecordID_15]
Questions:
Is it possible for Consumer_1 to iterate through the list of queues returned from the broker looking for the first available queue that does not have a Consumer_X connected to it and begin processing the 1st message on this queue?
And then each subsequent Consumer to do the same until it finds the next queue without a Consumer connected to it?
Would Advisory-Messages be the thing to use here?
Am I going down the wrong path completely? Is there a better approach
to handling this scenario?

Parallelism and Failover of a Sequential Data

Good time guys!
We have a pretty straightforward application-adapter: once in 30 seconds it reads records from a database (can't write to it) of one system, converts each of these records into an internal format, performs filtering, encrichment, ..., and, finally, transforms the resulting, let's say, entities into an xml format and sends them via a JMS to other system. Nothing new.
Let's add some spice here: records in the database are sequential (that means that their identifies are generated by a sequence), and when it is time to read a new bunch of records, we get a last-processed-sequence-number -- which is stored in our internal databese and updated each time the next record is processed (sent to the JMS) -- and start reading from that record (+1).
The problem is our customers gave us an NFR: processing of a read record bunch must not last longer than 30 seconds. As far as there are a lot of steps in the workflow (with some pretty long running ones), and it is possible to get a pretty big amount of records, and as far as we process them one by one, it can take more than 30 seconds.
Because of all the above I want to ask 2 questions:
1) Is there an approach of a parallel processing of sequential data, maybe with one or several intermediate storages, or Disruptor patern, or cqrs-like, or a notification-based, or ... that provides a possibility of working in such a system?
2) A general one. I need to store a last-processed-number and send an entity to the JMS. If I save a number to a database and then some problem raises with the JMS, on an application's restart my adapter will think that it successfuly sended the entity, which is not true and it won't be ever received. If I send an entity and after that try so save a number to a database and get an exception, on an application's restart a reprocessing will be performed which will lead to duplications in the JMS. I'm not sure that xa transactions will help here or some kind of a last resorce gambit...
Could somebody, please, share experience or ideas?
Thanks in advance!

1) 30 seconds is a long time and you can do a lot in that time esp with more than one CPU. Without specifics I can only say it is likely you can make it faster if you profile it and use more CPUs.
2) You can update the database before you send and listen to the JMS queue yourself to see it was received by the broker.

Dimitry - I don't know the detail around your problem so I'm just going to make a set of assumptions. I hope it willtrigger an idea that will lead to the solution at least.
Here goes:
Grab you list of items to process.
Store the last id (and maybe the starting id)
Process each item on a different thread (suggest using Tasks).
Record any failed item in a local failed queue.
When you grab the next bunch, ensure you process the failed queue first.
Have a way of determining a max number of retries and a way of moving/marking it as permanently failed.
Not sure if that was what you were after. NServiceBus has a retry process where the gap between each retry gets longer up to a point, then it is marked as failed.

Folks, finally we ended up with the following solution. We implemented a kind of the Actor Model. The idea is the following.
There are two main (internal) database tables for our application, let's call them READ_DATA_INFO, which contains a last-read-record-number of the 'source' external system, and DUMPED_DATA, which stores a metadata about each read record of the source system. This is how it all works: each n (a configurable property) seconds a service bus reads the last processed identifier of the source system and sends a request to the source system to get new records from it. If there are several new records, they are being wrapped with a DumpRecordBunchMessage message and sent to a DumpActor class. This class begins a transaction which comprises two operations: update the last-read-record-number (the READ_DATA_INFO table) and save a metadata about each reacord (the DUMPED_DATA table) (each dumped record gets the 'NEW' status. When a record is successfully processed, it gets the 'COMPLETED' status; otherwise - the 'FAILED' status). In case of a successfull transaction commit each of those records is wrapped with a RecordMessage message class and send to next processing actor; otherwise those records are just skipped - they would be reread after next n seconds.
There are three interesting points:
an application's disaster recovery. What if our application will be stopped somehow at the middle of a processing. No problem, at an application's startup (#PostConstruct marked method) we find all the records with the 'NEW' statuses at the DUMPED_DATA table and with a help of a stored metadata rebuild restore them from the source system.
parallel processing. After all records are successfully dumped, they become independent, which means that they can be processed in parallel. We introduced several mechanisms of a parallelism and a loa balancing. The simplest one is a round robin approach. Each processing actor consists of a parant actor (load balancer) and a configurable set of it's child actors (worker). When a new message arrives to the parent actor's queue, it dispatches it to the next worker.
duplicate record prevention. This is the most interesting one. Let's assume that we read data each 5 seconds. If there is an actor with a long running operation, it is possible to have several tryings to read from the source system's database starting from the same last-read-record number. Thus there would potentially be a lot duplicate records dumped and processed. In order to prevent this we added a CAS-like check of DumpActor's messages: if the last-read-record from a message is equal to a one from the DUMPED_DATA table, this message should be processed (no messages were processed before it); otherwise this message is rejected. Rather simple, but powerfull.
I hope this overview will help somebody. Have a good time!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.