I am trying to build a process that invokes AWS lambda, which then utilizes AWS SNS to send messages that trigger more lambdas. Each such triggered lambdas write an output file to S3. The process is as depicted below -
My question is this - How can I know that all lambdas are done with writing files? I want to execute another process that collects all these files and does merging. I could think of two obvious ways -
Constantly monitor s3 for as many output files as SNS messages. Once, total count reaches, invoke the final merging lambda.
Use a db as sync source, write counts for that particular job/session and keep monitoring it till the count reaches SNS messages count.
Both solutions require constant polling, which i would like to avoid. I want to do this in an event driven manner. I was hoping for Amazon SQS would come to my rescue with some sort of "empty queue lambda trigger", but SQS only supports lambdas triggering on new messages. Is there any known way to achieve this in an event driven manner in AWS? Your suggestions/comments/answers are much appreciated.
I would propose a couple of options here:
Step Functions:
This is a managed service for state machines. It's great for co-ordinating workflows.
Atomic Counting:
If you know the number of things in advance, you could initialize an Atomic Counter in DynamoDB and then atomically decrement it as work completes. Use DynamoDB Streams to trigger Lambda invocation when the counter is mutated, and trigger your next phase (or end of work) when the counter hits zero. Note that whenever an application creates, updates, or deletes items in the table, DynamoDB Streams writes a stream record, so every mutation of the counter would trigger your Lambda.
Note that DynamoDB Streams guarantees the following:
Each stream record appears exactly once in the stream.
For each item that is modified in a DynamoDB table, the stream records appear in the same sequence as the actual modifications to the item.
AWS Step Functions (a managed state machine service) would be the obvious choice. AWS has some examples as starting points. I remember one being a looping state that you could probably apply to this use case.
Another idea off top of my head...
Create an "Orchestration Lambda" that has the list of your files...
Orchestration Lambda invokes a "File Writer Lambda" in a loop, passing the file info. The invokeAsync(InvokeRequest request) returns a Future object. Orchestration Lambda can check the future object state for completion.
Orchestration Lambda can make a similar call to the "File Writer Lambda" but instead use the more flexible method: invokeAsync(InvokeRequest request, AsyncHandler asyncHandler). You can make an inner class that implements this AsyncHandler and monitor the completion there in the Orchestration Lambda. It is a little cleaner than all the loops.
There are probably many ways to solve this problem, but there are two ideas.
Personally, I prefer the idea with "Step Functions".
But if you want to simplify your architecture, you could create trigered lambda function. Chose 'S3 trigger' in left side of lambda function designer and configure it bottom.
Check out more - Using AWS Lambda with Amazon S3
But in this case you have to create more sophisticated lambda function wich will check that all apropriate files are uploaded on S3 and after this start your merge.
The stated problem seems a suitable candidate for the Saga Pattern.
Basically Saga is described like any long running , distributed process.
As mentioned earlier , the AWS platform allows using Step functions to implement a Saga, as described here enter
Related
In my Spring Boot app, customers can submit files. Each customer's files are merged together by a scheduled task that runs every minute. The fact that the merging is performed by a scheduler has a number of drawbacks, e.g. it's difficult to write end-to-end tests, because in the test you have to wait for the scheduler to run before retrieving the result of the merge.
Because of this, I would like to use an event-based approach instead, i.e.
Customer submits a file
An event is published that contains this customer's ID
The merging service listens for these events and performs a merge operation for the customer in the event object
This would have the advantage of triggering the merge operation immediately after there is a file available to merge.
However, there are a number of problems with this approach which I would like some help with
Concurrency
The merging is a reasonably expensive operation. It can take up to 20 seconds, depending on how many files are involved. Therefore the merging will have to happen asynchronously, i.e. not as part of the same thread which publishes the merge event. Also, I don't want to perform multiple merge operations for the same customer concurrently in order to avoid the following scenario
Customer1 saves file2 triggering a merge operation2 for file1 and file2
A very short time later, customer1 saves file3 triggering merge operation3 for file1, file2, and file3
Merge operation3 completes saving merge-file3
Merge operation2 completes overwriting merge-file3 with merge-file2
To avoid this, I plan to process merge operations for the same customer in sequence using locks in the event listener, e.g.
#Component
public class MergeEventListener implements ApplicationListener<MergeEvent> {
private final ConcurrentMap<String, Lock> customerLocks = new ConcurrentHashMap<>();
#Override
public void onApplicationEvent(MergeEvent event) {
var customerId = event.getCustomerId();
var customerLock = customerLocks.computeIfAbsent(customerId, key -> new ReentrantLock());
customerLock.lock();
mergeFileForCustomer(customerId);
customerLock.unlock();
}
private void mergeFileForCustomer(String customerId) {
// implementation omitted
}
}
Fault-Tolerance
How do I recover if for example the application shuts down in the middle of a merge operation or an error occurs during a merge operation?
One of the advantages of the scheduled approach is that it contains an implicit retry mechanism, because every time it runs it looks for customers with unmerged files.
Summary
I suspect my proposed solution may be re-implementing (badly) an existing technology for this type of problem, e.g. JMS. Is my proposed solution advisable, or should I use something like JMS instead? The application is hosted on Azure, so I can use any services it offers.
If my solution is advisable, how should I deal with fault-tolerance?
Regarding the concurrency part, I think the approach with locks would work fine, if the number of files submitted per customer (on a given timeframe) is small enough.
You can eventually monitor over time the number of threads waiting for the lock to see if there is a lot of contention. If there is, then maybe you can accumulate a number of merge events (on a specific timeframe) and then run a parallel merge operation, which in fact leads to a solution similar to the one with the scheduler.
In terms of fault-tolerance, an approach based on a message queue would work (haven't worked with JMS but I see it's an implementation of a message-queue).
I would go with a cloud-based message queue (SQS for example) simply because of reliability purposes. The approach would be:
Push merge events into the queue
The merging service scans one event at a time and it starts the merge job
When the merge job is finished, the message is removed from the queue
That way, if something goes wrong during the merge process, the message stays in the queue and it will be read again when the app is restarted.
My thoughts around this matter after some considerations.
I restricted possible solutions to what's available from Azure managed services, according to specifications from OP.
Azure Blob Storage Function Trigger
Because this issue is about storing files, let's start with exploring Blob Storage with trigger function that fires on file creation. According to doc, Azure functions can run up to 230 seconds, and will have a default retry count of 5.
But, this solution will require that files from a single customer arrives in a manner that will not cause concurrency issues, hence let's leave this solution for now.
Azure Queue Storage
Does not guarantee first-in-first-out (FIFO) ordered delivery, hence it does not meet the requirements.
Storage queues and Service Bus queues - compared and contrasted: https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-azure-and-service-bus-queues-compared-contrasted
Azure Service Bus
Azure Service Bus is a FIFO queue, and seems to meet the requirements.
https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-azure-and-service-bus-queues-compared-contrasted#compare-storage-queues-and-service-bus-queues
From doc above, we see that large files are not suited as message payload. To solve this, files may be stored in Azure Blob Storage, and message will contain info where to find the file.
With Azure Service Bus and Azure Blob Storage selected, let's discuss implementation caveats.
Queue Producer
On AWS, the solution for the producer side would have been like this:
Dedicated end-point provides pre-signed URL to customer app
Customer app uploads file to S3
Lambda triggered by S3 object creation inserts message to queue
Unfortunately, Azure doesn't have a pre-signed URL equivalent yet (they have Shared Access Signature which is not equal), hence file uploads must be done through an end-point which in turn stores the file to Azure Blob Storage. When file upload end-point is required, it seems appropriate to let the file upload end-point also be reponsible for inserting messages into queue.
Queue Consumer
Because file merging takes a signicant amount of time (~ 20 secs), it should be possible to scale out the consumer side. With multiple consumers, we'll have to make sure that a single customer is processed by no more than one consumer instance.
This can be solved by using message sessions: https://learn.microsoft.com/en-us/azure/service-bus-messaging/message-sessions
In order to achieve fault tolerance, consumer should use peek-lock (as opposed to receive-and-delete) during file merge and mark message as completed when file merge is completed. When message is marked as completed, consumer may be responsible for
removing superfluous files in Blob Storage.
Possible problems with both existing solution and future solution
If customer A starts uploading a huge file #1 and immediately after that starts uploading a small file #2, file upload of file #2 may be be completed before file #1 and cause an out-of-order situation.
I assume that this is an issue that is solved in existing solution by using some kind of locking mechanism or file name convention.
Spring-boot with Kafka can solve your problem of fault tolerance.
Kafka supports the producer-consumer model. let the customer events posted to Kafka producer.
configure Kafka with replication for not to lose any events.
use consumers that can invoke the Merging service for each event.
once the consumer read the event of customerId and merged then commit the offset.
In case of any failure in between merging the event, offset is not committed so it can be read again when the application started again.
If the merging service can detect the duplicate event with given data then reprocessing the same message should not cause any issue(Kafka promises single delivery of the event). Duplicate event detection is a safety check for an event processed full but failed to commit to Kafka.
First, event-based approach is corrrect for this scenario. You should use external broker for pub-sub event messages.
Attention that, by default, Spring publishing an event is synchronous.
Suppose that, you have 3 services:
App Service
Merge Servcie
CDC Service (change data capture)
Broker Service (Kafka, RabbitMQ,...)
Main flow base on "Outbox Pattern":
App Service save event message to Outbox message table
CDC Service watching outbox table and publish event message from Outbox table to Broker Servie
Merge Service subcribe to Broker Server and receiving event message (messages is orderly)
Merge Servcie perform merge action
You can use eventuate lib for this flow.
Futher more, you can apply DDD to your architecture. Using Axon framework for CQRS pattern, public domain event and process it.
Refer to:
Outbox pattern: https://microservices.io/patterns/data/transactional-outbox.html
It really sounds like you may do with a Stream or an ETL tool for the job. When you are developing an app, and you have some prioritisation/queuing/batching requirement, it is easy to see how you can build a solution with a Cron + SQL Database, with maybe a queue to decouple doing work from producing work.
This may very well be the easiest thing to build as you have a lot of granularity and control to this approach. If you believe that you can in fact meet your requirements this way fairly quickly with low risk, you can do so.
There are software components which are more tailored to these tasks, but they do have some learning curves, and depend on what PAAS or cloud you may be using. You'll get monitoring, scalability, availability resiliency out-of-the-box. An open source or cloud service will take the burden of management off your hands.
What to use will also depend on what your priority and requirements are. If you want to go the ETL approach which is great at banking up jobs you might want to use something like a Glue t. If you want to want prioritization functionality you may want to use multiple queues, it really depends. You'll also want to monitor with a dashboard to see what wait time you should have for your merge regardless of the approach.
The dynamodb documentation says that there are shards and they are needed to be iterated first, then for each shard it is needed to get number of records.
The documentation also says:
(If you use the DynamoDB Streams Kinesis Adapter, this is handled for you: Your application will process the shards and stream records in the correct order, and automatically handle new or expired shards, as well as shards that split while the application is running. For more information, see Using the DynamoDB Streams Kinesis Adapter to Process Stream Records.)
Ok, But I use lambda not kinesis (ot they relates to each other?) and if a lambda function is attached to dynamodb stream should I care about shards ot not? Or I should just write labda code and expect that aws environment pass just some records to that lambda?
When using Lambda to consume a DynamoDB Stream the work of polling the API and keeping track of shards is all handled for you automatically. If your table has multiple shards then multiple Lambda functions will be invoked. From your prospective as a developer you just have to write the code for your Lambda function and the rest is taken care for you.
In-order processing is still guaranteed by DynamoDB streams so with a single shard will have only one instance of your Lambda function will be invoked at a time. However, with multiple shards you may see multiple instances of your Lambda function running at the same time. This fan-out is transparent and may cause issues or lead to surprising behaviors if you are not aware of it while coding your Lambda function.
For a deeper explanation of how this works I'd recommend the YouTube video AWS re:Invent 2016: Real-time Data Processing Using AWS Lambda (SVR301). While the focus is mostly on Kinesis Streams the same concepts for consuming DynamoDB Streams apply as the technology is nearly identical.
We use DynamoDB to process close to billion of records everyday and autoexpire those records and send to streams.
Everything is taken care by AWS and we don't need to do anything, except configuring streams (what type of image you want) and adding triggers.
The only fine tuning we did is,
When you get more data, we just increased the batch size to process faster and reduce the overhead on the number of calls to Lambda.
If you are using any external process to iterate over the stream, you might need to do the same.
Reference:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
Hope it helps.
It is known that AWS lambda may reuse early created objects of handlers, and it really does it (see FAQ):
Q: Will AWS Lambda reuse function instances?
To improve performance, AWS Lambda may choose to retain an instance of
your function and reuse it to serve a subsequent request, rather than
creating a new copy. Your code should not assume that this will always
happen.
The question is regarding Java concurrency. If I have a class for a handler, say:
public class MyHandler {
private Foo foo;
public void handler(Map<String,String> request, Context context) {
...
}
}
so, will it be thread-safe to access and work with the object variable foo here or not?
In other words: may AWS lambda use the same object concurrently for different calls?
EDIT My function is processed on an event based source, particularly it is invoked by an API Gateway method.
EDIT-2 Such kind of question rises when you want to implement some kind of connection pool to external resources, so I want to keep the connection to the external resource as an object variable. It actually works as desired, but I'm afraid of concurrency problems.
EDIT-3 More specifically I'm wondering: can instances of handlers of AWS lambda share common heap (memory) or not? I have to specify this additional detail in order to prevent answers with listing of obvious and common-known things about java thread-safe objects.
May AWS lambda use same object concurrently for different calls?
Can instances of handlers of AWS lambda share common heap (memory) or not?
A strong, definite NO. Instances of handlers of AWS Lambda cannot even share files (in /tmp).
An AWS Lambda container may not be reused for two or more concurrently existing invocations of a Lambda function, since that would break the isolation requirement:
Q: How does AWS Lambda isolate my code?
Each AWS Lambda function runs in its own isolated environment, with its own resources and file system view.
The section "How Does AWS Lambda Run My Code? The Container Model" in the official description of how lambda functions work states:
After a Lambda function is executed, AWS Lambda maintains the
container for some time in anticipation of another Lambda function
invocation. In effect, the service freezes the container after a
Lambda function completes, and thaws the container for reuse, if AWS
Lambda chooses to reuse the container when the Lambda function is
invoked again. This container reuse approach has the following
implications:
Any declarations in your Lambda function code remains initialized,
providing additional optimization when the function is invoked again.
For example, if your Lambda function establishes a database
connection, instead of reestablishing the connection, the original
connection is used in subsequent invocations. You can add logic in
your code to check if a connection already exists before creating one.
Each container provides some disk space in the /tmp directory. The
directory content remains when the container is frozen, providing
transient cache that can be used for multiple invocations. You can add
extra code to check if the cache has the data that you stored.
Background processes or callbacks initiated by your Lambda function
that did not complete when the function ended resume if AWS Lambda
chooses to reuse the container. You should make sure any background
processes or callbacks (in case of Node.js) in your code are complete
before the code exits.
As you can see, there is absolutely no warning about race conditions between multiple concurrent invocations of a Lambda function when trying to take advantage of container reuse. The only note is "don't rely on it!".
Taking advantage of the execution context reuse is definitely a practice when working with AWS Lambda (See AWS Lambda Best Practices). But this does not apply to concurrent executions as for concurrent execution a new container is created and thus new context. In short, for concurrent executions if one handler changes the value other won't get the new value.
As I see there is no concurrency issues related to Lambda. Only a single invocation "owns" the container. The second invocation will get an another container (or possible have to wait until the first one become free).
BUT I didn't find any guarantee the Java memory visibility issues cannot happen. In this case changes done by the first invocation could stay invisible for the second one. Or the changes of the first invocation will be written to RAM after the changes done by the second invocation.
In the most cases visibility issues are handled in the same way as concurrency issues. Therefore I would suggest to develop Lambda function thread-safe (or synchronized). At least as long as AWS won't give us a guarantee, that they do something on their side to flush CPU state to the memory after every invocation.
I am trying to understand when to use Akka Futures and found this article to be a little bit more helpful than the main Akka docs. So it looks like Akka Futures do exactly the same thing as Java 7 Futures. So I ask:
Outside the context of an actor system, what benefits do Akka Futures have over Java Futures? When to use each?
Within the context of an actor system, why ever use an Akka Future? Aren't all actor-to-actor messages asynchronous, concurrent and non-blocking?
Akka Futures implement asynchronous way of communication, while Java7 Futures implement synchronous approach. Yes they do the same thing - communication - but in quite different way.
Producer-Consumer pair can interact in two ways: synchronous and asynchronous. Synchronous way assumes the consumer has its own thread and performs a blocking operation to get next produced message, e.g. BlockingQueue.take(). In asynchronous approach, consumer does not own a thread, it is just an object with at least two methods: to store a message and to process it. Producer calls the store method, just like it calls Queue.put(m) in synchronous approach, but this method also initiates execution of the consumer's processing method on a common thread pool.
UPDT
As for the 2nd question (why ever use an Akka Future):
Future creation looks (and is) simpler than Actor's; code for a chain of Futures is more compact and more demonstrable than that of Actors.
Note however, a Future can pass only a single value (message) while an Actor can handle a sequence of messages. But sequences can be handled with Akka Streams. So the question arise: why ever use Akka Actors? I invite more experienced developers to answer this question. Generally, I think if your task can be solved with Futures, then use Futures, else if with Streams, use Streams, else if with Akka Actors, then use Actors, else look for another framework.
For the first part of your question, I agree with Alexei Kaigorodov's answer.
For the second part of your question:
It is useful to use a Future internally when actor responses need to be combined in a very specific way. For example, let's say that the Master actor needs to perform several blocking database queries and then aggregate their results, and so Master sends each query to a Worker and will then aggregate the responses. If the query results can be aggregated in any order (e.g. Master is just summing row counts or whatever) then it makes sense for Worker to send its results to Master via a callback. However, if the results need to be combined in a very specific order then it is easier for each Worker to immediately return a Future and for Master to then go about manipulating these Futures in the correct order. This could be done via callbacks as well, but then Master would need to figure out which query result is which to put them in the correct order and it will be much more difficult to optimize the code (e.g. if the results of query1 can be immediately aggregated with the results of query2 then by using a Future this logic can go directly into the dispatch code where the identities of all queries is already known, whereas using a callback would require Master to identify the query result and also determine if it can aggregate the query with any other query results that have been returned).
I'm using Java to create EC2 instances from within Eclipse. Now I would like to push parts of the application to these instances so that these can process whatever needs processing and then send the results back to my machine.
What I'm trying to do is something along the lines of:
assignWork(){
workPerformed = workQueue;
workPerInstance = workQueue/numberOfInstances;
while(workQueue > 0){
netxInstance.doWork(workPerformed,workPerInstance);
workPerformer -= workPerInstance;
}
}
doWork(start, end){
while(start>end){
//process stuff
start--;
}
}
This way I could control exactly how many AMI's to instantiates depending on the volume of work at hand. I could instantiate them, send them specific code to process and then terminate them as soon as I receive the results.
Is this possible just using the AWS JDK?
It is, but consider that...
If you have SLAs, and they fall within SQS Limits (Maximum 4 Days), you could consider publishing your task queues into SNS/SQS, and use CloudWatch to track the number of needed instances.
If you have a clear division of roles (more like a workflow), and the long-running tasks are not of much concern and you can retry, also consider using AWS SWF instead. It goes a bit beyond of a SQS/SNS Combo, and I think it could fit nicely with CloudWatch (thats just a theory, I haven't looked further). Cons are the extreme assh*le AWS Flow Framework for writing the Workflow Processes
If your workload is predictable (say, around 5K processes to process today), meaning you have no need for real-time and you can batch those requests, then consider using Elastic MapReduce for this. Being Hadoop-based, it offers some such niceties, such as being able to resize your cluster on demand, and the obvious case of not having any vendor lock in at all.
Actually, if you want that manage and without many surprises, consider looking at options such as PiCloud and IronWorker. They were really made for situations just like the one you've just described.
If you have only a Queue and EC2, you can surely automate that. It only depends on how badly you want to coordinate these tasks, but I'm sure its possible.