My system background:
I have a Scala application that read events from various queue technologies, transform the data and as final step send record to kinesis firehose.
My firehose is currently have S3 bucket as destination, but I don't want to bind it to the solution, since it could be changed in the future.
Problem statement
AWS guarantees a monthly uptime percentage of at least 99.9% (means ~44 minutes in a month) - See Service Level Agreement.
In addition, when quota limits are reached, putRecord will return ServiceUnavailableException - See PutRecord.
As part of the flow I can't afford events loss, so I'm searching for sort of solution for firehose unavailability.
We think about error handler that will:
Publish to SQS each event that failed to published into firehose.
When firehose get back to work, SNS will trigger AWS Lambda.
AWS Lambda will publish all events from SQS into firehose.
My question
I'm wondering if that solution is good enough and if there is best practice for handling errors when trying to putRecord into firehose?
Related
Currently we have a use case where we want to process some messages at later point of time, after some conditions met.
Is it possible to unacknowledge some pub/sub messages in apache beam pipeline which will be later available after visibility time out which we can process later?
You can't unack the message with Apache beam. When the message are correctly ingested in the pipeline, they are acked automatically.
You can keep them in the pipeline and reprocess them until the conditions are met. But you could have a congestion, or an overusage of Dataflow resources for nothing. It could be better to clean the message before, on a Cloud Functions for instance, that unack the message when they aren't valid, and publish in a target PubSub topic the valid messages.
As an alternative to #guillaume's suggestion, you can also store the "to-be-processed-later" messages (in raw format) in storage mediums such as BigQuery or Cloud Bigtable. All the messages will be acked by the pipeline and then the segregation can be done inside the pipeline where the "valid" messages are processed as usual while the "invalid" messages are preserved in storage for future processing.
Once the processing conditions are satisfied, the "invalid" messages can be retrieved from the storage medium and processed after which they can be deleted from storage. This could be a viable solution if the "invalid" messages will be processed after the message retention period which is 7 days.
The above workflow is inspired by this section of the Google Cloud blog. I considered the "invalid" messages to be "bad" data.
The Java application posts async jobs to AWS and gets back a JobID. When the async job is finished, a message will appear in an SQS queue with that JobID. Each JobID is handled by a different thread. Each of those threads also polls SQS for messages until it finds the message which contains its JobID. Additionally, the application is distributed into multiple services so there can't be a single SQS processor.
I saw that SQS returns a maximum of 10 messages and after they are returned, a visibility timeout is applied so that they are not re-sent to other consumers. However, my consumers are the threads that want to consume only a single message and let the rest be consumed by other threads. Should I set the visibility timeout to 0? Will this make it so all consumers get the same set of 10 messages on every request? What's the best way for each consumer to sift through all the messages and find the one it wants?
TL;DR: SQS has 100 messages and there are 100 consumers, one for each message. How should I go about having each consumer find the message it wants (based on a JobID).
EDIT: I know that this is not an appropriate usage of SQS and I'd be very glad to not use it at all but our main integration is with Amazon Textract for which it is mandatory to use SQS for its asynchronous operations. Each Textract request is processed by a different thread which means that they each need to get back a specific SQS message, consumers are not universal. Not to mention the possibility of a clustered environment for which I'd like to avoid having to do any synchronization...
EDIT 2: This is for an on-premises, Setup.exe based, dev-hands-off application where we want to minimize the amount of unneeded AWS services used (both for cost and for customer setup/maintenance reasons) as well as the use of external components, again to minimize customer deployment/maintenance/servers. I understand that we are living in the world of microservices but there are still applications that want to benefit from intelligent services without being cloud-native themselves.
This is not an appropriate architecture for using Amazon SQS. Your processes should not be trying to find a specific message from an Amazon SQS queue.
You should find a different architecture for this message-passing task. Some ideas:
Create a message in Amazon S3 with an 'expected' Key. Have the each thread look for that object as a return message. (This is effectively using Amazon S3 as a Key-Value Store.)
Have a single Lambda function retrieve messages from SQS and update a database (or S3 as above). Then, have the threads consult the database instead of SQS.
I think you need to put something in between SQS and your threads. Like a DynamoDB table. You could have a Lambda function that processes all the SQS messages and just translates them into DynamoDB records. Then your different threads could easily check for the specific records they are interested in using a DynamoDB query.
Just because Textract mandates that you use SQS doesn't mean the final step in your architecture has to read those messages directly from SQS. In this case SQS is just a message bus that can integrate with other services in AWS, and those services are your building blocks you can use to create the architecture you need.
I have a long running AWS Lambda function that I am executing from my webapp. Using the documentation [1], it works fine however my problem is this particular lambda function does not return anything back to the application its output is saved to S3 and it runs for a long time 20-30s. Is there a way to trigger the lambda and not wait for the return value since I don't want to wait/block my app while the lambda is running. Right now I am using an ExecutorService as a que to execute lambda requests since I have to wait for each invocation, when the app crashes or restarts I lose jobs that are waiting to be executed.
[1] https://aws.amazon.com/blogs/developer/invoking-aws-lambda-functions-from-java/
Tracking status is not necessarily a difficult issue. Use a simple S3 "file exists" call after each job execution to know if the lambda is done.
However, as you've pointed out, you might lose job information at some point. To remove this issue, you need some persistence layer outside your JVM. A KV store would work, store some (timestamp, jobId, status) fields in a database, and periodically check from your web server and only update from the lambda.
Alternatively, to reduce end-to-end time frame further, a queuing mechanism would be better (unless you also want the full history of jobs, but this can be constructed along with the queue). As mentioned in the comments, AWS offers many built in solutions that can directly be used with Lambda, or you need additional infrastructure like RabbitMQ / Redis to built a task event bus.
With that, lambda is now optional. You'd effectively periodically pull off events into a worker queue, which either can be very dumb passthroughs and invoke the lambda, or do the work themselves directly. Combine this with ECS/EKS/EC2 autoscaling and it might actually run faster than lambda since you can scale in/out based on queue size. Then you write the output events to a success/error notification "channel" after the S3 file is written
Back in the web server, you'll have to modify code to now be listening for messages asynchronously from that channel, and when you get a success message, you'll know that you should be able to access the S3 resources
I'm using AWS SDK for Java.
Imagine I create a RDS instance as described in the AWS documentation.
AmazonRDS client = AmazonRDSClientBuilder.standard().build();
CreateDBInstanceRequest request = new CreateDBInstanceRequest().withDBInstanceIdentifier("mymysqlinstance").withAllocatedStorage(5)
.withDBInstanceClass("db.t2.micro").withEngine("MySQL").withMasterUsername("MyUser").withMasterUserPassword("MyPassword");
DBInstance response = client.createDBInstance(request);
If I call instance.getEndpoint() right after making the request it will return null to me, because AWS is still creating the database. I need to know this endpoint when it becomes available, but I'm not figuring out how to do it.
Is there a way, using the AWS SDK, to be notified when the instance was finally created?
You can use the RDS SNS notifications:
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Events.html#USER_Events.Messages
Subscribing to Amazon RDS Event Notification
You can create an Amazon
RDS event notification subscription so you can be notified when an
event occurs for a given DB instance, DB snapshot, DB security group,
or DB parameter group. The simplest way to create a subscription is
with the RDS console. If you choose to create event notification
subscriptions using the CLI or API, you must create an Amazon Simple
Notification Service topic and subscribe to that topic with the Amazon
SNS console or Amazon SNS API. You will also need to retain the Amazon
Resource Name (ARN) of the topic because it is used when submitting
CLI commands or API actions. For information on creating an SNS topic
and subscribing to it, see Getting Started with Amazon SNS.
Disclaimer: Opinionated Answer
IMO creating infrastructure at runtime in code like this is devil's work. Stacks are the way to go here, much more modular and you will get some of the following benefits:
If you start creating more than one table per customer you will be able to logically group them into a stack and clean then up easier as needed
If for some reason the creation of a resource fails you can see this very easily in the stack console
Management is much easier to search through stacks as you have a console already built for you
Updating a stack in AWS is much easier as well than updating tables individually
MOST IMPORTANT: If an error occurs the stack functionality already has rollback and redundancy functionality built in, which you control the behaviour of. If something happens in your code during your on boarding process it will be a mess to clean up, what if one table succeeded and the other not? You will have to troll through logs (if they exist) to find out what happened.
You can also combine this approach with using something like AWS Pipelines or even AWS Simple Workflow Service to add custom steps in your custom on-boarding process, eg run a lambda function, send a notification when completed, wait for some payment. This builds on my last point that if this pipeline does fail, you will be able to see which step failed, and why it failed. You will also be able to see if things timeout.
Lastly I want to advise caution in creating infrastructure per customer. It's much more work and adds allot more ways in which things can break. Make sure you put limits in AWS as well that you don't have a situation in which your bill sky-rockets because of some bug creating infrastructure.
I am using Java EWS API to connect my application to MS Exchange and read user email requests. These requests are then processed through the system workflow. The amount of emails in a day is limited to 50 so the overall volume is less. However I am looking at an efficient and reliable mechanism to read from exchange server using EWS API. Also note that once the email is processed we move it to sub folders so Inbox only has the unprocessed requests
Currently as I understand the following schemes are used to connect to Exchange server and perform various operations on the mailbox.
Polling - Connect to Exchange using the standard Exchange Service interface; find all new emails and process them in sequence. The client has better control over failures and synchronization between the reads and moving to processed folders. On the downside the experience isn’t real time and connections are made to exchange even if there isn’t any activity.
Pull Notifications - This method is almost identical to previous one, subscribe to pull notifications using an interval and read emails from Inbox whenever the timer event occurs. Pros and cons are similar to approach 1.
Push Notifications - Here the clients subscribe to exchange server for receiving push notifications by registering themselves to particular events and define a callback mechanism (Client Web service) to receive notifications. On the upside the notifications are near real time and connections are made only when there are events. On the downside I see that subscriptions and watermark needs to be managed on the client side so that events aren’t lost. Not sure if this is still a reliable approach as what happens to messages that are already in the inbox before establishing a subscription; will those events be replayed when server starts? It’s not clear.
Streaming Subscription - Clients establish a Streaming connection and then keep it open for a maximum of 30 min with the server and during this time exchange will notify any registered events. Once the connection dies there is an ability to restore it so that the subscription stays alive. It seemed like the best approach until I started hearing that an additional steps to Sync folder items and maintain sync state; is required at regular intervals so that events are not missed from connect/disconnect.
Looking at my needs (read emails from exchange server reliably) and analysis of various options I feel that approach 1 is simple and more reliable as it gives better control over the entire process. But at the same time I wanted to circle with others who are familiar with the API to correct me if my understanding of the framework in terms of pros and cons is wrong.
I am open for any suggestions from the group in order to make this better as the intent is to not miss any email.
I'd go for the code simplicity of option 1. If you connect once a minute the load is very low (just a FindItem call returning nothing) and the users experience it as almost instantaneous.
You're are only handling 50 a day max so the wish to be 'instantaneous' is a bit contradictory (if the user only does that many updates he surely can wait a minute).