Processing specific AWS SQS messages in specific threads - java

The Java application posts async jobs to AWS and gets back a JobID. When the async job is finished, a message will appear in an SQS queue with that JobID. Each JobID is handled by a different thread. Each of those threads also polls SQS for messages until it finds the message which contains its JobID. Additionally, the application is distributed into multiple services so there can't be a single SQS processor.
I saw that SQS returns a maximum of 10 messages and after they are returned, a visibility timeout is applied so that they are not re-sent to other consumers. However, my consumers are the threads that want to consume only a single message and let the rest be consumed by other threads. Should I set the visibility timeout to 0? Will this make it so all consumers get the same set of 10 messages on every request? What's the best way for each consumer to sift through all the messages and find the one it wants?
TL;DR: SQS has 100 messages and there are 100 consumers, one for each message. How should I go about having each consumer find the message it wants (based on a JobID).
EDIT: I know that this is not an appropriate usage of SQS and I'd be very glad to not use it at all but our main integration is with Amazon Textract for which it is mandatory to use SQS for its asynchronous operations. Each Textract request is processed by a different thread which means that they each need to get back a specific SQS message, consumers are not universal. Not to mention the possibility of a clustered environment for which I'd like to avoid having to do any synchronization...
EDIT 2: This is for an on-premises, Setup.exe based, dev-hands-off application where we want to minimize the amount of unneeded AWS services used (both for cost and for customer setup/maintenance reasons) as well as the use of external components, again to minimize customer deployment/maintenance/servers. I understand that we are living in the world of microservices but there are still applications that want to benefit from intelligent services without being cloud-native themselves.

This is not an appropriate architecture for using Amazon SQS. Your processes should not be trying to find a specific message from an Amazon SQS queue.
You should find a different architecture for this message-passing task. Some ideas:
Create a message in Amazon S3 with an 'expected' Key. Have the each thread look for that object as a return message. (This is effectively using Amazon S3 as a Key-Value Store.)
Have a single Lambda function retrieve messages from SQS and update a database (or S3 as above). Then, have the threads consult the database instead of SQS.

I think you need to put something in between SQS and your threads. Like a DynamoDB table. You could have a Lambda function that processes all the SQS messages and just translates them into DynamoDB records. Then your different threads could easily check for the specific records they are interested in using a DynamoDB query.
Just because Textract mandates that you use SQS doesn't mean the final step in your architecture has to read those messages directly from SQS. In this case SQS is just a message bus that can integrate with other services in AWS, and those services are your building blocks you can use to create the architecture you need.

Related

Searching for error handling solution when firehose is down

My system background:
I have a Scala application that read events from various queue technologies, transform the data and as final step send record to kinesis firehose.
My firehose is currently have S3 bucket as destination, but I don't want to bind it to the solution, since it could be changed in the future.
Problem statement
AWS guarantees a monthly uptime percentage of at least 99.9% (means ~44 minutes in a month) - See Service Level Agreement.
In addition, when quota limits are reached, putRecord will return ServiceUnavailableException - See PutRecord.
As part of the flow I can't afford events loss, so I'm searching for sort of solution for firehose unavailability.
We think about error handler that will:
Publish to SQS each event that failed to published into firehose.
When firehose get back to work, SNS will trigger AWS Lambda.
AWS Lambda will publish all events from SQS into firehose.
My question
I'm wondering if that solution is good enough and if there is best practice for handling errors when trying to putRecord into firehose?

Invoking AWS Lambda Functions from Java

I have a long running AWS Lambda function that I am executing from my webapp. Using the documentation [1], it works fine however my problem is this particular lambda function does not return anything back to the application its output is saved to S3 and it runs for a long time 20-30s. Is there a way to trigger the lambda and not wait for the return value since I don't want to wait/block my app while the lambda is running. Right now I am using an ExecutorService as a que to execute lambda requests since I have to wait for each invocation, when the app crashes or restarts I lose jobs that are waiting to be executed.
[1] https://aws.amazon.com/blogs/developer/invoking-aws-lambda-functions-from-java/
Tracking status is not necessarily a difficult issue. Use a simple S3 "file exists" call after each job execution to know if the lambda is done.
However, as you've pointed out, you might lose job information at some point. To remove this issue, you need some persistence layer outside your JVM. A KV store would work, store some (timestamp, jobId, status) fields in a database, and periodically check from your web server and only update from the lambda.
Alternatively, to reduce end-to-end time frame further, a queuing mechanism would be better (unless you also want the full history of jobs, but this can be constructed along with the queue). As mentioned in the comments, AWS offers many built in solutions that can directly be used with Lambda, or you need additional infrastructure like RabbitMQ / Redis to built a task event bus.
With that, lambda is now optional. You'd effectively periodically pull off events into a worker queue, which either can be very dumb passthroughs and invoke the lambda, or do the work themselves directly. Combine this with ECS/EKS/EC2 autoscaling and it might actually run faster than lambda since you can scale in/out based on queue size. Then you write the output events to a success/error notification "channel" after the S3 file is written
Back in the web server, you'll have to modify code to now be listening for messages asynchronously from that channel, and when you get a success message, you'll know that you should be able to access the S3 resources

How to send events to all instances of the application in PCF

I am not able to find a way to send/broadcast a message to all application instances in Pivotal Cloud Foundry. How can we notify to all app instances of some events? If we use the HTTP request, PCF router will dispatch it to a single instance of the app. How can we solve this problem?
What #Florian said is probably the safer option, but if you want something quick and easy, you can send HTTP requests directly to an app instance by using the X-CF-APP-INSTANCE header. The format for the header is YOUR-APP-GUID:YOUR-INSTANCE-INDEX.
https://docs.cloudfoundry.org/concepts/http-routing.html#app-instance-routing
So given an app guid, you could iterate over the number of instances, say 0 to 5, and send an HTTP request to each one. Make sure to check the response to confirm that each one succeeded.
This also requires that you know the app guid for your app (i.e. cf app <name> --guid) and the number of instances of your app.
CF, out of the box, does not provide any event queue mechanism where apps can subscribe to.
What I would do (assuming you've two app instances A and B):
Provide an event endpoint in your application code, e.g. POST /api/event (alternatively, if the event should arise from another app (e.g. another microservice), this one could directly send messages onto the queue)
All app instances are listening on an internal event queue for new events
instance A receives the call from the CF router and processes it by issuing an event on an internal event queue, the instance will not react to the event, yet
When A publishes the event, A and B receives the event and processes it accordingly
Now, the internal event queue you can use highly depends on your deployment. On AWS you probably can use SQS or SNS or something similar. PCF, as I know, may also provide a messaging system which would suit here as well, rabbitmq. You could also use features of other services that would allow you to subscribe to events, such as redis (pub/sub commands) or similar.
If you provide more information about what you want to achieve more concretely, more detailed answer would be possible, though.

Google pubsub flow control

I'm trying to implement a service which consumes a google pubsub subscription at its own pace. By that, I mean I need fine control on when I need to consume messages i.e get a batch of messages, pause for a while, do not get more than X messages...
Using google client libraries I did not find a way to do it as the MessageReceiver is running in its own thread and I don't have any control on what exactly happens.
Basically, being able to consume messages in a synchronous way should solve my issue.
Do you know how I can use the google client libs synchronously ? Or is there another way in the API I missed ?
You might try using setFlowControlSettings when you build your subscriber. In particular, you can use setMaxOutstandingElementCount or setMaxOutstandingRequestBytes to limit the messages sent to your MessageReceiver. When you have enough messages outstanding, i.e., messages for which you have not called Ack() or Nack(), to exceed these limits, then your MessageReceiver will not be called until messages have been acked or nacked.

Tool to send email from a DB

We are developing a webapp that needs to send out emails written in Java/Groovy. Currently, we are persisting each email to a database before we call the Java Mail APIs to send the mail to our SMTP server.
I want to send out email asynchronously. I want to persist the email and then have another process pick up the email and send it (and send it only once). Ideally, this process is running outside of my webapp.
Are there any tools that do that?
Update: This solution needs to prevent duplicate emails and it needs to handle spikes in email. I was hoping someone already wrote an offline email processor. (I'd rather not implement this myself.)
The suggestions to use a cron job to read the database are quite workable.
Another good approach here is to use a Java Message Service (JMS) message queue. These are persistent (backed up by a database) and reliable. You can have one or more producer programs enqueue messages with the relevant data in them, and then one or more consumers process the messages and dequeue them. All of this is set up for very high reliability, and you gain the flexibility of asynchronously decoupling the operations, which means during email spikes the message queue can grow larger until the consumers catch up with the spike. Another benefit is that the email goes out as soon as a consumer gets to it instead of on a timer. Plus, if you require high availability, you can have multiple consumers in case one goes down.
Check out Apache's ActiveMQ for a good open source implementation of JMS.
If you're using Linux/Unix you could create a cron job to run every few minutes which calls a program to grab the email from the database and send it out. You could also have a field in the database to indicate whether the message has been sent. The downside of this approach is that there may be a delay of a few minutes from when your webapp persists the email and when the cron job is run.
Setup a cron job and use scripts to query the db and send out emails via sendmail.
On the off chance it's an Oracle DB, you can use the UTL_MAIL package to write PL/SQL to send the mail through your SMTP server. Then create a scheduled job to execute on your desired schedule.
http://download.oracle.com/docs/cd/B19306_01/appdev.102/b14258/u_mail.htm
Since you are already using groovy, this might be an interesting tool to solve your problem
http://gaq.sourceforge.net/
You could use Quartz, a scheduling library (similar to cron), to schedule a recurring task that reads the DB and sends the emails. If you're using Grails, there's a Quartz plugin that makes working with Quartz a bit more Groovy.

Categories