I have a long running AWS Lambda function that I am executing from my webapp. Using the documentation [1], it works fine however my problem is this particular lambda function does not return anything back to the application its output is saved to S3 and it runs for a long time 20-30s. Is there a way to trigger the lambda and not wait for the return value since I don't want to wait/block my app while the lambda is running. Right now I am using an ExecutorService as a que to execute lambda requests since I have to wait for each invocation, when the app crashes or restarts I lose jobs that are waiting to be executed.
[1] https://aws.amazon.com/blogs/developer/invoking-aws-lambda-functions-from-java/
Tracking status is not necessarily a difficult issue. Use a simple S3 "file exists" call after each job execution to know if the lambda is done.
However, as you've pointed out, you might lose job information at some point. To remove this issue, you need some persistence layer outside your JVM. A KV store would work, store some (timestamp, jobId, status) fields in a database, and periodically check from your web server and only update from the lambda.
Alternatively, to reduce end-to-end time frame further, a queuing mechanism would be better (unless you also want the full history of jobs, but this can be constructed along with the queue). As mentioned in the comments, AWS offers many built in solutions that can directly be used with Lambda, or you need additional infrastructure like RabbitMQ / Redis to built a task event bus.
With that, lambda is now optional. You'd effectively periodically pull off events into a worker queue, which either can be very dumb passthroughs and invoke the lambda, or do the work themselves directly. Combine this with ECS/EKS/EC2 autoscaling and it might actually run faster than lambda since you can scale in/out based on queue size. Then you write the output events to a success/error notification "channel" after the S3 file is written
Back in the web server, you'll have to modify code to now be listening for messages asynchronously from that channel, and when you get a success message, you'll know that you should be able to access the S3 resources
Related
The Java application posts async jobs to AWS and gets back a JobID. When the async job is finished, a message will appear in an SQS queue with that JobID. Each JobID is handled by a different thread. Each of those threads also polls SQS for messages until it finds the message which contains its JobID. Additionally, the application is distributed into multiple services so there can't be a single SQS processor.
I saw that SQS returns a maximum of 10 messages and after they are returned, a visibility timeout is applied so that they are not re-sent to other consumers. However, my consumers are the threads that want to consume only a single message and let the rest be consumed by other threads. Should I set the visibility timeout to 0? Will this make it so all consumers get the same set of 10 messages on every request? What's the best way for each consumer to sift through all the messages and find the one it wants?
TL;DR: SQS has 100 messages and there are 100 consumers, one for each message. How should I go about having each consumer find the message it wants (based on a JobID).
EDIT: I know that this is not an appropriate usage of SQS and I'd be very glad to not use it at all but our main integration is with Amazon Textract for which it is mandatory to use SQS for its asynchronous operations. Each Textract request is processed by a different thread which means that they each need to get back a specific SQS message, consumers are not universal. Not to mention the possibility of a clustered environment for which I'd like to avoid having to do any synchronization...
EDIT 2: This is for an on-premises, Setup.exe based, dev-hands-off application where we want to minimize the amount of unneeded AWS services used (both for cost and for customer setup/maintenance reasons) as well as the use of external components, again to minimize customer deployment/maintenance/servers. I understand that we are living in the world of microservices but there are still applications that want to benefit from intelligent services without being cloud-native themselves.
This is not an appropriate architecture for using Amazon SQS. Your processes should not be trying to find a specific message from an Amazon SQS queue.
You should find a different architecture for this message-passing task. Some ideas:
Create a message in Amazon S3 with an 'expected' Key. Have the each thread look for that object as a return message. (This is effectively using Amazon S3 as a Key-Value Store.)
Have a single Lambda function retrieve messages from SQS and update a database (or S3 as above). Then, have the threads consult the database instead of SQS.
I think you need to put something in between SQS and your threads. Like a DynamoDB table. You could have a Lambda function that processes all the SQS messages and just translates them into DynamoDB records. Then your different threads could easily check for the specific records they are interested in using a DynamoDB query.
Just because Textract mandates that you use SQS doesn't mean the final step in your architecture has to read those messages directly from SQS. In this case SQS is just a message bus that can integrate with other services in AWS, and those services are your building blocks you can use to create the architecture you need.
I'm using AWS SDK for Java.
Imagine I create a RDS instance as described in the AWS documentation.
AmazonRDS client = AmazonRDSClientBuilder.standard().build();
CreateDBInstanceRequest request = new CreateDBInstanceRequest().withDBInstanceIdentifier("mymysqlinstance").withAllocatedStorage(5)
.withDBInstanceClass("db.t2.micro").withEngine("MySQL").withMasterUsername("MyUser").withMasterUserPassword("MyPassword");
DBInstance response = client.createDBInstance(request);
If I call instance.getEndpoint() right after making the request it will return null to me, because AWS is still creating the database. I need to know this endpoint when it becomes available, but I'm not figuring out how to do it.
Is there a way, using the AWS SDK, to be notified when the instance was finally created?
You can use the RDS SNS notifications:
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Events.html#USER_Events.Messages
Subscribing to Amazon RDS Event Notification
You can create an Amazon
RDS event notification subscription so you can be notified when an
event occurs for a given DB instance, DB snapshot, DB security group,
or DB parameter group. The simplest way to create a subscription is
with the RDS console. If you choose to create event notification
subscriptions using the CLI or API, you must create an Amazon Simple
Notification Service topic and subscribe to that topic with the Amazon
SNS console or Amazon SNS API. You will also need to retain the Amazon
Resource Name (ARN) of the topic because it is used when submitting
CLI commands or API actions. For information on creating an SNS topic
and subscribing to it, see Getting Started with Amazon SNS.
Disclaimer: Opinionated Answer
IMO creating infrastructure at runtime in code like this is devil's work. Stacks are the way to go here, much more modular and you will get some of the following benefits:
If you start creating more than one table per customer you will be able to logically group them into a stack and clean then up easier as needed
If for some reason the creation of a resource fails you can see this very easily in the stack console
Management is much easier to search through stacks as you have a console already built for you
Updating a stack in AWS is much easier as well than updating tables individually
MOST IMPORTANT: If an error occurs the stack functionality already has rollback and redundancy functionality built in, which you control the behaviour of. If something happens in your code during your on boarding process it will be a mess to clean up, what if one table succeeded and the other not? You will have to troll through logs (if they exist) to find out what happened.
You can also combine this approach with using something like AWS Pipelines or even AWS Simple Workflow Service to add custom steps in your custom on-boarding process, eg run a lambda function, send a notification when completed, wait for some payment. This builds on my last point that if this pipeline does fail, you will be able to see which step failed, and why it failed. You will also be able to see if things timeout.
Lastly I want to advise caution in creating infrastructure per customer. It's much more work and adds allot more ways in which things can break. Make sure you put limits in AWS as well that you don't have a situation in which your bill sky-rockets because of some bug creating infrastructure.
I'm building a real-time API handling 2 types of calls:
Updates,
Computation requests.
Internally, the updates are broadcasted among workers. The workers keep working data structures (such as hash-tables) in their RAM, and modify the contents as the updates are coming.
When a computation request comes, exactly one idle worker handles it, using multiple threads, working with the local copy in RAM.
I'm wondering whether I could migrate my current implementation to Storm. As I understand it, Storm is pretty real-time and could help me a lot with scalability and fault-tolerance.
Currently, I'm using UWSGI/Python to handle the API requests, and Java workers to do the computation. I'm thinking of putting the Java workers into the Storm topology as bolts. However, I'm not quite sure about the spouts.
As I understand it, I could use DRPC to handle the computation requests, just by connecting to a DRPC server from python. It is clearly written in the docs that DRPC can handle the whole life-cycle of the request-reply paradigm. But what about updates?
My question is: Is it a good idea (or is it even possible?) to use DRCP to only submit updates in non-blocking manner, not waiting for replies (because there are no results)?
For Non blocking, asynchronous Updates you should use a Job Server like Gearman
This will enable you to submit and need not to wait for any response. Gearman is used by Instagram to share photos to Facebook/Twitter whenever a user uploads a photo using Instagram app.
I have a web application that takes in user requests and puts them into a MYSQL database. Now a typical user request needs to be serviced by following a workflow that would take significant time to complete. To address this i have an asynchronous processor that keeps listening to the MYSQL table.
I have noticed that polling the MYSQL table on an infinite loop results in a spike in CPU usage on the box my application is deployed to that often renders the box unusable.
I know that making the asynchronous process sleep for 'some' time whenever there aren't any active requests in the MYSQL database is an option but i would like to keep that as a last resort.
Making this process synchronous is not an option because of the time the workflow involved in servicing a single request takes and also because there is a need to decouple the processing from the front end to allow the back end to evolve.
I would like to know if there is any smart way to trigger off the asynchronous process so that i can avoid the CPU usage spike and still get optimum response time from the asynchronous processor.
Any advices would be appreciated.
Thanks
p1ng
An option would be to store the request in the database AND send some kind of event in you system (eg. JMS message, or by using java.util.concurrent constructs). The process that reads and executes the command can then be awaken by this signal, go fetch the data in the database and process it as usual.
This way, you wouldn't waste CPU cycles polling not-yet-available data, and you would be more reactive due to the absence of polling delay.
You can make your asynchronous process read from a TCP socket or something similar. The asynchronous process should just wait blocking on i/o. Then from your primary process you can send a message over to the asynchronous process once it has updated the table. It may be possible to send the message from a trigger in database too.
I would generally not recommend polling based approach to check the table. What happens when you have to poll for different events at different schedules? If you envision need for multiple events in the future, I would suggest looking into message queues for asynchronous tasks.
But if you have to go with polling based approach for now - I don't fully understand your reasoning against letting the asynchronous process sleep for some time? Why would you want your process to consume all the CPU resources doing nothing but running in an infinite loop? Why not have your async process run at specific intervals? You can make this polling interval configurable.
You can use the FutureTask Java API to do that. See that blog entry.
Or perhaps just new Thread(YourRunnable).start(), and make some state variable in YourRunnable to know if your task is finished or not.
I have a shell script which I'd like to trigger from a J2EE web app.
The script does lots of things - processing, FTPing, etc - it's a legacy thing.
It takes a long time to run.
I'm wondering what is the best approach to this. I want a user to be able to click on a link, trigger the script, and display a message to the user saying that the script has started. I'd like the HTTP request/response cycle to be instantaneous, irrespective of the fact that my script takes a long time to run.
I can think of three options:
Spawn a new thread during the processing of the user's click. However, I don't think this is compliant with the J2EE spec.
Send some output down the HTTP response stream and commit it before triggering the script. This gives the illusion that the HTTP request/response cycle has finished, but actually the thread processing the request is still sat there waiting for the shell script to finish. So I've basically hijacked the containers HTTP processing thread for my own purpose.
Create a wrapper script which starts my main script in the background. This would let the request/response cycle to finish normally in the container.
All the above would be using a servlet and Runtime.getRuntime().exec().
This is running on Solaris using Oracle's OC4J app server, on Java 1.4.2.
Please does anyone have any opinions on which is the least hacky solution and why?
Or does anyone have a better approach? We've got Quartz available, but we don't want to have to reimplement the shell script as a Java process.
Thanks.
You mentioned Quartz so let's go for an option #4 (which is IMO the best of course):
Use Quartz Scheduler and a org.quartz.jobs.NativeJob
PS: The biggest problem may be to find documentation and this is the best source I've been able to find: How to use NativeJob?
I'd go with option 3, especially if you don't actually need to know when the script finishes (or have some other way of finding out other than waiting for the process to end).
Option 1 wastes a thread that's just going to be sitting around waiting for the script to finish. Option 2 seems like a bad idea. I wouldn't hijack servlet container threads.
Is it necessary for your application to evaluate output from the script you are starting, or is this a simple fire-and-forget job? If it's not required, you can 'abuse' the fact that Runtime.getRuntime().exec() will return immediately with the process continuing to run in the background. If you actually wanted to wait for the script/process to finish, you would have to invoke waitFor() on the Process object returned by exec().
If the process you are starting writes anything to stdout or stderr, be sure to redirect these to either log files or /dev/null, otherwise the process will block after a while, since stdout and stderr are available as InputStreams with limited buffering capabilites through the Process object.
My approach to this would probably be something like the following:
Set up an ExecutorService within the servlet to perform the actual execution.
Create an implementation of Callable with an appropriate return type, that wraps the actual script execution (using Runtime.exec()) to translate Java input variables to shell script arguments, and the script output to an appropriate Java object.
When a request comes in, create an appropriate Callable object, submit it to the executor service and put the resulting Future somewhere persistent (e.g. user's session, or UID-keyed map returning the key to the user for later lookups, depending on requirements). Then immediately send an HTTP response to the user implying that the script was started OK (including the lookup key if required).
Add some mechanism for the user to poll the progress of their task, returning either a "still running" response, a "failed" response or a "succeeded + result" response depending on the state of the Future that you just looked up.
It's a bit handwavy but depending on how your webapp is structured you can probably fit these general components in somewhere.
If your HTTP response / the user does not need to see the output of the script, or be aware of when the script completes, then your best option is to launch the thread in some sort of wrapper script as you mention so that it can run outside of the servlet container environment as a whole. This means you can absolve yourself from needing to manage threads within the container, or hijacking a thread as you mention, etc.
Only if the user needs to be informed of when the script completes and/or monitor the script's output would I consider options 1 or 2.
For the second option, you can use a servlet, and after you've responded to the HTTP request, you can use java.lang.Runtime.exec() to execute your script. I'd also recommend that you look here : http://www.javaworld.com/javaworld/jw-12-2000/jw-1229-traps.html
... for some of the problems and pitfalls of using it.
The most robust solution for asynchronous backend processes is using a message queue IMO. Recently I implemented this using a Spring-embedded ActiveMQ broker, and rigging up a producing and consuming bean. When a job needs to be started, my code calls the producer which puts a message on the queue. The consumer is subscribed to the queue and get kicked into action by the message in a separate thread. This approach neatly separates the UI from the queueing mechanism (via the producer), and from the asynchronous process (handled by the consumer).
Note this was a Java 5, Spring-configured environment running on a Tomcat server on developer machines, and deployed to Weblogic on the test/production machines.
Your problem stems from the fact that you are trying to go against the 'single response per request' model in J2EE, and have the end-user's page dynamically update as the backend task executes.
Unless you want to go down the introducing an Ajax-based solution, you will have to force the rendered page on the user's browser to 'poll' the server for information periodically, until the back-end task completes.
This can be achieved by:
When the J2EE container receives the request, spawn a thread which takes a reference to the session object (which will be used to write the output of your script)
Initialize the response servlet to write an html page which will contain a Javascript function to reload the page from the server at regular intervals (every 10 seconds or so).
On each request, poll the session object to display the output stored by the spawned thread in step 1
[clean-up logic can be added to delete the stored content from the session once the thread completes if needed, also you can set any additional flags in the session for mark state transitions of the execution of your script]
This is one way to achieve what you want - it isn't the most elegant of all approaches, but it is essentially due to needing to asynchronously update your page content from the server , with a request/response model.
There are other ways to achieve this, but it really depends on how inflexible your constraints are. I have heard of Direct Web Remoting (although I haven't played with it yet), might be worth taking a look at Developing Applications using Reverse-Ajax