Deploying code to Amazon EC2 instance - java

I'm using Java to create EC2 instances from within Eclipse. Now I would like to push parts of the application to these instances so that these can process whatever needs processing and then send the results back to my machine.
What I'm trying to do is something along the lines of:
assignWork(){
workPerformed = workQueue;
workPerInstance = workQueue/numberOfInstances;
while(workQueue > 0){
netxInstance.doWork(workPerformed,workPerInstance);
workPerformer -= workPerInstance;
}
}
doWork(start, end){
while(start>end){
//process stuff
start--;
}
}
This way I could control exactly how many AMI's to instantiates depending on the volume of work at hand. I could instantiate them, send them specific code to process and then terminate them as soon as I receive the results.
Is this possible just using the AWS JDK?

It is, but consider that...
If you have SLAs, and they fall within SQS Limits (Maximum 4 Days), you could consider publishing your task queues into SNS/SQS, and use CloudWatch to track the number of needed instances.
If you have a clear division of roles (more like a workflow), and the long-running tasks are not of much concern and you can retry, also consider using AWS SWF instead. It goes a bit beyond of a SQS/SNS Combo, and I think it could fit nicely with CloudWatch (thats just a theory, I haven't looked further). Cons are the extreme assh*le AWS Flow Framework for writing the Workflow Processes
If your workload is predictable (say, around 5K processes to process today), meaning you have no need for real-time and you can batch those requests, then consider using Elastic MapReduce for this. Being Hadoop-based, it offers some such niceties, such as being able to resize your cluster on demand, and the obvious case of not having any vendor lock in at all.
Actually, if you want that manage and without many surprises, consider looking at options such as PiCloud and IronWorker. They were really made for situations just like the one you've just described.
If you have only a Queue and EC2, you can surely automate that. It only depends on how badly you want to coordinate these tasks, but I'm sure its possible.

Related

Stream processing architecture

I am in the process of designing a system where there's a main stream of objects and there are multiple workers which produces some result from that object. Finally, there is some special/unique worker (sort of a "sink", in terms of graph theory) which takes all the results, and process them to some final object which is written to some DB.
It is possible for a worker to be dependent on the result of some other workers (hence, waiting for their results)
Now, I'm facing several problems:
It could be that one worker is much slower than another. How do you deal with that? Adding more workers (= scaling) of the slower type? (maybe dynamically)
Suppose W_B is dependent on W_A. If W_B is down for some reason then the flow will stop and the system will stop working. So I'd like the system to bypass this worker, somehow.
Moreover, how do the final worker decide when to operate on the set of results? Suppose it has the results of A and B but lacking the result of C. It may be that C is down or it's just very slow at the moment. How can it make a decision?
It is worth mentioning that it's not a realtime application but rather an offline processing system (i.e. you may access the DB and alter a record), but at the same time, it has to deal with relatively large amount of objects in an "high pace".
Regarding technologies,
I'm developing the system with Java but I'm not bounded to a specific technology.
I'd be glad if you could help me with the general design of the system.
Thanks a lot!
As Peter said, it really depends on the use case. Some general remarks though:
If a worker is slower than the other, maybe create more instances of that type; eg Kubernetes allows dynamic Node creation, and Kafka allows to partition a topic so more than one instance can read off and process it.
If B depends on A and A is down, B can't work and that's it. Maybe restart A? Maybe you can do a regular health check on it.
If the final worker needs the results of A, B and C, how would it process without C being available? If it can, it can store the results of A and B, install a timer, and if that goes off without C having arrived, continue.
Some additional thoughts:
If you mean to say that some subtasks of the overall application are quicker to execute than others, then it can be a good idea to slice up the application so that each worker is doing a bit of everything -- in other words, a share of the quick work and a share of the slow work. But if you mean to say that some machines are slower than others, then you could run fewer workers on the slow machines, and more on the faster ones, so as to balance things so that each worker has roughly the same resources.
You might want to decouple your architecture with some sort of durable queueing between the workers.
It's common to use heartbeats with timeouts and restarts.
Distributed stream processing quickly becomes very complex. Your life will be much easier if you build on top a stream processing framework that provides high availability and exactly-once semantics out of the box.

How does Apache Curator DistributedQueue's lockPath work?

I have a bunch of independent pieces of work that I need processes to perform. These pieces of work can be performed in any order, and they last long enough that processes sometimes fail when work is being performed.
I need to coordinate the assignment of these pieces of work, and Curator's DistributedQueue seems like it is almost what I want. I don't need the ordering it provides, though, so I am curious what level of overhead I am paying for that assuming I decline to have a single consumer (ie each process just consumes from the queue).
My main concern is how the lockPath() function on the queue builder actually works. I need the functionality it provides, because it is possible for processes to fail and I need to not be dropping the jobs they were supposed to be doing. But what I don't want is for only one process to be able to do any work at a time. If I use lockPath(), will the queue block for other processes while a process is consuming a message?
Also, if the queue seems like an unreasonable approach, is there another tool available to achieve what I want, or would I have to roll my own? I want to stay within the Curator / ZK environment but am open to alternatives within that.
(Note: I'm the main author of Apache Curator)
The documentation needs to be improved. The lock is used to make the queue entry retry-able. i.e. the entry in the queue is not removed until the consumer finishes. The lock assures that only 1 process is acting on the entry. If you don't care about dropping queue entries on failure you don't need to use the lock. With or without the lock, though, each consumer that you run processes queue entries. So, if you want to have concurrent processing of the queue you'd run multiple consumers (in the same JVM or in separate JVMs - it doesn't matter).
Here's a workflow engine I wrote that uses the Curator queue to do distributed work. Feel free to use it as it is open source: http://nirmataoss.github.io/workflow/

Serving single HTTP request with multiple threads

Angular 4 application sends a list of records to a Java spring MVC application that has been deployed in Websphere 8 Servlet container. The list is then inserted into to a temp table. After the batch insert, a procedure call is made in order to do some calculations and return results. Depending on the size of the list that was inserted into temp table it may take anywhere between: 3000ms( N ~ 500 ), 6000ms( N ~ 1000 ), 50,000+ms ( N > 2000 ).
My asendach would be to create chunks of data and simultaneously send them to database for processing. After threads (Futures) return results I would aggregate them and return back to the client. To sum up, I would split a synchronous call into multiple asynchronous processes(simultaneously executed) and return back to the client over the same thread that initiated HTTP call - landed into my controller.
Everything would be fine and I would not be asking this questions if a more experienced colleague of mine was not strongly disagreeing with this approach. His reasoning is that using this approach is prone to exceptions due to thread interrupts / timeouts / semaphores and so on. Hi is going as far as saying that multithreading should be avoided within a web container because it can crash the Servlet container in case it runs out of threads.
He proposes that we should have the browser send multiple AJAX requests and aggregates/present data in chunks.
Can you please help me understand which approach is better and why?
I would say that your approach is much better.
Threads created by application logic aren't application container threads and limited only by operating system. While each AJAX request uses a thread from application container. So the second approach reduces throughput and increases the possibility of reaching application container limit while and the first one not. Performance also should be considered because it's much cheaper to create a thread than to send a request over network. Plus each network requests uses additional resources for authentication/authorization/encryption etc.
It's definetely harder to write correct multithread code and it can easily prone to errors. However it shouldn't stop you from doing it because concurrency can significantly increase your performance. It's pretty straightforward to handle interrupts and timeouts using Future and you for sure don't need semaphores here.
Exposing this logic to client looks like breaking of encapsulation. Imagine that you use rest api which forces you to send multiple request by splitting you data in chunks. What chunk size should i use? How to deal with timeouts/interrupts? How many requests should i sent? etc. You will have almost the same challenges in both approaches, but it's much easier to deal with them using specially designed for this libraries like ExecutorService and Future.

AWS Lambda Performance issues

I use aws api gateway integrated with aws lambda(java), but I'm seeing some serious problems in this approach. The concept of removing the server and having your app scaled out of the box is really nice but here are the problem I'm facing. My lambda is doing 2 simple things- validate the payload received from the client and then send it to a kinesis stream for further processing from another lambda(you will ask why I don't send directly to the stream and only use 1 lambda for all of the operations. Let's just say that I want to separate the logic and have a layer of abstraction and also be able to tell the client that he's sending invalid data.).
In the implementation of the lambda I integrated the spring DI. So far so good. I started making performance testing. I simulated 50 concurrent users making 4 requests each with 5 seconds between the requests. So what happened- In the lambda's coldstart I initialize the spring's application context but it seems that having so many simultaneous requests when the lambda was not started is doing some strange things. Here's a screenshot of the times the context was initialized for.
What we can see from the screenshot is that the times for initializing the context have big difference. My assumption of what happening is that when so many requests are received and there's no "active" lambda it initializes a lambda container for every one of them and in the same time it "blocks" some of them(the ones with the big times of 18s) until the others already started are ready. So maybe it has some internal limit of the containers it can start at the same time. The problem is that if you don't have equally distributed traffic this will happen from time to time and some of the requests will timeout. We don't want this to happen.
So next thing was to do some tests without spring container as my thought was "ok, the initialization is heavy, let's just make plain old java objects initialization". And unfortunatelly the same thing happened(maybe just reduced the 3s container initialization for some of the requests). Here is a more detailed screenshot of the test data:
So I logged the whole lambda execution time(from construction to the end), the kinesis client initialization and the actual sending of the data to the stream as these are the heaviest operations in the lambda. We still have these big times of 18s or something but the interesting thing is that the times are somehow proportional. So if the whole lambda takes 18s, around 7-8s is the client initialization and 6-7 for sending the data to the stream and 4-5 seconds left for the other operations in the lambda which for the moment is only validation. On the other hand if we take one of the small times(which means that it reuses an already started lambda),i.e. 820ms, it takes 100ms for the kinesis client initialization and 340 for the data sending and 400ms for the validation. So this pushes me again to the thoughts that internally it makes some sleeps because of some limits. The next screenshot is showing what is happening on the next round of requests when the lamda is already started:
So we don't have this big times, yes we still have some relatively big delta in some of the request(which for me is also strange), but the things looks much better.
So I'm looking for a clarification from someone who knows actually what is happening under the hood, because this is not a good behavior for a serious application which is using the cloud because of it's "unlimited" possibilities.
And another question is related to another limit of the lambda-200 concurrent invocations in all lambdas within an account in a region. For me this is also a big limitation for a big application with lots of traffic. So as my business case in the moment(I don't know for the future) is more or less fire and forget the request. And I'm starting to think of changing the logic in the way that the gateway sends the data directly to the stream and the other lambda is taking care of the validation and the further processing. Yes, I'm loosing the current abstraction(which I don't need at the moment) but I'm increasing the application availability many times. What do you think?
The lambda execution time spikes to 18s because AWS launches new containers w/ your code to handle the incoming requests. The bootstrap time is ~18s.
Assigning more RAM can significantly improve the performance of your lambda function, because you have more RAM, CPU and networking throughput!
And another question is related to another limit of the lambda-200 concurrent invocations in all lambdas within an account in a region.
You can ask to the AWS Support to increase that limit. I asked to increase that limit to 10,000 invocation/second and the AWS Support did it quickly!
You can proxy straight to the Kinesis stream via API Gateway. You would lose some control in terms of validation and transformation, but you won't have the cold start latency that you're seeing from Lambda.
You can use the API Gateway mapping template to transform the data and if validation is important, you could potentially do that at the processing Lambda on the other side of the stream.

Patterns/Principles for thread-safe queues and "master/worker" program in Java

I have a problem which I believe is the classic master/worker pattern, and I'm seeking advice on implementation. Here's what I currently am thinking about the problem:
There's a global "queue" of some sort, and it is a central place where "the work to be done" is kept. Presumably this queue will be managed by a kind of "master" object. Threads will be spawned to go find work to do, and when they find work to do, they'll tell the master thing (whatever that is) to "add this to the queue of work to be done".
The master, perhaps on an interval, will spawn other threads that actually perform the work to be done. Once a thread completes its work, I'd like it to notify the master that the work is finished. Then, the master can remove this work from the queue.
I've done a fair amount of thread programming in Java in the past, but it's all been prior to JDK 1.5 and consequently I am not familiar with the appropriate new APIs for handling this case. I understand that JDK7 will have fork-join, and that that might be a solution for me, but I am not able to use an early-access product in this project.
The problems, as I see them, are:
1) how to have the "threads doing the work" communicate back to the master telling them that their work is complete and that the master can now remove the work from the queue
2) how to efficiently have the master guarantee that work is only ever scheduled once. For example, let's say this queue has a million items, and it wants to tell a worker to "go do these 100 things". What's the most efficient way of guaranteeing that when it schedules work to the next worker, it gets "the next 100 things" and not "the 100 things I've already scheduled"?
3) choosing an appropriate data structure for the queue. My thinking here is that the "threads finding work to do" could potentially find the same work to do more than once, and they'd send a message to the master saying "here's work", and the master would realize that the work has already been scheduled and consequently should ignore the message. I want to ensure that I choose the right data structure such that this computation is as cheap as possible.
Traditionally, I would have done this in a database, in sort of a finite-state-machine manner, working "tasks" through from start to complete. However, in this problem, I don't want to use a database because of the high volume and volatility of the queue. In addition, I'd like to keep this as light-weight as possible. I don't want to use any app server if that can be avoided.
It is quite likely that this problem I'm describing is a common problem with a well-known name and accepted set of solutions, but I, with my lowly non-CS degree, do not know what this is called (i.e. please be gentle).
Thanks for any and all pointers.
As far as I understand your requirements, you need ExecutorService. ExecutorService have
submit(Callable task)
method which return value is Future. Future is a blocking way to communicate back from worker to master. You could easily expand this mechanism to work is asynchronous manner. And yes, ExecutorService also maintaining work queue like ThreadPoolExecutor. So you don't need to bother about scheduling, in most cases. java.util.concurrent package already have efficient implementations of thread safe queue (ConcurrentLinked queue - nonblocking, and LinkedBlockedQueue - blocking).
Check out java.util.concurrent in the Java library.
Depending on your application it might be as simple as cobbling together some blocking queue and a ThreadPoolExecutor.
Also, the book Java Concurrency in Practice by Brian Goetz might be helpful.
First, why do you want to hold the items after a worker started doing them? Normally, you would have a queue of work and a worker takes items out of this queue. This would also solve the "how can I prevent workers from getting the same item"-problem.
To your questions:
1) how to have the "threads doing the
work" communicate back to the master
telling them that their work is
complete and that the master can now
remove the work from the queue
The master could listen to the workers using the listener/observer pattern
2) how to efficiently have the master
guarantee that work is only ever
scheduled once. For example, let's say
this queue has a million items, and it
wants to tell a worker to "go do these
100 things". What's the most efficient
way of guaranteeing that when it
schedules work to the next worker, it
gets "the next 100 things" and not
"the 100 things I've already
scheduled"?
See above. I would let the workers pull the items out of the queue.
3) choosing an appropriate data
structure for the queue. My thinking
here is that the "threads finding work
to do" could potentially find the same
work to do more than once, and they'd
send a message to the master saying
"here's work", and the master would
realize that the work has already been
scheduled and consequently should
ignore the message. I want to ensure
that I choose the right data structure
such that this computation is as cheap
as possible.
There are Implementations of a blocking queue since Java 5
Don't forget Jini and Javaspaces. What you're describing sounds very like the classic producer/consumer pattern that space-based architectures excel at.
A producer will write the jobs into the space. 1 or more consumers will take out jobs (under a transaction) and work on that in parallel, and then write the results back. Since it's under a transaction, if a problem occurs the job is made available again for another consumer .
You can scale this trivially by adding more consumers. This works especially well when the consumers are separate VMs and you scale across the network.
If you are open to the idea of Spring, then check out their Spring Integration project. It gives you all the queue/thread-pool boilerplate out of the box and leaves you to focus on the business logic. Configuration is kept to a minimum using #annotations.
btw, the Goetz is very good.
This doesn't sound like a master-worker problem, but a specialized client above a threadpool. Given that you have a lot of scavenging threads and not a lot of processing units, it may be worthwhile simply doing a scavaging pass and then a computing pass. By storing the work items in a Set, the uniqueness constraint will remove duplicates. The second pass can submit all of the work to an ExecutorService to perform the process in parallel.
A master-worker model generally assumes that the data provider has all of the work and supplies it to the master to manage. The master controls the work execution and deals with distributed computation, time-outs, failures, retries, etc. A fork-join abstraction is a recursive rather than iterative data provider. A map-reduce abstraction is a multi-step master-worker that is useful in certain scenarios.
A good example of master-worker is for trivially parallel problems, such as finding prime numbers. Another is a data load where each entry is independant (validate, transform, stage). The need to process a known working set, handle failures, etc. is what makes a master-worker model different than a thread-pool. This is why a master must be in control and pushes the work units out, whereas a threadpool allows workers to pull work from a shared queue.

Categories