How to have different machines running different task? - java

I have an application in which I have five different tasks. Each of those five tasks runs at different period of time in a particular day. I will be deploying this application in 4 different machines.
In general I was running all those five different tasks on a single machine by choosing the leader between those four machines using Apache Zookeeper. But with this approach other three machines will be sitting idle so I was thinking is there any way I can have different machines running different tasks? Meaning each of those four machines running some tasks from those five but no two machines running the same task.
Can anyone provide an example on how would I do this?
Update:-
I don't have any dependency between different tasks. They all are independent of each other.

I would have five nodes: pending_tasks, completed_tasks, running_tasks, workers, and queues. pending_tasks is the node that holds tasks, both new ones and the ones that will be re-triggered due to failures in workers nodes. completed_tasks holds completed tasks details. running_tasks holds the tasks that are assigned to workers. In a PoC implementation I did once I used XML-encoded POJO to store tasks' details. Nodes in pending_tasks, completed_tasks, and running_tasks are all persistent nodes.
workers holds ephemeral nodes that represent available workers. Given they are ephemeral, these nodes signal failures in the workers. queues is directly related to workers: there is a node in queues for each node in workers. The nodes in queues are used to hold the tasks assigned for each of the workers.
Now, you need a master. The master is responsible for three things: i) watch pending_tasks for new tasks; ii) watch workers to register new queues when new workers arrive, and to put tasks back in pending_tasks when workers went missing; and iii) publish the result of the tasks in completed_tasks (when I did this PoC, the result would go through a publish/subscribe notification mechanism). Besides that, the master must perform some clean-up in the start-up given that workers might fail during masters' downtime.
The master algorithm is the following:
at (start-up) {
for (q -> /queues) {
if q.name not in nodesOf(/workers) {
for (t -> nodesOf(/queues/d.name)) {
create /pending_tasks/t.name
delete /running_tasks/t.name
delete /queues/d.name/t.name
}
delete /queues/d.name
}
}
for (t -> nodesOf(/completed_tasks)) {
publish the result
deleted /completed_tasks/c.name
}
}
watch (/workers) {
case c: Created => register the new worker queue
case d: Deleted => transaction {
for (t -> nodesOf(/queues/d.name)) {
create /pending_tasks/t.name
delete /running_tasks/t.name
delete /queues/d.name/t.name
}
delete /queues/d.name
}
}
watch (/pending_tasks) {
case c: Created => transaction {
create /running_tasks/c.name
create persistent node in one of the workers queue (eg, /queues/worker_0/c.name)
delete /pending_tasks/c.name
}
}
watch (/completed_tasks) {
case c: Created =>
publish the result
deleted /completed_tasks/c.name
}
The worker algorithm is the following:
at (start-up) {
create /queue/this.name
create a ephemeral node /workers/this.name
}
watch (/queue/this.name) {
case c: Created =>
perform the task
transaction {
create /completed_tasks/c.name with the result
delete /queues/this.name/c.name
delete /running_tasks/c.name
}
}
Some notes on when I thought of this design. First, at any given time, no tasks targeting the same computation were to run. Therefore, I named the tasks after the computation that was in place. So, if two different clients requested the same computation, only one would succeed since only one would be able to create the /pending_tasks node. Likewise, if the task is already running, the creation of the /running_task/ node would fail and no new task would be dispatch.
Second, there might be arbitrary failures in both masters and workers and no task will be lost. If a worker fails, the watching of delete events in /worker will trigger the reassignment of tasks. If a master fail and any given number of workers fail before a new master is in place, the start-up procedure will move tasks back to /pending_tasks and publish any pending result.
Third, I might have forgotten some corner case since I have no access to this PoC implementation anymore. I'll be glad to discuss any issue.

Related

Architecturing Airflow DAG that needs contextual throttling

I have a group of job units (workers) that I want to run as a DAG
Group1 has 10 workers and each worker does multiple table extracts from a DB. Note that each worker maps to a single DB instance and each worker needs to successfully deal with 100 tables in total before it can successfully mark itself as complete
Group1 has a limitation that says no more than 5 tables across all those 10 workers should be consumed at a time. For example:
Worker1 is extracting 2 tables
Worker2 is extracting 2 tables
Worker3 is extracting 1 table
Worker4...Worker10 need to wait until Worker1...Worker3 relinquishes the threads
Worker4...Worker10 can pick up tables as soon as threads in step1 frees up
As each worker completes all the 100 tables, it proceeds to step2 without waiting. Step2 has no concurrency limits
I should be able to create a single node Group1 that caters to the throttling and also have
10 independent nodes of workers so I can restart them in case if anyone of it fails
I have tried explaining this in the following diagram:
If any of the worker fails, I can restart it without affecting other workers. It still uses the same thread pool from Group1 so the concurrency limits are enforced
Group1 would complete once all elements of step1 and step2 are complete
Step2 doesn't have any concurrency measures
How do I implement such a hierarchy in Airflow for a Spring Boot Java application?
Is it possible to design this kind of DAG using Airflow constructs and dynamically tell Java application how many tables it can extract at a time. For instance, if all workers except Worker1 are finished, Worker1 can now use all 5 threads available while everything else will proceed to step2.
These constraints cannot be modeled as a directed acyclic graph, and thus cannot implemented in airflow exactly as described. However, they can be modeled as queues, and thus could be implemented with a job queue framework. Here are your two options:
Implement suboptimally as airflow DAG:
from airflow.models import DAG
from airflow.operators.subdag_operator import SubDagOperator
# Executors that inherit from BaseExecutor take a parallelism parameter
from wherever import SomeExecutor, SomeOperator
# Table load jobs are done with parallelism 5
load_tables = SubDagOperator(subdag=DAG("load_tables"), executor=SomeExecutor(parallelism=5))
# Each table load must be it's own job, or must be split into sets of tables of predetermined size, such that num_tables_per_job * parallelism = 5
for table in tables:
load_table = SomeOperator(task_id=f"load_table_{table}", dag=load_tables)
# Jobs done afterwards are done with higher parallelism
afterwards = SubDagOperator(
subdag=DAG("afterwards"), executor=SomeExecutor(parallelism=high_parallelism)
)
for job in jobs:
afterward_job = SomeOperator(task_id=f"job_{job}", dag=afterwards)
# After _all_ table load jobs are complete, start the jobs that should be done afterwards
load_tables > afterwards
The suboptimal aspect here, is that, for the first half of the DAG, the cluster will be underutilized by higher_parallelism - 5.
Implement optimally with job queue:
# This is pseudocode, but could be easily adapted to a framework like Celery
# You need two queues
# The table load queue should be initialized with the job items
table_load_queue = Queue(initialize_with_tables)
# The queue for jobs to do afterwards starts empty
afterwards_queue = Queue()
def worker():
# Work while there's at least one item in either queue
while not table_load_queue.empty() or not afterwards_queue.empty():
working_on_table_load = [worker.is_working_table_load for worker in scheduler.active()]
# Work table loads if we haven't reached capacity, otherwise work the jobs afterwards
if sum(working_on_table_load) < 5:
is_working_table_load = True
task = table_load_queue.dequeue()
else
is_working_table_load = False
task = afterwards_queue.dequeue()
if task:
after = work(task)
if is_working_table_load:
# After working a table load, create the job to work afterwards
afterwards_queue.enqueue(after)
# Use all the parallelism available
scheduler.start(worker, num_workers=high_parallelism)
Using this approach, the cluster won't be underutilized.

How is back-and-forth communication established in MIP?

For example, I have a root process which sends some computations to be completed by worker processes. But because I have limited (4) processes I have to share the workload to all of them so I send multiple times. The workaround that I have found is this:
int me = MPI.COMM_WORLD.Rank();
if(me == 0) {
sendToWorkers(); //Sends more than once to workers.
}
else {
while(true) {//wait indefinitely, accept data received from root process and work on it.
MPI.COMM_WORLD.Recv(Buf, 0, Buf.length, MPI.INT, 0, 0);
doTask(Buf);
}
}
Now the problem arises that I want to send data that has completed processing back to the root process but I can't do another while(true);. I am sure there must be a much more elegant way to accomplish this.
EDIT 1: The reason why I want to send to root process is because it is cleaner. However, alternatively I can just print computed solutions from the worker processes but the output is all mangled up due to interleaving. Declaring the print method to be synchronized doesn't work.
One simple solution is at the end of task distribution, master must sent a "FINISH/STOP/END" (any custom message to indicate that tasks are over) message to all workers. Workers getting the finish message exits the loop and sends the result back to the master. Master can start a loop with total tasks and waits for those results.
From your example shown, this is a typical master worker model use-case. Here when you send a task to a worker using MPI_Send(), there is a corresponding MPI_Recv() in your worker process. After receiving task, you perform doTask(Buf). Then you again goes to the loop. So in your case, to summarise, you receive a new task only after computing the previously received task for that rank right? In that case, master process can also wait for reply from any of the finished tasks and can send new tasks based on that. May be you can consider that approach. If your doTask uses thread, then this becomes complicated. Each worker nodes then haves to keep track of its tasks and after all the tasks are completed, master should start a loop and waits for the results.
Or you can use multithreaded implementation. You may use separate thread for send and receive in master.

How do I send messages to idle akka actors?

I have an actor called a TaskRunner. The tasks can take up to 1 minute to run. Because of the library I use there can only be one actor per jvm/node. I have 1000 of these nodes across various machines.
I would like to distribute tasks to these nodes using various rules but the most important one is:
Never queue tasks in a TaskRunner node's mailbox, always wait until a TaskRunner is free before sending it a task
The way I was thinking of doing this is have an actor on another node (lets call this the Scheduler actor) listen to registrations from the TaskRunner nodes and keep an internal state of what has been sent to where.
Presumably if I did this I could only ever have one instance of this Scheduler actor because if there was more than one they wouldn't know which TaskRunner nodes were currently busy and so we would get tasks in the queue.
Does this mean I should be using a Cluster Singleton for the Scheduler actor?
Is there a better way to achieve my goal?
I would say you need:
dispatcher actor (cluster singleton) who send task to actor from pool of idle actors
your TaskRunner actor should have two states: running, and idle. In idle state it should register itself regularly to dispatcher actor (notifying that it is idle). Regularly, because of possible state losing by dispatcher in case of node shutdown and move singleton to another node.
dispatcher itself keep list of idle actors. When new task need to be done and list is not empty, worker is taken from the list and task is sent (worker could be removed from list immediately, but safe to work with Ack to be sure that task is taken for processing, or re-send to another worker if Ack is timed out)
Given your requirement, rather than building everything from scratch, you might want to consider adapting Lightbend's distributed-workers template which employs a pull model. It primarily consists of 1) a master cluster singleton that maintains state of workers, and, 2) an actor system of workers which register and pull work from the master singleton actor.
I adapted a repurposed version of the template for a R&D project in the past and it delivered the work-pulling functionality as advertised. Note that the template uses the retired Activator (which can be easily detached or replaced with sbt from the main code). It also does distributed pub-sub and persistence journal, which you can elect to exclude if not needed. Its source code is available at GitHub.
Reffering to your approach of singleton master and multiple workers,there can be a situation where your master is over loaded with task to schedule, which may result in more time to schedule the task to the workers.
So Instead of making master as Cluster singleton, you can have multiple masters having subset of workers assigned to them.
The distribution of work to different master can be done through cluster sharding based on sharding key.
Akka provides cluster sharding, you can refer that.
And for making your master fault tolerant, you can always have the persistent actors as masters.

RabbitMQ how to split jobs to tasks and handle results

I have the following use case on a Spring-based Web application:
I need to apply the Competing Consumers EIP with the following twists: the messages in the queue are actually split tasks belonging to the same job. Therefore, I need to properly track when all tasks of a job get completed and their completion status in order to save the scenario either as COMPLETED or FAILED, log the outcome and notify by e.g. e-mail the users accordingly
So, given the requirements I described above, my question is:
Can this be done with RabbitMQ and if yes how?
I created a quick gist to show a very crude example of how one could do it. In this example, there is one producer and 2 consumers, 2 queues, one for sending by the producer ("SEND"), consumed by the consumers, and vice versa, consumers publish to the "RECV" queue and is consumed by the producer.
Now bear in mind this is a pretty crude example, as the Producer in that case send simply one job (a random amount of tasks between 0 and 5), and block until the job is done. A way to circumvent this would be to store in a Map a job id and the number of tasks, and every time check that the number of tasks done reported per job id.
What you are trying to do is beyond the scope of RabbitMQ. RabbitMQ is for sending and receiving messages with ability to queue them.
It can't track your job tasks for you.
You will need to have a "Job Storage" service. Whenever your consumer finishes the task, its updates the Job Storage service, marking task as done. Job storage service knows about how many tasks are in the job, and when last task is done, completes jobs as succeeded. There in this service, you will also implement all your other business logic, such as when to treat job as failed.

flink - cluster not using cluster

Ive setup a 3 node cluster that was distributing tasks (steps? jobs?) pretty evenly until the most recent which has all been assigned to one machine.
Topology (do we still use this term for flink?):
kafka (3 topics on different feeds) -> flatmap -> union -> map
Is there something about this setup that would tell the cluster manager to put everything on one machine?
Also - what are the 'not set' values in the image? Some step I've missed? Or some to-be-implemented UI feature?
It is actually on purpose that Flink schedules your job on a single TaskManager. In order to understand it let me quickly explain Flink's resource scheduling algorithm.
First of all, in the Flink world a slot can accommodate more than one task (parallel instance of an operator). In fact, it can accommodate one parallel instance of each operator. The reason for this is that Flink not only executes streaming jobs in a streaming fashion but also batch jobs. With streaming fashion I mean that Flink brings all operators of your dataflow graph online so that intermediate results can be streamed directly to downstream operators where they are consumed. Per default Flink tries to combine one task of each operator in one slot.
When Flink schedules the tasks to the different slots, then it tries to co-locate the tasks with their inputs to avoid unnecessary network communication. For sources, the co-location depends on the implementation. For file-based sources, for example, Flink tries to assign local file input splits to the different tasks.
So if we apply this to your job, then we see the following. You have three different sources with parallelism 1. All sources belong to the same resource sharing group, thus the single task of each operator will deployed to the same slot. The initial slot is randomly chosen from the available instances (actually it depends on the order of the TaskManager registration at the JobManager) and then filled up. Let's say the chosen slot is on machine node1.
Next we have the three flat map operators which have a parallelism of 2. Here again one of the two sub-tasks of each flat map operator can be deployed to the same slot which already accommodates the three sources. The second sub-task, however, has to placed in a new slot. When this happens Flink tries to choose a free slot which is co-located to a slot in which one of the task's inputs is deployed (again to reduce network communication). Since only one slot of node1 is occupied and thus 31 are still free, it will deploy the 2nd sub-task of each flatMap operator also to node1.
The same now applies to the tumbling window reduce operation. Flink tries to co-locate all the tasks of the window operator with it's inputs. Since all of its inputs run on node1 and node1 has enough free slots to accommodate 6 sub-tasks of the window operator, they will be scheduled to node1. It's important to note, that 1 window task will run in the slots which contains the three sources and one task of each flatMap operator.
I hope this explains why Flink only uses the slots of a single machine for the execution of your job.
The problem is that you are building a global window on an unkeyed (ungrouped) stream, so the window has to run on one machine.
Maybe you can also express your application logic differently so that you can group the stream.
The "(not set)" part is probably an issue in Flink's DataStream API, which is not setting default operator names.
Jobs implemented against the DataSet API will look like this:

Categories