I have a group of job units (workers) that I want to run as a DAG
Group1 has 10 workers and each worker does multiple table extracts from a DB. Note that each worker maps to a single DB instance and each worker needs to successfully deal with 100 tables in total before it can successfully mark itself as complete
Group1 has a limitation that says no more than 5 tables across all those 10 workers should be consumed at a time. For example:
Worker1 is extracting 2 tables
Worker2 is extracting 2 tables
Worker3 is extracting 1 table
Worker4...Worker10 need to wait until Worker1...Worker3 relinquishes the threads
Worker4...Worker10 can pick up tables as soon as threads in step1 frees up
As each worker completes all the 100 tables, it proceeds to step2 without waiting. Step2 has no concurrency limits
I should be able to create a single node Group1 that caters to the throttling and also have
10 independent nodes of workers so I can restart them in case if anyone of it fails
I have tried explaining this in the following diagram:
If any of the worker fails, I can restart it without affecting other workers. It still uses the same thread pool from Group1 so the concurrency limits are enforced
Group1 would complete once all elements of step1 and step2 are complete
Step2 doesn't have any concurrency measures
How do I implement such a hierarchy in Airflow for a Spring Boot Java application?
Is it possible to design this kind of DAG using Airflow constructs and dynamically tell Java application how many tables it can extract at a time. For instance, if all workers except Worker1 are finished, Worker1 can now use all 5 threads available while everything else will proceed to step2.
These constraints cannot be modeled as a directed acyclic graph, and thus cannot implemented in airflow exactly as described. However, they can be modeled as queues, and thus could be implemented with a job queue framework. Here are your two options:
Implement suboptimally as airflow DAG:
from airflow.models import DAG
from airflow.operators.subdag_operator import SubDagOperator
# Executors that inherit from BaseExecutor take a parallelism parameter
from wherever import SomeExecutor, SomeOperator
# Table load jobs are done with parallelism 5
load_tables = SubDagOperator(subdag=DAG("load_tables"), executor=SomeExecutor(parallelism=5))
# Each table load must be it's own job, or must be split into sets of tables of predetermined size, such that num_tables_per_job * parallelism = 5
for table in tables:
load_table = SomeOperator(task_id=f"load_table_{table}", dag=load_tables)
# Jobs done afterwards are done with higher parallelism
afterwards = SubDagOperator(
subdag=DAG("afterwards"), executor=SomeExecutor(parallelism=high_parallelism)
)
for job in jobs:
afterward_job = SomeOperator(task_id=f"job_{job}", dag=afterwards)
# After _all_ table load jobs are complete, start the jobs that should be done afterwards
load_tables > afterwards
The suboptimal aspect here, is that, for the first half of the DAG, the cluster will be underutilized by higher_parallelism - 5.
Implement optimally with job queue:
# This is pseudocode, but could be easily adapted to a framework like Celery
# You need two queues
# The table load queue should be initialized with the job items
table_load_queue = Queue(initialize_with_tables)
# The queue for jobs to do afterwards starts empty
afterwards_queue = Queue()
def worker():
# Work while there's at least one item in either queue
while not table_load_queue.empty() or not afterwards_queue.empty():
working_on_table_load = [worker.is_working_table_load for worker in scheduler.active()]
# Work table loads if we haven't reached capacity, otherwise work the jobs afterwards
if sum(working_on_table_load) < 5:
is_working_table_load = True
task = table_load_queue.dequeue()
else
is_working_table_load = False
task = afterwards_queue.dequeue()
if task:
after = work(task)
if is_working_table_load:
# After working a table load, create the job to work afterwards
afterwards_queue.enqueue(after)
# Use all the parallelism available
scheduler.start(worker, num_workers=high_parallelism)
Using this approach, the cluster won't be underutilized.
Related
I have data stored at HDFS and N Postgres shards (separate instances).
During data retrieval based on Java COLUMN_STRING_DATA.hashCode() % N calculation, we pick an exact shard for SQL request.
We have a Spark set up with 5 workers, each has 32 CPU, CPU per task is 2.
I'm struggling with uploading data to shards because the database connections are limited by Database engineering team. Each worker tries to upload data to each shard, that's why I'm limited by allowed connections to PG shards. That's why I can't fully utilize Workers and PG instances.
So I would like to somehow bind a worker to work with a unique, limited group of PG shards.
spark.read()
.option("header", "true")
.orc(pathToHdfs)
.withColumn("hash_group", functions.hash(new Column("search")).mod(PG_SHARDS_SIZE))
Functions.hash uses murmur_hash algorithm, during data retrieval from PG, I choose shard based on Java String.hashCode() % N. I googled, and I'm able to write custom functions, so that shouldn't be a problem.
The question is how after creation a new column with Integer values in range [1..PG_SHARDS_SIZE]. I can assign Worker #1 to work only with values [1,2] of column hash_group. Worker #2 with values [3,4] and so on?
Maybe there are other approaches to achieve binding worker to exact group of PG shards?
I have 1 to 1 mapping Worker - Executor, so binding Executor to the group of PG shard working for me as well.
I have a queue with jobs that goes to different executor pools depending on the type of jobs. The queue is in a DB table and contains jobs from different clients with priorities, etc. I'm omitting some details irrelevant to the question.
At some point different clients put many jobs in the queue at the same time with the same priority, for example about 15-20'000 jobs.
With the current implementation, jobs are fetched using hibernate with this criteria and again, I'm omitting some restrictions for simplicity.
Calendar cal = Calendar.getInstance();
cal.add(Calendar.MINUTE, -minutes);
Criteria c = getSession().createCriteria(QueueEntry.class)
.add(Restrictions.eq("processing", false))
.add(Restrictions.or(Restrictions.ge("serverTimestamp", cal.getTime()), Restrictions.ge("sentTimestamp", cal.getTime())))
.add(Restrictions.lt("attemps", attemps))
.addOrder(Order.asc("priority"))
.addOrder(Order.asc("serverTimestamp"))
.setMaxResults(limit);
In the current situation if client A inserts 15k tasks in 10:00:00 and client B inserts 3k tasks in 10:00:05 (5 seconds later) with the same priority, B's tasks will be fetched and executed after those of A's.
I need to balance fetched jobs between the clients (there's a "client" column in the queue table) - for example if the throughput is 10 tasks/sec to get 5 of the A's tasks and 5 of the B's. When there's no more tasks for client B, to get 10 of A's tasks.
Is there some easy way or trick to do this with the query? The DB is Postgres.
I don't think you will be able to do it by modifying your existing Criteria or with just a single query. To prevent client starvation you would have to create separate resource pools for each client, which is the apporach taken by Fair Scheduler for Hadoop:
The fair scheduler organizes jobs into pools, and divides resources fairly between these pools. By default, there is a separate pool for each user, so that each user gets an equal share of the cluster. It is also possible to set a job's pool based on the user's Unix group or any jobconf property. Within each pool, jobs can be scheduled using either fair sharing or first-in-first-out (FIFO) scheduling.
You can run a query to get a list of distinct clients with total number of awaiting jobs. Based on the distinct client count divide the global job limit and fetch the awaiting jobs for each given client in a separate query.
My this question is an extension to my another SO Question. Since that doesn't look possible, I am trying to execute chunks in parallel for parallel / partitioned slave steps.
Article says that by just specifying SimpleAsyncTaskExecutor as task executor for a step would start executing chunks in parallel.
#Bean
public Step masterLuceneIndexerStep() throws Exception{
return stepBuilderFactory.get("masterLuceneIndexerStep")
.partitioner(slaveLuceneIndexerStep())
.partitioner("slaveLuceneIndexerStep", partitioner())
.gridSize(Constants.PARTITIONER_GRID_SIZE)
.taskExecutor(simpleAsyntaskExecutor)
.build();
}
#Bean
public Step slaveLuceneIndexerStep()throws Exception{
return stepBuilderFactory.get("slaveLuceneIndexerStep")
.<IndexerInputVO,IndexerOutputVO> chunk(Constants.INDEXER_STEP_CHUNK_SIZE)
.reader(luceneIndexReader(null))
.processor(luceneIndexProcessor())
.writer(luceneIndexWriter(null))
.listener(luceneIndexerStepListener)
.listener(lichunkListener)
.throttleLimit(Constants.THROTTLE_LIMIT)
.build();
}
If I specify, .taskExecutor(simpleAsyntaskExecutor) for slave step then job fails. Line .taskExecutor(simpleAsyntaskExecutor) in master step works OK but chunks work in serial and partitioned steps in parallel.
Is it possible to parallelize chunks of slaveLuceneIndexerStep()?
Basically, each chunk is writing Lucene indices to a single directory in sequential fashion and I want to further parallelize index writing process within each directory since Lucene IndexWriter is thread-safe.
I am able to launch parallel chunks from within a partitioned slave step by following,
1.I first took care of my reader, processor and writer to be thread safe so that those components can participate in parallel chunks without concurrency issues.
2.I kept task executor as for master step as SimpleAsyntaskExecutor since slave steps are long running and I wish to start exactly N-threads at a point in time. I control N by setting concurrencyLimit of task executor.
3.Then I set a ThreadPoolTaskExecutor as task executor for slave step. This pool gets used by all slave steps as a common pool so I set its core pool size as a minimum of N so that each slave step gets at least one thread and starvation doesn't happen. You can increase this thread pool size as per system capacity and I used a thread pool since chunks are smaller running processes.
Using a thread pool also handles a specific case for my application that my partitioning is by client_id so when smaller clients are done same threads get automatically reused by bigger clients and asymmetry created by client_id partitioning gets handled since data to be processed for each client varies a lot.
Master step task executor simply starts all slave step threads and goes to WAITINGstate while slave step chunks get processed by thread pool specified in slave step.
Ive setup a 3 node cluster that was distributing tasks (steps? jobs?) pretty evenly until the most recent which has all been assigned to one machine.
Topology (do we still use this term for flink?):
kafka (3 topics on different feeds) -> flatmap -> union -> map
Is there something about this setup that would tell the cluster manager to put everything on one machine?
Also - what are the 'not set' values in the image? Some step I've missed? Or some to-be-implemented UI feature?
It is actually on purpose that Flink schedules your job on a single TaskManager. In order to understand it let me quickly explain Flink's resource scheduling algorithm.
First of all, in the Flink world a slot can accommodate more than one task (parallel instance of an operator). In fact, it can accommodate one parallel instance of each operator. The reason for this is that Flink not only executes streaming jobs in a streaming fashion but also batch jobs. With streaming fashion I mean that Flink brings all operators of your dataflow graph online so that intermediate results can be streamed directly to downstream operators where they are consumed. Per default Flink tries to combine one task of each operator in one slot.
When Flink schedules the tasks to the different slots, then it tries to co-locate the tasks with their inputs to avoid unnecessary network communication. For sources, the co-location depends on the implementation. For file-based sources, for example, Flink tries to assign local file input splits to the different tasks.
So if we apply this to your job, then we see the following. You have three different sources with parallelism 1. All sources belong to the same resource sharing group, thus the single task of each operator will deployed to the same slot. The initial slot is randomly chosen from the available instances (actually it depends on the order of the TaskManager registration at the JobManager) and then filled up. Let's say the chosen slot is on machine node1.
Next we have the three flat map operators which have a parallelism of 2. Here again one of the two sub-tasks of each flat map operator can be deployed to the same slot which already accommodates the three sources. The second sub-task, however, has to placed in a new slot. When this happens Flink tries to choose a free slot which is co-located to a slot in which one of the task's inputs is deployed (again to reduce network communication). Since only one slot of node1 is occupied and thus 31 are still free, it will deploy the 2nd sub-task of each flatMap operator also to node1.
The same now applies to the tumbling window reduce operation. Flink tries to co-locate all the tasks of the window operator with it's inputs. Since all of its inputs run on node1 and node1 has enough free slots to accommodate 6 sub-tasks of the window operator, they will be scheduled to node1. It's important to note, that 1 window task will run in the slots which contains the three sources and one task of each flatMap operator.
I hope this explains why Flink only uses the slots of a single machine for the execution of your job.
The problem is that you are building a global window on an unkeyed (ungrouped) stream, so the window has to run on one machine.
Maybe you can also express your application logic differently so that you can group the stream.
The "(not set)" part is probably an issue in Flink's DataStream API, which is not setting default operator names.
Jobs implemented against the DataSet API will look like this:
I have an application in which I have five different tasks. Each of those five tasks runs at different period of time in a particular day. I will be deploying this application in 4 different machines.
In general I was running all those five different tasks on a single machine by choosing the leader between those four machines using Apache Zookeeper. But with this approach other three machines will be sitting idle so I was thinking is there any way I can have different machines running different tasks? Meaning each of those four machines running some tasks from those five but no two machines running the same task.
Can anyone provide an example on how would I do this?
Update:-
I don't have any dependency between different tasks. They all are independent of each other.
I would have five nodes: pending_tasks, completed_tasks, running_tasks, workers, and queues. pending_tasks is the node that holds tasks, both new ones and the ones that will be re-triggered due to failures in workers nodes. completed_tasks holds completed tasks details. running_tasks holds the tasks that are assigned to workers. In a PoC implementation I did once I used XML-encoded POJO to store tasks' details. Nodes in pending_tasks, completed_tasks, and running_tasks are all persistent nodes.
workers holds ephemeral nodes that represent available workers. Given they are ephemeral, these nodes signal failures in the workers. queues is directly related to workers: there is a node in queues for each node in workers. The nodes in queues are used to hold the tasks assigned for each of the workers.
Now, you need a master. The master is responsible for three things: i) watch pending_tasks for new tasks; ii) watch workers to register new queues when new workers arrive, and to put tasks back in pending_tasks when workers went missing; and iii) publish the result of the tasks in completed_tasks (when I did this PoC, the result would go through a publish/subscribe notification mechanism). Besides that, the master must perform some clean-up in the start-up given that workers might fail during masters' downtime.
The master algorithm is the following:
at (start-up) {
for (q -> /queues) {
if q.name not in nodesOf(/workers) {
for (t -> nodesOf(/queues/d.name)) {
create /pending_tasks/t.name
delete /running_tasks/t.name
delete /queues/d.name/t.name
}
delete /queues/d.name
}
}
for (t -> nodesOf(/completed_tasks)) {
publish the result
deleted /completed_tasks/c.name
}
}
watch (/workers) {
case c: Created => register the new worker queue
case d: Deleted => transaction {
for (t -> nodesOf(/queues/d.name)) {
create /pending_tasks/t.name
delete /running_tasks/t.name
delete /queues/d.name/t.name
}
delete /queues/d.name
}
}
watch (/pending_tasks) {
case c: Created => transaction {
create /running_tasks/c.name
create persistent node in one of the workers queue (eg, /queues/worker_0/c.name)
delete /pending_tasks/c.name
}
}
watch (/completed_tasks) {
case c: Created =>
publish the result
deleted /completed_tasks/c.name
}
The worker algorithm is the following:
at (start-up) {
create /queue/this.name
create a ephemeral node /workers/this.name
}
watch (/queue/this.name) {
case c: Created =>
perform the task
transaction {
create /completed_tasks/c.name with the result
delete /queues/this.name/c.name
delete /running_tasks/c.name
}
}
Some notes on when I thought of this design. First, at any given time, no tasks targeting the same computation were to run. Therefore, I named the tasks after the computation that was in place. So, if two different clients requested the same computation, only one would succeed since only one would be able to create the /pending_tasks node. Likewise, if the task is already running, the creation of the /running_task/ node would fail and no new task would be dispatch.
Second, there might be arbitrary failures in both masters and workers and no task will be lost. If a worker fails, the watching of delete events in /worker will trigger the reassignment of tasks. If a master fail and any given number of workers fail before a new master is in place, the start-up procedure will move tasks back to /pending_tasks and publish any pending result.
Third, I might have forgotten some corner case since I have no access to this PoC implementation anymore. I'll be glad to discuss any issue.