Assign tasks to exact workers in spark - java

I have data stored at HDFS and N Postgres shards (separate instances).
During data retrieval based on Java COLUMN_STRING_DATA.hashCode() % N calculation, we pick an exact shard for SQL request.
We have a Spark set up with 5 workers, each has 32 CPU, CPU per task is 2.
I'm struggling with uploading data to shards because the database connections are limited by Database engineering team. Each worker tries to upload data to each shard, that's why I'm limited by allowed connections to PG shards. That's why I can't fully utilize Workers and PG instances.
So I would like to somehow bind a worker to work with a unique, limited group of PG shards.
spark.read()
.option("header", "true")
.orc(pathToHdfs)
.withColumn("hash_group", functions.hash(new Column("search")).mod(PG_SHARDS_SIZE))
Functions.hash uses murmur_hash algorithm, during data retrieval from PG, I choose shard based on Java String.hashCode() % N. I googled, and I'm able to write custom functions, so that shouldn't be a problem.
The question is how after creation a new column with Integer values in range [1..PG_SHARDS_SIZE]. I can assign Worker #1 to work only with values [1,2] of column hash_group. Worker #2 with values [3,4] and so on?
Maybe there are other approaches to achieve binding worker to exact group of PG shards?
I have 1 to 1 mapping Worker - Executor, so binding Executor to the group of PG shard working for me as well.

Related

Architecturing Airflow DAG that needs contextual throttling

I have a group of job units (workers) that I want to run as a DAG
Group1 has 10 workers and each worker does multiple table extracts from a DB. Note that each worker maps to a single DB instance and each worker needs to successfully deal with 100 tables in total before it can successfully mark itself as complete
Group1 has a limitation that says no more than 5 tables across all those 10 workers should be consumed at a time. For example:
Worker1 is extracting 2 tables
Worker2 is extracting 2 tables
Worker3 is extracting 1 table
Worker4...Worker10 need to wait until Worker1...Worker3 relinquishes the threads
Worker4...Worker10 can pick up tables as soon as threads in step1 frees up
As each worker completes all the 100 tables, it proceeds to step2 without waiting. Step2 has no concurrency limits
I should be able to create a single node Group1 that caters to the throttling and also have
10 independent nodes of workers so I can restart them in case if anyone of it fails
I have tried explaining this in the following diagram:
If any of the worker fails, I can restart it without affecting other workers. It still uses the same thread pool from Group1 so the concurrency limits are enforced
Group1 would complete once all elements of step1 and step2 are complete
Step2 doesn't have any concurrency measures
How do I implement such a hierarchy in Airflow for a Spring Boot Java application?
Is it possible to design this kind of DAG using Airflow constructs and dynamically tell Java application how many tables it can extract at a time. For instance, if all workers except Worker1 are finished, Worker1 can now use all 5 threads available while everything else will proceed to step2.
These constraints cannot be modeled as a directed acyclic graph, and thus cannot implemented in airflow exactly as described. However, they can be modeled as queues, and thus could be implemented with a job queue framework. Here are your two options:
Implement suboptimally as airflow DAG:
from airflow.models import DAG
from airflow.operators.subdag_operator import SubDagOperator
# Executors that inherit from BaseExecutor take a parallelism parameter
from wherever import SomeExecutor, SomeOperator
# Table load jobs are done with parallelism 5
load_tables = SubDagOperator(subdag=DAG("load_tables"), executor=SomeExecutor(parallelism=5))
# Each table load must be it's own job, or must be split into sets of tables of predetermined size, such that num_tables_per_job * parallelism = 5
for table in tables:
load_table = SomeOperator(task_id=f"load_table_{table}", dag=load_tables)
# Jobs done afterwards are done with higher parallelism
afterwards = SubDagOperator(
subdag=DAG("afterwards"), executor=SomeExecutor(parallelism=high_parallelism)
)
for job in jobs:
afterward_job = SomeOperator(task_id=f"job_{job}", dag=afterwards)
# After _all_ table load jobs are complete, start the jobs that should be done afterwards
load_tables > afterwards
The suboptimal aspect here, is that, for the first half of the DAG, the cluster will be underutilized by higher_parallelism - 5.
Implement optimally with job queue:
# This is pseudocode, but could be easily adapted to a framework like Celery
# You need two queues
# The table load queue should be initialized with the job items
table_load_queue = Queue(initialize_with_tables)
# The queue for jobs to do afterwards starts empty
afterwards_queue = Queue()
def worker():
# Work while there's at least one item in either queue
while not table_load_queue.empty() or not afterwards_queue.empty():
working_on_table_load = [worker.is_working_table_load for worker in scheduler.active()]
# Work table loads if we haven't reached capacity, otherwise work the jobs afterwards
if sum(working_on_table_load) < 5:
is_working_table_load = True
task = table_load_queue.dequeue()
else
is_working_table_load = False
task = afterwards_queue.dequeue()
if task:
after = work(task)
if is_working_table_load:
# After working a table load, create the job to work afterwards
afterwards_queue.enqueue(after)
# Use all the parallelism available
scheduler.start(worker, num_workers=high_parallelism)
Using this approach, the cluster won't be underutilized.

Data processing balance

I have a queue with jobs that goes to different executor pools depending on the type of jobs. The queue is in a DB table and contains jobs from different clients with priorities, etc. I'm omitting some details irrelevant to the question.
At some point different clients put many jobs in the queue at the same time with the same priority, for example about 15-20'000 jobs.
With the current implementation, jobs are fetched using hibernate with this criteria and again, I'm omitting some restrictions for simplicity.
Calendar cal = Calendar.getInstance();
cal.add(Calendar.MINUTE, -minutes);
Criteria c = getSession().createCriteria(QueueEntry.class)
.add(Restrictions.eq("processing", false))
.add(Restrictions.or(Restrictions.ge("serverTimestamp", cal.getTime()), Restrictions.ge("sentTimestamp", cal.getTime())))
.add(Restrictions.lt("attemps", attemps))
.addOrder(Order.asc("priority"))
.addOrder(Order.asc("serverTimestamp"))
.setMaxResults(limit);
In the current situation if client A inserts 15k tasks in 10:00:00 and client B inserts 3k tasks in 10:00:05 (5 seconds later) with the same priority, B's tasks will be fetched and executed after those of A's.
I need to balance fetched jobs between the clients (there's a "client" column in the queue table) - for example if the throughput is 10 tasks/sec to get 5 of the A's tasks and 5 of the B's. When there's no more tasks for client B, to get 10 of A's tasks.
Is there some easy way or trick to do this with the query? The DB is Postgres.
I don't think you will be able to do it by modifying your existing Criteria or with just a single query. To prevent client starvation you would have to create separate resource pools for each client, which is the apporach taken by Fair Scheduler for Hadoop:
The fair scheduler organizes jobs into pools, and divides resources fairly between these pools. By default, there is a separate pool for each user, so that each user gets an equal share of the cluster. It is also possible to set a job's pool based on the user's Unix group or any jobconf property. Within each pool, jobs can be scheduled using either fair sharing or first-in-first-out (FIFO) scheduling.
You can run a query to get a list of distinct clients with total number of awaiting jobs. Based on the distinct client count divide the global job limit and fetch the awaiting jobs for each given client in a separate query.

Ideal way of using thread pool

I have a web application, I am fetching some ids from elastic search. After fetching the ids I divide them into batches of 1000 each and query the database for more information of those ids. I have to query 4 different tables of the database for fetching all the information of a particular id that I want. To optimize it I used a fixed thread pool of size 10 having singleton object to process the different batches of 1000 each. Then I made a fixed thread pool of size 4 from each of these threads to query the 4 tables parallely. Now since it is a web application it will receive request parallely from different users creating a bottleneck due to the pool size because there is a single pool for all request(of size 10), so I just want to know what should be the ideal way of parallelizing the tasks here.

How to parallelize Multiple Requests to mongoDb?

I am using a single standalone Mongo DB server with no special topology like replication or sharding etc. Currently I have an issue that mongo DB does not support more than 500 parallel requests. Note that I am using only one instance of MongoClient and the remaining threads are used for inserts. I am using a java executor framework to create the threads and these threads are used to insert data to a collection [all insert in the same collection]
You should queue the requests before you issue them towards the database. There is no use requesting 500 things from your database in parallel. Remember a single request comes with some costs memory wise, locking wise and so on. Actually you are wasting resources by asking your database too much at once - remember I mean this request wise not data wise.
So use a queue (or more) and pool up the requests. From that pool you feed your worker threads (lets say 5 or 10 are enough) and that's it.
Take a look at the Future interface in the concurrent package of java. Using asynchrone processing here looks like the thing with the highest throughput and the lowest resource impact.
But check the MongoDB driver first. I would not be surprised if they have implemented it already this way. If this is the case you just have to limit yourself by using a queue to have only lets say 10 or 100 requests at once being handled by the database driver. Do some performance check tweaking the number of actual requests send to the database.

Best way to distribute database read jobs among Java threads

I have a MySQL database with a large number of rows.
I want to initialize multiple Threads (each with its own database connection) in Java and read/print the data simultaneously.
How to partition data between multiple threads so as no two Threads read the same record? What strategies can be used?
It depends on what kind of work are your threads going to do. For example i usually execute single SELECT for some kind of large dataset, add tasks to thread safe task queue and submit workers which picks up proper task from queue to process. I usually write to DB without synchronisation, but that depends on size of unit of work, and DB constrains (like unique keys etc). Works like charm.
Other method would be to just simply run multiple threads and let them work on their own. I strongly disadvice usage of some fancy LIMIT, OFFSET however. It still requires DB to fetch MORE data rows than it will actually return from query.
EDIT:
As you have added comment that you have same data, than yes, my solution is what are you looking for
Get dataset by single query
Add data to queue
Lunch your threads (by executors or new threads)
Pick data from queue and process it.
If the large dataset has an integer primary key, then one of the approaches would be as follows
Get the count of rows using the same select query.
Divide the entire dataset into equal number of partitions
Assign each partition to each thead. Each thread will have its own select query with primary key value range as constraint.
Note: the following issues with this approach
You (fire number of threads + 1) queries to database. So performance might be a problem.
All the partitions may not be equal (as there will be some ids which are deleted).
This approach is simple and makes sure that a row is strictly processed by only thread.
You can use a singleton class to maintain already read rows. So every thread can access the row number from that singleton.
Otherwise you can use static AtomicInteger variable from a common class. Every time threads will call getAndIncrement method. So you can partition data between the threads.

Categories