Data processing balance

Data processing balance - java

I have a queue with jobs that goes to different executor pools depending on the type of jobs. The queue is in a DB table and contains jobs from different clients with priorities, etc. I'm omitting some details irrelevant to the question.
At some point different clients put many jobs in the queue at the same time with the same priority, for example about 15-20'000 jobs.
With the current implementation, jobs are fetched using hibernate with this criteria and again, I'm omitting some restrictions for simplicity.
Calendar cal = Calendar.getInstance();
cal.add(Calendar.MINUTE, -minutes);
Criteria c = getSession().createCriteria(QueueEntry.class)
.add(Restrictions.eq("processing", false))
.add(Restrictions.or(Restrictions.ge("serverTimestamp", cal.getTime()), Restrictions.ge("sentTimestamp", cal.getTime())))
.add(Restrictions.lt("attemps", attemps))
.addOrder(Order.asc("priority"))
.addOrder(Order.asc("serverTimestamp"))
.setMaxResults(limit);
In the current situation if client A inserts 15k tasks in 10:00:00 and client B inserts 3k tasks in 10:00:05 (5 seconds later) with the same priority, B's tasks will be fetched and executed after those of A's.
I need to balance fetched jobs between the clients (there's a "client" column in the queue table) - for example if the throughput is 10 tasks/sec to get 5 of the A's tasks and 5 of the B's. When there's no more tasks for client B, to get 10 of A's tasks.
Is there some easy way or trick to do this with the query? The DB is Postgres.

I don't think you will be able to do it by modifying your existing Criteria or with just a single query. To prevent client starvation you would have to create separate resource pools for each client, which is the apporach taken by Fair Scheduler for Hadoop:
The fair scheduler organizes jobs into pools, and divides resources fairly between these pools. By default, there is a separate pool for each user, so that each user gets an equal share of the cluster. It is also possible to set a job's pool based on the user's Unix group or any jobconf property. Within each pool, jobs can be scheduled using either fair sharing or first-in-first-out (FIFO) scheduling.
You can run a query to get a list of distinct clients with total number of awaiting jobs. Based on the distinct client count divide the global job limit and fetch the awaiting jobs for each given client in a separate query.

Related

Architecturing Airflow DAG that needs contextual throttling

I have a group of job units (workers) that I want to run as a DAG
Group1 has 10 workers and each worker does multiple table extracts from a DB. Note that each worker maps to a single DB instance and each worker needs to successfully deal with 100 tables in total before it can successfully mark itself as complete
Group1 has a limitation that says no more than 5 tables across all those 10 workers should be consumed at a time. For example:
Worker1 is extracting 2 tables
Worker2 is extracting 2 tables
Worker3 is extracting 1 table
Worker4...Worker10 need to wait until Worker1...Worker3 relinquishes the threads
Worker4...Worker10 can pick up tables as soon as threads in step1 frees up
As each worker completes all the 100 tables, it proceeds to step2 without waiting. Step2 has no concurrency limits
I should be able to create a single node Group1 that caters to the throttling and also have
10 independent nodes of workers so I can restart them in case if anyone of it fails
I have tried explaining this in the following diagram:
If any of the worker fails, I can restart it without affecting other workers. It still uses the same thread pool from Group1 so the concurrency limits are enforced
Group1 would complete once all elements of step1 and step2 are complete
Step2 doesn't have any concurrency measures
How do I implement such a hierarchy in Airflow for a Spring Boot Java application?
Is it possible to design this kind of DAG using Airflow constructs and dynamically tell Java application how many tables it can extract at a time. For instance, if all workers except Worker1 are finished, Worker1 can now use all 5 threads available while everything else will proceed to step2.

These constraints cannot be modeled as a directed acyclic graph, and thus cannot implemented in airflow exactly as described. However, they can be modeled as queues, and thus could be implemented with a job queue framework. Here are your two options:
Implement suboptimally as airflow DAG:
from airflow.models import DAG
from airflow.operators.subdag_operator import SubDagOperator
# Executors that inherit from BaseExecutor take a parallelism parameter
from wherever import SomeExecutor, SomeOperator
# Table load jobs are done with parallelism 5
load_tables = SubDagOperator(subdag=DAG("load_tables"), executor=SomeExecutor(parallelism=5))
# Each table load must be it's own job, or must be split into sets of tables of predetermined size, such that num_tables_per_job * parallelism = 5
for table in tables:
load_table = SomeOperator(task_id=f"load_table_{table}", dag=load_tables)
# Jobs done afterwards are done with higher parallelism
afterwards = SubDagOperator(
subdag=DAG("afterwards"), executor=SomeExecutor(parallelism=high_parallelism)
)
for job in jobs:
afterward_job = SomeOperator(task_id=f"job_{job}", dag=afterwards)
# After _all_ table load jobs are complete, start the jobs that should be done afterwards
load_tables > afterwards
The suboptimal aspect here, is that, for the first half of the DAG, the cluster will be underutilized by higher_parallelism - 5.
Implement optimally with job queue:
# This is pseudocode, but could be easily adapted to a framework like Celery
# You need two queues
# The table load queue should be initialized with the job items
table_load_queue = Queue(initialize_with_tables)
# The queue for jobs to do afterwards starts empty
afterwards_queue = Queue()
def worker():
# Work while there's at least one item in either queue
while not table_load_queue.empty() or not afterwards_queue.empty():
working_on_table_load = [worker.is_working_table_load for worker in scheduler.active()]
# Work table loads if we haven't reached capacity, otherwise work the jobs afterwards
if sum(working_on_table_load) < 5:
is_working_table_load = True
task = table_load_queue.dequeue()
else
is_working_table_load = False
task = afterwards_queue.dequeue()
if task:
after = work(task)
if is_working_table_load:
# After working a table load, create the job to work afterwards
afterwards_queue.enqueue(after)
# Use all the parallelism available
scheduler.start(worker, num_workers=high_parallelism)
Using this approach, the cluster won't be underutilized.

Calling a web-service in a loop in Java web application

We have an external SOAP based web-service which provides information regarding customer's gift card balance when presented with an Id. This Id is stored in our database.
The requirement is to find out the balance for all such customers who has this Id flagged and then send them an email. This logic is supposed to be run as a scheduled job once every alternate day.
When we queried the DB, we found out that there are more than 5000 such customers who have this Id flagged. Unfortunately, the web-service will NOT accept a list of Ids, and can only give information about a single customer in one network call.
Now, our doubt is whether it will be a good idea to loop through 5000 Ids and call the web-service in this loop as many times.
As a test run, when we called the web-service for 500 Ids, it completed in 3.7 minutes and 1000 Ids 7.25 minutes. By this measure, we can guesstimate that for 5000 Ids, it should roughly take 40 minutes.
Our web-application is JavaEE 6 stack and DB is Oracle.
Is there a better way to do this ? Any suggestions are welcome.
Thanks.

If you could write a deterministic function that takes the input of the customer id and that gives you a number from 0 to 47 representing the number of hours in the 2 day cycle of sending these email alerts, you could shard the email sending and convert it to a job that runs every hour.
I know that is changing the requirements a bit, but there isn't much difference between sending a batch every 2 days and a smaller batch every hour. Each customer who remains on your list would continue to get emails every 2 days.
Another possibility is to send queries to the web service in a multi-threaded manner.
The web service provider should really think about changing their interface.

Unfortunately, the web-service will NOT accept a list of Ids, and can
only give information about a single customer in one network call.
You should really take contact with the service provider to get a suitable solution.
As workaround, if making multiple concurrent invocations is allowed by the SOAP WS, you could make multiple invocations of the WS by multiple Threads.
To achieve that, create a Runnable or a Callable implementation that performs the invocation to the WS with a specific id.
For example to perform concurrently 10 invocations of the WS, with Callable and ExecutorService, you could do something as :
MyWs myWs = ...; // web service stub
List<Long> ids = ...; // ids to search
List<Callable<Double>> callables = ids.stream()
.map(id -> (Callable<Double>) () -> myWs.getBalance(id))
.collect(Collectors.toList());
ExecutorService executorService = Executors.newFixedThreadPool(10)
List<Future<Double>> balanceFutures = executorService.invokeAll(callables);
Of course adjust the number of invocations according to the CPU of the machine that runs the JVM.

How to limit the number of running tasks on akka scheduler?

I have an application in Java Play Framework and the user can run multiple tasks at the same time and it can take a long time to finish. I thought that I could used the actorSystem.scheduler() in order to do that. However, I've made a few tests and found out that the user can run up to 4 tasks at the same time otherwise the tasks would be taking more resources than my server could provide. So Is there a way to limit the number of tasks running at the same time on the Akka scheduler?

If you want to globally limit the concurrent tasks, you could set the akka max pool size to that number. Information about configuration is available here: https://playframework.com/documentation/2.5.x/JavaAkka
Specifically, there is a setting:
akka.actor.default-dispatcher.fork-join-executor.pool-size-max = 64
which you can set to the maximum number of tasks you want to run concurrently. This is the number of threads that will be used.

use a pool (which can be a blockingLinkQueue) to store the scheduler object.
when the user comes, try to get an instance from the pool, otherwise wait. do so 2 control the max scheduler u will use in ur system.

java, quartz and multiple tasks triggered at certain times saved in a database

I'm building a system where users can set a future date(down to hours and minutes) in calendar. At that date a trigger is calling a certain task, unique for every user.
Every user can set a different date. The system will have 10k+ from the start and a user can create more than one trigger.
So assuming I have 10k users each user create on average 3 triggers => 30k triggers with 30k different dates.
All dates are saved in a database.
I'm new to quartz, can this be done in a more optimized way?
I was thinking about making a task run every minute that will get the tasks that will suppose to run in the next hour and remove them from database.
Do you have any better ideas? Did someone used quartz for a large number of triggers.

You have the schedule backed in the database. If I understand the idea - you want the quartz to load all the upcoming tasks to execute them in the future.
This is problematic approach:
Synchronization Issues: I assume that users can edit, remove and add new tasks to the database. You would have to periodically ask the database to refresh the state of the quartz jobs, remove some jobs, edit other jobs etc. This may not be trivial. The state of the program would be a long living cache which needs to be synchronised often.
Performance and scalability issues: Even if proposed solution may be ok for 30K tasks it may not be ok for 70k or 700k tasks. In your approach it's not easy to scale - adding new machine would require additional layer of synchronisation - which machine should actually execute which job (as all of them have all the tasks).
What I would propose:
Add the "stage" to the Tasks table (new, queued, running, finished, failed)
divide your solution into several components. (Initially they can run on a single machine but it will be easy to scale)
Components:
Task Finder: Executed periodically (once every few seconds). Scans the database for tasks that are "new", and due soon. Sends the tasks found to Message Queue and marks the task as "queued" in the db. Marking as "queued" has to be done carefully as there can be multiple "task finders". (As an addition it may find the tasks that have been marked as "queued" or "running" more than N minutes ago and are not "finished" nor "canceled" - probably need to re-run these)
Message Queue: Connector between Taks Finder and Task Executor.
Task Executor: Listens to the Message Queue and process the tasks that it received. Marks the tasks as "running" initially and "finished" or "failed" later on.
With this approach you can have:
multiple Task Executors on multiple machines
multiple Task Schedulers on multiple machines
even if one of the Task Schedulers or Executors will fail it will not be Single Point of Failure. Some of the tasks will be delayed but it will be picked up and run afterwards.
This may not address all the scenarios but would be a good starting point.

I don't see why you need quartz here at all. As far as I remember, quartz is best used to schedule backend internal processes, not user-defined tasks obtained from db.
Just process the trigger as it is created, save a row to your tasks table with start_date based on the trigger and every second select all incomplete tasks with start_date< sysdate. If the job is repeating, calculate next execution time and insert new task row / update previous accordingly.

As Sam pointed out there are some nice topics addressing the same problem:
Quartz Performance
Quartz FAQ
In a system like the mentioned it should not a problem mostly to handle this amount of triggers. But according to my experiance it is a better way to create something like a "JobChecker". If you enable your users to create own triggers it could really break Quartz in some cases. For example if 5000 user creates an event to the exact same time, Quartz will have a hard time to handle them correctly. (It is not likely a situation that will occur often, but it is possible as your specification does not excludes it.) Quartz has difficulties only when a lot of triggers should be fired at the same time.
My recommendation to this problem is to create one job that is running in every hour/minute etc and that should handle every user set events. This way is simmilar to a cron job in bash. With this kind of processing your system will be pretty stable even if the number of "triggers" increases dramatically. Basically your line of thought is correct if you thrive for scalability.

How to gracefully handle thousands of Quartz misfires?

We have an application that needs to
nightly reprocess large amounts of data, and
reprocess large amounts of data on demand.
In both of these cases, around 10,000 quartz jobs get spawned and then run. In the case of nightly, we have one quartz cron job that spawns the 10,000 jobs which each individually do the work of processing the data.
The issue that we have is that we are running with around 30 threads, so naturally the quartz jobs misfire, and continue to misfire until everything is processed. The processing can take up to 6 hours. Each of these 10,000 jobs pertain to a specific domain object that can processed in parallel and are completely independent. Each of the 10,000 jobs can take a variable amount of time (from half a second to a minute).
My question is:
Is there a better way to do this?
If not, what is the best way for us to schedule/setup our quartz jobs so that a minimal amount of time is spent thrashing and dealing with misfires?
A note about or architecture: We are running two clusters with three nodes apiece. The version of quartz is a bit old (2.0.1), and clustering is enabled in the quartz.properties file.

In both of these cases, around 10,000 quartz jobs get spawned
No need to spawn new quartz jobs. Quartz is a scheduler - not a task manager.
In the nightly reprocess - you need only that one quartz cron job to invoke some service responsible for managing and running the 10,000 tasks. In the "on demand" scenario, quartz shouldn't be involved at all. Just invoke that service directly.
How does the service manage 10,000 tasks?
Typically, when only one JVM is available, you'd just use some ExecutorService. Here, since you have 6 nodes under your fingers, you can easily use Hazelcast. Hazelcast is a java library that enables you to cluster your nodes, sharing resources efficiently with each other. Hazelcast has a straightforward solution distributing your ExecutorService, that's called Distributed Executor Service. It's as easy as creating a Hazelcast ExecutorService and submitting the task on all members. Here's an example from the documentation for invoking on a single member:
Callable<String> task = new Echo(input); // Echo is just some Callable
HazelcastInstance hz = Hazelcast.newHazelcastInstance();
IExecutorService executorService = hz.getExecutorService("default");
Future<String> future = executorService.submitToMember(task, member);
String echoResult = future.get();

I would do this by making use of a queue (RabbitMQ/ActiveMQ). The cron job (or whatever your on-demand trigger is) populates the queue with messages representing the 10,000 work instructions (i.e. the instruction to reprocess the data for a given domain object).
On each of your nodes you have a pool of executors which pull from the queue and carry out the work instruction. This solution means that each executor is kept as busy as possible while there are still work items on the queue, meaning that the overall processing is accomplished as quickly as possible.

The best way is to use a cluster of Quartz Instances. This will share the jobs between many cluster nodes :
http://quartz-scheduler.org/documentation/quartz-2.x/configuration/ConfigJDBCJobStoreClustering

I would use a scheduled quartz job to initiate the 10k tasks, but it does so by appending task details to a JMS queue (10k messages). That queue is monitored by a message-driven bean (Java-EE EJB MDB). The MDB can run simultaneously on multiple nodes in your cluster, and each node can run multiple instances... don't reinvent the wheel for distributed taskload: let Java-EE do it.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.