We have an application that needs to
nightly reprocess large amounts of data, and
reprocess large amounts of data on demand.
In both of these cases, around 10,000 quartz jobs get spawned and then run. In the case of nightly, we have one quartz cron job that spawns the 10,000 jobs which each individually do the work of processing the data.
The issue that we have is that we are running with around 30 threads, so naturally the quartz jobs misfire, and continue to misfire until everything is processed. The processing can take up to 6 hours. Each of these 10,000 jobs pertain to a specific domain object that can processed in parallel and are completely independent. Each of the 10,000 jobs can take a variable amount of time (from half a second to a minute).
My question is:
Is there a better way to do this?
If not, what is the best way for us to schedule/setup our quartz jobs so that a minimal amount of time is spent thrashing and dealing with misfires?
A note about or architecture: We are running two clusters with three nodes apiece. The version of quartz is a bit old (2.0.1), and clustering is enabled in the quartz.properties file.
In both of these cases, around 10,000 quartz jobs get spawned
No need to spawn new quartz jobs. Quartz is a scheduler - not a task manager.
In the nightly reprocess - you need only that one quartz cron job to invoke some service responsible for managing and running the 10,000 tasks. In the "on demand" scenario, quartz shouldn't be involved at all. Just invoke that service directly.
How does the service manage 10,000 tasks?
Typically, when only one JVM is available, you'd just use some ExecutorService. Here, since you have 6 nodes under your fingers, you can easily use Hazelcast. Hazelcast is a java library that enables you to cluster your nodes, sharing resources efficiently with each other. Hazelcast has a straightforward solution distributing your ExecutorService, that's called Distributed Executor Service. It's as easy as creating a Hazelcast ExecutorService and submitting the task on all members. Here's an example from the documentation for invoking on a single member:
Callable<String> task = new Echo(input); // Echo is just some Callable
HazelcastInstance hz = Hazelcast.newHazelcastInstance();
IExecutorService executorService = hz.getExecutorService("default");
Future<String> future = executorService.submitToMember(task, member);
String echoResult = future.get();
I would do this by making use of a queue (RabbitMQ/ActiveMQ). The cron job (or whatever your on-demand trigger is) populates the queue with messages representing the 10,000 work instructions (i.e. the instruction to reprocess the data for a given domain object).
On each of your nodes you have a pool of executors which pull from the queue and carry out the work instruction. This solution means that each executor is kept as busy as possible while there are still work items on the queue, meaning that the overall processing is accomplished as quickly as possible.
The best way is to use a cluster of Quartz Instances. This will share the jobs between many cluster nodes :
http://quartz-scheduler.org/documentation/quartz-2.x/configuration/ConfigJDBCJobStoreClustering
I would use a scheduled quartz job to initiate the 10k tasks, but it does so by appending task details to a JMS queue (10k messages). That queue is monitored by a message-driven bean (Java-EE EJB MDB). The MDB can run simultaneously on multiple nodes in your cluster, and each node can run multiple instances... don't reinvent the wheel for distributed taskload: let Java-EE do it.
Related
I have an actor called a TaskRunner. The tasks can take up to 1 minute to run. Because of the library I use there can only be one actor per jvm/node. I have 1000 of these nodes across various machines.
I would like to distribute tasks to these nodes using various rules but the most important one is:
Never queue tasks in a TaskRunner node's mailbox, always wait until a TaskRunner is free before sending it a task
The way I was thinking of doing this is have an actor on another node (lets call this the Scheduler actor) listen to registrations from the TaskRunner nodes and keep an internal state of what has been sent to where.
Presumably if I did this I could only ever have one instance of this Scheduler actor because if there was more than one they wouldn't know which TaskRunner nodes were currently busy and so we would get tasks in the queue.
Does this mean I should be using a Cluster Singleton for the Scheduler actor?
Is there a better way to achieve my goal?
I would say you need:
dispatcher actor (cluster singleton) who send task to actor from pool of idle actors
your TaskRunner actor should have two states: running, and idle. In idle state it should register itself regularly to dispatcher actor (notifying that it is idle). Regularly, because of possible state losing by dispatcher in case of node shutdown and move singleton to another node.
dispatcher itself keep list of idle actors. When new task need to be done and list is not empty, worker is taken from the list and task is sent (worker could be removed from list immediately, but safe to work with Ack to be sure that task is taken for processing, or re-send to another worker if Ack is timed out)
Given your requirement, rather than building everything from scratch, you might want to consider adapting Lightbend's distributed-workers template which employs a pull model. It primarily consists of 1) a master cluster singleton that maintains state of workers, and, 2) an actor system of workers which register and pull work from the master singleton actor.
I adapted a repurposed version of the template for a R&D project in the past and it delivered the work-pulling functionality as advertised. Note that the template uses the retired Activator (which can be easily detached or replaced with sbt from the main code). It also does distributed pub-sub and persistence journal, which you can elect to exclude if not needed. Its source code is available at GitHub.
Reffering to your approach of singleton master and multiple workers,there can be a situation where your master is over loaded with task to schedule, which may result in more time to schedule the task to the workers.
So Instead of making master as Cluster singleton, you can have multiple masters having subset of workers assigned to them.
The distribution of work to different master can be done through cluster sharding based on sharding key.
Akka provides cluster sharding, you can refer that.
And for making your master fault tolerant, you can always have the persistent actors as masters.
I have the following use case on a Spring-based Web application:
I need to apply the Competing Consumers EIP with the following twists: the messages in the queue are actually split tasks belonging to the same job. Therefore, I need to properly track when all tasks of a job get completed and their completion status in order to save the scenario either as COMPLETED or FAILED, log the outcome and notify by e.g. e-mail the users accordingly
So, given the requirements I described above, my question is:
Can this be done with RabbitMQ and if yes how?
I created a quick gist to show a very crude example of how one could do it. In this example, there is one producer and 2 consumers, 2 queues, one for sending by the producer ("SEND"), consumed by the consumers, and vice versa, consumers publish to the "RECV" queue and is consumed by the producer.
Now bear in mind this is a pretty crude example, as the Producer in that case send simply one job (a random amount of tasks between 0 and 5), and block until the job is done. A way to circumvent this would be to store in a Map a job id and the number of tasks, and every time check that the number of tasks done reported per job id.
What you are trying to do is beyond the scope of RabbitMQ. RabbitMQ is for sending and receiving messages with ability to queue them.
It can't track your job tasks for you.
You will need to have a "Job Storage" service. Whenever your consumer finishes the task, its updates the Job Storage service, marking task as done. Job storage service knows about how many tasks are in the job, and when last task is done, completes jobs as succeeded. There in this service, you will also implement all your other business logic, such as when to treat job as failed.
I'm building a system where users can set a future date(down to hours and minutes) in calendar. At that date a trigger is calling a certain task, unique for every user.
Every user can set a different date. The system will have 10k+ from the start and a user can create more than one trigger.
So assuming I have 10k users each user create on average 3 triggers => 30k triggers with 30k different dates.
All dates are saved in a database.
I'm new to quartz, can this be done in a more optimized way?
I was thinking about making a task run every minute that will get the tasks that will suppose to run in the next hour and remove them from database.
Do you have any better ideas? Did someone used quartz for a large number of triggers.
You have the schedule backed in the database. If I understand the idea - you want the quartz to load all the upcoming tasks to execute them in the future.
This is problematic approach:
Synchronization Issues: I assume that users can edit, remove and add new tasks to the database. You would have to periodically ask the database to refresh the state of the quartz jobs, remove some jobs, edit other jobs etc. This may not be trivial. The state of the program would be a long living cache which needs to be synchronised often.
Performance and scalability issues: Even if proposed solution may be ok for 30K tasks it may not be ok for 70k or 700k tasks. In your approach it's not easy to scale - adding new machine would require additional layer of synchronisation - which machine should actually execute which job (as all of them have all the tasks).
What I would propose:
Add the "stage" to the Tasks table (new, queued, running, finished, failed)
divide your solution into several components. (Initially they can run on a single machine but it will be easy to scale)
Components:
Task Finder: Executed periodically (once every few seconds). Scans the database for tasks that are "new", and due soon. Sends the tasks found to Message Queue and marks the task as "queued" in the db. Marking as "queued" has to be done carefully as there can be multiple "task finders". (As an addition it may find the tasks that have been marked as "queued" or "running" more than N minutes ago and are not "finished" nor "canceled" - probably need to re-run these)
Message Queue: Connector between Taks Finder and Task Executor.
Task Executor: Listens to the Message Queue and process the tasks that it received. Marks the tasks as "running" initially and "finished" or "failed" later on.
With this approach you can have:
multiple Task Executors on multiple machines
multiple Task Schedulers on multiple machines
even if one of the Task Schedulers or Executors will fail it will not be Single Point of Failure. Some of the tasks will be delayed but it will be picked up and run afterwards.
This may not address all the scenarios but would be a good starting point.
I don't see why you need quartz here at all. As far as I remember, quartz is best used to schedule backend internal processes, not user-defined tasks obtained from db.
Just process the trigger as it is created, save a row to your tasks table with start_date based on the trigger and every second select all incomplete tasks with start_date< sysdate. If the job is repeating, calculate next execution time and insert new task row / update previous accordingly.
As Sam pointed out there are some nice topics addressing the same problem:
Quartz Performance
Quartz FAQ
In a system like the mentioned it should not a problem mostly to handle this amount of triggers. But according to my experiance it is a better way to create something like a "JobChecker". If you enable your users to create own triggers it could really break Quartz in some cases. For example if 5000 user creates an event to the exact same time, Quartz will have a hard time to handle them correctly. (It is not likely a situation that will occur often, but it is possible as your specification does not excludes it.) Quartz has difficulties only when a lot of triggers should be fired at the same time.
My recommendation to this problem is to create one job that is running in every hour/minute etc and that should handle every user set events. This way is simmilar to a cron job in bash. With this kind of processing your system will be pretty stable even if the number of "triggers" increases dramatically. Basically your line of thought is correct if you thrive for scalability.
I'm looking into Java Batch, and found a couple of good examples. But, I'm not sure whether the independent jobs will run concurrently/in-parallel or sequentially. If they will run in-parallel, or concurrently, then how to set the limit on concurrency? Further, is there any queue maintained internally -- I know about JobRepository, but in which fashion jobs will be picked?
Note: It's not about defining flows by split to run concurrently on multiple threads, neither it's about partitioning the input in order to process a range of data in parallel.
I have to write heavy load system, with pretty easy task to do. So i decided to split this tasks into multiple workers in different locations (or clouds). To communicate i want to use rabbitmq queue.
In my system there will be two kinds of software nodes: schedulers and workers. Schedulers will take user input from queue_input, split it into smaller task and put this smaller task into workers_queue. Workers reads this queue and 'do the thing'. I used round-robbin load balancing here - and all works pretty well, as long, as some worker crashed. Then i loose information about task completion (it's not allowed to do single operation twice, each task contains a pack of 50 iterations of doing worker-code with diffirent data).
I consider something like technical_queue - another channel to scheduler-worker communication, and I wonder, how to design it in a good way. I used tutorials from rabbitmq page, so my worker thread looks like :
while(true) {
message = consume(QUEUE,...);
handle(message); //do 50 simple tasks in loop for data in message
}
How can i handle second queue? Another thread we some while(true) {} loop?, or is there a better sollution to this? Maybe should I reuse existing queue with topic exchange? (but i wanted to have independent way of communication, while handling the task, which may take some time.
You should probably take a look at spring-amqp (doc). I hate to tell you to add a layer but that spring library takes care of the threading issues and management of threads with its SimpleMessageListenerContainer. Each container goes to a queue and you can specify # of threads (ie workers) per queue.
Alternatively you can make your own using an ExecutorService but you will probably end up rewriting what SimpleMessageListenerContainer does. Also you just could execute (via OS or batch scripts) more processes and that will add more consumers to each queue.
As far as queue topology is concerned it is entirely dependent on business logic/concerns and generally less on performance needs. More often you had more queues for business reasons and more workers for performance reasons but if a queue gets backed up with the same type of message considering giving that type of message its own queue. What your describing sounds like two queues with multiple consumer on your worker queue.
Other than the threading issue and queue topology I'm not entirely sure what else you are asking.
I would recommend you create a second queue consumer
consumer1 -> queue_process
consumer2 -> queue_process
Both consumers should make listening to the same queue.
Greetings I hope will help