Quartz scheduler maintenance and performance overheads

Quartz scheduler maintenance and performance overheads - java

We are currently evaluating quartz-scheduler to use in our project. For our use case, we need only one time trigger to be fired at some point in future, it need not to be a repeatable or cron trigger.
So in my POC, I'm creating a new simple one time trigger when business event occurs. I can see in clustered environment (using JDBC store of quartz), triggers are being balanced/distributed among multiple nodes.
Desired behaviour is observed from POC, but, from performance standpoint, how expensive will it be if we create a new one time trigger each time when we run at scale. From my understanding, one bottleneck would be bloating of database with triggers, possible solution for database cleanup is to add a background task to cleanup old triggers.
I am interested in hearing about experiences and pain points on maintaining scheduler with our design and any inputs for improving design.

You can safely use one-time triggers and they will be automatically removed by Quartz after they have fired. What happens is that Quartz checks all triggers and determines if these triggers are going to fire at some point in the future. If they do not, Quartz simply removes them from the store because it makes no sense to keep them.
A somewhat similar principle applies to jobs. If a job has no associated triggers, Quartz automatically removes it from the store unless the job has the durability flag set to true.
So in your case, you will probably want to register a bunch of durable jobs and then your app will create one-time triggers for these jobs on as needed basis. The jobs will remain in the store and the triggers will be automatically cleaned up when they are done.

Related

Distributed, synchronized batch processing

In our current Java project, we need to batch process a huge set of records. Once, this processing is done, it must start again and process all records again. This processing must be parallelized as well as distributed among multiple nodes.
The records itself are stored in a database. Using some id range (e.g. 1-10000) for identifying a batch would be sufficient.
From a high level perspective, I see the following steps:
A sub task processes one batch of records.
A master task checks if any sub task is still running. If not, create one sub task for each batch of records.
We use MongoDB quite heavily and thought of persisting sub tasks in it. Then, each node can pick up sub tasks that are not done yet, does the processing and marks the record as done. Once there are no undone subtasks, the master task creates all the sub tasks again. This would probably work, but we are looking for a solution in which we don't need to do the heavy synchronization work ourselves.
Could this be a possible use-case for akka?
Can akka-persistence be used to synchronize the processing among different nodes?
Are there any other Java/JVM frameworks suited for this job?

Your question is way too broad for SO's format. Plase read this guide in the future before asking, and don't ask your group members to vote your question up just to inflate what is obviously an ill-posed question ( ͡° ͜ʖ ͡°).
Anyways:
1) Yes, you can implement your requirements in Akka. In particular, since you mentioned multiple nodes, you are looking at the akka-cluster module (for inter-node communication), and you might also need akka-cluster-sharding (in case you want to keep all data in memory beside during processing).
2) No, I would strongly not reccomend that. While you could technically force your problem into using akka-persistence for synchronizing the tasks, the goal of akka-persistence is simply to make an actor's state persistent. Akka itself in its basic form is enough for handling all your synchronization issues. Simply have a master actor create a worker for every subtask and monitor its completion.
3) Yes. Note that the answer to this question is always yes no matter which job.

java, quartz and multiple tasks triggered at certain times saved in a database

I'm building a system where users can set a future date(down to hours and minutes) in calendar. At that date a trigger is calling a certain task, unique for every user.
Every user can set a different date. The system will have 10k+ from the start and a user can create more than one trigger.
So assuming I have 10k users each user create on average 3 triggers => 30k triggers with 30k different dates.
All dates are saved in a database.
I'm new to quartz, can this be done in a more optimized way?
I was thinking about making a task run every minute that will get the tasks that will suppose to run in the next hour and remove them from database.
Do you have any better ideas? Did someone used quartz for a large number of triggers.

You have the schedule backed in the database. If I understand the idea - you want the quartz to load all the upcoming tasks to execute them in the future.
This is problematic approach:
Synchronization Issues: I assume that users can edit, remove and add new tasks to the database. You would have to periodically ask the database to refresh the state of the quartz jobs, remove some jobs, edit other jobs etc. This may not be trivial. The state of the program would be a long living cache which needs to be synchronised often.
Performance and scalability issues: Even if proposed solution may be ok for 30K tasks it may not be ok for 70k or 700k tasks. In your approach it's not easy to scale - adding new machine would require additional layer of synchronisation - which machine should actually execute which job (as all of them have all the tasks).
What I would propose:
Add the "stage" to the Tasks table (new, queued, running, finished, failed)
divide your solution into several components. (Initially they can run on a single machine but it will be easy to scale)
Components:
Task Finder: Executed periodically (once every few seconds). Scans the database for tasks that are "new", and due soon. Sends the tasks found to Message Queue and marks the task as "queued" in the db. Marking as "queued" has to be done carefully as there can be multiple "task finders". (As an addition it may find the tasks that have been marked as "queued" or "running" more than N minutes ago and are not "finished" nor "canceled" - probably need to re-run these)
Message Queue: Connector between Taks Finder and Task Executor.
Task Executor: Listens to the Message Queue and process the tasks that it received. Marks the tasks as "running" initially and "finished" or "failed" later on.
With this approach you can have:
multiple Task Executors on multiple machines
multiple Task Schedulers on multiple machines
even if one of the Task Schedulers or Executors will fail it will not be Single Point of Failure. Some of the tasks will be delayed but it will be picked up and run afterwards.
This may not address all the scenarios but would be a good starting point.

I don't see why you need quartz here at all. As far as I remember, quartz is best used to schedule backend internal processes, not user-defined tasks obtained from db.
Just process the trigger as it is created, save a row to your tasks table with start_date based on the trigger and every second select all incomplete tasks with start_date< sysdate. If the job is repeating, calculate next execution time and insert new task row / update previous accordingly.

As Sam pointed out there are some nice topics addressing the same problem:
Quartz Performance
Quartz FAQ
In a system like the mentioned it should not a problem mostly to handle this amount of triggers. But according to my experiance it is a better way to create something like a "JobChecker". If you enable your users to create own triggers it could really break Quartz in some cases. For example if 5000 user creates an event to the exact same time, Quartz will have a hard time to handle them correctly. (It is not likely a situation that will occur often, but it is possible as your specification does not excludes it.) Quartz has difficulties only when a lot of triggers should be fired at the same time.
My recommendation to this problem is to create one job that is running in every hour/minute etc and that should handle every user set events. This way is simmilar to a cron job in bash. With this kind of processing your system will be pretty stable even if the number of "triggers" increases dramatically. Basically your line of thought is correct if you thrive for scalability.

Which java concurrent to use for cleaning DB (on demand/scheduled)

I have a thread cleaner in my code that is being created if the DB capacity was exceeded, the capacity is checked on every insertion to the DB. I would like to add more functionality to this cleaner and clean also when number of files exceeding, lets say 10000 files. The new functionality should run scheduled.
I want to be able to clean the DB in 2 ways:
1. On demand.
2. Scheduled, every day on X hour.
Which concurrent java class to use?
How can I make sure that the same thread will be used by the 2 ways above?

Code that would perform cleanup of DB should be completely separated out of scheduling (single responsibility principle), so that you could execute it at any time from some other code.
As for scheduling, I would suggest you looking at Quartz scheduler, and get familiar with CRON so that you could extract it to properties to have possibility to change scheduling trigger without modifying your code.
You should synchronize your code so that no more than one cleanup gets performed at the same time, this should be easy with standard synchronize.
If you wish to make it very simple and don't want to add new dependencies, you can go with standard Java solution: Timer. Timer#scheduleAtFixedRate can provide fixed rate execution. Which means you'll have to add extra code whenever new requirements will show up (e.g., don't schedule at weekend).

Two threads reading from the same table:how do i make both thread not to read the same set of data from the TASKS table

I have a tasks thread running in two separate instances of tomcat.
The Task threads concurrently reads (using select) TASKS table on certain where condition and then does some processing.
Issue is ,sometimes both the threads pick the same task , because of which the task is executed twice.
My question is how do i make both thread not to read the same set of data from the TASKS table

It is just because your code(which is accessing data base)DAO function is not synchronized.Make it synchronized,i think your problem will be solved.

If the TASKS table you mention is a database table then I would use Transaction isolation.
As a suggestion, within a trasaction, set an attribute of the TASK table to some unique identifiable value if not set. Commit the tracaction. If all is OK then the task has be selected by the thread.
I haven't come across this usecase so treat my suggestion with catuion.

I think you need to see some information how does work with any enterprise job scheduler, for example with Quartz

For your use case there is a better tool for the job - and that's messaging. You are persisting items that need to be worked on, and then attempting to synchronise access between workers. There are a number of issues that you would need to resolve in making this work - in general updating a table and selecting from it should not be mixed (it locks), so storing state there doesn't work; neither would synchronization in your Java code, as that wouldn't survive a server restart.
Using the JMS API with a message broker like ActiveMQ, you would publish a message to a queue. This message would contain the details of the task to be executed. The message broker would persist this somewhere (either in its own message store, or a database). Worker threads would then subscribe to the queue on the message broker, and each message would only be handed off to one of them. This is quite a powerful model, as you can have hundreds of message consumers all acting on tasks so it scales nicely. You can also make this as resilient as it needs to be, so tasks can survive both Tomcat and broker restarts.

Whether the database can provide graceful management of this will depend largely on whether it is using strict two-phase locking (S2PL) or multi-version concurrency control (MVCC) techniques to manage concurrency. Under MVCC reads don't block writes, and vice versa, so it is very possible to manage this with relatively simple logic. Under S2PL you would spend too much time blocking for the database to be a good mechanism for managing this, so you would probably want to look at external mechanisms. Of course, an external mechanism can work regardless of the database, it's just not really necessary with MVCC.
Databases using MVCC are PostgreSQL, Oracle, MS SQL Server (in certain configurations), InnoDB (except at the SERIALIZABLE isolation level), and probably many others. (These are the ones I know of off-hand.)
I didn't pick up any clues in the question as to which database product you are using, but if it is PostgreSQL you might want to consider using advisory locks. http://www.postgresql.org/docs/current/interactive/explicit-locking.html#ADVISORY-LOCKS I suspect many of the other products have some similar mechanism.

I think you need have some variable (column) where you keep last modified date of rows. Your threads can read same set of data with same modified date limitation.
Edit:
I did not see "not to read"
In this case you need have another table TaskExecutor (taskId , executorId) , and when some thread runs task you put data to TaskExecutor; and when you start another thread it just checks that task is already executing or not (Select ... from RanTask where taskId = ...).
Нou also need to take care of isolation level for transaсtions.

EJB timer performance

I am trying to decide if use a java-ee timer in my application or not. The server I am using is Weblogic 10.3.2
The need is: After one hour of a call to an async webservice from an EJB, if the async callback method has not been called it is needed to execute some actions. The information regarding if the callback method has been called and the date of the execution of the call is stored in database.
The two possibilities I see are:
Using a batch process that every half hour looks for all the calls that have been more than one hour without response and execute the needed actions.
Create a timer of one hour after every single call to the ws and in the #Timeout method check if the answer has come and if it has not, execute the required actions.
From a pure programming point of view, it looks easier and cleaner the second one, but I am worry of the performance issues I could have if let's say there are 100.000 Timer created at a single moment.
Any thoughts?

You would be better off having a more specialized process. The real problem is the 100,000 issue. It would depend on how long your actions take.
Because its easy to see that each second, the EJB timer would fire up 30 threads to process all of the current pending jobs, since that's how it works.
Also timers are persistent, so your EJB managed timer table will be saving and deleting 30 rows per second (60 total), this is assuming 100K transactions/hour.
So, that's an lot of work happening very quickly. I can easily see the system simply "falling behind" and never catching up.
A specialized process would be much lighter weight, could perhaps batch the action calls (call 5 actions per thread instead of one per thread), etc. It would be nice if you didn't have to persist the timer events, but that is what it is. You could almost easily simply append the timer events to a file for safety, and keep them in memory. On system restart, you can reload that file, and then roll the file (every hour create a new file, delete the older file after it's all been consumed, etc.). That would save a lot of DB traffic, but you could lose the transactional nature of the DB.
Anyway, I don't think you want to use the EJB Timer for this, I don't think it's really designed for this amount of traffic. But you can always test it and see. Make sure you test restarting your container see how well it works with 100K pending timer jobs in its table.

All depends of what is used by the container. e.g. JBoss uses Quartz Scheduler to implement EJB timer functionality. Quartz is pretty good when you have around 100 000 timer instances.

#Pau: why u need to create a timer for every call made...instead u can have a single timer thread created at start up of application which runs after every half-hour(configurable) period of time and looks in your Database for all web services calls whose response have not been received and whose requested time is past 1 hour. And for selected records, in for loop, it can execute required action.
Well above design may not be useful if you have time critical activity to be performed.
If you have spring framework in your application, you may also look up its timer services.http://static.springsource.org/spring/docs/1.2.9/reference/scheduling.html

Maybe you could use some of these ideas:
Where I'm at, we've built a cron-like scheduler which is powered by a single timer. When the timer fires the system checks which crons need to run using a Quartz CronTrigger. Generally these crons have a lot of work to do, and the way we handle that is each cron spins its individual tasks off as JMS messages, then MDBs handle the messages. Currently this runs on a single Glassfish instance and as our task load increases, we should be able to scale this up with a cluster so multiple nodes are processing the jms messages. We balance the jms message processing load for each type of task by setting the max-pool-size in glassfish-ejb-jar.xml (also known as sun-ejb-jar.xml).
Building a system like this and getting all the details right isn't trivial, but it's proving really effective.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.