We are using quartz scheduler in a clustered environment (Two nodes in a cluster, pointing to a single Oracle database). Currently we have two jobs which runs pretty much every one hour.
We have a separate database schema for the quartz jobs. What we've noticed is that quartz checks the database every 15 seconds (default value for clusterCheckinInterval).
We don't like this and like to make it less frequest. What we have in mind is to give is a 1 minute frequency, but most of the example configurations has given clusterCheckinInterval as 20000.
Can some body please recommend a suitable value for the clusterCheckinInterval?
From the Quartz documentation:
org.quartz.jobStore.clusterCheckinInterval
Set the frequency (in milliseconds) at which this instance "checks-in"* with the other instances of the cluster. Affects the quickness of detecting failed instances.
In a Quartz clusters the clusterCheckinInterval tells how responsive your cluster is to the failover (considering Quartz jobs). The smaller the interval is, the more quickly your application can respond. Actually this value is used by cluster nodes to check if there are recoverable jobs running on a broken node. If yes, Quartz tries to re-run them.
In general the default value is good enough, but you have to take into consideration the frequency of jobs and the effect that a missed job run can cause.
If you have a number of jobs that must run on every second, then you have to set the interval to 1000 (in milliseconds).
If you have jobs that run on every second, but it is not critical to run them all the time, then 5-15 sec is good enough (depending the fault tolerance of the system).
If you have hour-long running jobs which run a few times a day, you can raise the interval even to 60 sec.
My opinion is that I would not consider 20-30 database request per minute as a "load", so I would set it to 2 or 3 seconds (2000 or 3000 in millisec) .
Related
I'm using java quartz(2.3.1) and I have a setup where I'm using postgres as the job store and I have 3-4 machines all running the scheduler (vertical scaling). I want the rds to act as a source of truth and if I have a job with the schedule of repeat every 1 hour, I want it to run on any one of the machines. I don't care which one it runs on as long as it is one machine triggered in that hour.
I noticed that this works really well most of the time but I have recently had a trigger which runs once ever hour and about once every two days I see two of my machines getting triggered. I have noticed that my isClustered property is false which I have now set to true, but I'm not sure how this would help since if this was the problem, this issue would be happening a 100% of the time rather than rarely. Could anyone tell me what I should be looking into to actually fix this issue?
org.quartz.jobStore.isClustered = true
ensures proper database row locks are applied to the trigger before picking it, if that property was false both instances can pick up one trigger(race condition) before one could change status of that trigger.
http://www.quartz-scheduler.org/documentation/quartz-1.8.6/configuration/ConfigJDBCJobStoreClustering.html
I am testing a certain "functionality" that happens after log in.
The test case is 500 users exercising that functionality within 5 minutes.
I can add a synchronising timer after the log in, to ensure all 500 threads have logged in but then it will do all 500 "functionality" tasks at once, rather than 5 minutes, which will crash the app (it thinks there's a DDoS attack and shuts down).
Right now, I am handling this by giving some think time after login, to slow down login to a stable figure that I can predict and then start "functionality" at each thread's turn, as scheduled by: the main scheduler + the the log in response time + the think time...
But that's a bit fuzzy.
Is there a way to "ramp up" tasks once already running?
I can think in two options.
The first one is two use random times. You would use the range from 0 seconds to 300 - 1 that is [0-300) or using millis [0-300000). Then sleep the thread basesd on this ramdon time.
This approach can be a little more realist, because for instance, in a specific second of the given interval you don't have any threads starting and in other particular second you have 2-3. This still should be well balanced in general, since you won't make all petitions at start.
The second one is to start the threads uniformly. During your configuration time (login and before firing the threads) you can use something like an AtomicInteger, initializing it with new AtomicInteger(0) and calling getAndIncrement() to assign the possition of the thread, in the range [0-500) and then when you fire the threads sleep 300.0 * id / 500.0 milliseconds to execute the task/petition.
By default JMeter executes requests as fast as it can, you can "throttle" the execution to desired throughput (request per minute) rate using Constant Throughput Timer.
Example Test Plan would look like:
Thread Group
Login
Synchronizing Timer
Functionality
Constant Throughput Timer
Constant Throughput Timer follows JMeter Scoping Rules so you can apply it either to single sampler or to a group of samplers.
I have a non-concurrent quartz job running on 6 application server instances. A high level responsibility of the job is to walk through a DB table and process and update which ever row is expired. Now I see a behavior of the job which is not understandable.
I have a configuration by which the job should be triggered after 15 minutes, but as the span of a single run can be multiple days, each of this trigger after 15 minutes should be suppressed by a lock already acquired by running job instance.
So, the ideal behavior is, job starts running on one of the 6 server instances, it completes a single DB table iteration in let us say 3 days. Meanwhile, quartz is trying to push in another job every minutes, but as lock is already acquired, it should not. After 3 days when the first job run finishes quartz scheduler should succeed in starting another job, within <= 15 minutes of the first run endtime.
But, in reality I see a behavior, where the the job has run for some days and has not run for some of the days. some time this gap is as long as 8-10 days. I am unable to explain this scenario.
The closest theory I can think of, is that it might be the case that during a particular job run, the server instance got killed(due to deployment/redeployment), because of which the quartz did not get a chance to remove shared lock. So, all the attempts of acquiring a lock for next job run keep on failing till the orphan lock is not expired by an expiry date. The moment it got expired, a new job kicks in.
My question here is, what could be the possible explanations to this, and more importantly, how to debug it? Any leads to Quartz Lock management documentation for non-concurrent jobs can helpful.
I use DisallowConcurrentExecution annotation for non-concurrency.
I'm building a system where users can set a future date(down to hours and minutes) in calendar. At that date a trigger is calling a certain task, unique for every user.
Every user can set a different date. The system will have 10k+ from the start and a user can create more than one trigger.
So assuming I have 10k users each user create on average 3 triggers => 30k triggers with 30k different dates.
All dates are saved in a database.
I'm new to quartz, can this be done in a more optimized way?
I was thinking about making a task run every minute that will get the tasks that will suppose to run in the next hour and remove them from database.
Do you have any better ideas? Did someone used quartz for a large number of triggers.
You have the schedule backed in the database. If I understand the idea - you want the quartz to load all the upcoming tasks to execute them in the future.
This is problematic approach:
Synchronization Issues: I assume that users can edit, remove and add new tasks to the database. You would have to periodically ask the database to refresh the state of the quartz jobs, remove some jobs, edit other jobs etc. This may not be trivial. The state of the program would be a long living cache which needs to be synchronised often.
Performance and scalability issues: Even if proposed solution may be ok for 30K tasks it may not be ok for 70k or 700k tasks. In your approach it's not easy to scale - adding new machine would require additional layer of synchronisation - which machine should actually execute which job (as all of them have all the tasks).
What I would propose:
Add the "stage" to the Tasks table (new, queued, running, finished, failed)
divide your solution into several components. (Initially they can run on a single machine but it will be easy to scale)
Components:
Task Finder: Executed periodically (once every few seconds). Scans the database for tasks that are "new", and due soon. Sends the tasks found to Message Queue and marks the task as "queued" in the db. Marking as "queued" has to be done carefully as there can be multiple "task finders". (As an addition it may find the tasks that have been marked as "queued" or "running" more than N minutes ago and are not "finished" nor "canceled" - probably need to re-run these)
Message Queue: Connector between Taks Finder and Task Executor.
Task Executor: Listens to the Message Queue and process the tasks that it received. Marks the tasks as "running" initially and "finished" or "failed" later on.
With this approach you can have:
multiple Task Executors on multiple machines
multiple Task Schedulers on multiple machines
even if one of the Task Schedulers or Executors will fail it will not be Single Point of Failure. Some of the tasks will be delayed but it will be picked up and run afterwards.
This may not address all the scenarios but would be a good starting point.
I don't see why you need quartz here at all. As far as I remember, quartz is best used to schedule backend internal processes, not user-defined tasks obtained from db.
Just process the trigger as it is created, save a row to your tasks table with start_date based on the trigger and every second select all incomplete tasks with start_date< sysdate. If the job is repeating, calculate next execution time and insert new task row / update previous accordingly.
As Sam pointed out there are some nice topics addressing the same problem:
Quartz Performance
Quartz FAQ
In a system like the mentioned it should not a problem mostly to handle this amount of triggers. But according to my experiance it is a better way to create something like a "JobChecker". If you enable your users to create own triggers it could really break Quartz in some cases. For example if 5000 user creates an event to the exact same time, Quartz will have a hard time to handle them correctly. (It is not likely a situation that will occur often, but it is possible as your specification does not excludes it.) Quartz has difficulties only when a lot of triggers should be fired at the same time.
My recommendation to this problem is to create one job that is running in every hour/minute etc and that should handle every user set events. This way is simmilar to a cron job in bash. With this kind of processing your system will be pretty stable even if the number of "triggers" increases dramatically. Basically your line of thought is correct if you thrive for scalability.
I'm working on a job dispatcher for appengine, and the default scheduler always winds up firing up 3-4 instances that do all the work, some overflow instances that might take thousands of tasks, or only a couple and then sits there burning cpus doing nothing.
My task involves processing jobs for many different sized domains, sometimes there's huge throughput, and other times it's one user with 10,000 models to update; if I turn the normal appengine task scheduler loose, it fails in two ways: 1) backends never shut down, and when memory hits the cap, java gc makes an instance thrash and act like it's almost a zombie yet never shut down {and still take/hold jobs}, and 2) many domains have a single user that takes far longer than all the others to process, and this keeps a backend alive long after the rest of the domain has finished.
These tasks must run throughout the day, and it takes multiple backends to handle fanout, so I can't just dump them all on a B8 and call it a day., so we need a dispatcher to manage how tasks get allocated to backends.
Now, I don't want to pay datastore ops on every task just to save a few minutes of cpu time, so my plan of attack {please critique} is to use a static ConcurrentHashMap in RAM, start each run() in a try, have every deferred task put it's [hashcode, startTime] in at startup and remove(hashcode) in a finally. There will be one such map per backend instance that's running jobs, wrapped in a method, BackendCounter.addToLiveMap(this); it's .size() serves as a running total of how many jobs are alive on that backend {with timestamp to detect zombie jobs that run >10 minutes}. The job dispatcher can fire off a worker thread per instance to monitor how many jobs, excluding itself, are running in that instance, and keep a ranked list in memcache of which instances have how many tasks alive. If one instance drops below a threshold of X live tasks, pick an overflow instance to defer to, then have the method BackendCounter.addToLiveMap(this) throw an exception I can catch to tell jobs to just schedule themselves to a new instance {ChangeInstanceException#getNewTarget()}. This way I can prevent barely-used instances from getting new jobs so they have a chance to shut down, paying only for some memcache ops and fanout only pays a write and delete to static map.
That takes care of problem two, which is the instance-hour killer. As for problem one, which is how to prevent one instance {usually instance 0 and 1} from hitting peak memory and start turning toward the dark side, I am torn between two options.
On the one hand, I can use the expected call to BackendCounter.addToLiveMap(this) throws ChangeInstanceException and simply check memory:
if (((float)Runtime.getRuntime().freeMemory() / Runtime.getRuntime().totalMemory())<0.9) throw new ChangeInstanceException(getOverflowInstance());
This naive approach will simply tell any instance approaching it's memory limit to send all new work elsewhere.
On the other hand, I could keep instance 0 and 1 for handling overflow {and toggle between which of the two gets new jobs to give them chances to shut down}, then send the fanout to instances 2+, which will only run until they drop to say, 10 or 15 jobs in parallel. The fanout is pretty consistent, and only takes a couple minutes, so instances 2, 3 and, at most, 4, will need to turn on, and be given time to turn off while a different instance gets hit with more load.
The only thing I'm afraid of is if jobs starting bouncing from one instance to another, which can probably be overcome with a redirect header limit to skip throwing ChangeInstanceException.
Any thoughts or advice are greatly appreciated.