I use the java.util.ExecutorService to handle tasks, sometimes with only one worker. Not I'd like to add something like idle tasks, to preload data from the database and similar stuff while nothing is happening and the user has selected some item.
My first idea was to just add it as a task when the user selects something, because when the user starts an interaction with the selection, the data are needed and have to be loaded either way.
The problem with this approach is that when the user selects another item without doing something with the first selection, then there is this huge task in the Executor which only makes everything slower.
Any simple ideas how I could start something like that? I really don't want to build a huge management class to handle it and classify tasks or stuff like that.
So what about using PriorityBlockingQueue? Keep your tasks in that queue, give idle tasks low priority, so that they are always on the end of the queue. Implement your pool's runnables so that they simply take the highest priority task from the queue and execute it.
To be sure that executing idle tasks will be replaced by more important ones, you can implement them to be executed in short chunks and placed back in the queue after each chunk is finished. If something more important was placed in the queue in the meantime, it will be taken next, if not, idle task will be fetched once again from the queue.
Related
I'm building a system where users can set a future date(down to hours and minutes) in calendar. At that date a trigger is calling a certain task, unique for every user.
Every user can set a different date. The system will have 10k+ from the start and a user can create more than one trigger.
So assuming I have 10k users each user create on average 3 triggers => 30k triggers with 30k different dates.
All dates are saved in a database.
I'm new to quartz, can this be done in a more optimized way?
I was thinking about making a task run every minute that will get the tasks that will suppose to run in the next hour and remove them from database.
Do you have any better ideas? Did someone used quartz for a large number of triggers.
You have the schedule backed in the database. If I understand the idea - you want the quartz to load all the upcoming tasks to execute them in the future.
This is problematic approach:
Synchronization Issues: I assume that users can edit, remove and add new tasks to the database. You would have to periodically ask the database to refresh the state of the quartz jobs, remove some jobs, edit other jobs etc. This may not be trivial. The state of the program would be a long living cache which needs to be synchronised often.
Performance and scalability issues: Even if proposed solution may be ok for 30K tasks it may not be ok for 70k or 700k tasks. In your approach it's not easy to scale - adding new machine would require additional layer of synchronisation - which machine should actually execute which job (as all of them have all the tasks).
What I would propose:
Add the "stage" to the Tasks table (new, queued, running, finished, failed)
divide your solution into several components. (Initially they can run on a single machine but it will be easy to scale)
Components:
Task Finder: Executed periodically (once every few seconds). Scans the database for tasks that are "new", and due soon. Sends the tasks found to Message Queue and marks the task as "queued" in the db. Marking as "queued" has to be done carefully as there can be multiple "task finders". (As an addition it may find the tasks that have been marked as "queued" or "running" more than N minutes ago and are not "finished" nor "canceled" - probably need to re-run these)
Message Queue: Connector between Taks Finder and Task Executor.
Task Executor: Listens to the Message Queue and process the tasks that it received. Marks the tasks as "running" initially and "finished" or "failed" later on.
With this approach you can have:
multiple Task Executors on multiple machines
multiple Task Schedulers on multiple machines
even if one of the Task Schedulers or Executors will fail it will not be Single Point of Failure. Some of the tasks will be delayed but it will be picked up and run afterwards.
This may not address all the scenarios but would be a good starting point.
I don't see why you need quartz here at all. As far as I remember, quartz is best used to schedule backend internal processes, not user-defined tasks obtained from db.
Just process the trigger as it is created, save a row to your tasks table with start_date based on the trigger and every second select all incomplete tasks with start_date< sysdate. If the job is repeating, calculate next execution time and insert new task row / update previous accordingly.
As Sam pointed out there are some nice topics addressing the same problem:
Quartz Performance
Quartz FAQ
In a system like the mentioned it should not a problem mostly to handle this amount of triggers. But according to my experiance it is a better way to create something like a "JobChecker". If you enable your users to create own triggers it could really break Quartz in some cases. For example if 5000 user creates an event to the exact same time, Quartz will have a hard time to handle them correctly. (It is not likely a situation that will occur often, but it is possible as your specification does not excludes it.) Quartz has difficulties only when a lot of triggers should be fired at the same time.
My recommendation to this problem is to create one job that is running in every hour/minute etc and that should handle every user set events. This way is simmilar to a cron job in bash. With this kind of processing your system will be pretty stable even if the number of "triggers" increases dramatically. Basically your line of thought is correct if you thrive for scalability.
I'm implementing an event listener, querying new items to process, by creationTime in ascending order.
I deal with multithreading.
My current workflow is:
Querying a batch of items (let's say 50) containing the "New" flag.
Looping through those items, and for each item, updating its status to "InProgress".
For each item, still within the loop, start the corresponding process, detached in a thread (using Akka Actors in my case).
As soon as a process is fully completed, update the item's flag to "Consumed".
I set a polling frequency of 3 seconds, that obviously may involve query of new items BEFORE the current retrieved items are being fully processed (due to multithreading), with the flag "Consumed" set.
Only the querying is single-threaded, otherwise it would lead to retrieve duplicates.
I wonder if the step 2 is essential: updating each item with "InProgress" flag.
Indeed, it would slow down the whole.
I thought about skipping this step but to ensure that futures queries don't retrieve items that are currently being processed (let's imagine a very long computation), I would NOT start the next retrieval query as soon as the whole batch is processed.
Basically, my query step would wait for workers to finish their current jobs.
Obviously, this would make sense if the kind of jobs are similar in computation time.
What is a good practice of polling database while dealing with multithreaded computation?
I was going through the javadocs and source code for drainTo method present in BlockingQueue interface and LinkedBlockingQueue implementation of the same. My understanding of this method after looking at the source (JDK7), is that the calling thread actually submits a Collection and afterwards acquires a takeLock(), which blocks other consumers. After that till the count of max elements, the items of the nodes are removed from the queue and put in a collection.
What I could appreciate is that it saves the threads from acquiring locks again and again, but pardon my limited knowledge, I could not appreciate the need for the same in real world examples. Could some one please share some real world examples where drainTo behavior is observable ?
Well, I used it in real life code and it looked quite natural to me: a background database thread creates items and puts them into a queue in a loop until either the end of data is reached or a stop signal is detected. On the first item a UI updater is launched using EventQueue.invokeLater. Due to the asynchronous nature and some overhead in this invokeLater mechanism, it will take some time until the UI updater comes to the point where it queries the queue and most likely more than one item may be available.
So it will use drainTo to get all items that are available at this specific point and update a ListDataModel which produces a single event for the added interval. The next update can be triggered using another invokeLater or using a Timer. So drainTo has the semantic of “gimme all items arrived since the last call” here.
On the other hand, polling the queue for single items could lead to a situation that producer and consumer are blocking each other for a short time and every time the consumer asks for a new item, another item is available due to the fact that the consumer has been blocked just long enough for the producer to create and put a new item. So you have to implement your own time limit to avoid blocking the UI thread too long in this case. Using drainTo once and release the event handling thread afterwards is much easier.
I have many threads performing different operations on object and when nearly 50% of the task finished then I want to serialize everything(might be I want to shut down my machine ).
When I come back then I want to start from the point where I had left.
How can we achieve?
This is like saving state of objects of any game while playing.
Normally we save the state of the object and retrieve back. But here we are storing its process's count/state.
For example:
I am having a thread which is creating salary excel sheet for 50 thousand employee.
Other thread is creating appraisal letters for same 50 thousand employee.
Another thread is writing "Happy New Year" e-mail to 50 thousand employee.
so imagine multiple operations.
Now I want to shut down in between 50% of task finishes. say 25-30 thousand employee salary excel-sheet have been written and appraisal letters done for 25-30 thousand and so on.
When I will come back next day then I want to start the process from where I had left.
This is like resume.
I'm not sure if this might help, but you can achieve this if the threads communicate via in-memory queues.
To serialize the whole application, what you need to do is to disable the consumption of the queues, and when all the threads are idle you'll reach a "safe-point" where you can serialize the whole state. You'll need to keep track of all the threads you spawn, to know if they are in are idle.
You might be able to do this with another technology (maybe a java agent?) that freezes the JVM and allows you to dump the whole state, but I don't know if this exists.
well its not much different than saving state of object.
just maintain separate queues for different kind of inputs. and on every launch (1st launch or relaunch) check those queues, if not empty resume your 'stopped process' by starting new process but with remaining data.
say for ex. an app is sending messages, and u quit the app with 10 msg remaining. Have a global queue, which the app's senderMethod will check on every launch. so in this case it will have 10msg in pending queue, so it will continue sending remaining msgs.
Edit:
basically, for all resumable process' say pr1, pr2....prN, maintain queue of inputs, say q1, q2..... qN. queue should remove processed elements, to contain only pending inputs. as soon as u suspend system. store these queues, and on relaunching restore them. have a common routine say resumeOperation, which will call all resumable process (pr1, pr2....prN). So it will trigger the execution of methods with non-0 queues. which in tern replicate resuming behavior.
Java provides the java.io.Serializable interface to indicate serialization support in classes.
You don't provide much information about the task, so it's difficult to give an answer.
One way to think about a task is in terms of a general algorithm which can split in several steps. Each of these steps in turn are tasks themselves, so you should see a pattern here.
By cutting down each algorithms in small pieces until you cannot divide further you get a pretty good idea of where your task can be interrupted and recovered later.
The result of a task can be:
a success: the task returns a value of the expected type
a failure: somehow, something didn't turn right while doing computation
an interrupted computation: the work wasn't finished, but it may be resumed later, and the return value is the state of the task
(Note that the later case could be considered a subcase of a failure, it's up to you to organize your protocol as you see fit).
Depending on how you generate the interruption event (will it be a message passed from the main thread to the worker threads? Will it be an exception?), that event will have to bubble within the task tree, and trigger each task to evaluate if its work can be resumed or not, and then provide a serialized version of itself to the larger task containing it.
I don't think serialization is the correct approach to this problem. What you want is persistent queues, which you remove an item from when you've processed it. Every time you start the program you just start processing the queue from the beginning. There are numerous ways of implementing a persistent queue, but a database comes to mind given the scale of your operations.
i have the following situation:
Read data from database
do work "calculation"
write result to database
I have a thread that reads from the database and puts the generated objects into a BlockingQueue. These objects are extremely heavy weight hence the queue to limit amount of objects in memory.
A multiple threads take objects from the Queue, performs work and put the results in a second queue.
The final thread takes results from second queue and saves result to database.
The problem is how to prevent deadlocks, eg. the "calculation threads" need to know when no more objects will be put into the queue.
Currently I achieve this by passing a references of the threads (callable) to each other and checking thread.isDone() before a poll or offer and then if the element is null. I also check the size of the queue, as long as there are elements in it, the must be consumed. Using take or put leads to deadlocks.
Is there a simpler way to achieve this?
One of the ways to accomplish would be to put a "dummy" or "poison" message as the last message on the queue when you are sure that no more tasks are going to arrive on the queue.. for example after putting the message related to the last row of the db query. So the producer puts a dummy message on the queue, the consumer on receiving this dummy message knows that no more meaningful work is expected in this batch.
Maybe you should take a look at CompletionService
It is designed to combine executor and a queue functionality in one.
Tasks which completed execution will be available from the completions service via
completionServiceInstance.take()
You can then again use another executor for 3. i.e. fill DB with the results, which you will feed with the results taken from the completionServiceInstance.