Concurrent periodic task running - java

I'm trying to find the best solution for periodic task running in parallel. Requirements:
Java (Spring w/o Hibernate).
Tasks are being managed by front-end application and stored in MySQL DB (fields: id, frequency (in seconds), <other attributes/settings about task scenario>). -- Something like crontab, only with frequency (seconds) field, instead of minutes/hours/days/months/days of weeks.
I'm thinking about:
TaskImporter thread polling Tasks from DB (via TasksDAO.findToProcess()) and submitting them to queue.
java.util.concurrent.ThreadPoolExecutor running tasks (from queue) in parallel.
The most tricky part of this architecture is TasksDAO.findToProcess():
How do I know which tasks is time to run right now?
I'm thinking about next_run Task field, which would be populated (UPDATE tasks SET next_run = TIMESTAMPADD(SECOND, NOW(), frequency) WHERE id = ? straight after selection (SELECT * FROM tasks WHERE next_run IS NULL OR next_run <= NOW() FOR UPDATE). The problem: Have to run lots of UPDATES for lots of SELECT'ed tasks (UPDATE for each Task or bulk UPDATE) + concurrency problems (see below).
Ability to run several concurrent processing applications (cloud), using/polling same DB.
All of the concurring processing applications must run concrete task only once. Must lock all SELECT's from all other apps, until app A finishes updating (next_run) of all selected tasks. The problem: locking production table (front-end app) would slow things down. Table mirror?
I love simple and clean solutions and believe there's a better way to implement this processing application. Do you see any? :)
Thanks in advance.
EDIT: Using Quartz as a scheduler/executor is not an option because of syncing latency. Front-end app is not in Java and so is not able to interact with Quartz, except Webservice-oriented solution, which is not an option too, because front-end app has more data associated with previously mentioned Tasks and needs direct access to all data in DB (read+write).

I would suggest using Scheduling API like Quartz rather than using Home grown implementation.
It provides lot of API for implementation of logic and convenience. You will also have better control over jobs.
http://www.quartz-scheduler.org/
http://www.quartz-scheduler.org/docs/tutorial/index.html

Related

Legacy purge job hangs due to multi cascade

We initially missed migration of legacy scheduled "purge" job (Java based) to cloud. Now, when we have done so, the job always hangs, due to its original design of cascade deletes (or even regular ones) of 15 or so tables for each user identity.
This job runs well for few users, but because of initial miss, we ended up with 1000s of users that need purge (with associated records in multiple tables). Hence the first run is causing the job to run for hours, and it finally hangs.
Few approaches were tried (creating indexes, using chunk of size 50 etc), but none of them have so far worked.
Because this job works well for few users (which is likely scenario going forward), trying to see if approach of creating some kind of script / mechanism to delete users in small batches (of say 5), iteratively and have it executed by DBA. Once this is complete (all applicable users are purged), enabled the legacy purge job with its original design, which should work for deleting few users going forward.
Appreciate any suggestions/thoughts.

Running Only Once A Schedule Job Across Multiple Instances

I have a schedule job that run every end of the month. After running it saves some data to database.
When i scale the app(for example with 2 instances) both instances run the schedule job and both save the data and at the end of day my database has the same data.
So i want the schedule job only run one time regardless of instances numbers at cloud.
In my project, I have maintained a database table to hold a lock for each job which needs to be executed only once in the cluster.
When a Job gets triggered then it first tries to acquire lock from the database and if it gets that lock only then it will get executed. If it fails to acquire the lock then job will not get executed.
You can also look at the clustering feature of Quartz job.
http://www.quartz-scheduler.org/documentation/2.4.0-SNAPSHOT/introduction.html
I agree with the comments. If you can utilize a scheduler that's going to be your best, most flexible option. In addition, a scheduler should be executing your job as a "task" on Cloud Foundry. The task will only run on one instance, so you won't need to worry about how many instances your application is using (the two are separate in that regard).
If you're using Pivotal Cloud Foundry/Tanzu Cloud Foundry there is a scheduler you can ask your operations team to install. I don't know about other variants of CF, but I assume there are other schedulers.
https://network.pivotal.io/products/p-scheduler/
If using a scheduler is not an option then this is a concern you'll need to handle in your application. The solution of using a shared lock is a good one, but there is also a little trick you can do on Cloud Foundry that I feel is a little simpler.
When your application runs, certain environment variables are set by the platform. There is one called INSTANCE_INDEX which has a number indicating the instance on which the app is running. It's zero-based, so your first app instance will be running on instance zero, the second instance one, etc.
In your code, simply look at the instance index and see if it's zero. If the index is non-zero, have your task end without doing anything. If it's zero, then let the task proceed and do its work. The task will execute on every application instance, but it will only do work on the first instance. It's an easy way to guarantee something like a database migration or background job only runs once.
One final option would be to use multiple processes. This is a feature of Cloud Foundry that enables you to have different processes running, like your web process and a background worker process.
https://docs.cloudfoundry.org/devguide/multiple-processes.html
The interesting thing about this feature is that you can scale the different processes independently of each other. Thus you could have as many web processes running, but only one background worker which would guarantee that your background process only runs once.
That said, the downside of this approach is that you end up with separate containers for each process and the background process would need to continue running. The foundation expects it to be a long-running process, not a finite duration batch job. You could get around this by wrapping your periodic task a loop or something which keeps the process running forever.
I wouldn't really recommend this option but I wanted to throw it out there just in case.
You can use #SnapLock annotation in your method which guarantees that task only runs once. See documentation in this repo https://github.com/luismpcosta/snap-scheduler
Example:
Import maven dependency
<dependency>
<groupId>io.opensw.scheduler</groupId>
<artifactId>snap-scheduler-core</artifactId>
<version>0.3.0</version>
</dependency>
After importing maven dependency, you'll need to create the required tables tables.
Finally, see bellow how to annotate methods which guarantees that only runs once with #SnapLock annotation:
import io.opensw.scheduler.core.annotations.SnapLock;
...
#SnapLock(key = "UNIQUE_TASK_KEY", time = 60)
#Scheduled(fixedRate = 30000)
public void reportCurrentTime() {
...
}
With this approach you also guarantee audit of the tasks execution.

How to use GAE Appstats to monitor the task that is running in task queue?

I have a use case where I need to monitor the number of datastore operations and also the time consumed by a particular block of Java code that is running in the task queue.
As of now, I am using Appstats and it is not displaying the operations performed by the task (Java Code) that is running in the Task Queue. Also, I need to know the execution time of particular code blocks in the same code in the task queue by using some kind of monitors.
Please suggest and advise if I could use some other tools for the above requirement.

How do I create a short-lived, single task Google Compute Engine instance?

Question: How to create a lightweight on-demand instance, preconfigured w/ Java8 and my code, pull a task from a task queue, execute the memory-intensive tasks, and shut itself down. (on-demand, high memory, medium cpu, single task executors)
History: I was successfully using Google App Engine Task Queue in Java for "bursty" processing of relatively rare events - maybe once a week someone would submit a form, the form creates ~10 tasks, the system would chew up some memory and CPU cycles thinking about the tasks for a few minutes, save the results, and the webpage would be polling the backend for completion. It worked great within Google App Engine - Auto scaling would remove all idle instances, Task Queues would handle getting the processing done, I'd make sure not to overload things by setting the max-concurrent-requests=1, and life was good!
But then my tasks got too memory intensive for instance-class: F4_1G 😢 I'd love to pick something with more memory, but that isn't an option. So I need to figure something out.
I think my best bet is to spin up a generic instance using the API com.google.api.services.compute.model.Instance but get stopped there. I'm so spoiled with how easy the Task Queue was to build that I'd hate to get lost in the weeds just to get the higher memory instance - I don't need a cluster, and don't need any sort of reliability!
Is this a docker container thing?
Is it going to be hard auth-wise to pull from the Pull Queue outside of GAE?
Is it crazy to spin up/down an instance (container?) for each task if a task is ~10 minutes?
I found some similar questions, but no answers that quite fit:
How to Consume App Engine Task Queue Via Compute Engine Instance
How do I integrate google app engine, taks queue and google compute engine?
I would have a read about GAE modules. These can be set to use basic scaling so an instance gets created on demand, then expires some time later, set by you in your appengine-web.xml using something such as:
<basic-scaling>
<max-instances>2</max-instances>
<idle-timeout>5m</idle-timeout>
</basic-scaling>
If the module processes requests from a task queue then is has 10 minutes to get its job done, which is probably ample for many tasks.

How to deal with a search task which takes more time than usual in Spring 3.0

I am looking for ideas on how to deal with a search related task which takes more than usual time (in human terms more than 3 seconds)
I have to query multiple sources, sift through information for the first time and then cache it in the DB for later quick return.
The context of the project is J2EE, Spring and Hibernate (on top of SpringROO)
The possible solutions I could think of
-On the webpage let the user know that task is running in background, if possible give them a queue number or waiting time. Refresh the page via a controller which basically checks if the task is done, then when its done (ie the search result is prepared and stored in DB) then just forward to a new controller and fetch the result from the DB
-The background tasks could be done with Spring Task executor. I am not sure if it is easy to give a measure of how long it would take. It would probably be a bad idea to let all the search terms run concurrently, so some sort of pooling will be a good idea.
-Another option to use background tasks is to use JMS. This is perhaps a solution with more control (retries etc)
-Spring batch also comes to mind
Please suggest how you would do it. I would greatly appreciate a semi-detailed+ description. The sources of info can be man and can be sequential in nature so it can take upto 4-5 minutes for the results to form. It is also possible that such tasks run automatically in the background without user intervention (ie to update from the sources)
From a user perspective, I use AJAX. The default web page contains some kind of "Busy" indicator. When the AJAX request completes, the busy indicator is replaced with the result.
In the background, request handlers are already multi-threaded. So you can simply format the default result, close&flush the output, and do the processing in the current thread. You should put something in the session or DB to make sure that no one can start the same heavy process a second time.
Running task pools in a web container is possible but there are some caveats, especially how to synchronize startup/shutdown: Do you want your web server to "hang" during shutdown while some thread is busy collecting your results? Also the additional load should be considered. It might be better to use JMS and offload the strain to a second server dedicated to build the search results.
Such a system will scale much better if your searches start to become a burden. It also makes it trivial to automate the process by writing a small program which posts searches in the JMS queue.
I've solved this problem in the past doing something like this:
When the user initiates a long running task, I open a popup window that displays the task status. The task status includes a name and estimated time to complete
This task is also stored in my "app" (this can be stored in the DB, session, or application context), so the user can continue doing other things on my web app while having an easy way to navigate back to the running task.
I stored my tasks in a DB, so I could manage what happens on startup and shutdown of the web app. This requires storing the progress of the task in the DB.
The tricky part is display results to the user. If you use the method I've described, you'll need to store results in either the DB, session, or application contexts.
This system I've described is pretty heavyweight, and may be overkill for your application.
In response to the comment
so what do you use to do the
background computing. I have asked
this before
I use java.util.concurrent. A lot of this depends on the nature of your application. Is the task (or steps in the task) idempotent? How critical is it that it run to completion? If you have a non-idempotent task that must run to completion, I would say you generally must record every piece of work you do, and you must do that piece of work within a transaction. For example, if one of your tasks is to email a list of people (this is definitely not idempotent) you would do the emailing in a "transaction" (I'm using the term lightly here) and store your progress after each transaction is complete.

Categories