Launch spring batch jobs from control-m - java

I developed jobs in Spring Batch to replace a data loading process that was previously done using bash scripts.
My company's scheduler of choice is control-m. The old bash scripts were triggered from control-m on file arrival using a file watcher.
For reasons beyond my control, we still need to use control-m. Using Spring Boot or any other framework tied to a webserver is not a possibility.
The safe approach seems to be to package the spring batch application as a jar and trigger from control-m the job using "java -jar", but this doesn't seem the right way considering we have 20+ jobs.
I was wondering if it's possible to trigger the app once (like a deamon) and communicate with it using JMS or any other approach. In this way we wouldn't need to spawn multiple jvms (considering jobs might run simultaneously).
I'm open to different solutions. Feel free to teach me the best way to solve this use case.

The safe approach seems to be to package the spring batch application as a jar and trigger from control-m the job using "java -jar", but this doesn't seem the right way considering we have 20+ jobs.
IMO, running jobs on demand is the way to go, because it is the most efficient way to use resources. Having a JVM running 24/7 and make it run a batch job once a while is a waste of resource as the JVM will be idle between run schedules. If your concern is the packaging side of things, you can package all jobs in a single jar and use the spring boot property spring.batch.job.names at launch time to specify which job to run.
I was wondering if it's possible to trigger the app once (like a deamon) and communicate with it using JMS or any other approach. In this way we wouldn't need to spawn multiple jvms (considering jobs might run simultaneously).
I would recommend to expose a REST endpoint in your JVM and implement a controller that launches batch jobs on demand. You can find an example in the Running Jobs from within a Web Container section. In this case, the job name and its parameters could passed in as request parameters.
Another way is to use a combination between Spring Batch and Spring Integration to launch jobs using JMS requests (JobLaunchRequest). This approach is explained in details with code examples in the Launching Batch Jobs through Messages.

In addition to the helpful answer from Mahmoud, Control-M isn't great with daemons. Sure, you can launch a daemon but anything running for a substantial length of time (i.e. into several weeks and beyond) is prone to error and you can often end up with daemons that are running that Control-M is no longer "aware" of (e.g. if the system has issues that cause the Control-M Agent to assume the daemon job has failed and then launches another one).
When I had no other methods available I used to add daemons as batch jobs in Control-M but there was an overhead in additional jobs that checked for multiple daemons, did the stop/starts and various housekeeping tasks. Best avoided if possible.

Related

How to submit multiple Spark applications in parallel without spawning separate JVMs?

The problem is that you need to launch separate JVM to create separate session with different number of RAM per job.
How to submit few Spark applications simultaneously without manually spawning separate JVMs?
My app is run on single server, within single JVM. That appears a problem with Spark session per JVM paradigm. Spark paradigm says:
1 JVM => 1 app => 1 session => 1 context => 1 RAM/executors/cores config
I'd like to have different configurations per Spark application without launching extra JVMs manually. Configurations:
spark.executor.cores
spark.executor.memory
spark.dynamicAllocation.maxExecutors
spark.default.parallelism
Usecase
You have started long running job, say 4-5 hours to complete. The job is run within a session with configs spark.executor.memory=28GB, spark.executor.cores=2. Now you want to launch 5-10 seconds job on user demand, without waiting 4-5 hours. This tinny job need 1GB of RAM. What would you do? Submit tinny job from behalf of long-running-job-session? Than it will claim 28GB ((
What I've found
Spark allow you to configure number of CPU and executors only on the session level. Spark scheduling pool allow you to slide and dice only number of cores, not a RAM or executors, right?
Spark Job Server. But they does't support Spark newer than 2.0, not an option for me. But they actually solve the problem for versions older than 2.0. In Spark JobServer features they said Separate JVM per SparkContext for isolation (EXPERIMENTAL), which means spawn new JVM per context
Mesos fine-grained mode is deprecated
This hack, but it's too risky to use it in production.
Hidden Apache Spark REST API for job submission, read this and this. There is definitely way to specify executor memory and cores there, but still what is the behavior on submitting two jobs with different configs? As I understand this is Java REST client for it.
Livy. Not familiar with it, but looks they have Java API only for batch submission, which is not an option for me.
With a use case, this is much clearer now. There are two possible solutions:
If you require shared data between those jobs, use the FAIR-scheduler and a (REST-)frontend (as does SparkJobServer, Livy, etc.). You don't need to use SparkJobServer either, it should be relatively easy to code, if you have a fixed scope. I've seen projects go in that direction. All you need is an event loop and a way to translate your incoming queries into Spark queries. In a way, I would expect there to be demand for a library to cover this use case, since it's pretty much always the first thing you have to build, when you work on a Spark-based application/framework.
In this case, you can size your executors according to your hardware, Spark will manage scheduling of your jobs. With Yarn's dynamic resource allocation, Yarn will also free resources (kill executors), should your framework/app be idle.
For more information, read here: http://spark.apache.org/docs/latest/job-scheduling.html
If you don't need shared data, use YARN (or another resource manager) to assign resources in a fair manner to both jobs. YARN has a fair scheduling mode, and you can set the resource demands per application. If you think this suits you, but you need shared data, then you might want to think about using Hive or Alluxio to provide a data interface. In this case you would run two spark-submits, and maintain multiple drivers in the cluster. Building additional automation around spark-submit can help you make this less annoying and more transparent to end users. This approach is also high-latency, since resource allocation and SparkSession initialization take up a more or less constant amount of time.
tl;dr I'd say it's not possible.
A Spark application is at least one JVM and it's at spark-submit time when you specify the requirements of the single JVM (or a bunch of JVMs that act like executors).
If however you want to have different JVM configurations without launching separate JVMs, that does not seem possible (even outside Spark but assuming JVM is in use).

Quartz Scheduler - to run in Tomcat or application jar?

We have a web application that receives incoming data via RESTful web services running on Jersey/Tomcat/Apache/PostgreSQL. Separately from this web-service application, we have a number of repeating and scheduled tasks that need to be carried out. For example, purging different types of data at different intervals, pulling data from external systems on varying schedules, and generating reports on specified days and times.
So, after reading up on Quartz Scheduler, I see that it seems like a great fit.
My question is: should I design my Quartz-based scheduling application to run in Tomcat (via QuartzInitializerListener), or build it into a standalone application to run as a linux daemon (e.g., via Apache Commons Daemon or the Tanuk Java Service Wrapper).
On the one hand, it strikes me as counterintuitive to use Tomcat to host an application that is not geared towards receiving http calls. On the other hand, I haven't used Apache Commons Daemon or the Java Service Wrapper before, so maybe running inside Tomcat is the path of least resistance.
Are there any significant benefits or dangers with either approach that I should be aware of? Our core modules already take care of data access, logging, etc., so I don't see that those services are much of a factor either way.
Our scheduling will be data driven, so our Quartz-based scheduler will read the relevant data from PostgreSQL. However, if we run the scheduling application within Tomcat, is it possible/reasonable to send messages to our application via http calls to Tomcat? Finally, fwiw, since our jobs will be driven by our existing application data, I don't see any need for the Quartz JDBCJobStore.
To run a Java standalone application as linux daemon, simply end the java-command with an & -sign so that it runs in the background and put it in an Upstart-script for example.
As for the design: in this case I would go for whatever is easier to maintain. And it looks like running an app in Tomcat is already familiar. One benefit that comes to mind is that configuration files (for the database for example) can be shared/re-used so that only one set of configuration files needs to be maintained.
However, if you think the scheduled tasks can have a significant impact on resource usage, then you might want to run the tasks on a separate (virtual) machine. Since the timing of the tasks is data driven, it is hard to predict the exact load. E.g. it could happen that all the different tasks are executed at the same time (worst case/highest load scenario). Also consider the complexity of the software for the scheduled tasks and the related risk of nasty bugs: if you think there is a low chance of nasty bugs, then running the tasks in Tomcat next to the web-service is a good option, if not, run the tasks as a separate application. Lastly, consider the infrastructure in general: production line systems (providing (a continuous flow of) data processing critical to business) should be separate from non-production line systems. E.g. if the reports are created an hour later than usual and the business is largely unaffected, then this is non-production line. But if the web-service goes down and business is (immediatly) affected, then this is production line. Purging data and pulling updates is a bit gray: depends on what happens if these tasks are not performed, or later.

Open source Java Job Scheduler with: remoting, load balancing, failover, dependency DAG?

I'm looking for an open source Java Job Scheduler that allows sending different kind of jobs (not only flop intensive) and distributes them across many machines. It should also monitor the jobs and retry on different nodes should any job fail or the slave node crash. I would also appreciate load balancing similar to OpenMP or MPI. Ideally you should be able to pass in a Job dependency graph and the jobs will be processed in a topological-ordered fashion and parallelization should be done where possible.
The closest match I know to this is Quartz but this only allows scheduling single jobs by time and there is no remoting, failover, load balancing and dependency handling capabilities.
Such solution could be built on top of Quartz and a MOM server e.g. ActiveMQ but I'd like to be sure there is nothing out there first before building this up.
Probably a MapReduce port to Java would also do.
You should look at grid computing frameworks such as HTCondor, Hadoop (map/reduce), JPPF or GridGain, this is what they were made for.
Quartz do have a support for clustering . Check this.

quarts or simple pojo

I'm writing an java based app (not web app) and it should be able to run standalone without any container the task it carries are below:
windows scheduler fires off either quartz or simple POKO
pick up file(s) during midnight
import the data into DB
move the files over from original destination to another drive
Now, the dilemma I'm having is I've been reading around and it appears quartz need web container to function.
Is that correct AND what would be most simple and durable solution?
According your question: Quartz does not need a web container, it can be run in any java application. See Quartz Quickstart Guide for how to configure Quartz.
If you use Quartz the windows scheduler shouldn't be necessary, but this implies that your java application is running constantly.
I think Quartz has the advantage, that you can configure your application in one place and do not need to consider os specific scheduling. Further more Quartz makes you independent of the os specific scheduling mechanism.
But: All this advantages are not relevant if your application is not running all the time.
On the other hand if you want it to be a fire and forget like application, that runs, does its work and then quits again, you will be on the safe side to delegate the task of scheduling to the operation system your application runs on.
So, for this specific context I think using the operation system's scheduling mechanism is the better option.

Is it possible to run a cron job in a web application?

In a java web application (servlets/spring mvc), using tomcat, is it possible to run a cron job type service?
e.g. every 15 minutes, purge the log database.
Can you do this in a way that is container independent, or it has to be run using tomcat or some other container?
Please specify if the method is guaranteed to run at a specific time or one that runs every 15 minutes, but may be reset etc. if the application recycles (that's how it is in .net if you use timers)
As documented in Chapter 23. Scheduling and Thread Pooling, Spring has scheduling support through integration classes for the Timer and the Quartz Scheduler (http://www.quartz-scheduler.org/). For simple needs, I'd recommend to go with the JDK Timer.
Note that Java schedulers are usually used to trigger Java business oriented jobs. For sysadmin tasks (like the example you gave), you should really prefer cron and traditional admin tools (bash, etc).
If you're using Spring, you can use the built-in Quartz or Timer hooks. See http://static.springsource.org/spring/docs/2.5.x/reference/scheduling.html
It will be container-specific. You can do it in Java with Quartz or just using Java's scheduling concurrent utils (ScheduledExecutorService) or as an OS-level cron job.
Every 15 minutes seems extreme. Generally I'd also advise you only to truncate/delete log files that are no longer being written to (and they're generally rolled over overnight).
Jobs are batch oriented. Either by manual trigger or cron-style (as you seem to want).
Still I don't get your relation between webapp and cron-style job? The only webapp use-case I could think of is, that you want to have a HTTP endpoint to trigger a job (but this opposes your statement about being 'cron-style').
Generally use a dedicated framework, which solves the problem-area 'batch-jobs'. I can recommend quartz.

Categories