quartz cluster mode only runs one task

quartz cluster mode only runs one task - java

I have two quartz apps that must run in cluster mode so I have two jars. When I run those two jars (java -jar) only one process seems to be working, the other seems to be in standby and does nothing and only begins to work when I kill the other process. I need the two processes to run in cluster mode.
This is my config:
private Properties getProperties() {
final Properties quartzProperties = new Properties();
quartzProperties.put("org.quartz.jobStore.class", "org.quartz.impl.jdbcjobstore.JobStoreTX");
quartzProperties.put("org.quartz.jobStore.isClustered", "true");
quartzProperties.put("org.quartz.jobStore.tablePrefix", "QRTZ_");
quartzProperties.put("org.quartz.jobStore.driverDelegateClass", "org.quartz.impl.jdbcjobstore.StdJDBCDelegate");
quartzProperties.put("org.quartz.threadPool.class", "org.quartz.simpl.SimpleThreadPool");
quartzProperties.put("org.quartz.threadPool.threadCount", "25");
quartzProperties.put("org.quartz.scheduler.instanceId", "AUTO");
quartzProperties.put("org.quartz.scheduler.instanceName", "qrtz");
quartzProperties.put("org.quartz.threadPool.threadPriority", "5");
quartzProperties.put("org.quartz.jobStore.clusterCheckinInterval","10000");
quartzProperties.put("org.quartz.jobStore.useProperties", "false");
quartzProperties.put("org.quartz.jobStore.dataSource", "quartzDS");
quartzProperties.put("org.quartz.dataSource.quartzDS.URL", environment.getRequiredProperty("org.quartz.dataSource.quartzDS.URL"));
quartzProperties.put("org.quartz.dataSource.quartzDS.user", environment.getRequiredProperty("org.quartz.dataSource.quartzDS.user"));
quartzProperties.put("org.quartz.dataSource.quartzDS.password", environment.getRequiredProperty("org.quartz.dataSource.quartzDS.password"));
quartzProperties.put("org.quartz.dataSource.quartzDS.maxConnections", "5");
quartzProperties.put("org.quartz.dataSource.quartzDS.validationQuery", "select 0 from dual");
quartzProperties.put("org.quartz.dataSource.quartzDS.driver", environment.getRequiredProperty("org.quartz.dataSource.quartzDS.driver"));
return quartzProperties;
}

TL;TR : Your problem comes from Quartz Scheduler itself and there is no way to change its behaviour.
To make you understand why, I have to explain you how Quartz cluster mode behaves. We will take your case as example.
You start your two apps which each run a Quartz instance that synchronize through a database. Each jobs you are scheduling is stored in the database with processing data like "last time the job run", "last instance that run the job", etc. Each Quartz instance regularly scans the database for jobs to fire and fires as much jobs it cans.
The things is, if you don't have enough load, one of your node will always scans the database before the other one and take all the load.
To see your other instance working, you have to shutdown or standby the first one or increase the load of the cluster.
The only thing you can configure on this is the size of the thread pool of each node : See http://www.quartz-scheduler.org/documentation/quartz-2.x/configuration/ConfigJDBCJobStoreClustering.html

Related

Spark: Job restart and retries

Suppose you have Spark + Standalone cluster manager. You opened spark session with some configs and want to launch SomeSparkJob 40 times in parallel with different arguments.
Questions
How to set reties amount on job failures?
How to restart jobs programmatically on failure? This could be useful if jobs failure due lack of resources. Than I can launch one by one all jobs that require extra resources.
How to restart spark application on job failure? This could be useful if job lack resources even when it's launched simultaneously. Than to change cores, CPU etc configs I need to relaunch application in Standalone cluster manager.
My workarounds
1) I pretty sure the 1st point is possible, since it's possible at spark local mode. I just don't know how to do that in standalone mode.
2-3) It's possible to hand listener on spark context like spark.sparkContext().addSparkListener(new SparkListener() {. But seems SparkListener lacks failure callbacks.
Also there is a bunch of methods with very poor documentation. I've never used them, but perhaps they could help to solve my problem.
spark.sparkContext().dagScheduler().runJob();
spark.sparkContext().runJob()
spark.sparkContext().submitJob()
spark.sparkContext().taskScheduler().submitTasks();
spark.sparkContext().dagScheduler().handleJobCancellation();
spark.sparkContext().statusTracker()

You can use SparkLauncher and control the flow.
import org.apache.spark.launcher.SparkLauncher;
public class MyLauncher {
public static void main(String[] args) throws Exception {
Process spark = new SparkLauncher()
.setAppResource("/my/app.jar")
.setMainClass("my.spark.app.Main")
.setMaster("local")
.setConf(SparkLauncher.DRIVER_MEMORY, "2g")
.launch();
spark.waitFor();
}
}
See API for more details.
Since it creates process you can check the Process status and retry e.g. try following:
public boolean isAlive()
If Process is not live start again, see API for more details.
Hoping this gives high level idea of how we can achieve what you mentioned in your question. There could be more ways to do same thing but thought to share this approach.
Cheers !

check your spark.sql.broadcastTimeout and spark.broadcast.blockSize properties, try to increase them .

How to decrease heartbeat time of slave nodes in Hadoop

I am working on AWS EMR.
I want to get the information of died task node as soon as possible. But as per default setting in hadoop, heartbeat is shared after every 10 minutes.
This is the default key-value pair in mapred-default - mapreduce.jobtracker.expire.trackers.interval : 600000ms
I tried to modify default value to 6000ms using - this link
After that, whenever I terminate any ec2 machine from EMR cluster, I am not able to see state change that fast.(in 6 seconds)
Resource manager REST API - http://MASTER_DNS_NAME:8088/ws/v1/cluster/nodes
Questions-
What is the command to check the mapreduce.jobtracker.expire.trackers.interval value in running EMR cluster(Hadoop cluster)?
Is this the right key I am using to get the state change ? If it is not, please suggest any other solution.
What is the difference between DECOMMISSIONING vs DECOMMISSIONED vs LOST state of nodes in Resource manager UI ?
Update
I tried numbers of times, but it is showing ambiguous behaviour. Sometimes, it moved to DECOMMISSIONING/DECOMMISIONED state, and sometime it directly move to LOST state after 10 minutes.
I need a quick state change, so that I can trigger some event.
Here is my sample code -
List<Configuration> configurations = new ArrayList<Configuration>();
Configuration mapredSiteConfiguration = new Configuration();
mapredSiteConfiguration.setClassification("mapred-site");
Map<String, String> mapredSiteConfigurationMapper = new HashMap<String, String>();
mapredSiteConfigurationMapper.put("mapreduce.jobtracker.expire.trackers.interval", "7000");
mapredSiteConfiguration.setProperties(mapredSiteConfigurationMapper);
Configuration hdfsSiteConfiguration = new Configuration();
hdfsSiteConfiguration.setClassification("hdfs-site");
Map<String, String> hdfsSiteConfigurationMapper = new HashMap<String, String>();
hdfsSiteConfigurationMapper.put("dfs.namenode.decommission.interval", "10");
hdfsSiteConfiguration.setProperties(hdfsSiteConfigurationMapper);
Configuration yarnSiteConfiguration = new Configuration();
yarnSiteConfiguration.setClassification("yarn-site");
Map<String, String> yarnSiteConfigurationMapper = new HashMap<String, String>();
yarnSiteConfigurationMapper.put("yarn.resourcemanager.nodemanagers.heartbeat-interval-ms", "5000");
yarnSiteConfiguration.setProperties(yarnSiteConfigurationMapper);
configurations.add(mapredSiteConfiguration);
configurations.add(hdfsSiteConfiguration);
configurations.add(yarnSiteConfiguration);
This is the settings that I changed into AWS EMR (internally Hadoop) to reduce the time between state change from RUNNING to other state(DECOMMISSIONING/DECOMMISIONED/LOST).

You can use "hdfs getconf". Please refer to this post Get a yarn configuration from commandline
These links give info about node manager health-check and the properties you have to check:
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManager.html
Refer "yarn.resourcemanager.nodemanagers.heartbeat-interval-ms" in the below link:
https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Your queries are answered in this link:
https://issues.apache.org/jira/browse/YARN-914
Refer the "attachments" and "sub-tasks" area.
In simple terms, if the currently running application master and task containers gets shut-down properly (and/or re-initiated in different other nodes) then the node manager is said to be DECOMMISSIONED (gracefully), else it is LOST.
Update:
"dfs.namenode.decommission.interval" is for HDFS data node decommissioning, it does not matter if you are concerned only about node manager.
In exceptional cases, data node need not be a compute node.
Try yarn.nm.liveness-monitor.expiry-interval-ms (default 600000 - that is why you reported that the state changed to LOST in 10 minutes, set it to a smaller value as you require) instead of mapreduce.jobtracker.expire.trackers.interval.
You have set "yarn.resourcemanager.nodemanagers.heartbeat-interval-ms" as 5000, which means, the heartbeat goes to resource manager once in 5 seconds, whereas the default is 1000. Set it to a smaller value as you require.

hdfs getconf -confKey mapreduce.jobtracker.expire.trackers.interval
As mentioned in the other answer:
yarn.resourcemanager.nodemanagers.heartbeat-interval-ms should be set based on your network, if your network has high latency, you should set a bigger value.
3.
Its in DECOMMISSIONING when there are running containers and its waiting for them to complete so that those nodes can be decommissioned.
Its in LOST when its stuck in this process for too long. This state is reached after the set timeout is passed and decommissioning of node(s) couldn't be completed.
DECOMMISSIONED is when the decommissioning of the node(s) completes.
Reference : Resize a Running Cluster
For YARN NodeManager decommissioning, you can manually adjust the time
a node waits for decommissioning by setting
yarn.resourcemanager.decommissioning.timeout inside
/etc/hadoop/conf/yarn-site.xml; this setting is dynamically
propagated.

submit a job from eclipse to a running cluster on amazon EMR

I want to add jobs from my java code in eclipse to a running cluster of EMR for saving startup time (creating ec2, bootstrapping...).
I know how to run a new cluster from java code but it's terminating after all jobs are done.
RunJobFlowRequest runFlowRequest = new RunJobFlowRequest()
.withName("Some name")
.withInstances(instances)
// .withBootstrapActions(bootstrapActions)
.withJobFlowRole("EMR_EC2_DefaultRole")
.withServiceRole("EMR_DefaultRole")
.withSteps(firstJobStep, secondJobStep, thirdJobStep)
.withLogUri("s3n://path/to/logs");
// Run the jobs
RunJobFlowResult runJobFlowResult = mapReduce
.runJobFlow(runFlowRequest);
String jobFlowId = runJobFlowResult.getJobFlowId();

You have to set KeepJobFlowAliveWhenNoSteps parameter to TRUE, otherwise the cluster will be terminated after executing all the steps. If this property is set, the cluster will continue in waiting state after executing all the steps.
Add .withKeepJobFlowAliveWhenNoSteps(true) to the existing code.
Refer this doc for further details.

Non Persistent Quartz scheduler in java

We have an application deployed in a clustered environment. Every 5 minutes our application sends a ping operation to all other applications connected to it. We have used non-persistent Quartz scheduler in order to do this work.
The problem is that in a clustered environment only one node is doing this activity(ping operation). Are there any references or any sample code for this? (This is a plain servlet application.)

Since all nodes are working in a cluster, every job runs on just a single machine (most idle one). This is the reason you use clustering. But you want all machines to run given job independently, not being aware of other cluster nodes. Basically, you don't need Quartz (cluster) at all!
Enough is to use ScheduledExecutorService.html#scheduleAtFixedRate():
final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
final Runnable pinger = new Runnable() {
public void run() {
//send PING
}
};
scheduler.scheduleAtFixedRate(pinger, 5, 5, MINUTES);
Just run this code on every machine and use Quartz where you need it.

Scheduled processes running twice simultaneously in Openbravo (using Quartz)

I'm not quite sure whether this is more of an Openbravo issue or more of a Quartz issue, but we have some manual processes that run on schedules via Openbravo ProcessRequest objects (OB v2.50MP24), but it seems that the processes are running twice, at the exact same time. Openbravo extends the Quartz platform for their scheduling. I've tried to resolve this issue on my own by ensuring that my process classes extend this class:
import java.util.List;
import org.openbravo.dal.service.OBDal;
import org.openbravo.model.ad.ui.ProcessRequest;
import org.openbravo.scheduling.ProcessBundle;
import org.openbravo.service.db.DalBaseProcess;
public abstract class RBDDalProcess extends DalBaseProcess {
#Override
protected void doExecute(ProcessBundle bundle) throws Exception {
org.quartz.Scheduler sched = org.openbravo.scheduling.OBScheduler
.getInstance().getScheduler();
int runCount = 0;
synchronized (sched) {
List<org.quartz.JobExecutionContext> currentlyExecutingJobs = (List<org.quartz.JobExecutionContext>) sched
.getCurrentlyExecutingJobs();
for (org.quartz.JobExecutionContext jec : currentlyExecutingJobs) {
ProcessRequest processRequest = OBDal.getInstance().get(
ProcessRequest.class, jec.getJobDetail().getName());
if (processRequest == null)
continue;
String processClass = processRequest.getProcess()
.getJavaClassName();
if (bundle.getProcessClass().getCanonicalName()
.equals(processClass)) {
runCount++;
}
}
}
if (runCount > 1) {
System.out.println("Process "
+ bundle.getProcessClass().getSimpleName()
+ " is already running. Cancelling.");
return;
}
doRun(bundle);
}
protected abstract void doRun(ProcessBundle bundle);
}
This worked fine when I tested by requesting the process to run immediately twice at the same time. One of them cancelled. However, it's not working on the scheduled processes. I have S.o.p's set up to log when the processes start, and looking at the logs shows each line of the output twice, each line one right after the other.
I have a sneaking suspicion that it's because the processes are either running in two completely different threads that don't know about each others' processes, however, I'm not sure how to verify my suspicions or, if I am correct, what to do about it. I've already verified that there is only one instance of each of the ProcessRequest objects stored in the database.
Has anyone else experienced this, know why they might be running twice, or know what I can do to prevent them from simultaneously running?

The most common reasons for a double Job execution are the following:
EDITED:
Your application is deployed in a clustered environment and you have not configured Quartz to run in a cluster environment.
Your application is deployed more than once. There are many cases where the application is deployed twice especially in Tomcat server. As a consequence the QuartzInitializerListener is invoked twice and the Jobs are executed twice. In case you use Tomcat server and you are defining contexts explicitly in server.xml, you should turn off automatic application deployment or specify deployIgnore. Both the autoDeploy set to true and the context element existence in server.xml, have as a consequence the twice deployment of the application. Set autoDeploy to false or remove the context element from the server.xml.
Your application has been redeployed without unscheduling the current processes.
I hope this helps you.

Quartz uses a thread pool for the jobs execution. So as you suspect, the RBDDalProcess will probably have separate instances a in separate thread and the counter check will fail.
One thing you can do is list the jobs registered in the Scheduler (you can get the Scheduler using the OB API as: OBScheduler.getScheduler()):
// enumerate each job group
for(String group: sched.getJobGroupNames()) {
// enumerate each job in group
for(JobKey jobKey : sched.getJobKeys(groupEquals(group))) {
System.out.println("Found job identified by: " + jobKey);
}
}
If you see the same job added twice, check out org.quartz.spi.JobFactory and the org.quartz.Scheduler.setJobFactory method for controlling jobs instantiations.
Also make sure you have only one entry for this process in the 'Report and Process' table in Openbravo.
I have used DalBaseProcess in Openbravo 3.0 and I cannot confirm this behavior you're describing. Having this in mind it would be probably a good idea to checkout the reported bugs for Openbravov2.50MP24 and Quartz or post a thread in Openbravo Forge forums with your problem.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.