submit a job from eclipse to a running cluster on amazon EMR

submit a job from eclipse to a running cluster on amazon EMR - java

I want to add jobs from my java code in eclipse to a running cluster of EMR for saving startup time (creating ec2, bootstrapping...).
I know how to run a new cluster from java code but it's terminating after all jobs are done.
RunJobFlowRequest runFlowRequest = new RunJobFlowRequest()
.withName("Some name")
.withInstances(instances)
// .withBootstrapActions(bootstrapActions)
.withJobFlowRole("EMR_EC2_DefaultRole")
.withServiceRole("EMR_DefaultRole")
.withSteps(firstJobStep, secondJobStep, thirdJobStep)
.withLogUri("s3n://path/to/logs");
// Run the jobs
RunJobFlowResult runJobFlowResult = mapReduce
.runJobFlow(runFlowRequest);
String jobFlowId = runJobFlowResult.getJobFlowId();

You have to set KeepJobFlowAliveWhenNoSteps parameter to TRUE, otherwise the cluster will be terminated after executing all the steps. If this property is set, the cluster will continue in waiting state after executing all the steps.
Add .withKeepJobFlowAliveWhenNoSteps(true) to the existing code.
Refer this doc for further details.

Related

How to prevent Kubernetes cronjob running concurrently when I trigger it manually?

I have set the Kubernetes cronJob to prevent concurrent runs like here using parallelism: 1, concurrencyPolicy: Forbid, and parallelism: 1. However, when I try to create a cronJob manually I am allowed to do that.
$ kubectl get cronjobs
...
$ kubectl create job new-cronjob-1642417446000 --from=cronjob/original-cronjob-name
job.batch/new-cronjob-1642417446000 created
$ kubectl create job new-cronjob-1642417446001 --from=cronjob/original-cronjob-name
job.batch/new-cronjob-1642417446001 created
I was expecting that a new cronjob would not be created. Or it could be created and fail with a state that references the concurrencyPolicy. If the property concurrencyPolicy is part of the CronJob spec, not the PodSpec, it should prevent a new job to be created. Why it does not?
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: cronjob-name
annotations:
argocd.argoproj.io/sync-wave: "1"
spec:
schedule: "0 * * * *"
suspend: false
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 3
concurrencyPolicy: Forbid
jobTemplate:
spec:
parallelism: 1
completions: 1
backoffLimit: 3
template:
spec:
restartPolicy: Never
After reading the official documentation about the kubectl create -f I didn't find a way to prevent that. Is this behavior expected? If it is, I think I should check inside my Docker image (app written in Java) if there is already a cronjob running. How would I do that?

The concurrencyPolicy: Forbidden spec only prevents concurrent pod creations and executions of the same CronJob. It does not apply across separate CronJobs even though they effectively execute the same commands using the same Docker image. As far as Kubernetes is concerned, they are different jobs. If this was Java, what Kubernetes does is if (stringA == stringB) and not if (stringA.equals(stringB)).
If it is, I think I should check inside my Docker image (app written in Java) if there is already a cronjob running. How would I do that?
One way of using that is to use distributed lock mechanism using separate component such as Redis. Here is the link to the guide to utilize Java Redis library redisson for that purpose: https://github.com/redisson/redisson/wiki/8.-distributed-locks-and-synchronizers. Below is code sample taken from that page:
RLock lock = redisson.getLock("myLock");
// wait for lock aquisition up to 100 seconds
// and automatically unlock it after 10 seconds
boolean res = lock.tryLock(100, 10, TimeUnit.SECONDS);
if (res) {
// do operation
} else {
// some other execution is doing it so just chill
}

Stop specific running Kettle Job in java

How would it be possible to stop a specific running job in Kettle?
I'm using the following code:
KettleEnvironment.init();
JobMeta jobmeta = new JobMeta(C://Users//Admin//DBTOOL//EDW_Testing_Tool - 1.8(VersionUpgraded)//data-integration//Regress_bug//Start_Validation.kjb,
null);
Job job = new Job(null, jobmeta);
job.initializeVariablesFrom(null);
job.setVariable("Internal.Job.Filename.Directory", Constants.JOB_EXECUTION_KJB_FILE_PATH);
job.setVariable("jobId", jobId.toString());
job.getJobMeta().setInternalKettleVariables(job);
job.stopAll();
How would I ensure that the job which I want to stop is getting stopped and it is not executed after setting the flag?
I'm using rest api to stop the job and i'm not able to get the job Object.
if i'm using CarteSingleton and store the object in map i'm not able to execute the job it gives driver error could not connect to database(eg:-jtds) url not working.

I am not finding evidence of NodeInitializationAction for Dataproc having run

I am specifying a NodeInitializationAction for Dataproc as follows:
ClusterConfig clusterConfig = new ClusterConfig();
clusterConfig.setGceClusterConfig(...);
clusterConfig.setMasterConfig(...);
clusterConfig.setWorkerConfig(...);
List<NodeInitializationAction> initActions = new ArrayList<>();
NodeInitializationAction action = new NodeInitializationAction();
action.setExecutableFile("gs://mybucket/myExecutableFile");
initActions.add(action);
clusterConfig.setInitializationActions(initActions);
Then later:
Cluster cluster = new Cluster();
cluster.setProjectId("wide-isotope-147019");
cluster.setConfig(clusterConfig);
cluster.setClusterName("cat");
Then finally, I invoke the dataproc.create operation with the cluster. I can see the cluster being created, but when I ssh into the master machine ("cat-m" in us-central1-f), I see no evidence of the script I specified having been copied over or run.
So this leads to my questions:
What should I expect in terms of evidence? (edit: I found the script itself in /etc/google-dataproc/startup-scripts/dataproc-initialization-script-0).
Where does the script get invoked from? I know it runs as the user root, but beyond that, I am not sure where to find it. I did not find it in the root directory.
At what point does the Operation returned from the Create call change from "CREATING" to "RUNNING"? Does this happen before or after the script gets invoked, and does it matter if the exit code of the script is non-zero?
Thanks in advance.

Dataproc makes a number of guarantees about init actions:
each script should be downloaded and stored locally in:
/etc/google-dataproc/startup-scripts/dataproc-initialization-script-0
the output of the script will be captured in a "staging bucket" (either the bucket specified via --bucket option, or a Dataproc auto-generated bucket). Assuming your cluster is named my-cluster, if you describe master instance via gcloud compute instances describe my-cluster-m, the exact location is in dataproc-agent-output-directory metadata key
Cluster may not enter RUNNING state (and Operation may not complete) until all init actions execute on all nodes. If init action exits with non-zero code, or init action exceeds specified timeout, it will be reported as such
similarly if you resize a cluster, we guarantee that new workers do not join cluster until each worker is fully configured in isolation
if you still don't belive me :) inspect Dataproc agent log in /var/log/google-dataproc-agent-0.log and look for entries from BootstrapActionRunner

How to decrease heartbeat time of slave nodes in Hadoop

I am working on AWS EMR.
I want to get the information of died task node as soon as possible. But as per default setting in hadoop, heartbeat is shared after every 10 minutes.
This is the default key-value pair in mapred-default - mapreduce.jobtracker.expire.trackers.interval : 600000ms
I tried to modify default value to 6000ms using - this link
After that, whenever I terminate any ec2 machine from EMR cluster, I am not able to see state change that fast.(in 6 seconds)
Resource manager REST API - http://MASTER_DNS_NAME:8088/ws/v1/cluster/nodes
Questions-
What is the command to check the mapreduce.jobtracker.expire.trackers.interval value in running EMR cluster(Hadoop cluster)?
Is this the right key I am using to get the state change ? If it is not, please suggest any other solution.
What is the difference between DECOMMISSIONING vs DECOMMISSIONED vs LOST state of nodes in Resource manager UI ?
Update
I tried numbers of times, but it is showing ambiguous behaviour. Sometimes, it moved to DECOMMISSIONING/DECOMMISIONED state, and sometime it directly move to LOST state after 10 minutes.
I need a quick state change, so that I can trigger some event.
Here is my sample code -
List<Configuration> configurations = new ArrayList<Configuration>();
Configuration mapredSiteConfiguration = new Configuration();
mapredSiteConfiguration.setClassification("mapred-site");
Map<String, String> mapredSiteConfigurationMapper = new HashMap<String, String>();
mapredSiteConfigurationMapper.put("mapreduce.jobtracker.expire.trackers.interval", "7000");
mapredSiteConfiguration.setProperties(mapredSiteConfigurationMapper);
Configuration hdfsSiteConfiguration = new Configuration();
hdfsSiteConfiguration.setClassification("hdfs-site");
Map<String, String> hdfsSiteConfigurationMapper = new HashMap<String, String>();
hdfsSiteConfigurationMapper.put("dfs.namenode.decommission.interval", "10");
hdfsSiteConfiguration.setProperties(hdfsSiteConfigurationMapper);
Configuration yarnSiteConfiguration = new Configuration();
yarnSiteConfiguration.setClassification("yarn-site");
Map<String, String> yarnSiteConfigurationMapper = new HashMap<String, String>();
yarnSiteConfigurationMapper.put("yarn.resourcemanager.nodemanagers.heartbeat-interval-ms", "5000");
yarnSiteConfiguration.setProperties(yarnSiteConfigurationMapper);
configurations.add(mapredSiteConfiguration);
configurations.add(hdfsSiteConfiguration);
configurations.add(yarnSiteConfiguration);
This is the settings that I changed into AWS EMR (internally Hadoop) to reduce the time between state change from RUNNING to other state(DECOMMISSIONING/DECOMMISIONED/LOST).

You can use "hdfs getconf". Please refer to this post Get a yarn configuration from commandline
These links give info about node manager health-check and the properties you have to check:
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManager.html
Refer "yarn.resourcemanager.nodemanagers.heartbeat-interval-ms" in the below link:
https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Your queries are answered in this link:
https://issues.apache.org/jira/browse/YARN-914
Refer the "attachments" and "sub-tasks" area.
In simple terms, if the currently running application master and task containers gets shut-down properly (and/or re-initiated in different other nodes) then the node manager is said to be DECOMMISSIONED (gracefully), else it is LOST.
Update:
"dfs.namenode.decommission.interval" is for HDFS data node decommissioning, it does not matter if you are concerned only about node manager.
In exceptional cases, data node need not be a compute node.
Try yarn.nm.liveness-monitor.expiry-interval-ms (default 600000 - that is why you reported that the state changed to LOST in 10 minutes, set it to a smaller value as you require) instead of mapreduce.jobtracker.expire.trackers.interval.
You have set "yarn.resourcemanager.nodemanagers.heartbeat-interval-ms" as 5000, which means, the heartbeat goes to resource manager once in 5 seconds, whereas the default is 1000. Set it to a smaller value as you require.

hdfs getconf -confKey mapreduce.jobtracker.expire.trackers.interval
As mentioned in the other answer:
yarn.resourcemanager.nodemanagers.heartbeat-interval-ms should be set based on your network, if your network has high latency, you should set a bigger value.
3.
Its in DECOMMISSIONING when there are running containers and its waiting for them to complete so that those nodes can be decommissioned.
Its in LOST when its stuck in this process for too long. This state is reached after the set timeout is passed and decommissioning of node(s) couldn't be completed.
DECOMMISSIONED is when the decommissioning of the node(s) completes.
Reference : Resize a Running Cluster
For YARN NodeManager decommissioning, you can manually adjust the time
a node waits for decommissioning by setting
yarn.resourcemanager.decommissioning.timeout inside
/etc/hadoop/conf/yarn-site.xml; this setting is dynamically
propagated.

quartz cluster mode only runs one task

I have two quartz apps that must run in cluster mode so I have two jars. When I run those two jars (java -jar) only one process seems to be working, the other seems to be in standby and does nothing and only begins to work when I kill the other process. I need the two processes to run in cluster mode.
This is my config:
private Properties getProperties() {
final Properties quartzProperties = new Properties();
quartzProperties.put("org.quartz.jobStore.class", "org.quartz.impl.jdbcjobstore.JobStoreTX");
quartzProperties.put("org.quartz.jobStore.isClustered", "true");
quartzProperties.put("org.quartz.jobStore.tablePrefix", "QRTZ_");
quartzProperties.put("org.quartz.jobStore.driverDelegateClass", "org.quartz.impl.jdbcjobstore.StdJDBCDelegate");
quartzProperties.put("org.quartz.threadPool.class", "org.quartz.simpl.SimpleThreadPool");
quartzProperties.put("org.quartz.threadPool.threadCount", "25");
quartzProperties.put("org.quartz.scheduler.instanceId", "AUTO");
quartzProperties.put("org.quartz.scheduler.instanceName", "qrtz");
quartzProperties.put("org.quartz.threadPool.threadPriority", "5");
quartzProperties.put("org.quartz.jobStore.clusterCheckinInterval","10000");
quartzProperties.put("org.quartz.jobStore.useProperties", "false");
quartzProperties.put("org.quartz.jobStore.dataSource", "quartzDS");
quartzProperties.put("org.quartz.dataSource.quartzDS.URL", environment.getRequiredProperty("org.quartz.dataSource.quartzDS.URL"));
quartzProperties.put("org.quartz.dataSource.quartzDS.user", environment.getRequiredProperty("org.quartz.dataSource.quartzDS.user"));
quartzProperties.put("org.quartz.dataSource.quartzDS.password", environment.getRequiredProperty("org.quartz.dataSource.quartzDS.password"));
quartzProperties.put("org.quartz.dataSource.quartzDS.maxConnections", "5");
quartzProperties.put("org.quartz.dataSource.quartzDS.validationQuery", "select 0 from dual");
quartzProperties.put("org.quartz.dataSource.quartzDS.driver", environment.getRequiredProperty("org.quartz.dataSource.quartzDS.driver"));
return quartzProperties;
}

TL;TR : Your problem comes from Quartz Scheduler itself and there is no way to change its behaviour.
To make you understand why, I have to explain you how Quartz cluster mode behaves. We will take your case as example.
You start your two apps which each run a Quartz instance that synchronize through a database. Each jobs you are scheduling is stored in the database with processing data like "last time the job run", "last instance that run the job", etc. Each Quartz instance regularly scans the database for jobs to fire and fires as much jobs it cans.
The things is, if you don't have enough load, one of your node will always scans the database before the other one and take all the load.
To see your other instance working, you have to shutdown or standby the first one or increase the load of the cluster.
The only thing you can configure on this is the size of the thread pool of each node : See http://www.quartz-scheduler.org/documentation/quartz-2.x/configuration/ConfigJDBCJobStoreClustering.html

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.