I have an application that only implement Map function.
I'm creating 1000 jobs, each with a unique PrefixFilter.
Example:
public void startNewScan(String prefix, long endTime)
Job job = new Job(conf, "MyJob");
job.setNumReduceTasks(0);
Scan scan = new Scan();
scan.setTimeRange(0, endTime);
scan.addColumn(Bytes.toBytes("col"), Bytes.toBytes("Value"));
scan.setFilter(new PrefixFilter(prefix.getBytes()));
TableMapReduceUtil.initTableMapperJob(tableName, scan, ExtractMapper.class, ImmutableBytesWritable.class, Result.class, job);
job.waitForCompletion(true);
}
Now - I don't want to wait for completion, because executing 1000 jobs would take me forever. Creating a thread for each job is also not an option.
Is there anything built in for this usage?
Something like JobsPool that accepts all the jobs and has its own waitForCompletion for all the jobs.
Use:
job.submit();
"Submit the job to the cluster and return immediately."
Related
I ask for help, I have jobs scheduled through quartz in a job shop, everything works some jobs that call procedures on Oracle. I would like to put the job in a complete state when a procedure returns a certain result. I can pause the job but not in completed.
if(!result.substring(0,4).equalsIgnoreCase("null")) {
fase_1.callJobResult("JOB_RESULT", key.getName().toString(), key.getGroup().toString(), strJobex+":"+parameter, result);
}else {
// I pause the job because it has been completed
try {
fase_1.callJobResult("JOB_RESULT", key.getName().toString(), key.getGroup().toString(), strJobex+":"+parameter,"SUCCESS. Il job : " + key.getName().toString().toUpperCase() + " job completed");
System.out.println(result);
//Scheduler scheduler;
SchedulerFactory factory = new StdSchedulerFactory();
Scheduler scheduler = factory.getScheduler();
JobKey jobKey = new JobKey(key.getName().toString(), key.getGroup().toString());
//scheduler.pauseJob(jobKey); // example this work
Trigger trigger = jobExecutionContext.getTrigger();
//Trigger.CompletedExecutionInstruction.NOOP.SET_TRIGGER_COMPLETE()
TriggerListner tl = new TriggerListner();
tl.triggerComplete(trigger, jobExecutionContext, CompletedExecutionInstruction.SET_TRIGGER_COMPLETE);
I have a job that consists of different jobSteps.
I want to trigger a batch of these JobSteps(JobStep1 | JobStep2 | JobStep3) together( run with AsyncTaskExecutor in different threads)
and a last JobStep (JobStep 4) when the other JobSteps are completed.
So i created different Flows for every JobStep and put them in one Flow with AsyncTaskExecutor.
i Also made a single Flow for the last JobStep.
JobStep1 | JobStep2 | JobStep3 when COMPLETED
JobStep 4
The code below represents my implementation:
Flow flowJob1= new FlowBuilder<Flow>(jobStep.getName()).from((JobStep)jobStep1).end();
Flow flowJob2= new FlowBuilder<Flow>(jobStep.getName()).from((JobStep)jobStep2).end();
Flow flowJob3= new FlowBuilder<Flow>(jobStep.getName()).from((JobStep)jobStep3).end();
Flow flowJob4= new FlowBuilder<Flow>(jobStep.getName()).from((JobStep)jobStep4).end();
Flow splitFlow = new FlowBuilder<Flow>("splitflow").split(new SimpleAsyncTaskExecutor()).add(flowJob1,flowJob2,flowJob3).build();
And then for the job Creation i use this function:
JobFlowBuilder jobFlowBuilder = jobBuilderFactory.get(jobName).repository(jobRepository)
.start((Flow)splitFlow);
jobFlowBuilder.next(flowJob4);
FlowJobBuilder flowJobBuilder= jobFlowBuilder.build();
Job parentJob = flowJobBuilder.build();
return parentJob;
The problem is:
that the main Job doesn't wait all the JobSteps( in different threads) to be completed and then run the next JObStep. Is there any spring batch configuration that i should do to solve this problem?
You'll want to combine JobStep 1-3 into a single FlowStep. Then you'd use a regular SimpleJobBuilder to build your job.
Flow flowJob1= new FlowBuilder<Flow>(jobStep.getName()).from((JobStep)jobStep1).end();
Flow flowJob2= new FlowBuilder<Flow>(jobStep.getName()).from((JobStep)jobStep2).end();
Flow flowJob3= new FlowBuilder<Flow>(jobStep.getName()).from((JobStep)jobStep3).end();
// Don't need this
// Flow flowJob4= new FlowBuilder<Flow>(jobStep.getName()).from((JobStep)jobStep4).end();
Flow splitFlow = new FlowBuilder<Flow>("splitflow").split(new SimpleAsyncTaskExecutor()).add(flowJob1,flowJob2,flowJob3).build();
FlowStep flowStep = new FlowStep(splitFlow);
SimpleJobBuilder jobBuilder = new JobBuilder(yourJobName).start(flowStep);
jobBuilder.next(jobStep4);
Can you provide more information on how you are creating the jobStep?
TL;DR :
#Bean
public Step jobStepJobStep1(JobLauncher jobLauncher) {
return this.stepBuilderFactory.get("jobStepJobStep1")
.job(job())
.launcher(jobLauncher)
.parametersExtractor(jobParametersExtractor())
.build();
}
Try removing the launcher from the jobStep definition. It is working for me.
I guess this could be the issue from this post.
Background
My workflow is:
1. Run Job1, Job2, Job3 in parallel. I provide executor via split(). And all these jobs are actually JobSteps
2. Regardless of 1)'s outcome, execute an EndJob
#Bean
public Job dataSync(JobRepository jobRepository, PlatformTransactionManager transactionManager) {
Step pricingJobStep = new JobStepBuilder(new StepBuilder("pricingJobStep"))
.job(pricingJob())
.launcher(jobLauncher)
.repository(jobRepository)
.transactionManager(transactionManager)
.build();
Flow pricingFlow = new FlowBuilder<Flow>("pricing-flow").start(
pricingJobStep
).build();
Step referenceJobStep = new JobStepBuilder(new StepBuilder("referenceJobStep"))
.job(referenceJob())
.launcher(jobLauncher)
.repository(jobRepository)
.transactionManager(transactionManager)
.build();
Flow referenceFlow = new FlowBuilder<Flow>("reference-flow").start(
referenceJobStep
).build();
Step tradeJobStep = new JobStepBuilder(new StepBuilder("tradeJobStep"))
.job(tradeJob())
.launcher(jobLauncher)
.repository(jobRepository)
.transactionManager(transactionManager)
.build();
Flow tradeFlow = new FlowBuilder<Flow>("trade-flow").start(
tradeJobStep
).build();
SimpleAsyncTaskExecutor simpleAsyncTaskExecutor = new SimpleAsyncTaskExecutor("ETL-EXEC");
Flow etlFlow = new FlowBuilder<Flow>("etl-flow")
.split(simpleAsyncTaskExecutor)
.add(pricingFlow,referenceFlow,tradeFlow)
.build();
return jobBuilderFactory.get("data-sync")
.start(etlFlow)
.on("COMPLETED")
.to(finalStep())
.from(etlFlow)
.on("FAILED")
.to(finalStep())
.end().build();
}
I am running this job via #Scheduled
When I ran which launcher injected in each jobStep, they were all getting invoked as separate jobs. The executor I was assigning to the split() only got executed at step level.
In the logs below, ETL-EXEC2 is the executor I assigned for the split.
Within the split, each flow is actually another job. So they get executed by the executor assigned to the Job launcher. i.e. cTaskExecutor-2
2019-10-24 00:34:06.218 INFO 28776 --- [ ETL-EXEC2] o.s.batch.core.job.SimpleStepHandler : Executing step: [referenceJobStep]
2019-10-24 00:34:06.359 INFO 28776 --- [cTaskExecutor-2] o.s.b.c.l.support.SimpleJobLauncher : Job: [SimpleJob: [name=reference-job]] launched with the following parameters: [{name=0}]
2019-10-24 00:34:06.449 INFO 28776 --- [cTaskExecutor-2] o.s.batch.core.job.SimpleStepHandler : Executing step: [reference-etl]
My guess is that, this could be the reason why I was unable to have an end task that could wait for all jobSteps. Simply because they are launched in different executors and do not have any control over the flow.
I am Using Executorsevice to generate files from database. I am using jdbc and core java to get the table data into files.
After creating the Executorservice with 10 threads I am submitting 60 threads in a for loop to get 60 files parallelly. This is working fine with small data and a table with few columns. But in case of a huge file and for tables having more columns, the thread which is working on huge table data/ more columns table is stopping without giving any information in the log when the other threads are completed .
ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
for (String filename : filenames) {
EachFileThread worker = new EachFileThread(destdir, converter,
filename, this);
executor.execute(worker);
}
executor.shutdown();
Inside Eachfilethread I am reading the xml and get columns, table and form a query and executing the query and formatting the data and putting the data into file
forTable = (FileData) converter.convertFromXMLToObject( filename + ".xml");
String query = getQuery(forTable);
statement = connection.createStatement(ResultSet.TYPE_SCROLL_SENSITIVE, ResultSet.CONCUR_READ_ONLY);
resultSet = statement.executeQuery(query);
resultSet.setFetchSize(3000);
WriteData(resultSet, filepath, forTable);(formatting the data from db and then writing to a file)
The problem is that you are not waiting for all the jobs to finish what they were doing. As #msandiford suggested in the comment you should add call to awaitTermination(..) after calling shutdown() as it is in sample shutdownAndAwaitTermination() method on https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html
For example you can try to do it like so:
ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
for (String filename : filenames) {
EachFileThread worker = new EachFileThread(destdir, converter, filename, this);
executor.execute(worker);
}
executor.shutdown();
try {
// Wait a while for existing tasks to terminate
if (!executor.awaitTermination(60, TimeUnit.SECONDS)) {
executor.shutdownNow(); // Cancel currently executing tasks
// Wait a while for tasks to respond to being cancelled
if (!executor.awaitTermination(60, TimeUnit.SECONDS))
System.err.println("Executor did not terminate");
}
} catch (InterruptedException ie) {
// (Re-)Cancel if current thread also interrupted
executor.shutdownNow();
// Preserve interrupt status
Thread.currentThread().interrupt();
}
I'm new to Hadoop, so I have some doubts what to do in the next case.
I have an algorithm that includes multiple runs of different jobs and sometimes multiple runs of a single job (in a loop).
How should I achieve this, using Oozie, or using Java code? I was looking through Mahout code and in ClusterIterator function function found this:
public static void iterateMR(Configuration conf, Path inPath, Path priorPath, Path outPath, int numIterations)
throws IOException, InterruptedException, ClassNotFoundException {
ClusteringPolicy policy = ClusterClassifier.readPolicy(priorPath);
Path clustersOut = null;
int iteration = 1;
while (iteration <= numIterations) {
conf.set(PRIOR_PATH_KEY, priorPath.toString());
String jobName = "Cluster Iterator running iteration " + iteration + " over priorPath: " + priorPath;
Job job = new Job(conf, jobName);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(ClusterWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(ClusterWritable.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setMapperClass(CIMapper.class);
job.setReducerClass(CIReducer.class);
FileInputFormat.addInputPath(job, inPath);
clustersOut = new Path(outPath, Cluster.CLUSTERS_DIR + iteration);
priorPath = clustersOut;
FileOutputFormat.setOutputPath(job, clustersOut);
job.setJarByClass(ClusterIterator.class);
if (!job.waitForCompletion(true)) {
throw new InterruptedException("Cluster Iteration " + iteration + " failed processing " + priorPath);
}
ClusterClassifier.writePolicy(policy, clustersOut);
FileSystem fs = FileSystem.get(outPath.toUri(), conf);
iteration++;
if (isConverged(clustersOut, conf, fs)) {
break;
}
}
Path finalClustersIn = new Path(outPath, Cluster.CLUSTERS_DIR + (iteration - 1) + Cluster.FINAL_ITERATION_SUFFIX);
FileSystem.get(clustersOut.toUri(), conf).rename(clustersOut, finalClustersIn);
}
So, they have a loop in which they run MR jobs. Is this a good approach? I know that Oozie is used for DAGs, and can be used with another components, such Pig, but should I consider using it for something like this?
What if I want to run clustering algorithm multiple times, let's say for clustering (using specific driver), should I do that in a loop, or using Oozie.
Thanks
If you are looking to run map reduce jobs only then you can consider following ways
chain MR jobs using Map reduce job Control API.
http://hadoop.apache.org/docs/r2.5.0/api/org/apache/hadoop/mapreduce/lib/jobcontrol/JobControl.html
Submit multiple MR jobs from a single driver class.
Job job1 = new Job( getConf() );
job.waitForCompletion( true );
if(job.isSuccessful()){
//start another job with different Mapper.
//change config
Job job2 = new Job( getConf() );
}
If you have a complex DAG or involving multiple ecosystem tools like hive,pig then Oozie suits well.
I have Quartz coded as follows and the first job runs perfectly:
JobDetail jd = null;
CronTrigger ct = null;
jd = new JobDetail("Job1", "Group1", Job1.class);
ct = new CronTrigger("cronTrigger1","Group1","0/5 * * * * ?");
scheduler.scheduleJob(jd, ct);
jd = new JobDetail("Job2", "Group2", Job2.class);
ct = new CronTrigger("cronTrigger2","Group2","0/20 * * * * ?");
scheduler.scheduleJob(jd, ct);
But I'm finding that Job2, which is a completely separate job to Job1, will not execute.
The scheduler is started using a listener in Java. I've also tried using scheduler.addJob(jd, true); but nothing changes. I'm running Java through a JVM on windows 7.
How do you know the job does not run? If you substitute Job1.class for Job2.class, does it still fail? When you swap order in which they're added to scheduler, or only leave Job2? Or if you strip down Job2 to only print a message to console?
I suspect Job2 execution dies with an exception.