I'm new to Hadoop, so I have some doubts what to do in the next case.
I have an algorithm that includes multiple runs of different jobs and sometimes multiple runs of a single job (in a loop).
How should I achieve this, using Oozie, or using Java code? I was looking through Mahout code and in ClusterIterator function function found this:
public static void iterateMR(Configuration conf, Path inPath, Path priorPath, Path outPath, int numIterations)
throws IOException, InterruptedException, ClassNotFoundException {
ClusteringPolicy policy = ClusterClassifier.readPolicy(priorPath);
Path clustersOut = null;
int iteration = 1;
while (iteration <= numIterations) {
conf.set(PRIOR_PATH_KEY, priorPath.toString());
String jobName = "Cluster Iterator running iteration " + iteration + " over priorPath: " + priorPath;
Job job = new Job(conf, jobName);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(ClusterWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(ClusterWritable.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setMapperClass(CIMapper.class);
job.setReducerClass(CIReducer.class);
FileInputFormat.addInputPath(job, inPath);
clustersOut = new Path(outPath, Cluster.CLUSTERS_DIR + iteration);
priorPath = clustersOut;
FileOutputFormat.setOutputPath(job, clustersOut);
job.setJarByClass(ClusterIterator.class);
if (!job.waitForCompletion(true)) {
throw new InterruptedException("Cluster Iteration " + iteration + " failed processing " + priorPath);
}
ClusterClassifier.writePolicy(policy, clustersOut);
FileSystem fs = FileSystem.get(outPath.toUri(), conf);
iteration++;
if (isConverged(clustersOut, conf, fs)) {
break;
}
}
Path finalClustersIn = new Path(outPath, Cluster.CLUSTERS_DIR + (iteration - 1) + Cluster.FINAL_ITERATION_SUFFIX);
FileSystem.get(clustersOut.toUri(), conf).rename(clustersOut, finalClustersIn);
}
So, they have a loop in which they run MR jobs. Is this a good approach? I know that Oozie is used for DAGs, and can be used with another components, such Pig, but should I consider using it for something like this?
What if I want to run clustering algorithm multiple times, let's say for clustering (using specific driver), should I do that in a loop, or using Oozie.
Thanks
If you are looking to run map reduce jobs only then you can consider following ways
chain MR jobs using Map reduce job Control API.
http://hadoop.apache.org/docs/r2.5.0/api/org/apache/hadoop/mapreduce/lib/jobcontrol/JobControl.html
Submit multiple MR jobs from a single driver class.
Job job1 = new Job( getConf() );
job.waitForCompletion( true );
if(job.isSuccessful()){
//start another job with different Mapper.
//change config
Job job2 = new Job( getConf() );
}
If you have a complex DAG or involving multiple ecosystem tools like hive,pig then Oozie suits well.
Related
We are creating kubernates job using java kubernates client api (V:5.12.2) like below.
I am struck with two places . Could some one please help on this ?
podList.getItems().size() in below code snippet is some times returning zero even though I see pod get created and with other existing jobs.
How to specify particular label to job pod?
KubernetesClient kubernetesClient = new DefaultKubernetesClient();
String namespace = System.getenv(POD_NAMESPACE);
String jobName = TextUtils.concatenateToString("flatten" + Constants.HYPHEN+ flattenId);
Job jobRequest = createJob(flattenId, authValue);
var jobResult = kubernetesClient.batch().v1().jobs().inNamespace(namespace)
.create(jobRequest);
PodList podList = kubernetesClient.pods().inNamespace(namespace)
.withLabel("job-name", jobName).list();
// Wait for pod to complete
var pods = podList.getItems().size();
var terminalPodStatus = List.of("succeeded", "failed");
_LOGGER.info("pods created size:" + pods);
if (pods > 0) {
// returns zero some times.
var k8sPod = podList.getItems().get(0);
var podName = k8sPod.getMetadata().getName();
kubernetesClient.pods().inNamespace(namespace).withName(podName)
.waitUntilCondition(pod -> {
var podPhase = pod.getStatus().getPhase();
//some logic
return terminalPodStatus.contains(podPhase.toLowerCase());
}, JOB_TIMEOUT, TimeUnit.MINUTES);
kubernetesClient.close();
}
private Job createJob(String flattenId, String authValue) {
return new JobBuilder()
.withApiVersion(API_VERSION)
.withNewMetadata().withName(jobName)
.withLabels(labels)
.endMetadata()
.withNewSpec()
.withTtlSecondsAfterFinished(300)
.withBackoffLimit(0)
.withNewTemplate()
.withNewMetadata().withAnnotations(LINKERD_INJECT_ANNOTATIONS)
.endMetadata()
.withNewSpec()
.withServiceAccount(Constants.TEST_SERVICEACCOUNT)
.addNewContainer()
.addAllToEnv(envVars)
.withImage(System.getenv(BUILD_JOB_IMAGE))
.withName(“”test)
.withCommand("/bin/bash", "-c", "java -jar test.jar")
.endContainer()
.withRestartPolicy(RESTART_POLICY_NEVER)
.endSpec()
.endTemplate()
.endSpec()
.build();
}
Pods are not instantly created as consequence of creating a Job: The Job controller needs to become active and create the Pods accordingly. Depending on the load on your control plane and number of Job instances you may need to wait more or less time.
I'm writing a little Java program which uses PsExec.exe from cmd launched using ProcessBuilder to copy and install an application on networked PC (the number of PC that will need to be installed can vary from 5 to 50).
The program works fine if I launched ProcessBuilder for each PC sequentially.
However to speed things up I would like to implement some form of MultiThreading which could allow me to install 5 PC's at the time concurrently (one "batch" of 5 X Processbuilder processes untill all PC's have been installed).
I was thinking of using a Fixed Thread Pool in combination with a Callable interface (each execution of PsExec returns a value which indicates if the execution was succesfull and which I have to evaluate).
The code used for the ProcessBuilder is:
// Start iterating over all PC in the list:
for(String pc : pcList)
{
counter++;
logger.info("Starting the installation of remote pc: " + pc);
updateMessage("Starting the installation of remote pc: " + pc);
int exitVal = 99;
logger.debug("Exit Value set to 99");
try
{
ProcessBuilder pB = new ProcessBuilder();
pB.command("cmd", "/c",
"\""+psExecPath+"\"" + " \\\\" + pc + userName + userPassword + " -c" + " -f" + " -h" + " -n 60 " +
"\""+forumViewerPath+"\"" + " -q "+ forumAddress + remotePath + "-overwrite");
logger.debug(pB.command().toString());
pB.redirectError();
Process p = pB.start();
InputStream stErr = p.getErrorStream();
InputStreamReader esr = new InputStreamReader(stErr);
BufferedReader bre = new BufferedReader(esr);
String line = null;
line = bre.readLine();
while (line != null)
{
if(!line.equals(""))
logger.info(line);
line = bre.readLine();
}
exitVal = p.waitFor();
} catch (IOException ex)
{
logger.info("Exception occurred during installation of PC: \n"+pc+"\n "+ ex);
notInstalledPc.add(pc);
}
if(exitVal != 0)
{
notInstalledPc.add(pc);
ret = exitVal;
updateMessage("");
updateMessage("The remote pc: " + pc + " was not installed");
logger.info("The remote pc: " + pc + " was not installed. The error message returned was: \n"+getError(exitVal) + "\nProcess exit code was: " + exitVal);
}
else
{
updateMessage("");
updateMessage("The remote pc: " + pc + " was succesfully installed");
logger.info("The remote pc: " + pc + " was succesfully installed");
}
Now I've read some info on how to implement Callable and I would like to enclose my ProcessBuilder in a Callable interface and then submit all the Tasks for running in the for loop.
Am I on the right track?
You can surely do that. I suppose you want to use Callable instead of runnable to get the result of your exitVal ?
It doesn't seem like you have any shared data between your threads, so I think you should be fine. Since you even know how many Callables you are going to make you could create a collection of your Callables and then do
List<Future<SomeType>> results = pool.invokeAll(collection)
This would allow for easier handling of your result. Probably the most important thing you need to figure out when deciding on whether or not to use a threadpool is what to do if your program terminates while threads are still running; Do you HAVE to finish what you're doing in the threads, do you need to have seamless handling of errors etc.
Check out java threadpools doc: https://docs.oracle.com/javase/tutorial/essential/concurrency/pools.html
or search the web, there are tons of posts/blogs about when or not to use threadpools.
But seems like you're on the right track!
Thank you for your reply! It definitely put me on the right track. I ended up implementing it like this:
ThreadPoolExecutor executor = (ThreadPoolExecutor) Executors.newFixedThreadPool(5); //NEW
List<Future<List<String>>> resultList = new ArrayList<>();
updateMessage("Starting the installation of all remote pc entered...");
// Start iterating over all PC in the list:
for(String pc : pcList)
{
counter++;
logger.debug("Starting the installation of remote pc: " + pc);
psExe p = new psExe(pc);
Future<List<String>> result = executor.submit(p);//NEW
resultList.add(result);
}
for(Future<List<String>> future : resultList)
{.......
in the last for loop I read the result of my operations and write them on screen or act according to the result returned.
I still have a couple of questions as it is not really clear to me:
1 - If I have 20 PC and submit all the callable threads to the pool in my first For loop, do I get it correctly that only 5 threads will be started (threadpool size = 5) but all will already be created and put in Wait status, and only as soon as the first running Thread is complete and returns a result value the next one will automatically start until all PC are finished?
2 - What is the difference (advantage) of using invokeall() as you suggested compared to the method I used (submit() inside the for loop)?
Thank you once more for your help...I really Love this Java stuff!! ;-)
Is it possible to get information about already executed (finished) jobs? I browsed the javadocs, learned how to fetch JobDetails etc. but can't find way to learn about the jobs that has already been executed (and finished).
any hints?
You can get next trigger time using below code and compare it with cuurent time, if execution time is in past then job has already executed:
Scheduler scheduler = new StdSchedulerFactory().getScheduler();
for (String groupName : scheduler.getJobGroupNames()) {
for (JobKey jobKey : scheduler.getJobKeys(GroupMatcher.jobGroupEquals(groupName))) {
String jobName = jobKey.getName();
String jobGroup = jobKey.getGroup();
//get job's trigger
List<Trigger> triggers = (List<Trigger>) scheduler.getTriggersOfJob(jobKey);
Date nextFireTime = triggers.get(0).getNextFireTime();
Date currTime = new Date();
if(currTime>nextFireTime )
System.out.println("[jobName] : " + jobName + " [groupName] : "
+ jobGroup + " - " + has already executed);
}
}
If you want to keep track of detailed history of all executions for jobs, then you simply have to make an implementation to keep track of all this information. You can use listeners to for this purpose.
Depending on what exactly you're trying to accomplish, you may either use JobListeners ,TriggerListeners or SchedulerListeners.
For 'global' JobListeners:
<initialize JobListeners>
public void jobWasExecuted(JobExecutionContext context, JobExecutionException jobException) {
try
{
jobKey = context.getJobDetail().getKey();
schedulerName = context.getScheduler().getSchedulerName();
jobName = jobKey.getName();
groupName = jobKey.getGroup();
//execution
Date startDate = context.getFireTime();
//execution time
long runTime=context.getJobRunTime();
//execution end
long endDateM = startDate.getTime() + runTime;
Date endDate = new Date(endDateM);
//get more information here
}
catch (Exception e)
{
e.printStackTrace();
}
Note: Please be vary of the performance impact listeners can cause. As mentioned in the Quartz Docs:
One thing that CAN slow down quartz itself is using a lot of listeners
(TriggerListeners, JobListeners, and SchedulerListeners). The time
spent in each listener obviously adds into the time spent "processing"
a job's execution, outside of actual execution of the job.
This
doesn't mean that you should be terrified of using listeners, it just
means that you should use them judiciously - don't create a bunch of
"global" listeners if you can really make more specialized ones. Also
don't do "expensive" things in the listeners, unless you really need
to. Also be mindful that many plug-ins (such as the "history" plugin)
are actually listeners.
I have a Plain text file with possibly millions of lines which needs custom parsing and I want to load it into an HBase table as fast as possible (using Hadoop or HBase Java client).
My current solution is based on a MapReduce job without the Reduce part. I use FileInputFormat to read the text file so that each line is passed to the map method of my Mapper class. At this point the line is parsed to form a Put object which is written to the context. Then, TableOutputFormat takes the Put object and inserts it to table.
This solution yields an average insertion rate of 1,000 rows per second, which is less than what I expected. My HBase setup is in pseudo distributed mode on a single server.
One interesting thing is that during insertion of 1,000,000 rows, 25 Mappers (tasks) are spawned but they run serially (one after another); is this normal?
Here is the code for my current solution:
public static class CustomMap extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {
protected void map(LongWritable key, Text value, Context context) throws IOException {
Map<String, String> parsedLine = parseLine(value.toString());
Put row = new Put(Bytes.toBytes(parsedLine.get(keys[1])));
for (String currentKey : parsedLine.keySet()) {
row.add(Bytes.toBytes(currentKey),Bytes.toBytes(currentKey),Bytes.toBytes(parsedLine.get(currentKey)));
}
try {
context.write(new ImmutableBytesWritable(Bytes.toBytes(parsedLine.get(keys[1]))), row);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public int run(String[] args) throws Exception {
if (args.length != 2) {
return -1;
}
conf.set("hbase.mapred.outputtable", args[1]);
// I got these conf parameters from a presentation about Bulk Load
conf.set("hbase.hstore.blockingStoreFiles", "25");
conf.set("hbase.hregion.memstore.block.multiplier", "8");
conf.set("hbase.regionserver.handler.count", "30");
conf.set("hbase.regions.percheckin", "30");
conf.set("hbase.regionserver.globalMemcache.upperLimit", "0.3");
conf.set("hbase.regionserver.globalMemcache.lowerLimit", "0.15");
Job job = new Job(conf);
job.setJarByClass(BulkLoadMapReduce.class);
job.setJobName(NAME);
TextInputFormat.setInputPaths(job, new Path(args[0]));
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(CustomMap.class);
job.setOutputKeyClass(ImmutableBytesWritable.class);
job.setOutputValueClass(Put.class);
job.setNumReduceTasks(0);
job.setOutputFormatClass(TableOutputFormat.class);
job.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
Long startTime = Calendar.getInstance().getTimeInMillis();
System.out.println("Start time : " + startTime);
int errCode = ToolRunner.run(HBaseConfiguration.create(), new BulkLoadMapReduce(), args);
Long endTime = Calendar.getInstance().getTimeInMillis();
System.out.println("End time : " + endTime);
System.out.println("Duration milliseconds: " + (endTime-startTime));
System.exit(errCode);
}
I've gone through a process that is probably very similar to yours of attempting to find an efficient way to load data from an MR into HBase. What I found to work is using HFileOutputFormat as the OutputFormatClass of the MR.
Below is the basis of my code that I have to generate the job and the Mapper map function which writes out the data. This was fast. We don't use it anymore, so I don't have numbers on hand, but it was around 2.5 million records in under a minute.
Here is the (stripped down) function I wrote to generate the job for my MapReduce process to put data into HBase
private Job createCubeJob(...) {
//Build and Configure Job
Job job = new Job(conf);
job.setJobName(jobName);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Put.class);
job.setMapperClass(HiveToHBaseMapper.class);//Custom Mapper
job.setJarByClass(CubeBuilderDriver.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(HFileOutputFormat.class);
TextInputFormat.setInputPaths(job, hiveOutputDir);
HFileOutputFormat.setOutputPath(job, cubeOutputPath);
Configuration hConf = HBaseConfiguration.create(conf);
hConf.set("hbase.zookeeper.quorum", hbaseZookeeperQuorum);
hConf.set("hbase.zookeeper.property.clientPort", hbaseZookeeperClientPort);
HTable hTable = new HTable(hConf, tableName);
HFileOutputFormat.configureIncrementalLoad(job, hTable);
return job;
}
This is my map function from the HiveToHBaseMapper class (slightly edited ).
public void map(WritableComparable key, Writable val, Context context)
throws IOException, InterruptedException {
try{
Configuration config = context.getConfiguration();
String[] strs = val.toString().split(Constants.HIVE_RECORD_COLUMN_SEPARATOR);
String family = config.get(Constants.CUBEBUILDER_CONFIGURATION_FAMILY);
String column = strs[COLUMN_INDEX];
String Value = strs[VALUE_INDEX];
String sKey = generateKey(strs, config);
byte[] bKey = Bytes.toBytes(sKey);
Put put = new Put(bKey);
put.add(Bytes.toBytes(family), Bytes.toBytes(column), (value <= 0)
? Bytes.toBytes(Double.MIN_VALUE)
: Bytes.toBytes(value));
ImmutableBytesWritable ibKey = new ImmutableBytesWritable(bKey);
context.write(ibKey, put);
context.getCounter(CubeBuilderContextCounters.CompletedMapExecutions).increment(1);
}
catch(Exception e){
context.getCounter(CubeBuilderContextCounters.FailedMapExecutions).increment(1);
}
}
I pretty sure this isn't going to be a Copy&Paste solution for you. Obviously the data I was working with here didn't need any custom processing (that was done in a MR job before this one). The main thing I want to provide out of this is the HFileOutputFormat. The rest is just an example of how I used it. :)
I hope it gets you onto a solid path to a good solution. :
One interesting thing is that during insertion of 1,000,000 rows, 25 Mappers (tasks) are spawned but they run serially (one after another); is this normal?
mapreduce.tasktracker.map.tasks.maximum parameter which is defaulted to 2 determines the maximum number of tasks that can run in parallel on a node. Unless changed, you should see 2 map tasks running simultaneously on each node.
I have an application that only implement Map function.
I'm creating 1000 jobs, each with a unique PrefixFilter.
Example:
public void startNewScan(String prefix, long endTime)
Job job = new Job(conf, "MyJob");
job.setNumReduceTasks(0);
Scan scan = new Scan();
scan.setTimeRange(0, endTime);
scan.addColumn(Bytes.toBytes("col"), Bytes.toBytes("Value"));
scan.setFilter(new PrefixFilter(prefix.getBytes()));
TableMapReduceUtil.initTableMapperJob(tableName, scan, ExtractMapper.class, ImmutableBytesWritable.class, Result.class, job);
job.waitForCompletion(true);
}
Now - I don't want to wait for completion, because executing 1000 jobs would take me forever. Creating a thread for each job is also not an option.
Is there anything built in for this usage?
Something like JobsPool that accepts all the jobs and has its own waitForCompletion for all the jobs.
Use:
job.submit();
"Submit the job to the cluster and return immediately."