how to submit mapreduce job with yarn api in java - java

I want submit my MR job using YARN java API, I try to do it like WritingYarnApplications, but I don't know what to add amContainer, below is code I have written:
package org.apache.hadoop.examples;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.yarn.api.protocolrecords.GetNewApplicationResponse;
import org.apache.hadoop.yarn.api.records.ApplicationId;
import org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext;
import org.apache.hadoop.yarn.api.records.ContainerLaunchContext;
import org.apache.hadoop.yarn.api.records.Resource;
import org.apache.hadoop.yarn.client.api.YarnClient;
import org.apache.hadoop.yarn.client.api.YarnClientApplication;
import org.apache.hadoop.yarn.util.Records;
import org.mortbay.util.ajax.JSON;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class YarnJob {
private static Logger logger = LoggerFactory.getLogger(YarnJob.class);
public static void main(String[] args) throws Throwable {
Configuration conf = new Configuration();
YarnClient client = YarnClient.createYarnClient();
client.init(conf);
client.start();
System.out.println(JSON.toString(client.getAllQueues()));
System.out.println(JSON.toString(client.getConfig()));
//System.out.println(JSON.toString(client.getApplications()));
System.out.println(JSON.toString(client.getYarnClusterMetrics()));
YarnClientApplication app = client.createApplication();
GetNewApplicationResponse appResponse = app.getNewApplicationResponse();
ApplicationId appId = appResponse.getApplicationId();
// Create launch context for app master
ApplicationSubmissionContext appContext = Records.newRecord(ApplicationSubmissionContext.class);
// set the application id
appContext.setApplicationId(appId);
// set the application name
appContext.setApplicationName("test");
// Set the queue to which this application is to be submitted in the RM
appContext.setQueue("default");
// Set up the container launch context for the application master
ContainerLaunchContext amContainer = Records.newRecord(ContainerLaunchContext.class);
//amContainer.setLocalResources();
//amContainer.setCommands();
//amContainer.setEnvironment();
appContext.setAMContainerSpec(amContainer);
appContext.setResource(Resource.newInstance(1024, 1));
appContext.setApplicationType("MAPREDUCE");
// Submit the application to the applications manager
client.submitApplication(appContext);
//client.stop();
}
}
I can run a mapreduce job properly with command interface:
hadoop jar wordcount.jar org.apache.hadoop.examples.WordCount /user/admin/input /user/admin/output/
But how can I submit this wordcount job in yarn java api?

You do not use Yarn Client to submit job, instead use MapReduce APIs to submit job. See this link for Example
However if you need more control on the job, like getting status of completion, Mapper phase status, Reducer phase status, etc, you can use
job.submit();
Instead of
job.waitForCompletion(true)
You can use functions job.mapProgress() and job.reduceProgress() to get the status. There are lots of functions in job object which you can explore.
As far as your query about
hadoop jar wordcount.jar org.apache.hadoop.examples.WordCount /user/admin/input /user/admin/output/
Whats happening here is you are running your driver program which is available in wordcount.jar. Instead of doing "java -jar wordcount.jar" you are using "hadoop jar wordcount.jar". you can as well use "yarn jar wordcount.jar". Hadoop/Yarn will setup necessary additional classpaths compared to java -jar command. This executes the "main()" of your driver program which is available in class org.apache.hadoop.examples.WordCount as specified in the command.
You can check out the source here Source for WordCount class
The only reason i would assume you want to submit job via yarn is to integrate it with some kind of service which kicks up MapReduce2 jobs on certain events.
For this you can always have your drivers main() something like this.
public class MyMapReduceDriver extends Configured implements Tool {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
/******/
int errCode = ToolRunner.run(conf, new MyMapReduceDriver(), args);
System.exit(errCode);
}
#Override
public int run(String[] args) throws Exception {
while(true) {
try{
runMapReduceJob();
}
catch(IOException e)
{
e.printStackTrace();
}
}
}
private void runMapReduceJob() {
Configuration conf = new Configuration();
Job job = new Job(conf, "word count");
/******/
job.submit();
// Get status
while(job.getJobState()==RUNNING || job.getJobState()==PREP){
Thread.sleep(1000);
System.out.println(" Map: "+ StringUtils.formatPercent(job.mapProgress(), 0) + " Reducer: "+ StringUtils.formatPercent(job.reduceProgress(), 0));
}
}}
Hope this helps.

Related

create an Java api that will manually trigger Kubernetes already created jobs

I have a job already running in Kubernates which is scheduled for 4 hours. But I need to write a Java API so that whenever I want to run the job I just need to call this API and it runs the Job.
Please help to solve this requirement.
There is two way either you run your application in POD which create JOB for you OR you write java API and when you hit endpoint, it will create the job that time.
For creation, you can use the Java Kubernetes client library.
Example - Click here
Java client - Click here
package io.fabric8.kubernetes.examples;
import io.fabric8.kubernetes.api.model.PodList;
import io.fabric8.kubernetes.api.model.batch.v1.Job;
import io.fabric8.kubernetes.api.model.batch.v1.JobBuilder;
import io.fabric8.kubernetes.client.ConfigBuilder;
import io.fabric8.kubernetes.client.DefaultKubernetesClient;
import io.fabric8.kubernetes.client.KubernetesClient;
import io.fabric8.kubernetes.client.KubernetesClientException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.Collections;
import java.util.concurrent.TimeUnit;
/*
* Creates a simple run to complete job that computes π to 2000 places and prints it out.
*/
public class JobExample {
private static final Logger logger = LoggerFactory.getLogger(JobExample.class);
public static void main(String[] args) {
final ConfigBuilder configBuilder = new ConfigBuilder();
if (args.length > 0) {
configBuilder.withMasterUrl(args[0]);
}
try (KubernetesClient client = new DefaultKubernetesClient(configBuilder.build())) {
final String namespace = "default";
final Job job = new JobBuilder()
.withApiVersion("batch/v1")
.withNewMetadata()
.withName("pi")
.withLabels(Collections.singletonMap("label1", "maximum-length-of-63-characters"))
.withAnnotations(Collections.singletonMap("annotation1", "some-very-long-annotation"))
.endMetadata()
.withNewSpec()
.withNewTemplate()
.withNewSpec()
.addNewContainer()
.withName("pi")
.withImage("perl")
.withArgs("perl", "-Mbignum=bpi", "-wle", "print bpi(2000)")
.endContainer()
.withRestartPolicy("Never")
.endSpec()
.endTemplate()
.endSpec()
.build();
logger.info("Creating job pi.");
client.batch().v1().jobs().inNamespace(namespace).createOrReplace(job);
// Get All pods created by the job
PodList podList = client.pods().inNamespace(namespace).withLabel("job-name", job.getMetadata().getName()).list();
// Wait for pod to complete
client.pods().inNamespace(namespace).withName(podList.getItems().get(0).getMetadata().getName())
.waitUntilCondition(pod -> pod.getStatus().getPhase().equals("Succeeded"), 1, TimeUnit.MINUTES);
// Print Job's log
String joblog = client.batch().v1().jobs().inNamespace(namespace).withName("pi").getLog();
logger.info(joblog);
} catch (KubernetesClientException e) {
logger.error("Unable to create job", e);
}
}
}
Option : 2
You can also apply the YAML file
ApiClient client = ClientBuilder.cluster().build(); //create in-cluster client
Configuration.setDefaultApiClient(client);
BatchV1Api api = new BatchV1Api(client);
V1Job job = new V1Job();
job = (V1Job) Yaml.load(new File("<YAML file path>.yaml")); //apply static yaml file
ApiResponse<V1Job> response = api.createNamespacedJobWithHttpInfo("default", job, "true", null, null);
I had the same question as you since it was needed for me and my team, to develop a web application, that makes it possible for any user to start a new execution from our jobs.
I have a job already running in Kubernetes which is scheduled for 4 hours.
If I'm not mistaken, it's not possible to schedule a Job on Kubernetes, you need to create a Job from a CronJob, that's our case.
We have several CronJobs scheduled to start through the day, but it's also needed to start it again, during some error or something else.
After some research, I decided to use the Kubernetes-client library.
When it was needed to trigger a job manually, I used to use kubectl CLI kubectl create job batch-demo-job --from=cronjob/batch-demo-cronjob -n ns-batch-demo , so I was also seeking for a way that makes that possible.
From an issue opened on the Kubernetes-client GitHub it is not possible to do that, you need to search for your cronJob, then use the spec to create your job.
So I've made it a POC and it works as expected, it follows the same logic, but in a more friendly way.
In this example, I just need the cronJob spec to get the volume tag.
fun createJobFromACronJob(namespace: String) {
val client = Config.defaultClient()
Configuration.setDefaultApiClient(client)
try {
val api = BatchV1Api(client)
val cronJob = api.readNamespacedCronJob("$namespace-cronjob", namespace, "true")
val job = api.createNamespacedJob(namespace, createJobSpec(cronJob), "true", null, null, null)
} catch (ex: ApiException) {
System.err.println("Exception when calling BatchV1Api#createNamespacedJob")
System.err.println("Status Code: ${ex.code}")
System.err.println("Reason: ${ex.responseBody}")
System.err.println("Response Header: ${ex.responseHeaders}")
ex.printStackTrace()
}
}
private fun createJobSpec(cronJob: V1CronJob): V1Job {
val namespace = cronJob.metadata!!.namespace!!
return V1Job()
.kind("batch/v1")
.kind("Job")
.metadata(
V1ObjectMeta()
.name("$namespace-job")
.namespace(namespace)
.putLabelsItem("app.kubernetes.io/team", "Jonas-pangare")
.putLabelsItem("app.kubernetes.io/name", namespace.uppercase())
.putLabelsItem("app.kubernetes.io/part-of", "SINC")
.putLabelsItem("app.kubernetes.io/tier", "batch")
.putLabelsItem("app.kubernetes.io/managed-by", "kubectl")
.putLabelsItem("app.kubernetes.io/built-by", "sinc-monitoracao")
)
.spec(
V1JobSpec()
.template(
podTemplate(cronJob, namespace)
)
.backoffLimit(0)
)
}
private fun podTemplate(cronJob: V1CronJob, namespace: String): V1PodTemplateSpec {
return V1PodTemplateSpec()
.spec(
V1PodSpec()
.restartPolicy("Never")
.addContainersItem(
V1Container()
.name(namespace)
.image(namespace)
.imagePullPolicy("Never")
.addEnvItem(V1EnvVar().name("TZ").value("America/Sao_Paulo"))
.addEnvItem(V1EnvVar().name("JOB_NAME").value("helloWorldJob"))
)
.volumes(cronJob.spec!!.jobTemplate.spec!!.template.spec!!.volumes)
)
}
You also can use the Kubernetes client from Fabric8, it's great too, and easier to use.

How do I submit a job to a Flink cluster using Java code?

I have already uploaded a fat jar containing my application code to the /lib folder of all nodes in my Flink cluster. I am trying to start the Flink job from a separate java application, but can't find a good way to do so.
The closest thing to a solution that I have currently found is the Monitoring Rest API which has a run job API. However, this only allows you to run jobs submitted via the job upload function.
I have seen the ClusterClient.java in the flink-client module, but could not see any examples of how I might use this.
Any examples of how someone has submitted jobs successfully through java code would be greatly appreciated!
You can use RestClusterClient to run a PackagedProgram which points to your Flink job. If your job accepts some arguments, you can pass them.
Here is an example for a standalone cluster running on localhost:8081 :
// import org.apache.flink.api.common.JobSubmissionResult;
// import org.apache.flink.client.deployment.StandaloneClusterId;
// import org.apache.flink.client.program.PackagedProgram;
// import org.apache.flink.client.program.rest.RestClusterClient;
// import org.apache.flink.configuration.Configuration;
// import org.apache.flink.configuration.JobManagerOptions;
// import org.apache.flink.configuration.RestOptions;
String clusterHost = "localhost";
int clusterPort = 8081;
Configuration config = new Configuration();
config.setString(JobManagerOptions.ADDRESS, clusterHost);
config.setInteger(RestOptions.PORT, clusterPort);
String jarFilePath = "/opt/flink/examples/streaming/SocketWindowWordCount.jar";
String[] args = new String[]{ "--port", "9000" };
PackagedProgram packagedProgram = new PackagedProgram(new File(jarFilePath), args);
RestClusterClient<StandaloneClusterId> client =
new RestClusterClient<StandaloneClusterId>(config, StandaloneClusterId.getInstance());
int parallelism = 1;
JobSubmissionResult result = client.run(packagedProgram, parallelism);
This seems to work for version 1.10
private static final int PARALLELISM = 8;
private static final Configuration FLINK_CONFIG = new Configuration();
void foo() throws Exception {
FLINK_CONFIG.setString(JobManagerOptions.ADDRESS, "localhost");
FLINK_CONFIG.setInteger(RestOptions.PORT, 8081);
FLINK_CONFIG.setInteger(RestOptions.RETRY_MAX_ATTEMPTS, 3);
RestClusterClient<StandaloneClusterId> flinkClient = new RestClusterClient<>(FLINK_CONFIG, StandaloneClusterId.getInstance());
String jar = "/path/to/jar";
String[] args = new String[]{"..."};
PackagedProgram program = PackagedProgram.newBuilder()
.setJarFile(new File(jar))
.setArguments(args)
.build();
JobGraph jobGraph = PackagedProgramUtils.createJobGraph(program, FLINK_CONFIG, PARALLELISM, false);
JobID jobId = flinkClient.submitJob(jobGraph).get();
...
}

Using AWS CLI from batch script appears to prevent Process.waitFor from completing

I am trying to run AWS-CLI from a batch script to sync files with S3, then automatically close the cmd window.
In all my batch scripts without AWS-CLI involved the Process.waitFor method will cause the cmd window to automatically exit upon process execution completion, but this is not the case when I have an AWS CLI command in there.
The S3 Sync will finish and I will be left with an open cmd window, and the program will not continue until I manually close it.
Is there something special I need to do in order to make Process.waitFor work in this case, or otherwise automatically close the cmd window upon script completion?
This question is unique because the command normally returns just fine, but is not in the specific case of using AWS CLI.
You're probably not reading the process output, so it's blocked trying to write to stdout.
This works for me:
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.concurrent.CompletableFuture;
public class S3SyncProcess {
public static void main(String[] args) throws IOException, InterruptedException {
// sync dir
Process process = Runtime.getRuntime().exec(
new String[] {"aws", "s3", "sync", "dir", "s3://my.bucket"}
);
CompletableFuture.runAsync(() -> pipe(process.getInputStream(), System.out));
CompletableFuture.runAsync(() -> pipe(process.getErrorStream(), System.err));
// Wait for exit
System.exit(process.waitFor());
}
private static void pipe(InputStream in, OutputStream out) {
int c;
try {
while ((c = in.read()) != -1) {
out.write(c);
}
} catch (IOException e) {
// ignore
}
}
}

during HBase scan with MapReduce, the number of Reducer is always one

I do HBase scan in Mapper, then Reducer writes result to HDFS.
The number of records output by mapper is roughly 1,000,000,000.
The problem is the number of reducers is always one, though I have set -Dmapred.reduce.tasks=100. The reduce process is very slow.
// edit at 2016-12-04 by 祝方泽
the code of my main class:
public class GetUrlNotSent2SpiderFromHbase extends Configured implements Tool {
public int run(String[] arg0) throws Exception {
Configuration conf = getConf();
Job job = new Job(conf, conf.get("mapred.job.name"));
String input_table = conf.get("input.table");
job.setJarByClass(GetUrlNotSent2SpiderFromHbase.class);
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);
scan.addColumn(Bytes.toBytes("info"), Bytes.toBytes("sitemap_type"));
scan.addColumn(Bytes.toBytes("info"), Bytes.toBytes("is_send_to_spider"));
TableMapReduceUtil.initTableMapperJob(
input_table,
scan,
GetUrlNotSent2SpiderFromHbaseMapper.class,
Text.class,
Text.class,
job);
/*job.setMapperClass(GetUrlNotSent2SpiderFromHbaseMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);*/
job.setReducerClass(GetUrlNotSent2SpiderFromHbaseReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
if (job.waitForCompletion(true) && job.isSuccessful()) {
return 0;
}
return -1;
}
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
int res = ToolRunner.run(conf, new GetUrlNotSent2SpiderFromHbase(), args);
System.exit(res);
}
}
here is the script to run this MapReduce job:
table="xxx"
output="yyy"
sitemap_type="zzz"
JOBCONF=""
JOBCONF="${JOBCONF} -Dmapred.job.name=test_for_scan_hbase"
JOBCONF="${JOBCONF} -Dinput.table=$table"
JOBCONF="${JOBCONF} -Dmapred.output.dir=$output"
JOBCONF="${JOBCONF} -Ddemand.sitemap.type=$sitemap_type"
JOBCONF="${JOBCONF} -Dyarn.app.mapreduce.am.command-opts='-Xmx8192m'"
JOBCONF="${JOBCONF} -Dyarn.app.mapreduce.am.resource.mb=9216"
JOBCONF="${JOBCONF} -Dmapreduce.map.java.opts='-Xmx1536m'"
JOBCONF="${JOBCONF} -Dmapreduce.map.memory.mb=2048"
JOBCONF="${JOBCONF} -Dmapreduce.reduce.java.opts='-Xmx1536m'"
JOBCONF="${JOBCONF} -Dmapreduce.reduce.memory.mb=2048"
JOBCONF="${JOBCONF} -Dmapred.reduce.tasks=100"
JOBCONF="${JOBCONF} -Dmapred.job.priority=VERY_HIGH"
hadoop fs -rmr $output
hadoop jar get_url_not_sent_2_spider_from_hbase_hourly.jar hourly.GetUrlNotSent2SpiderFromHbase $JOBCONF
echo "===== scan HBase finished ====="
I set job.setNumReduceTasks(100); in code, it worked.
Since you mentioned only one reduce is working that's the obvious reason why reducer is very slow.
Unified way to know configuration properties applied to Job (this you call for every job you execute to know parameters are passed correctly) :
add the below method to your job driver mentioned above to print configuration entries applied from all possible sources i.e either from -D or some where else please add this method call in driver program before your job is submitted :
public static void printConfigApplied(Configuration conf)
try {
conf.writeXml(System.out);
} catch (final IOException e) {
e.printStackTrace();
}
}
This proves your system properties are not applied from the command line i.e -Dxxx so the way you are passing system properties is not correct. since pro grammatically.
Since job.setnumreducetasks is working , I strongly suspect the below where your system properties are not passed correctly to driver.
Configuration conf = getConf();
Job job = new Job(conf, conf.get("mapred.job.name"));
change this to the example in this

How to execute javaFX Tasks, Services in sequential manner

With my Controller class I have to execute several IO commands (ex: SSH, RCP commands with some parameter values) sequential manner. Each of this command will get some amount of time to execute.
I have to update UI controller when each command is start to execute.
Then depending on that execution result (whether success or failed) I have to update UI again.
Then have to execute the next command with same steps.
Execution of each command is depending on the result of previous command. As a example,
for (IOCommand command : commandsList) {
// Update the UI before start the command execution
messageTextArea.append("Command " + command.getType() + " Stated");
boolean result = commandExecutor(command);
if(result) {
// Update the UI after successful execution
messageTextArea.append("Command " + command.getType() + " Successfully Executed");
// Then go to next command execution
} else {
// Update the UI after failure execution
messageTextArea.append("Command " + command.getType() + " Failed");
// Fix the issue and do re execution
commandReExecutor(command);
}
}
For accomplish this gradual UI update I have to use some JavaFX related Task or Service related features (otherwise it will hang the application until finish all commands were executed and also it will update the UI all at once). But due to nature or concurrency I can not execute these commands with help of Task or Service, in sequential manner (not all at once, one after another). How can I address this problem. Thanks in advance.
I'd the exact requirement in a project and it can be done with Task and Service. You just need a correct implementation.
Few notes:
1. Always start a background task using service or Platform.runLater.
2. If you want to update UI, it must be done from either Task or Service.
3. Bind progress property of task to that of progress bar for smooth updation.
4. Similarly bind text property of a Label to message property of a task for smooth updation of status or something else.
To execute external commands like shell, etc. I've written following class:
package utils;
import controller.ProgressController;
import java.io.BufferedReader;
import java.io.File;
import java.io.InputStreamReader;
import java.util.Map;
import java.util.logging.Level;
import java.util.logging.Logger;
import javafx.concurrent.Task;
import main.Installer;
public class ProcessExecutor extends Task<Integer>
{
Logger logger =Logger.getLogger("ProcessExecutor");
File dir;
String []cmd;
String cmds;
int exitCode=-1;
boolean NextStepExists=false;
Task nextStep;
public ProcessExecutor(String...cmd )
{
this.cmd=cmd;
this.dir=new File(System.getProperty("user.dir"));
this.nextStep=null;
NextStepExists=false;
}
public ProcessExecutor(Task nextStep,String...cmd )
{
this.cmd=cmd;
this.dir=new File(System.getProperty("user.dir"));
this.nextStep=nextStep;
NextStepExists=true;
}
public ProcessExecutor(Task nextStep,File dir,String...cmd)
{
this.cmd=cmd;
this.dir=dir;
this.nextStep=nextStep;
NextStepExists=true;
}
#Override
protected final Integer call()
{
cmds=new String();
for(String i:cmd)
cmds+=i+" "; // just to log cmd array
try
{
logger.info("Starting new process with cmd > "+cmds);
ProcessBuilder processBuilder=new ProcessBuilder(cmd);
processBuilder.directory(dir);
processBuilder.redirectErrorStream(true);
Map<String, String> env = processBuilder.environment();
// create custom environment
env.put("JAVA_HOME", "/opt/jdk1.7.0_45/");
Process pr=processBuilder.start();
BufferedReader in = new BufferedReader(new InputStreamReader(pr.getInputStream()));
String line = in.readLine();
while (line != null) {
logger.log(Level.FINE,line);
ProgressController.instance.printToConsole(line);
line = in.readLine();
}
BufferedReader er = new BufferedReader(new InputStreamReader(pr.getErrorStream()));
String erLine = in.readLine();
while (erLine != null) {
logger.log(Level.FINE,erLine);
ProgressController.instance.printToConsole(erLine);
erLine = in.readLine();
}
exitCode=pr.waitFor();
exitCode=pr.exitValue();
logger.info("Exit Value="+exitCode);
updateMessage("Completed Process");
if(exitCode!=0 && exitCode!=1)
{
logger.info("Failed to execute process commands >"+cmds+" with exit code="+exitCode);
failed();
}
else
{
logger.info("PE succeeded()");
if(NextStepExists)
Installer.pool.submit(nextStep);
succeeded();
}
}
catch(Exception e)
{
logger.log(Level.SEVERE,"Exception: Failed to execute process commands >"+cmds,e);
updateMessage(e.getMessage());
}
return new Integer(exitCode);
}
#Override
public void failed()
{
super.failed();
logger.log(Level.SEVERE,"Failed to execute process commands >"+cmds+"; ExitCode="+exitCode);
}
}
This class uses ProcessBuilder to create required environment for new process,
It waits to finish execution of process using process.waitFor(), the directory of process can be set using processBuilder.directory(dir). In order to execute a single Task<> at any time, use java.util.concurrent.ExecutorService
public ExecutorService pool=Executors.newSingleThreadExecutor();
pool.submit(new ProcessExecutor("installTomcat.bat","tomcat7"));
pool.submit(new ProcessExecutor("installPostgres.bat","postgresql","5432"));
In this way you can execute batch files one after another. Executors.newSingleThreadExecutor() takes care of executing a single task at any time and queing the newly submitted tasks. I've written a generalized working example of sequential execution here:
github This is a NetBeans JavaFX project and its a generalized & stripped down version of a project. Hope this helps

Categories