during HBase scan with MapReduce, the number of Reducer is always one - java

I do HBase scan in Mapper, then Reducer writes result to HDFS.
The number of records output by mapper is roughly 1,000,000,000.
The problem is the number of reducers is always one, though I have set -Dmapred.reduce.tasks=100. The reduce process is very slow.
// edit at 2016-12-04 by 祝方泽
the code of my main class:
public class GetUrlNotSent2SpiderFromHbase extends Configured implements Tool {
public int run(String[] arg0) throws Exception {
Configuration conf = getConf();
Job job = new Job(conf, conf.get("mapred.job.name"));
String input_table = conf.get("input.table");
job.setJarByClass(GetUrlNotSent2SpiderFromHbase.class);
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);
scan.addColumn(Bytes.toBytes("info"), Bytes.toBytes("sitemap_type"));
scan.addColumn(Bytes.toBytes("info"), Bytes.toBytes("is_send_to_spider"));
TableMapReduceUtil.initTableMapperJob(
input_table,
scan,
GetUrlNotSent2SpiderFromHbaseMapper.class,
Text.class,
Text.class,
job);
/*job.setMapperClass(GetUrlNotSent2SpiderFromHbaseMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);*/
job.setReducerClass(GetUrlNotSent2SpiderFromHbaseReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
if (job.waitForCompletion(true) && job.isSuccessful()) {
return 0;
}
return -1;
}
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
int res = ToolRunner.run(conf, new GetUrlNotSent2SpiderFromHbase(), args);
System.exit(res);
}
}
here is the script to run this MapReduce job:
table="xxx"
output="yyy"
sitemap_type="zzz"
JOBCONF=""
JOBCONF="${JOBCONF} -Dmapred.job.name=test_for_scan_hbase"
JOBCONF="${JOBCONF} -Dinput.table=$table"
JOBCONF="${JOBCONF} -Dmapred.output.dir=$output"
JOBCONF="${JOBCONF} -Ddemand.sitemap.type=$sitemap_type"
JOBCONF="${JOBCONF} -Dyarn.app.mapreduce.am.command-opts='-Xmx8192m'"
JOBCONF="${JOBCONF} -Dyarn.app.mapreduce.am.resource.mb=9216"
JOBCONF="${JOBCONF} -Dmapreduce.map.java.opts='-Xmx1536m'"
JOBCONF="${JOBCONF} -Dmapreduce.map.memory.mb=2048"
JOBCONF="${JOBCONF} -Dmapreduce.reduce.java.opts='-Xmx1536m'"
JOBCONF="${JOBCONF} -Dmapreduce.reduce.memory.mb=2048"
JOBCONF="${JOBCONF} -Dmapred.reduce.tasks=100"
JOBCONF="${JOBCONF} -Dmapred.job.priority=VERY_HIGH"
hadoop fs -rmr $output
hadoop jar get_url_not_sent_2_spider_from_hbase_hourly.jar hourly.GetUrlNotSent2SpiderFromHbase $JOBCONF
echo "===== scan HBase finished ====="
I set job.setNumReduceTasks(100); in code, it worked.

Since you mentioned only one reduce is working that's the obvious reason why reducer is very slow.
Unified way to know configuration properties applied to Job (this you call for every job you execute to know parameters are passed correctly) :
add the below method to your job driver mentioned above to print configuration entries applied from all possible sources i.e either from -D or some where else please add this method call in driver program before your job is submitted :
public static void printConfigApplied(Configuration conf)
try {
conf.writeXml(System.out);
} catch (final IOException e) {
e.printStackTrace();
}
}
This proves your system properties are not applied from the command line i.e -Dxxx so the way you are passing system properties is not correct. since pro grammatically.
Since job.setnumreducetasks is working , I strongly suspect the below where your system properties are not passed correctly to driver.
Configuration conf = getConf();
Job job = new Job(conf, conf.get("mapred.job.name"));
change this to the example in this

Related

create an Java api that will manually trigger Kubernetes already created jobs

I have a job already running in Kubernates which is scheduled for 4 hours. But I need to write a Java API so that whenever I want to run the job I just need to call this API and it runs the Job.
Please help to solve this requirement.
There is two way either you run your application in POD which create JOB for you OR you write java API and when you hit endpoint, it will create the job that time.
For creation, you can use the Java Kubernetes client library.
Example - Click here
Java client - Click here
package io.fabric8.kubernetes.examples;
import io.fabric8.kubernetes.api.model.PodList;
import io.fabric8.kubernetes.api.model.batch.v1.Job;
import io.fabric8.kubernetes.api.model.batch.v1.JobBuilder;
import io.fabric8.kubernetes.client.ConfigBuilder;
import io.fabric8.kubernetes.client.DefaultKubernetesClient;
import io.fabric8.kubernetes.client.KubernetesClient;
import io.fabric8.kubernetes.client.KubernetesClientException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.Collections;
import java.util.concurrent.TimeUnit;
/*
* Creates a simple run to complete job that computes π to 2000 places and prints it out.
*/
public class JobExample {
private static final Logger logger = LoggerFactory.getLogger(JobExample.class);
public static void main(String[] args) {
final ConfigBuilder configBuilder = new ConfigBuilder();
if (args.length > 0) {
configBuilder.withMasterUrl(args[0]);
}
try (KubernetesClient client = new DefaultKubernetesClient(configBuilder.build())) {
final String namespace = "default";
final Job job = new JobBuilder()
.withApiVersion("batch/v1")
.withNewMetadata()
.withName("pi")
.withLabels(Collections.singletonMap("label1", "maximum-length-of-63-characters"))
.withAnnotations(Collections.singletonMap("annotation1", "some-very-long-annotation"))
.endMetadata()
.withNewSpec()
.withNewTemplate()
.withNewSpec()
.addNewContainer()
.withName("pi")
.withImage("perl")
.withArgs("perl", "-Mbignum=bpi", "-wle", "print bpi(2000)")
.endContainer()
.withRestartPolicy("Never")
.endSpec()
.endTemplate()
.endSpec()
.build();
logger.info("Creating job pi.");
client.batch().v1().jobs().inNamespace(namespace).createOrReplace(job);
// Get All pods created by the job
PodList podList = client.pods().inNamespace(namespace).withLabel("job-name", job.getMetadata().getName()).list();
// Wait for pod to complete
client.pods().inNamespace(namespace).withName(podList.getItems().get(0).getMetadata().getName())
.waitUntilCondition(pod -> pod.getStatus().getPhase().equals("Succeeded"), 1, TimeUnit.MINUTES);
// Print Job's log
String joblog = client.batch().v1().jobs().inNamespace(namespace).withName("pi").getLog();
logger.info(joblog);
} catch (KubernetesClientException e) {
logger.error("Unable to create job", e);
}
}
}
Option : 2
You can also apply the YAML file
ApiClient client = ClientBuilder.cluster().build(); //create in-cluster client
Configuration.setDefaultApiClient(client);
BatchV1Api api = new BatchV1Api(client);
V1Job job = new V1Job();
job = (V1Job) Yaml.load(new File("<YAML file path>.yaml")); //apply static yaml file
ApiResponse<V1Job> response = api.createNamespacedJobWithHttpInfo("default", job, "true", null, null);
I had the same question as you since it was needed for me and my team, to develop a web application, that makes it possible for any user to start a new execution from our jobs.
I have a job already running in Kubernetes which is scheduled for 4 hours.
If I'm not mistaken, it's not possible to schedule a Job on Kubernetes, you need to create a Job from a CronJob, that's our case.
We have several CronJobs scheduled to start through the day, but it's also needed to start it again, during some error or something else.
After some research, I decided to use the Kubernetes-client library.
When it was needed to trigger a job manually, I used to use kubectl CLI kubectl create job batch-demo-job --from=cronjob/batch-demo-cronjob -n ns-batch-demo , so I was also seeking for a way that makes that possible.
From an issue opened on the Kubernetes-client GitHub it is not possible to do that, you need to search for your cronJob, then use the spec to create your job.
So I've made it a POC and it works as expected, it follows the same logic, but in a more friendly way.
In this example, I just need the cronJob spec to get the volume tag.
fun createJobFromACronJob(namespace: String) {
val client = Config.defaultClient()
Configuration.setDefaultApiClient(client)
try {
val api = BatchV1Api(client)
val cronJob = api.readNamespacedCronJob("$namespace-cronjob", namespace, "true")
val job = api.createNamespacedJob(namespace, createJobSpec(cronJob), "true", null, null, null)
} catch (ex: ApiException) {
System.err.println("Exception when calling BatchV1Api#createNamespacedJob")
System.err.println("Status Code: ${ex.code}")
System.err.println("Reason: ${ex.responseBody}")
System.err.println("Response Header: ${ex.responseHeaders}")
ex.printStackTrace()
}
}
private fun createJobSpec(cronJob: V1CronJob): V1Job {
val namespace = cronJob.metadata!!.namespace!!
return V1Job()
.kind("batch/v1")
.kind("Job")
.metadata(
V1ObjectMeta()
.name("$namespace-job")
.namespace(namespace)
.putLabelsItem("app.kubernetes.io/team", "Jonas-pangare")
.putLabelsItem("app.kubernetes.io/name", namespace.uppercase())
.putLabelsItem("app.kubernetes.io/part-of", "SINC")
.putLabelsItem("app.kubernetes.io/tier", "batch")
.putLabelsItem("app.kubernetes.io/managed-by", "kubectl")
.putLabelsItem("app.kubernetes.io/built-by", "sinc-monitoracao")
)
.spec(
V1JobSpec()
.template(
podTemplate(cronJob, namespace)
)
.backoffLimit(0)
)
}
private fun podTemplate(cronJob: V1CronJob, namespace: String): V1PodTemplateSpec {
return V1PodTemplateSpec()
.spec(
V1PodSpec()
.restartPolicy("Never")
.addContainersItem(
V1Container()
.name(namespace)
.image(namespace)
.imagePullPolicy("Never")
.addEnvItem(V1EnvVar().name("TZ").value("America/Sao_Paulo"))
.addEnvItem(V1EnvVar().name("JOB_NAME").value("helloWorldJob"))
)
.volumes(cronJob.spec!!.jobTemplate.spec!!.template.spec!!.volumes)
)
}
You also can use the Kubernetes client from Fabric8, it's great too, and easier to use.

How do I submit a job to a Flink cluster using Java code?

I have already uploaded a fat jar containing my application code to the /lib folder of all nodes in my Flink cluster. I am trying to start the Flink job from a separate java application, but can't find a good way to do so.
The closest thing to a solution that I have currently found is the Monitoring Rest API which has a run job API. However, this only allows you to run jobs submitted via the job upload function.
I have seen the ClusterClient.java in the flink-client module, but could not see any examples of how I might use this.
Any examples of how someone has submitted jobs successfully through java code would be greatly appreciated!
You can use RestClusterClient to run a PackagedProgram which points to your Flink job. If your job accepts some arguments, you can pass them.
Here is an example for a standalone cluster running on localhost:8081 :
// import org.apache.flink.api.common.JobSubmissionResult;
// import org.apache.flink.client.deployment.StandaloneClusterId;
// import org.apache.flink.client.program.PackagedProgram;
// import org.apache.flink.client.program.rest.RestClusterClient;
// import org.apache.flink.configuration.Configuration;
// import org.apache.flink.configuration.JobManagerOptions;
// import org.apache.flink.configuration.RestOptions;
String clusterHost = "localhost";
int clusterPort = 8081;
Configuration config = new Configuration();
config.setString(JobManagerOptions.ADDRESS, clusterHost);
config.setInteger(RestOptions.PORT, clusterPort);
String jarFilePath = "/opt/flink/examples/streaming/SocketWindowWordCount.jar";
String[] args = new String[]{ "--port", "9000" };
PackagedProgram packagedProgram = new PackagedProgram(new File(jarFilePath), args);
RestClusterClient<StandaloneClusterId> client =
new RestClusterClient<StandaloneClusterId>(config, StandaloneClusterId.getInstance());
int parallelism = 1;
JobSubmissionResult result = client.run(packagedProgram, parallelism);
This seems to work for version 1.10
private static final int PARALLELISM = 8;
private static final Configuration FLINK_CONFIG = new Configuration();
void foo() throws Exception {
FLINK_CONFIG.setString(JobManagerOptions.ADDRESS, "localhost");
FLINK_CONFIG.setInteger(RestOptions.PORT, 8081);
FLINK_CONFIG.setInteger(RestOptions.RETRY_MAX_ATTEMPTS, 3);
RestClusterClient<StandaloneClusterId> flinkClient = new RestClusterClient<>(FLINK_CONFIG, StandaloneClusterId.getInstance());
String jar = "/path/to/jar";
String[] args = new String[]{"..."};
PackagedProgram program = PackagedProgram.newBuilder()
.setJarFile(new File(jar))
.setArguments(args)
.build();
JobGraph jobGraph = PackagedProgramUtils.createJobGraph(program, FLINK_CONFIG, PARALLELISM, false);
JobID jobId = flinkClient.submitJob(jobGraph).get();
...
}

how to submit mapreduce job with yarn api in java

I want submit my MR job using YARN java API, I try to do it like WritingYarnApplications, but I don't know what to add amContainer, below is code I have written:
package org.apache.hadoop.examples;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.yarn.api.protocolrecords.GetNewApplicationResponse;
import org.apache.hadoop.yarn.api.records.ApplicationId;
import org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext;
import org.apache.hadoop.yarn.api.records.ContainerLaunchContext;
import org.apache.hadoop.yarn.api.records.Resource;
import org.apache.hadoop.yarn.client.api.YarnClient;
import org.apache.hadoop.yarn.client.api.YarnClientApplication;
import org.apache.hadoop.yarn.util.Records;
import org.mortbay.util.ajax.JSON;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class YarnJob {
private static Logger logger = LoggerFactory.getLogger(YarnJob.class);
public static void main(String[] args) throws Throwable {
Configuration conf = new Configuration();
YarnClient client = YarnClient.createYarnClient();
client.init(conf);
client.start();
System.out.println(JSON.toString(client.getAllQueues()));
System.out.println(JSON.toString(client.getConfig()));
//System.out.println(JSON.toString(client.getApplications()));
System.out.println(JSON.toString(client.getYarnClusterMetrics()));
YarnClientApplication app = client.createApplication();
GetNewApplicationResponse appResponse = app.getNewApplicationResponse();
ApplicationId appId = appResponse.getApplicationId();
// Create launch context for app master
ApplicationSubmissionContext appContext = Records.newRecord(ApplicationSubmissionContext.class);
// set the application id
appContext.setApplicationId(appId);
// set the application name
appContext.setApplicationName("test");
// Set the queue to which this application is to be submitted in the RM
appContext.setQueue("default");
// Set up the container launch context for the application master
ContainerLaunchContext amContainer = Records.newRecord(ContainerLaunchContext.class);
//amContainer.setLocalResources();
//amContainer.setCommands();
//amContainer.setEnvironment();
appContext.setAMContainerSpec(amContainer);
appContext.setResource(Resource.newInstance(1024, 1));
appContext.setApplicationType("MAPREDUCE");
// Submit the application to the applications manager
client.submitApplication(appContext);
//client.stop();
}
}
I can run a mapreduce job properly with command interface:
hadoop jar wordcount.jar org.apache.hadoop.examples.WordCount /user/admin/input /user/admin/output/
But how can I submit this wordcount job in yarn java api?
You do not use Yarn Client to submit job, instead use MapReduce APIs to submit job. See this link for Example
However if you need more control on the job, like getting status of completion, Mapper phase status, Reducer phase status, etc, you can use
job.submit();
Instead of
job.waitForCompletion(true)
You can use functions job.mapProgress() and job.reduceProgress() to get the status. There are lots of functions in job object which you can explore.
As far as your query about
hadoop jar wordcount.jar org.apache.hadoop.examples.WordCount /user/admin/input /user/admin/output/
Whats happening here is you are running your driver program which is available in wordcount.jar. Instead of doing "java -jar wordcount.jar" you are using "hadoop jar wordcount.jar". you can as well use "yarn jar wordcount.jar". Hadoop/Yarn will setup necessary additional classpaths compared to java -jar command. This executes the "main()" of your driver program which is available in class org.apache.hadoop.examples.WordCount as specified in the command.
You can check out the source here Source for WordCount class
The only reason i would assume you want to submit job via yarn is to integrate it with some kind of service which kicks up MapReduce2 jobs on certain events.
For this you can always have your drivers main() something like this.
public class MyMapReduceDriver extends Configured implements Tool {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
/******/
int errCode = ToolRunner.run(conf, new MyMapReduceDriver(), args);
System.exit(errCode);
}
#Override
public int run(String[] args) throws Exception {
while(true) {
try{
runMapReduceJob();
}
catch(IOException e)
{
e.printStackTrace();
}
}
}
private void runMapReduceJob() {
Configuration conf = new Configuration();
Job job = new Job(conf, "word count");
/******/
job.submit();
// Get status
while(job.getJobState()==RUNNING || job.getJobState()==PREP){
Thread.sleep(1000);
System.out.println(" Map: "+ StringUtils.formatPercent(job.mapProgress(), 0) + " Reducer: "+ StringUtils.formatPercent(job.reduceProgress(), 0));
}
}}
Hope this helps.

HDFS Permission denied

I'm trying to start a MapReduce job from java. But when I try to submit the job I get Permission Denied exception. I'm able to run hdfs dfs -ls / from command line without any error. But it doesn't work from my java program.
Here's my code
public static void main(String[] args) {
Configuration conf=new Configuration();
conf.set("mapreduce.map.class","org.apache.hadoop.conf.TestMapper");
conf.set("mapreduce.reduce.class","org.apache.hadoop.conf.TestReducer");
conf.set("mapreduce.framework.name","yarn");
conf.set("hadoop.security.group.mapping","org.apache.hadoop.security.ShellBasedUnixGroupsMapping");
conf.set("fs.default.name","hdfs://master:9000");
conf.set("dfs.permission","false");
conf.set("yarn.nodemanager.aux-services","mapreduce_shuffle");
conf.set("yarn.resourcemanager.resource-tracker.address","master:8025");
conf.set("yarn.resourcemanager.scheduler.address","master:8030");
conf.set("yarn.resourcemanager.address","master:8040");
conf.set("yarn.nodemanager.localizer.address","master:8060");
Job job=null;
try {
job = Job.getInstance(conf, "Test Map Reduce");
job.setJarByClass(RunJob.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.setInputPaths(job, new Path("/input.txt"));
TextOutputFormat.setOutputPath(job, new Path("/output"));
job.submit();
}
But I get the following exception
org.apache.hadoop.security.AccessControlException: Permission denied: user=manthosh, access=EXECUTE, inode="/tmp":hduser:supergroup:drwxrwx---
The solution here doesn't work.
What am I missing?

issue in file creation while running topology in remote cluster using storm

I have created a topology which should read from a file and write it to a new file. My program is running properly in local cluster but while submitting in remote cluster i am not getting any error but file is not getting created. below is my code to submit topolgy in remote cluster :-
public static void main(String[] args) {
final Logger logger = LoggingService.getLogger(FileToFileTopology.class.getName());
try{
Properties prop =new Properties();
prop.load(new FileInputStream(args[0]+"/connection.properties"));
LoggingService.generateAppender("storm_etl",prop, "");
logger.info("inside main method...." +args.length);
System.out.println("inside main sys out");
Config conf= new Config();
conf.setDebug(false);
conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING,1);
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("file-reader",new FileReaderSpout(args[1]));
builder.setBolt("file-writer",new WriteToFileBolt(args[1]),2).shuffleGrouping("file-reader");
logger.info("submitting topology");
StormSubmitter.submitTopology(args[0], conf, builder.createTopology());
}
catch(Exception e){
System.out.println("inside catch");
logger.info("inside catch"+e.getMessage());
logger.error("inside error", e);
e.printStackTrace();
}
}
I have also used log4j to create my own logfile for my topology, log file gets created but no error in my log file. pls help
I had same issue with Hortonworks2.2. This happened because of permissions.
Even if you are submitting to the cluster as Root user, when submitting the storm jar command, it executes as 'storm' user. It can read the file from source, but won't write, because it doesn't have the necessary rights.
Modify the permissions of destination folder where you want to write file with all permissions.
chmod 777 -R /user/filesfolder

Categories