unable to set mapreduce.job.reduces through generic option parser - java

hadoop jar MapReduceTryouts-1.jar invertedindex.simple.MyDriver -D mapreduce.job.reduces=10 /user/notprabhu2/Input/potter/ /user/notprabhu2/output
I have been trying in vain to set the number of reducers through the -D option provided by GenericOptionParser but it does not seem to work and I have no idea why.
I tried -D mapreduce.job.reduces=10(with space after -D) and also
-Dmapreduce.job.reduces=10(without space after -D) but nothing seems to dodge.
In my Driver class I have implemented Tools.
package invertedindex.simple;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class MyDriver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = Job.getInstance(conf);
job.setJarByClass(MyDriver.class);
Path outputPath = new Path(args[1]);
outputPath.getFileSystem(getConf()).delete(outputPath, true);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job, new Path(args[0]));
TextOutputFormat.setOutputPath(job, outputPath);
job.setNumReduceTasks(3);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(),new MyDriver(), args);
System.exit(exitCode);
}
}
Since I have explicitly set the number of reducers to 3 in my driver code I always end up with 3 reducers.
I am using CDH 5.4.7 which has Hadoop 2.6.0 on a 2 node cluster on Google Compute Engine.

Figured it out. Turned out to be so silly but still posting the answer just in case someone also does the same silly mistake.
Seems the job.setNumReduceTasks(3); line in my driver class is taking precedence over the -D mapreduce.job.reduces=10 in the command line.
When I removed thejob.setNumReduceTasks(3); line from my code everything worked fine.

set the property for number of reducers - mapreduce.job.reduces in xml tag
set property in mapred-site.xml which will be called by code from configuration:
<property>
<name>mapreduce.job.reduces</name>
<value>5</value>
</property>
relaunch hadoop process

Related

Copy JSON file from Local to HDFS

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class HdfsWriter extends Configured implements Tool {
public int run(String[] args) throws Exception {
//String localInputPath = args[0];
Path outputPath = new Path(args[0]); // ARGUMENT FOR OUTPUT_LOCATION
Configuration conf = getConf();
FileSystem fs = FileSystem.get(conf);
OutputStream os = fs.create(outputPath);
InputStream is = new BufferedInputStream(new FileInputStream("/home/acadgild/acadgild.txt")); //Data set is getting copied into input stream through buffer mechanism.
IOUtils.copyBytes(is, os, conf); // Copying the dataset from input stream to output stream
return 0;
}
public static void main(String[] args) throws Exception {
int returnCode = ToolRunner.run(new HdfsWriter(), args);
System.exit(returnCode);
}
}
Need to Move the data from Local to HDFS.
The above code I got from another blog , it's not working. can anyone help me on this.
Also i need to parse the Json using MR and group by DateTime and move to HDFS
Map Reduce is a distributed job processing framework
for each mapper local means the local filesytem on the node on which that mapper is running.
What you want is reading from local on a given node to be put on to HDFS and then processing it via MapReduce.
There are multiple tools available for copying from Local of one node to HDFS
hdfs put localPath HdfsPath (Shell script)
flume

MapReduce set input and output

I have a file
import java.io.IOException;
import java.nio.file.Paths;
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
public class ViewCount extends Configured implements Tool {
public static void main(String args[]) throws Exception {
int res = ToolRunner.run(new ViewCount(), args);
System.exit(res);
}
public int run(String[] args) throws Exception {
//Path inputPath = new Path(args[0]);
Path inputPath = Paths.get("C:/WorkSpace/input.txt");
Path outputPath = Paths.get("C:/WorkSpace/output.txt");
Configuration conf = getConf();
Job job = new Job(conf, this.getClass().toString());
I try to run a the app in windows. How can I set inputPath and outputPath? The method I use now doesn't work. Before I had
Path inputPath = new Path(args[0]);
Path outputPath = new Path(args[1]);
and I had to go to the command line. Now I want to run the app from the IDE.
I'm getting
Required:
org.apache.hadoop.fs.Path
Found:
java.nio.file.Path
For Eclipse, you could set arguments :
Run -> run configuration -> arguments.
It should be the same in Intellij.
The error tells you that it is expecting a org.apache.hadoop.fs.Path, but instead it receives a java.nio.file.Paths.
This means that you should change the second import of your code to
org.apache.hadoop.fs.Path. IDEs import suggestions can be wrong some times ;)
Change the import and then use the method that you already had to add the input and output path. Those arguments are given in Eclipse with right-clicking the project -> Run as -> Run configurations -> arguments. The two paths should be white-space separated. Apply and run!
For the next executions, just run the project.

Error in running MapReduce job in eclipse from windows

I Have pseudo-distributed Hadoop setup on a linux machine. I have done a few examples in eclipse which is also installed in that linux machine and they worked fine. Now I want to perform MapReduce Jobs through eclipse (installed in windows machine) and access the HDFS which is already present in my linux machine. I have written the following Driver code:
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Windows_Driver extends Configured implements Tool{
public static void main(String[] args) throws Exception {
int exitcode = ToolRunner.run(new Windows_Driver(), args);
System.exit(exitcode);
}
#Override
public int run(String[] arg0) throws Exception {
JobConf conf = new JobConf(Windows_Driver.class);
conf.set("fs.defaultFS", "hdfs://<Ip address>:50070");
FileInputFormat.setInputPaths(conf, new Path("sample"));
FileOutputFormat.setOutputPath(conf, new Path("sam"));
conf.setMapperClass(Win_Mapper.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(Text.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
JobClient.runJob(conf);
return 0;
}
}
And the Mapper code :
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class Win_Mapper extends MapReduceBase implements Mapper<LongWritable, Text,Text, Text> {
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> o, Reporter arg3) throws IOException {
...
o.collect(... , ...);
}
}
When I run this, I get the following error:
SEVERE: PriviledgedActionException as:miracle cause:java.io.IOException: Failed to set permissions of path: \tmp\hadoop-miracle\mapred\staging\miracle1262421749\.staging to 0700
Exception in thread "main" java.io.IOException: Failed to set permissions of path: \tmp\hadoop-miracle\mapred\staging\miracle1262421749\.staging to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:691)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:664)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:193)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at Windows_Driver.run(Windows_Driver.java:41)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at Windows_Driver.main(Windows_Driver.java:16)
How can I rectify the error? And how can I access my HDFS remotely from windows?
submit() method on the Job creates an internal Jobsubmitter instance and that would do all the data validations including input path,output path availability,file/directory creation permissions and other things. During different phases of MR, it will create temporary directories under which it will put the temp. files. The temp directory is taken from core-site.xml with property hadoop.tmp.dir. The issue with your system is it seems the temp. directory is /tmp/ and the user running the MR job doesn't have permission to change its rwx status to 700. Provide appropriate permissions and rerun the job.

Executing a jar through ant clean dist- appears under different name?

I'm working with hadoop and need to create a .jar file combined from all of my classes in /src file. Everytime I try to create it it appears under WordCount.jar instead of Twitter.jar which I have stated in my code below:
import java.util.Arrays;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Twitter {
public static void runJob(String[] input, String output) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(Twitter.class);
job.setReducerClass(TwitterReducer.class);
job.setMapperClass(TwitterMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
Path outputPath = new Path(output);
FileInputFormat.setInputPaths(job, StringUtils.join(input, ","));
FileOutputFormat.setOutputPath(job, outputPath);
outputPath.getFileSystem(conf).delete(outputPath, true);
job.waitForCompletion(true);
}
public static void main(String[] args) throws Exception {
runJob(Arrays.copyOfRange(args, 0, args.length - 1), args[args.length - 1]);
}
}
Therefore I am unsure what is wrong? The files in the .jar itself are exactly the same as in /src folder.
The name of the Jar file has nothing to do the name of a class in it. You need to check the Ant buildfile, specifically the target that creates the jar. The Ant task that creates the Jar file is usually the jar task and the name of the file can be specified via the destfile attribute.

NoClassDefFoundError in wordcount program

I'm running hadoop wordcount program. But it is giving me error like "NoClassDefFoundError"
command for running :
hadoop -jar /home/user/Pradeep/sample.jar hdp_java.WordCount /user/hduser/ana.txt /user/hduser/prout
Exception in thread "main" java.lang.NoClassDefFoundError: WordCount
Caused by: java.lang.ClassNotFoundException: WordCount
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: WordCount. Program will exit.
i've created the program in eclipse and then exported as jar file
Eclipse code :
package hdp_java;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
Can anyone tell me where am i wrong?
You need to tell the hadoop job which jar to use like so:
job.setJarByClass(WordCount.class);
Also be sure to add any dependencies to both the HADOOP_CLASSPATH and -libjars upon submitting a job like in the following examples:
Use the following to add all the jar dependencies from (for example) current and lib directories:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:`echo *.jar`:`echo lib/*.jar | sed 's/ /:/g'`
Bear in mind that when starting a job through hadoop jar you'll need to also pass it the jars of any dependencies through use of -libjars. I like to use:
hadoop jar <jar> <class> -libjars `echo ./lib/*.jar | sed 's/ /,/g'` [args...]
NOTE: The sed commands require a different delimiter character; the HADOOP_CLASSPATH is : separated and the -libjars need to be , separated.
Add this line in your code :
job.setJarByClass(WordCount.class);
If it still doesn't work export this job as a jar and add it to itself as an external jar and see if it works.

Categories