Am doing mapreduce for the following code. When i run this Job every thing works fine. But the output shows 0 0. I suspect this may be due to the TryparseInt() method which i QUICKFIXED as it was undefined previously.Initially there was no method for the TryparseInt(). so i created one, Can any one check whether the code is correct expecially the TryParseInt Method and tell me any suggetion to run this program successfully.
input looks like :
Thanks in Advance
import java.io.IOException;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.io.LongWritable;
public class MaxPubYear {
public static class MaxPubYearMapper extends Mapper<LongWritable , Text, IntWritable,Text>
{
public void map(LongWritable key, Text value , Context context)
throws IOException, InterruptedException
{
String delim = "\t";
Text valtosend = new Text();
String tokens[] = value.toString().split(delim);
if (tokens.length == 2)
{
valtosend.set(tokens[0] + ";"+ tokens[1]);
context.write(new IntWritable(1), valtosend);
}
}
}
public static class MaxPubYearReducer extends Reducer<IntWritable ,Text, Text, IntWritable>
{
public void reduce(IntWritable key, Iterable<Text> values , Context context) throws IOException, InterruptedException
{
int maxiValue = Integer.MIN_VALUE;
String maxiYear = "";
for(Text value:values) {
String token[] = value.toString().split(";");
if(token.length == 2 && TryParseInt(token[1]).intValue()> maxiValue)
{
maxiValue = TryParseInt(token[1]);
maxiYear = token[0];
}
}
context.write(new Text(maxiYear), new IntWritable(maxiValue));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job Job = new Job(conf, "Maximum Publication year");
Job.setJarByClass(MaxPubYear.class);
Job.setOutputKeyClass(Text.class);
Job.setOutputValueClass(IntWritable.class);
Job.setMapOutputKeyClass(IntWritable.class);
Job.setMapOutputValueClass(Text.class);
Job.setMapperClass(MaxPubYearMapper.class);
Job.setReducerClass(MaxPubYearReducer.class);
FileInputFormat.addInputPath(Job,new Path(args[0]));
FileOutputFormat.setOutputPath(Job,new Path(args[1]));
System.exit(Job.waitForCompletion(true)?0:1);
}
public static Integer TryParseInt(String string) {
// TODO Auto-generated method stub
return(0);
}
}
The errors mean exactly what they say: for the three 'could not be resolved to a type' errors you probaobly forgot to import the right classes. Error 2 simply means there is no method TryParseInt(String) in the class MaxPubYear.MaxPubYearReducer you have to create one there.
Related
I'm practicing MapReduce and I have an Amazon .tsv file that has a list of Review's which have the rating of products. 1 product has many reviews and a rating in each review. The reviews also have other data like user_id, product_name, review_title, ect. I want to use MapReduce on this file to generate output of 3 columns: Product ID, total number of reviews, and the average rating of the product.
link to file i'm using for testing: LINK (It's the sample_us.tsv)
https://gofile.io/?c=wLsv0y
So far I have the following written but I am getting several errors. Please let me know if there's any fixes you see or better logic that can implemented to achieve the same goal. I have been using Hadoop btw.
Mapper:
package stubs;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class ReviewMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
int productIndex = 3; //index for productID
int ratingIndex = 7; //index for ratingID
String input = value.toString();
String [] line = input.split("\\t");
String productID = line[productIndex];
String ratingVal = line[ratingIndex];
if((productID.length() > 0) && (ratingVal.length() == 1))
{
int starRating = Integer.valueOf(ratingVal);
context.write(new Text(productID), new IntWritable(starRating));
}
}
}
And then my Reducer:
package stubs;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class ReviewReducer extends Reducer<Text, IntWritable, Text, Text> {
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException
{
int reviewCount = 0;
int combineRating = 0;
for(IntWritable value : values)
{
reviewCount++;
combineRating += value.get();
}
int avgRating = (combineRating/reviewCount);
String reviews = Integer.toString(reviewCount);
String ratings = Integer.toString(avgRating);
String result = reviews+ "\t" +ratings;
context.write(key, new Text(result));
}
}
Lastly the Driver:
package stubs;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
public class AvgRatingReviews {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: AvgWordLength <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(AvgRatingReviews.class);
job.setJobName("Review Results");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(ReviewMapper.class);
job.setReducerClass(ReviewReducer.class);
job.setOutputKeyClass(Text.class);;
job.setOutputValueClass(Text.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
I want to take advantage of using linux to practice examples of Hadoop-MapReduce.
I have written a code for my project and I am getting some warning message when I compile. I could not run it. Having tried many possible ways like ignoring warning and etc, I am still unable to run it. Below you will find the code.
import java.io.IOException;
import java.util.StringTokenizer;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
importorg.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.StringUtils;
public class Average{
public static class Map extends Mapper<Object, Text, Text, IntWritable> {
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer script = new StringTokenizer(line, "\n"); while (script.hasMoreTokens()) {
StringTokenizer scriptLine = new StringTokenizer(script.nextToken());
Text Name = new Text(scriptLine.nextToken());
int Score = Integer.parseInt(scriptLine.nextToken());
context.write(Name, new IntWritable(Score));
}
}
}
public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable> {
public void reduce(Text key, Iterable<IntWritable> value, Context context) throws IOException, InterruptedException{
int numerator = 0;
int denominator = 0;
int avg = 0;
for (IntWritable score : value) {
numerator += score.get();
denominator++;
}
avg = numerator/denominator;
context.write(key, new IntWritable(avg));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
Path dst_path = new Path (otherArgs[1]);
FileSystem hdfs = dst_path.getFileSystem(conf);
if (hdfs.exists(dst_path)){
hdfs.delete(dst_path, true);
};
Job job = new Job(conf, "Average");
job.setJarByClass(Average.class);
job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I want to run the code described in this tutorial in order to customize the output format in Hadoop. More precisely, the tutorial shows two java files:
WordCount: is the word count java application (similar to the WordCount v1.0 of the MapReduce Tutorial in this link)
XMLOutputFormat: java class that extends FileOutputFormat and implements the method to customize the output.
Well, what I did was to take the WordCount v1.0 of the MapReduce Tutorial (instead of using the WordCount showed in the tutorial) and add in the driver job.setOutputFormatClass(XMLOutputFormat.class); and execute the hadoop app in this way:
/usr/local/hadoop/bin/hadoop com.sun.tools.javac.Main WordCount.java && jar cf wc.jar WordCount*.class && /usr/local/hadoop/bin/hadoop jar wc.jar WordCount /home/luis/Desktop/mytest/input/ ./output_folder
note: /home/luis/Desktop/mytest/input/ and ./output_folder are the input and output folders, respectively.
Unfortunately, the terminal shows me the following error:
WordCount.java:57: error: cannot find symbol
job.setOutputFormatClass(XMLOutputFormat.class);
^
symbol: class XMLOutputFormat
location: class WordCount
1 error
Why? WordCount.java and XMLOutputFormat.java are stored in the same folder.
The following is my code.
WordCount code:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setOutputFormatClass(XMLOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
XMLOutputFormat code:
import java.io.DataOutputStream;
import java.io.IOException;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class XMLOutputFormat extends FileOutputFormat<Text, IntWritable> {
protected static class XMLRecordWriter extends RecordWriter<Text, IntWritable> {
private DataOutputStream out;
public XMLRecordWriter(DataOutputStream out) throws IOException{
this.out = out;
out.writeBytes("<Output>\n");
}
private void writeStyle(String xml_tag,String tag_value) throws IOException {
out.writeBytes("<"+xml_tag+">"+tag_value+"</"+xml_tag+">\n");
}
public synchronized void write(Text key, IntWritable value) throws IOException {
out.writeBytes("<record>\n");
this.writeStyle("key", key.toString());
this.writeStyle("value", value.toString());
out.writeBytes("</record>\n");
}
public synchronized void close(TaskAttemptContext job) throws IOException {
try {
out.writeBytes("</Output>\n");
} finally {
out.close();
}
}
}
public RecordWriter<Text, IntWritable> getRecordWriter(TaskAttemptContext job) throws IOException {
String file_extension = ".xml";
Path file = getDefaultWorkFile(job, file_extension);
FileSystem fs = file.getFileSystem(job.getConfiguration());
FSDataOutputStream fileOut = fs.create(file, false);
return new XMLRecordWriter(fileOut);
}
}
You need to either add package testpackage; at the beginning of your WordCount class
or
import testpackage.XMLOutputFormat; in your WordCount class.
Because they are in the same directory, it doesn't imply they are in the same package.
We will need to add the XMLOutputFormat.jar file to the HADOOP_CLASSPATH first for the driver code to find it. And pass it in -libjars option to be added to classpath of the map and reduce jvms.
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/abc/xyz/XMLOutputFormat.jar
yarn jar wordcount.jar com.sample.test.Wordcount
-libjars /path/to/XMLOutputFormat.jar
/lab/mr/input /lab/output/output
I have used one mapper,one reducer and one combiner class but I am getting the error as below:
java.io.IOException: wrong value class: class org.apache.hadoop.io.Text is not class org.apache.hadoop.io.IntWritable
at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:199)
at org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:1307)
at org.apache.hadoop.mapred.Task$NewCombinerRunner$OutputConverter.write(Task.java:1623)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)
at BookPublished1$Combine.reduce(BookPublished1.java:47)
at BookPublished1$Combine.reduce(BookPublished1.java:1)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1644)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1618)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1467)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:769)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
My entire program looks like below:
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.fs.Path;
public class BookPublished1 {
public static class Map extends Mapper<LongWritable,Text,Text,IntWritable>{
public void map(LongWritable key, Text value,Context context)
throws IOException,InterruptedException {
String line = value.toString();
String [] strYear = line.split(";");
context.write(new Text(strYear[3]), new IntWritable(1));
}
}
public static class Combine extends Reducer<Text,IntWritable,Text,Text>{
public void reduce(Text key, Iterable<IntWritable> values,Context context)
throws IOException,InterruptedException {
int sum=0;
// TODO Auto-generated method stub
for(IntWritable x: values)
{
sum+=x.get();
}
context.write(new Text("BookSummary"), new Text(key + "_"+ sum));
}
}
public static class Reduce extends Reducer<Text,Text,Text,FloatWritable>{
public void reduce(Text key, Iterable<Text> values,Context context)throws IOException,InterruptedException
{
Long publishYear =0L, max=Long.MAX_VALUE;
Text publishYear1 = null,maxYear=null;
Long publishValue= 0L;
String compositeString;
String compositeStringArray[];
// TODO Auto-generated method stub
for(Text x: values)
{
compositeString = x.toString();
compositeStringArray = compositeString.split("_");
publishYear1=new Text(compositeStringArray[0]);
publishValue=new Long(compositeStringArray[1]);
if(publishValue > max){
max=publishValue;
maxYear=publishYear1;
}
}
Text keyText= new Text("max" + " ( " + maxYear.toString() + ") : ");
context.write(keyText, new FloatWritable(max));
}
}
public static void main(String[] args) throws Exception {
Configuration conf= new Configuration();
Job job = new Job(conf,"BookPublished");
job.setJarByClass(BookPublished1.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setCombinerClass(Combine.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path outputPath = new Path(args[1]);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
outputPath.getFileSystem(conf).delete(outputPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
please help me with the resolution.
Output types of a combiner must match output types of a mapper. Hadoop makes no guarantees on how many times the combiner is applied, or that it is even applied at all. And that's what happens in your case.
Values from map (<Text, IntWritable>) go directly to the reduce where types <Text, Text> are expected.
Goal:
I want to be able to specify the number of mappers used on an input file
Equivalently, I want to specify the number of line of a file each mapper will take
Simple example:
For an input file of 10 lines (of unequal length; example below), I want there to be 2 mappers -- each mapper will thus process 5 lines.
This is
an arbitrary example file
of 10 lines.
Each line does
not have to be
of
the same
length or contain
the same
number of words
This is what I have:
(I have it so that each mapper produces one "<map,1>" key-value pair ... so that it will then be summed in the reducer)
package org.myorg;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.InputFormat;
public class Test {
// prduce one "<map,1>" pair per mapper
public static class Map extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
context.write(new Text("map"), one);
}
}
// reduce by taking a sum
public static class Red extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job1 = Job.getInstance(conf, "pass01");
job1.setJarByClass(Test.class);
job1.setMapperClass(Map.class);
job1.setCombinerClass(Red.class);
job1.setReducerClass(Red.class);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job1, new Path(args[0]));
FileOutputFormat.setOutputPath(job1, new Path(args[1]));
// // Attempt#1
// conf.setInt("mapreduce.input.lineinputformat.linespermap", 5);
// job1.setInputFormatClass(NLineInputFormat.class);
// // Attempt#2
// NLineInputFormat.setNumLinesPerSplit(job1, 5);
// job1.setInputFormatClass(NLineInputFormat.class);
// // Attempt#3
// conf.setInt(NLineInputFormat.LINES_PER_MAP, 5);
// job1.setInputFormatClass(NLineInputFormat.class);
// // Attempt#4
// conf.setInt("mapreduce.input.fileinputformat.split.minsize", 234);
// conf.setInt("mapreduce.input.fileinputformat.split.maxsize", 234);
System.exit(job1.waitForCompletion(true) ? 0 : 1);
}
}
The above code, using the above example data, will produce
map 10
I want the output to be
map 2
where the first mapper will do something will the first 5 lines, and the second mapper will do something with the second 5 lines.
You could use NLineInputFormat.
With NLineInputFormat functionality, you can specify exactly how many lines should go to a mapper.
E.g. If your file has 500 lines, and you set number of lines per mapper to 10, you have 50 mappers
(instead of one - assuming the file is smaller than a HDFS block size).
EDIT:
Here is an example for using NLineInputFormat:
Mapper Class:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MapperNLine extends Mapper<LongWritable, Text, LongWritable, Text> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
}
Driver class:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Driver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out
.printf("Two parameters are required for DriverNLineInputFormat- <input dir> <output dir>\n");
return -1;
}
Job job = new Job(getConf());
job.setJobName("NLineInputFormat example");
job.setJarByClass(Driver.class);
job.setInputFormatClass(NLineInputFormat.class);
NLineInputFormat.addInputPath(job, new Path(args[0]));
job.getConfiguration().setInt("mapreduce.input.lineinputformat.linespermap", 5);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MapperNLine.class);
job.setNumReduceTasks(0);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new Driver(), args);
System.exit(exitCode);
}
}
With the input you provided the output from the above sample Mapper would be written to two files as 2 Mappers get initialized :
part-m-00001
0 This is
8 an arbitrary example file
34 of 10 lines.
47 Each line does
62 not have to be
part-m-00002
77 of
80 the same
89 length or contain
107 the same
116 number of words