I am trying to wordcount program using the MapReduce Hadoop technology. What I need to do is develop an Indexed Word Count application that will count the number of occurences of each word in each file in a given input file set. This file set is present in the Amazon S3 bucket. It will also count the total occurences of each word. I have attached the code that counts the occurences of the words in the given file set. After this I need to print that which word is occuring in which file with the number of occurrences of the word in that particular file.
I know its a bit complex but any would be appreciated.
Map.java
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
public class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private String pattern= "^[a-z][a-z0-9]*$";
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
InputSplit inputSplit = context.getInputSplit();
String fileName = ((FileSplit) inputSplit).getPath().getName();
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
String stringWord = word.toString().toLowerCase();
if (stringWord.matches(pattern)){
context.write(new Text(stringWord), one);
}
}
}
}
Reduce.java
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
WordCount.java
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "WordCount");
job.setJarByClass(WordCount.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setNumReduceTasks(3);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
In the mapper, create a custom writable textpair which would be the output key that would hold filename and word from your file and value as 1.
Mapper Output:
<K,V> ==> <MytextpairWritable,new IntWritable(1)
You can get the filename in mapper with below snippet.
FileSplit fileSplit = (FileSplit)context.getInputSplit();
String filename = fileSplit.getPath().getName();
And pass these as a constructor to the custom writable class in the context.write. Something like this.
context.write(new MytextpairWritable(filename,word),new IntWritable(1));
And in the reducer side just sum up the value, so that you could get for each file how many occurrences are there for a particular word. Reducer code would be something like this.
public class Reduce extends Reducer<mytextpairWritable, IntWritable,mytextpairWritable, IntWritable> {
public void reduce(mytextpairWritable key, Iterable<IntWritable> values , Context context)
throws IOException, InterruptedException {
int sum = 0;
for(IntWritable val: values){
sum+=val.get();
}
context.write(key, new IntWritable(sum));
}
Your output will be something like this.
File1,hello,2
File2,hello,3
File3,hello,1
Related
i am new to the mapreduce topic and still in the learning phase. i thank you in advance for the help and further tips. in the context of an exercise at the university i have the following problem:
from a csv file (listed below as an example) i want to calculate the average order_demand for every single product_code.
the codes, shown below "FrequencyMapper" & "FreqeuencyReducer" are running on my server and i think i currently have a display problem of the output.
since i am making my first beginnings with mapreduce i am grateful for any help.
listed below are the mapper, reducer and driver codes.
Example of the Dataset (csv-file)
Product_Code,Warehouse,Product_Category,Date,Order_Demand
Product_0993,Whse_J,Category_028,2012/7/27,100
Product_0979,Whse_J,Category_028,2012/6/5,500
Product_0979,Whse_E,Category_028,2012/11/29,500
Product_1157,Whse_E,Category_006,2012/6/4,160000
Product_1159,Whse_A,Category_006,2012/7/17,50000
My goal for example:
Product_0979 500
Product_1157 105000
...
FrequencyMapper.java:
package ma.test.a02;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class FrequencyMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
#Override
public void map(LongWritable offset, Text lineText, Context context)
throws IOException, InterruptedException {
String line = lineText.toString();
if(line.contains("Product")) {
String productcode = line.split(",")[0];
float orderDemand = Float.parseFloat(line.split(",")[4]);
context.write(new Text(productcode), new IntWritable((int) orderDemand));
}
}
}
FrequencyReducer.java:
package ma.test.a02;
import java.io.IOException;
import javax.xml.soap.Text;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
public class FrequencyReducer extends Reducer< Text , IntWritable , IntWritable , FloatWritable > {
public void reduce( IntWritable productcode, Iterable<IntWritable> orderDemands, Context context)
throws IOException, InterruptedException {
float averageDemand = 0;
float count = 0;
for ( IntWritable orderDemand : orderDemands) {
averageDemand +=orderDemand.get();
count +=1;
}
float result = averageDemand / count;
context.write(productcode, new FloatWritable (result));
}
}
Frequency.java (Driver):
package ma.test.a02;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Frequency {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: Average <input path> <output path>");
System.exit(-1);
}
// create a Hadoop job and set the main class
Job job = Job.getInstance();
job.setJarByClass(Frequency.class);
job.setJobName("MA-Test Average");
// set the input and output path
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// set the Mapper and Reducer class
job.setMapperClass(FrequencyMapper.class);
job.setReducerClass(FrequencyReducer.class);
// specify the type of the output
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
// run the job
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Tip 1: In the mapper you have filtered lines that contains "VOLUME" in the following line:
if(line.contains("VOLUME")) {
}
But no line contains "VOLUME" so you have no input in reducer!
Tip 2: your reducer output value is FloatWritable and you should use this line in your runner(Frequency class):
job.setOutputValueClass(FloatWritable.class);
instead of this one:
job.setOutputValueClass(IntWritable.class);
Tip 3: In reducer change this line:
public class FrequencyReducer extends Reducer<IntWritable , IntWritable , IntWritable , FloatWritable>
To this one:
public class FrequencyReducer extends Reducer<Text, IntWritable, IntWritable, FloatWritable >
Also add these lines to the Frequency class:
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
Tip 4: fist line in your csv file which describe the structure of your csv file will cause problem. reject this line by putting following line at the first of your map methd:
if(line.contains("Product_Code,Warehouse")) {
return;
}
Tip 5: In the real program make sure that you have plan for String that can not be cast to Integer in orderDemand.
At the end your mapper will be :
public class FrequencyMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
#Override
public void map(LongWritable offset, Text lineText, Context context)
throws IOException, InterruptedException {
String line = lineText.toString();
if (line.contains("Product_Code,Warehouse")) {
return;
}
if (line.contains("Product")) {
String productcode = line.split(",")[0].trim();
int orderDemand = Integer.valueOf(line.split(",")[4].trim());
context.write(new Text(productcode), new IntWritable(orderDemand));
}
}
}
And here is your reducer:
public class FrequencyReducer extends Reducer<Text, IntWritable , Text, FloatWritable > {
public void reduce( Text productcode, Iterable<IntWritable> orderDemands, Context context)
throws IOException, InterruptedException {
float averageDemand = 0;
float count = 0;
for ( IntWritable orderDemand : orderDemands) {
averageDemand +=orderDemand.get();
count +=1;
}
float result = averageDemand / count;
context.write(productcode, new FloatWritable (result));
}
}
And here is your runner:
public class Frequency {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: Average <input path> <output path>");
System.exit(-1);
}
// create a Hadoop job and set the main class
Job job = Job.getInstance();
job.setJarByClass(Frequency.class);
job.setJobName("MA-Test Average");
// set the input and output path
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// set the Mapper and Reducer class
job.setMapperClass(FrequencyMapper.class);
job.setReducerClass(FrequencyReducer.class);
// specify the type of the output
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
// run the job
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I want to run the code described in this tutorial in order to customize the output format in Hadoop. More precisely, the tutorial shows two java files:
WordCount: is the word count java application (similar to the WordCount v1.0 of the MapReduce Tutorial in this link)
XMLOutputFormat: java class that extends FileOutputFormat and implements the method to customize the output.
Well, what I did was to take the WordCount v1.0 of the MapReduce Tutorial (instead of using the WordCount showed in the tutorial) and add in the driver job.setOutputFormatClass(XMLOutputFormat.class); and execute the hadoop app in this way:
/usr/local/hadoop/bin/hadoop com.sun.tools.javac.Main WordCount.java && jar cf wc.jar WordCount*.class && /usr/local/hadoop/bin/hadoop jar wc.jar WordCount /home/luis/Desktop/mytest/input/ ./output_folder
note: /home/luis/Desktop/mytest/input/ and ./output_folder are the input and output folders, respectively.
Unfortunately, the terminal shows me the following error:
WordCount.java:57: error: cannot find symbol
job.setOutputFormatClass(XMLOutputFormat.class);
^
symbol: class XMLOutputFormat
location: class WordCount
1 error
Why? WordCount.java and XMLOutputFormat.java are stored in the same folder.
The following is my code.
WordCount code:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setOutputFormatClass(XMLOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
XMLOutputFormat code:
import java.io.DataOutputStream;
import java.io.IOException;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class XMLOutputFormat extends FileOutputFormat<Text, IntWritable> {
protected static class XMLRecordWriter extends RecordWriter<Text, IntWritable> {
private DataOutputStream out;
public XMLRecordWriter(DataOutputStream out) throws IOException{
this.out = out;
out.writeBytes("<Output>\n");
}
private void writeStyle(String xml_tag,String tag_value) throws IOException {
out.writeBytes("<"+xml_tag+">"+tag_value+"</"+xml_tag+">\n");
}
public synchronized void write(Text key, IntWritable value) throws IOException {
out.writeBytes("<record>\n");
this.writeStyle("key", key.toString());
this.writeStyle("value", value.toString());
out.writeBytes("</record>\n");
}
public synchronized void close(TaskAttemptContext job) throws IOException {
try {
out.writeBytes("</Output>\n");
} finally {
out.close();
}
}
}
public RecordWriter<Text, IntWritable> getRecordWriter(TaskAttemptContext job) throws IOException {
String file_extension = ".xml";
Path file = getDefaultWorkFile(job, file_extension);
FileSystem fs = file.getFileSystem(job.getConfiguration());
FSDataOutputStream fileOut = fs.create(file, false);
return new XMLRecordWriter(fileOut);
}
}
You need to either add package testpackage; at the beginning of your WordCount class
or
import testpackage.XMLOutputFormat; in your WordCount class.
Because they are in the same directory, it doesn't imply they are in the same package.
We will need to add the XMLOutputFormat.jar file to the HADOOP_CLASSPATH first for the driver code to find it. And pass it in -libjars option to be added to classpath of the map and reduce jvms.
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/abc/xyz/XMLOutputFormat.jar
yarn jar wordcount.jar com.sample.test.Wordcount
-libjars /path/to/XMLOutputFormat.jar
/lab/mr/input /lab/output/output
I want to implement a string matching(Boyer-Moore) algorithm using Hadoop. I just started using Hadoop so I have no idea how to write a Hadoop program in Java.
All the sample programs that I have seen so far are word counting examples and I couldn't find any sample programs for string matching.
I tried searching for some tutorials that teaches how to write Hadoop applications using Java but couldn't find any. Can you suggest me some tutorials where I can learn how to write Hadoop applications using Java.
Thanks in advance.
I haven't tested the below code, But this should get you started.
I have used the BoyerMoore implementation available here
What the below code is doing:
The goal is to search for a pattern in an input document. The BoyerMoore class is initialized in the setup method using the pattern set in the configuration.
The mapper receives each line at a time and it uses the BoyerMoore instance to find the pattern. If match is found, the we write it using context.
There is no need of a reducer here. If the pattern is found multiple times in different mapper then the output will have multiple offsets(1 per mapper).
package hadoop.boyermoore;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class BoyerMooreImpl {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private BoyerMoore boyerMoore;
private static IntWritable offset;
private Text offsetFound = new Text("offset");
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
String line = itr.nextToken();
int offset1 = boyerMoore.search(line);
if (line.length() != offset1) {
offset = new IntWritable(offset1);
context.write(offsetFound,offset);
}
}
}
#Override
public final void setup(Context context) {
if (boyerMoore == null)
boyerMoore = new BoyerMoore(context.getConfiguration().get("pattern"));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("pattern","your_pattern_here");
Job job = Job.getInstance(conf, "BoyerMoore");
job.setJarByClass(BoyerMooreImpl.class);
job.setMapperClass(TokenizerMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I don't know if this is the correct implementation to run an algorithm in parallel, but this is what I figured out,
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
public class StringMatching extends Configured implements Tool {
public static void main(String args[]) throws Exception {
long start = System.currentTimeMillis();
int res = ToolRunner.run(new StringMatching(), args);
long end = System.currentTimeMillis();
System.exit((int)(end-start));
}
public int run(String[] args) throws Exception {
Path inputPath = new Path(args[0]);
Path outputPath = new Path(args[1]);
Configuration conf = getConf();
Job job = new Job(conf, this.getClass().toString());
FileInputFormat.setInputPaths(job, inputPath);
FileOutputFormat.setOutputPath(job, outputPath);
job.setJobName("StringMatching");
job.setJarByClass(StringMatching.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
#Override
public void map(LongWritable key, Text value,
Mapper.Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
BoyerMoore bm = new BoyerMoore();
boolean flag = bm.findPattern(key.toString().trim().toLowerCase(), "abc");
if(flag){
context.write(key, new IntWritable(1));
}else{
context.write(key, new IntWritable(0));
}
}
}
}
I'm using AWS(Amazon Web Services) so I can select the number of nodes from the console that I want my program to run on simultaneously. So I'm assuming that the map and reduce methods that I have used should be enough for running the Boyer-Moore string matching algorithm in parallel.
I use this code below to get output result like ( Key , Value )
Apple 12
Bee 345
Cat 123
What I want is descending sorted by value ( 345 ) and place them before the key ( Value , Key )
345 Bee
123 Cat
12 Apple
I found there are something called "secondary sorted" not going to lie but I'm so lost - I tried to change .. context.write(key, result); but failed miserably. I'm new to Hadoop and not sure how can I start to tackle this problem. Any recommendation would be appreciated. Which function do I need to change ? or which class do I need modify ?
here 'are my classes :
package org.apache.hadoop.examples;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 2) {
System.err.println("Usage: wordcount <in> [<in>...] <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
for (int i = 0; i < otherArgs.length - 1; ++i) {
FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
}
FileOutputFormat.setOutputPath(job,
new Path(otherArgs[otherArgs.length - 1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
You have been able to do word count correctly.
You will need second map only job to perform the second requirement of descending sort and swapping of key value
Use DecreasingComparator as sort comparator
Use InverseMapper to swap key and values
Use Identity Reducer i.e. Reducer.class - In case of Identity Reducer no aggregation will happen ( as each value is output individually for key )
Set number of reduce tasks to 1 or use TotalOderPartitioner
I using Hadoop Map/Reduce using Java
Suppose, I have completed a whole map/reduce job. Is there any way I could repeat the whole map/reduce part only, without ending the job. I mean, I DON'T want to use any chaining of the different jobs but only only want the map/reduce part to repeat.
Thank you!
So I am more familiar with hadoop streaming APIs but approach should translate to the native APIs.
In my understanding what you are trying to do is run the several iterations of same map() and reduce() operations on the input data.
Lets say your initial map() input data comes from file input.txt and the output file is output + {iteration}.txt (where iteration is loop count, iteration =[0, # of iteration)).
In the second invocation of the map()/reduce() your input file is output+{iteration} and output file would become output+{iteration +1}.txt.
Let me know if this is not clear, I can conjure up a quick example and post a link here.
EDIT* So for Java I modified the hadoop wordcount example to run multiple times
package com.rorlig;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountJob {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
if (args.length != 3) {
System.err.println("Usage: wordcount <in> <out> <iterations>");
System.exit(2);
}
int iterations = new Integer(args[2]);
Path inPath = new Path(args[0]);
Path outPath = null;
for (int i = 0; i<iterations; ++i){
outPath = new Path(args[1]+i);
Job job = new Job(conf, "word count");
job.setJarByClass(WordCountJob.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, inPath);
FileOutputFormat.setOutputPath(job, outPath);
job.waitForCompletion(true);
inPath = outPath;
}
}
}
Hope this helps