Splitting Files w.r.t input file MapReduce - java

Can someBody Suggest me Whats wrong in the Following Code.
Can u help me how to get the Below output using this Mapreduce program??
Actually This code works fine but the output is not as expected... output is generated in two files but either in Name.txt file or Age.txt file the output is swaping
Input File:
Name:A
Age:28
Name:B
Age:25
Name:K
Age:20
Name:P
Age:18
Name:Ak
Age:11
Name:N
Age:14
Name:Kr
Age:26
Name:Ra
Age:27
And my output should split into Name and Age
Name File:
Name:A
Name:B
Name:K
Name:P
Name:Ak
Name:N
Name:Kr
Name:Ra
Age File:
Age:28
Age:25
Age:20
Age:18
Age:11
Age:14
Age:26
Age:27
My Code :
MyMapper.java
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class MyMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value,OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String [] dall=value.toString().split(":");
output.collect(new Text(dall[0]),new Text(dall[1]));
}
}
MyReducer.Java:
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class MyReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values,OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
while (values.hasNext()) {
output.collect(new Text(key),new Text(values.next()));
}
}
}
MultiFileOutput.java:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.*;
public class MultiFileOutput extends MultipleTextOutputFormat<Text, Text>{
protected String generateFileNameForKeyValue(Text key, Text value,String name) {
//return new Path(key.toString(), name).toString();
return key.toString();
}
protected Text generateActualKey(Text key, Text value) {
//return new Text(key.toString());
return null;
}
}
MyDriver.java:
import java.io.IOException;
import java.lang.Exception;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
public class MyDriver{
public static void main(String[] args) throws Exception,IOException {
Configuration mycon=new Configuration();
JobConf conf = new JobConf(mycon,MyDriver.class);
//JobConf conf = new JobConf(MyDriver.class);
conf.setJobName("Splitting");
conf.setMapperClass(MyMapper.class);
conf.setReducerClass(MyReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(MultiFileOutput.class);
conf.setOutputKeyClass(Text.class);
conf.setMapOutputKeyClass(Text.class);
//conf.setOutputValueClass(Text.class);
conf.setMapOutputValueClass(Text.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
//System.err.println(JobClient.runJob(conf));
}
}
ThankYou

Ok this is a bit more complicated use case than simple word count :)
So what you need is a complex key & a partitioner. And set number of reducers =2
Your complex key could be a Text(concatenation of Name|A or Age|28) or CustomWritable (that has 2 instance variables holding type(Name or Age) & value)
In the mapper you create the Text or CustomWritable anfd set it as output key and value can be just the name of the person or his age.
Create a partitioner (which implements org.apache.hadoop.mapred.Partitioner). In the getPartition method you basically decide based on your key which reducer it will go to.
Hope this helps.

Related

Hadoop MapReduce Output for Maximum

I am currently using Eclipse and Hadoop to create a mapper and reducer to find Maximum Total Cost of an Airline Data Set.
So the Total Cost is Decimal Value and Airline Carrier is Text.
The dataset I used can be found in the following weblink:
https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/236265/dft-flights-data-2011.csv
When I export the jar file in Hadoop,
I am getting the following message: ls: "output" : No such file or directory.
Can anyone help me correct the code please?
My code is below.
Mapper:
package org.myorg;
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTotalCostMapper extends Mapper<LongWritable, Text, Text, DoubleWritable>
{
private final static DoubleWritable totalcostWritable = new DoubleWritable(0);
private Text AirCarrier = new Text();
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
String[] line = value.toString().split(",");
AirCarrier.set(line[8]);
double totalcost = Double.parseDouble(line[2].trim());
totalcostWritable.set(totalcost);
context.write(AirCarrier, totalcostWritable);
}
}
Reducer:
package org.myorg;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTotalCostReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable>
{
ArrayList<Double> totalcostList = new ArrayList<Double>();
#Override
public void reduce(Text key, Iterable<DoubleWritable> values, Context context)
throws IOException, InterruptedException
{
double maxValue=0.0;
for (DoubleWritable value : values)
{
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new DoubleWritable(maxValue));
}
}
Main:
package org.myorg;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTotalCost
{
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
if (args.length != 2)
{
System.err.println("Usage: MaxTotalCost<input path><output path>");
System.exit(-1);
}
Job job;
job=Job.getInstance(conf, "Max Total Cost");
job.setJarByClass(MaxTotalCost.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.setMapperClass(MaxTotalCostMapper.class);
job.setReducerClass(MaxTotalCostReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
ls: "output" : No such file or directory
You have no HDFS user directory. Your code isn't making it into the Mapper or Reducer. That error typically arises at the Job
FileOutputFormat.setOutputPath(job, new Path(args[2]));
Run an hdfs dfs -ls, see if you get any errors. If so, make a directory under /user that matches your current user.
Otherwise, change your output directory to something like /tmp/max

Getting Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable error

I am trying to run a map/reducer in java using eclipse. Below are my codes
Driver code:
package com.hadoop.training.criccount;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.Set;
import javax.lang.model.SourceVersion;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import com.hadoop.training.hadooputility.HadoopUtility;
public class CricClickDriver extends Configured implements Tool{
public int run (String args[]) throws Exception
{
Configuration config =HadoopUtility.INSTANCE.pseudomode();
Job job = new Job(config, "No of clicks by location");
job.setJarByClass(CricClickDriver.class);
job.setInputFormatClass(TextInputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setMapperClass(CricClickMapper.class);
job.setReducerClass(CricClickReducer.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
int exitcode = job.waitForCompletion(true)? 0:1;
return exitcode;
}
}
Mapper Code:
package com.hadoop.training.criccount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
#SuppressWarnings("unused")
public class CricClickMapper extends Mapper<LongWritable, Text,Text, LongWritable>{
public void CricketClick(LongWritable key, Text value, Context output) throws IOException, InterruptedException
{
String Line= value.toString();
String part[]=Line.split(" ");
if(part[0].contains("BAN"))
{
output.write(new Text(part[1]),new LongWritable(Long.parseLong(part[2])));
}
}
}
Reducer Code:
package com.hadoop.training.criccount;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class CricClickReducer extends Reducer<Text, LongWritable, Text, LongWritable>{
public void reduce(Text key, Iterable<LongWritable> value, Context output) throws IOException, InterruptedException
{
int sum=0;
for(LongWritable val:value){
sum +=val.get();
}
output.write(key, new LongWritable(sum));
}
}
I am getting following error:
Type mismatch in key from map
I tried to debug but not able to find the rootcause. Need a help on it.
Your Mapper class definition does not look like the one mentioned in the documentation.
public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
or is it that it is not visible in the code you posted here ?

Hadoop - MapReduce

I've been trying to solve a simple Map/Reduce problem in which I would be counting words from some input files, and then have their frequency as one key, and their word length as the other key. The Mapping would emit one eveytime a new word is read from the file, and then it would group all the same words together to have their final count. Then as an output I'd like to see the statistics for each word length what's the most frequent word.
This is as far as we've gotten (me and my team):
This is the WordCountMapper class
import java.io.IOException;
import java.util.ArrayList;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WordCountMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, CompositeGroupKey> {
private final IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, CompositeGroupKey> output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line.toLowerCase());
while(itr.hasMoreTokens()) {
word.set(itr.nextToken());
CompositeGroupKey gky = new CompositeGroupKey(1, word.getLength());
output.collect(word, gky);
}
}
}
This is wordcountreducer class:
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import com.sun.xml.internal.bind.CycleRecoverable.Context;
public class WordCountReducer extends MapReduceBase
implements Reducer<Text, CompositeGroupKey, Text, CompositeGroupKey> {
#Override
public void reduce(Text key, Iterator<CompositeGroupKey> values,
OutputCollector<Text, CompositeGroupKey> output, Reporter reporter)
throws IOException {
int sum = 0;
int length = 0;
while (values.hasNext()) {
CompositeGroupKey value = (CompositeGroupKey) values.next();
sum += (Integer) value.getCount(); // process value
length = (Integer) key.getLength();
}
CompositeGroupKey cgk = new CompositeGroupKey(sum,length);
output.collect(key, cgk);
}
}
This is the class wordcount
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.JobStatus;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.StringUtils;
public class WordCount {
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf = new JobConf(WordCount.class);
// specify output types
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(CompositeGroupKey.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(CompositeGroupKey.class);
// specify input and output dirs
FileInputFormat.addInputPath(conf, new Path("input"));
FileOutputFormat.setOutputPath(conf, new Path("output16"));
// specify a mapper
conf.setMapperClass(WordCountMapper.class);
// specify a reducer
conf.setReducerClass(WordCountReducer.class);
conf.setCombinerClass(WordCountReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}
And this is the groupcompositekey
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableUtils;
public class CompositeGroupKey implements WritableComparable<CompositeGroupKey> {
int count;
int length;
public CompositeGroupKey(int c, int l) {
this.count = c;
this.length = l;
}
public void write(DataOutput out) throws IOException {
WritableUtils.writeVInt(out, count);
WritableUtils.writeVInt(out, length);
}
public void readFields(DataInput in) throws IOException {
this.count = WritableUtils.readVInt(in);
this.length = WritableUtils.readVInt(in);
}
public int compareTo(CompositeGroupKey pop) {
return 0;
}
public int getCount() {
return this.count;
}
public int getLength() {
return this.length;
}
}
Right now I get this error:
java.lang.RuntimeException: java.lang.NoSuchMethodException: CompositeGroupKey.<init>()
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:80)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:62)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:738)
at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:678)
at org.apache.hadoop.mapred.Task$CombineValuesIterator.next(Task.java:757)
at WordCountReducer.reduce(WordCountReducer.java:24)
at WordCountReducer.reduce(WordCountReducer.java:1)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:904)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:785)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:228)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
Caused by: java.lang.NoSuchMethodException: CompositeGroupKey.<init>()
at java.lang.Class.getConstructor0(Unknown Source)
at java.lang.Class.getDeclaredConstructor(Unknown Source)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:74)
I know the coding's not that good, but right now we don't have any idea where we went wrong, so any help would be welcome!
You have to provide an empty default constructor in your key class CompositeGroupKey. It is used for serialization.
Just add:
public CompositeGroupKey() {
}
Whenever you see some exceptions like the one given below
java.lang.RuntimeException: java.lang.NoSuchMethodException: CompositeGroupKey.<init>()
Then it will be a problem with the object instantiation which means either of the constructors might not be present.Eitherdefault constructor OR parameterised constructor
The moment you write a parameterised constructor JVM suppresses the default constructor unless expicitly declared.
The answer given by RusIan Ostafiichuk is enough to answer your query yet I added some more points to make things much clear.

Hadoop MapReduce error while parsing CSV

I'm getting following error in map function while parsing a CSV file.
14/07/15 19:40:05 INFO mapreduce.Job: Task Id : attempt_1403602091361_0018_m_000001_2, Status : FAILED
Error: java.lang.ArrayIndexOutOfBoundsException: 4
at com.test.mapreduce.RetailCustomerAnalysis_2$MapClass.map(RetailCustomerAnalysis_2.java:55)
at com.test.mapreduce.RetailCustomerAnalysis_2$MapClass.map(RetailCustomerAnalysis_2.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
The map function is given below
package com.test.mapreduce;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class RetailCustomerAnalysis_2 extends Configured implements Tool {
public static class MapClass extends MapReduceBase
implements Mapper<Text, Text, Text, Text> {
private Text key1 = new Text();
private Text value1 = new Text();
public void map(Text key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String line = value.toString();
String[] split = line.split(",");
key1.set(split[0].trim());
/* line no 55 where error is occuring */
value1.set(split[4].trim());
output.collect(key1, value1);
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
JobConf job = new JobConf(conf, RetailCustomerAnalysis_2.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("RetailCustomerAnalysis_2");
job.setMapperClass(MapClass.class);
job.setReducerClass(Reduce.class);
job.setInputFormat(KeyValueTextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// job.set("key.value.separator.in.input.line", ",");
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new RetailCustomerAnalysis_2(), args);
System.exit(res);
}
}
Sample Input used to run this code is as follows
PRAVEEN,4002012,Kids,02GK,7/4/2010
PRAVEEN,400201,TOY,020383,14/04/2014
I'm running the application using the following command and Inputs.
yarn jar RetailCustomerAnalysis_2.jar com.test.mapreduce.RetailCustomerAnalysis_2 /hduser/input5 /hduser/output5
Add check to see if input line has all fields defined or ignore processing it map function. Code would be something like this in new API.
if(split.length!=noOfFields){
return;
}
Additionally, if you are further interested you can set up hadoop countner to know how many rows were in total which dint contain all required fields in csv file.
if(split.length!=noOfFields){
context.getCounter(MTJOB.DISCARDED_ROWS_DUE_MISSING_FIELDS)
.increment(1);
return;
}
split[] has elements split[0], split[1], split[2] and split[3] only
In case of KeyValueTextInputFormat the first String before the separator is considered as the key and rest of the line is considered as value. A byte separator(coma OR whitespace etc.) is used to separate between key and value from every record.
In your code the first string before first coma is taken as the key and rest of line is taken as value. And as you split the value, there are only 4 strings in it. Therefore the String array can go from split[0] to split[3] only instead of split[4].
Any suggestions or corrections are welcomed.

My Input file is being read twice by the mapper in MapReduce of Hadoop

I am facing the problem while write MapReduce Program, my input file is being read twice by the program. already have gone through this why is my sequence file being read twice in my hadoop mapper class? answer, but unfortunately it did not help
My Mapper class is:
package com.siddu.mapreduce.csv;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class SidduCSVMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable>
{
IntWritable one = new IntWritable(1);
#Override
public void map(LongWritable key, Text line,
OutputCollector<Text, IntWritable> output, Reporter report)
throws IOException
{
String lineCSV= line.toString();
String[] tokens = lineCSV.split(";");
output.collect(new Text(tokens[2]), one);
}
}
And My Reducer class is:
package com.siddu.mapreduce.csv;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class SidduCSVReducer extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable>
{
#Override
public void reduce(Text key, Iterator<IntWritable> inputFrmMapper,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException
{
System.out.println("In reducer the key is:"+key.toString());
int relationOccurance=0;
while(inputFrmMapper.hasNext())
{
IntWritable intWriteOb = inputFrmMapper.next();
int val = intWriteOb.get();
relationOccurance += val;
}
output.collect(key, new IntWritable(relationOccurance));
}
}
And finally My Driver class is:
package com.siddu.mapreduce.csv;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class SidduCSVMapReduceDriver
{
public static void main(String[] args)
{
JobClient client = new JobClient();
JobConf conf = new JobConf(com.siddu.mapreduce.csv.SidduCSVMapReduceDriver.class);
conf.setJobName("Siddu CSV Reader 1.0");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(com.siddu.mapreduce.csv.SidduCSVMapper.class);
conf.setReducerClass(com.siddu.mapreduce.csv.SidduCSVReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
client.setConf(conf);
try
{
JobClient.runJob(conf);
}
catch(Exception e)
{
e.printStackTrace();
}
}
}
You should be aware that hadoop spawns multiple attempts of a task, usually two for each mapper. If you see log file output twice, that's probably the reason.

Categories