Hadoop MapReduce error while parsing CSV - java

I'm getting following error in map function while parsing a CSV file.
14/07/15 19:40:05 INFO mapreduce.Job: Task Id : attempt_1403602091361_0018_m_000001_2, Status : FAILED
Error: java.lang.ArrayIndexOutOfBoundsException: 4
at com.test.mapreduce.RetailCustomerAnalysis_2$MapClass.map(RetailCustomerAnalysis_2.java:55)
at com.test.mapreduce.RetailCustomerAnalysis_2$MapClass.map(RetailCustomerAnalysis_2.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
The map function is given below
package com.test.mapreduce;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class RetailCustomerAnalysis_2 extends Configured implements Tool {
public static class MapClass extends MapReduceBase
implements Mapper<Text, Text, Text, Text> {
private Text key1 = new Text();
private Text value1 = new Text();
public void map(Text key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String line = value.toString();
String[] split = line.split(",");
key1.set(split[0].trim());
/* line no 55 where error is occuring */
value1.set(split[4].trim());
output.collect(key1, value1);
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
JobConf job = new JobConf(conf, RetailCustomerAnalysis_2.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("RetailCustomerAnalysis_2");
job.setMapperClass(MapClass.class);
job.setReducerClass(Reduce.class);
job.setInputFormat(KeyValueTextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// job.set("key.value.separator.in.input.line", ",");
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new RetailCustomerAnalysis_2(), args);
System.exit(res);
}
}
Sample Input used to run this code is as follows
PRAVEEN,4002012,Kids,02GK,7/4/2010
PRAVEEN,400201,TOY,020383,14/04/2014
I'm running the application using the following command and Inputs.
yarn jar RetailCustomerAnalysis_2.jar com.test.mapreduce.RetailCustomerAnalysis_2 /hduser/input5 /hduser/output5

Add check to see if input line has all fields defined or ignore processing it map function. Code would be something like this in new API.
if(split.length!=noOfFields){
return;
}
Additionally, if you are further interested you can set up hadoop countner to know how many rows were in total which dint contain all required fields in csv file.
if(split.length!=noOfFields){
context.getCounter(MTJOB.DISCARDED_ROWS_DUE_MISSING_FIELDS)
.increment(1);
return;
}

split[] has elements split[0], split[1], split[2] and split[3] only

In case of KeyValueTextInputFormat the first String before the separator is considered as the key and rest of the line is considered as value. A byte separator(coma OR whitespace etc.) is used to separate between key and value from every record.
In your code the first string before first coma is taken as the key and rest of line is taken as value. And as you split the value, there are only 4 strings in it. Therefore the String array can go from split[0] to split[3] only instead of split[4].
Any suggestions or corrections are welcomed.

Related

Hadoop MapReduce Output for Maximum

I am currently using Eclipse and Hadoop to create a mapper and reducer to find Maximum Total Cost of an Airline Data Set.
So the Total Cost is Decimal Value and Airline Carrier is Text.
The dataset I used can be found in the following weblink:
https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/236265/dft-flights-data-2011.csv
When I export the jar file in Hadoop,
I am getting the following message: ls: "output" : No such file or directory.
Can anyone help me correct the code please?
My code is below.
Mapper:
package org.myorg;
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTotalCostMapper extends Mapper<LongWritable, Text, Text, DoubleWritable>
{
private final static DoubleWritable totalcostWritable = new DoubleWritable(0);
private Text AirCarrier = new Text();
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
String[] line = value.toString().split(",");
AirCarrier.set(line[8]);
double totalcost = Double.parseDouble(line[2].trim());
totalcostWritable.set(totalcost);
context.write(AirCarrier, totalcostWritable);
}
}
Reducer:
package org.myorg;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTotalCostReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable>
{
ArrayList<Double> totalcostList = new ArrayList<Double>();
#Override
public void reduce(Text key, Iterable<DoubleWritable> values, Context context)
throws IOException, InterruptedException
{
double maxValue=0.0;
for (DoubleWritable value : values)
{
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new DoubleWritable(maxValue));
}
}
Main:
package org.myorg;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTotalCost
{
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
if (args.length != 2)
{
System.err.println("Usage: MaxTotalCost<input path><output path>");
System.exit(-1);
}
Job job;
job=Job.getInstance(conf, "Max Total Cost");
job.setJarByClass(MaxTotalCost.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.setMapperClass(MaxTotalCostMapper.class);
job.setReducerClass(MaxTotalCostReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
ls: "output" : No such file or directory
You have no HDFS user directory. Your code isn't making it into the Mapper or Reducer. That error typically arises at the Job
FileOutputFormat.setOutputPath(job, new Path(args[2]));
Run an hdfs dfs -ls, see if you get any errors. If so, make a directory under /user that matches your current user.
Otherwise, change your output directory to something like /tmp/max

Cassandra Hadoop MapReduce : java.lang.ClassCastException: java.util.HashMap cannot be cast to java.nio.ByteBuffer

I'm trying to create a mapreduce job with Apache Cassandra. The input date come from cassandra and the output goes to cassandra too.
The program try to select all data from a table called tweetstore and after that insert the count of lines which contains a name of user.
This is the main class of the mapreduce job :
package com.cassandra.hadoop;
import java.io.*;
import java.lang.*;
import java.util.*;
import java.nio.ByteBuffer;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.cassandra.hadoop.ColumnFamilyInputFormat;
import org.apache.cassandra.hadoop.ColumnFamilyOutputFormat;
import org.apache.cassandra.hadoop.ConfigHelper;
import org.apache.cassandra.hadoop.cql3.*;
import org.apache.cassandra.thrift.SlicePredicate;
import org.apache.cassandra.thrift.SliceRange;
import org.apache.cassandra.thrift.IndexExpression;
import org.apache.cassandra.thrift.IndexOperator;
import org.apache.cassandra.utils.ByteBufferUtil;
public class App
{
static final String KEYSPACE_NAME = "tweet_cassandra_map_reduce";
static final String INPUT_COLUMN_FAMILY = "tweetstore";
static final String OUTPUT_COLUMN_FAMILY = "tweetcount";
static final String COLUMN_NAME = "user";
public static void main( String[] args ) throws IOException, InterruptedException, ClassNotFoundException
{
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
Job job = new Job(conf, "tweet count");
job.setJarByClass(App.class);
// mapper configuration.
job.setMapperClass(TweetMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setInputFormatClass(ColumnFamilyInputFormat.class);
// Reducer configuration
job.setReducerClass(TweetAggregator.class);
job.setOutputKeyClass(ByteBuffer.class);
job.setOutputValueClass(List.class);
job.setOutputFormatClass(ColumnFamilyOutputFormat.class);
// Cassandra input column family configuration
ConfigHelper.setInputRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setInputInitialAddress(job.getConfiguration(), "localhost");
ConfigHelper.setInputPartitioner(job.getConfiguration(), "Murmur3Partitioner");
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE_NAME, INPUT_COLUMN_FAMILY);
job.setInputFormatClass(ColumnFamilyInputFormat.class);
SlicePredicate slicePredicate = new SlicePredicate();
slicePredicate.setSlice_range(new SliceRange(ByteBufferUtil.EMPTY_BYTE_BUFFER, ByteBufferUtil.EMPTY_BYTE_BUFFER, false, Integer.MAX_VALUE));
// Prepare index expression.
IndexExpression ixpr = new IndexExpression();ixpr.setColumn_name(ByteBufferUtil.bytes(COLUMN_NAME));
ixpr.setOp(IndexOperator.EQ);
ixpr.setValue(ByteBufferUtil.bytes(otherArgs.length > 0 && !StringUtils.isBlank(otherArgs[0])?otherArgs[0]: "mevivs"));
List<IndexExpression> ixpressions = new ArrayList<IndexExpression>();
ixpressions.add(ixpr);
ConfigHelper.setInputRange(job.getConfiguration(), ixpressions);
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), slicePredicate);
// Cassandra output family configuration.
ConfigHelper.setOutputRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setOutputInitialAddress(job.getConfiguration(), "localhost");
ConfigHelper.setOutputPartitioner(job.getConfiguration(), "Murmur3Partitioner");
ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE_NAME, OUTPUT_COLUMN_FAMILY);
job.setOutputFormatClass(ColumnFamilyOutputFormat.class);
job.getConfiguration().set("row_key", "key");
job.waitForCompletion(true);
}
}
The mapper code
package com.cassandra.hadoop;
import java.io.*;
import java.lang.*;
import java.util.*;
import java.nio.ByteBuffer;
import java.util.SortedMap;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.cassandra.db.Column;
import org.apache.cassandra.utils.ByteBufferUtil;
public class TweetMapper extends Mapper<ByteBuffer, SortedMap<ByteBuffer, Column>, Text, IntWritable>
{
static final String COLUMN_NAME = App.COLUMN_NAME;
private final static IntWritable one = new IntWritable(1);
/* (non-Javadoc)
* #see org.apache.hadoop.mapreduce.Mapper#map(KEYIN, VALUEIN, org.apache.hadoop.mapreduce.
Mapper.Context)
*/
public void map(ByteBuffer key, SortedMap<ByteBuffer, Column> columns, Context context) throws IOException, InterruptedException
{
Column column = columns.get(ByteBufferUtil.bytes(COLUMN_NAME));
String value = ByteBufferUtil.string(column.value());
context.write(new Text(value), one);
}
}
The reducer code :
package com.cassandra.hadoop;
import java.io.IOException;
import java.util.*;
import java.lang.*;
import java.io.*;
import java.nio.ByteBuffer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.cassandra.utils.ByteBufferUtil;
import org.apache.cassandra.thrift.Column;
import org.apache.cassandra.thrift.Mutation;
import org.apache.cassandra.thrift.ColumnOrSuperColumn;
import org.apache.cassandra.db.marshal.Int32Type;
public class TweetAggregator extends Reducer<Text,IntWritable, Map<String,ByteBuffer>, List<ByteBuffer>>
{
private static Map<String,ByteBuffer> keys = new HashMap<>();
public void reduce(Text word, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
sum += val.get();
System.out.println("writing");
keys.put("key", ByteBufferUtil.bytes(word.toString()));
context.write(keys, getBindVariables(word, sum));
}
private List<ByteBuffer> getBindVariables(Text word, int sum)
{
List<ByteBuffer> variables = new ArrayList<ByteBuffer>();
variables.add(Int32Type.instance.decompose(sum));
return variables;
}
}
The problem when I try to execute the job with hadoop cammand at the reduce step, this error appears :
15/02/14 16:53:13 WARN hadoop.AbstractColumnFamilyInputFormat: ignoring jobKeyRange specified without start_key
15/02/14 16:53:14 INFO mapred.JobClient: Running job: job_201502141652_0001
15/02/14 16:53:15 INFO mapred.JobClient: map 0% reduce 0%
15/02/14 16:53:20 INFO mapred.JobClient: map 66% reduce 0%
15/02/14 16:53:22 INFO mapred.JobClient: map 100% reduce 0%
15/02/14 16:53:28 INFO mapred.JobClient: map 100% reduce 33%
15/02/14 16:53:30 INFO mapred.JobClient: Task Id : attempt_201502141652_0001_r_000000_0, Status : FAILED
java.lang.ClassCastException: java.util.HashMap cannot be cast to java.nio.ByteBuffer
at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter.write(ColumnFamilyRecordWriter.java:50)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:588)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at com.cassandra.hadoop.TweetAggregator.reduce(TweetAggregator.java:40)
at com.cassandra.hadoop.TweetAggregator.reduce(TweetAggregator.java:20)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
attempt_201502141652_0001_r_000000_0: writing
ANY HELP please !! Thanks
Looks like your job setup for the reducer has bytebuffer as the output, rather than the map<>. Try changing this in your job setup
job.setOutputKeyClass(ByteBuffer.class);
to this
job.setOutputKeyClass(Map<String,ByteBuffer>.class);
anyway, the generic types in the job.set.... need to align across the generic type args for the mapper and reducer, so check to make sure they align.

Hadoop - MapReduce

I've been trying to solve a simple Map/Reduce problem in which I would be counting words from some input files, and then have their frequency as one key, and their word length as the other key. The Mapping would emit one eveytime a new word is read from the file, and then it would group all the same words together to have their final count. Then as an output I'd like to see the statistics for each word length what's the most frequent word.
This is as far as we've gotten (me and my team):
This is the WordCountMapper class
import java.io.IOException;
import java.util.ArrayList;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WordCountMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, CompositeGroupKey> {
private final IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, CompositeGroupKey> output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line.toLowerCase());
while(itr.hasMoreTokens()) {
word.set(itr.nextToken());
CompositeGroupKey gky = new CompositeGroupKey(1, word.getLength());
output.collect(word, gky);
}
}
}
This is wordcountreducer class:
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import com.sun.xml.internal.bind.CycleRecoverable.Context;
public class WordCountReducer extends MapReduceBase
implements Reducer<Text, CompositeGroupKey, Text, CompositeGroupKey> {
#Override
public void reduce(Text key, Iterator<CompositeGroupKey> values,
OutputCollector<Text, CompositeGroupKey> output, Reporter reporter)
throws IOException {
int sum = 0;
int length = 0;
while (values.hasNext()) {
CompositeGroupKey value = (CompositeGroupKey) values.next();
sum += (Integer) value.getCount(); // process value
length = (Integer) key.getLength();
}
CompositeGroupKey cgk = new CompositeGroupKey(sum,length);
output.collect(key, cgk);
}
}
This is the class wordcount
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.JobStatus;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.StringUtils;
public class WordCount {
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf = new JobConf(WordCount.class);
// specify output types
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(CompositeGroupKey.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(CompositeGroupKey.class);
// specify input and output dirs
FileInputFormat.addInputPath(conf, new Path("input"));
FileOutputFormat.setOutputPath(conf, new Path("output16"));
// specify a mapper
conf.setMapperClass(WordCountMapper.class);
// specify a reducer
conf.setReducerClass(WordCountReducer.class);
conf.setCombinerClass(WordCountReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}
And this is the groupcompositekey
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableUtils;
public class CompositeGroupKey implements WritableComparable<CompositeGroupKey> {
int count;
int length;
public CompositeGroupKey(int c, int l) {
this.count = c;
this.length = l;
}
public void write(DataOutput out) throws IOException {
WritableUtils.writeVInt(out, count);
WritableUtils.writeVInt(out, length);
}
public void readFields(DataInput in) throws IOException {
this.count = WritableUtils.readVInt(in);
this.length = WritableUtils.readVInt(in);
}
public int compareTo(CompositeGroupKey pop) {
return 0;
}
public int getCount() {
return this.count;
}
public int getLength() {
return this.length;
}
}
Right now I get this error:
java.lang.RuntimeException: java.lang.NoSuchMethodException: CompositeGroupKey.<init>()
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:80)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:62)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:738)
at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:678)
at org.apache.hadoop.mapred.Task$CombineValuesIterator.next(Task.java:757)
at WordCountReducer.reduce(WordCountReducer.java:24)
at WordCountReducer.reduce(WordCountReducer.java:1)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:904)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:785)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:228)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
Caused by: java.lang.NoSuchMethodException: CompositeGroupKey.<init>()
at java.lang.Class.getConstructor0(Unknown Source)
at java.lang.Class.getDeclaredConstructor(Unknown Source)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:74)
I know the coding's not that good, but right now we don't have any idea where we went wrong, so any help would be welcome!
You have to provide an empty default constructor in your key class CompositeGroupKey. It is used for serialization.
Just add:
public CompositeGroupKey() {
}
Whenever you see some exceptions like the one given below
java.lang.RuntimeException: java.lang.NoSuchMethodException: CompositeGroupKey.<init>()
Then it will be a problem with the object instantiation which means either of the constructors might not be present.Eitherdefault constructor OR parameterised constructor
The moment you write a parameterised constructor JVM suppresses the default constructor unless expicitly declared.
The answer given by RusIan Ostafiichuk is enough to answer your query yet I added some more points to make things much clear.

MapReduce: How to get mapper to process multiple lines?

Goal:
I want to be able to specify the number of mappers used on an input file
Equivalently, I want to specify the number of line of a file each mapper will take
Simple example:
For an input file of 10 lines (of unequal length; example below), I want there to be 2 mappers -- each mapper will thus process 5 lines.
This is
an arbitrary example file
of 10 lines.
Each line does
not have to be
of
the same
length or contain
the same
number of words
This is what I have:
(I have it so that each mapper produces one "<map,1>" key-value pair ... so that it will then be summed in the reducer)
package org.myorg;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.InputFormat;
public class Test {
// prduce one "<map,1>" pair per mapper
public static class Map extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
context.write(new Text("map"), one);
}
}
// reduce by taking a sum
public static class Red extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job1 = Job.getInstance(conf, "pass01");
job1.setJarByClass(Test.class);
job1.setMapperClass(Map.class);
job1.setCombinerClass(Red.class);
job1.setReducerClass(Red.class);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job1, new Path(args[0]));
FileOutputFormat.setOutputPath(job1, new Path(args[1]));
// // Attempt#1
// conf.setInt("mapreduce.input.lineinputformat.linespermap", 5);
// job1.setInputFormatClass(NLineInputFormat.class);
// // Attempt#2
// NLineInputFormat.setNumLinesPerSplit(job1, 5);
// job1.setInputFormatClass(NLineInputFormat.class);
// // Attempt#3
// conf.setInt(NLineInputFormat.LINES_PER_MAP, 5);
// job1.setInputFormatClass(NLineInputFormat.class);
// // Attempt#4
// conf.setInt("mapreduce.input.fileinputformat.split.minsize", 234);
// conf.setInt("mapreduce.input.fileinputformat.split.maxsize", 234);
System.exit(job1.waitForCompletion(true) ? 0 : 1);
}
}
The above code, using the above example data, will produce
map 10
I want the output to be
map 2
where the first mapper will do something will the first 5 lines, and the second mapper will do something with the second 5 lines.
You could use NLineInputFormat.
With NLineInputFormat functionality, you can specify exactly how many lines should go to a mapper.
E.g. If your file has 500 lines, and you set number of lines per mapper to 10, you have 50 mappers
(instead of one - assuming the file is smaller than a HDFS block size).
EDIT:
Here is an example for using NLineInputFormat:
Mapper Class:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MapperNLine extends Mapper<LongWritable, Text, LongWritable, Text> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
}
Driver class:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Driver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out
.printf("Two parameters are required for DriverNLineInputFormat- <input dir> <output dir>\n");
return -1;
}
Job job = new Job(getConf());
job.setJobName("NLineInputFormat example");
job.setJarByClass(Driver.class);
job.setInputFormatClass(NLineInputFormat.class);
NLineInputFormat.addInputPath(job, new Path(args[0]));
job.getConfiguration().setInt("mapreduce.input.lineinputformat.linespermap", 5);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MapperNLine.class);
job.setNumReduceTasks(0);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new Driver(), args);
System.exit(exitCode);
}
}
With the input you provided the output from the above sample Mapper would be written to two files as 2 Mappers get initialized :
part-m-00001
0 This is
8 an arbitrary example file
34 of 10 lines.
47 Each line does
62 not have to be
part-m-00002
77 of
80 the same
89 length or contain
107 the same
116 number of words

Splitting Files w.r.t input file MapReduce

Can someBody Suggest me Whats wrong in the Following Code.
Can u help me how to get the Below output using this Mapreduce program??
Actually This code works fine but the output is not as expected... output is generated in two files but either in Name.txt file or Age.txt file the output is swaping
Input File:
Name:A
Age:28
Name:B
Age:25
Name:K
Age:20
Name:P
Age:18
Name:Ak
Age:11
Name:N
Age:14
Name:Kr
Age:26
Name:Ra
Age:27
And my output should split into Name and Age
Name File:
Name:A
Name:B
Name:K
Name:P
Name:Ak
Name:N
Name:Kr
Name:Ra
Age File:
Age:28
Age:25
Age:20
Age:18
Age:11
Age:14
Age:26
Age:27
My Code :
MyMapper.java
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class MyMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value,OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String [] dall=value.toString().split(":");
output.collect(new Text(dall[0]),new Text(dall[1]));
}
}
MyReducer.Java:
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class MyReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values,OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
while (values.hasNext()) {
output.collect(new Text(key),new Text(values.next()));
}
}
}
MultiFileOutput.java:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.*;
public class MultiFileOutput extends MultipleTextOutputFormat<Text, Text>{
protected String generateFileNameForKeyValue(Text key, Text value,String name) {
//return new Path(key.toString(), name).toString();
return key.toString();
}
protected Text generateActualKey(Text key, Text value) {
//return new Text(key.toString());
return null;
}
}
MyDriver.java:
import java.io.IOException;
import java.lang.Exception;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
public class MyDriver{
public static void main(String[] args) throws Exception,IOException {
Configuration mycon=new Configuration();
JobConf conf = new JobConf(mycon,MyDriver.class);
//JobConf conf = new JobConf(MyDriver.class);
conf.setJobName("Splitting");
conf.setMapperClass(MyMapper.class);
conf.setReducerClass(MyReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(MultiFileOutput.class);
conf.setOutputKeyClass(Text.class);
conf.setMapOutputKeyClass(Text.class);
//conf.setOutputValueClass(Text.class);
conf.setMapOutputValueClass(Text.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
//System.err.println(JobClient.runJob(conf));
}
}
ThankYou
Ok this is a bit more complicated use case than simple word count :)
So what you need is a complex key & a partitioner. And set number of reducers =2
Your complex key could be a Text(concatenation of Name|A or Age|28) or CustomWritable (that has 2 instance variables holding type(Name or Age) & value)
In the mapper you create the Text or CustomWritable anfd set it as output key and value can be just the name of the person or his age.
Create a partitioner (which implements org.apache.hadoop.mapred.Partitioner). In the getPartition method you basically decide based on your key which reducer it will go to.
Hope this helps.

Categories