I am having trouble chaining two mapreduce jobs

I am having trouble chaining two mapreduce jobs - java

The First map-reduce is
map ( key, line ):
read 2 long integers from the line into the variables key2 and value2
emit (key2,value2)
reduce ( key, nodes ):
count = 0
for n in nodes
count++
emit(key,count)
The second Map-Reduce is:
map ( node, count ):
emit(count,1)
reduce ( key, values ):
sum = 0
for v in values
sum += v
emit(key,sum)
The code i wrote for this is:
import java.io.IOException;
import java.util.Scanner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Graph extends Configured implements Tool{
#Override
public int run( String[] args ) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "job1");
job.setJobName("job1");
job.setJarByClass(Graph.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path("job1"));
job.waitForCompletion(true);
Job job2 = Job.getInstance(conf, "job2");
job2.setJobName("job2");
job2.setOutputKeyClass(IntWritable.class);
job2.setOutputValueClass(IntWritable.class);
job2.setMapOutputKeyClass(IntWritable.class);
job2.setMapOutputValueClass(IntWritable.class);
job2.setMapperClass(MyMapper1.class);
job2.setReducerClass(MyReducer1.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job2,new Path("job1"));
FileOutputFormat.setOutputPath(job2,new Path(args[1]));
job2.waitForCompletion(true);
return 0;
}
public static void main ( String[] args ) throws Exception {
ToolRunner.run(new Configuration(),new Graph(),args);
}
public static class MyMapper extends Mapper<Object,Text,IntWritable,IntWritable> {
#Override
public void map ( Object key, Text value, Context context )
throws IOException, InterruptedException {
Scanner s = new Scanner(value.toString()).useDelimiter(",");
int key2 = s.nextInt();
int value2 = s.nextInt();
context.write(new IntWritable(key2),new IntWritable(value2));
s.close();
}
}
public static class MyReducer extends Reducer<IntWritable,IntWritable,IntWritable,IntWritable> {
#Override
public void reduce ( IntWritable key, Iterable<IntWritable> values, Context context )
throws IOException, InterruptedException {
int count = 0;
for (IntWritable v: values) {
count++;
};
context.write(key,new IntWritable(count));
}
}
public static class MyMapper1 extends Mapper<IntWritable, IntWritable,IntWritable,IntWritable >{
#Override
public void map(IntWritable node, IntWritable count, Context context )
throws IOException, InterruptedException {
context.write(count, new IntWritable(1));
}
}
public static class MyReducer1 extends Reducer<IntWritable,IntWritable,IntWritable,IntWritable> {
#Override
public void reduce ( IntWritable key, Iterable<IntWritable> values, Context context )
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable v: values) {
sum += v.get();
};
context.write(key,new IntWritable(sum));
//System.out.println("job 2"+sum);
}
}
}
I have tried to implement that psudocode, arg[0] is the input and arg[1] is the ouput.....when i run the code, i get the output of job1 and not that of job2.
Whats seems to be wrong??
I think i am not passing the output of job1 to job2 properly.

Instead of job1 in
FileOutputFormat.setOutputPath(job, new Path("job1"));
use this instead:
String temporary="home/xxx/...." //store result here
FileOutputFormat.setOutputPath(job, new Path(temporary));

Related

Analyzing multiple input files and output only one file containing one final result

I do not have a great understanding of MapReduce. What I need to achieve is one line result output from the analysis of a few input files. Currently, my result contains one line per input file. So if I have 3 input files, I will have one output file containing 3 lines; a result per each input. Since I sort the result, I need to write only the first result to HDFS file. My code is below:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordLength {
public static class Map extends Mapper<Object, Text, LongWritable, Text> {
// private final static IntWritable one = new IntWritable(1);
int max = Integer.MIN_VALUE;
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString(); //cumleni goturur file dan, 1 line i
StringTokenizer tokenizer = new StringTokenizer(line); //cumleni sozlere bolur
while (tokenizer.hasMoreTokens()) {
String s= tokenizer.nextToken();
int val = s.length();
if(val>max) {
max=val;
word.set(s);
}
}
}
public void cleanup(Context context) throws IOException, InterruptedException {
context.write(new LongWritable(max), word);
}
}
public static class IntSumReducer
extends Reducer<LongWritable,Text,Text,LongWritable> {
private IntWritable result = new IntWritable();
int max=-100;
public void reduce(LongWritable key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException {
context.write(new Text("longest"), key);
//context.write(new Text("longest"),key);
System.err.println(key);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setSortComparatorClass(LongWritable.DecreasingComparator.class);
//job.setCombinerClass(IntSumReducer.class);
job.setNumReduceTasks(1);
job.setReducerClass(IntSumReducer.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
}
It finds the longest length of a word per each input and prints it out. But i need to find the longest length among all possible input files, and print only one line.
So the output is:
longest 11
longest 10
longest 8
I want it to contain only:
longest 11
Thanks

changed my code for finding the longest word length. Now it prints only longest 11. If you have a better way, please feel free to correct my solution as I am eager to learn best options
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static class Map extends Mapper<Object, Text, Text, LongWritable> {
// private final static IntWritable one = new IntWritable(1);
int max = Integer.MIN_VALUE;
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString(); //cumleni goturur file dan, 1 line i
StringTokenizer tokenizer = new StringTokenizer(line); //cumleni sozlere bolur
while (tokenizer.hasMoreTokens()) {
String s= tokenizer.nextToken();
int val = s.length();
if(val>max) {
max=val;
word.set(s);
context.write(word,new LongWritable(val));
}
}
}
}
public static class IntSumReducer
extends Reducer<Text,LongWritable,Text,LongWritable> {
private LongWritable result = new LongWritable();
long max=-100;
public void reduce(Text key, Iterable<LongWritable> values,
Context context
) throws IOException, InterruptedException {
// int sum = -1;
for (LongWritable val : values) {
if(val.get()>max) {
max=val.get();
}
}
result.set(max);
}
public void cleanup(Context context) throws IOException, InterruptedException {
context.write(new Text("longest"),result );
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setSortComparatorClass(LongWritable.DecreasingComparator.class);
// job.setCombinerClass(IntSumReducer.class);
job.setNumReduceTasks(1);
job.setReducerClass(IntSumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

hadoop mapreduce to generate substrings of different lengths

Using Hadoop mapreduce I am writing code to get substrings of different lengths. Example given string "ZYXCBA" and length 3 (Using a text file i give input as "3 ZYXCBA"). My code has to return all possible strings of length 3 ("ZYX","YXC","XCB","CBA"), length 4("ZYXC","YXCB","XCBA") finally length 5("ZYXCB","YXCBA").
In map phase I did the following:
key = length of substrings I want
value = "ZYXCBA".
So mapper output is
3,"ZYXCBA"
4,"ZYXCBA"
5,"ZYXCBA"
In reduce I take string ("ZYXCBA") and key 3 to get all substrings of length 3. Same occurs for 4,5. Results are concatenated using a string. So out put of reduce should be :
3 "ZYX YXC XCB CBA"
4 "ZYXC YXCB XCBA"
5 "ZYXCB YXCBA"
I am running my code using following command:
hduser#Ganesh:~/Documents$ hadoop jar Saishingles.jar hadoopshingles.Saishingles Behara/Shingles/input Behara/Shingles/output
My code is as shown below:
package hadoopshingles;
import java.io.IOException;
//import java.util.ArrayList;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class Saishingles{
public static class shinglesmapper extends Mapper<Object, Text, IntWritable, Text>{
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String str = new String(value.toString());
String[] list = str.split(" ");
int x = Integer.parseInt(list[0]);
String val = list[1];
int M = val.length();
int X = M-1;
for(int z = x; z <= X; z++)
{
context.write(new IntWritable(z), new Text(val));
}
}
}
public static class shinglesreducer extends Reducer<IntWritable,Text,IntWritable,Text> {
public void reduce(IntWritable key, Text value, Context context
) throws IOException, InterruptedException {
int z = key.get();
String str = new String(value.toString());
int M = str.length();
int Tz = M - z;
String newvalue = "";
for(int position = 0; position <= Tz; position++)
{
newvalue = newvalue + " " + str.substring(position,position + z);
}
context.write(new IntWritable(z),new Text(newvalue));
}
}
public static void main(String[] args) throws Exception {
GenericOptionsParser parser = new GenericOptionsParser(args);
Configuration conf = parser.getConfiguration();
String[] otherArgs = parser.getRemainingArgs();
if (otherArgs.length != 2)
{
System.err.println("Usage: Saishingles <inputFile> <outputDir>");
System.exit(2);
}
Job job = Job.getInstance(conf, "Saishingles");
job.setJarByClass(hadoopshingles.Saishingles.class);
job.setMapperClass(shinglesmapper.class);
//job.setCombinerClass(shinglesreducer.class);
job.setReducerClass(shinglesreducer.class);
//job.setMapOutputKeyClass(IntWritable.class);
//job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Output of reduce instead of returning
3 "ZYX YXC XCB CBA"
4 "ZYXC YXCB XCBA"
5 "ZYXCB YXCBA"
it's returning
3 "ZYXCBA"
4 "ZYXCBA"
5 "ZYXCBA"
i.e., it's giving same output as mapper. Don't know why this is happening. Please help me resolve this and thanks in advance for helping ;) :) :)

You can achieve this without even running reducer. your map/reduce logic is wrong...transformation should be done in Mapper.
Reduce - In this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for each <key, (list of values)> pair in the grouped inputs.
in your reduce signature: public void reduce(IntWritable key, Text value, Context context)
should be public void reduce(IntWritable key, Iterable<Text> values, Context context)
Also, change last line of reduce method: context.write(new IntWritable(z),new Text(newvalue)); to context.write(key,new Text(newvalue)); - you already have Intwritable Key from mapper, I wouldn't create new one.
with given input:
3 "ZYXCBA"
4 "ZYXCBA"
5 "ZYXCBA"
Mapper job will output:
3 "XCB YXC ZYX"
4 "XCBA YXCB ZYXC"
5 "YXCBA ZYXCB"
MapReduceJob:
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class SubStrings{
public static class SubStringsMapper extends Mapper<Object, Text, IntWritable, Text> {
#Override
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String [] values = value.toString().split(" ");
int len = Integer.parseInt(values[0].trim());
String str = values[1].replaceAll("\"", "").trim();
int endindex=len;
for(int i = 0; i < len; i++)
{
endindex=i+len;
if(endindex <= str.length())
context.write(new IntWritable(len), new Text(str.substring(i, endindex)));
}
}
}
public static class SubStringsReducer extends Reducer<IntWritable, Text, IntWritable, Text> {
public void reduce(IntWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String str="\""; //adding starting quotes
for(Text value: values)
str += " " + value;
str=str.replace("\" ", "\"") + "\""; //adding ending quotes
context.write(key, new Text(str));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "get-possible-strings-by-length");
job.setJarByClass(SubStrings.class);
job.setMapperClass(SubStringsMapper.class);
job.setReducerClass(SubStringsReducer.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
FileSystem fs = null;
Path dstFilePath = new Path(args[1]);
try {
fs = dstFilePath.getFileSystem(conf);
if (fs.exists(dstFilePath))
fs.delete(dstFilePath, true);
} catch (IOException e1) {
e1.printStackTrace();
}
job.waitForCompletion(true);
}
}

Word Merge in hadoop

Currently i would like merge or concatenate two strings using hadoop. where The mapper function would group the words and the reduce will concatenate the values based on common key.
Below is my code for the map-reduce job.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class mr2 {
// mapper class
public static class TokenizerMapper extends Mapper<Text, Text, Text, Text>{
private Text word = new Text(); // key
private Text value_of_key = new Text(); // value
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String IndexAndCategory = "";
String value_of_the_key = "";
StringTokenizer itr = new StringTokenizer(line);
// key creation
IndexAndCategory += itr.nextToken() + " ";
IndexAndCategory += itr.nextToken() + " ";
// value creation
value_of_the_key += itr.nextToken() + ":";
value_of_the_key += itr.nextToken() + " ";
// key and value
word.set(IndexAndCategory);
value_of_key.set(value_of_the_key);
// write key-value pair
context.write(word, (Text)value_of_key);
}
}
// reducer class
public static class IntSumReducer extends Reducer<Text,Text,Text,Text> {
//private IntWritable result = new IntWritable();
private Text values_of_key = new Text();
#Override
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String values_ = "";
for (Text val : values) {
values_ += val.toString();
}
values_of_key.set(values_);
context.write(key, values_of_key);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "mr2");
job.setJarByClass(mr2.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setNumReduceTasks(1);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
The input to mapper is in the below format.
1 A this 2
1 A the 1
3 B is 1
The mapper process this into the below format and gives to reducer
1 A this:2
1 A the:1
3 B is:1
The reduce then reduces the given input into below format.
1 A this:2 the:1
3 B is:1
I used word count as basic template and modified it to process Text(String) but when i execute the above mentioned code i am getting the below error.
Error: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text
at mr2$TokenizerMapper.map(mr2.java:17)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
It is expecting LongIntWritable. Any help to solve this issue is appreciated.

If you're reading a text file, the mapper must be defined as
public static class TokenizerMapper extends Mapper<LongWritable, Text, Text, Text>{
So the map method should look like this
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

The problem was in the main function i was not specify what is the output of the mapper, so the reducer was expecting the default one as input. For more details refer the this post.
Changed input type to Object from Text.
public static class TokenizerMapper extends Mapper{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
Adding the following lines solved the issue.
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
The following is the complete working code.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.io.LongWritable;
public class mr2 {
// mapper class
public static class TokenizerMapper extends Mapper<Object, Text, Text, Text>{
private Text word = new Text(); // key
private Text value_of_key = new Text(); // value
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String IndexAndCategory = "";
String value_of_the_key = "";
StringTokenizer itr = new StringTokenizer(line);
// key creation
IndexAndCategory += itr.nextToken() + " ";
IndexAndCategory += itr.nextToken() + " ";
// value creation
value_of_the_key += itr.nextToken() + ":";
value_of_the_key += itr.nextToken() + " ";
// key and value
word.set(IndexAndCategory);
value_of_key.set(value_of_the_key);
// write key-value pair
context.write(word, value_of_key);
}
}
// reducer class
public static class IntSumReducer extends Reducer<Text,Text,Text,Text> {
//private IntWritable result = new IntWritable();
private Text values_of_key = new Text();
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String values_ = "";
for (Text val : values) {
values_ += val.toString();
}
values_of_key.set(values_);
context.write(key, values_of_key);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "mr2");
job.setJarByClass(mr2.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setNumReduceTasks(1);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

two output files for two different tasks using same input file in hadoop

I am new to hadoop and so i have taken up tasks and i have a csv file and i want to find the names of visitors and names of visitee from the file which contains the data of a company.
here is my code which is used to find only visitors. My output has to be a file with top 20 visitors and the count of each and following it top 20 visitee with their count.
first name last name and mid name of visitor is in col 0,1,2 and that of visitee is in column 20.
package dataset;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class Data {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] str=value.toString().split(",");
String wo=str[1]+" "+str[2]+" "+str[0];
// System.out.println(str[0]);
word.set(wo);
context.write(word, new IntWritable(1));
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
TreeMap<IntWritable,Text> T=new TreeMap<>();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
T.put(new IntWritable(sum),new Text(key));
if(T.size()>20)
{
System.out.println(T.firstKey());
T.remove(T.firstKey());
}
}
protected void cleanup(Context context) throws IOException, InterruptedException
{
for (IntWritable k : T.keySet()) {
System.out.println(k);
context.write(T.get(k),k);
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}

Hadoop error .ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text

My program is as follows:
public static class MapClass extends Mapper<Text, Text, Text, LongWritable> {
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
// your map code goes here
String[] fields = value.toString().split(",");
for(String str : fields) {
context.write(new Text(str), new LongWritable(1L));
}
}
}
public int run(String args[]) throws Exception {
Job job = new Job();
job.setJarByClass(TopOS.class);
job.setMapperClass(MapClass.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJobName("TopOS");
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setNumReduceTasks(0);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String args[]) throws Exception {
int ret = ToolRunner.run(new TopOS(), args);
System.exit(ret);
}
}
My data looks like:
123456,Windows,6.1,6394829384232,343534353,23432,23434343,12322
123456,OSX,10,6394829384232,23354353,23432,23434343,63635
123456,Windows,6.0,5396459384232,343534353,23432,23434343,23635
123456,Windows,6.0,6393459384232,343534353,23432,23434343,33635
Why am I getting the following error? How can I get around this?
Hadoop : java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text

from my point of view there is just a small error in your code.
As you are using a flat Textfile as Input the fixed key class is LongWritable (what you don't need/use) and the value class ist Text.
Setting the keyClass in your Mapper to Object to underline that you don't use this, you get rid of your error.
Here is my slightly modified code.
package org.woopi.stackoverflow.q22853574;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.fs.Path;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
public class MapReduceJob {
public static class MapClass extends Mapper<Object, Text, Text, LongWritable> {
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
// your map code goes here
String[] fields = value.toString().split(",");
for(String str : fields) {
context.write(new Text(str), new LongWritable(1L));
}
}
}
public int run(String args[]) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(MapReduceJob.class);
job.setMapperClass(MapClass.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJobName("MapReduceJob");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setNumReduceTasks(0);
job.setInputFormatClass(TextInputFormat.class);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String args[]) throws Exception {
MapReduceJob j = new MapReduceJob();
int ret = j.run(args);
System.exit(ret);
}
I hope this helps.
Martin

Can you use
//Set the key class for the job output data.
job.setOutputKeyClass(Class<?> theClass)
//Set the value class for job outputs
job.setOutputValueClass(Class<?> theClass)
instead of setMapOutputKeyClass and setMapOutputValueClass methods.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

I am having trouble chaining two mapreduce jobs - java

Instead of job1 in FileOutputFormat.setOutputPath(job, new Path("job1")); use this instead: String temporary="home/xxx/...." //store result here FileOutputFormat.setOutputPath(job, new Path(temporary));

Related

Analyzing multiple input files and output only one file containing one final result

hadoop mapreduce to generate substrings of different lengths

Word Merge in hadoop

two output files for two different tasks using same input file in hadoop

Hadoop error .ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text

Categories

Resources