hadoop mapreduce to generate substrings of different lengths

hadoop mapreduce to generate substrings of different lengths - java

Using Hadoop mapreduce I am writing code to get substrings of different lengths. Example given string "ZYXCBA" and length 3 (Using a text file i give input as "3 ZYXCBA"). My code has to return all possible strings of length 3 ("ZYX","YXC","XCB","CBA"), length 4("ZYXC","YXCB","XCBA") finally length 5("ZYXCB","YXCBA").
In map phase I did the following:
key = length of substrings I want
value = "ZYXCBA".
So mapper output is
3,"ZYXCBA"
4,"ZYXCBA"
5,"ZYXCBA"
In reduce I take string ("ZYXCBA") and key 3 to get all substrings of length 3. Same occurs for 4,5. Results are concatenated using a string. So out put of reduce should be :
3 "ZYX YXC XCB CBA"
4 "ZYXC YXCB XCBA"
5 "ZYXCB YXCBA"
I am running my code using following command:
hduser#Ganesh:~/Documents$ hadoop jar Saishingles.jar hadoopshingles.Saishingles Behara/Shingles/input Behara/Shingles/output
My code is as shown below:
package hadoopshingles;
import java.io.IOException;
//import java.util.ArrayList;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class Saishingles{
public static class shinglesmapper extends Mapper<Object, Text, IntWritable, Text>{
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String str = new String(value.toString());
String[] list = str.split(" ");
int x = Integer.parseInt(list[0]);
String val = list[1];
int M = val.length();
int X = M-1;
for(int z = x; z <= X; z++)
{
context.write(new IntWritable(z), new Text(val));
}
}
}
public static class shinglesreducer extends Reducer<IntWritable,Text,IntWritable,Text> {
public void reduce(IntWritable key, Text value, Context context
) throws IOException, InterruptedException {
int z = key.get();
String str = new String(value.toString());
int M = str.length();
int Tz = M - z;
String newvalue = "";
for(int position = 0; position <= Tz; position++)
{
newvalue = newvalue + " " + str.substring(position,position + z);
}
context.write(new IntWritable(z),new Text(newvalue));
}
}
public static void main(String[] args) throws Exception {
GenericOptionsParser parser = new GenericOptionsParser(args);
Configuration conf = parser.getConfiguration();
String[] otherArgs = parser.getRemainingArgs();
if (otherArgs.length != 2)
{
System.err.println("Usage: Saishingles <inputFile> <outputDir>");
System.exit(2);
}
Job job = Job.getInstance(conf, "Saishingles");
job.setJarByClass(hadoopshingles.Saishingles.class);
job.setMapperClass(shinglesmapper.class);
//job.setCombinerClass(shinglesreducer.class);
job.setReducerClass(shinglesreducer.class);
//job.setMapOutputKeyClass(IntWritable.class);
//job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Output of reduce instead of returning
3 "ZYX YXC XCB CBA"
4 "ZYXC YXCB XCBA"
5 "ZYXCB YXCBA"
it's returning
3 "ZYXCBA"
4 "ZYXCBA"
5 "ZYXCBA"
i.e., it's giving same output as mapper. Don't know why this is happening. Please help me resolve this and thanks in advance for helping ;) :) :)

You can achieve this without even running reducer. your map/reduce logic is wrong...transformation should be done in Mapper.
Reduce - In this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for each <key, (list of values)> pair in the grouped inputs.
in your reduce signature: public void reduce(IntWritable key, Text value, Context context)
should be public void reduce(IntWritable key, Iterable<Text> values, Context context)
Also, change last line of reduce method: context.write(new IntWritable(z),new Text(newvalue)); to context.write(key,new Text(newvalue)); - you already have Intwritable Key from mapper, I wouldn't create new one.
with given input:
3 "ZYXCBA"
4 "ZYXCBA"
5 "ZYXCBA"
Mapper job will output:
3 "XCB YXC ZYX"
4 "XCBA YXCB ZYXC"
5 "YXCBA ZYXCB"
MapReduceJob:
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class SubStrings{
public static class SubStringsMapper extends Mapper<Object, Text, IntWritable, Text> {
#Override
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String [] values = value.toString().split(" ");
int len = Integer.parseInt(values[0].trim());
String str = values[1].replaceAll("\"", "").trim();
int endindex=len;
for(int i = 0; i < len; i++)
{
endindex=i+len;
if(endindex <= str.length())
context.write(new IntWritable(len), new Text(str.substring(i, endindex)));
}
}
}
public static class SubStringsReducer extends Reducer<IntWritable, Text, IntWritable, Text> {
public void reduce(IntWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String str="\""; //adding starting quotes
for(Text value: values)
str += " " + value;
str=str.replace("\" ", "\"") + "\""; //adding ending quotes
context.write(key, new Text(str));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "get-possible-strings-by-length");
job.setJarByClass(SubStrings.class);
job.setMapperClass(SubStringsMapper.class);
job.setReducerClass(SubStringsReducer.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
FileSystem fs = null;
Path dstFilePath = new Path(args[1]);
try {
fs = dstFilePath.getFileSystem(conf);
if (fs.exists(dstFilePath))
fs.delete(dstFilePath, true);
} catch (IOException e1) {
e1.printStackTrace();
}
job.waitForCompletion(true);
}
}

Related

I am having trouble chaining two mapreduce jobs

The First map-reduce is
map ( key, line ):
read 2 long integers from the line into the variables key2 and value2
emit (key2,value2)
reduce ( key, nodes ):
count = 0
for n in nodes
count++
emit(key,count)
The second Map-Reduce is:
map ( node, count ):
emit(count,1)
reduce ( key, values ):
sum = 0
for v in values
sum += v
emit(key,sum)
The code i wrote for this is:
import java.io.IOException;
import java.util.Scanner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Graph extends Configured implements Tool{
#Override
public int run( String[] args ) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "job1");
job.setJobName("job1");
job.setJarByClass(Graph.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path("job1"));
job.waitForCompletion(true);
Job job2 = Job.getInstance(conf, "job2");
job2.setJobName("job2");
job2.setOutputKeyClass(IntWritable.class);
job2.setOutputValueClass(IntWritable.class);
job2.setMapOutputKeyClass(IntWritable.class);
job2.setMapOutputValueClass(IntWritable.class);
job2.setMapperClass(MyMapper1.class);
job2.setReducerClass(MyReducer1.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job2,new Path("job1"));
FileOutputFormat.setOutputPath(job2,new Path(args[1]));
job2.waitForCompletion(true);
return 0;
}
public static void main ( String[] args ) throws Exception {
ToolRunner.run(new Configuration(),new Graph(),args);
}
public static class MyMapper extends Mapper<Object,Text,IntWritable,IntWritable> {
#Override
public void map ( Object key, Text value, Context context )
throws IOException, InterruptedException {
Scanner s = new Scanner(value.toString()).useDelimiter(",");
int key2 = s.nextInt();
int value2 = s.nextInt();
context.write(new IntWritable(key2),new IntWritable(value2));
s.close();
}
}
public static class MyReducer extends Reducer<IntWritable,IntWritable,IntWritable,IntWritable> {
#Override
public void reduce ( IntWritable key, Iterable<IntWritable> values, Context context )
throws IOException, InterruptedException {
int count = 0;
for (IntWritable v: values) {
count++;
};
context.write(key,new IntWritable(count));
}
}
public static class MyMapper1 extends Mapper<IntWritable, IntWritable,IntWritable,IntWritable >{
#Override
public void map(IntWritable node, IntWritable count, Context context )
throws IOException, InterruptedException {
context.write(count, new IntWritable(1));
}
}
public static class MyReducer1 extends Reducer<IntWritable,IntWritable,IntWritable,IntWritable> {
#Override
public void reduce ( IntWritable key, Iterable<IntWritable> values, Context context )
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable v: values) {
sum += v.get();
};
context.write(key,new IntWritable(sum));
//System.out.println("job 2"+sum);
}
}
}
I have tried to implement that psudocode, arg[0] is the input and arg[1] is the ouput.....when i run the code, i get the output of job1 and not that of job2.
Whats seems to be wrong??
I think i am not passing the output of job1 to job2 properly.

Instead of job1 in
FileOutputFormat.setOutputPath(job, new Path("job1"));
use this instead:
String temporary="home/xxx/...." //store result here
FileOutputFormat.setOutputPath(job, new Path(temporary));

Analyzing multiple input files and output only one file containing one final result

I do not have a great understanding of MapReduce. What I need to achieve is one line result output from the analysis of a few input files. Currently, my result contains one line per input file. So if I have 3 input files, I will have one output file containing 3 lines; a result per each input. Since I sort the result, I need to write only the first result to HDFS file. My code is below:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordLength {
public static class Map extends Mapper<Object, Text, LongWritable, Text> {
// private final static IntWritable one = new IntWritable(1);
int max = Integer.MIN_VALUE;
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString(); //cumleni goturur file dan, 1 line i
StringTokenizer tokenizer = new StringTokenizer(line); //cumleni sozlere bolur
while (tokenizer.hasMoreTokens()) {
String s= tokenizer.nextToken();
int val = s.length();
if(val>max) {
max=val;
word.set(s);
}
}
}
public void cleanup(Context context) throws IOException, InterruptedException {
context.write(new LongWritable(max), word);
}
}
public static class IntSumReducer
extends Reducer<LongWritable,Text,Text,LongWritable> {
private IntWritable result = new IntWritable();
int max=-100;
public void reduce(LongWritable key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException {
context.write(new Text("longest"), key);
//context.write(new Text("longest"),key);
System.err.println(key);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setSortComparatorClass(LongWritable.DecreasingComparator.class);
//job.setCombinerClass(IntSumReducer.class);
job.setNumReduceTasks(1);
job.setReducerClass(IntSumReducer.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
}
It finds the longest length of a word per each input and prints it out. But i need to find the longest length among all possible input files, and print only one line.
So the output is:
longest 11
longest 10
longest 8
I want it to contain only:
longest 11
Thanks

changed my code for finding the longest word length. Now it prints only longest 11. If you have a better way, please feel free to correct my solution as I am eager to learn best options
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static class Map extends Mapper<Object, Text, Text, LongWritable> {
// private final static IntWritable one = new IntWritable(1);
int max = Integer.MIN_VALUE;
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString(); //cumleni goturur file dan, 1 line i
StringTokenizer tokenizer = new StringTokenizer(line); //cumleni sozlere bolur
while (tokenizer.hasMoreTokens()) {
String s= tokenizer.nextToken();
int val = s.length();
if(val>max) {
max=val;
word.set(s);
context.write(word,new LongWritable(val));
}
}
}
}
public static class IntSumReducer
extends Reducer<Text,LongWritable,Text,LongWritable> {
private LongWritable result = new LongWritable();
long max=-100;
public void reduce(Text key, Iterable<LongWritable> values,
Context context
) throws IOException, InterruptedException {
// int sum = -1;
for (LongWritable val : values) {
if(val.get()>max) {
max=val.get();
}
}
result.set(max);
}
public void cleanup(Context context) throws IOException, InterruptedException {
context.write(new Text("longest"),result );
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setSortComparatorClass(LongWritable.DecreasingComparator.class);
// job.setCombinerClass(IntSumReducer.class);
job.setNumReduceTasks(1);
job.setReducerClass(IntSumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

MAPREDUCE error: method write in interface TaskInputOutputContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> cannot be applied to given types

package br.edu.ufam.anibrata;
import java.io.*;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;
import java.util.StringTokenizer;
import java.util.Arrays;
import java.util.HashSet;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.log4j.Logger;
import org.kohsuke.args4j.CmdLineException;
import org.kohsuke.args4j.CmdLineParser;
import org.kohsuke.args4j.Option;
import org.kohsuke.args4j.ParserProperties;
import tl.lin.data.array.ArrayListWritable;
import tl.lin.data.pair.PairOfStringInt;
import tl.lin.data.pair.PairOfWritables;
import br.edu.ufam.data.Dataset;
import com.google.gson.JsonSyntaxException;
public class BuildIndexWebTables extends Configured implements Tool {
private static final Logger LOG = Logger.getLogger(BuildIndexWebTables.class);
public static void main(String[] args) throws Exception
{
ToolRunner.run(new BuildIndexWebTables(), args);
}
#Override
public int run(String[] argv) throws Exception {
// Creates a new job configuration for this Hadoop job.
Args args = new Args();
CmdLineParser parser = new CmdLineParser(args, ParserProperties.defaults().withUsageWidth(100));
try
{
parser.parseArgument(argv);
}
catch (CmdLineException e)
{
System.err.println(e.getMessage());
parser.printUsage(System.err);
return -1;
}
Configuration conf = getConf();
conf.setBoolean("mapreduce.map.output.compress", true);
conf.setBoolean("mapreduce.map.output.compress", true);
conf.set("mapreduce.map.failures.maxpercent", "10");
conf.set("mapreduce.max.map.failures.percent", "10");
conf.set("mapred.max.map.failures.percent", "10");
conf.set("mapred.map.failures.maxpercent", "10");
conf.setBoolean("mapred.compress.map.output", true);
conf.set("mapred.map.output.compression.codec", "org.apache.hadoop.io.compress.SnappyCodec");
conf.setBoolean("mapreduce.map.output.compress", true);
/*String inputPrefixes = args[0];
String outputFile = args[1];*/
Job job = Job.getInstance(conf);
/*FileInputFormat.addInputPath(job, new Path(inputPrefixes));
FileOutputFormat.setOutputPath(job, new Path(outputFile));*/
FileInputFormat.setInputPaths(job, new Path(args.input));
FileOutputFormat.setOutputPath(job, new Path(args.output));
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job,org.apache.hadoop.io.compress.GzipCodec.class);
job.setMapperClass(BuildIndexWebTablesMapper.class);
job.setReducerClass(BuildIndexWebTablesReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(PairOfWritables.class);
//job.setOutputFormatClass(MapFileOutputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
/*job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);*/
job.setJarByClass(BuildIndexWebTables.class);
job.setNumReduceTasks(args.numReducers);
//job.setNumReduceTasks(500);
FileInputFormat.setInputPaths(job, new Path(args.input));
FileOutputFormat.setOutputPath(job, new Path(args.output));
System.out.println(Arrays.deepToString(FileInputFormat.getInputPaths(job)));
// Delete the output directory if it exists already.
Path outputDir = new Path(args.output);
FileSystem.get(getConf()).delete(outputDir, true);
long startTime = System.currentTimeMillis();
job.waitForCompletion(true);
System.out.println("Job Finished in " + (System.currentTimeMillis() - startTime) / 1000.0 + " seconds");
return 0;
}
private BuildIndexWebTables() {}
public static class Args
{
#Option(name = "-input", metaVar = "[path]", required = true, usage = "input path")
public String input;
#Option(name = "-output", metaVar = "[path]", required = true, usage = "output path")
public String output;
#Option(name = "-reducers", metaVar = "[num]", required = false, usage = "number of reducers")
public int numReducers = 1;
}
public static class BuildIndexWebTablesMapper extends Mapper<LongWritable, Text, Text, Text> {
//public static final Log log = LogFactory.getLog(BuildIndexWebTablesMapper.class);
private static final Text WORD = new Text();
private static final Text OPVAL = new Text();
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// Log to stdout file
System.out.println("Map key : TEST");
//log to the syslog file
//log.info("Map key "+ key);
/*if(log.isDebugEanbled()){
log.debug("Map key "+ key);
}*/
Dataset ds;
String pgTitle; // Table page title
List<String> tokens = new ArrayList<String>(); // terms for frequency and other data
ds = Dataset.fromJson(value.toString()); // Get all text values from the json corpus
String[][] rel = ds.getRelation(); // Extract relation from the first json
int numCols = rel.length; // Number of columns in the relation
String[] attributes = new String[numCols]; // To store attributes for the relation
for (int j = 0; j < numCols; j++) { // Attributes of the relation
attributes[j] = rel[j][0];
}
int numRows = rel[0].length; //Number of rows of the relation
//dsTabNum = ds.getTableNum(); // Gets the table number from json
// Reads terms from relation and stores in tokens
for (int i = 0; i < numRows; i++ ){
for (int j = 0; j < numCols; j++ ){
String w = rel[i][j].toLowerCase().replaceAll("(^[^a-z]+|[^a-z]+$)", "");
if (w.length() == 0)
continue;
else {
w = w + "|" + pgTitle + "." + j + "|" + i; // Concatenate the term/PageTitle.Column number/row number in term
tokens.add(w);
}
}
}
// Emit postings.
for (String token : tokens){
String[] tokenPart = token.split("|", -2); // Split based on "|", -2(any negative) to split multiple times.
String newkey = tokenPart[0] + "|" + tokenPart[1];
WORD.set(newkey); // Emit term as key
//String valstr = Arrays.toString(Arrays.copyOfRange(tokenPart, 2, tokenPart.length)); // Emit rest of the string as value
String valstr = tokenPart[2];
OPVAL.set(valstr);
context.write(WORD,OPVAL);
}
}
}
public static class BuildIndexWebTablesReducer extends Reducer<Text, Text, Text, Text> {
private static final Text TERM = new Text();
private static final IntWritable TF = new IntWritable();
private String PrevTerm = null;
private int termFrequency = 0;
#Override
protected void reduce(Text key, Iterable<Text> textval, Context context) throws IOException, InterruptedException {
Iterator<Text> iter = textval.iterator();
IntWritable tnum = new IntWritable();
ArrayListWritable<IntWritable> postings = new ArrayListWritable<IntWritable>();
PairOfStringInt relColInfo = new PairOfStringInt();
PairOfWritables keyVal = new PairOfWritables<PairOfStringInt, ArrayListWritable<IntWritable>>();
if((!key.toString().equals(PrevTerm)) && (PrevTerm != null)) {
String[] parseKey = PrevTerm.split("|", -2);
TERM.set(parseKey[0]);
relColInfo.set(parseKey[1],termFrequency);
keyVal.set(relColInfo, postings);
context.write(TERM, keyVal);
termFrequency = 0;
postings.clear();
}
PrevTerm = key.toString();
while (iter.hasNext()) {
int tupleset = Integer.parseInt(iter.next().toString());
tnum.set(tupleset);
postings.add(tnum);
termFrequency++;
}
}
}
}`
I am getting the below mentioned error while compilation.
[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-compiler-plugin:2.3.2:compile
(default-compile) on project projeto-final: Compilation failure
[ERROR]
/home/cloudera/topicosBD-pis/topicosBD-pis/projeto-final/src/main/java/br/edu/ufam/anibrata/BuildIndexWebTables.java:[278,11]
error: method write in interface
TaskInputOutputContext cannot be
applied to given types;
The line where exactly this occurs is "context.write(TERM, keyVal);". This code has some dependencies that is based on my local machine though. I am stuck at the error since I am not getting any idea about it anywhere. If someone can help me understand the origin of the issue and how this can be tackled. I am pretty new to hadoop / mapreduce.
I have tried toggling between OutputFormatClass among job.setOutputFormatClass(MapFileOutputFormat.class); and
job.setOutputFormatClass(TextOutputFormat.class);, both of them throwing the same error. I am using "mvn clean package" to compile.
Any help is appreciated very much.
Thanks in advance.

As i can see, you are trying to write in context a key(TERM) of type Text and a value(keyval) of type PairOfWritables, but your reducer class extends Reducer with VALUEOUT(the last one) of type TEXT. You should change VALUEOUT to proper type.
In your case:
public static class BuildIndexWebTablesReducer extends Reducer<Text, Text, Text, PairOfWritables>

Global variable's value doesn't change after for Loop

I am developing a hadoop project. I want to find customers in a certain day and then write those with the max consumption in that day. In my reducer class, for some reason, the global variable max doesn't change it's value after a for loop.
EDIT I want to find the customers with max consumption in a certain day. I have managed to find the customers in the date I want, but I am facing a problem in my Reducer class. Here is the code:
EDIT #2 I already know that the values(consumption) are Natural numbers. So in my output file I want to be only the customers, of a certain day, with max consumption.
EDIT #3 My input file is consisted of many data. It has three columns; the customer's id, the timestamp (yyyy-mm-DD HH:mm:ss) and the consumption
Driver class
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class alicanteDriver {
public static void main(String[] args) throws Exception {
long t_start = System.currentTimeMillis();
long t_end;
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Alicante");
job.setJarByClass(alicanteDriver.class);
job.setMapperClass(alicanteMapperC.class);
//job.setCombinerClass(alicanteCombiner.class);
job.setPartitionerClass(alicantePartitioner.class);
job.setNumReduceTasks(2);
job.setReducerClass(alicanteReducerC.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("/alicante_1y.txt"));
FileOutputFormat.setOutputPath(job, new Path("/alicante_output"));
job.waitForCompletion(true);
t_end = System.currentTimeMillis();
System.out.println((t_end-t_start)/1000);
}
}
Mapper class
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class alicanteMapperC extends
Mapper<LongWritable, Text, Text, IntWritable> {
String Customer = new String();
SimpleDateFormat ft = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Date t = new Date();
IntWritable Consumption = new IntWritable();
int counter = 0;
// new vars
int max = 0;
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Date d2 = null;
try {
d2 = ft.parse("2013-07-01 01:00:00");
} catch (ParseException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
if (counter > 0) {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line, ",");
while (itr.hasMoreTokens()) {
Customer = itr.nextToken();
try {
t = ft.parse(itr.nextToken());
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Consumption.set(Integer.parseInt(itr.nextToken()));
//sort out as many values as possible
if(Consumption.get() > max) {
max = Consumption.get();
}
//find customers in a certain date
if (t.compareTo(d2) == 0 && Consumption.get() == max) {
context.write(new Text(Customer), Consumption);
}
}
}
counter++;
}
}
Reducer class
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import com.google.common.collect.Iterables;
public class alicanteReducerC extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int max = 0; //this var
// declaration of Lists
List<Text> l1 = new ArrayList<Text>();
List<IntWritable> l2 = new ArrayList<IntWritable>();
for (IntWritable val : values) {
if (val.get() > max) {
max = val.get();
}
l1.add(key);
l2.add(val);
}
for (int i = 0; i < l1.size(); i++) {
if (l2.get(i).get() == max) {
context.write(key, new IntWritable(max));
}
}
}
}
Some values of the Input file
C11FA586148,2013-07-01 01:00:00,3
C11FA586152,2015-09-01 15:22:22,3
C11FA586168,2015-02-01 15:22:22,1
C11FA586258,2013-07-01 01:00:00,5
C11FA586413,2013-07-01 01:00:00,5
C11UA487446,2013-09-01 15:22:22,3
C11UA487446,2013-07-01 01:00:00,3
C11FA586148,2013-07-01 01:00:00,4
Output should be
C11FA586258 5
C11FA586413 5
I have searched the forums for a couple of hours, and still can't find the issue. Any ideas?

here is the refactored code: you can pass/change specific value for date of consumption. In this case you don't need reducer. my first answer was to query max comsumption from input, and this answer is to query user provided consumption from input. setup method will get user provided value for mapper.maxConsumption.date and pass them to map method. cleaup method in reducer scans all max consumption customers and writes final max in input (i.e, 5 in this case) - see screen shot for detail execution log:
run as:
hadoop jar maxConsumption.jar -Dmapper.maxConsumption.date="2013-07-01 01:00:00" Data/input.txt output/maxConsupmtion5
#input:
C11FA586148,2013-07-01 01:00:00,3
C11FA586152,2015-09-01 15:22:22,3
C11FA586168,2015-02-01 15:22:22,1
C11FA586258,2013-07-01 01:00:00,5
C11FA586413,2013-07-01 01:00:00,5
C11UA487446,2013-09-01 15:22:22,3
C11UA487446,2013-07-01 01:00:00,3
C11FA586148,2013-07-01 01:00:00,4
#output:
C11FA586258 5
C11FA586413 5
public class maxConsumption extends Configured implements Tool{
public static class DataMapper extends Mapper<Object, Text, Text, IntWritable> {
SimpleDateFormat ft = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Date dateInFile, filterDate;
int lineno=0;
private final static Text customer = new Text();
private final static IntWritable consumption = new IntWritable();
private final static Text maxConsumptionDate = new Text();
public void setup(Context context) {
Configuration config = context.getConfiguration();
maxConsumptionDate.set(config.get("mapper.maxConsumption.date"));
}
public void map(Object key, Text value, Context context) throws IOException, InterruptedException{
try{
lineno++;
filterDate = ft.parse(maxConsumptionDate.toString());
//map data from line/file
String[] fields = value.toString().split(",");
customer.set(fields[0].trim());
dateInFile = ft.parse(fields[1].trim());
consumption.set(Integer.parseInt(fields[2].trim()));
if(dateInFile.equals(filterDate)) //only send to reducer if date filter matches....
context.write(new Text(customer), consumption);
}catch(Exception e){
System.err.println("Invaid Data at line: " + lineno + " Error: " + e.getMessage());
}
}
}
public static class DataReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
LinkedHashMap<String, Integer> maxConsumption = new LinkedHashMap<String,Integer>();
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int max=0;
System.out.print("reducer received: " + key + " [ ");
for(IntWritable value: values){
System.out.print( value.get() + " ");
if(value.get() > max)
max=value.get();
}
System.out.println( " ]");
System.out.println(key.toString() + " max is " + max);
maxConsumption.put(key.toString(), max);
}
#Override
protected void cleanup(Context context)
throws IOException, InterruptedException {
int max=0;
//first find the max from reducer
for (String key : maxConsumption.keySet()){
System.out.println("cleaup customer : " + key.toString() + " consumption : " + maxConsumption.get(key)
+ " max: " + max);
if(maxConsumption.get(key) > max)
max=maxConsumption.get(key);
}
System.out.println("final max is: " + max);
//write only the max value from map
for (String key : maxConsumption.keySet()){
if(maxConsumption.get(key) == max)
context.write(new Text(key), new IntWritable(maxConsumption.get(key)));
}
}
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new maxConsumption(), args);
System.exit(res);
}
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: -Dmapper.maxConsumption.date=\"2013-07-01 01:00:00\" <in> <out>");
System.exit(2);
}
Configuration conf = this.getConf();
Job job = Job.getInstance(conf, "get-max-consumption");
job.setJarByClass(maxConsumption.class);
job.setMapperClass(DataMapper.class);
job.setReducerClass(DataReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
FileSystem fs = null;
Path dstFilePath = new Path(args[1]);
try {
fs = dstFilePath.getFileSystem(conf);
if (fs.exists(dstFilePath))
fs.delete(dstFilePath, true);
} catch (IOException e1) {
e1.printStackTrace();
}
return job.waitForCompletion(true) ? 0 : 1;
}
}

Probably all the values that go to your reducer are under 0. Try min value to identify if you variable change.
max = MIN_VALUE;
Based on what you say, the output should be only 0's (in this the max value in the reducers are 0) or no output (all values less than 0). Also, look this
context.write(key, new IntWritable());
it should be
context.write(key, new IntWritable(max));
EDIT: I just saw your Mapper class, it has a lot of problems. the following code is skiping the first element in every mapper. why?
if (counter > 0) {
I guess, that you are getting something like this right? "customer, 2013-07-01 01:00:00, 2,..." if that is the case and you are already filtering values, you should declare your max variable as local, not in the mapper scope, it would affect multiple customers.
There are a lot of questions around this.. you could explain your input for each mapper and what you want to do.
EDIT2: Based on your answer I would try this
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class AlicanteMapperC extends Mapper<LongWritable, Text, Text, IntWritable> {
private final int max = 5;
private SimpleDateFormat ft = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Date t = null;
String[] line = value.toString().split(",");
String customer = line[0];
try {
t = ft.parse(line[1]);
} catch (ParseException e) {
// TODO Auto-generated catch block
throw new RuntimeException("something wrong with the date!" + line[1]);
}
Integer consumption = Integer.parseInt(line[2]);
//find customers in a certain date
if (t.compareTo(ft.parse("2013-07-01 01:00:00")) == 0 && consumption == max) {
context.write(new Text(customer), new IntWritable(consumption));
}
counter++;
}
}
and reducer pretty simple to emit 1 record per customer
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import com.google.common.collect.Iterables;
public class AlicanteReducerC extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
//We already now that it is 5
context.write(key, new IntWritable(5));
//If you want something different, for example report customer with different values, you could iterate over the iterator like this
//for (IntWritable val : values) {
// context.write(key, new IntWritable(val));
//}
}
}

Word Merge in hadoop

Currently i would like merge or concatenate two strings using hadoop. where The mapper function would group the words and the reduce will concatenate the values based on common key.
Below is my code for the map-reduce job.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class mr2 {
// mapper class
public static class TokenizerMapper extends Mapper<Text, Text, Text, Text>{
private Text word = new Text(); // key
private Text value_of_key = new Text(); // value
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String IndexAndCategory = "";
String value_of_the_key = "";
StringTokenizer itr = new StringTokenizer(line);
// key creation
IndexAndCategory += itr.nextToken() + " ";
IndexAndCategory += itr.nextToken() + " ";
// value creation
value_of_the_key += itr.nextToken() + ":";
value_of_the_key += itr.nextToken() + " ";
// key and value
word.set(IndexAndCategory);
value_of_key.set(value_of_the_key);
// write key-value pair
context.write(word, (Text)value_of_key);
}
}
// reducer class
public static class IntSumReducer extends Reducer<Text,Text,Text,Text> {
//private IntWritable result = new IntWritable();
private Text values_of_key = new Text();
#Override
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String values_ = "";
for (Text val : values) {
values_ += val.toString();
}
values_of_key.set(values_);
context.write(key, values_of_key);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "mr2");
job.setJarByClass(mr2.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setNumReduceTasks(1);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
The input to mapper is in the below format.
1 A this 2
1 A the 1
3 B is 1
The mapper process this into the below format and gives to reducer
1 A this:2
1 A the:1
3 B is:1
The reduce then reduces the given input into below format.
1 A this:2 the:1
3 B is:1
I used word count as basic template and modified it to process Text(String) but when i execute the above mentioned code i am getting the below error.
Error: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text
at mr2$TokenizerMapper.map(mr2.java:17)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
It is expecting LongIntWritable. Any help to solve this issue is appreciated.

If you're reading a text file, the mapper must be defined as
public static class TokenizerMapper extends Mapper<LongWritable, Text, Text, Text>{
So the map method should look like this
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

The problem was in the main function i was not specify what is the output of the mapper, so the reducer was expecting the default one as input. For more details refer the this post.
Changed input type to Object from Text.
public static class TokenizerMapper extends Mapper{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
Adding the following lines solved the issue.
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
The following is the complete working code.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.io.LongWritable;
public class mr2 {
// mapper class
public static class TokenizerMapper extends Mapper<Object, Text, Text, Text>{
private Text word = new Text(); // key
private Text value_of_key = new Text(); // value
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String IndexAndCategory = "";
String value_of_the_key = "";
StringTokenizer itr = new StringTokenizer(line);
// key creation
IndexAndCategory += itr.nextToken() + " ";
IndexAndCategory += itr.nextToken() + " ";
// value creation
value_of_the_key += itr.nextToken() + ":";
value_of_the_key += itr.nextToken() + " ";
// key and value
word.set(IndexAndCategory);
value_of_key.set(value_of_the_key);
// write key-value pair
context.write(word, value_of_key);
}
}
// reducer class
public static class IntSumReducer extends Reducer<Text,Text,Text,Text> {
//private IntWritable result = new IntWritable();
private Text values_of_key = new Text();
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String values_ = "";
for (Text val : values) {
values_ += val.toString();
}
values_of_key.set(values_);
context.write(key, values_of_key);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "mr2");
job.setJarByClass(mr2.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setNumReduceTasks(1);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

hadoop mapreduce to generate substrings of different lengths - java

Related

I am having trouble chaining two mapreduce jobs

Analyzing multiple input files and output only one file containing one final result

MAPREDUCE error: method write in interface TaskInputOutputContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> cannot be applied to given types

Global variable's value doesn't change after for Loop

Word Merge in hadoop

Categories

Resources