Join with Hadoop in Java [closed]

Join with Hadoop in Java [closed] - java

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I'm working since short time with Hadoop and trying to implement a join in Java. It doesn't matter if Map-Side or Reduce-Side. I took Reduce-Side join since it was supposed to be easier to implement. I know that Java is not the best choice for joins, aggregations etc. and should better pick Hive or Pig which I have done already. However I'm working on a research project and I have to use all of those 3 languages in order to deliver a comparison.
Anyway, I have two input files with different structure. One is key|value and the other one is key|value1;value2;value3;value4. One record from each input file looks like following:
Input1: 1;2010-01-10T00:00:01
Input2: 1;23;Blue;2010-01-11T00:00:01;9999-12-31T23:59:59
I followed the example in the Hadoop Definitve Guide book, but it didn't work for me. I'm posting my java files here, so you can see what's wrong.
public class LookupReducer extends Reducer<TextPair,Text,Text,Text> {
private String result = "";
private String msisdn;
private String attribute, product;
private long trans_dt_long, start_dt_long, end_dt_long;
private String trans_dt, start_dt, end_dt;
#Override
public void reduce(TextPair key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
context.progress();
//value without key to remember
Iterator<Text> iter = values.iterator();
for (Text val : values) {
Text recordNoKey = val; //new Text(iter.next());
String valSplitted[] = recordNoKey.toString().split(";");
//if the input is coming from CDR set corresponding values
if(key.getSecond().toString().equals(CDR.CDR_TAG))
{
trans_dt = recordNoKey.toString();
trans_dt_long = dateToLong(recordNoKey.toString());
}
//if the input is coming from Attributes set corresponding values
else if(key.getSecond().toString().equals(Attribute.ATT_TAG))
{
attribute = valSplitted[0];
product = valSplitted[1];
start_dt = valSplitted[2];
start_dt_long = dateToLong(valSplitted[2]);
end_dt = valSplitted[3];
end_dt_long = dateToLong(valSplitted[3]);;
}
Text record = val; //iter.next();
//System.out.println("RECORD: " + record);
Text outValue = new Text(recordNoKey.toString() + ";" + record.toString());
if(start_dt_long < trans_dt_long && trans_dt_long < end_dt_long)
{
//concat output columns
result = attribute + ";" + product + ";" + trans_dt;
context.write(key.getFirst(), new Text(result));
System.out.println("KEY: " + key);
}
}
}
private static long dateToLong(String date){
DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Date parsedDate = null;
try {
parsedDate = formatter.parse(date);
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
long dateInLong = parsedDate.getTime();
return dateInLong;
}
public static class TextPair implements WritableComparable<TextPair> {
private Text first;
private Text second;
public TextPair(){
set(new Text(), new Text());
}
public TextPair(String first, String second){
set(new Text(first), new Text(second));
}
public TextPair(Text first, Text second){
set(first, second);
}
public void set(Text first, Text second){
this.first = first;
this.second = second;
}
public Text getFirst() {
return first;
}
public void setFirst(Text first) {
this.first = first;
}
public Text getSecond() {
return second;
}
public void setSecond(Text second) {
this.second = second;
}
#Override
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
first.readFields(in);
second.readFields(in);
}
#Override
public void write(DataOutput out) throws IOException {
// TODO Auto-generated method stub
first.write(out);
second.write(out);
}
#Override
public int hashCode(){
return first.hashCode() * 163 + second.hashCode();
}
#Override
public boolean equals(Object o){
if(o instanceof TextPair)
{
TextPair tp = (TextPair) o;
return first.equals(tp.first) && second.equals(tp.second);
}
return false;
}
#Override
public String toString(){
return first + ";" + second;
}
#Override
public int compareTo(TextPair tp) {
// TODO Auto-generated method stub
int cmp = first.compareTo(tp.first);
if(cmp != 0)
return cmp;
return second.compareTo(tp.second);
}
public static class FirstComparator extends WritableComparator {
protected FirstComparator(){
super(TextPair.class, true);
}
#Override
public int compare(WritableComparable comp1, WritableComparable comp2){
TextPair pair1 = (TextPair) comp1;
TextPair pair2 = (TextPair) comp2;
int cmp = pair1.getFirst().compareTo(pair2.getFirst());
if(cmp != 0)
return cmp;
return -pair1.getSecond().compareTo(pair2.getSecond());
}
}
public static class GroupComparator extends WritableComparator {
protected GroupComparator()
{
super(TextPair.class, true);
}
#Override
public int compare(WritableComparable comp1, WritableComparable comp2)
{
TextPair pair1 = (TextPair) comp1;
TextPair pair2 = (TextPair) comp2;
return pair1.compareTo(pair2);
}
}
}
}
public class Joiner extends Configured implements Tool {
public static final String DATA_SEPERATOR =";"; //Define the symbol for seperating the output data
public static final String NUMBER_OF_REDUCER = "1"; //Define the number of the used reducer jobs
public static final String COMPRESS_MAP_OUTPUT = "true"; //if the output from the mapping process should be compressed, set COMPRESS_MAP_OUTPUT = "true" (if not set it to "false")
public static final String
USED_COMPRESSION_CODEC = "org.apache.hadoop.io.compress.SnappyCodec"; //set the used codec for the data compression
public static final boolean JOB_RUNNING_LOCAL = true; //if you run the Hadoop job on your local machine, you have to set JOB_RUNNING_LOCAL = true
//if you run the Hadoop job on the Telefonica Cloud, you have to set JOB_RUNNING_LOCAL = false
public static final String OUTPUT_PATH = "/home/hduser"; //set the folder, where the output is saved. Only needed, if JOB_RUNNING_LOCAL = false
public static class KeyPartitioner extends Partitioner<TextPair, Text> {
#Override
public int getPartition(/*[*/TextPair key/*]*/, Text value, int numPartitions) {
System.out.println("numPartitions: " + numPartitions);
return (/*[*/key.getFirst().hashCode()/*]*/ & Integer.MAX_VALUE) % numPartitions;
}
}
private static Configuration hadoopconfig() {
Configuration conf = new Configuration();
conf.set("mapred.textoutputformat.separator", DATA_SEPERATOR);
conf.set("mapred.compress.map.output", COMPRESS_MAP_OUTPUT);
//conf.set("mapred.map.output.compression.codec", USED_COMPRESSION_CODEC);
conf.set("mapred.reduce.tasks", NUMBER_OF_REDUCER);
return conf;
}
#Override
public int run(String[] args) throws Exception {
// TODO Auto-generated method stub
if ((args.length != 3) && (JOB_RUNNING_LOCAL)) {
System.err.println("Usage: Lookup <CDR-inputPath> <Attribute-inputPath> <outputPath>");
System.exit(2);
}
//starting the Hadoop job
Configuration conf = hadoopconfig();
Job job = new Job(conf, "Join cdrs and attributes");
job.setJarByClass(Joiner.class);
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, CDRMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, AttributeMapper.class);
//FileInputFormat.addInputPath(job, new Path(otherArgs[0])); //expecting a folder instead of a file
if(JOB_RUNNING_LOCAL)
FileOutputFormat.setOutputPath(job, new Path(args[2]));
else
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
job.setPartitionerClass(KeyPartitioner.class);
job.setGroupingComparatorClass(TextPair.FirstComparator.class);
job.setReducerClass(LookupReducer.class);
job.setMapOutputKeyClass(TextPair.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Joiner(), args);
System.exit(exitCode);
}
}
public class Attribute {
public static final String ATT_TAG = "1";
public static class AttributeMapper
extends Mapper<LongWritable, Text, TextPair, Text>{
private static Text values = new Text();
//private Object output = new Text();
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//partition the input line by the separator semicolon
String[] attributes = value.toString().split(";");
String valuesInString = "";
if(attributes.length != 5)
System.err.println("Input column number not correct. Expected 5, provided " + attributes.length
+ "\n" + "Check the input file");
if(attributes.length == 5)
{
//setting the values with the input values read above
valuesInString = attributes[1] + ";" + attributes[2] + ";" + attributes[3] + ";" + attributes[4];
values.set(valuesInString);
//writing out the key and value pair
context.write( new TextPair(new Text(String.valueOf(attributes[0])), new Text(ATT_TAG)), values);
}
}
}
}
public class CDR {
public static final String CDR_TAG = "0";
public static class CDRMapper
extends Mapper<LongWritable, Text, TextPair, Text>{
private static Text values = new Text();
private Object output = new Text();
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//partition the input line by the separator semicolon
String[] cdr = value.toString().split(";");
//setting the values with the input values read above
values.set(cdr[1]);
//output = CDR_TAG + cdr[1];
//writing out the key and value pair
context.write( new TextPair(new Text(String.valueOf(cdr[0])), new Text(CDR_TAG)), values);
}
}
}
I would be glad if anyone could at least post a link for a tutorial or a simple example where such a join functionality is implemented. I searched a lot, but either the code was not complete or there was not enough explanation.

To be honest, I have no idea what your code is trying to do, but that's probably because I'd do it in a different way and not familiar with the API's you're using.
I would start from scratch as follows:
Create a mapper to read one of the files. For each line read, write a key value pair to the context. The key is a Text created from the key and the value is another Text created by concatenating a "1" with the entire input record.
Create another mapper for the other file. This mapper acts just like the first mapper, but the value is a Text created by concatenating a "2" with the entire input record.
Write a reducer to do the join. The reduce() method will get all records written for a specific key. You can tell which input file (and therefore the data format for the record) by looking to see whether the value starts with a "1" or a "2". Once you know whether or not you have one, the other or both record types, you can write whatever logic you need to merge the data from the one or two records.
By the way, you use the MultipleInputs class to configure more than one mapper in your job/driver class.

Related

Map Reduce - How to group and aggregate multiple attributes in a single job

I am currently struggling a bit with MapReduce.
I have the following dataset:
1,John,Computer
2,Anne,Computer
3,John,Mobile
4,Julia,Mobile
5,Jack,Mobile
6,Jack,TV
7,John,Computer
8,Jack,TV
9,Jack,TV
10,Anne,Mobile
11,Anne,Computer
12,Julia,Mobile
Now I want to apply MapReduce with grouping and
aggregation on this data set, in order that the output
doesn't only show how many times which person bought something,
but also what the product is, which the person ordered most.
So the output should look like:
John 3 Computer
Anne 3 Mobile
Jack 4 TV
Julia 2 Mobile
My current implementation of the mapper as well as reducer
looks like that, which perfectly returns how many orders were
made by the individuals, however, I am really clueless how
to get the desired output.
static class CountMatchesMapper extends Mapper<Object,Text,Text,IntWritable> {
#Override
protected void map(Object key, Text value, Context ctx) throws IOException, InterruptedException {
String row = value.toString();
String[] row_part = row.split(",");
try{
ctx.write(new Text(row_part[1]), new IntWritable(1));
catch (IOException e) {
}
catch (InterruptedException e) {
}
}
}
}
static class CountMatchesReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
#Override
protected void reduce(Text key, Iterable<IntWritable> values, Context ctx) throws IOException, InterruptedException {
int i = 0;
for (IntWritable value : values) i += value.get();
try{
ctx.write(key, new IntWritable(i));
}
catch (IOException e) {
}
catch (InterruptedException e) {
}
}
}
I would really appreciate any efficient solution and help.
Thanks in advance!

If I understand correctly what you want, I think the 2nd output line should be:
Anne 3 Computer
based on the input. Anne has bought 3 products in total: 2 Computers and 1 Mobile.
I have here a very basic and simplistic approach, which doesn't take into account edge cases etc, but could give you some direction:
static class CountMatchesMapper extends Mapper<LongWritable, Text, Text, Text> {
private Text outputKey = new Text();
private Text outputValue = new Text();
#Override
protected void map(LongWritable key, Text value, Context ctx) throws IOException, InterruptedException {
String row = value.toString();
String[] row_part = row.split(",");
outputKey.set(row_part[1]);
outputValue.set(row_part[2]);
ctx.write(outputKey, outputValue);
}
}
static class CountMatchesReducer extends Reducer<Text, Text, Text, NullWritable> {
private Text output = new Text();
#Override
protected void reduce(Text key, Iterable<Text> values, Context ctx) throws IOException, InterruptedException {
HashMap<String, Integer> productCounts = new HashMap();
int totalProductsBought = 0;
for (Text value : values) {
String productBought = value.toString();
int count = 0;
if (productCounts.containsKey(productBought)) {
count = productCounts.get(productBought);
}
productCounts.put(productBought, count + 1);
totalProductsBought += 1;
}
String topProduct = getTopProductForPerson(productCounts);
output.set(key.toString() + " " + totalProductsBought + " " + topProduct);
ctx.write(output, NullWritable.get());
}
private String getTopProductForPerson(Map<String, Integer> productCounts) {
String topProduct = "";
int maxCount = 0;
for (Map.Entry<String, Integer> productCount : productCounts.entrySet()) {
if (productCount.getValue() > maxCount) {
maxCount = productCount.getValue();
topProduct = productCount.getKey();
}
}
return topProduct;
}
}
The above will give the output that you described.
If you want a proper solution that scales etc then probably you need a composite key and custom GroupComparator. This way you will be able to add Combiner as well and make it much more efficient. However, the approach above should work for an average case.

Reduce does not start, After map completes

Below is the code for my Implementation of a simple MapReduce Job using a custom writable comparable.
public class MapReduceKMeans {
public static class MapReduceKMeansMapper extends
Mapper<Object, Text, SongDataPoint, Text> {
public void map(Object key, Text value, Context context)
throws InterruptedException, IOException {
String str = value.toString();
// Reading Line one by one from the input CSV.
String split[] = str.split(",");
String trackId = split[0];
String title = split[1];
String artistName = split[2];
SongDataPoint songDataPoint =
new SongDataPoint(new Text(trackId), new Text(title),
new Text(artistName));
context.write(songDataPoint, new Text());
}
}
public static class MapReduceKMeansReducer extends
Reducer<SongDataPoint, Text, Text, NullWritable> {
public void reduce(SongDataPoint key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
StringBuilder sb = new StringBuilder();
sb.append(key.getTrackId()).append("\t").
append(key.getTitle()).append("\t")
.append(key.getArtistName()).append("\t");
String write = sb.toString();
context.write(new Text(write), NullWritable.get());
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {
System.err
.println("Usage:<CsV Out Path> <Final Out Path>");
System.exit(2);
}
Job job = new Job(conf, "Song Data Trial");
job.setJarByClass(MapReduceKMeans.class);
job.setMapperClass(MapReduceKMeansMapper.class);
job.setReducerClass(MapReduceKMeansReducer.class);
job.setOutputKeyClass(SongDataPoint.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
When I debug my code reads all the rows in the CSV file but it does not enter the reduce job at all.
I also have made use of the SongDataPoint as my custom writable.
Its code is as below.
public class SongDataPoint implements WritableComparable<SongDataPoint> {
Text trackId;
Text title;
Text artistName;
public SongDataPoint() {
this.trackId = new Text();
this.title = new Text();
this.artistName = new Text();
}
public SongDataPoint(Text trackId, Text title, Text artistName) {
this.trackId = trackId;
this.title = title;
this.artistName = artistName;
}
#Override
public void readFields(DataInput in) throws IOException {
this.trackId.readFields(in);
this.title.readFields(in);
this.artistName.readFields(in);
}
#Override
public void write(DataOutput out) throws IOException {
}
public Text getTrackId() {
return trackId;
}
public void setTrackId(Text trackId) {
this.trackId = trackId;
}
public Text getTitle() {
return title;
}
public void setTitle(Text title) {
this.title = title;
}
public Text getArtistName() {
return artistName;
}
public void setArtistName(Text artistName) {
this.artistName = artistName;
}
#Override
public int compareTo(SongDataPoint o) {
// TODO Auto-generated method stub
int compare = getTrackId().compareTo(o.getTrackId());
return compare;
}
}
Any help is appreciated. Thanks.

Your output key class class as per Driver is SongDataPoint.class and output value class as Text.class but actually you are writing Text as key in Reducer and Nullwritable as value in Reducer.

you should also specify the Mapper output values as following.
job.setMapOutputKeyClass(SongDataPoint.class);
job.setMapOutputValueClass(Text.class);

My write method in my CustomWritable Class was left blank by mistake. It solved the problem after writing the proper code in it.
public void write(DataOutput out) throws IOException {
}

generating a number between a range using json

How can we generate a number between a range using Json.
Like we have to generate a number between 0 to 50, how can we perform this in Java using a Json.
This is my Json Data
{
"rand": {
"type': "number",
"minimum": 0,
"exclusiveMinimum": false,
"maximum": 50,
"exclusiveMaximum": true
}
}
This is what I have tried in Java
public class JavaApplication1 {
public static void main(String[] args) {
try {
for (int i=0;i<5;i++)
{
FileInputStream fileInputStream = new FileInputStream("C://users/user/Desktop/V.xls");
HSSFWorkbook workbook = new HSSFWorkbook(fileInputStream);
HSSFSheet worksheet = workbook.getSheet("POI Worksheet");
HSSFRow row1 = worksheet.getRow(0);
String e1Val = cellE1.getStringCellValue();
HSSFCell cellF1 = row1.getCell((short) 5);
System.out.println("E1: " + e1Val);
JSONObject obj = new JSONObject();
obj.put("value", e1Val);
System.out.print(obj + "\n");
Map<String,Object> c_data = mapper.readValue(e1Val, Map.class);
System.out.println(a);
}
} catch (FileNotFoundException e) {
} catch (IOException e) {
}
}
}
Json Data is stored in excel sheet, from there I am reading it in Java program

Get a Json-reader like GSON.
Read in the JSON to an equivalent Object like
public class rand{
private String type;
private int minimum;
private boolean exclusiveMinimum;
private int maximum;
private boolean exclusiveMaximum;
//this standard-constructor is needed for the JsonReader
public rand(){
}
//Getter for all Values
}
and after reading in your JSON you can access your Data via your getter-methods

I think that Jackson may be of help here.
I suggest that you create a data model in Java that reflects the JSON. This can along the lines of:
// This is the root object. It contains the input data (RandomizerInput) and a
// generate-function that is used for generating new random ints.
public class RandomData {
private RandomizerInput input;
#JsonCreator
public RandomData(#JsonProperty("rand") final RandomizerInput input) {
this.input = input;
}
#JsonProperty("rand")
public RandomizerInput getInput() {
return input;
}
#JsonProperty("generated")
public int generateRandomNumber() {
int max = input.isExclusiveMaximum()
? input.getMaximum() - 1 : input.getMaximum();
int min = input.isExclusiveMinimum()
? input.getMinimum() + 1 : input.getMinimum();
return new Random().nextInt((max - min) + 1) + min;
}
}
// This is the input data (pretty much what is described in the question).
public class RandomizerInput {
private final boolean exclusiveMaximum;
private final boolean exclusiveMinimum;
private final int maximum;
private final int minimum;
private final String type;
#JsonCreator
public RandomizerInput(
#JsonProperty("type") final String type,
#JsonProperty("minimum") final int minimum,
#JsonProperty("exclusiveMinimum") final boolean exclusiveMinimum,
#JsonProperty("maximum") final int maximum,
#JsonProperty("exclusiveMaximum") final boolean exclusiveMaximum) {
this.type = type; // Not really used...
this.minimum = minimum;
this.exclusiveMinimum = exclusiveMinimum;
this.maximum = maximum;
this.exclusiveMaximum = exclusiveMaximum;
}
public int getMaximum() {
return maximum;
}
public int getMinimum() {
return minimum;
}
public String getType() {
return type;
}
public boolean isExclusiveMaximum() {
return exclusiveMaximum;
}
public boolean isExclusiveMinimum() {
return exclusiveMinimum;
}
}
To use these classes the ObjectMapper from Jackson can be used like this:
public static void main(String... args) throws IOException {
String json =
"{ " +
"\"rand\": { " +
"\"type\": \"number\", " +
"\"minimum\": 0, " +
"\"exclusiveMinimum\": false, " +
"\"maximum\": 50, " +
"\"exclusiveMaximum\": true " +
"} " +
"}";
// Create the mapper
ObjectMapper mapper = new ObjectMapper();
// Convert JSON to POJO
final RandomData randomData = mapper.readValue(json, RandomData.class);
// Either you can get the random this way...
final int random = randomData.generateRandomNumber();
// Or, you can serialize the whole thing as JSON....
String str = mapper.writeValueAsString(randomData);
// Output is:
// {"rand":{"type":"number","minimum":0,"exclusiveMinimum":false,"maximum":50,"exclusiveMaximum":true},"generated":21}
System.out.println(str);
}
The actual generation of a random number is based on this SO question.

Map Reduce job generating empty output file

Program is generating empty output file. Can anyone please suggest me where am I going wrong.
Any help will be highly appreciated. I tried to put job.setNumReduceTask(0) as I am not using reducer but still output file is empty.
public static class PrizeDisMapper extends Mapper<LongWritable, Text, Text, Pair>{
int rating = 0;
Text CustID;
IntWritable r;
Text MovieID;
public void map(LongWritable key, Text line, Context context
) throws IOException, InterruptedException {
String line1 = line.toString();
String [] fields = line1.split(":");
if(fields.length > 1)
{
String Movieid = fields[0];
String line2 = fields[1];
String [] splitline = line2.split(",");
String Custid = splitline[0];
int rate = Integer.parseInt(splitline[1]);
r = new IntWritable(rate);
CustID = new Text(Custid);
MovieID = new Text(Movieid);
Pair P = new Pair();
context.write(MovieID,P);
}
else
{
return;
}
}
}
public static class IntSumReducer extends Reducer<Text,Pair,Text,Pair> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<Pair> values,
Context context
) throws IOException, InterruptedException {
for (Pair val : values) {
context.write(key, val);
}
}
public class Pair implements Writable
{
String key;
int value;
public void write(DataOutput out) throws IOException {
out.writeInt(value);
out.writeChars(key);
}
public void readFields(DataInput in) throws IOException {
key = in.readUTF();
value = in.readInt();
}
public void setVal(String aKey, int aValue)
{
key = aKey;
value = aValue;
}
Main class:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setInputFormatClass (TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Pair.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
Thanks #Pathmanaban Palsamy and #Chris Gerken for your suggestions. I have modified the code as per your suggestions but still getting empty output file. Can anyone please suggest me configurations in my main class for input and output. Do I need to specify Pair class in input to mapper & how?

I'm guessing the reduce method should be declared as
public void reduce(Text key, Iterable<Pair> values,
Context context
) throws IOException, InterruptedException
You get passed an Iterable (an object from which you can get an Iterator) which you use to iterate over all of the values that were mapped to the given key.

Since no reducer required, I suspect below line
Pair P = new Pair();
context.write(MovieID,P);
empty Pair would be the issue.
also pls check your Driver class you have given correct keyclass and valueclass like
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Pair.class);

Performing multiple computations with Hadoop Map Reduce

I have a map reduce program for finding the min/max for 2 separate properties for each year. This works, for the most part, using a single node cluster in hadoop. Here is my currently setup:
public class MaxTemperatureReducer extends
Reducer<Text, Stats, Text, Stats> {
private Stats result = new Stats();
#Override
public void reduce(Text key, Iterable<Stats> values, Context context)
throws IOException, InterruptedException {
int maxValue = Integer.MIN_VALUE;
int minValue = Integer.MAX_VALUE;
int sum = 0;
for (Stats value : values) {
result.setMaxTemp(Math.max(maxValue, value.getMaxTemp()));
result.setMinTemp(Math.min(minValue, value.getMinTemp()));
result.setMaxWind(Math.max(maxValue, value.getMaxWind()));
result.setMinWind(Math.min(minValue, value.getMinWind()));
sum += value.getCount();
}
result.setCount(sum);
context.write(key, result);
}
}
public class MaxTemperatureMapper extends
Mapper<Object, Text, Text, Stats> {
private static final int MISSING = 9999;
private Stats outStat = new Stats();
#Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] split = value.toString().split("\\s+");
String year = split[2].substring(0, 4);
int airTemperature;
airTemperature = (int) Float.parseFloat(split[3]);
outStat.setMinTemp((float)airTemperature);
outStat.setMaxTemp((float)airTemperature);
outStat.setMinWind(Float.parseFloat(split[12]));
outStat.setMaxWind(Float.parseFloat(split[14]));
outStat.setCount(1);
context.write(new Text(year), outStat);
}
}
public class MaxTemperatureDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err
.println("Usage: MaxTemperatureDriver <input path> <outputpath>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperatureDriver.class);
job.setJobName("Max Temperature");
job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Stats.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
MaxTemperatureDriver driver = new MaxTemperatureDriver();
int exitCode = ToolRunner.run(driver, args);
System.exit(exitCode);
}
}
Currently it only prints the Min/Max for the temp and windspeed for each year. I am sure it is a simple implementation but cannot find a answer anywhere. I want to try and find the top 5 min/max for each year. Any suggestions?

Let me assume the following signature for your Stats class.
/* the stats class need to be a writable, the below is just a demo*/
public class Stats {
public float getTemp() {
return temp;
}
public void setTemp(float temp) {
this.temp = temp;
}
public float getWind() {
return wind;
}
public void setWind(float wind) {
this.wind = wind;
}
private float temp;
private float wind;
}
With this, let us change the reducer as below.
SortedSet<Float> tempSetMax = new TreeSet<Float>();
SortedSet<Float> tempSetMin = new TreeSet<Float>();
SortedSet<Float> windSetMin = new TreeSet<Float>();
SortedSet<Float> windSetMax = new TreeSet<Float>();
List<Stats> values = new ArrayList<Float>();
for (Stats value : values) {
float temp = value.getTemp();
float wind = value.getWind();
if (tempSetMax.size() < 5) {
tempSetMax.add(temp);
} else {
float currentMinValue = tempSetMax.first();
if (temp > currentMinValue) {
tempSetMax.remove(currentMinValue);
tempSetMax.add(temp);
}
}
if (tempSetMin.size() < 5) {
tempSetMin.add(temp);
} else {
float currentMaxValue = tempSetMin.last();
if (temp < currentMaxValue) {
tempSetMax.remove(currentMaxValue);
tempSetMax.add(temp);
}
}
if (windSetMin.size() < 5) {
windSetMin.add(wind);
} else {
float currentMaxValue = windSetMin.last();
if (wind < currentMaxValue) {
windSetMin.remove(currentMaxValue);
windSetMin.add(temp);
}
}
if (windSetMax.size() < 5) {
windSetMax.add(wind);
} else {
float currentMinValue = windSetMax.first();
if (wind > currentMinValue) {
windSetMax.remove(currentMinValue);
windSetMax.add(temp);
}
}
}
Now you can write to context the toString() of each list, or you can create a custom writable. In my code, please change the Stats according to your requirement. It needs to be a Writable. The above is just for demonstrating the example flow.

Here is the code from the MR Design Patterns Book to get the top 10. There is also code for other MR design patterns in the same GitHub location.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Join with Hadoop in Java [closed] - java

Related

Map Reduce - How to group and aggregate multiple attributes in a single job

Reduce does not start, After map completes

generating a number between a range using json

Map Reduce job generating empty output file

Performing multiple computations with Hadoop Map Reduce

Categories

Resources