Hadoop secondary sorting - java

I trying to implemented secondary sort,
And see that url as eg.:https://www.safaribooksonline.com/library/view/data-algorithms/9781491906170/ch01.html
But my problem it's different, I have a list of product, the year and month and the price like that:
201505011000######PEN DRIVE00951
201505011000######PEN DRIVE00952
201505011000######PEN DRIVE00458
201505011000######PEN DRIVE00459
201505011000#######NOTEBOOK11470
201605011000#######NOTEBOOK21471
201705011000#######NOTEBOOK21472
201705011000###GAVETA DE HD01472
201703011000###GAVETA DE HD01473
201705011000###GAVETA DE HD01474
Where for eg.: 201505 represent the year and the month, after the # sign I had the product name, and in the and the price 01470 represent 14,70.
What I need to do is get the lower price for each product and show the Year and month of that Price. But I don't know to do that, what I can show are the Lower price and the product.
Here is my program:
MAPPER
public class GroupMR {
public static class GroupMapper extends Mapper<LongWritable, Text, Product, IntWritable> {
Product prdt = new Product();
Text cntText = new Text();
Text YearMonthText = new Text();
IntWritable price = new IntWritable();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String produto = line.substring(13, 27);//Nome do produto
produto = produto.substring(produto.lastIndexOf("#")+1);
String ano = line.substring(0, 6);
int valor = Integer.parseInt(line.substring(27, 32));
cntText.set(new Text(produto));
YearMonthText.set(ano);
price.set(valor);
Product prdt = new Product(cntText, YearMonthText);
context.write(prdt, price);
}
}
REDUCER
public static class GroupReducer extends Reducer<Product, IntWritable, Product, IntWritable> {
public void reduce(Product key, Iterator<IntWritable> values, Context context) throws IOException,
InterruptedException {
int minValue = Integer.MAX_VALUE;
while (values.hasNext()) {
minValue = Math.min(minValue,values.next().get());
}
context.write(key, new IntWritable(minValue));
}
}
COMPARABLE
private static class Product implements WritableComparable<Product> {
Text Product;
Text YearMonth;
public Product(Text Product, Text YearMonth) {
this.Product = Product;
this.YearMonth = YearMonth;
}
public Product() {
this.Product = new Text();
this.YearMonth = new Text();
}
public void write(DataOutput out) throws IOException {
this.Product.write(out);
this.YearMonth.write(out);
}
public void readFields(DataInput in) throws IOException {
this.Product.readFields(in);
this.YearMonth.readFields(in);
}
public int compareTo(Product pric) {
if (pric == null)
return 0;
int intcnt = Product.compareTo(pric.Product);
return intcnt;
}
#Override
public String toString() {
return Product.toString() + " DATA: " + YearMonth.toString();
}
}
DRIVER
public static void main(String[] args)
throws IOException, ClassNotFoundException, InterruptedException {
FileUtils.deleteDirectory(new File("/Local/data/output"));
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "GroupMR");
job.setJarByClass(GroupMR.class);
job.setMapperClass(GroupMapper.class);
job.setReducerClass(GroupReducer.class);
job.setOutputKeyClass(Product.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
RESULT
201605011000######PEN DRIVE00950
201505011000######PEN DRIVE00951
201505011000######PEN DRIVE00952
201505011000######PEN DRIVE00458
201505011000######PEN DRIVE00459
201505011000#######NOTEBOOK11470
201605011000#######NOTEBOOK21471
201705011000#######NOTEBOOK21472
201705011000###GAVETA DE HD01472
201703011000###GAVETA DE HD01473
201705011000###GAVETA DE HD01474
I think the problem is in the Reduce and in the CompareTo But I have no idea how to make. Someone could help me with it?

Related

Hadoop: How to start 2 Mapper and 2 reducer

i'm trying to develop and Hadoop App. i want to start 2 Mapper and 2 Reducer in my main method. But the i keep getting a cast error, which bring me to ask how can i do this?
Mapper1:
#SuppressWarnings("javadoc")
public class IntervallMapper1 extends Mapper<LongWritable, Text, Text, LongWritable> {
private static Logger logger = Logger.getLogger(IntervallMapper1.class.getName());
private static Category categoriy;
private static Value value;
private String[] values = new String[4];
private final static LongWritable one = new LongWritable(1);
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
if(!this.categoriy.valueIsMissing(value.toString())){ // Luftdruck und Windstärke vorhanden...
this.logger.info("Key: " + values[0] + values[1]);
values = this.value.getValues(value.toString());
context.write(new Text(values[0] + values[1]), this.one); // Station-Datum als Key und Value = 1
}
}
}
Reducer1:
#SuppressWarnings("javadoc")
public class IntervallReducer1 extends Reducer<Text, LongWritable, Text, LongWritable> {
private static Logger logger = Logger.getLogger(IntervallReducer1.class.getName());
private String key = null;
private static LongWritable result = new LongWritable();
private long sum;
#Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context)
throws IOException, InterruptedException {
for (LongWritable value : values) {
if(this.key == null){
logger.info("Erster Durchlauf");
System.out.println("---> " + value.get());
sum = value.get();
this.key = key.toString().substring(0, 10);
} else if (key.toString().contains(this.key)) { // TODO: key.toString().substring(0, 10)
logger.info("Key bereit vorhanden");
System.out.println("---> " + sum);
sum += value.get();
} else { // Falls Key nicht bereit vorhanden
logger.info("Key nicht vorhanden");
result.set(sum);
logger.info("Value: " + sum);
context.write(new Text(this.key), result);
this.key = key.toString().substring(0, 10);
sum = value.get();
}
}
}
}
Mapper2:
#SuppressWarnings("javadoc")
public class IntervallMapper1 extends Mapper<LongWritable, Text, Text, LongWritable> {
private static Logger logger = Logger.getLogger(IntervallMapper1.class.getName());
private static Category categoriy;
private static Value value;
private String[] values = new String[4];
private final static LongWritable one = new LongWritable(1);
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
if(!this.categoriy.valueIsMissing(value.toString())){ // Luftdruck und Windstärke vorhanden...
this.logger.info("Key: " + values[0] + values[1]);
values = this.value.getValues(value.toString());
context.write(new Text(values[0] + values[1]), this.one); // Station-Datum als Key und Value = 1
}
}
}
Main:
#SuppressWarnings("javadoc")
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setMapperClass(IntervallMapper1.class);
// job.setCombinerClass(IntervallReducer1.class);
job.setReducerClass(IntervallReducer1.class);
job.setMapperClass(IntervallMapper2.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(IntervallStart.class);
job.waitForCompletion(true);
}
Error:
Error: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text
at ncdcW03.IntervallMapper2.map(IntervallMapper2.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Mapreduce - FloatArrayWritable printing address

I have a mapreduce program who's reduce method outputs a Text as the key and a FloatArrayWritable as the values. However, the values are outputting the array address instead of the values from the toString() method.
The output I am getting is:
IYE marketDataPackage.MarketData#69204998
IYE marketDataPackage.MarketData#69204998
The output should be:
IYE 38.89, 38.50, etc.
Could someone please advise the error in my code? Thanks.
public static class Map extends Mapper<LongWritable, Text, Text, MarketData> {
private Text symbol = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
StringTokenizer tokenizer2 = new StringTokenizer(tokenizer.nextToken().toString(), ",");
symbol.set(tokenizer2.nextToken());
context.write(symbol, new MarketData(tokenizer2.nextToken(), Float.parseFloat(tokenizer2.nextToken())));
}
}
}
public static class Reduce extends Reducer<Text, FloatWritable, Text, FloatArrayWritable> {
public void reduce(Text key, Iterable<MarketData> values, Context context) throws IOException, InterruptedException, ParseException {
Calendar today = Calendar.getInstance();
today.add(Calendar.DAY_OF_MONTH, -45);
Calendar testDate = Calendar.getInstance();
SimpleDateFormat sdf = new SimpleDateFormat("yyyy/m/d");
List<FloatWritable> prices = new ArrayList<FloatWritable>();
for (MarketData m : values) {
testDate.setTime(sdf.parse(m.getTradeDate()));
if (testDate.after(today)) {
prices.add(new FloatWritable(m.getPrice()));
}
}
context.write(key, new FloatArrayWritable(prices.toArray(new FloatWritable[prices.size()])));
}
}
public static void main(String[] args) {
Configuration conf = new Configuration();
Job job = new Job(conf, "Security_Closing_Prices");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(MarketData.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
FloatArrayWritable class:
public class FloatArrayWritable extends ArrayWritable {
public FloatArrayWritable() {
super(FloatWritable.class);
}
public FloatArrayWritable(FloatWritable[] values) {
super(FloatWritable.class, values);
}
#Override
public FloatWritable[] get() {
return (FloatWritable[]) super.get();
}
#Override
public String toString() {
FloatWritable[] values = get();
String prices = "";
for (FloatWritable f : values) {
prices = prices + f.toString() + ", ";
}
if (prices != null && !prices.isEmpty()) {
prices = prices.substring(0, prices.length() - 2);
}
return prices;
}
}
The MarketData class should override toString(). You don't provide code for that class, but I suspect that it doesn't.

Reduce does not start, After map completes

Below is the code for my Implementation of a simple MapReduce Job using a custom writable comparable.
public class MapReduceKMeans {
public static class MapReduceKMeansMapper extends
Mapper<Object, Text, SongDataPoint, Text> {
public void map(Object key, Text value, Context context)
throws InterruptedException, IOException {
String str = value.toString();
// Reading Line one by one from the input CSV.
String split[] = str.split(",");
String trackId = split[0];
String title = split[1];
String artistName = split[2];
SongDataPoint songDataPoint =
new SongDataPoint(new Text(trackId), new Text(title),
new Text(artistName));
context.write(songDataPoint, new Text());
}
}
public static class MapReduceKMeansReducer extends
Reducer<SongDataPoint, Text, Text, NullWritable> {
public void reduce(SongDataPoint key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
StringBuilder sb = new StringBuilder();
sb.append(key.getTrackId()).append("\t").
append(key.getTitle()).append("\t")
.append(key.getArtistName()).append("\t");
String write = sb.toString();
context.write(new Text(write), NullWritable.get());
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {
System.err
.println("Usage:<CsV Out Path> <Final Out Path>");
System.exit(2);
}
Job job = new Job(conf, "Song Data Trial");
job.setJarByClass(MapReduceKMeans.class);
job.setMapperClass(MapReduceKMeansMapper.class);
job.setReducerClass(MapReduceKMeansReducer.class);
job.setOutputKeyClass(SongDataPoint.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
When I debug my code reads all the rows in the CSV file but it does not enter the reduce job at all.
I also have made use of the SongDataPoint as my custom writable.
Its code is as below.
public class SongDataPoint implements WritableComparable<SongDataPoint> {
Text trackId;
Text title;
Text artistName;
public SongDataPoint() {
this.trackId = new Text();
this.title = new Text();
this.artistName = new Text();
}
public SongDataPoint(Text trackId, Text title, Text artistName) {
this.trackId = trackId;
this.title = title;
this.artistName = artistName;
}
#Override
public void readFields(DataInput in) throws IOException {
this.trackId.readFields(in);
this.title.readFields(in);
this.artistName.readFields(in);
}
#Override
public void write(DataOutput out) throws IOException {
}
public Text getTrackId() {
return trackId;
}
public void setTrackId(Text trackId) {
this.trackId = trackId;
}
public Text getTitle() {
return title;
}
public void setTitle(Text title) {
this.title = title;
}
public Text getArtistName() {
return artistName;
}
public void setArtistName(Text artistName) {
this.artistName = artistName;
}
#Override
public int compareTo(SongDataPoint o) {
// TODO Auto-generated method stub
int compare = getTrackId().compareTo(o.getTrackId());
return compare;
}
}
Any help is appreciated. Thanks.
Your output key class class as per Driver is SongDataPoint.class and output value class as Text.class but actually you are writing Text as key in Reducer and Nullwritable as value in Reducer.
you should also specify the Mapper output values as following.
job.setMapOutputKeyClass(SongDataPoint.class);
job.setMapOutputValueClass(Text.class);
My write method in my CustomWritable Class was left blank by mistake. It solved the problem after writing the proper code in it.
public void write(DataOutput out) throws IOException {
}

Map Reduce job generating empty output file

Program is generating empty output file. Can anyone please suggest me where am I going wrong.
Any help will be highly appreciated. I tried to put job.setNumReduceTask(0) as I am not using reducer but still output file is empty.
public static class PrizeDisMapper extends Mapper<LongWritable, Text, Text, Pair>{
int rating = 0;
Text CustID;
IntWritable r;
Text MovieID;
public void map(LongWritable key, Text line, Context context
) throws IOException, InterruptedException {
String line1 = line.toString();
String [] fields = line1.split(":");
if(fields.length > 1)
{
String Movieid = fields[0];
String line2 = fields[1];
String [] splitline = line2.split(",");
String Custid = splitline[0];
int rate = Integer.parseInt(splitline[1]);
r = new IntWritable(rate);
CustID = new Text(Custid);
MovieID = new Text(Movieid);
Pair P = new Pair();
context.write(MovieID,P);
}
else
{
return;
}
}
}
public static class IntSumReducer extends Reducer<Text,Pair,Text,Pair> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<Pair> values,
Context context
) throws IOException, InterruptedException {
for (Pair val : values) {
context.write(key, val);
}
}
public class Pair implements Writable
{
String key;
int value;
public void write(DataOutput out) throws IOException {
out.writeInt(value);
out.writeChars(key);
}
public void readFields(DataInput in) throws IOException {
key = in.readUTF();
value = in.readInt();
}
public void setVal(String aKey, int aValue)
{
key = aKey;
value = aValue;
}
Main class:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setInputFormatClass (TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Pair.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
Thanks #Pathmanaban Palsamy and #Chris Gerken for your suggestions. I have modified the code as per your suggestions but still getting empty output file. Can anyone please suggest me configurations in my main class for input and output. Do I need to specify Pair class in input to mapper & how?
I'm guessing the reduce method should be declared as
public void reduce(Text key, Iterable<Pair> values,
Context context
) throws IOException, InterruptedException
You get passed an Iterable (an object from which you can get an Iterator) which you use to iterate over all of the values that were mapped to the given key.
Since no reducer required, I suspect below line
Pair P = new Pair();
context.write(MovieID,P);
empty Pair would be the issue.
also pls check your Driver class you have given correct keyclass and valueclass like
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Pair.class);

Performing multiple computations with Hadoop Map Reduce

I have a map reduce program for finding the min/max for 2 separate properties for each year. This works, for the most part, using a single node cluster in hadoop. Here is my currently setup:
public class MaxTemperatureReducer extends
Reducer<Text, Stats, Text, Stats> {
private Stats result = new Stats();
#Override
public void reduce(Text key, Iterable<Stats> values, Context context)
throws IOException, InterruptedException {
int maxValue = Integer.MIN_VALUE;
int minValue = Integer.MAX_VALUE;
int sum = 0;
for (Stats value : values) {
result.setMaxTemp(Math.max(maxValue, value.getMaxTemp()));
result.setMinTemp(Math.min(minValue, value.getMinTemp()));
result.setMaxWind(Math.max(maxValue, value.getMaxWind()));
result.setMinWind(Math.min(minValue, value.getMinWind()));
sum += value.getCount();
}
result.setCount(sum);
context.write(key, result);
}
}
public class MaxTemperatureMapper extends
Mapper<Object, Text, Text, Stats> {
private static final int MISSING = 9999;
private Stats outStat = new Stats();
#Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] split = value.toString().split("\\s+");
String year = split[2].substring(0, 4);
int airTemperature;
airTemperature = (int) Float.parseFloat(split[3]);
outStat.setMinTemp((float)airTemperature);
outStat.setMaxTemp((float)airTemperature);
outStat.setMinWind(Float.parseFloat(split[12]));
outStat.setMaxWind(Float.parseFloat(split[14]));
outStat.setCount(1);
context.write(new Text(year), outStat);
}
}
public class MaxTemperatureDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err
.println("Usage: MaxTemperatureDriver <input path> <outputpath>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperatureDriver.class);
job.setJobName("Max Temperature");
job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Stats.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
MaxTemperatureDriver driver = new MaxTemperatureDriver();
int exitCode = ToolRunner.run(driver, args);
System.exit(exitCode);
}
}
Currently it only prints the Min/Max for the temp and windspeed for each year. I am sure it is a simple implementation but cannot find a answer anywhere. I want to try and find the top 5 min/max for each year. Any suggestions?
Let me assume the following signature for your Stats class.
/* the stats class need to be a writable, the below is just a demo*/
public class Stats {
public float getTemp() {
return temp;
}
public void setTemp(float temp) {
this.temp = temp;
}
public float getWind() {
return wind;
}
public void setWind(float wind) {
this.wind = wind;
}
private float temp;
private float wind;
}
With this, let us change the reducer as below.
SortedSet<Float> tempSetMax = new TreeSet<Float>();
SortedSet<Float> tempSetMin = new TreeSet<Float>();
SortedSet<Float> windSetMin = new TreeSet<Float>();
SortedSet<Float> windSetMax = new TreeSet<Float>();
List<Stats> values = new ArrayList<Float>();
for (Stats value : values) {
float temp = value.getTemp();
float wind = value.getWind();
if (tempSetMax.size() < 5) {
tempSetMax.add(temp);
} else {
float currentMinValue = tempSetMax.first();
if (temp > currentMinValue) {
tempSetMax.remove(currentMinValue);
tempSetMax.add(temp);
}
}
if (tempSetMin.size() < 5) {
tempSetMin.add(temp);
} else {
float currentMaxValue = tempSetMin.last();
if (temp < currentMaxValue) {
tempSetMax.remove(currentMaxValue);
tempSetMax.add(temp);
}
}
if (windSetMin.size() < 5) {
windSetMin.add(wind);
} else {
float currentMaxValue = windSetMin.last();
if (wind < currentMaxValue) {
windSetMin.remove(currentMaxValue);
windSetMin.add(temp);
}
}
if (windSetMax.size() < 5) {
windSetMax.add(wind);
} else {
float currentMinValue = windSetMax.first();
if (wind > currentMinValue) {
windSetMax.remove(currentMinValue);
windSetMax.add(temp);
}
}
}
Now you can write to context the toString() of each list, or you can create a custom writable. In my code, please change the Stats according to your requirement. It needs to be a Writable. The above is just for demonstrating the example flow.
Here is the code from the MR Design Patterns Book to get the top 10. There is also code for other MR design patterns in the same GitHub location.

Categories