I'm new to Hadoop, and i'm trying to do a MapReduce program, to count the max first two occurrencise of lecters by date (grouped by month). So my input is of this kind :
2017-06-01 , A, B, A, C, B, E, F
2017-06-02 , Q, B, Q, F, K, E, F
2017-06-03 , A, B, A, R, T, E, E
2017-07-01 , A, B, A, C, B, E, F
2017-07-05 , A, B, A, G, B, G, G
so, i'm expeting as result of this MapReducer program, something like :
2017-06, A:4, E:4
2017-07, A:4, B:4
public class ArrayGiulioTest {
public static Logger logger = Logger.getLogger(ArrayGiulioTest.class);
public static class CustomMap extends Mapper<LongWritable, Text, Text, TextWritable> {
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
TextWritable array = new TextWritable();
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line, ",");
String dataAttuale = tokenizer.nextToken().substring(0,
line.lastIndexOf("-"));
Text tmp = null;
Text[] tmpArray = new Text[tokenizer.countTokens()];
int i = 0;
while (tokenizer.hasMoreTokens()) {
String prod = tokenizer.nextToken(",");
word.set(dataAttuale);
tmp = new Text(prod);
tmpArray[i] = tmp;
i++;
}
array.set(tmpArray);
context.write(word, array);
}
}
public static class CustomReduce extends Reducer<Text, TextWritable, Text, Text> {
public void reduce(Text key, Iterator<TextWritable> values,
Context context) throws IOException, InterruptedException {
MapWritable map = new MapWritable();
Text txt = new Text();
while (values.hasNext()) {
TextWritable array = values.next();
Text[] tmpArray = (Text[]) array.toArray();
for(Text t : tmpArray) {
if(map.get(t)!= null) {
IntWritable val = (IntWritable) map.get(t);
map.put(t, new IntWritable(val.get()+1));
} else {
map.put(t, new IntWritable(1));
}
}
}
Set<Writable> set = map.keySet();
StringBuffer str = new StringBuffer();
for(Writable k : set) {
str.append("key: " + k.toString() + " value: " + map.get(k) + "**");
}
txt.set(str.toString());
context.write(key, txt);
}
}
public static void main(String[] args) throws Exception {
long inizio = System.currentTimeMillis();
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "countProduct");
job.setJarByClass(ArrayGiulioTest.class);
job.setMapperClass(CustomMap.class);
//job.setCombinerClass(CustomReduce.class);
job.setReducerClass(CustomReduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(TextWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
long fine = System.currentTimeMillis();
logger.info("**************************************End" + (End-Start));
System.exit(1);
}
}
and i've implemented my custom TextWritable in this way :
public class TextWritable extends ArrayWritable {
public TextWritable() {
super(Text.class);
}
}
..so when i run my MapReduce program i obtain a result of this kind
2017-6 wordcount.TextWritable#3e960865
2017-6 wordcount.TextWritable#3e960865
it's obvious that my reducer it doesn't works. It seems the output from my Mapper
Any idea? And someone can says if is the right path to the solution?
Here Console Log (Just for information, my input file has 6 rows instead of 5)
*I obtain the same result starting MapReduce problem under eclipse(mono JVM) or using Hadoop with Hdfs
File System Counters
FILE: Number of bytes read=1216
FILE: Number of bytes written=431465
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=6
Map output records=6
Map output bytes=214
Map output materialized bytes=232
Input split bytes=97
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=232
Reduce input records=6
Reduce output records=6
Spilled Records=12
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=0
Total committed heap usage (bytes)=394264576
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=208
File Output Format Counters
Bytes Written=1813
I think you're trying to do too much work in the Mapper. You only need to group the dates (which it seems you aren't formatting them correctly anyway based on your expected output).
The following approach is going to turn these lines, for example
2017-07-01 , A, B, A, C, B, E, F
2017-07-05 , A, B, A, G, B, G, G
Into this pair for the reducer
2017-07 , ("A,B,A,C,B,E,F", "A,B,A,G,B,G,G")
In other words, you gain no real benefit by using an ArrayWritable, just keep it as text.
So, the Mapper would look like this
class CustomMap extends Mapper<LongWritable, Text, Text, Text> {
private final Text key = new Text();
private final Text output = new Text();
#Override
protected void map(LongWritable offset, Text value, Context context) throws IOException, InterruptedException {
int separatorIndex = value.find(",");
final String valueStr = value.toString();
if (separatorIndex < 0) {
System.err.printf("mapper: not enough records for %s", valueStr);
return;
}
String dateKey = valueStr.substring(0, separatorIndex).trim();
String tokens = valueStr.substring(1 + separatorIndex).trim().replaceAll("\\p{Space}", "");
SimpleDateFormat fmtFrom = new SimpleDateFormat("yyyy-MM-dd");
SimpleDateFormat fmtTo = new SimpleDateFormat("yyyy-MM");
try {
dateKey = fmtTo.format(fmtFrom.parse(dateKey));
key.set(dateKey);
} catch (ParseException ex) {
System.err.printf("mapper: invalid key format %s", dateKey);
return;
}
output.set(tokens);
context.write(key, output);
}
}
And then the reducer can build a Map that collects and counts the values from the value strings. Again, writing out only Text.
class CustomReduce extends Reducer<Text, Text, Text, Text> {
private final Text output = new Text();
#Override
protected void reduce(Text date, Iterable<Text> values, Context context) throws IOException, InterruptedException {
Map<String, Integer> keyMap = new TreeMap<>();
for (Text v : values) {
String[] keys = v.toString().trim().split(",");
for (String key : keys) {
if (!keyMap.containsKey(key)) {
keyMap.put(key, 0);
}
keyMap.put(key, 1 + keyMap.get(key));
}
}
output.set(mapToString(keyMap));
context.write(date, output);
}
private String mapToString(Map<String, Integer> map) {
StringBuilder sb = new StringBuilder();
String delimiter = ", ";
for (Map.Entry<String, Integer> entry : map.entrySet()) {
sb.append(
String.format("%s:%d", entry.getKey(), entry.getValue())
).append(delimiter);
}
sb.setLength(sb.length()-delimiter.length());
return sb.toString();
}
}
Given your input, I got this
2017-06 A:4, B:4, C:1, E:4, F:3, K:1, Q:2, R:1, T:1
2017-07 A:4, B:4, C:1, E:1, F:1, G:3
The main problem is about the sign of the reduce method :
I was writing : public void reduce(Text key, Iterator<TextWritable> values,
Context context)
instead of
public void reduce(Text key, Iterable<ArrayTextWritable> values,
This is the reason why i obtain my Map output instead of my Reduce otuput
I have the following error when I try to pass an IntWritable from my mapper to my reducer:
INFO mapreduce.Job: Task Id : attempt_1413976354988_0009_r_000000_1, Status : FAILED
Error: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.hbase.client.Mutation
This is my mapper:
public class testMapper extends TableMapper<Object, Object>
{
public void map(ImmutableBytesWritable rowKey, Result columns, Context context) throws IOException, InterruptedException
{
try
{
// get rowKey and convert it to string
String inKey = new String(rowKey.get());
// set new key having only date
String oKey = inKey.split("#")[0];
// get sales column in byte format first and then convert it to
// string (as it is stored as string from hbase shell)
byte[] bSales = columns.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("sales"));
String sSales = new String(bSales);
Integer sales = new Integer(sSales);
// emit date and sales values
context.write(new ImmutableBytesWritable(oKey.getBytes()), new IntWritable(sales));
}
This is the reducer:
public class testReducer extends TableReducer<Object, Object, Object>
{
public void reduce(ImmutableBytesWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
try
{
int sum = 0;
// loop through different sales vales and add it to sum
for (IntWritable sales : values)
{
Integer intSales = new Integer(sales.toString());
sum += intSales;
}
// create hbase put with rowkey as date
Put insHBase = new Put(key.get());
// insert sum value to hbase
insHBase.add(Bytes.toBytes("cf1"), Bytes.toBytes("sum"), Bytes.toBytes(sum));
// write data to Hbase table
context.write(null, insHBase);
and the driver:
public class testDriver
{
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
// define scan and define column families to scan
Scan scan = new Scan();
scan.addFamily(Bytes.toBytes("cf1"));
Job job = Job.getInstance(conf);
job.setJarByClass(testDriver.class);
// define input hbase table
TableMapReduceUtil.initTableMapperJob("test1", scan, testMapper.class, ImmutableBytesWritable.class, IntWritable.class, job);
// define output table
TableMapReduceUtil.initTableReducerJob("test2", testReducer.class, job);
job.waitForCompletion(true);
}
}
context.write(null, insHBase);
The problem is that you are writing the Put out to the context and hbase is expecting an IntWritable.
You should write the outputs out to the context and let Hbase take charge of storing them. Hase is expecting to store an IntWritable but you are handing it a Put operation which extends Mutation.
The workflow for Hbase is you would configure where to put the outputs in your Configuration and then simply write the outputs out to the context. You shouldn't have to do any manual Put operations in your reducer.
I wrote one Hadoop word count program which takes TextInputFormat input and is supposed to output word count in avro format.
Map-Reduce job is running fine but output of this job is readable using unix commands such as more or vi. I was expecting this output be unreadable as avro output is in binary format.
I have used mapper only, reducer is not present. I just want to experiment with avro so I am not worried about memory or stack overflow. Following the the code of mapper
public class WordCountMapper extends Mapper<LongWritable, Text, AvroKey<String>, AvroValue<Integer>> {
private Map<String, Integer> wordCountMap = new HashMap<String, Integer>();
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] keys = value.toString().split("[\\s-*,\":]");
for (String currentKey : keys) {
int currentCount = 1;
String currentToken = currentKey.trim().toLowerCase();
if(wordCountMap.containsKey(currentToken)) {
currentCount = wordCountMap.get(currentToken);
currentCount++;
}
wordCountMap.put(currentToken, currentCount);
}
System.out.println("DEBUG : total number of unique words = " + wordCountMap.size());
}
#Override
protected void cleanup(Context context) throws IOException, InterruptedException {
for (Map.Entry<String, Integer> currentKeyValue : wordCountMap.entrySet()) {
AvroKey<String> currentKey = new AvroKey<String>(currentKeyValue.getKey());
AvroValue<Integer> currentValue = new AvroValue<Integer>(currentKeyValue.getValue());
context.write(currentKey, currentValue);
}
}
}
and driver code is as follows :
public int run(String[] args) throws Exception {
Job avroJob = new Job(getConf());
avroJob.setJarByClass(AvroWordCount.class);
avroJob.setJobName("Avro word count");
avroJob.setInputFormatClass(TextInputFormat.class);
avroJob.setMapperClass(WordCountMapper.class);
AvroJob.setInputKeySchema(avroJob, Schema.create(Type.INT));
AvroJob.setInputValueSchema(avroJob, Schema.create(Type.STRING));
AvroJob.setMapOutputKeySchema(avroJob, Schema.create(Type.STRING));
AvroJob.setMapOutputValueSchema(avroJob, Schema.create(Type.INT));
AvroJob.setOutputKeySchema(avroJob, Schema.create(Type.STRING));
AvroJob.setOutputValueSchema(avroJob, Schema.create(Type.INT));
FileInputFormat.addInputPath(avroJob, new Path(args[0]));
FileOutputFormat.setOutputPath(avroJob, new Path(args[1]));
return avroJob.waitForCompletion(true) ? 0 : 1;
}
I would like to know how do avro output looks like and what am I doing wrong in this program.
Latest release of Avro library includes an updated example of their ColorCount example adopted for MRv2. I suggest you to look at it, use the same pattern as they use in Reduce class or just extend AvroMapper. Please note that using Pair class instead of AvroKey+AvroValue is also essential for running Avro on Hadoop.
This may seem like a stupid question, but I fail to see the problem in my types in my mapreduce code for hadoop
As stated in the question the problem is that it is expecting IntWritable but I'm passing it a Text object in the collector.collect of the reducer.
My job configuration has the following mapper output classes:
conf.setMapOutputKeyClass(IntWritable.class);
conf.setMapOutputValueClass(IntWritable.class);
And the following reducer output classes:
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
My mapping class has the following definition:
public static class Reduce extends MapReduceBase implements Reducer<IntWritable, IntWritable, Text, IntWritable>
with the required function:
public void reduce(IntWritable key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output, Reporter reporter)
And then it fails when I call:
output.collect(new Text(),new IntWritable());
I'm fairly new to map reduce but all the types seem to match, it compiles but then fails on that line saying its expecting an IntWritable as the key for the reduce class. If it matters I'm using 0.21 version of Hadoop
Here is my map class:
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, IntWritable, IntWritable> {
private IntWritable node = new IntWritable();
private IntWritable edge = new IntWritable();
public void map(LongWritable key, Text value, OutputCollector<IntWritable, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
node.set(Integer.parseInt(tokenizer.nextToken()));
edge.set(Integer.parseInt(tokenizer.nextToken()));
if(node.get() < edge.get())
output.collect(node, edge);
}
}
}
and my reduce class:
public static class Reduce extends MapReduceBase implements Reducer<IntWritable, IntWritable, Text, IntWritable> {
IntWritable $ = new IntWritable(Integer.MAX_VALUE);
Text keyText = new Text();
public void reduce(IntWritable key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
ArrayList<IntWritable> valueList = new ArrayList<IntWritable>();
//outputs original edge pair as key and $ for value
while (values.hasNext()) {
IntWritable value = values.next();
valueList.add(value);
keyText.set(key.get() + ", " + value.get());
output.collect(keyText, $);
}
//outputs all the 2 length pairs
for(int i = 0; i < valueList.size(); i++)
for(int j = i+1; i < valueList.size(); j++)
output.collect(new Text(valueList.get(i).get() + ", " + valueList.get(j).get()), key);
}
}
and my job configuration:
JobConf conf = new JobConf(Triangles.class);
conf.setJobName("mapred1");
conf.setMapOutputKeyClass(IntWritable.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path("mapred1"));
JobClient.runJob(conf);
Your problem is that you set the Reduce class as a combiner
conf.setCombinerClass(Reduce.class);
Combiners run in the map phase and they need to emit the same key/value type (IntWriteable, IntWritable in your case)
remove this line and you should be ok
I am trying to write a simple Map Reduce program using Hadoop which will give me the month which is most prone to flu. I am using the google flu trends dataset which can be found here http://www.google.org/flutrends/data.txt.
I have written both the Mapper and the reducer as shown below
public class MaxFluPerMonthMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
private static final Log LOG =
LogFactory.getLog(MaxFluPerMonthMapper.class);
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String row = value.toString();
LOG.debug("Received row " + row);
List<String> columns = Arrays.asList(row.split(","));
String date = columns.get(0);
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
int month = 0;
try {
Calendar calendar = Calendar.getInstance();
calendar.setTime(sdf.parse(date));
month = calendar.get(Calendar.MONTH);
} catch (ParseException e) {
e.printStackTrace();
}
for (int i = 1; i < columns.size(); i++) {
String fluIndex = columns.get(i);
if (StringUtils.isNotBlank(fluIndex) && StringUtils.isNumeric(fluIndex)) {
LOG.info("Writing key " + month + " and value " + fluIndex);
context.write(new IntWritable(month), new IntWritable(Integer.valueOf(fluIndex)));
}
}
}
}
Reducer
public class MaxFluPerMonthReducer extends Reducer<IntWritable, IntWritable, Text, IntWritable> {
private static final Log LOG =
LogFactory.getLog(MaxFluPerMonthReducer.class);
#Override
protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
LOG.info("Received key " + key.get());
int sum = 0;
for (IntWritable intWritable : values) {
sum += intWritable.get();
}
int month = key.get();
String monthString = new DateFormatSymbols().getMonths()[month];
context.write(new Text(monthString), new IntWritable(sum));
}
}
With these Mapper and Reducer shown above I am getting the following output
January 545419
February 528022
March 436348
April 336759
May 346482
June 309795
July 312966
August 307346
September 322359
October 428346
November 461195
December 480078
What I want is just a single output giving me January 545419
How can I achieve this? by storing state in reducer or there is anyother solution to it? or my mapper and reducer are wrong for the question I am asking on this dataset?
The problem is that the Reducer has no idea about the other keys (by design). It would be possible to set up another Reducer to find the maximum value given all the data from your current reducer. However, this is overkill since you know you will only have 12 records you have to process, and setting up another Reducer will have more overhead than just running a serial script.
I would suggest writing some other script to process your text output.
You can add one more MapReduce step.
Mapper something like this:
public class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// emit only first row
if (key == 0)
{
String row = value.toString();
String[] values = row.split("\t");
context.write(new Text(values[0]), new Text(values[1]));
}
}
}
Reducer must emit all it's input (which will be only one record) directly to output. Number of mappers and reducers should be set to one. If your MapReduce job uses more then one reducer you should use one another intermediate MapReduce step to combine results to one file after your MapReduce job.
But this way seems not very effective.