I'm new to Hadoop, and i'm trying to do a MapReduce program, to count the max first two occurrencise of lecters by date (grouped by month). So my input is of this kind :
2017-06-01 , A, B, A, C, B, E, F
2017-06-02 , Q, B, Q, F, K, E, F
2017-06-03 , A, B, A, R, T, E, E
2017-07-01 , A, B, A, C, B, E, F
2017-07-05 , A, B, A, G, B, G, G
so, i'm expeting as result of this MapReducer program, something like :
2017-06, A:4, E:4
2017-07, A:4, B:4
public class ArrayGiulioTest {
public static Logger logger = Logger.getLogger(ArrayGiulioTest.class);
public static class CustomMap extends Mapper<LongWritable, Text, Text, TextWritable> {
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
TextWritable array = new TextWritable();
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line, ",");
String dataAttuale = tokenizer.nextToken().substring(0,
line.lastIndexOf("-"));
Text tmp = null;
Text[] tmpArray = new Text[tokenizer.countTokens()];
int i = 0;
while (tokenizer.hasMoreTokens()) {
String prod = tokenizer.nextToken(",");
word.set(dataAttuale);
tmp = new Text(prod);
tmpArray[i] = tmp;
i++;
}
array.set(tmpArray);
context.write(word, array);
}
}
public static class CustomReduce extends Reducer<Text, TextWritable, Text, Text> {
public void reduce(Text key, Iterator<TextWritable> values,
Context context) throws IOException, InterruptedException {
MapWritable map = new MapWritable();
Text txt = new Text();
while (values.hasNext()) {
TextWritable array = values.next();
Text[] tmpArray = (Text[]) array.toArray();
for(Text t : tmpArray) {
if(map.get(t)!= null) {
IntWritable val = (IntWritable) map.get(t);
map.put(t, new IntWritable(val.get()+1));
} else {
map.put(t, new IntWritable(1));
}
}
}
Set<Writable> set = map.keySet();
StringBuffer str = new StringBuffer();
for(Writable k : set) {
str.append("key: " + k.toString() + " value: " + map.get(k) + "**");
}
txt.set(str.toString());
context.write(key, txt);
}
}
public static void main(String[] args) throws Exception {
long inizio = System.currentTimeMillis();
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "countProduct");
job.setJarByClass(ArrayGiulioTest.class);
job.setMapperClass(CustomMap.class);
//job.setCombinerClass(CustomReduce.class);
job.setReducerClass(CustomReduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(TextWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
long fine = System.currentTimeMillis();
logger.info("**************************************End" + (End-Start));
System.exit(1);
}
}
and i've implemented my custom TextWritable in this way :
public class TextWritable extends ArrayWritable {
public TextWritable() {
super(Text.class);
}
}
..so when i run my MapReduce program i obtain a result of this kind
2017-6 wordcount.TextWritable#3e960865
2017-6 wordcount.TextWritable#3e960865
it's obvious that my reducer it doesn't works. It seems the output from my Mapper
Any idea? And someone can says if is the right path to the solution?
Here Console Log (Just for information, my input file has 6 rows instead of 5)
*I obtain the same result starting MapReduce problem under eclipse(mono JVM) or using Hadoop with Hdfs
File System Counters
FILE: Number of bytes read=1216
FILE: Number of bytes written=431465
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=6
Map output records=6
Map output bytes=214
Map output materialized bytes=232
Input split bytes=97
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=232
Reduce input records=6
Reduce output records=6
Spilled Records=12
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=0
Total committed heap usage (bytes)=394264576
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=208
File Output Format Counters
Bytes Written=1813
I think you're trying to do too much work in the Mapper. You only need to group the dates (which it seems you aren't formatting them correctly anyway based on your expected output).
The following approach is going to turn these lines, for example
2017-07-01 , A, B, A, C, B, E, F
2017-07-05 , A, B, A, G, B, G, G
Into this pair for the reducer
2017-07 , ("A,B,A,C,B,E,F", "A,B,A,G,B,G,G")
In other words, you gain no real benefit by using an ArrayWritable, just keep it as text.
So, the Mapper would look like this
class CustomMap extends Mapper<LongWritable, Text, Text, Text> {
private final Text key = new Text();
private final Text output = new Text();
#Override
protected void map(LongWritable offset, Text value, Context context) throws IOException, InterruptedException {
int separatorIndex = value.find(",");
final String valueStr = value.toString();
if (separatorIndex < 0) {
System.err.printf("mapper: not enough records for %s", valueStr);
return;
}
String dateKey = valueStr.substring(0, separatorIndex).trim();
String tokens = valueStr.substring(1 + separatorIndex).trim().replaceAll("\\p{Space}", "");
SimpleDateFormat fmtFrom = new SimpleDateFormat("yyyy-MM-dd");
SimpleDateFormat fmtTo = new SimpleDateFormat("yyyy-MM");
try {
dateKey = fmtTo.format(fmtFrom.parse(dateKey));
key.set(dateKey);
} catch (ParseException ex) {
System.err.printf("mapper: invalid key format %s", dateKey);
return;
}
output.set(tokens);
context.write(key, output);
}
}
And then the reducer can build a Map that collects and counts the values from the value strings. Again, writing out only Text.
class CustomReduce extends Reducer<Text, Text, Text, Text> {
private final Text output = new Text();
#Override
protected void reduce(Text date, Iterable<Text> values, Context context) throws IOException, InterruptedException {
Map<String, Integer> keyMap = new TreeMap<>();
for (Text v : values) {
String[] keys = v.toString().trim().split(",");
for (String key : keys) {
if (!keyMap.containsKey(key)) {
keyMap.put(key, 0);
}
keyMap.put(key, 1 + keyMap.get(key));
}
}
output.set(mapToString(keyMap));
context.write(date, output);
}
private String mapToString(Map<String, Integer> map) {
StringBuilder sb = new StringBuilder();
String delimiter = ", ";
for (Map.Entry<String, Integer> entry : map.entrySet()) {
sb.append(
String.format("%s:%d", entry.getKey(), entry.getValue())
).append(delimiter);
}
sb.setLength(sb.length()-delimiter.length());
return sb.toString();
}
}
Given your input, I got this
2017-06 A:4, B:4, C:1, E:4, F:3, K:1, Q:2, R:1, T:1
2017-07 A:4, B:4, C:1, E:1, F:1, G:3
The main problem is about the sign of the reduce method :
I was writing : public void reduce(Text key, Iterator<TextWritable> values,
Context context)
instead of
public void reduce(Text key, Iterable<ArrayTextWritable> values,
This is the reason why i obtain my Map output instead of my Reduce otuput
Related
I am having trouble with a MapReduce Job. My map function does run and it produces the desired output. However, the reduce function does not run. It seems like the function never gets called. I am using Text as keys and Text as values. But I don't think that this causes the problem.
The input file is formatted as follows:
2015-06-06,2015-06-06,40.80239868164062,-73.93379211425781,40.72591781616211,-73.98358154296875,7.71,35.72
2015-06-06,2015-06-06,40.71020126342773,-73.96302032470703,40.72967529296875,-74.00226593017578,3.11,2.19
2015-06-05,2015-06-05,40.68404388427734,-73.97597503662109,40.67932510375977,-73.95581817626953,1.13,1.29
...
I want to extract the second date of a line as Text and use it as key for the reduce. The value for the key will be a combination of the last two float values in the same line.
i.e.: 2015-06-06 7.71 35.72
2015-06-06 9.71 66.72
So that the value part can be viewed as two columns separated by a blank.
That actually works and I get an output file with many same keys but different values.
Now I want to sum up the both of the float columns for each key, so that after the reduce I get a date as key with the summed up columns as value.
Problem: reduce does not run.
See the code below:
Mapper
public class Aggregate {
public static class EarnDistMapper extends Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String [] splitResult = value.toString().split(",");
String dropOffDate = "";
String compEarningDist = "";
//dropoffDate at pos 1 as key
dropOffDate = splitResult[1];
//distance at pos length-2 and earnings at pos length-1 as values separated by space
compEarningDist = splitResult[splitResult.length -2] + " " + splitResult[splitResult.length-1];
context.write(new Text(dropOffDate), new Text(compEarningDist));
}
}
Reducer
public static class EarnDistReducer extends Reducer<Text,Text,Text,Text> {
public void reduce(Text key, Iterator<Text> values, Context context) throws IOException, InterruptedException {
float sumDistance = 0;
float sumEarnings = 0;
String[] splitArray;
while (values.hasNext()){
splitArray = values.next().toString().split("\\s+");
//distance first
sumDistance += Float.parseFloat(splitArray[0]);
sumEarnings += Float.parseFloat(splitArray[1]);
}
//combine result to text
context.write(key, new Text(Float.toString(sumDistance) + " " + Float.toString(sumEarnings)));
}
}
Job
public static void main(String[] args) throws Exception{
// TODO
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Taxi dropoff");
job.setJarByClass(Aggregate.class);
job.setMapperClass(EarnDistMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setCombinerClass(EarnDistReducer.class);
job.setReducerClass(EarnDistReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Thank you for your help!!
You have the signature of the reduce method wrong. You have:
public void reduce(Text key, Iterator<Text> values, Context context) {
It should be:
public void reduce(Text key, Iterable<Text> values, Context context) {
I am trying to convert Text to String in my reduce function but its not working. I tried the same logic in Map function and it worked perfectly, but when I tried to apply this logic in my reduce function it is giving error: java.lang.ArrayIndexOutOfBoundsException 1
My Map code is like this
public static class OutDegreeMapper2
extends Mapper<Object, Text, Text, Text>
{
private Text word = new Text();
private Text word2 = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException
{
String oneLine = value.toString();
String[] parts = oneLine.split("\t");
word.set(parts[0]);
String join = parts[1]+",from2";
word2.set(join);
context.write(word, word2);
}
}
My reduce function is like this
public static class OutDegreeReducer
extends Reducer<Text,Text,Text,Text>
{
private Text word = new Text();
String merge ="";
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException
{
for(Text val:values)
{
String[] x = val.toString().split(",");
if(x[1].contains("from2")){
merge+= x[0];
}
}
word.set(merge);
context.write(key, word);
}
}
Kindly tell me why split is working in map function but not in reducer?
Very likely here
String[] parts = oneLine.split("\t");
word.set(parts[0]);
String join = parts[1]+",from2";
or here
String[] x = val.toString().split(",");
if(x[1].contains("from2")){
merge+= x[0];
}
when read x[1] or parts[1] throws the ArrayIndexOutOfBoundsException because there is no , and \t inside the string.
I suggest to check the size of the array before access the element 1.
Looking at the stacktrace you should be able to understand where is throwing the exception.
Instead of
if(x.length() > 1 && x[1].contains("from2")){
merge+= x[0];
}
Do this:
if(x.length() > 1 && x[1].contains("from2")){
merge+= x[0];
}
I am trying to create a variation of the word count hadoop program in which it reads multiple files in a directory and outputs the frequency of each word. The thing is, I want it to output a word followed by the file name is came from and the frequency from that file. for example:
word1
( file1, 10)
( file2, 3)
( file3, 20)
So for word1 (say the word "and"). It finds it 10 times is file1, 3 times in file2, ect. Right now it is outputing only a key value pair
StringTokenizer itr = new StringTokenizer(chapter);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
I can get the file name by
String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
But I do not understand how to format the way I want. I've been looking into OutputCollector, but I am unsure of how to use it exactly.
EDIT: This is my mapper and recuder
public static class TokenizerMapper
extends Mapper<Object, Text, Text, Text>{
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
//Take out all non letters and make all lowercase
String chapter = value.toString();
chapter = chapter.toLowerCase();
chapter = chapter.replaceAll("[^a-z]"," ");
//This is the file name
String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
StringTokenizer itr = new StringTokenizer(chapter);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, new Text(fileName)); //
}
}
}
public static class IntSumReducer
extends Reducer<Text,Text,Text,Text> { second
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Map<String, Integer> files = new HashMap<String, Integer>();
for (Text val : values) {
if (files.containsKey(val.toString())) {
files.put(val.toString(), files.get(val.toString())+1);
} else {
files.put(val.toString(), 1);
}
}
String outputString="";
for (String file : files.keySet()) {
outputString = outputString + "\n<" + file + ", " + files.get(file) + ">"; //files.get(file)
}
context.write(key, new Text(outputString));
}
}
This is outputting for the word "a" for example:
a
(
(chap02, 53), 1)
(
(chap18, 50), 1)
I am unsure of why its making a key value pair a key for a value 1 for each entry.
I don't think you need a custom output format at all for this. So long as you pass the filename along to the reducer, you should be able to do this simply by modifying the String that you use within a TextOutputFormat type operation. Explanation is below.
In the mapper get the filename, and append it to a textInputFormat as below
String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
context.write(key,new Text(fileName));
Then in the reducer do something like the following:
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Map<String, Integer> files = new HashMap<String, Integer>();
for (Text val : values) {
if (files.containsKey(val.toString())) {
files.put(val.toString(), files.get(val.toString()) + 1);
} else {
files.put(val.toString(), 1);
}
}
String outputString = key.toString();
for (String file : files.keySet()) {
outputString += "\n( " + file + ", " + files.get(file) + ")";
}
context.write(key, new Text(outputString));
}
This reducer appends "\n" to the beginning of every line, in order to force the display formatting to be exactly what you want.
This seems much simpler than writing your own outputformat.
I wrote one Hadoop word count program which takes TextInputFormat input and is supposed to output word count in avro format.
Map-Reduce job is running fine but output of this job is readable using unix commands such as more or vi. I was expecting this output be unreadable as avro output is in binary format.
I have used mapper only, reducer is not present. I just want to experiment with avro so I am not worried about memory or stack overflow. Following the the code of mapper
public class WordCountMapper extends Mapper<LongWritable, Text, AvroKey<String>, AvroValue<Integer>> {
private Map<String, Integer> wordCountMap = new HashMap<String, Integer>();
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] keys = value.toString().split("[\\s-*,\":]");
for (String currentKey : keys) {
int currentCount = 1;
String currentToken = currentKey.trim().toLowerCase();
if(wordCountMap.containsKey(currentToken)) {
currentCount = wordCountMap.get(currentToken);
currentCount++;
}
wordCountMap.put(currentToken, currentCount);
}
System.out.println("DEBUG : total number of unique words = " + wordCountMap.size());
}
#Override
protected void cleanup(Context context) throws IOException, InterruptedException {
for (Map.Entry<String, Integer> currentKeyValue : wordCountMap.entrySet()) {
AvroKey<String> currentKey = new AvroKey<String>(currentKeyValue.getKey());
AvroValue<Integer> currentValue = new AvroValue<Integer>(currentKeyValue.getValue());
context.write(currentKey, currentValue);
}
}
}
and driver code is as follows :
public int run(String[] args) throws Exception {
Job avroJob = new Job(getConf());
avroJob.setJarByClass(AvroWordCount.class);
avroJob.setJobName("Avro word count");
avroJob.setInputFormatClass(TextInputFormat.class);
avroJob.setMapperClass(WordCountMapper.class);
AvroJob.setInputKeySchema(avroJob, Schema.create(Type.INT));
AvroJob.setInputValueSchema(avroJob, Schema.create(Type.STRING));
AvroJob.setMapOutputKeySchema(avroJob, Schema.create(Type.STRING));
AvroJob.setMapOutputValueSchema(avroJob, Schema.create(Type.INT));
AvroJob.setOutputKeySchema(avroJob, Schema.create(Type.STRING));
AvroJob.setOutputValueSchema(avroJob, Schema.create(Type.INT));
FileInputFormat.addInputPath(avroJob, new Path(args[0]));
FileOutputFormat.setOutputPath(avroJob, new Path(args[1]));
return avroJob.waitForCompletion(true) ? 0 : 1;
}
I would like to know how do avro output looks like and what am I doing wrong in this program.
Latest release of Avro library includes an updated example of their ColorCount example adopted for MRv2. I suggest you to look at it, use the same pattern as they use in Reduce class or just extend AvroMapper. Please note that using Pair class instead of AvroKey+AvroValue is also essential for running Avro on Hadoop.
When I launch may mapreduce program, I get this error :
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.BytesWritable
at nflow.hadoop.flow.analyzer.Calcul$Calcul_Mapper.map(Calcul.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
The code of the mapper:
public static class Calcul_Mapper extends Mapper<LongWritable, BytesWritable, Text, Text>{
String delimiter="|";
long interval = 60*60 ;
Calendar cal;
public void map(LongWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {
byte[] value_bytes = value.getBytes();
if(value_bytes.length < FlowWritable.MIN_PKT_SIZE + FlowWritable.PCAP_HLEN) return;
EZBytes eb = new EZBytes(value_bytes.length);
eb.PutBytes(value_bytes, 0, value_bytes.length);
// C2S key ==> protocol | srcIP | dstIP | sPort |dPort
long sys_uptime = Bytes.toLong(eb.GetBytes(FlowWritable.PCAP_ETHER_IP_UDP_HLEN+4,4));
long timestamp = Bytes.toLong(eb.GetBytes(FlowWritable.PCAP_ETHER_IP_UDP_HLEN+8,4))*1000000
+ Bytes.toLong(BinaryUtils.flipBO(eb.GetBytes(FlowWritable.PCAP_ETHER_IP_UDP_HLEN+12, 4),4));
int count = eb.GetShort(FlowWritable.PCAP_ETHER_IP_UDP_HLEN+2);
FlowWritable fw;
byte[] fdata = new byte[FlowWritable.FLOW_LEN];
int cnt_flows = 0;
int pos = FlowWritable.PCAP_ETHER_IP_UDP_HLEN+FlowWritable.CFLOW_HLEN;
try{
while(cnt_flows++ < count){
fw = new FlowWritable();
fdata = eb.GetBytes(pos, FlowWritable.FLOW_LEN);
if(fw.parse(sys_uptime, timestamp, fdata)){
context.write(new Text("Packet"), new Text(Integer.toString(1)));
context.write(new Text("Byte"), new Text(Integer.toString(1)));
context.write(new Text("Flow"), new Text(Integer.toString(1)));
context.write(new Text("srcPort"), new Text(Integer.toString(fw.getSrcport())));
context.write(new Text("dstPort"), new Text(Integer.toString(fw.getDstport())));
context.write(new Text("srcAddr"), new Text(fw.getSrcaddr()));
context.write(new Text("dstAddr"), new Text(fw.getDstaddr()));
}else{
}
pos += FlowWritable.FLOW_LEN;
}
} catch (NumberFormatException e) {
}
}
}
Does someone know what's wrong please?
Can you please check your Job configuration? Check those particularly:
conf.setOutputKeyClass(Something.class);
conf.setOutputValueClass(Something.class);
And by the way, since your keys are always fixed to a constant; you don't need to create them for each emit from your map function.
And I think it much better if you have a custom key object that groups together everything. For this you need to extend ObjectWritable and to implement WritableComparable.
Your writing/emitting looks highly suspicious to me.
Is your job receiving input from a plain file? If so your input value type should be Text rather than BytesWritable.
public static class Calcul_Mapper extends Mapper<LongWritable, Text, Text, Text>