I want to be able to set different separators for my key/value pairs which I receive in to the map function of my MR job.
For example my text file might have:
John-23
Mary-45
Scott-13
and in my map function I want the key to be John and the value to be 23 etc for each element.
Then if I set the output separator using
conf.set("mapreduce.textoutputformat.separator", "-");
Will the reducer pick up the key until the first '-' and the value everything after that? or do I need to make changes for the reducer as well?
Thanks
Reading
In case you use the org.apache.hadoop.mapreduce.lib.input.TextInputFormat, you can simply use a String#split in the Mapper.
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] keyValue = value.toString().split("-");
// would emit John -> 23 as a text
context.write(new Text(keyValue[0]), new Text(keyValue[1]));
}
Writing
In case you output it that way:
Text key = new Text("John");
LongWritable value = new LongWritable(23);
// of course key and value can come from the reduce method itself,
// I just want to illustrate the types
context.write(key, value);
Yes, the TextOutputFormat takes care of writing that in your desired format:
John-23
The only trap that I came across in Hadoop 2.x (YARN) and already answered here is that the property was renamed to mapreduce.output.textoutputformat.separator.
Related
I am working with Hadoop 2.6.4 and I am trying to implement a Stripes mapper for word Co-Occurances. I am running into an issue when attempting to use MapWritable class. When trying to add new key/values into the map, any key that is added is replacing every single other key in the map with itself.
For example, let's say I have a sentence like
"This is a sentence with two a letters"
The first run through, I am looking at the co-occurrences for the word "This". So the expected mapper would be
<is,1>
<a,2>
<sentence,1>
<with,1>
<two,1>
<letters,1>
But what is actually happening is on each iteration of adding the subsequent words, ALL keys/values are being replaced with the last key that was added. The actual result I am seeing is the following.
<letters,1>
<letters,1>
<letters,1>
<letters,1>
<letters,1>
<letters,1>
I have created a method to convert a HashMap to MapWritable, this is where the issue is occurring. Here is the code I am using. I have added print statements to make sure the values I am adding are correct (they are) and then I am printing the keys to see what is occurring as I am adding them. This is where I was able to see that it is replacing each key as it adds a new one.
According to all documentation I have looked at, I am using MapWritable.put() properly, and it should simply be adding to the map or updating the value, as it would with a generic HashMap. I am at a loss as to what is causing this.
public static MapWritable toMapWritable(HashMap<String,Integer> map){
MapWritable mw = new MapWritable();
Text key = new Text();
IntWritable val = new IntWritable();
for(String it : map.keySet()){
key.set(it.toString());
System.out.println("Setting Key: " + key.toString());
val.set(map.get(it));
System.out.println("Setting Value: " + map.get(key.toString()));
mw.put(key,val);
for(Writable itw : mw.keySet()){
System.out.println("Actual mw Key " + itw.toString());
}
}
return mw;
}
You are calling key.set() repeatedly and you have only allocated one Text. This is basically what you are doing.
Text key = new Text();
key.set("key1");
key.set("key2");
System.out.println(key); // prints 'key2'
I believe you might be implementing the common pattern of reusing objects in a Map/Reduce job. However, that hinges upon calling context.write(). For instance:
private Text word = new Text();
private IntWritable count = new IntWritable(1);
public void map(LongWritable offset, Text line, Context context) {
for (String s : line.toString().split(" ")) {
word.set(s);
context.write(word, count); // Text gets serialized here
}
}
In the above example, the Map/Reduce framework will serialize that text to bytes and save them behind the scenes. That's why you are free to reuse the Text object. MapWritable does not do the same thing however. You need to create new keys each time.
MapWritable mw = new MapWritable();
mw.put(new Text("key1"), new Text("value1"));
mw.put(new Text("key2"), new Text("value2"));
I will be doing the following in a much bigger file. for now,I have an example input file with the following values.
1000,SMITH,JERRY
1001,JOHN,TIA
1002,TWAIN,MARK
1003,HARDY,DENNIS
1004,CHILD,JACK
1005,CHILD,NORTON
1006,DAVIS,JENNY
1007,DAVIS,KAREN
1008,MIKE,JOHN
1009,DENNIS,SHERIN
now what i am doing is running a mapreduce job to encrypt the last name of each record and write back an output. and i am using the mapper partition number as the key and the modified text as value.
so the output from mapper will be,
0 1000,Mj4oJyk=,,JERRY
0 1001,KzwpPQ,TIA
0 1002,NSQgOi8,MARK
0 1003,KTIzNzg,DENNIS
0 1004,IjsoPyU,JACK
0 1005,IjsoPyU,NORTON
0 1006,JTI3OjI,JENNY
0 1007,JTI3OjI,KAREN
0 1008,LDoqNg,JOHN
0 1009,JTYvPSgg,SHERIN
I don't want any sorting to be done.I also use a reducer because, in case of a larger file, there will be multiple mappers and if no reducer, multiple output files will be written. so i use a single reduce to merge values from all mappers and write to single file.
now the input values to reducer comes in reversed order and in the order from mapper. it is like the following,
1009,JTYvPSgg,SHERIN
1008,LDoqNg==,JOHN
1007,JTI3OjI=,KAREN
1006,JTI3OjI=,JENNY
1005,IjsoPyU=,NORTON
1004,IjsoPyU=,JACK
1003,KTIzNzg=,DENNIS
1002,NSQgOi8=,MARK
1001,KzwpPQ==,TIA
1000,Mj4oJyk=,JERRY
Why is it reversing the order? and how can i maintain the same order from mapper? any suggestions will be helpfull
EDIT 1 :
the Driver code is,
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJobName("encrypt");
job.setJarByClass(TestDriver.class);
job.setMapperClass(TestMap.class);
job.setNumReduceTasks(1);
job.setReducerClass(TestReduce.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(hdfsInputPath));
FileOutputFormat.setOutputPath(job, new Path(hdfsOutputPath));
System.exit(job.waitForCompletion(true) ? 0 : 1);
the mapper code is,
inputValues = value.toString().split(",");
stringBuilder = new StringBuilder();
TaskID taskId = context.getTaskAttemptID().getTaskID();
int partition = taskId.getId();
// the mask(inputvalue) method is called to encrypt input values and write to stringbuilder in appropriate format
mask(inputvalues);
context.write(new IntWritable(partition), new Text(stringBuilder.toString()));
The reducer code is,
for(Text value : values) {
context.write(new Text(value), null);
}
The base idea of MapReduce is that the order in which things are done is irrelevant.
So you cannot (and do not need to) control the order in which
the input records go through the mappers.
the key and related values go through the reducers.
The only thing you can control is the order in which the values are placed in the iterator that is made available in the reducer.
For that you can use the Object key to maintain the order of values.
The LongWritable part (or the key) is the position of the line in the file (Not line number, but position from start of file).
You can use that part to keep track of which line was first.
Then your mapper code will be changed to
protected void map(Object key, Text value, Mapper<Object, Text, LongWritable, Text>.Context context)
throws IOException, InterruptedException {
inputValues = value.toString().split(",");
stringBuilder = new StringBuilder();
mask(inputValues);
// the mask(inputvalue) method is called to encrypt input values and write to stringbuilder in appropriate format
context.write(new LongWritable(((LongWritable) key).get()), value);
}
Note: you can change all IntWritable to LongWritable in your code but be careful.
inputValues = value.toString().split(",");
stringBuilder = new StringBuilder();
TaskID taskId = context.getTaskAttemptID().getTaskID();
//preserve the number value for sorting
IntWritable idNumber = new IntWritable(Integer.parseInt(inputValue[0])
// the mask(inputvalue) method is called to encrypt input values and write to stringbuilder in appropriate format
mask(inputvalues);
context.write(idNumber, new Text(stringBuilder.toString()));
I made some assumptions because you did not have the full code of the mapper. I assumed that inputValues was a string array due to the toString() output. The first value of the array should be the number value from your input, however it is now a string. You must convert the number back to IntWritable to match what your mapper is emitting IntWritable,Text. The hadoop framework will sort by key and with the key being of type IntWritable it will sort in ascending order. The code you provided is using the task ID and from reading the API https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/TaskAttemptID.html#getTaskID() It was unclear whether this would provide an order to your values as you desired. To control the order of output I would recommend using the first value of your string array and convert to IntWritable. I don't know if this violates your intent to mask the inputValues.
EDIT
To follow up with your comment. You can simply multiply the partition by -1 this will cause the hadoop framework to reverse the order.
int partition = -1*taskId.getId();
I am very new to Hadoop.I have written a MapReduce Program which parses an input file and extracts a specific pattern as key along with its value.
I can easily reduce it and the final output is a file with pair of keys and values.
public class EReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text>
{
private Text outputKey1 = new Text();
private Text outputValue1 = new Text();
public void reduce(Text equipKey1, Iterator<Text> equipValues1,
OutputCollector<Text, Text> results1, Reporter reporter1) throws IOException {
String output1 = "";
while(equipValues1.hasNext())
{
Text equi= equipValues1.next();
output1 = output1 + equi.toString();
}
outputKey1.set(equipKey1.toString());
outputValue1.set(output1);
results1.collect(outputKey1, outputValue1);
}
The problem is, at the start of the file i need to show Total number of keys and Total number of values for a particular key as an aggregate.
Key: Date
Value: Happenings.
something like
12/03/2013 CMB ate pizza
He went to a mall
He met his friend
There were totally 3 happenings on the date 12/03/2013.
Like there will be set of dates and happenings.
Finally i should show,there were "This number of action" on the date "date".
there were 3 action on the date 12/03/2013
etc....
How can i achieve this?
Any help would be appreciated.!
Not sure if this the direct answer, but I would not store aggregates along with output. Consider Pig to get aggregates. It fits well for this use case.
Also, I did not understand the "start of file" question. A reducer task could have more than one key - values to work with so your file "part-r-00000" would like
12/01/2012 something something1 something2
12/02/2012 abc abc1 abc2
But I would lean towards storing just data emitted from reducer without aggregating it and using pig to run trough them to get the count you need (you would have to implement your udf to parse your events, which is every simple)
just a possible snippet
a = LOAD '/path/to/mroutput' as (dt:chararray, evdata:chararray);
b = foreach a generate dt, com.something.EVParser(evdata) as numberofevents;
c = store b into '/path/to/aggregateddata';
I have inputs from two sources:
map output in the form,
output.collect(new StockKey(Text(x+" "+id), new Text(id2)), new Text(data));
map output in the form,
output.collect(new StockKey(new Text(x+" "+id), new Text(1), new Text(data));
Job conf:
conf.setPartitionerClass(CustomPartitioner.class);
conf.setValueGroupingComparatorClass(StockKeyGroupingComparator.class);
where StockKey is a custom class of format (new Text(), new Text());
Constructor:
public StockKey(){
this.symbol = new Text();
this.timestamp = new Text();
}
Grouping comparator:
public class StockKeyGroupingComparator extends WritableComparator {
protected StockKeyGroupingComparator() {
super(StockKey.class, true);
}
public int compare(WritableComparable w1, WritableComparable w2){
StockKey k1 = (StockKey)w1;
StockKey k2 = (StockKey)w2;
Text x1 = new Text(k1.getSymbol());
Text x2 = new Text(k2.getSymbol());
return x1.compareTo(x2);
}
}
But I'm not receiving the map output values from input
I'm getting only the map output value reaches the reducer. I want the the records which have the symbol viz new Text(x+" "+id) which are common from both the map outputs to be grouped to the same reducer. I am struck here.
Please help!
To do this you need a Partitioner which fits in as follows:
Your mappers output a bunch of records as key/value pairs
For each record, the partitioner is passed the key, the value and the number of reducers. The partitioner decides which reducer will handle the record
The records are shipped off to their respective partitions (reducers)
The GroupingComparator is run to decide which key value pairs get grouped into an iterable for a single call to the reducer() method
and so on...
I think the default partitioner is choosing the reducer partition for each record based on the entire value of your key (that's the default behavior). But you want records grouped by only part of the key (just the symbol and not the symbol and timestamp). So you need to write a partitioner that does this and specify/configure it in the driver class.
Once you do that, you're grouping comparator should help group the records as you've intended.
EDIT: random thoughts
You might make things easier on yourself if you moved the timestamp to the value, making the key simple (just the symbol) and the value complex (timestamp and value). Then you wouldn't need a partitioner or a grouping comparator.
You didn't say either way, but you did use the MultipleInputs class, right? That's the only way to invoke two or more mappers for the same job.
While writing test automation, i was required to leverage the api's provided by the developers and these api accepts HashMap as arguments. The test code involves calling several such api with hashmap as the parameter as shown below.
Map<String,String> testMap = new HashMap<String,String>();
setName()
{
testMap.put("firstName","James");
testMap.put("lastName","Bond");
String fullName=devApi1.submitMap(testMap);
testMap.put("realName",fullName);
}
setAddress()
{
testMap.put("city","London");
testMap.put("country","Britain");
testMap.put("studio","Hollywood");
testMap.put("firstName","");
testMap.put("person",myMap.get("realName"));
devApi2.submitMap(testMap);
}
However the requirement was to print the testMap in both setName and setAddress functions, but the map should print only those elements (key-value pairs) in alternate lines which are set in the respective function. I mean setName should print 2 elements in the Map which are set before submitMap api is invoked and similarly setAddress should print 5 elements which are set before submitMap is invoked.
setName Output must be:
The data used for firstName is James.
The data used for lastName is Bond
setAddress Output must be:
The data used for city is London.
The data used for country is Britain.
The data used for studio is Hollywood.
The data used for firstName is null.
The data used for person is James Bond
Any help, in order to acheive this?
I would probably write a helper function that would add items to the map and do the printing.
public static <K,V> void add(Map<K,V> map, K key, V value){
System.out.println(String.format("The data used for \"%s\" is \"%s\"", key, value));
map.put(key, value);
}
If you need to print different messages you could either use different helper functions or pass format string as an argument.
I would create a method which takes a comma separated list of keys as argument and print only them.
Something like:
public void printKeys(Map<String,String> map, String csKeys) {
for(String key: csKeys.split(",")){
if(map.conatinsKey(key)){
System.out.println("The data used for " + key + " is " + map.get(key) );
}
}
}
and you can invoke it like:
printKeys(testMap, "firstName,lastName");
printKeys(testMap, "city,country,studio");
You'd better create a copy of your testMap when you invoke the submitMap method, since you don't have a flag to indicate which key-value pairs should be printed.
You could do it like
Map<String, String> printObj = new HashMap<String, String>
setName()
{
testMap.put("firstName","James");
testMap.put("lastName","Bond");
String fullName=devApi1.submitMap(testMap);
printObj.addAll(testMap);
testMap.put("realName",fullName);
}
Then print the printObj instead of testMap.
From you comments on my earlier answer it seems you don't want the values put from one method to be displayed in the second one...
That can be done easily, just place:
testMap.clear();
in the beginning of every method.