I will be doing the following in a much bigger file. for now,I have an example input file with the following values.
1000,SMITH,JERRY
1001,JOHN,TIA
1002,TWAIN,MARK
1003,HARDY,DENNIS
1004,CHILD,JACK
1005,CHILD,NORTON
1006,DAVIS,JENNY
1007,DAVIS,KAREN
1008,MIKE,JOHN
1009,DENNIS,SHERIN
now what i am doing is running a mapreduce job to encrypt the last name of each record and write back an output. and i am using the mapper partition number as the key and the modified text as value.
so the output from mapper will be,
0 1000,Mj4oJyk=,,JERRY
0 1001,KzwpPQ,TIA
0 1002,NSQgOi8,MARK
0 1003,KTIzNzg,DENNIS
0 1004,IjsoPyU,JACK
0 1005,IjsoPyU,NORTON
0 1006,JTI3OjI,JENNY
0 1007,JTI3OjI,KAREN
0 1008,LDoqNg,JOHN
0 1009,JTYvPSgg,SHERIN
I don't want any sorting to be done.I also use a reducer because, in case of a larger file, there will be multiple mappers and if no reducer, multiple output files will be written. so i use a single reduce to merge values from all mappers and write to single file.
now the input values to reducer comes in reversed order and in the order from mapper. it is like the following,
1009,JTYvPSgg,SHERIN
1008,LDoqNg==,JOHN
1007,JTI3OjI=,KAREN
1006,JTI3OjI=,JENNY
1005,IjsoPyU=,NORTON
1004,IjsoPyU=,JACK
1003,KTIzNzg=,DENNIS
1002,NSQgOi8=,MARK
1001,KzwpPQ==,TIA
1000,Mj4oJyk=,JERRY
Why is it reversing the order? and how can i maintain the same order from mapper? any suggestions will be helpfull
EDIT 1 :
the Driver code is,
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJobName("encrypt");
job.setJarByClass(TestDriver.class);
job.setMapperClass(TestMap.class);
job.setNumReduceTasks(1);
job.setReducerClass(TestReduce.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(hdfsInputPath));
FileOutputFormat.setOutputPath(job, new Path(hdfsOutputPath));
System.exit(job.waitForCompletion(true) ? 0 : 1);
the mapper code is,
inputValues = value.toString().split(",");
stringBuilder = new StringBuilder();
TaskID taskId = context.getTaskAttemptID().getTaskID();
int partition = taskId.getId();
// the mask(inputvalue) method is called to encrypt input values and write to stringbuilder in appropriate format
mask(inputvalues);
context.write(new IntWritable(partition), new Text(stringBuilder.toString()));
The reducer code is,
for(Text value : values) {
context.write(new Text(value), null);
}
The base idea of MapReduce is that the order in which things are done is irrelevant.
So you cannot (and do not need to) control the order in which
the input records go through the mappers.
the key and related values go through the reducers.
The only thing you can control is the order in which the values are placed in the iterator that is made available in the reducer.
For that you can use the Object key to maintain the order of values.
The LongWritable part (or the key) is the position of the line in the file (Not line number, but position from start of file).
You can use that part to keep track of which line was first.
Then your mapper code will be changed to
protected void map(Object key, Text value, Mapper<Object, Text, LongWritable, Text>.Context context)
throws IOException, InterruptedException {
inputValues = value.toString().split(",");
stringBuilder = new StringBuilder();
mask(inputValues);
// the mask(inputvalue) method is called to encrypt input values and write to stringbuilder in appropriate format
context.write(new LongWritable(((LongWritable) key).get()), value);
}
Note: you can change all IntWritable to LongWritable in your code but be careful.
inputValues = value.toString().split(",");
stringBuilder = new StringBuilder();
TaskID taskId = context.getTaskAttemptID().getTaskID();
//preserve the number value for sorting
IntWritable idNumber = new IntWritable(Integer.parseInt(inputValue[0])
// the mask(inputvalue) method is called to encrypt input values and write to stringbuilder in appropriate format
mask(inputvalues);
context.write(idNumber, new Text(stringBuilder.toString()));
I made some assumptions because you did not have the full code of the mapper. I assumed that inputValues was a string array due to the toString() output. The first value of the array should be the number value from your input, however it is now a string. You must convert the number back to IntWritable to match what your mapper is emitting IntWritable,Text. The hadoop framework will sort by key and with the key being of type IntWritable it will sort in ascending order. The code you provided is using the task ID and from reading the API https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/TaskAttemptID.html#getTaskID() It was unclear whether this would provide an order to your values as you desired. To control the order of output I would recommend using the first value of your string array and convert to IntWritable. I don't know if this violates your intent to mask the inputValues.
EDIT
To follow up with your comment. You can simply multiply the partition by -1 this will cause the hadoop framework to reverse the order.
int partition = -1*taskId.getId();
Related
I have a csv file that consists of data in this format:
id, name, surname, morecolumns
5, John, Lok, more
2, John2, Lok2, more
1, John3, Lok3, more
etc..
I want to sort my csv file using the id as key and store the sorted results in another file.
What I've done so far in order to create JavaPairs of (id, rest_of_line).
SparkConf conf = new SparkConf().setAppName.....;
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> file = sc.textFile("inputfile.csv");
// extract the header
JavaRDD<String> lines = file.filter(s -> !s.equals(header));
// create JavaPairs
JavaPairRDD<Integer, String> pairRdd = lines.mapToPair(
new PairFunction<String, Integer, String>() {
public Tuple2<Integer, String> call(final String line) {
String str = line.split(",", 2)[0];
String str2 = line.split(",", 2)[1];
int id = Integer.parseInt(str);
return new Tuple2(id, str2);
}
});
// sort and save the output
pairRdd.sortByKey(true, 1);
pairRdd.coalesce(1).saveAsTextFile("sorted.csv");
This works in cases that I have small files. However when I am using bigger files, the output is not sorted properly. I think this happens because the sort procedure takes place on different nodes, so the merge of all the procedures from all the nodes doesn't give the expected output.
So, the question is how can I sort my csv file using the id as key and store the sorted results in another file.
The method coalesce is probably the one to blame, as it apparently does not contractually guarantee the ordering or the resulting RDD (see Which operations preserve RDD order?). So if you avoid such coalesce, the resulting output files will be ordered.
As you want a unique csv file, you could get the results from whatever file-system you're using but taking care of their actual order, and merge them. For example, if you're using HDFS (as stated by #PinoSan) this can be done using the command hdfs dfs -getmerge <hdfs-output-dir> <local-file.csv>.
As pointed by #mauriciojost, you should not do coalesce.
Instead, better way to do this is pairRdd.sortByKey(true,pairRdd.getNumPartitions()).saveAsTextFile(path) so that maximum possible work is carried out on partitions that hold data.
I want to be able to set different separators for my key/value pairs which I receive in to the map function of my MR job.
For example my text file might have:
John-23
Mary-45
Scott-13
and in my map function I want the key to be John and the value to be 23 etc for each element.
Then if I set the output separator using
conf.set("mapreduce.textoutputformat.separator", "-");
Will the reducer pick up the key until the first '-' and the value everything after that? or do I need to make changes for the reducer as well?
Thanks
Reading
In case you use the org.apache.hadoop.mapreduce.lib.input.TextInputFormat, you can simply use a String#split in the Mapper.
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] keyValue = value.toString().split("-");
// would emit John -> 23 as a text
context.write(new Text(keyValue[0]), new Text(keyValue[1]));
}
Writing
In case you output it that way:
Text key = new Text("John");
LongWritable value = new LongWritable(23);
// of course key and value can come from the reduce method itself,
// I just want to illustrate the types
context.write(key, value);
Yes, the TextOutputFormat takes care of writing that in your desired format:
John-23
The only trap that I came across in Hadoop 2.x (YARN) and already answered here is that the property was renamed to mapreduce.output.textoutputformat.separator.
I am very new to Hadoop.I have written a MapReduce Program which parses an input file and extracts a specific pattern as key along with its value.
I can easily reduce it and the final output is a file with pair of keys and values.
public class EReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text>
{
private Text outputKey1 = new Text();
private Text outputValue1 = new Text();
public void reduce(Text equipKey1, Iterator<Text> equipValues1,
OutputCollector<Text, Text> results1, Reporter reporter1) throws IOException {
String output1 = "";
while(equipValues1.hasNext())
{
Text equi= equipValues1.next();
output1 = output1 + equi.toString();
}
outputKey1.set(equipKey1.toString());
outputValue1.set(output1);
results1.collect(outputKey1, outputValue1);
}
The problem is, at the start of the file i need to show Total number of keys and Total number of values for a particular key as an aggregate.
Key: Date
Value: Happenings.
something like
12/03/2013 CMB ate pizza
He went to a mall
He met his friend
There were totally 3 happenings on the date 12/03/2013.
Like there will be set of dates and happenings.
Finally i should show,there were "This number of action" on the date "date".
there were 3 action on the date 12/03/2013
etc....
How can i achieve this?
Any help would be appreciated.!
Not sure if this the direct answer, but I would not store aggregates along with output. Consider Pig to get aggregates. It fits well for this use case.
Also, I did not understand the "start of file" question. A reducer task could have more than one key - values to work with so your file "part-r-00000" would like
12/01/2012 something something1 something2
12/02/2012 abc abc1 abc2
But I would lean towards storing just data emitted from reducer without aggregating it and using pig to run trough them to get the count you need (you would have to implement your udf to parse your events, which is every simple)
just a possible snippet
a = LOAD '/path/to/mroutput' as (dt:chararray, evdata:chararray);
b = foreach a generate dt, com.something.EVParser(evdata) as numberofevents;
c = store b into '/path/to/aggregateddata';
I have inputs from two sources:
map output in the form,
output.collect(new StockKey(Text(x+" "+id), new Text(id2)), new Text(data));
map output in the form,
output.collect(new StockKey(new Text(x+" "+id), new Text(1), new Text(data));
Job conf:
conf.setPartitionerClass(CustomPartitioner.class);
conf.setValueGroupingComparatorClass(StockKeyGroupingComparator.class);
where StockKey is a custom class of format (new Text(), new Text());
Constructor:
public StockKey(){
this.symbol = new Text();
this.timestamp = new Text();
}
Grouping comparator:
public class StockKeyGroupingComparator extends WritableComparator {
protected StockKeyGroupingComparator() {
super(StockKey.class, true);
}
public int compare(WritableComparable w1, WritableComparable w2){
StockKey k1 = (StockKey)w1;
StockKey k2 = (StockKey)w2;
Text x1 = new Text(k1.getSymbol());
Text x2 = new Text(k2.getSymbol());
return x1.compareTo(x2);
}
}
But I'm not receiving the map output values from input
I'm getting only the map output value reaches the reducer. I want the the records which have the symbol viz new Text(x+" "+id) which are common from both the map outputs to be grouped to the same reducer. I am struck here.
Please help!
To do this you need a Partitioner which fits in as follows:
Your mappers output a bunch of records as key/value pairs
For each record, the partitioner is passed the key, the value and the number of reducers. The partitioner decides which reducer will handle the record
The records are shipped off to their respective partitions (reducers)
The GroupingComparator is run to decide which key value pairs get grouped into an iterable for a single call to the reducer() method
and so on...
I think the default partitioner is choosing the reducer partition for each record based on the entire value of your key (that's the default behavior). But you want records grouped by only part of the key (just the symbol and not the symbol and timestamp). So you need to write a partitioner that does this and specify/configure it in the driver class.
Once you do that, you're grouping comparator should help group the records as you've intended.
EDIT: random thoughts
You might make things easier on yourself if you moved the timestamp to the value, making the key simple (just the symbol) and the value complex (timestamp and value). Then you wouldn't need a partitioner or a grouping comparator.
You didn't say either way, but you did use the MultipleInputs class, right? That's the only way to invoke two or more mappers for the same job.
I am trying to find a way to "loop" my reducer, for example:
for(String document: tempFrequencies.keySet())
{
if(list.get(0).equals(document))
{
testMap.put(key.toString(), DF.format(tfIDF));
}
}
//This allows me to create a hashmap which i plan to write out to context as Filename = key then all of the terms weights = value (a list I can parse out in the next job)
The code currently will run through the entire reduce and give me what I want for list.get(0) but the problem is once it is finished doing that entire reduce I need it to start again for list.get(1) etc. Any ideas on how to loop the reduce phase after it has finished?
Nest the for loop
for(int i = 0; i < number_of_time; i++){
//your code
}
Replace the 0 with i.
You can use key-tag-value technique.
In mapper emit (key, 0, value) for list values and (key, 1, value) for documents (?). In reducer values will be grouped by key and tag and sorted by tag for each key. You should write your own grouping comparator (and custom partitioner).
PS I am using the same techique for graph processing. I can provide sample code after weekend.