Outputting single file for partitioner - java

Trying to get as many reducer as the no of keys
public class CustomPartitioner extends Partitioner<Text, Text>
{
public int getPartition(Text key, Text value,int numReduceTasks)
{
System.out.println("In CustomP");
return (key.toString().hashCode()) % numReduceTasks;
}
}
Driver class
job6.setMapOutputKeyClass(Text.class);
job6.setMapOutputValueClass(Text.class);
job6.setOutputKeyClass(NullWritable.class);
job6.setOutputValueClass(Text.class);
job6.setMapperClass(LastMapper.class);
job6.setReducerClass(LastReducer.class);
job6.setPartitionerClass(CustomPartitioner.class);
job6.setInputFormatClass(TextInputFormat.class);
job6.setOutputFormatClass(TextOutputFormat.class);
But I am getting ootput in a single file.
Am I doing anything wrong

You can not control number of reducer without specifying it :-). But still there is no surety of getting all the keys on different reducer because you are not sure how many distinct keys you would get in the input data and your hash partition function may return same number for two distinct keys. If you want to achieve your solution then you'll have to know number of distinct keys in advance and then modify your partition function accordingly.

you need to specify the number of reduce tasks that's equal to number of keys and also you need to return the partitions based on your key's in partitioner class. for example if your input having 4 keys(here it is wood,Masonry,Reinforced Concrete etc) then your getPartition method look like this..
public int getPartition(Text key, PairWritable value, int numReduceTasks) {
// TODO Auto-generated method stub
String s = value.getone();
if (numReduceTasks ==0){
return 0;
}
if(s.equalsIgnoreCase("wood")){
return 0;
}
if(s.equalsIgnoreCase("Masonry")){
return 1%numReduceTasks;
}
if(s.equalsIgnoreCase("Reinforced Concrete")){
return 2%numReduceTasks;
}
if(s.equalsIgnoreCase("Reinforced Masonry")){
return 3%numReduceTasks;
}
else
return 4%numReduceTasks;
}
}
corresponding output will be collected in respective reducers..Try Running in CLI instead eclipse

You haven't configured the number of reducers to run.
You can configure it using below API
job.setNumReduceTasks(10); //change the number according to your
cluster
Also, you can set while executing from commandline
-D mapred.reduce.tasks=10
Hope this helps.

Veni, You need to Chain the Tasks as below
Mapper1 --> Reducer --> Mapper2 (Post Processing Mapper which creates
file for Each key)
Mapper 2 is InputFormat should be NlineInputFormat, so the output of the reducer that is for each key there will be corresponding mapper and Mapper output will be a separate file foe each key.
Mapper 1 and Reducer is your existing MR job.
Hope this helps.
Cheers
Nag

Related

How to append two value to one key using Redis Spring Data

I using redis caching my project but i have a problem. I had model student and write method put it to redis.First method i write findStudent one week and put it to cache.
public void findStudentOneWeek(List<Student> students1) {
redistemplate.opsForHash().put("Student", student.getId(), List<Customers>);
}
Second method I write findStudent one day.
public void findStudentOneDay(List<Student> students2) {
redistemplate.opsForHash().put("Student", student.getId(), List<Customers>);
}
But i want total user from 8 day. It mean i want hold one key Student but new value equal total value method findStudentOneWeek + total value method findStudentOneDay. But i don't now how to do. I can't find method working it. I know method put from redis but it remove old value and save new value. I don't want it. I want value total.
Firstly, I assume a typo, where List<Customers> should be List<Students> student:
redistemplate.opsForHash().put("Student", student.getId(), List<Students> students);
Spring Data class HashOperations works on the similar principle as HashMap. Both allow you to get the value by the key (and the hash key in case of HashOperations). Read the current value List<Customers> and put them with a new value to the template.
List<Students> students = redistemplate.opsForHash().get("Student", student.getId());
students2.addAll(students);
redistemplate.opsForHash().put("Student", student.getId(), students2);

Return debug information from Hadoop

I'm writing a Java application to run a MapReduce job on Hadoop. I've set up some local variables in my mapper/reducer classes but I'm not able to return the information to the main Java application. For example, if I set up a variable inside my Mapper class:
private static int nErrors = 0;
Each time I process a line from the input file, I increment the error count if the data is not formatted correctly. Finally, I define a get function for the errors and call this after my job is complete:
public static int GetErrors()
{
return nErrors;
}
But when I print out the errors at the end:
System.out.println("Errors = " + UPMapper.GetErrors());
This always returns "0" no matter what I do! If I start with nErrors = 12;, then the final value is 12. Is it possible to get information from the MapReduce functions like this?
UPDATE
Based on the suggestion from Binary Nerd, I implemented some Hadoop counters:
// Define this enumeration in your main class
public static enum MyStats
{
MAP_GOOD_RECORD,
MAP_BAD_RECORD
}
Then inside the mapper:
if (SomeCheckOnTheInputLine())
{
// This record is good
context.getCounter(MyStats.MAP_GOOD_RECORD).increment(1);
}
else
{
// This record has failed in some way...
context.getCounter(MyStats.MAP_BAD_RECORD).increment(1);
}
Then in the output stream from Hadoop I see:
MAP_BAD_RECORD=11557
MAP_GOOD_RECORD=8676
Great! But the question still stands, how do I get those counter values back into the main Java application?

Hadoop Custom Partitioner not behaving according to the logic

Based on this example here, this works. Have tried the same on my dataset.
Sample Dataset:
OBSERVATION;2474472;137176;
OBSERVATION;2474473;137176;
OBSERVATION;2474474;137176;
OBSERVATION;2474475;137177;
Consider each line as string, my Mapper output is:
key-> string[2], value-> string.
My Partitioner code:
#Override
public int getPartition(Text key, Text value, int reducersDefined) {
String keyStr = key.toString();
if(keyStr == "137176") {
return 0;
} else {
return 1 % reducersDefined;
}
}
In my data set most id's are 137176. Reducer declared -2. I expect two output files, one for 137176 and second for remaining Id's. I'm getting two output files but, Id's evenly distributed on both the output files. What's going wrong in my program?
Explicitly set in the Driver method that you want to use your custom Partitioner, by using: job.setPartitionerClass(YourPartitioner.class);. If you don't do that, the default HashPartitioner is used.
Change String comparison method from == to .equals(). i.e., change if(keyStr == "137176") { to if(keyStr.equals("137176")) {.
To save some time, perhaps it will be faster to declare a new Text variable at the beginning of the partitioner, like that: Text KEY = new Text("137176"); and then, without converting your input key to String every time, just compare it with the KEY variable (again using the equals() method). But perhaps those are equivalent. So, what I suggest is:
Text KEY = new Text("137176");
#Override
public int getPartition(Text key, Text value, int reducersDefined) {
return key.equals(KEY) ? 0 : 1 % reducersDefined;
}
Another suggestion, if the network load is heavy, parse the map output key as VIntWritable and change the Partitioner accordingly.

Spark - save RDD to multiple files as output

I have a JavaRDD<Model>, which i need to write it as more than one file with different layout [one or two fields in the RDD will be different between different layout].
When i use saveAsTextFile() its calling the toString() method of Model, it means same layout will be written as output.
Currently what i am doing is iterate the RDD using map transformation method and return the different model with other layout, so i can use saveAsTextFile() action to write as different output file.
Just because of one or two fields are different , i need to iterate the entire RDD again and create new RDD then save it as output file.
For example:
Current RDD with fields:
RoleIndicator, Name, Age, Address, Department
Output File 1:
Name, Age, Address
Output File 2:
RoleIndicator, Name, Age, Department
Is there any optimal solution for this?
Regards,
Shankar
You want to use foreach, not collect.
You should define your function as an actual named class that extends VoidFunction. Create instance variables for both files, and add a close() method that closes the files. Your call() implementation will write whatever you need.
Remember to call close() on your function object after you're done.
It is possible with Pair RDD.
Pair RDD can be stored in multiple files in a single iteration by using Hadoop Custom output format.
rdd.saveAsHadoopFile(path, key.class, value.class,CustomTextOutputFormat.class, jobConf);
public class FileGroupingTextOutputFormat extends MultipleTextOutputFormat<Text, Text> {
#Override
protected Text generateActualKey(Text key, Text value) {
return new Text();
}
#Override
protected Text generateActualValue(Text key, Text value) {
return value;
}
// returns a dynamic file name based on each RDD element
#Override
protected String generateFileNameForKeyValue(Text key, Text value, String name) {
return value.getSomeField() + "-" + name;
}
}

Using MapReduce to analyze log file

Here is a log file:
2011-10-26 06:11:35 user1 210.77.23.12
2011-10-26 06:11:45 user2 210.77.23.17
2011-10-26 06:11:46 user3 210.77.23.12
2011-10-26 06:11:47 user2 210.77.23.89
2011-10-26 06:11:48 user2 210.77.23.12
2011-10-26 06:11:52 user3 210.77.23.12
2011-10-26 06:11:53 user2 210.77.23.12
...
I want to use MapReduce to sort by the number of logging times by the third filed(user) in descending order each line. In another word, I want the result to be displayed as:
user2 4
user3 2
user1 1
Now I have two questions:
By default, MapReduce will split the log file with space and carriage return, but I only need the third filed each line, that is, I don't care fields such as 2011-10-26,06:11:35, 210.77.23.12, how to let MapReduce omit them and pick up the user filed?
By default, MapReduce will sort the result by the key instead of the value. How to let MapReduce to sort the result by value(logging times)?
Thank you.
For your first question:
You should probably pass the whole line to the mapper and just keep the third token for mapping and map (user, 1) everytime.
public class AnalyzeLogs
{
public static class FindFriendMapper extends Mapper<Object, Text, Text, IntWritable> {
public void map(Object, Text value, Context context) throws IOException, InterruptedException
{
String tempStrings[] = value.toString().split(",");
context.write(new Text(tempStrings[2]), new IntWritable(1));
}
}
For your second question I believe you cannot avoid having a second MR Job after that (I cannot think of any other way). So the reducer of the first job will just aggregate the values and give a sum for each key, sorted by key. Which is not yet what you need.
So, you pass the output of this job as input to this second MR job. The objective of this job is to do a somewhat special sorting by value before passing to the reducers (which will do absolutely nothing).
Our Mapper for the second job will be the following:
public static class SortLogsMapper extends Mapper<Object, Text, Text, NullWritable> {
public void map(Object, Text value, Context context) throws IOException, InterruptedException
{
context.write(value, new NullWritable());
}
As you can see we do not use the value for this mapper at all. Instead, we have created a key that contains our value ( our key is in key1 value1 format).
What remains to be done now, is to specify to the framework that it should sort based on the value1 and not the whole key1 value1. So we will implement a custom SortComparator:
public static class LogDescComparator extends WritableComparator
{
protected LogDescComparator()
{
super(Text.class, true);
}
#Override
public int compare(WritableComparable w1, WritableComparable w2)
{
Text t1 = (Text) w1;
Text t2 = (Text) w2;
String[] t1Items = t1.toString().split(" "); //probably it's a " "
String[] t2Items = t2.toString().split(" ");
String t1Value = t1Items[1];
String t2Value = t2Items[1];
int comp = t2Value.compareTo(t1Value); // We compare using "real" value part of our synthetic key in Descending order
return comp;
}
}
You can set your custom comparator as : job.setSortComparatorClass(LogDescComparator.class);
The reducer of the job should do nothing. However if we don't set a reducer the sorting for the mapper keys will not be done (and we need that). So, you need to set IdentityReducer as a Reducer for your second MR job to do no reduction but still ensure that the mapper's synthetic keys are sorted in the way we specified.

Categories