Using MapReduce to analyze log file - java

Here is a log file:
2011-10-26 06:11:35 user1 210.77.23.12
2011-10-26 06:11:45 user2 210.77.23.17
2011-10-26 06:11:46 user3 210.77.23.12
2011-10-26 06:11:47 user2 210.77.23.89
2011-10-26 06:11:48 user2 210.77.23.12
2011-10-26 06:11:52 user3 210.77.23.12
2011-10-26 06:11:53 user2 210.77.23.12
...
I want to use MapReduce to sort by the number of logging times by the third filed(user) in descending order each line. In another word, I want the result to be displayed as:
user2 4
user3 2
user1 1
Now I have two questions:
By default, MapReduce will split the log file with space and carriage return, but I only need the third filed each line, that is, I don't care fields such as 2011-10-26,06:11:35, 210.77.23.12, how to let MapReduce omit them and pick up the user filed?
By default, MapReduce will sort the result by the key instead of the value. How to let MapReduce to sort the result by value(logging times)?
Thank you.

For your first question:
You should probably pass the whole line to the mapper and just keep the third token for mapping and map (user, 1) everytime.
public class AnalyzeLogs
{
public static class FindFriendMapper extends Mapper<Object, Text, Text, IntWritable> {
public void map(Object, Text value, Context context) throws IOException, InterruptedException
{
String tempStrings[] = value.toString().split(",");
context.write(new Text(tempStrings[2]), new IntWritable(1));
}
}
For your second question I believe you cannot avoid having a second MR Job after that (I cannot think of any other way). So the reducer of the first job will just aggregate the values and give a sum for each key, sorted by key. Which is not yet what you need.
So, you pass the output of this job as input to this second MR job. The objective of this job is to do a somewhat special sorting by value before passing to the reducers (which will do absolutely nothing).
Our Mapper for the second job will be the following:
public static class SortLogsMapper extends Mapper<Object, Text, Text, NullWritable> {
public void map(Object, Text value, Context context) throws IOException, InterruptedException
{
context.write(value, new NullWritable());
}
As you can see we do not use the value for this mapper at all. Instead, we have created a key that contains our value ( our key is in key1 value1 format).
What remains to be done now, is to specify to the framework that it should sort based on the value1 and not the whole key1 value1. So we will implement a custom SortComparator:
public static class LogDescComparator extends WritableComparator
{
protected LogDescComparator()
{
super(Text.class, true);
}
#Override
public int compare(WritableComparable w1, WritableComparable w2)
{
Text t1 = (Text) w1;
Text t2 = (Text) w2;
String[] t1Items = t1.toString().split(" "); //probably it's a " "
String[] t2Items = t2.toString().split(" ");
String t1Value = t1Items[1];
String t2Value = t2Items[1];
int comp = t2Value.compareTo(t1Value); // We compare using "real" value part of our synthetic key in Descending order
return comp;
}
}
You can set your custom comparator as : job.setSortComparatorClass(LogDescComparator.class);
The reducer of the job should do nothing. However if we don't set a reducer the sorting for the mapper keys will not be done (and we need that). So, you need to set IdentityReducer as a Reducer for your second MR job to do no reduction but still ensure that the mapper's synthetic keys are sorted in the way we specified.

Related

Hadoop Custom Partitioner not behaving according to the logic

Based on this example here, this works. Have tried the same on my dataset.
Sample Dataset:
OBSERVATION;2474472;137176;
OBSERVATION;2474473;137176;
OBSERVATION;2474474;137176;
OBSERVATION;2474475;137177;
Consider each line as string, my Mapper output is:
key-> string[2], value-> string.
My Partitioner code:
#Override
public int getPartition(Text key, Text value, int reducersDefined) {
String keyStr = key.toString();
if(keyStr == "137176") {
return 0;
} else {
return 1 % reducersDefined;
}
}
In my data set most id's are 137176. Reducer declared -2. I expect two output files, one for 137176 and second for remaining Id's. I'm getting two output files but, Id's evenly distributed on both the output files. What's going wrong in my program?
Explicitly set in the Driver method that you want to use your custom Partitioner, by using: job.setPartitionerClass(YourPartitioner.class);. If you don't do that, the default HashPartitioner is used.
Change String comparison method from == to .equals(). i.e., change if(keyStr == "137176") { to if(keyStr.equals("137176")) {.
To save some time, perhaps it will be faster to declare a new Text variable at the beginning of the partitioner, like that: Text KEY = new Text("137176"); and then, without converting your input key to String every time, just compare it with the KEY variable (again using the equals() method). But perhaps those are equivalent. So, what I suggest is:
Text KEY = new Text("137176");
#Override
public int getPartition(Text key, Text value, int reducersDefined) {
return key.equals(KEY) ? 0 : 1 % reducersDefined;
}
Another suggestion, if the network load is heavy, parse the map output key as VIntWritable and change the Partitioner accordingly.

Manipulating a user input string in MapReduce

I am beginning to use the Hadoop variant of MapReduce and therefore have zero clue about the ins and outs. I understand how conceptually it's supposed to work.
My problem is to find a specific search string within a bunch of files I have been provided. I am not interested about the files - that's sorted. But how would you go about asking for input? Would you ask within the JobConf section of the program? If so, how would I pass the string into the job?
If it's within the map() function, how would you go about implementing it? Wouldn't it just ask for a search string every time the map() function is called?
Here's the main method and JobConf() section that should give you an idea:
public static void main(String[] args) throws IOException {
// This produces an output file in which each line contains a separate word followed by
// the total number of occurrences of that word in all the input files.
JobConf job = new JobConf();
FileInputFormat.setInputPaths(job, new Path("input"));
FileOutputFormat.setOutputPath(job, new Path("output"));
// Output from reducer maps words to counts.
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
// The output of the mapper is a map from words (including duplicates) to the value 1.
job.setMapperClass(InputMapper.class);
// The output of the reducer is a map from unique words to their total counts.
job.setReducerClass(CountWordsReducer.class);
JobClient.runJob(job);
}
And the map() function:
public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException {
// The key is the character offset within the file of the start of the line, ignored.
// The value is a line from the file.
//This is me trying to hard-code it. I would prefer an explanation on how to get interactive input!
String inputString = "data";
String line = value.toString();
Scanner scanner = new Scanner(line);
while (scanner.hasNext()) {
if (line.contains(inputString)) {
String line1 = scanner.next();
output.collect(new Text(line1), new LongWritable(1));
}
}
scanner.close();
}
I am led to believe that I don't need a reducer stage for this problem. Any advice/explanations much appreciated!
JobConf class is an extension of Configuration class, and thus, you can set custom properties:
JobConf job = new JobConf();
job.set("inputString", "data");
...
Then, as atated in the documentation for the Mapper: Mapper implementations can access the JobConf for the job via the JobConfigurable.configure(JobConf) and initialize themselves. This means you have to re-implement such a method within your Mapper in order to get the desired parameter:
private static String inputString;
public void configure(JobConf job)
inputString = job.get("inputString");
}
Anyway, this is using the old API. With the new one it is easier to access the configuration since the context (and thus the configuration) is passed to the map method as an argument.

Outputting single file for partitioner

Trying to get as many reducer as the no of keys
public class CustomPartitioner extends Partitioner<Text, Text>
{
public int getPartition(Text key, Text value,int numReduceTasks)
{
System.out.println("In CustomP");
return (key.toString().hashCode()) % numReduceTasks;
}
}
Driver class
job6.setMapOutputKeyClass(Text.class);
job6.setMapOutputValueClass(Text.class);
job6.setOutputKeyClass(NullWritable.class);
job6.setOutputValueClass(Text.class);
job6.setMapperClass(LastMapper.class);
job6.setReducerClass(LastReducer.class);
job6.setPartitionerClass(CustomPartitioner.class);
job6.setInputFormatClass(TextInputFormat.class);
job6.setOutputFormatClass(TextOutputFormat.class);
But I am getting ootput in a single file.
Am I doing anything wrong
You can not control number of reducer without specifying it :-). But still there is no surety of getting all the keys on different reducer because you are not sure how many distinct keys you would get in the input data and your hash partition function may return same number for two distinct keys. If you want to achieve your solution then you'll have to know number of distinct keys in advance and then modify your partition function accordingly.
you need to specify the number of reduce tasks that's equal to number of keys and also you need to return the partitions based on your key's in partitioner class. for example if your input having 4 keys(here it is wood,Masonry,Reinforced Concrete etc) then your getPartition method look like this..
public int getPartition(Text key, PairWritable value, int numReduceTasks) {
// TODO Auto-generated method stub
String s = value.getone();
if (numReduceTasks ==0){
return 0;
}
if(s.equalsIgnoreCase("wood")){
return 0;
}
if(s.equalsIgnoreCase("Masonry")){
return 1%numReduceTasks;
}
if(s.equalsIgnoreCase("Reinforced Concrete")){
return 2%numReduceTasks;
}
if(s.equalsIgnoreCase("Reinforced Masonry")){
return 3%numReduceTasks;
}
else
return 4%numReduceTasks;
}
}
corresponding output will be collected in respective reducers..Try Running in CLI instead eclipse
You haven't configured the number of reducers to run.
You can configure it using below API
job.setNumReduceTasks(10); //change the number according to your
cluster
Also, you can set while executing from commandline
-D mapred.reduce.tasks=10
Hope this helps.
Veni, You need to Chain the Tasks as below
Mapper1 --> Reducer --> Mapper2 (Post Processing Mapper which creates
file for Each key)
Mapper 2 is InputFormat should be NlineInputFormat, so the output of the reducer that is for each key there will be corresponding mapper and Mapper output will be a separate file foe each key.
Mapper 1 and Reducer is your existing MR job.
Hope this helps.
Cheers
Nag

How to use MultipleOutputs<KEYOUT,VALUEOUT> for writing output data to multiple outputs

I am new to Hadoop and MapReduce and have been trying to write output to multiple files based on keys. Could anyone please provide clear idea or Java code snippet example on how to use it. My mapper is working exactly fine and after shuffle, keys and the corresponding values are obtained as expected. Thanks!
What i am trying to do is output only few records from the input file to a new file.
Thus the new output file shall contain only those required records, ignoring rest irrelevant records.
This would work fine even if i don't use MultipleTextOutputFormat.
Logic which i implemented in mapper is as follows:
public static class MapClass extends
Mapper {
StringBuilder emitValue = null;
StringBuilder emitKey = null;
Text kword = new Text();
Text vword = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] parts;
String line = value.toString();
parts = line.split(" ");
kword.set(parts[4].toString());
vword.set(line.toString());
context.write(kword, vword);
}
}
Input to reduce is like this:
[key1]--> [value1, value2, ...]
[key2]--> [value1, value2, ...]
[key3]--> [value1, value2, ...] & so on
my interest is in [key2]--> [value1, value2, ...] ignoring other keys and corresponding values. Please help me out with the reducer.
Using MultipleOutputs lets you emit records in multiple files, but in a set of pre-defined number/type of files only and not arbitrary number of files and not with on-the-fly decision on filename according to key/value.
You may create your own OutputFormat by extending org.apache.hadoop.mapred.lib.MultipleTextOutputFormat. Your OutputFormat class shall enable decision of output file name as well as folder according to the key/value emitted by reducer. This can be achieved as follows:
package oddjob.hadoop;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;
public class MultipleTextOutputFormatByKey extends MultipleTextOutputFormat<Text, Text> {
/**
* Use they key as part of the path for the final output file.
*/
#Override
protected String generateFileNameForKeyValue(Text key, Text value, String leaf) {
return new Path(key.toString(), leaf).toString();
}
/**
* When actually writing the data, discard the key since it is already in
* the file path.
*/
#Override
protected Text generateActualKey(Text key, Text value) {
return null;
}
}
For more info read here.
PS: You will need to use the old mapred API to achieve that. As in the newer API there isn't support for MultipleTextOutput yet! Refer this.

MapReduce - WritableComparables

I’m new to both Java and Hadoop. I’m trying a very simple program to get Frequent pairs.
e.g.
Input: My name is Foo. Foo is student.
Intermediate Output:
Map:
(my, name): 1
(name ,is): 1
(is, Foo): 2 // (is, Foo) = (Foo, is)
(is, student)
So finally it should give frequent pair is (is ,Foo).
Pseudo code looks like this:
Map(Key: line_num, value: line)
words = split_words(line)
for each w in words:
for each neighbor x:
emit((w, x)), 1)
Here my key is not one, it’s pair. While going through documentation, I read that for each new key we have to implement WritableComparable.
So I'm confused about that. If someone can explain about this class, that would be great. Not sure it’s really true. Then I can figure out on my own how to do that!
I don't want any code neither mapper nor anything ... just want to understand what does this WritableComparable do? Which method of WritableComparable actually compares keys? I can see equals and compareTo, but I cannot find any explanation about that. Please no code! Thanks
EDIT 1:
In compareTo I return 0 for pair (a, b) = (b, a) but still its not going to same reducer, is there any way in compareTo method I reset key (b, a) to (a, b) or generate totally new key?
EDIT 2:
I don't know for generating new key, but in compareTo changing logic, it worked fine ..! Thanks everyone!
WritableComparable is an interface that makes the class that implements it be two things: Writable, meaning it can be written to and read from your network via serialization, etc. This is necessary if you're going to use it as a key or value so that it can be sent between Hadoop nodes. And Comparable, which means that methods must be provided that show how one object of the given class can be compared to another. This is used when the Reducer organizes by key.
This interface is neceesary when you want to create your own object to be a key. And you'd need to create your own InputFormat as opposed to using one of the ones that come with Hadoop. This can get be rather difficult (from my experience), especially if you're new to both Java and Hadoop.
So if I were you, I wouldn't bother with that as there's a much simpler way. I would use TextInputFormat which is conveniently both the default InputFormat as well as pretty easy to use and understand. You could simply emit each key as a Text object which is pretty simliar to a string. There is a caveat though; like you mentioned "is Foo" and "Foo is" need to be evaluated to be the same key. So with every pair of words you pull out, sort them alphabetically before passing them as a key with the String.compareTo method. That way you're guarenteed to have no repeats.
Here is mapper class for your problem ,
the frequent pair of words logic is not implemented . i guess u were not lookin for that .
public class MR {
public static class Mapper extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, LongWritable>
{
public static int check (String keyCheck)
{
// logig to check key is frequent or not ?
return 0;
}
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Map< String, Integer> keyMap=new HashMap<String, Integer>();
String line=value.toString();
String[] words=line.split(" ");
for(int i=0;i<(words.length-1);i++)
{
String mapkeyString=words[i]+","+words[i+1];
// Logic to check is mapKeyString is frequent or not .
int count =check(mapkeyString);
keyMap.put(mapkeyString, count);
}
Set<Entry<String,Integer>> entries=keyMap.entrySet();
for(Entry<String, Integer> entry:entries)
{
context.write(new Text(entry.getKey()), new LongWritable(entry.getValue()));
}
}
}
public static class Reduce extends Reducer<Text, LongWritable, Text, Text>
{
protected void reduce(Text key, Iterable<LongWritable> Values,
Context context)
throws IOException, InterruptedException {
}
}
public static void main(String[] args) {
Configuration configuration=new Configuration();
try {
Job job=new Job(configuration, "Word Job");
job.setMapperClass(Mapper.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

Categories