Manipulating a user input string in MapReduce - java

I am beginning to use the Hadoop variant of MapReduce and therefore have zero clue about the ins and outs. I understand how conceptually it's supposed to work.
My problem is to find a specific search string within a bunch of files I have been provided. I am not interested about the files - that's sorted. But how would you go about asking for input? Would you ask within the JobConf section of the program? If so, how would I pass the string into the job?
If it's within the map() function, how would you go about implementing it? Wouldn't it just ask for a search string every time the map() function is called?
Here's the main method and JobConf() section that should give you an idea:
public static void main(String[] args) throws IOException {
// This produces an output file in which each line contains a separate word followed by
// the total number of occurrences of that word in all the input files.
JobConf job = new JobConf();
FileInputFormat.setInputPaths(job, new Path("input"));
FileOutputFormat.setOutputPath(job, new Path("output"));
// Output from reducer maps words to counts.
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
// The output of the mapper is a map from words (including duplicates) to the value 1.
job.setMapperClass(InputMapper.class);
// The output of the reducer is a map from unique words to their total counts.
job.setReducerClass(CountWordsReducer.class);
JobClient.runJob(job);
}
And the map() function:
public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException {
// The key is the character offset within the file of the start of the line, ignored.
// The value is a line from the file.
//This is me trying to hard-code it. I would prefer an explanation on how to get interactive input!
String inputString = "data";
String line = value.toString();
Scanner scanner = new Scanner(line);
while (scanner.hasNext()) {
if (line.contains(inputString)) {
String line1 = scanner.next();
output.collect(new Text(line1), new LongWritable(1));
}
}
scanner.close();
}
I am led to believe that I don't need a reducer stage for this problem. Any advice/explanations much appreciated!

JobConf class is an extension of Configuration class, and thus, you can set custom properties:
JobConf job = new JobConf();
job.set("inputString", "data");
...
Then, as atated in the documentation for the Mapper: Mapper implementations can access the JobConf for the job via the JobConfigurable.configure(JobConf) and initialize themselves. This means you have to re-implement such a method within your Mapper in order to get the desired parameter:
private static String inputString;
public void configure(JobConf job)
inputString = job.get("inputString");
}
Anyway, this is using the old API. With the new one it is easier to access the configuration since the context (and thus the configuration) is passed to the map method as an argument.

Related

Why can we reuse Text object in WordCount example

After see the example of Hadoop: WordCount, I cannot understand why we can reuse the Text object instead of create a new one for each write operation "context.write(...)"?
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
**private Text word = new Text();**
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
// set other String in Text object
**word.set(itr.nextToken());**
**context.write(word, one);**
}
}....
My question is if there is only one Text object in each map task, after we change it's content by using "word.set(...)", the previous outcome key-value pair will be affected because the key uses the same Text object and it's content is changed now.
Did I miss something? Thanks in advance for correcting me...
Reusing objects is good practice to avoid creating many new objects. Hence, context.write(word, one) in the map() method populates and reuses the word and one objects.
context.write() will generate an output key/value pair. The Hadoop framework will take care of serializing the data when context.write()is called. Hence, you can safely reuse the objects within map() method.

Using MapReduce to analyze log file

Here is a log file:
2011-10-26 06:11:35 user1 210.77.23.12
2011-10-26 06:11:45 user2 210.77.23.17
2011-10-26 06:11:46 user3 210.77.23.12
2011-10-26 06:11:47 user2 210.77.23.89
2011-10-26 06:11:48 user2 210.77.23.12
2011-10-26 06:11:52 user3 210.77.23.12
2011-10-26 06:11:53 user2 210.77.23.12
...
I want to use MapReduce to sort by the number of logging times by the third filed(user) in descending order each line. In another word, I want the result to be displayed as:
user2 4
user3 2
user1 1
Now I have two questions:
By default, MapReduce will split the log file with space and carriage return, but I only need the third filed each line, that is, I don't care fields such as 2011-10-26,06:11:35, 210.77.23.12, how to let MapReduce omit them and pick up the user filed?
By default, MapReduce will sort the result by the key instead of the value. How to let MapReduce to sort the result by value(logging times)?
Thank you.
For your first question:
You should probably pass the whole line to the mapper and just keep the third token for mapping and map (user, 1) everytime.
public class AnalyzeLogs
{
public static class FindFriendMapper extends Mapper<Object, Text, Text, IntWritable> {
public void map(Object, Text value, Context context) throws IOException, InterruptedException
{
String tempStrings[] = value.toString().split(",");
context.write(new Text(tempStrings[2]), new IntWritable(1));
}
}
For your second question I believe you cannot avoid having a second MR Job after that (I cannot think of any other way). So the reducer of the first job will just aggregate the values and give a sum for each key, sorted by key. Which is not yet what you need.
So, you pass the output of this job as input to this second MR job. The objective of this job is to do a somewhat special sorting by value before passing to the reducers (which will do absolutely nothing).
Our Mapper for the second job will be the following:
public static class SortLogsMapper extends Mapper<Object, Text, Text, NullWritable> {
public void map(Object, Text value, Context context) throws IOException, InterruptedException
{
context.write(value, new NullWritable());
}
As you can see we do not use the value for this mapper at all. Instead, we have created a key that contains our value ( our key is in key1 value1 format).
What remains to be done now, is to specify to the framework that it should sort based on the value1 and not the whole key1 value1. So we will implement a custom SortComparator:
public static class LogDescComparator extends WritableComparator
{
protected LogDescComparator()
{
super(Text.class, true);
}
#Override
public int compare(WritableComparable w1, WritableComparable w2)
{
Text t1 = (Text) w1;
Text t2 = (Text) w2;
String[] t1Items = t1.toString().split(" "); //probably it's a " "
String[] t2Items = t2.toString().split(" ");
String t1Value = t1Items[1];
String t2Value = t2Items[1];
int comp = t2Value.compareTo(t1Value); // We compare using "real" value part of our synthetic key in Descending order
return comp;
}
}
You can set your custom comparator as : job.setSortComparatorClass(LogDescComparator.class);
The reducer of the job should do nothing. However if we don't set a reducer the sorting for the mapper keys will not be done (and we need that). So, you need to set IdentityReducer as a Reducer for your second MR job to do no reduction but still ensure that the mapper's synthetic keys are sorted in the way we specified.

Compare files string by string

I have two files:
Grader.getFileInfo("data\\studentSubmissionA.txt");
Grader.teacherFiles("data\\TeacherListA.txt");
Both contain a list of math problems, but the TeacherList is unsolved in order to check that the StudentSubmission was not altered from the original version.
studentSubmission is sent to the Grader class and the method currently looks like this:
public static void getFileInfo(String fileName)
throws FileNotFoundException {
Scanner in = new Scanner(new File(fileName))
while (in.hasNext()) {
String fileContent = in.nextLine();
}
and the TeacherFiles method looks like
public static void teacherFiles(String teacherFiles)
throws FileNotFoundException{
Scanner in = new Scanner(new File(teacherFiles));
while (in.hasNext()){
String teacherContent = in.nextLine();
String line = teacherContent.substring(0, teacherContent.indexOf('='));
}
I don't know how to get these methods to another method in order to compare them since they're coming from a file and I have to put something in the method signature to pass them and it doesn't work.
I tried putting them in one method, but that was a bust as well.
I don't know where to go from here.
And unfortunately, I can't use try/catches or arrays.
Is it possible to send the .substring(0 , .indexof('=')) through the methods?
Like line = teacherFiles(teacherContent.substring(0 , .indexof('='))); Is it possible to do this?
Think in more general terms. Observe that your methods called getFileInfo and teacherFiles, respectively are the very same except a few nuances. So why do not we think about finding the optimal way of merging the two functionalities and handling the nuances outside of them?
It is logical that you cannot use arrays as you need to know the number of elements of your array before you initialize it and your array would have already been initialized when you read the file. So using an array for this task is either an overkill (for example you allocate 1000 elements in your memory and you use only 10 elements) or insufficient (if you create an array of 10 elements, but you would need 1000). So, due to the fact that you do not know the number of rows in advance you need to use another data structure for your task.
So create the following method:
public static AbstractList<String> readFile(String filePath) throws FileNotFoundException, IOException {
Scanner s = new Scanner(new File(filePath));
AbstractList<String> list = new ArrayList<String>();
while (s.hasNext()){
list.add(s.next());
}
s.close();
return list;
}
Then use the method to read the student file and to read the teacher file. Store the results into two separate AbstractList<String> variables, then iterate through them and compare them as you like. Again, think in more general terms.

How to use MultipleOutputs<KEYOUT,VALUEOUT> for writing output data to multiple outputs

I am new to Hadoop and MapReduce and have been trying to write output to multiple files based on keys. Could anyone please provide clear idea or Java code snippet example on how to use it. My mapper is working exactly fine and after shuffle, keys and the corresponding values are obtained as expected. Thanks!
What i am trying to do is output only few records from the input file to a new file.
Thus the new output file shall contain only those required records, ignoring rest irrelevant records.
This would work fine even if i don't use MultipleTextOutputFormat.
Logic which i implemented in mapper is as follows:
public static class MapClass extends
Mapper {
StringBuilder emitValue = null;
StringBuilder emitKey = null;
Text kword = new Text();
Text vword = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] parts;
String line = value.toString();
parts = line.split(" ");
kword.set(parts[4].toString());
vword.set(line.toString());
context.write(kword, vword);
}
}
Input to reduce is like this:
[key1]--> [value1, value2, ...]
[key2]--> [value1, value2, ...]
[key3]--> [value1, value2, ...] & so on
my interest is in [key2]--> [value1, value2, ...] ignoring other keys and corresponding values. Please help me out with the reducer.
Using MultipleOutputs lets you emit records in multiple files, but in a set of pre-defined number/type of files only and not arbitrary number of files and not with on-the-fly decision on filename according to key/value.
You may create your own OutputFormat by extending org.apache.hadoop.mapred.lib.MultipleTextOutputFormat. Your OutputFormat class shall enable decision of output file name as well as folder according to the key/value emitted by reducer. This can be achieved as follows:
package oddjob.hadoop;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;
public class MultipleTextOutputFormatByKey extends MultipleTextOutputFormat<Text, Text> {
/**
* Use they key as part of the path for the final output file.
*/
#Override
protected String generateFileNameForKeyValue(Text key, Text value, String leaf) {
return new Path(key.toString(), leaf).toString();
}
/**
* When actually writing the data, discard the key since it is already in
* the file path.
*/
#Override
protected Text generateActualKey(Text key, Text value) {
return null;
}
}
For more info read here.
PS: You will need to use the old mapred API to achieve that. As in the newer API there isn't support for MultipleTextOutput yet! Refer this.

replace a string segment from input stream

I am trying to receive a huge text file as an inputstream and want to convert a string segment with another string. I am strictly confused how to do it, it works well if I convert whole inputstream as a string which I don't want as some of the contents are lost. can anyone please help how to do it??
e.g.
if I have a file which has the contents "This is the test string which needs to be modified". I want to accept this string as input stream and want to modify the contents to "This is the test string which is modified" , ( by replacing 'needs to be' with is).
public static void main(String[] args) {
String string = "This is the test string which needs to be modified";
InputStream inpstr = new ByteArrayInputStream(string.getBytes());
//Code to do
}
In this I want the output as: This is the test string which is modified
Thanking you in advance.
If the text to be changed will always fit in one logical line, as I stated in comment, I'd go with simple Line Reading (if applyable) using something like:
public class InputReader {
public static void main(String[] args) throws IOException
{
String string = "This is the test string which needs to be modified";
InputStream inpstr = new ByteArrayInputStream(string.getBytes());
BufferedReader rdr = new BufferedReader(new InputStreamReader(inpstr));
String buf = null;
while ((buf = rdr.readLine()) != null) {
// Apply regex on buf
// build output
}
}
}
However I've always like to use inheritance so I'd define this somewhere:
class MyReader extends BufferedReader {
public MyReader(Reader in)
{
super(in);
}
#Override
public String readLine() throws IOException {
String lBuf = super.readLine();
// Perform matching & subst on read string
return lBuf;
}
}
And use MyReader in place of standard BufferedReader keeping the substitution hidden inside the readLine method.
Pros: substitution logic is in a specified Reader, code is pretty standard.
Cons: it hides the substitution logic to the caller (sometimes this is also a pro, still it depends on usage case)
HTH
May be I understood you wrong, but I think you should build a stack machine. I mean you can use a small string stack to collect text and check condition of replacement.
If just collected stack already is not matched to your condition, just flush stack to output and collect it again.
If your stack is similar with condition, carry on collecting it.
If your stack is matched your condition, make a modification and flush modified stack to output.

Categories