Why can we reuse Text object in WordCount example

Why can we reuse Text object in WordCount example - java

After see the example of Hadoop: WordCount, I cannot understand why we can reuse the Text object instead of create a new one for each write operation "context.write(...)"?
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
**private Text word = new Text();**
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
// set other String in Text object
**word.set(itr.nextToken());**
**context.write(word, one);**
}
}....
My question is if there is only one Text object in each map task, after we change it's content by using "word.set(...)", the previous outcome key-value pair will be affected because the key uses the same Text object and it's content is changed now.
Did I miss something? Thanks in advance for correcting me...

Reusing objects is good practice to avoid creating many new objects. Hence, context.write(word, one) in the map() method populates and reuses the word and one objects.
context.write() will generate an output key/value pair. The Hadoop framework will take care of serializing the data when context.write()is called. Hence, you can safely reuse the objects within map() method.

Related

Best way to implement a link between N constants

I couldn't find a better title (feel free to edit it if you find a better one), but the use case is the following. I have two lists of constants. One of those contains the constants I use in my application, the other contains the different constants that are sent to me via a CSV file (along with data).
To give a rough exemple : in the CSV file, there is a field called "id of the client". In my application, I want to use a field called "clientId". So I basically need to create a static link between the two constants, so that I can easily switch from one to the other depending on what I need to achieve.
I've thought about creating a static Map(String, String) of values, but I figured there might be better solutions.
Thanks !
EDIT : changed title to "N" constants instead of 2, because Hashmap doesn't seem to be an option any longer in that case.

you can use the double bracket innitializer idiom to keep map initialization close to the map declaration, so it would be not so "ugly" eg:
static Map<String, String> someMap = new HashMap<String, String>() {{
put("one", "two");
put("three", "four");
}};
Beware that without the static modifier each anonymous class (there is one created in this example) holds a refernce to the enclosing object and if you'll give a reference to this map to some other class it will prevent the enclosing class from being garbage collect.
Fortunatelly, there is a hope for us with java update, in java 9 there will be very handy Map.of() to help us do it more safely.

The best way to separate the mapping from your application code is to use a properties file where in which you define your mapping.
For example, you could have a csv-mapping.properties in the root of your resources and load them with the following code:
final Properties properties = new Properties();
properties.load( this.getClass().getResourceAsStream( "/csv-mapping.properties" ) );
This will work just like a Map, with the added separation of code from configuration.

There are many methods that you can use to easily solve these types of problem.
One way is to use a Properties file, or file containing the key value pair.
Here is the code for Properties.
import java.util.ResourceBundle;
public class ReadingPropertiesFile {
public static void main(String[] args) {
ResourceBundle messages;
messages = ResourceBundle.getBundle("msg");
System.out.println(messages.getString("ID"));
}
}
msg.properties file contains values::
ID = ClientID.
PRODUCT_ID = prod_ID
The output of the program is ClientID.
You can also read from a simple text file. Or you could use the map as you are using. But I would suggest you to use the properties file.

One good option would be to use an enum to create such mappings beetween multiple constants to a single common sense value, eg:
import java.util.Arrays;
import java.util.Collections;
import java.util.HashSet;
import java.util.Set;
public enum MappingEnum {
CLIENT_ID("clientId", "id of the client", "clientId", "IdOfTheClient"),
CLIENT_NAME("clientName", "name of the client", "clientName");
private Set<String> aliases;
private String commonSenseName;
private MappingEnum(String commonSenseName, String... aliases) {
this.commonSenseName = commonSenseName;
this.aliases = Collections.unmodifiableSet(new HashSet<String>(Arrays.asList(aliases)));
}
public static MappingEnum fromAlias(String alias) {
for (MappingEnum mappingEnum : values()) {
if (mappingEnum.getAliases().contains(alias)) {
return mappingEnum;
}
}
throw new RuntimeException("No MappingEnum for mapping: " + alias);
}
public String getCommonSenseName() {
return commonSenseName;
}
}
and then you can use it like:
String columnName = "id of the client";
String targetFieldName = MappingEnum.fromAlias(columnName).getCommonSenseName();

Manipulating a user input string in MapReduce

I am beginning to use the Hadoop variant of MapReduce and therefore have zero clue about the ins and outs. I understand how conceptually it's supposed to work.
My problem is to find a specific search string within a bunch of files I have been provided. I am not interested about the files - that's sorted. But how would you go about asking for input? Would you ask within the JobConf section of the program? If so, how would I pass the string into the job?
If it's within the map() function, how would you go about implementing it? Wouldn't it just ask for a search string every time the map() function is called?
Here's the main method and JobConf() section that should give you an idea:
public static void main(String[] args) throws IOException {
// This produces an output file in which each line contains a separate word followed by
// the total number of occurrences of that word in all the input files.
JobConf job = new JobConf();
FileInputFormat.setInputPaths(job, new Path("input"));
FileOutputFormat.setOutputPath(job, new Path("output"));
// Output from reducer maps words to counts.
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
// The output of the mapper is a map from words (including duplicates) to the value 1.
job.setMapperClass(InputMapper.class);
// The output of the reducer is a map from unique words to their total counts.
job.setReducerClass(CountWordsReducer.class);
JobClient.runJob(job);
}
And the map() function:
public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException {
// The key is the character offset within the file of the start of the line, ignored.
// The value is a line from the file.
//This is me trying to hard-code it. I would prefer an explanation on how to get interactive input!
String inputString = "data";
String line = value.toString();
Scanner scanner = new Scanner(line);
while (scanner.hasNext()) {
if (line.contains(inputString)) {
String line1 = scanner.next();
output.collect(new Text(line1), new LongWritable(1));
}
}
scanner.close();
}
I am led to believe that I don't need a reducer stage for this problem. Any advice/explanations much appreciated!

JobConf class is an extension of Configuration class, and thus, you can set custom properties:
JobConf job = new JobConf();
job.set("inputString", "data");
...
Then, as atated in the documentation for the Mapper: Mapper implementations can access the JobConf for the job via the JobConfigurable.configure(JobConf) and initialize themselves. This means you have to re-implement such a method within your Mapper in order to get the desired parameter:
private static String inputString;
public void configure(JobConf job)
inputString = job.get("inputString");
}
Anyway, this is using the old API. With the new one it is easier to access the configuration since the context (and thus the configuration) is passed to the map method as an argument.

Cost of writing output using context.write() OR outputCollector.collect() in Hadoop?

I have just started learning Hadoop,and still experimenting and trying to understand things, i am really curious about the usage of OutputCollector class collect() method, all the examples i have found since now are calling this method only once.If the calling cost of this method is really high (as it is writing output to the file) ? while thinking about the different scenarios i have got into the situation where i am finding the need of calling it more than once. like wise below is the given code snippet
public static class Reduce extends MapReduceBase implements
Reducer<IntWritable, Text, Text, NullWritable> {
public void reduce(IntWritable key, Iterator<Text> values,
OutputCollector<Text, NullWritable> output, Reporter reporter)
throws IOException {
Text outData = null;
while (values.hasNext()) {
outData = new Text();
outData.set(values.next().toString());
output.collect(outData, NullWritable.get());
}
}
}
as the values object contains large number of records which mapper has emitted based on some filtering condition and i need to write those records to the output file.OR the other way around i could also use the below given approach.
public static class Reduce extends MapReduceBase implements
Reducer<IntWritable, Text, Text, NullWritable> {
public void reduce(IntWritable key, Iterator<Text> values,
OutputCollector<Text, NullWritable> output, Reporter reporter)
throws IOException {
StringBuilder sb = new StringBuilder();
while (values.hasNext()) {
sb.append(values.next().toString() + "\r\n ");
}
Text outData = new Text();
outData.set(sb.toString());
output.collect(outData, NullWritable.get());
}
}
However both approaches works fine on my singlenode setup for large input data-set of upto 400k records and values object containing around 70k records. I want to ask which approach is better?And also will the above written code behave well on multinode cluster ? Any help appreciated. Thanks.

In the end it boils down how much data (in terms of size in bytes) you write.
Both solutions has some size overhead, in the first example you write multiple strings, you have the constant overhead of serializing the length of each string. In the other solution you write the same amount of overhead as your line separation.
So in byte sizes, both are equal, thus collecting the data should not be significantly slower in both solutions.
A very different part of your problem is the memory usage, think of a very large iteration of values, your StringBuilder will be inefficient because of the resize operations and all the memory it uses. The collect method is smarter and spills to disk if the write buffer is filled. On the other hand, if you have tons of available memory and you want to write a single huge record in one go- this might also be as efficient as setting the write buffer to be similarly sized.

Using MapReduce to analyze log file

Here is a log file:
2011-10-26 06:11:35 user1 210.77.23.12
2011-10-26 06:11:45 user2 210.77.23.17
2011-10-26 06:11:46 user3 210.77.23.12
2011-10-26 06:11:47 user2 210.77.23.89
2011-10-26 06:11:48 user2 210.77.23.12
2011-10-26 06:11:52 user3 210.77.23.12
2011-10-26 06:11:53 user2 210.77.23.12
...
I want to use MapReduce to sort by the number of logging times by the third filed(user) in descending order each line. In another word, I want the result to be displayed as:
user2 4
user3 2
user1 1
Now I have two questions:
By default, MapReduce will split the log file with space and carriage return, but I only need the third filed each line, that is, I don't care fields such as 2011-10-26,06:11:35, 210.77.23.12, how to let MapReduce omit them and pick up the user filed?
By default, MapReduce will sort the result by the key instead of the value. How to let MapReduce to sort the result by value(logging times)?
Thank you.

For your first question:
You should probably pass the whole line to the mapper and just keep the third token for mapping and map (user, 1) everytime.
public class AnalyzeLogs
{
public static class FindFriendMapper extends Mapper<Object, Text, Text, IntWritable> {
public void map(Object, Text value, Context context) throws IOException, InterruptedException
{
String tempStrings[] = value.toString().split(",");
context.write(new Text(tempStrings[2]), new IntWritable(1));
}
}
For your second question I believe you cannot avoid having a second MR Job after that (I cannot think of any other way). So the reducer of the first job will just aggregate the values and give a sum for each key, sorted by key. Which is not yet what you need.
So, you pass the output of this job as input to this second MR job. The objective of this job is to do a somewhat special sorting by value before passing to the reducers (which will do absolutely nothing).
Our Mapper for the second job will be the following:
public static class SortLogsMapper extends Mapper<Object, Text, Text, NullWritable> {
public void map(Object, Text value, Context context) throws IOException, InterruptedException
{
context.write(value, new NullWritable());
}
As you can see we do not use the value for this mapper at all. Instead, we have created a key that contains our value ( our key is in key1 value1 format).
What remains to be done now, is to specify to the framework that it should sort based on the value1 and not the whole key1 value1. So we will implement a custom SortComparator:
public static class LogDescComparator extends WritableComparator
{
protected LogDescComparator()
{
super(Text.class, true);
}
#Override
public int compare(WritableComparable w1, WritableComparable w2)
{
Text t1 = (Text) w1;
Text t2 = (Text) w2;
String[] t1Items = t1.toString().split(" "); //probably it's a " "
String[] t2Items = t2.toString().split(" ");
String t1Value = t1Items[1];
String t2Value = t2Items[1];
int comp = t2Value.compareTo(t1Value); // We compare using "real" value part of our synthetic key in Descending order
return comp;
}
}
You can set your custom comparator as : job.setSortComparatorClass(LogDescComparator.class);
The reducer of the job should do nothing. However if we don't set a reducer the sorting for the mapper keys will not be done (and we need that). So, you need to set IdentityReducer as a Reducer for your second MR job to do no reduction but still ensure that the mapper's synthetic keys are sorted in the way we specified.

How to use MultipleOutputs<KEYOUT,VALUEOUT> for writing output data to multiple outputs

I am new to Hadoop and MapReduce and have been trying to write output to multiple files based on keys. Could anyone please provide clear idea or Java code snippet example on how to use it. My mapper is working exactly fine and after shuffle, keys and the corresponding values are obtained as expected. Thanks!
What i am trying to do is output only few records from the input file to a new file.
Thus the new output file shall contain only those required records, ignoring rest irrelevant records.
This would work fine even if i don't use MultipleTextOutputFormat.
Logic which i implemented in mapper is as follows:
public static class MapClass extends
Mapper {
StringBuilder emitValue = null;
StringBuilder emitKey = null;
Text kword = new Text();
Text vword = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] parts;
String line = value.toString();
parts = line.split(" ");
kword.set(parts[4].toString());
vword.set(line.toString());
context.write(kword, vword);
}
}
Input to reduce is like this:
[key1]--> [value1, value2, ...]
[key2]--> [value1, value2, ...]
[key3]--> [value1, value2, ...] & so on
my interest is in [key2]--> [value1, value2, ...] ignoring other keys and corresponding values. Please help me out with the reducer.

Using MultipleOutputs lets you emit records in multiple files, but in a set of pre-defined number/type of files only and not arbitrary number of files and not with on-the-fly decision on filename according to key/value.
You may create your own OutputFormat by extending org.apache.hadoop.mapred.lib.MultipleTextOutputFormat. Your OutputFormat class shall enable decision of output file name as well as folder according to the key/value emitted by reducer. This can be achieved as follows:
package oddjob.hadoop;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;
public class MultipleTextOutputFormatByKey extends MultipleTextOutputFormat<Text, Text> {
/**
* Use they key as part of the path for the final output file.
*/
#Override
protected String generateFileNameForKeyValue(Text key, Text value, String leaf) {
return new Path(key.toString(), leaf).toString();
}
/**
* When actually writing the data, discard the key since it is already in
* the file path.
*/
#Override
protected Text generateActualKey(Text key, Text value) {
return null;
}
}
For more info read here.
PS: You will need to use the old mapred API to achieve that. As in the newer API there isn't support for MultipleTextOutput yet! Refer this.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why can we reuse Text object in WordCount example - java

Related

Best way to implement a link between N constants

Manipulating a user input string in MapReduce

Cost of writing output using context.write() OR outputCollector.collect() in Hadoop?

Using MapReduce to analyze log file

How to use MultipleOutputs<KEYOUT,VALUEOUT> for writing output data to multiple outputs

Categories

Resources