MapReduce - WritableComparables - java

I’m new to both Java and Hadoop. I’m trying a very simple program to get Frequent pairs.
e.g.
Input: My name is Foo. Foo is student.
Intermediate Output:
Map:
(my, name): 1
(name ,is): 1
(is, Foo): 2 // (is, Foo) = (Foo, is)
(is, student)
So finally it should give frequent pair is (is ,Foo).
Pseudo code looks like this:
Map(Key: line_num, value: line)
words = split_words(line)
for each w in words:
for each neighbor x:
emit((w, x)), 1)
Here my key is not one, it’s pair. While going through documentation, I read that for each new key we have to implement WritableComparable.
So I'm confused about that. If someone can explain about this class, that would be great. Not sure it’s really true. Then I can figure out on my own how to do that!
I don't want any code neither mapper nor anything ... just want to understand what does this WritableComparable do? Which method of WritableComparable actually compares keys? I can see equals and compareTo, but I cannot find any explanation about that. Please no code! Thanks
EDIT 1:
In compareTo I return 0 for pair (a, b) = (b, a) but still its not going to same reducer, is there any way in compareTo method I reset key (b, a) to (a, b) or generate totally new key?
EDIT 2:
I don't know for generating new key, but in compareTo changing logic, it worked fine ..! Thanks everyone!

WritableComparable is an interface that makes the class that implements it be two things: Writable, meaning it can be written to and read from your network via serialization, etc. This is necessary if you're going to use it as a key or value so that it can be sent between Hadoop nodes. And Comparable, which means that methods must be provided that show how one object of the given class can be compared to another. This is used when the Reducer organizes by key.
This interface is neceesary when you want to create your own object to be a key. And you'd need to create your own InputFormat as opposed to using one of the ones that come with Hadoop. This can get be rather difficult (from my experience), especially if you're new to both Java and Hadoop.
So if I were you, I wouldn't bother with that as there's a much simpler way. I would use TextInputFormat which is conveniently both the default InputFormat as well as pretty easy to use and understand. You could simply emit each key as a Text object which is pretty simliar to a string. There is a caveat though; like you mentioned "is Foo" and "Foo is" need to be evaluated to be the same key. So with every pair of words you pull out, sort them alphabetically before passing them as a key with the String.compareTo method. That way you're guarenteed to have no repeats.

Here is mapper class for your problem ,
the frequent pair of words logic is not implemented . i guess u were not lookin for that .
public class MR {
public static class Mapper extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, LongWritable>
{
public static int check (String keyCheck)
{
// logig to check key is frequent or not ?
return 0;
}
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Map< String, Integer> keyMap=new HashMap<String, Integer>();
String line=value.toString();
String[] words=line.split(" ");
for(int i=0;i<(words.length-1);i++)
{
String mapkeyString=words[i]+","+words[i+1];
// Logic to check is mapKeyString is frequent or not .
int count =check(mapkeyString);
keyMap.put(mapkeyString, count);
}
Set<Entry<String,Integer>> entries=keyMap.entrySet();
for(Entry<String, Integer> entry:entries)
{
context.write(new Text(entry.getKey()), new LongWritable(entry.getValue()));
}
}
}
public static class Reduce extends Reducer<Text, LongWritable, Text, Text>
{
protected void reduce(Text key, Iterable<LongWritable> Values,
Context context)
throws IOException, InterruptedException {
}
}
public static void main(String[] args) {
Configuration configuration=new Configuration();
try {
Job job=new Job(configuration, "Word Job");
job.setMapperClass(Mapper.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

Related

The implementation of the AbstractCassandraTupleSink is not serializable

I create a program to count words in Wikipedia. It works without any errors. Then I created the Cassandra table with two columns "word(text) and count(bigint)". The problem is when I wanted to enter words and counts to Cassandra table.My program is in following:
public class WordCount_in_cassandra {
public static void main(String[] args) throws Exception {
// Checking input parameters
final ParameterTool params = ParameterTool.fromArgs(args);
// set up the execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// make parameters available in the web interface
env.getConfig().setGlobalJobParameters(params);
DataStream<String> text=env.addSource(new WikipediaEditsSource()).map(WikipediaEditEvent::getTitle);
DataStream<Tuple2<String, Integer>> counts =
// split up the lines in pairs (2-tuples) containing: (word,1)
text.flatMap(new Tokenizer())
// group by the tuple field "0" and sum up tuple field "1"
.keyBy(0).sum(1);
// emit result
if (params.has("output")) {
counts.writeAsText(params.get("output"));
} else {
System.out.println("Printing result to stdout. Use --output to specify output path.");
counts.print();
CassandraSink.addSink(counts)
.setQuery("INSERT INTO mar1.examplewordcount(word, count) values values (?, ?);")
.setHost("127.0.0.1")
.build();
}
// execute program
env.execute("Streaming WordCount");
}//main
public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {
#Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
// normalize and split the line
String[] tokens = value.toLowerCase().split("\\W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<>(token, 1));
}
}
}
}
}
After running this code I got this error:
Exception in thread "main" org.apache.flink.api.common.InvalidProgramException: The implementation of the AbstractCassandraTupleSink is not serializable. The object probably contains or references non serializable fields.
I searched a lot but I could not find any solutions for it.Would you please tell me how I can solve the issue?
Thank you in advance.
I tried to replicate your problem, but I didn't get the serialization issue. Though because I don't have a Cassandra cluster running, it fails in the open() call. But this happens after serialization, as it's called when the operator being started by the TaskManager. So it feels like you have something maybe wrong with your dependencies, such that it's somehow using the wrong class for the actual Cassandra sink.
BTW, it's always helpful to include context for your error - e.g. what version of Flink, are you running this from an IDE or on a cluster, etc.
Just FYI, here are the Flink jars on my classpath...
flink-java/1.7.0/flink-java-1.7.0.jar
flink-core/1.7.0/flink-core-1.7.0.jar
flink-annotations/1.7.0/flink-annotations-1.7.0.jar
force-shading/1.7.0/force-shading-1.7.0.jar
flink-metrics-core/1.7.0/flink-metrics-core-1.7.0.jar
flink-shaded-asm/5.0.4-5.0/flink-shaded-asm-5.0.4-5.0.jar
flink-streaming-java_2.12/1.7.0/flink-streaming-java_2.12-1.7.0.jar
flink-runtime_2.12/1.7.0/flink-runtime_2.12-1.7.0.jar
flink-queryable-state-client-java_2.12/1.7.0/flink-queryable-state-client-java_2.12-1.7.0.jar
flink-shaded-netty/4.1.24.Final-5.0/flink-shaded-netty-4.1.24.Final-5.0.jar
flink-shaded-guava/18.0-5.0/flink-shaded-guava-18.0-5.0.jar
flink-hadoop-fs/1.7.0/flink-hadoop-fs-1.7.0.jar
flink-shaded-jackson/2.7.9-5.0/flink-shaded-jackson-2.7.9-5.0.jar
flink-clients_2.12/1.7.0/flink-clients_2.12-1.7.0.jar
flink-optimizer_2.12/1.7.0/flink-optimizer_2.12-1.7.0.jar
flink-streaming-scala_2.12/1.7.0/flink-streaming-scala_2.12-1.7.0.jar
flink-scala_2.12/1.7.0/flink-scala_2.12-1.7.0.jar
flink-shaded-asm-6/6.2.1-5.0/flink-shaded-asm-6-6.2.1-5.0.jar
flink-test-utils_2.12/1.7.0/flink-test-utils_2.12-1.7.0.jar
flink-test-utils-junit/1.7.0/flink-test-utils-junit-1.7.0.jar
flink-runtime_2.12/1.7.0/flink-runtime_2.12-1.7.0-tests.jar
flink-queryable-state-runtime_2.12/1.7.0/flink-queryable-state-runtime_2.12-1.7.0.jar
flink-connector-cassandra_2.12/1.7.0/flink-connector-cassandra_2.12-1.7.0.jar
flink-connector-wikiedits_2.12/1.7.0/flink-connector-wikiedits_2.12-1.7.0.jar
How to debug serializable exception in Flink?, this might helps. It's happening because you are assigning an unserialized field to serialized one.

How to remove “Null” key from HashMap<String, String>?

According to Java, HashMap allowed Null as key. My client said
Use HashMap only, Not other like HashTable,ConcurrentHashMap etc. write logic such a way that HashMap don't
contains Null as Key in my overall product logic.
I have a options like
Create wrapper class of HashMap and use it everywhere.
import java.util.HashMap;
public class WHashMap<T, K> extends HashMap<T, K> {
#Override
public K put(T key, K value) {
// TODO Auto-generated method stub
if (key != null) {
return super.put(key, value);
}
return null;
}
}
I suggested another option like remove null key manually or don't allowed it in each. It is also not allowed as its same operations repeated.
let me know..if I missed any other better approach?
Use HashMap with Nullonly as per java standard.
Let me know what is good approach to handle such case?
Change your put method implementation as follows
#Override
public K put(T key, K value) {
// TODO Auto-generated method stub
if (key == null) {
throw new NullPointerException("Key must not be null.");
}
return super.put(key, value);
}
Your code is a reasonable way to create a HashMap that can't contain a null key (though it's not perfect: what happens if someone calls putAll and passes in a map with a null key?); but I don't think that's what your client is asking for. Rather, I think your client is just saying that (s)he wants you to create a HashMap that doesn't contain a null key (even though it can). As in, (s)he just wants you to make sure that nothing in your program logic will ever put a null key in the map.

What is the correct way to handle IOException in a comparator?

I'm using a comparator to sort files by modified date. Those files are to be processed for contained data and deleted after. I have to make sure the files are in sequential order before processing them. Below is what I have so far.
public class FileComparator implements Comparator<Path> {
#Override
public int compare(Path o1, Path o2) {
try {
return Files.getLastModifiedTime(o1).compareTo(Files.getLastModifiedTime(o2));
} catch (IOException exception) {
//write to error log
//***
}
}
}
*** This is where I'm stuck. I have to return an int because compare requires it but I don't want to return zero and have false equivalency when it fails.
I tried restructuring the code but then if getLastModifiedTime() fails, o1Modified and o2Modified will be null.
public class FileComparator implements Comparator<Path> {
#Override
public int compare(Path o1, Path o2) {
FileTime o1Modified = null;
FileTime o2Modified = null;
try {
o1Modified = Files.getLastModifiedTime(o1);
o2Modified = Files.getLastModifiedTime(o2);
} catch (IOException exception) {
//write to error log
}
return o1Modified.compareTo(o2Modified);
}
}
Is there any standard way to handle situations like this?
I think the way to tackle this particular situation is by calling Files.getLastModifiedTime() on each path exactly once, storing the results and then using the stored results during the sorting.
This has several benefits:
It cleanly solves the IOException problem.
It doesn't repeatedly and unnecessarily perform the same costly I/O operations on the same files.
It ensures consistent ordering even if the last modification time of a file changes midway through the sort (see, for example, "Comparison method violates its general contract!" for how this could subtly break your code).
Throw a runtime exception wrapping the IOException:
try {
return Files.getLastModifiedTime(o1).compareTo(Files.getLastModifiedTime(o2));
} catch (IOException e) {
throw new UncheckedIOException("impossible to get last modified dates, so can't compare", e)
}
Note however that the modified time could change during the sort, which would make your comparator incorrect: it wouldn't respect its contract anymore. So a better approach would be to iterate through your paths first, and wrap them into some TimedPath object which would store the last modification time, and then sort those TimedPath objects.

Hadoop: Implement a nested for loop in MapReduce [Java]

I am trying to implement a statistical formula that requires comparing a datapoint with all other possible datapoints. For example my dataset is something like:
10.22
15.77
16.55
9.88
I need to go through this file like:
for (i=0;i< data.length();i++)
for (j=0;j< data.length();j++)
Sum +=(data[i] + data[j])
Basically when i get each line through my map function, i need to execute some instructions on the rest of the file in the reducer like in a nested for loop.
Now i have tried using the distributedCache, some form of ChainMapper but to no avail. Any idea of how i can go about doing this would be really appreciated. Even an out of the box way will be helpful.
You need to override the run method implementation of the Reducer Class.
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKey()) {
//This corresponds to the ones corresponding to i of first iterator
Text currentKey = context.getCurrentKey();
Iterator<VALUEIN> currentValue = context.getValues();
if(context.nextKey()){
//You can get the Next Values the ones corresponding to j of you second iterator
}
}
cleanup(context);
}
or if you don't have reducer you can do the same in the Mapper as well by overriding the
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
/*context.nextKeyValue() if invoked again gives you the next key values which is same as the ones you are looking for in the second loop*/
}
cleanup(context);
}
Let me know if this helps.

Using MapReduce to analyze log file

Here is a log file:
2011-10-26 06:11:35 user1 210.77.23.12
2011-10-26 06:11:45 user2 210.77.23.17
2011-10-26 06:11:46 user3 210.77.23.12
2011-10-26 06:11:47 user2 210.77.23.89
2011-10-26 06:11:48 user2 210.77.23.12
2011-10-26 06:11:52 user3 210.77.23.12
2011-10-26 06:11:53 user2 210.77.23.12
...
I want to use MapReduce to sort by the number of logging times by the third filed(user) in descending order each line. In another word, I want the result to be displayed as:
user2 4
user3 2
user1 1
Now I have two questions:
By default, MapReduce will split the log file with space and carriage return, but I only need the third filed each line, that is, I don't care fields such as 2011-10-26,06:11:35, 210.77.23.12, how to let MapReduce omit them and pick up the user filed?
By default, MapReduce will sort the result by the key instead of the value. How to let MapReduce to sort the result by value(logging times)?
Thank you.
For your first question:
You should probably pass the whole line to the mapper and just keep the third token for mapping and map (user, 1) everytime.
public class AnalyzeLogs
{
public static class FindFriendMapper extends Mapper<Object, Text, Text, IntWritable> {
public void map(Object, Text value, Context context) throws IOException, InterruptedException
{
String tempStrings[] = value.toString().split(",");
context.write(new Text(tempStrings[2]), new IntWritable(1));
}
}
For your second question I believe you cannot avoid having a second MR Job after that (I cannot think of any other way). So the reducer of the first job will just aggregate the values and give a sum for each key, sorted by key. Which is not yet what you need.
So, you pass the output of this job as input to this second MR job. The objective of this job is to do a somewhat special sorting by value before passing to the reducers (which will do absolutely nothing).
Our Mapper for the second job will be the following:
public static class SortLogsMapper extends Mapper<Object, Text, Text, NullWritable> {
public void map(Object, Text value, Context context) throws IOException, InterruptedException
{
context.write(value, new NullWritable());
}
As you can see we do not use the value for this mapper at all. Instead, we have created a key that contains our value ( our key is in key1 value1 format).
What remains to be done now, is to specify to the framework that it should sort based on the value1 and not the whole key1 value1. So we will implement a custom SortComparator:
public static class LogDescComparator extends WritableComparator
{
protected LogDescComparator()
{
super(Text.class, true);
}
#Override
public int compare(WritableComparable w1, WritableComparable w2)
{
Text t1 = (Text) w1;
Text t2 = (Text) w2;
String[] t1Items = t1.toString().split(" "); //probably it's a " "
String[] t2Items = t2.toString().split(" ");
String t1Value = t1Items[1];
String t2Value = t2Items[1];
int comp = t2Value.compareTo(t1Value); // We compare using "real" value part of our synthetic key in Descending order
return comp;
}
}
You can set your custom comparator as : job.setSortComparatorClass(LogDescComparator.class);
The reducer of the job should do nothing. However if we don't set a reducer the sorting for the mapper keys will not be done (and we need that). So, you need to set IdentityReducer as a Reducer for your second MR job to do no reduction but still ensure that the mapper's synthetic keys are sorted in the way we specified.

Categories