Hadoop: Implement a nested for loop in MapReduce [Java]

Hadoop: Implement a nested for loop in MapReduce [Java] - java

I am trying to implement a statistical formula that requires comparing a datapoint with all other possible datapoints. For example my dataset is something like:
10.22
15.77
16.55
9.88
I need to go through this file like:
for (i=0;i< data.length();i++)
for (j=0;j< data.length();j++)
Sum +=(data[i] + data[j])
Basically when i get each line through my map function, i need to execute some instructions on the rest of the file in the reducer like in a nested for loop.
Now i have tried using the distributedCache, some form of ChainMapper but to no avail. Any idea of how i can go about doing this would be really appreciated. Even an out of the box way will be helpful.

You need to override the run method implementation of the Reducer Class.
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKey()) {
//This corresponds to the ones corresponding to i of first iterator
Text currentKey = context.getCurrentKey();
Iterator<VALUEIN> currentValue = context.getValues();
if(context.nextKey()){
//You can get the Next Values the ones corresponding to j of you second iterator
}
}
cleanup(context);
}
or if you don't have reducer you can do the same in the Mapper as well by overriding the
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
/*context.nextKeyValue() if invoked again gives you the next key values which is same as the ones you are looking for in the second loop*/
}
cleanup(context);
}
Let me know if this helps.

Related

Hadoop: MapReduce MinMax result different from original dataset

I am new in Hadoop.
I try to use MapReduce to get the min and max Monthly Precipitation value for each year.
Here is one year of the data set looks like:
Product code,Station number,Year,Month,Monthly Precipitation Total (millimetres),Quality
IDCJAC0001,023000,1839,01,11.5,Y
IDCJAC0001,023000,1839,02,11.4,Y
IDCJAC0001,023000,1839,03,20.8,Y
IDCJAC0001,023000,1839,04,10.5,Y
IDCJAC0001,023000,1839,05,4.8,Y
IDCJAC0001,023000,1839,06,90.4,Y
IDCJAC0001,023000,1839,07,54.2,Y
IDCJAC0001,023000,1839,08,97.4,Y
IDCJAC0001,023000,1839,09,41.4,Y
IDCJAC0001,023000,1839,10,40.8,Y
IDCJAC0001,023000,1839,11,113.2,Y
IDCJAC0001,023000,1839,12,8.9,Y
And this is what the result I get for the year 1839:
1839 1.31709005E9 1.3172928E9
Obviously, the result is not matched to the original data...But I cannot figure out why it happens...

Your code has multiple issues.
(1) In MinMixExposure, you write doubles, but read ints. You also use Double type (meaning that you care about nulls) but do not handle nulls in serialization/deserialization. If you really need nulls, you should write something like this:
// write
out.writeBoolean(value != null);
if (value != null) {
out.writeDouble(value);
}
// read
if (in.readBoolean()) {
value = in.readDouble();
} else {
value = null;
}
If you do not need to store nulls, replace Double with double.
(2) In map function you wrap your code in IOException catch blocks. This doesn't make any sense. If input data has records in incorrect format, then most probably you will get NullPointerException/NumberFormatError in Double.parseDouble(). However, you do not handle these exceptions.
Checking for nulls after you called parseDouble also doesn't make sense.
(3) You pass map key to reducer as Text. I would recommend to pass year as IntWritable (and configure your job with job.setMapOutputKeyClass(IntWritable.class);).
(4) maxExposure must be handled similarly to minExposure in reducer code. Currently you just return the value for the last record.

Your logic to find the min and max exposure in the Reducer seems off. You set maxExposure twice, and never check whether it is actually the max exposure. I'd go with:
public void reduce(Text key, Iterable<MinMaxExposure> values,
Context context) throws IOException, InterruptedException {
Double minExposure = Double.MAX_VALUE;
Double maxExposure = Double.MIN_VALUE;
for (MinMaxExposure val : values) {
if (val.getMinExposure() < minExposure) {
minExposure = val.getMinExposure();
}
if (val.getMaxExposure() > maxExposure) {
maxExposure = val.getMaxExposure();
}
}
MinMaxExposure resultRow = new MinMaxExposure();
resultRow.setMinExposure(minExposure);
resultRow.setMaxExposure(maxExposure);
context.write(key, resultRow);
}

The implementation of the AbstractCassandraTupleSink is not serializable

I create a program to count words in Wikipedia. It works without any errors. Then I created the Cassandra table with two columns "word(text) and count(bigint)". The problem is when I wanted to enter words and counts to Cassandra table.My program is in following:
public class WordCount_in_cassandra {
public static void main(String[] args) throws Exception {
// Checking input parameters
final ParameterTool params = ParameterTool.fromArgs(args);
// set up the execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// make parameters available in the web interface
env.getConfig().setGlobalJobParameters(params);
DataStream<String> text=env.addSource(new WikipediaEditsSource()).map(WikipediaEditEvent::getTitle);
DataStream<Tuple2<String, Integer>> counts =
// split up the lines in pairs (2-tuples) containing: (word,1)
text.flatMap(new Tokenizer())
// group by the tuple field "0" and sum up tuple field "1"
.keyBy(0).sum(1);
// emit result
if (params.has("output")) {
counts.writeAsText(params.get("output"));
} else {
System.out.println("Printing result to stdout. Use --output to specify output path.");
counts.print();
CassandraSink.addSink(counts)
.setQuery("INSERT INTO mar1.examplewordcount(word, count) values values (?, ?);")
.setHost("127.0.0.1")
.build();
}
// execute program
env.execute("Streaming WordCount");
}//main
public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {
#Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
// normalize and split the line
String[] tokens = value.toLowerCase().split("\\W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<>(token, 1));
}
}
}
}
}
After running this code I got this error:
Exception in thread "main" org.apache.flink.api.common.InvalidProgramException: The implementation of the AbstractCassandraTupleSink is not serializable. The object probably contains or references non serializable fields.
I searched a lot but I could not find any solutions for it.Would you please tell me how I can solve the issue?
Thank you in advance.

I tried to replicate your problem, but I didn't get the serialization issue. Though because I don't have a Cassandra cluster running, it fails in the open() call. But this happens after serialization, as it's called when the operator being started by the TaskManager. So it feels like you have something maybe wrong with your dependencies, such that it's somehow using the wrong class for the actual Cassandra sink.
BTW, it's always helpful to include context for your error - e.g. what version of Flink, are you running this from an IDE or on a cluster, etc.
Just FYI, here are the Flink jars on my classpath...
flink-java/1.7.0/flink-java-1.7.0.jar
flink-core/1.7.0/flink-core-1.7.0.jar
flink-annotations/1.7.0/flink-annotations-1.7.0.jar
force-shading/1.7.0/force-shading-1.7.0.jar
flink-metrics-core/1.7.0/flink-metrics-core-1.7.0.jar
flink-shaded-asm/5.0.4-5.0/flink-shaded-asm-5.0.4-5.0.jar
flink-streaming-java_2.12/1.7.0/flink-streaming-java_2.12-1.7.0.jar
flink-runtime_2.12/1.7.0/flink-runtime_2.12-1.7.0.jar
flink-queryable-state-client-java_2.12/1.7.0/flink-queryable-state-client-java_2.12-1.7.0.jar
flink-shaded-netty/4.1.24.Final-5.0/flink-shaded-netty-4.1.24.Final-5.0.jar
flink-shaded-guava/18.0-5.0/flink-shaded-guava-18.0-5.0.jar
flink-hadoop-fs/1.7.0/flink-hadoop-fs-1.7.0.jar
flink-shaded-jackson/2.7.9-5.0/flink-shaded-jackson-2.7.9-5.0.jar
flink-clients_2.12/1.7.0/flink-clients_2.12-1.7.0.jar
flink-optimizer_2.12/1.7.0/flink-optimizer_2.12-1.7.0.jar
flink-streaming-scala_2.12/1.7.0/flink-streaming-scala_2.12-1.7.0.jar
flink-scala_2.12/1.7.0/flink-scala_2.12-1.7.0.jar
flink-shaded-asm-6/6.2.1-5.0/flink-shaded-asm-6-6.2.1-5.0.jar
flink-test-utils_2.12/1.7.0/flink-test-utils_2.12-1.7.0.jar
flink-test-utils-junit/1.7.0/flink-test-utils-junit-1.7.0.jar
flink-runtime_2.12/1.7.0/flink-runtime_2.12-1.7.0-tests.jar
flink-queryable-state-runtime_2.12/1.7.0/flink-queryable-state-runtime_2.12-1.7.0.jar
flink-connector-cassandra_2.12/1.7.0/flink-connector-cassandra_2.12-1.7.0.jar
flink-connector-wikiedits_2.12/1.7.0/flink-connector-wikiedits_2.12-1.7.0.jar

How to debug serializable exception in Flink?, this might helps. It's happening because you are assigning an unserialized field to serialized one.

Use shared variable in Spark

Hi I am using BLAS to do some math computation in Spark.I got 2 JavaPairRDDs that both has a Double[] field, I want to caculate dot product as follow:
userPairRDD.cartesian(itemPairRDD).mapToPair(
new PairFunction<Tuple2<Tuple2<String, Double[]>, Tuple2<String, Double[]>>, String, ItemAndWeight>() {
#Override
public Tuple2<String, ItemAndWeight> call(Tuple2<Tuple2<String, Double[]>, Tuple2<String, Double[]>> tuple2Tuple2Tuple2) throws Exception {
BLAS.getInstance().ddot("......");
.......
}
}
)
My question is, in my call(),I called BLAS.getInstance() every time it might be inefficient, Can I create only one BLAS object outside call() and just the very object to do ddot()?
Is there any point to take care as this is a distributed program? Thanks in advance.

You don't need shared variable in this case. BLAS.getInstance() just return a static/singleton instance, so no inefficient thing here.

Apache Giraph/Hadoop: Iterating through custom ArrayWritable

I thought this would be simple to implement, but it's starting to be a pain.
I've got a ArrayWritable subclass like so:
public class VertexDistanceArrayWritable extends ArrayWritable {
public VertexDistanceArrayWritable() {
super(VertexDistanceWritable.class);
}
public VertexDistanceArrayWritable(VertexDistanceWritable[] v) {
super(VertexDistanceWritable.class, v);
}
}
And a Writable subclass like so:
public class VertexDistanceWritable implements Writable {
//Implements write, readFields, and some custom functions that aren't used yet
}
In my Giraph compute function, Messages are VertexDistanceArrayWritable's. I want to iterate through every VertexDistanceWritable every message (VertexDistanceArrayWritable). Here is my compute function:
#Override
public void compute(Vertex<Text, MapWritable, FloatWritable> vertex,
Iterable<VertexDistanceArrayWritable> messages) throws IOException {
for(VertexDistanceArrayWritable message : messages) {
for(VertexDistanceWritable distEntry : message) {
//Do stuff with distEntry
}
}
//do other stuff
vertex.voteToHalt();
}
When I compile the code, I get this error:
for-each not applicable to expression type
for(VertexDistanceWritable distEntry : message) {
required: array or java.lang.Iterable
found: VertexDistanceArrayWritable
So now I have a problem. I want to iterate over the arrayWritable sub-class.
I've tried the following:
Change that line to for(VertexDistanceWritable distEntry : message.toArray()) which tells me that for-each not applicaable to type Object (required: array or java.lang.Iterable, found: Object).
Change that line to for(VertexDistanceWritable distEntry : message.get() ), which gives me error: incompatible types -- required: VertexDistanceWritable, found: Writable. This is the strangest problem -- VertexDistanceWritable extends Writable, shouldn't this work fine?
Writing my own custom "get_foo()" function for VertexDistanceWritable, which returns values as a VertexDistanceWritable[]. Of course, values is private, and has no getter function according to the documentation other than get() which I'm already having problems with
I just want a way to iterate over my VertexDistanceArrayWritable class. Is this even possible in Hadoop? It has to be, right? I should be able to iterate over a bunch of elements I made in an array, no? It seems like pretty darn basic stuff.

After about 30 minutes of experimenting and googling, I found out a clue here. Sort of cheesy, but it seems to compile correctly. Basically just use a Writable then cast it to my custom writable.
for(VertexDistanceArrayWritable message : messages) {
for(Writable distWritable : message.get()) {
vertexDistanceWritable distEntry = (VertexDistanceWritable) distWritable;
//do other stuff
}
}
I haven't yet confirmed if it works correctly, I will update and confirm my answer when I can make sure it works.
edit: it works. Might require a copy constructor since I had one for VertexDistanceWritable, but never checked that out.

MapReduce - WritableComparables

I’m new to both Java and Hadoop. I’m trying a very simple program to get Frequent pairs.
e.g.
Input: My name is Foo. Foo is student.
Intermediate Output:
Map:
(my, name): 1
(name ,is): 1
(is, Foo): 2 // (is, Foo) = (Foo, is)
(is, student)
So finally it should give frequent pair is (is ,Foo).
Pseudo code looks like this:
Map(Key: line_num, value: line)
words = split_words(line)
for each w in words:
for each neighbor x:
emit((w, x)), 1)
Here my key is not one, it’s pair. While going through documentation, I read that for each new key we have to implement WritableComparable.
So I'm confused about that. If someone can explain about this class, that would be great. Not sure it’s really true. Then I can figure out on my own how to do that!
I don't want any code neither mapper nor anything ... just want to understand what does this WritableComparable do? Which method of WritableComparable actually compares keys? I can see equals and compareTo, but I cannot find any explanation about that. Please no code! Thanks
EDIT 1:
In compareTo I return 0 for pair (a, b) = (b, a) but still its not going to same reducer, is there any way in compareTo method I reset key (b, a) to (a, b) or generate totally new key?
EDIT 2:
I don't know for generating new key, but in compareTo changing logic, it worked fine ..! Thanks everyone!

WritableComparable is an interface that makes the class that implements it be two things: Writable, meaning it can be written to and read from your network via serialization, etc. This is necessary if you're going to use it as a key or value so that it can be sent between Hadoop nodes. And Comparable, which means that methods must be provided that show how one object of the given class can be compared to another. This is used when the Reducer organizes by key.
This interface is neceesary when you want to create your own object to be a key. And you'd need to create your own InputFormat as opposed to using one of the ones that come with Hadoop. This can get be rather difficult (from my experience), especially if you're new to both Java and Hadoop.
So if I were you, I wouldn't bother with that as there's a much simpler way. I would use TextInputFormat which is conveniently both the default InputFormat as well as pretty easy to use and understand. You could simply emit each key as a Text object which is pretty simliar to a string. There is a caveat though; like you mentioned "is Foo" and "Foo is" need to be evaluated to be the same key. So with every pair of words you pull out, sort them alphabetically before passing them as a key with the String.compareTo method. That way you're guarenteed to have no repeats.

Here is mapper class for your problem ,
the frequent pair of words logic is not implemented . i guess u were not lookin for that .
public class MR {
public static class Mapper extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, LongWritable>
{
public static int check (String keyCheck)
{
// logig to check key is frequent or not ?
return 0;
}
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Map< String, Integer> keyMap=new HashMap<String, Integer>();
String line=value.toString();
String[] words=line.split(" ");
for(int i=0;i<(words.length-1);i++)
{
String mapkeyString=words[i]+","+words[i+1];
// Logic to check is mapKeyString is frequent or not .
int count =check(mapkeyString);
keyMap.put(mapkeyString, count);
}
Set<Entry<String,Integer>> entries=keyMap.entrySet();
for(Entry<String, Integer> entry:entries)
{
context.write(new Text(entry.getKey()), new LongWritable(entry.getValue()));
}
}
}
public static class Reduce extends Reducer<Text, LongWritable, Text, Text>
{
protected void reduce(Text key, Iterable<LongWritable> Values,
Context context)
throws IOException, InterruptedException {
}
}
public static void main(String[] args) {
Configuration configuration=new Configuration();
try {
Job job=new Job(configuration, "Word Job");
job.setMapperClass(Mapper.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hadoop: Implement a nested for loop in MapReduce [Java] - java

Related

Hadoop: MapReduce MinMax result different from original dataset

The implementation of the AbstractCassandraTupleSink is not serializable

Use shared variable in Spark

Apache Giraph/Hadoop: Iterating through custom ArrayWritable

MapReduce - WritableComparables

Categories

Resources