FlatMap function on a CoGrouped RDD

FlatMap function on a CoGrouped RDD - java

I am trying to use a flatmap function on the cogroupedRDD which has the signature:
JavaPairRDD<String, Tuple2<Iterable<Row>, Iterable<Row>>>
my flatmap function is as follows:
static FlatMapFunction<Tuple2<String, Tuple2<Iterable<Row>, Iterable<Row>>>,Row> setupF = new FlatMapFunction<Tuple2<String, Tuple2<Iterable<Row>, Iterable<Row>>>,Row>() {
#Override
public Iterable<Row> call(Tuple2<String, Tuple2<Iterable<Row>, Iterable<Row>>> row) {
}};
But i am getting compilation error . I am sure it must be a syntactical issue which I am not able to understand.
Full Code:
JavaPairRDD<String, Tuple2<Iterable<Row>, Iterable<Row>>> coGroupedRDD = rdd1.cogroup(rdd2);
JavaRDD<Row> jd = coGroupedRDD.flatmap(setupF);
static FlatMapFunction<Tuple2<String, Tuple2<Iterable<Row>, Iterable<Row>>>,Row> setupF = new FlatMapFunction<Tuple2<String, Tuple2<Iterable<Row>, Iterable<Row>>>,Row>() {
#Override
public Iterable<Row> call(Tuple2<String, Tuple2<Iterable<Row>, Iterable<Row>>> row) {
//logic
}};
Error:
The method flatmap(FlatMapFunction<Tuple2<String,Tuple2<Iterable<Row>,Iterable<Row>>>,Row>) is undefined for the type JavaPairRDD<String,Tuple2<Iterable<Row>,Iterable<Row>>>

A wild guess here, maybe the reason is that you write your code against Spark 1.6 API but you actually use Spark 2.0 dependency? API differs between these two releases.
Spark 1.6 API FlatMapFunction method signature:
Iterable<R> call(T t)
Spark 2.0 API FlatMapFunction method signature:
Iterator<R> call(T t)
So try change you code to this:
new FlatMapFunction<Tuple2<String, Tuple2<Iterable<Row>, Iterable<Row>>>, Row>() {
#Override
public Iterator<Row> call(Tuple2<String, Tuple2<Iterable<Row>, Iterable<Row>>> row) {
//...
}
};
or using Java 8 lambda version:
coGroupedRDD
.flatMap(t -> {
List<Row> result = new ArrayList<>();
//...use t._1, t._2._1, t._2._2 to construct the result list
return result.iterator();
});

Related

Spark UDF written in Java Lambda raises ClassCastException

Here's the exception:
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to ... of type org.apache.spark.sql.api.java.UDF2 in instance of ...
If I don't implement the UDF by Lambda expression, it's ok. Like:
private UDF2 funUdf = new UDF2<String, String, String>() {
#Override
public String call(String a, String b) throws Exception {
return fun(a, b);
}
};
dataset.sparkSession().udf().register("Fun", funUdf, DataTypes.StringType);
functions.callUDF("Fun", functions.col("a"), functions.col("b"));
I am running in local so this answer will not help: https://stackoverflow.com/a/28367602/4164722
Why? How can I fix it?

This is a working solution :
UDF1 myUDF = new UDF1<String, String>() {
public String call(final String str) throws Exception {
return str+"A";
}
};
sparkSession.udf().register("Fun", myUDF, DataTypes.StringType);
Dataset<Row> rst = sparkSession.read().format("text").load("myFile");
rst.withColumn("nameA",functions.callUDF("Fun",functions.col("name")))

Spark: Serialization not working with Aggregate

I have this class (in Java), which I want to use in Spark (1.6):
public class Aggregation {
private Map<String, Integer> counts;
public Aggregation() {
counts = new HashMap<String, Integer>();
}
public Aggregation add(Aggregation ia) {
String key = buildCountString(ia);
addKey(key);
return this;
}
private void addKey(String key, int cnt) {
if(counts.containsKey(key)) {
counts.put(key, counts.get(key) + cnt);
}
else {
counts.put(key, cnt);
}
}
private void addKey(String key) {
addKey(key, 1);
}
public Aggregation merge(Aggregation agg) {
for(Entry<String, Integer> e: agg.counts.entrySet()) {
this.addKey(e.getKey(), e.getValue());
}
return this;
}
private String buildCountString(Aggregation rec) {
...
}
}
When starting Spark I enabled Kyro and added this class (in Scala):
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(Array(classOf[Aggregation]))
And I want to use it with Spark aggregate like this (Scala):
rdd.aggregate(new InteractionAggregation)((agg, rec) => agg.add(rec), (a, b) => a.merge(b) )
Somehow this raises a "Task not serializable" exception.
But when I use the class with map and reduce, everything works fine:
val rdd2= interactionObjects.map( _ => new InteractionAggregation())
rdd2.reduce((a,b) => a.merge(b))
println(rdd2.count())
Do you have an idea why the error occurs with aggregate but not with map/reduce?
Thanks and regards!

Your Aggregation class should implement Serializable. When you call aggregate, the driver sends your (new Aggregation()) object to all workers, which results in a serialization error.

Mapping RDD with several comma separated fields in Spark

I am new to Spark and I am going over a tutorial where a line with several fields is parsed with Scala, the code with scala is like this:
val pass = lines.map(_.split(",")).
map(pass=>(pass(15),pass(7).toInt)).
reduceByKey(_+_)
where pass is data recevied from socketTextStream (its SparkStreams). I am new to Spark and want to use Java to have the same result. I have decalared JavaReceiverInputDStream using:
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);
I came up with two possible solutions:
using flatMap:
JavaDStream<String> words = lines.flatMap(
new FlatMapFunction<String, String>() {
#Override public Iterable<String> call(String x) {
return Arrays.asList(x.split(","));
}
});
But it doesn't seem right since the result is breaking the CSV to words without any order.
Using map (compilation error), This looks like the appropriate solution but I am not able to extract the fields 15 and 7 using:
JavaDStream<List<String>> words = lines.map(
new Function<String, List<String>>() {
public List<String> call(String s) {
return Arrays.asList(s.split(","));
}
});
This idea fails when i try to map List<String> => Tuple2<String, Int>
The mapping code is:
JavaPairDStream<String, Integer> pairs = words.map(
new PairFunction<List<String>, String, Integer>() {
public Tuple2<String, Integer> call(List<String> s) throws Exception {
return new Tuple2(s.get(15), 6);
}
});
The error:
method map in class
org.apache.spark.streaming.api.java.AbstractJavaDStreamLike`<T,This,R>` cannot be applied to given types;
[ERROR] required: org.apache.spark.api.java.function.Function`<java.util.List`<java.lang.String>`,R>`
[ERROR] found: `<anonymous org.apache.spark.api.java.function.PairFunction`<java.util.List`<java.lang.String>`,java.lang.String,java.lang.Integer>`>`
[ERROR] reason: no instance(s) of type variable(s) R exist so that argument type `<anonymous org.apache.spark.api.java.function.PairFunction`<java.util.List`<java.lang.String>`,java.lang.String,java.lang.Integer>`>` conforms to formal parameter type org.apache.spark.api.java.
Any suggestions on this?

Use this code. It will get require field from String.
JavaDStream<String> lines = { ..... };
JavaPairDStream<String, Integer> pairs = lines.mapToPair(new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String t) throws Exception {
String[] words = t.split(",");
return new Tuple2<String, Integer>(words[15],Integer.parseInt(words[7]));
}
});

Sortby in Javardd

I'm usinig spark with java. And i want to sort my map. In fact, i have i javaRDD like this :
JavaPairRDD<String, Integer> rebondCountURL = session_rebond_2.mapToPair(new PairFunction<Tuple2<String, String>, String, String>() {
#Override
public Tuple2<String, String> call(Tuple2<String, String> stringStringTuple2) throws Exception {
return new Tuple2<String, String>(stringStringTuple2._2, stringStringTuple2._1);
}
}).groupByKey().map(new PairFunction<Tuple2<String, Iterable<String>>, Tuple2<String, Integer>>() {
#Override
public Tuple2<String, Integer> call(Tuple2<String, Iterable<String>> stringIterableTuple2) throws Exception {
Iterable<String> strings = stringIterableTuple2._2;
List<String> b = new ArrayList<String>();
for (String s : strings) {
b.add(s);
}
return new Tuple2<String, Integer>(stringIterableTuple2._1, b.size());
}
});
And i want to sort this Java Rdd using Sortby (in order to sort using the Integer).
Can you help me please to do it ?
Thank you in advance.

You need to create a function which extracts the sorting key from each element. Example from our code
final JavaRDD<Something> stage2 = stage1.sortBy( new Function<Something, Long>() {
private static final long serialVersionUID = 1L;
#Override
public Long call( Something value ) throws Exception {
return value.getTime();
}
}, true, 1 );

Just a tip related to sortBy().. If you want to sort a set of user defined objects say Point then implement the Comparable<Point> interface in the class Point and override the compareTo() method in which you can write your own logic for sorting. After this, the sortby function will take care of the sorting logic.
Note: your Point class must also implement java.io.Serializable interface or else you will encounter NotSerializable exception.

This is a code based on #Vignesh suggestion. You can sortBy any custom implementation of Comparator. It is more clean to write the comparator separately, and use a reference in the spark code :
rdd ->{JavaRDD<MaxProfitDto> result =
rdd.keyBy(Recommendations.profitAsKey)
.sortByKey(new CryptoVolumeComparator())
.values()
So, the comparator looks like below:
import java.io.Serializable;
import java.math.BigDecimal;
import java.util.Comparator;
import models.CryptoDto;
import scala.Tuple2;
public class CryptoVolumeComparator implements Comparator<Tuple2<BigDecimal, CryptoDto>>, Serializable {
private static final long serialVersionUID = 1L;
#Override
public int compare(Tuple2<BigDecimal, CryptoDto> v1, Tuple2<BigDecimal, CryptoDto> v2) {
return v2._1().compareTo(v1._1());
}
}

Anonymous class do not have an argument

I am learning Apache Spark. Given such an implementation of spark using java below, I am confused about some details about it.
public class JavaWordCount {
public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.err.println("Usage: JavaWordCount <master> <file>");
System.exit(1);
}
JavaSparkContext ctx = new JavaSparkContext(args[0], "JavaWordCount",
System.getenv("SPARK_HOME"), System.getenv("SPARK_EXAMPLES_JAR"));
JavaRDD<String> lines = ctx.textFile(args[1], 1);
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
});
JavaPairRDD<String, Integer> ones = words.map(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
List<Tuple2<String, Integer>> output = counts.collect();
for (Tuple2 tuple : output) {
System.out.println(tuple._1 + ": " + tuple._2);
}
System.exit(0);
}
}
According to my comprehension, begin in line 12, it passed an anonymous class FlatMapFunction into the lines.flatMap() as an argument. Then what does the String s mean? It seems that it doesn't pass an created String s as an argument, then how will the FlatMapFunction<String,String>(){} class works since no specific arguments are passed into?

The anonymous class instance you're passing is overriding the call(String s) method. Whatever is receiving this anonymous class instance is something that wants to make use of that call() method during its execution: it will be (somehow) constructing strings and passing them (directly or indirectly) to the call() method of whatever you've passed in.
So the fact that you're not invoking the method you've defined isn't a worry: something else is doing so.
This is a common use case for anonymous inner classes. A method m() expects to be passed something that implements the Blah interface, and the Blah interface has a frobnicate(String s) method in it. So we call it with
m(new Blah() {
public void frobnicate(String s) {
//exciting code goes here to do something with s
}
});
and the m method will now be able to take this instance that implements Blah, and invoke frobnicate() on it.
Perhaps m looks like this:
public void m(Blah b) {
b.frobnicate("whatever");
}
Now the frobnicate() method that we wrote in our inner class is being invoked, and as it runs, the parameter s will be set to "whatever".

All your are doing here is passing a FlatMapFunction as argument to the flatMap method; your passed FlatMapFunction overrides call(String s):
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>()
{
public Iterable<String> call(String s)
{
return Arrays.asList(s.split(" "));
}
});
The code implementing lines.flatMap could look like this for instance:
public JavaRDD<String> flatMap(FlatMapFunction<String, String> map)
{
String str = "some string";
Iterable<String> it = map.call(str);
// do stuff with 'it'
// return a JavaRDD<String>
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

FlatMap function on a CoGrouped RDD - java

Related

Spark UDF written in Java Lambda raises ClassCastException

Spark: Serialization not working with Aggregate

Mapping RDD with several comma separated fields in Spark

Sortby in Javardd

Anonymous class do not have an argument

Categories

Resources