Spark UDF written in Java Lambda raises ClassCastException

Spark UDF written in Java Lambda raises ClassCastException - java

Here's the exception:
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to ... of type org.apache.spark.sql.api.java.UDF2 in instance of ...
If I don't implement the UDF by Lambda expression, it's ok. Like:
private UDF2 funUdf = new UDF2<String, String, String>() {
#Override
public String call(String a, String b) throws Exception {
return fun(a, b);
}
};
dataset.sparkSession().udf().register("Fun", funUdf, DataTypes.StringType);
functions.callUDF("Fun", functions.col("a"), functions.col("b"));
I am running in local so this answer will not help: https://stackoverflow.com/a/28367602/4164722
Why? How can I fix it?

This is a working solution :
UDF1 myUDF = new UDF1<String, String>() {
public String call(final String str) throws Exception {
return str+"A";
}
};
sparkSession.udf().register("Fun", myUDF, DataTypes.StringType);
Dataset<Row> rst = sparkSession.read().format("text").load("myFile");
rst.withColumn("nameA",functions.callUDF("Fun",functions.col("name")))

Related

Java Consumer MethodReference for nonstatic methods

Code snippet:
class Scratch {
Map<ActionType, SomeConsumer<DocumentPublisher, String, String>> consumerMapping = Map.of(
ActionType.REJECT, DocumentPublisher::rejectDocument,
ActionType.ACCEPT, DocumentPublisher::acceptDocument,
ActionType.DELETE, DocumentPublisher::deleteDocument);
private void runProcess(DocumentAction action) {
DocumentPublisher documentPublisher = DocumentPublisherFactory.getDocumentPublisher(action.getType);
SomeConsumer<DocumentPublisher, String, String> consumer = consumerMapping.get(action.getType());
consumer.apply(documentPublisher, "documentName", "testId1");
}
private interface DocumentPublisher {
void rejectDocument(String name, String textId);
void acceptDocument(String name, String textId);
void deleteDocument(String name, String textId);
}
}
Which type of functionalInterface can I use instead SomeConsumer? The main issue here is that it is not static field, and the object I will only know in runtime.
I tried to use BiConsumer, however it tells me that I can not refer to non static method in this way.

From your usage here:
consumer.apply(documentPublisher, "documentName", "testId1");
It is quite clear that the consumer consumes 3 things, so it's not a BiConsumer. You'd need a TriConsumer, which isn't available in the standard library.
You can write such a functional interface yourself though:
interface TriConsumer<T1, T2, T3> {
void accept(T1 a, T2 b, T3 c);
}
If the only generic parameters that you are ever going to give it is <DocumentPublisher, String, String>, I think you should name it something specific to your application, such as DocumentPublisherAction:
interface DocumentPublisherAction {
void perform(DocumentPublisher publisher, String name, String textId);
}
Map<ActionType, DocumentPublisherAction> consumerMapping = Map.of(
ActionType.REJECT, DocumentPublisher::rejectDocument,
ActionType.ACCEPT, DocumentPublisher::acceptDocument,
ActionType.DELETE, DocumentPublisher::deleteDocument);
private void runProcess(DocumentAction action) {
DocumentPublisher documentPublisher = DocumentPublisherFactory.getDocumentPublisher(action.getType);
DocumentPublisherAction consumer = consumerMapping.get(action.getType());
consumer.perform(documentPublisher, "documentName", "testId1");
}

FlatMap function on a CoGrouped RDD

I am trying to use a flatmap function on the cogroupedRDD which has the signature:
JavaPairRDD<String, Tuple2<Iterable<Row>, Iterable<Row>>>
my flatmap function is as follows:
static FlatMapFunction<Tuple2<String, Tuple2<Iterable<Row>, Iterable<Row>>>,Row> setupF = new FlatMapFunction<Tuple2<String, Tuple2<Iterable<Row>, Iterable<Row>>>,Row>() {
#Override
public Iterable<Row> call(Tuple2<String, Tuple2<Iterable<Row>, Iterable<Row>>> row) {
}};
But i am getting compilation error . I am sure it must be a syntactical issue which I am not able to understand.
Full Code:
JavaPairRDD<String, Tuple2<Iterable<Row>, Iterable<Row>>> coGroupedRDD = rdd1.cogroup(rdd2);
JavaRDD<Row> jd = coGroupedRDD.flatmap(setupF);
static FlatMapFunction<Tuple2<String, Tuple2<Iterable<Row>, Iterable<Row>>>,Row> setupF = new FlatMapFunction<Tuple2<String, Tuple2<Iterable<Row>, Iterable<Row>>>,Row>() {
#Override
public Iterable<Row> call(Tuple2<String, Tuple2<Iterable<Row>, Iterable<Row>>> row) {
//logic
}};
Error:
The method flatmap(FlatMapFunction<Tuple2<String,Tuple2<Iterable<Row>,Iterable<Row>>>,Row>) is undefined for the type JavaPairRDD<String,Tuple2<Iterable<Row>,Iterable<Row>>>

A wild guess here, maybe the reason is that you write your code against Spark 1.6 API but you actually use Spark 2.0 dependency? API differs between these two releases.
Spark 1.6 API FlatMapFunction method signature:
Iterable<R> call(T t)
Spark 2.0 API FlatMapFunction method signature:
Iterator<R> call(T t)
So try change you code to this:
new FlatMapFunction<Tuple2<String, Tuple2<Iterable<Row>, Iterable<Row>>>, Row>() {
#Override
public Iterator<Row> call(Tuple2<String, Tuple2<Iterable<Row>, Iterable<Row>>> row) {
//...
}
};
or using Java 8 lambda version:
coGroupedRDD
.flatMap(t -> {
List<Row> result = new ArrayList<>();
//...use t._1, t._2._1, t._2._2 to construct the result list
return result.iterator();
});

Spark load a csv into JavaPairRDD by key found in row

I want to load a csv into a JavaPairRDD, using a value in the row as the key, and the row itself as the value. Currently I am doing it like this:
I have a csv that has lines like this:
a,1,1,2
b,1,1,2
a,2,2,3
b,2,2,3
I have a java object that represents these rows like this:
public class FactData implements Serializable{
public String key;
public int m1;
public int m2;
public int m3;
}
I'm currently getting to the pairRDD like this:
JavaRDD<FactData> lines = sc.textFile("test.csv").map(line -> FactData.fromFileLine(line));
JavaPairRDD<String, Iterable<FactData>> groupBy = lines.groupBy(row -> row.getId());
But I am wondering if there is a faster/better way to do this? something like:
JavaPairRDD<String,Iterable<FactData>> groupedLines = sc.textFile("test.csv").flatMapToPair(new PairFlatMapFunction<String, String, Iterable<FactData>>() {
#Override
public Iterator<Tuple2<String, Iterable<FactData>>> call(String s) throws Exception {
//WHAT GOES IN HERE?
return null;
}
});
Any ideas appreciated.

Why dont you use keyBy.?
Let's say, you want first value of the file as key and total line as value.
Than you can do this simply.
JavaRDD<String> lines = context.textFile("test.csv");
JavaPairRDD<String, String> newLines = lines.keyBy(new Function<String,String>(){
#Override
public String call(String arg0) throws Exception {
return arg0.split(",")[0];
}
});
If you want collect as Map, may be you can do this.
JavaPairRDD<String, Iterable<String>> newLines = lines.keyBy(new Function<String,String>(){
#Override
public String call(String arg0) throws Exception {
return arg0.split(",")[0];
}
}).mapValues(new Function<String, Iterable<String>>(){
#Override
public Iterable<String> call(String arg0) throws Exception {
return Arrays.asList(arg0.split(","));
}
});

How to create a Supplier with a toString returning the get?

I would like to make Map values resolvement lazy, so was thinking about providing a Supplier with a toString function. But below code does not compile:
A default method cannot override a method from java.lang.Object
Anyone an idea how to solve this in a neat way?
#FunctionalInterface
private static interface ToStringSupplier extends Supplier<String>
{
default public String toString() {
return get();
}
}
The reason I want this is that my consumers (which are in another repository) first can update their code:
From:
String value = (Strint)map.get(key);
To:
String value = map.get(key).toString();
After which I can change the implementation to a lazy approach:
From:
String value = expensiveCalculation();
map.put(key,value);
To:
Supplier<String> supplier () -> expensiveCalculation();
map.put(key, supplier);

I found below code working fine for my problem:
private static Object getToString(Supplier<String> s) {
return new Object()
{
#Override
public String toString() {
return s.get();
}
};
}
Supplier<String> supplier = () -> expensiveCalculation();
map.put(key, getToString(supplier));

As Louis Wasserman mentioned in the comments section of the question, it is not possible to override an instance method with a default one. It can be done with a new class that delegates the #toString call to the #get method of the provided supplier.
Here's how it can be done:
import java.util.Map;
import java.util.function.Supplier;
class Scratch {
public static final class ToStringSupplier implements Supplier<String> {
private final Supplier<String> supplier;
public ToStringSupplier(Supplier<String> supplier) {
if (supplier == null) {
throw new NullPointerException();
}
this.supplier = supplier;
}
#Override
public String toString() {
System.out.println("Invoked ToStringSupplier#toString.");
return get();
}
#Override
public String get() {
System.out.println("Invoked ToStringSupplier#get.");
return supplier.get();
}
}
public static void main(String[] args) {
final var supplier = new ToStringSupplier(() -> {
System.out.println("Invoked Supplier#get.");
return "The result of calculations.";
});
final var key = "key";
final var map = Map.of(key, supplier);
System.out.println("The map has been built.");
final var calculationResult = map.get(key).toString();
System.out.println(calculationResult);
System.out.flush();
}
}
The output is:
The map has been built.
Invoked ToStringSupplier#toString.
Invoked ToStringSupplier#get.
Invoked Supplier#get.
The result of calculations.

default is a reserved word use in switch statements
You probably want to use abstract

Anonymous class do not have an argument

I am learning Apache Spark. Given such an implementation of spark using java below, I am confused about some details about it.
public class JavaWordCount {
public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.err.println("Usage: JavaWordCount <master> <file>");
System.exit(1);
}
JavaSparkContext ctx = new JavaSparkContext(args[0], "JavaWordCount",
System.getenv("SPARK_HOME"), System.getenv("SPARK_EXAMPLES_JAR"));
JavaRDD<String> lines = ctx.textFile(args[1], 1);
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
});
JavaPairRDD<String, Integer> ones = words.map(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
List<Tuple2<String, Integer>> output = counts.collect();
for (Tuple2 tuple : output) {
System.out.println(tuple._1 + ": " + tuple._2);
}
System.exit(0);
}
}
According to my comprehension, begin in line 12, it passed an anonymous class FlatMapFunction into the lines.flatMap() as an argument. Then what does the String s mean? It seems that it doesn't pass an created String s as an argument, then how will the FlatMapFunction<String,String>(){} class works since no specific arguments are passed into?

The anonymous class instance you're passing is overriding the call(String s) method. Whatever is receiving this anonymous class instance is something that wants to make use of that call() method during its execution: it will be (somehow) constructing strings and passing them (directly or indirectly) to the call() method of whatever you've passed in.
So the fact that you're not invoking the method you've defined isn't a worry: something else is doing so.
This is a common use case for anonymous inner classes. A method m() expects to be passed something that implements the Blah interface, and the Blah interface has a frobnicate(String s) method in it. So we call it with
m(new Blah() {
public void frobnicate(String s) {
//exciting code goes here to do something with s
}
});
and the m method will now be able to take this instance that implements Blah, and invoke frobnicate() on it.
Perhaps m looks like this:
public void m(Blah b) {
b.frobnicate("whatever");
}
Now the frobnicate() method that we wrote in our inner class is being invoked, and as it runs, the parameter s will be set to "whatever".

All your are doing here is passing a FlatMapFunction as argument to the flatMap method; your passed FlatMapFunction overrides call(String s):
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>()
{
public Iterable<String> call(String s)
{
return Arrays.asList(s.split(" "));
}
});
The code implementing lines.flatMap could look like this for instance:
public JavaRDD<String> flatMap(FlatMapFunction<String, String> map)
{
String str = "some string";
Iterable<String> it = map.call(str);
// do stuff with 'it'
// return a JavaRDD<String>
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark UDF written in Java Lambda raises ClassCastException - java

Related

Java Consumer MethodReference for nonstatic methods

FlatMap function on a CoGrouped RDD

Spark load a csv into JavaPairRDD by key found in row

How to create a Supplier with a toString returning the get?

Anonymous class do not have an argument

Categories

Resources