How to load a custom transformer in Spark 2.4

How to load a custom transformer in Spark 2.4 - java

I'm trying to create a custom transformer in Spark 2.4.0. Saving it works fine. However, when I try to load it, I get the following error:
java.lang.NoSuchMethodException: TestTransformer.<init>(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:496)
at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380)
at TestTransformer$.load(<console>:40)
... 31 elided
This suggests to me that it can't find my transformer's constructor, which doesn't really make sense to me.
MCVE:
import org.apache.spark.sql.{Dataset, DataFrame}
import org.apache.spark.sql.types.{StructType}
import org.apache.spark.ml.Transformer
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable}
class TestTransformer(override val uid: String) extends Transformer with DefaultParamsWritable{
def this() = this(Identifiable.randomUID("TestTransformer"))
override def transform(df: Dataset[_]): DataFrame = {
val columns = df.columns
df.select(columns.head, columns.tail: _*)
}
override def transformSchema(schema: StructType): StructType = {
schema
}
override def copy(extra: ParamMap): TestTransformer = defaultCopy[TestTransformer](extra)
}
object TestTransformer extends DefaultParamsReadable[TestTransformer]{
override def load(path: String): TestTransformer = super.load(path)
}
val transformer = new TestTransformer("test")
transformer.write.overwrite().save("test_transformer")
TestTransformer.load("test_transformer")
Running this (I'm using a Jupyter notebook) leads to the above error. I've tried compiling and running it as a .jar file, with no difference.
What puzzles me is that the equivalent PySpark code works fine:
from pyspark.sql import SparkSession, DataFrame
from pyspark.ml import Transformer
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
class TestTransformer(Transformer, DefaultParamsWritable, DefaultParamsReadable):
def transform(self, df: DataFrame) -> DataFrame:
return df
TestTransformer().save('test_transformer')
TestTransformer.load('test_transformer')
How can I make a custom Spark transformer that can be saved and loaded?

I can reproduce your problem in spark-shell.
Trying to find the source of the problem I looked into DefaultParamsReadable and DefaultParamsReader sources and I could see they utilize Java reflection.
https://github.com/apache/spark/blob/v2.4.0/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala
lines 495-496
val instance =
cls.getConstructor(classOf[String]).newInstance(metadata.uid).asInstanceOf[Params]
I think scala REPLs and Java reflection aren't good friends.
If you run this snippet (after yours):
new TestTransformer().getClass.getConstructors
you'll get the following output:
res1: Array[java.lang.reflect.Constructor[_]] = Array(public TestTransformer($iw), public TestTransformer($iw,java.lang.String))
It is true! TestTransformer.<init>(java.lang.String) doesn't exist.
I found 2 workarounds,
Compiling your code with sbt and creating a jar, then including in spark-shell with :require, worked for me (You mentioned you tried a jar, I don't know how though)
Pasting the code in spark-shell with :paste -raw , worked fine as well. I suppose -raw prevents from REPL doing shenanigans to your classes.
See: https://docs.scala-lang.org/overviews/repl/overview.html
I'm not sure how you can adapt any of these to Jupyter but I hope this info is useful for you.
NOTE: I actually used spark-shell in spark 2.4.1

Related

Importing java classes in Intellij Kotlin app

I am using IntelliJ for the first time to write a Kotlin program for windows.
I need to read a file so I used code from a sample site:
import java.io.File
import java.io.InputStream
fun main(args: Array<String>) {
val inputStream: InputStream = File("bezkoder.txt").inputStream()
val lineList = mutableListOf<String>()
inputStream.bufferedReader().useLines { lines -> lines.forEach { lineList.add(it)} }
lineList.forEach{println("> " + it)}
}
The thing is that it doesn't recognise the import for the java classes.
I guess it is something in my setup but I have no idea where to look and haven't managed to find an answer.
This is my SDK setup screen

You can't use Java SDK classes in the Kotlin/Native projects, only in Kotlin/JVM.
Please also make sure you have a valid JDK configuration in the project. See this answer for more details.
If you want to read a file in Kotlin Native, see CsvParser.kt example.

How to compute summary statistic on Cassandra table with Spark DataFrame?

I'm trying to get the min, max mean of some Cassandra/SPARK data but I need to do it with JAVA.
import org.apache.spark.sql.DataFrame;
import static org.apache.spark.sql.functions.*;
DataFrame df = sqlContext.read()
.format("org.apache.spark.sql.cassandra")
.option("table", "someTable")
.option("keyspace", "someKeyspace")
.load();
df.groupBy(col("keyColumn"))
.agg(min("valueColumn"), max("valueColumn"), avg("valueColumn"))
.show();
EDITED to show working version:
Make sure to put " around the someTable and someKeyspace

Just import your data as a DataFrame and apply required aggregations:
import org.apache.spark.sql.DataFrame;
import static org.apache.spark.sql.functions.*;
DataFrame df = sqlContext.read()
.format("org.apache.spark.sql.cassandra")
.option("table", someTable)
.option("keyspace", someKeyspace)
.load();
df.groupBy(col("keyColumn"))
.agg(min("valueColumn"), max("valueColumn"), avg("valueColumn"))
.show();
where someTable and someKeyspace store table name and keyspace respectively.

I suggest checking out https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector-demos
Which contains demos in both Scala and the equivalent Java.
You can also check out: http://spark.apache.org/documentation.html
Which has tons of examples that you can flip between Scala, Java, and Python versions.
I'm almost 100% certain that between those to links, you'll find exactly what you're looking for.
If there's anything you're having trouble with after that, feel free to update your question with a more specific error/problem.

In general,
compile scala file:
$ scalac Main.scala
create your java source file from Main.class file:
$ javap Main
More info is available at following url:
http://alvinalexander.com/scala/scala-class-to-decompiled-java-source-code-classes

Hadoop Hive UDF with external library

I'm trying to write a UDF for Hadoop Hive, that parses User Agents. Following code works fine on my local machine, but on Hadoop I'm getting:
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String MyUDF .evaluate(java.lang.String) throws org.apache.hadoop.hive.ql.metadata.HiveException on object MyUDF#64ca8bfb of class MyUDF with arguments {All Occupations:java.lang.String} of size 1',
Code:
import java.io.IOException;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.*;
import com.decibel.uasparser.OnlineUpdater;
import com.decibel.uasparser.UASparser;
import com.decibel.uasparser.UserAgentInfo;
public class MyUDF extends UDF {
public String evaluate(String i) {
UASparser parser = null;
parser = new UASparser();
String key = "";
OnlineUpdater update = new OnlineUpdater(parser, key);
UserAgentInfo info = null;
info = parser.parse(i);
return info.getDeviceType();
}
}
Facts that come to my mind I should mention:
I'm compiling with Eclipse with "export runnable jar file" and extract required libraries into generated jar option
I'm uploading this "fat jar" file with Hue
Minimum working example I managed to run:
public String evaluate(String i) {
return "hello" + i.toString()";
}
I guess the problem lies somewhere around that library (downloaded from https://udger.com) I'm using, but I have no idea where.
Any suggestions?
Thanks, Michal

It could be a few things. Best thing is to check the logs, but here's a list of a few quick things you can check in a minute.
jar does not contain all dependencies. I am not sure how eclipse builds a runnable jar, but it may not include all dependencies. You can do
jar tf your-udf-jar.jar
to see what was included. You should see stuff from com.decibel.uasparser. If not, you have to build the jar with the appropriate dependencies (usually you do that using maven).
Different version of the JVM. If you compile with jdk8 and the cluster runs jdk7, it would also fail
Hive version. Sometimes the Hive APIs change slightly, enough to be incompatible. Probably not the case here, but make sure to compile the UDF against the same version of hadoop and hive that you have in the cluster
You should always check if info is null after the call to parse()
looks like the library uses a key, meaning that actually gets data from an online service (udger.com), so it may not work without an actual key. Even more important, the library updates online, contacting the online service for each record. This means, looking at the code, that it will create one update thread per record. You should change the code to do that only once in the constructor like the following:
Here's how to change it:
public class MyUDF extends UDF {
UASparser parser = new UASparser();
public MyUDF() {
super()
String key = "PUT YOUR KEY HERE";
// update only once, when the UDF is instantiated
OnlineUpdater update = new OnlineUpdater(parser, key);
}
public String evaluate(String i) {
UserAgentInfo info = parser.parse(i);
if(info!=null) return info.getDeviceType();
// you want it to return null if it's unparseable
// otherwise one bad record will stop your processing
// with an exception
else return null;
}
}
But to know for sure, you have to look at the logs...yarn logs, but also you can look at the hive logs on the machine you're submitting the job on ( probably in /var/log/hive but it depends on your installation).

such a problem probably can be solved by steps:
overide the method UDF.getRequiredJars(), make it returning a hdfs file path list which values are determined by where you put the following xxx_lib folder into your hdfs. Note that , the list mist exactly contains each jar's full hdfs path strings ,such as hdfs://yourcluster/some_path/xxx_lib/some.jar
export your udf code by following "Runnable jar file exporting wizard" (chose "copy required libraries into a sub folder next to the generated jar". This steps will result in a xxx.jar and a lib folder xxx_lib next to xxx.jar
put xxx.jar and the folders xxx_lib to your hdfs filesystem according to your code in step 0.
create a udf using: add jar ${the-xxx.jar-hdfs-path}; create function your-function as $}qualified name of udf class};
Try it. I test this and it works

how to consult file as a module in jpl

I am trying to consult Prolog file as a module, since jpl does not support multiple prolog vms.
In swipl console, I can do something like this successfully
?- consult(mod1:'data/load.pro') .
In java (well, it is actually scala, but they are all on top of jvm), I can consult file directly w/o issue
scala> import jpl._
scala> val q = new Query("consult", Array[Term](new Atom("data/load.pl")))
scala> q.query()
...
true
however, when I tried to consult the file as module, I always get the exception.
scala> val q = new Query("consult", Array[Term](new Atom("mod1:data/load.pl")))
scala> q.query()
jpl.PrologExcepion: PrologException: error(existence_error(source_sink, 'mod1:data/load.pl'), _0)
at jpl.Query.get1(Query.java:336)
at jpl.Query.hasMoreSolutions(Query.java:258)
at jpl.Query.oneSolution(Query.java:688)
at jpl.Query.query(Query.java:747)
at .<init>(<console>:15)
at .<clinit>(<console>)
....
Anybody can point me to the correct way of consulting prolog file as module in jpl? Thanks!

I think you can swap the module qualification on predicate, and of course that would allow you to pass in the full path of your source file:
val q = new Query("mod1:consult('full_path_to/load.pl')")

Error reading image from scala playn

I researched and looked into the PlayN game framework and I liked it a lot. I program in Scala and actually don't know Java but it's not usually a problem since they work together great.
I've set up a basic project in eclipse and imported all the libraries and dependencies. I even translated over the base maven project code. Here's the two files:
Zeitgeist.scala
package iris.zeit.core
import playn.core.PlayN._
import playn.core.Game
import playn.core.Image
import playn.core.ImageLayer
class Zeitgeist extends Game {
override def init (){
var bgImage: Image = assets().getImage("images/bg.png")
var bgLayer: ImageLayer = graphics().createImageLayer(bgImage)
graphics().rootLayer().add(bgLayer)
}
override def paint (alpha: Float){
//painting stuffs
}
override def update(delta: Float){
}
override def updateRate(): Int = {
25
}
}
Main.scala
package iris.zeit.desktop
import playn.core.PlayN
import playn.java.JavaPlatform
import iris.zeit.core.Zeitgeist
object Main {
def main(args: Array[String]){
var platform: JavaPlatform = JavaPlatform.register()
platform.assets().setPathPrefix("resources")
PlayN.run(new Zeitgeist())
}
}
The cool thing is it works! A window comes up perfectly. The only problem is I can't seem to load images. With the above line, "assets().getImage("images/bg.png")" it pops out
Could not load image: resources/images/bg.png [error=java.io.FileNotFoundException: resources/images/bg.png]
I've played around with the location of my resources file to no avail. I was even able to find bg.png myself with java.io.File. Am I doing something wrong? Is there something I'm forgetting?

Looking at the code of JavaAssetsManager, it looks like it is trying to load a resource and not a file. So you should check that your images are actually in the classpath and at the path you give ("resources/images/bp.png")
Alternatively, you can use getRemoteImage and pass a File URL. As you succeeded in using a java.io.File, you can just get the URL with method toUri of File (toUrl is deprecated).

This almost certainly doesn't work because you're doing this:
platform.assets().setPathPrefix("resources")
That means you're saying your source folder looks like this:
src/main/resources/resources/images/bg.png
src/main/resources/resources/images/pea.png
src/main/resources/resources/images
I imagine it actually looks like one of these:
src/main/resources/assets/images/bg.png <-- 'assets' the default prefix
src/main/resources/assets/images/pea.png
src/main/resources/assets/images
or:
src/main/resources/images/bg.png <-- You have failed to put a subfolder prefix in
src/main/resources/images/pea.png
src/main/resources/images
You can either do this, if you have no prefix:
plat.assets().setPathPrefix("")
Or just put your files in the assets sub-folder inside the resources folder.
It's worth noting that the current implementation calls:
getClass().getClassLoader().getResource(...)
Not:
getClass().getResource(...)
The difference is covered elsewhere, but the tldr is that plat.assets.getImage("images/pea.png") will work, but plat.assets.getImage("/images/pea.png") will not.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to load a custom transformer in Spark 2.4 - java

Related

Importing java classes in Intellij Kotlin app

How to compute summary statistic on Cassandra table with Spark DataFrame?

Hadoop Hive UDF with external library

how to consult file as a module in jpl

Error reading image from scala playn

Categories

Resources