Spark Java Encoders.bean fail to convert to a Scala defined class

Spark Java Encoders.bean fail to convert to a Scala defined class - java

I have Java code to convert a JavaRDD to Dataset and save it to HDFS:
Dataset<User> userDataset = sqlContext.createDataset(userRdd.rdd(), Encoders.bean(User.class));
userDataset.write.json("some_path");
User class is defined in Scala language:
case class User(val name: Name, val address: Seq[Address]) extends Serializable
case class Name(firstName: String, lastName: Option[String])
case class Address(address: String)
Code complies and runs successfully, file is saved to HDFS, while User class in the output file has empty schema:
val users = spark.read.json("some_path")
users.count // 100,000 which is same as "userRdd"
users.printSchema // users: org.apache.spark.sql.DataFrame = []
Why Encoders.bean is not working in this case?

Encoders.bean does not support Scala case class, Encoders.product supports that. Encoders.product takes a TypeTag as parameter while initializing a TypeTag is not possible in Java. I created a Scala object to provide TypeTag:
import scala.reflect.runtime.universe._
object MyTypeTags {
val UserTypeTag: TypeTag[User] = typeTag[User]
}
Then in Java code: Dataset<User> userDataset = sqlContext.createDataset(userRdd.rdd(), Encoders.product(MyTypeTags.UserTypeTag()));

Related

Trino UDF Plugin: Which is the corresponding Java type to map(string,string)

I am creating a plugin to host an UDF, it is an http_get function, receiving the httpAddress path, the query parameters and the request headers. It is implemeted in scala:
#ScalarFunction(value = "http_get", deterministic = true)
#Description("Returns the result of an Http Get request")
#SqlType(value = StandardTypes.VARCHAR)
def httpGetFromArrayMap(
#SqlType(StandardTypes.VARCHAR) httpAddress: Slice,
#SqlType(constants.STRING_MAP) parameters: ImmutableMap[Slice, Slice],
#SqlNullable #SqlType(constants.STRING_MAP) headers: ImmutableMap[Slice, Slice],
): String = {
val stringHeaders = castSliceMap(headers)
val stringParams = castSliceMap(parameters)
val request = Http(httpAddress.toStringUtf8).headers(stringHeaders).params(stringParams)
val stringResponse = request.asString.body
stringResponse
}
When running on trino, it raises the following exception:
io.trino.spi.TrinoException: Exact implementation of http_get do not match expected java types.
The problem is: What is the corresponding java type to map(varchar,varchar)?
I've tried Many:
Scala Map[String, String]
Java Map<String, String>
Java Map<Slice, Slice>
ImmutableMap<String, String>
ImmutableMap<Slice, Slice>
Can't find any example of plugin implementing a function which receives a map.
Any help is appreciated. Thanks

Disclaimer: I'm not familiar at all with Trino UDF
According to the examples you can find in Trino repository, the type to use for a SqlType("map(X,y)") is io.trino.spi.block.Block.
The Block can then be manipulated to extract its content as if it were a regular Map.
See for instance: https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/operator/scalar/MathFunctions.java#L1340

convert akka journal event columns string value to java object

I am using aws dynamodb akka persistence API https://github.com/akka/akka-persistence-dynamodb which doesn't have a read journal API like Cassandra (Akka Persistence Query).
I can write journal data to dynamodb the event column is in string java object format my next task is to build CQRS using aws lambda or AWS Java API to read dynamodb, which has to convert the event data to human readble format.
Event Data:-
rO0ABXNyAD9jb20uY2Fwb25lLmJhbmsuYWN0b3JzLlBlcnNpc3RlbnRCYW5rQWNjb3VudCRCYW5rQWNjb3VudENyZWF0ZWQrGoMniq0AywIAAUwAC2JhbmtBY2NvdW50dAA6TGNvbS9jYXBvbmUvYmFuay9hY3RvcnMvUGVyc2lzdGVudEJhbmtBY2NvdW50JEJhbmtBY2NvdW50O3hwc3IAOGNvbS5jYXBvbmUuYmFuay5hY3RvcnMuUGVyc2lzdGVudEJhbmtBY2NvdW50JEJhbmtBY2NvdW5011CikshX3ysCAAREAAdiYWxhbmNlTAAIY3VycmVuY3l0ABJMamF2YS9sYW5nL1N0cmluZztMAAJpZHEAfgAETAAEdXNlcnEAfgAEeHBAj0AAAAAAAHQAA0VVUnQAJDM5M2M2NmRiLTJhYmItNDEwNS04NWUyLWMwZjc3MzExMDNlM3QAB3JjYXJkaW4=
I want to know how to convert the above Java Object string value to human-reable format ? I tried using Java objectinputstream but I think I am doing something wrong.
Scala example:-
val eventData:String = "rO0ABXNyAD9jb20uY2Fwb25lLmJhbmsuYWN0b3JzLlBlcnNpc3RlbnRCYW5rQWNjb3VudCRCYW5rQWNjb3VudENyZWF0ZWQrGoMniq0AywIAAUwAC2JhbmtBY2NvdW50dAA6TGNvbS9jYXBvbmUvYmFuay9hY3RvcnMvUGVyc2lzdGVudEJhbmtBY2NvdW50JEJhbmtBY2NvdW50O3hwc3IAOGNvbS5jYXBvbmUuYmFuay5hY3RvcnMuUGVyc2lzdGVudEJhbmtBY2NvdW50JEJhbmtBY2NvdW5011CikshX3ysCAAREAAdiYWxhbmNlTAAIY3VycmVuY3l0ABJMamF2YS9sYW5nL1N0cmluZztMAAJpZHEAfgAETAAEdXNlcnEAfgAEeHBAj0AAAAAAAHQAA0VVUnQAJDM5M2M2NmRiLTJhYmItNDEwNS04NWUyLWMwZjc3MzExMDNlM3QAB3JjYXJkaW4="
??? (and then what how to convert above string value to human reable format)
Thanks
Sri

ok was able to deserialize the object string data and convert it to json below is an example
object DeserializeData extends App {
import java.io.ByteArrayInputStream
import java.io.InputStream
import java.io.ObjectInputStream
import java.util.Base64
import com.google.gson.Gson
val base64encodedString = "rO0ABXNyAD9jb20uY2Fwb25lLmJhbmsuYWN0b3JzLlBlcnNpc3RlbnRCYW5rQWNjb3VudCRCYW5rQWNjb3VudENyZWF0ZWQrGoMniq0AlM3QAB3JjYXJkaW4="
println("Base64 encoded string :" + base64encodedString)
// Decode
val base64decodedBytes = Base64.getDecoder.decode(base64encodedString)
val in = new ByteArrayInputStream(base64decodedBytes)
val obin = new ObjectInputStream(in)
val `object` = obin.readObject
println("Deserialised data: \n" + `object`.toString)
// You could also try...
println("Object class is " + `object`.getClass.toString)
val json = new Gson();
val resp = json.toJson(`object`)
println(resp)
}

A feature to read aws Dynamodb read journal is now implemented no need of any kind of clunky code https://github.com/akka/akka-persistence-dynamodb/pull/114/files thank you Lightbend

Java Tensorflow + Keras Equivalent of model.predict()

In python you can simply pass a numpy array to predict() to get predictions from your model. What is the equivalent using Java with a SavedModelBundle?
Python
model = tf.keras.models.Sequential([
# layers go here
])
model.compile(...)
model.fit(x_train, y_train)
predictions = model.predict(x_test_maxabs) # <= This line
Java
SavedModelBundle model = SavedModelBundle.load(path, "serve");
model.predict() // ????? // What does it take as in input? Tensor?

TensorFlow Python automatically convert your NumPy array to a tf.Tensor. In TensorFlow Java, you manipulate tensors directly.
Now the SavedModelBundle does not have a predict method. You need to obtain the session and run it, using the SessionRunner and feeding it with input tensors.
For example, based on the next generation of TF Java (https://github.com/tensorflow/java), your code endup looking like this (note that I'm taking a lot of assumptions here about x_test_maxabs since your code sample does not explain clearly where it comes from):
try (SavedModelBundle model = SavedModelBundle.load(path, "serve")) {
try (Tensor<TFloat32> input = TFloat32.tensorOf(...);
Tensor<TFloat32> output = model.session()
.runner()
.feed("input_name", input)
.fetch("output_name")
.run()
.expect(TFloat32.class)) {
float prediction = output.data().getFloat();
System.out.println("prediction = " + prediction);
}
}
If you are not sure what is the name of the input/output tensor in your graph, you can obtain programmatically by looking at the signature definition:
model.metaGraphDef().getSignatureDefMap().get("serving_default")

You can try Deep Java Library (DJL).
DJL internally use Tensorflow java and provide high level API to make it easy fro inference:
Criteria<Image, Classifications> criteria =
Criteria.builder()
.setTypes(Image.class, Classifications.class)
.optModelUrls("https://example.com/squeezenet.zip")
.optTranslator(ImageClassificationTranslator
.builder().addTransform(new ToTensor()).build())
.build();
try (ZooModel<Image, Classification> model = ModelZoo.load(criteria);
Predictor<Image, Classification> predictor = model.newPredictor()) {
Image image = ImageFactory.getInstance().fromUrl("https://myimage.jpg");
Classification result = predictor.predict(image);
}
Checkout the github repo: https://github.com/awslabs/djl
There is a blogpost: https://towardsdatascience.com/detecting-pneumonia-from-chest-x-ray-images-e02bcf705dd6
And the demo project can be found: https://github.com/aws-samples/djl-demo/blob/master/pneumonia-detection/README.md

In 0.3.1 API:
val model: SavedModelBundle = SavedModelBundle.load("path/to/model", "serve")
val inputTensor = TFloat32.tesnorOf(..)
val function: ConcreteFunction = model.function(Signature.DEFAULT_KEY)
val result: Tensor = function.call(inputTensor) // u can cast to type you expect, a type of returning tensor can be checked by signature: model.function("serving_default").signature().toString()
After you got a result Tensor of any subtype, you can iterate over its values. In my example, I had a TFloat32 with shape (1, 56), so I found max value by result.get(0, idx)

Issue with Jackson module in Scala 2.11 with filters

Recently, I was trying to do a simple Scala code to analyse COVID - 19 data, I called one API and cast structure of that JSON API call to my Scala case classes. If I do not apply filters code is working as expected, When I try to apply filter it do not work, Little bit confused. Why it is not working.
case class RegionData(region: Option[String], totalInfected: Option[String], recovered: Option[String], deceased: Option[String])
case class Data(activeCases: String, recovered: String, deaths: String, totalCases: String, sourceUrl: String, lastUpdatedAtApify: String, readMe: String, regionData: Option[List[RegionData]])
These are my case classes for Scala.
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import scala.io.Source
object CovidAnalytics {
val objectMapper: ObjectMapper = new ObjectMapper()
objectMapper.registerModule(DefaultScalaModule)
def main(args: Array[String]): Unit = {
val getData: String = Source.fromURL("https://api.apify.com/v2/datasets/58a4VXwBBF0HtxuQa/items?format=json&clean=1").mkString
val data: List[Data] = objectMapper.readValue(getData, classOf[List[Data]])
//This is working
println(data)
val filter = data.filter(e => e.deaths != "")
//This is not working (Confused!!!)
println(filter)
}
}
Exception in thread "main" java.lang.ClassCastException:
scala.collection.immutable.HashMap$HashTrieMap cannot be cast to
com.example.analytics.Data at
com.example.analytics.CovidAnalytics$$anonfun$1.apply(CovidAnalytics.scala:18)
at
scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
at scala.collection.immutable.List.foreach(List.scala:392) at
scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at
scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
at
com.example.analytics.CovidAnalytics$.main(CovidAnalytics.scala:18)
at com.example.analytics.CovidAnalytics.main(CovidAnalytics.scala)

This is happening because of the type erasure. You can replace List by Array.
val getData: String = Source.fromURL("https://api.apify.com/v2/datasets/58a4VXwBBF0HtxuQa/items?format=json&clean=1").mkString
val data = objectMapper.readValue(getData, classOf[Array[Data]]).toList //.toList if you need it as List
println(data)
val filter = data.filter(e => e.deaths != "")
println(filter)

AWS Lambda - Java Beans

I have a request that looks like the following:
package pricing
import scala.beans.BeanProperty
class Request(#BeanProperty var name: String, #BeanProperty var surname: String) {
def this() = this(name="defName", surname="defSurname")
}
The handler is as follows:
package pricing
import com.amazonaws.services.lambda.runtime.{Context, RequestHandler}
import scala.collection.JavaConverters
import spray.json._
class ApiGatewayHandler extends RequestHandler[Request, ApiGatewayResponse] {
import DefaultJsonProtocol._
def handleRequest(input: Request, context: Context): ApiGatewayResponse = {
val headers = Map("x-foo" -> "coucou")
val msg = "Hello " + input.name
val message = Map[String, String]("message" -> msg )
ApiGatewayResponse(
200,
message.toJson.toString(),
JavaConverters.mapAsJavaMap[String, Object](headers),
true
)
}
}
which has been documented as:
functions:
pricing:
handler: pricing.ApiGatewayHandler
events:
- http:
path: pricing/test
method: get
documentation:
summary: "submit your name and surname, the API says hi"
description: ".. well, the summary is pretty exhaustive"
requestBody:
description: "Send over name and surname"
queryParams:
- name: "name"
description: "your 1st name"
- name: "surname"
description: ".. guess .. "
methodResponses:
- statusCode: "200"
responseHeaders:
- name: "x-foo"
description: "you can foo in here"
responseBody:
description: "You'll see a funny message here"
responseModels:
"application/json": "HelloWorldResponse"
well, this is a copy and paste from one of the tutorials. And it is not working.
I guess that the BeanProperty refers to body object properties; and this is what I can guess from the example here.
if I would like to have query strings?
A try was:
package pricing
import scala.beans.BeanProperty
import spray.json._
abstract class ApiGatewayGetRequest(
#BeanProperty httpMethod: String,
#BeanProperty headers: Map[String, String],
#BeanProperty queryStringParameters: Map[String, String])
abstract class ApiGatewayPostRequest(
#BeanProperty httpMethod: String,
#BeanProperty headers: Map[String, String],
#BeanProperty queryStringParameters: Map[String, String])
class HelloWorldRequest(
#BeanProperty httpMethod: String,
#BeanProperty headers: Map[String, String],
#BeanProperty queryStringParameters: Map[String, String]
) extends ApiGatewayGetRequest(httpMethod, headers, queryStringParameters) {
private def getParam(param: String): String =
queryStringParameters get param match {
case Some(s) => s
case None => "default_" + param
}
def name: String = getParam("name")
def surname: String = getParam("surname")
def this() = this("GET", Map.empty, Map.empty)
}
Which results in:
{
"message":"Hello default_name"
}
suggesting that the class has been initialized with an empty map in place of the queryStringParameters which was however submitted correctly
Mon Sep 25 20:45:22 UTC 2017 : Endpoint request body after
transformations:
{"resource":"/pricing/test","path":"/pricing/test","httpMethod":"GET","headers":null,"queryStringParameters":{"name":"ciao", "surname":"bonjour"},"pathParameters":null,"stageVariables":null,
...
Note:
I am following this path because I feel it would be convenient and expressive to replace the Map in #BeanProperty queryStringParameters: Map[String, String] with a type T, for example
case class Person(#beanProperty val name: String, #beanProperty val surname: String)
However, the code above looks at {"name":"ciao", "surname":"bonjour"} as a String, without figuring out that it should deserialize that String.
EDIT
I have also tried to replace the scala map with a java.util.Map[String, String] without success

By default, Serverless enables proxy integration between the lambda and API Gateway. What this means for you is that API Gateway is going to pass an object containing all the metadata about the request into your handler, as you have noticed:
Mon Sep 25 20:45:22 UTC 2017 : Endpoint request body after transformations: {"resource":"/pricing/test","path":"/pricing/test","httpMethod":"GET","headers":null,"queryStringParameters":{"name":"ciao", "surname":"bonjour"},"pathParameters":null,"stageVariables":null, ...
This clearly doesn't map to your model which has just the fields name and surname in it. There are several ways you could go about solving this.
1. Adapt your model
Your attempt with the HelloWorldRequest class does actually work if you make your class a proper POJO by making the fields mutable (and thus creating the setters for them):
class HelloWorldRequest(
#BeanProperty var httpMethod: String,
#BeanProperty var headers: java.util.Map[String, String],
#BeanProperty var queryStringParameters: java.util.Map[String, String]
) extends ApiGatewayGetRequest(httpMethod, headers, queryStringParameters) {
AWS Lambda documentation states:
The get and set methods are required in order for the POJOs to work with AWS Lambda's built in JSON serializer.
Also keep in mind that Scala's Map is not supported.
2. Use a custom request template
If you don't need the metadata, then instead of changing your model you can make API Gateway pass only the data you need into the lambda using mapping templates.
In order to do this, you need to tell Serverless to use plain lambda integration (instead of proxy) and specify a custom request template.
Amazon API Gateway documentation has an example request template which is almost perfect for your problem. Tailoring it a little bit, we get
functions:
pricing:
handler: pricing.ApiGatewayHandler
events:
- http:
path: pricing/test
method: get
integration: lambda
request:
template:
application/json: |
#set($params = $input.params().querystring)
{
#foreach($paramName in $params.keySet())
"$paramName" : "$util.escapeJavaScript($params.get($paramName))"
#if($foreach.hasNext),#end
#end
}
This template will make a JSON out of the query string parameters, and it will now be the input of the lambda:
Endpoint request body after transformations: { "name" : "ciao" }
Which maps properly to your model.
Note that disabling proxy integration also changes the response format. You will notice that now your API returns your response model directly:
{"statusCode":200,"body":"{\"message\":\"Hello ciao\"}","headers":{"x-foo":"coucou"},"base64Encoded":true}
You can fix this by either modifying your code to return only the body, or by adding a custom response template:
response:
template: $input.path('$.body')
This will transform the output into what you expect, but will blatantly ignore the statusCode and headers. You would need to make a more complex response configuration to handle those.
3. Do the mapping yourself
Instead of extending RequestHandler and letting AWS Lambda map the JSON to a POJO, you can instead extend RequestStreamHandler, which will provide you an InputStream and an OutputStream, so you can do the (de)serialization with the JSON serializer of your choice.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark Java Encoders.bean fail to convert to a Scala defined class - java

Related

Trino UDF Plugin: Which is the corresponding Java type to map(string,string)

convert akka journal event columns string value to java object

Java Tensorflow + Keras Equivalent of model.predict()

Issue with Jackson module in Scala 2.11 with filters

AWS Lambda - Java Beans

Categories

Resources