I am writing a small UDF
val transform = udf((x: Array[Byte]) => {
val mapper = new ObjectMapper() with ScalaObjectMapper
val stream: InputStream = new ByteArrayInputStream(x);
val obs = new ObjectInputStream(stream)
val stock = mapper.readValue(obs, classOf[util.Hashtable[String, String]])
stock
})
Where in I get error
java.lang.UnsupportedOperationException: Schema for type java.util.Hashtable[String,String] is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:809)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:740)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:926)
at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:739)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:736)
at org.apache.spark.sql.functions$.udf(functions.scala:3898)
... 59 elided
Can anyone help in understanding why this is coming?
The error you get just means that Spark does not understand java hash tables. We can reproduce your error with this simple UDF.
val gen = udf(() => new java.util.Hashtable[String, String]())
Spark tries to create a DataType (to put in a spark schema) from a java.util.Hashtable, which it does not know how to do. Spark understands scala maps though. Indeed the following code
val gen2 = udf(() => Map("a" -> "b"))
spark.range(1).select(gen2()).show()
yields
+--------+
| UDF()|
+--------+
|[a -> b]|
+--------+
To fix the first UDF, and yours by the way, you can convert the Hashtable to a scala map. Converting a HashMap can be done easily with JavaConverters. I do not know of any easy way to do it with a Hashtable but you can do it this way:
import collection.JavaConverters._
val gen3 = udf(() => {
val table = new java.util.Hashtable[String, String]()
table.put("a", "b")
Map(table.entrySet.asScala.toSeq.map(x => x.getKey -> x.getValue) :_*)
})
Related
I am playing with Graal for running javascript as a guest language, and would like to know if there is a way to use javascript Array.map functionality on a host (Java) object or proxy. Demo Kotlin code follows, but should be close enough to Java code.
fun main() {
val context = Context.newBuilder().build()
val javaOutputList = mutableListOf<Integer>()
val javaList = listOf(2, 2, 3, 4, 5)
val proxyJavaList = ProxyArray.fromList(javaList)
context.polyglotBindings.apply {
putMember("javaOutputList", javaOutputList)
putMember("javaList", javaList)
putMember("proxyJavaList", proxyJavaList)
}
val script = """
var javaOutputList = Polyglot.import('javaOutputList');
var javaList = Polyglot.import('javaList');
var proxyJavaList = Polyglot.import('proxyJavaList');
var abc = [1, 2, 3];
abc.forEach(x => javaOutputList.add(x)); // WORKS
//abc.map(x => x + 1) // WORKS
//javaList.map(x => x + 1) // DOES NOT WORK (map not a method on list)
proxyJavaList.map(x => x + 1) // DOES NOT WORK (message not supported: INVOKE)
""".trimIndent()
val result = context.eval("js", script)
val resultList = result.`as`(List::class.java)
println("result: $resultList")
println("javaOutputList: $javaOutputList")
}
Using ProxyArray looked the most promising to me, but I still couldn't get it to work. Is this functionality expected to be supported?
EDIT: with the accepted answer the code works, here is the change for the interested:
val context = Context.newBuilder()
//.allowExperimentalOptions(true) // doesn't seem to be needed
.option("js.experimental-foreign-object-prototype", "true")
.build()
The root of the problem is that array-like non-JavaScript objects do not have Array.prototype on their prototype chain by default. So, Array.prototype.map is not accessible using javaList.map/proxyJavaList.map syntax.
You can either invoke Array.prototype.map directly like Array.prototype.map.call(javaList, x => x+1) or you can use an experimental option js.experimental-foreign-object-prototype=true (that we added recently) that adds Array.prototype on the prototype chain of all array-like objects. javaList.map/proxyJavaList.map will be available then.
This question is same as the one posted here. It has an accepted answer for scala. But I need to implement the same in Java.
How to select a subset of fields from an array column in Spark?
import org.apache.spark.sql.Row
case class Record(id: String, size: Int)
val dropUseless = udf((xs: Seq[Row]) => xs.map{
case Row(id: String, size: Int, _) => Record(id, size)
})
df.select(dropUseless($"subClasss"))
I have tried to implement the above in java but couldn't get it working. Appreciate any help. Thanks
this.spark.udf().register("dropUseless",
(UDF1<Seq<Row>, Seq<Row>>) rows -> {
Seq<Row> seq = JavaConversions
.asScalaIterator(
JavaConversions.seqAsJavaList(rows)
.stream()
.map((Row t) -> RowFactory.create(new Object[] {t.getAs("id"), t.getAs("size")})
).iterator())
.toSeq();
return seq;
}, DataTypes.createStructType(Arrays.asList(
DataTypes.createStructField("id", DataTypes.StringType, false),
DataTypes.createStructField("size", DataTypes.IntegerType, true))
)
);
If we suppose you have a Dataframe (df), you can use native SQL to extract a new Dataframe (ndf) which could contain the results that you want.
Try this :
df.registerTempTable("df");
Dataframe ndf = sqlContext.sql("SELECT ..... FROM df WHERE ...");
I would love if someone of you guys can guide me to convert a scala (or java) Resultset to spark Dataframe.
I cannot use this notation:
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql://XXX-XX-XXX-XX-XX.compute-1.amazonaws.com:3306/")
.option("dbtable", "pg_partner")
.option("user", "XXX")
.option("password", "XXX")
.load()
So before referring me to this similar question, please take it into account.
The reason why I cannot use that notation is that I need to use a jdbc configuration which is not present in current version of spark that I am using (2.2.0), because I want to use a "queryTimeout" option which has been recently added to the spark version 2.4, so I need to use it in the ResultSet.
Any help will be appreciated.
Thank you in advance!
A working example against public source mySQL
import java.util.Properties
import org.apache.spark.rdd.JdbcRDD
import java.sql.{Connection, DriverManager, ResultSet}
import org.apache.spark.implicits.
val url = "jdbc:mysql://mysql-rfam-public.ebi.ac.uk:4497/Rfam"
val username = "rfamro"
val password = ""
val myRDD = new JdbcRDD( sc, () => DriverManager.getConnection(url, username, password), "select rfam_id, noise_cutoff from family limit ?, ?", 1, 100, 10,
r => r.getString("rfam_id") + ", " + r.getString("noise_cutoff"))
val DF = myRDD.toDF
DF.show
returns:
+-------------------+
| value|
+-------------------+
| 5_8S_rRNA, 41.9|
| U1, 39.9|
| U2, 45.9|
| tRNA, 28.9|
| Vault, 33.9|
| U12, 52.9|
....
....
Give this a try
(haven't tried but should work with slight modification)
import java.sql.ResultSet
import org.apache.spark.sql.DataFrame
// assuming ResultSet comprises rows of (String, Int)
def resultSetToDataFrame(resultSet: ResultSet): DataFrame = {
val resultSetAsList: List[(String, Int)] = new Iterator[(String, Int)] {
override def hasNext: Boolean = resultSet.next()
override def next(): (String, Int) = {
// can also use column-label instead of column-index
(resultSet.getString(0), resultSet.getInt(1))
}
}.toStream.toList
import org.apache.spark.implicits._
val listAsDataFrame: DataFrame = resultSetAsList.toDF("column_name_1", "column_name_2")
listAsDataFrame
}
References:
SQL ResultSet to Scala List
Creating Spark DataFrame manually
I'm trying to start a datawarehouse project, this is what I would like my schema to look like:
table: event_log
schema:
-> info
-> user_id: "xyz"
-> user_properties // <- I want this to be array like
-> 0
-> key: "name
-> value
-> int_value: null
-> string_value: "osp"
...
-> 1 // and it goes on
The problem is I don't know how to programatically define this array like structure.
I took the idea from here:
https://www.youtube.com/watch?v=pxNrkjBeHpw
here is my code (kotlin using the java google cloud library) so far:
val tableId = TableId.of(datasetName, tableName)
// First part, general field
val generalInfoFields = ArrayList<Field>()
generalInfoFields.add(Field.of("user_id", LegacySQLTypeName.STRING))
generalInfoFields.add(Field.of("user_properties", {ARRAY LIKE TYPE??}))
val general_info = Field.of("general_info", LegacySQLTypeName.RECORD, FieldList.of(generalInfoFields))
// Combine fields and create table
val tableSchema = Schema.of(general_info)
val tableDefinition = StandardTableDefinition.of(tableSchema)
val tableInfo = TableInfo.newBuilder(tableId, tableDefinition).build()
val table = bigquery.create(tableInfo)
log.info("dataset created " + dataset.datasetId.dataset)
Any help would be greatly appreciated
To define array in BigQuery schema you need to use Field.Mode.REPEATED modifier. Check official docs.
Your code will look something like this:
val arrayField = Field.newBuilder("user_properties", LegacySQLTypeName.RECORD, FieldList.of(<record nested fields here>))
.setMode(Field.Mode.REPEATED).build()
Spark 2.0 (final) with Scala 2.11.8. The following super simple code yields the compilation error Error:(17, 45) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
import org.apache.spark.sql.SparkSession
case class SimpleTuple(id: Int, desc: String)
object DatasetTest {
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder.
master("local")
.appName("example")
.getOrCreate()
val dataset = sparkSession.createDataset(dataList)
}
}
Spark Datasets require Encoders for data type which is about to be stored. For common types (atomics, product types) there is a number of predefined encoders available but you have to import these first from SparkSession.implicits to make it work:
val sparkSession: SparkSession = ???
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
Alternatively you can provide directly an explicit
import org.apache.spark.sql.{Encoder, Encoders}
val dataset = sparkSession.createDataset(dataList)(Encoders.product[SimpleTuple])
or implicit
implicit val enc: Encoder[SimpleTuple] = Encoders.product[SimpleTuple]
val dataset = sparkSession.createDataset(dataList)
Encoder for the stored type.
Note that Encoders also provide a number of predefined Encoders for atomic types, and Encoders for complex ones, can derived with ExpressionEncoder.
Further reading:
For custom objects which are not covered by built-in encoders see How to store custom objects in Dataset?
For Row objects you have to provide Encoder explicitly as shown in Encoder error while trying to map dataframe row to updated row
For debug cases, case class must be defined outside of the Main https://stackoverflow.com/a/34715827/3535853
For other users (yours is correct), note that you it's also important that the case class is defined outside of the object scope. So:
Fails:
object DatasetTest {
case class SimpleTuple(id: Int, desc: String)
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
val dataset = sparkSession.createDataset(dataList)
}
}
Add the implicits, still fails with the same error:
object DatasetTest {
case class SimpleTuple(id: Int, desc: String)
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
}
}
Works:
case class SimpleTuple(id: Int, desc: String)
object DatasetTest {
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
}
}
Here's the relevant bug: https://issues.apache.org/jira/browse/SPARK-13540, so hopefully it will be fixed in the next release of Spark 2.
(Edit: Looks like that bugfix is actually in Spark 2.0.0... So I'm not sure why this still fails).
I'd clarify with an answer to my own question, that if the goal is to define a simple literal SparkData frame, rather than use Scala tuples and implicit conversion, the simpler route is to use the Spark API directly like this:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._
val simpleSchema = StructType(
StructField("a", StringType) ::
StructField("b", IntegerType) ::
StructField("c", IntegerType) ::
StructField("d", IntegerType) ::
StructField("e", IntegerType) :: Nil)
val data = List(
Row("001", 1, 0, 3, 4),
Row("001", 3, 4, 1, 7),
Row("001", null, 0, 6, 4),
Row("003", 1, 4, 5, 7),
Row("003", 5, 4, null, 2),
Row("003", 4, null, 9, 2),
Row("003", 2, 3, 0, 1)
)
val df = spark.createDataFrame(data.asJava, simpleSchema)