BigQuery, how to define array like field programmatically? - java

I'm trying to start a datawarehouse project, this is what I would like my schema to look like:
table: event_log
schema:
-> info
-> user_id: "xyz"
-> user_properties // <- I want this to be array like
-> 0
-> key: "name
-> value
-> int_value: null
-> string_value: "osp"
...
-> 1 // and it goes on
The problem is I don't know how to programatically define this array like structure.
I took the idea from here:
https://www.youtube.com/watch?v=pxNrkjBeHpw
here is my code (kotlin using the java google cloud library) so far:
val tableId = TableId.of(datasetName, tableName)
// First part, general field
val generalInfoFields = ArrayList<Field>()
generalInfoFields.add(Field.of("user_id", LegacySQLTypeName.STRING))
generalInfoFields.add(Field.of("user_properties", {ARRAY LIKE TYPE??}))
val general_info = Field.of("general_info", LegacySQLTypeName.RECORD, FieldList.of(generalInfoFields))
// Combine fields and create table
val tableSchema = Schema.of(general_info)
val tableDefinition = StandardTableDefinition.of(tableSchema)
val tableInfo = TableInfo.newBuilder(tableId, tableDefinition).build()
val table = bigquery.create(tableInfo)
log.info("dataset created " + dataset.datasetId.dataset)
Any help would be greatly appreciated

To define array in BigQuery schema you need to use Field.Mode.REPEATED modifier. Check official docs.
Your code will look something like this:
val arrayField = Field.newBuilder("user_properties", LegacySQLTypeName.RECORD, FieldList.of(<record nested fields here>))
.setMode(Field.Mode.REPEATED).build()

Related

Scala Error for Hashtable[String, String]

I am writing a small UDF
val transform = udf((x: Array[Byte]) => {
val mapper = new ObjectMapper() with ScalaObjectMapper
val stream: InputStream = new ByteArrayInputStream(x);
val obs = new ObjectInputStream(stream)
val stock = mapper.readValue(obs, classOf[util.Hashtable[String, String]])
stock
})
Where in I get error
java.lang.UnsupportedOperationException: Schema for type java.util.Hashtable[String,String] is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:809)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:740)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:926)
at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:739)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:736)
at org.apache.spark.sql.functions$.udf(functions.scala:3898)
... 59 elided
Can anyone help in understanding why this is coming?
The error you get just means that Spark does not understand java hash tables. We can reproduce your error with this simple UDF.
val gen = udf(() => new java.util.Hashtable[String, String]())
Spark tries to create a DataType (to put in a spark schema) from a java.util.Hashtable, which it does not know how to do. Spark understands scala maps though. Indeed the following code
val gen2 = udf(() => Map("a" -> "b"))
spark.range(1).select(gen2()).show()
yields
+--------+
| UDF()|
+--------+
|[a -> b]|
+--------+
To fix the first UDF, and yours by the way, you can convert the Hashtable to a scala map. Converting a HashMap can be done easily with JavaConverters. I do not know of any easy way to do it with a Hashtable but you can do it this way:
import collection.JavaConverters._
val gen3 = udf(() => {
val table = new java.util.Hashtable[String, String]()
table.put("a", "b")
Map(table.entrySet.asScala.toSeq.map(x => x.getKey -> x.getValue) :_*)
})

How to fetch complex model with a result of select statment using fetchgoups and map

I'm using Jooq with Kotlin and i want to write a statement that fetches data from a query that uses couple of tables using join statement(example attached)
The problem I'm facing is that I want to map the result to my complex model which consist of one to many relationships and also many to many.
According to my knowledge I know i can use fetchgroups operation in Jooq to some how map the records but i still can't figure out how to get the result into my model.
my model:
data class MicroserviceDto(
val microservice_id: Long = 1,
val microservice_name: String? = "",
val endpoint: String? = "",
val mappings: String? = "",
val solutionDefinitionMinimalDtoList: List<SolutionDefinitionDto> = emptyList(),
val projectFileDtoList: List<ProjectFileDto> = emptyList()
)
data class SolutionDefinitionDto(
val solution_definition_id: Long = 0L,
val solution_definition_name: String = "",
val solutionId: String = "",
val solutionVersion: String = ""
)
data class ProjectFileDto(
val project_file_id: Long = 1,
val model: String = "",
val relativePath: String = "",
val fileContentDtoList: List<FileContentDto> = emptyList()
)
data class FileContentDto(
val file_content_id: Long = 1,
val content: ByteArray = ByteArray(0)
)
Link to my schema diagram
Database Diagram visualization
Explanation of the diagram:
Microservice has many to many relationship with SolutionDefinistion
ProjectFile has one to many relationship with Microservice
ProjectFile has one to many relationship with SolutionDefinition
FileContent has one to many with ProjectFile
I've created a view to represent my desired query with all tables and the join statements between them.
Here is the View:
CREATE OR REPLACE VIEW Microservice_Metadata_by_Microservice_Id AS
select
# microservice
M.id as `microservice_id`,
M.name as `microservice_name`,
M.mappings,
M.endpoint,
# solution definition
SD.id as `solution_definition_id`,
SD.name as `solution_definition_name`,
SD.solutionId,
SD.solutionVersion,
# project file of microservice
PF.id as `project_file_id`,
PF.relativePath,
PF.model,
# file content data of project file
FC.id as `file_content_id`,
FC.content
from Microservice M
# get project file
left join Microservice_SolutionDefinition MSD
on MSD.microserviceId = M.id
left join ProjectFile PF
on PF.microserviceId = M.id
# get data content
left JOIN FileContent FC
on PF.id = FC.projectFileId
# get solutions of microservice
left join SolutionDefinition SD
on SD.id = MSD.solutionDefinitionId;
How can I implement such a Jooq dsl query that map the ResultSet to my data model

Graal embedded javascript in java, how to call map on list/array from java? Is it possible?

I am playing with Graal for running javascript as a guest language, and would like to know if there is a way to use javascript Array.map functionality on a host (Java) object or proxy. Demo Kotlin code follows, but should be close enough to Java code.
fun main() {
val context = Context.newBuilder().build()
val javaOutputList = mutableListOf<Integer>()
val javaList = listOf(2, 2, 3, 4, 5)
val proxyJavaList = ProxyArray.fromList(javaList)
context.polyglotBindings.apply {
putMember("javaOutputList", javaOutputList)
putMember("javaList", javaList)
putMember("proxyJavaList", proxyJavaList)
}
val script = """
var javaOutputList = Polyglot.import('javaOutputList');
var javaList = Polyglot.import('javaList');
var proxyJavaList = Polyglot.import('proxyJavaList');
var abc = [1, 2, 3];
abc.forEach(x => javaOutputList.add(x)); // WORKS
//abc.map(x => x + 1) // WORKS
//javaList.map(x => x + 1) // DOES NOT WORK (map not a method on list)
proxyJavaList.map(x => x + 1) // DOES NOT WORK (message not supported: INVOKE)
""".trimIndent()
val result = context.eval("js", script)
val resultList = result.`as`(List::class.java)
println("result: $resultList")
println("javaOutputList: $javaOutputList")
}
Using ProxyArray looked the most promising to me, but I still couldn't get it to work. Is this functionality expected to be supported?
EDIT: with the accepted answer the code works, here is the change for the interested:
val context = Context.newBuilder()
//.allowExperimentalOptions(true) // doesn't seem to be needed
.option("js.experimental-foreign-object-prototype", "true")
.build()
The root of the problem is that array-like non-JavaScript objects do not have Array.prototype on their prototype chain by default. So, Array.prototype.map is not accessible using javaList.map/proxyJavaList.map syntax.
You can either invoke Array.prototype.map directly like Array.prototype.map.call(javaList, x => x+1) or you can use an experimental option js.experimental-foreign-object-prototype=true (that we added recently) that adds Array.prototype on the prototype chain of all array-like objects. javaList.map/proxyJavaList.map will be available then.

Unable to perform Ignite SQL query over [CustomKey, CustomValue] cache in Scala.

I am trying to setup a distributed cache using Apache Ignite with Scala.
After setting up the cache, I am able to put and get items knowing the key, but SQL queries of any type returns always a cursor with null iterator.
Here is how I setup my cache (please note that this is done before the ignition.start):
def setupTelemetryCache(): CacheConfiguration[TelemetryKey, TelemetryValue] = {
val dataRegionName = "persistent-region"
val cacheName = "telemetry-cache"
// This object is required to perform SQL queries over custom key object
val queryEntity = new QueryEntity("TelemetryKey", "TelemetryValue")
val fields: util.LinkedHashMap[String, String] = new util.LinkedHashMap[String, String]
fields.put("deviceId", classOf[String].getName)
fields.put("metricName", classOf[String].getName)
fields.put("timestamp", classOf[String].getName)
queryEntity.setFields(fields)
val keyFields: util.HashSet[String] = new util.HashSet[String]()
keyFields.add("deviceId")
keyFields.add("metricName")
keyFields.add("timestamp")
queryEntity.setKeyFields(keyFields)
queryEntity.setIndexes(Collections.emptyList[QueryIndex]())
new CacheConfiguration()
.setName(cacheName)
.setDataRegionName(dataRegionName)
.setCacheMode(CacheMode.PARTITIONED) // Data is split among nodes
.setBackups(1) // each partition has 1 backup
.setIndexedTypes(classOf[String], classOf[TelemetryKey]) // Index by ID
.setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_ASYNC) // Faster, clients do not wait for cache
// synchronization. Consistency issues?
.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL) // Allows transactional query
.setQueryEntities(Collections.singletonList(queryEntity))
}
And those are the code of my TelemetryKey:
case class TelemetryKey private (
#(AffinityKeyMapped #field)
#(QuerySqlField#field)(index = true)
deviceId: String,
#(QuerySqlField#field)(index = false)
metricName: String,
#(QuerySqlField#field)(index = true)
timestamp: String) extends Serializable
And TelemetryValue:
class TelemetryValue private(valueType: ValueTypes.Value, doubleValue: Option[Double],
stringValue: Option[String],
longValue: Option[Long]) extends Serializable
A sample SQL query I have to achieve could be "Select * from CACHE where deviceId = 'dev1234'" and I expect to receive all the Cache.Entry[TelemetryKey, TelemetryValue] of the same deviceId
Here is how I perform the query:
private def sqlQuery(query: SqlQuery[TelemetryKey, TelemetryValue]):
QueryCursor[Cache.Entry[TelemetryKey, TelemetryValue]] = {
cache.query(query)
}
def getEntries(ofDeviceId: String):
QueryCursor[Cache.Entry[TelemetryKey, TelemetryValue]] = {
val q = new SqlQuery[TelemetryKey, TelemetryValue](classOf[TelemetryKey], "deviceId = ?")
sqlQuery(q.setArgs(ofDeviceId))
}
Even changing the body of the query i receive a cursor object which is empty. I cannot even perform a "Select *" query.
Thanks for the help
There are two ways to configure indexes and queryable fields.
Annotation based configuration
Your key and value classes need to be annotated #QuerySqlField as follows.
case class TelemetryKey private (
#(AffinityKeyMapped #field)
#(QuerySqlField#field)(index = true)
deviceId: String,
#(QuerySqlField#field)(index = false)
metricName: String,
#(QuerySqlField#field)(index = true)
timestamp: String) extends Serializable
After indexed and queryable fields are defined, they have to be registered in the SQL engine along with the object types they belong to.
new CacheConfiguration()
.setName(cacheName)
.setDataRegionName(dataRegionName)
.setCacheMode(CacheMode.PARTITIONED)
.setBackups(1)
.setIndexedTypes(classOf[TelemetryKey], classOf[TelemetryValue])
.setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_ASYNC)
.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL)
UPD:
One more thing that should be fixed as well is your SqlQuery
def getEntries(ofDeviceId: String):
QueryCursor[Cache.Entry[TelemetryKey, TelemetryValue]] = {
val q = new SqlQuery[TelemetryKey, TelemetryValue](classOf[TelemetryValue], "deviceId = ?")
sqlQuery(q.setArgs(ofDeviceId))
}
QueryEntity based approach
val queryEntity = new QueryEntity(classOf[TelemetryKey], classOf[TelemetryValue]);
new CacheConfiguration()
.setName(cacheName)
.setDataRegionName(dataRegionName)
.setCacheMode(CacheMode.PARTITIONED)
.setBackups(1)
.setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_ASYNC)
.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL)
.setQueryEntities(Collections.singletonList(queryEntity))
Long story short, you should supply full JVM class names to QueryEntity.
As in:
val queryEntity = new QueryEntity("com.pany.telemetry.TelemetryKey",
"com.pany.telemetry.TelemetryValue") // or e.g. TelemetryKey.class.getName()
Ignite needs these to distinguish various types that can be stored in one cache, it's not decorative - there's got to be an exact match.
Better yet? Use setIndexedTypes() instead of setQueryEntities(). It allows you to pass classes instead of Strings and it will scan annotations, which you already have.

Typesafe config: copy a key-value from from one config to another

Suppose I have two config files:
val config1: Config = ...
val config2: Config = ...
and I want to copy a key-value pair corresponding to the key someKey from config1 to config2. The key-value looks like the following:
someKey: ["someVal", "someVal2"]
This is what I did first:
val config3 = config2.withValue("someKey",
ConfigValueFactory.fromIterable(config1.getStringList("someKet")))
which is very ugly. I also tried but it gives error has type LIST rather than OBJECT:
val config3 = config2.withFallback(config1.getConfig("someKey"))
Any ideas for how to do this in a cleaner way?
What about this:
val c1: Config = ConfigFactory.parseString("x.a = 3 \n x.b = 'bbb' \n x.c = [1, 2, 3]")
val c2: Config = ConfigFactory.parseString("x.a = 4")
println(c1)
println("-----------")
println(c2)
println(c1.getInt("x.a"))
println(c2.withValue("x.c", c1.getList("x.c")))

Categories