I have a sqlite database with columns saved as json, some are just arrays and some are array of objects
Data isn't too big, around 1 million rows in a table and another 6 million on another table. Now I would like to improve queries speed and extract this data into something indexed and more manageable
The problem is that spark treat json columns as BigDecimal and I don't know why or how to solve this, found some things but nothing helped.
Caused by: java.sql.SQLException: Bad value for type BigDecimal : [56641575300, 56640640900, 56640564100, 56640349700, 18635841800, 54913035400, 6505719940, 56641287800, 7102147726, 57202227222, 57191928343, 18633330200, 57193578904, 7409778074, 7409730079, 55740247200, 56641355300, 18635857700, 57191972388, 54912606500, 6601960745, 57191972907, 56641923500, 56640256300, 54911965100, 45661930800, 55474245300, 7409541556, 7409694518, 56641363000, 56519446200, 6504106170, 57191975866, 56640736700, 55463741500, 56640319300, 56640861000, 54911965000, 56561401800, 6504731849, 24342836300, 7402491855, 22950414800, 6507741522, 6504199636, 7102381436, 57191895642, 18634536800, 57196623329, 7005988322, 56013334500, 18634278500, 57191983462, 7409545828, 57204194408, 56641031400, 56641436400, 6504659572, 36829162100, 24766932600, 8256434300]
at org.sqlite.jdbc3.JDBC3ResultSet.getBigDecimal(JDBC3ResultSet.java:196)
What I tried, is to load the sqlite driver and then open the db with SQLContext
df = sqlContext.read.format('jdbc').options(url='jdbc:sqlite:../cache/iconic.db', dbtable='coauthors', driver='org.sqlite.JDBC').load()
After spark complained about column type, I tried to cast it as string so it could be further parsed as json
schema = ArrayType(IntegerType())
df.withColumn('co_list', from_json(df['co_list'].cast(StringType()), schema))
But this throw same error as it didn't changed anything
Also I tried to set table schema from start, but seems like pyspark doesn't let me to do this
df = sqlContext.read.schema([...]).format('jdbc')...
# Throws
pyspark.sql.utils.AnalysisException: 'jdbc does not allow user-specified schemas.;'
The rows look like this
# First table
1 "[{""surname"": ...}]" "[[{""frequency"": ""58123"", ...}]]" 74072 14586 null null null "{""affiliation-url"":}" "[""SOCI""]" null 0 0 1
# Second table
505 "[{""surname"": ""Blondel"" ...}, {""surname"": ""B\u0153ge"" ..}, ...]" "1999-12-01" 21 null null null 0
Hope there is a way.
Found the solution, database should be loaded using jdbc reader and to customize casting of columns, you should pass a property to the driver
Here is the solution
connectionProperties = {
"customSchema": 'id INT, co_list STRING, last_page INT, saved INT',
"driver": 'org.sqlite.JDBC'
}
df = sqlContext.read.jdbc(url='jdbc:sqlite:../cache/iconic.db', table='coauthors', properties=connectionProperties)
This way you have control over how spark internally map columns of database table.
Related
My input was a kafka-stream with only one value which is comma-separated. It looks like this.
"id,country,timestamp"
I already splitted the dataset so that i have something like the following structured stream
Dataset<Row> words = df
.selectExpr("CAST (value AS STRING)")
.as(Encoders.STRING())
.withColumn("id", split(col("value"), ",").getItem(0))
.withColumn("country", split(col("value"), ",").getItem(1))
.withColumn("timestamp", split(col("value"), ",").getItem(2));
+----+---------+----------+
|id |country |timestamp |
+----+---------+----------+
|2922|de |1231231232|
|4195|de |1231232424|
|6796|fr |1232412323|
+----+---------+----------+
Now I have a dataset with 3 columns. Now i want to use the entries in each row in a custom function e.g.
Dataset<String> words.map(row -> {
//do something with every entry of each row e.g.
Person person = new Person(id, country, timestamp);
String name = person.getName();
return name;
};
In the end i want to sink out again a comma-separated String.
Data frame has a schema so you cant just call a map function on it without defining a new schema.
You can either cast to RDD and use a map , or use a DF map with encoder.
Another option is I think you can use spark SQL with user defined functions, you can read about it.
If your use case is really simple as you are showing, doing something like :
var nameRdd = words.rdd.map(x => {f(x)})
which seems like is all you need
if you still want a dataframe you can use something like:
val schema = StructType(Seq[StructField](StructField(dataType = StringType, name = s"name")))
val rddToDf = nameRdd.map(name => Row.apply(name))
val df = sparkSession.createDataFrame(rddToDf, schema)
P.S dataframe === dataset
If you have a custom function that is not available by composing functions in the existing spark API[1], then you can either drop down to the RDD level (as #Ilya suggested), or use a UDF[2].
Typically I'll try to use the spark API functions on a dataframe whenever possible, as they generally will be the best optimized.
If thats not possible I will construct a UDF:
import org.apache.spark.sql.functions.{col, udf}
val squared = udf((s: Long) => s * s)
display(spark.range(1, 20).select(squared(col("id")) as "id_squared"))
In your case you need to pass multiple columns to your UDF, you can pass them in comma separated squared(col("col_a"), col("col_b")).
Since you are writing your UDF in Scala it should be pretty efficient, but keep in mind if you use Python, in general there will be extra latency due to data movements between JVM and Python.
[1]https://spark.apache.org/docs/latest/api/scala/index.html#package
[2]https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html
I have data in JavaPairRDD in format
JavaPairdRDD<Tuple2<String, Tuple2<String,String>>>
I tried using below code
Encoder<Tuple2<String, Tuple2<String,String>>> encoder2 =
Encoders.tuple(Encoders.STRING(), Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
Dataset<Row> userViolationsDetails = spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");
But how to generate Dataset with 3 columns ??? As output of above code gives me data in 2 columns. Any pointers / suggestion ???
Try to run printSchema - you will see, that value2 is a complex type.
Having such information, you can write:
Dataset<Row> uvd = userViolationsDetails.selectExpr("value1", "value2._1 as value2", "value2._2 as value3")
value2._1 means first element of a tuple inside current "value2" field. We overwrite value2 field to have one value only
Note that this will work after https://issues.apache.org/jira/browse/SPARK-24548 is merged to master branch. Currently there is a bug in Spark and tuple is converted to struct with two fields named value
Hello friends i have complicated issue , please help me to solve this, let me explain problem :
Cassandra Table
DeviceID | companyId | data |
abc123 : xyz1 :{"Temperature":{"x":"67.824"},"Humidity":117.828,"vibration":{"y":"2.276","x":"72.995"},"date":1487641956515}
"date column data type is text"
now i am trying to get data in java using this query
ResultSet results = inSession.execute("SELECT data from keyspace.tablename where companyid = 4d91f767-2312-4f32-a25a-4674e8bae244 limit 24");
and getting data successfully like this...
{"Temperature":{"x":"67.824"},"Humidity":117.828,"vibration":{"y":"2.276","x":"72.995"},"date":1487641956515}{"Temperature":{"x":"73.981"},"Humidity":58.561,"vibration":{"y":"87.482","x":"87.131"},"date":1487641951512}{"Temperature":{"x":"62.747"},"Humidity":88.611,"vibration":{"y":"137.792","x":"36.363"},"date":1487641946512}{"Temperature":{"x":"36.072"},"Humidity":55.819,"vibration":{"y":"60.062","x":"2.779"},"date":1487641941508}{"Temperature":{"x":"36.724"},"Humidity":68.209,"vibration":{"y":"49.323","x":"64.822"},"date":1487641936507}{"Temperature":{"x":"31.777"},"Humidity":131.955,"vibration":{"y":"68.690","x":"6.737"},"date":1487641931503}{"Temperature":{"x":"41.768"},"Humidity":81.847,"vibration":{"y":"74.360","x":"60.438"},"date":1487641926499}{"Temperature":{"x":"49.538"},"Humidity":57.258,"vibration":{"y":"34.688","x":"81.397"},"date":1487641921496}{"Temperature":{"x":"98.013"},"Humidity":61.1,"vibration":{"y":"121.482","x":"93.721"},"date":1487641916492}{"Temperature":{"x":"98.307"},"Humidity":63.377,"vibration":{"y":"106.067","x":"98.968"},"date":1487641911487}{"Temperature":{"x":"92.119"},"Humidity":70.677,"vibration":{"y":"66.953","x":"59.440"},"date":1487641906481}{"Temperature":{"x":"41.627"},"Humidity":73.739,"vibration":{"y":"54.557","x":"82.876"},"date":1487641901475}{"Temperature":{"x":"74.684"},"Humidity":125.163,"vibration":{"y":"77.522","x":"96.560"},"date":1487641896471}{"Temperature":{"x":"50.228"},"Humidity":53.3,"vibration":{"y":"58.011","x":"26.710"},"date":1487641891468}{"Temperature":{"x":"61.710"},"Humidity":75.869,"vibration":{"y":"67.637","x":"69.842"},"date":1487641886465}{"Temperature":{"x":"61.908"},"Humidity":43.106,"vibration":{"y":"6.975","x":"15.009"},"date":1487641881461}{"Temperature":{"x":"75.157"},"Humidity":61.452,"vibration":{"y":"39.608","x":"58.490"},"date":1487826732069}{"Temperature":{"x":"77.562"},"Humidity":65.951,"vibration":{"y":"102.782","x":"24.761"},"date":1487826731069}{"Temperature":{"x":"60.483"},"Humidity":57.307,"vibration":{"y":"96.702","x":"86.667"},"date":1487826730068}{"Temperature":{"x":"85.893"},"Humidity":58.953,"vibration":{"y":"49.167","x":"86.790"},"date":1487826729067}{"Temperature":{"x":"84.073"},"Humidity":142.27,"vibration":{"y":"94.980","x":"65.363"},"date":1487826728065}{"Temperature":{"x":"81.733"},"Humidity":145.871,"vibration":{"y":"81.889","x":"57.215"},"date":1487826727064}{"Temperature":{"x":"41.944"},"Humidity":139.18,"vibration":{"y":"62.525","x":"74.986"},"date":1487826726063}{"Temperature":{"x":"85.298"},"Humidity":80.534,"vibration":{"y":"47.796","x":"74.527"},"date":1487826725062}
now my problem is i need to create list or array using key value , means
Temperature [24] or Temperature class list , is it possible ??? how ???
I use Spark 2.1.1.
I have the following DataSet<Row> ds1;
name | ratio | count // column names
"hello" | 1.56 | 34
(ds1.isStreaming gives true)
and I am trying to generate DataSet<String> ds2. other words when I write to a kafka sink I want to write something like this
{"name": "hello", "ratio": 1.56, "count": 34}
I have tried something like this df2.toJSON().writeStream().foreach(new KafkaSink()).start() but then it gives the following error
Queries with streaming sources must be executed with writeStream.start()
There are to_json and json_tuple however I am not sure how to leverage them here ?
I tried the following using json_tuple() function
Dataset<String> df4 = df3.select(json_tuple(new Column("result"), " name", "ratio", "count")).as(Encoders.STRING());
and I get the following error:
cannot resolve 'result' given input columns: [name, ratio, count];;
tl;dr Use struct function followed by to_json (as toJSON was broken for streaming datasets due to SPARK-17029 that got fixed just 20 days ago).
Quoting the scaladoc of struct:
struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.
Given you use Java API you have 4 different variants of struct function, too:
public static Column struct(Column... cols) Creates a new struct column.
With to_json function your case is covered:
public static Column to_json(Column e) Converts a column containing a StructType into a JSON string with the specified schema.
The following is a Scala code (translating it to Java is your home exercise):
val ds1 = Seq(("hello", 1.56, 34)).toDF("name", "ratio", "count")
val recordCol = to_json(struct("name", "ratio", "count")) as "record"
scala> ds1.select(recordCol).show(truncate = false)
+----------------------------------------+
|record |
+----------------------------------------+
|{"name":"hello","ratio":1.56,"count":34}|
+----------------------------------------+
I've also given your solution a try (with Spark 2.3.0-SNAPSHOT built today) and it seems it works perfectly.
val fromKafka = spark.
readStream.
format("kafka").
option("subscribe", "topic1").
option("kafka.bootstrap.servers", "localhost:9092").
load.
select('value cast "string")
fromKafka.
toJSON. // <-- JSON conversion
writeStream.
format("console"). // using console sink
start
format("kafka") was added in SPARK-19719 and is not available in 2.1.0.
I am using spark-sql-2.4.1v with java 1.8.
and kafka versions spark-sql-kafka-0-10_2.11_2.4.3 and kafka-clients_0.10.0.0
StreamingQuery queryComapanyRecords =
comapanyRecords
.writeStream()
.format("kafka")
.option("kafka.bootstrap.servers",KAFKA_BROKER)
.option("topic", "in_topic")
.option("auto.create.topics.enable", "false")
.option("key.serializer","org.apache.kafka.common.serialization.StringDeserializer")
.option("value.serializer", "com.spgmi.ca.prescore.serde.MessageRecordSerDe")
.option("checkpointLocation", "/app/chkpnt/" )
.outputMode("append")
.start();
queryLinkingMessageRecords.awaitTermination();
Giving error :
Caused by: org.apache.spark.sql.AnalysisException: Required attribute 'value' not found;
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$6.apply(KafkaWriter.scala:71)
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$6.apply(KafkaWriter.scala:71)
at scala.Option.getOrElse(Option.scala:121)
I tried to fix as below, but unable to send the value i.e. which is a java bean in my case.
StreamingQuery queryComapanyRecords =
comapanyRecords.selectExpr("CAST(company_id AS STRING) AS key", "to_json(struct(\"company_id\",\"fiscal_year\",\"fiscal_quarter\")) AS value")
.writeStream()
.format("kafka")
.option("kafka.bootstrap.servers",KAFKA_BROKER)
.option("topic", "in_topic")
.start();
So is there anyway in java how to handle/send this value( i.e. Java
bean as record) ??.
Kafka data source requires a specific schema for reading (loading) and writing (saving) datasets.
Quoting the official documentation (highlighting the most important field / column):
Each row in the source has the following schema:
...
value binary
...
In other words, you have Kafka records in the value column when reading from a Kafka topic and you have to make your data to save to a Kafka topic available in the value column as well.
In other words, whatever is or is going to be in Kafka is in the value column. The value column is where you "store" business records (the data).
On to your question:
How to write selected columns to Kafka topic?
You should "pack" the selected columns together so they can all together be part of the value column. to_json standard function is a good fit so the selected columns are going to be a JSON message.
Example
Let me give you an example.
Don't forget to start a Spark application or spark-shell with the Kafka data source. Mind the versions of Scala (2.11 or 2.12) and Spark (e.g. 2.4.4).
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4
Let's start by creating a sample dataset. Any multiple-field dataset would work.
val ns = Seq((0, "zero")).toDF("id", "name")
scala> ns.show
+---+----+
| id|name|
+---+----+
| 0|zero|
+---+----+
If we tried to write the dataset to a Kafka topic, it would error out due to value column missing. That's what you faced initially.
scala> ns.write.format("kafka").option("topic", "in_topic").save
org.apache.spark.sql.AnalysisException: Required attribute 'value' not found;
at org.apache.spark.sql.kafka010.KafkaWriter$.$anonfun$validateQuery$6(KafkaWriter.scala:71)
at scala.Option.getOrElse(Option.scala:138)
...
You have to come up with a way to "pack" multiple fields (columns) together and make it available as value column. struct and to_json standard functions will do it.
val vs = ns.withColumn("value", to_json(struct("id", "name")))
scala> vs.show(truncate = false)
+---+----+----------------------+
|id |name|value |
+---+----+----------------------+
|0 |zero|{"id":0,"name":"zero"}|
+---+----+----------------------+
Saving to a Kafka topic should now be a breeze.
vs.write.format("kafka").option("topic", "in_topic").save