I'm trying to validate the datatype of a DataFrame by writing describe as an SQL query but every time I am getting datetime as string.
1.First I tried with the below code:
SparkSession sparkSession=new SparkSession.Builder().getOrCreate();
Dataset<Row> df=sparkSession.read().option("header","true").option("inferschema","true").format("csv").load("/user/data/*_ecs.csv");
try {
df.createTempView("data");
Dataset<Row> sqlDf=sparkSession.sql("Describe data");
sqlDf.show(300,false);
Output:
+-----------------+---------+-------+
|col_name |data_type|comment|
+-----------------+---------+-------+
|id |int |null |
|symbol |string |null |
|datetime |string |null |
|side |string |null |
|orderQty |int |null |
|price |double |null |
+-----------------+---------+-------+
I also try Custom schema but in that case i am getting exception when I am executing any query other than describe table:
SparkSession sparkSession=new SparkSession.Builder().getOrCreate(); Dataset<Row>df=sparkSession.read().option("header","true").schema(customeSchema).format("csv").load("/use/data/*_ecs.csv");
try {
df.createTempView("trade_data");
Dataset<Row> sqlDf=sparkSession.sql("Describe trade_data");
sqlDf.show(300,false);
Output:
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
|datetime|timestamp|null |
|price |double |null |
|orderQty|double |null |
+--------+---------+-------+
But if i try any query then getting the below execption:
Dataset<Row> sqlDf=sparkSession.sql("select DATE(datetime),avg(price),avg(orderQty) from data group by datetime");
java.lang.IllegalArgumentException
at java.sql.Date.valueOf(Date.java:143)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
How can this be solved?
why Inferschema is not working ??
for this you can find more on this link: https://issues.apache.org/jira/browse/SPARK-19228
so Datetype columns are parsed as String for current version of spark
If you don't want to submit your own schema, one way would be this:
Dataset<Row> df = sparkSession.read().format("csv").option("header","true").option("inferschema", "true").load("example.csv");
df.printSchema(); // check output - 1
df.createOrReplaceTempView("df");
Dataset<Row> df1 = sparkSession.sql("select * , Date(datetime) as datetime_d from df").drop("datetime");
df1.printSchema(); // check output - 2
====================================
output - 1:
root
|-- id: integer (nullable = true)
|-- symbol: string (nullable = true)
|-- datetime: string (nullable = true)
|-- side: string (nullable = true)
|-- orderQty: integer (nullable = true)
|-- price: double (nullable = true)
output - 2:
root
|-- id: integer (nullable = true)
|-- symbol: string (nullable = true)
|-- side: string (nullable = true)
|-- orderQty: integer (nullable = true)
|-- price: double (nullable = true)
|-- datetime_d: date (nullable = true)
I'd choose this method if the number of fields to cast is not high.
If you want to submit your own schema:
List<org.apache.spark.sql.types.StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("datetime", DataTypes.TimestampType, true));
fields.add(DataTypes.createStructField("price",DataTypes.DoubleType,true));
fields.add(DataTypes.createStructField("orderQty",DataTypes.DoubleType,true));
StructType schema = DataTypes.createStructType(fields);
Dataset<Row> df = sparkSession.read().format("csv").option("header", "true").schema(schema).load("example.csv");
df.printSchema(); // output - 1
df.createOrReplaceTempView("df");
Dataset<Row> df1 = sparkSession.sql("select * , Date(datetime) as datetime_d from df").drop("datetime");
df1.printSchema(); // output - 2
======================================
output - 1:
root
|-- datetime: timestamp (nullable = true)
|-- price: double (nullable = true)
|-- orderQty: double (nullable = true)
output - 2:
root
|-- price: double (nullable = true)
|-- orderQty: double (nullable = true)
|-- datetime_d: date (nullable = true)
Since its again casting column from timestamp to Date, I don't see much use of this method. But still putting it here for your later use maybe.
Related
I have a dataframe where the tag column contains different key->values. I try to filter out the values information where the key=name. The filtered out information should be put in a new dataframe.
The initial df has the following schema:
root
|-- id: long (nullable = true)
|-- type: string (nullable = true)
|-- tags: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- nds: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ref: long (nullable = true)
|-- members: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- type: string (nullable = true)
| | |-- ref: long (nullable = true)
| | |-- role: string (nullable = true)
|-- visible: boolean (nullable = true)
And I want a newdf of schema:
root
|-- place: string (nullable = true)
|-- num_evacuees string (nullable = true)
How should I do the filter? I tried a lot of methods, where I tried to have a normal filter at least. But everytime, the result of the filter is an empty dataframe each time. For example:
val newdf = df.filter($"tags"("key") contains "name")
val newdf = df.where(places("tags")("key") === "name")
I tried a lot more methods, but none of it has worked
How should I do the proper filter
You can achieve the result you want with:
val df = Seq(
(1L, Map("sf" -> "100")),
(2L, Map("ny" -> "200"))
).toDF("id", "tags")
val resultDf = df
.select(explode(map_filter(col("tags"), (k, _) => k === "ny")))
.withColumnRenamed("key", "place")
.withColumnRenamed("value", "num_evacuees")
resultDf.printSchema
resultDf.show
Which will show:
root
|-- place: string (nullable = false)
|-- num_evacuees: string (nullable = true)
+-----+------------+
|place|num_evacuees|
+-----+------------+
| ny| 200|
+-----+------------+
The key idea is to use map_filter to select the fields from the map you want then explode turns the map into two columns (key and value) which you can then rename to make the DataFrame match your specification.
The above example assumes you want to get a single value to demonstrate the idea. The lambda function used by map_filter can be as complex as necessary. Its signature map_filter(expr: Column, f: (Column, Column) => Column): Column shows that as long as you return a Column it will be happy.
If you wanted to filter a large number of entries you could do something like:
val resultDf = df
.withColumn("filterList", array("sf", "place_n"))
.select(explode(map_filter(col("tags"), (k, _) => array_contains(col("filterList"), k))))
The idea is to extract the keys of the map column (tags), then use array_contains to check for a key called "name".
import org.apache.spark.sql.functions._
val newdf = df.filter(array_contains(map_keys($"tags), "name"))
I have a dataframe with the following schema:
root
|-- id: long (nullable = true)
|-- type: string (nullable = true)
|-- tags: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- lat: Long (nullable = true)
|-- lon: Long (nullable = true)
|-- nds: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ref: long (nullable = true)
|-- members: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- type: string (nullable = true)
| | |-- ref: long (nullable = true)
| | |-- role: string (nullable = true)
I want to create a new dataframe res where I select specific data from the column tags. I need the values from key=place and key=population. The new dataframe should have the following schema:
val schema = StructType(
Array(
StructField("place", StringType),
StructField("population", LongType)
)
)
I have litteraly no idea how I to do this. I tried to replicate the first dataframe and then select the columns, but that didnt work.
Anyone has a solution?
You can directly apply desired key on column of type map to extract value, then cast and rename column as you wish as follows:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.LongType
val result = dataframe.select(
col("tags")("place").as("place"),
col("tags")("population").cast(LongType).as("population")
)
With the following tags column:
+------------------------------------------------+
|tags |
+------------------------------------------------+
|{place -> A, population -> 32, another_key -> X}|
|{place -> B, population -> 64, another_key -> Y}|
+------------------------------------------------+
You get the following result:
+-----+----------+
|place|population|
+-----+----------+
|A |32 |
|B |64 |
+-----+----------+
Having the following schema:
root
|-- place: string (nullable = true)
|-- population: long (nullable = true)
Lets call your original dataframe df. You can extract the information you want like this
import org.apache.spark.sql.functions.sql.col
val data = df
.select("tags")
.where(
df("tags")("key") isin (List("place", "population"): _*)
)
.select(
col("tags")("value")
)
.collect()
.toList
This will give you a List[Row] which can be converted to another dataframe with your schema
import scala.collection.JavaConversions.seqAsJavaList
sparkSession.createDataFrame(seqAsJavaList[Row](data), schema)
Given the following simplified input:
val df = Seq(
(1L, Map("place" -> "home", "population" -> "1", "name" -> "foo")),
(2L, Map("place" -> "home", "population" -> "4", "name" -> "foo")),
(3L, Map("population" -> "3")),
(4L, Map.empty[String, String])
).toDF("id", "tags")
You want to select the values using methods map_filter to filter the map to only contain the key you want, then call map_values to get those entries. map_values returns an array, so you need to use explode_outer to flatten the data. We use explode_outer here because you might have entries which have neither place nor population, or only one of the two. Once the data is in a form we can easily work with, we just select the fields we want in the desired structure.
I've left the id column in so when you run the example you can see that we don't drop entries with missing data.
val r = df.select(
col("id"),
explode_outer(map_values(map_filter(col("tags"), (k,_) => k === "place"))) as "place",
map_values(map_filter(col("tags"), (k,_) => k === "population")) as "population"
).withColumn("population", explode_outer(col("population")))
.select(
col("id"),
array(
struct(
col("place"),
col("population") cast LongType as "population"
) as "place_and_population"
) as "data"
)
Gives:
root
|-- id: long (nullable = false)
|-- data: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- place: string (nullable = true)
| | |-- population: long (nullable = true)
+---+--------------+
| id| data|
+---+--------------+
| 1| [{home, 1}]|
| 2| [{home, 4}]|
| 3| [{null, 3}]|
| 4|[{null, null}]|
+---+--------------+
I'm trying to write a test case for a program.
For that, I'm reading a CSV file that has data in the following format.
account_number,struct_data
123456789,{"key1":"value","key2":"value2","keyn":"valuen"}
987678909,{"key1":"value0","key2":"value20","keyn":"valuen0"}
some hundreds of such rows.
I need to read the second column as a struct. But I'm getting the error
struct type expected, string type found
I tried casting as StructType, then getting the error as "StringType cannot be converted to StructType".
Should I change the way my CSV is? What else can I do?
I gave my solution in Scala Spark, it might give some insight to your query
scala> val sdf = """{"df":[{"actNum": "1234123", "strType": [{"key1": "value1", "key2": "value2"}]}]}"""
sdf: String = {"df":[{"actNum": "1234123", "strType": [{"key1": "value1", "key2": "value2"}]}]}
scala> val erdf = spark.read.json(Seq(sdf).toDS).toDF().withColumn("arr", explode($"df")).select("arr.*")
erdf: org.apache.spark.sql.DataFrame = [actNum: string, strType: array<struct<key1:string,key2:string>>]
scala> erdf.show()
+-------+-----------------+
| actNum| strType|
+-------+-----------------+
|1234123|[[value1,value2]]|
+-------+-----------------+
scala> erdf.printSchema
root
|-- actNum: string (nullable = true)
|-- strType: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key1: string (nullable = true)
| | |-- key2: string (nullable = true)
If all of the json records have the same schema, you can define that and use sparks from_json() function to accomplish your task.
import org.apache.spark.sql.types.StructType
val df = Seq(
(123456789, "{\"key1\":\"value\",\"key2\":\"value2\",\"keyn\":\"valuen\"}"),
(987678909, "{\"key1\":\"value0\",\"key2\":\"value20\",\"keyn\":\"valuen0\"}")
).toDF("account_number", "struct_data")
val schema = new StructType()
.add($"key1".string)
.add($"key2".string)
.add($"keyn".string)
val df2 = df.withColumn("st", from_json($"struct_data", schema))
df2.printSchema
df2.show(false)
This snippet results in this output:
root
|-- account_number: integer (nullable = false)
|-- struct_data: string (nullable = true)
|-- st: struct (nullable = true)
| |-- key1: string (nullable = true)
| |-- key2: string (nullable = true)
| |-- keyn: string (nullable = true)
+--------------+---------------------------------------------------+------------------------+
|account_number|struct_data |st |
+--------------+---------------------------------------------------+------------------------+
|123456789 |{"key1":"value","key2":"value2","keyn":"valuen"} |[value,value2,valuen] |
|987678909 |{"key1":"value0","key2":"value20","keyn":"valuen0"}|[value0,value20,valuen0]|
+--------------+---------------------------------------------------+------------------------+
Dataset<Row> ds = spark.read().option("multiLine", true).option("mode", "PERMISSIVE").json("/user/administrador/prueba_diario.txt").toDF();
ds.printSchema();
Dataset<Row> ds2 = ds.select("articles").toDF();
ds2.printSchema();
spark.sql("drop table if exists table1");
ds2.write().saveAsTable("table1");
I have this json format
root
|-- articles: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- author: string (nullable = true)
| | |-- content: string (nullable = true)
| | |-- description: string (nullable = true)
| | |-- publishedAt: string (nullable = true)
| | |-- source: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- name: string (nullable = true)
| | |-- title: string (nullable = true)
| | |-- url: string (nullable = true)
| | |-- urlToImage: string (nullable = true)
|-- status: string (nullable = true)
|-- totalResults: long (nullable = true)
I want to save the array articles as a hive's table with the arrays format
example of hive table that i want:
author (string)
content (string)
description (string)
publishedat (string)
source (struct<id:string,name:string>)
title (string)
url (string)
urltoimage (string)
The problem is that is saving the table just with one column named article and the contend is inside in this only column
A bit convoluted, but I found this one to work:
import org.apache.spark.sql.functions._
ds.select(explode(col("articles")).as("exploded")).select("exploded.*").toDF()
I tested it on
{
"articles": [
{
"author": "J.K. Rowling",
"title": "Harry Potter and the goblet of fire"
},
{
"author": "George Orwell",
"title": "1984"
}
]
}
and it returned (after collecting it into an array)
result = {Arrays$ArrayList#13423} size = 2
0 = {GenericRowWithSchema#13425} "[J.K. Rowling,Harry Potter and the goblet of fire]"
1 = {GenericRowWithSchema#13426} "[George Orwell,1984]"
I'm loading a parquet file as a spark dataset. I can query and create new datasets from the query. Now, I would like to add a new column to the dataset ("hashkey") and generate the values (e.g. md5sum(nameValue)). How can i achieve this?
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("Hello Spark");
sparkConf.setMaster("local");
SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example")
.config("spark.master", "local").config("spark.sql.warehouse.dir", "file:///C:\\spark_warehouse")
.getOrCreate();
Dataset<org.apache.spark.sql.Row> df = spark.read().parquet("meetup.parquet");
df.show();
df.createOrReplaceTempView("tmpview");
Dataset<Row> namesDF = spark.sql("SELECT * FROM tmpview where name like 'Spark-%'");
namesDF.show();
}
The output looks like this:
+-------------+-----------+-----+---------+--------------------+
| name|meetup_date|going|organizer| topics|
+-------------+-----------+-----+---------+--------------------+
| Spark-H20| 2016-01-01| 50|airisdata|[h2o, repeated sh...|
| Spark-Avro| 2016-01-02| 60|airisdata| [avro, usecases]|
|Spark-Parquet| 2016-01-03| 70|airisdata| [parquet, usecases]|
+-------------+-----------+-----+---------+--------------------+
Just add spark sql function for MD5 in you query.
Dataset<Row> namesDF = spark.sql("SELECT *, md5(name) as modified_name FROM tmpview where name like 'Spark-%'");
Dataset<Row> ds = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.option("delimiter","|")
.load("/home/cloudera/Desktop/data.csv");
ds.printSchema();
will print this :
root
|-- ReferenceValueSet_Id: integer (nullable = true)
|-- ReferenceValueSet_Name: string (nullable = true)
|-- Code_Description: string (nullable = true)
|-- Code_Type: string (nullable = true)
|-- Code: string (nullable = true)
|-- CURR_FLAG: string (nullable = true)
|-- REC_CREATE_DATE: timestamp (nullable = true)
|-- REC_UPDATE_DATE: timestamp (nullable = true)
Dataset<Row> df1 = ds.withColumn("Key", functions.lit(1));
df1.printSchema();
after adding above code, it will append one column with constant values.
root
|-- ReferenceValueSet_Id: integer (nullable = true)
|-- ReferenceValueSet_Name: string (nullable = true)
|-- Code_Description: string (nullable = true)
|-- Code_Type: string (nullable = true)
|-- Code: string (nullable = true)
|-- CURR_FLAG: string (nullable = true)
|-- REC_CREATE_DATE: timestamp (nullable = true)
|-- REC_UPDATE_DATE: timestamp (nullable = true)
|-- Key: integer (nullable = true)
you can see column with name Key is added to the dataset.
If you wanted to add some column inplace of the constunt value, you can use below code to add it.
Dataset<Row> df1 = ds.withColumn("Key", functions.lit(ds.col("Code")));
df1.printSchema();
df1.show();
now it will print watever the values is there into the column CODE. into the newly aded column named Key.