How to convert DataFrame of string to a Dataframe of defined schema - java

I have a DataFrame of string with each sting being a JSON element. I want to convert it to a dataframe.
{"StartTime":1649424816686069,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"}|
{"StartTime":164981846249877,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"}|
{"StartTime":16498172424241095,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"}|
Here is my input schema:
Input.printSchema
input: org.apache.spark.sql.DataFrame = [value: string]
root
|-- value: string (nullable = true)
Desired is something like this:
root
|-- StartTime: integer (nullable = true)
|-- StatusCode: integer (nullable = true)
|-- integer: string (nullable = true)
|-- HTTPUserAgent: string (nullable = true)
I tried creating a struct class and creating a dataframe from that but it throws ArrayIndexOutOfBoundsException.
spark.createDataFrame(input,simpleSchema).show
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 116.0 failed 4 times, most recent failure: Lost task 0.3 in stage 116.0 (TID 17471, ip-10-0-62-29.ec2.internal, executor 1030): java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 1
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, Channel), StringType), true, false) AS Channel#947

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> df.printSchema
root
|-- value: string (nullable = true)
scala> df.show(false)
+--------------------------------------------------------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------------------------------------------------------+
|{"StartTime":1649424816686069,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"} |
|{"StartTime":164981846249877,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"} |
|{"StartTime":16498172424241095,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"}|
+--------------------------------------------------------------------------------------------------------------------+
scala> val sch = spark.read.json(df.select("value").as[String].distinct).schema
sch: org.apache.spark.sql.types.StructType = StructType(StructField(HTTPMethod,StringType,true), StructField(HTTPUserAgent,StringType,true), StructField(StartTime,LongType,true), StructField(StatusCode,LongType,true))
scala> val df1 = df.withColumn("jsonData", from_json(col("value"), sch, Map.empty[String, String])).select(col("jsonData.*"))
df1: org.apache.spark.sql.DataFrame = [HTTPMethod: string, HTTPUserAgent: string ... 2 more fields]
scala> df1.show(false)
+----------+------------------------------+-----------------+----------+
|HTTPMethod|HTTPUserAgent |StartTime |StatusCode|
+----------+------------------------------+-----------------+----------+
|GET |Jakarta Commons-HttpClient/3.1|1649424816686069 |200 |
|GET |Jakarta Commons-HttpClient/3.1|164981846249877 |200 |
|GET |Jakarta Commons-HttpClient/3.1|16498172424241095|200 |
+----------+------------------------------+-----------------+----------+
scala> df1.printSchema
root
|-- HTTPMethod: string (nullable = true)
|-- HTTPUserAgent: string (nullable = true)
|-- StartTime: long (nullable = true)
|-- StatusCode: long (nullable = true)

Related

Create a new dataframe (with different schema) from selected information from another dataframe

I have a dataframe where the tag column contains different key->values. I try to filter out the values information where the key=name. The filtered out information should be put in a new dataframe.
The initial df has the following schema:
root
|-- id: long (nullable = true)
|-- type: string (nullable = true)
|-- tags: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- nds: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ref: long (nullable = true)
|-- members: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- type: string (nullable = true)
| | |-- ref: long (nullable = true)
| | |-- role: string (nullable = true)
|-- visible: boolean (nullable = true)
And I want a newdf of schema:
root
|-- place: string (nullable = true)
|-- num_evacuees string (nullable = true)
How should I do the filter? I tried a lot of methods, where I tried to have a normal filter at least. But everytime, the result of the filter is an empty dataframe each time. For example:
val newdf = df.filter($"tags"("key") contains "name")
val newdf = df.where(places("tags")("key") === "name")
I tried a lot more methods, but none of it has worked
How should I do the proper filter
You can achieve the result you want with:
val df = Seq(
(1L, Map("sf" -> "100")),
(2L, Map("ny" -> "200"))
).toDF("id", "tags")
val resultDf = df
.select(explode(map_filter(col("tags"), (k, _) => k === "ny")))
.withColumnRenamed("key", "place")
.withColumnRenamed("value", "num_evacuees")
resultDf.printSchema
resultDf.show
Which will show:
root
|-- place: string (nullable = false)
|-- num_evacuees: string (nullable = true)
+-----+------------+
|place|num_evacuees|
+-----+------------+
| ny| 200|
+-----+------------+
The key idea is to use map_filter to select the fields from the map you want then explode turns the map into two columns (key and value) which you can then rename to make the DataFrame match your specification.
The above example assumes you want to get a single value to demonstrate the idea. The lambda function used by map_filter can be as complex as necessary. Its signature map_filter(expr: Column, f: (Column, Column) => Column): Column shows that as long as you return a Column it will be happy.
If you wanted to filter a large number of entries you could do something like:
val resultDf = df
.withColumn("filterList", array("sf", "place_n"))
.select(explode(map_filter(col("tags"), (k, _) => array_contains(col("filterList"), k))))
The idea is to extract the keys of the map column (tags), then use array_contains to check for a key called "name".
import org.apache.spark.sql.functions._
val newdf = df.filter(array_contains(map_keys($"tags), "name"))

Create new dataframe from selected information from another datama

I have a dataframe with the following schema:
root
|-- id: long (nullable = true)
|-- type: string (nullable = true)
|-- tags: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- lat: Long (nullable = true)
|-- lon: Long (nullable = true)
|-- nds: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ref: long (nullable = true)
|-- members: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- type: string (nullable = true)
| | |-- ref: long (nullable = true)
| | |-- role: string (nullable = true)
I want to create a new dataframe res where I select specific data from the column tags. I need the values from key=place and key=population. The new dataframe should have the following schema:
val schema = StructType(
Array(
StructField("place", StringType),
StructField("population", LongType)
)
)
I have litteraly no idea how I to do this. I tried to replicate the first dataframe and then select the columns, but that didnt work.
Anyone has a solution?
You can directly apply desired key on column of type map to extract value, then cast and rename column as you wish as follows:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.LongType
val result = dataframe.select(
col("tags")("place").as("place"),
col("tags")("population").cast(LongType).as("population")
)
With the following tags column:
+------------------------------------------------+
|tags |
+------------------------------------------------+
|{place -> A, population -> 32, another_key -> X}|
|{place -> B, population -> 64, another_key -> Y}|
+------------------------------------------------+
You get the following result:
+-----+----------+
|place|population|
+-----+----------+
|A |32 |
|B |64 |
+-----+----------+
Having the following schema:
root
|-- place: string (nullable = true)
|-- population: long (nullable = true)
Lets call your original dataframe df. You can extract the information you want like this
import org.apache.spark.sql.functions.sql.col
val data = df
.select("tags")
.where(
df("tags")("key") isin (List("place", "population"): _*)
)
.select(
col("tags")("value")
)
.collect()
.toList
This will give you a List[Row] which can be converted to another dataframe with your schema
import scala.collection.JavaConversions.seqAsJavaList
sparkSession.createDataFrame(seqAsJavaList[Row](data), schema)
Given the following simplified input:
val df = Seq(
(1L, Map("place" -> "home", "population" -> "1", "name" -> "foo")),
(2L, Map("place" -> "home", "population" -> "4", "name" -> "foo")),
(3L, Map("population" -> "3")),
(4L, Map.empty[String, String])
).toDF("id", "tags")
You want to select the values using methods map_filter to filter the map to only contain the key you want, then call map_values to get those entries. map_values returns an array, so you need to use explode_outer to flatten the data. We use explode_outer here because you might have entries which have neither place nor population, or only one of the two. Once the data is in a form we can easily work with, we just select the fields we want in the desired structure.
I've left the id column in so when you run the example you can see that we don't drop entries with missing data.
val r = df.select(
col("id"),
explode_outer(map_values(map_filter(col("tags"), (k,_) => k === "place"))) as "place",
map_values(map_filter(col("tags"), (k,_) => k === "population")) as "population"
).withColumn("population", explode_outer(col("population")))
.select(
col("id"),
array(
struct(
col("place"),
col("population") cast LongType as "population"
) as "place_and_population"
) as "data"
)
Gives:
root
|-- id: long (nullable = false)
|-- data: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- place: string (nullable = true)
| | |-- population: long (nullable = true)
+---+--------------+
| id| data|
+---+--------------+
| 1| [{home, 1}]|
| 2| [{home, 4}]|
| 3| [{null, 3}]|
| 4|[{null, null}]|
+---+--------------+

Reading CSV files contains struct type in Spark using Java

I'm trying to write a test case for a program.
For that, I'm reading a CSV file that has data in the following format.
account_number,struct_data
123456789,{"key1":"value","key2":"value2","keyn":"valuen"}
987678909,{"key1":"value0","key2":"value20","keyn":"valuen0"}
some hundreds of such rows.
I need to read the second column as a struct. But I'm getting the error
struct type expected, string type found
I tried casting as StructType, then getting the error as "StringType cannot be converted to StructType".
Should I change the way my CSV is? What else can I do?
I gave my solution in Scala Spark, it might give some insight to your query
scala> val sdf = """{"df":[{"actNum": "1234123", "strType": [{"key1": "value1", "key2": "value2"}]}]}"""
sdf: String = {"df":[{"actNum": "1234123", "strType": [{"key1": "value1", "key2": "value2"}]}]}
scala> val erdf = spark.read.json(Seq(sdf).toDS).toDF().withColumn("arr", explode($"df")).select("arr.*")
erdf: org.apache.spark.sql.DataFrame = [actNum: string, strType: array<struct<key1:string,key2:string>>]
scala> erdf.show()
+-------+-----------------+
| actNum| strType|
+-------+-----------------+
|1234123|[[value1,value2]]|
+-------+-----------------+
scala> erdf.printSchema
root
|-- actNum: string (nullable = true)
|-- strType: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key1: string (nullable = true)
| | |-- key2: string (nullable = true)
If all of the json records have the same schema, you can define that and use sparks from_json() function to accomplish your task.
import org.apache.spark.sql.types.StructType
val df = Seq(
(123456789, "{\"key1\":\"value\",\"key2\":\"value2\",\"keyn\":\"valuen\"}"),
(987678909, "{\"key1\":\"value0\",\"key2\":\"value20\",\"keyn\":\"valuen0\"}")
).toDF("account_number", "struct_data")
val schema = new StructType()
.add($"key1".string)
.add($"key2".string)
.add($"keyn".string)
val df2 = df.withColumn("st", from_json($"struct_data", schema))
df2.printSchema
df2.show(false)
This snippet results in this output:
root
|-- account_number: integer (nullable = false)
|-- struct_data: string (nullable = true)
|-- st: struct (nullable = true)
| |-- key1: string (nullable = true)
| |-- key2: string (nullable = true)
| |-- keyn: string (nullable = true)
+--------------+---------------------------------------------------+------------------------+
|account_number|struct_data |st |
+--------------+---------------------------------------------------+------------------------+
|123456789 |{"key1":"value","key2":"value2","keyn":"valuen"} |[value,value2,valuen] |
|987678909 |{"key1":"value0","key2":"value20","keyn":"valuen0"}|[value0,value20,valuen0]|
+--------------+---------------------------------------------------+------------------------+

Spark Dataframe datatype as String

I'm trying to validate the datatype of a DataFrame by writing describe as an SQL query but every time I am getting datetime as string.
1.First I tried with the below code:
SparkSession sparkSession=new SparkSession.Builder().getOrCreate();
Dataset<Row> df=sparkSession.read().option("header","true").option("inferschema","true").format("csv").load("/user/data/*_ecs.csv");
try {
df.createTempView("data");
Dataset<Row> sqlDf=sparkSession.sql("Describe data");
sqlDf.show(300,false);
Output:
+-----------------+---------+-------+
|col_name |data_type|comment|
+-----------------+---------+-------+
|id |int |null |
|symbol |string |null |
|datetime |string |null |
|side |string |null |
|orderQty |int |null |
|price |double |null |
+-----------------+---------+-------+
I also try Custom schema but in that case i am getting exception when I am executing any query other than describe table:
SparkSession sparkSession=new SparkSession.Builder().getOrCreate(); Dataset<Row>df=sparkSession.read().option("header","true").schema(customeSchema).format("csv").load("/use/data/*_ecs.csv");
try {
df.createTempView("trade_data");
Dataset<Row> sqlDf=sparkSession.sql("Describe trade_data");
sqlDf.show(300,false);
Output:
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
|datetime|timestamp|null |
|price |double |null |
|orderQty|double |null |
+--------+---------+-------+
But if i try any query then getting the below execption:
Dataset<Row> sqlDf=sparkSession.sql("select DATE(datetime),avg(price),avg(orderQty) from data group by datetime");
java.lang.IllegalArgumentException
at java.sql.Date.valueOf(Date.java:143)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
How can this be solved?
why Inferschema is not working ??
for this you can find more on this link: https://issues.apache.org/jira/browse/SPARK-19228
so Datetype columns are parsed as String for current version of spark
If you don't want to submit your own schema, one way would be this:
Dataset<Row> df = sparkSession.read().format("csv").option("header","true").option("inferschema", "true").load("example.csv");
df.printSchema(); // check output - 1
df.createOrReplaceTempView("df");
Dataset<Row> df1 = sparkSession.sql("select * , Date(datetime) as datetime_d from df").drop("datetime");
df1.printSchema(); // check output - 2
====================================
output - 1:
root
|-- id: integer (nullable = true)
|-- symbol: string (nullable = true)
|-- datetime: string (nullable = true)
|-- side: string (nullable = true)
|-- orderQty: integer (nullable = true)
|-- price: double (nullable = true)
output - 2:
root
|-- id: integer (nullable = true)
|-- symbol: string (nullable = true)
|-- side: string (nullable = true)
|-- orderQty: integer (nullable = true)
|-- price: double (nullable = true)
|-- datetime_d: date (nullable = true)
I'd choose this method if the number of fields to cast is not high.
If you want to submit your own schema:
List<org.apache.spark.sql.types.StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("datetime", DataTypes.TimestampType, true));
fields.add(DataTypes.createStructField("price",DataTypes.DoubleType,true));
fields.add(DataTypes.createStructField("orderQty",DataTypes.DoubleType,true));
StructType schema = DataTypes.createStructType(fields);
Dataset<Row> df = sparkSession.read().format("csv").option("header", "true").schema(schema).load("example.csv");
df.printSchema(); // output - 1
df.createOrReplaceTempView("df");
Dataset<Row> df1 = sparkSession.sql("select * , Date(datetime) as datetime_d from df").drop("datetime");
df1.printSchema(); // output - 2
======================================
output - 1:
root
|-- datetime: timestamp (nullable = true)
|-- price: double (nullable = true)
|-- orderQty: double (nullable = true)
output - 2:
root
|-- price: double (nullable = true)
|-- orderQty: double (nullable = true)
|-- datetime_d: date (nullable = true)
Since its again casting column from timestamp to Date, I don't see much use of this method. But still putting it here for your later use maybe.

Adding a new column to dataframe in Spark SQL using Java API and JavaRDD<Row>

I am trying to create a new dataframe (in SparkSQL 1.6.2) after applying a mapPartition function as follow:
FlatMapFunction<Iterator<Row>,Row> mapPartitonstoTTF=rows->
{
List<Row> mappedRows=new ArrayList<Row>();
while(rows.hasNext())
{
Row row=rows.next();
Row mappedRow= RowFactory.create(row.getDouble(0),row.getString(1),row.getLong(2),row.getDouble(3),row.getInt(4),row.getString(5),
row.getString(6),row.getInt(7),row.getInt(8),row.getString(9),0L);
mappedRows.add(mappedRow);
}
return mappedRows;
};
JavaRDD<Row> sensorDataDoubleRDD=oldsensorDataDoubleDF.toJavaRDD().mapPartitions(mapPartitonstoTTF);
StructType oldSchema=oldsensorDataDoubleDF.schema();
StructType newSchema =oldSchema.add("TTF",DataTypes.LongType,false);
System.out.println("The new schema is: ");
newSchema.printTreeString();
System.out.println("The old schema is: ");
oldSchema.printTreeString();
DataFrame sensorDataDoubleDF=hc.createDataFrame(sensorDataDoubleRDD, newSchema);
sensorDataDoubleDF.show();
As seen from above I am adding a new LongType column with values of 0 to RDDs using RowFactory.create() function
However, I get exception at line running sensorDataDoubleDF.show(); as follow:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 117 in stage 26.0 failed 4 times, most recent failure: Lost task 117.3 in stage 26.0 (TID 3249, AUPER01-01-20-08-0.prod.vroc.com.au): scala.MatchError: 1435766400001 (of class java.lang.Long)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1882)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1882)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The old schema is
root
|-- data_quality: double (nullable = false)
|-- data_sensor: string (nullable = true)
|-- data_timestamp: long (nullable = false)
|-- data_valueDouble: double (nullable = false)
|-- day: integer (nullable = false)
|-- dpnode: string (nullable = true)
|-- dsnode: string (nullable = true)
|-- month: integer (nullable = false)
|-- year: integer (nullable = false)
|-- nodeid: string (nullable = true)
|-- nodename: string (nullable = true)
The new schema is like above with addition of a TTF column as LongType
root
|-- data_quality: double (nullable = false)
|-- data_sensor: string (nullable = true)
|-- data_timestamp: long (nullable = false)
|-- data_valueDouble: double (nullable = false)
|-- day: integer (nullable = false)
|-- dpnode: string (nullable = true)
|-- dsnode: string (nullable = true)
|-- month: integer (nullable = false)
|-- year: integer (nullable = false)
|-- nodeid: string (nullable = true)
|-- nodename: string (nullable = true)
|-- TTF: long (nullable = false)
I appreciate any help to figure it our where I am making mistake.
You have 11 columns in old schema but you are mapping only 10. Add row.getString(10) in RowFactory.create function.
Row mappedRow= RowFactory.create(row.getDouble(0),row.getString(1),row.getLong(2),row.getDouble(3),row.getInt(4),row.getString(5),
row.getString(6),row.getInt(7),row.getInt(8),row.getString(9),row.getString(10),0L);

Categories