Adding a column to a spark dataset and transforming data - java

I'm loading a parquet file as a spark dataset. I can query and create new datasets from the query. Now, I would like to add a new column to the dataset ("hashkey") and generate the values (e.g. md5sum(nameValue)). How can i achieve this?
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("Hello Spark");
sparkConf.setMaster("local");
SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example")
.config("spark.master", "local").config("spark.sql.warehouse.dir", "file:///C:\\spark_warehouse")
.getOrCreate();
Dataset<org.apache.spark.sql.Row> df = spark.read().parquet("meetup.parquet");
df.show();
df.createOrReplaceTempView("tmpview");
Dataset<Row> namesDF = spark.sql("SELECT * FROM tmpview where name like 'Spark-%'");
namesDF.show();
}
The output looks like this:
+-------------+-----------+-----+---------+--------------------+
| name|meetup_date|going|organizer| topics|
+-------------+-----------+-----+---------+--------------------+
| Spark-H20| 2016-01-01| 50|airisdata|[h2o, repeated sh...|
| Spark-Avro| 2016-01-02| 60|airisdata| [avro, usecases]|
|Spark-Parquet| 2016-01-03| 70|airisdata| [parquet, usecases]|
+-------------+-----------+-----+---------+--------------------+

Just add spark sql function for MD5 in you query.
Dataset<Row> namesDF = spark.sql("SELECT *, md5(name) as modified_name FROM tmpview where name like 'Spark-%'");

Dataset<Row> ds = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.option("delimiter","|")
.load("/home/cloudera/Desktop/data.csv");
ds.printSchema();
will print this :
root
|-- ReferenceValueSet_Id: integer (nullable = true)
|-- ReferenceValueSet_Name: string (nullable = true)
|-- Code_Description: string (nullable = true)
|-- Code_Type: string (nullable = true)
|-- Code: string (nullable = true)
|-- CURR_FLAG: string (nullable = true)
|-- REC_CREATE_DATE: timestamp (nullable = true)
|-- REC_UPDATE_DATE: timestamp (nullable = true)
Dataset<Row> df1 = ds.withColumn("Key", functions.lit(1));
df1.printSchema();
after adding above code, it will append one column with constant values.
root
|-- ReferenceValueSet_Id: integer (nullable = true)
|-- ReferenceValueSet_Name: string (nullable = true)
|-- Code_Description: string (nullable = true)
|-- Code_Type: string (nullable = true)
|-- Code: string (nullable = true)
|-- CURR_FLAG: string (nullable = true)
|-- REC_CREATE_DATE: timestamp (nullable = true)
|-- REC_UPDATE_DATE: timestamp (nullable = true)
|-- Key: integer (nullable = true)
you can see column with name Key is added to the dataset.
If you wanted to add some column inplace of the constunt value, you can use below code to add it.
Dataset<Row> df1 = ds.withColumn("Key", functions.lit(ds.col("Code")));
df1.printSchema();
df1.show();
now it will print watever the values is there into the column CODE. into the newly aded column named Key.

Related

Create a new dataframe (with different schema) from selected information from another dataframe

I have a dataframe where the tag column contains different key->values. I try to filter out the values information where the key=name. The filtered out information should be put in a new dataframe.
The initial df has the following schema:
root
|-- id: long (nullable = true)
|-- type: string (nullable = true)
|-- tags: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- nds: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ref: long (nullable = true)
|-- members: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- type: string (nullable = true)
| | |-- ref: long (nullable = true)
| | |-- role: string (nullable = true)
|-- visible: boolean (nullable = true)
And I want a newdf of schema:
root
|-- place: string (nullable = true)
|-- num_evacuees string (nullable = true)
How should I do the filter? I tried a lot of methods, where I tried to have a normal filter at least. But everytime, the result of the filter is an empty dataframe each time. For example:
val newdf = df.filter($"tags"("key") contains "name")
val newdf = df.where(places("tags")("key") === "name")
I tried a lot more methods, but none of it has worked
How should I do the proper filter
You can achieve the result you want with:
val df = Seq(
(1L, Map("sf" -> "100")),
(2L, Map("ny" -> "200"))
).toDF("id", "tags")
val resultDf = df
.select(explode(map_filter(col("tags"), (k, _) => k === "ny")))
.withColumnRenamed("key", "place")
.withColumnRenamed("value", "num_evacuees")
resultDf.printSchema
resultDf.show
Which will show:
root
|-- place: string (nullable = false)
|-- num_evacuees: string (nullable = true)
+-----+------------+
|place|num_evacuees|
+-----+------------+
| ny| 200|
+-----+------------+
The key idea is to use map_filter to select the fields from the map you want then explode turns the map into two columns (key and value) which you can then rename to make the DataFrame match your specification.
The above example assumes you want to get a single value to demonstrate the idea. The lambda function used by map_filter can be as complex as necessary. Its signature map_filter(expr: Column, f: (Column, Column) => Column): Column shows that as long as you return a Column it will be happy.
If you wanted to filter a large number of entries you could do something like:
val resultDf = df
.withColumn("filterList", array("sf", "place_n"))
.select(explode(map_filter(col("tags"), (k, _) => array_contains(col("filterList"), k))))
The idea is to extract the keys of the map column (tags), then use array_contains to check for a key called "name".
import org.apache.spark.sql.functions._
val newdf = df.filter(array_contains(map_keys($"tags), "name"))

Reading CSV files contains struct type in Spark using Java

I'm trying to write a test case for a program.
For that, I'm reading a CSV file that has data in the following format.
account_number,struct_data
123456789,{"key1":"value","key2":"value2","keyn":"valuen"}
987678909,{"key1":"value0","key2":"value20","keyn":"valuen0"}
some hundreds of such rows.
I need to read the second column as a struct. But I'm getting the error
struct type expected, string type found
I tried casting as StructType, then getting the error as "StringType cannot be converted to StructType".
Should I change the way my CSV is? What else can I do?
I gave my solution in Scala Spark, it might give some insight to your query
scala> val sdf = """{"df":[{"actNum": "1234123", "strType": [{"key1": "value1", "key2": "value2"}]}]}"""
sdf: String = {"df":[{"actNum": "1234123", "strType": [{"key1": "value1", "key2": "value2"}]}]}
scala> val erdf = spark.read.json(Seq(sdf).toDS).toDF().withColumn("arr", explode($"df")).select("arr.*")
erdf: org.apache.spark.sql.DataFrame = [actNum: string, strType: array<struct<key1:string,key2:string>>]
scala> erdf.show()
+-------+-----------------+
| actNum| strType|
+-------+-----------------+
|1234123|[[value1,value2]]|
+-------+-----------------+
scala> erdf.printSchema
root
|-- actNum: string (nullable = true)
|-- strType: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key1: string (nullable = true)
| | |-- key2: string (nullable = true)
If all of the json records have the same schema, you can define that and use sparks from_json() function to accomplish your task.
import org.apache.spark.sql.types.StructType
val df = Seq(
(123456789, "{\"key1\":\"value\",\"key2\":\"value2\",\"keyn\":\"valuen\"}"),
(987678909, "{\"key1\":\"value0\",\"key2\":\"value20\",\"keyn\":\"valuen0\"}")
).toDF("account_number", "struct_data")
val schema = new StructType()
.add($"key1".string)
.add($"key2".string)
.add($"keyn".string)
val df2 = df.withColumn("st", from_json($"struct_data", schema))
df2.printSchema
df2.show(false)
This snippet results in this output:
root
|-- account_number: integer (nullable = false)
|-- struct_data: string (nullable = true)
|-- st: struct (nullable = true)
| |-- key1: string (nullable = true)
| |-- key2: string (nullable = true)
| |-- keyn: string (nullable = true)
+--------------+---------------------------------------------------+------------------------+
|account_number|struct_data |st |
+--------------+---------------------------------------------------+------------------------+
|123456789 |{"key1":"value","key2":"value2","keyn":"valuen"} |[value,value2,valuen] |
|987678909 |{"key1":"value0","key2":"value20","keyn":"valuen0"}|[value0,value20,valuen0]|
+--------------+---------------------------------------------------+------------------------+

Persist output of window function in db with Spark dataframe in java

When the following snippet executes:
Dataset<Row> ds1=ds.groupBy(functions.window(ds.col("datetime"),windowLength,slidingLength).as("datetime"),ds.col("symbol").as("Ticker"))
.agg(functions.mean("volume").as("volume"),functions.mean("price").as("Price"),
(functions.first("price").plus(functions.last("price")).divide(value)).as("Mid_Point"),
functions.max("price").as("High"),functions.min("price").as("Low"),
functions.first("price").as("Open"),functions.last("price").as("Close"))
.sort(functions.asc("datetime"));
ds1.printSchema();
Output:
|-- datetime: struct (nullable = true)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
|-- Ticker: string (nullable = true)
|-- volume: double (nullable = true)
|-- Price: double (nullable = true)
|-- Mid_Point: double (nullable = true)
|-- High: double (nullable = true)
|-- Low: double (nullable = true)
|-- Open: double (nullable = true)
|-- Close: double (nullable = true)
Now when I am trying to save it into a csv file I am getting the error that csv file unable to resolve datetime as timestamp.
Error:
cannot resolve 'CAST(`datetime` AS TIMESTAMP)' due to data type mismatch: cannot cast StructType(StructField(start,TimestampType,true), StructField(end,TimestampType,true)) to TimestampType
Someone has any idea about that ?
Apply datetime cast to the col rather than applying it to the sliding window,
ds.col("datetime").as("datetime")

Spark Dataframe datatype as String

I'm trying to validate the datatype of a DataFrame by writing describe as an SQL query but every time I am getting datetime as string.
1.First I tried with the below code:
SparkSession sparkSession=new SparkSession.Builder().getOrCreate();
Dataset<Row> df=sparkSession.read().option("header","true").option("inferschema","true").format("csv").load("/user/data/*_ecs.csv");
try {
df.createTempView("data");
Dataset<Row> sqlDf=sparkSession.sql("Describe data");
sqlDf.show(300,false);
Output:
+-----------------+---------+-------+
|col_name |data_type|comment|
+-----------------+---------+-------+
|id |int |null |
|symbol |string |null |
|datetime |string |null |
|side |string |null |
|orderQty |int |null |
|price |double |null |
+-----------------+---------+-------+
I also try Custom schema but in that case i am getting exception when I am executing any query other than describe table:
SparkSession sparkSession=new SparkSession.Builder().getOrCreate(); Dataset<Row>df=sparkSession.read().option("header","true").schema(customeSchema).format("csv").load("/use/data/*_ecs.csv");
try {
df.createTempView("trade_data");
Dataset<Row> sqlDf=sparkSession.sql("Describe trade_data");
sqlDf.show(300,false);
Output:
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
|datetime|timestamp|null |
|price |double |null |
|orderQty|double |null |
+--------+---------+-------+
But if i try any query then getting the below execption:
Dataset<Row> sqlDf=sparkSession.sql("select DATE(datetime),avg(price),avg(orderQty) from data group by datetime");
java.lang.IllegalArgumentException
at java.sql.Date.valueOf(Date.java:143)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
How can this be solved?
why Inferschema is not working ??
for this you can find more on this link: https://issues.apache.org/jira/browse/SPARK-19228
so Datetype columns are parsed as String for current version of spark
If you don't want to submit your own schema, one way would be this:
Dataset<Row> df = sparkSession.read().format("csv").option("header","true").option("inferschema", "true").load("example.csv");
df.printSchema(); // check output - 1
df.createOrReplaceTempView("df");
Dataset<Row> df1 = sparkSession.sql("select * , Date(datetime) as datetime_d from df").drop("datetime");
df1.printSchema(); // check output - 2
====================================
output - 1:
root
|-- id: integer (nullable = true)
|-- symbol: string (nullable = true)
|-- datetime: string (nullable = true)
|-- side: string (nullable = true)
|-- orderQty: integer (nullable = true)
|-- price: double (nullable = true)
output - 2:
root
|-- id: integer (nullable = true)
|-- symbol: string (nullable = true)
|-- side: string (nullable = true)
|-- orderQty: integer (nullable = true)
|-- price: double (nullable = true)
|-- datetime_d: date (nullable = true)
I'd choose this method if the number of fields to cast is not high.
If you want to submit your own schema:
List<org.apache.spark.sql.types.StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("datetime", DataTypes.TimestampType, true));
fields.add(DataTypes.createStructField("price",DataTypes.DoubleType,true));
fields.add(DataTypes.createStructField("orderQty",DataTypes.DoubleType,true));
StructType schema = DataTypes.createStructType(fields);
Dataset<Row> df = sparkSession.read().format("csv").option("header", "true").schema(schema).load("example.csv");
df.printSchema(); // output - 1
df.createOrReplaceTempView("df");
Dataset<Row> df1 = sparkSession.sql("select * , Date(datetime) as datetime_d from df").drop("datetime");
df1.printSchema(); // output - 2
======================================
output - 1:
root
|-- datetime: timestamp (nullable = true)
|-- price: double (nullable = true)
|-- orderQty: double (nullable = true)
output - 2:
root
|-- price: double (nullable = true)
|-- orderQty: double (nullable = true)
|-- datetime_d: date (nullable = true)
Since its again casting column from timestamp to Date, I don't see much use of this method. But still putting it here for your later use maybe.

Adding a new column to dataframe in Spark SQL using Java API and JavaRDD<Row>

I am trying to create a new dataframe (in SparkSQL 1.6.2) after applying a mapPartition function as follow:
FlatMapFunction<Iterator<Row>,Row> mapPartitonstoTTF=rows->
{
List<Row> mappedRows=new ArrayList<Row>();
while(rows.hasNext())
{
Row row=rows.next();
Row mappedRow= RowFactory.create(row.getDouble(0),row.getString(1),row.getLong(2),row.getDouble(3),row.getInt(4),row.getString(5),
row.getString(6),row.getInt(7),row.getInt(8),row.getString(9),0L);
mappedRows.add(mappedRow);
}
return mappedRows;
};
JavaRDD<Row> sensorDataDoubleRDD=oldsensorDataDoubleDF.toJavaRDD().mapPartitions(mapPartitonstoTTF);
StructType oldSchema=oldsensorDataDoubleDF.schema();
StructType newSchema =oldSchema.add("TTF",DataTypes.LongType,false);
System.out.println("The new schema is: ");
newSchema.printTreeString();
System.out.println("The old schema is: ");
oldSchema.printTreeString();
DataFrame sensorDataDoubleDF=hc.createDataFrame(sensorDataDoubleRDD, newSchema);
sensorDataDoubleDF.show();
As seen from above I am adding a new LongType column with values of 0 to RDDs using RowFactory.create() function
However, I get exception at line running sensorDataDoubleDF.show(); as follow:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 117 in stage 26.0 failed 4 times, most recent failure: Lost task 117.3 in stage 26.0 (TID 3249, AUPER01-01-20-08-0.prod.vroc.com.au): scala.MatchError: 1435766400001 (of class java.lang.Long)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1882)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1882)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The old schema is
root
|-- data_quality: double (nullable = false)
|-- data_sensor: string (nullable = true)
|-- data_timestamp: long (nullable = false)
|-- data_valueDouble: double (nullable = false)
|-- day: integer (nullable = false)
|-- dpnode: string (nullable = true)
|-- dsnode: string (nullable = true)
|-- month: integer (nullable = false)
|-- year: integer (nullable = false)
|-- nodeid: string (nullable = true)
|-- nodename: string (nullable = true)
The new schema is like above with addition of a TTF column as LongType
root
|-- data_quality: double (nullable = false)
|-- data_sensor: string (nullable = true)
|-- data_timestamp: long (nullable = false)
|-- data_valueDouble: double (nullable = false)
|-- day: integer (nullable = false)
|-- dpnode: string (nullable = true)
|-- dsnode: string (nullable = true)
|-- month: integer (nullable = false)
|-- year: integer (nullable = false)
|-- nodeid: string (nullable = true)
|-- nodename: string (nullable = true)
|-- TTF: long (nullable = false)
I appreciate any help to figure it our where I am making mistake.
You have 11 columns in old schema but you are mapping only 10. Add row.getString(10) in RowFactory.create function.
Row mappedRow= RowFactory.create(row.getDouble(0),row.getString(1),row.getLong(2),row.getDouble(3),row.getInt(4),row.getString(5),
row.getString(6),row.getInt(7),row.getInt(8),row.getString(9),row.getString(10),0L);

Categories