Create new dataframe from selected information from another datama - java

I have a dataframe with the following schema:
root
|-- id: long (nullable = true)
|-- type: string (nullable = true)
|-- tags: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- lat: Long (nullable = true)
|-- lon: Long (nullable = true)
|-- nds: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ref: long (nullable = true)
|-- members: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- type: string (nullable = true)
| | |-- ref: long (nullable = true)
| | |-- role: string (nullable = true)
I want to create a new dataframe res where I select specific data from the column tags. I need the values from key=place and key=population. The new dataframe should have the following schema:
val schema = StructType(
Array(
StructField("place", StringType),
StructField("population", LongType)
)
)
I have litteraly no idea how I to do this. I tried to replicate the first dataframe and then select the columns, but that didnt work.
Anyone has a solution?

You can directly apply desired key on column of type map to extract value, then cast and rename column as you wish as follows:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.LongType
val result = dataframe.select(
col("tags")("place").as("place"),
col("tags")("population").cast(LongType).as("population")
)
With the following tags column:
+------------------------------------------------+
|tags |
+------------------------------------------------+
|{place -> A, population -> 32, another_key -> X}|
|{place -> B, population -> 64, another_key -> Y}|
+------------------------------------------------+
You get the following result:
+-----+----------+
|place|population|
+-----+----------+
|A |32 |
|B |64 |
+-----+----------+
Having the following schema:
root
|-- place: string (nullable = true)
|-- population: long (nullable = true)

Lets call your original dataframe df. You can extract the information you want like this
import org.apache.spark.sql.functions.sql.col
val data = df
.select("tags")
.where(
df("tags")("key") isin (List("place", "population"): _*)
)
.select(
col("tags")("value")
)
.collect()
.toList
This will give you a List[Row] which can be converted to another dataframe with your schema
import scala.collection.JavaConversions.seqAsJavaList
sparkSession.createDataFrame(seqAsJavaList[Row](data), schema)

Given the following simplified input:
val df = Seq(
(1L, Map("place" -> "home", "population" -> "1", "name" -> "foo")),
(2L, Map("place" -> "home", "population" -> "4", "name" -> "foo")),
(3L, Map("population" -> "3")),
(4L, Map.empty[String, String])
).toDF("id", "tags")
You want to select the values using methods map_filter to filter the map to only contain the key you want, then call map_values to get those entries. map_values returns an array, so you need to use explode_outer to flatten the data. We use explode_outer here because you might have entries which have neither place nor population, or only one of the two. Once the data is in a form we can easily work with, we just select the fields we want in the desired structure.
I've left the id column in so when you run the example you can see that we don't drop entries with missing data.
val r = df.select(
col("id"),
explode_outer(map_values(map_filter(col("tags"), (k,_) => k === "place"))) as "place",
map_values(map_filter(col("tags"), (k,_) => k === "population")) as "population"
).withColumn("population", explode_outer(col("population")))
.select(
col("id"),
array(
struct(
col("place"),
col("population") cast LongType as "population"
) as "place_and_population"
) as "data"
)
Gives:
root
|-- id: long (nullable = false)
|-- data: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- place: string (nullable = true)
| | |-- population: long (nullable = true)
+---+--------------+
| id| data|
+---+--------------+
| 1| [{home, 1}]|
| 2| [{home, 4}]|
| 3| [{null, 3}]|
| 4|[{null, null}]|
+---+--------------+

Related

Create a new dataframe (with different schema) from selected information from another dataframe

I have a dataframe where the tag column contains different key->values. I try to filter out the values information where the key=name. The filtered out information should be put in a new dataframe.
The initial df has the following schema:
root
|-- id: long (nullable = true)
|-- type: string (nullable = true)
|-- tags: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- nds: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ref: long (nullable = true)
|-- members: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- type: string (nullable = true)
| | |-- ref: long (nullable = true)
| | |-- role: string (nullable = true)
|-- visible: boolean (nullable = true)
And I want a newdf of schema:
root
|-- place: string (nullable = true)
|-- num_evacuees string (nullable = true)
How should I do the filter? I tried a lot of methods, where I tried to have a normal filter at least. But everytime, the result of the filter is an empty dataframe each time. For example:
val newdf = df.filter($"tags"("key") contains "name")
val newdf = df.where(places("tags")("key") === "name")
I tried a lot more methods, but none of it has worked
How should I do the proper filter
You can achieve the result you want with:
val df = Seq(
(1L, Map("sf" -> "100")),
(2L, Map("ny" -> "200"))
).toDF("id", "tags")
val resultDf = df
.select(explode(map_filter(col("tags"), (k, _) => k === "ny")))
.withColumnRenamed("key", "place")
.withColumnRenamed("value", "num_evacuees")
resultDf.printSchema
resultDf.show
Which will show:
root
|-- place: string (nullable = false)
|-- num_evacuees: string (nullable = true)
+-----+------------+
|place|num_evacuees|
+-----+------------+
| ny| 200|
+-----+------------+
The key idea is to use map_filter to select the fields from the map you want then explode turns the map into two columns (key and value) which you can then rename to make the DataFrame match your specification.
The above example assumes you want to get a single value to demonstrate the idea. The lambda function used by map_filter can be as complex as necessary. Its signature map_filter(expr: Column, f: (Column, Column) => Column): Column shows that as long as you return a Column it will be happy.
If you wanted to filter a large number of entries you could do something like:
val resultDf = df
.withColumn("filterList", array("sf", "place_n"))
.select(explode(map_filter(col("tags"), (k, _) => array_contains(col("filterList"), k))))
The idea is to extract the keys of the map column (tags), then use array_contains to check for a key called "name".
import org.apache.spark.sql.functions._
val newdf = df.filter(array_contains(map_keys($"tags), "name"))

Reading CSV files contains struct type in Spark using Java

I'm trying to write a test case for a program.
For that, I'm reading a CSV file that has data in the following format.
account_number,struct_data
123456789,{"key1":"value","key2":"value2","keyn":"valuen"}
987678909,{"key1":"value0","key2":"value20","keyn":"valuen0"}
some hundreds of such rows.
I need to read the second column as a struct. But I'm getting the error
struct type expected, string type found
I tried casting as StructType, then getting the error as "StringType cannot be converted to StructType".
Should I change the way my CSV is? What else can I do?
I gave my solution in Scala Spark, it might give some insight to your query
scala> val sdf = """{"df":[{"actNum": "1234123", "strType": [{"key1": "value1", "key2": "value2"}]}]}"""
sdf: String = {"df":[{"actNum": "1234123", "strType": [{"key1": "value1", "key2": "value2"}]}]}
scala> val erdf = spark.read.json(Seq(sdf).toDS).toDF().withColumn("arr", explode($"df")).select("arr.*")
erdf: org.apache.spark.sql.DataFrame = [actNum: string, strType: array<struct<key1:string,key2:string>>]
scala> erdf.show()
+-------+-----------------+
| actNum| strType|
+-------+-----------------+
|1234123|[[value1,value2]]|
+-------+-----------------+
scala> erdf.printSchema
root
|-- actNum: string (nullable = true)
|-- strType: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key1: string (nullable = true)
| | |-- key2: string (nullable = true)
If all of the json records have the same schema, you can define that and use sparks from_json() function to accomplish your task.
import org.apache.spark.sql.types.StructType
val df = Seq(
(123456789, "{\"key1\":\"value\",\"key2\":\"value2\",\"keyn\":\"valuen\"}"),
(987678909, "{\"key1\":\"value0\",\"key2\":\"value20\",\"keyn\":\"valuen0\"}")
).toDF("account_number", "struct_data")
val schema = new StructType()
.add($"key1".string)
.add($"key2".string)
.add($"keyn".string)
val df2 = df.withColumn("st", from_json($"struct_data", schema))
df2.printSchema
df2.show(false)
This snippet results in this output:
root
|-- account_number: integer (nullable = false)
|-- struct_data: string (nullable = true)
|-- st: struct (nullable = true)
| |-- key1: string (nullable = true)
| |-- key2: string (nullable = true)
| |-- keyn: string (nullable = true)
+--------------+---------------------------------------------------+------------------------+
|account_number|struct_data |st |
+--------------+---------------------------------------------------+------------------------+
|123456789 |{"key1":"value","key2":"value2","keyn":"valuen"} |[value,value2,valuen] |
|987678909 |{"key1":"value0","key2":"value20","keyn":"valuen0"}|[value0,value20,valuen0]|
+--------------+---------------------------------------------------+------------------------+

How to create a Dataframe from existing Dataframe and make specific fields as Struct type?

I need to create a DataFrame from existing DataFrame in which I need to change the schema as well.
I have a DataFrame like:
+-----------+----------+-------------+
|Id |Position |playerName |
+-----------+-----------+------------+
|10125 |Forward |Messi |
|10126 |Forward |Ronaldo |
|10127 |Midfield |Xavi |
|10128 |Midfield |Neymar |
and I am created this using a case class given below:
case class caseClass (
Id: Int = "",
Position : String = "" ,
playerName : String = ""
)
Now I need to make both Playername and position under Struct type.
ie,
I need to create another DataFrame with schema,
root
|-- Id: int (nullable = true)
|-- playerDetails: struct (nullable = true)
| |--playername: string (nullable = true)
| |--Position: string (nullable = true)
I did the following code to create a new dataframe by referring the link
https://medium.com/#mrpowers/adding-structtype-columns-to-spark-dataframes-b44125409803
myschema was
List(
StructField("Id", IntegerType, true),
StructField("Position",StringType, true),
StructField("playerName", StringType,true)
)
I tried the following code
spark.sparkContext.parallelize(data),
myschema
)
but I can't make it happen.
I saw similar question
Change schema of existing dataframe but I can't understand the solution.
Is there any solution for directly implement StructType inside the case class? so that I think I don't need to make own schema for creating struct type values.
Function "struct" can be used:
// data
val playersDF = Seq(
(10125, "Forward", "Messi"),
(10126, "Forward", "Ronaldo"),
(10127, "Midfield", "Xavi"),
(10128, "Midfield", "Neymar")
).toDF("Id", "Position", "playerName")
// action
val playersStructuredDF = playersDF.select($"Id", struct("playerName", "Position").as("playerDetails"))
// display
playersStructuredDF.printSchema()
playersStructuredDF.show(false)
Output:
root
|-- Id: integer (nullable = false)
|-- playerDetails: struct (nullable = false)
| |-- playerName: string (nullable = true)
| |-- Position: string (nullable = true)
+-----+------------------+
|Id |playerDetails |
+-----+------------------+
|10125|[Messi, Forward] |
|10126|[Ronaldo, Forward]|
|10127|[Xavi, Midfield] |
|10128|[Neymar, Midfield]|
+-----+------------------+

Save table in hive with java spark sql from json array

Dataset<Row> ds = spark.read().option("multiLine", true).option("mode", "PERMISSIVE").json("/user/administrador/prueba_diario.txt").toDF();
ds.printSchema();
Dataset<Row> ds2 = ds.select("articles").toDF();
ds2.printSchema();
spark.sql("drop table if exists table1");
ds2.write().saveAsTable("table1");
I have this json format
root
|-- articles: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- author: string (nullable = true)
| | |-- content: string (nullable = true)
| | |-- description: string (nullable = true)
| | |-- publishedAt: string (nullable = true)
| | |-- source: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- name: string (nullable = true)
| | |-- title: string (nullable = true)
| | |-- url: string (nullable = true)
| | |-- urlToImage: string (nullable = true)
|-- status: string (nullable = true)
|-- totalResults: long (nullable = true)
I want to save the array articles as a hive's table with the arrays format
example of hive table that i want:
author (string)
content (string)
description (string)
publishedat (string)
source (struct<id:string,name:string>)
title (string)
url (string)
urltoimage (string)
The problem is that is saving the table just with one column named article and the contend is inside in this only column
A bit convoluted, but I found this one to work:
import org.apache.spark.sql.functions._
ds.select(explode(col("articles")).as("exploded")).select("exploded.*").toDF()
I tested it on
{
"articles": [
{
"author": "J.K. Rowling",
"title": "Harry Potter and the goblet of fire"
},
{
"author": "George Orwell",
"title": "1984"
}
]
}
and it returned (after collecting it into an array)
result = {Arrays$ArrayList#13423} size = 2
0 = {GenericRowWithSchema#13425} "[J.K. Rowling,Harry Potter and the goblet of fire]"
1 = {GenericRowWithSchema#13426} "[George Orwell,1984]"

I am using apache spark to parse json files . How to get nested key from json files wheather it's array or nested key

I have multiple json files which keeps json data init. Json Structure look like this.
{
"Name":"Vipin Suman",
"Email":"vpn2330#gmail.com",
"Designation":"Trainee Programmer",
"Age":22 ,
"location":
{"City":
{
"Pin":324009,
"City Name":"Ahmedabad"
},
"State":"Gujarat"
},
"Company":
{
"Company Name":"Elegant",
"Domain":"Java"
},
"Test":["Test1","Test2"]
}
I tried this
String jsonFilePath = "/home/vipin/workspace/Smarten/jsonParsing/Employee/Employee-03.json";
String[] jsonFiles = jsonFilePath.split(",");
Dataset<Row> people = sparkSession.read().json(jsonFiles);
i am getting schema for this is
root
|-- Age: long (nullable = true)
|-- Company: struct (nullable = true)
| |-- Company Name: string (nullable = true)
| |-- Domain: string (nullable = true)
|-- Designation: string (nullable = true)
|-- Email: string (nullable = true)
|-- Name: string (nullable = true)
|-- Test: array (nullable = true)
| |-- element: string (containsNull = true)
|-- location: struct (nullable = true)
| |-- City: struct (nullable = true)
| | |-- City Name: string (nullable = true)
| | |-- Pin: long (nullable = true)
| |-- State: string (nullable = true)
i am getting the view of table:-
+---+--------------+------------------+-----------------+-----------+--------------+--------------------+
|Age| Company| Designation| Email| Name| Test| location|
+---+--------------+------------------+-----------------+-----------+--------------+--------------------+
| 22|[Elegant,Java]|Trainee Programmer|vpn2330#gmail.com|Vipin Suman|[Test1, Test2]|[[Ahmedabad,32400...|
+---+--------------+------------------+-----------------+-----------+--------------+--------------------+
i want result as:-
Age | Company Name | Domain| Designation | Email | Name | Test | City Name | Pin | State |
22 | Elegant MicroWeb | Java | Programmer | vpn2330#gmail.com | Vipin Suman | Test1 | Ahmedabad | 324009 | Gujarat
22 | Elegant MicroWeb | Java | Programmer | vpn2330#gmail.com | Vipin Suman | Test2 | Ahmedabad | 324009 |
how i can get table in above formet. i tried out everything. I am new to apache spark can any one help me out??
I suggest you do your work in scala which is better supported by spark. To do your work, you can use "select" API to select a specific column, use alias to rename a column, and you can refer to here to say how to select complex data format(https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html)
Based on your result, you also need to use "explode" API (Flattening Rows in Spark)
In Scala it could be done like this:
people.select(
$"Age",
$"Company.*",
$"Designation",
$"Email",
$"Name",
explode($"Test"),
$"location.City.*",
$"location.State")
Unfortunately, following code in Java would fail:
people.select(
people.col("Age"),
people.col("Company.*"),
people.col("Designation"),
people.col("Email"),
people.col("Name"),
explode(people.col("Test")),
people.col("location.City.*"),
people.col("location.State"));
You can use selectExpr instead though:
people.selectExpr(
"Age",
"Company.*",
"Designation",
"Email",
"Name",
"EXPLODE(Test) AS Test",
"location.City.*",
"location.State");
PS:
You can pass the path to the directory or directories instead of the list of JSON files in sparkSession.read().json(jsonFiles);.

Categories