I have two data-frame (Dataset<Row>) with the same columns, but different order array of structs.
df1:
root
|-- root: string (nullable = false)
|-- array_nested: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- array_id: integer (nullable = false)
| | |-- array_value: string (nullable = false)
+----+------------+
|root|array_nested|
+----+------------+
|One |[[1, 1-One]]|
+----+------------+
df2:
root
|-- root: string (nullable = false)
|-- array_nested: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- array_value: string (nullable = false)
| | |-- array_id: integer (nullable = false)
+----+------------+
|root|array_nested|
+----+------------+
|Two |[[2-Two, 2]]|
+----+------------+
I want make the schema the same, but when I try my approach it generates and extra later of array:
List<Column> updatedStructNames = new ArrayList<>();
updatedStructNames.add(col("array_nested.array_id"));
updatedStructNames.add(col("array_nested.array_value"));
Column[] updatedStructNameArray = updatedStructNames.toArray(new Column[0]);
Dataset<Row> df3 = df2.withColumn("array_nested", array(struct(updatedStructNameArray)));
It will get generate schema like this:
root
|-- root: string (nullable = false)
|-- array_nested: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- array_id: array (nullable = false)
| | | |-- element: integer (containsNull = true)
| | |-- array_value: array (nullable = false)
| | | |-- element: string (containsNull = true)
+----+----------------+
|root|array_nested |
+----+----------------+
|Two |[[[2], [2-Two]]]|
+----+----------------+
How can I achieve the same schema?
You can use transform function to update the struct elements of array_nested column:
Dataset < Row > df3 = df2.withColumn(
"array_nested",
expr("transform(array_nested, x -> struct(x.array_id as array_id, x.array_value as array_value))")
);
Scenario:
I have read two XML files via specifying a schema on load.
In the schema, One of the tags is mandatory. One XML is missing that mandatory tag.
Now, when I do the following, I am expecting the XML with the missing mandatory tag to be filtered out.
dataset = dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull());
In the code when I try to count the Rows of the dataset, I am getting the count as 2 (2 Input XMLS) but when I try to print the dataset via show() method, I am getting a NPE.
When I debugged the above line and do the following, I get 0 as the count.
dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull()).count();
Question:
Can anyone please answer the questions/affirm my understanding below
Why is spark Dataset not filtering the Row which does not have a mandatory column?
Why there are no exception in the count but in show method?
For 2, I believe the count is just counting no of Rows without looking into the contents.
For show, the iterator actually goes through the Struct Fields to print their values and when it does not find the mandatory column, it errors out.
P.S. If I make the mandatory column optional, all is working fine.
Edit:
Providing Reading options as requested
For loading the data I am executing the following
Dataset<Row> dataset = sparkSession.read().format("com.databricks.spark.xml")
.option("header", "true")
.option("inferSchema", "false")
.option("rowTag", rowTag)//rowTag is "body" tag in the XML
.option("failFast", "true")
.option("mode", "FAILFAST")
.schema(schema)
.load(XMLfilePath);
Providing samples as requested
Schema:
root
|-- old: struct (nullable = true)
| |-- _beyond: string (nullable = true)
| |-- lot: struct (nullable = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _chose: string (nullable = true)
| |-- real: struct (nullable = true)
| | |-- _eat: string (nullable = true)
| | |-- kill: struct (nullable = true)
| | | |-- _VALUE: double (nullable = true)
| | | |-- _top: string (nullable = true)
| | |-- tool: struct (nullable = true)
| | | |-- _VALUE: string (nullable = true)
| | | |-- _affect: string (nullable = true)
|-- porch: struct (nullable = true)
| |-- _account: string (nullable = true)
| |-- cast: string (nullable = true)
| |-- vegetable: struct (nullable = true)
| | |-- leg: struct (nullable = true)
| | | |-- _VALUE: double (nullable = true)
| | | |-- _nose: string (nullable = true)
| | |-- now: struct (nullable = true)
| | | |-- _gravity: string (nullable = true)
| | | |-- chief: struct (nullable = true)
| | | | |-- _VALUE: long (nullable = true)
| | | | |-- _further: string (nullable = true)
| | | |-- field: string (nullable = true)
Sample XML:
<?xml version="1.0" encoding="UTF-8" ?>
<root>
<body>
<porch account="something">
<vegetable>
<now gravity="wide">
<field>box</field>
<chief further="satisfied">-1889614487</chief>
</now>
<leg nose="angle">912658017.229279</leg>
</vegetable>
<cast>clear</cast>
</porch>
<old beyond="continent">
<real eat="term">
<kill top="plates">-1623084908.8669372</kill>
<tool affect="pond">today</tool>
</real>
<lot chose="swung">promised</lot>
</old>
</body>
</root>
Schema in JSON format:
{"type":"struct","fields":[{"name":"old","type":{"type":"struct","fields":[{"name":"_beyond","type":"string","nullable":true,"metadata":{}},{"name":"lot","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_chose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"real","type":{"type":"struct","fields":[{"name":"_eat","type":"string","nullable":true,"metadata":{}},{"name":"kill","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_top","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"tool","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_affect","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"porch","type":{"type":"struct","fields":[{"name":"_account","type":"string","nullable":true,"metadata":{}},{"name":"cast","type":"string","nullable":true,"metadata":{}},{"name":"vegetable","type":{"type":"struct","fields":[{"name":"leg","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_nose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"now","type":{"type":"struct","fields":[{"name":"_gravity","type":"string","nullable":true,"metadata":{}},{"name":"chief","type":{"type":"struct","fields":[{"name":"_VALUE","type":"long","nullable":true,"metadata":{}},{"name":"_further","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"field","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]}
My scenario can be reproduced by making the element "old" as nullable = false, and removing the tag from the XML
Dataset<Row> ds = spark.read().option("multiLine", true).option("mode", "PERMISSIVE").json("/user/administrador/prueba_diario.txt").toDF();
ds.printSchema();
Dataset<Row> ds2 = ds.select("articles").toDF();
ds2.printSchema();
spark.sql("drop table if exists table1");
ds2.write().saveAsTable("table1");
I have this json format
root
|-- articles: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- author: string (nullable = true)
| | |-- content: string (nullable = true)
| | |-- description: string (nullable = true)
| | |-- publishedAt: string (nullable = true)
| | |-- source: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- name: string (nullable = true)
| | |-- title: string (nullable = true)
| | |-- url: string (nullable = true)
| | |-- urlToImage: string (nullable = true)
|-- status: string (nullable = true)
|-- totalResults: long (nullable = true)
I want to save the array articles as a hive's table with the arrays format
example of hive table that i want:
author (string)
content (string)
description (string)
publishedat (string)
source (struct<id:string,name:string>)
title (string)
url (string)
urltoimage (string)
The problem is that is saving the table just with one column named article and the contend is inside in this only column
A bit convoluted, but I found this one to work:
import org.apache.spark.sql.functions._
ds.select(explode(col("articles")).as("exploded")).select("exploded.*").toDF()
I tested it on
{
"articles": [
{
"author": "J.K. Rowling",
"title": "Harry Potter and the goblet of fire"
},
{
"author": "George Orwell",
"title": "1984"
}
]
}
and it returned (after collecting it into an array)
result = {Arrays$ArrayList#13423} size = 2
0 = {GenericRowWithSchema#13425} "[J.K. Rowling,Harry Potter and the goblet of fire]"
1 = {GenericRowWithSchema#13426} "[George Orwell,1984]"
I have multiple json files which keeps json data init. Json Structure look like this.
{
"Name":"Vipin Suman",
"Email":"vpn2330#gmail.com",
"Designation":"Trainee Programmer",
"Age":22 ,
"location":
{"City":
{
"Pin":324009,
"City Name":"Ahmedabad"
},
"State":"Gujarat"
},
"Company":
{
"Company Name":"Elegant",
"Domain":"Java"
},
"Test":["Test1","Test2"]
}
I tried this
String jsonFilePath = "/home/vipin/workspace/Smarten/jsonParsing/Employee/Employee-03.json";
String[] jsonFiles = jsonFilePath.split(",");
Dataset<Row> people = sparkSession.read().json(jsonFiles);
i am getting schema for this is
root
|-- Age: long (nullable = true)
|-- Company: struct (nullable = true)
| |-- Company Name: string (nullable = true)
| |-- Domain: string (nullable = true)
|-- Designation: string (nullable = true)
|-- Email: string (nullable = true)
|-- Name: string (nullable = true)
|-- Test: array (nullable = true)
| |-- element: string (containsNull = true)
|-- location: struct (nullable = true)
| |-- City: struct (nullable = true)
| | |-- City Name: string (nullable = true)
| | |-- Pin: long (nullable = true)
| |-- State: string (nullable = true)
i am getting the view of table:-
+---+--------------+------------------+-----------------+-----------+--------------+--------------------+
|Age| Company| Designation| Email| Name| Test| location|
+---+--------------+------------------+-----------------+-----------+--------------+--------------------+
| 22|[Elegant,Java]|Trainee Programmer|vpn2330#gmail.com|Vipin Suman|[Test1, Test2]|[[Ahmedabad,32400...|
+---+--------------+------------------+-----------------+-----------+--------------+--------------------+
i want result as:-
Age | Company Name | Domain| Designation | Email | Name | Test | City Name | Pin | State |
22 | Elegant MicroWeb | Java | Programmer | vpn2330#gmail.com | Vipin Suman | Test1 | Ahmedabad | 324009 | Gujarat
22 | Elegant MicroWeb | Java | Programmer | vpn2330#gmail.com | Vipin Suman | Test2 | Ahmedabad | 324009 |
how i can get table in above formet. i tried out everything. I am new to apache spark can any one help me out??
I suggest you do your work in scala which is better supported by spark. To do your work, you can use "select" API to select a specific column, use alias to rename a column, and you can refer to here to say how to select complex data format(https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html)
Based on your result, you also need to use "explode" API (Flattening Rows in Spark)
In Scala it could be done like this:
people.select(
$"Age",
$"Company.*",
$"Designation",
$"Email",
$"Name",
explode($"Test"),
$"location.City.*",
$"location.State")
Unfortunately, following code in Java would fail:
people.select(
people.col("Age"),
people.col("Company.*"),
people.col("Designation"),
people.col("Email"),
people.col("Name"),
explode(people.col("Test")),
people.col("location.City.*"),
people.col("location.State"));
You can use selectExpr instead though:
people.selectExpr(
"Age",
"Company.*",
"Designation",
"Email",
"Name",
"EXPLODE(Test) AS Test",
"location.City.*",
"location.State");
PS:
You can pass the path to the directory or directories instead of the list of JSON files in sparkSession.read().json(jsonFiles);.
I have this nested schema format in JavaSchemaRDD
root
|-- ProductInfo: struct (nullable = true)
| |-- Features: string (nullable = true)
| |-- ImgURL: string (nullable = true)
| |-- Name: string (nullable = true)
| |-- Price: string (nullable = true)
| |-- ProductID: string (nullable = true)
|-- Reviews: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- Author: string (nullable = true)
| | |-- Content: string (nullable = true)
| | |-- Date: string (nullable = true)
| | |-- Overall: string (nullable = true)
| | |-- ReviewID: string (nullable = true)
| | |-- Title: string (nullable = true)
|-- _corrupt_record: string (nullable = true)
I would like to select Name of the product based on the Overall rating of the product.
I wrote as below
JavaSchemaRDD variable = sqlContext.sql("SELECT ProductInfo.Name FROM Table" + "WHERE Reviews.element.Overall=5.0" + "ORDER BY c");
But it seems there is a mistake.
What would be the write one?