Dataset<Row> ds = spark.read().option("multiLine", true).option("mode", "PERMISSIVE").json("/user/administrador/prueba_diario.txt").toDF();
ds.printSchema();
Dataset<Row> ds2 = ds.select("articles").toDF();
ds2.printSchema();
spark.sql("drop table if exists table1");
ds2.write().saveAsTable("table1");
I have this json format
root
|-- articles: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- author: string (nullable = true)
| | |-- content: string (nullable = true)
| | |-- description: string (nullable = true)
| | |-- publishedAt: string (nullable = true)
| | |-- source: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- name: string (nullable = true)
| | |-- title: string (nullable = true)
| | |-- url: string (nullable = true)
| | |-- urlToImage: string (nullable = true)
|-- status: string (nullable = true)
|-- totalResults: long (nullable = true)
I want to save the array articles as a hive's table with the arrays format
example of hive table that i want:
author (string)
content (string)
description (string)
publishedat (string)
source (struct<id:string,name:string>)
title (string)
url (string)
urltoimage (string)
The problem is that is saving the table just with one column named article and the contend is inside in this only column
A bit convoluted, but I found this one to work:
import org.apache.spark.sql.functions._
ds.select(explode(col("articles")).as("exploded")).select("exploded.*").toDF()
I tested it on
{
"articles": [
{
"author": "J.K. Rowling",
"title": "Harry Potter and the goblet of fire"
},
{
"author": "George Orwell",
"title": "1984"
}
]
}
and it returned (after collecting it into an array)
result = {Arrays$ArrayList#13423} size = 2
0 = {GenericRowWithSchema#13425} "[J.K. Rowling,Harry Potter and the goblet of fire]"
1 = {GenericRowWithSchema#13426} "[George Orwell,1984]"
Related
I have two data-frame (Dataset<Row>) with the same columns, but different order array of structs.
df1:
root
|-- root: string (nullable = false)
|-- array_nested: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- array_id: integer (nullable = false)
| | |-- array_value: string (nullable = false)
+----+------------+
|root|array_nested|
+----+------------+
|One |[[1, 1-One]]|
+----+------------+
df2:
root
|-- root: string (nullable = false)
|-- array_nested: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- array_value: string (nullable = false)
| | |-- array_id: integer (nullable = false)
+----+------------+
|root|array_nested|
+----+------------+
|Two |[[2-Two, 2]]|
+----+------------+
I want make the schema the same, but when I try my approach it generates and extra later of array:
List<Column> updatedStructNames = new ArrayList<>();
updatedStructNames.add(col("array_nested.array_id"));
updatedStructNames.add(col("array_nested.array_value"));
Column[] updatedStructNameArray = updatedStructNames.toArray(new Column[0]);
Dataset<Row> df3 = df2.withColumn("array_nested", array(struct(updatedStructNameArray)));
It will get generate schema like this:
root
|-- root: string (nullable = false)
|-- array_nested: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- array_id: array (nullable = false)
| | | |-- element: integer (containsNull = true)
| | |-- array_value: array (nullable = false)
| | | |-- element: string (containsNull = true)
+----+----------------+
|root|array_nested |
+----+----------------+
|Two |[[[2], [2-Two]]]|
+----+----------------+
How can I achieve the same schema?
You can use transform function to update the struct elements of array_nested column:
Dataset < Row > df3 = df2.withColumn(
"array_nested",
expr("transform(array_nested, x -> struct(x.array_id as array_id, x.array_value as array_value))")
);
Scenario:
I have read two XML files via specifying a schema on load.
In the schema, One of the tags is mandatory. One XML is missing that mandatory tag.
Now, when I do the following, I am expecting the XML with the missing mandatory tag to be filtered out.
dataset = dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull());
In the code when I try to count the Rows of the dataset, I am getting the count as 2 (2 Input XMLS) but when I try to print the dataset via show() method, I am getting a NPE.
When I debugged the above line and do the following, I get 0 as the count.
dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull()).count();
Question:
Can anyone please answer the questions/affirm my understanding below
Why is spark Dataset not filtering the Row which does not have a mandatory column?
Why there are no exception in the count but in show method?
For 2, I believe the count is just counting no of Rows without looking into the contents.
For show, the iterator actually goes through the Struct Fields to print their values and when it does not find the mandatory column, it errors out.
P.S. If I make the mandatory column optional, all is working fine.
Edit:
Providing Reading options as requested
For loading the data I am executing the following
Dataset<Row> dataset = sparkSession.read().format("com.databricks.spark.xml")
.option("header", "true")
.option("inferSchema", "false")
.option("rowTag", rowTag)//rowTag is "body" tag in the XML
.option("failFast", "true")
.option("mode", "FAILFAST")
.schema(schema)
.load(XMLfilePath);
Providing samples as requested
Schema:
root
|-- old: struct (nullable = true)
| |-- _beyond: string (nullable = true)
| |-- lot: struct (nullable = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _chose: string (nullable = true)
| |-- real: struct (nullable = true)
| | |-- _eat: string (nullable = true)
| | |-- kill: struct (nullable = true)
| | | |-- _VALUE: double (nullable = true)
| | | |-- _top: string (nullable = true)
| | |-- tool: struct (nullable = true)
| | | |-- _VALUE: string (nullable = true)
| | | |-- _affect: string (nullable = true)
|-- porch: struct (nullable = true)
| |-- _account: string (nullable = true)
| |-- cast: string (nullable = true)
| |-- vegetable: struct (nullable = true)
| | |-- leg: struct (nullable = true)
| | | |-- _VALUE: double (nullable = true)
| | | |-- _nose: string (nullable = true)
| | |-- now: struct (nullable = true)
| | | |-- _gravity: string (nullable = true)
| | | |-- chief: struct (nullable = true)
| | | | |-- _VALUE: long (nullable = true)
| | | | |-- _further: string (nullable = true)
| | | |-- field: string (nullable = true)
Sample XML:
<?xml version="1.0" encoding="UTF-8" ?>
<root>
<body>
<porch account="something">
<vegetable>
<now gravity="wide">
<field>box</field>
<chief further="satisfied">-1889614487</chief>
</now>
<leg nose="angle">912658017.229279</leg>
</vegetable>
<cast>clear</cast>
</porch>
<old beyond="continent">
<real eat="term">
<kill top="plates">-1623084908.8669372</kill>
<tool affect="pond">today</tool>
</real>
<lot chose="swung">promised</lot>
</old>
</body>
</root>
Schema in JSON format:
{"type":"struct","fields":[{"name":"old","type":{"type":"struct","fields":[{"name":"_beyond","type":"string","nullable":true,"metadata":{}},{"name":"lot","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_chose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"real","type":{"type":"struct","fields":[{"name":"_eat","type":"string","nullable":true,"metadata":{}},{"name":"kill","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_top","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"tool","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_affect","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"porch","type":{"type":"struct","fields":[{"name":"_account","type":"string","nullable":true,"metadata":{}},{"name":"cast","type":"string","nullable":true,"metadata":{}},{"name":"vegetable","type":{"type":"struct","fields":[{"name":"leg","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_nose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"now","type":{"type":"struct","fields":[{"name":"_gravity","type":"string","nullable":true,"metadata":{}},{"name":"chief","type":{"type":"struct","fields":[{"name":"_VALUE","type":"long","nullable":true,"metadata":{}},{"name":"_further","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"field","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]}
My scenario can be reproduced by making the element "old" as nullable = false, and removing the tag from the XML
I'm trying to perform JavaRDD actions on Row type data. But I'm not able to parse or iterate JavaRDD< Row> Data
Schema:
root
|-- categories: array (nullable = true)
| |-- element: string (containsNull = true)
|-- discount: long (nullable = true)
|-- expiration: string (nullable = true)
|-- id: long (nullable = true)
|-- maxCashback: string (nullable = true)
|-- minTicket: long (nullable = true)
|-- name: string (nullable = true)
|-- rules: struct (nullable = true)
| |-- cardRequired: boolean (nullable = true)
| |-- cardType: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- usageLimit: long (nullable = true)
| |-- vendor: array (nullable = true)
| | |-- element: string (containsNull = true)
Data:
+--------------------+--------+----------+---+-----------+---------+--------------------+--------------------+
| categories|discount|expiration| id|maxCashback|minTicket| name| rules|
+--------------------+--------+----------+---+-----------+---------+--------------------+--------------------+
| [Movie, Event]| null|31-03-2018| 1| 100| 1|ICICI Bank Credit...|[true,WrappedArra...|
| [Movie]| 10|30-11-2017| 2| 100| 2|RBL Credit Card O...|[true,WrappedArra...|
| [Movie]| null|30-11-2017| 3| 150| 2|SBI RUPAY PLATINU...|[true,WrappedArra...|
| [Movie]| null|31-10-2017| 4| 150| 2|IDEA Select Prepa...|[true,WrappedArra...|
|[Movie, Event, Sp...| 10|31-10-2017| 5| 150| 1|Mobikwik Wallet O...|[true,WrappedArra...|
|[Movie, Event, Sp...| null| null| 6| {}| 1| Payback Point|[null,WrappedArra...|
+--------------------+--------+----------+---+-----------+---------+--------------------+--------------------+
Code Snippet:
JavaRDD<Row> applicableOffers = offers.toJavaRDD();
applicableOffers.foreach((a)->{
int fieldNoTicket = a.fieldIndex("minTicket");
int filedNoCashback=a.fieldIndex("maxCashback");
int fieldNoDiscount=a.fieldIndex("discount");
System.out.println("a : " +a);
});
Output:
a : [WrappedArray(Movie, Event),null,31-03-2018,1,100,1,ICICI Bank Credit Card Offer,[true,WrappedArray(Credit),null,WrappedArray(ICICI)]]
a : [WrappedArray(Movie),10,30-11-2017,2,100,2,RBL Credit Card Offer,[true,WrappedArray(Credit),15,WrappedArray(RBL)]]
a : [WrappedArray(Movie),null,30-11-2017,3,150,2,SBI RUPAY PLATINUM DEBIT CARD OFFER,[true,WrappedArray(Platinum Debit),null,WrappedArray(SBI)]]
a : [WrappedArray(Movie),null,31-10-2017,4,150,2,IDEA Select Prepaid Offer,[true,WrappedArray(SIM),null,WrappedArray(IDEA)]]
a : [WrappedArray(Movie, Event, Sports),10,31-10-2017,5,150,1,Mobikwik Wallet Offer,[true,WrappedArray(eWallet),null,WrappedArray(Mobikwik)]]
a : [WrappedArray(Movie, Event, Sports),null,null,6,{},1,Payback Point,[null,WrappedArray(Credit, Debit),null,WrappedArray(ICICI,SBI,Canara)]]
All I need to do is run an action which calculates the discount for USD 1000 and outputs value and name of an offer in Apache Spark Java.
I managed to find a workaround. By using fieldIndex(colName) to capture the index followed by getLong(index) to access the items.
int orderValue=1000; // USD 1000 is order value
applicableOffers.foreach((a) -> {
int name = a.fieldIndex("name");
int discount = a.fieldIndex("discount");
String offerName = a.getString(name);
Long discount = a.getLong(discount);
System.out.println("Offer:" + offerName + " Total:" + computeCashBack(orderValue,discount));
});
I have multiple json files which keeps json data init. Json Structure look like this.
{
"Name":"Vipin Suman",
"Email":"vpn2330#gmail.com",
"Designation":"Trainee Programmer",
"Age":22 ,
"location":
{"City":
{
"Pin":324009,
"City Name":"Ahmedabad"
},
"State":"Gujarat"
},
"Company":
{
"Company Name":"Elegant",
"Domain":"Java"
},
"Test":["Test1","Test2"]
}
I tried this
String jsonFilePath = "/home/vipin/workspace/Smarten/jsonParsing/Employee/Employee-03.json";
String[] jsonFiles = jsonFilePath.split(",");
Dataset<Row> people = sparkSession.read().json(jsonFiles);
i am getting schema for this is
root
|-- Age: long (nullable = true)
|-- Company: struct (nullable = true)
| |-- Company Name: string (nullable = true)
| |-- Domain: string (nullable = true)
|-- Designation: string (nullable = true)
|-- Email: string (nullable = true)
|-- Name: string (nullable = true)
|-- Test: array (nullable = true)
| |-- element: string (containsNull = true)
|-- location: struct (nullable = true)
| |-- City: struct (nullable = true)
| | |-- City Name: string (nullable = true)
| | |-- Pin: long (nullable = true)
| |-- State: string (nullable = true)
i am getting the view of table:-
+---+--------------+------------------+-----------------+-----------+--------------+--------------------+
|Age| Company| Designation| Email| Name| Test| location|
+---+--------------+------------------+-----------------+-----------+--------------+--------------------+
| 22|[Elegant,Java]|Trainee Programmer|vpn2330#gmail.com|Vipin Suman|[Test1, Test2]|[[Ahmedabad,32400...|
+---+--------------+------------------+-----------------+-----------+--------------+--------------------+
i want result as:-
Age | Company Name | Domain| Designation | Email | Name | Test | City Name | Pin | State |
22 | Elegant MicroWeb | Java | Programmer | vpn2330#gmail.com | Vipin Suman | Test1 | Ahmedabad | 324009 | Gujarat
22 | Elegant MicroWeb | Java | Programmer | vpn2330#gmail.com | Vipin Suman | Test2 | Ahmedabad | 324009 |
how i can get table in above formet. i tried out everything. I am new to apache spark can any one help me out??
I suggest you do your work in scala which is better supported by spark. To do your work, you can use "select" API to select a specific column, use alias to rename a column, and you can refer to here to say how to select complex data format(https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html)
Based on your result, you also need to use "explode" API (Flattening Rows in Spark)
In Scala it could be done like this:
people.select(
$"Age",
$"Company.*",
$"Designation",
$"Email",
$"Name",
explode($"Test"),
$"location.City.*",
$"location.State")
Unfortunately, following code in Java would fail:
people.select(
people.col("Age"),
people.col("Company.*"),
people.col("Designation"),
people.col("Email"),
people.col("Name"),
explode(people.col("Test")),
people.col("location.City.*"),
people.col("location.State"));
You can use selectExpr instead though:
people.selectExpr(
"Age",
"Company.*",
"Designation",
"Email",
"Name",
"EXPLODE(Test) AS Test",
"location.City.*",
"location.State");
PS:
You can pass the path to the directory or directories instead of the list of JSON files in sparkSession.read().json(jsonFiles);.
I have this nested schema format in JavaSchemaRDD
root
|-- ProductInfo: struct (nullable = true)
| |-- Features: string (nullable = true)
| |-- ImgURL: string (nullable = true)
| |-- Name: string (nullable = true)
| |-- Price: string (nullable = true)
| |-- ProductID: string (nullable = true)
|-- Reviews: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- Author: string (nullable = true)
| | |-- Content: string (nullable = true)
| | |-- Date: string (nullable = true)
| | |-- Overall: string (nullable = true)
| | |-- ReviewID: string (nullable = true)
| | |-- Title: string (nullable = true)
|-- _corrupt_record: string (nullable = true)
I would like to select Name of the product based on the Overall rating of the product.
I wrote as below
JavaSchemaRDD variable = sqlContext.sql("SELECT ProductInfo.Name FROM Table" + "WHERE Reviews.element.Overall=5.0" + "ORDER BY c");
But it seems there is a mistake.
What would be the write one?