Related
I have two data-frame (Dataset<Row>) with the same columns, but different order array of structs.
df1:
root
|-- root: string (nullable = false)
|-- array_nested: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- array_id: integer (nullable = false)
| | |-- array_value: string (nullable = false)
+----+------------+
|root|array_nested|
+----+------------+
|One |[[1, 1-One]]|
+----+------------+
df2:
root
|-- root: string (nullable = false)
|-- array_nested: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- array_value: string (nullable = false)
| | |-- array_id: integer (nullable = false)
+----+------------+
|root|array_nested|
+----+------------+
|Two |[[2-Two, 2]]|
+----+------------+
I want make the schema the same, but when I try my approach it generates and extra later of array:
List<Column> updatedStructNames = new ArrayList<>();
updatedStructNames.add(col("array_nested.array_id"));
updatedStructNames.add(col("array_nested.array_value"));
Column[] updatedStructNameArray = updatedStructNames.toArray(new Column[0]);
Dataset<Row> df3 = df2.withColumn("array_nested", array(struct(updatedStructNameArray)));
It will get generate schema like this:
root
|-- root: string (nullable = false)
|-- array_nested: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- array_id: array (nullable = false)
| | | |-- element: integer (containsNull = true)
| | |-- array_value: array (nullable = false)
| | | |-- element: string (containsNull = true)
+----+----------------+
|root|array_nested |
+----+----------------+
|Two |[[[2], [2-Two]]]|
+----+----------------+
How can I achieve the same schema?
You can use transform function to update the struct elements of array_nested column:
Dataset < Row > df3 = df2.withColumn(
"array_nested",
expr("transform(array_nested, x -> struct(x.array_id as array_id, x.array_value as array_value))")
);
Scenario:
I have read two XML files via specifying a schema on load.
In the schema, One of the tags is mandatory. One XML is missing that mandatory tag.
Now, when I do the following, I am expecting the XML with the missing mandatory tag to be filtered out.
dataset = dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull());
In the code when I try to count the Rows of the dataset, I am getting the count as 2 (2 Input XMLS) but when I try to print the dataset via show() method, I am getting a NPE.
When I debugged the above line and do the following, I get 0 as the count.
dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull()).count();
Question:
Can anyone please answer the questions/affirm my understanding below
Why is spark Dataset not filtering the Row which does not have a mandatory column?
Why there are no exception in the count but in show method?
For 2, I believe the count is just counting no of Rows without looking into the contents.
For show, the iterator actually goes through the Struct Fields to print their values and when it does not find the mandatory column, it errors out.
P.S. If I make the mandatory column optional, all is working fine.
Edit:
Providing Reading options as requested
For loading the data I am executing the following
Dataset<Row> dataset = sparkSession.read().format("com.databricks.spark.xml")
.option("header", "true")
.option("inferSchema", "false")
.option("rowTag", rowTag)//rowTag is "body" tag in the XML
.option("failFast", "true")
.option("mode", "FAILFAST")
.schema(schema)
.load(XMLfilePath);
Providing samples as requested
Schema:
root
|-- old: struct (nullable = true)
| |-- _beyond: string (nullable = true)
| |-- lot: struct (nullable = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _chose: string (nullable = true)
| |-- real: struct (nullable = true)
| | |-- _eat: string (nullable = true)
| | |-- kill: struct (nullable = true)
| | | |-- _VALUE: double (nullable = true)
| | | |-- _top: string (nullable = true)
| | |-- tool: struct (nullable = true)
| | | |-- _VALUE: string (nullable = true)
| | | |-- _affect: string (nullable = true)
|-- porch: struct (nullable = true)
| |-- _account: string (nullable = true)
| |-- cast: string (nullable = true)
| |-- vegetable: struct (nullable = true)
| | |-- leg: struct (nullable = true)
| | | |-- _VALUE: double (nullable = true)
| | | |-- _nose: string (nullable = true)
| | |-- now: struct (nullable = true)
| | | |-- _gravity: string (nullable = true)
| | | |-- chief: struct (nullable = true)
| | | | |-- _VALUE: long (nullable = true)
| | | | |-- _further: string (nullable = true)
| | | |-- field: string (nullable = true)
Sample XML:
<?xml version="1.0" encoding="UTF-8" ?>
<root>
<body>
<porch account="something">
<vegetable>
<now gravity="wide">
<field>box</field>
<chief further="satisfied">-1889614487</chief>
</now>
<leg nose="angle">912658017.229279</leg>
</vegetable>
<cast>clear</cast>
</porch>
<old beyond="continent">
<real eat="term">
<kill top="plates">-1623084908.8669372</kill>
<tool affect="pond">today</tool>
</real>
<lot chose="swung">promised</lot>
</old>
</body>
</root>
Schema in JSON format:
{"type":"struct","fields":[{"name":"old","type":{"type":"struct","fields":[{"name":"_beyond","type":"string","nullable":true,"metadata":{}},{"name":"lot","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_chose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"real","type":{"type":"struct","fields":[{"name":"_eat","type":"string","nullable":true,"metadata":{}},{"name":"kill","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_top","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"tool","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_affect","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"porch","type":{"type":"struct","fields":[{"name":"_account","type":"string","nullable":true,"metadata":{}},{"name":"cast","type":"string","nullable":true,"metadata":{}},{"name":"vegetable","type":{"type":"struct","fields":[{"name":"leg","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_nose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"now","type":{"type":"struct","fields":[{"name":"_gravity","type":"string","nullable":true,"metadata":{}},{"name":"chief","type":{"type":"struct","fields":[{"name":"_VALUE","type":"long","nullable":true,"metadata":{}},{"name":"_further","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"field","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]}
My scenario can be reproduced by making the element "old" as nullable = false, and removing the tag from the XML
Dataset<Row> ds = spark.read().option("multiLine", true).option("mode", "PERMISSIVE").json("/user/administrador/prueba_diario.txt").toDF();
ds.printSchema();
Dataset<Row> ds2 = ds.select("articles").toDF();
ds2.printSchema();
spark.sql("drop table if exists table1");
ds2.write().saveAsTable("table1");
I have this json format
root
|-- articles: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- author: string (nullable = true)
| | |-- content: string (nullable = true)
| | |-- description: string (nullable = true)
| | |-- publishedAt: string (nullable = true)
| | |-- source: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- name: string (nullable = true)
| | |-- title: string (nullable = true)
| | |-- url: string (nullable = true)
| | |-- urlToImage: string (nullable = true)
|-- status: string (nullable = true)
|-- totalResults: long (nullable = true)
I want to save the array articles as a hive's table with the arrays format
example of hive table that i want:
author (string)
content (string)
description (string)
publishedat (string)
source (struct<id:string,name:string>)
title (string)
url (string)
urltoimage (string)
The problem is that is saving the table just with one column named article and the contend is inside in this only column
A bit convoluted, but I found this one to work:
import org.apache.spark.sql.functions._
ds.select(explode(col("articles")).as("exploded")).select("exploded.*").toDF()
I tested it on
{
"articles": [
{
"author": "J.K. Rowling",
"title": "Harry Potter and the goblet of fire"
},
{
"author": "George Orwell",
"title": "1984"
}
]
}
and it returned (after collecting it into an array)
result = {Arrays$ArrayList#13423} size = 2
0 = {GenericRowWithSchema#13425} "[J.K. Rowling,Harry Potter and the goblet of fire]"
1 = {GenericRowWithSchema#13426} "[George Orwell,1984]"
I have a Dataset with Schema as below
root
|-- collectorId: string (nullable = true)
|-- generatedAt: long (nullable = true)
|-- managedNeId: string (nullable = true)
|-- neAlert: struct (nullable = true)
| |-- advisory: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- equipmentType: string (nullable = true)
| | | |-- headlineName: string (nullable = true)
| |-- fieldNotice: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- caveat: string (nullable = true)
| | | |-- distributionCode: string (nullable = true)
| |-- hwEoX: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- bulletinName: string (nullable = true)
| | | |-- equipmentType: string (nullable = true)
| |-- swEoX: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- bulletinHeadline: string (nullable = true)
| | | |-- equipmentType: string (nullable = true)
|-- partyId: string (nullable = true)
|-- recordType: string (nullable = true)
|-- sourceNeId: string (nullable = true)
|-- sourcePartyId: string (nullable = true)
|-- sourceSubPartyId: string (nullable = true)
|-- wfid: string (nullable = true)
I want to get the fields inside the "element". In order to do this I have done an explode on the array to flatten this.
Dataset<Row> alert = spark.read().option("multiLine", true).option("mode", "PERMISSIVE").json("C:\\Users\\LearningAndDevelopment\\\\merge\\data1\\sample.json");
Seq<String> droppedColumns = scala.collection.JavaConversions.asScalaBuffer(Arrays.asList("neAlert"));
Dataset<Row> alertjson = alert.withColumn("exploded_advisory", explode(col("neAlert.advisory"))).withColumn("exploded_fn", explode(col("neAlert.fieldNotice"))).withColumn("exploded_swEoX", explode(col("neAlert.swEoX"))).withColumn("exploded_hwEox", explode(col("neAlert.hwEoX"))).drop(droppedColumns);
alertjson.printSchema();
I got the final JSON as below
root
|-- collectorId: string (nullable = true)
|-- generatedAt: long (nullable = true)
|-- managedNeId: string (nullable = true)
|-- partyId: string (nullable = true)
|-- recordType: string (nullable = true)
|-- sourceNeId: string (nullable = true)
|-- sourcePartyId: string (nullable = true)
|-- sourceSubPartyId: string (nullable = true)
|-- wfid: string (nullable = true)
|-- exploded_advisory: struct (nullable = true)
| |-- equipmentType: string (nullable = true)
| |-- headlineName: string (nullable = true)
|-- exploded_fn: struct (nullable = true)
| |-- caveat: string (nullable = true)
| |-- distributionCode: string (nullable = true)
|-- exploded_swEoX: struct (nullable = true)
| |-- bulletinHeadline: string (nullable = true)
| |-- equipmentType: string (nullable = true)
|-- exploded_hwEox: struct (nullable = true)
| |-- bulletinName: string (nullable = true)
| |-- equipmentType: string (nullable = true)
But, the above method created all duplicate records flattened with data in the first element of each JSON array. Each array can have so many elements. How can i flatten the JSON arrays without loosing the data integrity.
You can select the nested json with . dot operator first and use explode for each nested field.
Dataset<Row> alertjson = alert
.withColumn("exploded_advisory", explode(col("neAlert.advisory")))
.withColumn("exploded_fn", explode(col("neAlert.fieldNotice")))
.withColumn("exploded_swEoX", explode(col("neAlert.swEoX")))
.withColumn("exploded_hwEox", explode(col("neAlert.hwEoX")));
If you want each field explode as individual then you have to explode separately which created multiple dataframes
// for advisory
Dataset<Row> alertjson = alert
.withColumn("exploded_advisory", explode(col("neAlert.advisory")))
DataSet<Row> fieldNorice = alert
.withColumn("exploded_fn", explode(col("neAlert.fieldNotice")))
Drop the unrequired columns and should work.
I'm trying to perform JavaRDD actions on Row type data. But I'm not able to parse or iterate JavaRDD< Row> Data
Schema:
root
|-- categories: array (nullable = true)
| |-- element: string (containsNull = true)
|-- discount: long (nullable = true)
|-- expiration: string (nullable = true)
|-- id: long (nullable = true)
|-- maxCashback: string (nullable = true)
|-- minTicket: long (nullable = true)
|-- name: string (nullable = true)
|-- rules: struct (nullable = true)
| |-- cardRequired: boolean (nullable = true)
| |-- cardType: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- usageLimit: long (nullable = true)
| |-- vendor: array (nullable = true)
| | |-- element: string (containsNull = true)
Data:
+--------------------+--------+----------+---+-----------+---------+--------------------+--------------------+
| categories|discount|expiration| id|maxCashback|minTicket| name| rules|
+--------------------+--------+----------+---+-----------+---------+--------------------+--------------------+
| [Movie, Event]| null|31-03-2018| 1| 100| 1|ICICI Bank Credit...|[true,WrappedArra...|
| [Movie]| 10|30-11-2017| 2| 100| 2|RBL Credit Card O...|[true,WrappedArra...|
| [Movie]| null|30-11-2017| 3| 150| 2|SBI RUPAY PLATINU...|[true,WrappedArra...|
| [Movie]| null|31-10-2017| 4| 150| 2|IDEA Select Prepa...|[true,WrappedArra...|
|[Movie, Event, Sp...| 10|31-10-2017| 5| 150| 1|Mobikwik Wallet O...|[true,WrappedArra...|
|[Movie, Event, Sp...| null| null| 6| {}| 1| Payback Point|[null,WrappedArra...|
+--------------------+--------+----------+---+-----------+---------+--------------------+--------------------+
Code Snippet:
JavaRDD<Row> applicableOffers = offers.toJavaRDD();
applicableOffers.foreach((a)->{
int fieldNoTicket = a.fieldIndex("minTicket");
int filedNoCashback=a.fieldIndex("maxCashback");
int fieldNoDiscount=a.fieldIndex("discount");
System.out.println("a : " +a);
});
Output:
a : [WrappedArray(Movie, Event),null,31-03-2018,1,100,1,ICICI Bank Credit Card Offer,[true,WrappedArray(Credit),null,WrappedArray(ICICI)]]
a : [WrappedArray(Movie),10,30-11-2017,2,100,2,RBL Credit Card Offer,[true,WrappedArray(Credit),15,WrappedArray(RBL)]]
a : [WrappedArray(Movie),null,30-11-2017,3,150,2,SBI RUPAY PLATINUM DEBIT CARD OFFER,[true,WrappedArray(Platinum Debit),null,WrappedArray(SBI)]]
a : [WrappedArray(Movie),null,31-10-2017,4,150,2,IDEA Select Prepaid Offer,[true,WrappedArray(SIM),null,WrappedArray(IDEA)]]
a : [WrappedArray(Movie, Event, Sports),10,31-10-2017,5,150,1,Mobikwik Wallet Offer,[true,WrappedArray(eWallet),null,WrappedArray(Mobikwik)]]
a : [WrappedArray(Movie, Event, Sports),null,null,6,{},1,Payback Point,[null,WrappedArray(Credit, Debit),null,WrappedArray(ICICI,SBI,Canara)]]
All I need to do is run an action which calculates the discount for USD 1000 and outputs value and name of an offer in Apache Spark Java.
I managed to find a workaround. By using fieldIndex(colName) to capture the index followed by getLong(index) to access the items.
int orderValue=1000; // USD 1000 is order value
applicableOffers.foreach((a) -> {
int name = a.fieldIndex("name");
int discount = a.fieldIndex("discount");
String offerName = a.getString(name);
Long discount = a.getLong(discount);
System.out.println("Offer:" + offerName + " Total:" + computeCashBack(orderValue,discount));
});