My input json dataframe file looks like the following:
company: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- address: struct (nullable = true)
| | | |-- city: string (nullable = true)
| | | |-- county: string (nullable = true)
| | | |-- latitude: string (nullable = true)
| | | |-- line1: string (nullable = true)
| | | |-- line2: string (nullable = true)
| | | |-- longitude: string (nullable = true)
| | | |-- postalCode: long (nullable = true)
| | | |-- state: struct (nullable = true)
| | | | |-- code: int (nullable = true)
| | | | |-- name: string (nullable = true)
| | | |-- stateOtherDescription: string (nullable = true)
| | |-- addressSourceOther: string (nullable = true)
| | |-- addressSourceType: struct (nullable = true)
| | | |-- code: int (nullable = true)
| | | |-- name: string (nullable = true)
| | |-- reasons: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | |-- improve: string (nullable = true)
| | | |-- far: string (nullable = true)
| | | |-- home: string (nullable = true)
I want to flatten it dynamically using spark java. Can someone help me with this
Take a look at the de.stefanfrings/parsing/JsonSlurper class in http://stefanfrings.de/bfUtilities/bfUtilities.zip. It reads a JSON document and creates a flat HashMap from the content.
Related
I have two data-frame (Dataset<Row>) with the same columns, but different order array of structs.
df1:
root
|-- root: string (nullable = false)
|-- array_nested: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- array_id: integer (nullable = false)
| | |-- array_value: string (nullable = false)
+----+------------+
|root|array_nested|
+----+------------+
|One |[[1, 1-One]]|
+----+------------+
df2:
root
|-- root: string (nullable = false)
|-- array_nested: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- array_value: string (nullable = false)
| | |-- array_id: integer (nullable = false)
+----+------------+
|root|array_nested|
+----+------------+
|Two |[[2-Two, 2]]|
+----+------------+
I want make the schema the same, but when I try my approach it generates and extra later of array:
List<Column> updatedStructNames = new ArrayList<>();
updatedStructNames.add(col("array_nested.array_id"));
updatedStructNames.add(col("array_nested.array_value"));
Column[] updatedStructNameArray = updatedStructNames.toArray(new Column[0]);
Dataset<Row> df3 = df2.withColumn("array_nested", array(struct(updatedStructNameArray)));
It will get generate schema like this:
root
|-- root: string (nullable = false)
|-- array_nested: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- array_id: array (nullable = false)
| | | |-- element: integer (containsNull = true)
| | |-- array_value: array (nullable = false)
| | | |-- element: string (containsNull = true)
+----+----------------+
|root|array_nested |
+----+----------------+
|Two |[[[2], [2-Two]]]|
+----+----------------+
How can I achieve the same schema?
You can use transform function to update the struct elements of array_nested column:
Dataset < Row > df3 = df2.withColumn(
"array_nested",
expr("transform(array_nested, x -> struct(x.array_id as array_id, x.array_value as array_value))")
);
Scenario:
I have read two XML files via specifying a schema on load.
In the schema, One of the tags is mandatory. One XML is missing that mandatory tag.
Now, when I do the following, I am expecting the XML with the missing mandatory tag to be filtered out.
dataset = dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull());
In the code when I try to count the Rows of the dataset, I am getting the count as 2 (2 Input XMLS) but when I try to print the dataset via show() method, I am getting a NPE.
When I debugged the above line and do the following, I get 0 as the count.
dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull()).count();
Question:
Can anyone please answer the questions/affirm my understanding below
Why is spark Dataset not filtering the Row which does not have a mandatory column?
Why there are no exception in the count but in show method?
For 2, I believe the count is just counting no of Rows without looking into the contents.
For show, the iterator actually goes through the Struct Fields to print their values and when it does not find the mandatory column, it errors out.
P.S. If I make the mandatory column optional, all is working fine.
Edit:
Providing Reading options as requested
For loading the data I am executing the following
Dataset<Row> dataset = sparkSession.read().format("com.databricks.spark.xml")
.option("header", "true")
.option("inferSchema", "false")
.option("rowTag", rowTag)//rowTag is "body" tag in the XML
.option("failFast", "true")
.option("mode", "FAILFAST")
.schema(schema)
.load(XMLfilePath);
Providing samples as requested
Schema:
root
|-- old: struct (nullable = true)
| |-- _beyond: string (nullable = true)
| |-- lot: struct (nullable = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _chose: string (nullable = true)
| |-- real: struct (nullable = true)
| | |-- _eat: string (nullable = true)
| | |-- kill: struct (nullable = true)
| | | |-- _VALUE: double (nullable = true)
| | | |-- _top: string (nullable = true)
| | |-- tool: struct (nullable = true)
| | | |-- _VALUE: string (nullable = true)
| | | |-- _affect: string (nullable = true)
|-- porch: struct (nullable = true)
| |-- _account: string (nullable = true)
| |-- cast: string (nullable = true)
| |-- vegetable: struct (nullable = true)
| | |-- leg: struct (nullable = true)
| | | |-- _VALUE: double (nullable = true)
| | | |-- _nose: string (nullable = true)
| | |-- now: struct (nullable = true)
| | | |-- _gravity: string (nullable = true)
| | | |-- chief: struct (nullable = true)
| | | | |-- _VALUE: long (nullable = true)
| | | | |-- _further: string (nullable = true)
| | | |-- field: string (nullable = true)
Sample XML:
<?xml version="1.0" encoding="UTF-8" ?>
<root>
<body>
<porch account="something">
<vegetable>
<now gravity="wide">
<field>box</field>
<chief further="satisfied">-1889614487</chief>
</now>
<leg nose="angle">912658017.229279</leg>
</vegetable>
<cast>clear</cast>
</porch>
<old beyond="continent">
<real eat="term">
<kill top="plates">-1623084908.8669372</kill>
<tool affect="pond">today</tool>
</real>
<lot chose="swung">promised</lot>
</old>
</body>
</root>
Schema in JSON format:
{"type":"struct","fields":[{"name":"old","type":{"type":"struct","fields":[{"name":"_beyond","type":"string","nullable":true,"metadata":{}},{"name":"lot","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_chose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"real","type":{"type":"struct","fields":[{"name":"_eat","type":"string","nullable":true,"metadata":{}},{"name":"kill","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_top","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"tool","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_affect","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"porch","type":{"type":"struct","fields":[{"name":"_account","type":"string","nullable":true,"metadata":{}},{"name":"cast","type":"string","nullable":true,"metadata":{}},{"name":"vegetable","type":{"type":"struct","fields":[{"name":"leg","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_nose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"now","type":{"type":"struct","fields":[{"name":"_gravity","type":"string","nullable":true,"metadata":{}},{"name":"chief","type":{"type":"struct","fields":[{"name":"_VALUE","type":"long","nullable":true,"metadata":{}},{"name":"_further","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"field","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]}
My scenario can be reproduced by making the element "old" as nullable = false, and removing the tag from the XML
I have a Dataset with Schema as below
root
|-- collectorId: string (nullable = true)
|-- generatedAt: long (nullable = true)
|-- managedNeId: string (nullable = true)
|-- neAlert: struct (nullable = true)
| |-- advisory: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- equipmentType: string (nullable = true)
| | | |-- headlineName: string (nullable = true)
| |-- fieldNotice: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- caveat: string (nullable = true)
| | | |-- distributionCode: string (nullable = true)
| |-- hwEoX: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- bulletinName: string (nullable = true)
| | | |-- equipmentType: string (nullable = true)
| |-- swEoX: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- bulletinHeadline: string (nullable = true)
| | | |-- equipmentType: string (nullable = true)
|-- partyId: string (nullable = true)
|-- recordType: string (nullable = true)
|-- sourceNeId: string (nullable = true)
|-- sourcePartyId: string (nullable = true)
|-- sourceSubPartyId: string (nullable = true)
|-- wfid: string (nullable = true)
I want to get the fields inside the "element". In order to do this I have done an explode on the array to flatten this.
Dataset<Row> alert = spark.read().option("multiLine", true).option("mode", "PERMISSIVE").json("C:\\Users\\LearningAndDevelopment\\\\merge\\data1\\sample.json");
Seq<String> droppedColumns = scala.collection.JavaConversions.asScalaBuffer(Arrays.asList("neAlert"));
Dataset<Row> alertjson = alert.withColumn("exploded_advisory", explode(col("neAlert.advisory"))).withColumn("exploded_fn", explode(col("neAlert.fieldNotice"))).withColumn("exploded_swEoX", explode(col("neAlert.swEoX"))).withColumn("exploded_hwEox", explode(col("neAlert.hwEoX"))).drop(droppedColumns);
alertjson.printSchema();
I got the final JSON as below
root
|-- collectorId: string (nullable = true)
|-- generatedAt: long (nullable = true)
|-- managedNeId: string (nullable = true)
|-- partyId: string (nullable = true)
|-- recordType: string (nullable = true)
|-- sourceNeId: string (nullable = true)
|-- sourcePartyId: string (nullable = true)
|-- sourceSubPartyId: string (nullable = true)
|-- wfid: string (nullable = true)
|-- exploded_advisory: struct (nullable = true)
| |-- equipmentType: string (nullable = true)
| |-- headlineName: string (nullable = true)
|-- exploded_fn: struct (nullable = true)
| |-- caveat: string (nullable = true)
| |-- distributionCode: string (nullable = true)
|-- exploded_swEoX: struct (nullable = true)
| |-- bulletinHeadline: string (nullable = true)
| |-- equipmentType: string (nullable = true)
|-- exploded_hwEox: struct (nullable = true)
| |-- bulletinName: string (nullable = true)
| |-- equipmentType: string (nullable = true)
But, the above method created all duplicate records flattened with data in the first element of each JSON array. Each array can have so many elements. How can i flatten the JSON arrays without loosing the data integrity.
You can select the nested json with . dot operator first and use explode for each nested field.
Dataset<Row> alertjson = alert
.withColumn("exploded_advisory", explode(col("neAlert.advisory")))
.withColumn("exploded_fn", explode(col("neAlert.fieldNotice")))
.withColumn("exploded_swEoX", explode(col("neAlert.swEoX")))
.withColumn("exploded_hwEox", explode(col("neAlert.hwEoX")));
If you want each field explode as individual then you have to explode separately which created multiple dataframes
// for advisory
Dataset<Row> alertjson = alert
.withColumn("exploded_advisory", explode(col("neAlert.advisory")))
DataSet<Row> fieldNorice = alert
.withColumn("exploded_fn", explode(col("neAlert.fieldNotice")))
Drop the unrequired columns and should work.
I'm trying to perform JavaRDD actions on Row type data. But I'm not able to parse or iterate JavaRDD< Row> Data
Schema:
root
|-- categories: array (nullable = true)
| |-- element: string (containsNull = true)
|-- discount: long (nullable = true)
|-- expiration: string (nullable = true)
|-- id: long (nullable = true)
|-- maxCashback: string (nullable = true)
|-- minTicket: long (nullable = true)
|-- name: string (nullable = true)
|-- rules: struct (nullable = true)
| |-- cardRequired: boolean (nullable = true)
| |-- cardType: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- usageLimit: long (nullable = true)
| |-- vendor: array (nullable = true)
| | |-- element: string (containsNull = true)
Data:
+--------------------+--------+----------+---+-----------+---------+--------------------+--------------------+
| categories|discount|expiration| id|maxCashback|minTicket| name| rules|
+--------------------+--------+----------+---+-----------+---------+--------------------+--------------------+
| [Movie, Event]| null|31-03-2018| 1| 100| 1|ICICI Bank Credit...|[true,WrappedArra...|
| [Movie]| 10|30-11-2017| 2| 100| 2|RBL Credit Card O...|[true,WrappedArra...|
| [Movie]| null|30-11-2017| 3| 150| 2|SBI RUPAY PLATINU...|[true,WrappedArra...|
| [Movie]| null|31-10-2017| 4| 150| 2|IDEA Select Prepa...|[true,WrappedArra...|
|[Movie, Event, Sp...| 10|31-10-2017| 5| 150| 1|Mobikwik Wallet O...|[true,WrappedArra...|
|[Movie, Event, Sp...| null| null| 6| {}| 1| Payback Point|[null,WrappedArra...|
+--------------------+--------+----------+---+-----------+---------+--------------------+--------------------+
Code Snippet:
JavaRDD<Row> applicableOffers = offers.toJavaRDD();
applicableOffers.foreach((a)->{
int fieldNoTicket = a.fieldIndex("minTicket");
int filedNoCashback=a.fieldIndex("maxCashback");
int fieldNoDiscount=a.fieldIndex("discount");
System.out.println("a : " +a);
});
Output:
a : [WrappedArray(Movie, Event),null,31-03-2018,1,100,1,ICICI Bank Credit Card Offer,[true,WrappedArray(Credit),null,WrappedArray(ICICI)]]
a : [WrappedArray(Movie),10,30-11-2017,2,100,2,RBL Credit Card Offer,[true,WrappedArray(Credit),15,WrappedArray(RBL)]]
a : [WrappedArray(Movie),null,30-11-2017,3,150,2,SBI RUPAY PLATINUM DEBIT CARD OFFER,[true,WrappedArray(Platinum Debit),null,WrappedArray(SBI)]]
a : [WrappedArray(Movie),null,31-10-2017,4,150,2,IDEA Select Prepaid Offer,[true,WrappedArray(SIM),null,WrappedArray(IDEA)]]
a : [WrappedArray(Movie, Event, Sports),10,31-10-2017,5,150,1,Mobikwik Wallet Offer,[true,WrappedArray(eWallet),null,WrappedArray(Mobikwik)]]
a : [WrappedArray(Movie, Event, Sports),null,null,6,{},1,Payback Point,[null,WrappedArray(Credit, Debit),null,WrappedArray(ICICI,SBI,Canara)]]
All I need to do is run an action which calculates the discount for USD 1000 and outputs value and name of an offer in Apache Spark Java.
I managed to find a workaround. By using fieldIndex(colName) to capture the index followed by getLong(index) to access the items.
int orderValue=1000; // USD 1000 is order value
applicableOffers.foreach((a) -> {
int name = a.fieldIndex("name");
int discount = a.fieldIndex("discount");
String offerName = a.getString(name);
Long discount = a.getLong(discount);
System.out.println("Offer:" + offerName + " Total:" + computeCashBack(orderValue,discount));
});
I have this nested schema format in JavaSchemaRDD
root
|-- ProductInfo: struct (nullable = true)
| |-- Features: string (nullable = true)
| |-- ImgURL: string (nullable = true)
| |-- Name: string (nullable = true)
| |-- Price: string (nullable = true)
| |-- ProductID: string (nullable = true)
|-- Reviews: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- Author: string (nullable = true)
| | |-- Content: string (nullable = true)
| | |-- Date: string (nullable = true)
| | |-- Overall: string (nullable = true)
| | |-- ReviewID: string (nullable = true)
| | |-- Title: string (nullable = true)
|-- _corrupt_record: string (nullable = true)
I would like to select Name of the product based on the Overall rating of the product.
I wrote as below
JavaSchemaRDD variable = sqlContext.sql("SELECT ProductInfo.Name FROM Table" + "WHERE Reviews.element.Overall=5.0" + "ORDER BY c");
But it seems there is a mistake.
What would be the write one?