Get abstract syntax tree of JSON string

Get abstract syntax tree of JSON string - java

Problem Description
I am writing a simple JSON analyzer in order to achieve syntax analysis of JSON strings. I am receiving random structure JSON strings and I want to export the syntax structure. As result, I want to get a tree structure which describes the format of the JSON string (key, value, arrays etc) and types of every element. I've already found the syntax definition of the JSON (described below)
object
{}
{ members }
members
pair
pair , members
pair
string : value
array
[]
[ elements ]
elements
value
value , elements
value
string
number
object
array
true
false
null
Example
JSON String:
{"widget": {
"null": null,
"window": {
153: "This is string",
"boolean": true,
"int": 500,
"float": 5.555
}
}}
And I want to get something like:
{ KEY_STR : {
KEY_STR : null
KEY_ARRAY : {
KEY_INT: VALUE_STR,
KEY_STR: VALUE_BOOL,
KEY_STR: VALUE_INT,
KEY_STR: VALUE_FLOAT
}
}}
I am using JAVA with GSON library.
How I want to use that
I am interested to export the abstract tree in order to create my own messages automatically
My Question
I have started to implement that by using JsonParser. I am parsing the JSON object and then I am defining for every key and value the type. But I am wondering if I am on a good way or I am discovering the wheel. Is there anything already exist to export the abstract syntax tree or I should implement that by myself?

I have no idea what JsonParser will export. But in general, parsing something, then exporting an AST form from an AST data structure, reading the AST, and then extracting values from the read AST seems like just a lot of overhead goo to build and maintain.
What you should do is build the JSON parser into your application, parse the JSON to an AST data structure, and simply process that AST structure directly.
Frankly, JSON is simple enough so you could write your own recursive descent parser to parse JSON and build the AST, leading you back to the first solution. See https://stackoverflow.com/a/2336769/120163
If you absolutely insist on exporting it, you can find tools that will do that off the shelf.
Our DMS Software Reengineering Toolkit will do that, although it might be a bit heavyweight for this kind of application.
One of the nice things about JSON is the simple grammar. Here's the grammar that DMS uses:
-- JSON.atg: JSON domain grammar for DMS
-- Copyright (C) 2011-2018 Semantic Designs, Inc.; All Rights Reserved
--
-- Update history:
-- Date Initials Changes Made
-- 2011/09/02 CW Created
--
-- Note that dangling commas in lists are off-spec
-- but I (CW) hate dealing with them
--
-- I'm not sure if JSON is supposed to have more than one entity per file
-- but I'm allowing it for robustness
--
-- TODO: name/value pair lists should be associative-commutative
JSON_text = ;
JSON_text = JSON_text object ;
JSON_text = JSON_text array ;
-- unordered set of name/value pairs
-- should be able to use an associative-commutative property directive
object = '{' name_value_pair_list '}' ;
-- empty production is for empty list, but will also allow multiple commas
name_value_pair_list = ;
name_value_pair_list = name_value_pair_list ',' ;
name_value_pair_list = name_value_pair_list name_value_pair ;
name_value_pair = STRING ':' value ;
-- ordered collection of values
array = '[' value_list ']' ;
value_list = value ;
value_list = value_list ',' value ;
value_list = value_list ',' value ',' ;
value = STRING ;
value = NUMBER_INT ;
value = NUMBER_FLOAT ;
value = object ;
value = array ;
value = 'true' ;
value = 'false' ;
value = 'null' ;
Yes, it almost exactly matches the abstract grammar provided by the OP.
Now, with that, you can ask DMS to parse a file and export its AST with this command:
run ..\DomainParser +AST ..\..\..\Examples\One.js
For the JSON file One.js, containing this text:
{
"from": "http://json.org/example.html"
}
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
<rest of file snipped>
The parser produces an S-expression:
(JSON_text#JSON=2#59406e0^0 Line 1 Column 1 File C:/DMS/Domains/JSON/Examples/One.js
(JSON_text#JSON=2#2199f60^1#59406e0:1 Line 1 Column 1 File C:/DMS/Domains/JSON/Examples/One.js
(JSON_text#JSON=2#21912a0^1#2199f60:1 Line 1 Column 1 File C:/DMS/Domains/JSON/Examples/One.js
(JSON_text#JSON=2#593df00^1#21912a0:1 Line 1 Column 1 File C:/DMS/Domains/JSON/Examples/One.js
|(JSON_text#JSON=2#593d420^1#593df00:1 Line 1 Column 1 File C:/DMS/Domains/JSON/Examples/One.js
| (JSON_text#JSON=2#593c580^1#593d420:1 Line 1 Column 1 File C:/DMS/Domains/JSON/Examples/One.js
| (JSON_text#JSON=1#593bec0^1#593c580:1 Line 1 Column 1 File C:/DMS/Domains/JSON/Examples/One.js)JSON_text
| (object#JSON=4#593c560^1#593c580:2 Line 1 Column 1 File C:/DMS/Domains/JSON/Examples/One.js
| (name_value_pair_list#JSON=7#593c520^1#593c560:1 Line 2 Column 5 File C:/DMS/Domains/JSON/Examples/One.js
| |(name_value_pair_list#JSON=5#593c420^1#593c520:1 Line 2 Column 5 File C:/DMS/Domains/JSON/Examples/One.js)name_value_pair_list
| |(name_value_pair#JSON=8#593c4e0^1#593c520:2 Line 2 Column 5 File C:/DMS/Domains/JSON/Examples/One.js
| | (STRING#JSON=24#593c400^1#593c4e0:1[`from'] Line 2 Column 5 File C:/DMS/Domains/JSON/Examples/One.js)STRING
| | (STRING#JSON=24#593c480^1#593c4e0:2[`http://json.org/example.html'] Line 2 Column 13 File C:/DMS/Domains/JSON/Examples/One.js)STRING
| |)name_value_pair#593c4e0
| )name_value_pair_list#593c520
| )object#593c560
| )JSON_text#593c580
| (object#JSON=4#593d400^1#593d420:2 Line 5 Column 1 File C:/DMS/Domains/JSON/Examples/One.js
| (name_value_pair_list#JSON=7#593d3c0^1#593d400:1 Line 6 Column 5 File C:/DMS/Domains/JSON/Examples/One.js
| (name_value_pair_list#JSON=5#593c5c0^1#593d3c0:1 Line 6 Column 5 File C:/DMS/Domains/JSON/Examples/One.js)name_value_pair_list
| (name_value_pair#JSON=8#593d380^1#593d3c0:2 Line 6 Column 5 File C:/DMS/Domains/JSON/Examples/One.js
| |(STRING#JSON=24#593c5a0^1#593d380:1[`glossary'] Line 6 Column 5 File C:/DMS/Domains/JSON/Examples/One.js)STRING
| |(object#JSON=4#593d360^1#593d380:2 Line 6 Column 17 File C:/DMS/Domains/JSON/Examples/One.js
| | (name_value_pair_list#JSON=7#593d340^1#593d360:1 Line 7 Column 9 File C:/DMS/Domains/JSON/Examples/One.js
| | (name_value_pair_list#JSON=6#593c720^1#593d340:1 Line 7 Column 9 File C:/DMS/Domains/JSON/Examples/One.js
| | (name_value_pair_list#JSON=7#593c6c0^1#593c720:1 Line 7 Column 9 File C:/DMS/Domains/JSON/Examples/One.js
| | |(name_value_pair_list#JSON=5#593c600^1#593c6c0:1 Line 7 Column 9 File C:/DMS/Domains/JSON/Examples/One.js)name_value_pair_list
| | |(name_value_pair#JSON=8#593c640^1#593c6c0:2 Line 7 Column 9 File C:/DMS/Domains/JSON/Examples/One.js
| | | (STRING#JSON=24#593c5e0^1#593c640:1[`title'] Line 7 Column 9 File C:/DMS/Domains/JSON/Examples/One.js)STRING
| | | (STRING#JSON=24#593c620^1#593c640:2[`example glossary'] Line 7 Column 18 File C:/DMS/Domains/JSON/Examples/One.js)STRING
| | |)name_value_pair#593c640
| | )name_value_pair_list#593c6c0
| | )name_value_pair_list#593c720
| | (name_value_pair#JSON=8#593d320^1#593d340:2 Line 8 Column 17 File C:/DMS/Domains/JSON/Examples/One.js
| | (STRING#JSON=24#593c700^1#593d320:1[`GlossDiv'] Line 8 Column 17 File C:/DMS/Domains/JSON/Examples/One.js)STRING
| | (object#JSON=4#593d300^1#593d320:2 Line 8 Column 29 File C:/DMS/Domains/JSON/Examples/One.js
| | |(name_value_pair_list#JSON=7#593d2e0^1#593d300:1 Line 9 Column 13 File C:/DMS/Domains/JSON/Examples/One.js
| | | (name_value_pair_list#JSON=6#593c880^1#593d2e0:1 Line 9 Column 13 File C:/DMS/Domains/JSON/Examples/One.js
| | | (name_value_pair_list#JSON=7#593c820^1#593c880:1 Line 9 Column 13 File C:/DMS/Domains/JSON/Examples/One.js
| | | (name_value_pair_list#JSON=5#593c760^1#593c820:1 Line 9 Column 13 File C:/DMS/Domains/JSON/Examples/One.js)name_value_pair_list
| | | (name_value_pair#JSON=8#593c7e0^1#593c820:2 Line 9 Column 13 File C:/DMS/Domains/JSON/Examples/One.js
| | | |(STRING#JSON=24#593c740^1#593c7e0:1[`title'] Line 9 Column 13 File C:/DMS/Domains/JSON/Examples/One.js)STRING
| | | |(STRING#JSON=24#593c780^1#593c7e0:2[`S'] Line 9 Column 22 File C:/DMS/Domains/JSON/Examples/One.js)STRING
| | | )name_value_pair#593c7e0
| | | )name_value_pair_list#593c820
| | | )name_value_pair_list#593c880
| | | (name_value_pair#JSON=8#593d2c0^1#593d2e0:2 Line 10 Column 25 File C:/DMS/Domains/JSON/Examples/One.js
| | | (STRING#JSON=24#593c860^1#593d2c0:1[`GlossList'] Line 10 Column 25 File C:/DMS/Domains/JSON/Examples/One.js)STRING
| | | (object#JSON=4#593d2a0^1#593d2c0:2 Line 10 Column 38 File C:/DMS/Domains/JSON/Examples/One.js
| | | (name_value_pair_list#JSON=7#593d280^1#593d2a0:1 Line 11 Column 17 File C:/DMS/Domains/JSON/Examples/One.js
| | | |(name_value_pair_list#JSON=5#593c8c0^1#593d280:1 Line 11 Column 17 File C:/DMS/Domains/JSON/Examples/One.js)name_value_pair_list
| | | |(name_value_pair#JSON=8#593d260^1#593d280:2 Line 11 Column 17 File C:/DMS/Domains/JSON/Examples/One.js
| | | | (STRING#JSON=24#593c8a0^1#593d260:1[`GlossEntry'] Line 11 Column 17 File C:/DMS/Domains/JSON/Examples/One.js)STRING
| | | | (object#JSON=4#593d240^1#593d260:2 Line 11 Column 31 File C:/DMS/Domains/JSON/Examples/One.js
| | | | (name_value_pair_list#JSON=7#593d200^1#593d240:1 Line 12 Column 21 File C:/DMS/Domains/JSON/Examples/One.js
| | | | (name_value_pair_list#JSON=6#593d160^1#593d200:1 Line 12 Column 21 File C:/DMS/Domains/JSON/Examples/One.js
| | | | |(name_value_pair_list#JSON=7#593d120^1#593d160:1 Line 12 Column 21 File C:/DMS/Domains/JSON/Examples/One.js
| | | | | (name_value_pair_list#JSON=6#593cde0^1#593d120:1 Line 12 Column 21 File C:/DMS/Domains/JSON/Examples/One.js
| | | | | (name_value_pair_list#JSON=7#593cd60^1#593cde0:1 Line 12 Column 21 File C:/DMS/Domains/JSON/Examples/One.js
| | | | | (name_value_pair_list#JSON=6#593cca0^1#593cd60:1 Line 12 Column 21 File C:/DMS/Domains/JSON/Examples/One.js
| | | | | |(name_value_pair_list#JSON=7#593cc60^1#593cca0:1 Line 12 Column 21 File C:/DMS/Domains/JSON/Examples/One.js
| | | | | | (name_value_pair_list#JSON=6#593cc00^1#593cc60:1 Line 12 Column 21 File C:/DMS/Domains/JSON/Examples/One.js
| | | | | | (name_value_pair_list#JSON=7#593cb80^1#593cc00:1 Line 12 Column 21 File C:/DMS/Domains/JSON/Examples/One.js
| | | | | | (name_value_pair_list#JSON=6#593cb00^1#593cb80:1 Line 12 Column 21 File C:/DMS/Domains/JSON/Examples/One.js
| | | | | | |(name_value_pair_list#JSON=7#593cac0^1#593cb00:1 Line 12 Column 21 File C:/DMS/Domains/JSON/Examples/One.js
| | | | | | | (name_value_pair_list#JSON=6#593ca60^1#593cac0:1 Line 12 Column 21 File C:/DMS/Domains/JSON/Examples/One.js
| | | | | | | (name_value_pair_list#JSON=7#593ca00^1#593ca60:1 Line 12 Column 21 File C:/DMS/Domains/JSON/Examples/One.js
| | | | | | | (name_value_pair_list#JSON=5#593c900^1#593ca00:1 Line 12 Column 21 File C:/DMS/Domains/JSON/Examples/One.js)name_value_pair_list
| | | | | | | (name_value_pair#JSON=8#593c9c0^1#593ca00:2 Line 12 Column 21 File C:/DMS/Domains/JSON/Examples/One.js
| | | | | | | |(STRING#JSON=24#593c8e0^1#593c9c0:1[`ID'] Line 12 Column 21 File C:/DMS/Domains/JSON/Examples/One.js)STRING
| | | | | | | |(STRING#JSON=24#593c920^1#593c9c0:2[`SGML'] Line 12 Column 27 File C:/DMS/Domains/JSON/Examples/One.js)STRING
| | | | | | | )name_value_pair#593c9c0
| | | | | | | )name_value_pair_list#593ca00
| | | | | | | )name_value_pair_list#593ca60
| | | | | | | (name_value_pair#JSON=8#593caa0^1#593cac0:2 Line 13 Column 41 File C:/DMS/Domains/JSON/Examples/One.js
| | | | | | | (STRING#JSON=24#593ca40^1#593caa0:1[`SortAs'] Line 13 Column 41 File C:/DMS/Domains/JSON/Examples/One.js)STRING
| | | | | | | (STRING#JSON=24#593ca80^1#593caa0:2[`SGML'] Line 13 Column 51 File C:/DMS/Domains/JSON/Examples/One.js)STRING
| | | | | | | )name_value_pair#593caa0
| | | | | | |)name_value_pair_list#593cac0
I've truncated the output because nobody really wants to see the tree. Now, there's a lot of "extra" stuff in tree such as node location, source line numbers, which can all be easily eliminated or ignored.

Related

Select row value from a different column in Spark Java using some rule

I want to select different row values for each row from different columns using some complex rule.
For example I have this data set:
+----------+---+---+---+
| Column A | 1 | 2 | 3 |
+ -------- +---+---+---+
| User 1 | A | H | O |
| User 2 | B | L | J |
| User 3 | A | O | N |
| User 4 | F | S | E |
| User 5 | S | G | V |
+----------+---+---+---+
I want to get something like this:
+----------+---+---+---+---+
| Column A | 1 | 2 | 3 | F |
+ -------- +---+---+---+---+
| User 1 | A | H | O | O |
| User 2 | B | L | J | J |
| User 3 | A | O | N | A |
| User 4 | F | S | E | E |
| User 5 | S | G | V | S |
+----------+---+---+---+---+
The selected values for column F are selected using a complex rule wherein the when function is not applicable. If there are 1000 columns to select from, can I make a UDF do this?
I already tried making a UDF to store the string of the column name to select the value from so it can be used to access that column name's row value. For example, I tried storing the row value 233 (the result of the complex rule) of row 100, then try to use it as a column name (column 233) to access its row value for row 100. However, I never got it to work.

JDBC pagination in one to many relation and mapping to pojo?

we have a messaging logs table and we are using this table to provide a search UI which lets to search messages by id or status or auditor or date. Table audit looks like below
+-----------+----------+---------+---------------------+
| messageId | auditor | status | timestamp |
+-----------+----------+---------+---------------------+
| 10 | program1 | Failed | 2020-08-01 10:00:00 |
| 11 | program2 | success | 2020-08-01 10:01:10 |
| 12 | program3 | Failed | 2020-08-01 10:01:15 |
+-----------+----------+---------+---------------------+
Since in a given date range we could have many messages matching the criteria so we added pagination for the query. Now as a new feature we are adding another table with one to many relation which contain tags as the possible reasons for the failure. The table failure_tags will look like below
+-----------+----------+-------+--------+
| messageId | auditor | type | cause |
+-----------+----------+-------+--------+
| 10 | program1 | type1 | cause1 |
| 10 | program1 | type1 | cause2 |
| 10 | program1 | type2 | cause3 |
+-----------+----------+-------+--------+
Now for a general search query for a status = 'Failed' and using left join with the other table will retrieve 4 rows as below
+-----------+----------+-------+--------+---------------------+
| messageId | auditor | type | cause | timestamp |
+-----------+----------+-------+--------+---------------------+
| 10 | program1 | type1 | cause1 | 2020-08-01 10:00:00 |
| 10 | program1 | type1 | cause2 | 2020-08-01 10:00:00 |
| 10 | program1 | type2 | cause3 | 2020-08-01 10:00:00 |
| 12 | program3 | | | 2020-08-01 10:01:15 |
+-----------+----------+-------+--------+---------------------+
The requirement is to since the 3 rows of messageId 10 belongs to same message the requirement is to merge the rows into 1 in json response, so the response will have only 2 elements
[
{
"messageId": "10",
"auditor": "program1",
"failures": [
{
"type": "type1",
"cause": [
"cause1",
"cause2"
]
},
{
"type": "type2",
"cause": [
"cause3"
]
}
],
"date": "2020-08-01 10:00:00"
},
{
"messageId": "12",
"auditor": "program3",
"failures": [],
"date": "2020-08-01 10:01:15"
}
]
Because of this merge for a pagination request of 10 elements after fetching from the database and merging would result in less than 10 results.
The 1 solution, I could think of is after merging, if its less than page size, initiate a search again do the combining process and take the top 10 elements. Is there any better solution to get all the results in 1 query instead of going twice or more to DB ?
We use generic spring - JDBC not the JPA.

Java code to write a missing value in a mapping Informatica PowerCenter

I have a task to take a look in a database (SAP iDoc) that has specific values in it derived by segments. I have to export an xml at the end of the mapping that has a subcomponent that can have more than one row. My problem is that we have a component that has two values that are separated by a qualifier.
Every transaction looks like so:
+----------+-----------+--------+
| QUALF_1 | BETRG_dc | DOCNUM |
+----------+-----------+--------+
| 001 | 20 | xxxxxx |
| 001 | 22 | xxxxxx |
+----------+-----------+--------+
+---------+-----------+-----------+
| QUALF_2 | BETRG_pr | DOCNUM |
+---------+-----------+-----------+
| 013 | 30 | xxxxxx |
| 013 | 40 | xxxxxx |
+---------+-----------+-----------+
My problem is that when joined with the built in transformations we have a geometrical progression like so
+---------+-----------+-----------+
| DOCNUM | BETRG_dc | BETRG_pr |
+---------+-----------+-----------+
| xxxxxx | 20 | 30 |
| xxxxxx | 20 | 40 |
| xxxxxx | 22 | 30 |
| xxxxxx | 22 | 40 |
+---------+-----------+-----------+
As you can see only the first and last rows are correct.
The problem comes from the fact that if BETRG_dc is 0 the whole segment is not being sent so a filter transformation fails.
What i found out is the the segment number of QUALF_1 and QUALF_2 are sequencial. So QUALF_1 is for example 48 and QUALF_2 is 49.
Can you help me create a JAVA transformation that adds a row for a missing QUALF_1.
Here is a table of requirements:
+-------+-------+---------------+
| QUALF | BETRG | SegmentNumber |
+-------+-------+---------------+
| 013 | 20 | 48 |
| 001 | 150 | 49 |
| 013 | 15 | 57 |
| 001 | 600 | 58 |
+-------+-------+---------------+
I want the transformation to take a look and if we have a source like this:
+-------+-------+---------------+
| QUALF | BETRG | SegmentNumber |
+-------+-------+---------------+
| 001 | 150 | 49 |
| 013 | 15 | 57 |
| 001 | 600 | 58 |
+-------+-------+---------------+
To go ahead and insert a row with the segment id 48 and a value for BETRG of "0".
I have tried every transformation i can.
The expected output should be like this:
+-------+-------+---------------+
| QUALF | BETRG | SegmentNumber |
+-------+-------+---------------+
| 013 | 0 | 48 |
| 001 | 150 | 49 |
| 013 | 15 | 57 |
| 001 | 600 | 58 |
+-------+-------+---------------+

You should join both of the table in a joiner transformation.
use Left(master) outer join and then take it into a target. then map the BETRG column from the right table to the target and the rest of the columns from the left table.
what happens is when ever there is no match BETRG will be empty. take it into a expression and see if the value is null or empty and change it to 0 or what value you wish.

Here is what i have created but unfortunately for now it works on a row level only and not on the whole data. I am working on making the code run properly:
QUALF_out = QUALF;
BETRG_out= BETRG;
SegmentNumber_out= SegmentNumber;
if(QUALF.equals("001"))
{
segment_new=(SegmentNumber - 1);
}
int colCount=1;
myList.add(SegmentNumber);
System.out.println("SegmentNumber_out: " + segment_new);
if(Arrays.asList(myList).contains(segment_new)){
QUALF_out = QUALF;
BETRG_out= BETRG;
SegmentNumber_out= SegmentNumber;
QUALF_out="013";
BETRG_out="0";
SegmentNumber_out=segment_new;
generateRow();
} else {
QUALF_out = QUALF;
BETRG_out= BETRG;
SegmentNumber_out= SegmentNumber;
generateRow();
}

Here is what works:
import java.util.*;
private ArrayList<String> myList2 = new ArrayList<String>();
QUALF_out = QUALF;
BETRG_out = BETRG;
SegmentNumber_out = SegmentNumber;
DOCNUM = DOCNUM;
array_for_search = QUALF + ParentSegmentNumber + DOCNUM ;
myList2.add(array_for_search);
System.out.println("myList: " + myList2);
System.out.println("Array: " + myList2.contains("910" + ParentSegmentNumber + DOCNUM));
if(!myList2.contains("910" + ParentSegmentNumber + DOCNUM)){
QUALF_out="910";
BETRG_out="0";
}
generateRow();

Converting Values to Columns in Spark Dataset (Convert Key & Value pair of columns to regular columns) [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I have a Dataframe (Java) that has the following simple schema . Here is an example instance:
+-------------------+
| id | key | Value |
+-------------------+
| 01 | A | John |
| 01 | B | Nick |
| 02 | A | Mary |
| 02 | B | Kathy |
| 02 | C | Sabrina|
| 03 | B | George |
+-------------------+
I would like to transform it to the following:
+------------------------------+
| id | A | B | C |
+------------------------------+
| 01 | John | Nick | null |
| 02 | Mary | Kathy | Sabrina |
| 03 | null | George | null |
+------------------------------+
I tried the pivot operator (because this is what it is actually) but worked partially because once the values A B and C become columns, the contents of the columns can only be numeric.
Dataset<Row> pivotTest2 = pivotTest.groupBy(col("id")).pivot("key").count();
What I would actually like, is in the place of the count, to put the value of the column Value, i.e., something of the form .select(col("Value")), or even .max("Value") would work fine, but I cannot since Value is not an arithmetic column.

Doing the following should work for you
import static org.apache.spark.sql.functions.*;
Dataset<Row> pivotTest2 = pivotTest.groupBy(col("id")).pivot("key").agg(first("Value"));
pivotTest2.show(false);
which should give you
+---+----+------+-------+
|id |A |B |C |
+---+----+------+-------+
|01 |John|Nick |null |
|03 |null|George|null |
|02 |Mary|Kathy |Sabrina|
+---+----+------+-------+

Apache Spark find first different preceding row in Dataframe

I have an Apache Spark Dataframe of the following format
| ID | groupId | phaseName |
|----|-----------|-----------|
| 10 | someHash1 | PhaseA |
| 11 | someHash1 | PhaseB |
| 12 | someHash1 | PhaseB |
| 13 | someHash2 | PhaseX |
| 14 | someHash2 | PhaseY |
Each row represents a phase that happens in a procedure that consists of several of these phases. The ID column represents a sequential order of phases and the groupId column shows which phases belong together.
I want to add a new column to the dataframe: previousPhaseName. This column should indicate the previous different phase from the same procedure. The first phase of a process (the one with the minimum ID) will have null as previous phase. When a phase occurs twice or more, the second (third...) occurrence will have the same previousPhaseName For example:
df =
| ID | groupId | phaseName | prevPhaseName |
|----|-----------|-----------|---------------|
| 10 | someHash1 | PhaseA | null |
| 11 | someHash1 | PhaseB | PhaseA |
| 12 | someHash1 | PhaseB | PhaseA |
| 13 | someHash2 | PhaseX | null |
| 14 | someHash2 | PhaseY | PhaseX |
I am not sure how to implement this. My first approach would be:
create a second empty dataframe df2
for each row in df:
find the row with groupId = row.groupId, ID < row.ID, and maximum id
add this row to df2
join df1 and df2
Partial Solution using Window Functions
I used Window Functionsto aggregate the Name of the previous phase, the number of previous occurrences (not necessarily in a row) of the current phase in the group and the information whether the current and previous phase names are equal:
WindowSpec windowSpecPrev = Window
.partitionBy(df.col("groupId"))
.orderBy(df.col("ID"));
WindowSpec windowSpecCount = Window
.partitionBy(df.col("groupId"), df.col("phaseName"))
.orderBy(df.col("ID"))
.rowsBetween(Long.MIN_VALUE, 0);
df
.withColumn("prevPhase", functions.lag("phaseName", 1).over(windowSpecPrev))
.withColumn("phaseCount", functions.count("phaseId").over(windowSpecCount))
.withColumn("prevSame", when(col("prevPhase").equalTo(col("phaseName")),1).otherwise(0))
df =
| ID | groupId | phaseName | prevPhase | phaseCount | prevSame |
|----|-----------|-----------|-------------|------------|----------|
| 10 | someHash1 | PhaseA | null | 1 | 0 |
| 11 | someHash1 | PhaseB | PhaseA | 1 | 0 |
| 12 | someHash1 | PhaseB | PhaseB | 2 | 1 |
| 13 | someHash2 | PhaseX | null | 1 | 0 |
| 14 | someHash2 | PhaseY | PhaseX | 1 | 0 |
This is still not what I wanted to achieve but good enough for now
Further Ideas
To get the the name of the previous distinct phase I see three possibilities that I have not investigated thoroughly:
Implement an own lagfunction that does not take an offset but recursively checks the previous line until it finds a value that is different from the given line. (Though I don't think it's possible to use own analytic window functions in Spark SQL)
Find a way to dynamically set the offset of the lag function according to the value of phaseCount. (That may fail if the previous occurrences of the same phaseName do not appear in a single sequence)
Use a UserDefinedAggregateFunction over the window that stores the ID and phaseName of the first given input and seeks for the highest ID with different phaseName.

I was able to solve this problem in the following way:
Get the (ordinary) previous phase.
Introduce a new id that groups phases that occur in sequential order. (With help of this answer). This takes two steps. First checking whether the current and previous phase names are equal and assigning a groupCount value accordingly. Second computing a cumulative sum over this value.
Assign the previous phase of the first row of a sequential group to all its members.
Implementation
WindowSpec specGroup = Window.partitionBy(col("groupId"))
.orderBy(col("ID"));
WindowSpec specSeqGroupId = Window.partitionBy(col("groupId"))
.orderBy(col("ID"))
.rowsBetween(Long.MIN_VALUE, 0);
WindowSpec specPrevDiff = Window.partitionBy(col("groupId"), col("seqGroupId"))
.orderBy(col("ID"))
.rowsBetween(Long.MIN_VALUE, 0);
df.withColumn("prevPhase", coalesce(lag("phaseName", 1).over(specGroup), lit("NO_PREV")))
.withColumn("seqCount", when(col("prevPhase").equalTo(col("phaseName")).or(col("prevPhase").equalTo("NO_PREV")),0).otherwise(1))
.withColumn("seqGroupId", sum("seqCount").over(specSeqGroupId))
.withColumn("prevDiff", first("prevPhase").over(specPrevDiff));
Result
df =
| ID | groupId | phaseName | prevPhase | seqCount | seqGroupId | prevDiff |
|----|-----------|-----------|-----------|----------|------------|----------|
| 10 | someHash1 | PhaseA | NO_PREV | 0 | 0 | NO_PREV |
| 11 | someHash1 | PhaseB | PhaseA | 1 | 1 | PhaseA |
| 12 | someHash1 | PhaseB | PhaseA | 0 | 1 | PhaseA |
| 13 | someHash2 | PhaseX | NO_PREV | 0 | 0 | NO_PREV |
| 14 | someHash2 | PhaseY | PhaseX | 1 | 1 | PhaseX |
Any suggestions, specially in terms of efficiency of these operations are appreciated.

I guess you can use Spark window (row frame) functions. Check the api documentation and the following post.
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Get abstract syntax tree of JSON string - java

Related

Select row value from a different column in Spark Java using some rule

JDBC pagination in one to many relation and mapping to pojo?

Java code to write a missing value in a mapping Informatica PowerCenter

Converting Values to Columns in Spark Dataset (Convert Key & Value pair of columns to regular columns) [duplicate]

Apache Spark find first different preceding row in Dataframe

Categories

Resources