Apache Spark find first different preceding row in Dataframe - java

I have an Apache Spark Dataframe of the following format
| ID | groupId | phaseName |
|----|-----------|-----------|
| 10 | someHash1 | PhaseA |
| 11 | someHash1 | PhaseB |
| 12 | someHash1 | PhaseB |
| 13 | someHash2 | PhaseX |
| 14 | someHash2 | PhaseY |
Each row represents a phase that happens in a procedure that consists of several of these phases. The ID column represents a sequential order of phases and the groupId column shows which phases belong together.
I want to add a new column to the dataframe: previousPhaseName. This column should indicate the previous different phase from the same procedure. The first phase of a process (the one with the minimum ID) will have null as previous phase. When a phase occurs twice or more, the second (third...) occurrence will have the same previousPhaseName For example:
df =
| ID | groupId | phaseName | prevPhaseName |
|----|-----------|-----------|---------------|
| 10 | someHash1 | PhaseA | null |
| 11 | someHash1 | PhaseB | PhaseA |
| 12 | someHash1 | PhaseB | PhaseA |
| 13 | someHash2 | PhaseX | null |
| 14 | someHash2 | PhaseY | PhaseX |
I am not sure how to implement this. My first approach would be:
create a second empty dataframe df2
for each row in df:
find the row with groupId = row.groupId, ID < row.ID, and maximum id
add this row to df2
join df1 and df2
Partial Solution using Window Functions
I used Window Functionsto aggregate the Name of the previous phase, the number of previous occurrences (not necessarily in a row) of the current phase in the group and the information whether the current and previous phase names are equal:
WindowSpec windowSpecPrev = Window
.partitionBy(df.col("groupId"))
.orderBy(df.col("ID"));
WindowSpec windowSpecCount = Window
.partitionBy(df.col("groupId"), df.col("phaseName"))
.orderBy(df.col("ID"))
.rowsBetween(Long.MIN_VALUE, 0);
df
.withColumn("prevPhase", functions.lag("phaseName", 1).over(windowSpecPrev))
.withColumn("phaseCount", functions.count("phaseId").over(windowSpecCount))
.withColumn("prevSame", when(col("prevPhase").equalTo(col("phaseName")),1).otherwise(0))
df =
| ID | groupId | phaseName | prevPhase | phaseCount | prevSame |
|----|-----------|-----------|-------------|------------|----------|
| 10 | someHash1 | PhaseA | null | 1 | 0 |
| 11 | someHash1 | PhaseB | PhaseA | 1 | 0 |
| 12 | someHash1 | PhaseB | PhaseB | 2 | 1 |
| 13 | someHash2 | PhaseX | null | 1 | 0 |
| 14 | someHash2 | PhaseY | PhaseX | 1 | 0 |
This is still not what I wanted to achieve but good enough for now
Further Ideas
To get the the name of the previous distinct phase I see three possibilities that I have not investigated thoroughly:
Implement an own lagfunction that does not take an offset but recursively checks the previous line until it finds a value that is different from the given line. (Though I don't think it's possible to use own analytic window functions in Spark SQL)
Find a way to dynamically set the offset of the lag function according to the value of phaseCount. (That may fail if the previous occurrences of the same phaseName do not appear in a single sequence)
Use a UserDefinedAggregateFunction over the window that stores the ID and phaseName of the first given input and seeks for the highest ID with different phaseName.

I was able to solve this problem in the following way:
Get the (ordinary) previous phase.
Introduce a new id that groups phases that occur in sequential order. (With help of this answer). This takes two steps. First checking whether the current and previous phase names are equal and assigning a groupCount value accordingly. Second computing a cumulative sum over this value.
Assign the previous phase of the first row of a sequential group to all its members.
Implementation
WindowSpec specGroup = Window.partitionBy(col("groupId"))
.orderBy(col("ID"));
WindowSpec specSeqGroupId = Window.partitionBy(col("groupId"))
.orderBy(col("ID"))
.rowsBetween(Long.MIN_VALUE, 0);
WindowSpec specPrevDiff = Window.partitionBy(col("groupId"), col("seqGroupId"))
.orderBy(col("ID"))
.rowsBetween(Long.MIN_VALUE, 0);
df.withColumn("prevPhase", coalesce(lag("phaseName", 1).over(specGroup), lit("NO_PREV")))
.withColumn("seqCount", when(col("prevPhase").equalTo(col("phaseName")).or(col("prevPhase").equalTo("NO_PREV")),0).otherwise(1))
.withColumn("seqGroupId", sum("seqCount").over(specSeqGroupId))
.withColumn("prevDiff", first("prevPhase").over(specPrevDiff));
Result
df =
| ID | groupId | phaseName | prevPhase | seqCount | seqGroupId | prevDiff |
|----|-----------|-----------|-----------|----------|------------|----------|
| 10 | someHash1 | PhaseA | NO_PREV | 0 | 0 | NO_PREV |
| 11 | someHash1 | PhaseB | PhaseA | 1 | 1 | PhaseA |
| 12 | someHash1 | PhaseB | PhaseA | 0 | 1 | PhaseA |
| 13 | someHash2 | PhaseX | NO_PREV | 0 | 0 | NO_PREV |
| 14 | someHash2 | PhaseY | PhaseX | 1 | 1 | PhaseX |
Any suggestions, specially in terms of efficiency of these operations are appreciated.

I guess you can use Spark window (row frame) functions. Check the api documentation and the following post.
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

Related

Select row value from a different column in Spark Java using some rule

I want to select different row values for each row from different columns using some complex rule.
For example I have this data set:
+----------+---+---+---+
| Column A | 1 | 2 | 3 |
+ -------- +---+---+---+
| User 1 | A | H | O |
| User 2 | B | L | J |
| User 3 | A | O | N |
| User 4 | F | S | E |
| User 5 | S | G | V |
+----------+---+---+---+
I want to get something like this:
+----------+---+---+---+---+
| Column A | 1 | 2 | 3 | F |
+ -------- +---+---+---+---+
| User 1 | A | H | O | O |
| User 2 | B | L | J | J |
| User 3 | A | O | N | A |
| User 4 | F | S | E | E |
| User 5 | S | G | V | S |
+----------+---+---+---+---+
The selected values for column F are selected using a complex rule wherein the when function is not applicable. If there are 1000 columns to select from, can I make a UDF do this?
I already tried making a UDF to store the string of the column name to select the value from so it can be used to access that column name's row value. For example, I tried storing the row value 233 (the result of the complex rule) of row 100, then try to use it as a column name (column 233) to access its row value for row 100. However, I never got it to work.

JSprit not using closer vehicle, due to time issues

Single-job schedule with two vehicles. One vehicle starts close to the job, the other starts far from the job. Seems it should prefer to use the closer vehicle, as there's a cost-per-distance. But it uses the farther one, if there's a non-zero setCostPerWaitingTime(). Why?
public void testUseCloserVehicleWhenCostsAreSet() throws Exception {
VehicleType type = VehicleTypeImpl.Builder.newInstance("generic")
.setCostPerDistance(0.017753)
//.setCostPerTransportTime(1.0)
.setCostPerWaitingTime(1.0)
.build();
double serviceTime = 420.0;
Location pointA = Location.newInstance(100.0, 100.0);
Location pointB = Location.newInstance(100.0, 200.0);
Location closeToPointA = Location.newInstance(110.0, 110.0);
Location farFromPointA = Location.newInstance(500.0, 110.0);
VehicleRoutingProblem vrp = VehicleRoutingProblem.Builder
.newInstance()
.setFleetSize(VehicleRoutingProblem.FleetSize.FINITE)
.addVehicle(VehicleImpl.Builder.newInstance("CloseBy")
.setType(type)
.setStartLocation(closeToPointA)
.build())
.addVehicle(VehicleImpl.Builder.newInstance("FarAway")
.setType(type)
.setStartLocation(farFromPointA)
.build())
.addJob(Shipment.Builder.newInstance("123")
.setPickupLocation(pointA)
.setPickupServiceTime(serviceTime)
.setDeliveryLocation(pointB)
.setDeliveryServiceTime(serviceTime)
.setPickupTimeWindow(new TimeWindow(36000.0, 36360.0))
.setDeliveryTimeWindow(new TimeWindow(36360.0, 36720.0))
.setMaxTimeInVehicle(720.0)
.build())
.build();
VehicleRoutingAlgorithm algorithm = Jsprit.Builder.newInstance(vrp)
.buildAlgorithm();
VehicleRoutingProblemSolution bestSolution = Solutions.bestOf(algorithm.searchSolutions());
SolutionPrinterWithTimes.print(vrp, bestSolution, SolutionPrinterWithTimes.Print.VERBOSE);
System.out.flush();
assertEquals("CloseBy", bestSolution.getRoutes().iterator().next().getVehicle().getId());
}
Result:
+----------------------------------------------------------+
| solution |
+---------------+------------------------------------------+
| indicator | value |
+---------------+------------------------------------------+
| costs | 35616.03246830352 |
| noVehicles | 1 |
| unassgndJobs | 0 |
+----------------------------------------------------------+
+--------------------------------------------------------------------------------------------------------------------------------+
| detailed solution |
+---------+----------------------+-----------------------+-----------------+-----------------+-----------------+-----------------+
| route | vehicle | activity | job | arrTime | endTime | costs |
+---------+----------------------+-----------------------+-----------------+-----------------+-----------------+-----------------+
| 1 | FarAway | start | - | undef | 0 | 0 |
| 1 | FarAway | pickupShipment | 123 | 400 | 36420 | 35607 |
| 1 | FarAway | deliverShipment | 123 | 36520 | 36940 | 35609 |
| 1 | FarAway | end | - | 37350 | undef | 35616 |
+--------------------------------------------------------------------------------------------------------------------------------+
junit.framework.ComparisonFailure:
Expected :CloseBy
Actual :FarAway
I suspect it has something to do with the vehicle arriving at 400 for a job that can't start until 36000. Is there a way to prevent that, so the vehicle starts only as early as needed to reach the first job? Does setCostPerWaitingTime do something other than what I think?
Here's a comparison of the job with only the CloseBy vehicle.
+--------------------------------------------------------------------------------------------------------------------------------+
| detailed solution |
+---------+----------------------+-----------------------+-----------------+-----------------+-----------------+-----------------+
| route | vehicle | activity | job | arrTime | endTime | costs |
+---------+----------------------+-----------------------+-----------------+-----------------+-----------------+-----------------+
| 1 | FarAway | start | - | undef | 0 | 0 |
| 1 | FarAway | pickupShipment | 123 | 400 | 36420 | 35607 |
| 1 | FarAway | deliverShipment | 123 | 36520 | 36940 | 35609 |
| 1 | FarAway | end | - | 37350 | undef | 35616 |
+--------------------------------------------------------------------------------------------------------------------------------+
+--------------------------------------------------------------------------------------------------------------------------------+
| detailed solution |
+---------+----------------------+-----------------------+-----------------+-----------------+-----------------+-----------------+
| route | vehicle | activity | job | arrTime | endTime | costs |
+---------+----------------------+-----------------------+-----------------+-----------------+-----------------+-----------------+
| 1 | CloseBy | start | - | undef | 0 | 0 |
| 1 | CloseBy | pickupShipment | 123 | 14 | 36420 | 35986 |
| 1 | CloseBy | deliverShipment | 123 | 36520 | 36940 | 35988 |
| 1 | CloseBy | end | - | 37031 | undef | 35989 |
+--------------------------------------------------------------------------------------------------------------------------------+
I think the problem is that the CloseBy vehicle arrives sooner, so it pays higher wait costs, while the other vehicle is driving during that time so pays less in wait costs. This would be mitigated if the vehicle didn't start until it needed to, but I'm unsure how to set that up.

Java code to write a missing value in a mapping Informatica PowerCenter

I have a task to take a look in a database (SAP iDoc) that has specific values in it derived by segments. I have to export an xml at the end of the mapping that has a subcomponent that can have more than one row. My problem is that we have a component that has two values that are separated by a qualifier.
Every transaction looks like so:
+----------+-----------+--------+
| QUALF_1 | BETRG_dc | DOCNUM |
+----------+-----------+--------+
| 001 | 20 | xxxxxx |
| 001 | 22 | xxxxxx |
+----------+-----------+--------+
+---------+-----------+-----------+
| QUALF_2 | BETRG_pr | DOCNUM |
+---------+-----------+-----------+
| 013 | 30 | xxxxxx |
| 013 | 40 | xxxxxx |
+---------+-----------+-----------+
My problem is that when joined with the built in transformations we have a geometrical progression like so
+---------+-----------+-----------+
| DOCNUM | BETRG_dc | BETRG_pr |
+---------+-----------+-----------+
| xxxxxx | 20 | 30 |
| xxxxxx | 20 | 40 |
| xxxxxx | 22 | 30 |
| xxxxxx | 22 | 40 |
+---------+-----------+-----------+
As you can see only the first and last rows are correct.
The problem comes from the fact that if BETRG_dc is 0 the whole segment is not being sent so a filter transformation fails.
What i found out is the the segment number of QUALF_1 and QUALF_2 are sequencial. So QUALF_1 is for example 48 and QUALF_2 is 49.
Can you help me create a JAVA transformation that adds a row for a missing QUALF_1.
Here is a table of requirements:
+-------+-------+---------------+
| QUALF | BETRG | SegmentNumber |
+-------+-------+---------------+
| 013 | 20 | 48 |
| 001 | 150 | 49 |
| 013 | 15 | 57 |
| 001 | 600 | 58 |
+-------+-------+---------------+
I want the transformation to take a look and if we have a source like this:
+-------+-------+---------------+
| QUALF | BETRG | SegmentNumber |
+-------+-------+---------------+
| 001 | 150 | 49 |
| 013 | 15 | 57 |
| 001 | 600 | 58 |
+-------+-------+---------------+
To go ahead and insert a row with the segment id 48 and a value for BETRG of "0".
I have tried every transformation i can.
The expected output should be like this:
+-------+-------+---------------+
| QUALF | BETRG | SegmentNumber |
+-------+-------+---------------+
| 013 | 0 | 48 |
| 001 | 150 | 49 |
| 013 | 15 | 57 |
| 001 | 600 | 58 |
+-------+-------+---------------+
You should join both of the table in a joiner transformation.
use Left(master) outer join and then take it into a target. then map the BETRG column from the right table to the target and the rest of the columns from the left table.
what happens is when ever there is no match BETRG will be empty. take it into a expression and see if the value is null or empty and change it to 0 or what value you wish.
Here is what i have created but unfortunately for now it works on a row level only and not on the whole data. I am working on making the code run properly:
QUALF_out = QUALF;
BETRG_out= BETRG;
SegmentNumber_out= SegmentNumber;
if(QUALF.equals("001"))
{
segment_new=(SegmentNumber - 1);
}
int colCount=1;
myList.add(SegmentNumber);
System.out.println("SegmentNumber_out: " + segment_new);
if(Arrays.asList(myList).contains(segment_new)){
QUALF_out = QUALF;
BETRG_out= BETRG;
SegmentNumber_out= SegmentNumber;
QUALF_out="013";
BETRG_out="0";
SegmentNumber_out=segment_new;
generateRow();
} else {
QUALF_out = QUALF;
BETRG_out= BETRG;
SegmentNumber_out= SegmentNumber;
generateRow();
}
Here is what works:
import java.util.*;
private ArrayList<String> myList2 = new ArrayList<String>();
QUALF_out = QUALF;
BETRG_out = BETRG;
SegmentNumber_out = SegmentNumber;
DOCNUM = DOCNUM;
array_for_search = QUALF + ParentSegmentNumber + DOCNUM ;
myList2.add(array_for_search);
System.out.println("myList: " + myList2);
System.out.println("Array: " + myList2.contains("910" + ParentSegmentNumber + DOCNUM));
if(!myList2.contains("910" + ParentSegmentNumber + DOCNUM)){
QUALF_out="910";
BETRG_out="0";
}
generateRow();

Iterating through lists of list in java

I have two tables User and Roles with one-to-one relation as below.
User
_________________________________________
| Id | user_name | full_name | creator |
_________________________________________
| 1 | a | A | a |
| 2 | b | B | a |
| 3 | c | C | a |
| 4 | d | D | c |
| 5 | e | E | c |
| 6 | f | F | e |
| 7 | g | G | e |
| 8 | h | H | e |
| 9 | i | I | e |
|10 | j | J | i |
_________________________________________
Roles
_______________________________________
| id | user_mgmt | others | user_id |
_______________________________________
| 1 | 1 | 1 | 1 |
| 2 | 0 | 1 | 2 |
| 3 | 1 | 0 | 3 |
| 4 | 0 | 1 | 4 |
| 5 | 1 | 1 | 5 |
| 6 | 0 | 1 | 6 |
| 7 | 0 | 0 | 7 |
| 8 | 0 | 0 | 8 |
| 9 | 1 | 0 | 9 |
________________________________________
The Roles table have boolean columns, so if an User have user_mgmt role he can add many users (How many users can be added by an user is not definite). I want to fetch all users created by an user and his child users ( a is parent, c is child of a and e is child of c ..) .
Here is my code to fetch users
public void loadUsers(){
List<Users> users = new ArrayList<>();
String creator = user.getUserName();
List<Users> createdUsers = userService.getUsersByCreator(creator);
for(Users user : createdUsers) {
Roles role = createdUsers.getRoles();
if(role.isEmpMgnt()){
users.add(user);
loadUsers();
}
}
}
This gives me an stack overflow error. If i don't call loadUsers() recursively it returns only a single child result. So is there any solutions to this ? Thanks for any help in advance.
This gives you stack overflow error because a has creator a. So you have infinite loop for user a. For a you should set creator to null or skip self references in code.
Also you should pass current user into loadUsers() method and read only users that are created by it. Like
public void loadUsers(String creator)
and only process users created by that creator. Here
String creator = user.getUserName();
what is user? You should use creator. Question is how do you obtain initial creator. Probably initial creator should be user where creator is null.

Select data from specific year

I need a solution for my problem here.
I got 2 tables, assetdetail and assetcondition. Here is the structure of those tables.
assetdetail
-----------------------------------------------------------
| sequenceindex | assetcode | assetname | acquisitionyear |
-----------------------------------------------------------
| 1 | 110 | Car | 2012-06-30 |
| 2 | 111 | Bus | 2013-02-12 |
assetcondition
--------------------------------------------------------------------------
|sequenceindex | indexassetdetail | fiscalyear | assetamount | assetprice |
---------------------------------------------------------------------------
| 1 | 1 | 2012 | 1 | 20000000 |
| 2 | 1 | 2013 | 1 | 15000000 |
| 3 | 2 | 2013 | 1 | 25000000 |
And i want the result is like this:
------------------------
assetname | assetprice |
------------------------
Car | 20000000 |
Bus | 25000000 |
Note: using "SELECT WHERE fiscalyear = "
Without explaining how your tables are linked one can only guess. Here's the query I came up with.
select assetdetail.assetname,
sum( assetcondition.assetprice )
from assetdetail
inner join assetcondition
on assetcondition.indexassetdetail = assetdetail.sequenceindex
where assetcondition.fiscalyear = 2013
group by assetdetail.assetname;
I haven't understand from a logical point of view your query. By the way the operator that you have to you use is the JOIN's one.
The SQL that follows, I don't know if it is what you want.
Select assetname, assetprice
From assetdetail as ad join assetcondition as ac on (as.sequenceindex = ac.sequenceindex)
Where fiscalyear = '2013'
Not quite sure if it is what you're looking for, but I guess what you want is a JOIN:
SELECT
assetdetail.assetname, assetcondition.assetprice
FROM
assetdetail
JOIN
assetcondition
ON
assetdetail.sequenceindex = assetcondition.sequenceindex
WHERE
YEAR(acquisitionyear) = '2013'

Categories