I need a solution for my problem here.
I got 2 tables, assetdetail and assetcondition. Here is the structure of those tables.
assetdetail
-----------------------------------------------------------
| sequenceindex | assetcode | assetname | acquisitionyear |
-----------------------------------------------------------
| 1 | 110 | Car | 2012-06-30 |
| 2 | 111 | Bus | 2013-02-12 |
assetcondition
--------------------------------------------------------------------------
|sequenceindex | indexassetdetail | fiscalyear | assetamount | assetprice |
---------------------------------------------------------------------------
| 1 | 1 | 2012 | 1 | 20000000 |
| 2 | 1 | 2013 | 1 | 15000000 |
| 3 | 2 | 2013 | 1 | 25000000 |
And i want the result is like this:
------------------------
assetname | assetprice |
------------------------
Car | 20000000 |
Bus | 25000000 |
Note: using "SELECT WHERE fiscalyear = "
Without explaining how your tables are linked one can only guess. Here's the query I came up with.
select assetdetail.assetname,
sum( assetcondition.assetprice )
from assetdetail
inner join assetcondition
on assetcondition.indexassetdetail = assetdetail.sequenceindex
where assetcondition.fiscalyear = 2013
group by assetdetail.assetname;
I haven't understand from a logical point of view your query. By the way the operator that you have to you use is the JOIN's one.
The SQL that follows, I don't know if it is what you want.
Select assetname, assetprice
From assetdetail as ad join assetcondition as ac on (as.sequenceindex = ac.sequenceindex)
Where fiscalyear = '2013'
Not quite sure if it is what you're looking for, but I guess what you want is a JOIN:
SELECT
assetdetail.assetname, assetcondition.assetprice
FROM
assetdetail
JOIN
assetcondition
ON
assetdetail.sequenceindex = assetcondition.sequenceindex
WHERE
YEAR(acquisitionyear) = '2013'
Related
I am using select
SELECT
asl.id, asl.outstanding_principal as outstandingPrincipal, the_date as theDate, asl.interest_rate as interestRate, asl.interest_payment as interestPayment, asl.principal_payment as principalPayment,
asl.total_payment as totalPayment, asl.actual_delta as actualDelta, asl.outstanding_usd as outstandingUsd, asl.disbursement, asl.floating_index_rate as floatingIndexRate,
asl.upfront_fee as upfrontFee, asl.commitment_fee as commitmentFee, asl.other_fee as otherFee, asl.withholding_tax as withholdingTax, asl.default_fee as defaultFee,
asl.prepayment_fee as prepaymentFee, asl.total_out_flows as totalOutFlows, asl.net_flows as netFlows, asl.modified, asl.new_row as newRow, asl.interest_payment_modified as
interestPaymentModified, asl.date, asl.amortization_schedule_initial_id as amortizationScheduleInitialId, asl.tranche_id as trancheId, asl.user_id as userId, tr.local_currency_id as localCurrencyId,
f.facility_id
FROM
GENERATE_SERIES
(
(SELECT MIN(ams.date) FROM amortization_schedules ams),
(SELECT MAX(ams.date) + INTERVAL '1' MONTH FROM amortization_schedules ams),
'1 MONTH'
) AS tab (the_date)
FULL JOIN amortization_schedules asl on to_char(the_date, 'yyyy-mm') = to_char(asl.date, 'yyyy-mm')
LEFT JOIN tranches tr ON asl.tranche_id = tr.id
LEFT JOIN facilities f on tr.facility_id = f.id
In this select, I'm using generate_series to get each month since there are no records in the database for each month. But the matter is that this select gives me superfluous results. I use this select in my Spring Boot application. But the fact is that I need all the data, and for example only with a certain facility_id , and when I insert a condition
WHERE f.id = :id and tr.tranche_number_id = :trancheNumberId
My generate_series stops working (as I understand it, because I set certain conditions for generating a request) and instead of 30 lines, I get only 3.
How do I keep the ability to generate the theDate by month, with the ability to select specific IDs
I tried different options.
With this option:
FULL JOIN amortization_schedules asl on to_char(the_date, 'yyyy-mm') = to_char(asl.date, 'yyyy-mm')
| id | outstantandingprincipal | thedate |
-------------------------------------------------------------------
| 1 | 10000 | 2022-05-16 00:00:00.000000 |
| 2 | 50000 | 2023-05-16 00:00:00.000000 |
| 3 | 0 | 2024-05-16 00:00:00.000000 |
In this case, it does not work correctly, since months are not generated and only three lines are displayed (if it is (the_date, 'yyyy-MM') = to_char(asl.date, 'yyyy-MM')).
If I change to (the_date, 'yyyy') = to_char(asl.date, 'yyyy') then the generation works, but it doesn't work correctly because it is year oriented.
| id | outstantandingprincipal | thedate |
-------------------------------------------------------------------
| 1 | 10000 | 2022-05-16 00:00:00.000000 |
| 1 | 10000 | 2022-06-16 00:00:00.000000 |
| 1 | 10000 | 2022-06-16 00:00:00.000000 |
| 1 | 10000 | 2022-07-16 00:00:00.000000 |
... ... ....
| 1 | 10000 | 2022-12-16 00:00:00.000000 |
| 2 | 50000 | 2023-01-16 00:00:00.000000 |
| 2 | 50000 | 2023-02-16 00:00:00.000000 |
| 2 | 50000 | 2023-03-16 00:00:00.000000 |
| 2 | 50000 | 2023-04-16 00:00:00.000000 |
... ... ....
| 3 | 0 | 2024-01-16 00:00:00.000000 |
but it should be:
| id | outstantandingprincipal | thedate |
-------------------------------------------------------------------
| 1 | 10000 | 2022-05-16 00:00:00.000000 |
| 1 | 10000 | 2022-06-16 00:00:00.000000 |
| 1 | 10000 | 2022-06-16 00:00:00.000000 |
| 1 | 10000 | 2022-07-16 00:00:00.000000 |
... ... ....
| 1 | 10000 | 2023-04-16 00:00:00.000000 |
| 2 | 50000 | 2023-05-16 00:00:00.000000 |
| 2 | 50000 | 2023-06-16 00:00:00.000000 |
| 2 | 50000 | 2023-07-16 00:00:00.000000 |
| 2 | 50000 | 2023-08-16 00:00:00.000000 |
... ... ....
| 3 | 0 | 2024-05-16 00:00:00.000000 |
| 3 | 0 | 2024-06-16 00:00:00.000000 |
| 3 | 0 | 2024-07-16 00:00:00.000000 |
I'm making a few intuitive leaps here, so if something looks off it might be because I don't have the entire picture.
From what I can tell you want the amortization schedule starting from the "date" for each ID and then going out a specific amount of time. I am guessing it is not truly the max date in that entire table, and that it varies by ID. In your example you went out one year, so for now I'm going with that.
You can use a generate_series inline, which will explode out each row. I believe something like this will give you the output you seek:
with schedule as (
select
id,
generate_series (date, date + interval '1 year', interval '1 month')::date as dt
from
amortization_schedules
)
select
asl.id, s.dt, asl.outstanding_principal
from
amortization_schedules asl
join schedule s on asl.id = s.id
JOIN tranches tr ON asl.tranche_id = tr.id
JOIN facilities f on tr.facility_id = f.id
WHERE
f.id = :id and
tr.tranche_number_id = :trancheNumberId
Is there another field that tells, by id, when the payments should end or one that will let us derive it (number of payments, payment end date, etc)?
One final note. If you use [left] outer joins and a where clause, as in below:
LEFT JOIN tranches tr ON asl.tranche_id = tr.id
LEFT JOIN facilities f on tr.facility_id = f.id
WHERE
f.id = :id and
tr.tranche_number_id = :trancheNumberId
You have effectively nullified the "left" and made these inner joins. In this case, get rid of "left," not because it will return wrong results but because it misleads. You are saying those fields must have those specific values, which means they must first exist. That's an inner join.
If you truly wanted these as left joins, this would have been more appropriate, but I don't think this is what you meant:
LEFT JOIN tranches tr ON
asl.tranche_id = tr.id and
tr.tranche_number_id = :trancheNumberId
LEFT JOIN facilities f on
tr.facility_id = f.id and
f.id = :id
I have two tables
table-1
|stdid | stdname |
|-------|---------|
|1 | raghav |
|2 | sowmya |
|3 | kiran |
table-2
| skillid | stdname | skill |
|---------|---------|--------|
| 1 | raghav | java |
| 2 | raghav | c |
| 3 | raghav | c++ |
| 4 | sowmya | python|
| 5 | sowmya | c++ |
| 6 | kiran | c |
I want output like
raghav c,c++,python.
Soumya python,c++.
kiran c.
How can join those two tables like this and store them in Arraylist
Does Arraylist accept array variables? if yes how can I approach it?
Join the tables and then aggregate by name:
SELECT t1.stdname, GROUP_CONCAT(COALESCE(t2.skill, 'NA')) AS skills
FROM Table1 t1
LEFT JOIN Table2 t2
ON t2.stdname = t1.stdname
GROUP BY
t1.stdname;
I'm trying to join to two datasets
Ds1
+-------------+-----------------+-----------+
| countryName|countryPopulation|countrySize|
+-------------+-----------------+-----------+
| China| 1210004992| 9596960|
| India| 952107712| 3287590|
| UnitedStates| 266476272| 9372610|
| Indonesia| 206611600| 1919440|
| Brazil| 162661216| 8511965|
| Russia| 148178480| 17075200|
| Pakistan| 129275664| 803940|
| Japan| 125449704| 377835|
| Bangladesh| 123062800| 144000|
| Nigeria| 103912488| 923770|
| Mexico| 95772464| 1972550|
| Germany| 83536112| 356910|
| Philippines| 74480848| 300000|
| Vietnam| 73976976| 329560|
| Iran| 66094264| 1648000|
| Egypt| 63575108| 1001450|
| Turkey| 62484480| 780580|
| Thailand| 58851356| 514000|
|UnitedKingdom| 58489976| 244820|
| France| 58317448| 547030|
+-------------+-----------------+-----------+
Ds2:
+------------+-----------------+-----------+
| countryName|countryPopulation|countrySize|
+------------+-----------------+-----------+
| China| 1210004992| 9596960|
| India| 952107712| 3287590|
|UnitedStates| 266476272| 9372610|
| Indonesia| 206611600| 1919440|
| Brazil| 162661216| 8511965|
| Russia| 148178480| 17075200|
| Pakistan| 129275664| 803940|
| Japan| 125449704| 377835|
| Bangladesh| 123062800| 144000|
| Nigeria| 103912488| 923770|
| Germany| 83536112| 356910|
| Vietnam| 73976976| 329560|
| Iran| 66094264| 1648000|
| Thailand| 58851356| 514000|
| France| 58317448| 547030|
| Italy| 57460272| 301230|
| Ethiopia| 57171664| 1127127|
| Ukraine| 50864008| 603700|
| Zaire| 46498540| 2345410|
| Burma| 45975624| 678500|
+------------+-----------------+-----------+
When I perform below operation I ge the output
Dataset<Row> ds3 = ds2.filter(ds2.col("countryPopulation").cast("int").$greater(100000))
.join(ds1, ds1.col("countrySize")
.equalTo(ds2.col("countrySize")));
ds3.show();
But when I do below operation, I'm getting error
Dataset<Row> ds3 = ds2.filter(ds2.col("countryPopulation").cast("int").$greater(100000))
.join(ds1, ds1.col("countrySize").cast(DataTypes.IntegerType)
.equalTo(ds2.col("countrySize").cast(DataTypes.IntegerType)), "inner");
ds3.show();
Error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
Project [country#6.name AS countryName#2, country#6.population AS countryPopulation#3, country#6.area AS countrySize#4]
+- Filter (isnotnull(country#6) && (Contains(country#6.name, a) && ((cast(country#6.population as int) > 100000) && (cast(country#6.area as int) = cast(country#6.area as int)))))
+- Generate explode(countries#0.country), [0], false, t, [country#6]
+- Relation[countries#0] json
May I know Please how should cast and join at the same time..? And why am getting this error..?
What's a meaning of this "Detected implicit cartesian product for INNER join between logical plans" in error ?
I have seen cartesian join happens when the where condition contains a function call with parameters from both data frames. Something like df1.join(df2, aFunction(df1.column, df2.column). Here I don't see that exactly but I suspect something like that is going on.
Try below to get the function applied on select condition rather than where.
Dataset<Row> ds1_1 = ds1.select(col("countrySize").cast(DataTypes.IntegerType).as("countrySize")) // add all columns here
Dataset<Row> ds2_1 = ds2.select(col("countrySize").cast(DataTypes.IntegerType).as("countrySize"),ds2.col("countryPopulation").cast("int").as("countryPopulation")) // add all columns here
Dataset<Row> ds3 = ds2.filter(ds2_1.col("countryPopulation").$greater(100000))
.join(ds1_1, ds1_1.col("countrySize")
.equalTo(ds2_1.col("countrySize")), "inner");
ds3.show();
I have two tables User and Roles with one-to-one relation as below.
User
_________________________________________
| Id | user_name | full_name | creator |
_________________________________________
| 1 | a | A | a |
| 2 | b | B | a |
| 3 | c | C | a |
| 4 | d | D | c |
| 5 | e | E | c |
| 6 | f | F | e |
| 7 | g | G | e |
| 8 | h | H | e |
| 9 | i | I | e |
|10 | j | J | i |
_________________________________________
Roles
_______________________________________
| id | user_mgmt | others | user_id |
_______________________________________
| 1 | 1 | 1 | 1 |
| 2 | 0 | 1 | 2 |
| 3 | 1 | 0 | 3 |
| 4 | 0 | 1 | 4 |
| 5 | 1 | 1 | 5 |
| 6 | 0 | 1 | 6 |
| 7 | 0 | 0 | 7 |
| 8 | 0 | 0 | 8 |
| 9 | 1 | 0 | 9 |
________________________________________
The Roles table have boolean columns, so if an User have user_mgmt role he can add many users (How many users can be added by an user is not definite). I want to fetch all users created by an user and his child users ( a is parent, c is child of a and e is child of c ..) .
Here is my code to fetch users
public void loadUsers(){
List<Users> users = new ArrayList<>();
String creator = user.getUserName();
List<Users> createdUsers = userService.getUsersByCreator(creator);
for(Users user : createdUsers) {
Roles role = createdUsers.getRoles();
if(role.isEmpMgnt()){
users.add(user);
loadUsers();
}
}
}
This gives me an stack overflow error. If i don't call loadUsers() recursively it returns only a single child result. So is there any solutions to this ? Thanks for any help in advance.
This gives you stack overflow error because a has creator a. So you have infinite loop for user a. For a you should set creator to null or skip self references in code.
Also you should pass current user into loadUsers() method and read only users that are created by it. Like
public void loadUsers(String creator)
and only process users created by that creator. Here
String creator = user.getUserName();
what is user? You should use creator. Question is how do you obtain initial creator. Probably initial creator should be user where creator is null.
I have an Apache Spark Dataframe of the following format
| ID | groupId | phaseName |
|----|-----------|-----------|
| 10 | someHash1 | PhaseA |
| 11 | someHash1 | PhaseB |
| 12 | someHash1 | PhaseB |
| 13 | someHash2 | PhaseX |
| 14 | someHash2 | PhaseY |
Each row represents a phase that happens in a procedure that consists of several of these phases. The ID column represents a sequential order of phases and the groupId column shows which phases belong together.
I want to add a new column to the dataframe: previousPhaseName. This column should indicate the previous different phase from the same procedure. The first phase of a process (the one with the minimum ID) will have null as previous phase. When a phase occurs twice or more, the second (third...) occurrence will have the same previousPhaseName For example:
df =
| ID | groupId | phaseName | prevPhaseName |
|----|-----------|-----------|---------------|
| 10 | someHash1 | PhaseA | null |
| 11 | someHash1 | PhaseB | PhaseA |
| 12 | someHash1 | PhaseB | PhaseA |
| 13 | someHash2 | PhaseX | null |
| 14 | someHash2 | PhaseY | PhaseX |
I am not sure how to implement this. My first approach would be:
create a second empty dataframe df2
for each row in df:
find the row with groupId = row.groupId, ID < row.ID, and maximum id
add this row to df2
join df1 and df2
Partial Solution using Window Functions
I used Window Functionsto aggregate the Name of the previous phase, the number of previous occurrences (not necessarily in a row) of the current phase in the group and the information whether the current and previous phase names are equal:
WindowSpec windowSpecPrev = Window
.partitionBy(df.col("groupId"))
.orderBy(df.col("ID"));
WindowSpec windowSpecCount = Window
.partitionBy(df.col("groupId"), df.col("phaseName"))
.orderBy(df.col("ID"))
.rowsBetween(Long.MIN_VALUE, 0);
df
.withColumn("prevPhase", functions.lag("phaseName", 1).over(windowSpecPrev))
.withColumn("phaseCount", functions.count("phaseId").over(windowSpecCount))
.withColumn("prevSame", when(col("prevPhase").equalTo(col("phaseName")),1).otherwise(0))
df =
| ID | groupId | phaseName | prevPhase | phaseCount | prevSame |
|----|-----------|-----------|-------------|------------|----------|
| 10 | someHash1 | PhaseA | null | 1 | 0 |
| 11 | someHash1 | PhaseB | PhaseA | 1 | 0 |
| 12 | someHash1 | PhaseB | PhaseB | 2 | 1 |
| 13 | someHash2 | PhaseX | null | 1 | 0 |
| 14 | someHash2 | PhaseY | PhaseX | 1 | 0 |
This is still not what I wanted to achieve but good enough for now
Further Ideas
To get the the name of the previous distinct phase I see three possibilities that I have not investigated thoroughly:
Implement an own lagfunction that does not take an offset but recursively checks the previous line until it finds a value that is different from the given line. (Though I don't think it's possible to use own analytic window functions in Spark SQL)
Find a way to dynamically set the offset of the lag function according to the value of phaseCount. (That may fail if the previous occurrences of the same phaseName do not appear in a single sequence)
Use a UserDefinedAggregateFunction over the window that stores the ID and phaseName of the first given input and seeks for the highest ID with different phaseName.
I was able to solve this problem in the following way:
Get the (ordinary) previous phase.
Introduce a new id that groups phases that occur in sequential order. (With help of this answer). This takes two steps. First checking whether the current and previous phase names are equal and assigning a groupCount value accordingly. Second computing a cumulative sum over this value.
Assign the previous phase of the first row of a sequential group to all its members.
Implementation
WindowSpec specGroup = Window.partitionBy(col("groupId"))
.orderBy(col("ID"));
WindowSpec specSeqGroupId = Window.partitionBy(col("groupId"))
.orderBy(col("ID"))
.rowsBetween(Long.MIN_VALUE, 0);
WindowSpec specPrevDiff = Window.partitionBy(col("groupId"), col("seqGroupId"))
.orderBy(col("ID"))
.rowsBetween(Long.MIN_VALUE, 0);
df.withColumn("prevPhase", coalesce(lag("phaseName", 1).over(specGroup), lit("NO_PREV")))
.withColumn("seqCount", when(col("prevPhase").equalTo(col("phaseName")).or(col("prevPhase").equalTo("NO_PREV")),0).otherwise(1))
.withColumn("seqGroupId", sum("seqCount").over(specSeqGroupId))
.withColumn("prevDiff", first("prevPhase").over(specPrevDiff));
Result
df =
| ID | groupId | phaseName | prevPhase | seqCount | seqGroupId | prevDiff |
|----|-----------|-----------|-----------|----------|------------|----------|
| 10 | someHash1 | PhaseA | NO_PREV | 0 | 0 | NO_PREV |
| 11 | someHash1 | PhaseB | PhaseA | 1 | 1 | PhaseA |
| 12 | someHash1 | PhaseB | PhaseA | 0 | 1 | PhaseA |
| 13 | someHash2 | PhaseX | NO_PREV | 0 | 0 | NO_PREV |
| 14 | someHash2 | PhaseY | PhaseX | 1 | 1 | PhaseX |
Any suggestions, specially in terms of efficiency of these operations are appreciated.
I guess you can use Spark window (row frame) functions. Check the api documentation and the following post.
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html