Java/Spark - Group by weighted avg aggregation - java

data :
id | sector | balance
---------------------------
1 | restaurant | 20000
2 | restaurant | 20000
3 | auto | 10000
4 | auto | 10000
5 | auto | 10000
i am looking to load this into spark as a df and calculate group by balance sums, but i also have to calculate the balace% against total balance (sum(balance) for all ids)
how can I accomplish this ?

To get the % against total you could use the DoubleRDDFunctions:
val totalBalance = data.map(_._3.toDouble).sum()
val percentageRow = data.map(d => d._3 * 100 / totalBalance)
val percentageGroup = data.map(d => (d._2, d._3))
.reduceByKey((x,y) => x+y).mapValues(sumGroup => sumGroup * 100 / totalBalance)

Related

Stream to filter and sum different fields depending on different conditions

I have a list:
ID | Product Code | Product Type | Transaction Amt | Charges Amt | Charges ID
1 | 001 | 001 | 10.00 | 0.01 | 001
2 | 001 | 001 | 11.00 | 0.01 | 001
3 | 002 | 001 | 12.00 | 0.01 | 002
I want to have this result:
ID | Product Code | Product Type | Transaction Amt | Charges Amt | Charges ID
1 | 001 | 001 | 21.00 | 0.01 | 001
3 | 002 | 001 | 12.00 | 0.01 | 002
I want to sum up the transaction amount and charges, but when the charges id is same, I will only calculates once.
Below is the code to sum up based on product code and type:
Map<String, Transaction> map = txs.stream()
.map(s -> {
s.setTotalCount(1);
return s;
})
.collect(Collectors.toMap(f -> f.getProductCode() + f.getProductType()
Function.identity(),
(s, a) -> new FeeAllocationTransactionModel(
s.getProductCode(),
s.getProductType(),
s.getTranzAmt().add(a.getTranzAmt()),
s.getCharges().add(a.getFeeCharges()),
s.getTotalCount() + 1
)));
List<Transaction> reduced = new ArrayList<>(map.values());
From the code above I got:
ID | Product Code | Product Type | Transaction Amt | Charges Amt | Charges ID
1 | 001 | 001 | 21.00 | **0.02** | 001
3 | 002 | 001 | 12.00 | 0.01 | 002
You could check the charges id in your mergeFunction to Collectors.toMap:
(s, a) -> {
double charges = (s.getChargesId().equals(a.getChargesId())
? s.getCharges()
: s.getCharges().add(a.getFeeCharges());
return new FeeAllocationTransactionModel(
s.getProductCode(),
s.getProductType(),
s.getTranzAmt().add(a.getTranzAmt()),
charges,
s.getTotalCount() + 1);
}

Is it possible to use sequence generator in this situation?

I have below table structure
ITEM
| ID(Auto Inc) | ORG_ID(FK to ORG) | ITEM_ID |
|-----------------|----------------------|-----------------------|
| 1 | 1 | 1 (Initial Val for A) |
| 1 | 2 | 1 (Initial Val for B) |
| 1 | 1 | 2 (Incremented for A) |
ORG
| ID | NAME |
|------|-----------|
| 1 | A |
| 2 | B |
Is there any possibility of using any generator to manage item_id column. This is not id column for ITEM table. Business requirement is to manage item_id sequential for each org.
You may try to insert using the next query:
INSERT INTO item (org_id, item_id)
SELECT #org_id, COALESCE((SELECT 1 + MAX(item_id)
FROM item
WHERE org_id = #org_id), 1)
where #org_id is the value to be inserted.
The problem: it may insert duplicates while concurrent insertions occures.

Join two entries rows ( Start time and End time) as a single row in Talend

I have Data coming from a MS SQL Database, it is concerning the the working hours of employees.
The problem is that, the start time and the end time are stored as 2 different entries, so when the employee comes, he scans his badge and this is considered arrival time, and when he leaves, he scans his badge again and this is considered departure time.
There is one column that helps to make the difference between the start and the end time (CodeNr column : B1 = StartTime, B2 = EndTime)
so this is how my Table looks like
Now i need this data as a single entry, in Talend oder from the Database,
so that should looks like
What to use in order to achieve this please (specially in Talend and when to complicate than in MS SQL)?
CREATE TABLE EmployeeWorkLoad(
EmployeeNr bigint,
Year int,
Month int,
Day int,
Hour int,
Minute int,
CodeNr char(2)
)
Insert into [EmployeeWorkLoad] ( [EmployeeNr],[Year],[Month] ,[Day],[Hour], [Minute] ,[CodeNr]) Values (1,2020,1,4,8,30,'B1'),
(1,2020,1,4,16,45,'B2'),
(1,2020,1,6,8,15,'B1'),
(1,2020,1,6,16,45,'B2'),
(2,2020,3,2,8,10,'B1'),
(2,2020,3,2,16,5,'B2')
GO
6 rows affected
WITH CTE AS (
select EmployeeNr,Year,Month,Day,
MAX(CASE WHEN CodeNr='B1' THEN Hour END) AS StartHour,
MAX(CASE WHEN CodeNr = 'B1' THEN Minute END) AS StartMinute,
MAX(CASE WHEN CodeNr = 'B2' THEN Hour END) AS EndHour,
MAX(CASE WHEN CodeNr = 'B2' THEN Minute END) AS EndMinute
from EmployeeWorkLoad
group by EmployeeNr,Year,Month,Day )
SELECT * , ABS(EndHour-StartHour) AS DutationHour
,ABS(IIF(EndMinute <StartMinute, EndMinute+60, EndMinute)- StartMinute) AS DurationMinute
FROM
CTE
GO
EmployeeNr | Year | Month | Day | StartHour | StartMinute | EndHour | EndMinute | DutationHour | DurationMinute
---------: | ---: | ----: | --: | --------: | ----------: | ------: | --------: | -----------: | -------------:
1 | 2020 | 1 | 4 | 8 | 30 | 16 | 45 | 8 | 15
1 | 2020 | 1 | 6 | 8 | 15 | 16 | 45 | 8 | 30
2 | 2020 | 3 | 2 | 8 | 10 | 16 | 5 | 8 | 55
db<>fiddle here

Spark and non-denormalized tables

I know Spark works much better with denormalized tables, where all the needed data is in one line. I wondering, if it is not the case, it would have a way to retrieve data from previous, or next, rows.
Example:
Formula:
value = (value from 2 year ago) + (current year value) / (value from 2 year ahead)
Table
+-------+-----+
| YEAR|VALUE|
+-------+-----+
| 2015| 100 |
| 2016| 34 |
| 2017| 32 |
| 2018| 22 |
| 2019| 14 |
| 2020| 42 |
| 2021| 88 |
+-------+-----+
Dataset<Row> dataset ...
Dataset<Results> results = dataset.map(row -> {
int currentValue = Integer.valueOf(row.getAs("VALUE")); // 2019
// non sense code just to exemplify
int twoYearsBackValue = Integer.valueOf(row[???].getAs("VALUE")); // 2016
int twoYearsAheadValue = Integer.valueOf(row[???].getAs("VALUE")); // 2021
double resultValue = twoYearsBackValue + currentValue / twoYearsAheadValue;
return new Result(2019, resultValue);
});
Results[] results = results.collect();
Is it possible to grab these values (that belongs to other rows) without changing the table format (no denormalization, no pivots ...) and also without collecting the data, or does it go totally against Spark/BigData principles?

Calculate/Determine Hours for Nightshift in mysql

Here is the table for the employee's logs:
And what I want is to generate the time ins and time out of employees. like this:
Can anyone help me for this? Any added logic or algorithm will be accepted.
This is one way. And it will work for day/night any shifts, provided, the first min(datetime) represent IN.
Rextester Sample
select t.enno
,max(datetime) as time_out
,min(datetime) as time_in
,time_to_sec(timediff(max(datetime), min(datetime) )) / 3600
as No_of_hours
from
(
SELECT
floor(#row1 := #row1 + 0.5) as day,
t.*
FROM Table4356 t,
(SELECT #row1 := 0.5) r1
order by t.datetime
) t
group by t.day,t.enno
;
Output
+------+---------------------+---------------------+-------------+
| enno | time_out | time_in | No_of_hours |
+------+---------------------+---------------------+-------------+
| 6 | 16.05.2017 06:30:50 | 15.05.2017 18:30:50 | 12,0000 |
| 6 | 17.05.2017 05:30:50 | 16.05.2017 18:10:50 | 11,3333 |
+------+---------------------+---------------------+-------------+
Explanation:
SELECT
floor(#row1 := #row1 + 0.5) as day,
t.*
FROM Table4356 t,
(SELECT #row1 := 0.5) r1
order by t.datetime
This query uses sequence to increment #row1 with 0.5, so you will get 1 1.5 2 2.5. Now if you just get the integer part of with with floor, you will generate sequece like 1 1 2 2. So this query will give you this output
+-----+------+---------------------+
| day | enno | datetime |
+-----+------+---------------------+
| 1 | 6 | 15.05.2017 18:30:50 |
| 1 | 6 | 16.05.2017 06:30:50 |
| 2 | 6 | 16.05.2017 18:10:50 |
| 2 | 6 | 17.05.2017 05:30:50 |
+-----+------+---------------------+
Now you can group by day and get max and min time.

Categories