Count of Txns in spark within a group - java

I have the below dataframe in spark
+---------+--------------+-------+------------+---------+------+------+
| id| txnId|account| date| idl| type|amount|
+---------+--------------+-------+------------+---------+------+------+
| 153| 0000004512 | 30095| 11272020| 30| debit| 1000|
| 153| 0000004512 | 30096| 11272020| 0|credit| 200|
| 145| 0000004513 | 30095| 11272020| 0| debit| 4000|
| 135| 0000004512 | 30096| 11272020| 0|credit| 2000|
| 153| 0000004512 | 30097| 11272020| 0| debit| 1000|
| 145| 0000004514 | 30094| 11272020| 0| debit| 1000|
+---------+--------------+-------+------------+---------+------+------+
I need to group the rows based on id, txnId and type and add another column to add counts
for ex the output should be
+---------+--------------+-------+------------+---------+------+------+
| id| txnId|account| date| idl| type|count |
+---------+--------------+-------+------------+---------+------+------+
| 153| 0000004512 | 30095| 11272020| 30| debit| 2 |
| 153| 0000004512 | 30097| 11272020| 0| debit| 2. |
| 153| 0000004512 | 30096| 11272020| 0|credit| 1. |
| 145| 0000004513 | 30095| 11272020| 0| debit| 2. |
| 145| 0000004514 | 30094| 11272020| 0| debit| 2. |
| 135| 0000004512 | 30096| 11272020| 0|credit| 1. |
+---------+--------------+-------+------------+---------+------+------+
Here is the logic I tried but it is not working
WindowSpec windowSpec = Window.partitionBy("id","txnId","type").orderBy("id");
Column roworder = rank().over(windowSpec).as("rank");
Dataset<Row> df1 = df.select(df.col("*"),roworder);
Dataset<Row> df2 = df1.withColumn("count",sum(agg(df1.col("id"),1))
But this is not working

You don't need the rank function to achieve what you have
WindowSpec windowSpec = Window.partitionBy("id","txnId","type").orderBy("id");
Dataset<Row> df2 = df.withColumn("count",count("*").over(windowSpec))
this gives me the result
+---+----------+-------+--------+---+------+------+-----+
| id| txnId|account| date|idl| type|amount|count|
+---+----------+-------+--------+---+------+------+-----+
|145|0000004513| 30095|11272020| 0| debit| 4000| 1|
|145|0000004514| 30094|11272020| 0| debit| 1000| 1|
|135|0000004512| 30096|11272020| 0|credit| 2000| 1|
|153|0000004512| 30095|11272020| 30| debit| 1000| 2|
|153|0000004512| 30097|11272020| 0| debit| 1000| 2|
|153|0000004512| 30096|11272020| 0|credit| 200| 1|
+---+----------+-------+--------+---+------+------+-----+

Related

spark with column and aggregate function dropping other columns in the dataset

I have the below data frame and I have grouped the below data frame by id, txnId and date
+---------+--------------+-------+------------+---------+------+------+
| id| txnId|account| date| idl| type|amount|
+---------+--------------+-------+------------+---------+------+------+
| 153| 0000004512 | 30095| 11272020| 30| debit| 1000|
| 153| 0000004512 | 30096| 11272020| 0|credit| 200|
| 145| 0000004513 | 30095| 11272020| 0| debit| 4000|
| 135| 0000004512 | 30096| 11272020| 0|credit| 2000|
| 153| 0000004512 | 30097| 11272020| 0| debit| 1000|
| 145| 0000004514 | 30094| 11272020| 0| debit| 1000|
+---------+--------------+-------+------------+---------+------+------+
so after grouping , the output is
+---------+--------------+-------+------------+---------+------+------+
| id| txnId|account| date| idl| type|amount|
+---------+--------------+-------+------------+---------+------+------+
| 153| 0000004512| 30095| 11272020| 30| debit| 1000|
| 153| 0000004512| 30096| 11272020| 0|credit| 200|
| 153| 0000004512| 30097| 11272020| 0| debit| 1000|
| 153| 0000004512| 30097| 11272020| 0|credit| 500|
| 145| 0000004513| 30095| 11272020| 0| debit| 4000|
| 145| 0000004514| 30094| 11272020| 0| debit| 1000|
| 135| 0000004512| 30096| 11272020| 0|credit| 2000|
+---------+--------------+-------+------------+---------+------+------+
I need to add a third and fourth column to the data frame such that it is a total of amounts by the type credit or debit for that group , the output should look like
+---------+--------------+-------+-----------+---------+------+------+-----------+----------+
| id| txnId|account| date| idl| type|amount|totalcredit|totaldebit|
+---------+--------------+-------+-----------+---------+------+------+-----------+----------+
| 153| 0000004512| 30095| 11272020| 30| debit| 1000| 0| 2000|
| 153| 0000004512| 30096| 11272020| 0|credit| 200| 700| 0|
| 153| 0000004512| 30097| 11272020| 0| debit| 1000| 0| 2000|
| 153| 0000004512| 30097| 11272020| 0|credit| 500| 700| 0|
| 145| 0000004513| 30095| 11272020| 0| debit| 4000| 0| 4000|
| 145| 0000004514| 30094| 11272020| 0|credit| 1000| 1000| 0|
| 135| 0000004512| 30096| 11272020| 0|credit| 2000| 2000| 0|
+---------+--------------+-------+-----------+---------+------+------+-----------+----------+
I have written the below code to add new column for
Dataset <Row> df3 = df2.where(df2.col("type").equalTo("credit"))
.groupBy("type")
.agg(sum("amount")).withColumnRenamed("sum(amount)", "totalcredit");
but it is dropping the other columns from the dataset, how do I preserve the other columns in the dataset ?.
You want to use conditional sum aggregation over a Window partitioned by id:
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;
import static org.apache.spark.sql.functions.*;
WindowSpec w = Window.partitionBy("id");
Dataset <Row> df3 = df2.withColumn(
"totalcredit",
when(
col("type").equalTo("credit"),
sum(when(col("type").equalTo("credit"), col("amount"))).over(w)
).otherwise(0)
).withColumn(
"totaldebit",
when(
col("type").equalTo("debit"),
sum(when(col("type").equalTo("debit"), col("amount"))).over(w)
).otherwise(0)
);
df3.show();
//+---+-----+-------+--------+---+------+------+-----------+----------+
//| id|txnId|account| date|idl| type|amount|totalcredit|totaldebit|
//+---+-----+-------+--------+---+------+------+-----------+----------+
//|145| 4513| 30095|11272020| 0| debit| 4000| 0| 5000|
//|145| 4514| 30094|11272020| 0| debit| 1000| 0| 5000|
//|135| 4512| 30096|11272020| 0|credit| 2000| 2000| 0|
//|153| 4512| 30095|11272020| 30| debit| 1000| 0| 2000|
//|153| 4512| 30096|11272020| 0|credit| 200| 700| 0|
//|153| 4512| 30097|11272020| 0| debit| 1000| 0| 2000|
//|153| 4512| 30097|11272020| 0|credit| 500| 700| 0|
//+---+-----+-------+--------+---+------+------+-----------+----------+

How to data pre-processing in Spark in this case

I made a follwing dataset with scala.
+--------------------+---+
| text| docu_no|
+--------------------+---+
|서울,NNP 시내,NNG 한,M...| 1|
|최저,NNG 임금,NNG 때문,...| 2|
|왜,MAG 시급,NNG 만,JX...| 3|
|지금,MAG 경제,NNG 가,J...| 4|
|임대료,NNG 폭리,NNG 내리...| 5|
|모든,MM 문제,NNG 를,JK...| 6|
|니,NP 들,XSN 이,JKS ...| 7|
|실제,NNG 자영업,NNG 자,...| 8|
I want to make DTM for analysis.
For example
docu_no|서울|시내|한|최저|임금|지금|폭리 ...
1 1 1 1 0 0 0 0
2 0 0 0 1 1 1 1
For this, I thought pre-processing as follows.
+--------------------+---+
| text|count |docu_no
+--------------------+---+
|서울,NNP | 1| 1
|시내,NNG | 1| 1
|한,M. | 1| 1
|최저,NNG | 1| 2
|임금,NNG| 1| 2
|때문,...| 1| 2
After I make this (rdd or DataSet), if I use group by and pivot, I will get the results that I want to. But it is too difficult for me. If you have ideas, please inform those to me.
val data = List(("A", 1),("B", 2),("C", 3),("E", 4),("F", 5))
val df = sc.parallelize(data).toDF("text","doc_no")
df.show()
+----+------+
|text|doc_no|
+----+------+
| A| 1|
| B| 2|
| C| 3|
| E| 4|
| F| 5|
+----+------+
import org.apache.spark.sql.functions._
df.groupBy($"doc_no").pivot("text").agg(count("doc_no")).show()
+------+---+---+---+---+---+
|doc_no| A| B| C| E| F|
+------+---+---+---+---+---+
| 1| 1| 0| 0| 0| 0|
| 2| 0| 1| 0| 0| 0|
| 3| 0| 0| 1| 0| 0|
| 4| 0| 0| 0| 1| 0|
| 5| 0| 0| 0| 0| 1|
+------+---+---+---+---+---+

Spark Java API, Dataset Manipulation?

I'm new spark Java API. My dataset contains two columns (account, Lib) . I want to display accounts having differents lib. In fact my dataset is something like this.
ds1
+---------+------------+
| account| Lib |
+---------+------------+
| 222222 | bbbb |
| 222222 | bbbb |
| 222222 | bbbb |
| | |
| 333333 | aaaa |
| 333333 | bbbb |
| 333333 | cccc |
| | |
| 444444 | dddd |
| 444444 | dddd |
| 444444 | dddd |
| | |
| 555555 | vvvv |
| 555555 | hhhh |
| 555555 | vvvv |
I want to get ds2 like this:
+---------+------------+
| account| Lib |
+---------+------------+
| | |
| 333333 | aaaa |
| 333333 | bbbb |
| 333333 | cccc |
| | |
| 555555 | vvvv |
| 555555 | hhhh |
If groups are small you can use window functions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
df
.withColumn("cnt", approx_count_distinct("Lib").over(Window.partitionBy("account")).alias("cnt"))
.where(col("cnt") > 1)
If groups are large:
df.join(
df
.groupBy("account")
.agg(countDistinct("Lib").alias("cnt")).where(col("cnt") > 1),
Seq("account"),
"leftsemi"
)

Gametree not working correctly Java. Is the addChild() function ok?

import java.util.ArrayList;
import java.util.List;
public class Tree
{
private Board board;
private List<Tree> children;
private Tree parent;
public Tree(Board board1)
{
this.board = board1;
this.children = new ArrayList<Tree>();
}
public Tree(Tree t1)
{
}
public Tree createTree(Tree tree, boolean isHuman, int depth)
{
Player play1 = new Player();
ArrayList<Board> potBoards = new ArrayList<Board>(play1.potentialMoves(tree.board, isHuman));
if (board.gameEnd() || depth == 0)
{
return null;
}
//Tree oldTree = new Tree(board);
for (int i = 0; i < potBoards.size() - 1; i++)
{
Tree newTree = new Tree(potBoards.get(i));
createTree(newTree, !isHuman, depth - 1);
tree.addChild(newTree);
}
return tree;
}
private Tree addChild(Tree child)
{
Tree childNode = new Tree(child);
childNode.parent = this;
this.children.add(childNode);
return childNode;
}
}
Hi there. I'm trying to make a gameTree that will be handled by minimax in the future. I think the error either happened in the AddChild function or the potentialMoves? The potentialMoves returns all potential moves a player or computer can make. For example in Othello a player can either go
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
0| | | | | | | | |
+-+-+-+-+-+-+-+-+
1| | | | | | | | |
+-+-+-+-+-+-+-+-+
2| | | |b| | | | |
+-+-+-+-+-+-+-+-+
3| | | |b|b| | | |
+-+-+-+-+-+-+-+-+
4| | | |b|w| | | |
+-+-+-+-+-+-+-+-+
5| | | | | | | | |
+-+-+-+-+-+-+-+-+
6| | | | | | | | |
+-+-+-+-+-+-+-+-+
7| | | | | | | | |
+-+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+-+
0| | | | | | | | |
+-+-+-+-+-+-+-+-+
1| | | | | | | | |
+-+-+-+-+-+-+-+-+
2| | | | | | | | |
+-+-+-+-+-+-+-+-+
3| | |b|b|b| | | |
+-+-+-+-+-+-+-+-+
4| | | |b|w| | | |
+-+-+-+-+-+-+-+-+
5| | | | | | | | |
+-+-+-+-+-+-+-+-+
6| | | | | | | | |
+-+-+-+-+-+-+-+-+
7| | | | | | | | |
+-+-+-+-+-+-+-+-+
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
0| | | | | | | | |
+-+-+-+-+-+-+-+-+
1| | | | | | | | |
+-+-+-+-+-+-+-+-+
2| | | | | | | | |
+-+-+-+-+-+-+-+-+
3| | | |w|b| | | |
+-+-+-+-+-+-+-+-+
4| | | |b|b|b| | |
+-+-+-+-+-+-+-+-+
5| | | | | | | | |
+-+-+-+-+-+-+-+-+
6| | | | | | | | |
+-+-+-+-+-+-+-+-+
7| | | | | | | | |
+-+-+-+-+-+-+-+-+
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
0| | | | | | | | |
+-+-+-+-+-+-+-+-+
1| | | | | | | | |
+-+-+-+-+-+-+-+-+
2| | | | | | | | |
+-+-+-+-+-+-+-+-+
3| | | |w|b| | | |
+-+-+-+-+-+-+-+-+
4| | | |b|b|| | |
+-+-+-+-+-+-+-+-+
5| | | | |b| | | |
+-+-+-+-+-+-+-+-+
6| | | | | | | | |
+-+-+-+-+-+-+-+-+
7| | | | | | | | |
+-+-+-+-+-+-+-+-+
for the first turn. The potential moves does not permanently change the board that is being played on. It returns an ArrayList.
I have this in my main:
Tree gameTree = new Tree(boardOthello);
Tree pickTree = gameTree.createTree(gameTree, true, 2);
Does the addChild() function look ok or is there something else I'm missing in my code?

Insert variables, arrays into MySQM database using Java

I've successfully inserted data into database by just writing in data that I need. Now I'm trying to insert variables and arrays that will hold the data. This was kind of shot in the dark because I had no idea how to do it, I just kind of guessed. I get no syntax errors, so I thought I was doing good but it doesn't compile... I just need to know the exact syntax to do that.
for(int i = 0; i < ReadingFile.altitudeList.size(); i++){
for(int j = 0; j < ReadingFile.temperatureList.size(); j++){
for( int k = 0; k < ReadingFile.velocityList.size(); k++){
for( int x = 0; x < ReadingFile.latList.size(); x++){
for(int y = 0; y < ReadingFile.longList.size();y++){
stat
.execute("INSERT INTO TrailTracker VALUES(id,ReadingFile.date,ReadingFile.distance, ReadingFile.timeElapsed, ReadingFile.startTime,"
+ "ReadingFile.temperatureList[j], ReadingFile.velocityList[k], ReadingFile.altitudeList[i], ReadingFile.latList[x],"
+ "ReadingFile.longList[y])");
}}}}}
You can't insert variables or arrays into a database. You can only insert data, ie the values of your variables or arrays.
A PreparedStatement is the way to go. It would look something like this;
int a = 1;
Date b = new Date();
String c = "hello world";
PreparedStatement stmt = conn.prepareStatement("INSERT INTO MyTable VALUES (?,?,?)");
stmt.setInt(1, a);
stmt.setDate(2, new java.sql.Date(b.getTime());
stmt.setString(3, c);
stmt.execute();
Note that it doesn't look like you have correctly designed your table to match your data. Your ReadingFile seems to have 5 Lists and you need to figure out how the values in these lists relate to each other. Your current logic with 5 nested loops is almost certainly not what you want. It results in a highly denormalised structure.
For example, say you had a ReadingFile object with an id of 1, date of 20/1/2011, distance of 10, time elapsed of 20 and start time of 30. Then each of the lists had two values;
- temperature 21, 23
- velocity 51, 52
- altitude 1000, 2000
- lat 45.1, 47.2
- long 52.3, 58.4
Then your nested loops would insert data into your table like this;
+--+---------+--------+-----------+---------+-----------+--------+--------+----+----+
|id| date|distance|timeElapsed|startTime|temperature|velocity|altitude| lat|long|
+--+---------+--------+-----------+---------+-----------+--------+--------+----+----+
| 1|20.1.2011| 10| 20| 30| 21| 51| 1000|45.1|52.3|
| 1|20.1.2011| 10| 20| 30| 21| 51| 1000|45.1|58.4|
| 1|20.1.2011| 10| 20| 30| 21| 51| 1000|47.2|52.3|
| 1|20.1.2011| 10| 20| 30| 21| 51| 1000|47.2|58.4|
| 1|20.1.2011| 10| 20| 30| 21| 52| 1000|45.1|52.3|
| 1|20.1.2011| 10| 20| 30| 21| 52| 1000|45.1|58.4|
| 1|20.1.2011| 10| 20| 30| 21| 52| 1000|47.2|52.3|
| 1|20.1.2011| 10| 20| 30| 21| 52| 1000|47.2|58.4|
| 1|20.1.2011| 10| 20| 30| 23| 51| 1000|45.1|52.3|
| 1|20.1.2011| 10| 20| 30| 23| 51| 1000|45.1|58.4|
| 1|20.1.2011| 10| 20| 30| 23| 51| 1000|47.2|52.3|
| 1|20.1.2011| 10| 20| 30| 23| 51| 1000|47.2|58.4|
| 1|20.1.2011| 10| 20| 30| 23| 52| 1000|45.1|52.3|
| 1|20.1.2011| 10| 20| 30| 23| 52| 1000|45.1|58.4|
| 1|20.1.2011| 10| 20| 30| 23| 52| 1000|47.2|52.3|
| 1|20.1.2011| 10| 20| 30| 23| 52| 1000|47.2|58.4|
| 1|20.1.2011| 10| 20| 30| 21| 51| 2000|45.1|52.3|
| 1|20.1.2011| 10| 20| 30| 21| 51| 2000|45.1|58.4|
| 1|20.1.2011| 10| 20| 30| 21| 51| 2000|47.2|52.3|
| 1|20.1.2011| 10| 20| 30| 21| 51| 2000|47.2|58.4|
| 1|20.1.2011| 10| 20| 30| 21| 52| 2000|45.1|52.3|
| 1|20.1.2011| 10| 20| 30| 21| 52| 2000|45.1|58.4|
| 1|20.1.2011| 10| 20| 30| 21| 52| 2000|47.2|52.3|
| 1|20.1.2011| 10| 20| 30| 21| 52| 2000|47.2|58.4|
| 1|20.1.2011| 10| 20| 30| 23| 51| 2000|45.1|52.3|
| 1|20.1.2011| 10| 20| 30| 23| 51| 2000|45.1|58.4|
| 1|20.1.2011| 10| 20| 30| 23| 51| 2000|47.2|52.3|
| 1|20.1.2011| 10| 20| 30| 23| 51| 2000|47.2|58.4|
| 1|20.1.2011| 10| 20| 30| 23| 52| 2000|45.1|52.3|
| 1|20.1.2011| 10| 20| 30| 23| 52| 2000|45.1|58.4|
| 1|20.1.2011| 10| 20| 30| 23| 52| 2000|47.2|52.3|
| 1|20.1.2011| 10| 20| 30| 23| 52| 2000|47.2|58.4|
+--+---------+--------+-----------+---------+-----------+--------+--------+----+----+
this would be invalid query.
You need to go for PreparedStatement.
So I figured out the easiest way to do what I needed using a while loop
while(!(sampleSize == temp)){
conn.prepareStatement(insertStr);
prstat.setInt(1, id);
prstat.setInt(7, v.get(temp));
temp++;
prstat.executeUpdate();
}
temp is initially set to zero, and increments while inserting elements from arrayList into database until its equal to the sampleSize (sampleSize = v.size();) so that it knows it reached the end of the list. Thanks for everyones help!

Categories