I have the below dataframe in spark
+---------+--------------+-------+------------+---------+------+------+
| id| txnId|account| date| idl| type|amount|
+---------+--------------+-------+------------+---------+------+------+
| 153| 0000004512 | 30095| 11272020| 30| debit| 1000|
| 153| 0000004512 | 30096| 11272020| 0|credit| 200|
| 145| 0000004513 | 30095| 11272020| 0| debit| 4000|
| 135| 0000004512 | 30096| 11272020| 0|credit| 2000|
| 153| 0000004512 | 30097| 11272020| 0| debit| 1000|
| 145| 0000004514 | 30094| 11272020| 0| debit| 1000|
+---------+--------------+-------+------------+---------+------+------+
I need to group the rows based on id, txnId and type and add another column to add counts
for ex the output should be
+---------+--------------+-------+------------+---------+------+------+
| id| txnId|account| date| idl| type|count |
+---------+--------------+-------+------------+---------+------+------+
| 153| 0000004512 | 30095| 11272020| 30| debit| 2 |
| 153| 0000004512 | 30097| 11272020| 0| debit| 2. |
| 153| 0000004512 | 30096| 11272020| 0|credit| 1. |
| 145| 0000004513 | 30095| 11272020| 0| debit| 2. |
| 145| 0000004514 | 30094| 11272020| 0| debit| 2. |
| 135| 0000004512 | 30096| 11272020| 0|credit| 1. |
+---------+--------------+-------+------------+---------+------+------+
Here is the logic I tried but it is not working
WindowSpec windowSpec = Window.partitionBy("id","txnId","type").orderBy("id");
Column roworder = rank().over(windowSpec).as("rank");
Dataset<Row> df1 = df.select(df.col("*"),roworder);
Dataset<Row> df2 = df1.withColumn("count",sum(agg(df1.col("id"),1))
But this is not working
You don't need the rank function to achieve what you have
WindowSpec windowSpec = Window.partitionBy("id","txnId","type").orderBy("id");
Dataset<Row> df2 = df.withColumn("count",count("*").over(windowSpec))
this gives me the result
+---+----------+-------+--------+---+------+------+-----+
| id| txnId|account| date|idl| type|amount|count|
+---+----------+-------+--------+---+------+------+-----+
|145|0000004513| 30095|11272020| 0| debit| 4000| 1|
|145|0000004514| 30094|11272020| 0| debit| 1000| 1|
|135|0000004512| 30096|11272020| 0|credit| 2000| 1|
|153|0000004512| 30095|11272020| 30| debit| 1000| 2|
|153|0000004512| 30097|11272020| 0| debit| 1000| 2|
|153|0000004512| 30096|11272020| 0|credit| 200| 1|
+---+----------+-------+--------+---+------+------+-----+
Related
I have the below data frame and I have grouped the below data frame by id, txnId and date
+---------+--------------+-------+------------+---------+------+------+
| id| txnId|account| date| idl| type|amount|
+---------+--------------+-------+------------+---------+------+------+
| 153| 0000004512 | 30095| 11272020| 30| debit| 1000|
| 153| 0000004512 | 30096| 11272020| 0|credit| 200|
| 145| 0000004513 | 30095| 11272020| 0| debit| 4000|
| 135| 0000004512 | 30096| 11272020| 0|credit| 2000|
| 153| 0000004512 | 30097| 11272020| 0| debit| 1000|
| 145| 0000004514 | 30094| 11272020| 0| debit| 1000|
+---------+--------------+-------+------------+---------+------+------+
so after grouping , the output is
+---------+--------------+-------+------------+---------+------+------+
| id| txnId|account| date| idl| type|amount|
+---------+--------------+-------+------------+---------+------+------+
| 153| 0000004512| 30095| 11272020| 30| debit| 1000|
| 153| 0000004512| 30096| 11272020| 0|credit| 200|
| 153| 0000004512| 30097| 11272020| 0| debit| 1000|
| 153| 0000004512| 30097| 11272020| 0|credit| 500|
| 145| 0000004513| 30095| 11272020| 0| debit| 4000|
| 145| 0000004514| 30094| 11272020| 0| debit| 1000|
| 135| 0000004512| 30096| 11272020| 0|credit| 2000|
+---------+--------------+-------+------------+---------+------+------+
I need to add a third and fourth column to the data frame such that it is a total of amounts by the type credit or debit for that group , the output should look like
+---------+--------------+-------+-----------+---------+------+------+-----------+----------+
| id| txnId|account| date| idl| type|amount|totalcredit|totaldebit|
+---------+--------------+-------+-----------+---------+------+------+-----------+----------+
| 153| 0000004512| 30095| 11272020| 30| debit| 1000| 0| 2000|
| 153| 0000004512| 30096| 11272020| 0|credit| 200| 700| 0|
| 153| 0000004512| 30097| 11272020| 0| debit| 1000| 0| 2000|
| 153| 0000004512| 30097| 11272020| 0|credit| 500| 700| 0|
| 145| 0000004513| 30095| 11272020| 0| debit| 4000| 0| 4000|
| 145| 0000004514| 30094| 11272020| 0|credit| 1000| 1000| 0|
| 135| 0000004512| 30096| 11272020| 0|credit| 2000| 2000| 0|
+---------+--------------+-------+-----------+---------+------+------+-----------+----------+
I have written the below code to add new column for
Dataset <Row> df3 = df2.where(df2.col("type").equalTo("credit"))
.groupBy("type")
.agg(sum("amount")).withColumnRenamed("sum(amount)", "totalcredit");
but it is dropping the other columns from the dataset, how do I preserve the other columns in the dataset ?.
You want to use conditional sum aggregation over a Window partitioned by id:
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;
import static org.apache.spark.sql.functions.*;
WindowSpec w = Window.partitionBy("id");
Dataset <Row> df3 = df2.withColumn(
"totalcredit",
when(
col("type").equalTo("credit"),
sum(when(col("type").equalTo("credit"), col("amount"))).over(w)
).otherwise(0)
).withColumn(
"totaldebit",
when(
col("type").equalTo("debit"),
sum(when(col("type").equalTo("debit"), col("amount"))).over(w)
).otherwise(0)
);
df3.show();
//+---+-----+-------+--------+---+------+------+-----------+----------+
//| id|txnId|account| date|idl| type|amount|totalcredit|totaldebit|
//+---+-----+-------+--------+---+------+------+-----------+----------+
//|145| 4513| 30095|11272020| 0| debit| 4000| 0| 5000|
//|145| 4514| 30094|11272020| 0| debit| 1000| 0| 5000|
//|135| 4512| 30096|11272020| 0|credit| 2000| 2000| 0|
//|153| 4512| 30095|11272020| 30| debit| 1000| 0| 2000|
//|153| 4512| 30096|11272020| 0|credit| 200| 700| 0|
//|153| 4512| 30097|11272020| 0| debit| 1000| 0| 2000|
//|153| 4512| 30097|11272020| 0|credit| 500| 700| 0|
//+---+-----+-------+--------+---+------+------+-----------+----------+
I made a follwing dataset with scala.
+--------------------+---+
| text| docu_no|
+--------------------+---+
|서울,NNP 시내,NNG 한,M...| 1|
|최저,NNG 임금,NNG 때문,...| 2|
|왜,MAG 시급,NNG 만,JX...| 3|
|지금,MAG 경제,NNG 가,J...| 4|
|임대료,NNG 폭리,NNG 내리...| 5|
|모든,MM 문제,NNG 를,JK...| 6|
|니,NP 들,XSN 이,JKS ...| 7|
|실제,NNG 자영업,NNG 자,...| 8|
I want to make DTM for analysis.
For example
docu_no|서울|시내|한|최저|임금|지금|폭리 ...
1 1 1 1 0 0 0 0
2 0 0 0 1 1 1 1
For this, I thought pre-processing as follows.
+--------------------+---+
| text|count |docu_no
+--------------------+---+
|서울,NNP | 1| 1
|시내,NNG | 1| 1
|한,M. | 1| 1
|최저,NNG | 1| 2
|임금,NNG| 1| 2
|때문,...| 1| 2
After I make this (rdd or DataSet), if I use group by and pivot, I will get the results that I want to. But it is too difficult for me. If you have ideas, please inform those to me.
val data = List(("A", 1),("B", 2),("C", 3),("E", 4),("F", 5))
val df = sc.parallelize(data).toDF("text","doc_no")
df.show()
+----+------+
|text|doc_no|
+----+------+
| A| 1|
| B| 2|
| C| 3|
| E| 4|
| F| 5|
+----+------+
import org.apache.spark.sql.functions._
df.groupBy($"doc_no").pivot("text").agg(count("doc_no")).show()
+------+---+---+---+---+---+
|doc_no| A| B| C| E| F|
+------+---+---+---+---+---+
| 1| 1| 0| 0| 0| 0|
| 2| 0| 1| 0| 0| 0|
| 3| 0| 0| 1| 0| 0|
| 4| 0| 0| 0| 1| 0|
| 5| 0| 0| 0| 0| 1|
+------+---+---+---+---+---+
I'm new spark Java API. My dataset contains two columns (account, Lib) . I want to display accounts having differents lib. In fact my dataset is something like this.
ds1
+---------+------------+
| account| Lib |
+---------+------------+
| 222222 | bbbb |
| 222222 | bbbb |
| 222222 | bbbb |
| | |
| 333333 | aaaa |
| 333333 | bbbb |
| 333333 | cccc |
| | |
| 444444 | dddd |
| 444444 | dddd |
| 444444 | dddd |
| | |
| 555555 | vvvv |
| 555555 | hhhh |
| 555555 | vvvv |
I want to get ds2 like this:
+---------+------------+
| account| Lib |
+---------+------------+
| | |
| 333333 | aaaa |
| 333333 | bbbb |
| 333333 | cccc |
| | |
| 555555 | vvvv |
| 555555 | hhhh |
If groups are small you can use window functions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
df
.withColumn("cnt", approx_count_distinct("Lib").over(Window.partitionBy("account")).alias("cnt"))
.where(col("cnt") > 1)
If groups are large:
df.join(
df
.groupBy("account")
.agg(countDistinct("Lib").alias("cnt")).where(col("cnt") > 1),
Seq("account"),
"leftsemi"
)
import java.util.ArrayList;
import java.util.List;
public class Tree
{
private Board board;
private List<Tree> children;
private Tree parent;
public Tree(Board board1)
{
this.board = board1;
this.children = new ArrayList<Tree>();
}
public Tree(Tree t1)
{
}
public Tree createTree(Tree tree, boolean isHuman, int depth)
{
Player play1 = new Player();
ArrayList<Board> potBoards = new ArrayList<Board>(play1.potentialMoves(tree.board, isHuman));
if (board.gameEnd() || depth == 0)
{
return null;
}
//Tree oldTree = new Tree(board);
for (int i = 0; i < potBoards.size() - 1; i++)
{
Tree newTree = new Tree(potBoards.get(i));
createTree(newTree, !isHuman, depth - 1);
tree.addChild(newTree);
}
return tree;
}
private Tree addChild(Tree child)
{
Tree childNode = new Tree(child);
childNode.parent = this;
this.children.add(childNode);
return childNode;
}
}
Hi there. I'm trying to make a gameTree that will be handled by minimax in the future. I think the error either happened in the AddChild function or the potentialMoves? The potentialMoves returns all potential moves a player or computer can make. For example in Othello a player can either go
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
0| | | | | | | | |
+-+-+-+-+-+-+-+-+
1| | | | | | | | |
+-+-+-+-+-+-+-+-+
2| | | |b| | | | |
+-+-+-+-+-+-+-+-+
3| | | |b|b| | | |
+-+-+-+-+-+-+-+-+
4| | | |b|w| | | |
+-+-+-+-+-+-+-+-+
5| | | | | | | | |
+-+-+-+-+-+-+-+-+
6| | | | | | | | |
+-+-+-+-+-+-+-+-+
7| | | | | | | | |
+-+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+-+
0| | | | | | | | |
+-+-+-+-+-+-+-+-+
1| | | | | | | | |
+-+-+-+-+-+-+-+-+
2| | | | | | | | |
+-+-+-+-+-+-+-+-+
3| | |b|b|b| | | |
+-+-+-+-+-+-+-+-+
4| | | |b|w| | | |
+-+-+-+-+-+-+-+-+
5| | | | | | | | |
+-+-+-+-+-+-+-+-+
6| | | | | | | | |
+-+-+-+-+-+-+-+-+
7| | | | | | | | |
+-+-+-+-+-+-+-+-+
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
0| | | | | | | | |
+-+-+-+-+-+-+-+-+
1| | | | | | | | |
+-+-+-+-+-+-+-+-+
2| | | | | | | | |
+-+-+-+-+-+-+-+-+
3| | | |w|b| | | |
+-+-+-+-+-+-+-+-+
4| | | |b|b|b| | |
+-+-+-+-+-+-+-+-+
5| | | | | | | | |
+-+-+-+-+-+-+-+-+
6| | | | | | | | |
+-+-+-+-+-+-+-+-+
7| | | | | | | | |
+-+-+-+-+-+-+-+-+
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
0| | | | | | | | |
+-+-+-+-+-+-+-+-+
1| | | | | | | | |
+-+-+-+-+-+-+-+-+
2| | | | | | | | |
+-+-+-+-+-+-+-+-+
3| | | |w|b| | | |
+-+-+-+-+-+-+-+-+
4| | | |b|b|| | |
+-+-+-+-+-+-+-+-+
5| | | | |b| | | |
+-+-+-+-+-+-+-+-+
6| | | | | | | | |
+-+-+-+-+-+-+-+-+
7| | | | | | | | |
+-+-+-+-+-+-+-+-+
for the first turn. The potential moves does not permanently change the board that is being played on. It returns an ArrayList.
I have this in my main:
Tree gameTree = new Tree(boardOthello);
Tree pickTree = gameTree.createTree(gameTree, true, 2);
Does the addChild() function look ok or is there something else I'm missing in my code?
I've successfully inserted data into database by just writing in data that I need. Now I'm trying to insert variables and arrays that will hold the data. This was kind of shot in the dark because I had no idea how to do it, I just kind of guessed. I get no syntax errors, so I thought I was doing good but it doesn't compile... I just need to know the exact syntax to do that.
for(int i = 0; i < ReadingFile.altitudeList.size(); i++){
for(int j = 0; j < ReadingFile.temperatureList.size(); j++){
for( int k = 0; k < ReadingFile.velocityList.size(); k++){
for( int x = 0; x < ReadingFile.latList.size(); x++){
for(int y = 0; y < ReadingFile.longList.size();y++){
stat
.execute("INSERT INTO TrailTracker VALUES(id,ReadingFile.date,ReadingFile.distance, ReadingFile.timeElapsed, ReadingFile.startTime,"
+ "ReadingFile.temperatureList[j], ReadingFile.velocityList[k], ReadingFile.altitudeList[i], ReadingFile.latList[x],"
+ "ReadingFile.longList[y])");
}}}}}
You can't insert variables or arrays into a database. You can only insert data, ie the values of your variables or arrays.
A PreparedStatement is the way to go. It would look something like this;
int a = 1;
Date b = new Date();
String c = "hello world";
PreparedStatement stmt = conn.prepareStatement("INSERT INTO MyTable VALUES (?,?,?)");
stmt.setInt(1, a);
stmt.setDate(2, new java.sql.Date(b.getTime());
stmt.setString(3, c);
stmt.execute();
Note that it doesn't look like you have correctly designed your table to match your data. Your ReadingFile seems to have 5 Lists and you need to figure out how the values in these lists relate to each other. Your current logic with 5 nested loops is almost certainly not what you want. It results in a highly denormalised structure.
For example, say you had a ReadingFile object with an id of 1, date of 20/1/2011, distance of 10, time elapsed of 20 and start time of 30. Then each of the lists had two values;
- temperature 21, 23
- velocity 51, 52
- altitude 1000, 2000
- lat 45.1, 47.2
- long 52.3, 58.4
Then your nested loops would insert data into your table like this;
+--+---------+--------+-----------+---------+-----------+--------+--------+----+----+
|id| date|distance|timeElapsed|startTime|temperature|velocity|altitude| lat|long|
+--+---------+--------+-----------+---------+-----------+--------+--------+----+----+
| 1|20.1.2011| 10| 20| 30| 21| 51| 1000|45.1|52.3|
| 1|20.1.2011| 10| 20| 30| 21| 51| 1000|45.1|58.4|
| 1|20.1.2011| 10| 20| 30| 21| 51| 1000|47.2|52.3|
| 1|20.1.2011| 10| 20| 30| 21| 51| 1000|47.2|58.4|
| 1|20.1.2011| 10| 20| 30| 21| 52| 1000|45.1|52.3|
| 1|20.1.2011| 10| 20| 30| 21| 52| 1000|45.1|58.4|
| 1|20.1.2011| 10| 20| 30| 21| 52| 1000|47.2|52.3|
| 1|20.1.2011| 10| 20| 30| 21| 52| 1000|47.2|58.4|
| 1|20.1.2011| 10| 20| 30| 23| 51| 1000|45.1|52.3|
| 1|20.1.2011| 10| 20| 30| 23| 51| 1000|45.1|58.4|
| 1|20.1.2011| 10| 20| 30| 23| 51| 1000|47.2|52.3|
| 1|20.1.2011| 10| 20| 30| 23| 51| 1000|47.2|58.4|
| 1|20.1.2011| 10| 20| 30| 23| 52| 1000|45.1|52.3|
| 1|20.1.2011| 10| 20| 30| 23| 52| 1000|45.1|58.4|
| 1|20.1.2011| 10| 20| 30| 23| 52| 1000|47.2|52.3|
| 1|20.1.2011| 10| 20| 30| 23| 52| 1000|47.2|58.4|
| 1|20.1.2011| 10| 20| 30| 21| 51| 2000|45.1|52.3|
| 1|20.1.2011| 10| 20| 30| 21| 51| 2000|45.1|58.4|
| 1|20.1.2011| 10| 20| 30| 21| 51| 2000|47.2|52.3|
| 1|20.1.2011| 10| 20| 30| 21| 51| 2000|47.2|58.4|
| 1|20.1.2011| 10| 20| 30| 21| 52| 2000|45.1|52.3|
| 1|20.1.2011| 10| 20| 30| 21| 52| 2000|45.1|58.4|
| 1|20.1.2011| 10| 20| 30| 21| 52| 2000|47.2|52.3|
| 1|20.1.2011| 10| 20| 30| 21| 52| 2000|47.2|58.4|
| 1|20.1.2011| 10| 20| 30| 23| 51| 2000|45.1|52.3|
| 1|20.1.2011| 10| 20| 30| 23| 51| 2000|45.1|58.4|
| 1|20.1.2011| 10| 20| 30| 23| 51| 2000|47.2|52.3|
| 1|20.1.2011| 10| 20| 30| 23| 51| 2000|47.2|58.4|
| 1|20.1.2011| 10| 20| 30| 23| 52| 2000|45.1|52.3|
| 1|20.1.2011| 10| 20| 30| 23| 52| 2000|45.1|58.4|
| 1|20.1.2011| 10| 20| 30| 23| 52| 2000|47.2|52.3|
| 1|20.1.2011| 10| 20| 30| 23| 52| 2000|47.2|58.4|
+--+---------+--------+-----------+---------+-----------+--------+--------+----+----+
this would be invalid query.
You need to go for PreparedStatement.
So I figured out the easiest way to do what I needed using a while loop
while(!(sampleSize == temp)){
conn.prepareStatement(insertStr);
prstat.setInt(1, id);
prstat.setInt(7, v.get(temp));
temp++;
prstat.executeUpdate();
}
temp is initially set to zero, and increments while inserting elements from arrayList into database until its equal to the sampleSize (sampleSize = v.size();) so that it knows it reached the end of the list. Thanks for everyones help!