How to data pre-processing in Spark in this case - java

I made a follwing dataset with scala.
+--------------------+---+
| text| docu_no|
+--------------------+---+
|서울,NNP 시내,NNG 한,M...| 1|
|최저,NNG 임금,NNG 때문,...| 2|
|왜,MAG 시급,NNG 만,JX...| 3|
|지금,MAG 경제,NNG 가,J...| 4|
|임대료,NNG 폭리,NNG 내리...| 5|
|모든,MM 문제,NNG 를,JK...| 6|
|니,NP 들,XSN 이,JKS ...| 7|
|실제,NNG 자영업,NNG 자,...| 8|
I want to make DTM for analysis.
For example
docu_no|서울|시내|한|최저|임금|지금|폭리 ...
1 1 1 1 0 0 0 0
2 0 0 0 1 1 1 1
For this, I thought pre-processing as follows.
+--------------------+---+
| text|count |docu_no
+--------------------+---+
|서울,NNP | 1| 1
|시내,NNG | 1| 1
|한,M. | 1| 1
|최저,NNG | 1| 2
|임금,NNG| 1| 2
|때문,...| 1| 2
After I make this (rdd or DataSet), if I use group by and pivot, I will get the results that I want to. But it is too difficult for me. If you have ideas, please inform those to me.

val data = List(("A", 1),("B", 2),("C", 3),("E", 4),("F", 5))
val df = sc.parallelize(data).toDF("text","doc_no")
df.show()
+----+------+
|text|doc_no|
+----+------+
| A| 1|
| B| 2|
| C| 3|
| E| 4|
| F| 5|
+----+------+
import org.apache.spark.sql.functions._
df.groupBy($"doc_no").pivot("text").agg(count("doc_no")).show()
+------+---+---+---+---+---+
|doc_no| A| B| C| E| F|
+------+---+---+---+---+---+
| 1| 1| 0| 0| 0| 0|
| 2| 0| 1| 0| 0| 0|
| 3| 0| 0| 1| 0| 0|
| 4| 0| 0| 0| 1| 0|
| 5| 0| 0| 0| 0| 1|
+------+---+---+---+---+---+

Related

Count of Txns in spark within a group

I have the below dataframe in spark
+---------+--------------+-------+------------+---------+------+------+
| id| txnId|account| date| idl| type|amount|
+---------+--------------+-------+------------+---------+------+------+
| 153| 0000004512 | 30095| 11272020| 30| debit| 1000|
| 153| 0000004512 | 30096| 11272020| 0|credit| 200|
| 145| 0000004513 | 30095| 11272020| 0| debit| 4000|
| 135| 0000004512 | 30096| 11272020| 0|credit| 2000|
| 153| 0000004512 | 30097| 11272020| 0| debit| 1000|
| 145| 0000004514 | 30094| 11272020| 0| debit| 1000|
+---------+--------------+-------+------------+---------+------+------+
I need to group the rows based on id, txnId and type and add another column to add counts
for ex the output should be
+---------+--------------+-------+------------+---------+------+------+
| id| txnId|account| date| idl| type|count |
+---------+--------------+-------+------------+---------+------+------+
| 153| 0000004512 | 30095| 11272020| 30| debit| 2 |
| 153| 0000004512 | 30097| 11272020| 0| debit| 2. |
| 153| 0000004512 | 30096| 11272020| 0|credit| 1. |
| 145| 0000004513 | 30095| 11272020| 0| debit| 2. |
| 145| 0000004514 | 30094| 11272020| 0| debit| 2. |
| 135| 0000004512 | 30096| 11272020| 0|credit| 1. |
+---------+--------------+-------+------------+---------+------+------+
Here is the logic I tried but it is not working
WindowSpec windowSpec = Window.partitionBy("id","txnId","type").orderBy("id");
Column roworder = rank().over(windowSpec).as("rank");
Dataset<Row> df1 = df.select(df.col("*"),roworder);
Dataset<Row> df2 = df1.withColumn("count",sum(agg(df1.col("id"),1))
But this is not working
You don't need the rank function to achieve what you have
WindowSpec windowSpec = Window.partitionBy("id","txnId","type").orderBy("id");
Dataset<Row> df2 = df.withColumn("count",count("*").over(windowSpec))
this gives me the result
+---+----------+-------+--------+---+------+------+-----+
| id| txnId|account| date|idl| type|amount|count|
+---+----------+-------+--------+---+------+------+-----+
|145|0000004513| 30095|11272020| 0| debit| 4000| 1|
|145|0000004514| 30094|11272020| 0| debit| 1000| 1|
|135|0000004512| 30096|11272020| 0|credit| 2000| 1|
|153|0000004512| 30095|11272020| 30| debit| 1000| 2|
|153|0000004512| 30097|11272020| 0| debit| 1000| 2|
|153|0000004512| 30096|11272020| 0|credit| 200| 1|
+---+----------+-------+--------+---+------+------+-----+

How to maintain the order of the data while selecting the distinct values of column from Dataset

I have a Dataset as below,
+------+------+---------------+
| col1 | col2 | sum(costs) |
+------+------+---------------+
| 1 | a | 3555204326.27 |
| 4 | b | 22273491.72 |
| 5 | c | 219175.00 |
| 3 | a | 219175.00 |
| 2 | c | 75341433.37 |
+------+------+---------------+
I need to select the distinct values of the col1 and my resultant dataset should have the order as 1, 4, 5, 3, 2 (the order in which these values are available in initial dataset). But the order is getting shuffled. Is there any way to maintain the same order as the intital dataset. Any suggestion in Spark/SQL could be fine.
This dataset can be obtained by below sequence in spark.
df = sqlCtx.createDataFrame(
[(1, a, 355.27), (4, b, 222.98), (5, c, 275.00), (3, a, 25.00),
(2, c, 753.37)], ('Col1', 'col2', 'cost'));
You can add another column containing the index of each row, then sort on that column after "distinct". Here is an example:
import org.apache.spark.sql.functions._
val df = Seq(1, 4, 4, 5, 2)
.toDF("a")
.withColumn("id", monotonically_increasing_id())
df.show()
// +---+---+
// | a| id|
// +---+---+
// | 1| 0|
// | 4| 1|
// | 4| 2|
// | 5| 3|
// | 2| 4|
// +---+---+
df.dropDuplicates("a").sort("id").show()
// +---+---+
// | a| id|
// +---+---+
// | 1| 0|
// | 4| 1|
// | 5| 3|
// | 2| 4|
// +---+---+
Note that to do distinct on 1 specific column, you can use dropDuplicates, if you want to control which row you want to take in case of duplicate then use groupBy.
Assuming you are trying to remote the duplicates in col2 (as there are none in col1), so that the final result would be:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
+----+----+---------------+
You could add an index column like:
df = df.withColumn("__idx", monotonically_increasing_id());
Then do all the transformations you want, and then drop it, like in:
df = df.dropDuplicates("col2").orderBy("__idx").drop("__idx");
This would mean do:
Step 1: load the data and stuff:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
| 3| a| 219175.0|
| 2| c| 7.534143337E7|
+----+----+---------------+
Step 2: add the index:
+----+----+---------------+-----+
|col1|col2| sum|__idx|
+----+----+---------------+-----+
| 1| a|3.55520432627E9| 0|
| 4| b| 2.227349172E7| 1|
| 5| c| 219175.0| 2|
| 3| a| 219175.0| 3|
| 2| c| 7.534143337E7| 4|
+----+----+---------------+-----+
Step 3: transformations (here remove the dups in col2) and remove the __idx column:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
+----+----+---------------+
The Java code could be:
package net.jgp.books.spark.ch12.lab990_others;
import static org.apache.spark.sql.functions.monotonically_increasing_id;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
/**
* Keeping the order of rows during transformations.
*
* #author jgp
*/
public class KeepingOrderApp {
/**
* main() is your entry point to the application.
*
* #param args
*/
public static void main(String[] args) {
KeepingOrderApp app = new KeepingOrderApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("Splitting a dataframe to collect it")
.master("local")
.getOrCreate();
Dataset<Row> df = createDataframe(spark);
df.show();
df = df.withColumn("__idx", monotonically_increasing_id());
df.show();
df = df.dropDuplicates("col2").orderBy("__idx").drop("__idx");
df.show();
}
private static Dataset<Row> createDataframe(SparkSession spark) {
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(
"col1",
DataTypes.IntegerType,
false),
DataTypes.createStructField(
"col2",
DataTypes.StringType,
false),
DataTypes.createStructField(
"sum",
DataTypes.DoubleType,
false) });
List<Row> rows = new ArrayList<>();
rows.add(RowFactory.create(1, "a", 3555204326.27));
rows.add(RowFactory.create(4, "b", 22273491.72));
rows.add(RowFactory.create(5, "c", 219175.0));
rows.add(RowFactory.create(3, "a", 219175.0));
rows.add(RowFactory.create(2, "c", 75341433.37));
return spark.createDataFrame(rows, schema);
}
}
You could add an index column to your DB and then in your SQL request make an ORDER BY id
I believe you need to reformat your query and use group by instead of distinct like this answer suggests SQL: How to keep rows order with DISTINCT?

How to get the size of result generated using concat_ws?

I am performing groupBy on COL1 and getting the concatenated list of COL2 using concat_ws. How can I get the count of values in that list? Here's my code:
Dataset<Row> ds = df.groupBy("COL1").agg(org.apache.spark.sql.functions
.concat_ws(",",org.apache.spark.sql.functions.collect_list("COL2")).as("sample"));
Use size function.
size(e: Column): Column Returns length of array or map.
The following example is in Scala and am leaving it to you to convert it to Java, but the general idea is exactly the same regardless of the programming language.
val input = spark.range(4)
.withColumn("COL1", $"id" % 2)
.select($"COL1", $"id" as "COL2")
scala> input.show
+----+----+
|COL1|COL2|
+----+----+
| 0| 0|
| 1| 1|
| 0| 2|
| 1| 3|
+----+----+
val s = input
.groupBy("COL1")
.agg(
concat_ws(",", collect_list("COL2")) as "concat",
size(collect_list("COL2")) as "size") // <-- size
scala> s.show
+----+------+----+
|COL1|concat|size|
+----+------+----+
| 0| 0,2| 2|
| 1| 1,3| 2|
+----+------+----+
In Java that'd be as follows. Thanks Krishna Prasad for sharing the code with the SO/Spark community!
Dataset<Row> ds = df.groupBy("COL1").agg(
org.apache.spark.sql.functions.concat_ws(",",org.apache.spark.sql.functions.collect_list("‌​COL2")).as("sample")‌​,
org.apache.spark.sql.functions.size(org.apache.spark.sql.functions.collect_list("COL2‌​")).as("size"));

Transform Spark Datset - count and merge multiple rows by ID

After some data processing, I end up with this Dataset:
Dataset<Row> counts //ID,COUNT,DAY_OF_WEEK
Now I want to transform this to this format and save as CSV:
ID,COUNT_DoW1, ID,COUNT_DoW2, ID,COUNT_DoW3,..ID,COUNT_DoW7
I can think of one approach of:
JavaPairRDD<Long, Map<Integer, Integer>> r = counts.toJavaRDD().mapToPair(...)
JavaPairRDD<Long, Map<Integer, Integer>> merged = r.reduceByKey(...);
Where its a pair of "ID" and List of size 7.
After I get JavaPairRDD, I can store it in csv. Is there a simpler approach for this transformation without converting it to an RDD?
You can use the struct function to construct a pair from cnt and day and then do a groupby with collect_list.
Something like this (scala but you can easily convert to java):
df.groupBy("ID").agg(collect_list(struct("COUNT","DAY")))
Now you can write a UDF which extracts the relevant column. So you simply do a withColumn in a loop to simply copy the ID (df.withColumn("id2",col("id")))
then you create a UDF which extracts the count element from position i and run it on all columns and lastly the same on day.
If you keep the order you want and drop irrelevant columns you would get what you asked for.
You can also work with the pivot command (again in scala but you should be able to easily convert to java):
df.show()
>>+---+---+---+
>>| id|cnt|day|
>>+---+---+---+
>>|333| 31| 1|
>>|333| 32| 2|
>>|333|133| 3|
>>|333| 34| 4|
>>|333| 35| 5|
>>|333| 36| 6|
>>|333| 37| 7|
>>|222| 41| 4|
>>|111| 11| 1|
>>|111| 22| 2|
>>|111| 33| 3|
>>|111| 44| 4|
>>|111| 55| 5|
>>|111| 66| 6|
>>|111| 77| 7|
>>|222| 21| 1|
>>+---+---+---+
val df2 = df.withColumn("all",struct('id, 'cnt' 'day))
val res = .groupBy("id").pivot("day").agg(first('all).as("bla")).select("1.*","2.*","3.*", "4.*", "5.*", "6.*", "7.*")
res.show()
>>+---+---+---+----+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+
>>| id|cnt|day| id| cnt| day| id| cnt| day| id|cnt|day| id| cnt| day| id| cnt| day| id| cnt| day|
>>+---+---+---+----+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+
>>|333| 31| 1| 333| 32| 2| 333| 133| 3|333| 34| 4| 333| 35| 5| 333| 36| 6| 333| 37| 7|
>>|222| 21| 1|null|null|null|null|null|null|222| 41| 4|null|null|null|null|null|null|null|null|null|
>>|111| 11| 1| 111| 22| 2| 111| 33| 3|111| 44| 4| 111| 55| 5| 111| 66| 6| 111| 77| 7|
>>+---+---+---+----+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+

OrientDB Java API

I'm trying to work with OrientDB using Java API. My aim is to add some vertices to fresh created database.
First of all I created connection to database
graph = new OrientGraphFactory("PLOCAL:localhost/affiliates", "admin", "admin").getNoTx();
Next through some iteration I created some amount of vertices
private Vertex createVertex(Company company) {
Vertex vertex = graph.addVertex(company);
//vertex.setProperty("id", company.getId());
if (company.getName() != null) {
vertex.setProperty("name", company.getName());
}
if (company.getInn() != null) {
vertex.setProperty("inn", company.getInn());
}
if (company.getOgrn() != null) {
vertex.setProperty("ogrn", company.getOgrn());
}
return vertex;
}
After that (in that session) I've calculated created vertices like this:
Integer vertexCount = 0;
Iterable vertices = graph.getVertices();
while (vertices.iterator().hasNext()) {
Vertex vertex = (Vertex) vertices.iterator().next();
vertexCount++;
}
log.info("vertexCount = " + vertexCount);
And found the same amount of vertices as I expected (100 vertices).
After that I'm trying to see those vertices in OrientDB console and see nothing
orientdb {db=affiliates}> classes
CLASSES
+----+-----------+-------------+------------+-----+
|# |NAME |SUPER-CLASSES|CLUSTERS |COUNT|
+----+-----------+-------------+------------+-----+
|0 |_studio | |_studio(12) | 1|
|1 |E | |e(10) | 0|
|2 |OFunction | |ofunction(6)| 0|
|3 |OIdentity | |- | 0|
|4 |ORestricted| |- | 0|
|5 |ORole |[OIdentity] |orole(4) | 3|
|6 |OSchedule | |oschedule(8)| 0|
|7 |OSequence | |osequence(7)| 0|
|8 |OTriggered | |- | 0|
|9 |OUser |[OIdentity] |ouser(5) | 3|
|10 |V | |v(9) | 0|
+----+-----------+-------------+------------+-----+
| |TOTAL | | | 7|
+----+-----------+-------------+------------+-----+
What I've missed?

Categories