Transform Spark Datset - count and merge multiple rows by ID - java

After some data processing, I end up with this Dataset:
Dataset<Row> counts //ID,COUNT,DAY_OF_WEEK
Now I want to transform this to this format and save as CSV:
ID,COUNT_DoW1, ID,COUNT_DoW2, ID,COUNT_DoW3,..ID,COUNT_DoW7
I can think of one approach of:
JavaPairRDD<Long, Map<Integer, Integer>> r = counts.toJavaRDD().mapToPair(...)
JavaPairRDD<Long, Map<Integer, Integer>> merged = r.reduceByKey(...);
Where its a pair of "ID" and List of size 7.
After I get JavaPairRDD, I can store it in csv. Is there a simpler approach for this transformation without converting it to an RDD?

You can use the struct function to construct a pair from cnt and day and then do a groupby with collect_list.
Something like this (scala but you can easily convert to java):
df.groupBy("ID").agg(collect_list(struct("COUNT","DAY")))
Now you can write a UDF which extracts the relevant column. So you simply do a withColumn in a loop to simply copy the ID (df.withColumn("id2",col("id")))
then you create a UDF which extracts the count element from position i and run it on all columns and lastly the same on day.
If you keep the order you want and drop irrelevant columns you would get what you asked for.
You can also work with the pivot command (again in scala but you should be able to easily convert to java):
df.show()
>>+---+---+---+
>>| id|cnt|day|
>>+---+---+---+
>>|333| 31| 1|
>>|333| 32| 2|
>>|333|133| 3|
>>|333| 34| 4|
>>|333| 35| 5|
>>|333| 36| 6|
>>|333| 37| 7|
>>|222| 41| 4|
>>|111| 11| 1|
>>|111| 22| 2|
>>|111| 33| 3|
>>|111| 44| 4|
>>|111| 55| 5|
>>|111| 66| 6|
>>|111| 77| 7|
>>|222| 21| 1|
>>+---+---+---+
val df2 = df.withColumn("all",struct('id, 'cnt' 'day))
val res = .groupBy("id").pivot("day").agg(first('all).as("bla")).select("1.*","2.*","3.*", "4.*", "5.*", "6.*", "7.*")
res.show()
>>+---+---+---+----+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+
>>| id|cnt|day| id| cnt| day| id| cnt| day| id|cnt|day| id| cnt| day| id| cnt| day| id| cnt| day|
>>+---+---+---+----+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+
>>|333| 31| 1| 333| 32| 2| 333| 133| 3|333| 34| 4| 333| 35| 5| 333| 36| 6| 333| 37| 7|
>>|222| 21| 1|null|null|null|null|null|null|222| 41| 4|null|null|null|null|null|null|null|null|null|
>>|111| 11| 1| 111| 22| 2| 111| 33| 3|111| 44| 4| 111| 55| 5| 111| 66| 6| 111| 77| 7|
>>+---+---+---+----+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+

Related

Create a GraphX graph in Java Spark using a 2 column dataframe

I have a 2 column data frame as mentioned below.
+---+--------+
| src| dst|
+---+--------+
| 1| 2|
| 2| 3|
| 1| 3|
| 4| 3|
| 4| 5|
| 2| 5|
| 7| 8|
| 9| 8|
+---+--------+
I am trying to create a graph from this Edge DF. I am able to create the graph manual using a List<Edge<String>> . I want to know if there is any simpler way to create a graph from the Edge DF.
List<Edge<String>> edges = new ArrayList<>();
edges.add(new Edge<String>(1, 2, ""));
edges.add(new Edge<String>(2, 3, ""));
edges.add(new Edge<String>(1, 3, ""));
edges.add(new Edge<String>(4, 3, ""));
edges.add(new Edge<String>(4, 5, ""));
edges.add(new Edge<String>(2, 5, ""));
//sc is the Java Spark Context
JavaRDD<Edge<String>> edgeRDD = sc.parallelize(edges, 1);
ClassTag<String> stringTag = scala.reflect.ClassTag$.MODULE$.apply(String.class);
Graph<String, String> graph = Graph.fromEdges(edgeRDD.rdd(), "",StorageLevel.MEMORY_ONLY(), StorageLevel.MEMORY_ONLY(), stringTag, stringTag);
What would be the easiest way to create a graph from the above mentioned data frame

3 LEFT-JOIN in Spark SQL with API JAVA

I have 3 dataset in origin from 3 tables:
Dataset<TABLE1> bbdd_one = map.get("TABLE1").as(Encoders.bean(TABLE1.class)).alias("TABLE1");
Dataset<TABLE2> bbdd_two = map.get("TABLE2").as(Encoders.bean(TABLE2.class)).alias("TABLE2");
Dataset<TABLE3> bbdd_three = map.get("TABLE3").as(Encoders.bean(TABLE3.class)).alias("TABLE3");
and I want to do a triple left-join on it and write it in an output .parquet
The sql JOIN statement is similar to this:
SELECT one.field, ........, two.field ....., three.field, ... four.field
FROM TABLE1 one
LEFT JOIN TABLE2 two ON two.field = one.field
LEFT JOIN TABLE3 three ON three.field = one.field AND three.field = one.field
LEFT JOIN TABLE3 four ON four.field = one.field AND four.field = one.otherfield
WHERE one.field = 'whatever'
How can do this with JAVA API? Is it possible? I did an example with only one join but with 3 seems difficult.
PS: My other join with JAVA API is:
Dataset<TJOINED> ds_joined = ds_table1
.join(ds_table2,
JavaConversions.asScalaBuffer(Arrays.asList("fieldInCommon1", "fieldInCommon2", "fieldInCommon3", "fieldInCommon4"))
.seq(),
"inner")
.select("a lot of fields", ... "more fields")
.as(Encoders.bean(TJOINED.class));
Thanks!
Have you tried chaining join statements?
I don't often code in Java so this is just a guess
Dataset<TJOINED> ds_joined = ds_table1
.join(
ds_table2,
JavaConversions.asScalaBuffer(Arrays.asList(...)).seq(),
"left"
)
.join(
ds_table3,
JavaConversions.asScalaBuffer(Arrays.asList(...)).seq(),
"left"
)
.join(
ds_table4,
JavaConversions.asScalaBuffer(Arrays.asList(...)).seq(),
"left"
)
.select(...)
.as(Encoders.bean(TJOINED.class))
Update: If my understanding is correct, ds_table3 and ds_table4 are the same and they are joined on different field. Then maybe this updated answer, which is given in Scala since it's what I'm used to working with, might achieve what you want. Here's the full working example:
import spark.implicits._
case class TABLE1(f1: Int, f2: Int, f3: Int, f4: Int, f5:Int)
case class TABLE2(f1: Int, f2: Int, vTable2: Int)
case class TABLE3(f3: Int, f4: Int, vTable3: Int)
val one = spark.createDataset[TABLE1](Seq(TABLE1(1,2,3,4,5), TABLE1(1,3,4,5,6)))
//one.show()
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//| 1| 2| 3| 4| 5|
//| 1| 3| 4| 5| 6|
//+---+---+---+---+---+
val two = spark.createDataset[TABLE2](Seq(TABLE2(1,2,20)))
//two.show()
//+---+---+-------+
//| f1| f2|vTable2|
//+---+---+-------+
//| 1| 2| 20|
//+---+---+-------+
val three = spark.createDataset[TABLE3](Seq(TABLE3(3,4,20), TABLE3(3,5,50)))
//three.show()
//+---+---+-------+
//| f3| f4|vTable3|
//+---+---+-------+
//| 3| 4| 20|
//| 3| 5| 50|
//+---+---+-------+
val result = one
.join(two, Seq("f1", "f2"), "left")
.join(three, Seq("f3", "f4"), "left")
.join(
three.withColumnRenamed("f4", "f5").withColumnRenamed("vTable3", "vTable4"),
Seq("f3", "f5"),
"left"
)
//result.show()
//+---+---+---+---+---+-------+-------+-------+
//| f3| f5| f4| f1| f2|vTable2|vTable3|vTable4|
//+---+---+---+---+---+-------+-------+-------+
//| 3| 5| 4| 1| 2| 20| 20| 50|
//| 4| 6| 5| 1| 3| null| null| null|
//+---+---+---+---+---+-------+-------+-------+

How to maintain the order of the data while selecting the distinct values of column from Dataset

I have a Dataset as below,
+------+------+---------------+
| col1 | col2 | sum(costs) |
+------+------+---------------+
| 1 | a | 3555204326.27 |
| 4 | b | 22273491.72 |
| 5 | c | 219175.00 |
| 3 | a | 219175.00 |
| 2 | c | 75341433.37 |
+------+------+---------------+
I need to select the distinct values of the col1 and my resultant dataset should have the order as 1, 4, 5, 3, 2 (the order in which these values are available in initial dataset). But the order is getting shuffled. Is there any way to maintain the same order as the intital dataset. Any suggestion in Spark/SQL could be fine.
This dataset can be obtained by below sequence in spark.
df = sqlCtx.createDataFrame(
[(1, a, 355.27), (4, b, 222.98), (5, c, 275.00), (3, a, 25.00),
(2, c, 753.37)], ('Col1', 'col2', 'cost'));
You can add another column containing the index of each row, then sort on that column after "distinct". Here is an example:
import org.apache.spark.sql.functions._
val df = Seq(1, 4, 4, 5, 2)
.toDF("a")
.withColumn("id", monotonically_increasing_id())
df.show()
// +---+---+
// | a| id|
// +---+---+
// | 1| 0|
// | 4| 1|
// | 4| 2|
// | 5| 3|
// | 2| 4|
// +---+---+
df.dropDuplicates("a").sort("id").show()
// +---+---+
// | a| id|
// +---+---+
// | 1| 0|
// | 4| 1|
// | 5| 3|
// | 2| 4|
// +---+---+
Note that to do distinct on 1 specific column, you can use dropDuplicates, if you want to control which row you want to take in case of duplicate then use groupBy.
Assuming you are trying to remote the duplicates in col2 (as there are none in col1), so that the final result would be:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
+----+----+---------------+
You could add an index column like:
df = df.withColumn("__idx", monotonically_increasing_id());
Then do all the transformations you want, and then drop it, like in:
df = df.dropDuplicates("col2").orderBy("__idx").drop("__idx");
This would mean do:
Step 1: load the data and stuff:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
| 3| a| 219175.0|
| 2| c| 7.534143337E7|
+----+----+---------------+
Step 2: add the index:
+----+----+---------------+-----+
|col1|col2| sum|__idx|
+----+----+---------------+-----+
| 1| a|3.55520432627E9| 0|
| 4| b| 2.227349172E7| 1|
| 5| c| 219175.0| 2|
| 3| a| 219175.0| 3|
| 2| c| 7.534143337E7| 4|
+----+----+---------------+-----+
Step 3: transformations (here remove the dups in col2) and remove the __idx column:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
+----+----+---------------+
The Java code could be:
package net.jgp.books.spark.ch12.lab990_others;
import static org.apache.spark.sql.functions.monotonically_increasing_id;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
/**
* Keeping the order of rows during transformations.
*
* #author jgp
*/
public class KeepingOrderApp {
/**
* main() is your entry point to the application.
*
* #param args
*/
public static void main(String[] args) {
KeepingOrderApp app = new KeepingOrderApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("Splitting a dataframe to collect it")
.master("local")
.getOrCreate();
Dataset<Row> df = createDataframe(spark);
df.show();
df = df.withColumn("__idx", monotonically_increasing_id());
df.show();
df = df.dropDuplicates("col2").orderBy("__idx").drop("__idx");
df.show();
}
private static Dataset<Row> createDataframe(SparkSession spark) {
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(
"col1",
DataTypes.IntegerType,
false),
DataTypes.createStructField(
"col2",
DataTypes.StringType,
false),
DataTypes.createStructField(
"sum",
DataTypes.DoubleType,
false) });
List<Row> rows = new ArrayList<>();
rows.add(RowFactory.create(1, "a", 3555204326.27));
rows.add(RowFactory.create(4, "b", 22273491.72));
rows.add(RowFactory.create(5, "c", 219175.0));
rows.add(RowFactory.create(3, "a", 219175.0));
rows.add(RowFactory.create(2, "c", 75341433.37));
return spark.createDataFrame(rows, schema);
}
}
You could add an index column to your DB and then in your SQL request make an ORDER BY id
I believe you need to reformat your query and use group by instead of distinct like this answer suggests SQL: How to keep rows order with DISTINCT?

How to data pre-processing in Spark in this case

I made a follwing dataset with scala.
+--------------------+---+
| text| docu_no|
+--------------------+---+
|서울,NNP 시내,NNG 한,M...| 1|
|최저,NNG 임금,NNG 때문,...| 2|
|왜,MAG 시급,NNG 만,JX...| 3|
|지금,MAG 경제,NNG 가,J...| 4|
|임대료,NNG 폭리,NNG 내리...| 5|
|모든,MM 문제,NNG 를,JK...| 6|
|니,NP 들,XSN 이,JKS ...| 7|
|실제,NNG 자영업,NNG 자,...| 8|
I want to make DTM for analysis.
For example
docu_no|서울|시내|한|최저|임금|지금|폭리 ...
1 1 1 1 0 0 0 0
2 0 0 0 1 1 1 1
For this, I thought pre-processing as follows.
+--------------------+---+
| text|count |docu_no
+--------------------+---+
|서울,NNP | 1| 1
|시내,NNG | 1| 1
|한,M. | 1| 1
|최저,NNG | 1| 2
|임금,NNG| 1| 2
|때문,...| 1| 2
After I make this (rdd or DataSet), if I use group by and pivot, I will get the results that I want to. But it is too difficult for me. If you have ideas, please inform those to me.
val data = List(("A", 1),("B", 2),("C", 3),("E", 4),("F", 5))
val df = sc.parallelize(data).toDF("text","doc_no")
df.show()
+----+------+
|text|doc_no|
+----+------+
| A| 1|
| B| 2|
| C| 3|
| E| 4|
| F| 5|
+----+------+
import org.apache.spark.sql.functions._
df.groupBy($"doc_no").pivot("text").agg(count("doc_no")).show()
+------+---+---+---+---+---+
|doc_no| A| B| C| E| F|
+------+---+---+---+---+---+
| 1| 1| 0| 0| 0| 0|
| 2| 0| 1| 0| 0| 0|
| 3| 0| 0| 1| 0| 0|
| 4| 0| 0| 0| 1| 0|
| 5| 0| 0| 0| 0| 1|
+------+---+---+---+---+---+

How to get the size of result generated using concat_ws?

I am performing groupBy on COL1 and getting the concatenated list of COL2 using concat_ws. How can I get the count of values in that list? Here's my code:
Dataset<Row> ds = df.groupBy("COL1").agg(org.apache.spark.sql.functions
.concat_ws(",",org.apache.spark.sql.functions.collect_list("COL2")).as("sample"));
Use size function.
size(e: Column): Column Returns length of array or map.
The following example is in Scala and am leaving it to you to convert it to Java, but the general idea is exactly the same regardless of the programming language.
val input = spark.range(4)
.withColumn("COL1", $"id" % 2)
.select($"COL1", $"id" as "COL2")
scala> input.show
+----+----+
|COL1|COL2|
+----+----+
| 0| 0|
| 1| 1|
| 0| 2|
| 1| 3|
+----+----+
val s = input
.groupBy("COL1")
.agg(
concat_ws(",", collect_list("COL2")) as "concat",
size(collect_list("COL2")) as "size") // <-- size
scala> s.show
+----+------+----+
|COL1|concat|size|
+----+------+----+
| 0| 0,2| 2|
| 1| 1,3| 2|
+----+------+----+
In Java that'd be as follows. Thanks Krishna Prasad for sharing the code with the SO/Spark community!
Dataset<Row> ds = df.groupBy("COL1").agg(
org.apache.spark.sql.functions.concat_ws(",",org.apache.spark.sql.functions.collect_list("‌​COL2")).as("sample")‌​,
org.apache.spark.sql.functions.size(org.apache.spark.sql.functions.collect_list("COL2‌​")).as("size"));

Categories