3 LEFT-JOIN in Spark SQL with API JAVA - java

I have 3 dataset in origin from 3 tables:
Dataset<TABLE1> bbdd_one = map.get("TABLE1").as(Encoders.bean(TABLE1.class)).alias("TABLE1");
Dataset<TABLE2> bbdd_two = map.get("TABLE2").as(Encoders.bean(TABLE2.class)).alias("TABLE2");
Dataset<TABLE3> bbdd_three = map.get("TABLE3").as(Encoders.bean(TABLE3.class)).alias("TABLE3");
and I want to do a triple left-join on it and write it in an output .parquet
The sql JOIN statement is similar to this:
SELECT one.field, ........, two.field ....., three.field, ... four.field
FROM TABLE1 one
LEFT JOIN TABLE2 two ON two.field = one.field
LEFT JOIN TABLE3 three ON three.field = one.field AND three.field = one.field
LEFT JOIN TABLE3 four ON four.field = one.field AND four.field = one.otherfield
WHERE one.field = 'whatever'
How can do this with JAVA API? Is it possible? I did an example with only one join but with 3 seems difficult.
PS: My other join with JAVA API is:
Dataset<TJOINED> ds_joined = ds_table1
.join(ds_table2,
JavaConversions.asScalaBuffer(Arrays.asList("fieldInCommon1", "fieldInCommon2", "fieldInCommon3", "fieldInCommon4"))
.seq(),
"inner")
.select("a lot of fields", ... "more fields")
.as(Encoders.bean(TJOINED.class));
Thanks!

Have you tried chaining join statements?
I don't often code in Java so this is just a guess
Dataset<TJOINED> ds_joined = ds_table1
.join(
ds_table2,
JavaConversions.asScalaBuffer(Arrays.asList(...)).seq(),
"left"
)
.join(
ds_table3,
JavaConversions.asScalaBuffer(Arrays.asList(...)).seq(),
"left"
)
.join(
ds_table4,
JavaConversions.asScalaBuffer(Arrays.asList(...)).seq(),
"left"
)
.select(...)
.as(Encoders.bean(TJOINED.class))
Update: If my understanding is correct, ds_table3 and ds_table4 are the same and they are joined on different field. Then maybe this updated answer, which is given in Scala since it's what I'm used to working with, might achieve what you want. Here's the full working example:
import spark.implicits._
case class TABLE1(f1: Int, f2: Int, f3: Int, f4: Int, f5:Int)
case class TABLE2(f1: Int, f2: Int, vTable2: Int)
case class TABLE3(f3: Int, f4: Int, vTable3: Int)
val one = spark.createDataset[TABLE1](Seq(TABLE1(1,2,3,4,5), TABLE1(1,3,4,5,6)))
//one.show()
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//| 1| 2| 3| 4| 5|
//| 1| 3| 4| 5| 6|
//+---+---+---+---+---+
val two = spark.createDataset[TABLE2](Seq(TABLE2(1,2,20)))
//two.show()
//+---+---+-------+
//| f1| f2|vTable2|
//+---+---+-------+
//| 1| 2| 20|
//+---+---+-------+
val three = spark.createDataset[TABLE3](Seq(TABLE3(3,4,20), TABLE3(3,5,50)))
//three.show()
//+---+---+-------+
//| f3| f4|vTable3|
//+---+---+-------+
//| 3| 4| 20|
//| 3| 5| 50|
//+---+---+-------+
val result = one
.join(two, Seq("f1", "f2"), "left")
.join(three, Seq("f3", "f4"), "left")
.join(
three.withColumnRenamed("f4", "f5").withColumnRenamed("vTable3", "vTable4"),
Seq("f3", "f5"),
"left"
)
//result.show()
//+---+---+---+---+---+-------+-------+-------+
//| f3| f5| f4| f1| f2|vTable2|vTable3|vTable4|
//+---+---+---+---+---+-------+-------+-------+
//| 3| 5| 4| 1| 2| 20| 20| 50|
//| 4| 6| 5| 1| 3| null| null| null|
//+---+---+---+---+---+-------+-------+-------+

Related

Create a GraphX graph in Java Spark using a 2 column dataframe

I have a 2 column data frame as mentioned below.
+---+--------+
| src| dst|
+---+--------+
| 1| 2|
| 2| 3|
| 1| 3|
| 4| 3|
| 4| 5|
| 2| 5|
| 7| 8|
| 9| 8|
+---+--------+
I am trying to create a graph from this Edge DF. I am able to create the graph manual using a List<Edge<String>> . I want to know if there is any simpler way to create a graph from the Edge DF.
List<Edge<String>> edges = new ArrayList<>();
edges.add(new Edge<String>(1, 2, ""));
edges.add(new Edge<String>(2, 3, ""));
edges.add(new Edge<String>(1, 3, ""));
edges.add(new Edge<String>(4, 3, ""));
edges.add(new Edge<String>(4, 5, ""));
edges.add(new Edge<String>(2, 5, ""));
//sc is the Java Spark Context
JavaRDD<Edge<String>> edgeRDD = sc.parallelize(edges, 1);
ClassTag<String> stringTag = scala.reflect.ClassTag$.MODULE$.apply(String.class);
Graph<String, String> graph = Graph.fromEdges(edgeRDD.rdd(), "",StorageLevel.MEMORY_ONLY(), StorageLevel.MEMORY_ONLY(), stringTag, stringTag);
What would be the easiest way to create a graph from the above mentioned data frame

How to maintain the order of the data while selecting the distinct values of column from Dataset

I have a Dataset as below,
+------+------+---------------+
| col1 | col2 | sum(costs) |
+------+------+---------------+
| 1 | a | 3555204326.27 |
| 4 | b | 22273491.72 |
| 5 | c | 219175.00 |
| 3 | a | 219175.00 |
| 2 | c | 75341433.37 |
+------+------+---------------+
I need to select the distinct values of the col1 and my resultant dataset should have the order as 1, 4, 5, 3, 2 (the order in which these values are available in initial dataset). But the order is getting shuffled. Is there any way to maintain the same order as the intital dataset. Any suggestion in Spark/SQL could be fine.
This dataset can be obtained by below sequence in spark.
df = sqlCtx.createDataFrame(
[(1, a, 355.27), (4, b, 222.98), (5, c, 275.00), (3, a, 25.00),
(2, c, 753.37)], ('Col1', 'col2', 'cost'));
You can add another column containing the index of each row, then sort on that column after "distinct". Here is an example:
import org.apache.spark.sql.functions._
val df = Seq(1, 4, 4, 5, 2)
.toDF("a")
.withColumn("id", monotonically_increasing_id())
df.show()
// +---+---+
// | a| id|
// +---+---+
// | 1| 0|
// | 4| 1|
// | 4| 2|
// | 5| 3|
// | 2| 4|
// +---+---+
df.dropDuplicates("a").sort("id").show()
// +---+---+
// | a| id|
// +---+---+
// | 1| 0|
// | 4| 1|
// | 5| 3|
// | 2| 4|
// +---+---+
Note that to do distinct on 1 specific column, you can use dropDuplicates, if you want to control which row you want to take in case of duplicate then use groupBy.
Assuming you are trying to remote the duplicates in col2 (as there are none in col1), so that the final result would be:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
+----+----+---------------+
You could add an index column like:
df = df.withColumn("__idx", monotonically_increasing_id());
Then do all the transformations you want, and then drop it, like in:
df = df.dropDuplicates("col2").orderBy("__idx").drop("__idx");
This would mean do:
Step 1: load the data and stuff:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
| 3| a| 219175.0|
| 2| c| 7.534143337E7|
+----+----+---------------+
Step 2: add the index:
+----+----+---------------+-----+
|col1|col2| sum|__idx|
+----+----+---------------+-----+
| 1| a|3.55520432627E9| 0|
| 4| b| 2.227349172E7| 1|
| 5| c| 219175.0| 2|
| 3| a| 219175.0| 3|
| 2| c| 7.534143337E7| 4|
+----+----+---------------+-----+
Step 3: transformations (here remove the dups in col2) and remove the __idx column:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
+----+----+---------------+
The Java code could be:
package net.jgp.books.spark.ch12.lab990_others;
import static org.apache.spark.sql.functions.monotonically_increasing_id;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
/**
* Keeping the order of rows during transformations.
*
* #author jgp
*/
public class KeepingOrderApp {
/**
* main() is your entry point to the application.
*
* #param args
*/
public static void main(String[] args) {
KeepingOrderApp app = new KeepingOrderApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("Splitting a dataframe to collect it")
.master("local")
.getOrCreate();
Dataset<Row> df = createDataframe(spark);
df.show();
df = df.withColumn("__idx", monotonically_increasing_id());
df.show();
df = df.dropDuplicates("col2").orderBy("__idx").drop("__idx");
df.show();
}
private static Dataset<Row> createDataframe(SparkSession spark) {
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(
"col1",
DataTypes.IntegerType,
false),
DataTypes.createStructField(
"col2",
DataTypes.StringType,
false),
DataTypes.createStructField(
"sum",
DataTypes.DoubleType,
false) });
List<Row> rows = new ArrayList<>();
rows.add(RowFactory.create(1, "a", 3555204326.27));
rows.add(RowFactory.create(4, "b", 22273491.72));
rows.add(RowFactory.create(5, "c", 219175.0));
rows.add(RowFactory.create(3, "a", 219175.0));
rows.add(RowFactory.create(2, "c", 75341433.37));
return spark.createDataFrame(rows, schema);
}
}
You could add an index column to your DB and then in your SQL request make an ORDER BY id
I believe you need to reformat your query and use group by instead of distinct like this answer suggests SQL: How to keep rows order with DISTINCT?

How to get the size of result generated using concat_ws?

I am performing groupBy on COL1 and getting the concatenated list of COL2 using concat_ws. How can I get the count of values in that list? Here's my code:
Dataset<Row> ds = df.groupBy("COL1").agg(org.apache.spark.sql.functions
.concat_ws(",",org.apache.spark.sql.functions.collect_list("COL2")).as("sample"));
Use size function.
size(e: Column): Column Returns length of array or map.
The following example is in Scala and am leaving it to you to convert it to Java, but the general idea is exactly the same regardless of the programming language.
val input = spark.range(4)
.withColumn("COL1", $"id" % 2)
.select($"COL1", $"id" as "COL2")
scala> input.show
+----+----+
|COL1|COL2|
+----+----+
| 0| 0|
| 1| 1|
| 0| 2|
| 1| 3|
+----+----+
val s = input
.groupBy("COL1")
.agg(
concat_ws(",", collect_list("COL2")) as "concat",
size(collect_list("COL2")) as "size") // <-- size
scala> s.show
+----+------+----+
|COL1|concat|size|
+----+------+----+
| 0| 0,2| 2|
| 1| 1,3| 2|
+----+------+----+
In Java that'd be as follows. Thanks Krishna Prasad for sharing the code with the SO/Spark community!
Dataset<Row> ds = df.groupBy("COL1").agg(
org.apache.spark.sql.functions.concat_ws(",",org.apache.spark.sql.functions.collect_list("‌​COL2")).as("sample")‌​,
org.apache.spark.sql.functions.size(org.apache.spark.sql.functions.collect_list("COL2‌​")).as("size"));

how to apply dictionary key to value project to a column in dataset in spark?

Newbie here on spark... how can I use a column in spark dataset ask key to get some values and add the values as new column to the dataset?
In python, we have something like:
df.loc[:,'values'] = df.loc[:,'key'].apply(lambda x: D.get(x))
where D is a function in python defined earlier.
how can I do this in spark using Java? thank you.
Edit:
for example:
I have a following dataset df:
A
1
3
6
0
8
I want to create a weekday column based on the following dictionary:
D[1] = "Monday"
D[2] = "Tuesday"
D[3] = "Wednesday"
D[4] = "Thursday"
D[5] = "Friday"
D[6] = "Saturday"
D[7] = "Sunday"
and add the column back to my dataset df:
A days
1 Monday
3 Wednesday
6 Saturday
0 Sunday
8 NULL
This is just an example, column A could be anything other than integers of course.
You can use df.withColumn to return a new df with the new column values and the previous values of df.
create a udf function (user defined functions) to apply the dictionary mapping.
here's a reproducible example:
>>> from pyspark.sql.types import StringType
>>> from pyspark.sql.functions import udf
>>> df = spark.createDataFrame([{'A':1,'B':5},{'A':5,'B':2},{'A':1,'B':3},{'A':5,'B':4}], ['A','B'])
>>> df.show()
+---+---+
| A| B|
+---+---+
| 1| 5|
| 5| 2|
| 1| 3|
| 5| 4|
+---+---+
>>> d = {1:'x', 2:'y', 3:'w', 4:'t', 5:'z'}
>>> mapping_func = lambda x: d.get(x)
>>> df = df.withColumn('values',udf(mapping_func, StringType())("A"))
>>> df.show()
+---+---+------+
| A| B|values|
+---+---+------+
| 1| 5| x|
| 5| 2| z|
| 1| 3| x|
| 5| 4| z|
+---+---+------+

Transform Spark Datset - count and merge multiple rows by ID

After some data processing, I end up with this Dataset:
Dataset<Row> counts //ID,COUNT,DAY_OF_WEEK
Now I want to transform this to this format and save as CSV:
ID,COUNT_DoW1, ID,COUNT_DoW2, ID,COUNT_DoW3,..ID,COUNT_DoW7
I can think of one approach of:
JavaPairRDD<Long, Map<Integer, Integer>> r = counts.toJavaRDD().mapToPair(...)
JavaPairRDD<Long, Map<Integer, Integer>> merged = r.reduceByKey(...);
Where its a pair of "ID" and List of size 7.
After I get JavaPairRDD, I can store it in csv. Is there a simpler approach for this transformation without converting it to an RDD?
You can use the struct function to construct a pair from cnt and day and then do a groupby with collect_list.
Something like this (scala but you can easily convert to java):
df.groupBy("ID").agg(collect_list(struct("COUNT","DAY")))
Now you can write a UDF which extracts the relevant column. So you simply do a withColumn in a loop to simply copy the ID (df.withColumn("id2",col("id")))
then you create a UDF which extracts the count element from position i and run it on all columns and lastly the same on day.
If you keep the order you want and drop irrelevant columns you would get what you asked for.
You can also work with the pivot command (again in scala but you should be able to easily convert to java):
df.show()
>>+---+---+---+
>>| id|cnt|day|
>>+---+---+---+
>>|333| 31| 1|
>>|333| 32| 2|
>>|333|133| 3|
>>|333| 34| 4|
>>|333| 35| 5|
>>|333| 36| 6|
>>|333| 37| 7|
>>|222| 41| 4|
>>|111| 11| 1|
>>|111| 22| 2|
>>|111| 33| 3|
>>|111| 44| 4|
>>|111| 55| 5|
>>|111| 66| 6|
>>|111| 77| 7|
>>|222| 21| 1|
>>+---+---+---+
val df2 = df.withColumn("all",struct('id, 'cnt' 'day))
val res = .groupBy("id").pivot("day").agg(first('all).as("bla")).select("1.*","2.*","3.*", "4.*", "5.*", "6.*", "7.*")
res.show()
>>+---+---+---+----+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+
>>| id|cnt|day| id| cnt| day| id| cnt| day| id|cnt|day| id| cnt| day| id| cnt| day| id| cnt| day|
>>+---+---+---+----+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+
>>|333| 31| 1| 333| 32| 2| 333| 133| 3|333| 34| 4| 333| 35| 5| 333| 36| 6| 333| 37| 7|
>>|222| 21| 1|null|null|null|null|null|null|222| 41| 4|null|null|null|null|null|null|null|null|null|
>>|111| 11| 1| 111| 22| 2| 111| 33| 3|111| 44| 4| 111| 55| 5| 111| 66| 6| 111| 77| 7|
>>+---+---+---+----+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+

Categories