Spark2 dataframe get difference of two tables, with some exclusion - java

Lets say I have two tables, tableA and tableB, they have the same schema. Now Id like to get the mismatch of the two tables with the same primary key but with some exclusion of some column values.
So, if either table ref column contains xx, we consider this would match with the other table column value. Can anyone help me with this in java? I had a hard time reading scala code
tableA
+--+------+------+------+
|id| name | type| ref |
+--+------+------+------+
| 1|aaa | a| a1|
| 2|bbb | b| xx|
| 3|ccc | c| c3|
| 4|ddd | d| d4|
| 6|fff | f| f6|
| 7|ggg | 0| g7|
+--+------+------+------+
tableB
+--+------+------+------+
|id| name | type| ref |
+--+------+------+------+
| 1|aaa | a| a1|
| 2|bbb | b| b2|
| 3|ccc | c| xx|
| 5|eee | e| e5|
| 6|fff | f| f66|
| 7|ggg | g| g7|
+--+------+------+------+
Expected results:
+--+------+------+-------------+
|id| name | type| ref |
+--+------+------+-------------+
| 6|fff | f| [f6 ->f66]|
| 7|ggg | 0| [0 -> g7 ]|
+--+------+------+-------------+
This seems working fine, but I dont have strong confidence of it.
Dataset<Row> join = data1.join(data2, data1.col("id").equalTo(data2.col("id"))
.and(data1.col("name").equalTo(data2.col("name")))
.and(data1.col("type").equalTo(data2.col("type"))
.and(data1.col("ref").equalTo(data2.col("ref"))
.or(data1.col("ref").equalTo(lit("xx")))
.or(data2.col("ref").equalTo(lit("xx"))))), "left_semi");

you can simply use UDF (User Defined Functions) to achieve this once you have joined your two dataframes like below:
import sparkSession.sqlContext.implicits._
val df1 = Seq((1, "aaa", "a", "a1"), (2, "bbb", "b", "xx"), (3, "ccc", "c", "c3"), (4, "ddd", "d", "d4"), (6, "fff", "f", "f6")).toDF("id", "name", "type", "ref")
val df2 = Seq((1, "aaa", "a", "a1"), (2, "bbb", "b", "b2"), (3, "ccc", "c", "xx"), (4, "ddd", "d", "d4"), (6, "fff", "f", "f66")).toDF("id", "name", "type", "ref")
val diffCondition: UserDefinedFunction = udf {
(ref1: String, ref2: String) => {
var result: String = null
if (!ref1.equals(ref2) && !"xx".equals(ref1) && !"xx".equals(ref2)) {
result = s"$ref1 -> $ref2"
}
result
}
}
df1.join(df2, Seq("id", "name", "type"))
.withColumn("difference", diffCondition(df1("ref"), df2("ref")))
.filter("difference is not null")
.show()
and the output is
+---+----+----+---+---+----------+
| id|name|type|ref|ref|difference|
+---+----+----+---+---+----------+
| 6| fff| f| f6|f66| f6 -> f66|
+---+----+----+---+---+----------+

Related

replace one column values with another Spark Java

I have a dataframe df1 of the format
+------+------+------+
| Col1 | Col2 | Col3 |
+------+------+------+
| A | z | m |
| B | w | n |
| C | x | o |
| A | z | n |
| A | p | o |
+------+------+------+
and another dataframe df2 of the format
+------+------+
| Col1 | Col2 |
+------+------+
| 0-A | 0-z |
| 1-B | 3-w |
| 2-C | 1-x |
| | 2-P |
+------+------+-
I am trying to replace the values in Col1 and Col2 of df1 with values from df2 using Spark Java.
The end dataframe df3 should look like this.
+------+------+------+
| Col1 | Col2 | Col3 |
+------+------+------+
| 0-A | 0-z | m |
| 1-B | 3-w | n |
| 2-C | 1-x | o |
| 0-A | 0-z | n |
| 0-A | 2-p | o |
+------+------+------+
I am trying to replace all the values in the column1 and column2 of df1 with values from col1 and col2 of df2.
Is there anyway that i can achieve this in Spark Java dataframe syntax.?
The initial idea i had was to do the following.
String pattern1="\\p{L}+(?: \\p{L}+)*$";
df1=df1.join(df2, df1.col("col1").equalTo(regexp_extract(df2.col("col1"),pattern1,1)),"left-semi");
Replace your last join operation with below join.
df1.alias("x").join(df2.alias("y").select(col("y.Col1").alias("newCol1")), col("x.Col1") === regexp_extract(col("newCol1"),"\\p{L}+(?: \\p{L}+)*$",0), "left")
.withColumn("Col1", col("newCol1"))
.join(df2.alias("z").select(col("z.Col2").alias("newCol2")), col("x.Col2") === regexp_extract(col("newCol2"),"\\p{L}+(?: \\p{L}+)*$",0), "left")
.withColumn("Col2", col("newCol2"))
.drop("newCol1", "newCol2")
.show(false)
+----+----+----+
|Col1|Col2|Col3|
+----+----+----+
|2-C |1-x |o |
|0-A |0-z |m |
|0-A |0-z |n |
|0-A |2-p |o |
|1-B |3-w |n |
+----+----+----+

Custom sort order on a Spark dataframe/dataset

I have a web service built around Spark that, based on a JSON request, builds a series of dataframe/dataset operations.
These operations involve multiple joins, filters, etc. that would change the ordering of the values in the columns. This final data set could have rows to the scale of millions.
Preferably without converting it to an RDD, is there anyway to apply a custom sort(s) on some columns of the final dataset based on the order of elements passed in as Lists?
The original dataframe is of the form
+----------+----------+
| Column 1 | Column 2 |
+----------+----------+
| Val 1 | val a |
+----------+----------+
| Val 2 | val b |
+----------+----------+
| val 3 | val c |
+----------+----------+
After a series of transformations are performed, the dataframe ends up looking like this.
+----------+----------+----------+----------+
| Column 1 | Column 2 | Column 3 | Column 4 |
+----------+----------+----------+----------+
| Val 2 | val b | val 999 | val 900 |
+----------+----------+----------+----------+
| Val 1 | val c | val 100 | val 9$## |
+----------+----------+----------+----------+
| val 3 | val a | val 2## | val $##8 |
+----------+----------+----------+----------+
I now need to apply a sort on multiple columns based on the order of the values passed as an Array list.
For example:
Col1values Order=[val 1,val 3,val 2}
Col3values Order=[100,2##,999].
Custom sorting works by creating a column for sorting. It does not need to be a visible column inside the dataframe. I can show it using PySpark.
Initial df:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 'a', 'A'),
(2, 'a', 'B'),
(3, 'a', 'C'),
(4, 'b', 'A'),
(5, 'b', 'B'),
(6, 'b', 'C'),
(7, 'c', 'A'),
(8, 'c', 'B'),
(9, 'c', 'C')],
['id', 'c1', 'c2']
)
Custom sorting on 1 column:
from itertools import chain
order = {'b': 1, 'a': 2, 'c': 3}
sort_col = F.create_map([F.lit(x) for x in chain(*order.items())])[F.col('c1')]
df = df.sort(sort_col)
df.show()
# +---+---+---+
# | id| c1| c2|
# +---+---+---+
# | 5| b| B|
# | 6| b| C|
# | 4| b| A|
# | 1| a| A|
# | 2| a| B|
# | 3| a| C|
# | 7| c| A|
# | 8| c| B|
# | 9| c| C|
# +---+---+---+
On 2 columns:
from itertools import chain
order1 = {'b': 1, 'a': 2, 'c': 3}
order2 = {'B': 1, 'C': 2, 'A': 3}
sort_col1 = F.create_map([F.lit(x) for x in chain(*order1.items())])[F.col('c1')]
sort_col2 = F.create_map([F.lit(x) for x in chain(*order2.items())])[F.col('c2')]
df = df.sort(sort_col1, sort_col2)
df.show()
# +---+---+---+
# | id| c1| c2|
# +---+---+---+
# | 5| b| B|
# | 6| b| C|
# | 4| b| A|
# | 2| a| B|
# | 3| a| C|
# | 1| a| A|
# | 8| c| B|
# | 9| c| C|
# | 7| c| A|
# +---+---+---+
Or as a function:
from itertools import chain
def cust_sort(col: str, order: dict):
return F.create_map([F.lit(x) for x in chain(*order.items())])[F.col(col)]
df = df.sort(
cust_sort('c1', {'b': 1, 'a': 2, 'c': 3}),
cust_sort('c2', {'B': 1, 'C': 2, 'A': 3})
)

How to maintain the order of the data while selecting the distinct values of column from Dataset

I have a Dataset as below,
+------+------+---------------+
| col1 | col2 | sum(costs) |
+------+------+---------------+
| 1 | a | 3555204326.27 |
| 4 | b | 22273491.72 |
| 5 | c | 219175.00 |
| 3 | a | 219175.00 |
| 2 | c | 75341433.37 |
+------+------+---------------+
I need to select the distinct values of the col1 and my resultant dataset should have the order as 1, 4, 5, 3, 2 (the order in which these values are available in initial dataset). But the order is getting shuffled. Is there any way to maintain the same order as the intital dataset. Any suggestion in Spark/SQL could be fine.
This dataset can be obtained by below sequence in spark.
df = sqlCtx.createDataFrame(
[(1, a, 355.27), (4, b, 222.98), (5, c, 275.00), (3, a, 25.00),
(2, c, 753.37)], ('Col1', 'col2', 'cost'));
You can add another column containing the index of each row, then sort on that column after "distinct". Here is an example:
import org.apache.spark.sql.functions._
val df = Seq(1, 4, 4, 5, 2)
.toDF("a")
.withColumn("id", monotonically_increasing_id())
df.show()
// +---+---+
// | a| id|
// +---+---+
// | 1| 0|
// | 4| 1|
// | 4| 2|
// | 5| 3|
// | 2| 4|
// +---+---+
df.dropDuplicates("a").sort("id").show()
// +---+---+
// | a| id|
// +---+---+
// | 1| 0|
// | 4| 1|
// | 5| 3|
// | 2| 4|
// +---+---+
Note that to do distinct on 1 specific column, you can use dropDuplicates, if you want to control which row you want to take in case of duplicate then use groupBy.
Assuming you are trying to remote the duplicates in col2 (as there are none in col1), so that the final result would be:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
+----+----+---------------+
You could add an index column like:
df = df.withColumn("__idx", monotonically_increasing_id());
Then do all the transformations you want, and then drop it, like in:
df = df.dropDuplicates("col2").orderBy("__idx").drop("__idx");
This would mean do:
Step 1: load the data and stuff:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
| 3| a| 219175.0|
| 2| c| 7.534143337E7|
+----+----+---------------+
Step 2: add the index:
+----+----+---------------+-----+
|col1|col2| sum|__idx|
+----+----+---------------+-----+
| 1| a|3.55520432627E9| 0|
| 4| b| 2.227349172E7| 1|
| 5| c| 219175.0| 2|
| 3| a| 219175.0| 3|
| 2| c| 7.534143337E7| 4|
+----+----+---------------+-----+
Step 3: transformations (here remove the dups in col2) and remove the __idx column:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
+----+----+---------------+
The Java code could be:
package net.jgp.books.spark.ch12.lab990_others;
import static org.apache.spark.sql.functions.monotonically_increasing_id;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
/**
* Keeping the order of rows during transformations.
*
* #author jgp
*/
public class KeepingOrderApp {
/**
* main() is your entry point to the application.
*
* #param args
*/
public static void main(String[] args) {
KeepingOrderApp app = new KeepingOrderApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("Splitting a dataframe to collect it")
.master("local")
.getOrCreate();
Dataset<Row> df = createDataframe(spark);
df.show();
df = df.withColumn("__idx", monotonically_increasing_id());
df.show();
df = df.dropDuplicates("col2").orderBy("__idx").drop("__idx");
df.show();
}
private static Dataset<Row> createDataframe(SparkSession spark) {
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(
"col1",
DataTypes.IntegerType,
false),
DataTypes.createStructField(
"col2",
DataTypes.StringType,
false),
DataTypes.createStructField(
"sum",
DataTypes.DoubleType,
false) });
List<Row> rows = new ArrayList<>();
rows.add(RowFactory.create(1, "a", 3555204326.27));
rows.add(RowFactory.create(4, "b", 22273491.72));
rows.add(RowFactory.create(5, "c", 219175.0));
rows.add(RowFactory.create(3, "a", 219175.0));
rows.add(RowFactory.create(2, "c", 75341433.37));
return spark.createDataFrame(rows, schema);
}
}
You could add an index column to your DB and then in your SQL request make an ORDER BY id
I believe you need to reformat your query and use group by instead of distinct like this answer suggests SQL: How to keep rows order with DISTINCT?

Spark Dataframe: Select distinct rows

I tried two ways to find distinct rows from parquet but it doesn't seem to work.
Attemp 1:
Dataset<Row> df = sqlContext.read().parquet("location.parquet").distinct();
But throws
Cannot have map type columns in DataFrame which calls set operations
(intersect, except, etc.),
but the type of column canvasHashes is map<string,string>;;
Attemp 2:
Tried running sql queries:
Dataset<Row> df = sqlContext.read().parquet("location.parquet");
rawLandingDS.createOrReplaceTempView("df");
Dataset<Row> landingDF = sqlContext.sql("SELECT distinct on timestamp * from df");
error I get:
= SQL ==
SELECT distinct on timestamp * from df
-----------------------------^^^
Is there a way to get distinct records while reading parquet files? Any read option I can use.
The problem you face is explicitly stated in the exception message - because MapType columns are neither hashable nor orderable cannot be used as a part of grouping or partitioning expression.
Your take on SQL solution is not logically equivalent to distinct on Dataset. If you want to deduplicate data based on a set of compatible columns you should use dropDuplicates:
df.dropDuplicates("timestamp")
which would be equivalent to
SELECT timestamp, first(c1) AS c1, first(c2) AS c2, ..., first(cn) AS cn,
first(canvasHashes) AS canvasHashes
FROM df GROUP BY timestamp
Unfortunately if your goal is actual DISTINCT it won't be so easy. On possible solution is to leverage Scala* Map hashing. You could define Scala udf like this:
spark.udf.register("scalaHash", (x: Map[String, String]) => x.##)
and then use it in your Java code to derive column that can be used to dropDuplicates:
df
.selectExpr("*", "scalaHash(canvasHashes) AS hash_of_canvas_hashes")
.dropDuplicates(
// All columns excluding canvasHashes / hash_of_canvas_hashes
"timestamp", "c1", "c2", ..., "cn"
// Hash used as surrogate of canvasHashes
"hash_of_canvas_hashes"
)
with SQL equivalent
SELECT
timestamp, c1, c2, ..., cn, -- All columns excluding canvasHashes
first(canvasHashes) AS canvasHashes
FROM df GROUP BY
timestamp, c1, c2, ..., cn -- All columns excluding canvasHashes
* Please note that java.util.Map with its hashCode won't work, as hashCode is not consistent.
1) If you want to distinct based on coluns you can use it
val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("no", "age")
scala> df.show
+---+---+
| no|age|
+---+---+
| 1| 2|
| 3| 4|
| 1| 6|
+---+---+
val distinctValuesDF = df.select(df("no")).distinct
scala> distinctValuesDF.show
+---+
| no|
+---+
| 1|
| 3|
+---+
2) If you have want unique on all column use dropduplicate
scala> val df = sc.parallelize(Array((1, 2), (3, 4),(3, 4), (1, 6))).toDF("no", "age")
scala> df.show
+---+---+
| no|age|
+---+---+
| 1| 2|
| 3| 4|
| 3| 4|
| 1| 6|
+---+---+
scala> df.dropDuplicates().show()
+---+---+
| no|age|
+---+---+
| 1| 2|
| 3| 4|
| 1| 6|
+---+---+
Yes, the syntax is incorrect, it should be:
Dataset<Row> landingDF = sqlContext.sql("SELECT distinct * from df");

How to create a new column in a Spark DataFrame based on a second DataFrame (Java)?

I have two Spark DataFrames where one of them has two cols, id and Tag. A second DataFrame has an id col, but missing the Tag. The first Dataframe is essentially a dictionary, each id appears once, while in the second DataFrame and id may appear several times. What I need is to create a new col in the second DataFrame that has the Tag as a function of the id in each row (in the second DataFrame). I think this can be done by converting to RDDs first ..etc, but I thought there must be a more elegant way using DataFrames (in Java). Example: given a df1 Row-> id: 0, Tag: "A", a df2 Row1-> id: 0, Tag: null, a df2 Row2-> id: 0, Tag: "B", I need to create a Tag col in the resulting DataFrame df3 equal to df1(id=0) = "A" IF df2 Tag was null, but keep original Tag if not null => resulting in df3 Row1-> id: 0, Tag: "A", df3 Row2-> id: 0, Tag: "B". Hope the example is clear.
| ID | No. | Tag | new Tag Col |
| 1 | 10002 | A | A |
| 2 | 10003 | B | B |
| 1 | 10004 | null | A |
| 2 | 10005 | null | B |
All you need here is left outer join and coalesce:
import org.apache.spark.sql.functions.coalesce
val df = sc.parallelize(Seq(
(1, 10002, Some("A")), (2, 10003, Some("B")),
(1, 10004, None), (2, 10005, None)
)).toDF("id", "no", "tag")
val lookup = sc.parallelize(Seq(
(1, "A"), (2, "B")
)).toDF("id", "tag")
df.join(lookup, df.col("id").equalTo(lookup.col("id")), "leftouter")
.withColumn("new_tag", coalesce(df.col("tag"), lookup.col("tag")))
This should almost identical to Java version.

Categories