Spark Dataframe: Select distinct rows

Spark Dataframe: Select distinct rows - java

I tried two ways to find distinct rows from parquet but it doesn't seem to work.
Attemp 1:
Dataset<Row> df = sqlContext.read().parquet("location.parquet").distinct();
But throws
Cannot have map type columns in DataFrame which calls set operations
(intersect, except, etc.),
but the type of column canvasHashes is map<string,string>;;
Attemp 2:
Tried running sql queries:
Dataset<Row> df = sqlContext.read().parquet("location.parquet");
rawLandingDS.createOrReplaceTempView("df");
Dataset<Row> landingDF = sqlContext.sql("SELECT distinct on timestamp * from df");
error I get:
= SQL ==
SELECT distinct on timestamp * from df
-----------------------------^^^
Is there a way to get distinct records while reading parquet files? Any read option I can use.

The problem you face is explicitly stated in the exception message - because MapType columns are neither hashable nor orderable cannot be used as a part of grouping or partitioning expression.
Your take on SQL solution is not logically equivalent to distinct on Dataset. If you want to deduplicate data based on a set of compatible columns you should use dropDuplicates:
df.dropDuplicates("timestamp")
which would be equivalent to
SELECT timestamp, first(c1) AS c1, first(c2) AS c2, ..., first(cn) AS cn,
first(canvasHashes) AS canvasHashes
FROM df GROUP BY timestamp
Unfortunately if your goal is actual DISTINCT it won't be so easy. On possible solution is to leverage Scala* Map hashing. You could define Scala udf like this:
spark.udf.register("scalaHash", (x: Map[String, String]) => x.##)
and then use it in your Java code to derive column that can be used to dropDuplicates:
df
.selectExpr("*", "scalaHash(canvasHashes) AS hash_of_canvas_hashes")
.dropDuplicates(
// All columns excluding canvasHashes / hash_of_canvas_hashes
"timestamp", "c1", "c2", ..., "cn"
// Hash used as surrogate of canvasHashes
"hash_of_canvas_hashes"
)
with SQL equivalent
SELECT
timestamp, c1, c2, ..., cn, -- All columns excluding canvasHashes
first(canvasHashes) AS canvasHashes
FROM df GROUP BY
timestamp, c1, c2, ..., cn -- All columns excluding canvasHashes
* Please note that java.util.Map with its hashCode won't work, as hashCode is not consistent.

1) If you want to distinct based on coluns you can use it
val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("no", "age")
scala> df.show
+---+---+
| no|age|
+---+---+
| 1| 2|
| 3| 4|
| 1| 6|
+---+---+
val distinctValuesDF = df.select(df("no")).distinct
scala> distinctValuesDF.show
+---+
| no|
+---+
| 1|
| 3|
+---+
2) If you have want unique on all column use dropduplicate
scala> val df = sc.parallelize(Array((1, 2), (3, 4),(3, 4), (1, 6))).toDF("no", "age")
scala> df.show
+---+---+
| no|age|
+---+---+
| 1| 2|
| 3| 4|
| 3| 4|
| 1| 6|
+---+---+
scala> df.dropDuplicates().show()
+---+---+
| no|age|
+---+---+
| 1| 2|
| 3| 4|
| 1| 6|
+---+---+

Yes, the syntax is incorrect, it should be:
Dataset<Row> landingDF = sqlContext.sql("SELECT distinct * from df");

Related

Java Spark remove duplicates/nulls and preserve order

I have the below Java Spark dataset/dataframe.
Col_1 Col_2 Col_3 ...
A 1 1
A 1 NULL
B 2 2
B 2 3
C 1 NULL
There are close to 25 columns in this dataset and I have to remove those records which are duplicated on Col_1. If the second record is NULL, then NULL has to be removed (like in case of COl_1 = A) and if there are multiple valid values like in case of Col_1 = B then only one valid Col_2 = 2 and Col_3 = 2 should only be retained everytime. If there is only one record with null like in case of Col_1 = C. then it has to be retained
Expected Output:
Col_1 Col_2 Col_3 ...
A 1 1
B 2 2
C 1 NULL
What i tried so far:
I tried using group by and collect set with sort_array and array_remove but it removes the nulls altogether even if there is one row.
How to achieve the expected output in Java Spark.

This is how you can do it using spark dataframes:
import org.apache.spark.sql.functions.{coalesce, col, lit, min, struct}
val rows = Seq(
("A",1,Some(1)),
("A",1, Option.empty[Int]),
("B",2,Some(2)),
("B",2,Some(3)),
("C",1,Option.empty[Int]))
.toDF("Col_1", "Col_2", "Col_3")
rows.show()
+-----+-----+-----+
|Col_1|Col_2|Col_3|
+-----+-----+-----+
| A| 1| 1|
| A| 1| null|
| B| 2| 2|
| B| 2| 3|
| C| 1| null|
+-----+-----+-----+
val deduped = rows.groupBy(col("Col_1"))
.agg(
min(
struct(
coalesce(col("Col_3"), lit(Int.MaxValue)).as("null_maxed"),
col("Col_2"),
col("Col_3"))).as("argmax"))
.select(col("Col_1"), col("argmax.Col_2"), col("argmax.Col_3"))
deduped.show()
+-----+-----+-----+
|Col_1|Col_2|Col_3|
+-----+-----+-----+
| B| 2| 2|
| C| 1| null|
| A| 1| 1|
+-----+-----+-----+
Whats happening here is you are grouping by Col_1 and then getting the minimum of a composite struct of Col_3 and Col_2 but nulls in Col_3 have been replaced with the max integer value so they don't impact the ordering. We then select the original Col_3 and Col_2 from the resulting row. I realise this is in scala but the syntax for java should be very similar.

How to merge two dataframes spark java/scala based on a column?

I have a two dataframes DF1 and DF2 with id as the unique column,
DF2 may contain a new records and updated values for existing records of DF1, when we merge the two dataframes result should include the new record and a old records with updated values remain should come as it is.
Input example:
id name
10 abc
20 tuv
30 xyz
and
id name
10 abc
20 pqr
40 lmn
When I merge these two dataframes, I want the result as:
id name
10 abc
20 pqr
30 xyz
40 lmn

Use an outer join followed by a coalesce. In Scala:
val df1 = Seq((10, "abc"), (20, "tuv"), (30, "xyz")).toDF("id", "name")
val df2 = Seq((10, "abc"), (20, "pqr"), (40, "lmn")).toDF("id", "name")
df1.select($"id", $"name".as("old_name"))
.join(df2, Seq("id"), "outer")
.withColumn("name", coalesce($"name", $"old_name"))
.drop("old_name")
coalesce will give the value of the first non-null value, which in this case returns:
+---+----+
| id|name|
+---+----+
| 20| pqr|
| 40| lmn|
| 10| abc|
| 30| xyz|
+---+----+

df1.join(df2, Seq("id"), "leftanti").union(df2).show
| id|name|
+---+----+
| 30| xyz|
| 10| abc|
| 20| pqr|
| 40| lmn|
+---+----+

Apache Spark GroupBy / Aggregate

New Spark and using Java. How do I do the equivalent SQL below:
select id, sum(thistransaction::decimal) , date_part('month', transactiondisplaydate::timestamp) as month from history group by id, month
Data Set looks like this:
ID, Spend, DateTime
468429,13.3,2017-09-01T11:43:16.999
520003,84.34,2017-09-01T11:46:49.999
520003,46.69,2017-09-01T11:24:34.999
520003,82.1,2017-09-01T11:45:19.999
468429,20.0,2017-09-01T11:40:14.999
468429,20.0,2017-09-01T11:38:16.999
520003,26.57,2017-09-01T12:46:25.999
468429,20.0,2017-09-01T12:25:04.999
468429,20.0,2017-09-01T12:25:04.999
520003,20.25,2017-09-01T12:24:51.999
The desired outcome is average weekly spend by customer.
This is more to do with which flow to use than the nuts and bolts of getting the data loaded etc.

This should work
df=df.withColumn("Month", col("DateTime").substr(6, 2));
df=df.groupBy(col("ID"),col("Month")).agg(sum("Spend"));
df.show();
Which will produce
+------+-----+----------+
| ID|Month|sum(Spend)|
+------+-----+----------+
|520003| 09| 259.95|
|468429| 09| 93.3|
+------+-----+----------+

Here is another version ...
val df = Seq(
(468429,13.3,"2017-09-01T11:43:16.999"),
(520003,84.34,"2017-09-01T11:46:49.999"),
(520003,46.69,"2017-09-01T11:24:34.999"),
(520003,82.1,"2017-09-01T11:45:19.999"),
(468429,20.0,"2017-09-01T11:40:14.999"),
(468429,20.0,"2017-09-12T11:38:16.999"),
(520003,26.57,"2017-09-22T12:46:25.999"),
(468429,20.0,"2017-09-01T12:25:04.999"),
(468429,20.0,"2017-09-17T12:25:04.999"),
(520003,20.25,"2017-09-01T12:24:51.999")
).toDF("id","spend","datetime")
import org.apache.spark.sql.functions._
val df2 = df.select('id,'datetime,date_format($"datetime", "M").name("month"),
date_format($"datetime", "W").name("week"),'spend)
// Monthly
df2.groupBy('id,'month).agg(sum('spend).as("spend_sum"),avg('spend).as("spend_avg")).
select('id,'month,'spend_sum,'spend_avg).show()
+------+-----+---------+------------------+
| id|month|spend_sum| spend_avg|
+------+-----+---------+------------------+
|520003| 9| 259.95|51.989999999999995|
|468429| 9| 93.3| 18.66|
+------+-----+---------+------------------+
// Weekly
df2.groupBy('id,'month,'week).agg(sum('spend).as("spend_sum"),avg('spend).as("spend_avg")).
select('id,'month,'week,'spend_sum,'spend_avg).show()
+------+-----+----+---------+------------------+
| id|month|week|spend_sum| spend_avg|
+------+-----+----+---------+------------------+
|520003| 9| 4| 26.57| 26.57|
|468429| 9| 3| 20.0| 20.0|
|520003| 9| 1| 233.38| 58.345|
|468429| 9| 4| 20.0| 20.0|
|468429| 9| 1| 53.3|17.766666666666666|
+------+-----+----+---------+------------------+

How to get the size of result generated using concat_ws?

I am performing groupBy on COL1 and getting the concatenated list of COL2 using concat_ws. How can I get the count of values in that list? Here's my code:
Dataset<Row> ds = df.groupBy("COL1").agg(org.apache.spark.sql.functions
.concat_ws(",",org.apache.spark.sql.functions.collect_list("COL2")).as("sample"));

Use size function.
size(e: Column): Column Returns length of array or map.
The following example is in Scala and am leaving it to you to convert it to Java, but the general idea is exactly the same regardless of the programming language.
val input = spark.range(4)
.withColumn("COL1", $"id" % 2)
.select($"COL1", $"id" as "COL2")
scala> input.show
+----+----+
|COL1|COL2|
+----+----+
| 0| 0|
| 1| 1|
| 0| 2|
| 1| 3|
+----+----+
val s = input
.groupBy("COL1")
.agg(
concat_ws(",", collect_list("COL2")) as "concat",
size(collect_list("COL2")) as "size") // <-- size
scala> s.show
+----+------+----+
|COL1|concat|size|
+----+------+----+
| 0| 0,2| 2|
| 1| 1,3| 2|
+----+------+----+
In Java that'd be as follows. Thanks Krishna Prasad for sharing the code with the SO/Spark community!
Dataset<Row> ds = df.groupBy("COL1").agg(
org.apache.spark.sql.functions.concat_ws(",",org.apache.spark.sql.functions.collect_list("‌COL2")).as("sample")‌,
org.apache.spark.sql.functions.size(org.apache.spark.sql.functions.collect_list("COL2‌")).as("size"));

how to apply dictionary key to value project to a column in dataset in spark?

Newbie here on spark... how can I use a column in spark dataset ask key to get some values and add the values as new column to the dataset?
In python, we have something like:
df.loc[:,'values'] = df.loc[:,'key'].apply(lambda x: D.get(x))
where D is a function in python defined earlier.
how can I do this in spark using Java? thank you.
Edit:
for example:
I have a following dataset df:
A
1
3
6
0
8
I want to create a weekday column based on the following dictionary:
D[1] = "Monday"
D[2] = "Tuesday"
D[3] = "Wednesday"
D[4] = "Thursday"
D[5] = "Friday"
D[6] = "Saturday"
D[7] = "Sunday"
and add the column back to my dataset df:
A days
1 Monday
3 Wednesday
6 Saturday
0 Sunday
8 NULL
This is just an example, column A could be anything other than integers of course.

You can use df.withColumn to return a new df with the new column values and the previous values of df.
create a udf function (user defined functions) to apply the dictionary mapping.
here's a reproducible example:
>>> from pyspark.sql.types import StringType
>>> from pyspark.sql.functions import udf
>>> df = spark.createDataFrame([{'A':1,'B':5},{'A':5,'B':2},{'A':1,'B':3},{'A':5,'B':4}], ['A','B'])
>>> df.show()
+---+---+
| A| B|
+---+---+
| 1| 5|
| 5| 2|
| 1| 3|
| 5| 4|
+---+---+
>>> d = {1:'x', 2:'y', 3:'w', 4:'t', 5:'z'}
>>> mapping_func = lambda x: d.get(x)
>>> df = df.withColumn('values',udf(mapping_func, StringType())("A"))
>>> df.show()
+---+---+------+
| A| B|values|
+---+---+------+
| 1| 5| x|
| 5| 2| z|
| 1| 3| x|
| 5| 4| z|
+---+---+------+

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark Dataframe: Select distinct rows - java

Yes, the syntax is incorrect, it should be: Dataset<Row> landingDF = sqlContext.sql("SELECT distinct * from df");

Related

Java Spark remove duplicates/nulls and preserve order

How to merge two dataframes spark java/scala based on a column?

Apache Spark GroupBy / Aggregate

How to get the size of result generated using concat_ws?

how to apply dictionary key to value project to a column in dataset in spark?

Categories

Resources