Apache Spark GroupBy / Aggregate - java

New Spark and using Java. How do I do the equivalent SQL below:
select id, sum(thistransaction::decimal) , date_part('month', transactiondisplaydate::timestamp) as month from history group by id, month
Data Set looks like this:
ID, Spend, DateTime
468429,13.3,2017-09-01T11:43:16.999
520003,84.34,2017-09-01T11:46:49.999
520003,46.69,2017-09-01T11:24:34.999
520003,82.1,2017-09-01T11:45:19.999
468429,20.0,2017-09-01T11:40:14.999
468429,20.0,2017-09-01T11:38:16.999
520003,26.57,2017-09-01T12:46:25.999
468429,20.0,2017-09-01T12:25:04.999
468429,20.0,2017-09-01T12:25:04.999
520003,20.25,2017-09-01T12:24:51.999
The desired outcome is average weekly spend by customer.
This is more to do with which flow to use than the nuts and bolts of getting the data loaded etc.

This should work
df=df.withColumn("Month", col("DateTime").substr(6, 2));
df=df.groupBy(col("ID"),col("Month")).agg(sum("Spend"));
df.show();
Which will produce
+------+-----+----------+
| ID|Month|sum(Spend)|
+------+-----+----------+
|520003| 09| 259.95|
|468429| 09| 93.3|
+------+-----+----------+

Here is another version ...
val df = Seq(
(468429,13.3,"2017-09-01T11:43:16.999"),
(520003,84.34,"2017-09-01T11:46:49.999"),
(520003,46.69,"2017-09-01T11:24:34.999"),
(520003,82.1,"2017-09-01T11:45:19.999"),
(468429,20.0,"2017-09-01T11:40:14.999"),
(468429,20.0,"2017-09-12T11:38:16.999"),
(520003,26.57,"2017-09-22T12:46:25.999"),
(468429,20.0,"2017-09-01T12:25:04.999"),
(468429,20.0,"2017-09-17T12:25:04.999"),
(520003,20.25,"2017-09-01T12:24:51.999")
).toDF("id","spend","datetime")
import org.apache.spark.sql.functions._
val df2 = df.select('id,'datetime,date_format($"datetime", "M").name("month"),
date_format($"datetime", "W").name("week"),'spend)
// Monthly
df2.groupBy('id,'month).agg(sum('spend).as("spend_sum"),avg('spend).as("spend_avg")).
select('id,'month,'spend_sum,'spend_avg).show()
+------+-----+---------+------------------+
| id|month|spend_sum| spend_avg|
+------+-----+---------+------------------+
|520003| 9| 259.95|51.989999999999995|
|468429| 9| 93.3| 18.66|
+------+-----+---------+------------------+
// Weekly
df2.groupBy('id,'month,'week).agg(sum('spend).as("spend_sum"),avg('spend).as("spend_avg")).
select('id,'month,'week,'spend_sum,'spend_avg).show()
+------+-----+----+---------+------------------+
| id|month|week|spend_sum| spend_avg|
+------+-----+----+---------+------------------+
|520003| 9| 4| 26.57| 26.57|
|468429| 9| 3| 20.0| 20.0|
|520003| 9| 1| 233.38| 58.345|
|468429| 9| 4| 20.0| 20.0|
|468429| 9| 1| 53.3|17.766666666666666|
+------+-----+----+---------+------------------+

Related

Is there a way to get the value from a column at a specifc row and put it to the next row?

I have Data that looks the following
ID Sensor No
1 specificSensor 1
2 1234 null
3 1234 null
4 specificSensor 2
5 2345 null
6 2345 null
7
...
I need an output format like this
ID Sensor No
1 specificSensor 1
2 1234 1
3 1234 1
4 specificSensor 2
5 2345 2
6 2345 2
7
...
I'm using Apache Spark in Java.
after that, the data is processed using groupby and pivot.
I'm thinking of something like
df.withColumn("No", functions.when(df.col("Sensor").equalTo("specificSensor"), functions.monotonically_increasing_id())
//this works as I need it
.otherwise(WHEN NULL THEN VALUE ABOVE);
I don't know if this is feasable in a way.
Help appreciated, thanks a lot!
Dataframe with sensor ID ranges can be created, and then joined to original dataframe:
val df = Seq((1, "specificSensor", Some(1)),
(2, "1234", None),
(3, "1234", None),
(4, "specificSensor", Some(2)),
(5, "2345", None),
(6, "2345", None))
.toDF("ID", "Sensor", "No")
val idWindow = Window.orderBy("ID")
val sensorsRange = df
.where($"Sensor" === "specificSensor")
.withColumn("nextId", coalesce(lead($"id", 1).over(idWindow), lit(Long.MaxValue)))
sensorsRange.show(false)
val joinColumn = $"d.ID" > $"s.id" && $"d.ID" < $"s.nextId"
val result =
df.alias("d")
.join(sensorsRange.alias("s"), joinColumn, "left")
.select($"d.ID", $"d.Sensor", coalesce($"d.No", $"s.No").alias("No"))
Output:
+---+--------------+---+-------------------+
|ID |Sensor |No |nextId |
+---+--------------+---+-------------------+
|1 |specificSensor|1 |4 |
|4 |specificSensor|2 |9223372036854775807|
+---+--------------+---+-------------------+
+---+--------------+---+
|ID |Sensor |No |
+---+--------------+---+
|1 |specificSensor|1 |
|2 |1234 |1 |
|3 |1234 |1 |
|4 |specificSensor|2 |
|5 |2345 |2 |
|6 |2345 |2 |
+---+--------------+---+
Using last aggregation with ignoreNulls over an ordered window does the trick
df.select(
$"ID",
$"Sensor",
last($"No", ignoreNulls = true) over Window.orderBy($"ID") as "No")
.show()
Output:
+---+--------------+---+
| ID| Sensor| No|
+---+--------------+---+
| 1|specificSensor| 1|
| 2| 1234| 1|
| 3| 1234| 1|
| 4|specificSensor| 2|
| 5| 2345| 2|
| 6| 2345| 2|
+---+--------------+---+
P.S. I have no working Java setup right now but should be easy to translate

Spark Dataframe: Select distinct rows

I tried two ways to find distinct rows from parquet but it doesn't seem to work.
Attemp 1:
Dataset<Row> df = sqlContext.read().parquet("location.parquet").distinct();
But throws
Cannot have map type columns in DataFrame which calls set operations
(intersect, except, etc.),
but the type of column canvasHashes is map<string,string>;;
Attemp 2:
Tried running sql queries:
Dataset<Row> df = sqlContext.read().parquet("location.parquet");
rawLandingDS.createOrReplaceTempView("df");
Dataset<Row> landingDF = sqlContext.sql("SELECT distinct on timestamp * from df");
error I get:
= SQL ==
SELECT distinct on timestamp * from df
-----------------------------^^^
Is there a way to get distinct records while reading parquet files? Any read option I can use.
The problem you face is explicitly stated in the exception message - because MapType columns are neither hashable nor orderable cannot be used as a part of grouping or partitioning expression.
Your take on SQL solution is not logically equivalent to distinct on Dataset. If you want to deduplicate data based on a set of compatible columns you should use dropDuplicates:
df.dropDuplicates("timestamp")
which would be equivalent to
SELECT timestamp, first(c1) AS c1, first(c2) AS c2, ..., first(cn) AS cn,
first(canvasHashes) AS canvasHashes
FROM df GROUP BY timestamp
Unfortunately if your goal is actual DISTINCT it won't be so easy. On possible solution is to leverage Scala* Map hashing. You could define Scala udf like this:
spark.udf.register("scalaHash", (x: Map[String, String]) => x.##)
and then use it in your Java code to derive column that can be used to dropDuplicates:
df
.selectExpr("*", "scalaHash(canvasHashes) AS hash_of_canvas_hashes")
.dropDuplicates(
// All columns excluding canvasHashes / hash_of_canvas_hashes
"timestamp", "c1", "c2", ..., "cn"
// Hash used as surrogate of canvasHashes
"hash_of_canvas_hashes"
)
with SQL equivalent
SELECT
timestamp, c1, c2, ..., cn, -- All columns excluding canvasHashes
first(canvasHashes) AS canvasHashes
FROM df GROUP BY
timestamp, c1, c2, ..., cn -- All columns excluding canvasHashes
* Please note that java.util.Map with its hashCode won't work, as hashCode is not consistent.
1) If you want to distinct based on coluns you can use it
val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("no", "age")
scala> df.show
+---+---+
| no|age|
+---+---+
| 1| 2|
| 3| 4|
| 1| 6|
+---+---+
val distinctValuesDF = df.select(df("no")).distinct
scala> distinctValuesDF.show
+---+
| no|
+---+
| 1|
| 3|
+---+
2) If you have want unique on all column use dropduplicate
scala> val df = sc.parallelize(Array((1, 2), (3, 4),(3, 4), (1, 6))).toDF("no", "age")
scala> df.show
+---+---+
| no|age|
+---+---+
| 1| 2|
| 3| 4|
| 3| 4|
| 1| 6|
+---+---+
scala> df.dropDuplicates().show()
+---+---+
| no|age|
+---+---+
| 1| 2|
| 3| 4|
| 1| 6|
+---+---+
Yes, the syntax is incorrect, it should be:
Dataset<Row> landingDF = sqlContext.sql("SELECT distinct * from df");

How to get the size of result generated using concat_ws?

I am performing groupBy on COL1 and getting the concatenated list of COL2 using concat_ws. How can I get the count of values in that list? Here's my code:
Dataset<Row> ds = df.groupBy("COL1").agg(org.apache.spark.sql.functions
.concat_ws(",",org.apache.spark.sql.functions.collect_list("COL2")).as("sample"));
Use size function.
size(e: Column): Column Returns length of array or map.
The following example is in Scala and am leaving it to you to convert it to Java, but the general idea is exactly the same regardless of the programming language.
val input = spark.range(4)
.withColumn("COL1", $"id" % 2)
.select($"COL1", $"id" as "COL2")
scala> input.show
+----+----+
|COL1|COL2|
+----+----+
| 0| 0|
| 1| 1|
| 0| 2|
| 1| 3|
+----+----+
val s = input
.groupBy("COL1")
.agg(
concat_ws(",", collect_list("COL2")) as "concat",
size(collect_list("COL2")) as "size") // <-- size
scala> s.show
+----+------+----+
|COL1|concat|size|
+----+------+----+
| 0| 0,2| 2|
| 1| 1,3| 2|
+----+------+----+
In Java that'd be as follows. Thanks Krishna Prasad for sharing the code with the SO/Spark community!
Dataset<Row> ds = df.groupBy("COL1").agg(
org.apache.spark.sql.functions.concat_ws(",",org.apache.spark.sql.functions.collect_list("‌​COL2")).as("sample")‌​,
org.apache.spark.sql.functions.size(org.apache.spark.sql.functions.collect_list("COL2‌​")).as("size"));

how to apply dictionary key to value project to a column in dataset in spark?

Newbie here on spark... how can I use a column in spark dataset ask key to get some values and add the values as new column to the dataset?
In python, we have something like:
df.loc[:,'values'] = df.loc[:,'key'].apply(lambda x: D.get(x))
where D is a function in python defined earlier.
how can I do this in spark using Java? thank you.
Edit:
for example:
I have a following dataset df:
A
1
3
6
0
8
I want to create a weekday column based on the following dictionary:
D[1] = "Monday"
D[2] = "Tuesday"
D[3] = "Wednesday"
D[4] = "Thursday"
D[5] = "Friday"
D[6] = "Saturday"
D[7] = "Sunday"
and add the column back to my dataset df:
A days
1 Monday
3 Wednesday
6 Saturday
0 Sunday
8 NULL
This is just an example, column A could be anything other than integers of course.
You can use df.withColumn to return a new df with the new column values and the previous values of df.
create a udf function (user defined functions) to apply the dictionary mapping.
here's a reproducible example:
>>> from pyspark.sql.types import StringType
>>> from pyspark.sql.functions import udf
>>> df = spark.createDataFrame([{'A':1,'B':5},{'A':5,'B':2},{'A':1,'B':3},{'A':5,'B':4}], ['A','B'])
>>> df.show()
+---+---+
| A| B|
+---+---+
| 1| 5|
| 5| 2|
| 1| 3|
| 5| 4|
+---+---+
>>> d = {1:'x', 2:'y', 3:'w', 4:'t', 5:'z'}
>>> mapping_func = lambda x: d.get(x)
>>> df = df.withColumn('values',udf(mapping_func, StringType())("A"))
>>> df.show()
+---+---+------+
| A| B|values|
+---+---+------+
| 1| 5| x|
| 5| 2| z|
| 1| 3| x|
| 5| 4| z|
+---+---+------+

Transform Spark Datset - count and merge multiple rows by ID

After some data processing, I end up with this Dataset:
Dataset<Row> counts //ID,COUNT,DAY_OF_WEEK
Now I want to transform this to this format and save as CSV:
ID,COUNT_DoW1, ID,COUNT_DoW2, ID,COUNT_DoW3,..ID,COUNT_DoW7
I can think of one approach of:
JavaPairRDD<Long, Map<Integer, Integer>> r = counts.toJavaRDD().mapToPair(...)
JavaPairRDD<Long, Map<Integer, Integer>> merged = r.reduceByKey(...);
Where its a pair of "ID" and List of size 7.
After I get JavaPairRDD, I can store it in csv. Is there a simpler approach for this transformation without converting it to an RDD?
You can use the struct function to construct a pair from cnt and day and then do a groupby with collect_list.
Something like this (scala but you can easily convert to java):
df.groupBy("ID").agg(collect_list(struct("COUNT","DAY")))
Now you can write a UDF which extracts the relevant column. So you simply do a withColumn in a loop to simply copy the ID (df.withColumn("id2",col("id")))
then you create a UDF which extracts the count element from position i and run it on all columns and lastly the same on day.
If you keep the order you want and drop irrelevant columns you would get what you asked for.
You can also work with the pivot command (again in scala but you should be able to easily convert to java):
df.show()
>>+---+---+---+
>>| id|cnt|day|
>>+---+---+---+
>>|333| 31| 1|
>>|333| 32| 2|
>>|333|133| 3|
>>|333| 34| 4|
>>|333| 35| 5|
>>|333| 36| 6|
>>|333| 37| 7|
>>|222| 41| 4|
>>|111| 11| 1|
>>|111| 22| 2|
>>|111| 33| 3|
>>|111| 44| 4|
>>|111| 55| 5|
>>|111| 66| 6|
>>|111| 77| 7|
>>|222| 21| 1|
>>+---+---+---+
val df2 = df.withColumn("all",struct('id, 'cnt' 'day))
val res = .groupBy("id").pivot("day").agg(first('all).as("bla")).select("1.*","2.*","3.*", "4.*", "5.*", "6.*", "7.*")
res.show()
>>+---+---+---+----+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+
>>| id|cnt|day| id| cnt| day| id| cnt| day| id|cnt|day| id| cnt| day| id| cnt| day| id| cnt| day|
>>+---+---+---+----+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+
>>|333| 31| 1| 333| 32| 2| 333| 133| 3|333| 34| 4| 333| 35| 5| 333| 36| 6| 333| 37| 7|
>>|222| 21| 1|null|null|null|null|null|null|222| 41| 4|null|null|null|null|null|null|null|null|null|
>>|111| 11| 1| 111| 22| 2| 111| 33| 3|111| 44| 4| 111| 55| 5| 111| 66| 6| 111| 77| 7|
>>+---+---+---+----+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+

Categories