How to merge two dataframes spark java/scala based on a column? - java

I have a two dataframes DF1 and DF2 with id as the unique column,
DF2 may contain a new records and updated values for existing records of DF1, when we merge the two dataframes result should include the new record and a old records with updated values remain should come as it is.
Input example:
id name
10 abc
20 tuv
30 xyz
and
id name
10 abc
20 pqr
40 lmn
When I merge these two dataframes, I want the result as:
id name
10 abc
20 pqr
30 xyz
40 lmn

Use an outer join followed by a coalesce. In Scala:
val df1 = Seq((10, "abc"), (20, "tuv"), (30, "xyz")).toDF("id", "name")
val df2 = Seq((10, "abc"), (20, "pqr"), (40, "lmn")).toDF("id", "name")
df1.select($"id", $"name".as("old_name"))
.join(df2, Seq("id"), "outer")
.withColumn("name", coalesce($"name", $"old_name"))
.drop("old_name")
coalesce will give the value of the first non-null value, which in this case returns:
+---+----+
| id|name|
+---+----+
| 20| pqr|
| 40| lmn|
| 10| abc|
| 30| xyz|
+---+----+

df1.join(df2, Seq("id"), "leftanti").union(df2).show
| id|name|
+---+----+
| 30| xyz|
| 10| abc|
| 20| pqr|
| 40| lmn|
+---+----+

Related

update row value using another table in spark java

I have 2 datasets in java spark, first one contains id, name , age
and second one are the same, i need to check values (name and id) and if it's similar update the age with new age in dataset2
i tried all possible ways but i found that java in spark don't have much resourses and all possible ways not worked
this is one i tried :
dataset1.createOrReplaceTempView("updatesTable");
datase2.createOrReplaceTempView("carsTheftsFinal2");
updatesNew.show();
Dataset<Row> test = spark.sql( "ALTER carsTheftsFinal2 set carsTheftsFinal2.age = updatesTable.age from updatesTable where carsTheftsFinal2.id = updatesTable.id AND carsTheftsFinal2.name = updatesTable.name ");
test.show(12);
and this is the error :
Exception in thread "main"
org.apache.spark.sql.catalyst.parser.ParseException: no viable
alternative at input 'ALTER carsTheftsFinal2'(line 1, pos 6)
I have hint: that i can use join to update without using update statement ( java spark not provide update )
Assume that we have ds1 with this data:
+---+-----------+---+
| id| name|age|
+---+-----------+---+
| 1| Someone| 18|
| 2| Else| 17|
| 3|SomeoneElse| 14|
+---+-----------+---+
and ds2 with this data:
+---+----------+---+
| id| name|age|
+---+----------+---+
| 1| Someone| 14|
| 2| Else| 18|
| 3|NotSomeone| 14|
+---+----------+---+
According to your expected result, the final table would be:
+---+-----------+---+
| id| name|age|
+---+-----------+---+
| 3|SomeoneElse| 14| <-- not modified, same as ds
| 1| Someone| 14| <-- modified, was 18
| 2| Else| 18| <-- modified, was 17
+---+-----------+---+
This is achieved with the following transformations, first, we rename ds2's age with age2.
val renamedDs2 = ds2.withColumnRenamed("age", "age2")
Then:
// we join main dataset with the renamed ds2, now we have age and age2
ds1.join(renamedDs2, Seq("id", "name"), "left")
// we overwrite age, if age2 is not null from ds2, take it, otherwise leave age
.withColumn("age",
when(col("age2").isNotNull, col("age2")).otherwise(col("age"))
)
// finally, we drop age2
.drop("age2")
Hope this does what you want!

Iterating rows of a Spark Dataset and applying operations in Java API

New to Spark (2.4.x) and using the Java API (not Scala!!!)
I have a Dataset that I've read in from a CSV file. It has a schema (named columns) like so:
id (integer) | name (string) | color (string) | price (double) | enabled (boolean)
An example row:
23 | "hotmeatballsoup" | "blue" | 3.95 | true
There are many (tens of thousands) rows in the dataset. I would like to write an expression using the proper Java/Spark API, that scrolls through each row and applies the following two operations on each row:
If the price is null, default it to 0.00; and then
If the color column value is "red", add 2.55 to the price
Since I'm so new to Spark I'm not sure even where to begin! My best attempt thus far is definitely wrong, but its a least a starting point I guess:
Dataset csvData = sparkSession.read()
.format("csv")
.load(fileToLoad.getAbsolutePath());
// ??? get rows somehow
Seq<Seq<String>> csvRows = csvData.getRows(???, ???);
// now how to loop through rows???
for (Seq<String> row : csvRows) {
// how apply two operations specified above???
if (row["price"] == null) {
row["price"] = 0.00;
}
if (row["color"].equals("red")) {
row["price"] = row["price"] + 2.55;
}
}
Can someone help nudge me in the right direction here?
You could use spark sql api to achieve it. Null values could also be replaced with values using .fill() from DataFrameNaFunctions. Otherwise you could convert Dataframe to Dataset and do these steps in .map, but sql api is better and more efficient in this case.
+---+---------------+-----+-----+-------+
| id| name|color|price|enabled|
+---+---------------+-----+-----+-------+
| 23|hotmeatballsoup| blue| 3.95| true|
| 24| abc| red| 1.0| true|
| 24| abc| red| null| true|
+---+---------------+-----+-----+-------+
import sql functions before class declaration:
import static org.apache.spark.sql.functions.*;
sql api:
df.select(
col("id"), col("name"), col("color"),
when(col("color").equalTo("red").and(col("price").isNotNull()), col("price").plus(2.55))
.when(col("color").equalTo("red").and(col("price").isNull()), 2.55)
.otherwise(col("price")).as("price")
,col("enabled")
).show();
or using temp view and sql query:
df.createOrReplaceTempView("df");
spark.sql("select id,name,color, case when color = 'red' and price is not null then (price + 2.55) when color = 'red' and price is null then 2.55 else price end as price, enabled from df").show();
output:
+---+---------------+-----+-----+-------+
| id| name|color|price|enabled|
+---+---------------+-----+-----+-------+
| 23|hotmeatballsoup| blue| 3.95| true|
| 24| abc| red| 3.55| true|
| 24| abc| red| 2.55| true|
+---+---------------+-----+-----+-------+

Is there a way to get the value from a column at a specifc row and put it to the next row?

I have Data that looks the following
ID Sensor No
1 specificSensor 1
2 1234 null
3 1234 null
4 specificSensor 2
5 2345 null
6 2345 null
7
...
I need an output format like this
ID Sensor No
1 specificSensor 1
2 1234 1
3 1234 1
4 specificSensor 2
5 2345 2
6 2345 2
7
...
I'm using Apache Spark in Java.
after that, the data is processed using groupby and pivot.
I'm thinking of something like
df.withColumn("No", functions.when(df.col("Sensor").equalTo("specificSensor"), functions.monotonically_increasing_id())
//this works as I need it
.otherwise(WHEN NULL THEN VALUE ABOVE);
I don't know if this is feasable in a way.
Help appreciated, thanks a lot!
Dataframe with sensor ID ranges can be created, and then joined to original dataframe:
val df = Seq((1, "specificSensor", Some(1)),
(2, "1234", None),
(3, "1234", None),
(4, "specificSensor", Some(2)),
(5, "2345", None),
(6, "2345", None))
.toDF("ID", "Sensor", "No")
val idWindow = Window.orderBy("ID")
val sensorsRange = df
.where($"Sensor" === "specificSensor")
.withColumn("nextId", coalesce(lead($"id", 1).over(idWindow), lit(Long.MaxValue)))
sensorsRange.show(false)
val joinColumn = $"d.ID" > $"s.id" && $"d.ID" < $"s.nextId"
val result =
df.alias("d")
.join(sensorsRange.alias("s"), joinColumn, "left")
.select($"d.ID", $"d.Sensor", coalesce($"d.No", $"s.No").alias("No"))
Output:
+---+--------------+---+-------------------+
|ID |Sensor |No |nextId |
+---+--------------+---+-------------------+
|1 |specificSensor|1 |4 |
|4 |specificSensor|2 |9223372036854775807|
+---+--------------+---+-------------------+
+---+--------------+---+
|ID |Sensor |No |
+---+--------------+---+
|1 |specificSensor|1 |
|2 |1234 |1 |
|3 |1234 |1 |
|4 |specificSensor|2 |
|5 |2345 |2 |
|6 |2345 |2 |
+---+--------------+---+
Using last aggregation with ignoreNulls over an ordered window does the trick
df.select(
$"ID",
$"Sensor",
last($"No", ignoreNulls = true) over Window.orderBy($"ID") as "No")
.show()
Output:
+---+--------------+---+
| ID| Sensor| No|
+---+--------------+---+
| 1|specificSensor| 1|
| 2| 1234| 1|
| 3| 1234| 1|
| 4|specificSensor| 2|
| 5| 2345| 2|
| 6| 2345| 2|
+---+--------------+---+
P.S. I have no working Java setup right now but should be easy to translate

Spark Dataframe: Select distinct rows

I tried two ways to find distinct rows from parquet but it doesn't seem to work.
Attemp 1:
Dataset<Row> df = sqlContext.read().parquet("location.parquet").distinct();
But throws
Cannot have map type columns in DataFrame which calls set operations
(intersect, except, etc.),
but the type of column canvasHashes is map<string,string>;;
Attemp 2:
Tried running sql queries:
Dataset<Row> df = sqlContext.read().parquet("location.parquet");
rawLandingDS.createOrReplaceTempView("df");
Dataset<Row> landingDF = sqlContext.sql("SELECT distinct on timestamp * from df");
error I get:
= SQL ==
SELECT distinct on timestamp * from df
-----------------------------^^^
Is there a way to get distinct records while reading parquet files? Any read option I can use.
The problem you face is explicitly stated in the exception message - because MapType columns are neither hashable nor orderable cannot be used as a part of grouping or partitioning expression.
Your take on SQL solution is not logically equivalent to distinct on Dataset. If you want to deduplicate data based on a set of compatible columns you should use dropDuplicates:
df.dropDuplicates("timestamp")
which would be equivalent to
SELECT timestamp, first(c1) AS c1, first(c2) AS c2, ..., first(cn) AS cn,
first(canvasHashes) AS canvasHashes
FROM df GROUP BY timestamp
Unfortunately if your goal is actual DISTINCT it won't be so easy. On possible solution is to leverage Scala* Map hashing. You could define Scala udf like this:
spark.udf.register("scalaHash", (x: Map[String, String]) => x.##)
and then use it in your Java code to derive column that can be used to dropDuplicates:
df
.selectExpr("*", "scalaHash(canvasHashes) AS hash_of_canvas_hashes")
.dropDuplicates(
// All columns excluding canvasHashes / hash_of_canvas_hashes
"timestamp", "c1", "c2", ..., "cn"
// Hash used as surrogate of canvasHashes
"hash_of_canvas_hashes"
)
with SQL equivalent
SELECT
timestamp, c1, c2, ..., cn, -- All columns excluding canvasHashes
first(canvasHashes) AS canvasHashes
FROM df GROUP BY
timestamp, c1, c2, ..., cn -- All columns excluding canvasHashes
* Please note that java.util.Map with its hashCode won't work, as hashCode is not consistent.
1) If you want to distinct based on coluns you can use it
val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("no", "age")
scala> df.show
+---+---+
| no|age|
+---+---+
| 1| 2|
| 3| 4|
| 1| 6|
+---+---+
val distinctValuesDF = df.select(df("no")).distinct
scala> distinctValuesDF.show
+---+
| no|
+---+
| 1|
| 3|
+---+
2) If you have want unique on all column use dropduplicate
scala> val df = sc.parallelize(Array((1, 2), (3, 4),(3, 4), (1, 6))).toDF("no", "age")
scala> df.show
+---+---+
| no|age|
+---+---+
| 1| 2|
| 3| 4|
| 3| 4|
| 1| 6|
+---+---+
scala> df.dropDuplicates().show()
+---+---+
| no|age|
+---+---+
| 1| 2|
| 3| 4|
| 1| 6|
+---+---+
Yes, the syntax is incorrect, it should be:
Dataset<Row> landingDF = sqlContext.sql("SELECT distinct * from df");

how to apply dictionary key to value project to a column in dataset in spark?

Newbie here on spark... how can I use a column in spark dataset ask key to get some values and add the values as new column to the dataset?
In python, we have something like:
df.loc[:,'values'] = df.loc[:,'key'].apply(lambda x: D.get(x))
where D is a function in python defined earlier.
how can I do this in spark using Java? thank you.
Edit:
for example:
I have a following dataset df:
A
1
3
6
0
8
I want to create a weekday column based on the following dictionary:
D[1] = "Monday"
D[2] = "Tuesday"
D[3] = "Wednesday"
D[4] = "Thursday"
D[5] = "Friday"
D[6] = "Saturday"
D[7] = "Sunday"
and add the column back to my dataset df:
A days
1 Monday
3 Wednesday
6 Saturday
0 Sunday
8 NULL
This is just an example, column A could be anything other than integers of course.
You can use df.withColumn to return a new df with the new column values and the previous values of df.
create a udf function (user defined functions) to apply the dictionary mapping.
here's a reproducible example:
>>> from pyspark.sql.types import StringType
>>> from pyspark.sql.functions import udf
>>> df = spark.createDataFrame([{'A':1,'B':5},{'A':5,'B':2},{'A':1,'B':3},{'A':5,'B':4}], ['A','B'])
>>> df.show()
+---+---+
| A| B|
+---+---+
| 1| 5|
| 5| 2|
| 1| 3|
| 5| 4|
+---+---+
>>> d = {1:'x', 2:'y', 3:'w', 4:'t', 5:'z'}
>>> mapping_func = lambda x: d.get(x)
>>> df = df.withColumn('values',udf(mapping_func, StringType())("A"))
>>> df.show()
+---+---+------+
| A| B|values|
+---+---+------+
| 1| 5| x|
| 5| 2| z|
| 1| 3| x|
| 5| 4| z|
+---+---+------+

Categories