I have 2 datasets in java spark, first one contains id, name , age
and second one are the same, i need to check values (name and id) and if it's similar update the age with new age in dataset2
i tried all possible ways but i found that java in spark don't have much resourses and all possible ways not worked
this is one i tried :
dataset1.createOrReplaceTempView("updatesTable");
datase2.createOrReplaceTempView("carsTheftsFinal2");
updatesNew.show();
Dataset<Row> test = spark.sql( "ALTER carsTheftsFinal2 set carsTheftsFinal2.age = updatesTable.age from updatesTable where carsTheftsFinal2.id = updatesTable.id AND carsTheftsFinal2.name = updatesTable.name ");
test.show(12);
and this is the error :
Exception in thread "main"
org.apache.spark.sql.catalyst.parser.ParseException: no viable
alternative at input 'ALTER carsTheftsFinal2'(line 1, pos 6)
I have hint: that i can use join to update without using update statement ( java spark not provide update )
Assume that we have ds1 with this data:
+---+-----------+---+
| id| name|age|
+---+-----------+---+
| 1| Someone| 18|
| 2| Else| 17|
| 3|SomeoneElse| 14|
+---+-----------+---+
and ds2 with this data:
+---+----------+---+
| id| name|age|
+---+----------+---+
| 1| Someone| 14|
| 2| Else| 18|
| 3|NotSomeone| 14|
+---+----------+---+
According to your expected result, the final table would be:
+---+-----------+---+
| id| name|age|
+---+-----------+---+
| 3|SomeoneElse| 14| <-- not modified, same as ds
| 1| Someone| 14| <-- modified, was 18
| 2| Else| 18| <-- modified, was 17
+---+-----------+---+
This is achieved with the following transformations, first, we rename ds2's age with age2.
val renamedDs2 = ds2.withColumnRenamed("age", "age2")
Then:
// we join main dataset with the renamed ds2, now we have age and age2
ds1.join(renamedDs2, Seq("id", "name"), "left")
// we overwrite age, if age2 is not null from ds2, take it, otherwise leave age
.withColumn("age",
when(col("age2").isNotNull, col("age2")).otherwise(col("age"))
)
// finally, we drop age2
.drop("age2")
Hope this does what you want!
Related
I have partitioned Hive table described as
CREATE TABLE IF NOT EXISTS TRADES
(
TRADE_ID STRING,
TRADE_DATE INT,
-- ...
)
PARTITIONED BY (BUSINESS_DATE INT)
STORED AS PARQUET;
When I insert the data from Spark-based Java application as
try (SparkSession sparkSession = SparkSession.builder()
.config(new SparkConf())
.enableHiveSupport()
// .config("hive.exec.dynamic.partition", "true")
// .config("hive.exec.dynamic.partition.mode", "nonstrict")
// .config("spark.sql.hive.convertMetastoreParquet", "false")
.getOrCreate()) {
dataset.select(columns(joinedDs, businessDate))
.write()
.format("parquet")
.option("compression", "snappy")
.mode(SaveMode.Append)
.insertInto("TRADES"));
}
//...
private Column[] columns(Dataset<Row> dataset, LocalDate businessDate) {
List<Column> columns = new ArrayList<>();
for (String column : appConfig.getColumns()) {
columns.add(dataset.col(column));
}
columns.add(lit(dateToInteger(businessDate)).as("BUSINESS_DATE"));
return columns.toArray(new Column[0]);
}
it fails with the following exception:
23/01/27 10:39:26 ERROR metadata.Hive: Exception when loading partition with parameters partPath=hdfs://path-to-trades/.hive-staging_hive_2023-01-27_10-38-29_374_384898966661095068-1/-ext-10000/BUSINESS_DATE=20221230, table=trades, partSpec={business_date=, BUSINESS_DATE=20221230}, replace=false, listBucketingEnabled=false, isAcid=false, hasFollowingStatsTask=false
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Partition spec is incorrect. {business_date=, BUSINESS_DATE=20221230})
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:1662)
at org.apache.hadoop.hive.ql.metadata.Hive.lambda$loadDynamicPartitions$4(Hive.java:1970)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: MetaException(message:Partition spec is incorrect. {business_date=, BUSINESS_DATE=20221230})
at org.apache.hadoop.hive.metastore.Warehouse.makePartName(Warehouse.java:329)
at org.apache.hadoop.hive.metastore.Warehouse.makePartPath(Warehouse.java:312)
at org.apache.hadoop.hive.ql.metadata.Hive.genPartPathFromTable(Hive.java:1751)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:1607)
... 5 more
I fix this issue by specifying lower case for partitioning column both in DDL:
CREATE TABLE IF NOT EXISTS TRADES
(
TRADE_ID STRING,
TRADE_DATE INT,
-- ...
)
PARTITIONED BY (business_date INT) -- <-- lower case!
STORED AS PARQUET;
and Java code
columns.add(lit(dateToInteger(businessDate)).as("business_date")); // <-- lower case!
I suspect there is a way to configure Spark to use Hive's case (or vice versa) or make them case-insensitive. Tried to uncomment properties for SparkSession, but no luck.
Does anyone know how to implement in properly? I use Spark 2.4.0 and Hive 2.1.1.
P.S. Output of sparkSession.sql("describe formatted TRADES").show(false):
+--------------------+--------------------+-------+
| col_name| data_type|comment|
+--------------------+--------------------+-------+
| ROW_INDEX| int| null|
| OP_GENESIS_FEED_ID| string| null|
| business_date| int| null|
|# Partition Infor...| | |
| # col_name| data_type|comment|
| business_date| int| null|
| | | |
|# Detailed Table ...| | |
| Database| managed| |
| Table| trades| |
| Owner| managed| |
| Created Time|Mon Jan 30 19:59:...| |
| Last Access|Thu Jan 01 02:00:...| |
| Created By|Spark 2.4.0-cdh6.2.1| |
| Type| MANAGED| |
| Provider| hive| |
| Table Properties|[transient_lastDd...| |
| Location|....................| |
| Serde Library|org.apache.hadoop...| |
| InputFormat|org.apache.hadoop...| |
| OutputFormat|org.apache.hadoop...| |
| Storage Properties|[serialization.fo...| |
| Partition Provider| Catalog| |
+--------------------+--------------------+-------+
New to Spark (2.4.x) and using the Java API (not Scala!!!)
I have a Dataset that I've read in from a CSV file. It has a schema (named columns) like so:
id (integer) | name (string) | color (string) | price (double) | enabled (boolean)
An example row:
23 | "hotmeatballsoup" | "blue" | 3.95 | true
There are many (tens of thousands) rows in the dataset. I would like to write an expression using the proper Java/Spark API, that scrolls through each row and applies the following two operations on each row:
If the price is null, default it to 0.00; and then
If the color column value is "red", add 2.55 to the price
Since I'm so new to Spark I'm not sure even where to begin! My best attempt thus far is definitely wrong, but its a least a starting point I guess:
Dataset csvData = sparkSession.read()
.format("csv")
.load(fileToLoad.getAbsolutePath());
// ??? get rows somehow
Seq<Seq<String>> csvRows = csvData.getRows(???, ???);
// now how to loop through rows???
for (Seq<String> row : csvRows) {
// how apply two operations specified above???
if (row["price"] == null) {
row["price"] = 0.00;
}
if (row["color"].equals("red")) {
row["price"] = row["price"] + 2.55;
}
}
Can someone help nudge me in the right direction here?
You could use spark sql api to achieve it. Null values could also be replaced with values using .fill() from DataFrameNaFunctions. Otherwise you could convert Dataframe to Dataset and do these steps in .map, but sql api is better and more efficient in this case.
+---+---------------+-----+-----+-------+
| id| name|color|price|enabled|
+---+---------------+-----+-----+-------+
| 23|hotmeatballsoup| blue| 3.95| true|
| 24| abc| red| 1.0| true|
| 24| abc| red| null| true|
+---+---------------+-----+-----+-------+
import sql functions before class declaration:
import static org.apache.spark.sql.functions.*;
sql api:
df.select(
col("id"), col("name"), col("color"),
when(col("color").equalTo("red").and(col("price").isNotNull()), col("price").plus(2.55))
.when(col("color").equalTo("red").and(col("price").isNull()), 2.55)
.otherwise(col("price")).as("price")
,col("enabled")
).show();
or using temp view and sql query:
df.createOrReplaceTempView("df");
spark.sql("select id,name,color, case when color = 'red' and price is not null then (price + 2.55) when color = 'red' and price is null then 2.55 else price end as price, enabled from df").show();
output:
+---+---------------+-----+-----+-------+
| id| name|color|price|enabled|
+---+---------------+-----+-----+-------+
| 23|hotmeatballsoup| blue| 3.95| true|
| 24| abc| red| 3.55| true|
| 24| abc| red| 2.55| true|
+---+---------------+-----+-----+-------+
I have a two dataframes DF1 and DF2 with id as the unique column,
DF2 may contain a new records and updated values for existing records of DF1, when we merge the two dataframes result should include the new record and a old records with updated values remain should come as it is.
Input example:
id name
10 abc
20 tuv
30 xyz
and
id name
10 abc
20 pqr
40 lmn
When I merge these two dataframes, I want the result as:
id name
10 abc
20 pqr
30 xyz
40 lmn
Use an outer join followed by a coalesce. In Scala:
val df1 = Seq((10, "abc"), (20, "tuv"), (30, "xyz")).toDF("id", "name")
val df2 = Seq((10, "abc"), (20, "pqr"), (40, "lmn")).toDF("id", "name")
df1.select($"id", $"name".as("old_name"))
.join(df2, Seq("id"), "outer")
.withColumn("name", coalesce($"name", $"old_name"))
.drop("old_name")
coalesce will give the value of the first non-null value, which in this case returns:
+---+----+
| id|name|
+---+----+
| 20| pqr|
| 40| lmn|
| 10| abc|
| 30| xyz|
+---+----+
df1.join(df2, Seq("id"), "leftanti").union(df2).show
| id|name|
+---+----+
| 30| xyz|
| 10| abc|
| 20| pqr|
| 40| lmn|
+---+----+
I have a Dataset structure in Spark with two columns, one called user the other called category. Such that the table looks some what like this:
+---------------+---------------+
| user| category|
+---------------+---------------+
| garrett| syncopy|
| garrison| musictheory|
| marta| sheetmusic|
| garrett| orchestration|
| harold| chopin|
| marta| russianmusic|
| niko| piano|
| james| sheetmusic|
| manny| violin|
| charles| gershwin|
| dawson| cello|
| bob| cello|
| george| cello|
| george| americanmusic|
| bob| personalcompos|
| george| sheetmusic|
| fred| sheetmusic|
| bob| sheetmusic|
| garrison| sheetmusic|
| george| musictheory|
+---------------+---------------+
only showing top 20 rows
Each row in the table is unique but a user and category can appear multiple times. The objective is to count the number of users that two categories share. For example cello and americanmusic share a user named george and musictheory and sheetmusic share users george and garrison. The goal is to get the number of distinct users between n categories meaning that there is at most n squared edges between categories. I understand partially how to do this operation but I am struggling a little bit converting my thoughts to Spark Java.
My thinking is that I need to do a self-join on user to get a table that would be structured like this:
+---------------+---------------+---------------+
| user| category| category|
+---------------+---------------+---------------+
| garrison| musictheory| sheetmusic|
| george| musictheory| sheetmusic|
| garrison| musictheory| musictheory|
| george| musictheory| musicthoery|
| garrison| sheetmusic| musictheory|
| george| sheetmusic| musictheory|
+---------------+---------------+---------------+
The self join operation in Spark (Java code) is not difficult:
Dataset<Row> newDataset = allUsersToCategories.join(allUsersToCategories, "users");
This is getting somewhere, however I get mappings to the same category as in rows 3 and 4 in the above example and I get backwards mappings where the categories are reversed such that essentially is double counting each user interaction like in rows 5 and 6 of the above example.
What I would believe I need to do is have some sort of conditional in my join that says something along the lines of X < Y so that equal categories and duplicates get thrown away. Finally I then need to count the number of distinct rows for n squared combinations where n is the number of categories.
Could somebody please explain how to do this in Spark and specifically Spark Java since I am a little unfamiliar with the Scala syntax?
Thanks for the help.
I'm not sure if I understand your requirements correctly, but I will try to help.
According to my understanding expected result for above data should look like below. If it's not true, please let me know I will try to make requried modifications.
+--------------+--------------+-+
|_1 |_2 |
+--------------+--------------+-+
|personalcompos|sheetmusic |1|
|cello |musictheory |1|
|americanmusic |cello |1|
|cello |sheetmusic |2|
|cello |personalcompos|1|
|russianmusic |sheetmusic |1|
|americanmusic |sheetmusic |1|
|americanmusic |musictheory |1|
|musictheory |sheetmusic |2|
|orchestration |syncopy |1|
+--------------+--------------+-+
In this case you can solve your problem with below Scala code:
allUsersToCategories
.groupByKey(_.user)
.flatMapGroups{case (user, userCategories) =>
val categories = userCategories.map(uc => uc.category).toSeq
for {
c1 <- categories
c2 <- categories
if c1 < c2
} yield (c1, c2)
}
.groupByKey(x => x)
.count()
.show()
If you need symetric result you can just change if statement in flatMapGroups transformation to if c1 != c2.
Please note that in above example I used Dataset API, which for test purpose was created with below code:
case class UserCategory(user: String, category: String)
val allUsersToCategories = session.createDataset(Seq(
UserCategory("garrett", "syncopy"),
UserCategory("garrison", "musictheory"),
UserCategory("marta", "sheetmusic"),
UserCategory("garrett", "orchestration"),
UserCategory("harold", "chopin"),
UserCategory("marta", "russianmusic"),
UserCategory("niko", "piano"),
UserCategory("james", "sheetmusic"),
UserCategory("manny", "violin"),
UserCategory("charles", "gershwin"),
UserCategory("dawson", "cello"),
UserCategory("bob", "cello"),
UserCategory("george", "cello"),
UserCategory("george", "americanmusic"),
UserCategory("bob", "personalcompos"),
UserCategory("george", "sheetmusic"),
UserCategory("fred", "sheetmusic"),
UserCategory("bob", "sheetmusic"),
UserCategory("garrison", "sheetmusic"),
UserCategory("george", "musictheory")
))
I was trying to provide example in Java, but I don't have any experience with Java+Spark and it is too time consuming for me to migrate above example from Scala to Java...
I found the answer a couple of hours ago using spark sql:
Dataset<Row> connection per shared user = spark.sql("SELECT a.user as user, "
+ "a.category as categoryOne, "
+ "b.category as categoryTwo "
+ "FROM allTable as a INNER JOIN allTable as b "
+ "ON a.user = b.user AND a.user < b.user");
This will then create a Dataset with three columns user, categoryOne, and categoryTwo. Each row will be unique and will indicate when the user exists in both categories.
The following are the list of different kinds of books that customers read in a library. The values are stored with the power of 2 in a column called bookType.
I need to fetch list of books with the combinations of persons who read
only Novel Or only Fairytale Or only BedTime Or both Novel + Fairytale
from the database with logical operational query.
Fetch list for the following combinations :
person who reads only novel(Stored in DB as 1)
person who reads both novel and fairy tale(Stored in DB as 1+2 = 3)
person who reads all the three i.e {novel + fairy tale + bed time} (stored in DB as 1+2+4 = 7)
The count of these are stored in the database in a column called BookType(marked with red in fig.)
How can I fetch the above list using MySQL query
From the example, I need to fetch users like novel readers (1,3,5,7).
The heart of this question is conversion of decimal to binary and mysql has a function to do just - CONV(num , from_base , to_base );
In this case from_base would be 10 and to_base would be 2.
I would wrap this in a UDF
So given
MariaDB [sandbox]> select id,username
-> from users
-> where id < 8;
+----+----------+
| id | username |
+----+----------+
| 1 | John |
| 2 | Jane |
| 3 | Ali |
| 6 | Bruce |
| 7 | Martha |
+----+----------+
5 rows in set (0.00 sec)
MariaDB [sandbox]> select * from t;
+------+------------+
| id | type |
+------+------------+
| 1 | novel |
| 2 | fairy Tale |
| 3 | bedtime |
+------+------------+
3 rows in set (0.00 sec)
This UDF
drop function if exists book_type;
delimiter //
CREATE DEFINER=`root`#`localhost` FUNCTION `book_type`(
`indec` int
)
RETURNS varchar(255) CHARSET latin1
LANGUAGE SQL
NOT DETERMINISTIC
CONTAINS SQL
SQL SECURITY DEFINER
COMMENT ''
begin
declare tempstring varchar(100);
declare outstring varchar(100);
declare book_types varchar(100);
declare bin_position int;
declare str_length int;
declare checkit int;
set tempstring = reverse(lpad(conv(indec,10,2),4,0));
set str_length = length(tempstring);
set checkit = 0;
set bin_position = 0;
set book_types = '';
looper: while bin_position < str_length do
set bin_position = bin_position + 1;
set outstring = substr(tempstring,bin_position,1);
if outstring = 1 then
set book_types = concat(book_types,(select trim(type) from t where id = bin_position),',');
end if;
end while;
set outstring = book_types;
return outstring;
end //
delimiter ;
Results in
+----+----------+---------------------------+
| id | username | book_type(id) |
+----+----------+---------------------------+
| 1 | John | novel, |
| 2 | Jane | fairy Tale, |
| 3 | Ali | novel,fairy Tale, |
| 6 | Bruce | fairy Tale,bedtime, |
| 7 | Martha | novel,fairy Tale,bedtime, |
+----+----------+---------------------------+
5 rows in set (0.00 sec)
Note the loop in the UDF to walk through the binary string and that the position of the 1's relate to the ids in the look up table;
I leave it to you to code for errors and tidy up.