Iterating rows of a Spark Dataset and applying operations in Java API - java

New to Spark (2.4.x) and using the Java API (not Scala!!!)
I have a Dataset that I've read in from a CSV file. It has a schema (named columns) like so:
id (integer) | name (string) | color (string) | price (double) | enabled (boolean)
An example row:
23 | "hotmeatballsoup" | "blue" | 3.95 | true
There are many (tens of thousands) rows in the dataset. I would like to write an expression using the proper Java/Spark API, that scrolls through each row and applies the following two operations on each row:
If the price is null, default it to 0.00; and then
If the color column value is "red", add 2.55 to the price
Since I'm so new to Spark I'm not sure even where to begin! My best attempt thus far is definitely wrong, but its a least a starting point I guess:
Dataset csvData = sparkSession.read()
.format("csv")
.load(fileToLoad.getAbsolutePath());
// ??? get rows somehow
Seq<Seq<String>> csvRows = csvData.getRows(???, ???);
// now how to loop through rows???
for (Seq<String> row : csvRows) {
// how apply two operations specified above???
if (row["price"] == null) {
row["price"] = 0.00;
}
if (row["color"].equals("red")) {
row["price"] = row["price"] + 2.55;
}
}
Can someone help nudge me in the right direction here?

You could use spark sql api to achieve it. Null values could also be replaced with values using .fill() from DataFrameNaFunctions. Otherwise you could convert Dataframe to Dataset and do these steps in .map, but sql api is better and more efficient in this case.
+---+---------------+-----+-----+-------+
| id| name|color|price|enabled|
+---+---------------+-----+-----+-------+
| 23|hotmeatballsoup| blue| 3.95| true|
| 24| abc| red| 1.0| true|
| 24| abc| red| null| true|
+---+---------------+-----+-----+-------+
import sql functions before class declaration:
import static org.apache.spark.sql.functions.*;
sql api:
df.select(
col("id"), col("name"), col("color"),
when(col("color").equalTo("red").and(col("price").isNotNull()), col("price").plus(2.55))
.when(col("color").equalTo("red").and(col("price").isNull()), 2.55)
.otherwise(col("price")).as("price")
,col("enabled")
).show();
or using temp view and sql query:
df.createOrReplaceTempView("df");
spark.sql("select id,name,color, case when color = 'red' and price is not null then (price + 2.55) when color = 'red' and price is null then 2.55 else price end as price, enabled from df").show();
output:
+---+---------------+-----+-----+-------+
| id| name|color|price|enabled|
+---+---------------+-----+-----+-------+
| 23|hotmeatballsoup| blue| 3.95| true|
| 24| abc| red| 3.55| true|
| 24| abc| red| 2.55| true|
+---+---------------+-----+-----+-------+

Related

update row value using another table in spark java

I have 2 datasets in java spark, first one contains id, name , age
and second one are the same, i need to check values (name and id) and if it's similar update the age with new age in dataset2
i tried all possible ways but i found that java in spark don't have much resourses and all possible ways not worked
this is one i tried :
dataset1.createOrReplaceTempView("updatesTable");
datase2.createOrReplaceTempView("carsTheftsFinal2");
updatesNew.show();
Dataset<Row> test = spark.sql( "ALTER carsTheftsFinal2 set carsTheftsFinal2.age = updatesTable.age from updatesTable where carsTheftsFinal2.id = updatesTable.id AND carsTheftsFinal2.name = updatesTable.name ");
test.show(12);
and this is the error :
Exception in thread "main"
org.apache.spark.sql.catalyst.parser.ParseException: no viable
alternative at input 'ALTER carsTheftsFinal2'(line 1, pos 6)
I have hint: that i can use join to update without using update statement ( java spark not provide update )
Assume that we have ds1 with this data:
+---+-----------+---+
| id| name|age|
+---+-----------+---+
| 1| Someone| 18|
| 2| Else| 17|
| 3|SomeoneElse| 14|
+---+-----------+---+
and ds2 with this data:
+---+----------+---+
| id| name|age|
+---+----------+---+
| 1| Someone| 14|
| 2| Else| 18|
| 3|NotSomeone| 14|
+---+----------+---+
According to your expected result, the final table would be:
+---+-----------+---+
| id| name|age|
+---+-----------+---+
| 3|SomeoneElse| 14| <-- not modified, same as ds
| 1| Someone| 14| <-- modified, was 18
| 2| Else| 18| <-- modified, was 17
+---+-----------+---+
This is achieved with the following transformations, first, we rename ds2's age with age2.
val renamedDs2 = ds2.withColumnRenamed("age", "age2")
Then:
// we join main dataset with the renamed ds2, now we have age and age2
ds1.join(renamedDs2, Seq("id", "name"), "left")
// we overwrite age, if age2 is not null from ds2, take it, otherwise leave age
.withColumn("age",
when(col("age2").isNotNull, col("age2")).otherwise(col("age"))
)
// finally, we drop age2
.drop("age2")
Hope this does what you want!

Attempting to count unique users between two categories in Spark

I have a Dataset structure in Spark with two columns, one called user the other called category. Such that the table looks some what like this:
+---------------+---------------+
| user| category|
+---------------+---------------+
| garrett| syncopy|
| garrison| musictheory|
| marta| sheetmusic|
| garrett| orchestration|
| harold| chopin|
| marta| russianmusic|
| niko| piano|
| james| sheetmusic|
| manny| violin|
| charles| gershwin|
| dawson| cello|
| bob| cello|
| george| cello|
| george| americanmusic|
| bob| personalcompos|
| george| sheetmusic|
| fred| sheetmusic|
| bob| sheetmusic|
| garrison| sheetmusic|
| george| musictheory|
+---------------+---------------+
only showing top 20 rows
Each row in the table is unique but a user and category can appear multiple times. The objective is to count the number of users that two categories share. For example cello and americanmusic share a user named george and musictheory and sheetmusic share users george and garrison. The goal is to get the number of distinct users between n categories meaning that there is at most n squared edges between categories. I understand partially how to do this operation but I am struggling a little bit converting my thoughts to Spark Java.
My thinking is that I need to do a self-join on user to get a table that would be structured like this:
+---------------+---------------+---------------+
| user| category| category|
+---------------+---------------+---------------+
| garrison| musictheory| sheetmusic|
| george| musictheory| sheetmusic|
| garrison| musictheory| musictheory|
| george| musictheory| musicthoery|
| garrison| sheetmusic| musictheory|
| george| sheetmusic| musictheory|
+---------------+---------------+---------------+
The self join operation in Spark (Java code) is not difficult:
Dataset<Row> newDataset = allUsersToCategories.join(allUsersToCategories, "users");
This is getting somewhere, however I get mappings to the same category as in rows 3 and 4 in the above example and I get backwards mappings where the categories are reversed such that essentially is double counting each user interaction like in rows 5 and 6 of the above example.
What I would believe I need to do is have some sort of conditional in my join that says something along the lines of X < Y so that equal categories and duplicates get thrown away. Finally I then need to count the number of distinct rows for n squared combinations where n is the number of categories.
Could somebody please explain how to do this in Spark and specifically Spark Java since I am a little unfamiliar with the Scala syntax?
Thanks for the help.
I'm not sure if I understand your requirements correctly, but I will try to help.
According to my understanding expected result for above data should look like below. If it's not true, please let me know I will try to make requried modifications.
+--------------+--------------+-+
|_1 |_2 |
+--------------+--------------+-+
|personalcompos|sheetmusic |1|
|cello |musictheory |1|
|americanmusic |cello |1|
|cello |sheetmusic |2|
|cello |personalcompos|1|
|russianmusic |sheetmusic |1|
|americanmusic |sheetmusic |1|
|americanmusic |musictheory |1|
|musictheory |sheetmusic |2|
|orchestration |syncopy |1|
+--------------+--------------+-+
In this case you can solve your problem with below Scala code:
allUsersToCategories
.groupByKey(_.user)
.flatMapGroups{case (user, userCategories) =>
val categories = userCategories.map(uc => uc.category).toSeq
for {
c1 <- categories
c2 <- categories
if c1 < c2
} yield (c1, c2)
}
.groupByKey(x => x)
.count()
.show()
If you need symetric result you can just change if statement in flatMapGroups transformation to if c1 != c2.
Please note that in above example I used Dataset API, which for test purpose was created with below code:
case class UserCategory(user: String, category: String)
val allUsersToCategories = session.createDataset(Seq(
UserCategory("garrett", "syncopy"),
UserCategory("garrison", "musictheory"),
UserCategory("marta", "sheetmusic"),
UserCategory("garrett", "orchestration"),
UserCategory("harold", "chopin"),
UserCategory("marta", "russianmusic"),
UserCategory("niko", "piano"),
UserCategory("james", "sheetmusic"),
UserCategory("manny", "violin"),
UserCategory("charles", "gershwin"),
UserCategory("dawson", "cello"),
UserCategory("bob", "cello"),
UserCategory("george", "cello"),
UserCategory("george", "americanmusic"),
UserCategory("bob", "personalcompos"),
UserCategory("george", "sheetmusic"),
UserCategory("fred", "sheetmusic"),
UserCategory("bob", "sheetmusic"),
UserCategory("garrison", "sheetmusic"),
UserCategory("george", "musictheory")
))
I was trying to provide example in Java, but I don't have any experience with Java+Spark and it is too time consuming for me to migrate above example from Scala to Java...
I found the answer a couple of hours ago using spark sql:
Dataset<Row> connection per shared user = spark.sql("SELECT a.user as user, "
+ "a.category as categoryOne, "
+ "b.category as categoryTwo "
+ "FROM allTable as a INNER JOIN allTable as b "
+ "ON a.user = b.user AND a.user < b.user");
This will then create a Dataset with three columns user, categoryOne, and categoryTwo. Each row will be unique and will indicate when the user exists in both categories.

MySQL query to fetch list of data using logical operations

The following are the list of different kinds of books that customers read in a library. The values are stored with the power of 2 in a column called bookType.
I need to fetch list of books with the combinations of persons who read
only Novel Or only Fairytale Or only BedTime Or both Novel + Fairytale
from the database with logical operational query.
Fetch list for the following combinations :
person who reads only novel(Stored in DB as 1)
person who reads both novel and fairy tale(Stored in DB as 1+2 = 3)
person who reads all the three i.e {novel + fairy tale + bed time} (stored in DB as 1+2+4 = 7)
The count of these are stored in the database in a column called BookType(marked with red in fig.)
How can I fetch the above list using MySQL query
From the example, I need to fetch users like novel readers (1,3,5,7).
The heart of this question is conversion of decimal to binary and mysql has a function to do just - CONV(num , from_base , to_base );
In this case from_base would be 10 and to_base would be 2.
I would wrap this in a UDF
So given
MariaDB [sandbox]> select id,username
-> from users
-> where id < 8;
+----+----------+
| id | username |
+----+----------+
| 1 | John |
| 2 | Jane |
| 3 | Ali |
| 6 | Bruce |
| 7 | Martha |
+----+----------+
5 rows in set (0.00 sec)
MariaDB [sandbox]> select * from t;
+------+------------+
| id | type |
+------+------------+
| 1 | novel |
| 2 | fairy Tale |
| 3 | bedtime |
+------+------------+
3 rows in set (0.00 sec)
This UDF
drop function if exists book_type;
delimiter //
CREATE DEFINER=`root`#`localhost` FUNCTION `book_type`(
`indec` int
)
RETURNS varchar(255) CHARSET latin1
LANGUAGE SQL
NOT DETERMINISTIC
CONTAINS SQL
SQL SECURITY DEFINER
COMMENT ''
begin
declare tempstring varchar(100);
declare outstring varchar(100);
declare book_types varchar(100);
declare bin_position int;
declare str_length int;
declare checkit int;
set tempstring = reverse(lpad(conv(indec,10,2),4,0));
set str_length = length(tempstring);
set checkit = 0;
set bin_position = 0;
set book_types = '';
looper: while bin_position < str_length do
set bin_position = bin_position + 1;
set outstring = substr(tempstring,bin_position,1);
if outstring = 1 then
set book_types = concat(book_types,(select trim(type) from t where id = bin_position),',');
end if;
end while;
set outstring = book_types;
return outstring;
end //
delimiter ;
Results in
+----+----------+---------------------------+
| id | username | book_type(id) |
+----+----------+---------------------------+
| 1 | John | novel, |
| 2 | Jane | fairy Tale, |
| 3 | Ali | novel,fairy Tale, |
| 6 | Bruce | fairy Tale,bedtime, |
| 7 | Martha | novel,fairy Tale,bedtime, |
+----+----------+---------------------------+
5 rows in set (0.00 sec)
Note the loop in the UDF to walk through the binary string and that the position of the 1's relate to the ids in the look up table;
I leave it to you to code for errors and tidy up.

Removing null elements and keeping non-null elements together on a list in jasper reports

I am using JRBeanCollectionDataSource as datasource for a subreport. Each record in the list contains elements with either null or non-null value . This is my POJO:
public class PayslipDtl {
private String earningSalaryHeadName;
private double earningSalaryHeadAmount;
private String deductionSalaryHeadName;
private double deductionSalaryHeadAmount;
String type;
public PayslipDtl(String salaryHeadName,
double salaryHeadAmount, String type) {
if(type.equalsIgnoreCase("Earning")) {
earningSalaryHeadName = salaryHeadName;
earningSalaryHeadAmount = salaryHeadAmount;
} else {
deductionSalaryHeadName = salaryHeadAmount;
deductionSalaryHeadAmount = salaryHeadAmount;
}
}
//getters and setters
}
Based on the "type", the list is populated as such: {"Basic", 4755, null, 0.0}, {"HRA", 300, null, 0.0}, {null, 0.0, "Employee PF", 925}, {"Medical Allowance", 900, null, 0.0} and so on...
After setting isBlankWhenNull to true and using "Print when" expression, the record is displayed as such:
|Earning |Amount|Deduction |Amount|
--------------------|------|---------------------|------|
| Basic | 4755 | | |
| HRA | 300 | | |
| | | Employee PF | 925 |
| Medical Allowance | 900 | | |
| Fuel Reimbursement| 350 | | |
| | | Loan | 1000 |
---------------------------------------------------------
I want it to be displayed as such:
|Earning |Amount|Deduction |Amount|
--------------------|------|---------------------|------|
| Basic | 4755 | Employee PF | 925 |
| HRA | 300 | Loan | 1000 |
| Medical Allowance | 900 | | |
| Fuel Reimbursement| 350 | | |
---------------------------------------------------------
Setting isRemoveLineWhenBlank to true doesn't work since it is not the entire row which is blank but only a subset of elements of a row that is null.
Is it possible in Jasper?
I am using iReport Designer 5.0.1 with compatibility set to JasperReports3.5.1.
Use a List component for the deduction/amount, here you have a video tutorial on how to do this.
Then deduction and amount fields on the list component need the following options Blank when null and Remove line when blank.
If this still gives you blank lines, try putting both fields on a frame inside the list and mark those options for the frame too.
Only one good solution is, you have to create separate table as:
table employeeED:
srno int,
Earning varchar(50),
EarnAmount Double,
Deduction varchar(50)
DedAmount Double
then you have to insert all earnings in earning side and update all deductions in deductions side.
int i=1;
rs.first();
while(rs.next())
{
if(rs.getString("type").equals("Earning"))
Insert into employeeEd (srno, Earning,EarnAmount) values (i, rs('earning'), rs('eamt'))
}
int j=1;
rs.first();
while(rs.next())
{
if(rs.getString("type").equals("deduction"))
update employeeEd set Deductions='"+rs('earning')+"', DedAmount=" + rs('eamt') + " where srno="+j)
j++;
}
then use employeeED table as datasource.
100% working.

Delete in sqlite certain rows with specific data that are relatively old in a table using java

I want to delete for example the first 3 (oldest) that have the color1 as blue.
Example data set:
_id | name | surname | color1 | color2
1 | mark | jacobs | blue | green
2 | tony | hilo | black | red
13 | lisa | xyz | blue | green
4 | andre | qwerty | blue | green
9 | laura | abc | black | red
14 | kerr | jacobs | blue | green
I want to use execsql rather than db.delete..
which method is preferable ?
and what my code should be like ?
I will be using this inside eclipse in an android app.
db.execSQL("DELETE FROM MyTable WHERE _id IN " +
"(SELECT _id FROM MyTable WHERE color1 = ? ORDER BY _id LIMIT 3)",
new Object[] { "blue" });
execSQL is perfectly fine to use, especially when the command is so complex that using delete would require even more complex code.
It is NOT advisable to use execSql for this or any operation SELECT/INSERT/UPDATE/DELETE as execSql does not return anything, such as errors or rows affected by this query.
Instead although it takes a little longer to write out
Cursor c = db.query(table, new String[]{"_id"}, "color1" +"=?", new String[]{"blue"}, null,null,"_id ASC","3");
String ids="";
String qs = "";
for(c.moveToFirst();!c.isAfterLast();c.moveToNext()){
ids+=c.getInt(c.getColumnIndex("_id")+",";
qs+="?,"
}
ids= ids!=""?ids.substring(0, ids.length() - 1):ids;
qs= qs!=""?ids.substring(0, qs.length() - 1):qs;
db.delete(table, "_id IN ("+qs+")", ids.split(","));
Here's the reference for why execsql is bad for this situation
http://developer.android.com/reference/android/database/sqlite/SQLiteDatabase.html#execSQL(java.lang.String, java.lang.Object[])
DELETE FROM table WHERE _id IN
(SELECT _id FROM table ORDER BY _id ASC LIMIT 3);

Categories