Handling comma delimited columns with dependency on another column in Spark dataset - java

I have the below spark dataframe/dataset.
column_1 column_2 column_3 column_4
A,B NameA,NameB F NameF
C NameC NULL NULL
NULL NULL D,E NameD,NULL
G NULL H NameH
I NameI J NULL
All the above 4 columns are comma delimited. I have to convert this into a new dataframe/dataset which has only 2 columns and without any comma delimiters. The value in column_1 and its corresponding name in Column_2 should be written to output. Similarly for column_3 and column_4. If both are column_1 and column_2 are null, they are not required in output.
Expected output:
out_column_1 out_column_2
A NameA
B NameB
F NameF
C NameC
D NameD
E NULL
G NULL
H NameH
I NameI
J NULL
Is there a way to achieve this in Java spark without using UDF's?

Scala solution - I think should work in Java. Basically just handle col1, col2 separately from col3, col4, and union the results. Lots of wrangling with arrays.
// maybe replace this with Dataset<Row> result = ... in Java
val result = df.select(
split(col("column_1"), ",").alias("column_1"),
split(col("column_2"), ",").alias("column_2")
).filter(
"column_1 is not null"
).select(
explode(
arrays_zip(
col("column_1"),
coalesce(col("column_2"), array(lit(null)))
)
)
).select(
"col.*"
).union(
df.select(
split(col("column_3"), ",").alias("column_3"),
split(col("column_4"), ",").alias("column_4")
).filter(
"column_3 is not null"
).select(
explode(
arrays_zip(
col("column_3"),
coalesce(col("column_4"), array(lit(null)))
)
)
).select("col.*")
).toDF(
"out_column_1", "out_column_2"
)
result.show
+------------+------------+
|out_column_1|out_column_2|
+------------+------------+
| A| NameA|
| B| NameB|
| C| NameC|
| G| null|
| I| NameI|
| F| NameF|
| D| NameD|
| E| null|
| H| NameH|
| J| null|
+------------+------------+

Related

Multiple comma delimited values to separate row in Spark Java

I have the below dataset.
Column_1 is comma-separated and Column_2 and Column_3 are separated by Colon. All are string columns.
Every comma-separated value from Column_1 should be a separate row in Column_1 and the equivalent values from Column_2 or Column_3 should be populated. Either column_2 or column_3 will be populated and both will not be populated at the same time.
If the number of values in Column_1 doesn't match with the number of equivalent values in column_2 or column_3 then we have to populate null (Column_1: I,J and K,L)
Column_1 Column_2 Column_3
A,B,C,D NULL N1:N2:N3:N4
E,F N5:N6 NULL
G NULL N7
H NULL NULL
I,J NULL N8
K,L N9 NULL
I have to convert the delimited values into rows as below.
Column_1 Column_2
A N1
B N2
C N3
D N4
E N5
F N6
G N7
H NULL
I N8
J NULL
K N9
L NULL
Is there a way to achieve this in Java spark API without using UDF's.
Scala solution... should be similar in Java. You can combine columns 2 and 3 using coalesce, split them with the appropriate delimiter, use arrays_zip to transpose, and explode the results into rows.
df.select(
explode(
arrays_zip(
split(col("Column_1"), ","),
coalesce(
split(coalesce(col("Column_2"), col("Column_3")), ":"),
array()
)
)
).alias("result")
).select(
"result.*"
).toDF(
"Column_1", "Column_2"
).show
+--------+--------+
|Column_1|Column_2|
+--------+--------+
| A| N1|
| B| N2|
| C| N3|
| D| N4|
| E| N5|
| F| N6|
| G| N7|
| H| null|
| I| N8|
| J| null|
| K| N9|
| L| null|
+--------+--------+
Here's another way, using transform function you can iterate over element of column_1 and create map that you explode later:
df.withColumn(
"mappings",
split(coalesce(col("Column_2"), col("Column_3")), ":")
).selectExpr(
"explode(transform(split(Column_1, ','), (x, i) -> map(x, mappings[i]))) as mappings"
).selectExpr(
"explode(mappings) as (Column_1, Column_2)"
).show()
//+--------+--------+
//|Column_1|Column_2|
//+--------+--------+
//| A| N1|
//| B| N2|
//| C| N3|
//| D| N4|
//| E| N5|
//| F| N6|
//| G| N7|
//| H| null|
//| I| N8|
//| J| null|
//| K| N9|
//| L| null|
//+--------+--------+

Java Spark remove duplicates/nulls and preserve order

I have the below Java Spark dataset/dataframe.
Col_1 Col_2 Col_3 ...
A 1 1
A 1 NULL
B 2 2
B 2 3
C 1 NULL
There are close to 25 columns in this dataset and I have to remove those records which are duplicated on Col_1. If the second record is NULL, then NULL has to be removed (like in case of COl_1 = A) and if there are multiple valid values like in case of Col_1 = B then only one valid Col_2 = 2 and Col_3 = 2 should only be retained everytime. If there is only one record with null like in case of Col_1 = C. then it has to be retained
Expected Output:
Col_1 Col_2 Col_3 ...
A 1 1
B 2 2
C 1 NULL
What i tried so far:
I tried using group by and collect set with sort_array and array_remove but it removes the nulls altogether even if there is one row.
How to achieve the expected output in Java Spark.
This is how you can do it using spark dataframes:
import org.apache.spark.sql.functions.{coalesce, col, lit, min, struct}
val rows = Seq(
("A",1,Some(1)),
("A",1, Option.empty[Int]),
("B",2,Some(2)),
("B",2,Some(3)),
("C",1,Option.empty[Int]))
.toDF("Col_1", "Col_2", "Col_3")
rows.show()
+-----+-----+-----+
|Col_1|Col_2|Col_3|
+-----+-----+-----+
| A| 1| 1|
| A| 1| null|
| B| 2| 2|
| B| 2| 3|
| C| 1| null|
+-----+-----+-----+
val deduped = rows.groupBy(col("Col_1"))
.agg(
min(
struct(
coalesce(col("Col_3"), lit(Int.MaxValue)).as("null_maxed"),
col("Col_2"),
col("Col_3"))).as("argmax"))
.select(col("Col_1"), col("argmax.Col_2"), col("argmax.Col_3"))
deduped.show()
+-----+-----+-----+
|Col_1|Col_2|Col_3|
+-----+-----+-----+
| B| 2| 2|
| C| 1| null|
| A| 1| 1|
+-----+-----+-----+
Whats happening here is you are grouping by Col_1 and then getting the minimum of a composite struct of Col_3 and Col_2 but nulls in Col_3 have been replaced with the max integer value so they don't impact the ordering. We then select the original Col_3 and Col_2 from the resulting row. I realise this is in scala but the syntax for java should be very similar.

How to merge two dataframes spark java/scala based on a column?

I have a two dataframes DF1 and DF2 with id as the unique column,
DF2 may contain a new records and updated values for existing records of DF1, when we merge the two dataframes result should include the new record and a old records with updated values remain should come as it is.
Input example:
id name
10 abc
20 tuv
30 xyz
and
id name
10 abc
20 pqr
40 lmn
When I merge these two dataframes, I want the result as:
id name
10 abc
20 pqr
30 xyz
40 lmn
Use an outer join followed by a coalesce. In Scala:
val df1 = Seq((10, "abc"), (20, "tuv"), (30, "xyz")).toDF("id", "name")
val df2 = Seq((10, "abc"), (20, "pqr"), (40, "lmn")).toDF("id", "name")
df1.select($"id", $"name".as("old_name"))
.join(df2, Seq("id"), "outer")
.withColumn("name", coalesce($"name", $"old_name"))
.drop("old_name")
coalesce will give the value of the first non-null value, which in this case returns:
+---+----+
| id|name|
+---+----+
| 20| pqr|
| 40| lmn|
| 10| abc|
| 30| xyz|
+---+----+
df1.join(df2, Seq("id"), "leftanti").union(df2).show
| id|name|
+---+----+
| 30| xyz|
| 10| abc|
| 20| pqr|
| 40| lmn|
+---+----+

Find the number in list which is not present in Database column

I have one requirement where I need to find the first missing number in the ArrayList which is not there in Database(Oracle) column.
Scenario is like this :
Table 1:
From the above table I am thinking to make 3 lists
List<Integer> lst1 = new ArrayList<Integer>();
List<Integer> lst2 = new ArrayList<Integer>();
List<Integer> lst3 = new ArrayList<Integer>();
lst1 -> [0,1,2,3,4,5,6....1000]
lst2 -> [a0,a1,a2,a3,a4,a5,a6....a1000]
lst3 -> [b0,b1,b2,b3,b4,b5,b6....b1000]
As of now the lists contains approx 1000 values in the serial order.
Now I have a Database table as below
How can I match the lsts with Range column.
I need to find what value from lsts is not present in this table ?
Like in this case, if we see from the "lst1" the first available value which is not there in the table is "1" then next available value is "3" and then 5.6...so on.
Similarly for "lst2" the first missing value is "a3"
Is there any way to do ?
The below query obtains a prefix and numeric valuest of range_start and range_end.
For simplicity of examples I limited ranges to 0-5
SELECT lstname,
regexp_substr( rangestart, '[^0-9]') AS Prefix,
regexp_substr( rangestart, '[0-9]') AS r_start,
regexp_substr( rangeend, '[0-9]') AS r_end
FROM table_1
LSTNAME |PREFIX |R_START |R_END |
--------|-------|--------|------|
Lst1 | |0 |5 |
Lst2 |a |0 |5 |
Lst3 |b |0 |5 |
The below query will generate all values for ranges using the above query as a subquery.
Please note that CROSS JOIN LATERAL works on version Oracle 12c and later, if you are using earlier version then this query must be rewriten.
SELECT lstname, prefix || val AS val
FROM (
SELECT lstname,
regexp_substr( rangestart, '[^0-9]') AS Prefix,
regexp_substr( rangestart, '[0-9]') AS r_start,
regexp_substr( rangeend, '[0-9]') AS r_end
FROM table_1
) x
CROSS JOIN LATERAL (
SELECT LEVEL - 1 + x.r_start AS val
FROM dual
CONNECT BY LEVEL <= x.r_end - x.r_start + 1
)
LSTNAME |VAL |
--------|----|
Lst1 |0 |
Lst1 |1 |
Lst1 |2 |
Lst1 |3 |
Lst1 |4 |
Lst1 |5 |
Lst2 |a0 |
Lst2 |a1 |
Lst2 |a2 |
Lst2 |a3 |
Lst2 |a4 |
Lst2 |a5 |
Lst3 |b0 |
Lst3 |b1 |
Lst3 |b2 |
Lst3 |b3 |
Lst3 |b4 |
Lst3 |b5 |
And now say that table_2 contains the following values:
SELECT * FROM table_2
RANGE |
------|
3 |
a0 |
a1 |
a5 |
b3 |
b4 |
b5 |
To find missing values just LEFT JOIN the above queries to this table and filter out not null values.
Please note that I am using "RANGE" within quotes as a column name in this example because RANGE is reserved word in Oracle
SELECT lstname, val
FROM (
SELECT lstname, prefix || val AS val
FROM (
SELECT lstname,
regexp_substr( rangestart, '[^0-9]') AS Prefix,
regexp_substr( rangestart, '[0-9]') AS r_start,
regexp_substr( rangeend, '[0-9]') AS r_end
FROM table_1
) x
CROSS JOIN LATERAL (
SELECT LEVEL - 1 + x.r_start AS val
FROM dual
CONNECT BY LEVEL <= x.r_end - x.r_start + 1
)
) XX
LEFT JOIN table_2 t2
ON t2."RANGE" = xx.val
WHERE t2."RANGE" IS NULL
ORDER BY 1, 2;
LSTNAME |VAL |
--------|----|
Lst1 |0 |
Lst1 |1 |
Lst1 |2 |
Lst1 |4 |
Lst1 |5 |
Lst2 |a2 |
Lst2 |a3 |
Lst2 |a4 |
Lst3 |b0 |
Lst3 |b1 |
Lst3 |b2 |
This version of subquery emulates a lateral join and should work on Oracle 10, but I've not tested it
SELECT lstname,
prefix || column_value AS val
FROM (
SELECT lstname,
regexp_substr( rangestart, '[^0-9]') AS Prefix,
regexp_substr( rangestart, '[0-9]') AS r_start,
regexp_substr( rangeend, '[0-9]') AS r_end
FROM table_1
) x
CROSS JOIN table(cast(multiset(
SELECT LEVEL - 1 + x.r_start AS val
FROM dual
CONNECT BY LEVEL <= x.r_end - x.r_start + 1
) as sys.OdciNumberList)) q
;

MySQL query to fetch list of data using logical operations

The following are the list of different kinds of books that customers read in a library. The values are stored with the power of 2 in a column called bookType.
I need to fetch list of books with the combinations of persons who read
only Novel Or only Fairytale Or only BedTime Or both Novel + Fairytale
from the database with logical operational query.
Fetch list for the following combinations :
person who reads only novel(Stored in DB as 1)
person who reads both novel and fairy tale(Stored in DB as 1+2 = 3)
person who reads all the three i.e {novel + fairy tale + bed time} (stored in DB as 1+2+4 = 7)
The count of these are stored in the database in a column called BookType(marked with red in fig.)
How can I fetch the above list using MySQL query
From the example, I need to fetch users like novel readers (1,3,5,7).
The heart of this question is conversion of decimal to binary and mysql has a function to do just - CONV(num , from_base , to_base );
In this case from_base would be 10 and to_base would be 2.
I would wrap this in a UDF
So given
MariaDB [sandbox]> select id,username
-> from users
-> where id < 8;
+----+----------+
| id | username |
+----+----------+
| 1 | John |
| 2 | Jane |
| 3 | Ali |
| 6 | Bruce |
| 7 | Martha |
+----+----------+
5 rows in set (0.00 sec)
MariaDB [sandbox]> select * from t;
+------+------------+
| id | type |
+------+------------+
| 1 | novel |
| 2 | fairy Tale |
| 3 | bedtime |
+------+------------+
3 rows in set (0.00 sec)
This UDF
drop function if exists book_type;
delimiter //
CREATE DEFINER=`root`#`localhost` FUNCTION `book_type`(
`indec` int
)
RETURNS varchar(255) CHARSET latin1
LANGUAGE SQL
NOT DETERMINISTIC
CONTAINS SQL
SQL SECURITY DEFINER
COMMENT ''
begin
declare tempstring varchar(100);
declare outstring varchar(100);
declare book_types varchar(100);
declare bin_position int;
declare str_length int;
declare checkit int;
set tempstring = reverse(lpad(conv(indec,10,2),4,0));
set str_length = length(tempstring);
set checkit = 0;
set bin_position = 0;
set book_types = '';
looper: while bin_position < str_length do
set bin_position = bin_position + 1;
set outstring = substr(tempstring,bin_position,1);
if outstring = 1 then
set book_types = concat(book_types,(select trim(type) from t where id = bin_position),',');
end if;
end while;
set outstring = book_types;
return outstring;
end //
delimiter ;
Results in
+----+----------+---------------------------+
| id | username | book_type(id) |
+----+----------+---------------------------+
| 1 | John | novel, |
| 2 | Jane | fairy Tale, |
| 3 | Ali | novel,fairy Tale, |
| 6 | Bruce | fairy Tale,bedtime, |
| 7 | Martha | novel,fairy Tale,bedtime, |
+----+----------+---------------------------+
5 rows in set (0.00 sec)
Note the loop in the UDF to walk through the binary string and that the position of the 1's relate to the ids in the look up table;
I leave it to you to code for errors and tidy up.

Categories