I have the below dataset.
Column_1 is comma-separated and Column_2 and Column_3 are separated by Colon. All are string columns.
Every comma-separated value from Column_1 should be a separate row in Column_1 and the equivalent values from Column_2 or Column_3 should be populated. Either column_2 or column_3 will be populated and both will not be populated at the same time.
If the number of values in Column_1 doesn't match with the number of equivalent values in column_2 or column_3 then we have to populate null (Column_1: I,J and K,L)
Column_1 Column_2 Column_3
A,B,C,D NULL N1:N2:N3:N4
E,F N5:N6 NULL
G NULL N7
H NULL NULL
I,J NULL N8
K,L N9 NULL
I have to convert the delimited values into rows as below.
Column_1 Column_2
A N1
B N2
C N3
D N4
E N5
F N6
G N7
H NULL
I N8
J NULL
K N9
L NULL
Is there a way to achieve this in Java spark API without using UDF's.
Scala solution... should be similar in Java. You can combine columns 2 and 3 using coalesce, split them with the appropriate delimiter, use arrays_zip to transpose, and explode the results into rows.
df.select(
explode(
arrays_zip(
split(col("Column_1"), ","),
coalesce(
split(coalesce(col("Column_2"), col("Column_3")), ":"),
array()
)
)
).alias("result")
).select(
"result.*"
).toDF(
"Column_1", "Column_2"
).show
+--------+--------+
|Column_1|Column_2|
+--------+--------+
| A| N1|
| B| N2|
| C| N3|
| D| N4|
| E| N5|
| F| N6|
| G| N7|
| H| null|
| I| N8|
| J| null|
| K| N9|
| L| null|
+--------+--------+
Here's another way, using transform function you can iterate over element of column_1 and create map that you explode later:
df.withColumn(
"mappings",
split(coalesce(col("Column_2"), col("Column_3")), ":")
).selectExpr(
"explode(transform(split(Column_1, ','), (x, i) -> map(x, mappings[i]))) as mappings"
).selectExpr(
"explode(mappings) as (Column_1, Column_2)"
).show()
//+--------+--------+
//|Column_1|Column_2|
//+--------+--------+
//| A| N1|
//| B| N2|
//| C| N3|
//| D| N4|
//| E| N5|
//| F| N6|
//| G| N7|
//| H| null|
//| I| N8|
//| J| null|
//| K| N9|
//| L| null|
//+--------+--------+
Related
I have the below spark dataframe/dataset.
column_1 column_2 column_3 column_4
A,B NameA,NameB F NameF
C NameC NULL NULL
NULL NULL D,E NameD,NULL
G NULL H NameH
I NameI J NULL
All the above 4 columns are comma delimited. I have to convert this into a new dataframe/dataset which has only 2 columns and without any comma delimiters. The value in column_1 and its corresponding name in Column_2 should be written to output. Similarly for column_3 and column_4. If both are column_1 and column_2 are null, they are not required in output.
Expected output:
out_column_1 out_column_2
A NameA
B NameB
F NameF
C NameC
D NameD
E NULL
G NULL
H NameH
I NameI
J NULL
Is there a way to achieve this in Java spark without using UDF's?
Scala solution - I think should work in Java. Basically just handle col1, col2 separately from col3, col4, and union the results. Lots of wrangling with arrays.
// maybe replace this with Dataset<Row> result = ... in Java
val result = df.select(
split(col("column_1"), ",").alias("column_1"),
split(col("column_2"), ",").alias("column_2")
).filter(
"column_1 is not null"
).select(
explode(
arrays_zip(
col("column_1"),
coalesce(col("column_2"), array(lit(null)))
)
)
).select(
"col.*"
).union(
df.select(
split(col("column_3"), ",").alias("column_3"),
split(col("column_4"), ",").alias("column_4")
).filter(
"column_3 is not null"
).select(
explode(
arrays_zip(
col("column_3"),
coalesce(col("column_4"), array(lit(null)))
)
)
).select("col.*")
).toDF(
"out_column_1", "out_column_2"
)
result.show
+------------+------------+
|out_column_1|out_column_2|
+------------+------------+
| A| NameA|
| B| NameB|
| C| NameC|
| G| null|
| I| NameI|
| F| NameF|
| D| NameD|
| E| null|
| H| NameH|
| J| null|
+------------+------------+
I have the below Java Spark dataset/dataframe.
Col_1 Col_2 Col_3 ...
A 1 1
A 1 NULL
B 2 2
B 2 3
C 1 NULL
There are close to 25 columns in this dataset and I have to remove those records which are duplicated on Col_1. If the second record is NULL, then NULL has to be removed (like in case of COl_1 = A) and if there are multiple valid values like in case of Col_1 = B then only one valid Col_2 = 2 and Col_3 = 2 should only be retained everytime. If there is only one record with null like in case of Col_1 = C. then it has to be retained
Expected Output:
Col_1 Col_2 Col_3 ...
A 1 1
B 2 2
C 1 NULL
What i tried so far:
I tried using group by and collect set with sort_array and array_remove but it removes the nulls altogether even if there is one row.
How to achieve the expected output in Java Spark.
This is how you can do it using spark dataframes:
import org.apache.spark.sql.functions.{coalesce, col, lit, min, struct}
val rows = Seq(
("A",1,Some(1)),
("A",1, Option.empty[Int]),
("B",2,Some(2)),
("B",2,Some(3)),
("C",1,Option.empty[Int]))
.toDF("Col_1", "Col_2", "Col_3")
rows.show()
+-----+-----+-----+
|Col_1|Col_2|Col_3|
+-----+-----+-----+
| A| 1| 1|
| A| 1| null|
| B| 2| 2|
| B| 2| 3|
| C| 1| null|
+-----+-----+-----+
val deduped = rows.groupBy(col("Col_1"))
.agg(
min(
struct(
coalesce(col("Col_3"), lit(Int.MaxValue)).as("null_maxed"),
col("Col_2"),
col("Col_3"))).as("argmax"))
.select(col("Col_1"), col("argmax.Col_2"), col("argmax.Col_3"))
deduped.show()
+-----+-----+-----+
|Col_1|Col_2|Col_3|
+-----+-----+-----+
| B| 2| 2|
| C| 1| null|
| A| 1| 1|
+-----+-----+-----+
Whats happening here is you are grouping by Col_1 and then getting the minimum of a composite struct of Col_3 and Col_2 but nulls in Col_3 have been replaced with the max integer value so they don't impact the ordering. We then select the original Col_3 and Col_2 from the resulting row. I realise this is in scala but the syntax for java should be very similar.
I have a web service built around Spark that, based on a JSON request, builds a series of dataframe/dataset operations.
These operations involve multiple joins, filters, etc. that would change the ordering of the values in the columns. This final data set could have rows to the scale of millions.
Preferably without converting it to an RDD, is there anyway to apply a custom sort(s) on some columns of the final dataset based on the order of elements passed in as Lists?
The original dataframe is of the form
+----------+----------+
| Column 1 | Column 2 |
+----------+----------+
| Val 1 | val a |
+----------+----------+
| Val 2 | val b |
+----------+----------+
| val 3 | val c |
+----------+----------+
After a series of transformations are performed, the dataframe ends up looking like this.
+----------+----------+----------+----------+
| Column 1 | Column 2 | Column 3 | Column 4 |
+----------+----------+----------+----------+
| Val 2 | val b | val 999 | val 900 |
+----------+----------+----------+----------+
| Val 1 | val c | val 100 | val 9$## |
+----------+----------+----------+----------+
| val 3 | val a | val 2## | val $##8 |
+----------+----------+----------+----------+
I now need to apply a sort on multiple columns based on the order of the values passed as an Array list.
For example:
Col1values Order=[val 1,val 3,val 2}
Col3values Order=[100,2##,999].
Custom sorting works by creating a column for sorting. It does not need to be a visible column inside the dataframe. I can show it using PySpark.
Initial df:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 'a', 'A'),
(2, 'a', 'B'),
(3, 'a', 'C'),
(4, 'b', 'A'),
(5, 'b', 'B'),
(6, 'b', 'C'),
(7, 'c', 'A'),
(8, 'c', 'B'),
(9, 'c', 'C')],
['id', 'c1', 'c2']
)
Custom sorting on 1 column:
from itertools import chain
order = {'b': 1, 'a': 2, 'c': 3}
sort_col = F.create_map([F.lit(x) for x in chain(*order.items())])[F.col('c1')]
df = df.sort(sort_col)
df.show()
# +---+---+---+
# | id| c1| c2|
# +---+---+---+
# | 5| b| B|
# | 6| b| C|
# | 4| b| A|
# | 1| a| A|
# | 2| a| B|
# | 3| a| C|
# | 7| c| A|
# | 8| c| B|
# | 9| c| C|
# +---+---+---+
On 2 columns:
from itertools import chain
order1 = {'b': 1, 'a': 2, 'c': 3}
order2 = {'B': 1, 'C': 2, 'A': 3}
sort_col1 = F.create_map([F.lit(x) for x in chain(*order1.items())])[F.col('c1')]
sort_col2 = F.create_map([F.lit(x) for x in chain(*order2.items())])[F.col('c2')]
df = df.sort(sort_col1, sort_col2)
df.show()
# +---+---+---+
# | id| c1| c2|
# +---+---+---+
# | 5| b| B|
# | 6| b| C|
# | 4| b| A|
# | 2| a| B|
# | 3| a| C|
# | 1| a| A|
# | 8| c| B|
# | 9| c| C|
# | 7| c| A|
# +---+---+---+
Or as a function:
from itertools import chain
def cust_sort(col: str, order: dict):
return F.create_map([F.lit(x) for x in chain(*order.items())])[F.col(col)]
df = df.sort(
cust_sort('c1', {'b': 1, 'a': 2, 'c': 3}),
cust_sort('c2', {'B': 1, 'C': 2, 'A': 3})
)
I have a Dataset as below,
+------+------+---------------+
| col1 | col2 | sum(costs) |
+------+------+---------------+
| 1 | a | 3555204326.27 |
| 4 | b | 22273491.72 |
| 5 | c | 219175.00 |
| 3 | a | 219175.00 |
| 2 | c | 75341433.37 |
+------+------+---------------+
I need to select the distinct values of the col1 and my resultant dataset should have the order as 1, 4, 5, 3, 2 (the order in which these values are available in initial dataset). But the order is getting shuffled. Is there any way to maintain the same order as the intital dataset. Any suggestion in Spark/SQL could be fine.
This dataset can be obtained by below sequence in spark.
df = sqlCtx.createDataFrame(
[(1, a, 355.27), (4, b, 222.98), (5, c, 275.00), (3, a, 25.00),
(2, c, 753.37)], ('Col1', 'col2', 'cost'));
You can add another column containing the index of each row, then sort on that column after "distinct". Here is an example:
import org.apache.spark.sql.functions._
val df = Seq(1, 4, 4, 5, 2)
.toDF("a")
.withColumn("id", monotonically_increasing_id())
df.show()
// +---+---+
// | a| id|
// +---+---+
// | 1| 0|
// | 4| 1|
// | 4| 2|
// | 5| 3|
// | 2| 4|
// +---+---+
df.dropDuplicates("a").sort("id").show()
// +---+---+
// | a| id|
// +---+---+
// | 1| 0|
// | 4| 1|
// | 5| 3|
// | 2| 4|
// +---+---+
Note that to do distinct on 1 specific column, you can use dropDuplicates, if you want to control which row you want to take in case of duplicate then use groupBy.
Assuming you are trying to remote the duplicates in col2 (as there are none in col1), so that the final result would be:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
+----+----+---------------+
You could add an index column like:
df = df.withColumn("__idx", monotonically_increasing_id());
Then do all the transformations you want, and then drop it, like in:
df = df.dropDuplicates("col2").orderBy("__idx").drop("__idx");
This would mean do:
Step 1: load the data and stuff:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
| 3| a| 219175.0|
| 2| c| 7.534143337E7|
+----+----+---------------+
Step 2: add the index:
+----+----+---------------+-----+
|col1|col2| sum|__idx|
+----+----+---------------+-----+
| 1| a|3.55520432627E9| 0|
| 4| b| 2.227349172E7| 1|
| 5| c| 219175.0| 2|
| 3| a| 219175.0| 3|
| 2| c| 7.534143337E7| 4|
+----+----+---------------+-----+
Step 3: transformations (here remove the dups in col2) and remove the __idx column:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
+----+----+---------------+
The Java code could be:
package net.jgp.books.spark.ch12.lab990_others;
import static org.apache.spark.sql.functions.monotonically_increasing_id;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
/**
* Keeping the order of rows during transformations.
*
* #author jgp
*/
public class KeepingOrderApp {
/**
* main() is your entry point to the application.
*
* #param args
*/
public static void main(String[] args) {
KeepingOrderApp app = new KeepingOrderApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("Splitting a dataframe to collect it")
.master("local")
.getOrCreate();
Dataset<Row> df = createDataframe(spark);
df.show();
df = df.withColumn("__idx", monotonically_increasing_id());
df.show();
df = df.dropDuplicates("col2").orderBy("__idx").drop("__idx");
df.show();
}
private static Dataset<Row> createDataframe(SparkSession spark) {
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(
"col1",
DataTypes.IntegerType,
false),
DataTypes.createStructField(
"col2",
DataTypes.StringType,
false),
DataTypes.createStructField(
"sum",
DataTypes.DoubleType,
false) });
List<Row> rows = new ArrayList<>();
rows.add(RowFactory.create(1, "a", 3555204326.27));
rows.add(RowFactory.create(4, "b", 22273491.72));
rows.add(RowFactory.create(5, "c", 219175.0));
rows.add(RowFactory.create(3, "a", 219175.0));
rows.add(RowFactory.create(2, "c", 75341433.37));
return spark.createDataFrame(rows, schema);
}
}
You could add an index column to your DB and then in your SQL request make an ORDER BY id
I believe you need to reformat your query and use group by instead of distinct like this answer suggests SQL: How to keep rows order with DISTINCT?
Newbie here on spark... how can I use a column in spark dataset ask key to get some values and add the values as new column to the dataset?
In python, we have something like:
df.loc[:,'values'] = df.loc[:,'key'].apply(lambda x: D.get(x))
where D is a function in python defined earlier.
how can I do this in spark using Java? thank you.
Edit:
for example:
I have a following dataset df:
A
1
3
6
0
8
I want to create a weekday column based on the following dictionary:
D[1] = "Monday"
D[2] = "Tuesday"
D[3] = "Wednesday"
D[4] = "Thursday"
D[5] = "Friday"
D[6] = "Saturday"
D[7] = "Sunday"
and add the column back to my dataset df:
A days
1 Monday
3 Wednesday
6 Saturday
0 Sunday
8 NULL
This is just an example, column A could be anything other than integers of course.
You can use df.withColumn to return a new df with the new column values and the previous values of df.
create a udf function (user defined functions) to apply the dictionary mapping.
here's a reproducible example:
>>> from pyspark.sql.types import StringType
>>> from pyspark.sql.functions import udf
>>> df = spark.createDataFrame([{'A':1,'B':5},{'A':5,'B':2},{'A':1,'B':3},{'A':5,'B':4}], ['A','B'])
>>> df.show()
+---+---+
| A| B|
+---+---+
| 1| 5|
| 5| 2|
| 1| 3|
| 5| 4|
+---+---+
>>> d = {1:'x', 2:'y', 3:'w', 4:'t', 5:'z'}
>>> mapping_func = lambda x: d.get(x)
>>> df = df.withColumn('values',udf(mapping_func, StringType())("A"))
>>> df.show()
+---+---+------+
| A| B|values|
+---+---+------+
| 1| 5| x|
| 5| 2| z|
| 1| 3| x|
| 5| 4| z|
+---+---+------+