Join spark dataset using java - java

I have 2 datasets which I am trying to combine:
Dataset1(machine):
String machineID:
List<Integer> machineCat;(100,200,300)
Dataset2(car):
String carID:
List<Integer> carCat;(30,200,100,300)
I need to basically get each item of List machineCat from dataset1 and check if that is contained in List carCat of dataset2. If that matches combine the 2 dataset as below:
final dataset:
machineID,machineCat(100),carID,carCat(100)
machineID,machineCat(200),carID,carCat(200)
machineID,machineCat(300),carID,carCat(400)
any help on how to do this using dataset join in java.
Looking at an option with arrays_contain(something like below)
machine.foreachPartition((ForeachPartitionFunction<Machine>) iterator -> {
while (iterator.hasNext()) {
Machine machine = iterator.next();
machine.getmachineCat().stream().filter(cat -> {
LOG.info("matched");
spark.sql(
"select * from machineDataset m"
+ " join"
+ " carDataset c "
+ "where array_contains(m.machineCat,cat)");
return true;
});
}
});

import static org.apache.spark.sql.functions.*; // before main class
Machine machine = new Machine("m1",Arrays.asList(100,200,300));
Car car = new Car("c1", Arrays.asList(30,200,100,300));
Dataset<Row> mDF= spark.createDataFrame(Arrays.asList(machine), Machine.class);
mDF.show();
Dataset<Row> cDF= spark.createDataFrame(Arrays.asList(car), Car.class);
cDF.show();
output:
+---------------+---------+
| machineCat|machineId|
+---------------+---------+
|[100, 200, 300]| m1|
+---------------+---------+
+-------------------+-----+
| carCat|catId|
+-------------------+-----+
|[30, 200, 100, 300]| c1|
+-------------------+-----+
then
Dataset<Row> mDF2 = mDF.select(col("machineId"),explode(col("machineCat")).as("machineCat"));
Dataset<Row> cDF2 = cDF.select(col("catId"),explode(col("carCat")).as("carCat"));
Dataset<Row> joinedDF = mDF2.join(cDF2).where(mDF2.col("machineCat").equalTo(cDF2.col("carCat")));
Dataset<Row> finalDF = joinedDF.select(col("machineId"),array(col("machineCat")), col("catId"),array(col("carCat")) );
finalDF.show();
and finally:
+---------+-----------------+-----+-------------+
|machineId|array(machineCat)|catId|array(carCat)|
+---------+-----------------+-----+-------------+
| m1| [100]| c1| [100]|
| m1| [200]| c1| [200]|
| m1| [300]| c1| [300]|
+---------+-----------------+-----+-------------+
root
|-- machineId: string (nullable = true)
|-- array(machineCat): array (nullable = false)
| |-- element: integer (containsNull = true)
|-- catId: string (nullable = true)
|-- array(carCat): array (nullable = false)
| |-- element: integer (containsNull = true)

Related

Create a new dataframe (with different schema) from selected information from another dataframe

I have a dataframe where the tag column contains different key->values. I try to filter out the values information where the key=name. The filtered out information should be put in a new dataframe.
The initial df has the following schema:
root
|-- id: long (nullable = true)
|-- type: string (nullable = true)
|-- tags: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- nds: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ref: long (nullable = true)
|-- members: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- type: string (nullable = true)
| | |-- ref: long (nullable = true)
| | |-- role: string (nullable = true)
|-- visible: boolean (nullable = true)
And I want a newdf of schema:
root
|-- place: string (nullable = true)
|-- num_evacuees string (nullable = true)
How should I do the filter? I tried a lot of methods, where I tried to have a normal filter at least. But everytime, the result of the filter is an empty dataframe each time. For example:
val newdf = df.filter($"tags"("key") contains "name")
val newdf = df.where(places("tags")("key") === "name")
I tried a lot more methods, but none of it has worked
How should I do the proper filter
You can achieve the result you want with:
val df = Seq(
(1L, Map("sf" -> "100")),
(2L, Map("ny" -> "200"))
).toDF("id", "tags")
val resultDf = df
.select(explode(map_filter(col("tags"), (k, _) => k === "ny")))
.withColumnRenamed("key", "place")
.withColumnRenamed("value", "num_evacuees")
resultDf.printSchema
resultDf.show
Which will show:
root
|-- place: string (nullable = false)
|-- num_evacuees: string (nullable = true)
+-----+------------+
|place|num_evacuees|
+-----+------------+
| ny| 200|
+-----+------------+
The key idea is to use map_filter to select the fields from the map you want then explode turns the map into two columns (key and value) which you can then rename to make the DataFrame match your specification.
The above example assumes you want to get a single value to demonstrate the idea. The lambda function used by map_filter can be as complex as necessary. Its signature map_filter(expr: Column, f: (Column, Column) => Column): Column shows that as long as you return a Column it will be happy.
If you wanted to filter a large number of entries you could do something like:
val resultDf = df
.withColumn("filterList", array("sf", "place_n"))
.select(explode(map_filter(col("tags"), (k, _) => array_contains(col("filterList"), k))))
The idea is to extract the keys of the map column (tags), then use array_contains to check for a key called "name".
import org.apache.spark.sql.functions._
val newdf = df.filter(array_contains(map_keys($"tags), "name"))

Create new dataframe from selected information from another datama

I have a dataframe with the following schema:
root
|-- id: long (nullable = true)
|-- type: string (nullable = true)
|-- tags: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- lat: Long (nullable = true)
|-- lon: Long (nullable = true)
|-- nds: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ref: long (nullable = true)
|-- members: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- type: string (nullable = true)
| | |-- ref: long (nullable = true)
| | |-- role: string (nullable = true)
I want to create a new dataframe res where I select specific data from the column tags. I need the values from key=place and key=population. The new dataframe should have the following schema:
val schema = StructType(
Array(
StructField("place", StringType),
StructField("population", LongType)
)
)
I have litteraly no idea how I to do this. I tried to replicate the first dataframe and then select the columns, but that didnt work.
Anyone has a solution?
You can directly apply desired key on column of type map to extract value, then cast and rename column as you wish as follows:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.LongType
val result = dataframe.select(
col("tags")("place").as("place"),
col("tags")("population").cast(LongType).as("population")
)
With the following tags column:
+------------------------------------------------+
|tags |
+------------------------------------------------+
|{place -> A, population -> 32, another_key -> X}|
|{place -> B, population -> 64, another_key -> Y}|
+------------------------------------------------+
You get the following result:
+-----+----------+
|place|population|
+-----+----------+
|A |32 |
|B |64 |
+-----+----------+
Having the following schema:
root
|-- place: string (nullable = true)
|-- population: long (nullable = true)
Lets call your original dataframe df. You can extract the information you want like this
import org.apache.spark.sql.functions.sql.col
val data = df
.select("tags")
.where(
df("tags")("key") isin (List("place", "population"): _*)
)
.select(
col("tags")("value")
)
.collect()
.toList
This will give you a List[Row] which can be converted to another dataframe with your schema
import scala.collection.JavaConversions.seqAsJavaList
sparkSession.createDataFrame(seqAsJavaList[Row](data), schema)
Given the following simplified input:
val df = Seq(
(1L, Map("place" -> "home", "population" -> "1", "name" -> "foo")),
(2L, Map("place" -> "home", "population" -> "4", "name" -> "foo")),
(3L, Map("population" -> "3")),
(4L, Map.empty[String, String])
).toDF("id", "tags")
You want to select the values using methods map_filter to filter the map to only contain the key you want, then call map_values to get those entries. map_values returns an array, so you need to use explode_outer to flatten the data. We use explode_outer here because you might have entries which have neither place nor population, or only one of the two. Once the data is in a form we can easily work with, we just select the fields we want in the desired structure.
I've left the id column in so when you run the example you can see that we don't drop entries with missing data.
val r = df.select(
col("id"),
explode_outer(map_values(map_filter(col("tags"), (k,_) => k === "place"))) as "place",
map_values(map_filter(col("tags"), (k,_) => k === "population")) as "population"
).withColumn("population", explode_outer(col("population")))
.select(
col("id"),
array(
struct(
col("place"),
col("population") cast LongType as "population"
) as "place_and_population"
) as "data"
)
Gives:
root
|-- id: long (nullable = false)
|-- data: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- place: string (nullable = true)
| | |-- population: long (nullable = true)
+---+--------------+
| id| data|
+---+--------------+
| 1| [{home, 1}]|
| 2| [{home, 4}]|
| 3| [{null, 3}]|
| 4|[{null, null}]|
+---+--------------+

A dataset is loaded from a csv file, with a schema asking fields to be considered as not nullable. Then, a printSchema() describes the fields nullable

I load a csv file with Apache Spark.
Dataset<Row> csv = session.read().schema(schema()).format("csv")
.option("header","true").option("delimiter", ";").load("myFile.csv").selectExpr("*");
For it, I provided a schema :
public StructType schema(boolean renamed) {
StructType schema = new StructType();
schema = schema.add("CODGEO", StringType, false)
.add("P16_POP1564", DoubleType, false)
.add("P16_POP1524", DoubleType, false)
.add("P16_POP2554", DoubleType, false)
.add("P16_POP5564", DoubleType, false)
.add("P16_H1564", DoubleType, false)
....
return schema;
}
The dataset is loaded. A printSchema() on it displays me this on the console :
root
|-- CODGEO: string (nullable = true)
|-- P16_POP1564: double (nullable = true)
|-- P16_POP1524: double (nullable = true)
|-- P16_POP2554: double (nullable = true)
|-- P16_POP5564: double (nullable = true)
|-- P16_H1564: double (nullable = true)
...
But every field is tagged as nullable = true.
And I explicitly asked each of them to be not nullable.
What's the problem ?
For me, it is working well-
By default, empty string("") is considered as null while reading csv
Test-1. Dataset having null & Schema nullable=false
String data = "id Col_1 Col_2 Col_3 Col_4 Col_5\n" +
"1 A B C D E\n" +
"2 X Y Z P \"\"";
List<String> list = Arrays.stream(data.split(System.lineSeparator()))
.map(s -> s.replaceAll("\\s+", ","))
.collect(Collectors.toList());
List<StructField> fields = Arrays.stream("id Col_1 Col_2 Col_3 Col_4 Col_5".split("\\s+"))
.map(s -> new StructField(s, DataTypes.StringType, false, Metadata.empty()))
.collect(Collectors.toList());
Dataset<Row> df1 = spark.read()
.schema(new StructType(fields.toArray(new StructField[fields.size()])))
.option("header", true)
.option("sep", ",")
.csv(spark.createDataset(list, Encoders.STRING()));
df1.show();
df1.printSchema();
Output-
java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow_0_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:3387)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:3384)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
Conclusion- Expected behaviour PASS
2. Dataset having null & Schema nullable=true
String data = "id Col_1 Col_2 Col_3 Col_4 Col_5\n" +
"1 A B C D E\n" +
"2 X Y Z P \"\"";
List<String> list = Arrays.stream(data.split(System.lineSeparator()))
.map(s -> s.replaceAll("\\s+", ","))
.collect(Collectors.toList());
List<StructField> fields = Arrays.stream("id Col_1 Col_2 Col_3 Col_4 Col_5".split("\\s+"))
.map(s -> new StructField(s, DataTypes.StringType, true, Metadata.empty()))
.collect(Collectors.toList());
Dataset<Row> df1 = spark.read()
.schema(new StructType(fields.toArray(new StructField[fields.size()])))
.option("header", true)
.option("sep", ",")
.csv(spark.createDataset(list, Encoders.STRING()));
df1.show();
df1.printSchema();
Output-
+---+-----+-----+-----+-----+-----+
| id|Col_1|Col_2|Col_3|Col_4|Col_5|
+---+-----+-----+-----+-----+-----+
| 1| A| B| C| D| E|
| 2| X| Y| Z| P| null|
+---+-----+-----+-----+-----+-----+
root
|-- id: string (nullable = true)
|-- Col_1: string (nullable = true)
|-- Col_2: string (nullable = true)
|-- Col_3: string (nullable = true)
|-- Col_4: string (nullable = true)
|-- Col_5: string (nullable = true)
Conclusion- Expected behaviour PASS
3. Dataset without null & Schema nullable=true
String data1 = "id Col_1 Col_2 Col_3 Col_4 Col_5\n" +
"1 A B C D E\n" +
"2 X Y Z P F";
List<String> list1 = Arrays.stream(data1.split(System.lineSeparator()))
.map(s -> s.replaceAll("\\s+", ","))
.collect(Collectors.toList());
List<StructField> fields1 = Arrays.stream("id Col_1 Col_2 Col_3 Col_4 Col_5".split("\\s+"))
.map(s -> new StructField(s, DataTypes.StringType, true, Metadata.empty()))
.collect(Collectors.toList());
Dataset<Row> df2 = spark.read()
.schema(new StructType(fields1.toArray(new StructField[fields.size()])))
.option("header", true)
.option("sep", ",")
.csv(spark.createDataset(list1, Encoders.STRING()));
df2.show();
df2.printSchema();
Output-
+---+-----+-----+-----+-----+-----+
| id|Col_1|Col_2|Col_3|Col_4|Col_5|
+---+-----+-----+-----+-----+-----+
| 1| A| B| C| D| E|
| 2| X| Y| Z| P| F|
+---+-----+-----+-----+-----+-----+
root
|-- id: string (nullable = true)
|-- Col_1: string (nullable = true)
|-- Col_2: string (nullable = true)
|-- Col_3: string (nullable = true)
|-- Col_4: string (nullable = true)
|-- Col_5: string (nullable = true)
Conclusion- Expected behaviour PASS

How to recursively call FlatMapFunction<Row,Row>?

I have a Dataset df, read using spark.read().json
It's schema is something like the following:
root
|-- items: struct (nullable = true)
| |-- item: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- batch-id column: long (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- type: string (nullable = true)
I want use a FlatMapFunction to get a dataset with the inner schema(id,name,type).
I want to do something like the following:
df.flatMap(mapperFunction(),RowEncoder.apply(someSchema);
public static FlatMapFunction<Row,Row> mapperFunction() {
return row -> {
Row r1 = row.getAs("items");
List<Row> r2 = r1.getList(0); //This will explode the column
StructType schema = r2.get(0).schema();
//I know list doesn't have map function, I want to know what can be done here
return r2.flatMap(mapperFunction(),RowEncoder.apply(schema);
};
}
There are several options:
Option 1: use explode
The easiest way to flatten the data structure would be to use explode instead of a flatMap call:
Starting with the data
{"items": {"item": [{"batch-id":1,"id":"id1","name":"name1","type":"type1"},{"batch-id":2,"id":"id2","name":"name2","type":"type2"}]}}
the code
df.withColumn("exploded", explode(col("items.item"))).select("exploded.*").show();
prints
+--------+---+-----+-----+
|batch_id| id| name| type|
+--------+---+-----+-----+
| 1|id1|name1|type1|
| 2|id2|name2|type2|
+--------+---+-----+-----+
Option 2: use flatMap
If the flatMap call is required (for example to add more logic into the mapping), this code prints the same result:
df.flatMap(mapperFunction(), Encoders.bean(Data.class)).show();
with the mapping function
private static FlatMapFunction<Row, Data> mapperFunction() {
return row -> {
Row r1 = row.getAs("items");
List<Row> r2 = r1.getList(0); //This will explode the column
return r2.stream().map(entry -> {
Data d = new Data();
d.setBatch_id(entry.getLong(0));
d.setId(entry.getString(1));
d.setName(entry.getString(2));
d.setType(entry.getString(3));
return d;
}).iterator();
};
}
and the Data bean
public static class Data implements Serializable {
private long batch_id;
private String id;
private String name;
private String type;
//getters and setters
}

How to create a Dataframe from existing Dataframe and make specific fields as Struct type?

I need to create a DataFrame from existing DataFrame in which I need to change the schema as well.
I have a DataFrame like:
+-----------+----------+-------------+
|Id |Position |playerName |
+-----------+-----------+------------+
|10125 |Forward |Messi |
|10126 |Forward |Ronaldo |
|10127 |Midfield |Xavi |
|10128 |Midfield |Neymar |
and I am created this using a case class given below:
case class caseClass (
Id: Int = "",
Position : String = "" ,
playerName : String = ""
)
Now I need to make both Playername and position under Struct type.
ie,
I need to create another DataFrame with schema,
root
|-- Id: int (nullable = true)
|-- playerDetails: struct (nullable = true)
| |--playername: string (nullable = true)
| |--Position: string (nullable = true)
I did the following code to create a new dataframe by referring the link
https://medium.com/#mrpowers/adding-structtype-columns-to-spark-dataframes-b44125409803
myschema was
List(
StructField("Id", IntegerType, true),
StructField("Position",StringType, true),
StructField("playerName", StringType,true)
)
I tried the following code
spark.sparkContext.parallelize(data),
myschema
)
but I can't make it happen.
I saw similar question
Change schema of existing dataframe but I can't understand the solution.
Is there any solution for directly implement StructType inside the case class? so that I think I don't need to make own schema for creating struct type values.
Function "struct" can be used:
// data
val playersDF = Seq(
(10125, "Forward", "Messi"),
(10126, "Forward", "Ronaldo"),
(10127, "Midfield", "Xavi"),
(10128, "Midfield", "Neymar")
).toDF("Id", "Position", "playerName")
// action
val playersStructuredDF = playersDF.select($"Id", struct("playerName", "Position").as("playerDetails"))
// display
playersStructuredDF.printSchema()
playersStructuredDF.show(false)
Output:
root
|-- Id: integer (nullable = false)
|-- playerDetails: struct (nullable = false)
| |-- playerName: string (nullable = true)
| |-- Position: string (nullable = true)
+-----+------------------+
|Id |playerDetails |
+-----+------------------+
|10125|[Messi, Forward] |
|10126|[Ronaldo, Forward]|
|10127|[Xavi, Midfield] |
|10128|[Neymar, Midfield]|
+-----+------------------+

Categories