JavaPairRDD to Dataset<Row> in SPARK

JavaPairRDD to Dataset<Row> in SPARK - java

I have data in JavaPairRDD in format
JavaPairdRDD<Tuple2<String, Tuple2<String,String>>>
I tried using below code
Encoder<Tuple2<String, Tuple2<String,String>>> encoder2 =
Encoders.tuple(Encoders.STRING(), Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
Dataset<Row> userViolationsDetails = spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");
But how to generate Dataset with 3 columns ??? As output of above code gives me data in 2 columns. Any pointers / suggestion ???

Try to run printSchema - you will see, that value2 is a complex type.
Having such information, you can write:
Dataset<Row> uvd = userViolationsDetails.selectExpr("value1", "value2._1 as value2", "value2._2 as value3")
value2._1 means first element of a tuple inside current "value2" field. We overwrite value2 field to have one value only
Note that this will work after https://issues.apache.org/jira/browse/SPARK-24548 is merged to master branch. Currently there is a bug in Spark and tuple is converted to struct with two fields named value

Related

Karate : In my CSV file, columns are not having same row count. While reading data empty values are added for columns having less rows

My csv file data : 1 column is HeaderText(6 rows) and other is accountBtn(4 rows)
accountBtn,HeaderText
New Case,Type
New Note,Phone
New Contact,Website
,Account Owner
,Account Site
,Industry
When I'm reading file with below code
* def csvData = read('../TestData/Button.csv')
* def expectedButton = karate.jsonPath(csvData,"$..accountBtn")
* def eHeaderTest = karate.jsonPath(csvData,"$..HeaderText")
data set generated as per code is : ["New Case","New Note","New Contact","","",""]
My expected data set is : ["New Case","New Note","New Contact"]
Any idea how can this be handled?

That's how it is in Karate and it shouldn't be a concern since you are just using it as data to drive a test. You can run a transform to convert empty strings to null if required: https://stackoverflow.com/a/56581365/143475
Else please consider contributing code to make Karate better !
The other option is to use JSON as a data-source instead of CSV: https://stackoverflow.com/a/47272108/143475

explode an spark array column to multiple columns sparksql

I have a column which has type Value defined like below
val Value: ArrayType = ArrayType(
new StructType()
.add("unit", StringType)
.add("value", StringType)
)
and data like this
[[unit1, 25], [unit2, 77]]
[[unit2, 100], [unit1, 40]]
[[unit2, 88]]
[[unit1, 33]]
I know spark sql can use functions.explode to make the data become multiple rows, but what i want is explode to multiple columns (or the 1 one column but 2 items for the one has only 1 item).
so the end result looks like below
unit1 unit2
25 77
40 100
value1 88
33 value2
How could I achieve this?
addtion after initial post and update
I want to get result like this (this is more like my final goal).
transformed-column
[[unit1, 25], [unit2, 77]]
[[unit2, 104], [unit1, 40]]
[[unit1, value1], [unit2, 88]]
[[unit1, 33],[unit2,value2]]
where value1 is the result of applying some kind of map/conversion function using the [unit2, 88]
similarly, value2 is the result of applying the same map /conversion function using the [unit1, 33]

I solved this problem using the map_from_entries as suggested by #jxc, and then used UDF to convert the map of 1 item to map of 2 items, using business logic to convert between the 2 units.
one thing to note is the map returned from map_from_entries is scala map. and if you use java, need to make sure the udf method takes scala map instead.
ps. maybe I did not have to use map_from_entries, instead maybe i could make the UDF to take array of structType

Union spark Datasets in loop

I am trying to append a dataset to an empty dataset in a loop.
But the resultant dataset is always empty.
I tried to eliminate the variable failedRows from the loop by executing just Line 1 commented in code but still got empty failedRows dataset.
Dataset<Row> failedRows = sparkSession.createDataFrame(new ArrayList<>(), itemsDS.schema());
failedRows.count();
Dataset<Row> filteredDs;
for(String tagName: mandatoryTagsList){
//failedRows.union(itemsDS.filter(functions.col(tagName).isNull()));//Line 1
filteredDs = itemsDS.filter(functions.col(tagName).isNull());
if(filteredDs.count()>0){
failedRows.union(filteredDs);//Line 2
failedRows.count();
}
}
Does anybody know why exactly the union is not generating the desired results.

You need to save to a new variable each time.
Dataset same as all distributed collections in Spark are immutable.
failedRows = failedRows.union(filteredDs);//Line 2

How to access the entries in every row and apply custom functions?

My input was a kafka-stream with only one value which is comma-separated. It looks like this.
"id,country,timestamp"
I already splitted the dataset so that i have something like the following structured stream
Dataset<Row> words = df
.selectExpr("CAST (value AS STRING)")
.as(Encoders.STRING())
.withColumn("id", split(col("value"), ",").getItem(0))
.withColumn("country", split(col("value"), ",").getItem(1))
.withColumn("timestamp", split(col("value"), ",").getItem(2));
+----+---------+----------+
|id |country |timestamp |
+----+---------+----------+
|2922|de |1231231232|
|4195|de |1231232424|
|6796|fr |1232412323|
+----+---------+----------+
Now I have a dataset with 3 columns. Now i want to use the entries in each row in a custom function e.g.
Dataset<String> words.map(row -> {
//do something with every entry of each row e.g.
Person person = new Person(id, country, timestamp);
String name = person.getName();
return name;
};
In the end i want to sink out again a comma-separated String.

Data frame has a schema so you cant just call a map function on it without defining a new schema.
You can either cast to RDD and use a map , or use a DF map with encoder.
Another option is I think you can use spark SQL with user defined functions, you can read about it.
If your use case is really simple as you are showing, doing something like :
var nameRdd = words.rdd.map(x => {f(x)})
which seems like is all you need
if you still want a dataframe you can use something like:
val schema = StructType(Seq[StructField](StructField(dataType = StringType, name = s"name")))
val rddToDf = nameRdd.map(name => Row.apply(name))
val df = sparkSession.createDataFrame(rddToDf, schema)
P.S dataframe === dataset

If you have a custom function that is not available by composing functions in the existing spark API[1], then you can either drop down to the RDD level (as #Ilya suggested), or use a UDF[2].
Typically I'll try to use the spark API functions on a dataframe whenever possible, as they generally will be the best optimized.
If thats not possible I will construct a UDF:
import org.apache.spark.sql.functions.{col, udf}
val squared = udf((s: Long) => s * s)
display(spark.range(1, 20).select(squared(col("id")) as "id_squared"))
In your case you need to pass multiple columns to your UDF, you can pass them in comma separated squared(col("col_a"), col("col_b")).
Since you are writing your UDF in Scala it should be pretty efficient, but keep in mind if you use Python, in general there will be extra latency due to data movements between JVM and Python.
[1]https://spark.apache.org/docs/latest/api/scala/index.html#package
[2]https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html

How to Convert DataSet<Row> to DataSet of JSON messages to write to Kafka?

I use Spark 2.1.1.
I have the following DataSet<Row> ds1;
name | ratio | count // column names
"hello" | 1.56 | 34
(ds1.isStreaming gives true)
and I am trying to generate DataSet<String> ds2. other words when I write to a kafka sink I want to write something like this
{"name": "hello", "ratio": 1.56, "count": 34}
I have tried something like this df2.toJSON().writeStream().foreach(new KafkaSink()).start() but then it gives the following error
Queries with streaming sources must be executed with writeStream.start()
There are to_json and json_tuple however I am not sure how to leverage them here ?
I tried the following using json_tuple() function
Dataset<String> df4 = df3.select(json_tuple(new Column("result"), " name", "ratio", "count")).as(Encoders.STRING());
and I get the following error:
cannot resolve 'result' given input columns: [name, ratio, count];;

tl;dr Use struct function followed by to_json (as toJSON was broken for streaming datasets due to SPARK-17029 that got fixed just 20 days ago).
Quoting the scaladoc of struct:
struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.
Given you use Java API you have 4 different variants of struct function, too:
public static Column struct(Column... cols) Creates a new struct column.
With to_json function your case is covered:
public static Column to_json(Column e) Converts a column containing a StructType into a JSON string with the specified schema.
The following is a Scala code (translating it to Java is your home exercise):
val ds1 = Seq(("hello", 1.56, 34)).toDF("name", "ratio", "count")
val recordCol = to_json(struct("name", "ratio", "count")) as "record"
scala> ds1.select(recordCol).show(truncate = false)
+----------------------------------------+
|record |
+----------------------------------------+
|{"name":"hello","ratio":1.56,"count":34}|
+----------------------------------------+
I've also given your solution a try (with Spark 2.3.0-SNAPSHOT built today) and it seems it works perfectly.
val fromKafka = spark.
readStream.
format("kafka").
option("subscribe", "topic1").
option("kafka.bootstrap.servers", "localhost:9092").
load.
select('value cast "string")
fromKafka.
toJSON. // <-- JSON conversion
writeStream.
format("console"). // using console sink
start
format("kafka") was added in SPARK-19719 and is not available in 2.1.0.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JavaPairRDD to Dataset<Row> in SPARK - java

Related

Karate : In my CSV file, columns are not having same row count. While reading data empty values are added for columns having less rows

explode an spark array column to multiple columns sparksql

Union spark Datasets in loop

How to access the entries in every row and apply custom functions?

How to Convert DataSet<Row> to DataSet of JSON messages to write to Kafka?

Categories

Resources