I am trying to append a dataset to an empty dataset in a loop.
But the resultant dataset is always empty.
I tried to eliminate the variable failedRows from the loop by executing just Line 1 commented in code but still got empty failedRows dataset.
Dataset<Row> failedRows = sparkSession.createDataFrame(new ArrayList<>(), itemsDS.schema());
failedRows.count();
Dataset<Row> filteredDs;
for(String tagName: mandatoryTagsList){
//failedRows.union(itemsDS.filter(functions.col(tagName).isNull()));//Line 1
filteredDs = itemsDS.filter(functions.col(tagName).isNull());
if(filteredDs.count()>0){
failedRows.union(filteredDs);//Line 2
failedRows.count();
}
}
Does anybody know why exactly the union is not generating the desired results.
You need to save to a new variable each time.
Dataset same as all distributed collections in Spark are immutable.
failedRows = failedRows.union(filteredDs);//Line 2
Related
My csv file data : 1 column is HeaderText(6 rows) and other is accountBtn(4 rows)
accountBtn,HeaderText
New Case,Type
New Note,Phone
New Contact,Website
,Account Owner
,Account Site
,Industry
When I'm reading file with below code
* def csvData = read('../TestData/Button.csv')
* def expectedButton = karate.jsonPath(csvData,"$..accountBtn")
* def eHeaderTest = karate.jsonPath(csvData,"$..HeaderText")
data set generated as per code is : ["New Case","New Note","New Contact","","",""]
My expected data set is : ["New Case","New Note","New Contact"]
Any idea how can this be handled?
That's how it is in Karate and it shouldn't be a concern since you are just using it as data to drive a test. You can run a transform to convert empty strings to null if required: https://stackoverflow.com/a/56581365/143475
Else please consider contributing code to make Karate better !
The other option is to use JSON as a data-source instead of CSV: https://stackoverflow.com/a/47272108/143475
My input was a kafka-stream with only one value which is comma-separated. It looks like this.
"id,country,timestamp"
I already splitted the dataset so that i have something like the following structured stream
Dataset<Row> words = df
.selectExpr("CAST (value AS STRING)")
.as(Encoders.STRING())
.withColumn("id", split(col("value"), ",").getItem(0))
.withColumn("country", split(col("value"), ",").getItem(1))
.withColumn("timestamp", split(col("value"), ",").getItem(2));
+----+---------+----------+
|id |country |timestamp |
+----+---------+----------+
|2922|de |1231231232|
|4195|de |1231232424|
|6796|fr |1232412323|
+----+---------+----------+
Now I have a dataset with 3 columns. Now i want to use the entries in each row in a custom function e.g.
Dataset<String> words.map(row -> {
//do something with every entry of each row e.g.
Person person = new Person(id, country, timestamp);
String name = person.getName();
return name;
};
In the end i want to sink out again a comma-separated String.
Data frame has a schema so you cant just call a map function on it without defining a new schema.
You can either cast to RDD and use a map , or use a DF map with encoder.
Another option is I think you can use spark SQL with user defined functions, you can read about it.
If your use case is really simple as you are showing, doing something like :
var nameRdd = words.rdd.map(x => {f(x)})
which seems like is all you need
if you still want a dataframe you can use something like:
val schema = StructType(Seq[StructField](StructField(dataType = StringType, name = s"name")))
val rddToDf = nameRdd.map(name => Row.apply(name))
val df = sparkSession.createDataFrame(rddToDf, schema)
P.S dataframe === dataset
If you have a custom function that is not available by composing functions in the existing spark API[1], then you can either drop down to the RDD level (as #Ilya suggested), or use a UDF[2].
Typically I'll try to use the spark API functions on a dataframe whenever possible, as they generally will be the best optimized.
If thats not possible I will construct a UDF:
import org.apache.spark.sql.functions.{col, udf}
val squared = udf((s: Long) => s * s)
display(spark.range(1, 20).select(squared(col("id")) as "id_squared"))
In your case you need to pass multiple columns to your UDF, you can pass them in comma separated squared(col("col_a"), col("col_b")).
Since you are writing your UDF in Scala it should be pretty efficient, but keep in mind if you use Python, in general there will be extra latency due to data movements between JVM and Python.
[1]https://spark.apache.org/docs/latest/api/scala/index.html#package
[2]https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html
I have data in JavaPairRDD in format
JavaPairdRDD<Tuple2<String, Tuple2<String,String>>>
I tried using below code
Encoder<Tuple2<String, Tuple2<String,String>>> encoder2 =
Encoders.tuple(Encoders.STRING(), Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
Dataset<Row> userViolationsDetails = spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");
But how to generate Dataset with 3 columns ??? As output of above code gives me data in 2 columns. Any pointers / suggestion ???
Try to run printSchema - you will see, that value2 is a complex type.
Having such information, you can write:
Dataset<Row> uvd = userViolationsDetails.selectExpr("value1", "value2._1 as value2", "value2._2 as value3")
value2._1 means first element of a tuple inside current "value2" field. We overwrite value2 field to have one value only
Note that this will work after https://issues.apache.org/jira/browse/SPARK-24548 is merged to master branch. Currently there is a bug in Spark and tuple is converted to struct with two fields named value
I am using spark 2.11 version and I am doing only 3 basic operations in my application:
taking records from the database: 2.2 million
checking records from a file (5 000) present in Database(2.2 million) using contains
writing matched records to a file of CSV format
But for these 3 operations, it takes almost 20 minutes. If I do same operations in SQL, it will take less than 1 minutes.
I have started to use spark because it will yield results very fast but it is taking too much of time. How to improve performance?
Step 1: taking records from the database.
Properties connectionProperties = new Properties();
connectionProperties.put("user", "test");
connectionProperties.put("password", "test##");
String query="(SELECT * from items)
dataFileContent= spark.read().jdbc("jdbc:oracle:thin:#//172.20.0.11/devad", query,connectionProperties);
Step2: checking records of file A (5k) present in file B (2M) using contains
Dataset<Row> NewSet=source.join(target,target.col("ItemIDTarget").contains(source.col("ItemIDSource")),"inner");
Step3: writing matched records to a file of CSV format
NewSet.repartition(1).select("*")
.write().format("com.databricks.spark.csv")
.option("delimiter", ",")
.option("header", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("nullValue", "")
.save(fileAbsolutePath);
To improve the performance I have tried several things like setting Cache,
data serialization
set("spark.serializer","org.apache.spark.serializer.KryoSerializer")),
Shuffle time
sqlContext.setConf("spark.sql.shuffle.partitions", "10"),
Data Structure Tuning
-XX:+UseCompressedOops ,
none of the approaches is not yielding better performance.
Increasing performance is more like improving parallelism.
Parallelism depends on number of partitions in RDD.
Make sure Dataset/Dataframe/RDD neither have too many number of partitions nor very less number of partitions.
Please check below suggestions where you can improve your code. I'm more comfortable with scala so I am providing suggestions in scala.
Step1:
Make sure you have control on connections you make with database by mentionioning numPartitions.
Number of connections = number of partitions.
Below I just assigned 10 to num_partitions, this you have to tune to get more performance.
int num_partitions;
num_partitions = 10;
Properties connectionProperties = new Properties();
connectionProperties.put("user", "test");
connectionProperties.put("password", "test##");
connectionProperties.put("partitionColumn", "hash_code");
String query = "(SELECT mod(A.id,num_partitions) as hash_code, A.* from items A)";
dataFileContent = spark.read()
.jdbc("jdbc:oracle:thin:#//172.20.0.11/devad",
dbtable = query,
columnName = "hash_code",
lowerBound = 0,
upperBound = num_partitions,
numPartitions = num_partitions,
connectionProperties);
You can check how numPartitions works
Step2:
Dataset<Row> NewSet = source.join(target,
target.col("ItemIDTarget").contains(source.col("ItemIDSource")),
"inner");
Since one of table/dataframe having 5k records(small amount of data) you can use broadcast join as mentioned below.
import org.apache.spark.sql.functions.broadcast
val joined_df = largeTableDF.join(broadcast(smallTableDF), "key")
Step3:
Use coalesce to decrease number of partitions so that it avoids full shuffle.
NewSet.coalesce(1).select("*")
.write().format("com.databricks.spark.csv")
.option("delimiter", ",")
.option("header", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("nullValue", "")
.save(fileAbsolutePath);
Hope my answer helps you.
I'm using Apache Spark 2 to tokenize some text.
Dataset<Row> regexTokenized = regexTokenizer.transform(data);
It returns Array of String.
Dataset<Row> words = regexTokenized.select("words");
sample data looks like this.
+--------------------+
| words|
+--------------------+
|[very, caring, st...|
|[the, grand, cafe...|
|[i, booked, a, no...|
|[wow, the, places...|
|[if, you, are, ju...|
Now, I want to get it all unique words. I tried out a couple of filters, flatMap, map functions and reduce. I couldn't figure it out because I'm new to the Spark.
based on the #Haroun Mohammedi answer, I was able to figure it out in Java.
Dataset<Row> uniqueWords = regexTokenized.select(explode(regexTokenized.col("words"))).distinct();
uniqueWords.show();
I'm coming from scala but I do believe that there's a similar way in Java.
I think in this case you have to use the explode method in order to transform your data into a Dataset of words.
This code should give you the desired results :
import org.apache.spark.sql.functions.explode
val dsWords = regexTokenized.select(explode("words"))
val dsUniqueWords = dsWords.distinct()
For information about the explode methode please refer to the official documentation
Hope it helps.