How to add string to string array column in spark dataset

How to add string to string array column in spark dataset - java

I have a "Dataset(Row)" as below
+-----+--------------+
|val | history |
+-----+--------------+
|500 |[a=456, a=500]|
|800 |[a=456, a=500]|
|784 |[a=456, a=500]|
+-----+--------------+
Here val is "String" and history is an "string array". I'm trying to add the content in val column to the history column, so that my dataset looks like :
+-----+---------------------+
|val | history |
+-----+---------------------+
|500 |[a=456, b=500, c=500]|
|800 |[a=456, b=500, c=800]|
|784 |[a=456, b=500, c=784]|
+-----+---------------------+
A similar question is discussed here https://stackoverflow.com/a/49685271/2316771 , but I don't know scala and couldn't create a similar java solution.
Please help me to achieve this in java

In Spark 2.4 (not before), you can use the concat function to concat two arrays. In your case, you could do something like:
df.withColumn("val2", concat(lit("c="), col("val")))
.select(concat(col("history"), array(col("val2")));
NB: the first time I use concat is to concat strings, the second time, to concat arrays. array(col("val2")) creates an array of one element.

I coded a solution but I'm not sure if it can be further optimized
dataset.map(row -> {
Seq<String> seq = row.getAs("history");
ArrayList<String> list = new ArrayList<>(JavaConversions.seqAsJavaList(seq));
list.add("c="+row.getAs("val"));
return RowFactory.create(row.getAs("val"),list.toArray(new String[0]));},schema);

Related

how to remove space in when using cucumber example placeholder?

I have following test case where i want pass '00:00:00.0' (date_suffix) for one example and one for not.
however using this approach it also append space in first example with no date_suffix
so it results something like this:
// I need to get rid of last space (after /17) for example 1.
example1. "1996/06/17 "
example2. "1996/06/17 00:00:00.0"
--
Then Some case:
| birthdate |
| 1996/06/17 <date_suffix> |
| 1987-11-08 <date_suffix> |
| 1998-07-20 <date_suffix> |
#example1
Examples:
| date_suffix |
| |
#example2
Examples:
| date_suffix |
| 00:00:00.0 |

What you want to do is not possible in Gherkin.
However it seems like you are testing a date parser or validation tool through some other component.
By adding the time stamp to the date, you're adding incidental details to your scenario. It is not immediately apparent what these test and maybe overlooked in the future.
Consider instead testing the parser/validator separately and directly.
Once you have confidence in the date parser works correctly, use for your current scenario a list of mixed dates, some with and some without suffix.

Use a trim function to eliminate all the spacese.
exemple = urString.trim();

How to access the entries in every row and apply custom functions?

My input was a kafka-stream with only one value which is comma-separated. It looks like this.
"id,country,timestamp"
I already splitted the dataset so that i have something like the following structured stream
Dataset<Row> words = df
.selectExpr("CAST (value AS STRING)")
.as(Encoders.STRING())
.withColumn("id", split(col("value"), ",").getItem(0))
.withColumn("country", split(col("value"), ",").getItem(1))
.withColumn("timestamp", split(col("value"), ",").getItem(2));
+----+---------+----------+
|id |country |timestamp |
+----+---------+----------+
|2922|de |1231231232|
|4195|de |1231232424|
|6796|fr |1232412323|
+----+---------+----------+
Now I have a dataset with 3 columns. Now i want to use the entries in each row in a custom function e.g.
Dataset<String> words.map(row -> {
//do something with every entry of each row e.g.
Person person = new Person(id, country, timestamp);
String name = person.getName();
return name;
};
In the end i want to sink out again a comma-separated String.

Data frame has a schema so you cant just call a map function on it without defining a new schema.
You can either cast to RDD and use a map , or use a DF map with encoder.
Another option is I think you can use spark SQL with user defined functions, you can read about it.
If your use case is really simple as you are showing, doing something like :
var nameRdd = words.rdd.map(x => {f(x)})
which seems like is all you need
if you still want a dataframe you can use something like:
val schema = StructType(Seq[StructField](StructField(dataType = StringType, name = s"name")))
val rddToDf = nameRdd.map(name => Row.apply(name))
val df = sparkSession.createDataFrame(rddToDf, schema)
P.S dataframe === dataset

If you have a custom function that is not available by composing functions in the existing spark API[1], then you can either drop down to the RDD level (as #Ilya suggested), or use a UDF[2].
Typically I'll try to use the spark API functions on a dataframe whenever possible, as they generally will be the best optimized.
If thats not possible I will construct a UDF:
import org.apache.spark.sql.functions.{col, udf}
val squared = udf((s: Long) => s * s)
display(spark.range(1, 20).select(squared(col("id")) as "id_squared"))
In your case you need to pass multiple columns to your UDF, you can pass them in comma separated squared(col("col_a"), col("col_b")).
Since you are writing your UDF in Scala it should be pretty efficient, but keep in mind if you use Python, in general there will be extra latency due to data movements between JVM and Python.
[1]https://spark.apache.org/docs/latest/api/scala/index.html#package
[2]https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html

Get unique words from Spark Dataset in Java

I'm using Apache Spark 2 to tokenize some text.
Dataset<Row> regexTokenized = regexTokenizer.transform(data);
It returns Array of String.
Dataset<Row> words = regexTokenized.select("words");
sample data looks like this.
+--------------------+
| words|
+--------------------+
|[very, caring, st...|
|[the, grand, cafe...|
|[i, booked, a, no...|
|[wow, the, places...|
|[if, you, are, ju...|
Now, I want to get it all unique words. I tried out a couple of filters, flatMap, map functions and reduce. I couldn't figure it out because I'm new to the Spark.

based on the #Haroun Mohammedi answer, I was able to figure it out in Java.
Dataset<Row> uniqueWords = regexTokenized.select(explode(regexTokenized.col("words"))).distinct();
uniqueWords.show();

I'm coming from scala but I do believe that there's a similar way in Java.
I think in this case you have to use the explode method in order to transform your data into a Dataset of words.
This code should give you the desired results :
import org.apache.spark.sql.functions.explode
val dsWords = regexTokenized.select(explode("words"))
val dsUniqueWords = dsWords.distinct()
For information about the explode methode please refer to the official documentation
Hope it helps.

How to Convert DataSet<Row> to DataSet of JSON messages to write to Kafka?

I use Spark 2.1.1.
I have the following DataSet<Row> ds1;
name | ratio | count // column names
"hello" | 1.56 | 34
(ds1.isStreaming gives true)
and I am trying to generate DataSet<String> ds2. other words when I write to a kafka sink I want to write something like this
{"name": "hello", "ratio": 1.56, "count": 34}
I have tried something like this df2.toJSON().writeStream().foreach(new KafkaSink()).start() but then it gives the following error
Queries with streaming sources must be executed with writeStream.start()
There are to_json and json_tuple however I am not sure how to leverage them here ?
I tried the following using json_tuple() function
Dataset<String> df4 = df3.select(json_tuple(new Column("result"), " name", "ratio", "count")).as(Encoders.STRING());
and I get the following error:
cannot resolve 'result' given input columns: [name, ratio, count];;

tl;dr Use struct function followed by to_json (as toJSON was broken for streaming datasets due to SPARK-17029 that got fixed just 20 days ago).
Quoting the scaladoc of struct:
struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.
Given you use Java API you have 4 different variants of struct function, too:
public static Column struct(Column... cols) Creates a new struct column.
With to_json function your case is covered:
public static Column to_json(Column e) Converts a column containing a StructType into a JSON string with the specified schema.
The following is a Scala code (translating it to Java is your home exercise):
val ds1 = Seq(("hello", 1.56, 34)).toDF("name", "ratio", "count")
val recordCol = to_json(struct("name", "ratio", "count")) as "record"
scala> ds1.select(recordCol).show(truncate = false)
+----------------------------------------+
|record |
+----------------------------------------+
|{"name":"hello","ratio":1.56,"count":34}|
+----------------------------------------+
I've also given your solution a try (with Spark 2.3.0-SNAPSHOT built today) and it seems it works perfectly.
val fromKafka = spark.
readStream.
format("kafka").
option("subscribe", "topic1").
option("kafka.bootstrap.servers", "localhost:9092").
load.
select('value cast "string")
fromKafka.
toJSON. // <-- JSON conversion
writeStream.
format("console"). // using console sink
start
format("kafka") was added in SPARK-19719 and is not available in 2.1.0.

How to break string from an excel file into substrings and load it?

I'm actually working on a talend job. I need to load from an excel file to an oracle 11g database.
I can't figure out how to break a field of my excel entry file within talend and load the broken string into the database.
For example I've got a field like this:
toto:12;tata:1;titi:15
And I need to load into a table, for example grade:
| name | grade |
|------|-------|
| toto |12 |
| titi |15 |
| tata |1 |
|--------------|
Thank's in advance

In a Talend job, you can use tFileInputExcel to read your Excel file, and then tNormalize to split your special column into individual rows with a separator of ";". After that, use tExtractDelimitedFields with a separator of ":" to split the normalized column into name and grade columns. Then you can use a tOracleOutput component to write the result to the database.
While this solution is more verbose than the Java snippet suggested by AlexR, it has the advantage that it stays within Talend's graphical programming model.

for(String pair : str.split(";")) {
String[] kv = pair.split(":");
// at this point you have separated values
String name = kv[0];
String grade = kv[1];
dbInsert(name, grade);
}
Now you have to implement dbInsert(). Do it either using JDBC or using any higher level tools (e.g. Hivernate, iBatis, JDO, JPA etc).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to add string to string array column in spark dataset - java

Related

how to remove space in when using cucumber example placeholder?

How to access the entries in every row and apply custom functions?

Get unique words from Spark Dataset in Java

How to Convert DataSet<Row> to DataSet of JSON messages to write to Kafka?

How to break string from an excel file into substrings and load it?

Categories

Resources