spark schema rdd to RDD - java

I would like to do word count in spark , I created a rdd using spark sql to extract distinct tweets from data set.
I would like to use split function on top of RDD but its not allowing me to do so.
Error:-
valuse split is not a member of org.apache.spark.sql.SchemaRdd
Spark Code that doesn't work to do word count:-
val disitnct_tweets=hiveCtx.sql("select distinct(text) from tweets_table where text <> ''")
val distinct_tweets_List=sc.parallelize(List(distinct_tweets))
//tried split on both the rdd disnt worked
distinct_tweets.flatmap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
distinct_tweets_List.flatmap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
But when I output the data from sparksql to a file and load it again and run split it works.
Example Code that works:-
val distinct_tweets=hiveCtx.sql("select dsitinct(text) from tweets_table where text <> ''")
val distinct_tweets_op=distinct_tweets.collect()
val rdd=sc.parallelize(distinct_tweets_op)
rdd.saveAsTextFile("/home/cloudera/bdp/op")
val textFile=sc.textFile("/home/cloudera/bdp/op/part-00000")
val counts=textFile.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
counts.SaveAsTextFile("/home/cloudera/bdp/wordcount")
I need a answer instead of writing to file and loading again to do my split function is there a work around to make split function work
Thanks

First, we should not do collect() and then parallelize to create RDD; that will make driver busy/down.
Rather,
val distinct_tweets=hiveCtx.sql("select dsitinct(text) from tweets_table where text <> ''")
val distinct_tweets_op=distinct_tweets.map(x => x.mkstring)
[considering that, you select only single column in the query - distinct(text)]
now distinct_tweets_op is simply a RDD.
So, loop over this RDD; and you are good to apply split("") function on each string in that RDD.

Found the answer , its three step process to convert data frame or spark.sql.row.RDD to plain RDD.
sc.parallelize(List())
map to string
val distinct_tweets=hiveCtx.sql(" select distinct(text) from tweets_table where text <> ''")
val distinct_tweets_op=distinct_tweets.collect()
val distinct_tweets_list=sc.parallelize(List(distinct_tweets_op))
val distinct_tweets_string=distinct_tweets.map(x=>x.toString)
val test_kali=distinct_tweets_string.flatMap(line =>line.split(" ")).map(word => (word,1)).reduceByKey(_+_).sortBy {case (key,value) => -value}.map { case (key,value) => Array(key,value).mkString(",") }
test_kali.collect().foreach(println)
case class kali_test(text: String)
val test_kali_op=test_kali.map(_.split(" ")).map(p => kali_test(p(0)))
test_kali_op.registerTempTable("kali_test")
hiveCtx.sql(" select * from kali_test limit 10 ").collect().foreach(println)
This way I don't need to load in a file , I can do my operations on fly.
Thanks
Sri

The main reason your first one failed was this line:
val distinct_tweets_List=sc.parallelize(List(distinct_tweets))
That's a completely useless line in Spark, and worse than useless -- as you saw it tanks your system.
You want to avoid doing collect(), which creates an Array and returns it to the Driver application. Instead, you want to leave the objects as RDDs as long as possible, and return as little as possible data to the Driver (like the keys and the counts after being reduced).
But to answer your basic question, the following will take a DataFrame consisting of a single StringType column and convert it to an RDD[String]:
val myRdd = myDf.rdd.map(_.getString(0))
And though SchemaRDDs aren't around anymore, I believe the following will convert SchemaRDD with a single String column to a plain RDD[String]:
val myRdd = mySchemaRdd.map(_.getString(0))

Related

Spark java filter isin method or something else?

I have around 2 billions of rows in my cassandra database which I filter with the isin method based on an experimentlist with 4827 Strings, as shown below. However, I noticed that after the distinct command I have only 4774 unique rows. Any ideas why 53 are missing? Does the isin method has a threshold/limitations? I have double and triple checked the experimentlist, it does have 4827 Strings, and also the other 53 strings do exist in the database as I can query them with cqlsh. Any help much appreciated!
Dataset<Row> df1 = sp.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "mdb");
put("table", "experiment");
}
})
.load().select(col("experimentid")).filter(col("experimentid").isin(experimentlist.toArray()));
List<String> tmplist=df1.distinct().as(Encoders.STRING()).collectAsList();
System.out.println("tmplist "+tmplist.size());
Regarding the actual question about "missing data" - there could be problems when your cluster has missing writes, and repair isn't done regularly. Spark Cassandra Connector (SCC) reads data with consistency level LOCAL_ONE, and may hit nodes without all data. You can try to set consistency level to LOCAL_QUORUM (via --conf spark.cassandra.input.consistency.level=LOCAL_QUORUM), for example, and repeat the experiment, although it's better to make sure that data is repaired.
Another problem that you have is that you're using the .isin function - it's translating into a query SELECT ... FROM table WHERE partition_key IN (list). See the execution plan:
scala> import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.cassandra._
scala> val data = spark.read.cassandraFormat("m1", "test").load()
data: org.apache.spark.sql.DataFrame = [id: int, m: map<int,string>]
scala> data.filter($"id".isin(Seq(1,2,3,4):_*)).explain
== Physical Plan ==
*Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [id#169,m#170] PushedFilters: [*In(id, [1,2,3,4])], ReadSchema: struct<id:int,m:map<int,string>>
This query is very inefficient, and put an additional load to the node that performs query. In the SCC 2.5.0, there are some optimizations around that, but it's better to use so-called "Direct Join" that was also introduced in the SCC 2.5.0, so SCC will perform requests to specific partition keys in parallel - that's more effective and put the less load to the nodes. You can use it as following (the only difference that I have it as "DSE Direct Join", while in OSS SCC it's printed as "Cassandra Direct Join"):
scala> val toJoin = Seq(1,2,3,4).toDF("id")
toJoin: org.apache.spark.sql.DataFrame = [id: int]
scala> val joined = toJoin.join(data, data("id") === toJoin("id"))
joined: org.apache.spark.sql.DataFrame = [id: int, id: int ... 1 more field]
scala> joined.explain
== Physical Plan ==
DSE Direct Join [id = id#189] test.m1 - Reading (id, m) Pushed {}
+- LocalTableScan [id#189]
This direct join optimization needs to be explicitly enabled as described in the documentation.

How to access the entries in every row and apply custom functions?

My input was a kafka-stream with only one value which is comma-separated. It looks like this.
"id,country,timestamp"
I already splitted the dataset so that i have something like the following structured stream
Dataset<Row> words = df
.selectExpr("CAST (value AS STRING)")
.as(Encoders.STRING())
.withColumn("id", split(col("value"), ",").getItem(0))
.withColumn("country", split(col("value"), ",").getItem(1))
.withColumn("timestamp", split(col("value"), ",").getItem(2));
+----+---------+----------+
|id |country |timestamp |
+----+---------+----------+
|2922|de |1231231232|
|4195|de |1231232424|
|6796|fr |1232412323|
+----+---------+----------+
Now I have a dataset with 3 columns. Now i want to use the entries in each row in a custom function e.g.
Dataset<String> words.map(row -> {
//do something with every entry of each row e.g.
Person person = new Person(id, country, timestamp);
String name = person.getName();
return name;
};
In the end i want to sink out again a comma-separated String.
Data frame has a schema so you cant just call a map function on it without defining a new schema.
You can either cast to RDD and use a map , or use a DF map with encoder.
Another option is I think you can use spark SQL with user defined functions, you can read about it.
If your use case is really simple as you are showing, doing something like :
var nameRdd = words.rdd.map(x => {f(x)})
which seems like is all you need
if you still want a dataframe you can use something like:
val schema = StructType(Seq[StructField](StructField(dataType = StringType, name = s"name")))
val rddToDf = nameRdd.map(name => Row.apply(name))
val df = sparkSession.createDataFrame(rddToDf, schema)
P.S dataframe === dataset
If you have a custom function that is not available by composing functions in the existing spark API[1], then you can either drop down to the RDD level (as #Ilya suggested), or use a UDF[2].
Typically I'll try to use the spark API functions on a dataframe whenever possible, as they generally will be the best optimized.
If thats not possible I will construct a UDF:
import org.apache.spark.sql.functions.{col, udf}
val squared = udf((s: Long) => s * s)
display(spark.range(1, 20).select(squared(col("id")) as "id_squared"))
In your case you need to pass multiple columns to your UDF, you can pass them in comma separated squared(col("col_a"), col("col_b")).
Since you are writing your UDF in Scala it should be pretty efficient, but keep in mind if you use Python, in general there will be extra latency due to data movements between JVM and Python.
[1]https://spark.apache.org/docs/latest/api/scala/index.html#package
[2]https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html

JavaPairRDD to Dataset<Row> in SPARK

I have data in JavaPairRDD in format
JavaPairdRDD<Tuple2<String, Tuple2<String,String>>>
I tried using below code
Encoder<Tuple2<String, Tuple2<String,String>>> encoder2 =
Encoders.tuple(Encoders.STRING(), Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
Dataset<Row> userViolationsDetails = spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");
But how to generate Dataset with 3 columns ??? As output of above code gives me data in 2 columns. Any pointers / suggestion ???
Try to run printSchema - you will see, that value2 is a complex type.
Having such information, you can write:
Dataset<Row> uvd = userViolationsDetails.selectExpr("value1", "value2._1 as value2", "value2._2 as value3")
value2._1 means first element of a tuple inside current "value2" field. We overwrite value2 field to have one value only
Note that this will work after https://issues.apache.org/jira/browse/SPARK-24548 is merged to master branch. Currently there is a bug in Spark and tuple is converted to struct with two fields named value

How to Convert DataSet<Row> to DataSet of JSON messages to write to Kafka?

I use Spark 2.1.1.
I have the following DataSet<Row> ds1;
name | ratio | count // column names
"hello" | 1.56 | 34
(ds1.isStreaming gives true)
and I am trying to generate DataSet<String> ds2. other words when I write to a kafka sink I want to write something like this
{"name": "hello", "ratio": 1.56, "count": 34}
I have tried something like this df2.toJSON().writeStream().foreach(new KafkaSink()).start() but then it gives the following error
Queries with streaming sources must be executed with writeStream.start()
There are to_json and json_tuple however I am not sure how to leverage them here ?
I tried the following using json_tuple() function
Dataset<String> df4 = df3.select(json_tuple(new Column("result"), " name", "ratio", "count")).as(Encoders.STRING());
and I get the following error:
cannot resolve 'result' given input columns: [name, ratio, count];;
tl;dr Use struct function followed by to_json (as toJSON was broken for streaming datasets due to SPARK-17029 that got fixed just 20 days ago).
Quoting the scaladoc of struct:
struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.
Given you use Java API you have 4 different variants of struct function, too:
public static Column struct(Column... cols) Creates a new struct column.
With to_json function your case is covered:
public static Column to_json(Column e) Converts a column containing a StructType into a JSON string with the specified schema.
The following is a Scala code (translating it to Java is your home exercise):
val ds1 = Seq(("hello", 1.56, 34)).toDF("name", "ratio", "count")
val recordCol = to_json(struct("name", "ratio", "count")) as "record"
scala> ds1.select(recordCol).show(truncate = false)
+----------------------------------------+
|record |
+----------------------------------------+
|{"name":"hello","ratio":1.56,"count":34}|
+----------------------------------------+
I've also given your solution a try (with Spark 2.3.0-SNAPSHOT built today) and it seems it works perfectly.
val fromKafka = spark.
readStream.
format("kafka").
option("subscribe", "topic1").
option("kafka.bootstrap.servers", "localhost:9092").
load.
select('value cast "string")
fromKafka.
toJSON. // <-- JSON conversion
writeStream.
format("console"). // using console sink
start
format("kafka") was added in SPARK-19719 and is not available in 2.1.0.

How can I extract the values that don't match when joining two RDD's in Spark?

I have two sets of RDD's that look like this:
rdd1 = [(12, abcd, lmno), (45, wxyz, rstw), (67, asdf, wert)]
rdd2 = [(12, abcd, lmno), (87, whsh, jnmk), (45, wxyz, rstw)]
I need to create a new RDD that has all the values found in rdd2 that don't have corresponding matches in rdd1. So the created RDD should contain the following data:
rdd3 = [(87, whsh, jnmk)]
Does anyone know how to accomplish this?
You can do a full join and then create 2 new RDDs.
Select where both tables had records
Select where RDD2 PK is present and RDD1 is null
You'll need to first convert them to KV rdds. Sample code below:
rdd3 = rdd1.fullJoin(rdd2).filter(x => x._3.exists).map(x => (x._1, x._3.get))
(Yes, there is a more idiomatic way to get the option but this should work)

Categories