Spark converting a Dataset to RDD - java

I have a Dataset[String] and need to convert to a RDD[String]. How?
Note: I've recently migrated from spark 1.6 to spark 2.0. Some of my clients were expecting RDD but now Spark gives me Dataset.

As stated in the scala API documentation you can call .rdd on your Dataset :
val myRdd : RDD[String] = ds.rdd

Dataset is a strong typed Dataframe, so both Dataset and Dataframe could use .rdd to convert to a RDD.

Related

how to convert json string to a specific dataframe(DataSet) on spark?

I want to convert Json string to a specific dataframe on spark.
spark can easily get a automaticly generated dataframe:
DataSet<Row> row = sparkSession.read().json(JavaRdd<String>)
but the result is not what I want,because the automatically generated dateframe contains only struct and array,
and the target hive table contains fields in which dataType are map,so i can't write directly the DataSet to hdfs
I can provide the right StructType and Json, and I want get the Row which base on the StructType I provide, the API like the following:
Row transJsonStrToSpecificRow(StructType specificStruct,String json)
Does anyone have the solution?
thanks!

May I convert a RDD<POJO> to a Dataframe a way I can write these POJOs in a table having the same attributes names than the POJO?

According to a reply made to Convert Spark DataFrame to Pojo Object I've learn that a Dataframe is an alias of Dataset<Row>.
I currently calculated a JavaPairRDD<CityCode, CityStatistics> where CityStatistics is a POJO containing getters and setters for members like : getCityCode(), getCityName(), getActivityCode(), getNumberOfSalaried(), getNumberOfCompanies()...
A Liquibase script has created a statistics table where those fields (CITYCODE, CITYNAME, ACTIVITYCODE...) exist. I just have to write the records.
What is the (or before that : is there any) clean way to do something like that from my JavaPairRDD<CityCode, CityStatistics> citiesStatisticsRDD ?
citiesStatisticsRDD.values() => DataSet<CityStatistics> => DataSet<Row> (= DataFrame) => write on a JDBC connection through a dataframe method ?
Thanks !
First you have to convert JavaPairRDD to RDD beacuse .createDataset() accepts RDD<T> not JavaRDD<T>.JavaRDD is a wrapper around RDD inorder to make calls from java code easier. It contains RDD internally and can be accessed using .rdd()
JavaRDD cityRDD = citiesStatisticsRDD.map(x -> x._2);
Dataset<CityStatistics> cityDS = sqlContext.createDataset(cityRDD.rdd(), Encoders.bean(CityStatistics.class))
Now if you want whole citiesStatisticsRDD converted to Dataset: Convert JavaPairRDD to RDD and then use encoders
Dataset<Row> cityDS = sqlContext.createDataset(citiesStatisticsRDD.values().rdd(), Encoders.bean(CityStatistics.class)).toDF();

Querying file in memory

I have a comma separated file, which I want to load into memory and query it as if it was a database, I've come across many concepts/names but am not sure which is correct like ... embedded DB, in-memory database (Apache ignite, etc ...), how can I achieve that ?
I recommend to work with Apache Spark, you can load your file and then query it using spark-sql as follow:
val df = spark.read.format("csv").option("header", "true").load("csvfile.csv")
// Select only the "user_id" column
df.select("user_id").show()
see link for more information.
If you are using Apache Spark 1.6 version, your code would be
HiveContext hqlContext = new HiveContext(sparkContext);
DataFrame df = hqlContext.read().format("com.databricks.spark.csv").option("inferSchema", "true")
.option("header", "true").load(csvpath);
df.registerTempTable("Table name");
And then you can query from the table

How to read Avro data and put the Avro fields into dataframe Columns using Spark Structured Streaming 2.1.1?

Code from the website
DataSet<Row> ds = sparkSession.readStream()
.format("com.databricks.spark.avro")
.option("avroSchema", schema.toString)
.load();
Since Avro Messages has fields. I am wondering how to put each field of the message in a separate column of a DataSet/Dataframe ? Does the above code automatically do that?
I

Flink Table API not able to convert DataSet to DataStream

I am using Flink Table API using Java where I want to convert DataSet to DataStream .... Following is my code :
TableEnvironment tableEnvironment=new TableEnvironment();
Table tab1=table.where("related_value < 2014").select("related_value,ref_id");
DataSet<MyClass>ds2=tableEnvironment.toDataSet(tab1, MyClass.class);
DataStream<MyClass> d=tableEnvironment.toDataStream(tab1, MyClass.class);
But when I try to execute this program,it throws following exception :
org.apache.flink.api.table.ExpressionException: Invalid Root for JavaStreamingTranslator: Root(ArraySeq((related_value,Double), (ref_id,String))). Did you try converting a Table based on a DataSet to a DataStream or vice-versa? I want to know how we can convert DataSet to DataStream using Flink Table API ??
Another thing I want to know that, for Pattern matching, there is Flink CEP Library available.But is it feasible to use Flink Table API for Pattern Matching ??
Flink's Table API was not designed to convert a DataSet into a DataStream and vice versa. It is not possible to do that with the Table API and there is also no other way to do it with Flink at the moment.
Unifying the DataStream and DataSet APIs (handling batch processing as a special case of streaming, i.e., as bounded streams) is on the long-term roadmap of Flink.
You cannot convert to DataStream API when using TableEnvironment, you must create an StreamTableEnvironment to convert from table to DataStream, something like this:
final EnvironmentSettings fsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
final StreamTableEnvironment fsTableEnv = StreamTableEnvironment.create(configuration, fsSettings);
DataStream<String> finalRes = fsTableEnv.toAppendStream(tableNameHere, MyClass.class);
Hope to help you in somehow.
Kind regards!

Categories