how to convert json string to a specific dataframe(DataSet) on spark? - java

I want to convert Json string to a specific dataframe on spark.
spark can easily get a automaticly generated dataframe:
DataSet<Row> row = sparkSession.read().json(JavaRdd<String>)
but the result is not what I want,because the automatically generated dateframe contains only struct and array,
and the target hive table contains fields in which dataType are map,so i can't write directly the DataSet to hdfs
I can provide the right StructType and Json, and I want get the Row which base on the StructType I provide, the API like the following:
Row transJsonStrToSpecificRow(StructType specificStruct,String json)
Does anyone have the solution?
thanks!

Related

How to Iterate Dataset<Row> and print each attribute value in java

I have loaded a parquet file into a Dataset<Row> in java and I want to iterate it record/row wise and read value of every attribute in that row.
I have got till here
Dataset<Row> df = sparkSession.read().format("parquet").load(location);
df.foreach((ForeachFunction<Row>) row -> {
System.out.println(row);
});
Is there any function for it in java to read attributes of a given row?
PS : I am using java 11 and spark 2.4.0.
Well its a little tricky for java, as java doesn't support searching a column directly by its name, but we can get a column based on index.
Also there is a function which returns the field-index of a given column name. Combining both we can iterate over the attributes of DataSet<Row>
Sample code:
Dataset<Row> df = sparkSession.read().format("parquet").load(location);
df.foreach((ForeachFunction<Row>) row -> {
System.out.println((String) row.get(row.fieldIndex("attribute_name")));
});

Get a single column values as a flat list in Apache spark using java

I am new to Java and Apache spark and trying to figure out how to get values of a single column from a Dataset in spark as a flat list.
Dataset<Row> sampleData = sparkSession.read()
.....
.option("query", "SELECT COLUMN1, column2 from table1")
.load();
List<Row> columnsList = sampleData.select("COLUMN1")
.where(sampleData.col("COLUMN1").isNotNull()).collectAsList();
String result = StringUtils.join(columnsList, ", ");
// Result I am getting is
[15230321], [15306791], [15325784], [15323326], [15288338], [15322001], [15307950], [15298286], [15327223]
// What i want is":
15230321, 15306791......
How do I achieve this in spark using java?
Spark row can be converted to String by Encoders:
List<String> result = sampleData.select("COLUMN1").as(Encoders.STRING()).collectAsList();
I am pasting the answer in Scala. You can convert it into Java as there are online tools available.
Also I am not creating String result as the way you specified because it would require creating table and doing the query per your process but I am replicating the problem variable directly using
import org.apache.spark.sql.Row
val a = List(Row("123"),Row("222"),Row("333"))
Printing a is giving me
List([123], [222], [333])
So apply a simple map operation along with mkString method to flatten the List
a.map(x => x.mkString(","))
gives
List(123, 222, 333) which I assume is your expectation.
Let me know if this sorts out your issue.

May I convert a RDD<POJO> to a Dataframe a way I can write these POJOs in a table having the same attributes names than the POJO?

According to a reply made to Convert Spark DataFrame to Pojo Object I've learn that a Dataframe is an alias of Dataset<Row>.
I currently calculated a JavaPairRDD<CityCode, CityStatistics> where CityStatistics is a POJO containing getters and setters for members like : getCityCode(), getCityName(), getActivityCode(), getNumberOfSalaried(), getNumberOfCompanies()...
A Liquibase script has created a statistics table where those fields (CITYCODE, CITYNAME, ACTIVITYCODE...) exist. I just have to write the records.
What is the (or before that : is there any) clean way to do something like that from my JavaPairRDD<CityCode, CityStatistics> citiesStatisticsRDD ?
citiesStatisticsRDD.values() => DataSet<CityStatistics> => DataSet<Row> (= DataFrame) => write on a JDBC connection through a dataframe method ?
Thanks !
First you have to convert JavaPairRDD to RDD beacuse .createDataset() accepts RDD<T> not JavaRDD<T>.JavaRDD is a wrapper around RDD inorder to make calls from java code easier. It contains RDD internally and can be accessed using .rdd()
JavaRDD cityRDD = citiesStatisticsRDD.map(x -> x._2);
Dataset<CityStatistics> cityDS = sqlContext.createDataset(cityRDD.rdd(), Encoders.bean(CityStatistics.class))
Now if you want whole citiesStatisticsRDD converted to Dataset: Convert JavaPairRDD to RDD and then use encoders
Dataset<Row> cityDS = sqlContext.createDataset(citiesStatisticsRDD.values().rdd(), Encoders.bean(CityStatistics.class)).toDF();

How to read Avro data and put the Avro fields into dataframe Columns using Spark Structured Streaming 2.1.1?

Code from the website
DataSet<Row> ds = sparkSession.readStream()
.format("com.databricks.spark.avro")
.option("avroSchema", schema.toString)
.load();
Since Avro Messages has fields. I am wondering how to put each field of the message in a separate column of a DataSet/Dataframe ? Does the above code automatically do that?
I

Get column name ( Meta Data ) Talend

I'm trying to export data and meta data from Mysql Database to a JSON .
My JSON output need to have this structure :
{ "classifier":[
{
"name":"Frequency",
"value":"75 kHz"
},
{
"name":"depth",
"value":"100 m"
} ]}
Frequency for me represent a column Name and 75 Khz is the value of the column for a specific row.
I'm using Talend data integration to do this, and i can get the data, but i can't figure out how to get the meta data, do i have to enter it myself ? or there is a more easy way to do this ?
You cannot export metadata of json file from Mysql because Mysql provide a structured data, hence we have to create our json structure independently using an existing file or manually, the easiest way is to create a sample file like the one used in your question. See Talend Help.

Categories