Convert Spark DataFrame to Pojo Object

Convert Spark DataFrame to Pojo Object - java

Please see below code:
//Create Spark Context
SparkConf sparkConf = new SparkConf().setAppName("TestWithObjects").setMaster("local");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
//Creating RDD
JavaRDD<Person> personsRDD = javaSparkContext.parallelize(persons);
//Creating SQL context
SQLContext sQLContext = new SQLContext(javaSparkContext);
DataFrame personDataFrame = sQLContext.createDataFrame(personsRDD, Person.class);
personDataFrame.show();
personDataFrame.printSchema();
personDataFrame.select("name").show();
personDataFrame.registerTempTable("peoples");
DataFrame result = sQLContext.sql("SELECT * FROM peoples WHERE name='test'");
result.show();
After this I need to convert the DataFrame - 'result' to Person Object or List. Thanks in advance.

DataFrame is simply a type alias of Dataset[Row] . These operations are also referred as “untyped transformations” in contrast to “typed transformations” that come with strongly typed Scala/Java Datasets.
The conversion from Dataset[Row] to Dataset[Person] is very simple in spark
DataFrame result = sQLContext.sql("SELECT * FROM peoples WHERE name='test'");
At this point, Spark converts your data into DataFrame = Dataset[Row], a collection of generic Row object, since it does not know the exact type.
// Create an Encoders for Java beans
Encoder<Person> personEncoder = Encoders.bean(Person.class);
Dataset<Person> personDF = result.as(personEncoder);
personDF.show();
Now, Spark converts the Dataset[Row] -> Dataset[Person] type-specific Scala / Java JVM object, as dictated by the class Person.
Please refer to below link provided by databricks for further details
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

A DataFrame is stored as Rows, so you can use the methods there to cast from untyped to typed. Take a look at the get methods.

If someone looking for conversion of json string column in Dataset<Row> to Dataset<PojoClass>
Sample pojo: Testing
#Data
public class Testing implements Serializable {
private String name;
private String dept;
}
In the above code #Data is an annotation from Lombok to generate getters and setters for this Testing class.
Actual conversion logic in Spark
#Test
void shouldConvertJsonStringToPojo() {
var sparkSession = SparkSession.builder().getOrCreate();
var structType = new StructType(new StructField[] {
new StructField("employee", DataTypes.StringType, false, Metadata.empty()),
});
var ds = sparkSession.createDataFrame(new ArrayList<>(
Arrays.asList(RowFactory.create(new Object[]{"{ \"name\": \"test\", \"dept\": \"IT\"}"}))), structType);
var objectMapper = new ObjectMapper();
var bean = Encoders.bean(Testing.class);
var testingDataset = ds.map((MapFunction<Row, Testing>) row -> {
var dept = row.<String>getAs("employee");
return objectMapper.readValue(dept, Testing.class);
}, bean);
assertEquals("test", testingDataset.head().getName());
}

Related

How to access Java Spark Broadcast variable?

i am trying to broadcast a Dataset in spark in order to access it from within a map function. The first print statement returns the first line of the broadcasted dataset as expected. Unfortunately, the second print statement does not return a result. The execution simply hangs at this point.
Any idea what I'm doing wrong?
Broadcast<JavaRDD<Row>> broadcastedTrainingData = this.javaSparkContext.broadcast(trainingData.toJavaRDD());
System.out.println("Data:" + broadcastedTrainingData.value().first());
JavaRDD<Row> rowRDD = this.javaSparkContext.parallelize(stringAsList).map((Integer row) -> {
System.out.println("Data (map):" + broadcastedTrainingData.value().first());
return RowFactory.create(row);
});
The following pseudocode hightlights what i want to achieve. My main goal is to broadcast the training dataset, so i can use it from within a map function.
public Dataset<Row> getWSSE(Dataset<Row> trainingData, int clusterRange) {
StructType structType = new StructType();
structType = structType.add("ClusterAm", DataTypes.IntegerType, false);
structType = structType.add("Cost", DataTypes.DoubleType, false);
List<Integer> stringAsList = new ArrayList<>();
for (int clusterAm = 2; clusterAm < clusterRange + 2; clusterAm++) {
stringAsList.add(clusterAm);
}
Broadcast<Dataset> broadcastedTrainingData = this.javaSparkContext.broadcast(trainingData);
System.out.println("Data:" + broadcastedTrainingData.value().first());
JavaRDD<Row> rowRDD = this.javaSparkContext.parallelize(stringAsList).map((Integer row) -> RowFactory.create(row));
StructType schema = DataTypes.createStructType(new StructField[]{DataTypes.createStructField("ClusterAm", DataTypes.IntegerType, false)});
Dataset wsse = sqlContext.createDataFrame(rowRDD, schema).toDF();
wsse.show();
ExpressionEncoder<Row> encoder = RowEncoder.apply(structType);
Dataset result = wsse.map(
(MapFunction<Row, Row>) row -> RowFactory.create(row.getAs("ClusterAm"), new KMeans().setK(row.getAs("ClusterAm")).setSeed(1L).fit(broadcastedTrainingData.value()).computeCost(broadcastedTrainingData.value())),
encoder);
result.show();
broadcastedTrainingData.destroy();
return wsse;
}

DataSet<Row> trainingData = ...<Your dataset>;
//Creating the broadcast variable. No need to write classTag code by hand
// use akka.japi.Util which is available
Broadcast<Dataset<Row>> broadcastedTrainingData = spark.sparkContext()
.broadcast(trainingData, akka.japi.Util.classTag(DataSet.class));
//Here is the catch.When you are iterating over a Dataset,
//Spark will actally run it in distributed mode. So if you try to accees
//Your object directly (e.g. trainingData) it would be null .
//Cause you didn't ask spark to explicitly send tha outside variable to
//each machine where you are running this for each parallelly.
//So you need to use Broadcast variable.(Most common use of Broadcast)
someSparkDataSet.foreach((row) -> {
DataSet<Row> recieveBrdcast = broadcastedTrainingData.value();
...
...
})

convert RDD to Dataset in Java Spark

I have an RDD, i need to convert it into a Dataset, i tried:
Dataset<Person> personDS = sqlContext.createDataset(personRDD, Encoders.bean(Person.class));
the above line throws the error,
cannot resolve method createDataset(org.apache.spark.api.java.JavaRDD
Main.Person, org.apache.spark.sql.Encoder T)
however, i can convert to Dataset after converting to Dataframe. the below code works:
Dataset<Row> personDF = sqlContext.createDataFrame(personRDD, Person.class);
Dataset<Person> personDS = personDF.as(Encoders.bean(Person.class));

.createDataset() accepts RDD<T> not JavaRDD<T>. JavaRDD is a wrapper around RDD inorder to make calls from java code easier. It contains RDD internally and can be accessed using .rdd(). The following can create a Dataset:
Dataset<Person> personDS = sqlContext.createDataset(personRDD.rdd(), Encoders.bean(Person.class));

on your rdd use .toDS() you will get a dataset.
Let me know if it helps. Cheers.

In addition to accepted answer, if you want to create a Dataset<Row> instead of Dataset<Person> in Java, please try like this:
StructType yourStruct = ...; //Create your own structtype based on individual field types
Dataset<Row> personDS = sqlContext.createDataset(personRDD.rdd(), RowEncoder.apply(yourStruct));

StructType schema = new StructType()
.add("Id", DataTypes.StringType)
.add("Name", DataTypes.StringType)
.add("Country", DataTypes.StringType);
Dataset<Row> dataSet = sqlContext.createDataFrame(yourJavaRDD, schema);
Be carefull with schema variable, not always easy to predict what datatype you need to use, sometimes it's better to use just StringType for all columns

Spark error when convert JavaRDD to DataFrame: java.util.Arrays$ArrayList is not a valid external type for schema of array<string>

I am using Spark 2.1.0. For the following code, which read a text file and convert the content to DataFrame, then feed into a Word2Vector model:
SparkSession spark = SparkSession.builder().appName("word2vector").getOrCreate();
JavaRDD<String> lines = spark.sparkContext().textFile("input.txt", 10).toJavaRDD();
JavaRDD<List<String>> lists = lines.map(new Function<String, List<String>>(){
public List<String> call(String line){
List<String> list = Arrays.asList(line.split(" "));
return list;
}
});
JavaRDD<Row> rows = lists.map(new Function<List<String>, Row>() {
public Row call(List<String> list) {
return RowFactory.create(list);
}
});
StructType schema = new StructType(new StructField[] {
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
Dataset<Row> input = spark.createDataFrame(rows, schema);
input.show(3);
Word2Vec word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(100).setMinCount(0);
Word2VecModel model = word2Vec.fit(input);
Dataset<Row> result = model.transform(input);
It throws an exception
java.lang.RuntimeException: Error while encoding: java.util.Arrays$ArrayList is not a valid external type for
schema of array
which happens at line input.show(3) , so the createDataFrame() is causing the exception because Arrays.asList() returns an Arrays$ArrayList which is not supported here. However the Spark Official Documentation has the following code:
List<Row> data = Arrays.asList(
RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
);
StructType schema = new StructType(new StructField[]{
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
Dataset<Row> documentDF = spark.createDataFrame(data, schema);
which works just fine. If Arrays$ArrayList is not supported, how come this code is working? The difference is I am converting a JavaRDD<Row> to DataFrame but the official documentation is converting a List<Row> to DataFrame. I believe Spark Java API has an overloaded method createDataFrame() which takes a JavaRDD<Row> and convert it to a DataFrame based on the provided schema. I am so confused about why it is not working. Can anyone help?

I encountered the same issue several days ago and the only way to solve this problem is the use an array of array. Why ? Here is the response:
An ArrayType is wrapper for Scala Arrays which correspond one-to-one to Java arrays. Java ArrayList is not mapped by default to Scala Array so that's why you get the exception:
java.util.Arrays$ArrayList is not a valid external type for schema of array
Hence, passing directly a String[] sould have work:
RowFactory.create(line.split(" "))
But since create takes as input an Object list as a row may have a columns list, the String[] get interpreted to a list of String columns. That's why a double array of String is required:
RowFactory.create(new String[][] {line.split(" ")})
However, still the mystery of constructing a DataFrame from a Java List of rows in the spark documentation. This is because the SparkSession.createDataFrame function version that takes as first parameter java.util.List of rows makes special type checks and converts so that it converts all Java Iterable (so ArrayList) to a Scala Array.
However, the SparkSession.createDataFrame that takes JavaRDD maps directly the row content to the DataFrame.
To wrap-up, this is the correct version:
SparkSession spark = SparkSession.builder().master("local[*]").appName("Word2Vec").getOrCreate();
SparkContext sc = spark.sparkContext();
sc.setLogLevel("WARN");
JavaRDD<String> lines = sc.textFile("input.txt", 10).toJavaRDD();
JavaRDD<Row> rows = lines.map(new Function<String, Row>(){
public Row call(String line){
return RowFactory.create(new String[][] {line.split(" ")});
}
});
StructType schema = new StructType(new StructField[] {
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
Dataset<Row> input = spark.createDataFrame(rows, schema);
input.show(3);
Hope this solves your problem.

It's exactly as the error says. ArrayList is not the equivalent of Scala's array. You should use a normal array (i.e String[]) instead.

for me below is working fine
JavaRDD<Row> rowRdd = rdd.map(r -> RowFactory.create(r.split(",")));

Load csv data with partition in spark 2.0

In Spark 2.0, i have the following method which loads the data into dataset
public Dataset<AccountingData> GetDataFrameFromTextFile()
{ // The schema is encoded in a string
String schemaString = "id firstname lastname accountNo";
// Generate the schema based on the string of schema
List<StructField> fields = new ArrayList<>();
for (String fieldName : schemaString.split("\t")) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);
return sparksession.read().schema(schema)
.option("mode", "DROPMALFORMED")
.option("sep", "|")
.option("ignoreLeadingWhiteSpace", true)
.option("ignoreTrailingWhiteSpace ", true)
.csv("D:\\HadoopDirectory\Employee.txt").as(Encoders.bean(Employee.class));
}
and in my driver code, Map operation is called on the dataset
Dataset<Employee> rowDataset = ad.GetDataFrameFromTextFile();
Dataset<String> map = rowDataset.map(new MapFunction<Employee, String>() {
#Override
public String call(Employee emp) throws Exception {
return TraverseRuleByADRow(emp);
}
},Encoders.STRING());
When i run the driver program in spark local mode with 8 cores on my laptop, i see 8 partitions split the input file.May i know whether there is a way to load the file in more than 8 partitions, say 100 or 1000 partitions?
I know this is achievable if the source data is from sql server table via jdbc.
sparksession.read().format("jdbc").option("url", urlCandi).option("dbtable", tableName).option("partitionColumn", partitionColumn).option("lowerBound", String.valueOf(lowerBound))
.option("upperBound", String.valueOf(upperBound))
.option("numPartitions", String.valueOf(numberOfPartitions))
.load().as(Encoders.bean(Employee.class));
Thanks

Use repartition() method from Dataset. According to Scaladoc, there is no option to set number of paritions while reading

Deeplearning4j to spark pipeline: Convert a String type to org.apache.spark.mllib.linalg.VectorUDT

I have a sentiment analysis program to predict whether a given movie review is positive or negative using recurrent neutral network. I'm using Deeplearning4j deep learning library for that program. Now I need to add that program to apache spark pipeline.
When doing it, I have a class MovieReviewClassifier which extends org.apache.spark.ml.classification.ProbabilisticClassifier and I have to add an instance of that class to the pipeline. The features which are needed to build the model are entered to the program using setFeaturesCol(String s) method. The features I add are in String format since they are a set of strings used for sentiment analysis. But the features should be in the form org.apache.spark.mllib.linalg.VectorUDT. Is there a way to convert the strings to Vector UDT?
I have attached my code for pipeline implementation below:
public class RNNPipeline {
final static String RESPONSE_VARIABLE = "s";
final static String INDEXED_RESPONSE_VARIABLE = "indexedClass";
final static String FEATURES = "features";
final static String PREDICTION = "prediction";
final static String PREDICTION_LABEL = "predictionLabel";
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("test-client").setMaster("local[2]");
sparkConf.set("spark.driver.allowMultipleContexts", "true");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new SQLContext(javaSparkContext);
// ======================== Import data ====================================
DataFrame dataFrame = sqlContext.read().format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.load("/home/RNN3/WordVec/training.csv");
// Split in to train/test data
double [] dataSplitWeights = {0.7,0.3};
DataFrame[] data = dataFrame.randomSplit(dataSplitWeights);
// ======================== Preprocess ===========================
// Encode labels
StringIndexerModel labelIndexer = new StringIndexer().setInputCol(RESPONSE_VARIABLE)
.setOutputCol(INDEXED_RESPONSE_VARIABLE)
.fit(data[0]);
// Convert indexed labels back to original labels (decode labels).
IndexToString labelConverter = new IndexToString().setInputCol(PREDICTION)
.setOutputCol(PREDICTION_LABEL)
.setLabels(labelIndexer.labels());
// ======================== Train ========================
MovieReviewClassifier mrClassifier = new MovieReviewClassifier().setLabelCol(INDEXED_RESPONSE_VARIABLE).setFeaturesCol("Review");
// Fit the pipeline for training..setLabelCol.setLabelCol.setLabelCol.setLabelCol
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] { labelIndexer, mrClassifier, labelConverter});
PipelineModel pipelineModel = pipeline.fit(data[0]);
}
}
Review is the feature column which contains strings to be predicted as positive or negative.
I get the following error when I execute the code:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column Review must be of type org.apache.spark.mllib.linalg.VectorUDT#f71b0bce but was actually StringType.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:50)
at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:167)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:167)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:167)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:62)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:121)
at RNNPipeline.main(RNNPipeline.java:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

According to its documentation
User-defined type for Vector which allows easy interaction with SQL via DataFrame.
And the fact that in the ML library
DataFrame supports many basic and structured types; see the Spark SQL datatype reference for a list of supported types. In addition to the types listed in the Spark SQL guide, DataFrame can use ML Vector types.
and the fact you are asked for org.apache.spark.sql.types.UserDefinedType<Vector>
You can probably get away by passing either a DenseVector or SparseVector, created from your String.
The conversion from String ("Review" ??? ) to a Vector depends on how you have organized your data.

The way to convert String type to verctor UDT is using word2vec. I have to add an word2vec object to the spark pipeline to do the conversion.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Convert Spark DataFrame to Pojo Object - java

A DataFrame is stored as Rows, so you can use the methods there to cast from untyped to typed. Take a look at the get methods.

Related

How to access Java Spark Broadcast variable?

convert RDD to Dataset in Java Spark

Spark error when convert JavaRDD to DataFrame: java.util.Arrays$ArrayList is not a valid external type for schema of array<string>

Load csv data with partition in spark 2.0

Deeplearning4j to spark pipeline: Convert a String type to org.apache.spark.mllib.linalg.VectorUDT

Categories

Resources