Spark scala schema not enforced on load [duplicate]

Spark scala schema not enforced on load [duplicate] - java

This question already has an answer here:
Spark DataFrame Schema Nullable Fields
(1 answer)
Closed 4 years ago.
I was trying out a Spark scala example on the spark shell which is as follows;
val myManualSchema = StructType(Array(StructField("DEST_COUNTRY_NAME", StringType, true),
StructField("ORIGIN_COUNTRY_NAME",StringType
, true), StructField("count", LongType, nullable=false)))
val dfNew = spark.read.format("json").schema(myManualSchema).load("/test.json")
dfNew.printSchema()
The output I got was as follows;
root
|-- DEST_COUNTRY_NAME: string (nullable = true)
|-- ORIGIN_COUNTRY_NAME: string (nullable = true)
|-- count: long (nullable = true)
I was expecting the count column to be nullable=false but it does not seem to be enforced. However, when I create a new DataFrame from this one and set the schema there, it works. This is what I did;
val dfSchemaTest = spark.createDataFrame(dfNew.rdd,myManualSchema)
scala> dfSchemaTest.printSchema()
root
|-- DEST_COUNTRY_NAME: string (nullable = true)
|-- ORIGIN_COUNTRY_NAME: string (nullable = true)
|-- count: long (nullable = false)
I would appreciate if someone can point my error in the following for not enforcing the schema when done at the time of loading the data file.

There is nothing, that can be done, as nullability is enforced by file format. It's just what spark does - if the datasource cannot ensure, that the column is not null, neither can DataFrame while read.

Related

Can we read a parquet file and partition file in java arrow similar to pyarrow?

I have been trying to implement below pyarrow code in java but could not find anything.
can you please suggest is it even possible to implement below code in java arrow or is there any alternative library to achieve this
table1 = pq.read_table('/Users/some-user/Downloads/' + file_name + '.parquet')
ds.write_dataset(table1, base_dir='/Users/some-user/hive', partitioning=['column'], partitioning_flavor='hive', max_partitions=10000, format='parquet', use_threads=True, existing_data_behavior='delete_matching')

For Arrow Java side, you could use Dataset module that offer reads capabilities of parquet files (write support, base on PR opened, it is under development).
For Spark side, you could use this Github example about how do you could implement that. Base on that examples, your code could be something like this:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class SparkRecipe {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("RW-with-partition")
.config("spark.master", "local")
.getOrCreate();
// File at: https://github.com/apache/spark/blob/a92ef00145b264013e11de12f2c7cee62c28198d/examples/src/main/resources/users.parquet
Dataset<Row> usersDF = spark.read().load("src/main/resources/parquet/users.parquet");
usersDF.printSchema();
/*
root
|-- name: string (nullable = true)
|-- favorite_color: string (nullable = true)
|-- favorite_numbers: array (nullable = true)
| |-- element: integer (containsNull = true)
*/
usersDF.show();
/*
+------+--------------+----------------+
| name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa| null| [3, 9, 15, 20]|
| Ben| red| []|
+------+--------------+----------------+
*/
usersDF
.write()
.partitionBy("favorite_color")
.format("parquet")
.save("src/main/resources/parquet/partbycolo/names.parquet");
}
}
Please let us know if this work on your side.

Reference 'unit' is ambiguous, could be: unit, unit

I'm trying to load all incoming parquet files from an S3 Bucket, and process them with delta-lake. I'm getting an exception.
val df = spark.readStream().parquet("s3a://$bucketName/")
df.select("unit") //filter data!
.writeStream()
.format("delta")
.outputMode("append")
.option("checkpointLocation", checkpointFolder)
.start(bucketProcessed) //output goes in another bucket
.awaitTermination()
It throws an exception, because "unit" is ambiguous.
I've tried debugging it. For some reason, it finds "unit" twice.
What is going on here? Could it be an encoding issue?
edit:
This is how I create the spark session:
val spark = SparkSession.builder()
.appName("streaming")
.master("local")
.config("spark.hadoop.fs.s3a.endpoint", endpoint)
.config("spark.hadoop.fs.s3a.access.key", accessKey)
.config("spark.hadoop.fs.s3a.secret.key", secretKey)
.config("spark.hadoop.fs.s3a.path.style.access", true)
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 2)
.config("spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored", true)
.config("spark.sql.caseSensitive", true)
.config("spark.sql.streaming.schemaInference", true)
.config("spark.sql.parquet.mergeSchema", true)
.orCreate
edit2:
output from df.printSchema()
2020-10-21 13:15:33,962 [main] WARN org.apache.spark.sql.execution.datasources.DataSource - Found duplicate column(s) in the data schema and the partition schema: `unit`;
root
|-- unit: string (nullable = true)
|-- unit: string (nullable = true)

Reading the same data like this...
val df = spark.readStream().parquet("s3a://$bucketName/*")
...solves the issue. For whatever reason. I would love to know why... :(

How to combine the two rows of a dataset into a single row in spark using java

I am reading the transactions from a kafka topic in json format. then
i applied some transformations to get the aggregations based on the
txn_status . Below is the schema.
root |-- window: struct (nullable = true) | |-- start: timestamp
(nullable = true) | |-- end: timestamp (nullable = true) |--
txn_status: string (nullable = true) |-- count: long (nullable =
false)
My batch output is like below after applying grouping for the given
window. [![enter image description here][1]][1]
but i want the output like below json format.
{
“start_end_time”: “28/12/2018 11:32:00.000”,
“count_Total” : 6
“count_RCVD” : 5,
“count_FAILED”: 1
}
> how to combine two rows in a spark dataset.
>
>
> [1]: https://i.stack.imgur.com/sCJuX.jpg

As per the image you have shown, I have created a data frame or a temp table and provided the solution for your question.
Scala Code:
case class txn_rec(txn_status: String, count: Int, start_end_time: String)
var txDf=sc.parallelize(Array(new txn_rec("FAIL",9,"2019-03-08 016:40:00, 2019-03-08 016:57:00"),
new txn_rec("RCVD",161,"2019-03-08 016:40:00, 2019-03-08 016:57:00"))).toDF
txDf.createOrReplaceTempView("temp")
var resDF=spark.sql("select start_end_time, (select sum(count) from temp) as total_count , (select count from temp where txn_status='RCVD') as rcvd_count,(select count from temp where txn_status='FAIL') as failed_count from temp group by start_end_time")
resDF.show
resDF.toJSON.collectAsList.toString
You can see the output as shown in the screen shot.

Why does LogisticRegression fail with "IllegalArgumentException: org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7"?

I am trying to run simple logistic regression program in spark.
I am getting this error: I tried to include various libs for solving the problem but it is not solving the problem.
java.lang.IllegalArgumentException: requirement failed: Column pmi
must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was
actually DoubleType.
This is my dataSet csv
abc,pmi,sv,h,rh,label
0,4.267034,5,1.618187,5.213683,T
0,4.533071,24,3.540976,5.010458,F
0,6.357766,7,0.440152,5.592032,T
0,4.694365,1,0,6.953864,T
0,3.099447,2,0.994779,7.219463,F
0,1.482493,20,3.221419,7.219463,T
0,4.886681,4,0.919705,5.213683,F
0,1.515939,20,3.92588,6.329699,T
0,2.756057,9,2.841345,6.727063,T
0,3.341671,13,3.022361,5.601656,F
0,4.509981,7,1.538982,6.716471,T
0,4.039118,17,3.206316,6.392757,F
0,3.862023,16,3.268327,4.080564,F
0,5.026574,1,0,6.254859,T
0,3.186627,19,1.880978,8.466048,T
1,6.036507,8,1.376031,4.080564,F
1,5.026574,1,0,6.254859,T
1,-0.936022,23,2.78176,5.601656,F
1,6.435599,3,1.298795,3.408575,T
1,4.769222,3,1.251629,7.201824,F
1,3.190702,20,3.294354,6.716471,F
This is the Edited Code:
import java.io.IOException;
import org.apache.spark.ml.classification.LogisticRegression;
import org.apache.spark.ml.classification.LogisticRegressionModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.ml.linalg.VectorUDT;
import org.apache.spark.ml.feature.VectorAssembler;
public class Sp_LogistcRegression {
public void trainLogisticregression(String path, String model_path) throws IOException {
//SparkConf conf = new SparkConf().setAppName("Linear Regression Example");
// JavaSparkContext sc = new JavaSparkContext(conf);
SparkSession spark = SparkSession.builder().appName("Sp_LogistcRegression").master("local[6]").config("spark.driver.memory", "3G").getOrCreate();
Dataset<Row> training = spark
.read()
.option("header", "true")
.option("inferSchema","true")
.csv(path);
String[] myStrings = {"abc",
"pmi", "sv", "h", "rh", "label"};
VectorAssembler VA = new VectorAssembler().setInputCols(myStrings ).setOutputCol("label");
Dataset<Row> transform = VA.transform(training);
LogisticRegression lr = new LogisticRegression().setMaxIter(1000).setRegParam(0.3);
LogisticRegressionModel lrModel = lr.fit( transform);
lrModel.save(model_path);
spark.close();
}
}
This is the test.
import java.io.File;
import java.io.IOException;
import org.junit.Test;
public class Sp_LogistcRegressionTest {
Sp_LogistcRegression spl =new Sp_LogistcRegression ();
#Test
public void test() throws IOException {
String filename = "datas/seg-large.csv";
ClassLoader classLoader = getClass().getClassLoader();
File file1 = new File(classLoader.getResource(filename).getFile());
spl. trainLogisticregression( file1.getAbsolutePath(), "/tmp");
}
}
UPDATE
As per your suggestion, I removed the string value attribute form the the dataset, which is label. Now, I get following error.
java.lang.IllegalArgumentException: Field "features" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:58)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:263)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51)

TL;DR Use VectorAssembler transformer.
Spark MLlib's LogisticRegression requires features column to be of type VectorUDT (as the error message says).
In your Spark application, you read the dataset from a CSV file and the field you use for features is of different type.
Please note that I can use Spark MLlib not necessarily explain what Machine Learning as a field of study would recommend in this case.
My recommendation would then be to use a transformer that would map the column to match the requirements of LogisticRegression.
A quick glance at the known transformers in Spark MLlib 2.1.1 gives me VectorAssembler.
A feature transformer that merges multiple columns into a vector column.
That's exactly what you need.
(I use Scala and I leave rewriting the code to Java as your home exercise)
val training: DataFrame = ...
// the following are to show that we're on the same page
val lr = new LogisticRegression().setFeaturesCol("pmi")
scala> lr.fit(training)
java.lang.IllegalArgumentException: requirement failed: Column pmi must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually IntegerType.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)
at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:42)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
at org.apache.spark.ml.classification.LogisticRegression.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:278)
at org.apache.spark.ml.classification.LogisticRegressionParams$class.validateAndTransformSchema(LogisticRegression.scala:265)
at org.apache.spark.ml.classification.LogisticRegression.validateAndTransformSchema(LogisticRegression.scala:278)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:144)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:100)
... 48 elided
"Houston, we've got a problem." Let's fix it by using VectorAssembler then.
import org.apache.spark.ml.feature.VectorAssembler
val vecAssembler = new VectorAssembler().
setInputCols(Array("pmi")).
setOutputCol("features")
val features = vecAssembler.transform(training)
scala> features.show
+---+--------+
|pmi|features|
+---+--------+
| 5| [5.0]|
| 24| [24.0]|
+---+--------+
scala> features.printSchema
root
|-- pmi: integer (nullable = true)
|-- features: vector (nullable = true)
Whoohoo! We've got features column of the proper vector type! Are we done?
Yes. In my case however as I use spark-shell for the experimentation, it won't work right away since lr uses a wrong pmi column (i.e. of incorrect type).
scala> lr.fit(features)
java.lang.IllegalArgumentException: requirement failed: Column pmi must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually IntegerType.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)
at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:42)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
at org.apache.spark.ml.classification.LogisticRegression.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:278)
at org.apache.spark.ml.classification.LogisticRegressionParams$class.validateAndTransformSchema(LogisticRegression.scala:265)
at org.apache.spark.ml.classification.LogisticRegression.validateAndTransformSchema(LogisticRegression.scala:278)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:144)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:100)
... 48 elided
Let's fix lr to use features column.
Please note that features column is the default so I simply create a new instance of LogisticRegression (I could also use setInputCol).
val lr = new LogisticRegression()
// it works but I've got no label column (with 0s and 1s and hence the issue)
// the main issue was fixed though, wasn't it?
scala> lr.fit(features)
java.lang.IllegalArgumentException: Field "label" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)
at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:42)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
at org.apache.spark.ml.classification.LogisticRegression.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:278)
at org.apache.spark.ml.classification.LogisticRegressionParams$class.validateAndTransformSchema(LogisticRegression.scala:265)
at org.apache.spark.ml.classification.LogisticRegression.validateAndTransformSchema(LogisticRegression.scala:278)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:144)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:100)
... 48 elided
Using Multiple Columns
After the first version of the question has been updated, another issue has turned up.
scala> va.transform(training)
java.lang.IllegalArgumentException: Data type StringType is not supported.
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.apply(VectorAssembler.scala:121)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.apply(VectorAssembler.scala:117)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:117)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
... 48 elided
The reason is that VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type. It means that one of the columns used for VectorAssembler is of StringType type.
In your case the column is label since it's of StringType. See the schema.
scala> training.printSchema
root
|-- bc: integer (nullable = true)
|-- pmi: double (nullable = true)
|-- sv: integer (nullable = true)
|-- h: double (nullable = true)
|-- rh: double (nullable = true)
|-- label: string (nullable = true)
Remove it from your columns to use for VectorAssembler and the error goes away.
If however this or any other column should be included but is of incorrect type, you have to cast it appropriately (provided it is possible by the values the column holds). Use cast method for this.
cast(to: String): Column Casts the column to a different data type, using the canonical string representation of the type. The supported types are: string, boolean, byte, short, int, long, float, double, decimal, date, timestamp.
The error message should include the column name(s), but currently it does not so I filed [SPARK-21285 VectorAssembler should report the column name when data type used is not supported|https://issues.apache.org/jira/browse/SPARK-21285] to fix it. Vote for it if you think it's worth to have in the upcoming Spark version.

How to parse complex JSON Data in spark streaming in Java

We are developing an IOT application
We get following data stream from each of the device we want to run analysis for,
[{"t":1481368346000,"sensors":[{"s":"s1","d":"+149.625"},{"s":"s2","d":"+23.062"},{"s":"s3","d":"+16.375"},{"s":"s4","d":"+235.937"},{"s":"s5","d":"+271.437"},{"s":"s6","d":"+265.937"},{"s":"s7","d":"+295.562"},{"s":"s8","d":"+301.687"}]}]
At primary level I am able to get schema using spark java code as follows,
root
|-- sensors: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- d: string (nullable = true)
| | |-- s: string (nullable = true)
|-- t: long (nullable = true)
Code I have written is,
JavaDStream<String> json = directKafkaStream.map(new Function<Tuple2<String,String>, String>() {
public String call(Tuple2<String,String> message) throws Exception {
return message._2();
};
});
SQLContext sqlContext = spark.sqlContext();
json.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> jsonRecord) throws Exception {
Dataset<Row> row = sqlContext.read().json(jsonRecord).toDF();
row.createOrReplaceTempView("MyTable");
row.printSchema();
row.show();
Dataset<Row> sensors = row.select("sensors");
sensors.createOrReplaceTempView("sensors");
sensors.printSchema();
sensors.show();
}
});
This gives me and error as "org.apache.spark.sql.AnalysisException: cannot resolve 'sensors' given input columns: [];"
I am beginner with spark and analytics and not able to find any good example in java for parsing nested json.
What I am trying to achieve is and might need suggestions on from experts here is,
I am going to extract each sensor value and then going to run Regression analysis using sparkML library of spark. This will help me to find out what trend is occuring in each sensor stream as well as I want to detect failueres using that data.
I am not sure which should be the best way to do this and any guidance, links and info would be really helpful.

Here is how your json.foreachRDD should look like.
json.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
if(!rdd.isEmpty()){
Dataset<Row> data = spark.read().json(rdd).select("sensors");
data.printSchema();
data.show(false);
//DF in table
Dataset<Row> df = data.select( org.apache.spark.sql.functions.explode(org.apache.spark.sql.functions.col("sensors"))).toDF("sensors").select("sensors.s","sensors.d");
df.show(false);
}
}
});
For regression analysis sample, you can refer JavaRandomForestRegressorExample.java at https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaRandomForestRegressorExample.java
For real time data analysis using spark machine learning and spark streaming, you can refer below articles.
https://www.mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-1
https://www.mapr.com/blog/real-time-credit-card-fraud-detection-apache-spark-and-event-streaming

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark scala schema not enforced on load [duplicate] - java

There is nothing, that can be done, as nullability is enforced by file format. It's just what spark does - if the datasource cannot ensure, that the column is not null, neither can DataFrame while read.

Related

Can we read a parquet file and partition file in java arrow similar to pyarrow?

Reference 'unit' is ambiguous, could be: unit, unit

How to combine the two rows of a dataset into a single row in spark using java

Why does LogisticRegression fail with "IllegalArgumentException: org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7"?

How to parse complex JSON Data in spark streaming in Java

Categories

Resources