Jave Equivalent impementation of withColumn in Spark - java

I am trying to use function which are available in org.apache.spark.sql.functions
When I am using it as
Dataset<Row> dfSelect =sqlContext.sql(
"SELECT unix_timestamp(concat(Date,' ',regexp_replace(Time,'[.]',':'))) AS TIMESTAMP,
`NMHC(GT)` from airQuality");
These functions are working fine as they should but when I am using
Dataset<Row> org.apache.spark.sql.Dataset.withColumn(String colName, Column col)
function in Java, i have implemented as below but it is giving error
Dataset<Row> df = spark.read().format("csv")
.option("dateFormat", "dd/MM/yyyy")
.option("timeFormat", "hh.mm.ss")
.option("mode", "PERMISSIVE")
.option("inferSchema", true)
.option("header", true)
.schema(schema)
.load("src/main/resources/AirQualityUCI/AirQualityUCI.csv");
df.createOrReplaceTempView("airQuality");
df.withColumn("DateStamp",unix_timestamp(concat(df.col("Date"),col(" "),regexp_replace(df.col("Time"),"[.]",":"))));
Error is
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '` `' given input columns: [Time, Date];;
'Project [Date#0, Time#1, unix_timestamp(concat(Date#0, ' , regexp_replace(Time#1, [.], :)), yyyy-MM-dd HH:mm:ss) AS DateStamp#32]
+- Relation[Date#0,Time#1] csv
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)

Your issue probably lies in the concat
concat(df.col("Date"),col(" "),regexp_replace(df.col("Time"),"[.]",":"))
And more precisely inside the col(" ") which instructs the SQL engine to find a column (hence the col function) whose name is " " (space character). And of course, no such columns exist, which is why you get an error saying there is no such column :
cannot resolve '` `' given input columns: [Time, Date];;
If what you want, as I suspect, is a blank character inside your concatenation, you may express that with a literal column value, which is lit(" ") in spark.
Which would give :
concat(df.col("Date"),lit(" "),regexp_replace(df.col("Time"),"[.]",":"))
In any case, my advice when dealing with such errors would be to simplify your expression untill it works, thus identifying what is at fault.

Try this.
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.regexp_replace;
import static org.apache.spark.sql.functions.concat;
import static org.apache.spark.sql.functions.unix_timestamp;
import static org.apache.spark.sql.functions.lit;
//Display date and time
df.withColumn("DateTime",concat(col("Date"),lit(" "),
regexp_replace(col("Time"),"[.]",":"))).show(false);
//Display unix timestamp
df.withColumn("DateTimeUnix",unix_timestamp(concat(col("Date"),lit(" "),
regexp_replace(col("Time"),"[.]",":")),"dd/MM/yyyy HH:mm:ss")).show(false);

Related

Group by inside otherwise clause, spark java

I have this process in SparkJava (IntelliJ app) where I have a problem that I don`t know how to resolve yet. First I declare the dataset:
private static final String CONTRA1 = "contra1";
query = "select contra1, ..., eadfinal, , ..., data_date" + FROM + dbSchema + TBLNAME " + WHERE fech = '" + fechjmCto2 + "' AND s1emp=49";
Dataset<Row> jmCto2 = sql.sql(query);
Then I have the calculations, I analyze some fields to assign some literal values. My problem is in the aggegate function:
Dataset<Row> contrCapOk1 = contrCapOk.join(jmCto2,
contrCapOk.col(CONTRA1).equalTo(jmCto2.col(CONTRA1)),LEFT)
.select(contrCapOk.col("*"),
jmCto2.col("ind"),
functions.when(jmCto2.col(CONTRA1).isNull(),functions.lit(NUEVES))
.when(jmCto2.col("ind").equalTo("N"),functions.lit(UNOS))
.otherwise(jmCto2.groupBy(CONTRA1).agg(functions.sum(jmCto2.col("eadfinal")))).as("EAD"),
What I want is to make the sum in the otherwise part. But when I execute the cluster give me this message in the log.
User class threw exception: java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset [contra1: int, sum(eadfinal): decimal(33,6)]
in the line 211, the otherwise line.
Do you know what the problem could be?.
Thanks.
You cannot use groupBy and aggregation function in a column clause. To do what you want to do, you have to use a window.
For you case, you can define the following window:
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;
...
WindowSpec window = Window
.partitionBy(CONTRA1)
.rangeBetween(Window.unboundedPreceding(), Window.unboundedFollowing());
Where
partitionBy is the equivalent of groupBy for aggregation
rangeBetween determine which rows of the partition will be used by aggregation function, here we take all rows
And then you use this window when calling your aggregation function, as follow:
import org.apache.spark.sql.functions;
...
Dataset<Row> contrCapOk1 = contrCapOk.join(
jmCto2,
contrCapOk.col(CONTRA1).equalTo(jmCto2.col(CONTRA1)),
LEFT
)
.select(
contrCapOk.col("*"),
jmCto2.col("ind"),
functions.when(jmCto2.col(CONTRA1).isNull(), functions.lit(NUEVES))
.when(jmCto2.col("ind").equalTo("N"), functions.lit(UNOS))
.otherwise(functions.sum(jmCto2.col("eadfinal")).over(window))
.as("EAD")
)

Java Spark: How to get value from a column which is JSON formatted string for entire dataset?

Needs some help here. I am trying to read data from Hive/CSV. There is a column whose type is string and the value is json formatted string. It is something like this:
| Column Name A |
|----------------------------------------------------------|
|"{"key":{"data":{"key_1":{"key_A":[123]},"key_2":[456]}}}"|
How can I get the value of key_2 and insert it to a new column?
I tried to create a new function to the get value via Gson
private BigDecimal getValue(final String columnValue){
JsonObject jsonObject = JsonParser.parseString(columnValue).getAsJsonOBject();
return jsonObject.get("key").getAsJsonObject().get("key_1").getAsJsonObject().get("key_2").getAsJsonArray().get(0).getAsBigDecimal();
}
But how i can apply this method to the whole dataset?
I was trying to achieve something like this:
Dataset<Row> ds = souceDataSet.withColumn("New_column", getValue(sourceDataSet.col("Column Name A")));
But it cannot be done as the data types are different...
Could you please give any suggestions?
Thx!
hx!
------------------Update---------------------
As #Mck suggested, I used get_json_object.
As my value contains "
"{"key":{"data":{"key_1":{"key_A":[123]},"key_2":[456]}}}"
I used substring to removed " and make the new string like this
{"key":{"data":{"key_1":{"key_A":[123]},"key_2":[456]}}}
Code for substring
DataSet<Row> dsA = sourceDataSet.withColumn("Column Name A",expr("substring(Column Name A, 2, length(Column Name A))"))
I used dsA.show() and confirmed the dataset looks correct.
Then I used following code try to do it
Dataset<Row> ds = dsA.withColumn("New_column",get_json_object(dsA.col("Column Name A"), "$.key.data.key_2[0]"));
which returns null.
However, if the data is this:
{"key":{"data":{"key_2":[456]}}}
I can get value 456.
Any suggestions why I get null?
Thx for the help!
Use get_json_object:
ds.withColumn(
"New_column",
get_json_object(
col("Column Name A").substr(lit(2), length(col("Column Name A")) - 2),
"$.key.data.key_2[0]")
).show(false)
+----------------------------------------------------------+----------+
|Column Name A |New_column|
+----------------------------------------------------------+----------+
|"{"key":{"data":{"key_1":{"key_A":[123]},"key_2":[456]}}}"|456 |
+----------------------------------------------------------+----------+

Using custome UDF withColumn in a Spark Dataset<Row>; java.lang.String cannot be cast to org.apache.spark.sql.Row

I have a JSON file containing many fields. I read the file using spark's Dataset in java.
Spark version 2.2.0
java jdk 1.8.0_121
Below is the code.
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.master("local")
.getOrCreate();
Dataset<Row> df = spark.read().json("jsonfile.json");
I would like to use withColumn function with a custom UDF to add a new column.
UDF1 someudf = new UDF1<Row,String>(){
public String call(Row fin) throws Exception{
String some_str = fin.getAs("String");
return some_str;
}
};
spark.udf().register( "some_udf", someudf, DataTypes.StringType );
df.withColumn( "procs", callUDF( "some_udf", col("columnx") ) ).show();
I get a cast error when I run the above code.
java.lang.String cannot be cast to org.apache.spark.sql.Row
Questions:
1 - Is reading into a dataset of rows the only option? I can convert the df into a df of strings. but I will not be able to select fields.
2 - Tried but failed to define user defined datatype. I was not able to register the UDF with this custom UDDatatype. do I need user defined datatypes here?
3 - and the main question, how can I cast from String to Row?
Part of the log is copied below:
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.Row
at Risks.readcsv$1.call(readcsv.java:1)
at org.apache.spark.sql.UDFRegistration$$anonfun$27.apply(UDFRegistration.scala:512)
... 16 more
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$27: (string) => string)
Your help will be greatly appreciated.
You are getting that exception because UDF will execute on column's data type which is not Row. Consider we have Dataset<Row> ds which has two columns col1 and col2 both are String type. Now if we want to convert the value of col2 to uppercase using UDF.
We can register and call UDF like below.
spark.udf().register("toUpper", toUpper, DataTypes.StringType);
ds.select(col("*"),callUDF("toUpper", col("col2"))).show();
Or using withColumn
ds.withColumn("Upper",callUDF("toUpper", col("col2"))).show();
And UDF should be like below.
private static UDF1 toUpper = new UDF1<String, String>() {
public String call(final String str) throws Exception {
return str.toUpperCase();
}
};
Improving what #abaghel wrote.
If you use the following import
import org.apache.spark.sql.functions;
Using withColumn, code should be as follows:
ds.withColumn("Upper",functions.callUDF("toUpper", ds.col("col2"))).show();

Why does LogisticRegression fail with "IllegalArgumentException: org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7"?

I am trying to run simple logistic regression program in spark.
I am getting this error: I tried to include various libs for solving the problem but it is not solving the problem.
java.lang.IllegalArgumentException: requirement failed: Column pmi
must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was
actually DoubleType.
This is my dataSet csv
abc,pmi,sv,h,rh,label
0,4.267034,5,1.618187,5.213683,T
0,4.533071,24,3.540976,5.010458,F
0,6.357766,7,0.440152,5.592032,T
0,4.694365,1,0,6.953864,T
0,3.099447,2,0.994779,7.219463,F
0,1.482493,20,3.221419,7.219463,T
0,4.886681,4,0.919705,5.213683,F
0,1.515939,20,3.92588,6.329699,T
0,2.756057,9,2.841345,6.727063,T
0,3.341671,13,3.022361,5.601656,F
0,4.509981,7,1.538982,6.716471,T
0,4.039118,17,3.206316,6.392757,F
0,3.862023,16,3.268327,4.080564,F
0,5.026574,1,0,6.254859,T
0,3.186627,19,1.880978,8.466048,T
1,6.036507,8,1.376031,4.080564,F
1,5.026574,1,0,6.254859,T
1,-0.936022,23,2.78176,5.601656,F
1,6.435599,3,1.298795,3.408575,T
1,4.769222,3,1.251629,7.201824,F
1,3.190702,20,3.294354,6.716471,F
This is the Edited Code:
import java.io.IOException;
import org.apache.spark.ml.classification.LogisticRegression;
import org.apache.spark.ml.classification.LogisticRegressionModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.ml.linalg.VectorUDT;
import org.apache.spark.ml.feature.VectorAssembler;
public class Sp_LogistcRegression {
public void trainLogisticregression(String path, String model_path) throws IOException {
//SparkConf conf = new SparkConf().setAppName("Linear Regression Example");
// JavaSparkContext sc = new JavaSparkContext(conf);
SparkSession spark = SparkSession.builder().appName("Sp_LogistcRegression").master("local[6]").config("spark.driver.memory", "3G").getOrCreate();
Dataset<Row> training = spark
.read()
.option("header", "true")
.option("inferSchema","true")
.csv(path);
String[] myStrings = {"abc",
"pmi", "sv", "h", "rh", "label"};
VectorAssembler VA = new VectorAssembler().setInputCols(myStrings ).setOutputCol("label");
Dataset<Row> transform = VA.transform(training);
LogisticRegression lr = new LogisticRegression().setMaxIter(1000).setRegParam(0.3);
LogisticRegressionModel lrModel = lr.fit( transform);
lrModel.save(model_path);
spark.close();
}
}
This is the test.
import java.io.File;
import java.io.IOException;
import org.junit.Test;
public class Sp_LogistcRegressionTest {
Sp_LogistcRegression spl =new Sp_LogistcRegression ();
#Test
public void test() throws IOException {
String filename = "datas/seg-large.csv";
ClassLoader classLoader = getClass().getClassLoader();
File file1 = new File(classLoader.getResource(filename).getFile());
spl. trainLogisticregression( file1.getAbsolutePath(), "/tmp");
}
}
UPDATE
As per your suggestion, I removed the string value attribute form the the dataset, which is label. Now, I get following error.
java.lang.IllegalArgumentException: Field "features" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:58)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:263)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51)
TL;DR Use VectorAssembler transformer.
Spark MLlib's LogisticRegression requires features column to be of type VectorUDT (as the error message says).
In your Spark application, you read the dataset from a CSV file and the field you use for features is of different type.
Please note that I can use Spark MLlib not necessarily explain what Machine Learning as a field of study would recommend in this case.
My recommendation would then be to use a transformer that would map the column to match the requirements of LogisticRegression.
A quick glance at the known transformers in Spark MLlib 2.1.1 gives me VectorAssembler.
A feature transformer that merges multiple columns into a vector column.
That's exactly what you need.
(I use Scala and I leave rewriting the code to Java as your home exercise)
val training: DataFrame = ...
// the following are to show that we're on the same page
val lr = new LogisticRegression().setFeaturesCol("pmi")
scala> lr.fit(training)
java.lang.IllegalArgumentException: requirement failed: Column pmi must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually IntegerType.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)
at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:42)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
at org.apache.spark.ml.classification.LogisticRegression.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:278)
at org.apache.spark.ml.classification.LogisticRegressionParams$class.validateAndTransformSchema(LogisticRegression.scala:265)
at org.apache.spark.ml.classification.LogisticRegression.validateAndTransformSchema(LogisticRegression.scala:278)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:144)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:100)
... 48 elided
"Houston, we've got a problem." Let's fix it by using VectorAssembler then.
import org.apache.spark.ml.feature.VectorAssembler
val vecAssembler = new VectorAssembler().
setInputCols(Array("pmi")).
setOutputCol("features")
val features = vecAssembler.transform(training)
scala> features.show
+---+--------+
|pmi|features|
+---+--------+
| 5| [5.0]|
| 24| [24.0]|
+---+--------+
scala> features.printSchema
root
|-- pmi: integer (nullable = true)
|-- features: vector (nullable = true)
Whoohoo! We've got features column of the proper vector type! Are we done?
Yes. In my case however as I use spark-shell for the experimentation, it won't work right away since lr uses a wrong pmi column (i.e. of incorrect type).
scala> lr.fit(features)
java.lang.IllegalArgumentException: requirement failed: Column pmi must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually IntegerType.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)
at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:42)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
at org.apache.spark.ml.classification.LogisticRegression.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:278)
at org.apache.spark.ml.classification.LogisticRegressionParams$class.validateAndTransformSchema(LogisticRegression.scala:265)
at org.apache.spark.ml.classification.LogisticRegression.validateAndTransformSchema(LogisticRegression.scala:278)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:144)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:100)
... 48 elided
Let's fix lr to use features column.
Please note that features column is the default so I simply create a new instance of LogisticRegression (I could also use setInputCol).
val lr = new LogisticRegression()
// it works but I've got no label column (with 0s and 1s and hence the issue)
// the main issue was fixed though, wasn't it?
scala> lr.fit(features)
java.lang.IllegalArgumentException: Field "label" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)
at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:42)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
at org.apache.spark.ml.classification.LogisticRegression.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:278)
at org.apache.spark.ml.classification.LogisticRegressionParams$class.validateAndTransformSchema(LogisticRegression.scala:265)
at org.apache.spark.ml.classification.LogisticRegression.validateAndTransformSchema(LogisticRegression.scala:278)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:144)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:100)
... 48 elided
Using Multiple Columns
After the first version of the question has been updated, another issue has turned up.
scala> va.transform(training)
java.lang.IllegalArgumentException: Data type StringType is not supported.
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.apply(VectorAssembler.scala:121)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.apply(VectorAssembler.scala:117)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:117)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
... 48 elided
The reason is that VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type. It means that one of the columns used for VectorAssembler is of StringType type.
In your case the column is label since it's of StringType. See the schema.
scala> training.printSchema
root
|-- bc: integer (nullable = true)
|-- pmi: double (nullable = true)
|-- sv: integer (nullable = true)
|-- h: double (nullable = true)
|-- rh: double (nullable = true)
|-- label: string (nullable = true)
Remove it from your columns to use for VectorAssembler and the error goes away.
If however this or any other column should be included but is of incorrect type, you have to cast it appropriately (provided it is possible by the values the column holds). Use cast method for this.
cast(to: String): Column Casts the column to a different data type, using the canonical string representation of the type. The supported types are: string, boolean, byte, short, int, long, float, double, decimal, date, timestamp.
The error message should include the column name(s), but currently it does not so I filed [SPARK-21285 VectorAssembler should report the column name when data type used is not supported|https://issues.apache.org/jira/browse/SPARK-21285] to fix it. Vote for it if you think it's worth to have in the upcoming Spark version.

how to output value without brackets in spark?

I want to store the dataframe as pure value, but what I got is value with brackets, the code:
val df = sqlContext.read.format("orc").load(filename)
//I skip the processes here, just shows as an example
df.rdd.saveAsTextFile(outputPath)
The data is:
[40fc4ab12a174bf4]
[5572a277df472931]
[5fbce7c5c854996b]
[b4283abd92ea904]
[2f486994064f6875]
What I want is :
40fc4ab12a174bf4
5572a277df472931
5fbce7c5c854996b
b4283abd92ea904
2f486994064f6875
Use spark-csv to write data:
df.write
.format("com.databricks.spark.csv")
.option("header", "false")
.save(outputPath)
Or using rdd, just get the first value from Row:
df.rdd.map(l => l.get(0)).saveAsTextFile(outputPath)

Categories