Spark Sql mapping issue

Spark Sql mapping issue - java

Sparks2/Java8 Cassandra2
Trying to read some data from Cassandra and then run a group by query in sparks. I have only 2 columns in my DF
transdate (Date), origin (String)
Dataset<Row> maxOrigindate = sparks.sql("SELECT origin, transdate, COUNT(*) AS cnt FROM origins GROUP BY (origin,transdate) ORDER BY cnt DESC LIMIT 1"); `
Get Error:
`Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'origins.`origin`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value)`
The group by issue got solved removing ( ) in group by as below
Complete code: (trying to get max number of trans on date for a origin/location)
JavaRDD<TransByDate> originDateRDD = javaFunctions(sc).cassandraTable("trans", "trans_by_date", CassandraJavaUtil.mapRowTo(TransByDate.class))
.select(CassandraJavaUtil.column("origin"), CassandraJavaUtil.column("trans_date").as("transdate")).limit((long)100) ;
Dataset<Row> originDF = sparks.createDataFrame(originDateRDD, TransByDate.class);
String[] columns = originDF.columns();
System.out.println("originDF columns: "+columns[0]+" "+columns[1]) ; -> transdate origin
originDF.createOrReplaceTempView("origins");
Dataset<Row> maxOrigindate = sparks.sql("SELECT origin, transdate, COUNT(*) AS cnt FROM origins GROUP BY origin,transdate ORDER BY cnt DESC LIMIT 1");
List list = maxOrigindate.collectAsList(); -> Exception here
int j = list.size();
originDF columns: transdate origin
`public static class TransByDate implements Serializable {
private String origin;
private Date transdate;
public TransByDate() { }
public TransByDate (String origin, Date transdate) {
this.origin = origin;
this.transdate= transdate;
}
public String getOrigin() { return origin; }
public void setOrigin(String origin) { this.origin = origin; }
public Date getTransdate() { return transdate; }
public void setTransdate(Date transdate) { this.transdate = transdate; }
}
Schema
root
|-- transdate: struct (nullable = true)
| |-- date: integer (nullable = false)
| |-- day: integer (nullable = false)
| |-- hours: integer (nullable = false)
| |-- minutes: integer (nullable = false)
| |-- month: integer (nullable = false)
| |-- seconds: integer (nullable = false)
| |-- time: long (nullable = false)
| |-- timezoneOffset: integer (nullable = false)
| |-- year: integer (nullable = false)
|-- origin: string (nullable = true)
Exception
ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 12)
scala.MatchError: Sun Jan 01 00:00:00 PST 2012 (of class java.util.Date)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
....
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 12, localhost): scala.MatchError: Sun Jan 01 00:00:00 PST 2012 (of class java.util.Date)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
...
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
...
at org.apache.spark.sql.Dataset$$anonfun$collectAsList$1.apply(Dataset.scala:2184)
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2559)
at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2184)
at spark.SparkTest.sqlMaxCount(SparkTest.java:244) -> List list = maxOrigindate.collectAsList();
Caused by: scala.MatchError: Sun Jan 01 00:00:00 PST 2012 (of class java.util.Date)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)

You are getting below error.
Caused by: scala.MatchError: Sun Jan 01 00:00:00 PST 2012 (of class java.util.Date) at
This error is because Spark sql supports java.sql.Date type. Please check the Spark documentation here. You can also refer SPARK-2562.

change the query to
Dataset<Row> maxOrigindate = sparks.sql("SELECT origin,
transdate,
COUNT(*) AS cnt FROM origins GROUP BY origin,transdate
ORDER BY cnt DESC LIMIT 1");
this will work.

Related

How to convert DataFrame of string to a Dataframe of defined schema

I have a DataFrame of string with each sting being a JSON element. I want to convert it to a dataframe.
{"StartTime":1649424816686069,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"}|
{"StartTime":164981846249877,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"}|
{"StartTime":16498172424241095,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"}|
Here is my input schema:
Input.printSchema
input: org.apache.spark.sql.DataFrame = [value: string]
root
|-- value: string (nullable = true)
Desired is something like this:
root
|-- StartTime: integer (nullable = true)
|-- StatusCode: integer (nullable = true)
|-- integer: string (nullable = true)
|-- HTTPUserAgent: string (nullable = true)
I tried creating a struct class and creating a dataframe from that but it throws ArrayIndexOutOfBoundsException.
spark.createDataFrame(input,simpleSchema).show
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 116.0 failed 4 times, most recent failure: Lost task 0.3 in stage 116.0 (TID 17471, ip-10-0-62-29.ec2.internal, executor 1030): java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 1
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, Channel), StringType), true, false) AS Channel#947

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> df.printSchema
root
|-- value: string (nullable = true)
scala> df.show(false)
+--------------------------------------------------------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------------------------------------------------------+
|{"StartTime":1649424816686069,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"} |
|{"StartTime":164981846249877,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"} |
|{"StartTime":16498172424241095,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"}|
+--------------------------------------------------------------------------------------------------------------------+
scala> val sch = spark.read.json(df.select("value").as[String].distinct).schema
sch: org.apache.spark.sql.types.StructType = StructType(StructField(HTTPMethod,StringType,true), StructField(HTTPUserAgent,StringType,true), StructField(StartTime,LongType,true), StructField(StatusCode,LongType,true))
scala> val df1 = df.withColumn("jsonData", from_json(col("value"), sch, Map.empty[String, String])).select(col("jsonData.*"))
df1: org.apache.spark.sql.DataFrame = [HTTPMethod: string, HTTPUserAgent: string ... 2 more fields]
scala> df1.show(false)
+----------+------------------------------+-----------------+----------+
|HTTPMethod|HTTPUserAgent |StartTime |StatusCode|
+----------+------------------------------+-----------------+----------+
|GET |Jakarta Commons-HttpClient/3.1|1649424816686069 |200 |
|GET |Jakarta Commons-HttpClient/3.1|164981846249877 |200 |
|GET |Jakarta Commons-HttpClient/3.1|16498172424241095|200 |
+----------+------------------------------+-----------------+----------+
scala> df1.printSchema
root
|-- HTTPMethod: string (nullable = true)
|-- HTTPUserAgent: string (nullable = true)
|-- StartTime: long (nullable = true)
|-- StatusCode: long (nullable = true)

Join spark dataset using java

I have 2 datasets which I am trying to combine:
Dataset1(machine):
String machineID:
List<Integer> machineCat;(100,200,300)
Dataset2(car):
String carID:
List<Integer> carCat;(30,200,100,300)
I need to basically get each item of List machineCat from dataset1 and check if that is contained in List carCat of dataset2. If that matches combine the 2 dataset as below:
final dataset:
machineID,machineCat(100),carID,carCat(100)
machineID,machineCat(200),carID,carCat(200)
machineID,machineCat(300),carID,carCat(400)
any help on how to do this using dataset join in java.
Looking at an option with arrays_contain(something like below)
machine.foreachPartition((ForeachPartitionFunction<Machine>) iterator -> {
while (iterator.hasNext()) {
Machine machine = iterator.next();
machine.getmachineCat().stream().filter(cat -> {
LOG.info("matched");
spark.sql(
"select * from machineDataset m"
+ " join"
+ " carDataset c "
+ "where array_contains(m.machineCat,cat)");
return true;
});
}
});

import static org.apache.spark.sql.functions.*; // before main class
Machine machine = new Machine("m1",Arrays.asList(100,200,300));
Car car = new Car("c1", Arrays.asList(30,200,100,300));
Dataset<Row> mDF= spark.createDataFrame(Arrays.asList(machine), Machine.class);
mDF.show();
Dataset<Row> cDF= spark.createDataFrame(Arrays.asList(car), Car.class);
cDF.show();
output:
+---------------+---------+
| machineCat|machineId|
+---------------+---------+
|[100, 200, 300]| m1|
+---------------+---------+
+-------------------+-----+
| carCat|catId|
+-------------------+-----+
|[30, 200, 100, 300]| c1|
+-------------------+-----+
then
Dataset<Row> mDF2 = mDF.select(col("machineId"),explode(col("machineCat")).as("machineCat"));
Dataset<Row> cDF2 = cDF.select(col("catId"),explode(col("carCat")).as("carCat"));
Dataset<Row> joinedDF = mDF2.join(cDF2).where(mDF2.col("machineCat").equalTo(cDF2.col("carCat")));
Dataset<Row> finalDF = joinedDF.select(col("machineId"),array(col("machineCat")), col("catId"),array(col("carCat")) );
finalDF.show();
and finally:
+---------+-----------------+-----+-------------+
|machineId|array(machineCat)|catId|array(carCat)|
+---------+-----------------+-----+-------------+
| m1| [100]| c1| [100]|
| m1| [200]| c1| [200]|
| m1| [300]| c1| [300]|
+---------+-----------------+-----+-------------+
root
|-- machineId: string (nullable = true)
|-- array(machineCat): array (nullable = false)
| |-- element: integer (containsNull = true)
|-- catId: string (nullable = true)
|-- array(carCat): array (nullable = false)
| |-- element: integer (containsNull = true)

How to decode a byte[] of List<Objects> to Dataset<Row> in spark?

Me using spark-sql-2.3.1v , kafka with java8 in my project.
I am trying to convert topic received byte[] to Dataset at kafka consumer side.
Here are the details
I have
class Company{
String companyName;
Integer companyId;
}
Which I defined as
public static final StructType companySchema = new StructType(
.add("companyName", DataTypes.StringType)
.add("companyId", DataTypes.IntegerType);
But message defined as
class Message{
private List<Company> companyList;
private String messageId;
}
I tried to define as
StructType messageSchema = new StructType()
.add("companyList", DataTypes.createArrayType(companySchema , false),false)
.add("messageId", DataTypes.StringType);
I sent the Message to kafka topic as byte[] using serialization .
I successfully received the message byte [] at consumer .
Which I am trying to convert as Dataset ?? how to do it ?
Dataset<Row> messagesDs = kafkaReceivedStreamDs.select(from_json(col("value").cast("string"), messageSchema ).as("messages")).select("messages.*");
messagesDs.printSchema();
root
|-- companyList: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- companyName: string (nullable = true)
| | |-- companyId: integer (nullable = true)
|-- messageId: string (nullable = true)
Dataset<Row> comapanyListDs = messagesDs.select(explode_outer(col("companyList")));
comapanyListDs.printSchema();
root
|-- col: struct (nullable = true)
| |-- companyName: string (nullable = true)
| |-- companyId: integer (nullable = true)
Dataset<Company> comapanyDs = comapanyListDs.as(Encoders.bean(Company.class));
Getting Error :
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'companyName' given input columns: [col];
How to get Dataset records , how to get it ?

Your struct got named with "col" when exploding.
Since your Bean class doesn't have "col" attribute, it is failing with mentioned error.
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve 'companyName' given input columns: [col];
You can do following select to get relevant columns as plain column:
Something like this:
Dataset<Row> comapanyListDs = messagesDs.select(explode_outer(col("companyList"))).
select(col("col.companyName").as("companyName"),col("col.companyId").as("companyId"));
I haven't tested syntax but must work your next step as soon as you get plain columns from struct for each row.

Spark Dataframe datatype as String

I'm trying to validate the datatype of a DataFrame by writing describe as an SQL query but every time I am getting datetime as string.
1.First I tried with the below code:
SparkSession sparkSession=new SparkSession.Builder().getOrCreate();
Dataset<Row> df=sparkSession.read().option("header","true").option("inferschema","true").format("csv").load("/user/data/*_ecs.csv");
try {
df.createTempView("data");
Dataset<Row> sqlDf=sparkSession.sql("Describe data");
sqlDf.show(300,false);
Output:
+-----------------+---------+-------+
|col_name |data_type|comment|
+-----------------+---------+-------+
|id |int |null |
|symbol |string |null |
|datetime |string |null |
|side |string |null |
|orderQty |int |null |
|price |double |null |
+-----------------+---------+-------+
I also try Custom schema but in that case i am getting exception when I am executing any query other than describe table:
SparkSession sparkSession=new SparkSession.Builder().getOrCreate(); Dataset<Row>df=sparkSession.read().option("header","true").schema(customeSchema).format("csv").load("/use/data/*_ecs.csv");
try {
df.createTempView("trade_data");
Dataset<Row> sqlDf=sparkSession.sql("Describe trade_data");
sqlDf.show(300,false);
Output:
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
|datetime|timestamp|null |
|price |double |null |
|orderQty|double |null |
+--------+---------+-------+
But if i try any query then getting the below execption:
Dataset<Row> sqlDf=sparkSession.sql("select DATE(datetime),avg(price),avg(orderQty) from data group by datetime");
java.lang.IllegalArgumentException
at java.sql.Date.valueOf(Date.java:143)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
How can this be solved?

why Inferschema is not working ??
for this you can find more on this link: https://issues.apache.org/jira/browse/SPARK-19228
so Datetype columns are parsed as String for current version of spark
If you don't want to submit your own schema, one way would be this:
Dataset<Row> df = sparkSession.read().format("csv").option("header","true").option("inferschema", "true").load("example.csv");
df.printSchema(); // check output - 1
df.createOrReplaceTempView("df");
Dataset<Row> df1 = sparkSession.sql("select * , Date(datetime) as datetime_d from df").drop("datetime");
df1.printSchema(); // check output - 2
====================================
output - 1:
root
|-- id: integer (nullable = true)
|-- symbol: string (nullable = true)
|-- datetime: string (nullable = true)
|-- side: string (nullable = true)
|-- orderQty: integer (nullable = true)
|-- price: double (nullable = true)
output - 2:
root
|-- id: integer (nullable = true)
|-- symbol: string (nullable = true)
|-- side: string (nullable = true)
|-- orderQty: integer (nullable = true)
|-- price: double (nullable = true)
|-- datetime_d: date (nullable = true)
I'd choose this method if the number of fields to cast is not high.
If you want to submit your own schema:
List<org.apache.spark.sql.types.StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("datetime", DataTypes.TimestampType, true));
fields.add(DataTypes.createStructField("price",DataTypes.DoubleType,true));
fields.add(DataTypes.createStructField("orderQty",DataTypes.DoubleType,true));
StructType schema = DataTypes.createStructType(fields);
Dataset<Row> df = sparkSession.read().format("csv").option("header", "true").schema(schema).load("example.csv");
df.printSchema(); // output - 1
df.createOrReplaceTempView("df");
Dataset<Row> df1 = sparkSession.sql("select * , Date(datetime) as datetime_d from df").drop("datetime");
df1.printSchema(); // output - 2
======================================
output - 1:
root
|-- datetime: timestamp (nullable = true)
|-- price: double (nullable = true)
|-- orderQty: double (nullable = true)
output - 2:
root
|-- price: double (nullable = true)
|-- orderQty: double (nullable = true)
|-- datetime_d: date (nullable = true)
Since its again casting column from timestamp to Date, I don't see much use of this method. But still putting it here for your later use maybe.

Adding a new column to dataframe in Spark SQL using Java API and JavaRDD<Row>

I am trying to create a new dataframe (in SparkSQL 1.6.2) after applying a mapPartition function as follow:
FlatMapFunction<Iterator<Row>,Row> mapPartitonstoTTF=rows->
{
List<Row> mappedRows=new ArrayList<Row>();
while(rows.hasNext())
{
Row row=rows.next();
Row mappedRow= RowFactory.create(row.getDouble(0),row.getString(1),row.getLong(2),row.getDouble(3),row.getInt(4),row.getString(5),
row.getString(6),row.getInt(7),row.getInt(8),row.getString(9),0L);
mappedRows.add(mappedRow);
}
return mappedRows;
};
JavaRDD<Row> sensorDataDoubleRDD=oldsensorDataDoubleDF.toJavaRDD().mapPartitions(mapPartitonstoTTF);
StructType oldSchema=oldsensorDataDoubleDF.schema();
StructType newSchema =oldSchema.add("TTF",DataTypes.LongType,false);
System.out.println("The new schema is: ");
newSchema.printTreeString();
System.out.println("The old schema is: ");
oldSchema.printTreeString();
DataFrame sensorDataDoubleDF=hc.createDataFrame(sensorDataDoubleRDD, newSchema);
sensorDataDoubleDF.show();
As seen from above I am adding a new LongType column with values of 0 to RDDs using RowFactory.create() function
However, I get exception at line running sensorDataDoubleDF.show(); as follow:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 117 in stage 26.0 failed 4 times, most recent failure: Lost task 117.3 in stage 26.0 (TID 3249, AUPER01-01-20-08-0.prod.vroc.com.au): scala.MatchError: 1435766400001 (of class java.lang.Long)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1882)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1882)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The old schema is
root
|-- data_quality: double (nullable = false)
|-- data_sensor: string (nullable = true)
|-- data_timestamp: long (nullable = false)
|-- data_valueDouble: double (nullable = false)
|-- day: integer (nullable = false)
|-- dpnode: string (nullable = true)
|-- dsnode: string (nullable = true)
|-- month: integer (nullable = false)
|-- year: integer (nullable = false)
|-- nodeid: string (nullable = true)
|-- nodename: string (nullable = true)
The new schema is like above with addition of a TTF column as LongType
root
|-- data_quality: double (nullable = false)
|-- data_sensor: string (nullable = true)
|-- data_timestamp: long (nullable = false)
|-- data_valueDouble: double (nullable = false)
|-- day: integer (nullable = false)
|-- dpnode: string (nullable = true)
|-- dsnode: string (nullable = true)
|-- month: integer (nullable = false)
|-- year: integer (nullable = false)
|-- nodeid: string (nullable = true)
|-- nodename: string (nullable = true)
|-- TTF: long (nullable = false)
I appreciate any help to figure it our where I am making mistake.

You have 11 columns in old schema but you are mapping only 10. Add row.getString(10) in RowFactory.create function.
Row mappedRow= RowFactory.create(row.getDouble(0),row.getString(1),row.getLong(2),row.getDouble(3),row.getInt(4),row.getString(5),
row.getString(6),row.getInt(7),row.getInt(8),row.getString(9),row.getString(10),0L);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark Sql mapping issue - java

You are getting below error. Caused by: scala.MatchError: Sun Jan 01 00:00:00 PST 2012 (of class java.util.Date) at This error is because Spark sql supports java.sql.Date type. Please check the Spark documentation here. You can also refer SPARK-2562.

change the query to Dataset<Row> maxOrigindate = sparks.sql("SELECT origin, transdate, COUNT(*) AS cnt FROM origins GROUP BY origin,transdate ORDER BY cnt DESC LIMIT 1"); this will work.

Related

How to convert DataFrame of string to a Dataframe of defined schema

Join spark dataset using java

How to decode a byte[] of List<Objects> to Dataset<Row> in spark?

Spark Dataframe datatype as String

Adding a new column to dataframe in Spark SQL using Java API and JavaRDD<Row>

Categories

Resources