Sparks2/Java8 Cassandra2
Trying to read some data from Cassandra and then run a group by query in sparks. I have only 2 columns in my DF
transdate (Date), origin (String)
Dataset<Row> maxOrigindate = sparks.sql("SELECT origin, transdate, COUNT(*) AS cnt FROM origins GROUP BY (origin,transdate) ORDER BY cnt DESC LIMIT 1"); `
Get Error:
`Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'origins.`origin`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value)`
The group by issue got solved removing ( ) in group by as below
Complete code: (trying to get max number of trans on date for a origin/location)
JavaRDD<TransByDate> originDateRDD = javaFunctions(sc).cassandraTable("trans", "trans_by_date", CassandraJavaUtil.mapRowTo(TransByDate.class))
.select(CassandraJavaUtil.column("origin"), CassandraJavaUtil.column("trans_date").as("transdate")).limit((long)100) ;
Dataset<Row> originDF = sparks.createDataFrame(originDateRDD, TransByDate.class);
String[] columns = originDF.columns();
System.out.println("originDF columns: "+columns[0]+" "+columns[1]) ; -> transdate origin
originDF.createOrReplaceTempView("origins");
Dataset<Row> maxOrigindate = sparks.sql("SELECT origin, transdate, COUNT(*) AS cnt FROM origins GROUP BY origin,transdate ORDER BY cnt DESC LIMIT 1");
List list = maxOrigindate.collectAsList(); -> Exception here
int j = list.size();
originDF columns: transdate origin
`public static class TransByDate implements Serializable {
private String origin;
private Date transdate;
public TransByDate() { }
public TransByDate (String origin, Date transdate) {
this.origin = origin;
this.transdate= transdate;
}
public String getOrigin() { return origin; }
public void setOrigin(String origin) { this.origin = origin; }
public Date getTransdate() { return transdate; }
public void setTransdate(Date transdate) { this.transdate = transdate; }
}
Schema
root
|-- transdate: struct (nullable = true)
| |-- date: integer (nullable = false)
| |-- day: integer (nullable = false)
| |-- hours: integer (nullable = false)
| |-- minutes: integer (nullable = false)
| |-- month: integer (nullable = false)
| |-- seconds: integer (nullable = false)
| |-- time: long (nullable = false)
| |-- timezoneOffset: integer (nullable = false)
| |-- year: integer (nullable = false)
|-- origin: string (nullable = true)
Exception
ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 12)
scala.MatchError: Sun Jan 01 00:00:00 PST 2012 (of class java.util.Date)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
....
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 12, localhost): scala.MatchError: Sun Jan 01 00:00:00 PST 2012 (of class java.util.Date)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
...
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
...
at org.apache.spark.sql.Dataset$$anonfun$collectAsList$1.apply(Dataset.scala:2184)
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2559)
at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2184)
at spark.SparkTest.sqlMaxCount(SparkTest.java:244) -> List list = maxOrigindate.collectAsList();
Caused by: scala.MatchError: Sun Jan 01 00:00:00 PST 2012 (of class java.util.Date)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
You are getting below error.
Caused by: scala.MatchError: Sun Jan 01 00:00:00 PST 2012 (of class java.util.Date) at
This error is because Spark sql supports java.sql.Date type. Please check the Spark documentation here. You can also refer SPARK-2562.
change the query to
Dataset<Row> maxOrigindate = sparks.sql("SELECT origin,
transdate,
COUNT(*) AS cnt FROM origins GROUP BY origin,transdate
ORDER BY cnt DESC LIMIT 1");
this will work.
Related
I have a DataFrame of string with each sting being a JSON element. I want to convert it to a dataframe.
{"StartTime":1649424816686069,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"}|
{"StartTime":164981846249877,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"}|
{"StartTime":16498172424241095,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"}|
Here is my input schema:
Input.printSchema
input: org.apache.spark.sql.DataFrame = [value: string]
root
|-- value: string (nullable = true)
Desired is something like this:
root
|-- StartTime: integer (nullable = true)
|-- StatusCode: integer (nullable = true)
|-- integer: string (nullable = true)
|-- HTTPUserAgent: string (nullable = true)
I tried creating a struct class and creating a dataframe from that but it throws ArrayIndexOutOfBoundsException.
spark.createDataFrame(input,simpleSchema).show
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 116.0 failed 4 times, most recent failure: Lost task 0.3 in stage 116.0 (TID 17471, ip-10-0-62-29.ec2.internal, executor 1030): java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 1
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, Channel), StringType), true, false) AS Channel#947
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> df.printSchema
root
|-- value: string (nullable = true)
scala> df.show(false)
+--------------------------------------------------------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------------------------------------------------------+
|{"StartTime":1649424816686069,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"} |
|{"StartTime":164981846249877,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"} |
|{"StartTime":16498172424241095,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"}|
+--------------------------------------------------------------------------------------------------------------------+
scala> val sch = spark.read.json(df.select("value").as[String].distinct).schema
sch: org.apache.spark.sql.types.StructType = StructType(StructField(HTTPMethod,StringType,true), StructField(HTTPUserAgent,StringType,true), StructField(StartTime,LongType,true), StructField(StatusCode,LongType,true))
scala> val df1 = df.withColumn("jsonData", from_json(col("value"), sch, Map.empty[String, String])).select(col("jsonData.*"))
df1: org.apache.spark.sql.DataFrame = [HTTPMethod: string, HTTPUserAgent: string ... 2 more fields]
scala> df1.show(false)
+----------+------------------------------+-----------------+----------+
|HTTPMethod|HTTPUserAgent |StartTime |StatusCode|
+----------+------------------------------+-----------------+----------+
|GET |Jakarta Commons-HttpClient/3.1|1649424816686069 |200 |
|GET |Jakarta Commons-HttpClient/3.1|164981846249877 |200 |
|GET |Jakarta Commons-HttpClient/3.1|16498172424241095|200 |
+----------+------------------------------+-----------------+----------+
scala> df1.printSchema
root
|-- HTTPMethod: string (nullable = true)
|-- HTTPUserAgent: string (nullable = true)
|-- StartTime: long (nullable = true)
|-- StatusCode: long (nullable = true)
I have 2 datasets which I am trying to combine:
Dataset1(machine):
String machineID:
List<Integer> machineCat;(100,200,300)
Dataset2(car):
String carID:
List<Integer> carCat;(30,200,100,300)
I need to basically get each item of List machineCat from dataset1 and check if that is contained in List carCat of dataset2. If that matches combine the 2 dataset as below:
final dataset:
machineID,machineCat(100),carID,carCat(100)
machineID,machineCat(200),carID,carCat(200)
machineID,machineCat(300),carID,carCat(400)
any help on how to do this using dataset join in java.
Looking at an option with arrays_contain(something like below)
machine.foreachPartition((ForeachPartitionFunction<Machine>) iterator -> {
while (iterator.hasNext()) {
Machine machine = iterator.next();
machine.getmachineCat().stream().filter(cat -> {
LOG.info("matched");
spark.sql(
"select * from machineDataset m"
+ " join"
+ " carDataset c "
+ "where array_contains(m.machineCat,cat)");
return true;
});
}
});
import static org.apache.spark.sql.functions.*; // before main class
Machine machine = new Machine("m1",Arrays.asList(100,200,300));
Car car = new Car("c1", Arrays.asList(30,200,100,300));
Dataset<Row> mDF= spark.createDataFrame(Arrays.asList(machine), Machine.class);
mDF.show();
Dataset<Row> cDF= spark.createDataFrame(Arrays.asList(car), Car.class);
cDF.show();
output:
+---------------+---------+
| machineCat|machineId|
+---------------+---------+
|[100, 200, 300]| m1|
+---------------+---------+
+-------------------+-----+
| carCat|catId|
+-------------------+-----+
|[30, 200, 100, 300]| c1|
+-------------------+-----+
then
Dataset<Row> mDF2 = mDF.select(col("machineId"),explode(col("machineCat")).as("machineCat"));
Dataset<Row> cDF2 = cDF.select(col("catId"),explode(col("carCat")).as("carCat"));
Dataset<Row> joinedDF = mDF2.join(cDF2).where(mDF2.col("machineCat").equalTo(cDF2.col("carCat")));
Dataset<Row> finalDF = joinedDF.select(col("machineId"),array(col("machineCat")), col("catId"),array(col("carCat")) );
finalDF.show();
and finally:
+---------+-----------------+-----+-------------+
|machineId|array(machineCat)|catId|array(carCat)|
+---------+-----------------+-----+-------------+
| m1| [100]| c1| [100]|
| m1| [200]| c1| [200]|
| m1| [300]| c1| [300]|
+---------+-----------------+-----+-------------+
root
|-- machineId: string (nullable = true)
|-- array(machineCat): array (nullable = false)
| |-- element: integer (containsNull = true)
|-- catId: string (nullable = true)
|-- array(carCat): array (nullable = false)
| |-- element: integer (containsNull = true)
Me using spark-sql-2.3.1v , kafka with java8 in my project.
I am trying to convert topic received byte[] to Dataset at kafka consumer side.
Here are the details
I have
class Company{
String companyName;
Integer companyId;
}
Which I defined as
public static final StructType companySchema = new StructType(
.add("companyName", DataTypes.StringType)
.add("companyId", DataTypes.IntegerType);
But message defined as
class Message{
private List<Company> companyList;
private String messageId;
}
I tried to define as
StructType messageSchema = new StructType()
.add("companyList", DataTypes.createArrayType(companySchema , false),false)
.add("messageId", DataTypes.StringType);
I sent the Message to kafka topic as byte[] using serialization .
I successfully received the message byte [] at consumer .
Which I am trying to convert as Dataset ?? how to do it ?
Dataset<Row> messagesDs = kafkaReceivedStreamDs.select(from_json(col("value").cast("string"), messageSchema ).as("messages")).select("messages.*");
messagesDs.printSchema();
root
|-- companyList: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- companyName: string (nullable = true)
| | |-- companyId: integer (nullable = true)
|-- messageId: string (nullable = true)
Dataset<Row> comapanyListDs = messagesDs.select(explode_outer(col("companyList")));
comapanyListDs.printSchema();
root
|-- col: struct (nullable = true)
| |-- companyName: string (nullable = true)
| |-- companyId: integer (nullable = true)
Dataset<Company> comapanyDs = comapanyListDs.as(Encoders.bean(Company.class));
Getting Error :
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'companyName' given input columns: [col];
How to get Dataset records , how to get it ?
Your struct got named with "col" when exploding.
Since your Bean class doesn't have "col" attribute, it is failing with mentioned error.
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve 'companyName' given input columns: [col];
You can do following select to get relevant columns as plain column:
Something like this:
Dataset<Row> comapanyListDs = messagesDs.select(explode_outer(col("companyList"))).
select(col("col.companyName").as("companyName"),col("col.companyId").as("companyId"));
I haven't tested syntax but must work your next step as soon as you get plain columns from struct for each row.
I'm trying to validate the datatype of a DataFrame by writing describe as an SQL query but every time I am getting datetime as string.
1.First I tried with the below code:
SparkSession sparkSession=new SparkSession.Builder().getOrCreate();
Dataset<Row> df=sparkSession.read().option("header","true").option("inferschema","true").format("csv").load("/user/data/*_ecs.csv");
try {
df.createTempView("data");
Dataset<Row> sqlDf=sparkSession.sql("Describe data");
sqlDf.show(300,false);
Output:
+-----------------+---------+-------+
|col_name |data_type|comment|
+-----------------+---------+-------+
|id |int |null |
|symbol |string |null |
|datetime |string |null |
|side |string |null |
|orderQty |int |null |
|price |double |null |
+-----------------+---------+-------+
I also try Custom schema but in that case i am getting exception when I am executing any query other than describe table:
SparkSession sparkSession=new SparkSession.Builder().getOrCreate(); Dataset<Row>df=sparkSession.read().option("header","true").schema(customeSchema).format("csv").load("/use/data/*_ecs.csv");
try {
df.createTempView("trade_data");
Dataset<Row> sqlDf=sparkSession.sql("Describe trade_data");
sqlDf.show(300,false);
Output:
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
|datetime|timestamp|null |
|price |double |null |
|orderQty|double |null |
+--------+---------+-------+
But if i try any query then getting the below execption:
Dataset<Row> sqlDf=sparkSession.sql("select DATE(datetime),avg(price),avg(orderQty) from data group by datetime");
java.lang.IllegalArgumentException
at java.sql.Date.valueOf(Date.java:143)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
How can this be solved?
why Inferschema is not working ??
for this you can find more on this link: https://issues.apache.org/jira/browse/SPARK-19228
so Datetype columns are parsed as String for current version of spark
If you don't want to submit your own schema, one way would be this:
Dataset<Row> df = sparkSession.read().format("csv").option("header","true").option("inferschema", "true").load("example.csv");
df.printSchema(); // check output - 1
df.createOrReplaceTempView("df");
Dataset<Row> df1 = sparkSession.sql("select * , Date(datetime) as datetime_d from df").drop("datetime");
df1.printSchema(); // check output - 2
====================================
output - 1:
root
|-- id: integer (nullable = true)
|-- symbol: string (nullable = true)
|-- datetime: string (nullable = true)
|-- side: string (nullable = true)
|-- orderQty: integer (nullable = true)
|-- price: double (nullable = true)
output - 2:
root
|-- id: integer (nullable = true)
|-- symbol: string (nullable = true)
|-- side: string (nullable = true)
|-- orderQty: integer (nullable = true)
|-- price: double (nullable = true)
|-- datetime_d: date (nullable = true)
I'd choose this method if the number of fields to cast is not high.
If you want to submit your own schema:
List<org.apache.spark.sql.types.StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("datetime", DataTypes.TimestampType, true));
fields.add(DataTypes.createStructField("price",DataTypes.DoubleType,true));
fields.add(DataTypes.createStructField("orderQty",DataTypes.DoubleType,true));
StructType schema = DataTypes.createStructType(fields);
Dataset<Row> df = sparkSession.read().format("csv").option("header", "true").schema(schema).load("example.csv");
df.printSchema(); // output - 1
df.createOrReplaceTempView("df");
Dataset<Row> df1 = sparkSession.sql("select * , Date(datetime) as datetime_d from df").drop("datetime");
df1.printSchema(); // output - 2
======================================
output - 1:
root
|-- datetime: timestamp (nullable = true)
|-- price: double (nullable = true)
|-- orderQty: double (nullable = true)
output - 2:
root
|-- price: double (nullable = true)
|-- orderQty: double (nullable = true)
|-- datetime_d: date (nullable = true)
Since its again casting column from timestamp to Date, I don't see much use of this method. But still putting it here for your later use maybe.
I am trying to create a new dataframe (in SparkSQL 1.6.2) after applying a mapPartition function as follow:
FlatMapFunction<Iterator<Row>,Row> mapPartitonstoTTF=rows->
{
List<Row> mappedRows=new ArrayList<Row>();
while(rows.hasNext())
{
Row row=rows.next();
Row mappedRow= RowFactory.create(row.getDouble(0),row.getString(1),row.getLong(2),row.getDouble(3),row.getInt(4),row.getString(5),
row.getString(6),row.getInt(7),row.getInt(8),row.getString(9),0L);
mappedRows.add(mappedRow);
}
return mappedRows;
};
JavaRDD<Row> sensorDataDoubleRDD=oldsensorDataDoubleDF.toJavaRDD().mapPartitions(mapPartitonstoTTF);
StructType oldSchema=oldsensorDataDoubleDF.schema();
StructType newSchema =oldSchema.add("TTF",DataTypes.LongType,false);
System.out.println("The new schema is: ");
newSchema.printTreeString();
System.out.println("The old schema is: ");
oldSchema.printTreeString();
DataFrame sensorDataDoubleDF=hc.createDataFrame(sensorDataDoubleRDD, newSchema);
sensorDataDoubleDF.show();
As seen from above I am adding a new LongType column with values of 0 to RDDs using RowFactory.create() function
However, I get exception at line running sensorDataDoubleDF.show(); as follow:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 117 in stage 26.0 failed 4 times, most recent failure: Lost task 117.3 in stage 26.0 (TID 3249, AUPER01-01-20-08-0.prod.vroc.com.au): scala.MatchError: 1435766400001 (of class java.lang.Long)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1882)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1882)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The old schema is
root
|-- data_quality: double (nullable = false)
|-- data_sensor: string (nullable = true)
|-- data_timestamp: long (nullable = false)
|-- data_valueDouble: double (nullable = false)
|-- day: integer (nullable = false)
|-- dpnode: string (nullable = true)
|-- dsnode: string (nullable = true)
|-- month: integer (nullable = false)
|-- year: integer (nullable = false)
|-- nodeid: string (nullable = true)
|-- nodename: string (nullable = true)
The new schema is like above with addition of a TTF column as LongType
root
|-- data_quality: double (nullable = false)
|-- data_sensor: string (nullable = true)
|-- data_timestamp: long (nullable = false)
|-- data_valueDouble: double (nullable = false)
|-- day: integer (nullable = false)
|-- dpnode: string (nullable = true)
|-- dsnode: string (nullable = true)
|-- month: integer (nullable = false)
|-- year: integer (nullable = false)
|-- nodeid: string (nullable = true)
|-- nodename: string (nullable = true)
|-- TTF: long (nullable = false)
I appreciate any help to figure it our where I am making mistake.
You have 11 columns in old schema but you are mapping only 10. Add row.getString(10) in RowFactory.create function.
Row mappedRow= RowFactory.create(row.getDouble(0),row.getString(1),row.getLong(2),row.getDouble(3),row.getInt(4),row.getString(5),
row.getString(6),row.getInt(7),row.getInt(8),row.getString(9),row.getString(10),0L);