Dataset api of Spark giving different result as compare to Dataframe

Dataset api of Spark giving different result as compare to Dataframe - java

I am using Spark 2.1 and having one hive table with orc format, following is the schema.
col_name data_type
tuid string
puid string
ts string
dt string
source string
peer string
# Partition Information
# col_name data_type
dt string
source string
peer string
# Detailed Table Information
Database: test
Owner: test
Create Time: Tue Nov 22 15:25:53 GMT 2016
Last Access Time: Thu Jan 01 00:00:00 GMT 1970
Location: hdfs://apps/hive/warehouse/nis.db/dmp_puid_tuid
Table Type: MANAGED
Table Parameters:
transient_lastDdlTime 1479828353
SORTBUCKETCOLSPREFIX TRUE
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Storage Desc Parameters:
serialization.format 1
When i am applying filter on top of this table using partition column, its working fine and only reading specific partitions.
val puid = spark.read.table("nis.dmp_puid_tuid")
.as(Encoders.bean(classOf[DmpPuidTuid]))
.filter( """peer = "AggregateKnowledge" and dt = "20170403"""")
and this is my physical plan for this query
== Physical Plan ==
HiveTableScan [tuid#1025, puid#1026, ts#1027, dt#1022, source#1023, peer#1024], MetastoreRelation nis, dmp_puid_tuid, [isnotnull(peer#1024), isnotnull(dt#1022),
(peer#1024 = AggregateKnowledge), (dt#1022 = 20170403)]
but when i am using below code, its reading entire data into spark
val puid = spark.read.table("nis.dmp_puid_tuid")
.as(Encoders.bean(classOf[DmpPuidTuid]))
.filter( tp => tp.getPeer().equals("AggregateKnowledge") && Integer.valueOf(tp.getDt()) >= 20170403)
Physical plan for above dataframe
== Physical Plan ==
*Filter <function1>.apply
+- HiveTableScan [tuid#1058, puid#1059, ts#1060, dt#1055, source#1056, peer#1057], MetastoreRelation nis, dmp_puid_tuid
Note :- DmpPuidTuid is java bean class

When you pass a Scala function to filter, you prevent the Spark optimizer from seeing which columns of the dataset are actually used (because the optimizer does not try to look inside the compiled code of the function). If you pass a column expression, such as col("peer") === "AggregateKnowledge" && col("dt").cast(IntegerType) >= 20170403 then the optimizer will be able to see which columns are actually required and adjust the plan accordingly.

Related

In Scala, how do I create a column of date arrays of monthly dates between a start and end date?

In Spark Scala, I am trying to create a column that contains an array of monthly dates between a start and an end date (inclusive).
For example, if we have 2018-02-07 and 2018-04-28, the array should contain [2018-02-01, 2018-03-01, 2018-04-01].
Besides the monthly version I would also like to create a quarterly version, i.e. [2018-1, 2018-2].
Example Input Data:
id startDate endDate
1_1 2018-02-07 2018-04-28
1_2 2018-05-06 2018-05-31
2_1 2017-04-13 2017-04-14
Expected (monthly) Output 1:
id startDate endDate dateRange
1_1 2018-02-07 2018-04-28 [2018-02-01, 2018-03-01, 2018-04-01]
1_1 2018-05-06 2018-05-31 [2018-05-01]
2_1 2017-04-13 2017-04-14 [2017-04-01]
Ultimate expected (monthly) output 2:
id Date
1_1 2018-02-01
1_1 2018-03-01
1_1 2018-04-01
1_2 2018-05-01
2_1 2017-04-01
I have spark 2.1.0.167, Scala 2.10.6, and JavaHotSpot 1.8.0_172.
I have tried to implement several answers to similar (day-level) questions on here, but I am struggling with getting a monthly/quarterly version to work.
The below creates an array from start and endDate and explodes it. However I need to explode a column that contains all the monthly (quarterly) dates in-between.
val df1 = df.select($"id", $"startDate", $"endDate").
// This just creates an array of start and end Date
withColumn("start_end_array"), array($"startDate", $"endDate").
withColumn("start_end_array"), explode($"start_end_array"))
Thank you for any leads.

case class MyData(id: String, startDate: String, endDate: String, list: List[String])
val inputData = Seq(("1_1", "2018-02-07", "2018-04-28"), ("1_2", "2018-05-06", "2018-05-31"), ("2_2", "2017-04-13", "2017-04-14"))
inputData.map(x => {
import java.time.temporal._
import java.time._
val startDate = LocalDate.parse(x._2)
val endDate = LocalDate.parse(x._3)
val diff = ChronoUnit.MONTHS.between(startDate, endDate)
var result = List[String]();
for (index <- 0 to diff.toInt) {
result = (startDate.getYear + "-" + (startDate.getMonth.getValue + index) + "-01") :: result
}
new MyData(x._1, x._2, x._3, result)
}).foreach(println)

Apache flink 1.52 Rowtime timestamp is null

I am doing some query with the following code:
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
DataStream<Row> ds = SourceHelp.builder().env(env).consumer010(MyKafka.builder().build().kafkaWithWaterMark2())
.rowTypeInfo(MyRowType.builder().build().typeInfo())
.build().source4();
//,proctime.proctime,rowtime.rowtime
String sql1 = "select a,b,max(rowtime)as rowtime from user_device group by a,b";
DataStream<Row> ds2 = TableHelp.builder().tableEnv(tableEnv).tableName("user_device").fields("a,b,rowtime.rowtime")
.rowTypeInfo(MyRowType.builder().build().typeInfo13())
.sql(sql1).in(ds).build().result();
ds2.print();
// String sql2 = "select a,count(b) as b from user_device2 group by a";
String sql2 = "select a,count(b) as b,HOP_END(rowtime,INTERVAL '5' SECOND,INTERVAL '30' SECOND) as c from user_device2 group by HOP(rowtime, INTERVAL '5' SECOND, INTERVAL '30' SECOND),a";
DataStream<Row> ds3 = TableHelp.builder().tableEnv(tableEnv).tableName("user_device2").fields("a,b,rowtime.rowtime")
.rowTypeInfo(MyRowType.builder().build().typeInfo14())
.sql(sql2).in(ds2).build().result();
ds3.print();
env.execute("test");
note: For sql1, I use max function with rowtime, it is not working, and following Exception is thrown:
Exception in thread "main"
org.apache.flink.runtime.client.JobExecutionException:
java.lang.RuntimeException: Rowtime timestamp is null. Please make
sure that a proper TimestampAssigner is defined and the stream
environment uses the EventTime time characteristic. at
org.apache.flink.runtime.minicluster.MiniCluster.executeJobBlocking(MiniCluster.java:625)
at
org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:123)
at
com.aicaigroup.water.WaterTest.testRowtimeWithMoreSqls5(WaterTest.java:158)
at com.aicaigroup.water.WaterTest.main(WaterTest.java:20) Caused by:
java.lang.RuntimeException: Rowtime timestamp is null. Please make
sure that a proper TimestampAssigner is defined and the stream
environment uses the EventTime time characteristic. at
DataStreamSourceConversion$24.processElement(Unknown Source) at
org.apache.flink.table.runtime.CRowOutputProcessRunner.processElement(CRowOutputProcessRunner.scala:67)
at
org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:558)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:533)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:513)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:628)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:581)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:679)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:657)
at
org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
at com.aicaigroup.TableHelp$1.processElement(TableHelp.java:42) at
com.aicaigroup.TableHelp$1.processElement(TableHelp.java:39) at
org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:558)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:533)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:513)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:679)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:657)
at
org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:41)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:558)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:533)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:513)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:679)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:657)
at
org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
at
org.apache.flink.table.runtime.aggregate.GroupAggProcessFunction.processElement(GroupAggProcessFunction.scala:151)
at
org.apache.flink.table.runtime.aggregate.GroupAggProcessFunction.processElement(GroupAggProcessFunction.scala:39)
at
org.apache.flink.streaming.api.operators.LegacyKeyedProcessOperator.processElement(LegacyKeyedProcessOperator.java:88)
at
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:104)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:306)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:703) at
java.lang.Thread.run(Thread.java:748) 2018-09-17 09:51:53.679 [Kafka
0.10 Fetcher for Source: Custom Source -> Map -> from: (a, b, rowtime) -> select: (a, b, CAST(rowtime) AS rowtime) (2/8)] INFO o.a.kafka.clients.consumer.internals.AbstractCoordinator - Discovered
coordinator 172.16.11.91:9092 (id: 2147483647 rack: null) for group
test.
then I tried to update sql1 like this "select a,b,rowtime from user_device", and it works.
So how to fix the error? First sql should use group by, and second sql should use rowtime by
timeWindow. 3QS

I started flink from 1.6 , meet the similar question like yours.
Solved by the those steps :
using assignTimestampsAndWatermarks , just use the default and normal implement BoundedOutOfOrdernessTimestampExtractor. You need write the extractTimestamp function to extract timestamp value and declare window interval in the constructor.
append ,proctime.proctime,rowtime.rowtime at the end of fields (i'm using fromDataStream(Flink 1.6) to convert stream as table)
if you want use the exist field as rowtime. for example, data source fields is "a,clicktime,c" , you can declare "a,clicktime.rowtime,c"
Wish it can help you.

What is 'no viable alternative at input' for spark sql?

I have a DF that has startTimeUnix column (of type Number in Mongo) that contains epoch timestamps. I want to query the DF on this column but I want to pass EST datetime. I went through multiple hoops to test the following on spark-shell:
val df = Seq(("1", "1523937600000"), ("2", "1523941200000"),("3","1524024000000")).toDF("id", "unix")
df.filter($"unix" > java.time.ZonedDateTime.parse("04/17/2018 01:00:00", java.time.format.DateTimeFormatter.ofPattern ("MM/dd/yyyy HH:mm:ss").withZone ( java.time.ZoneId.of("America/New_York"))).toEpochSecond()*1000).collect()
Output:
= Array([3,1524024000000])
Since the java.time functions are working, I am passing the same to spark-submit where while retrieving the data from Mongo, the filter query goes like:
startTimeUnix < (java.time.ZonedDateTime.parse(${LT}, java.time.format.DateTimeFormatter.ofPattern('MM/dd/yyyyHHmmss').withZone(java.time.ZoneId.of('America/New_York'))).toEpochSecond()*1000) AND startTimeUnix > (java.time.ZonedDateTime.parse(${GT}, java.time.format.DateTimeFormatter.ofPattern('MM/dd/yyyyHHmmss').withZone(java.time.ZoneId.of('America/New_York'))).toEpochSecond()*1000)`
However, I keep getting following error:
Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input '(java.time.ZonedDateTime.parse(04/18/2018000000, java.time.format.DateTimeFormatter.ofPattern('MM/dd/yyyyHHmmss').withZone('(line 1, pos 138)
== SQL ==
startTimeUnix < (java.time.ZonedDateTime.parse(04/18/2018000000, java.time.format.DateTimeFormatter.ofPattern('MM/dd/yyyyHHmmss').withZone(java.time.ZoneId.of('America/New_York'))).toEpochSecond()*1000).toString() AND startTimeUnix > (java.time.ZonedDateTime.parse(04/17/2018000000, java.time.format.DateTimeFormatter.ofPattern('MM/dd/yyyyHHmmss').withZone(java.time.ZoneId.of('America/New_York'))).toEpochSecond()*1000).toString()
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:114)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseExpression(ParseDriver.scala:43)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:1315)
Somewhere it said the error meant mis-matched data type. I tried applying toString to the output of date conversion with no luck.

You can use spark data frame functions.
scala> val df = Seq(("1", "1523937600000"), ("2", "1523941200000"),("3","1524024000000")).toDF("id", "unix")
df: org.apache.spark.sql.DataFrame = [id: string, unix: string]
scala> df.filter($"unix" > unix_timestamp()*1000).collect()
res5: Array[org.apache.spark.sql.Row] = Array([3,1524024000000])
scala> df.withColumn("unixinEST"
,from_utc_timestamp(
from_unixtime(unix_timestamp()),
"EST"))
.show()
+---+-------------+-------------------+
| id| unix| unixinEST|
+---+-------------+-------------------+
| 1|1523937600000|2018-04-18 06:13:19|
| 2|1523941200000|2018-04-18 06:13:19|
| 3|1524024000000|2018-04-18 06:13:19|
+---+-------------+-------------------+

Obtain Master public DNS value from AWS EMR Cluster using the Java SDK

I need to obtain the master public DNS value via the Java SDK. The only information that I'll have at the start of the application is the ClusterName which is static.
Thus far I've been able to pull out all the other information that I need excluding this and this, unfortunately is vital for the application to be a success.
This is the code that I'm currently working with:
List<ClusterSummary> summaries = clusters.getClusters();
for (ClusterSummary cs: summaries) {
if (cs.getName().equals("test") && WHITELIST.contains(cs.getStatus().getState())) {
ListInstancesResult instances = emr.listInstances(new ListInstancesRequest().withClusterId(cs.getId()));
clusterHostName = instances.getInstances().get(0).toString();
jobFlowId = cs.getId();
}
}
I've removed the get for PublicIpAddress as wanted the full toString for testing. I should be clear in that this method does give me the DNS that I need but I have no way of differentiating between them.
If my EMR has 4 machines, I don't know which position in the list that Instance will be. For my basic trial I've only got two machines, 1 master and a worker. .get(0) has returned both the values for master and the worker on successive runs.
The information that I'm able to obtain from these is below - my only option that I can see at the moment is to use the 'ReadyDateTime' as an identifier as the master 'should' always be ready first, but this feels hacky and I was hoping on a cleaner solution.
{Id: id,
Ec2InstanceId: id,
PublicDnsName: ec2-54--143.compute-1.amazonaws.com,
PublicIpAddress: 54..143,
PrivateDnsName: ip-10--158.ec2.internal,
PrivateIpAddress: 10..158,
Status: {State: RUNNING,StateChangeReason: {},
Timeline: {CreationDateTime: Tue Feb 21 09:18:08 GMT 2017,
ReadyDateTime: Tue Feb 21 09:25:11 GMT 2017,}},
InstanceGroupId: id,
EbsVolumes: []}
{Id: id,
Ec2InstanceId: id,
PublicDnsName: ec2-54--33.compute-1.amazonaws.com,
PublicIpAddress: 54..33,
PrivateDnsName: ip-10--95.ec2.internal,
PrivateIpAddress: 10..95,
Status: {State: RUNNING,StateChangeReason: {},
Timeline: {CreationDateTime: Tue Feb 21 09:18:08 GMT 2017,
ReadyDateTime: Tue Feb 21 09:22:48 GMT 2017,}},
InstanceGroupId: id
EbsVolumes: []}

Don't use ListInstances. Instead, use DescribeCluster, which returns as one of the fields MasterPublicDnsName.

To expand on what was mentioned by Jonathon:
AmazonEC2Client ec2 = new AmazonEC2Client(cred);
DescribeInstancesResult describeInstancesResult = ec2.describeInstances(new DescribeInstancesRequest().withInstanceIds(clusterInstanceIds));
List<Reservation> reservations = describeInstancesResult.getReservations();
for (Reservation res : reservations) {
for (GroupIdentifier group : res.getGroups()) {
if (group.getGroupName().equals("ElasticMapReduce-master")) { // yaaaaaaaaah, Wahay!
masterDNS = res.getInstances().get(0).getPublicDnsName();
}
}
}

AWSCredentials credentials_profile = null;
credentials_profile = new
DefaultAWSCredentialsProviderChain().getCredentials();
AmazonElasticMapReduceClient emr = new
AmazonElasticMapReduceClient(credentials_profile);
Region euWest1 = Region.getRegion(Regions.US_EAST_1);
emr.setRegion(euWest1);
DescribeClusterFunction fun = new DescribeClusterFunction(emr);
DescribeClusterResult res = fun.apply(new
DescribeClusterRequest().withClusterId(clusterId));
String publicDNSName =res.getCluster().getMasterPublicDnsName();
Below is the working code to get the public DNS name.

dbf CREATE TABLE throws java.sql.SQLException: Syntax error: Stopped parse at

I have a dbf file, and I can see in the view that types of intersting fields are L ( I suppose it is logical type ) and M (I suppose it's a Memo type)
I try to recreate dbf template using dbf_jdbc, like table:
private static final String TABLE = "create table SAMPLE ( "
+ " SM Logical, "
+ " PRIM MEMO " + ")";
...
String url = "jdbc:DBF:/C:\\TEST";
Connection dbfConn = null;
PreparedStatement ps = null;
...
// instantiate it
Class.forName( "com.hxtt.sql.dbf.DBFDriver" ).newInstance();
dbfConn = DriverManager.getConnection( url, properties );
Statement stmt = dbfConn.createStatement();
stmt.executeUpdate(TABLE);
But i'm getting the following error:
java.sql.SQLException: Syntax error: Stopped parse at MEMO
java.sql.SQLException: Syntax error: Stopped parse at LOGICAL
The reason - type names, because when I use varchar, everythins is fine.
Dbf_jdbc version (from jar manifest file):
Manifest-Version: 1.0
Created-By: HXTT Version Robot
Main-Class: com.hxtt.sql.admin.Admin
Name: com/hxtt/sql/dbf/
Specification-Title: HXTT DBF JDBC 3.0 Package
Implementation-Title: com.hxtt.sql.dbf
Specification-Version: 4.2.056 on April 01, 2009
Specification-Vendor: Hongxin Technology & Trade Ltd.
Comment: JDBC 3.0 Package for Xbase database
Implementation-Version: 4.2.056 on April 01, 2009
Implementation-Vendor: Hongxin Technology & Trade Ltd.
Implementation-URL: http://www.hxtt.com/dbf.html
Name: com/hxtt/sql/admin/
Specification-Title: HXTT Database Admin
Implementation-Title: com.hxtt.sql.admin
Specification-Vendor: Hongxin Technology & Trade Ltd.
Specification-Version: 0.5 on April 01, 2009
Comment: HXTT Database Admin
Implementation-Version: 0.5 on April 01, 2009
Implementation-Vendor: Hongxin Technology & Trade Ltd.
Implementation-URL: http://www.hxtt.com/dbf/dbadmin.html
So my question is which sql type should I use so I could create dbf template using code and when I open a file using dbf viewer I could see letters M and L as type shortnames.

create table SAMPLE ( "
+ " SM BIT , "
+ " PRIM longvarchar" + ")";
SQL Data Types for Create Table at http://www.hxtt.com/dbf/sqlsyntax.html#createtable

I could not find the reason of the problem with dbf_jdbc. I used javadbf framework to create a template. The following example illustrates it:
File file = new File( filePathName );
DBFWriter dbfWriter = new DBFWriter( file );
dbfWriter.setCharactersetName( "cp866" );
DBFField[] fields = new DBFField[ 29 ];
fields[ 0 ] = new DBFField();
fields[ 0 ].setDataType( DBFField.FIELD_TYPE_L );
fields[ 0 ].setName( "SM" );
...
fields[ 19 ] = new DBFField();
fields[ 19 ].setDataType( DBFField.FIELD_TYPE_M );
fields[ 19 ].setName( "PRIM" );

I don't know about java based dbc driver, but an implied abbreviated version is to just use "L" or "M" respectively
create table SAMPLE ( SM L, PRIM M )";
Additionally for some other types
C(?) = character (?=length of character based field)
I = integer
D = date (only date portion)
T = date/time
B(?) = double(?=decimal precision -- ex: B(3) = up to 3 decimals )

dBase III files support:
Char name C(40)
Date birth D
Logical member L
Memo desc M
Numeric rate N(6, 2)
The first letter of the type is what you want to use.
Additionally, other dbf formats allow:
Currency price Y (note Y, not C)
DateTime appt T (note T, not D)
Double mass B (note B, not D)
Float (same as Numeric)
General bin_data G
Integer age I
Picture photo P
Currency, Double, Integer, General, and Picture all store the data as binary, while the others store the data as text.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Dataset api of Spark giving different result as compare to Dataframe - java

Related

In Scala, how do I create a column of date arrays of monthly dates between a start and end date?

Apache flink 1.52 Rowtime timestamp is null

What is 'no viable alternative at input' for spark sql?

Obtain Master public DNS value from AWS EMR Cluster using the Java SDK

dbf CREATE TABLE throws java.sql.SQLException: Syntax error: Stopped parse at

Categories

Resources