I have a DF that has startTimeUnix column (of type Number in Mongo) that contains epoch timestamps. I want to query the DF on this column but I want to pass EST datetime. I went through multiple hoops to test the following on spark-shell:
val df = Seq(("1", "1523937600000"), ("2", "1523941200000"),("3","1524024000000")).toDF("id", "unix")
df.filter($"unix" > java.time.ZonedDateTime.parse("04/17/2018 01:00:00", java.time.format.DateTimeFormatter.ofPattern ("MM/dd/yyyy HH:mm:ss").withZone ( java.time.ZoneId.of("America/New_York"))).toEpochSecond()*1000).collect()
Output:
= Array([3,1524024000000])
Since the java.time functions are working, I am passing the same to spark-submit where while retrieving the data from Mongo, the filter query goes like:
startTimeUnix < (java.time.ZonedDateTime.parse(${LT}, java.time.format.DateTimeFormatter.ofPattern('MM/dd/yyyyHHmmss').withZone(java.time.ZoneId.of('America/New_York'))).toEpochSecond()*1000) AND startTimeUnix > (java.time.ZonedDateTime.parse(${GT}, java.time.format.DateTimeFormatter.ofPattern('MM/dd/yyyyHHmmss').withZone(java.time.ZoneId.of('America/New_York'))).toEpochSecond()*1000)`
However, I keep getting following error:
Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input '(java.time.ZonedDateTime.parse(04/18/2018000000, java.time.format.DateTimeFormatter.ofPattern('MM/dd/yyyyHHmmss').withZone('(line 1, pos 138)
== SQL ==
startTimeUnix < (java.time.ZonedDateTime.parse(04/18/2018000000, java.time.format.DateTimeFormatter.ofPattern('MM/dd/yyyyHHmmss').withZone(java.time.ZoneId.of('America/New_York'))).toEpochSecond()*1000).toString() AND startTimeUnix > (java.time.ZonedDateTime.parse(04/17/2018000000, java.time.format.DateTimeFormatter.ofPattern('MM/dd/yyyyHHmmss').withZone(java.time.ZoneId.of('America/New_York'))).toEpochSecond()*1000).toString()
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:114)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseExpression(ParseDriver.scala:43)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:1315)
Somewhere it said the error meant mis-matched data type. I tried applying toString to the output of date conversion with no luck.
You can use spark data frame functions.
scala> val df = Seq(("1", "1523937600000"), ("2", "1523941200000"),("3","1524024000000")).toDF("id", "unix")
df: org.apache.spark.sql.DataFrame = [id: string, unix: string]
scala> df.filter($"unix" > unix_timestamp()*1000).collect()
res5: Array[org.apache.spark.sql.Row] = Array([3,1524024000000])
scala> df.withColumn("unixinEST"
,from_utc_timestamp(
from_unixtime(unix_timestamp()),
"EST"))
.show()
+---+-------------+-------------------+
| id| unix| unixinEST|
+---+-------------+-------------------+
| 1|1523937600000|2018-04-18 06:13:19|
| 2|1523941200000|2018-04-18 06:13:19|
| 3|1524024000000|2018-04-18 06:13:19|
+---+-------------+-------------------+
Related
I am using below function to list the directories. It works in Azure databricks but when I am adding in IntelliJ project code, it is not able to resolve "union" keyword. Do I need import anything here?
def listLeafDirectories(path: String): Array[String] =
dbutils.fs.ls(path).map(file => {
// Work around double encoding bug
val path = file.path.replace("%25", "%").replace("%25", "%")
if (file.isDir) listLeafDirectories(path)
else Array[String](path.substring(0,path.lastIndexOf("/")+1))
}).reduceOption(_ union _).getOrElse(Array()).distinct
ADB Notebook successful execution:
Sorry, my mistake.
In the screenshot provided highlighting error, I have recurse flag which I was not passing to the function (in same function).
So below one worked:
def listDirectories(dir: String, recurse: Boolean): Array[String] = {
dbutils.fs.ls(dir).map(file => {
val path = file.path.replace("%25", "%").replace("%25", "%")
if (file.isDir) listDirectories(path,recurse)
else Array[String](path.substring(0, path.lastIndexOf("/")+1))
}).reduceOption(_ union _).getOrElse(Array()).distinct
}
I've asked a bit similar question earlier today. Here it is.
Shortly: I need to do record linkage for two large datasets (1.6M & 6M). I was going to use Sparks thinking that Cartesian product I was warned about would not be such a big problem. But it is. It hit the performance so hard that the linkage process didn't finish in 7 hours..
Is there another library/framework/tool for doing this more effectively? Or maybe improve performance of the solution below?
The code I ended up with:
object App {
def left(col: Column, n: Int) = {
assert(n > 0)
substring(col, 1, n)
}
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[4]")
.appName("MatchingApp")
.getOrCreate()
import spark.implicits._
val a = spark.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.load("/home/helveticau/workstuff/a.csv")
.withColumn("FULL_NAME", concat_ws(" ", col("FIRST_NAME"), col("LAST_NAME")))
.withColumn("BIRTH_DATE", to_date(col("BIRTH_DATE"), "yyyy-MM-dd"))
val b = spark.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.load("/home/helveticau/workstuff/b.txt")
.withColumn("FULL_NAME", concat_ws(" ", col("FIRST_NAME"), col("LAST_NAME")))
.withColumn("BIRTH_DATE", to_date(col("BIRTH_DATE"), "dd.MM.yyyy"))
// #formatter:off
val condition = a
.col("FULL_NAME").contains(b.col("FIRST_NAME"))
.and(a.col("FULL_NAME").contains(b.col("LAST_NAME")))
.and(a.col("BIRTH_DATE").equalTo(b.col("BIRTH_DATE"))
.or(a.col("STREET").startsWith(left(b.col("STR"), 3))))
// #formatter:on
val startMillis = System.currentTimeMillis();
val res = a.join(b, condition, "left_outer")
val count = res
.filter(col("B_ID").isNotNull)
.count()
println(s"Count: $count")
val executionTime = Duration.ofMillis(System.currentTimeMillis() - startMillis)
println(s"Execution time: ${executionTime.toMinutes}m")
}
}
Probably the condition is too complicated, but it must be that way.
You may improve performance of your current solution by changing a bit the logic of how your perform your linkage:
First perform an inner join of a and b dataframes with columns that you know matches. In your case, it seems to be LAST_NAME and FIRST_NAME columns.
Then filter the resulting dataframe with your specific complex conditions, In your case, birth dates are equal or street matches condition.
Finally, if you need to also keep the not linked records, perform a right join with the a dataframe.
Your code could be rewritten as follow:
import org.apache.spark.sql.functions.{col, substring, to_date}
import org.apache.spark.sql.SparkSession
import java.time.Duration
object App {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[4]")
.appName("MatchingApp")
.getOrCreate()
val a = spark.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.load("/home/helveticau/workstuff/a.csv")
.withColumn("BIRTH_DATE", to_date(col("BIRTH_DATE"), "yyyy-MM-dd"))
val b = spark.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.load("/home/helveticau/workstuff/b.txt")
.withColumn("BIRTH_DATE", to_date(col("BIRTH_DATE"), "dd.MM.yyyy"))
val condition = a.col("BIRTH_DATE").equalTo(b.col("BIRTH_DATE"))
.or(a.col("STREET").startsWith(substring(b.col("STR"), 1, 3)))
val startMillis = System.currentTimeMillis();
val res = a.join(b, Seq("LAST_NAME", "FIRST_NAME"))
.filter(condition)
// two following lines optional if you want to only keep records with not null B_ID
.select("B_ID", "A_ID")
.join(a, Seq("A_ID"), "right_outer")
val count = res
.filter(col("B_ID").isNotNull)
.count()
println(s"Count: $count")
val executionTime = Duration.ofMillis(System.currentTimeMillis() - startMillis)
println(s"Execution time: ${executionTime.toMinutes}m")
}
}
So you will avoid cartesian product at the price of two joins instead of only one.
Example
With file a.csv containing the following data:
"A_ID";"FIRST_NAME";"LAST_NAME";"BIRTH_DATE";"STREET"
10;John;Doe;1965-10-21;Johnson Road
11;Rebecca;Davis;1977-02-27;Lincoln Road
12;Samantha;Johns;1954-03-31;Main Street
13;Roger;Penrose;1987-12-25;Oxford Street
14;Robert;Smith;1981-08-26;Canergie Road
15;Britney;Stark;1983-09-27;Alshire Road
And b.txt having the following data:
"B_ID";"FIRST_NAME";"LAST_NAME";"BIRTH_DATE";"STR"
29;John;Doe;21.10.1965;Johnson Road
28;Rebecca;Davis;28.03.1986;Lincoln Road
27;Shirley;Iron;30.01.1956;Oak Street
26;Roger;Penrose;25.12.1987;York Street
25;Robert;Dayton;26.08.1956;Canergie Road
24;Britney;Stark;22.06.1962;Algon Road
res dataframe will be:
+----+----+----------+---------+----------+-------------+
|A_ID|B_ID|FIRST_NAME|LAST_NAME|BIRTH_DATE|STREET |
+----+----+----------+---------+----------+-------------+
|10 |29 |John |Doe |1965-10-21|Johnson Road |
|11 |28 |Rebecca |Davis |1977-02-27|Lincoln Road |
|12 |null|Samantha |Johns |1954-03-31|Main Street |
|13 |26 |Roger |Penrose |1987-12-25|Oxford Street|
|14 |null|Robert |Smith |1981-08-26|Canergie Road|
|15 |null|Britney |Stark |1983-09-27|Alshire Road |
+----+----+----------+---------+----------+-------------+
Note: if your FIRST_NAME and LAST_NAME columns are not exactly the same, you can try to make them matches with Spark's built-in functions, for instance:
trim to remove spaces at start and end of string
lower to transform the column to lower case (and thus ignore case in comparison)
What is really important is to have the maximum number of columns that exactly match.
I have a big list of string (140866 elements) which takes some times to compute. Once computed I want to use this list in UDF or in map of my DataFrame. I follow some tutorials and I found this example
val states = List("NY","New York","CA","California","FL","Florida")
val countries = Map(("USA","United States of America"),("IN","India"))
val broadcastStates = spark.sparkContext.broadcast(states)
val broadcastCountries = spark.sparkContext.broadcast(countries)
val data = Seq(("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
)
val columns = Seq("firstname","lastname","country","state")
import spark.sqlContext.implicits._
val df = data.toDF(columns:_*)
val df2 = df.map(row=>{
val country = row.getString(2)
val state = row.getString(3)
val fullCountry = broadcastCountries.value.get(country).get
val fullState = broadcastStates.value(0)
(row.getString(0),row.getString(1),fullCountry,fullState)
}).toDF(columns:_*)
df2.show(false)
which works fine.
But when I try to use my list I got this error
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:393)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:630)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:283)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:375)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2550)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2764)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
at org.apache.spark.sql.Dataset.show(Dataset.scala:753)
at org.apache.spark.sql.Dataset.show(Dataset.scala:730)
... 54 elided
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#6ebc6ccc)
I get my list with
val myList = spark.read.option("header",true)
.csv(NER_PATH_S3)
.na.drop()
.filter(col("label") =!= "VALUE" )
.groupBy("keyword")
.agg(sum("n_occurences").alias("n_occurences"))
.filter(col("n_occurences") > 2)
.filter($"keyword".rlike("[^0-9]+"))
.select("keyword")
.collect()
.map(x => x(0).toString)
.toList
val myListBroadcast = sc.broadcast(myList)
Which I made sure I have exactly the same type as my example, I also try to reduce the size of my list by slicing it.
According to me instead of using
sc.broadcast(myList)
you can use
spark.sparkContext.broadcast(myList)
and that should work.
I had faced the similar issue and when I changed the code to what I have suggested it works like a charm.
Happy Learning.
In Spark Scala, I am trying to create a column that contains an array of monthly dates between a start and an end date (inclusive).
For example, if we have 2018-02-07 and 2018-04-28, the array should contain [2018-02-01, 2018-03-01, 2018-04-01].
Besides the monthly version I would also like to create a quarterly version, i.e. [2018-1, 2018-2].
Example Input Data:
id startDate endDate
1_1 2018-02-07 2018-04-28
1_2 2018-05-06 2018-05-31
2_1 2017-04-13 2017-04-14
Expected (monthly) Output 1:
id startDate endDate dateRange
1_1 2018-02-07 2018-04-28 [2018-02-01, 2018-03-01, 2018-04-01]
1_1 2018-05-06 2018-05-31 [2018-05-01]
2_1 2017-04-13 2017-04-14 [2017-04-01]
Ultimate expected (monthly) output 2:
id Date
1_1 2018-02-01
1_1 2018-03-01
1_1 2018-04-01
1_2 2018-05-01
2_1 2017-04-01
I have spark 2.1.0.167, Scala 2.10.6, and JavaHotSpot 1.8.0_172.
I have tried to implement several answers to similar (day-level) questions on here, but I am struggling with getting a monthly/quarterly version to work.
The below creates an array from start and endDate and explodes it. However I need to explode a column that contains all the monthly (quarterly) dates in-between.
val df1 = df.select($"id", $"startDate", $"endDate").
// This just creates an array of start and end Date
withColumn("start_end_array"), array($"startDate", $"endDate").
withColumn("start_end_array"), explode($"start_end_array"))
Thank you for any leads.
case class MyData(id: String, startDate: String, endDate: String, list: List[String])
val inputData = Seq(("1_1", "2018-02-07", "2018-04-28"), ("1_2", "2018-05-06", "2018-05-31"), ("2_2", "2017-04-13", "2017-04-14"))
inputData.map(x => {
import java.time.temporal._
import java.time._
val startDate = LocalDate.parse(x._2)
val endDate = LocalDate.parse(x._3)
val diff = ChronoUnit.MONTHS.between(startDate, endDate)
var result = List[String]();
for (index <- 0 to diff.toInt) {
result = (startDate.getYear + "-" + (startDate.getMonth.getValue + index) + "-01") :: result
}
new MyData(x._1, x._2, x._3, result)
}).foreach(println)
I am using Spark 2.1 and having one hive table with orc format, following is the schema.
col_name data_type
tuid string
puid string
ts string
dt string
source string
peer string
# Partition Information
# col_name data_type
dt string
source string
peer string
# Detailed Table Information
Database: test
Owner: test
Create Time: Tue Nov 22 15:25:53 GMT 2016
Last Access Time: Thu Jan 01 00:00:00 GMT 1970
Location: hdfs://apps/hive/warehouse/nis.db/dmp_puid_tuid
Table Type: MANAGED
Table Parameters:
transient_lastDdlTime 1479828353
SORTBUCKETCOLSPREFIX TRUE
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Storage Desc Parameters:
serialization.format 1
When i am applying filter on top of this table using partition column, its working fine and only reading specific partitions.
val puid = spark.read.table("nis.dmp_puid_tuid")
.as(Encoders.bean(classOf[DmpPuidTuid]))
.filter( """peer = "AggregateKnowledge" and dt = "20170403"""")
and this is my physical plan for this query
== Physical Plan ==
HiveTableScan [tuid#1025, puid#1026, ts#1027, dt#1022, source#1023, peer#1024], MetastoreRelation nis, dmp_puid_tuid, [isnotnull(peer#1024), isnotnull(dt#1022),
(peer#1024 = AggregateKnowledge), (dt#1022 = 20170403)]
but when i am using below code, its reading entire data into spark
val puid = spark.read.table("nis.dmp_puid_tuid")
.as(Encoders.bean(classOf[DmpPuidTuid]))
.filter( tp => tp.getPeer().equals("AggregateKnowledge") && Integer.valueOf(tp.getDt()) >= 20170403)
Physical plan for above dataframe
== Physical Plan ==
*Filter <function1>.apply
+- HiveTableScan [tuid#1058, puid#1059, ts#1060, dt#1055, source#1056, peer#1057], MetastoreRelation nis, dmp_puid_tuid
Note :- DmpPuidTuid is java bean class
When you pass a Scala function to filter, you prevent the Spark optimizer from seeing which columns of the dataset are actually used (because the optimizer does not try to look inside the compiled code of the function). If you pass a column expression, such as col("peer") === "AggregateKnowledge" && col("dt").cast(IntegerType) >= 20170403 then the optimizer will be able to see which columns are actually required and adjust the plan accordingly.