Effective record linkage - java

I've asked a bit similar question earlier today. Here it is.
Shortly: I need to do record linkage for two large datasets (1.6M & 6M). I was going to use Sparks thinking that Cartesian product I was warned about would not be such a big problem. But it is. It hit the performance so hard that the linkage process didn't finish in 7 hours..
Is there another library/framework/tool for doing this more effectively? Or maybe improve performance of the solution below?
The code I ended up with:
object App {
def left(col: Column, n: Int) = {
assert(n > 0)
substring(col, 1, n)
}
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[4]")
.appName("MatchingApp")
.getOrCreate()
import spark.implicits._
val a = spark.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.load("/home/helveticau/workstuff/a.csv")
.withColumn("FULL_NAME", concat_ws(" ", col("FIRST_NAME"), col("LAST_NAME")))
.withColumn("BIRTH_DATE", to_date(col("BIRTH_DATE"), "yyyy-MM-dd"))
val b = spark.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.load("/home/helveticau/workstuff/b.txt")
.withColumn("FULL_NAME", concat_ws(" ", col("FIRST_NAME"), col("LAST_NAME")))
.withColumn("BIRTH_DATE", to_date(col("BIRTH_DATE"), "dd.MM.yyyy"))
// #formatter:off
val condition = a
.col("FULL_NAME").contains(b.col("FIRST_NAME"))
.and(a.col("FULL_NAME").contains(b.col("LAST_NAME")))
.and(a.col("BIRTH_DATE").equalTo(b.col("BIRTH_DATE"))
.or(a.col("STREET").startsWith(left(b.col("STR"), 3))))
// #formatter:on
val startMillis = System.currentTimeMillis();
val res = a.join(b, condition, "left_outer")
val count = res
.filter(col("B_ID").isNotNull)
.count()
println(s"Count: $count")
val executionTime = Duration.ofMillis(System.currentTimeMillis() - startMillis)
println(s"Execution time: ${executionTime.toMinutes}m")
}
}
Probably the condition is too complicated, but it must be that way.

You may improve performance of your current solution by changing a bit the logic of how your perform your linkage:
First perform an inner join of a and b dataframes with columns that you know matches. In your case, it seems to be LAST_NAME and FIRST_NAME columns.
Then filter the resulting dataframe with your specific complex conditions, In your case, birth dates are equal or street matches condition.
Finally, if you need to also keep the not linked records, perform a right join with the a dataframe.
Your code could be rewritten as follow:
import org.apache.spark.sql.functions.{col, substring, to_date}
import org.apache.spark.sql.SparkSession
import java.time.Duration
object App {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[4]")
.appName("MatchingApp")
.getOrCreate()
val a = spark.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.load("/home/helveticau/workstuff/a.csv")
.withColumn("BIRTH_DATE", to_date(col("BIRTH_DATE"), "yyyy-MM-dd"))
val b = spark.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.load("/home/helveticau/workstuff/b.txt")
.withColumn("BIRTH_DATE", to_date(col("BIRTH_DATE"), "dd.MM.yyyy"))
val condition = a.col("BIRTH_DATE").equalTo(b.col("BIRTH_DATE"))
.or(a.col("STREET").startsWith(substring(b.col("STR"), 1, 3)))
val startMillis = System.currentTimeMillis();
val res = a.join(b, Seq("LAST_NAME", "FIRST_NAME"))
.filter(condition)
// two following lines optional if you want to only keep records with not null B_ID
.select("B_ID", "A_ID")
.join(a, Seq("A_ID"), "right_outer")
val count = res
.filter(col("B_ID").isNotNull)
.count()
println(s"Count: $count")
val executionTime = Duration.ofMillis(System.currentTimeMillis() - startMillis)
println(s"Execution time: ${executionTime.toMinutes}m")
}
}
So you will avoid cartesian product at the price of two joins instead of only one.
Example
With file a.csv containing the following data:
"A_ID";"FIRST_NAME";"LAST_NAME";"BIRTH_DATE";"STREET"
10;John;Doe;1965-10-21;Johnson Road
11;Rebecca;Davis;1977-02-27;Lincoln Road
12;Samantha;Johns;1954-03-31;Main Street
13;Roger;Penrose;1987-12-25;Oxford Street
14;Robert;Smith;1981-08-26;Canergie Road
15;Britney;Stark;1983-09-27;Alshire Road
And b.txt having the following data:
"B_ID";"FIRST_NAME";"LAST_NAME";"BIRTH_DATE";"STR"
29;John;Doe;21.10.1965;Johnson Road
28;Rebecca;Davis;28.03.1986;Lincoln Road
27;Shirley;Iron;30.01.1956;Oak Street
26;Roger;Penrose;25.12.1987;York Street
25;Robert;Dayton;26.08.1956;Canergie Road
24;Britney;Stark;22.06.1962;Algon Road
res dataframe will be:
+----+----+----------+---------+----------+-------------+
|A_ID|B_ID|FIRST_NAME|LAST_NAME|BIRTH_DATE|STREET |
+----+----+----------+---------+----------+-------------+
|10 |29 |John |Doe |1965-10-21|Johnson Road |
|11 |28 |Rebecca |Davis |1977-02-27|Lincoln Road |
|12 |null|Samantha |Johns |1954-03-31|Main Street |
|13 |26 |Roger |Penrose |1987-12-25|Oxford Street|
|14 |null|Robert |Smith |1981-08-26|Canergie Road|
|15 |null|Britney |Stark |1983-09-27|Alshire Road |
+----+----+----------+---------+----------+-------------+
Note: if your FIRST_NAME and LAST_NAME columns are not exactly the same, you can try to make them matches with Spark's built-in functions, for instance:
trim to remove spaces at start and end of string
lower to transform the column to lower case (and thus ignore case in comparison)
What is really important is to have the maximum number of columns that exactly match.

Related

How to efficiently WordCount for each file?

I have tens of thousands files in dir demotxt like:
demotxt/
aa.txt
this is aaa1
this is aaa2
this is aaa3
bb.txt
this is bbb1
this is bbb2
this is bbb3
this is bbb4
cc.txt
this is ccc1
this is ccc2
I would like to efficiently make a WordCount for each .txt in this dir with Spark2.4 (scala or python)
# target result is:
aa.txt: (this,3), (is,3), (aaa1,1), (aaa2,1), (aaa3,1)
bb.txt: (this,3), (is,3), (bbb1,1), (bbb2,1), (bbb3,1)
cc.txt: (this,3), (is,3), (ccc1,1), (ccc2,1), (ccc3,1)
code maybe like?
def dealWithOneFile(path2File):
res = wordcountFor(path2File)
saveResultToDB(res)
sc.wholeTextFile(rooDir).map(dealWithOneFile)
Seems using sc.textFile(".../demotxt/") spark will load all files which may cause memory issues,also it treats all files as one which is not expected.
So I wonder how should I do this? Many Thanks!
Here is an approach. Can be with DF or RDD. Here I show RDD using Databricks, as you do not state. Scala as well.
It's hard to explain, but works. Try some input.
%scala
val paths = Seq("/FileStore/tables/fff_1.txt", "/FileStore/tables/fff_2.txt")
val rdd = spark.read.format("text").load(paths: _*).select(input_file_name, $"value").as[(String, String)].rdd
val rdd2 = rdd.flatMap(x=>x._2.split("\\s+").map(y => ((x._1, y), 1)))
val rdd3 = rdd2.reduceByKey(_+_).map( { case (x, y) => (x._1, (x._2, y)) } )
rdd3.collect
val rdd4 = rdd3.groupByKey()
rdd4.collect

What is 'no viable alternative at input' for spark sql?

I have a DF that has startTimeUnix column (of type Number in Mongo) that contains epoch timestamps. I want to query the DF on this column but I want to pass EST datetime. I went through multiple hoops to test the following on spark-shell:
val df = Seq(("1", "1523937600000"), ("2", "1523941200000"),("3","1524024000000")).toDF("id", "unix")
df.filter($"unix" > java.time.ZonedDateTime.parse("04/17/2018 01:00:00", java.time.format.DateTimeFormatter.ofPattern ("MM/dd/yyyy HH:mm:ss").withZone ( java.time.ZoneId.of("America/New_York"))).toEpochSecond()*1000).collect()
Output:
= Array([3,1524024000000])
Since the java.time functions are working, I am passing the same to spark-submit where while retrieving the data from Mongo, the filter query goes like:
startTimeUnix < (java.time.ZonedDateTime.parse(${LT}, java.time.format.DateTimeFormatter.ofPattern('MM/dd/yyyyHHmmss').withZone(java.time.ZoneId.of('America/New_York'))).toEpochSecond()*1000) AND startTimeUnix > (java.time.ZonedDateTime.parse(${GT}, java.time.format.DateTimeFormatter.ofPattern('MM/dd/yyyyHHmmss').withZone(java.time.ZoneId.of('America/New_York'))).toEpochSecond()*1000)`
However, I keep getting following error:
Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input '(java.time.ZonedDateTime.parse(04/18/2018000000, java.time.format.DateTimeFormatter.ofPattern('MM/dd/yyyyHHmmss').withZone('(line 1, pos 138)
== SQL ==
startTimeUnix < (java.time.ZonedDateTime.parse(04/18/2018000000, java.time.format.DateTimeFormatter.ofPattern('MM/dd/yyyyHHmmss').withZone(java.time.ZoneId.of('America/New_York'))).toEpochSecond()*1000).toString() AND startTimeUnix > (java.time.ZonedDateTime.parse(04/17/2018000000, java.time.format.DateTimeFormatter.ofPattern('MM/dd/yyyyHHmmss').withZone(java.time.ZoneId.of('America/New_York'))).toEpochSecond()*1000).toString()
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:114)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseExpression(ParseDriver.scala:43)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:1315)
Somewhere it said the error meant mis-matched data type. I tried applying toString to the output of date conversion with no luck.
You can use spark data frame functions.
scala> val df = Seq(("1", "1523937600000"), ("2", "1523941200000"),("3","1524024000000")).toDF("id", "unix")
df: org.apache.spark.sql.DataFrame = [id: string, unix: string]
scala> df.filter($"unix" > unix_timestamp()*1000).collect()
res5: Array[org.apache.spark.sql.Row] = Array([3,1524024000000])
scala> df.withColumn("unixinEST"
,from_utc_timestamp(
from_unixtime(unix_timestamp()),
"EST"))
.show()
+---+-------------+-------------------+
| id| unix| unixinEST|
+---+-------------+-------------------+
| 1|1523937600000|2018-04-18 06:13:19|
| 2|1523941200000|2018-04-18 06:13:19|
| 3|1524024000000|2018-04-18 06:13:19|
+---+-------------+-------------------+

Tf Idf over a set of documents to find words relevance

I have 2 books in txt format (6000+ lines). I would like to associate, using Python, to each word its relevance (using td idf algorithm) and order them in descending order.
I tried this code
#- * -coding: utf - 8 - * -
from __future__
import division, unicode_literals
import math
from textblob
import TextBlob as tb
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1
for blob in bloblist
if word in blob)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
document1 = tb(""
"FULL BOOK1 TEST"
"")
document2 = tb(""
"FULL BOOK2 TEST"
"")
bloblist = [document1, document2]
for i, blob in enumerate(bloblist):
with open("result.txt", 'w') as textfile:
print("Top words in document {}".format(i + 1))
scores = {
word: tfidf(word, blob, bloblist) for word in blob.words
}
sorted_words = sorted(scores.items(), key = lambda x: x[1], reverse = True)
for word, score in sorted_words:
textfile.write("Word: {}, TF-IDF: {}".format(word, round(score, 5)) + "\n")
that I found here https://stevenloria.com/tf-idf/ with some changes, but it takes a lot of time and after some minutes, it crashes saying TypeError: coercing to Unicode: need string or buffer, float found.
Why?
I also tried to call this Java program through python https://github.com/mccurdyc/tf-idf/. The program works, but the output is incorrect: there are a lot of words that should have a high relevance level that are instead categorized with 0 relevance.
Is there a way to fix that Python code?
Or, can you suggest me another tf-idf implementation that does correctly what I want?

Split dataset based on column values in spark

I am trying to split the Dataset into different Datasets based on Manufacturer column contents. It is very slow Please suggest a way to improve the code, so that it can execute faster and reduce the usage of Java code.
List<Row> lsts= countsByAge.collectAsList();
for(Row lst:lsts) {
String man = lst.toString();
man = man.replaceAll("[\\p{Ps}\\p{Pe}]", "");
Dataset<Row> DF = src.filter("Manufacturer='" + man + "'");
DF.show();
}
The Code, Input and Output Datasets are as shown below.
package org.sparkexample;
import org.apache.parquet.filter2.predicate.Operators.Column;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.RelationalGroupedDataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import java.util.Arrays;
import java.util.List;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
public class GroupBy {
public static void main(String[] args) {
System.setProperty("hadoop.home.dir", "C:\\winutils");
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder().appName("split datasets").getOrCreate();
sc.setLogLevel("ERROR");
Dataset<Row> src= sqlContext.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.load("sample.csv");
Dataset<Row> unq_manf=src.select("Manufacturer").distinct();
List<Row> lsts= unq_manf.collectAsList();
for(Row lst:lsts) {
String man = lst.toString();
man = man.replaceAll("[\\p{Ps}\\p{Pe}]", "");
Dataset<Row> DF = src.filter("Manufacturer='" + man + "'");
DF.show();
}
}
}
Input Table
+------+------------+--------------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+--------------------+---+
| 804| ael|Brush & Broom Han...|123|
| 805| ael|Wheel Brush Parts...|124|
| 813| ael| Drivers Gloves|125|
| 632| west| Pipe Wrenches|126|
| 804| bil| Masonry Brushes|127|
| 497| west| Power Tools Other|128|
| 496| west| Power Tools Other|129|
| 495| bil| Hole Saws|130|
| 499| bil| Battery Chargers|131|
| 497| west| Power Tools Other|132|
+------+------------+--------------------+---+
Output
+------------+
|Manufacturer|
+------------+
| ael|
| west|
| bil|
+------------+
+------+------------+--------------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+--------------------+---+
| 804| ael|Brush & Broom Han...|123|
| 805| ael|Wheel Brush Parts...|124|
| 813| ael| Drivers Gloves|125|
+------+------------+--------------------+---+
+------+------------+-----------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+-----------------+---+
| 632| west| Pipe Wrenches|126|
| 497| west|Power Tools Other|128|
| 496| west|Power Tools Other|129|
| 497| west|Power Tools Other|132|
+------+------------+-----------------+---+
+------+------------+----------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+----------------+---+
| 804| bil| Masonry Brushes|127|
| 495| bil| Hole Saws|130|
| 499| bil|Battery Chargers|131|
+------+------------+----------------+---+
You have two choice in this case:
First you have to collect unique manufacturer values and then map
over resulting array:
val df = Seq(("HP", 1), ("Brother", 2), ("Canon", 3), ("HP", 5)).toDF("k", "v")
val brands = df.select("k").distinct.collect.flatMap(_.toSeq)
val BrandArray = brands.map(brand => df.where($"k" <=> brand))
BrandArray.foreach { x =>
x.show()
println("---------------------------------------")
}
You can also save the data frame based on manufacturer.
df.write.partitionBy("hour").saveAsTable("parquet")
Instead of splitting the dataset/dataframe by manufacturers it might be optimal to write the dataframe using manufacturer as the partition key if you need to query based on manufacturer frequently
Incase you still want separate dataframes based on one of the column values one of the approaches using pyspark and spark 2.0+ could be-
from pyspark.sql import functions as F
df = spark.read.csv("sample.csv",header=True)
# collect list of manufacturers
manufacturers = df.select('manufacturer').distinct().collect()
# loop through manufacturers to filter df by manufacturers and write it separately
for m in manufacturers:
df1 = df.where(F.col('manufacturers')==m[0])
df1[.repartition(repartition_col)].write.parquet(<write_path>,[write_mode])

how to save a list of case classes in scala

I have a case class named Rdv:
case class Rdv(
id: Option[Int],
nom: String,
prénom: String,
sexe: Int,
telPortable: String,
telBureau: String,
telPrivé: String,
siteRDV: String,
typeRDV: String,
libelléRDV: String,
numRDV: String,
étape: String,
dateRDV: Long,
heureRDVString: String,
statut: String,
orderId: String)
and I would like to save a list of such elements on disk, and reload them later.
I tried with java classes (ObjectOutputStream, fileOutputStream, objectInputStream, fileInputStream) but I have an error in the retrieving step : the statement
val n2 = ois.readObject().asInstanceOf[List[Rdv]]
always get an error(classNotFound:Rdv), although the correct path is given in the imports place.
Do you know a workaround to save such an object?
Please provide a little piece of code!
thanks
olivier
ps: I have the same error while using the Marshall class, such as in this code:
object Application extends Controller {
def index = Action {
//implicit val Rdv2Writes = Json.writes[rdv2]
def rdvTordv2(rdv: Rdv): rdv2 = new rdv2(
rdv.nom,
rdv.prénom,
rdv.dateRDV,
rdv.heureRDVString,
rdv.telPortable,
rdv.telBureau,
rdv.telPrivé,
rdv.siteRDV,
rdv.typeRDV,
rdv.libelléRDV,
rdv.orderId,
rdv.statut)
val n = variables.manager.liste_locale
val out = new FileOutputStream("out")
out.write(Marshal.dump(n))
out.close
val in = new FileInputStream("out")
val bytes = Stream.continually(in.read).takeWhile(-1 !=).map(_.toByte).toArray
val bar: List[Rdv] = Marshal.load[List[Rdv]](bytes) <--------------
val n3 = bar.map(rdv =>
rdvTordv2(rdv))
println("n3:" + n3.size)
Ok(views.html.Application.olivier2(n3))
}
},
in the line with the arrow.
It seems that the conversion to the type List[Rdv] encounters problems, but why? Is it a play! linked problem?
ok, there's a problem with play:
I created a new scala project with this code:
object Test1 extends App {
//pour des fins de test
case class Person(name:String,age:Int)
val liste_locale=List(new Person("paul",18))
val n = liste_locale
val out = new FileOutputStream("out")
out.write(Marshal.dump(n))
out.close
val in = new FileInputStream("out")
val bytes = Stream.continually(in.read).takeWhile(-1 !=).map(_.toByte).toArray
val bar: List[Person] = Marshal.load[List[Person]](bytes)
println(s"bar:size=${bar.size}")
}
and the display is good ("bar:size=1").
then, I modified my previous code in the play project, in the controller class, such as this:
object Application extends Controller {
def index = Action {
//pour des fins de test
case class Person(name:String,age:Int)
val liste_locale=List(new Person("paul",18))
val n = liste_locale
val out = new FileOutputStream("out")
out.write(Marshal.dump(n))
out.close
val in = new FileInputStream("out")
val bytes = Stream.continually(in.read).takeWhile(-1 !=).map(_.toByte).toArray
val bar: List[Person] = Marshal.load[List[Person]](bytes)
println(s"bar:size=${bar.size}")
Ok(views.html.Application.olivier2(Nil))
}
}
and I have an error saying:
play.api.Application$$anon$1: Execution exception[[ClassNotFoundException: controllers.Application$$anonfun$index$1$Person$3]]
is there anyone having the answer?
edit: I thought the error could come from sbt, so I modified build.scala such as this:
import sbt._
import Keys._
import play.Project._
object ApplicationBuild extends Build {
val appName = "sms_play_2"
val appVersion = "1.0-SNAPSHOT"
val appDependencies = Seq(
// Add your project dependencies here,
jdbc,
anorm,
"com.typesafe.slick" % "slick_2.10" % "2.0.0",
"com.github.nscala-time" %% "nscala-time" % "0.6.0",
"org.xerial" % "sqlite-jdbc" % "3.7.2",
"org.quartz-scheduler" % "quartz" % "2.2.1",
"com.esotericsoftware.kryo" % "kryo" % "2.22",
"io.argonaut" %% "argonaut" % "6.0.2")
val mySettings = Seq(
(javaOptions in run) ++= Seq("-Dconfig.file=conf/dev.conf"))
val playCommonSettings = Seq(
Keys.fork := true)
val main = play.Project(appName, appVersion, appDependencies).settings(
Keys.fork in run := true,
resolvers += Resolver.sonatypeRepo("snapshots")).settings(mySettings: _*)
.settings(playCommonSettings: _*)
}
but without success, the error is still there (Class Person not found)
can you help me?
Scala Pickling has reasonable momentum and the approach has many advantages (lots of the heavy lifting is done at compile time). There is a plugable serialization mechanism and formats like json are supported.

Categories