Spark broadcast big list fail to serialize - java

I have a big list of string (140866 elements) which takes some times to compute. Once computed I want to use this list in UDF or in map of my DataFrame. I follow some tutorials and I found this example
val states = List("NY","New York","CA","California","FL","Florida")
val countries = Map(("USA","United States of America"),("IN","India"))
val broadcastStates = spark.sparkContext.broadcast(states)
val broadcastCountries = spark.sparkContext.broadcast(countries)
val data = Seq(("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
)
val columns = Seq("firstname","lastname","country","state")
import spark.sqlContext.implicits._
val df = data.toDF(columns:_*)
val df2 = df.map(row=>{
val country = row.getString(2)
val state = row.getString(3)
val fullCountry = broadcastCountries.value.get(country).get
val fullState = broadcastStates.value(0)
(row.getString(0),row.getString(1),fullCountry,fullState)
}).toDF(columns:_*)
df2.show(false)
which works fine.
But when I try to use my list I got this error
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:393)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:630)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:283)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:375)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2550)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2764)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
at org.apache.spark.sql.Dataset.show(Dataset.scala:753)
at org.apache.spark.sql.Dataset.show(Dataset.scala:730)
... 54 elided
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#6ebc6ccc)
I get my list with
val myList = spark.read.option("header",true)
.csv(NER_PATH_S3)
.na.drop()
.filter(col("label") =!= "VALUE" )
.groupBy("keyword")
.agg(sum("n_occurences").alias("n_occurences"))
.filter(col("n_occurences") > 2)
.filter($"keyword".rlike("[^0-9]+"))
.select("keyword")
.collect()
.map(x => x(0).toString)
.toList
val myListBroadcast = sc.broadcast(myList)
Which I made sure I have exactly the same type as my example, I also try to reduce the size of my list by slicing it.

According to me instead of using
sc.broadcast(myList)
you can use
spark.sparkContext.broadcast(myList)
and that should work.
I had faced the similar issue and when I changed the code to what I have suggested it works like a charm.
Happy Learning.

Related

Flink does not checkpoint, and BucketingSink leaves files in pending state, when source is generated from Collection

I'm trying to generate some test data using a collection, and write that data to s3, Flink doesn't seem to do any checkpointing at all when I do this, but it does do checkpointing when the source comes from s3.
For example, this DOES checkpoint and leaves output files in a completed state:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setMaxParallelism(128)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.enableCheckpointing(2000L)
env.setStateBackend(new RocksDBStateBackend("s3a://my_bucket/simple_job/rocksdb_checkpoints"))
val lines: DataStream[String] = {
val path = "s3a://my_bucket/simple_job/in"
env
.readFile(
inputFormat = new TextInputFormat(new Path(path)),
filePath = path,
watchType = FileProcessingMode.PROCESS_CONTINUOUSLY,
interval = 5000L
)
}
val sinkFunction: BucketingSink[String] =
new BucketingSink[String]("s3a://my_bucket/simple_job/out")
.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm"))
lines.addSink(sinkFunction)
env.execute()
Meanwhile, this DOES NOT checkpoint, and leaves files in a .pending state even after the job has finished:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setMaxParallelism(128)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.enableCheckpointing(2000L)
env.setStateBackend(new RocksDBStateBackend("s3a://my_bucket/simple_job/rocksdb_checkpoints"))
val lines: DataStream[String] = env.fromCollection((1 to 100).map(_.toString))
val sinkFunction: BucketingSink[String] =
new BucketingSink[String]("s3a://my_bucket/simple_job/out")
.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm"))
lines.addSink(sinkFunction)
env.execute()
It turns out that this is because of this ticket: https://issues.apache.org/jira/browse/FLINK-2646 and simply comes about because the stream from collection finishes finishes before the app ever has time to make a single checkpoint.

Spark foeach no effect on Java Map [duplicate]

Below is code for getting list of file Names in a zipped file
def getListOfFilesInRepo(zipFileRDD : RDD[(String,PortableDataStream)]) : (List[String]) = {
val zipInputStream = zipFileRDD.values.map(x => new ZipInputStream(x.open))
val filesInZip = new ArrayBuffer[String]()
var ze : Option[ZipEntry] = None
zipInputStream.foreach(stream =>{
do{
ze = Option(stream.getNextEntry);
ze.foreach{ze =>
if(ze.getName.endsWith("java") && !ze.isDirectory()){
var fileName:String = ze.getName.substring(ze.getName.lastIndexOf("/")+1,ze.getName.indexOf(".java"))
filesInZip += fileName
}
}
stream.closeEntry()
} while(ze.isDefined)
println(filesInZip.toList.length) // print 889 (correct)
})
println(filesInZip.toList.length) // print 0 (WHY..?)
(filesInZip.toList)
}
I execute above code in the following manner :
scala> val zipFileRDD = sc.binaryFiles("./handsOn/repo~apache~storm~14135470~false~Java~master~2210.zip")
zipFileRDD: org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)] = ./handsOn/repo~apache~storm~14135470~false~Java~master~2210.zip BinaryFileRDD[17] at binaryFiles at <console>:25
scala> getListOfFilesInRepo(zipRDD)
889
0
res12: List[String] = List()
Why i am not getting 889 and instead getting 0?
It happens because filesInZip is not shared between workers. foreach operates on a local copy of filesInZip and when it finishes this copy is simply discarded and garbage collected. If you want to keep the results you should use transformation (most likely a flatMap) and return collected aggregated values.
def listFiles(stream: PortableDataStream): TraversableOnce[String] = ???
zipInputStream.flatMap(listFiles)
You can learn more from Understanding closures

Parsing JSON objects into a java.util.List using Scala

I'm trying to take the JSON output of an analysis tool, and turn it into a Java list.
I'm doing this with Scala, and while it doesn't sound that hard, I've run into trouble.
So I was hoping the following code would do it:
def returnWarnings(json: JSONObject): java.util.List[String] ={
var bugs = new ArrayBuffer[String]
for (warning <- json.getJSONArray("warnings").){ //might need to add .asScala
bugs += (warning.getString("warning_type") + ": " + warning.getString("message"))
}
val rval: java.util.List[String] = bugs.asJava
rval
}
This block produces two errors when I try to compile it:
Error:(18, 42) value foreach is not a member of org.json.JSONArray
for (warning <- json.getJSONArray("warnings")){ //might need to add .asScala
^
and
Error:(21, 49) value asJava is not a member of scala.collection.mutable.ArrayBuffer[String]
val rval: java.util.List[String] = bugs.asJava
^
I don't know what's wrong with my for loop.
EDIT: with a bit more reading, I figured out what was up with the loop. see https://stackoverflow.com/a/6376083/5843840
The second error is especially baffling, because as far as I can tell it should work. It is really similar to the code from this documentation
scala> val jul: java.util.List[Int] = ArrayBuffer(1, 2, 3).asJava
jul: java.util.List[Int] = [1, 2, 3]
You should try the following:
import scala.collection.JavaConverters._
def returnWarnings(input: JSONObject): java.util.List[String] = {
val warningsArray = input.getJSONArray("warnings")
val output = (0 until warningsArray.length).map { i =>
val warning = warningsArray.getJSONObject(i)
warning.getString("warning_type") + ": " + warning.getString("message")
}
output.asJava
}
That final conversion could be done implicitly (without invoking .asJava), by importing scala.collection.JavaConversions._

Substitute for thread executor pool in Scala

My application requires that I have multiple threads running fetching data from various HDFS nodes. For that I am using the thread executor pool and forking threads.
Forking at :
val pathSuffixList = fileStatuses.getOrElse("FileStatus", List[Any]()).asInstanceOf[List[Map[String, Any]]]
pathSuffixList.foreach(block => {
ConsumptionExecutor.execute(new Consumption(webHdfsUri,block))
})
My class Consumption :
class Consumption(webHdfsUri: String, block:Map[String,Any]) extends Runnable {
override def run(): Unit = {
val uriSplit = webHdfsUri.split("\\?")
val fileOpenUri = uriSplit(0) + "/" + block.getOrElse("pathSuffix", "").toString + "?op=OPEN"
val inputStream = new URL(fileOpenUri).openStream()
val datumReader = new GenericDatumReader[Void]()
val dataStreamReader = new DataFileStream(inputStream, datumReader)
// val schema = dataStreamReader.getSchema()
val dataIterator = dataStreamReader.iterator()
while (dataIterator.hasNext) {
println(" data : " + dataStreamReader.next())
}
}
}
ConsumptionExecutor :
object ConsumptionExecutor{
val counter: AtomicLong = new AtomicLong()
val executionContext: ExecutorService = Executors.newCachedThreadPool(new ThreadFactory {
def newThread(r: Runnable): Thread = {
val thread: Thread = new Thread(r)
thread.setName("ConsumptionExecutor-" + counter.incrementAndGet())
thread
}
})
executionContext.asInstanceOf[ThreadPoolExecutor].setMaximumPoolSize(200)
def execute(trigger: Runnable) {
executionContext.execute(trigger)
}
}
However I want to use Akka streaming/ Akka actors where in I don't need to give a fixed thread pool size and Akka takes care of everything.
I am pretty new to Akka and the concept of Streaming and actors . Can someone give me any leads in the form of a sample code to fit my use case?
Thanks in advance!
An idea would be to create a (subclass) instance of ActorPublisher for each HDFS node that you are reading from, and then Merge them in as multiple Sources in a FlowGraph.
Something like this pseudo-code, where the details of the ActorPublisher sources are left out:
val g = PartialFlowGraph { implicit b =>
import FlowGraphImplicits._
val in1 = actorSource1
val in2 = actorSource2
// etc.
val out = UndefinedSink[T]
val merge = Merge[T]
in1 ~> merge ~> out
in2 ~> merge
// etc.
}
This can be improved for a collection of actor sources by just iterating over them and adding an edge to the merge for each one, but this gives the idea.

how to save a list of case classes in scala

I have a case class named Rdv:
case class Rdv(
id: Option[Int],
nom: String,
prénom: String,
sexe: Int,
telPortable: String,
telBureau: String,
telPrivé: String,
siteRDV: String,
typeRDV: String,
libelléRDV: String,
numRDV: String,
étape: String,
dateRDV: Long,
heureRDVString: String,
statut: String,
orderId: String)
and I would like to save a list of such elements on disk, and reload them later.
I tried with java classes (ObjectOutputStream, fileOutputStream, objectInputStream, fileInputStream) but I have an error in the retrieving step : the statement
val n2 = ois.readObject().asInstanceOf[List[Rdv]]
always get an error(classNotFound:Rdv), although the correct path is given in the imports place.
Do you know a workaround to save such an object?
Please provide a little piece of code!
thanks
olivier
ps: I have the same error while using the Marshall class, such as in this code:
object Application extends Controller {
def index = Action {
//implicit val Rdv2Writes = Json.writes[rdv2]
def rdvTordv2(rdv: Rdv): rdv2 = new rdv2(
rdv.nom,
rdv.prénom,
rdv.dateRDV,
rdv.heureRDVString,
rdv.telPortable,
rdv.telBureau,
rdv.telPrivé,
rdv.siteRDV,
rdv.typeRDV,
rdv.libelléRDV,
rdv.orderId,
rdv.statut)
val n = variables.manager.liste_locale
val out = new FileOutputStream("out")
out.write(Marshal.dump(n))
out.close
val in = new FileInputStream("out")
val bytes = Stream.continually(in.read).takeWhile(-1 !=).map(_.toByte).toArray
val bar: List[Rdv] = Marshal.load[List[Rdv]](bytes) <--------------
val n3 = bar.map(rdv =>
rdvTordv2(rdv))
println("n3:" + n3.size)
Ok(views.html.Application.olivier2(n3))
}
},
in the line with the arrow.
It seems that the conversion to the type List[Rdv] encounters problems, but why? Is it a play! linked problem?
ok, there's a problem with play:
I created a new scala project with this code:
object Test1 extends App {
//pour des fins de test
case class Person(name:String,age:Int)
val liste_locale=List(new Person("paul",18))
val n = liste_locale
val out = new FileOutputStream("out")
out.write(Marshal.dump(n))
out.close
val in = new FileInputStream("out")
val bytes = Stream.continually(in.read).takeWhile(-1 !=).map(_.toByte).toArray
val bar: List[Person] = Marshal.load[List[Person]](bytes)
println(s"bar:size=${bar.size}")
}
and the display is good ("bar:size=1").
then, I modified my previous code in the play project, in the controller class, such as this:
object Application extends Controller {
def index = Action {
//pour des fins de test
case class Person(name:String,age:Int)
val liste_locale=List(new Person("paul",18))
val n = liste_locale
val out = new FileOutputStream("out")
out.write(Marshal.dump(n))
out.close
val in = new FileInputStream("out")
val bytes = Stream.continually(in.read).takeWhile(-1 !=).map(_.toByte).toArray
val bar: List[Person] = Marshal.load[List[Person]](bytes)
println(s"bar:size=${bar.size}")
Ok(views.html.Application.olivier2(Nil))
}
}
and I have an error saying:
play.api.Application$$anon$1: Execution exception[[ClassNotFoundException: controllers.Application$$anonfun$index$1$Person$3]]
is there anyone having the answer?
edit: I thought the error could come from sbt, so I modified build.scala such as this:
import sbt._
import Keys._
import play.Project._
object ApplicationBuild extends Build {
val appName = "sms_play_2"
val appVersion = "1.0-SNAPSHOT"
val appDependencies = Seq(
// Add your project dependencies here,
jdbc,
anorm,
"com.typesafe.slick" % "slick_2.10" % "2.0.0",
"com.github.nscala-time" %% "nscala-time" % "0.6.0",
"org.xerial" % "sqlite-jdbc" % "3.7.2",
"org.quartz-scheduler" % "quartz" % "2.2.1",
"com.esotericsoftware.kryo" % "kryo" % "2.22",
"io.argonaut" %% "argonaut" % "6.0.2")
val mySettings = Seq(
(javaOptions in run) ++= Seq("-Dconfig.file=conf/dev.conf"))
val playCommonSettings = Seq(
Keys.fork := true)
val main = play.Project(appName, appVersion, appDependencies).settings(
Keys.fork in run := true,
resolvers += Resolver.sonatypeRepo("snapshots")).settings(mySettings: _*)
.settings(playCommonSettings: _*)
}
but without success, the error is still there (Class Person not found)
can you help me?
Scala Pickling has reasonable momentum and the approach has many advantages (lots of the heavy lifting is done at compile time). There is a plugable serialization mechanism and formats like json are supported.

Categories