I am trying to write a Spark connector to pull AVRO messages off a RabbitMQ message queue. When decoding the AVRO messages, there is a NoSuchMethodError error that occurs only when running in Spark.
I could not reproduce the Spark code exactly outside of spark, but I believe the two examples are sufficiently similar. I think this is the smallest code that reproduces the same scenario.
I've removed all the connection parameters both because the information is private and the connection does not appear to be the issue.
Spark code:
package simpleexample
import org.apache.spark.SparkConf
import org.apache.spark.streaming.rabbitmq.distributed.RabbitMQDistributedKey
import org.apache.spark.streaming.rabbitmq.models.ExchangeAndRouting
import org.apache.spark.streaming.rabbitmq.RabbitMQUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
import com.sksamuel.avro4s._
import java.io.{ByteArrayInputStream, ByteArrayOutputStream}
import com.rabbitmq.client.QueueingConsumer.Delivery
import java.util.HashMap
case class AttributeTuple(attrName: String, attrValue: String)
// AVRO Schema for Events
case class DeviceEvent(
tenantName: String,
groupName: String,
subgroupName: String,
eventType: String,
eventSource: String,
deviceTypeName: String,
deviceId: Int,
timestamp: Long,
attribute: AttributeTuple
)
object RabbitMonitor {
def main(args: Array[String]) {
println("start")
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("RabbitMonitor")
val ssc = new StreamingContext(sparkConf, Seconds(60))
def parseArrayEvent(delivery: Delivery): Seq[DeviceEvent] = {
val in = new ByteArrayInputStream(delivery.getBody())
val input = AvroInputStream.binary[DeviceEvent](in)
input.iterator.toSeq
}
val params: Map[String, String] = Map(
/* many rabbit connection parameters */
"maxReceiveTime" -> "60000" // 60s
)
val distributedKey = Seq(
RabbitMQDistributedKey(
/* queue name */,
new ExchangeAndRouting(/* exchange name */, /* routing key */),
params
)
)
var events = RabbitMQUtils.createDistributedStream[Seq[DeviceEvent]](ssc, distributedKey, params, parseArrayEvent)
events.print()
ssc.start()
ssc.awaitTermination()
}
}
Non-Spark code:
package simpleexample
import com.thenewmotion.akka.rabbitmq._
import akka.actor._
// avoid name collision with rabbitmq channel
import scala.concurrent.{Channel => BasicChannel}
import scala.concurrent.ExecutionContext.Implicits.global
import com.sksamuel.avro4s._
import java.io.{ByteArrayInputStream, ByteArrayOutputStream}
object Test extends App {
implicit val system = ActorSystem()
val factory = new ConnectionFactory()
/* Set connection parameters*/
val exchange: String = /* exchange name */
val connection: ActorRef = system.actorOf(ConnectionActor.props(factory), "rabbitmq")
def setupSubscriber(channel: Channel, self: ActorRef) {
val queue = channel.queueDeclare().getQueue
channel.queueBind(queue, exchange, /* routing key */)
val consumer = new DefaultConsumer(channel) {
override def handleDelivery(consumerTag: String, envelope: Envelope, properties: BasicProperties, body: Array[Byte]) {
val in = new ByteArrayInputStream(body)
val input = AvroInputStream.binary[DeviceEvent](in)
val result = input.iterator.toSeq
println(result)
}
}
channel.basicConsume(queue, true, consumer)
}
connection ! CreateChannel(ChannelActor.props(setupSubscriber), Some("eventSubscriber"))
scala.concurrent.Future {
def loop(n: Long) {
Thread.sleep(1000)
if (n < 30) {
loop(n + 1)
}
}
loop(0)
}
}
Non-Spark Output (the last line is a successfully decoded update):
drex#drexThinkPad:~/src/scala/so-repro/connector/target/scala-2.11$ scala project.jar
[INFO] [03/02/2017 14:11:06.899] [default-akka.actor.default-dispatcher-4] [akka://default/deadLetters] Message [com.thenewmotion.akka.rabbitmq.ChannelCreated] from Actor[akka://default/user/rabbitmq#-889215077] to Actor[akka://default/deadLetters] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO] [03/02/2017 14:11:07.337] [default-akka.actor.default-dispatcher-3] [akka://default/user/rabbitmq] akka://default/user/rabbitmq connected to amqp://<rabbit info>
[INFO] [03/02/2017 14:11:07.509] [default-akka.actor.default-dispatcher-4] [akka://default/user/rabbitmq/eventSubscriber] akka://default/user/rabbitmq/eventSubscriber connected
Stream(DeviceEvent(int,na,d01,deviceAttrUpdate,device,TestDeviceType,33554434,1488492704421,AttributeTuple(temperature,60)), ?)
Spark Output:
drex#drexThinkPad:~/src/scala/so-repro/connector/target/scala-2.11$ spark-submit ./project.jar --class RabbitMonitor
start
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/03/02 14:20:15 INFO SparkContext: Running Spark version 2.1.0
17/03/02 14:20:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/02 14:20:16 WARN Utils: Your hostname, drexThinkPad resolves to a loopback address: 127.0.1.1; using 192.168.1.11 instead (on interface wlp3s0)
17/03/02 14:20:16 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/03/02 14:20:16 INFO SecurityManager: Changing view acls to: drex
17/03/02 14:20:16 INFO SecurityManager: Changing modify acls to: drex
17/03/02 14:20:16 INFO SecurityManager: Changing view acls groups to:
17/03/02 14:20:16 INFO SecurityManager: Changing modify acls groups to:
17/03/02 14:20:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(drex); groups with view permissions: Set(); users with modify permissions: Set(drex); groups with modify permissions: Set()
17/03/02 14:20:16 INFO Utils: Successfully started service 'sparkDriver' on port 34701.
17/03/02 14:20:16 INFO SparkEnv: Registering MapOutputTracker
17/03/02 14:20:16 INFO SparkEnv: Registering BlockManagerMaster
17/03/02 14:20:16 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/03/02 14:20:16 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/03/02 14:20:16 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-5cbb13bf-78fe-4227-81b3-1afea40f899a
17/03/02 14:20:16 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
17/03/02 14:20:16 INFO SparkEnv: Registering OutputCommitCoordinator
17/03/02 14:20:16 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/03/02 14:20:16 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.1.11:4040
17/03/02 14:20:16 INFO SparkContext: Added JAR file:/home/drex/src/scala/so-repro/connector/target/scala-2.11/./project.jar at spark://192.168.1.11:34701/jars/project.jar with timestamp 1488493216614
17/03/02 14:20:16 INFO Executor: Starting executor ID driver on host localhost
17/03/02 14:20:16 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33276.
17/03/02 14:20:16 INFO NettyBlockTransferService: Server created on 192.168.1.11:33276
17/03/02 14:20:16 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/03/02 14:20:16 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.1.11, 33276, None)
17/03/02 14:20:16 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:33276 with 366.3 MB RAM, BlockManagerId(driver, 192.168.1.11, 33276, None)
17/03/02 14:20:16 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.11, 33276, None)
17/03/02 14:20:16 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.11, 33276, None)
17/03/02 14:20:17 INFO RabbitMQDStream: Duration for remembering RDDs set to 60000 ms for org.apache.spark.streaming.rabbitmq.distributed.RabbitMQDStream#546621c4
17/03/02 14:20:17 INFO RabbitMQDStream: Slide time = 60000 ms
17/03/02 14:20:17 INFO RabbitMQDStream: Storage level = Memory Deserialized 1x Replicated
17/03/02 14:20:17 INFO RabbitMQDStream: Checkpoint interval = null
17/03/02 14:20:17 INFO RabbitMQDStream: Remember interval = 60000 ms
17/03/02 14:20:17 INFO RabbitMQDStream: Initialized and validated org.apache.spark.streaming.rabbitmq.distributed.RabbitMQDStream#546621c4
17/03/02 14:20:17 INFO ForEachDStream: Slide time = 60000 ms
17/03/02 14:20:17 INFO ForEachDStream: Storage level = Serialized 1x Replicated
17/03/02 14:20:17 INFO ForEachDStream: Checkpoint interval = null
17/03/02 14:20:17 INFO ForEachDStream: Remember interval = 60000 ms
17/03/02 14:20:17 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream#49c6ddef
17/03/02 14:20:17 INFO RecurringTimer: Started timer for JobGenerator at time 1488493260000
17/03/02 14:20:17 INFO JobGenerator: Started JobGenerator at 1488493260000 ms
17/03/02 14:20:17 INFO JobScheduler: Started JobScheduler
17/03/02 14:20:17 INFO StreamingContext: StreamingContext started
17/03/02 14:21:00 INFO JobScheduler: Added jobs for time 1488493260000 ms
17/03/02 14:21:00 INFO JobScheduler: Starting job streaming job 1488493260000 ms.0 from job set of time 1488493260000 ms
17/03/02 14:21:00 INFO SparkContext: Starting job: print at RabbitMonitor.scala:94
17/03/02 14:21:00 INFO DAGScheduler: Got job 0 (print at RabbitMonitor.scala:94) with 1 output partitions
17/03/02 14:21:00 INFO DAGScheduler: Final stage: ResultStage 0 (print at RabbitMonitor.scala:94)
17/03/02 14:21:00 INFO DAGScheduler: Parents of final stage: List()
17/03/02 14:21:00 INFO DAGScheduler: Missing parents: List()
17/03/02 14:21:00 INFO DAGScheduler: Submitting ResultStage 0 (RabbitMQRDD[0] at createDistributedStream at RabbitMonitor.scala:93), which has no missing parents
17/03/02 14:21:00 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.7 KB, free 366.3 MB)
17/03/02 14:21:00 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1752.0 B, free 366.3 MB)
17/03/02 14:21:00 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.11:33276 (size: 1752.0 B, free: 366.3 MB)
17/03/02 14:21:00 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:996
17/03/02 14:21:00 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (RabbitMQRDD[0] at createDistributedStream at RabbitMonitor.scala:93)
17/03/02 14:21:00 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/03/02 14:21:00 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, ANY, 7744 bytes)
17/03/02 14:21:00 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/03/02 14:21:00 INFO Executor: Fetching spark://192.168.1.11:34701/jars/project.jar with timestamp 1488493216614
17/03/02 14:21:00 INFO TransportClientFactory: Successfully created connection to /192.168.1.11:34701 after 23 ms (0 ms spent in bootstraps)
17/03/02 14:21:00 INFO Utils: Fetching spark://192.168.1.11:34701/jars/project.jar to /tmp/spark-92b6ff6a-b120-4fd0-ba46-a450eff80636/userFiles-c0a334f3-68fc-495f-8ccd-cfe90e6d0bf8/fetchFileTemp2710654534934784726.tmp
17/03/02 14:21:00 INFO Executor: Adding file:/tmp/spark-92b6ff6a-b120-4fd0-ba46-a450eff80636/userFiles-c0a334f3-68fc-495f-8ccd-cfe90e6d0bf8/project.jar to class loader
<removing rabbit queue connection parameters>
17/03/02 14:21:02 INFO RabbitMQRDD: Receiving data in Partition 0 from
</removing rabbit queue connection parameters>
17/03/02 14:21:50 WARN BlockManager: Putting block rdd_0_0 failed due to an exception
17/03/02 14:21:50 WARN BlockManager: Block rdd_0_0 could not be removed as it was not found on disk or in memory
17/03/02 14:21:50 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoSuchMethodError: shapeless.Lazy.map(Lscala/Function1;)Lshapeless/Lazy;
at com.sksamuel.avro4s.SchemaFor$.recordBuilder(SchemaFor.scala:447)
at simpleexample.RabbitMonitor$$anon$3.<init>(RabbitMonitor.scala:70)
at simpleexample.RabbitMonitor$.simpleexample$RabbitMonitor$$parseArrayEvent$1(RabbitMonitor.scala:70)
at simpleexample.RabbitMonitor$$anonfun$15.apply(RabbitMonitor.scala:93)
at simpleexample.RabbitMonitor$$anonfun$15.apply(RabbitMonitor.scala:93)
at org.apache.spark.streaming.rabbitmq.distributed.RabbitMQRDD$RabbitMQRDDIterator$$anonfun$5.apply(RabbitMQRDD.scala:209)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.rabbitmq.distributed.RabbitMQRDD$RabbitMQRDDIterator.processDelivery(RabbitMQRDD.scala:209)
at org.apache.spark.streaming.rabbitmq.distributed.RabbitMQRDD$RabbitMQRDDIterator.getNext(RabbitMQRDD.scala:194)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/03/02 14:21:50 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.NoSuchMethodError: shapeless.Lazy.map(Lscala/Function1;)Lshapeless/Lazy;
at com.sksamuel.avro4s.SchemaFor$.recordBuilder(SchemaFor.scala:447)
at simpleexample.RabbitMonitor$$anon$3.<init>(RabbitMonitor.scala:70)
at simpleexample.RabbitMonitor$.simpleexample$RabbitMonitor$$parseArrayEvent$1(RabbitMonitor.scala:70)
at simpleexample.RabbitMonitor$$anonfun$15.apply(RabbitMonitor.scala:93)
at simpleexample.RabbitMonitor$$anonfun$15.apply(RabbitMonitor.scala:93)
at org.apache.spark.streaming.rabbitmq.distributed.RabbitMQRDD$RabbitMQRDDIterator$$anonfun$5.apply(RabbitMQRDD.scala:209)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.rabbitmq.distributed.RabbitMQRDD$RabbitMQRDDIterator.processDelivery(RabbitMQRDD.scala:209)
at org.apache.spark.streaming.rabbitmq.distributed.RabbitMQRDD$RabbitMQRDDIterator.getNext(RabbitMQRDD.scala:194)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/03/02 14:21:50 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
17/03/02 14:21:50 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/03/02 14:21:50 INFO TaskSchedulerImpl: Cancelling stage 0
build.sbt:
retrieveManaged := true
lazy val sparkVersion = "2.1.0"
scalaVersion in ThisBuild := "2.11.8"
lazy val rabbit = (project in file("rabbit-plugin")).settings(
name := "Spark Streaming RabbitMQ Receiver",
homepage := Some(url("https://github.com/Stratio/RabbitMQ-Receiver")),
description := "RabbitMQ-Receiver is a library that allows the user to read data with Apache Spark from RabbitMQ.",
exportJars := true,
assemblyJarName in assembly := "rabbit.jar",
test in assembly := {},
moduleName := "spark-rabbitmq",
organization := "com.stratio.receive",
version := "0.6.0",
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"com.typesafe.akka" %% "akka-actor" % "2.4.11",
"com.rabbitmq" % "amqp-client" % "3.6.6",
"joda-time" % "joda-time" % "2.8.2",
"com.github.sstone" %% "amqp-client" % "1.5" % Test,
"org.scalatest" %% "scalatest" % "2.2.2" % Test,
"org.scalacheck" %% "scalacheck" % "1.11.3" % Test,
"junit" % "junit" % "4.12" % Test,
"com.typesafe.akka" %% "akka-testkit" % "2.4.11" % Test
)
)
lazy val root = (project in file("connector")).settings(
name := "Connector from Rabbit to Kafka queue",
description := "",
exportJars := true,
test in assembly := {},
assemblyJarName in assembly := "project.jar",
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"com.thenewmotion" %% "akka-rabbitmq" % "3.0.0",
"org.apache.kafka" % "kafka_2.10" % "0.10.1.1",
"com.sksamuel.avro4s" %% "avro4s-core" % "1.6.4"
)
) dependsOn rabbit
I am also using assembly to put together a "fat jar" for spark (addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.4")) and using the command sbt assembly to produce the jar used in both examples above. I'm running Spark 2.1.0.
I'm relatively new to the Spark / Scala ecosystem so hopefully this is a problem with my build settings. It makes no sense that shapeless would be unavailable in Spark.
Same issue myself. I just add more details for others facing it.
Error
Everything works fine till I deploy to cluster. Then I get
Exception in thread "main" java.lang.NoSuchMethodError: 'shapeless.DefaultSymbolicLabelling shapeless.DefaultSymbolicLabelling$.instance(shapeless.HList)'
Root Cause
Following the stacktrace, I know it is related to the circe library. Then I run the dependency (make sure you have addDependencyTreePlugin in your ~/.sbt/1.0/plugins/plugins.sbt file):
❯ sbt "whatDependsOn com.chuusai shapeless_2.12"
[info] welcome to sbt 1.6.2 (Amazon.com Inc. Java 1.8.0_332)
[info] com.chuusai:shapeless_2.12:2.3.7 [S]
[info] +-io.circe:circe-generic_2.12:0.14.1 [S]
[info] +-***
but if I run the dependency with "provided" scope, I get:
❯ sbt provided:"whatDependsOn com.chuusai shapeless_2.12"
[info] welcome to sbt 1.6.2 (Amazon.com Inc. Java 1.8.0_332)
[info] com.chuusai:shapeless_2.12:2.3.3 [S]
[info] +-org.scalanlp:breeze_2.12:1.0 [S]
[info] +-org.apache.spark:spark-mllib-local_2.12:3.1.3
[info] | +-org.apache.spark:spark-graphx_2.12:3.1.3
[info] | | +-org.apache.spark:spark-mllib_2.12:3.1.3
[info] | | +-***
[info] | |
[info] | +-org.apache.spark:spark-mllib_2.12:3.1.3
[info] | +-***
[info] |
[info] +-org.apache.spark:spark-mllib_2.12:3.1.3
[info] +-***
As you can see, the instance function in version 2.3.7 is not present in version 2.3.3 (it is added in version 2.3.5):
https://javadoc.io/static/com.chuusai/shapeless_2.12/2.3.3/shapeless/DefaultSymbolicLabelling$.html
https://javadoc.io/static/com.chuusai/shapeless_2.12/2.3.7/shapeless/DefaultSymbolicLabelling$.html
Didn't work
Adding the dependency didn't fix my issue.
val CirceVersion = "0.14.1"
val ShapelessVersion = "2.3.7" // Circe 0.14.1 uses 2.3.7; Spark 3.1.3 uses 2.3.3
val SparkVersion = "3.1.3"
lazy val CirceDeps: Seq[ModuleID] = Seq(
"io.circe" %% "circe-generic" % CirceVersion,
/* Shapeless is one of the Spark dependencies. As Spark is provided, it is not included in the uber jar.
* Adding the dependency explicitly to make sure we have the correct version at run-time
*/
"com.chuusai" %% "shapeless" % ShapelessVersion
)
I keep this in my code just for documentation purpose only.
What worked
The main fix is actually to rename Shapeless library (see my comments)the question that I pick the answer
/** Shapeless is one of the Spark dependencies. At run-time, they clash and Spark's shapeless package takes
* precedence. It results run-time error as shapeless 2.3.7 and 2.3.3 are not fully compatible.
* Here, we are are renaming the library so they co-exist in run-time and Spark uses its own version and Circe also
* uses its own version.
*/
// noinspection SbtDependencyVersionInspection
lazy val shadingRules: Def.Setting[Seq[ShadeRule]] =
assembly / assemblyShadeRules := Seq(
ShadeRule
.rename("shapeless.**" -> "shadeshapless.#1")
.inLibrary("com.chuusai" % "shapeless_2.12" % Dependencies.ShapelessVersion)
.inProject,
ShadeRule
.rename("shapeless.**" -> "shadeshapless.#1")
.inLibrary("io.circe" % "circe-generic_2.12" % Dependencies.CirceVersion)
.inProject
)
Update 2022-08-20
Based on #denis-arnaud comment, here is a simpler version from pureconfig
assembly / assemblyShadeRules := Seq(ShadeRule.rename("shapeless.**" -> "new_shapeless.#1").inAll)
I guess the simple one works for most the situations. The more complex one is good for when there are different versions of shapeless in the classpath, and you'd like to rename them in #1, #2, etc.
zero323 has the right answer as far as I can tell. Spark 2.1.0 has a dependency that itself depends on Shapeless 2.0.0.
This problem could be solved one of two ways: Import the dependency that uses Shapeless and shade shapeless, or use a different avro library. I went with the latter solution.
Related
I am running a spark structured streaming job (bounces every day) in EMR. I am getting an OOM error in my application after a few hours of execution and get killed. The following are my configurations and spark SQL code.
I am new to Spark and need your valuable input.
The EMR is having 10 instances with 16 core and 64GB memory.
Spark-Submit arguments:
num_of_executors: 17
executor_cores: 5
executor_memory: 19G
driver_memory: 30G
Job is reading input as micro-batches from a Kafka at an interval of 30seconds. Average number of rows read per batch is 90k.
spark.streaming.kafka.maxRatePerPartition: 4500
spark.streaming.stopGracefullyOnShutdown: true
spark.streaming.unpersist: true
spark.streaming.kafka.consumer.cache.enabled: true
spark.hadoop.fs.s3.maxRetries: 30
spark.sql.shuffle.partitions: 2001
Spark SQL aggregation code:
dataset.groupBy(functions.col(NAME),functions.window(functions.column(TIMESTAMP_COLUMN),30))
.agg(functions.concat_ws(SPLIT, functions.collect_list(DEPARTMENT)).as(DEPS))
.select(NAME,DEPS)
.map((row) -> {
Map<String, Object> map = Maps.newHashMap();
map.put(NAME, row.getString(0));
map.put(DEPS, row.getString(1));
return new KryoMapSerializationService().serialize(map);
}, Encoders.BINARY());
Some logs from the driver:
20/04/04 13:10:51 INFO TaskSetManager: Finished task 1911.0 in stage 1041.0 (TID 1052055) in 374 ms on <host> (executor 3) (1998/2001)
20/04/04 13:10:52 INFO TaskSetManager: Finished task 1925.0 in stage 1041.0 (TID 1052056) in 411 ms on <host> (executor 3) (1999/2001)
20/04/04 13:10:52 INFO TaskSetManager: Finished task 1906.0 in stage 1041.0 (TID 1052054) in 776 ms on <host> (executor 3) (2000/2001)
20/04/04 13:11:04 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
20/04/04 13:11:04 INFO DAGScheduler: Executor lost: 3 (epoch 522)
20/04/04 13:11:04 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
20/04/04 13:11:04 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, <host>, 38533, None)
20/04/04 13:11:04 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
20/04/04 13:11:04 INFO YarnAllocator: Completed container container_1582797414408_1814_01_000004 on host: <host> (state: COMPLETE, exit status: 143)
And by the way, I am using collectasList in my forEachBatch code
List<Event> list = dataset.select("value")
.selectExpr("deserialize(value) as rows")
.select("rows.*")
.selectExpr(NAME, DEPS)
.as(Encoders.bean(Event.class))
.collectAsList();
With these settings, you may be causing your own issues.
num_of_executors: 17
executor_cores: 5
executor_memory: 19G
driver_memory: 30G
You are basically creating extra containers here to have to shuffle between. Instead, start off with something like 10 executors, 15 cores, 60g memory. If that is working, then you can play these a bit to try and optimize performance. I usually try splitting my containers in half each step (but I also havent needed to do this since spark 2.0).
Let Spark SQL keep the default at 200. The more you break this up, the more math you make Spark do to calculate the shuffles. If anything, I'd try to go with the same number of parallelism as you have executors, so in this case just 10. When 2.0 came out, this is how you would tune hive queries.
Making the job complex to break up puts all the load on the master.
Using Datasets and Encoding are also generally not as performant as going with straight DataFrame operations. I have found great lifts in performance of factoring this out for dataframe operations.
I have some code written in Java which uses Apache Spark and I want to disable all Log4j log messages levels (ERROR, WARN, etc) and keep only INFO level ones that contain a specific string. In other words, I have these logs:
19/04/21 19:09:40 INFO Instrumentation: [e10c0eb5] {"seed":26,"impurity":"entropy","featuresCol":"indexedFeatures","maxDepth":5,"labelCol":"indexedLabel","numTrees":10}
19/04/21 19:09:40 INFO Instrumentation: [752ad4c3] {"seed":26,"impurity":"entropy","featuresCol":"indexedFeatures","maxDepth":5,"labelCol":"indexedLabel","numTrees":12}
19/04/21 19:09:40 INFO Instrumentation: [d9d09329] {"seed":26,"impurity":"entropy","featuresCol":"indexedFeatures","maxDepth":5,"labelCol":"indexedLabel","numTrees":11}
19/04/21 19:09:40 INFO SparkContext: Starting job: take at DecisionTreeMetadata.scala:112
19/04/21 19:09:40 INFO SparkContext: Starting job: take at DecisionTreeMetadata.scala:112
19/04/21 19:09:40 INFO SparkContext: Starting job: take at DecisionTreeMetadata.scala:112
19/04/21 19:09:40 INFO DAGScheduler: Got job 5 (take at DecisionTreeMetadata.scala:112) with 1 output partitions
19/04/21 19:09:40 INFO DAGScheduler: Final stage: ResultStage 6 (take at DecisionTreeMetadata.scala:112)
19/04/21 19:09:40 INFO DAGScheduler: Parents of final stage: List()
19/04/21 19:09:40 INFO DAGScheduler: Missing parents: List()
I want to keep only those that start with "INFO Instrumentation".
I have this sample code:
/*Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);*/
SparkSession sparkSession = new SparkSession
.Builder()
.appName("Random Forest Classifier")
.master("local[*]")
.config("spark.ui.port", "40000")
.getOrCreate();
I want to change the first two commented line in order to apply my filter, is that even possible and if so how to do it??
I have solved my problem; it turned out that I do not need any filter or appender; I just disable all the logs for both "org" and "akka", then I enable only the INFO level for this class: "org.apache.spark.ml.util", like this:
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
Logger.getLogger("org.apache.spark.ml.util").setLevel(Level.INFO);
Thanks for your help.
You can add a filter to the loggers which have messages you wish to ignore. The filter implementation will filter out messages not conforming to a given predicate, such as the following:
import org.apache.log4j.Level;
import org.apache.log4j.spi.Filter;
import org.apache.log4j.spi.LoggingEvent;
public class MyLog4jFilter extends Filter {
/**
* Custom filter to only log INFO events with the 'Instrumentation:' prefix in their message
*/
#Override
public int decide(LoggingEvent event) {
if(event.getLevel() == Level.INFO && event.getMessage().trim().startsWith("Instrumentation:"))
return ACCEPT;
else
return DENY;
}
}
References:
Log4j Levels Example – Order, Priority, Custom Filters
Log4j - LoggingEvent
Log4j manual (v1.2)
In this piece of code in comment 1 length of listbuffer items is shown correctly, but in the 2nd comment code never executes. Why it is occurs?
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
var wktReader: WKTReader = new WKTReader();
val dataSet = sc.textFile("dataSet.txt")
val items = new ListBuffer[String]()
dataSet.foreach { e =>
items += e
println("len = " + items.length) //1. here length is ok
}
println("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
items.foreach { x => print(x)} //2. this code doesn't be executed
Logs are here:
16/11/20 01:16:52 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/11/20 01:16:52 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.56.1:4040
16/11/20 01:16:53 INFO Executor: Starting executor ID driver on host localhost
16/11/20 01:16:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 58608.
16/11/20 01:16:53 INFO NettyBlockTransferService: Server created on 192.168.56.1:58608
16/11/20 01:16:53 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.56.1, 58608)
16/11/20 01:16:53 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.56.1:58608 with 347.1 MB RAM, BlockManagerId(driver, 192.168.56.1, 58608)
16/11/20 01:16:53 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.56.1, 58608)
Starting app
16/11/20 01:16:57 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 139.6 KB, free 347.0 MB)
16/11/20 01:16:58 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 15.9 KB, free 346.9 MB)
16/11/20 01:16:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.56.1:58608 (size: 15.9 KB, free: 347.1 MB)
16/11/20 01:16:58 INFO SparkContext: Created broadcast 0 from textFile at main.scala:25
16/11/20 01:16:58 INFO FileInputFormat: Total input paths to process : 1
16/11/20 01:16:58 INFO SparkContext: Starting job: foreach at main.scala:28
16/11/20 01:16:58 INFO DAGScheduler: Got job 0 (foreach at main.scala:28) with 1 output partitions
16/11/20 01:16:58 INFO DAGScheduler: Final stage: ResultStage 0 (foreach at main.scala:28)
16/11/20 01:16:58 INFO DAGScheduler: Parents of final stage: List()
16/11/20 01:16:58 INFO DAGScheduler: Missing parents: List()
16/11/20 01:16:58 INFO DAGScheduler: Submitting ResultStage 0 (dataSet.txt MapPartitionsRDD[1] at textFile at main.scala:25), which has no missing parents
16/11/20 01:16:58 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.3 KB, free 346.9 MB)
16/11/20 01:16:58 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2034.0 B, free 346.9 MB)
16/11/20 01:16:58 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.56.1:58608 (size: 2034.0 B, free: 347.1 MB)
16/11/20 01:16:58 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1012
16/11/20 01:16:59 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (dataSet.txt MapPartitionsRDD[1] at textFile at main.scala:25)
16/11/20 01:16:59 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/11/20 01:16:59 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0, PROCESS_LOCAL, 5427 bytes)
16/11/20 01:16:59 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/11/20 01:16:59 INFO HadoopRDD: Input split: file:/D:/dataSet.txt:0+291
16/11/20 01:16:59 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/11/20 01:16:59 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/11/20 01:16:59 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/11/20 01:16:59 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/11/20 01:16:59 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
len = 1
len = 2
len = 3
len = 4
len = 5
len = 6
len = 7
16/11/20 01:16:59 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 989 bytes result sent to driver
16/11/20 01:16:59 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 417 ms on localhost (1/1)
16/11/20 01:16:59 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/11/20 01:16:59 INFO DAGScheduler: ResultStage 0 (foreach at main.scala:28) finished in 0,456 s
16/11/20 01:16:59 INFO DAGScheduler: Job 0 finished: foreach at main.scala:28, took 0,795126 s
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
16/11/20 01:16:59 INFO SparkContext: Invoking stop() from shutdown hook
16/11/20 01:16:59 INFO SparkUI: Stopped Spark web UI at http://192.168.56.1:4040
16/11/20 01:16:59 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/11/20 01:16:59 INFO MemoryStore: MemoryStore cleared
16/11/20 01:16:59 INFO BlockManager: BlockManager stopped
16/11/20 01:16:59 INFO BlockManagerMaster: BlockManagerMaster stopped
16/11/20 01:16:59 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/11/20 01:16:59 INFO SparkContext: Successfully stopped SparkContext
16/11/20 01:16:59 INFO ShutdownHookManager: Shutdown hook called
16/11/20 01:16:59 INFO ShutdownHookManager: Deleting directory
Apache Spark doesn't provide shared memory therefore here:
dataSet.foreach { e =>
items += e
println("len = " + items.length) //1. here length is ok
}
you modify a local copy of items on a respective exectuor. The original items list defined on the driver is not modified. As a result this:
items.foreach { x => print(x) }
executes, but there is nothing to print.
Please check Understanding closures
While it would be recommended here, you could replace items with an accumulator
val acc = sc.collectionAccumulator[String]("Items")
dataSet.foreach(e => acc.add(e))
Spark runs in executers and returns the results. The above code doesn't work as intended. If you need to add the elements from foreach then need to collect the data in the driver and add to the current_set. But collecting the data is a bad idea when you have large data.
val items = new ListBuffer[String]()
val rdd = spark.sparkContext.parallelize(1 to 10, 4)
rdd.collect().foreach(data => items += data.toString())
println(items)
Output:
ListBuffer(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
While performing join or any operation with persisted datasets with other non-persisted datasets, Spark server throws Remote RPC client disassociated. Following is piece of code that causing issue.
Dataset<Row> dsTableA = sparkSession.read().format("jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableA").load().persist(StorageLevel.MEMORY_AND_DISK_SER());
Dataset<Row> dsTableB = sparkSession.read().format("jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableB").load().persist(StorageLevel.MEMORY_AND_DISK_SER());
Dataset<Row> anotherTableA = sparkSession.read().format("jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableC").load();
anotherTableA.write().format("json").save("/path/toJsonA"); // Working Fine - No use of persisted datasets
Dataset<Row> anotherTableB = sparkSession.read().format("jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableD").load();
dsTableA.createOrReplaceTempView("dsTableA");
dsTableB.createOrReplaceTempView("dsTableB");
anotherTableB.createOrReplaceTempView("anotherTableB");
Dataset<Row> joinedTable = sparkSession.sql("select atb.* from anotherTableB atb INNER JOIN dsTableA dsta ON atb.pid=dsta.pid LEFT JOIN dsTableB dstb ON atb.ssid=dstb.ssid");
joinedTable.write().format("json").save("/path/toJsonB");
// ERROR : Remote RPC client disassociated
// Working fine if Datasets dsTableA and dsTableB were not persisted
Part of logs
INFO TaskSetManager: Starting task 0.0 in stage 17.0 (TID 111, X.X.X.X, partition 0, PROCESS_LOCAL, 5342 bytes)
INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 111 on executor id: 0 hostname: X.X.X.X.
INFO BlockManagerInfo: Added broadcast_13_piece0 in memory on X.X.X.X:37153 (size: 12.9 KB, free: X.2 GB)
INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on X.X.X.X:37153 (size: 52.0 KB, free: X.2 GB)
ERROR TaskSchedulerImpl: Lost executor 0 on X.X.X.X: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-12121212121211-0000/0 is now EXITED (Command exited with code 134)
WARN TaskSetManager: Lost task 0.0 in stage 17.0 (TID 111, X.X.X.X): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
INFO StandaloneSchedulerBackend: Executor app-12121212121211-0000/0 removed: Command exited with code 134
INFO DAGScheduler: Executor lost: 0 (epoch 8)
If Datasets dsTableA and dsTableB were not persisted, then everything works smoothly. But must have to use persisted datasets. So how to solve this problem?
I want to use foreachpartition to save data in my database, but I noticed that this function not working
RDD2.foreachRDD(new VoidFunction<JavaRDD<Object>>() {
#Override
public void call(JavaRDD<Object> t) throws Exception {
t.foreachPartition(new VoidFunction<Iterator<Object>>() {
#Override
public void call(Iterator<Object> t) throws Exception {
System.out.println("test");
} }
);
}});
When I run this example, my my spark program will be blocked in these steps, without showing others RDD or even print test
6/05/30 10:18:41 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/05/30 10:18:41 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/05/30 10:18:41 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2946 bytes)
16/05/30 10:18:41 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/05/30 10:18:41 INFO SparkContext: Starting job: foreachPartition at BrokerSpout.java:265
16/05/30 10:18:41 INFO RecurringTimer: Started timer for BlockGenerator at time 1464596321600
-------------------------------------------
Time: 1464596321500 ms
-------------------------------------------
16/05/30 10:18:41 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer()
16/05/30 10:18:41 INFO ReceiverTracker: Registered receiver for stream 0 from 10.25.30.41:59407
16/05/30 10:18:41 INFO InputInfoTracker: remove old batch metadata:
16/05/30 10:18:41 INFO ReceiverSupervisorImpl: Starting receiver
16/05/30 10:18:41 INFO ReceiverSupervisorImpl: Called receiver onStart
16/05/30 10:18:41 INFO ReceiverSupervisorImpl: Waiting for receiver to be stopped
16/05/30 10:18:42 INFO SparkContext: Starting job: foreachPartition at BrokerSpout.java:265
16/05/30 10:18:42 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/05/30 10:18:42 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
As you see in my logging, it says that my spark is Waiting for a receiver to be stopped, but my receiver must not be stopped, if not, what is the purpose of spark streaming if we have to stop the sender.