Spark : forEachPartition not working

Spark : forEachPartition not working - java

I want to use foreachpartition to save data in my database, but I noticed that this function not working
RDD2.foreachRDD(new VoidFunction<JavaRDD<Object>>() {
#Override
public void call(JavaRDD<Object> t) throws Exception {
t.foreachPartition(new VoidFunction<Iterator<Object>>() {
#Override
public void call(Iterator<Object> t) throws Exception {
System.out.println("test");
} }
);
}});
When I run this example, my my spark program will be blocked in these steps, without showing others RDD or even print test
6/05/30 10:18:41 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/05/30 10:18:41 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/05/30 10:18:41 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2946 bytes)
16/05/30 10:18:41 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/05/30 10:18:41 INFO SparkContext: Starting job: foreachPartition at BrokerSpout.java:265
16/05/30 10:18:41 INFO RecurringTimer: Started timer for BlockGenerator at time 1464596321600
-------------------------------------------
Time: 1464596321500 ms
-------------------------------------------
16/05/30 10:18:41 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer()
16/05/30 10:18:41 INFO ReceiverTracker: Registered receiver for stream 0 from 10.25.30.41:59407
16/05/30 10:18:41 INFO InputInfoTracker: remove old batch metadata:
16/05/30 10:18:41 INFO ReceiverSupervisorImpl: Starting receiver
16/05/30 10:18:41 INFO ReceiverSupervisorImpl: Called receiver onStart
16/05/30 10:18:41 INFO ReceiverSupervisorImpl: Waiting for receiver to be stopped
16/05/30 10:18:42 INFO SparkContext: Starting job: foreachPartition at BrokerSpout.java:265
16/05/30 10:18:42 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/05/30 10:18:42 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
As you see in my logging, it says that my spark is Waiting for a receiver to be stopped, but my receiver must not be stopped, if not, what is the purpose of spark streaming if we have to stop the sender.

Related

apoc.periodic.iterate fails with exception: java.util.concurrent.RejectedExecutionException

I am trying run the annotation function of graphaware within Neo4J (see documentation here). I have a set of 5000 nodes (KnowledgeArticles) with textual data in the content property. To annotate those I run the following query in Neo4J desktop:
CALL apoc.periodic.iterate(
"MATCH (n:KnowledgeArticle) RETURN n",
"CALL ga.nlp.annotate({text: n.content, id: id(n)})
YIELD result MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)", {batchSize:1, iterateList:true})
After annotating approximately 200 to 300 KnowledgeArticles the database shuts down and provides the error:
Neo.ClientError.Procedure.ProcedureCallFailed: Failed to invoke procedure `apoc.periodic.iterate`: Caused by:
java.util.concurrent.RejectedExecutionException: Task
java.util.concurrent.FutureTask#373b81ee rejected from
java.util.concurrent.ThreadPoolExecutor#285a2901[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 288]
I have experimented using different values for batchSize or setting iterateList to false, but none of this helped.
Also, I have tried performing the above iterate call limiting it only to 150 nodes. This works fine for the first time I call it, but when I perform it for a second time it again provides the same error, stating that the completed_task is about 200 to 300. The processor in the back thus seems to 'remember' the amount of tasks it has run in total as of the first time the database has started.
Could you help me resolve this issue. I want to run the above query not necessarily from Neo4j desktop, but eventually with py2neo from Python using graph.run([iterate-query]). If there is thus any way of solving this from Python, that would be even better.
Thank you!
PS. The debug log provides the following output (as of the last few iterations of the annotation up until the shut down):
2019-05-21 12:46:10.359+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] Start storing annotatedText 251906
2019-05-21 12:46:13.784+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] end storing annotatedText 251906. It took: 3425
2019-05-21 12:46:13.786+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:13.788+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:13.800+0000 INFO [c.g.n.u.ProcessorUtils] Taking default pipeline from configuration : myPipeline
2019-05-21 12:46:13.868+0000 INFO [c.g.n.p.s.StanfordTextProcessor] Time for pipeline annotation (myPipeline): 67. Text length: 954
2019-05-21 12:46:13.869+0000 INFO [c.g.n.NLPManager] Time to annotate 68
2019-05-21 12:46:13.869+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:13.869+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] Start storing annotatedText 251907
2019-05-21 12:46:15.848+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] end storing annotatedText 251907. It took: 1978
2019-05-21 12:46:15.848+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:15.862+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:15.915+0000 INFO [c.g.n.u.ProcessorUtils] Taking default pipeline from configuration : myPipeline
2019-05-21 12:46:16.294+0000 INFO [c.g.n.p.s.StanfordTextProcessor] Time for pipeline annotation (myPipeline): 378. Text length: 2641
2019-05-21 12:46:16.295+0000 INFO [c.g.n.NLPManager] Time to annotate 379
2019-05-21 12:46:16.296+0000 INFO [c.g.n.e.EventDispatcher] Notifying listeners for event {}
2019-05-21 12:46:16.296+0000 INFO [c.g.n.p.p.AnnotatedTextPersister] Start storing annotatedText 251908
2019-05-21 12:46:16.421+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Database graph.db is unavailable.
2019-05-21 12:46:17.018+0000 INFO [c.g.s.f.b.GraphAwareServerBootstrapper] stopped
2019-05-21 12:46:17.020+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutdown started
2019-05-21 12:46:17.149+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutting down 'graph.db' database.
2019-05-21 12:46:17.150+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutdown started
2019-05-21 12:46:17.164+0000 INFO [o.n.b.i.BackupServer] BackupServer communication server shutting down and unbinding from /127.0.0.1:6362
2019-05-21 12:46:17.226+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by database shutdown # txId: 7720 checkpoint started...
2019-05-21 12:46:17.247+0000 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 7720 to [/Users/{my.user.name}/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-e2babea7-0332-4c2c-bf1d-076d4feed49a/installation-3.5.4/data/databases/graph.db/neostore.counts.db.a], from [/Users/{my.user.name}/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-e2babea7-0332-4c2c-bf1d-076d4feed49a/installation-3.5.4/data/databases/graph.db/neostore.counts.db.b].
2019-05-21 12:46:17.644+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by database shutdown # txId: 7720 checkpoint completed in 418ms
2019-05-21 12:46:17.647+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] No log version pruned, last checkpoint was made in version 3
2019-05-21 12:46:17.698+0000 INFO [o.n.i.d.DiagnosticsManager] --- STOPPING diagnostics START ---
2019-05-21 12:46:17.700+0000 INFO [o.n.i.d.DiagnosticsManager] --- STOPPING diagnostics END ---
2019-05-21 12:46:17.706+0000 INFO [c.g.r.BaseGraphAwareRuntime] Shutting down GraphAware Runtime...
2019-05-21 12:46:17.709+0000 INFO [c.g.r.m.BaseModuleManager] Shutting down module UIDM
2019-05-21 12:46:17.709+0000 INFO [c.g.r.m.BaseModuleManager] Shutting down module NLP
2019-05-21 12:46:17.712+0000 INFO [c.g.r.s.RotatingTaskScheduler] Terminating task scheduler...
2019-05-21 12:46:17.712+0000 INFO [c.g.r.s.RotatingTaskScheduler] Task scheduler terminated successfully.
2019-05-21 12:46:17.714+0000 INFO [c.g.r.BaseGraphAwareRuntime] GraphAware Runtime shut down.

Iterate through Dataset<Row> in Spark using mapPartion and change the row data in java [duplicate]

In this piece of code in comment 1 length of listbuffer items is shown correctly, but in the 2nd comment code never executes. Why it is occurs?
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
var wktReader: WKTReader = new WKTReader();
val dataSet = sc.textFile("dataSet.txt")
val items = new ListBuffer[String]()
dataSet.foreach { e =>
items += e
println("len = " + items.length) //1. here length is ok
}
println("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
items.foreach { x => print(x)} //2. this code doesn't be executed
Logs are here:
16/11/20 01:16:52 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/11/20 01:16:52 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.56.1:4040
16/11/20 01:16:53 INFO Executor: Starting executor ID driver on host localhost
16/11/20 01:16:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 58608.
16/11/20 01:16:53 INFO NettyBlockTransferService: Server created on 192.168.56.1:58608
16/11/20 01:16:53 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.56.1, 58608)
16/11/20 01:16:53 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.56.1:58608 with 347.1 MB RAM, BlockManagerId(driver, 192.168.56.1, 58608)
16/11/20 01:16:53 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.56.1, 58608)
Starting app
16/11/20 01:16:57 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 139.6 KB, free 347.0 MB)
16/11/20 01:16:58 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 15.9 KB, free 346.9 MB)
16/11/20 01:16:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.56.1:58608 (size: 15.9 KB, free: 347.1 MB)
16/11/20 01:16:58 INFO SparkContext: Created broadcast 0 from textFile at main.scala:25
16/11/20 01:16:58 INFO FileInputFormat: Total input paths to process : 1
16/11/20 01:16:58 INFO SparkContext: Starting job: foreach at main.scala:28
16/11/20 01:16:58 INFO DAGScheduler: Got job 0 (foreach at main.scala:28) with 1 output partitions
16/11/20 01:16:58 INFO DAGScheduler: Final stage: ResultStage 0 (foreach at main.scala:28)
16/11/20 01:16:58 INFO DAGScheduler: Parents of final stage: List()
16/11/20 01:16:58 INFO DAGScheduler: Missing parents: List()
16/11/20 01:16:58 INFO DAGScheduler: Submitting ResultStage 0 (dataSet.txt MapPartitionsRDD[1] at textFile at main.scala:25), which has no missing parents
16/11/20 01:16:58 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.3 KB, free 346.9 MB)
16/11/20 01:16:58 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2034.0 B, free 346.9 MB)
16/11/20 01:16:58 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.56.1:58608 (size: 2034.0 B, free: 347.1 MB)
16/11/20 01:16:58 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1012
16/11/20 01:16:59 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (dataSet.txt MapPartitionsRDD[1] at textFile at main.scala:25)
16/11/20 01:16:59 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/11/20 01:16:59 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0, PROCESS_LOCAL, 5427 bytes)
16/11/20 01:16:59 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/11/20 01:16:59 INFO HadoopRDD: Input split: file:/D:/dataSet.txt:0+291
16/11/20 01:16:59 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/11/20 01:16:59 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/11/20 01:16:59 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/11/20 01:16:59 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/11/20 01:16:59 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
len = 1
len = 2
len = 3
len = 4
len = 5
len = 6
len = 7
16/11/20 01:16:59 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 989 bytes result sent to driver
16/11/20 01:16:59 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 417 ms on localhost (1/1)
16/11/20 01:16:59 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/11/20 01:16:59 INFO DAGScheduler: ResultStage 0 (foreach at main.scala:28) finished in 0,456 s
16/11/20 01:16:59 INFO DAGScheduler: Job 0 finished: foreach at main.scala:28, took 0,795126 s
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
16/11/20 01:16:59 INFO SparkContext: Invoking stop() from shutdown hook
16/11/20 01:16:59 INFO SparkUI: Stopped Spark web UI at http://192.168.56.1:4040
16/11/20 01:16:59 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/11/20 01:16:59 INFO MemoryStore: MemoryStore cleared
16/11/20 01:16:59 INFO BlockManager: BlockManager stopped
16/11/20 01:16:59 INFO BlockManagerMaster: BlockManagerMaster stopped
16/11/20 01:16:59 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/11/20 01:16:59 INFO SparkContext: Successfully stopped SparkContext
16/11/20 01:16:59 INFO ShutdownHookManager: Shutdown hook called
16/11/20 01:16:59 INFO ShutdownHookManager: Deleting directory

Apache Spark doesn't provide shared memory therefore here:
dataSet.foreach { e =>
items += e
println("len = " + items.length) //1. here length is ok
}
you modify a local copy of items on a respective exectuor. The original items list defined on the driver is not modified. As a result this:
items.foreach { x => print(x) }
executes, but there is nothing to print.
Please check Understanding closures
While it would be recommended here, you could replace items with an accumulator
val acc = sc.collectionAccumulator[String]("Items")
dataSet.foreach(e => acc.add(e))

Spark runs in executers and returns the results. The above code doesn't work as intended. If you need to add the elements from foreach then need to collect the data in the driver and add to the current_set. But collecting the data is a bad idea when you have large data.
val items = new ListBuffer[String]()
val rdd = spark.sparkContext.parallelize(1 to 10, 4)
rdd.collect().foreach(data => items += data.toString())
println(items)
Output:
ListBuffer(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

MapReduce input from HTable on AWS timeout

I'm having a bit of trouble figuring out how to excute a simple MapReduce job with input to be sourced from HTable using emr-5.4.0.
when I ran on ERM, it failed because of time out.(emr-5.3.0 also failed)
I have done a bunch of google searching to find out how to proceed, but could'nt find anything useful.
my process：
I created a EMR cluster using Hbase.The version is:
Amazon 2.7.3, Ganglia 3.7.2, HBase 1.3.0, Hive 2.1.1, Hue 3.11.0,
Phoenix 4.9.0
According to the sample from the manual:http://hbase.apache.org/book.html#mapreduce.example, and write my job likes:
public class TableMapTest3 {
// TableMapper
public static class MyMapper extends TableMapper<Text, Text> {
protected void map(ImmutableBytesWritable key, Result inputValue, Context context)
throws IOException, InterruptedException {
String keyS = new String(key.get(), "UTF-8");
String value = new String(inputValue.getValue(Bytes.toBytes("contents"), Bytes.toBytes("name")), "UTF-8");
System.out.println("TokenizerMapper :" + value);
context.write(new Text(keyS), new Text(value));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
System.out.println("url:" + conf.get("fs.defaultFS"));
System.out.println("hbase.zookeeper.quorum:" + conf.get("hbase.zookeeper.quorum"));
Connection conn = ConnectionFactory.createConnection(conf);
Admin admin = conn.getAdmin();
String tableName = "TableMapTest";
TableName tablename = TableName.valueOf(tableName);
Table hTable = null;
// check table exists
if (admin.tableExists(tablename)) {
System.out.println(tablename + " table existed...");
hTable = conn.getTable(tablename);
ResultScanner resultScanner = hTable.getScanner(new Scan());
for (Result result : resultScanner) {
Delete delete = new Delete(result.getRow());
hTable.delete(delete);
}
} else {
HTableDescriptor tableDesc = new HTableDescriptor(tablename);
tableDesc.addFamily(new HColumnDescriptor("contents"));
admin.createTable(tableDesc);
System.out.println(tablename + " table created...");
hTable = conn.getTable(tablename);
}
// insert data
for (int i = 0; i < 20; i++) {
Put put = new Put(Bytes.toBytes(String.valueOf(i)));
put.addColumn(Bytes.toBytes("contents"), Bytes.toBytes("name"), Bytes.toBytes("value" + i));
hTable.put(put);
}
hTable.close();
// Hadoop
Job job = Job.getInstance(conf, TableMapTest3.class.getSimpleName());
job.setJarByClass(TableMapTest3.class);
job.setOutputFormatClass(NullOutputFormat.class);
Scan scan = new Scan();
TableMapReduceUtil.initTableMapperJob(tableName, scan, MyMapper.class, Text.class, Text.class, job);
System.out.println("TableMapTest result:" + job.waitForCompletion(true));
}
}
package my source to jar and upload it to the cluster. then I ssh on the master and run my job:
hadoop jar zz-0.0.1.jar com.ziki.zz.TableMapTest3
I got the follow messages:
url:hdfs://ip-xxx.ap-northeast-1.compute.internal:8020
hbase.zookeeper.quorum:localhost
TableMapTest table created...
17/05/05 01:31:23 INFO impl.TimelineClientImpl: Timeline service address: http://ip-xxx.ap-northeast-1.compute.internal:8188/ws/v1/timeline/
17/05/05 01:31:23 INFO client.RMProxy: Connecting to ResourceManager at ip-xxx.ap-northeast-1.compute.internal/172.31.4.228:8032
17/05/05 01:31:24 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/05/05 01:31:31 INFO mapreduce.JobSubmitter: number of splits:1
17/05/05 01:31:31 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1493947058255_0001
17/05/05 01:31:33 INFO impl.YarnClientImpl: Submitted application application_1493947058255_0001
17/05/05 01:31:34 INFO mapreduce.Job: The url to track the job: http://ip-xxx.ap-northeast-1.compute.internal:20888/proxy/application_1493947058255_0001/
17/05/05 01:31:34 INFO mapreduce.Job: Running job: job_1493947058255_0001
17/05/05 01:31:57 INFO mapreduce.Job: Job job_1493947058255_0001 running in uber mode : false
17/05/05 01:31:57 INFO mapreduce.Job: map 0% reduce 0%
after a well, i get the error:
17/05/05 01:42:26 INFO mapreduce.Job: Task Id : attempt_1493947058255_0001_m_000000_0, Status : FAILED
AttemptID:attempt_1493947058255_0001_m_000000_0 Timed out after 600 secs
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
17/05/05 01:52:56 INFO mapreduce.Job: Task Id : attempt_1493947058255_0001_m_000000_1, Status : FAILED
AttemptID:attempt_1493947058255_0001_m_000000_1 Timed out after 600 secs
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
and some syslogs
2017-05-05 01:31:59,664 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1493947058255_0001_m_000000 Task Transitioned from SCHEDULED to RUNNING
2017-05-05 01:32:08,168 INFO [Socket Reader #1 for port 33348] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for job_1493947058255_0001 (auth:SIMPLE)
2017-05-05 01:32:08,227 INFO [IPC Server handler 0 on 33348] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID : jvm_1493947058255_0001_m_000002 asked for a task
2017-05-05 01:32:08,231 INFO [IPC Server handler 0 on 33348] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: jvm_1493947058255_0001_m_000002 given task: attempt_1493947058255_0001_m_000000_0
2017-05-05 01:42:25,382 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1493947058255_0001_m_000000_0: AttemptID:attempt_1493947058255_0001_m_000000_0 Timed out after 600 secs
2017-05-05 01:42:25,389 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1493947058255_0001_m_000000_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP
2017-05-05 01:42:25,392 INFO [ContainerLauncher #1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1493947058255_0001_01_000002 taskAttempt attempt_1493947058255_0001_m_000000_0
2017-05-05 01:42:25,392 INFO [ContainerLauncher #1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1493947058255_0001_m_000000_0
2017-05-05 01:42:25,394 INFO [ContainerLauncher #1] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : ip-xxx.ap-northeast-1.compute.internal:8041
2017-05-05 01:42:25,457 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1493947058255_0001_m_000000_0 TaskAttempt Transitioned from FAIL_CONTAINER_CLEANUP to FAIL_TASK_CLEANUP
2017-05-05 01:42:25,458 INFO [CommitterEvent Processor #1] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: TASK_ABORT
2017-05-05 01:42:25,460 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1493947058255_0001_m_000000_0 TaskAttempt Transitioned from FAIL_TASK_CLEANUP to FAILED
2017-05-05 01:42:25,495 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved ip-xxx.ap-northeast-1.compute.internal to /default-rack
2017-05-05 01:42:25,500 INFO [Thread-83] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 1 failures on node ip-xxx.ap-northeast-1.compute.internal
2017-05-05 01:42:25,502 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1493947058255_0001_m_000000_1 TaskAttempt Transitioned from NEW to UNASSIGNED
2017-05-05 01:42:25,503 INFO [Thread-83] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Added attempt_1493947058255_0001_m_000000_1 to list of failed maps
2017-05-05 01:42:25,557 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:3 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:1 AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:1 ContRel:0 HostLocal:1 RackLocal:0
2017-05-05 01:42:25,582 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1493947058255_0001: ask=1 release= 0 newContainers=0 finishedContainers=1 resourcelimit=<memory:1024, vCores:1> knownNMs=2
2017-05-05 01:42:25,582 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1493947058255_0001_01_000002
2017-05-05 01:42:25,583 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1493947058255_0001_m_000000_0: Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
I just use the default settings and run a simple job.why these errors happend?
if i'm missing anything, let me know!
anyway, thanks for your help - appreciate it!

I found the answer from : here
you cant use HConfiguration (because it defaults to a localhost quorum) What you'll have to do is use the configuration that amazon sets up for you (located in /etc/hbase/conf/hbase-site.xml)
The connection code looks like this:
Configuration conf = new Configuration();
String hbaseSite = "/etc/hbase/conf/hbase-site.xml";
conf.addResource(new File(hbaseSite).toURI().toURL());
HBaseAdmin.checkHBaseAvailable(conf);

NoSuchMethodError in shapeless seen only in Spark

I am trying to write a Spark connector to pull AVRO messages off a RabbitMQ message queue. When decoding the AVRO messages, there is a NoSuchMethodError error that occurs only when running in Spark.
I could not reproduce the Spark code exactly outside of spark, but I believe the two examples are sufficiently similar. I think this is the smallest code that reproduces the same scenario.
I've removed all the connection parameters both because the information is private and the connection does not appear to be the issue.
Spark code:
package simpleexample
import org.apache.spark.SparkConf
import org.apache.spark.streaming.rabbitmq.distributed.RabbitMQDistributedKey
import org.apache.spark.streaming.rabbitmq.models.ExchangeAndRouting
import org.apache.spark.streaming.rabbitmq.RabbitMQUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
import com.sksamuel.avro4s._
import java.io.{ByteArrayInputStream, ByteArrayOutputStream}
import com.rabbitmq.client.QueueingConsumer.Delivery
import java.util.HashMap
case class AttributeTuple(attrName: String, attrValue: String)
// AVRO Schema for Events
case class DeviceEvent(
tenantName: String,
groupName: String,
subgroupName: String,
eventType: String,
eventSource: String,
deviceTypeName: String,
deviceId: Int,
timestamp: Long,
attribute: AttributeTuple
)
object RabbitMonitor {
def main(args: Array[String]) {
println("start")
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("RabbitMonitor")
val ssc = new StreamingContext(sparkConf, Seconds(60))
def parseArrayEvent(delivery: Delivery): Seq[DeviceEvent] = {
val in = new ByteArrayInputStream(delivery.getBody())
val input = AvroInputStream.binary[DeviceEvent](in)
input.iterator.toSeq
}
val params: Map[String, String] = Map(
/* many rabbit connection parameters */
"maxReceiveTime" -> "60000" // 60s
)
val distributedKey = Seq(
RabbitMQDistributedKey(
/* queue name */,
new ExchangeAndRouting(/* exchange name */, /* routing key */),
params
)
)
var events = RabbitMQUtils.createDistributedStream[Seq[DeviceEvent]](ssc, distributedKey, params, parseArrayEvent)
events.print()
ssc.start()
ssc.awaitTermination()
}
}
Non-Spark code:
package simpleexample
import com.thenewmotion.akka.rabbitmq._
import akka.actor._
// avoid name collision with rabbitmq channel
import scala.concurrent.{Channel => BasicChannel}
import scala.concurrent.ExecutionContext.Implicits.global
import com.sksamuel.avro4s._
import java.io.{ByteArrayInputStream, ByteArrayOutputStream}
object Test extends App {
implicit val system = ActorSystem()
val factory = new ConnectionFactory()
/* Set connection parameters*/
val exchange: String = /* exchange name */
val connection: ActorRef = system.actorOf(ConnectionActor.props(factory), "rabbitmq")
def setupSubscriber(channel: Channel, self: ActorRef) {
val queue = channel.queueDeclare().getQueue
channel.queueBind(queue, exchange, /* routing key */)
val consumer = new DefaultConsumer(channel) {
override def handleDelivery(consumerTag: String, envelope: Envelope, properties: BasicProperties, body: Array[Byte]) {
val in = new ByteArrayInputStream(body)
val input = AvroInputStream.binary[DeviceEvent](in)
val result = input.iterator.toSeq
println(result)
}
}
channel.basicConsume(queue, true, consumer)
}
connection ! CreateChannel(ChannelActor.props(setupSubscriber), Some("eventSubscriber"))
scala.concurrent.Future {
def loop(n: Long) {
Thread.sleep(1000)
if (n < 30) {
loop(n + 1)
}
}
loop(0)
}
}
Non-Spark Output (the last line is a successfully decoded update):
drex#drexThinkPad:~/src/scala/so-repro/connector/target/scala-2.11$ scala project.jar
[INFO] [03/02/2017 14:11:06.899] [default-akka.actor.default-dispatcher-4] [akka://default/deadLetters] Message [com.thenewmotion.akka.rabbitmq.ChannelCreated] from Actor[akka://default/user/rabbitmq#-889215077] to Actor[akka://default/deadLetters] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[INFO] [03/02/2017 14:11:07.337] [default-akka.actor.default-dispatcher-3] [akka://default/user/rabbitmq] akka://default/user/rabbitmq connected to amqp://<rabbit info>
[INFO] [03/02/2017 14:11:07.509] [default-akka.actor.default-dispatcher-4] [akka://default/user/rabbitmq/eventSubscriber] akka://default/user/rabbitmq/eventSubscriber connected
Stream(DeviceEvent(int,na,d01,deviceAttrUpdate,device,TestDeviceType,33554434,1488492704421,AttributeTuple(temperature,60)), ?)
Spark Output:
drex#drexThinkPad:~/src/scala/so-repro/connector/target/scala-2.11$ spark-submit ./project.jar --class RabbitMonitor
start
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/03/02 14:20:15 INFO SparkContext: Running Spark version 2.1.0
17/03/02 14:20:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/02 14:20:16 WARN Utils: Your hostname, drexThinkPad resolves to a loopback address: 127.0.1.1; using 192.168.1.11 instead (on interface wlp3s0)
17/03/02 14:20:16 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/03/02 14:20:16 INFO SecurityManager: Changing view acls to: drex
17/03/02 14:20:16 INFO SecurityManager: Changing modify acls to: drex
17/03/02 14:20:16 INFO SecurityManager: Changing view acls groups to:
17/03/02 14:20:16 INFO SecurityManager: Changing modify acls groups to:
17/03/02 14:20:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(drex); groups with view permissions: Set(); users with modify permissions: Set(drex); groups with modify permissions: Set()
17/03/02 14:20:16 INFO Utils: Successfully started service 'sparkDriver' on port 34701.
17/03/02 14:20:16 INFO SparkEnv: Registering MapOutputTracker
17/03/02 14:20:16 INFO SparkEnv: Registering BlockManagerMaster
17/03/02 14:20:16 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/03/02 14:20:16 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/03/02 14:20:16 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-5cbb13bf-78fe-4227-81b3-1afea40f899a
17/03/02 14:20:16 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
17/03/02 14:20:16 INFO SparkEnv: Registering OutputCommitCoordinator
17/03/02 14:20:16 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/03/02 14:20:16 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.1.11:4040
17/03/02 14:20:16 INFO SparkContext: Added JAR file:/home/drex/src/scala/so-repro/connector/target/scala-2.11/./project.jar at spark://192.168.1.11:34701/jars/project.jar with timestamp 1488493216614
17/03/02 14:20:16 INFO Executor: Starting executor ID driver on host localhost
17/03/02 14:20:16 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33276.
17/03/02 14:20:16 INFO NettyBlockTransferService: Server created on 192.168.1.11:33276
17/03/02 14:20:16 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/03/02 14:20:16 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.1.11, 33276, None)
17/03/02 14:20:16 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:33276 with 366.3 MB RAM, BlockManagerId(driver, 192.168.1.11, 33276, None)
17/03/02 14:20:16 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.11, 33276, None)
17/03/02 14:20:16 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.11, 33276, None)
17/03/02 14:20:17 INFO RabbitMQDStream: Duration for remembering RDDs set to 60000 ms for org.apache.spark.streaming.rabbitmq.distributed.RabbitMQDStream#546621c4
17/03/02 14:20:17 INFO RabbitMQDStream: Slide time = 60000 ms
17/03/02 14:20:17 INFO RabbitMQDStream: Storage level = Memory Deserialized 1x Replicated
17/03/02 14:20:17 INFO RabbitMQDStream: Checkpoint interval = null
17/03/02 14:20:17 INFO RabbitMQDStream: Remember interval = 60000 ms
17/03/02 14:20:17 INFO RabbitMQDStream: Initialized and validated org.apache.spark.streaming.rabbitmq.distributed.RabbitMQDStream#546621c4
17/03/02 14:20:17 INFO ForEachDStream: Slide time = 60000 ms
17/03/02 14:20:17 INFO ForEachDStream: Storage level = Serialized 1x Replicated
17/03/02 14:20:17 INFO ForEachDStream: Checkpoint interval = null
17/03/02 14:20:17 INFO ForEachDStream: Remember interval = 60000 ms
17/03/02 14:20:17 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream#49c6ddef
17/03/02 14:20:17 INFO RecurringTimer: Started timer for JobGenerator at time 1488493260000
17/03/02 14:20:17 INFO JobGenerator: Started JobGenerator at 1488493260000 ms
17/03/02 14:20:17 INFO JobScheduler: Started JobScheduler
17/03/02 14:20:17 INFO StreamingContext: StreamingContext started
17/03/02 14:21:00 INFO JobScheduler: Added jobs for time 1488493260000 ms
17/03/02 14:21:00 INFO JobScheduler: Starting job streaming job 1488493260000 ms.0 from job set of time 1488493260000 ms
17/03/02 14:21:00 INFO SparkContext: Starting job: print at RabbitMonitor.scala:94
17/03/02 14:21:00 INFO DAGScheduler: Got job 0 (print at RabbitMonitor.scala:94) with 1 output partitions
17/03/02 14:21:00 INFO DAGScheduler: Final stage: ResultStage 0 (print at RabbitMonitor.scala:94)
17/03/02 14:21:00 INFO DAGScheduler: Parents of final stage: List()
17/03/02 14:21:00 INFO DAGScheduler: Missing parents: List()
17/03/02 14:21:00 INFO DAGScheduler: Submitting ResultStage 0 (RabbitMQRDD[0] at createDistributedStream at RabbitMonitor.scala:93), which has no missing parents
17/03/02 14:21:00 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.7 KB, free 366.3 MB)
17/03/02 14:21:00 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1752.0 B, free 366.3 MB)
17/03/02 14:21:00 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.11:33276 (size: 1752.0 B, free: 366.3 MB)
17/03/02 14:21:00 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:996
17/03/02 14:21:00 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (RabbitMQRDD[0] at createDistributedStream at RabbitMonitor.scala:93)
17/03/02 14:21:00 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/03/02 14:21:00 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, ANY, 7744 bytes)
17/03/02 14:21:00 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/03/02 14:21:00 INFO Executor: Fetching spark://192.168.1.11:34701/jars/project.jar with timestamp 1488493216614
17/03/02 14:21:00 INFO TransportClientFactory: Successfully created connection to /192.168.1.11:34701 after 23 ms (0 ms spent in bootstraps)
17/03/02 14:21:00 INFO Utils: Fetching spark://192.168.1.11:34701/jars/project.jar to /tmp/spark-92b6ff6a-b120-4fd0-ba46-a450eff80636/userFiles-c0a334f3-68fc-495f-8ccd-cfe90e6d0bf8/fetchFileTemp2710654534934784726.tmp
17/03/02 14:21:00 INFO Executor: Adding file:/tmp/spark-92b6ff6a-b120-4fd0-ba46-a450eff80636/userFiles-c0a334f3-68fc-495f-8ccd-cfe90e6d0bf8/project.jar to class loader
<removing rabbit queue connection parameters>
17/03/02 14:21:02 INFO RabbitMQRDD: Receiving data in Partition 0 from
</removing rabbit queue connection parameters>
17/03/02 14:21:50 WARN BlockManager: Putting block rdd_0_0 failed due to an exception
17/03/02 14:21:50 WARN BlockManager: Block rdd_0_0 could not be removed as it was not found on disk or in memory
17/03/02 14:21:50 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoSuchMethodError: shapeless.Lazy.map(Lscala/Function1;)Lshapeless/Lazy;
at com.sksamuel.avro4s.SchemaFor$.recordBuilder(SchemaFor.scala:447)
at simpleexample.RabbitMonitor$$anon$3.<init>(RabbitMonitor.scala:70)
at simpleexample.RabbitMonitor$.simpleexample$RabbitMonitor$$parseArrayEvent$1(RabbitMonitor.scala:70)
at simpleexample.RabbitMonitor$$anonfun$15.apply(RabbitMonitor.scala:93)
at simpleexample.RabbitMonitor$$anonfun$15.apply(RabbitMonitor.scala:93)
at org.apache.spark.streaming.rabbitmq.distributed.RabbitMQRDD$RabbitMQRDDIterator$$anonfun$5.apply(RabbitMQRDD.scala:209)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.rabbitmq.distributed.RabbitMQRDD$RabbitMQRDDIterator.processDelivery(RabbitMQRDD.scala:209)
at org.apache.spark.streaming.rabbitmq.distributed.RabbitMQRDD$RabbitMQRDDIterator.getNext(RabbitMQRDD.scala:194)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/03/02 14:21:50 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.NoSuchMethodError: shapeless.Lazy.map(Lscala/Function1;)Lshapeless/Lazy;
at com.sksamuel.avro4s.SchemaFor$.recordBuilder(SchemaFor.scala:447)
at simpleexample.RabbitMonitor$$anon$3.<init>(RabbitMonitor.scala:70)
at simpleexample.RabbitMonitor$.simpleexample$RabbitMonitor$$parseArrayEvent$1(RabbitMonitor.scala:70)
at simpleexample.RabbitMonitor$$anonfun$15.apply(RabbitMonitor.scala:93)
at simpleexample.RabbitMonitor$$anonfun$15.apply(RabbitMonitor.scala:93)
at org.apache.spark.streaming.rabbitmq.distributed.RabbitMQRDD$RabbitMQRDDIterator$$anonfun$5.apply(RabbitMQRDD.scala:209)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.rabbitmq.distributed.RabbitMQRDD$RabbitMQRDDIterator.processDelivery(RabbitMQRDD.scala:209)
at org.apache.spark.streaming.rabbitmq.distributed.RabbitMQRDD$RabbitMQRDDIterator.getNext(RabbitMQRDD.scala:194)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/03/02 14:21:50 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
17/03/02 14:21:50 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/03/02 14:21:50 INFO TaskSchedulerImpl: Cancelling stage 0
build.sbt:
retrieveManaged := true
lazy val sparkVersion = "2.1.0"
scalaVersion in ThisBuild := "2.11.8"
lazy val rabbit = (project in file("rabbit-plugin")).settings(
name := "Spark Streaming RabbitMQ Receiver",
homepage := Some(url("https://github.com/Stratio/RabbitMQ-Receiver")),
description := "RabbitMQ-Receiver is a library that allows the user to read data with Apache Spark from RabbitMQ.",
exportJars := true,
assemblyJarName in assembly := "rabbit.jar",
test in assembly := {},
moduleName := "spark-rabbitmq",
organization := "com.stratio.receive",
version := "0.6.0",
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"com.typesafe.akka" %% "akka-actor" % "2.4.11",
"com.rabbitmq" % "amqp-client" % "3.6.6",
"joda-time" % "joda-time" % "2.8.2",
"com.github.sstone" %% "amqp-client" % "1.5" % Test,
"org.scalatest" %% "scalatest" % "2.2.2" % Test,
"org.scalacheck" %% "scalacheck" % "1.11.3" % Test,
"junit" % "junit" % "4.12" % Test,
"com.typesafe.akka" %% "akka-testkit" % "2.4.11" % Test
)
)
lazy val root = (project in file("connector")).settings(
name := "Connector from Rabbit to Kafka queue",
description := "",
exportJars := true,
test in assembly := {},
assemblyJarName in assembly := "project.jar",
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"com.thenewmotion" %% "akka-rabbitmq" % "3.0.0",
"org.apache.kafka" % "kafka_2.10" % "0.10.1.1",
"com.sksamuel.avro4s" %% "avro4s-core" % "1.6.4"
)
) dependsOn rabbit
I am also using assembly to put together a "fat jar" for spark (addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.4")) and using the command sbt assembly to produce the jar used in both examples above. I'm running Spark 2.1.0.
I'm relatively new to the Spark / Scala ecosystem so hopefully this is a problem with my build settings. It makes no sense that shapeless would be unavailable in Spark.

Same issue myself. I just add more details for others facing it.
Error
Everything works fine till I deploy to cluster. Then I get
Exception in thread "main" java.lang.NoSuchMethodError: 'shapeless.DefaultSymbolicLabelling shapeless.DefaultSymbolicLabelling$.instance(shapeless.HList)'
Root Cause
Following the stacktrace, I know it is related to the circe library. Then I run the dependency (make sure you have addDependencyTreePlugin in your ~/.sbt/1.0/plugins/plugins.sbt file):
❯ sbt "whatDependsOn com.chuusai shapeless_2.12"
[info] welcome to sbt 1.6.2 (Amazon.com Inc. Java 1.8.0_332)
[info] com.chuusai:shapeless_2.12:2.3.7 [S]
[info] +-io.circe:circe-generic_2.12:0.14.1 [S]
[info] +-***
but if I run the dependency with "provided" scope, I get:
❯ sbt provided:"whatDependsOn com.chuusai shapeless_2.12"
[info] welcome to sbt 1.6.2 (Amazon.com Inc. Java 1.8.0_332)
[info] com.chuusai:shapeless_2.12:2.3.3 [S]
[info] +-org.scalanlp:breeze_2.12:1.0 [S]
[info] +-org.apache.spark:spark-mllib-local_2.12:3.1.3
[info] | +-org.apache.spark:spark-graphx_2.12:3.1.3
[info] | | +-org.apache.spark:spark-mllib_2.12:3.1.3
[info] | | +-***
[info] | |
[info] | +-org.apache.spark:spark-mllib_2.12:3.1.3
[info] | +-***
[info] |
[info] +-org.apache.spark:spark-mllib_2.12:3.1.3
[info] +-***
As you can see, the instance function in version 2.3.7 is not present in version 2.3.3 (it is added in version 2.3.5):
https://javadoc.io/static/com.chuusai/shapeless_2.12/2.3.3/shapeless/DefaultSymbolicLabelling$.html
https://javadoc.io/static/com.chuusai/shapeless_2.12/2.3.7/shapeless/DefaultSymbolicLabelling$.html
Didn't work
Adding the dependency didn't fix my issue.
val CirceVersion = "0.14.1"
val ShapelessVersion = "2.3.7" // Circe 0.14.1 uses 2.3.7; Spark 3.1.3 uses 2.3.3
val SparkVersion = "3.1.3"
lazy val CirceDeps: Seq[ModuleID] = Seq(
"io.circe" %% "circe-generic" % CirceVersion,
/* Shapeless is one of the Spark dependencies. As Spark is provided, it is not included in the uber jar.
* Adding the dependency explicitly to make sure we have the correct version at run-time
*/
"com.chuusai" %% "shapeless" % ShapelessVersion
)
I keep this in my code just for documentation purpose only.
What worked
The main fix is actually to rename Shapeless library (see my comments)the question that I pick the answer
/** Shapeless is one of the Spark dependencies. At run-time, they clash and Spark's shapeless package takes
* precedence. It results run-time error as shapeless 2.3.7 and 2.3.3 are not fully compatible.
* Here, we are are renaming the library so they co-exist in run-time and Spark uses its own version and Circe also
* uses its own version.
*/
// noinspection SbtDependencyVersionInspection
lazy val shadingRules: Def.Setting[Seq[ShadeRule]] =
assembly / assemblyShadeRules := Seq(
ShadeRule
.rename("shapeless.**" -> "shadeshapless.#1")
.inLibrary("com.chuusai" % "shapeless_2.12" % Dependencies.ShapelessVersion)
.inProject,
ShadeRule
.rename("shapeless.**" -> "shadeshapless.#1")
.inLibrary("io.circe" % "circe-generic_2.12" % Dependencies.CirceVersion)
.inProject
)
Update 2022-08-20
Based on #denis-arnaud comment, here is a simpler version from pureconfig
assembly / assemblyShadeRules := Seq(ShadeRule.rename("shapeless.**" -> "new_shapeless.#1").inAll)
I guess the simple one works for most the situations. The more complex one is good for when there are different versions of shapeless in the classpath, and you'd like to rename them in #1, #2, etc.

zero323 has the right answer as far as I can tell. Spark 2.1.0 has a dependency that itself depends on Shapeless 2.0.0.
This problem could be solved one of two ways: Import the dependency that uses Shapeless and shade shapeless, or use a different avro library. I went with the latter solution.

Hadoop Pipes Wordcount example: NullPointerException in LocalJobRunner

I am trying to run the sample example in this tutorial about Hadoop Pipes:
I'm succeeding in compiling and everything. However, after it runs it shows me NullPointerException error. I tried many ways and read many similar questions, but wasn't able to find an actual solution for this problem.
Note: I am running on a single machine in a pseudo-distributed environment.
hadoop pipes -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriters=true -input /input -output /output -program /bin/wordcount
DEPRECATED: Use of this script to execute mapred command is deprecated.
Instead use the mapred command for it.
15/02/18 01:09:02 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/02/18 01:09:02 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/02/18 01:09:02 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/02/18 01:09:03 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
15/02/18 01:09:04 INFO mapred.FileInputFormat: Total input paths to process : 1
15/02/18 01:09:04 INFO mapreduce.JobSubmitter: number of splits:1
15/02/18 01:09:04 INFO Configuration.deprecation: hadoop.pipes.java.recordreader is deprecated. Instead, use mapreduce.pipes.isjavarecordreader
15/02/18 01:09:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local143452495_0001
15/02/18 01:09:06 INFO mapred.LocalDistributedCacheManager: Localized hdfs://localhost:9000/bin/wordcount as file:/tmp/hadoop-abdulrahman/mapred/local/1424214545411/wordcount
15/02/18 01:09:06 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/02/18 01:09:06 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/02/18 01:09:06 INFO mapreduce.Job: Running job: job_local143452495_0001
15/02/18 01:09:06 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
15/02/18 01:09:06 INFO mapred.LocalJobRunner: Waiting for map tasks
15/02/18 01:09:06 INFO mapred.LocalJobRunner: Starting task: attempt_local143452495_0001_m_000000_0
15/02/18 01:09:06 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
15/02/18 01:09:06 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/input/data.txt:0+68
15/02/18 01:09:07 INFO mapred.MapTask: numReduceTasks: 1
15/02/18 01:09:07 INFO mapreduce.Job: Job job_local143452495_0001 running in uber mode : false
15/02/18 01:09:07 INFO mapreduce.Job: map 0% reduce 0%
15/02/18 01:09:07 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
15/02/18 01:09:07 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
15/02/18 01:09:07 INFO mapred.MapTask: soft limit at 83886080
15/02/18 01:09:07 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/02/18 01:09:07 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/02/18 01:09:07 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
15/02/18 01:09:08 INFO mapred.LocalJobRunner: map task executor complete.
15/02/18 01:09:08 WARN mapred.LocalJobRunner: job_local143452495_0001
java.lang.Exception: java.lang.NullPointerException
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.mapred.pipes.Application.<init>(Application.java:104)
at org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:69)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/02/18 01:09:08 INFO mapreduce.Job: Job job_local143452495_0001 failed with state FAILED due to: NA
15/02/18 01:09:08 INFO mapreduce.Job: Counters: 0
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:264)
at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:503)
at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:518)
Edit: I downloaded the sourcecode of hadoop and tracked where the exception is happening, it seems that the exception occurs in the initialization stage, and thus the code inside the mapper/reducer isn't really the problem.
The function in Hadoop that produces the exception is this one:
/** Run a set of tasks and waits for them to complete. */
435 private void runTasks(List<RunnableWithThrowable> runnables,
436 ExecutorService service, String taskType) throws Exception {
437 // Start populating the executor with work units.
438 // They may begin running immediately (in other threads).
439 for (Runnable r : runnables) {
440 service.submit(r);
441 }
442
443 try {
444 service.shutdown(); // Instructs queue to drain.
445
446 // Wait for tasks to finish; do not use a time-based timeout.
447 // (See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6179024)
448 LOG.info("Waiting for " + taskType + " tasks");
449 service.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
450 } catch (InterruptedException ie) {
451 // Cancel all threads.
452 service.shutdownNow();
453 throw ie;
454 }
455
456 LOG.info(taskType + " task executor complete.");
457
458 // After waiting for the tasks to complete, if any of these
459 // have thrown an exception, rethrow it now in the main thread context.
460 for (RunnableWithThrowable r : runnables) {
461 if (r.storedException != null) {
462 throw new Exception(r.storedException);
463 }
464 }
465 }
The problem though is that it is storing the exception and then throwing it, which is preventing me from knowing the actual source of the exception.
Any help?
Also, if you need me to post more details please let me know.
Thank you,

So after a lot of research, I found out that the problem was actually caused by this line in pipes/Application.java (line 104):
byte[] password= jobToken.getPassword();
I changed the code and recompiled hadoop:
byte[] password= "no password".getBytes();
if (jobToken != null)
{
password= jobToken.getPassword();
}
I got this from here
This solved the problem, and my program currently runs, but I am facing another problem where the program actually hangs at map 0% reduce 0%
I will open another topic for that question.
Thank you,

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark : forEachPartition not working - java

Related

apoc.periodic.iterate fails with exception: java.util.concurrent.RejectedExecutionException

Iterate through Dataset<Row> in Spark using mapPartion and change the row data in java [duplicate]

MapReduce input from HTable on AWS timeout

NoSuchMethodError in shapeless seen only in Spark

Hadoop Pipes Wordcount example: NullPointerException in LocalJobRunner

Categories

Resources