Spark driver throws weird exceptions when Spark standalone cluster is not available

Spark driver throws weird exceptions when Spark standalone cluster is not available - java

Spark version : 1.4.1
I am running below driver program and try to create a Spark Context for a remote spark standalone cluster. At the time remote spark cluster is not available following errors are reporting by the spark. What would be the reason for exceptions (
ERROR OneForOneStrategy:
java.lang.NullPointerException
and
ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
) and how I could eliminate them ?
Note : when I am using running local or remote standalone cluster, my application is working fine.
Java Program
public class SparkContextTest {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
JavaSparkContext sc = null;
try {
SparkConf conf = new SparkConf();
conf.setAppName("testABC");
conf.set("spark.scheduler.mode", "FAIR");
conf.setMaster("spark://remote-server:7077")
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.executor.memory", "1g")
.set("spark.driver.maxResultSize", "1g");
sc = new JavaSparkContext(conf);
} catch(Exception e){
e.printStackTrace();
} finally {
if(null != sc){
sc.clearCallSite();
sc.close();
}
}
}
}
Spark log
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/downloads/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/downloads/spark-1.4.1-bin-hadoop2.6/lib/spark-examples-1.4.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/12/09 17:11:23 INFO SparkContext: Running Spark version 1.4.1
15/12/09 17:11:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/09 17:11:24 WARN Utils: Your hostname, pesamara-mobl-vm1 resolves to a loopback address: 127.0.0.1; using 10.30.9.107 instead (on interface eth0)
15/12/09 17:11:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/12/09 17:11:25 INFO SecurityManager: Changing view acls to: pes
15/12/09 17:11:25 INFO SecurityManager: Changing modify acls to: pes
15/12/09 17:11:25 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(pes); users with modify permissions: Set(pes)
15/12/09 17:11:26 INFO Slf4jLogger: Slf4jLogger started
15/12/09 17:11:27 INFO Remoting: Starting remoting
15/12/09 17:11:27 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#10.30.9.107:55740]
15/12/09 17:11:27 INFO Utils: Successfully started service 'sparkDriver' on port 55740.
15/12/09 17:11:27 INFO SparkEnv: Registering MapOutputTracker
15/12/09 17:11:27 INFO SparkEnv: Registering BlockManagerMaster
15/12/09 17:11:27 INFO DiskBlockManager: Created local directory at /tmp/spark-30d61b03-0b1c-4250-b68e-c2404c7884a8/blockmgr-3226ed7e-f8e5-40a2-bfb1-ffabb51cd0e0
15/12/09 17:11:28 INFO MemoryStore: MemoryStore started with capacity 491.5 MB
15/12/09 17:11:28 INFO HttpFileServer: HTTP File server directory is /tmp/spark-30d61b03-0b1c-4250-b68e-c2404c7884a8/httpd-7f2572c2-5677-446e-a80a-6f9d05ee2891
15/12/09 17:11:28 INFO HttpServer: Starting HTTP Server
15/12/09 17:11:28 INFO Utils: Successfully started service 'HTTP file server' on port 45047.
15/12/09 17:11:28 INFO SparkEnv: Registering OutputCommitCoordinator
15/12/09 17:11:28 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/12/09 17:11:28 INFO SparkUI: Started SparkUI at http://10.30.9.107:4040
15/12/09 17:11:29 INFO FairSchedulableBuilder: Created default pool default, schedulingMode: FIFO, minShare: 0, weight: 1
15/12/09 17:11:29 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#localhost2:7077/user/Master...
15/12/09 17:11:29 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster#localhost2:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#localhost2:7077
15/12/09 17:11:29 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#localhost2:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: localhost2: unknown error
15/12/09 17:11:49 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#localhost2:7077/user/Master...
15/12/09 17:11:49 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster#localhost2:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#localhost2:7077
15/12/09 17:11:49 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#localhost2:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: localhost2: unknown error
15/12/09 17:12:09 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#localhost2:7077/user/Master...
15/12/09 17:12:09 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster#localhost2:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#localhost2:7077
15/12/09 17:12:09 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#localhost2:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: localhost2: unknown error
15/12/09 17:12:29 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
15/12/09 17:12:29 WARN SparkDeploySchedulerBackend: Application ID is not initialized yet.
15/12/09 17:12:29 INFO SparkUI: Stopped Spark web UI at http://10.30.9.107:4040
15/12/09 17:12:29 INFO DAGScheduler: Stopping DAGScheduler
15/12/09 17:12:29 INFO SparkDeploySchedulerBackend: Shutting down all executors
15/12/09 17:12:29 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
15/12/09 17:12:29 ERROR OneForOneStrategy:
java.lang.NullPointerException
at org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/12/09 17:12:29 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#localhost2:7077/user/Master...
15/12/09 17:12:29 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster#localhost2:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#localhost2:7077
15/12/09 17:12:29 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#localhost2:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: localhost2: unknown error
15/12/09 17:12:29 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54184.
15/12/09 17:12:29 INFO NettyBlockTransferService: Server created on 54184
15/12/09 17:12:29 INFO BlockManagerMaster: Trying to register BlockManager
15/12/09 17:12:29 INFO BlockManagerMasterEndpoint: Registering block manager 10.30.9.107:54184 with 491.5 MB RAM, BlockManagerId(driver, 10.30.9.107, 54184)
15/12/09 17:12:29 INFO BlockManagerMaster: Registered BlockManager
15/12/09 17:12:30 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1503)
at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2007)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
at org.apache.spark.examples.sql.SparkContextTest.main(SparkContextTest.java:32)
15/12/09 17:12:30 INFO SparkContext: SparkContext already stopped.
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1503)
at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2007)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
at org.apache.spark.examples.sql.SparkContextTest.main(SparkContextTest.java:32)
15/12/09 17:12:30 INFO DiskBlockManager: Shutdown hook called
15/12/09 17:12:30 INFO Utils: path = /tmp/spark-30d61b03-0b1c-4250-b68e-c2404c7884a8/blockmgr-3226ed7e-f8e5-40a2-bfb1-ffabb51cd0e0, already present as root for deletion.
15/12/09 17:12:30 INFO Utils: Shutdown hook called
15/12/09 17:12:30 INFO Utils: Deleting directory /tmp/spark-30d61b03-0b1c-4250-b68e-c2404c7884a8/httpd-7f2572c2-5677-446e-a80a-6f9d05ee2891
15/12/09 17:12:30 INFO Utils: Deleting directory /tmp/spark-30d61b03-0b1c-4250-b68e-c2404c7884a8

Related

How to configure beam python sdk with spark in a kubernetes environment

TLDR;
How to configure Apache Beam pipelines options with "environment_type" = EXTERNAL or PROCESS?
Description
Currently, we have a standalone spark cluster inside Kubernetes, following this solution (and the setup) we launch a beam pipeline creating an embedded spark job server on the spark worker who needs to run a python SDK jointly.
Apache Beam allows running python SDK in 4 different ways:
"DOCKER" - Default and not possible inside a Kubernetes cluster (would use "container inside container")
"LOOPBACK" - Only for testing, not possible with more than 1 worker pod
"EXTERNAL" - Ideal setup, "just" create a sidecar container to run in the same pod as the spark workers
"PROCESS" - Execute a process in the spark worker, not ideal but could be too.
Development
Using "External" - Implementing the spark worker with the python sdk on the same pod:
apiVersion: apps/v1
kind: Deployment
metadata:
name: spark-worker
labels:
app: spark-worker
spec:
selector:
matchLabels:
app: spark-worker
template:
metadata:
labels:
app: spark-worker
spec:
containers:
- name: spark-worker
image: spark-py-custom:latest
imagePullPolicy: Never
ports:
- containerPort: 8081
protocol: TCP
command: ['/bin/bash',"-c","--"]
args: ["/start-worker.sh" ]
resources :
requests :
cpu : 4
memory : "5Gi"
limits :
cpu : 4
memory : "5Gi"
volumeMounts:
- name: spark-jars
mountPath: "/tmp"
- name: python-beam-sdk
image: apachebeam/python3.7_sdk:latest
command: ["/opt/apache/beam/boot", "--worker_pool"]
ports:
- containerPort: 50000
resources:
limits:
cpu: "1"
memory: "1Gi"
volumes:
- name: spark-jars
persistentVolumeClaim:
claimName: spark-jars
And them, if we execute the command
python3 wordcount.py \
--output ./data_test/counts \
--runner=SparkRunner \
--spark_submit_uber_jar \
--spark_job_server_jar=beam-runners-spark-job-server-2.28.0.jar \
--spark_master_url=spark://spark-master:7077 \
--spark_rest_url=http://spark-master:6066 \
--environment_type=EXTERNAL \
--environment_config=localhost:50000
We get a stuck terminal in the state of "RUNNING":
INFO:apache_beam.internal.gcp.auth:Setting socket default timeout to 60 seconds.
INFO:apache_beam.internal.gcp.auth:socket default timeout is 60.0 seconds.
INFO:oauth2client.client:Timeout attempting to reach GCE metadata service.
WARNING:apache_beam.internal.gcp.auth:Unable to find default credentials to use: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
Connecting anonymously.
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.7 interpreter.
INFO:root:Default Python SDK image for environment is apache/beam_python3.7_sdk:2.28.0
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function lift_combiners at 0x7fc360c0b8c8> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function sort_stages at 0x7fc360c0f048> ====================
INFO:apache_beam.runners.portability.abstract_job_service:Artifact server started on port 36369
INFO:apache_beam.runners.portability.abstract_job_service:Running job 'job-2448721e-e686-41d4-b924-5f8c5ae73ac2'
INFO:apache_beam.runners.portability.spark_uber_jar_job_server:Submitted Spark job with ID driver-20210305172421-0000
INFO:apache_beam.runners.portability.portable_runner:Job state changed to STOPPED
INFO:apache_beam.runners.portability.portable_runner:Job state changed to RUNNING
And in the spark worker log:
21/03/05 17:24:25 INFO ExecutorRunner: Launch command: "/usr/local/openjdk-8/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=45203" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#spark-worker-64fd4ddd6-tqdrs:45203" "--executor-id" "0" "--hostname" "172.18.0.20" "--cores" "3" "--app-id" "app-20210305172425-0000" "--worker-url" "spark://Worker#172.18.0.20:44365"
And on the python sdk:
2021/03/05 17:19:52 Starting worker pool 1: python -m apache_beam.runners.worker.worker_pool_main --service_port=50000 --container_executable=/opt/apache/beam/boot
Starting worker with command ['/opt/apache/beam/boot', '--id=1-1', '--logging_endpoint=', '--artifact_endpoint=', '--provision_endpoint=', '--control_endpoint=']
2021/03/05 17:24:32 No logging endpoint provided.
Checking the spark worker stderr (on localhost 8081):
Spark Executor Command: "/usr/local/openjdk-8/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=45203" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#spark-worker-64fd4ddd6-tqdrs:45203" "--executor-id" "0" "--hostname" "172.18.0.20" "--cores" "3" "--app-id" "app-20210305172425-0000" "--worker-url" "spark://Worker#172.18.0.20:44365"
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/03/05 17:24:26 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 230#spark-worker-64fd4ddd6-tqdrs
21/03/05 17:24:26 INFO SignalUtils: Registered signal handler for TERM
21/03/05 17:24:26 INFO SignalUtils: Registered signal handler for HUP
21/03/05 17:24:26 INFO SignalUtils: Registered signal handler for INT
21/03/05 17:24:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/03/05 17:24:27 INFO SecurityManager: Changing view acls to: root
21/03/05 17:24:27 INFO SecurityManager: Changing modify acls to: root
21/03/05 17:24:27 INFO SecurityManager: Changing view acls groups to:
21/03/05 17:24:27 INFO SecurityManager: Changing modify acls groups to:
21/03/05 17:24:27 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
21/03/05 17:24:27 INFO TransportClientFactory: Successfully created connection to spark-worker-64fd4ddd6-tqdrs/172.18.0.20:45203 after 50 ms (0 ms spent in bootstraps)
21/03/05 17:24:27 INFO SecurityManager: Changing view acls to: root
21/03/05 17:24:27 INFO SecurityManager: Changing modify acls to: root
21/03/05 17:24:27 INFO SecurityManager: Changing view acls groups to:
21/03/05 17:24:27 INFO SecurityManager: Changing modify acls groups to:
21/03/05 17:24:27 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
21/03/05 17:24:28 INFO TransportClientFactory: Successfully created connection to spark-worker-64fd4ddd6-tqdrs/172.18.0.20:45203 after 1 ms (0 ms spent in bootstraps)
21/03/05 17:24:28 INFO DiskBlockManager: Created local directory at /tmp/spark-bdffc2b3-f57a-42fa-a720-e22274b86b67/executor-f1eff7ca-d2cd-4ff4-b18b-c8d6a520f590/blockmgr-c61fb65f-ea97-4bd5-bf15-e0025845a251
21/03/05 17:24:28 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
21/03/05 17:24:28 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler#spark-worker-64fd4ddd6-tqdrs:45203
21/03/05 17:24:28 INFO WorkerWatcher: Connecting to worker spark://Worker#172.18.0.20:44365
21/03/05 17:24:28 INFO TransportClientFactory: Successfully created connection to /172.18.0.20:44365 after 1 ms (0 ms spent in bootstraps)
21/03/05 17:24:28 INFO WorkerWatcher: Successfully connected to spark://Worker#172.18.0.20:44365
21/03/05 17:24:28 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
21/03/05 17:24:28 INFO Executor: Starting executor ID 0 on host 172.18.0.20
21/03/05 17:24:28 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 42561.
21/03/05 17:24:28 INFO NettyBlockTransferService: Server created on 172.18.0.20:42561
21/03/05 17:24:28 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/03/05 17:24:28 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, 172.18.0.20, 42561, None)
21/03/05 17:24:28 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, 172.18.0.20, 42561, None)
21/03/05 17:24:28 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, 172.18.0.20, 42561, None)
Where it gets stuck forever.
Checking the source code of the python SDK we can see that "no logging endpoint provided" is fatal and it comes from the lack of configuration sent to him (no logging/artifact/provision/control endpoints). If I try to add the "--artifact_endpoint" to the python command I get grcp error of failed communication because the jobserver creates its own artifact endpoint. In this setup would be necessary to configure all these endpoints (probably as localhost as the SDK and the worker are in the same pod) with fixed ports but I can't find how to configure it. Checking SO I can find a related issue but in his case he gets the python SDK configurations automatically (maybe a spark runner issue?)
Using "PROCESS" - Trying to run the python SDK within a process, I built the python SDK with ./gradlew :sdks:python:container:py37:docker, copied the sdks/python/container/build/target/launcher/linux_amd64/boot executable to /python_sdk/boot inside the spark worker container and used the command:
python3 wordcount.py \
--output ./data_test/counts \
--runner=SparkRunner \
--spark_submit_uber_jar \
--spark_master_url=spark://spark-master:7077 \
--spark_rest_url=http://spark-master:6066 \
--environment_type=PROCESS \
--spark_job_server_jar=beam-runners-spark-job-server-2.28.0.jar \
--environment_config='{"os":"linux","arch":"x84_64","command":"/python_sdk/boot"}'
Resulting in "run time exception" in the terminal:
INFO:apache_beam.runners.portability.portable_runner:Job state changed to FAILED
Traceback (most recent call last):
File "wordcount.py", line 91, in <module>
run()
File "wordcount.py", line 86, in run
output | "Write" >> WriteToText(known_args.output)
File "/usr/local/lib/python3.7/dist-packages/apache_beam/pipeline.py", line 581, in __exit__
self.result.wait_until_finish()
File "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/portability/portable_runner.py", line 608, in wait_until_finish
raise self._runtime_exception
RuntimeError: Pipeline job-95c13aa5-96ab-4d1d-bc68-7f9d203c8251 failed in state FAILED: unknown error
and checking again the spark stderr worker log I can see that the problem is java.lang.IllegalArgumentException: No filesystem found for scheme classpath which I don't know the reason.
21/03/05 18:33:12 INFO Executor: Adding file:/opt/spark/work/app-20210305183309-0000/0/./javax.servlet-api-3.1.0.jar to class loader
21/03/05 18:33:12 INFO TorrentBroadcast: Started reading broadcast variable 0
21/03/05 18:33:12 INFO TransportClientFactory: Successfully created connection to spark-worker-89c5c4c87-5q45s/172.18.0.20:34783 after 1 ms (0 ms spent in bootstraps)
21/03/05 18:33:12 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 12.3 KB, free 366.3 MB)
21/03/05 18:33:12 INFO TorrentBroadcast: Reading broadcast variable 0 took 63 ms
21/03/05 18:33:12 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 32.5 KB, free 366.3 MB)
21/03/05 18:33:13 INFO MemoryStore: Block rdd_13_0 stored as values in memory (estimated size 16.0 B, free 366.3 MB)
21/03/05 18:33:13 INFO MemoryStore: Block rdd_17_0 stored as values in memory (estimated size 16.0 B, free 366.3 MB)
21/03/05 18:33:13 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 5427 bytes result sent to driver
21/03/05 18:33:14 ERROR SerializingExecutor: Exception while executing runnable org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed#5f917914
java.lang.IllegalArgumentException: No filesystem found for scheme classpath
at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:467)
at org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:537)
at org.apache.beam.runners.fnexecution.artifact.ArtifactRetrievalService.getArtifact(ArtifactRetrievalService.java:125)
at org.apache.beam.runners.fnexecution.artifact.ArtifactRetrievalService.getArtifact(ArtifactRetrievalService.java:99)
at org.apache.beam.model.jobmanagement.v1.ArtifactRetrievalServiceGrpc$MethodHandlers.invoke(ArtifactRetrievalServiceGrpc.java:327)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
21/03/05 18:33:16 ERROR SerializingExecutor: Exception while executing runnable org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed#67fb2b2c
java.lang.IllegalArgumentException: No filesystem found for scheme classpath
at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:467)
at org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:537)
at org.apache.beam.runners.fnexecution.artifact.ArtifactRetrievalService.getArtifact(ArtifactRetrievalService.java:125)
at org.apache.beam.runners.fnexecution.artifact.ArtifactRetrievalService.getArtifact(ArtifactRetrievalService.java:99)
at org.apache.beam.model.jobmanagement.v1.ArtifactRetrievalServiceGrpc$MethodHandlers.invoke(ArtifactRetrievalServiceGrpc.java:327)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
21/03/05 18:33:19 INFO ProcessEnvironmentFactory: Still waiting for startup of environment '/python_sdk/boot' for worker id 1-1
Probably is missing some configuration parameters.
Obs
If I execute the command
python3 wordcount.py \
--output ./data_test/counts \
--runner=SparkRunner \
--spark_submit_uber_jar \
--spark_job_server_jar=beam-runners-spark-job-server-2.28.0.jar \
--spark_master_url=spark://spark-master:7077 \
--spark_rest_url=http://spark-master:6066 \
--environment_type=LOOPBACK
inside our spark worker (having only one worker in the spark cluster) we have a full working beam pipeline with these logs.

Using "External" - this definitely seems like a bug in Beam. The worker endpoints are supposed to be set up to use localhost; I don't think it is possible to configure them. I'm not sure why they would be missing; one educated guess is that the servers silently fail to start, leaving the endpoints empty. I filed a bug report (BEAM-11957) for this issue.
Using "Process" - The scheme classpath corresponds to ClassLoaderFileSystem. This file system is usually loaded using AutoService, which depends on ClassLoaderFileSystemRegistrar being present on the classpath (no relation to the name of the file system itself). The classpath of the job jar is based on spark_job_server_jar. Where are you getting your beam-runners-spark-job-server-2.28.0.jar from?

Apache Beam Spark runner side inputs causing SIGNAL TERM

I wish to use side-inputs in order to pass some configuration to my pipeline, however the driver commands a shutdown after the PCollectionView has been created when running on my local spark-cluster (spark version 2.4.7, 1 master, 1 worker, running on localhost). This however works perfectly on the DirectRunner.
I have attempted to strip the code to its bare essentials (see below). Still the issue persists when running on the spark cluster. DirectRunner still works fine.
The spark-cluster does accept jobs, and I have sucessfully run a "hello-world" pipeline that completed without issue.
What is happening here?
Logs pasted below.
Environment:
------------
Beam: 2.25
SparkRunner: 2.25
Java version: 11.0.9-ea
Maven Compiler Source: 1.8
Maven Compiler Target: 1.8
Spark version: 2.4.7
// Pipeline
private static PipelineResult runPipeline(PipelineOptions options) {
Pipeline p = Pipeline.create(options);
PCollectionView<String> schema = p
.apply("Dummy tabular schema builder", Create.of("This is a string"))
.apply("Collect", View.asSingleton());
p
.apply("Hello world", Create.of("Hello world"))
.apply("Side input test", ParDo.of(DummyFn.builder().setSchemaView(schema).build()).withSideInput("schema", schema))
.apply(ConsoleIO.create());
return p.run();
}
/ Simple FN that prints the side input
#AutoValue
public abstract class DummyFn extends DoFn<String, String> {
private final static Logger LOG = LoggerFactory.getLogger(DummyFn.class);
public static Builder builder() {
return new org.odp.beam.io.fn.AutoValue_DummyFn.Builder();
}
public abstract PCollectionView<String> getSchemaView();
#ProcessElement
public void processElement(#Element String element,
OutputReceiver<String> out,
ProcessContext context) {
String schema = context.sideInput(getSchemaView());
LOG.warn(schema.toString());
out.output(element.toUpperCase());
}
#AutoValue.Builder
public abstract static class Builder {
public abstract Builder setSchemaView(PCollectionView<String> value);
public abstract DummyFn build();
}
}
// Simple PTransform that prints the output of the toString-method
public class ConsoleIO<T> extends PTransform<PCollection<T>, PDone> {
public static <T> ConsoleIO<T> create() {
return new ConsoleIO();
}
#Override
public PDone expand(PCollection<T> input) {
input
.apply("Print elements", ParDo.of(new PrintElementFn<T>()));
return PDone.in(input.getPipeline());
}
static class PrintElementFn<T> extends DoFn<T, Void> {
#DoFn.ProcessElement
public void processElement(#Element T element, ProcessContext context) throws Exception {
System.out.println(element.toString());
}
}
}
$ spark-submit \
--class org.odp.beam.extractors.CsvToCdfRawExtractor \
--verbose \
--driver-memory 4G \
--executor-memory 4G \
--total-executor-cores 4 \
--deploy-mode client \
--supervise \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.network.timeout=420000 \
--master spark://192.168.10.172:7077 \
target/beam-poc-0.1-shaded.jar \
--runner=SparkRunner
Using properties file: null
20/11/10 15:46:44 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 192.168.10.172 instead (on interface enp7s0)
20/11/10 15:46:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/tom/app/spark/spark-2.4.7-bin-hadoop2.7/jars/spark-unsafe_2.11-2.4.7.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Parsed arguments:
master spark://192.168.10.172:7077
deployMode client
executorMemory 4G
executorCores null
totalExecutorCores 4
propertiesFile null
driverMemory 4G
driverCores null
driverExtraClassPath null
driverExtraLibraryPath null
driverExtraJavaOptions null
supervise true
queue null
numExecutors null
files null
pyFiles null
archives null
mainClass org.odp.beam.extractors.CsvToCdfRawExtractor
primaryResource file:/home/tom/project/odf/beam-poc/target/beam-poc-0.1-shaded.jar
name org.odp.beam.extractors.CsvToCdfRawExtractor
childArgs [--runner=SparkRunner]
jars null
packages null
packagesExclusions null
repositories null
verbose true
Spark properties used, including those specified through
--conf and those from the properties file null:
(spark.network.timeout,420000)
(spark.driver.memory,4G)
(spark.dynamicAllocation.enabled,false)
20/11/10 15:46:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Main class:
org.odp.beam.extractors.CsvToCdfRawExtractor
Arguments:
--runner=SparkRunner
Spark config:
(spark.jars,file:/home/tom/project/odf/beam-poc/target/beam-poc-0.1-shaded.jar)
(spark.app.name,org.odp.beam.extractors.CsvToCdfRawExtractor)
(spark.cores.max,4)
(spark.network.timeout,420000)
(spark.driver.memory,4G)
(spark.submit.deployMode,client)
(spark.master,spark://192.168.10.172:7077)
(spark.executor.memory,4G)
(spark.dynamicAllocation.enabled,false)
Classpath elements:
file:/home/tom/project/odf/beam-poc/target/beam-poc-0.1-shaded.jar
log4j:WARN No appenders could be found for logger (org.apache.beam.sdk.options.PipelineOptionsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/11/10 15:46:46 INFO SparkContext: Running Spark version 2.4.7
20/11/10 15:46:47 INFO SparkContext: Submitted application: CsvToCdfRawExtractor
20/11/10 15:46:47 INFO SecurityManager: Changing view acls to: tom
20/11/10 15:46:47 INFO SecurityManager: Changing modify acls to: tom
20/11/10 15:46:47 INFO SecurityManager: Changing view acls groups to:
20/11/10 15:46:47 INFO SecurityManager: Changing modify acls groups to:
20/11/10 15:46:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(tom); groups with view permissions: Set(); users with modify permissions: Set(tom); groups with modify permissions: Set()
20/11/10 15:46:47 INFO Utils: Successfully started service 'sparkDriver' on port 35103.
20/11/10 15:46:47 INFO SparkEnv: Registering MapOutputTracker
20/11/10 15:46:47 INFO SparkEnv: Registering BlockManagerMaster
20/11/10 15:46:47 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/11/10 15:46:47 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/11/10 15:46:47 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-58419068-d0ad-45c9-b90b-92b659dee1c3
20/11/10 15:46:47 INFO MemoryStore: MemoryStore started with capacity 2.2 GB
20/11/10 15:46:47 INFO SparkEnv: Registering OutputCommitCoordinator
20/11/10 15:46:47 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/11/10 15:46:47 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://fedora:4040
20/11/10 15:46:47 INFO SparkContext: Added JAR file:/home/tom/project/odf/beam-poc/target/beam-poc-0.1-shaded.jar at spark://fedora:35103/jars/beam-poc-0.1-shaded.jar with timestamp 1605019607514
20/11/10 15:46:47 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://192.168.10.172:7077...
20/11/10 15:46:47 INFO TransportClientFactory: Successfully created connection to /192.168.10.172:7077 after 25 ms (0 ms spent in bootstraps)
20/11/10 15:46:47 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20201110154647-0020
20/11/10 15:46:47 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20201110154647-0020/0 on worker-20201109144752-192.168.10.172-45535 (192.168.10.172:45535) with 4 core(s)
20/11/10 15:46:47 INFO StandaloneSchedulerBackend: Granted executor ID app-20201110154647-0020/0 on hostPort 192.168.10.172:45535 with 4 core(s), 4.0 GB RAM
20/11/10 15:46:47 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33169.
20/11/10 15:46:47 INFO NettyBlockTransferService: Server created on fedora:33169
20/11/10 15:46:47 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/11/10 15:46:47 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201110154647-0020/0 is now RUNNING
20/11/10 15:46:47 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, fedora, 33169, None)
20/11/10 15:46:47 INFO BlockManagerMasterEndpoint: Registering block manager fedora:33169 with 2.2 GB RAM, BlockManagerId(driver, fedora, 33169, None)
20/11/10 15:46:47 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, fedora, 33169, None)
20/11/10 15:46:47 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, fedora, 33169, None)
20/11/10 15:46:47 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
20/11/10 15:46:48 INFO SparkRunner$Evaluator: Entering directly-translatable composite transform: 'Collect/Combine.GloballyAsSingletonView/Combine.globally(Singleton)'
20/11/10 15:46:48 INFO MetricsAccumulator: Instantiated metrics accumulator: MetricQueryResults()
20/11/10 15:46:48 INFO AggregatorsAccumulator: Instantiated aggregators accumulator:
20/11/10 15:46:48 INFO SparkRunner$Evaluator: Evaluating Read(CreateSource)
20/11/10 15:46:48 INFO SparkRunner$Evaluator: Entering directly-translatable composite transform: 'Collect/Combine.GloballyAsSingletonView/Combine.globally(Singleton)'
20/11/10 15:46:48 INFO SparkRunner$Evaluator: Evaluating Combine.globally(Singleton)
20/11/10 15:46:48 INFO SparkContext: Starting job: aggregate at GroupCombineFunctions.java:107
20/11/10 15:46:48 INFO DAGScheduler: Got job 0 (aggregate at GroupCombineFunctions.java:107) with 1 output partitions
20/11/10 15:46:48 INFO DAGScheduler: Final stage: ResultStage 0 (aggregate at GroupCombineFunctions.java:107)
20/11/10 15:46:48 INFO DAGScheduler: Parents of final stage: List()
20/11/10 15:46:48 INFO DAGScheduler: Missing parents: List()
20/11/10 15:46:48 INFO DAGScheduler: Submitting ResultStage 0 (Dummy tabular schema builder/Read(CreateSource).out Bounded[0] at RDD at SourceRDD.java:80), which has no missing parents
20/11/10 15:46:48 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 16.2 KB, free 2.2 GB)
20/11/10 15:46:48 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 6.8 KB, free 2.2 GB)
20/11/10 15:46:48 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on fedora:33169 (size: 6.8 KB, free: 2.2 GB)
20/11/10 15:46:48 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1184
20/11/10 15:46:48 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (Dummy tabular schema builder/Read(CreateSource).out Bounded[0] at RDD at SourceRDD.java:80) (first 15 tasks are for partitions Vector(0))
20/11/10 15:46:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
20/11/10 15:46:49 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.10.172:48382) with ID 0
20/11/10 15:46:49 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.10.172, executor 0, partition 0, PROCESS_LOCAL, 8546 bytes)
20/11/10 15:46:49 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.10.172:43781 with 2.2 GB RAM, BlockManagerId(0, 192.168.10.172, 43781, None)
20/11/10 15:46:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.10.172:43781 (size: 6.8 KB, free: 2.2 GB)
20/11/10 15:46:51 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2056 ms on 192.168.10.172 (executor 0) (1/1)
20/11/10 15:46:51 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
20/11/10 15:46:51 INFO DAGScheduler: ResultStage 0 (aggregate at GroupCombineFunctions.java:107) finished in 3.091 s
20/11/10 15:46:51 INFO DAGScheduler: Job 0 finished: aggregate at GroupCombineFunctions.java:107, took 3.132405 s
20/11/10 15:46:51 INFO SparkRunner$Evaluator: Evaluating org.apache.beam.sdk.transforms.View$VoidKeyToMultimapMaterialization$VoidKeyToMultimapMaterializationDoFn#14924f41
20/11/10 15:46:51 INFO SparkRunner$Evaluator: Evaluating View.CreatePCollectionView
20/11/10 15:46:51 INFO SparkContext: Invoking stop() from shutdown hook
20/11/10 15:46:51 INFO SparkUI: Stopped Spark web UI at http://fedora:4040
20/11/10 15:46:51 INFO StandaloneSchedulerBackend: Shutting down all executors
20/11/10 15:46:51 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
20/11/10 15:46:51 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/11/10 15:46:51 INFO MemoryStore: MemoryStore cleared
20/11/10 15:46:51 INFO BlockManager: BlockManager stopped
20/11/10 15:46:51 INFO BlockManagerMaster: BlockManagerMaster stopped
20/11/10 15:46:51 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/11/10 15:46:51 INFO SparkContext: Successfully stopped SparkContext
20/11/10 15:46:51 INFO ShutdownHookManager: Shutdown hook called
20/11/10 15:46:51 INFO ShutdownHookManager: Deleting directory /tmp/spark-665a903f-22db-497e-989f-a5ca3e0635e2
20/11/10 15:46:51 INFO ShutdownHookManager: Deleting directory /tmp/spark-d4b5a04f-f6a3-48ff-b229-4eb966151d86
Spark Executor Command: "/usr/lib/jvm/java-11-openjdk-11.0.9.6-0.0.ea.fc33.x86_64/bin/java" "-cp" "/home/tom/app/spark/spark/conf/:/home/tom/app/spark/spark/jars/*" "-Xmx4096M" "-Dspark.driver.port=35103" "-Dspark.network.timeout=420000" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#fedora:35103" "--executor-id" "0" "--hostname" "192.168.10.172" "--cores" "4" "--app-id" "app-20201110154647-0020" "--worker-url" "spark://Worker#192.168.10.172:45535"
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/11/10 15:46:48 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 360353#localhost.localdomain
20/11/10 15:46:48 INFO SignalUtils: Registered signal handler for TERM
20/11/10 15:46:48 INFO SignalUtils: Registered signal handler for HUP
20/11/10 15:46:48 INFO SignalUtils: Registered signal handler for INT
20/11/10 15:46:48 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 192.168.10.172 instead (on interface enp7s0)
20/11/10 15:46:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/tom/app/spark/spark-2.4.7-bin-hadoop2.7/jars/spark-unsafe_2.11-2.4.7.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
20/11/10 15:46:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/11/10 15:46:48 INFO SecurityManager: Changing view acls to: tom
20/11/10 15:46:48 INFO SecurityManager: Changing modify acls to: tom
20/11/10 15:46:48 INFO SecurityManager: Changing view acls groups to:
20/11/10 15:46:48 INFO SecurityManager: Changing modify acls groups to:
20/11/10 15:46:48 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(tom); groups with view permissions: Set(); users with modify permissions: Set(tom); groups with modify permissions: Set()
20/11/10 15:46:49 INFO TransportClientFactory: Successfully created connection to fedora/192.168.10.172:35103 after 54 ms (0 ms spent in bootstraps)
20/11/10 15:46:49 INFO SecurityManager: Changing view acls to: tom
20/11/10 15:46:49 INFO SecurityManager: Changing modify acls to: tom
20/11/10 15:46:49 INFO SecurityManager: Changing view acls groups to:
20/11/10 15:46:49 INFO SecurityManager: Changing modify acls groups to:
20/11/10 15:46:49 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(tom); groups with view permissions: Set(); users with modify permissions: Set(tom); groups with modify permissions: Set()
20/11/10 15:46:49 INFO TransportClientFactory: Successfully created connection to fedora/192.168.10.172:35103 after 4 ms (0 ms spent in bootstraps)
20/11/10 15:46:49 INFO DiskBlockManager: Created local directory at /tmp/spark-0e47fa97-8714-4e8e-950e-b1032fe36995/executor-e7667d04-198d-4144-8897-ddada0bfd1de/blockmgr-019262b3-4d3e-4158-b984-ff85c0846191
20/11/10 15:46:49 INFO MemoryStore: MemoryStore started with capacity 2.2 GB
20/11/10 15:46:49 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler#fedora:35103
20/11/10 15:46:49 INFO WorkerWatcher: Connecting to worker spark://Worker#192.168.10.172:45535
20/11/10 15:46:49 INFO TransportClientFactory: Successfully created connection to /192.168.10.172:45535 after 2 ms (0 ms spent in bootstraps)
20/11/10 15:46:49 INFO WorkerWatcher: Successfully connected to spark://Worker#192.168.10.172:45535
20/11/10 15:46:49 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
20/11/10 15:46:49 INFO Executor: Starting executor ID 0 on host 192.168.10.172
20/11/10 15:46:49 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43781.
20/11/10 15:46:49 INFO NettyBlockTransferService: Server created on 192.168.10.172:43781
20/11/10 15:46:49 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/11/10 15:46:49 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, 192.168.10.172, 43781, None)
20/11/10 15:46:49 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, 192.168.10.172, 43781, None)
20/11/10 15:46:49 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, 192.168.10.172, 43781, None)
20/11/10 15:46:49 INFO CoarseGrainedExecutorBackend: Got assigned task 0
20/11/10 15:46:49 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
20/11/10 15:46:49 INFO Executor: Fetching spark://fedora:35103/jars/beam-poc-0.1-shaded.jar with timestamp 1605019607514
20/11/10 15:46:49 INFO TransportClientFactory: Successfully created connection to fedora/192.168.10.172:35103 after 2 ms (0 ms spent in bootstraps)
20/11/10 15:46:49 INFO Utils: Fetching spark://fedora:35103/jars/beam-poc-0.1-shaded.jar to /tmp/spark-0e47fa97-8714-4e8e-950e-b1032fe36995/executor-e7667d04-198d-4144-8897-ddada0bfd1de/spark-62556d02-a044-4c2c-8f97-c7f25ef3e337/fetchFileTemp6325880319900581024.tmp
20/11/10 15:46:49 INFO Utils: Copying /tmp/spark-0e47fa97-8714-4e8e-950e-b1032fe36995/executor-e7667d04-198d-4144-8897-ddada0bfd1de/spark-62556d02-a044-4c2c-8f97-c7f25ef3e337/2058038551605019607514_cache to /home/tom/app/spark/spark-2.4.7-bin-hadoop2.7/work/app-20201110154647-0020/0/./beam-poc-0.1-shaded.jar
20/11/10 15:46:50 INFO Executor: Adding file:/home/tom/app/spark/spark-2.4.7-bin-hadoop2.7/work/app-20201110154647-0020/0/./beam-poc-0.1-shaded.jar to class loader
20/11/10 15:46:50 INFO TorrentBroadcast: Started reading broadcast variable 0
20/11/10 15:46:50 INFO TransportClientFactory: Successfully created connection to fedora/192.168.10.172:33169 after 2 ms (0 ms spent in bootstraps)
20/11/10 15:46:50 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 6.8 KB, free 2.2 GB)
20/11/10 15:46:50 INFO TorrentBroadcast: Reading broadcast variable 0 took 112 ms
20/11/10 15:46:50 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 16.2 KB, free 2.2 GB)
20/11/10 15:46:51 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 6312 bytes result sent to driver
20/11/10 15:46:51 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
20/11/10 15:46:51 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
tdown

Submitting Job to Remote Apache Spark Server

Apache Spark (v1.6.1) started as service on Ubuntu (10.10.0.102) machine, using ./start-all.sh.
Now need to submit job to this server remotely using Java API.
Following is Java client code running from different machine (10.10.0.95).
String mySqlConnectionUrl = "jdbc:mysql://localhost:3306/demo?user=sec&password=sec";
String jars[] = new String[] {"/home/.m2/repository/com/databricks/spark-csv_2.10/1.4.0/spark-csv_2.10-1.4.0.jar",
"/home/.m2/repository/org/apache/commons/commons-csv/1.1/commons-csv-1.1.jar",
"/home/.m2/repository/mysql/mysql-connector-java/6.0.2/mysql-connector-java-6.0.2.jar"};
SparkConf sparkConf = new SparkConf()
.setAppName("sparkCSVWriter")
.setMaster("spark://10.10.0.102:7077")
.setJars(jars);
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new SQLContext(javaSparkContext);
Map<String, String> options = new HashMap<>();
options.put("driver", "com.mysql.jdbc.Driver");
options.put("url", mySqlConnectionUrl);
options.put("dbtable", "(select p.FIRST_NAME from person p) as firstName");
DataFrame dataFrame = sqlContext.read().format("jdbc").options(options).load();
dataFrame.write()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", "|")
.option("quote", "\"")
.option("quoteMode", QuoteMode.NON_NUMERIC.toString())
.option("escape", "\\")
.save("persons.csv");
Configuration hadoopConfiguration = javaSparkContext.hadoopConfiguration();
FileSystem hdfs = FileSystem.get(hadoopConfiguration);
FileUtil.copyMerge(hdfs, new Path("persons.csv"), hdfs, new Path("\home\persons1.csv"), true, hadoopConfiguration, new String());
As per code need to convert RDBMS data to csv/json using Spark. But when I run this client application, able to connect to remote spark server but in console continuously receiving following WARN message
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
And at server side on Spark UI in running applications > executor summary > stderr log, received following error.
Exception in thread "main" java.io.IOException: Failed to connect to /192.168.56.1:53112
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: /192.168.56.1:53112
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
But there no any IP address configured as 192.168.56.1. So is there any configuration missing.

Actually my client machine(10.10.0.95) is Windows machine. When I tried to submit Spark job using another Ubuntu machine(10.10.0.155), I am able to run same Java client code successfully.
As I debugged in Windows client environment, when I submit spark job following log displayed,
INFO Remoting: Starting remoting
INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#192.168.56.1:61552]
INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 61552.
INFO MemoryStore: MemoryStore started with capacity 2.4 GB
INFO SparkEnv: Registering OutputCommitCoordinator
INFO Utils: Successfully started service 'SparkUI' on port 4044.
INFO SparkUI: Started SparkUI at http://192.168.56.1:4044
As per log line number 2, its register client with 192.168.56.1.
Elsewhere, in Ubuntu client
INFO Remoting: Starting remoting
INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#10.10.0.155:42786]
INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 42786.
INFO MemoryStore: MemoryStore started with capacity 511.1 MB
INFO SparkEnv: Registering OutputCommitCoordinator
INFO Utils: Successfully started service 'SparkUI' on port 4040.
INFO SparkUI: Started SparkUI at http://10.10.0.155:4040
As per log line number 2, its register client with 10.10.0.155 same as actual IP address.
If anybody find what is the problem with Windows client, let community know.
[UPDATE]
I am running this whole environment in Virtual Box. Windows machine is my host and Ubuntu is guest. And Spark installed in Ubuntu machine. In Virtual box environment Virtual box installing Ethernet adapter VirtualBox Host-Only Netwotk with IPv4 address : 192.168.56.1. And Spark registering this IP as client IP instead of actual IP address 10.10.0.95.

Can anyone explain my Apache Spark Error SparkException: Job aborted due to stage failure

I have simple Apache Spark App where I read files from hdfs and after that i pipe it to external process. When I read a big amount a data (in my case files have about 241MB) and i don't specify min number of partitions or specify min number to 4 i'm getting following error:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, ip-172-31-36-43.us-west-2.compute.internal): ExecutorLostFailure (executor 6 lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
When I specify min number of partitions to 10 or above i don't getting this error. Can anyone tell me what's wrong and avoid it? I didn't get error that subprocess exited with error code so I think it's problem with Spark configuration.
stderr from worker:
15/05/03 10:41:29 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
15/05/03 10:41:30 INFO spark.SecurityManager: Changing view acls to: root
15/05/03 10:41:30 INFO spark.SecurityManager: Changing modify acls to: root
15/05/03 10:41:30 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/05/03 10:41:30 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/05/03 10:41:30 INFO Remoting: Starting remoting
15/05/03 10:41:31 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher#ip-172-31-36-43.us-west-2.compute.internal:46832]
15/05/03 10:41:31 INFO util.Utils: Successfully started service 'driverPropsFetcher' on port 46832.
15/05/03 10:41:31 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/05/03 10:41:31 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/05/03 10:41:31 INFO spark.SecurityManager: Changing view acls to: root
15/05/03 10:41:31 INFO spark.SecurityManager: Changing modify acls to: root
15/05/03 10:41:31 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/05/03 10:41:31 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
15/05/03 10:41:31 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/05/03 10:41:31 INFO Remoting: Starting remoting
15/05/03 10:41:31 INFO util.Utils: Successfully started service 'sparkExecutor' on port 37039.
15/05/03 10:41:31 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor#ip-172-31-36-43.us-west-2.compute.internal:37039]
15/05/03 10:41:31 INFO util.AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkDriver#ip-172-31-35-111.us-west-2.compute.internal:48730/user/MapOutputTracker
15/05/03 10:41:31 INFO util.AkkaUtils: Connecting to BlockManagerMaster: akka.tcp://sparkDriver#ip-172-31-35-111.us-west-2.compute.internal:48730/user/BlockManagerMaster
15/05/03 10:41:31 INFO storage.DiskBlockManager: Created local directory at /mnt/spark/spark-cbaf9bff-4d12-4847-9135-9667ba27dccb/spark-ad82597c-4b55-46fc-9063-5d1196d6e0b0/spark-e99f55c6-5bcb-4d1b-b014-aaec94fe6cc5/blockmgr-cda1922d-ea50-4630-a834-bfb637ecdaa0
15/05/03 10:41:31 INFO storage.DiskBlockManager: Created local directory at /mnt2/spark/spark-0c6c912f-3aa1-4c54-9970-7a75d22899e8/spark-71d64ae7-36bc-49e0-958e-e7e2c1432027/spark-56d9e077-4585-4fd7-8a48-5227943d9004/blockmgr-29c5d068-f19d-4f41-85fc-11960c77a8a3
15/05/03 10:41:31 INFO storage.MemoryStore: MemoryStore started with capacity 445.4 MB
15/05/03 10:41:32 INFO util.AkkaUtils: Connecting to OutputCommitCoordinator: akka.tcp://sparkDriver#ip-172-31-35-111.us-west-2.compute.internal:48730/user/OutputCommitCoordinator
15/05/03 10:41:32 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://sparkDriver#ip-172-31-35-111.us-west-2.compute.internal:48730/user/CoarseGrainedScheduler
15/05/03 10:41:32 INFO worker.WorkerWatcher: Connecting to worker akka.tcp://sparkWorker#ip-172-31-36-43.us-west-2.compute.internal:54983/user/Worker
15/05/03 10:41:32 INFO worker.WorkerWatcher: Successfully connected to akka.tcp://sparkWorker#ip-172-31-36-43.us-west-2.compute.internal:54983/user/Worker
15/05/03 10:41:32 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
15/05/03 10:41:32 INFO executor.Executor: Starting executor ID 6 on host ip-172-31-36-43.us-west-2.compute.internal
15/05/03 10:41:32 INFO netty.NettyBlockTransferService: Server created on 33000
15/05/03 10:41:32 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/05/03 10:41:32 INFO storage.BlockManagerMaster: Registered BlockManager
15/05/03 10:41:32 INFO util.AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#ip-172-31-35-111.us-west-2.compute.internal:48730/user/HeartbeatReceiver
15/05/03 10:41:32 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 6
15/05/03 10:41:32 INFO executor.Executor: Running task 1.3 in stage 0.0 (TID 6)
15/05/03 10:41:32 INFO executor.Executor: Fetching http://172.31.35.111:34347/jars/proteinsApacheSpark-0.0.1.jar with timestamp 1430649374764
15/05/03 10:41:32 INFO util.Utils: Fetching http://172.31.35.111:34347/jars/proteinsApacheSpark-0.0.1.jar to /mnt/spark/spark-cbaf9bff-4d12-4847-9135-9667ba27dccb/spark-ad82597c-4b55-46fc-9063-5d1196d6e0b0/spark-08b3b4ce-960f-488f-99ea-bd66b3277207/fetchFileTemp3079113313084659984.tmp
15/05/03 10:41:32 INFO util.Utils: Copying /mnt/spark/spark-cbaf9bff-4d12-4847-9135-9667ba27dccb/spark-ad82597c-4b55-46fc-9063-5d1196d6e0b0/spark-08b3b4ce-960f-488f-99ea-bd66b3277207/9655652641430649374764_cache to /root/spark/work/app-20150503103615-0002/6/./proteinsApacheSpark-0.0.1.jar
15/05/03 10:41:32 INFO executor.Executor: Adding file:/root/spark/work/app-20150503103615-0002/6/./proteinsApacheSpark-0.0.1.jar to class loader
15/05/03 10:41:32 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
15/05/03 10:41:32 INFO storage.MemoryStore: ensureFreeSpace(17223) called with curMem=0, maxMem=467081625
15/05/03 10:41:32 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 16.8 KB, free 445.4 MB)
15/05/03 10:41:32 INFO storage.BlockManagerMaster: Updated info of block broadcast_1_piece0
15/05/03 10:41:32 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 274 ms
15/05/03 10:41:32 INFO storage.MemoryStore: ensureFreeSpace(22384) called with curMem=17223, maxMem=467081625
15/05/03 10:41:32 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 21.9 KB, free 445.4 MB)
15/05/03 10:41:33 INFO spark.CacheManager: Partition rdd_0_1 not found, computing it
15/05/03 10:41:33 INFO rdd.WholeTextFileRDD: Input split: Paths:/user/root/pepnovo3/largeinputfile2/largeinputfile2_45.mgf:0+2106005,/user/root/pepnovo3/largeinputfile2/largeinputfile2_46.mgf:0+2105954,/user/root/pepnovo3/largeinputfile2/largeinputfile2_47.mgf:0+2106590,/user/root/pepnovo3/largeinputfile2/largeinputfile2_48.mgf:0+2105696,/user/root/pepnovo3/largeinputfile2/largeinputfile2_49.mgf:0+2105891,/user/root/pepnovo3/largeinputfile2/largeinputfile2_5.mgf:0+2106283,/user/root/pepnovo3/largeinputfile2/largeinputfile2_50.mgf:0+2105559,/user/root/pepnovo3/largeinputfile2/largeinputfile2_51.mgf:0+2106403,/user/root/pepnovo3/largeinputfile2/largeinputfile2_52.mgf:0+2105535,/user/root/pepnovo3/largeinputfile2/largeinputfile2_53.mgf:0+2105615,/user/root/pepnovo3/largeinputfile2/largeinputfile2_54.mgf:0+2105861,/user/root/pepnovo3/largeinputfile2/largeinputfile2_55.mgf:0+2106100,/user/root/pepnovo3/largeinputfile2/largeinputfile2_56.mgf:0+2106265,/user/root/pepnovo3/largeinputfile2/largeinputfile2_57.mgf:0+2105768,/user/root/pepnovo3/largeinputfile2/largeinputfile2_58.mgf:0+2106180,/user/root/pepnovo3/largeinputfile2/largeinputfile2_59.mgf:0+2105751,/user/root/pepnovo3/largeinputfile2/largeinputfile2_6.mgf:0+2106247,/user/root/pepnovo3/largeinputfile2/largeinputfile2_60.mgf:0+2106133,/user/root/pepnovo3/largeinputfile2/largeinputfile2_61.mgf:0+2106224,/user/root/pepnovo3/largeinputfile2/largeinputfile2_62.mgf:0+2106415,/user/root/pepnovo3/largeinputfile2/largeinputfile2_63.mgf:0+2106408,/user/root/pepnovo3/largeinputfile2/largeinputfile2_64.mgf:0+2105702,/user/root/pepnovo3/largeinputfile2/largeinputfile2_65.mgf:0+2106268,/user/root/pepnovo3/largeinputfile2/largeinputfile2_66.mgf:0+2106149,/user/root/pepnovo3/largeinputfile2/largeinputfile2_67.mgf:0+2105846,/user/root/pepnovo3/largeinputfile2/largeinputfile2_68.mgf:0+2105408,/user/root/pepnovo3/largeinputfile2/largeinputfile2_69.mgf:0+2106172,/user/root/pepnovo3/largeinputfile2/largeinputfile2_7.mgf:0+2105517,/user/root/pepnovo3/largeinputfile2/largeinputfile2_70.mgf:0+2105980,/user/root/pepnovo3/largeinputfile2/largeinputfile2_71.mgf:0+2105651,/user/root/pepnovo3/largeinputfile2/largeinputfile2_72.mgf:0+2105936,/user/root/pepnovo3/largeinputfile2/largeinputfile2_73.mgf:0+2105966,/user/root/pepnovo3/largeinputfile2/largeinputfile2_74.mgf:0+2105456,/user/root/pepnovo3/largeinputfile2/largeinputfile2_75.mgf:0+2105786,/user/root/pepnovo3/largeinputfile2/largeinputfile2_76.mgf:0+2106151,/user/root/pepnovo3/largeinputfile2/largeinputfile2_77.mgf:0+2106284,/user/root/pepnovo3/largeinputfile2/largeinputfile2_78.mgf:0+2106163,/user/root/pepnovo3/largeinputfile2/largeinputfile2_79.mgf:0+2106233,/user/root/pepnovo3/largeinputfile2/largeinputfile2_8.mgf:0+2105885,/user/root/pepnovo3/largeinputfile2/largeinputfile2_80.mgf:0+2105979,/user/root/pepnovo3/largeinputfile2/largeinputfile2_81.mgf:0+2105888,/user/root/pepnovo3/largeinputfile2/largeinputfile2_82.mgf:0+2106546,/user/root/pepnovo3/largeinputfile2/largeinputfile2_83.mgf:0+2106322,/user/root/pepnovo3/largeinputfile2/largeinputfile2_84.mgf:0+2106017,/user/root/pepnovo3/largeinputfile2/largeinputfile2_85.mgf:0+2106242,/user/root/pepnovo3/largeinputfile2/largeinputfile2_86.mgf:0+2105543,/user/root/pepnovo3/largeinputfile2/largeinputfile2_87.mgf:0+2106556,/user/root/pepnovo3/largeinputfile2/largeinputfile2_88.mgf:0+2105637,/user/root/pepnovo3/largeinputfile2/largeinputfile2_89.mgf:0+2106130,/user/root/pepnovo3/largeinputfile2/largeinputfile2_9.mgf:0+2105634,/user/root/pepnovo3/largeinputfile2/largeinputfile2_90.mgf:0+2105731,/user/root/pepnovo3/largeinputfile2/largeinputfile2_91.mgf:0+2106401,/user/root/pepnovo3/largeinputfile2/largeinputfile2_92.mgf:0+2105736,/user/root/pepnovo3/largeinputfile2/largeinputfile2_93.mgf:0+2105688,/user/root/pepnovo3/largeinputfile2/largeinputfile2_94.mgf:0+2106436,/user/root/pepnovo3/largeinputfile2/largeinputfile2_95.mgf:0+2105609,/user/root/pepnovo3/largeinputfile2/largeinputfile2_96.mgf:0+2105525,/user/root/pepnovo3/largeinputfile2/largeinputfile2_97.mgf:0+2105603,/user/root/pepnovo3/largeinputfile2/largeinputfile2_98.mgf:0+2106211,/user/root/pepnovo3/largeinputfile2/largeinputfile2_99.mgf:0+2105928
15/05/03 10:41:33 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
15/05/03 10:41:33 INFO storage.MemoryStore: ensureFreeSpace(6906) called with curMem=39607, maxMem=467081625
15/05/03 10:41:33 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 6.7 KB, free 445.4 MB)
15/05/03 10:41:33 INFO storage.BlockManagerMaster: Updated info of block broadcast_0_piece0
15/05/03 10:41:33 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 15 ms
15/05/03 10:41:33 INFO storage.MemoryStore: ensureFreeSpace(53787) called with curMem=46513, maxMem=467081625
15/05/03 10:41:33 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 52.5 KB, free 445.3 MB)
15/05/03 10:41:33 WARN snappy.LoadSnappy: Snappy native library is available
15/05/03 10:41:33 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/05/03 10:41:33 INFO snappy.LoadSnappy: Snappy native library loaded
15/05/03 10:41:36 INFO storage.MemoryStore: ensureFreeSpace(252731448) called with curMem=100300, maxMem=467081625
15/05/03 10:41:36 INFO storage.MemoryStore: Block rdd_0_1 stored as values in memory (estimated size 241.0 MB, free 204.3 MB)
15/05/03 10:41:36 INFO storage.BlockManagerMaster: Updated info of block rdd_0_1

The answer is probably in the executor log, which is different from the worker log. Most likely it runs out of memory and either starts GC thrashing or dies from OOM. You could try running with more memory per executor if this is an option.

Check your system hard disk space, network and memory. spark write file in $SPARK_HOME/work. some time full hard disk space, no memory free or network issue.
if any exception you can see on your_machine:4040

JBoss strucked at Remoting

I am new to JBoss deployment. I am using Java 32 bit, Unix, Jboss 6 environment. While starting my application shell file X.sh, Jboss struck at Remote service. I have spent lot of time but didn't get any clue to resolve. Please find the error below.
14:34:13,100 INFO [JMXKernel] Legacy JMX core initialized
14:34:24,603 INFO [AbstractServerConfig] JBoss Web Services - Native Server 3.4.1.GA
14:34:25,157 INFO [JSFImplManagementDeployer] Initialized 3 JSF configurations: [Mojarra-1.2, MyFaces-2.0, Mojarra-2.0]
14:34:32,683 WARNING [FileConfigurationParser] AIO wasn't located on this platform, it will fall back to using pure Java NIO. If your platform is Linux, install LibAIO to enable the AIO journal
14:34:37,911 INFO [mbean] Sleeping for 600 seconds
14:34:38,214 WARNING [FileConfigurationParser] AIO wasn't located on this platform, it will fall back to using pure Java NIO. If your platform is Linux, install LibAIO to enable the AIO journal
14:34:38,425 INFO [JMXConnector] starting JMXConnector on host 0.0.0.0:1090
14:34:38,560 INFO [MailService] Mail Service bound to java:/Mail
14:34:39,623 INFO [HornetQServerImpl] live server is starting..
14:34:39,705 INFO [JournalStorageManager] Using NIO Journal
14:34:39,730 WARNING [HornetQServerImpl] Security risk! It has been detected that the cluster admin user and password have not been changed from the installation default. Please see the HornetQ user guide, cluster chapter, for instructions on how to do this.
14:34:40,970 INFO [NettyAcceptor] Started Netty Acceptor version 3.2.1.Final-r2319 0.0.0.0:5455 for CORE protocol
14:34:40,971 INFO [NettyAcceptor] Started Netty Acceptor version 3.2.1.Final-r2319 0.0.0.0:5445 for CORE protocol
14:34:40,975 INFO [HornetQServerImpl] HornetQ Server version 2.1.2.Final (Colmeia, 120) started
14:34:41,040 INFO [WebService] Using RMI server codebase: http://esaxh036.hyd.lab.vignette.com:8083/
14:34:41,271 INFO [jbossatx] ARJUNA-32010 JBossTS Recovery Service (tag: JBOSSTS_4_14_0_Final) - JBoss Inc.
14:34:41,281 INFO [arjuna] ARJUNA-12324 Start RecoveryActivators
14:34:41,301 INFO [arjuna] ARJUNA-12296 ExpiredEntryMonitor running at Thu, 30 Oct 2014 14:34:41
14:34:41,323 INFO [arjuna] ARJUNA-12332 Failed to establish connection to server
14:34:41,348 INFO [arjuna] ARJUNA-12304 Removing old transaction status manager item 0:ffff0a601a3e:126a:5451fbf6:8
14:34:41,390 INFO [arjuna] ARJUNA-12310 Recovery manager listening on endpoint 0.0.0.0:4712
14:34:41,390 INFO [arjuna] ARJUNA-12344 RecoveryManagerImple is ready on port 4712
14:34:41,391 INFO [jbossatx] ARJUNA-32013 Starting transaction recovery manager
14:34:41,402 INFO [arjuna] ARJUNA-12163 Starting service com.arjuna.ats.arjuna.recovery.ActionStatusService on port 4713
14:34:41,403 INFO [arjuna] ARJUNA-12337 TransactionStatusManagerItem host: 0.0.0.0 port: 4713
14:34:41,425 INFO [arjuna] ARJUNA-12170 TransactionStatusManager started on port 4713 and host 0.0.0.0 with service com.arjuna.ats.arjuna.recovery.ActionStatusService
14:34:41,480 INFO [jbossatx] ARJUNA-32017 JBossTS Transaction Service (JTA version - tag: JBOSSTS_4_14_0_Final) - JBoss Inc.
14:34:41,549 INFO [arjuna] ARJUNA-12202 registering bean jboss.jta:type=ObjectStore.
14:34:41,764 INFO [AprLifecycleListener] The Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: /home/IWSTU/JBOSS/jboss-6.0.0.Final/bin/native/lib
14:34:41,922 INFO [ModClusterService] Initializing mod_cluster 1.1.0.Final
14:34:41,935 INFO [TomcatDeployment] deploy, ctxPath=/invoker
14:34:42,364 INFO [RARDeployment] Required license terms exist, view vfs:/home/IWSTU/JBOSS/jboss-6.0.0.Final/server/XDomain/deploy/jboss-local-jdbc.rar/META-INF/ra.xml
14:34:42,382 INFO [RARDeployment] Required license terms exist, view vfs:/home/IWSTU/JBOSS/jboss-6.0.0.Final/server/XDomain/deploy/jboss-xa-jdbc.rar/META-INF/ra.xml
14:34:42,395 INFO [RARDeployment] Required license terms exist, view vfs:/home/IWSTU/JBOSS/jboss-6.0.0.Final/server/XDomain/deploy/jms-ra.rar/META-INF/ra.xml
14:34:42,410 INFO [HornetQResourceAdapter] HornetQ resource adaptor started
14:34:42,421 INFO [RARDeployment] Required license terms exist, view vfs:/home/IWSTU/JBOSS/jboss-6.0.0.Final/server/XDomain/deploy/mail-ra.rar/META-INF/ra.xml
14:34:42,439 INFO [RARDeployment] Required license terms exist, view vfs:/home/IWSTU/JBOSS/jboss-6.0.0.Final/server/XDomain/deploy/quartz-ra.rar/META-INF/ra.xml
14:34:42,544 INFO [SimpleThreadPool] Job execution threads will use class loader of thread: Thread-7
14:34:42,578 INFO [SchedulerSignalerImpl] Initialized Scheduler Signaller of type: class org.quartz.core.SchedulerSignalerImpl
14:34:42,579 INFO [QuartzScheduler] Quartz Scheduler v.1.8.3 created.
14:34:42,582 INFO [RAMJobStore] RAMJobStore initialized.
14:34:42,585 INFO [QuartzScheduler] Scheduler meta-data: Quartz Scheduler (v1.8.3) 'JBossQuartzScheduler' with instanceId 'NON_CLUSTERED'
Scheduler class: 'org.quartz.core.QuartzScheduler' - running locally.
NOT STARTED.
Currently in standby mode.
Number of jobs executed: 0
Using thread pool 'org.quartz.simpl.SimpleThreadPool' - with 10 threads.
Using job-store 'org.quartz.simpl.RAMJobStore' - which does not support persistence. and is not clustered.
14:34:42,585 INFO [StdSchedulerFactory] Quartz scheduler 'JBossQuartzScheduler' initialized from an externally opened InputStream.
14:34:42,586 INFO [StdSchedulerFactory] Quartz scheduler version: 1.8.3
14:34:42,586 INFO [QuartzScheduler] Scheduler JBossQuartzScheduler_$_NON_CLUSTERED started.
14:34:43,229 INFO [ConnectionFactoryBindingService] Bound
ConnectionManager 'jboss.jca:service=DataSourceBinding,name=DefaultDS' to JNDI name 'java:DefaultDS'
14:34:43,422 INFO [TomcatDeployment] deploy, ctxPath=/juddi
14:34:43,488 INFO [RegistryServlet] Loading jUDDI configuration.
14:34:43,494 INFO [RegistryServlet] Resources loaded from: /WEB-INF/juddi.properties
14:34:43,494 INFO [RegistryServlet] Initializing jUDDI components.
14:34:43,688 INFO [ConnectionFactoryBindingService] Bound ConnectionManager 'jboss.jca:service=ConnectionFactoryBinding,name=JmsXA' to JNDI name 'java:JmsXA'
14:34:43,738 INFO [ConnectionFactoryBindingService] Bound ConnectionManager 'jboss.jca:service=DataSourceBinding,name=OracleDS' to JNDI name 'java:OracleDS'
14:34:43,926 INFO [xnio] XNIO Version 2.1.0.CR2
14:34:43,937 INFO [nio] XNIO NIO Implementation Version 2.1.0.CR2
**14:34:44,170 INFO [remoting] JBoss Remoting version 3.1.0.Beta2** (Strucked here)
14:44:37,912 INFO [TicketMap] Start:
14:44:37,913 INFO [TicketMap] Complete:
14:44:37,930 INFO [mbean] Sleeping for 600 seconds
14:54:37,932 INFO [TicketMap] Start:
14:54:37,932 INFO [TicketMap] Complete:
14:54:37,944 INFO [mbean] Sleeping for 600 seconds

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.