I have a Spark cluster setup with one master and 3 workers.
I Use vagrant and Docker to start a cluster.
I'm trying to submit a Spark work from my local eclipse which would connect to the master, and allow me to execute it. So, here is the Spark Conf :
SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("spark://scale1.docker:7077");
When I run my application from eclipse on Master's UI, I can see one running application. All the workers are ALIVE, have 4 / 4 cores used, and have allocated 512 MB to the application.
The eclipse console will just print the same warning:
15/03/04 15:39:27 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
15/03/04 15:39:27 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838
15/03/04 15:39:27 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[2] at mapToPair at CountLines.java:35)
15/03/04 15:39:27 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/03/04 15:39:42 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
15/03/04 15:39:57 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
15/03/04 15:40:04 INFO AppClient$ClientActor: Executor updated: app-20150304143926-0001/1 is now EXITED (Command exited with code 1)
15/03/04 15:40:04 INFO SparkDeploySchedulerBackend: Executor app-20150304143926-0001/1 removed: Command exited with code 1
15/03/04 15:40:04 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 1
15/03/04 15:40:04 INFO AppClient$ClientActor: Executor added: app-20150304143926-0001/2 on worker-20150304140319-scale3.docker-55425 (scale3.docker:55425) with 4 cores
15/03/04 15:40:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150304143926-0001/2 on hostPort scale3.docker:55425 with 4 cores, 512.0 MB RAM
15/03/04 15:40:04 INFO AppClient$ClientActor: Executor updated: app-20150304143926-0001/2 is now RUNNING
15/03/04 15:40:04 INFO AppClient$ClientActor: Executor updated: app-20150304143926-0001/2 is now LOADING
15/03/04 15:40:04 INFO AppClient$ClientActor: Executor updated: app-20150304143926-0001/0 is now EXITED (Command exited with code 1)
15/03/04 15:40:04 INFO SparkDeploySchedulerBackend: Executor app-20150304143926-0001/0 removed: Command exited with code 1
15/03/04 15:40:04 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0
15/03/04 15:40:04 INFO AppClient$ClientActor: Executor added: app-20150304143926-0001/3 on worker-20150304140317-scale2.docker-60646 (scale2.docker:60646) with 4 cores
15/03/04 15:40:04 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150304143926-0001/3 on hostPort scale2.docker:60646 with 4 cores, 512.0 MB RAM
15/03/04 15:40:04 INFO AppClient$ClientActor: Executor updated: app-20150304143926-0001/3 is now RUNNING
15/03/04 15:40:04 INFO AppClient$ClientActor: Executor updated: app-20150304143926-0001/3 is now LOADING
Reading Spark Documentation of Spark I have find this:
Because the driver schedules tasks on the cluster, it should be run
close to the worker nodes, preferably on the same local area network.
If you’d like to send requests to the cluster remotely, it’s better to
open an RPC to the driver and have it submit operations from nearby
than to run a driver far away from the worker nodes.
I think the problem is due to the driver that runs locally on my machine.
I am using Spark 1.2.0.
Is it possible to run application in eclipse and submit it to remote cluster using local driver? If so, what can I do?
remote dubugging is quite possible and it works fine with below option executed on edge node.
--driver-java-options -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
Debugging Spark Applications
you need not to speicify the master or anything. here is the sample command.
spark-submit --master yarn-client --class org.hkt.spark.jstest.javascalawordcount.JavaWordCount --driver-java-options -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005 javascalawordcount-0.0.1-SNAPSHOT.jar
Related
TLDR;
How to configure Apache Beam pipelines options with "environment_type" = EXTERNAL or PROCESS?
Description
Currently, we have a standalone spark cluster inside Kubernetes, following this solution (and the setup) we launch a beam pipeline creating an embedded spark job server on the spark worker who needs to run a python SDK jointly.
Apache Beam allows running python SDK in 4 different ways:
"DOCKER" - Default and not possible inside a Kubernetes cluster (would use "container inside container")
"LOOPBACK" - Only for testing, not possible with more than 1 worker pod
"EXTERNAL" - Ideal setup, "just" create a sidecar container to run in the same pod as the spark workers
"PROCESS" - Execute a process in the spark worker, not ideal but could be too.
Development
Using "External" - Implementing the spark worker with the python sdk on the same pod:
apiVersion: apps/v1
kind: Deployment
metadata:
name: spark-worker
labels:
app: spark-worker
spec:
selector:
matchLabels:
app: spark-worker
template:
metadata:
labels:
app: spark-worker
spec:
containers:
- name: spark-worker
image: spark-py-custom:latest
imagePullPolicy: Never
ports:
- containerPort: 8081
protocol: TCP
command: ['/bin/bash',"-c","--"]
args: ["/start-worker.sh" ]
resources :
requests :
cpu : 4
memory : "5Gi"
limits :
cpu : 4
memory : "5Gi"
volumeMounts:
- name: spark-jars
mountPath: "/tmp"
- name: python-beam-sdk
image: apachebeam/python3.7_sdk:latest
command: ["/opt/apache/beam/boot", "--worker_pool"]
ports:
- containerPort: 50000
resources:
limits:
cpu: "1"
memory: "1Gi"
volumes:
- name: spark-jars
persistentVolumeClaim:
claimName: spark-jars
And them, if we execute the command
python3 wordcount.py \
--output ./data_test/counts \
--runner=SparkRunner \
--spark_submit_uber_jar \
--spark_job_server_jar=beam-runners-spark-job-server-2.28.0.jar \
--spark_master_url=spark://spark-master:7077 \
--spark_rest_url=http://spark-master:6066 \
--environment_type=EXTERNAL \
--environment_config=localhost:50000
We get a stuck terminal in the state of "RUNNING":
INFO:apache_beam.internal.gcp.auth:Setting socket default timeout to 60 seconds.
INFO:apache_beam.internal.gcp.auth:socket default timeout is 60.0 seconds.
INFO:oauth2client.client:Timeout attempting to reach GCE metadata service.
WARNING:apache_beam.internal.gcp.auth:Unable to find default credentials to use: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
Connecting anonymously.
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.7 interpreter.
INFO:root:Default Python SDK image for environment is apache/beam_python3.7_sdk:2.28.0
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function lift_combiners at 0x7fc360c0b8c8> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function sort_stages at 0x7fc360c0f048> ====================
INFO:apache_beam.runners.portability.abstract_job_service:Artifact server started on port 36369
INFO:apache_beam.runners.portability.abstract_job_service:Running job 'job-2448721e-e686-41d4-b924-5f8c5ae73ac2'
INFO:apache_beam.runners.portability.spark_uber_jar_job_server:Submitted Spark job with ID driver-20210305172421-0000
INFO:apache_beam.runners.portability.portable_runner:Job state changed to STOPPED
INFO:apache_beam.runners.portability.portable_runner:Job state changed to RUNNING
And in the spark worker log:
21/03/05 17:24:25 INFO ExecutorRunner: Launch command: "/usr/local/openjdk-8/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=45203" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#spark-worker-64fd4ddd6-tqdrs:45203" "--executor-id" "0" "--hostname" "172.18.0.20" "--cores" "3" "--app-id" "app-20210305172425-0000" "--worker-url" "spark://Worker#172.18.0.20:44365"
And on the python sdk:
2021/03/05 17:19:52 Starting worker pool 1: python -m apache_beam.runners.worker.worker_pool_main --service_port=50000 --container_executable=/opt/apache/beam/boot
Starting worker with command ['/opt/apache/beam/boot', '--id=1-1', '--logging_endpoint=', '--artifact_endpoint=', '--provision_endpoint=', '--control_endpoint=']
2021/03/05 17:24:32 No logging endpoint provided.
Checking the spark worker stderr (on localhost 8081):
Spark Executor Command: "/usr/local/openjdk-8/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=45203" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#spark-worker-64fd4ddd6-tqdrs:45203" "--executor-id" "0" "--hostname" "172.18.0.20" "--cores" "3" "--app-id" "app-20210305172425-0000" "--worker-url" "spark://Worker#172.18.0.20:44365"
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/03/05 17:24:26 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 230#spark-worker-64fd4ddd6-tqdrs
21/03/05 17:24:26 INFO SignalUtils: Registered signal handler for TERM
21/03/05 17:24:26 INFO SignalUtils: Registered signal handler for HUP
21/03/05 17:24:26 INFO SignalUtils: Registered signal handler for INT
21/03/05 17:24:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/03/05 17:24:27 INFO SecurityManager: Changing view acls to: root
21/03/05 17:24:27 INFO SecurityManager: Changing modify acls to: root
21/03/05 17:24:27 INFO SecurityManager: Changing view acls groups to:
21/03/05 17:24:27 INFO SecurityManager: Changing modify acls groups to:
21/03/05 17:24:27 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
21/03/05 17:24:27 INFO TransportClientFactory: Successfully created connection to spark-worker-64fd4ddd6-tqdrs/172.18.0.20:45203 after 50 ms (0 ms spent in bootstraps)
21/03/05 17:24:27 INFO SecurityManager: Changing view acls to: root
21/03/05 17:24:27 INFO SecurityManager: Changing modify acls to: root
21/03/05 17:24:27 INFO SecurityManager: Changing view acls groups to:
21/03/05 17:24:27 INFO SecurityManager: Changing modify acls groups to:
21/03/05 17:24:27 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
21/03/05 17:24:28 INFO TransportClientFactory: Successfully created connection to spark-worker-64fd4ddd6-tqdrs/172.18.0.20:45203 after 1 ms (0 ms spent in bootstraps)
21/03/05 17:24:28 INFO DiskBlockManager: Created local directory at /tmp/spark-bdffc2b3-f57a-42fa-a720-e22274b86b67/executor-f1eff7ca-d2cd-4ff4-b18b-c8d6a520f590/blockmgr-c61fb65f-ea97-4bd5-bf15-e0025845a251
21/03/05 17:24:28 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
21/03/05 17:24:28 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler#spark-worker-64fd4ddd6-tqdrs:45203
21/03/05 17:24:28 INFO WorkerWatcher: Connecting to worker spark://Worker#172.18.0.20:44365
21/03/05 17:24:28 INFO TransportClientFactory: Successfully created connection to /172.18.0.20:44365 after 1 ms (0 ms spent in bootstraps)
21/03/05 17:24:28 INFO WorkerWatcher: Successfully connected to spark://Worker#172.18.0.20:44365
21/03/05 17:24:28 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
21/03/05 17:24:28 INFO Executor: Starting executor ID 0 on host 172.18.0.20
21/03/05 17:24:28 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 42561.
21/03/05 17:24:28 INFO NettyBlockTransferService: Server created on 172.18.0.20:42561
21/03/05 17:24:28 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/03/05 17:24:28 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, 172.18.0.20, 42561, None)
21/03/05 17:24:28 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, 172.18.0.20, 42561, None)
21/03/05 17:24:28 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, 172.18.0.20, 42561, None)
Where it gets stuck forever.
Checking the source code of the python SDK we can see that "no logging endpoint provided" is fatal and it comes from the lack of configuration sent to him (no logging/artifact/provision/control endpoints). If I try to add the "--artifact_endpoint" to the python command I get grcp error of failed communication because the jobserver creates its own artifact endpoint. In this setup would be necessary to configure all these endpoints (probably as localhost as the SDK and the worker are in the same pod) with fixed ports but I can't find how to configure it. Checking SO I can find a related issue but in his case he gets the python SDK configurations automatically (maybe a spark runner issue?)
Using "PROCESS" - Trying to run the python SDK within a process, I built the python SDK with ./gradlew :sdks:python:container:py37:docker, copied the sdks/python/container/build/target/launcher/linux_amd64/boot executable to /python_sdk/boot inside the spark worker container and used the command:
python3 wordcount.py \
--output ./data_test/counts \
--runner=SparkRunner \
--spark_submit_uber_jar \
--spark_master_url=spark://spark-master:7077 \
--spark_rest_url=http://spark-master:6066 \
--environment_type=PROCESS \
--spark_job_server_jar=beam-runners-spark-job-server-2.28.0.jar \
--environment_config='{"os":"linux","arch":"x84_64","command":"/python_sdk/boot"}'
Resulting in "run time exception" in the terminal:
INFO:apache_beam.runners.portability.portable_runner:Job state changed to FAILED
Traceback (most recent call last):
File "wordcount.py", line 91, in <module>
run()
File "wordcount.py", line 86, in run
output | "Write" >> WriteToText(known_args.output)
File "/usr/local/lib/python3.7/dist-packages/apache_beam/pipeline.py", line 581, in __exit__
self.result.wait_until_finish()
File "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/portability/portable_runner.py", line 608, in wait_until_finish
raise self._runtime_exception
RuntimeError: Pipeline job-95c13aa5-96ab-4d1d-bc68-7f9d203c8251 failed in state FAILED: unknown error
and checking again the spark stderr worker log I can see that the problem is java.lang.IllegalArgumentException: No filesystem found for scheme classpath which I don't know the reason.
21/03/05 18:33:12 INFO Executor: Adding file:/opt/spark/work/app-20210305183309-0000/0/./javax.servlet-api-3.1.0.jar to class loader
21/03/05 18:33:12 INFO TorrentBroadcast: Started reading broadcast variable 0
21/03/05 18:33:12 INFO TransportClientFactory: Successfully created connection to spark-worker-89c5c4c87-5q45s/172.18.0.20:34783 after 1 ms (0 ms spent in bootstraps)
21/03/05 18:33:12 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 12.3 KB, free 366.3 MB)
21/03/05 18:33:12 INFO TorrentBroadcast: Reading broadcast variable 0 took 63 ms
21/03/05 18:33:12 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 32.5 KB, free 366.3 MB)
21/03/05 18:33:13 INFO MemoryStore: Block rdd_13_0 stored as values in memory (estimated size 16.0 B, free 366.3 MB)
21/03/05 18:33:13 INFO MemoryStore: Block rdd_17_0 stored as values in memory (estimated size 16.0 B, free 366.3 MB)
21/03/05 18:33:13 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 5427 bytes result sent to driver
21/03/05 18:33:14 ERROR SerializingExecutor: Exception while executing runnable org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed#5f917914
java.lang.IllegalArgumentException: No filesystem found for scheme classpath
at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:467)
at org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:537)
at org.apache.beam.runners.fnexecution.artifact.ArtifactRetrievalService.getArtifact(ArtifactRetrievalService.java:125)
at org.apache.beam.runners.fnexecution.artifact.ArtifactRetrievalService.getArtifact(ArtifactRetrievalService.java:99)
at org.apache.beam.model.jobmanagement.v1.ArtifactRetrievalServiceGrpc$MethodHandlers.invoke(ArtifactRetrievalServiceGrpc.java:327)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
21/03/05 18:33:16 ERROR SerializingExecutor: Exception while executing runnable org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed#67fb2b2c
java.lang.IllegalArgumentException: No filesystem found for scheme classpath
at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:467)
at org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:537)
at org.apache.beam.runners.fnexecution.artifact.ArtifactRetrievalService.getArtifact(ArtifactRetrievalService.java:125)
at org.apache.beam.runners.fnexecution.artifact.ArtifactRetrievalService.getArtifact(ArtifactRetrievalService.java:99)
at org.apache.beam.model.jobmanagement.v1.ArtifactRetrievalServiceGrpc$MethodHandlers.invoke(ArtifactRetrievalServiceGrpc.java:327)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
21/03/05 18:33:19 INFO ProcessEnvironmentFactory: Still waiting for startup of environment '/python_sdk/boot' for worker id 1-1
Probably is missing some configuration parameters.
Obs
If I execute the command
python3 wordcount.py \
--output ./data_test/counts \
--runner=SparkRunner \
--spark_submit_uber_jar \
--spark_job_server_jar=beam-runners-spark-job-server-2.28.0.jar \
--spark_master_url=spark://spark-master:7077 \
--spark_rest_url=http://spark-master:6066 \
--environment_type=LOOPBACK
inside our spark worker (having only one worker in the spark cluster) we have a full working beam pipeline with these logs.
Using "External" - this definitely seems like a bug in Beam. The worker endpoints are supposed to be set up to use localhost; I don't think it is possible to configure them. I'm not sure why they would be missing; one educated guess is that the servers silently fail to start, leaving the endpoints empty. I filed a bug report (BEAM-11957) for this issue.
Using "Process" - The scheme classpath corresponds to ClassLoaderFileSystem. This file system is usually loaded using AutoService, which depends on ClassLoaderFileSystemRegistrar being present on the classpath (no relation to the name of the file system itself). The classpath of the job jar is based on spark_job_server_jar. Where are you getting your beam-runners-spark-job-server-2.28.0.jar from?
i am working on apache spark and i am facing a very weird issue. One of the executors fail because of OOM and its shut down hooks clear all the storage (memory and disk) but apparently driver keeps submitting the failed tasks on the same executor due to PROCESS_LOCAL tasks.
Now that the storage on that machine is cleared, all the retried tasks also fail causing the whole stage to fail (after 4 retries)
What i don't understand is, how couldn't driver know that the executor is in shut down and will not be able to execute any tasks.
Configuration:
heartbeat interval: 60s
Network timeout: 600s
Logs to confirm that executor is accepting tasks post shutdown
20/09/29 20:26:32 ERROR [Executor task launch worker for task 138513] Executor: Exception in task 6391.0 in stage 17.0 (TID 138513)
java.lang.OutOfMemoryError: Java heap space
20/09/29 20:26:32 ERROR [Executor task launch worker for task 138513] SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 138513,5,main]
java.lang.OutOfMemoryError: Java heap space
20/09/29 20:26:32 INFO [pool-8-thread-1] DiskBlockManager: Shutdown hook called
20/09/29 20:26:35 ERROR [Executor task launch worker for task 138295] Executor: Exception in task 6239.0 in stage 17.0 (TID 138295)
java.io.FileNotFoundException: /storage/1/spark/spark-ba168da6-dc11-4e15-bd95-1e58198c81e7/executor-8dea198c-741a-4733-8fbb-df57241acdd5/blockmgr-1fc6b30a-c24e-4bb2-a133-5e411cef810f/35/temp_shuffle_b5df90ac-78de-48e3-9c2d-891f8b2ce1fa (No such file or directory)
20/09/29 20:26:36 ERROR [Executor task launch worker for task 139484] Executor: Exception in task 6587.0 in stage 17.0 (TID 139484)
org.apache.spark.SparkException: Block rdd_3861_6587 was not found even though it's read-locked
20/09/29 20:26:42 WARN [Thread-2] ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException
java.util.concurrent.TimeoutException: null
at java.util.concurrent.FutureTask.get(FutureTask.java:205) ~[?:1.8.0_172]
20/09/29 20:26:44 ERROR [Executor task launch worker for task 140256] Executor: Exception in task 6576.3 in stage 17.0 (TID 140256)
java.io.FileNotFoundException: /storage/1/spark/spark-ba168da6-dc11-4e15-bd95-1e58198c81e7/executor-8dea198c-741a-4733-8fbb-df57241acdd5/blockmgr-1fc6b30a-c24e-4bb2-a133-5e411cef810f/30/rdd_3861_6576 (No such file or directory)
20/09/29 20:26:44 INFO [dispatcher-event-loop-0] Executor: Executor is trying to kill task 6866.1 in stage 17.0 (TID 140329), reason: stage cancelled
20/09/29 20:26:47 INFO [pool-8-thread-1] ShutdownHookManager: Shutdown hook called
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: stopping client from cache: org.apache.hadoop.ipc.Client#3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: stopping client from cache: org.apache.hadoop.ipc.Client#3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: removing client from cache: org.apache.hadoop.ipc.Client#3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: stopping actual client because no more references remain: org.apache.hadoop.ipc.Client#3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: Stopping client
20/09/29 20:26:47 DEBUG [Thread-2] ShutdownHookManager: ShutdownHookManger complete shutdown
20/09/29 20:26:55 INFO [dispatcher-event-loop-14] CoarseGrainedExecutorBackend: Got assigned task 141510
20/09/29 20:26:55 INFO [Executor task launch worker for task 141510] Executor: Running task 545.1 in stage 26.0 (TID 141510)
(I have trimmed down the stack traces as those are just spark RDD shuffle read methods)
If we check the timestamps, then shutdown started at 20/09/29 20:26:32 and ended at 20/09/29 20:26:47, in between this duration, driver sent all the retried tasks to the same executor and they all failed causing stage cancellation.
can someone help me understand this behavior ? Also let me know if anything else is required
there is a configuration in spark, spark.task.maxFailures default is 4. so spark retry the task when it's failed. and the Taskrunner will update the task status to driver. and driver will forward this status to taskschedulerimpl. in your case, only executor OOM and DiskBlockManager shutdown, but the driver is alive. i think the executor is alive as well. and will retry the task with same taskSetManager. the failed times of this task arrive 4 times, this stage will be cancelled, and executor will be killed.
I have setup a cluster on 12 machine and the spark workers on slave machines could disassociate with the master everyday. That means they could looks work well for a period of time during a day, but then the slaves would all disassociate and then be shutdowned.
The log of the worker would look like below:
16/03/07 12:45:34.828 INFO Worker: Retrying connection to master (attempt # 1)
16/03/07 12:45:34.830 INFO Worker: Connecting to master host1:7077...
16/03/07 12:45:34.826 INFO Worker: Retrying connection to master (attempt # 2)
16/03/07 12:45:45.830 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[sparkWorker-akka.actor.default-dispatcher-2,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask#1c5651e9 rejected from java.util.concurrent.ThreadPoolExecutor#671ba687[Running, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 2]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
...
16/03/07 12:45:45.853 Info ExecutorRunner: Killing process!
16/03/07 12:45:45.855 INFO ShutdownHookManager: Shutdown hook called
The log of the master would look like below:
16/03/07 12:45:45.878 INFO Master: 10.126.217.11:51502 got disassociated, removing it.
16/03/07 12:45:45.878 INFO Master: Removing worker worker-20160303035822-10.126.217.11-51502 on 10.126.217.11:51502
Information for the machines:
40 cores and 256GB memory per machine
spark version: 1.5.1
java version: 1.8.0_45
The spark application runs on this cluster with configuration below:
spark.cores.max=360
spark.executor.memory=32g
Is it a memory issue on slave or on master machine?
Or is it a network issue between slaves and master machine?
Or any other problems?
Please advise.
thanks
I'm running a single node application with Spark on a machine with 32 GB RAM.
More than 12GB of the memory is available at the time I'm running the applicaton.
But From the spark UI and logs, I see that it using 3.8GB of RAM (which is gradually decreased as the jobs run).
At this time this is logged, 5GB more memory is avilable. Where as Spark is using 3.8GB
UPDATE
I set these parameters in conf/spark-env.sh but still each time I run the application It is using exactly 3.8 GB
export SPARK_WORKER_MEMORY=6g
export SPARK_MEM=6g
export SPARK_DAEMON_MEMORY=6g
Log
2015-11-19 13:05:41,701 INFO org.apache.spark.SparkEnv.logInfo:59 - Registering MapOutputTracker
2015-11-19 13:05:41,716 INFO org.apache.spark.SparkEnv.logInfo:59 - Registering BlockManagerMaster
2015-11-19 13:05:41,735 INFO org.apache.spark.storage.DiskBlockManager.logInfo:59 - Created local directory at /usr/local/TC_SPARCDC_COM/temp/blockmgr-8513cd3b-ac03-4c0a-b291-65aba4cbc395
2015-11-19 13:05:41,746 INFO org.apache.spark.storage.MemoryStore.logInfo:59 - MemoryStore started with capacity 3.8 GB
2015-11-19 13:05:41,777 INFO org.apache.spark.HttpFileServer.logInfo:59 - HTTP File server directory is /usr/local/TC_SPARCDC_COM/temp/spark-b86380c2-4cbd-43d6-a3b7-aa03d9a05a84/httpd-ceaffbd0-eac4-447e-9d3f-c452627a28cb
2015-11-19 13:05:41,781 INFO org.apache.spark.HttpServer.logInfo:59 - Starting HTTP Server
2015-11-19 13:05:41,842 INFO org.spark-project.jetty.server.Server.doStart:272 - jetty-8.y.z-SNAPSHOT
2015-11-19 13:05:41,854 INFO org.spark-project.jetty.server.AbstractConnector.doStart:338 - Started SocketConnector#0.0.0.0:5279
2015-11-19 13:05:41,855 INFO org.apache.spark.util.Utils.logInfo:59 - Successfully started service 'HTTP file server' on port 5279.
2015-11-19 13:05:41,867 INFO org.apache.spark.SparkEnv.logInfo:59 - Registering OutputCommitCoordinator
2015-11-19 13:05:42,013 INFO org.spark-project.jetty.server.Server.doStart:272 - jetty-8.y.z-SNAPSHOT
2015-11-19 13:05:42,039 INFO org.spark-project.jetty.server.AbstractConnector.doStart:338 - Started SelectChannelConnector#0.0.0.0:4040
2015-11-19 13:05:42,039 INFO org.apache.spark.util.Utils.logInfo:59 - Successfully started service 'SparkUI' on port 4040.
2015-11-19 13:05:42,041 INFO org.apache.spark.ui.SparkUI.logInfo:59 - Started SparkUI at http://103.252.184.181:4040
2015-11-19 13:05:42,114 WARN org.apache.spark.metrics.MetricsSystem.logWarning:71 - Using default name DAGScheduler for source because spark.app.id is not set.
2015-11-19 13:05:42,117 INFO org.apache.spark.executor.Executor.logInfo:59 - Starting executor ID driver on host localhost
2015-11-19 13:05:42,307 INFO org.apache.spark.util.Utils.logInfo:59 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 31334.
2015-11-19 13:05:42,308 INFO org.apache.spark.network.netty.NettyBlockTransferService.logInfo:59 - Server created on 31334
2015-11-19 13:05:42,309 INFO org.apache.spark.storage.BlockManagerMaster.logInfo:59 - Trying to register BlockManager
2015-11-19 13:05:42,312 INFO org.apache.spark.storage.BlockManagerMasterEndpoint.logInfo:59 - Registering block manager localhost:31334 with 3.8 GB RAM, BlockManagerId(driver, localhost, 31334)
2015-11-19 13:05:42,313 INFO org.apache.spark.storage.BlockManagerMaster.logInfo:59 - Registered BlockManager
If you are using SparkSubmit you can use the --executor-memory and --driver-memory flags. Otherwise, change these configurations spark.executor.memory and spark.driver.memory either directly in your program or in spark-defaults.
Note that you should not set memory too high. As a rule of thumb, aim for ~75% of available memory. That will leave enough memory for other processes (like your OS) running on your machines.
It is correctly stated by #Glennie Helles Sindholt but setting driver flags while submitting jobs on a standalone machine won't affect the usage as the JVM has been already been initialized. Checkout this link of discussion:
How to set Apache Spark Executor memory
If you are using Spark submit command to submit a job following is an example for how to set parameters while submitting the job:
spark-submit --master spark://127.0.0.1:7077 \
--num-executors 2 \
--executor-cores 8 \
--executor-memory 3g \
--class <Class name> \
$JAR_FILE_NAME or path \
/path-to-input \
/path-to-output \
By varying the number of parameters in this you can see and understand how the usage of RAM is changing. Also, there is a utility named htop on Linux. It is useful to instantaneous usage of memory, CPU cores and Swap space to have an understanding of what is happening. To install htop, use the following:
sudo apt-get install htop
It will look something like this:
htop utility
For more information you can check out the following links:
https://spark.apache.org/docs/latest/configuration.html
I have set some MapReduce configuration in my main method as so
configuration.set("mapreduce.jobtracker.address", "localhost:54311");
configuration.set("mapreduce.framework.name", "yarn");
configuration.set("yarn.resourcemanager.address", "localhost:8032");
Now when I launch the mapreduce task, the process is tracked (I can see it in my cluster dashboard (the one listening on port 8088)), but the process never finishes. It remains blocked at the following line:
15/06/30 15:56:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/06/30 15:56:17 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
15/06/30 15:56:18 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/06/30 15:56:18 INFO input.FileInputFormat: Total input paths to process : 1
15/06/30 15:56:18 INFO mapreduce.JobSubmitter: number of splits:1
15/06/30 15:56:18 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1435241671439_0008
15/06/30 15:56:19 INFO impl.YarnClientImpl: Submitted application application_1435241671439_0008
15/06/30 15:56:19 INFO mapreduce.Job: The url to track the job: http://10.0.0.10:8088/proxy/application_1435241671439_0008/
15/06/30 15:56:19 INFO mapreduce.Job: Running job: job_1435241671439_0008
Someone has an idea?
Edit : in my yarn nodemanager log, I have this message
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1435241671439_0003_03_000001
2015-06-30 15:44:38,396 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1435241671439_0002_04_000001
Edit 2 :
I also have in the yarn manager log, some exception that happened sooner (for a precedent mapreduce call) :
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:8040] java.net.BindException: Address already in use; For more details see:
Solution : I killed all the daemon processes and restarted again hadoop ! In fact, when I ran jps, I was still getting hadoop daemons though I had stopped them. This was a mismatch of HADOOP_PID_DIR
The default port of nodemanage of yarn is 8040. The error says that the port is already in use. Stop all the hadoop process, if you dont have data, may be format namenode once and try running the job again. From both of your edits, the issue is surely with node manager
Solution : I killed all the daemon processes and restarted again hadoop ! In fact, when I ran jps, I was still getting hadoop daemons though I had stopped them. This was related to a mismatch of HADOOP_PID_DIR