Remote RPC client disassociated while doing operation on datasets with persisted datasets - java

While performing join or any operation with persisted datasets with other non-persisted datasets, Spark server throws Remote RPC client disassociated. Following is piece of code that causing issue.
Dataset<Row> dsTableA ="jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableA").load().persist(StorageLevel.MEMORY_AND_DISK_SER());
Dataset<Row> dsTableB ="jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableB").load().persist(StorageLevel.MEMORY_AND_DISK_SER());
Dataset<Row> anotherTableA ="jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableC").load();
anotherTableA.write().format("json").save("/path/toJsonA"); // Working Fine - No use of persisted datasets
Dataset<Row> anotherTableB ="jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableD").load();
Dataset<Row> joinedTable = sparkSession.sql("select atb.* from anotherTableB atb INNER JOIN dsTableA dsta ON LEFT JOIN dsTableB dstb ON atb.ssid=dstb.ssid");
// ERROR : Remote RPC client disassociated
// Working fine if Datasets dsTableA and dsTableB were not persisted
Part of logs
INFO TaskSetManager: Starting task 0.0 in stage 17.0 (TID 111, X.X.X.X, partition 0, PROCESS_LOCAL, 5342 bytes)
INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 111 on executor id: 0 hostname: X.X.X.X.
INFO BlockManagerInfo: Added broadcast_13_piece0 in memory on X.X.X.X:37153 (size: 12.9 KB, free: X.2 GB)
INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on X.X.X.X:37153 (size: 52.0 KB, free: X.2 GB)
ERROR TaskSchedulerImpl: Lost executor 0 on X.X.X.X: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-12121212121211-0000/0 is now EXITED (Command exited with code 134)
WARN TaskSetManager: Lost task 0.0 in stage 17.0 (TID 111, X.X.X.X): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
INFO StandaloneSchedulerBackend: Executor app-12121212121211-0000/0 removed: Command exited with code 134
INFO DAGScheduler: Executor lost: 0 (epoch 8)
If Datasets dsTableA and dsTableB were not persisted, then everything works smoothly. But must have to use persisted datasets. So how to solve this problem?


After enabling checkpoint - org.apache.flink.runtime.resourcemanager.exceptions.UnfulfillableSlotRequestException

I have enabled checkpoint in flink 1.12.1 programmatically as below:
int duration = 10 ;
if (!environment.getCheckpointConfig().isCheckpointingEnabled()) {
environment.enableCheckpointing(duration * 6 * 1000, CheckpointingMode.EXACTLY_ONCE);
environment.getCheckpointConfig().setMinPauseBetweenCheckpoints(duration * 3 * 1000);
Flink Version: 1.12.1
state.backend: rocksdb
state.checkpoints.dir: file:///flink/
blob.server.port: 6124
jobmanager.rpc.port: 6123
parallelism.default: 2
queryable-state.proxy.ports: 6125
taskmanager.numberOfTaskSlots: 2
taskmanager.rpc.port: 6122
jobmanager.memory.process.size: 1600m
taskmanager.memory.process.size: 1728m
rest.port: 8081 6121
classloader.resolve-order: parent-first
execution.checkpointing.unaligned: false
execution.checkpointing.max-concurrent-checkpoints: 2
execution.checkpointing.interval: 60000
But it is failing with following error:
Caused by: org.apache.flink.runtime.resourcemanager.exceptions.UnfulfillableSlotRequestException: Could not fulfill slot request 44ec308e34aa86629d2034a017b8ef91. Requested resource profile (ResourceProfile{UNKNOWN}) is unfulfillable.
If I remove/disable checkpoint, everything works normally. I have checkpoint requirement because, if my pod, gets restart, data which is being handled by earlier run gets reset.
Can somebody direct, how this can be addressed?

SCDF: Error handling when pod failed to start

I'm working on a service where it will call Spring Cloud Dataflow (SCDF) to spin off a new k8s Pod for Spring Batch job.
Map<String, String> properties = Map.of("testApp.cpu", cpu, "testApp.memory", memory);"Create task '{}' with definition '{}'", taskName, taskDefinition);
taskOperations.create(taskName, taskDefinition);"Launching task '{}' with properties {} and arguments '{}'", taskName, properties, args);
return taskOperations.launch(taskName, properties, args);
Everything works fine. The problem is, whenever we pull a non-existing image (eg: due to some connection issue), the pod failed to start AND we end up with pending tasks (with NO batch jobs created whatever)
For example, we will have tasks in the table task_execution (SCDF table) with empty end time
But no related jobs in batch_job_execution table.
It seems fine at first since no pod is created, we don't consume any resource. But as the number of "pending jobs" reached 20, we have the famous error:
Cannot launch task testApp. The maximum concurrent task executions is at its limit [20]
I'm trying to find a way to detect that the pod spin-off has failed (and hence we should mark the task as error), but to no avail.
Is there a way to detect if the task launch has failed when that task launch a new k8s pod?
Not sure if it is relevant, we are using SCDF 1.7.3.RELEASE
Describe the failed pod:
Name: podname-lp2nyowgmm
Namespace: my-namespace
Priority: 1000
Priority Class Name: test-cluster-default
Node: some-ip.compute.internal/XX.XXX.XXX.XX
Start Time: Thu, 14 Jan 2021 18:47:52 +0700
Labels: role=spring-app
Annotations: arn:aws:iam::XXXXXXXXXXXX:role/svc-XXXX-XXX-XX-XXXX-X-XXX-XXX-XXXXXXXXXXXXXXXXXXXX eks.privileged
Status: Pending
Container ID:
Image: image_host:XXX/mysystem/myapp:notExist
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: ErrImagePull
Ready: False
Restart Count: 0
cpu: 2
memory: 8Gi
cpu: 2
memory: 8Gi
ELASTIC_SEARCH_URL: elasticsearch
/var/run/secrets/ from default-token-xxxxx(ro)
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Type: Secret (a volume populated by a Secret)
SecretName: default-token-xxxxx
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: op=Exists for 300s op=Exists for 300s
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m22s default-scheduler Successfully assigned my-namespace/podname-lp2nyowgmm to some-ip.compute.internal
Normal Pulling 103s (x4 over 3m21s) kubelet Pulling image "image_host:XXX/mysystem/myapp:notExist"
Warning Failed 102s (x4 over 3m19s) kubelet Failed to pull image "image_host:XXX/mysystem/myapp:notExist": rpc error: code = Unknown desc = Error response from daemon: manifest for image_host:XXX/mysystem/myapp:notExist not found: manifest unknown: manifest unknown
Warning Failed 102s (x4 over 3m19s) kubelet Error: ErrImagePull
Normal BackOff 88s (x6 over 3m19s) kubelet Back-off pulling image "image_host:XXX/mysystem/myapp:notExist"
Warning Failed 73s (x7 over 3m19s) kubelet Error: ImagePullBackOff
1.7.3 is a very old release. We just released 2.7. The original logic used the task execution tables instead of the pod status. If the version you are using is subject to that, then it would explain what you are seeing. I strongly recommend an upgrade.
Thanks for the question. Looking at the source code, we don't include Pendingpods when calculating the current number of executing tasks. It may be something else is going on. 1) Could you run kubectl describe pod on a pod when it's in this state and post the result? (status details). 2) Is the deployer configured to create a job for each task? (false by default).

PySpark 3.0.1 Failing to run Distributed training in Tensorflow 2.1.0

I'm trying to train a simple fashion_mnist model on tensorflow as per the original TensorBoard Api docs on hyper parameter tuning you can find here
Currently, for testing purposes, I'm running on standalone mode so. master = 'local[*]'
I have installed pyspark==3.0.1 and tensorflow==2.1.0. The following is what I'm trying to run:
# For a given hyper parameter, this will run the train & return the model + accuracy which I'm looking for.
# This works when I run without spark.
def train(hparam) -> Tuple[Model, Any]:
fashion_mnist = fashion
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = Sequential([
Dense(hparam['num_units'], activation=tf.nn.relu),
Dense(10, activation=tf.nn.softmax),
), y_train, epochs=1) # Run with 1 epoch to speed things up for demo purposes
_, accuracy = model.evaluate(x_test, y_test)
return model, accuracy
Here's my spark code which I run.
if __name__ == '__main__':
hp_nums = hp.HParam('num_units', hp.Discrete([16, 32]))
hp_dropouts = hp.HParam('dropout', hp.RealInterval(0.1, 0.2))
hp_opts = hp.HParam('optimizer', hp.Discrete(['adam', 'sgd']))
all_params = [] ##contains a list of different hparams
for num_units in hp_nums.domain.values:
for dropout_rate in (hp_dropouts.domain.min_value, hp_dropouts.domain.max_value):
for optimizer in hp_opts.domain.values:
hparams = {
'num_units': num_units,
'dropout': dropout_rate,
'optimizer': optimizer,
spark_sess = SparkSession.builder.master(
res = spark_sess.sparkContext.parallelize(
all_hparams, len(all_hparams)
train #above function
temp = 0.0
best_model = None
for model, acc in res:
if acc > temp:
best_model = model
print("best accuracy is -> " + str(temp))
This looks alright to me and works for any simple mapreduce (like the basic examples). Which makes me believe my environment is perfect and alright.
My Environment:
java : Java 11.0.8 2020-07-14 LTS
python: Python 3.6.5
pyspark: 3.0.1
tensorflow: 2.1.0
Keras: 2.3.1
windows: 10 (if this really matters)
cores : 8 (i5 10th gen)
Memory: 6G
But When I run the above piece of code. I get the following error. I can see the training run and it just stops after 1 executor runs
59168/60000 [============================>.] - ETA: 0s - loss: 0.7350 - accuracy: 0.7471
60000/60000 [==============================] - 3s 42us/step - loss: 0.7331 - accuracy: 0.7477
20/12/05 14:03:57 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) Connection reset
at java.base/
at java.base/
at java.base/
at java.base/
0/12/05 14:03:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
20/12/05 14:03:57 INFO TaskSchedulerImpl: Cancelling stage 0
20/12/05 14:03:57 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage cancelled
20/12/05 14:03:57 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
20/12/05 14:03:57 INFO TaskSchedulerImpl: Stage 0 was cancelled
20/12/05 14:03:57 INFO DAGScheduler: ResultStage 0 (collect at C:/Users/<>/<>/<>/<>) failed in 7.506 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, host.docker.internal, executor driver): Connection reset
at java.base/
at java.base/
at java.base/
at java.base/
at java.base/
at org.apache.spark.api.python.PythonRunner$$anon$
at org.apache.spark.api.python.PythonRunner$$anon$
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, host.docker.internal, executor driver): Connection reset
at java.base/
at java.base/
at java.base/
at java.base/
at java.base/
at org.apache.spark.api.python.PythonRunner$$anon$
at org.apache.spark.api.python.PythonRunner$$anon$
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:941)
Driver stacktrace:
20/12/05 14:03:57 INFO DAGScheduler: Job 0 failed: collect at C:/<>/<>/<>/, took 7.541442 s
Traceback (most recent call last):
File "C:/<>/<>/<>/", line 68, in main
return res.collect()
File "C:\Users\<>\<>\<>\venv\lib\site-packages\pyspark\", line 889, in collect
sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "C:\Users\<>\<>\<>\venv\lib\site-packages\py4j\", line 1305, in __call__
answer, self.gateway_client, self.target_id,
File "C:\Users\<>\<>\<>\venv\lib\site-packages\pyspark\sql\", line 128, in deco
return f(*a, **kw)
File "C:\Users\<>\<>\<>\venv\lib\site-packages\py4j\", line 328, in get_return_value
format(target_id, ".", name), value)
The error is on line [It only happens when I do If I comment it out and have something else there, it works perfectly fine. I'm unsure why it fails on]

How to identify the optimum number of shuffle partition in Spark

I am running a spark structured streaming job (bounces every day) in EMR. I am getting an OOM error in my application after a few hours of execution and get killed. The following are my configurations and spark SQL code.
I am new to Spark and need your valuable input.
The EMR is having 10 instances with 16 core and 64GB memory.
Spark-Submit arguments:
num_of_executors: 17
executor_cores: 5
executor_memory: 19G
driver_memory: 30G
Job is reading input as micro-batches from a Kafka at an interval of 30seconds. Average number of rows read per batch is 90k.
spark.streaming.kafka.maxRatePerPartition: 4500
spark.streaming.stopGracefullyOnShutdown: true
spark.streaming.unpersist: true
spark.streaming.kafka.consumer.cache.enabled: true
spark.hadoop.fs.s3.maxRetries: 30
spark.sql.shuffle.partitions: 2001
Spark SQL aggregation code:
.agg(functions.concat_ws(SPLIT, functions.collect_list(DEPARTMENT)).as(DEPS))
.map((row) -> {
Map<String, Object> map = Maps.newHashMap();
map.put(NAME, row.getString(0));
map.put(DEPS, row.getString(1));
return new KryoMapSerializationService().serialize(map);
}, Encoders.BINARY());
Some logs from the driver:
20/04/04 13:10:51 INFO TaskSetManager: Finished task 1911.0 in stage 1041.0 (TID 1052055) in 374 ms on <host> (executor 3) (1998/2001)
20/04/04 13:10:52 INFO TaskSetManager: Finished task 1925.0 in stage 1041.0 (TID 1052056) in 411 ms on <host> (executor 3) (1999/2001)
20/04/04 13:10:52 INFO TaskSetManager: Finished task 1906.0 in stage 1041.0 (TID 1052054) in 776 ms on <host> (executor 3) (2000/2001)
20/04/04 13:11:04 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
20/04/04 13:11:04 INFO DAGScheduler: Executor lost: 3 (epoch 522)
20/04/04 13:11:04 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
20/04/04 13:11:04 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, <host>, 38533, None)
20/04/04 13:11:04 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
20/04/04 13:11:04 INFO YarnAllocator: Completed container container_1582797414408_1814_01_000004 on host: <host> (state: COMPLETE, exit status: 143)
And by the way, I am using collectasList in my forEachBatch code
List<Event> list ="value")
.selectExpr("deserialize(value) as rows")
.selectExpr(NAME, DEPS)
With these settings, you may be causing your own issues.
num_of_executors: 17
executor_cores: 5
executor_memory: 19G
driver_memory: 30G
You are basically creating extra containers here to have to shuffle between. Instead, start off with something like 10 executors, 15 cores, 60g memory. If that is working, then you can play these a bit to try and optimize performance. I usually try splitting my containers in half each step (but I also havent needed to do this since spark 2.0).
Let Spark SQL keep the default at 200. The more you break this up, the more math you make Spark do to calculate the shuffles. If anything, I'd try to go with the same number of parallelism as you have executors, so in this case just 10. When 2.0 came out, this is how you would tune hive queries.
Making the job complex to break up puts all the load on the master.
Using Datasets and Encoding are also generally not as performant as going with straight DataFrame operations. I have found great lifts in performance of factoring this out for dataframe operations.

How to stop Spark RDD mapPartitions duplicating data in cluster?

We discovered the issue with duplication due to an external dependency getting called numerous times, a value greater than actual size of the RDD. The result of the transformation is a consolidated RDD with size == input RDD size, I suspect this could be the reason why the problem is undiscoverable and there aren't many solutions documented online. We tried a couple of approaches, mentioned below -
Approach 1: Using mapPartitionsWithIndex function with persistPartition = true, result was the same - duplication of data.
Approach 2: Another stackoverflow solution recommended calling rdd.cache() before the mapPartition transformation since the cause was due to lazy evaluation and cache can invoke an immediate transformation. This seemed to resolve the issue but caching led to other memory related exceptions.
rdd = rdd.mapPartitionsWithIndex(partitionFunction, true);
Results from spark worker nodes in a 4 node cluster with RDD having size 1000
2019-08-11 13:48:47,487 [Executor task launch worker-0] INFO - Found 215 records in Partition 3
2019-08-11 13:48:47,634 [Executor task launch worker-0] INFO - Found 203 records in Partition 0
2019-08-11 13:49:44,472 [Executor task launch worker-0] INFO - Found 177 records in Partition 2
2019-08-11 13:49:46,252 [Executor task launch worker-0] INFO - Found 201 records in Partition 4
2019-08-11 13:49:44,537 [Executor task launch worker-0] INFO - Found 203 records in Partition 0
2019-08-11 13:48:47,170 [Executor task launch worker-0] INFO - Found 204 records in Partition 1
2019-08-11 13:48:47,410 [Executor task launch worker-0] INFO - Found 177 records in Partition 2
2019-08-11 13:49:16,875 [Executor task launch worker-0] INFO - Found 201 records in Partition 4
Total records across partitions: 1581
Any suggestions from the community on solving this issue? TIA
