How to identify the optimum number of shuffle partition in Spark

How to identify the optimum number of shuffle partition in Spark - java

I am running a spark structured streaming job (bounces every day) in EMR. I am getting an OOM error in my application after a few hours of execution and get killed. The following are my configurations and spark SQL code.
I am new to Spark and need your valuable input.
The EMR is having 10 instances with 16 core and 64GB memory.
Spark-Submit arguments:
num_of_executors: 17
executor_cores: 5
executor_memory: 19G
driver_memory: 30G
Job is reading input as micro-batches from a Kafka at an interval of 30seconds. Average number of rows read per batch is 90k.
spark.streaming.kafka.maxRatePerPartition: 4500
spark.streaming.stopGracefullyOnShutdown: true
spark.streaming.unpersist: true
spark.streaming.kafka.consumer.cache.enabled: true
spark.hadoop.fs.s3.maxRetries: 30
spark.sql.shuffle.partitions: 2001
Spark SQL aggregation code:
dataset.groupBy(functions.col(NAME),functions.window(functions.column(TIMESTAMP_COLUMN),30))
.agg(functions.concat_ws(SPLIT, functions.collect_list(DEPARTMENT)).as(DEPS))
.select(NAME,DEPS)
.map((row) -> {
Map<String, Object> map = Maps.newHashMap();
map.put(NAME, row.getString(0));
map.put(DEPS, row.getString(1));
return new KryoMapSerializationService().serialize(map);
}, Encoders.BINARY());
Some logs from the driver:
20/04/04 13:10:51 INFO TaskSetManager: Finished task 1911.0 in stage 1041.0 (TID 1052055) in 374 ms on <host> (executor 3) (1998/2001)
20/04/04 13:10:52 INFO TaskSetManager: Finished task 1925.0 in stage 1041.0 (TID 1052056) in 411 ms on <host> (executor 3) (1999/2001)
20/04/04 13:10:52 INFO TaskSetManager: Finished task 1906.0 in stage 1041.0 (TID 1052054) in 776 ms on <host> (executor 3) (2000/2001)
20/04/04 13:11:04 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
20/04/04 13:11:04 INFO DAGScheduler: Executor lost: 3 (epoch 522)
20/04/04 13:11:04 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
20/04/04 13:11:04 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, <host>, 38533, None)
20/04/04 13:11:04 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
20/04/04 13:11:04 INFO YarnAllocator: Completed container container_1582797414408_1814_01_000004 on host: <host> (state: COMPLETE, exit status: 143)
And by the way, I am using collectasList in my forEachBatch code
List<Event> list = dataset.select("value")
.selectExpr("deserialize(value) as rows")
.select("rows.*")
.selectExpr(NAME, DEPS)
.as(Encoders.bean(Event.class))
.collectAsList();

With these settings, you may be causing your own issues.
num_of_executors: 17
executor_cores: 5
executor_memory: 19G
driver_memory: 30G
You are basically creating extra containers here to have to shuffle between. Instead, start off with something like 10 executors, 15 cores, 60g memory. If that is working, then you can play these a bit to try and optimize performance. I usually try splitting my containers in half each step (but I also havent needed to do this since spark 2.0).
Let Spark SQL keep the default at 200. The more you break this up, the more math you make Spark do to calculate the shuffles. If anything, I'd try to go with the same number of parallelism as you have executors, so in this case just 10. When 2.0 came out, this is how you would tune hive queries.
Making the job complex to break up puts all the load on the master.
Using Datasets and Encoding are also generally not as performant as going with straight DataFrame operations. I have found great lifts in performance of factoring this out for dataframe operations.

Related

PySpark 3.0.1 Failing to run Distributed training in Tensorflow 2.1.0

I'm trying to train a simple fashion_mnist model on tensorflow as per the original TensorBoard Api docs on hyper parameter tuning you can find here
Currently, for testing purposes, I'm running on standalone mode so. master = 'local[*]'
I have installed pyspark==3.0.1 and tensorflow==2.1.0. The following is what I'm trying to run:
# For a given hyper parameter, this will run the train & return the model + accuracy which I'm looking for.
# This works when I run without spark.
def train(hparam) -> Tuple[Model, Any]:
fashion_mnist = fashion
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = Sequential([
Flatten(),
Dense(hparam['num_units'], activation=tf.nn.relu),
Dropout(hparam['dropout']),
Dense(10, activation=tf.nn.softmax),
])
model.compile(
optimizer=hparam['optimizer'],
loss='sparse_categorical_crossentropy',
metrics=['accuracy'],
)
model.fit(x_train, y_train, epochs=1) # Run with 1 epoch to speed things up for demo purposes
_, accuracy = model.evaluate(x_test, y_test)
return model, accuracy
Here's my spark code which I run.
if __name__ == '__main__':
hp_nums = hp.HParam('num_units', hp.Discrete([16, 32]))
hp_dropouts = hp.HParam('dropout', hp.RealInterval(0.1, 0.2))
hp_opts = hp.HParam('optimizer', hp.Discrete(['adam', 'sgd']))
all_params = [] ##contains a list of different hparams
for num_units in hp_nums.domain.values:
for dropout_rate in (hp_dropouts.domain.min_value, hp_dropouts.domain.max_value):
for optimizer in hp_opts.domain.values:
hparams = {
'num_units': num_units,
'dropout': dropout_rate,
'optimizer': optimizer,
}
all_params.append(hparams)
spark_sess = SparkSession.builder.master(
'local[*]'
).appName(
'LocalTraining'
).getOrCreate()
res = spark_sess.sparkContext.parallelize(
all_hparams, len(all_hparams)
).map(
train #above function
).collect()
temp = 0.0
best_model = None
for model, acc in res:
if acc > temp:
best_model = model
print("best accuracy is -> " + str(temp))
This looks alright to me and works for any simple mapreduce (like the basic examples). Which makes me believe my environment is perfect and alright.
My Environment:
java : Java 11.0.8 2020-07-14 LTS
python: Python 3.6.5
pyspark: 3.0.1
tensorflow: 2.1.0
Keras: 2.3.1
windows: 10 (if this really matters)
cores : 8 (i5 10th gen)
Memory: 6G
But When I run the above piece of code. I get the following error. I can see the training run and it just stops after 1 executor runs
59168/60000 [============================>.] - ETA: 0s - loss: 0.7350 - accuracy: 0.7471
60000/60000 [==============================] - 3s 42us/step - loss: 0.7331 - accuracy: 0.7477
20/12/05 14:03:57 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.net.SocketException: Connection reset
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:271)
0/12/05 14:03:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
20/12/05 14:03:57 INFO TaskSchedulerImpl: Cancelling stage 0
20/12/05 14:03:57 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage cancelled
20/12/05 14:03:57 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
20/12/05 14:03:57 INFO TaskSchedulerImpl: Stage 0 was cancelled
20/12/05 14:03:57 INFO DAGScheduler: ResultStage 0 (collect at C:/Users/<>/<>/<>/main.py:<>) failed in 7.506 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, host.docker.internal, executor driver): java.net.SocketException: Connection reset
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:271)
at java.base/java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:628)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, host.docker.internal, executor driver): java.net.SocketException: Connection reset
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:271)
at java.base/java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:628)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:941)
Driver stacktrace:
20/12/05 14:03:57 INFO DAGScheduler: Job 0 failed: collect at C:/<>/<>/<>/main.py, took 7.541442 s
Traceback (most recent call last):
File "C:/<>/<>/<>/main.py", line 68, in main
return res.collect()
File "C:\Users\<>\<>\<>\venv\lib\site-packages\pyspark\rdd.py", line 889, in collect
sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "C:\Users\<>\<>\<>\venv\lib\site-packages\py4j\java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\Users\<>\<>\<>\venv\lib\site-packages\pyspark\sql\utils.py", line 128, in deco
return f(*a, **kw)
File "C:\Users\<>\<>\<>\venv\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
The error is on line model.fit(). [It only happens when I do model.fit If I comment it out and have something else there, it works perfectly fine. I'm unsure why it fails on model.fit()]

How to stop Spark RDD mapPartitions duplicating data in cluster?

We discovered the issue with duplication due to an external dependency getting called numerous times, a value greater than actual size of the RDD. The result of the transformation is a consolidated RDD with size == input RDD size, I suspect this could be the reason why the problem is undiscoverable and there aren't many solutions documented online. We tried a couple of approaches, mentioned below -
Approach 1: Using mapPartitionsWithIndex function with persistPartition = true, result was the same - duplication of data.
Approach 2: Another stackoverflow solution recommended calling rdd.cache() before the mapPartition transformation since the cause was due to lazy evaluation and cache can invoke an immediate transformation. This seemed to resolve the issue but caching led to other memory related exceptions.
rdd = rdd.mapPartitionsWithIndex(partitionFunction, true);
Results from spark worker nodes in a 4 node cluster with RDD having size 1000
2019-08-11 13:48:47,487 [Executor task launch worker-0] INFO - Found 215 records in Partition 3
2019-08-11 13:48:47,634 [Executor task launch worker-0] INFO - Found 203 records in Partition 0
2019-08-11 13:49:44,472 [Executor task launch worker-0] INFO - Found 177 records in Partition 2
2019-08-11 13:49:46,252 [Executor task launch worker-0] INFO - Found 201 records in Partition 4
2019-08-11 13:49:44,537 [Executor task launch worker-0] INFO - Found 203 records in Partition 0
2019-08-11 13:48:47,170 [Executor task launch worker-0] INFO - Found 204 records in Partition 1
2019-08-11 13:48:47,410 [Executor task launch worker-0] INFO - Found 177 records in Partition 2
2019-08-11 13:49:16,875 [Executor task launch worker-0] INFO - Found 201 records in Partition 4
Total records across partitions: 1581
Any suggestions from the community on solving this issue? TIA

Hazelcast custom timeout for operations

I am using "hazelcast.operation.call.timeout.millis = 100" configuration to timeout hazelcast operations.
But at the startup of the hazelcast some of the map size operation are getting timeout because of this configuration. I just only want to timeout the operations after the map load which are basically map get operations. Is there any way to add custom operation timeout for those map.get() operations ?
Is there any other way to get this done ???
com.hazelcast.core.OperationTimeoutException: HDMapSizeOperation got rejected before execution due to not starting within the operation-call-timeout of: 100ms. Current time: 2017-05-15 11:41:47.503. Start time: 2017-05-15 11:41:44.189. Total elapsed time: 3314 ms. Invocation{op=com.hazelcast.map.impl.operation.HDMapSizeOperation{serviceName='hz:impl:mapService', identityHash=1941379381, partitionId=0, replicaIndex=0, callId=-24461, invocationTime=1494828707296 (2017-05-15 11:41:47.296), waitTimeout=-1, callTimeout=100, name=blockMap}, tryCount=250, tryPauseMillis=500, invokeCount=11, callTimeoutMillis=100, firstInvocationTimeMs=1494828704189, firstInvocationTime='2017-05-15 11:41:44.189', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 05:30:00.000', target=[192.168.2.204]:5701, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:151)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:99)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:75)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:155)
at com.hazelcast.spi.impl.operationservice.impl.InvokeOnPartitions.retryFailedPartitions(InvokeOnPartitions.java:143)
at com.hazelcast.spi.impl.operationservice.impl.InvokeOnPartitions.invoke(InvokeOnPartitions.java:73)
at com.hazelcast.spi.impl.operationservice.impl.OperationServiceImpl.invokeOnAllPartitions(OperationServiceImpl.java:371)
at com.hazelcast.map.impl.proxy.MapProxySupport.size(MapProxySupport.java:628)
at com.hazelcast.map.impl.proxy.MapProxyImpl.size(MapProxyImpl.java:102)
at it.XXXX.tbx.server.MapLoader.run(MapLoader.java:36)
Regards,
Tharinda

If you are trying to control waiting on the result of e.g. a map.get; you could have a look at the asynchronous version like map.getAsync. It returns a future and you can control how long you want to wait for a result.
Modifying the call timeout is not advised.

Remote RPC client disassociated while doing operation on datasets with persisted datasets

While performing join or any operation with persisted datasets with other non-persisted datasets, Spark server throws Remote RPC client disassociated. Following is piece of code that causing issue.
Dataset<Row> dsTableA = sparkSession.read().format("jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableA").load().persist(StorageLevel.MEMORY_AND_DISK_SER());
Dataset<Row> dsTableB = sparkSession.read().format("jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableB").load().persist(StorageLevel.MEMORY_AND_DISK_SER());
Dataset<Row> anotherTableA = sparkSession.read().format("jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableC").load();
anotherTableA.write().format("json").save("/path/toJsonA"); // Working Fine - No use of persisted datasets
Dataset<Row> anotherTableB = sparkSession.read().format("jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableD").load();
dsTableA.createOrReplaceTempView("dsTableA");
dsTableB.createOrReplaceTempView("dsTableB");
anotherTableB.createOrReplaceTempView("anotherTableB");
Dataset<Row> joinedTable = sparkSession.sql("select atb.* from anotherTableB atb INNER JOIN dsTableA dsta ON atb.pid=dsta.pid LEFT JOIN dsTableB dstb ON atb.ssid=dstb.ssid");
joinedTable.write().format("json").save("/path/toJsonB");
// ERROR : Remote RPC client disassociated
// Working fine if Datasets dsTableA and dsTableB were not persisted
Part of logs
INFO TaskSetManager: Starting task 0.0 in stage 17.0 (TID 111, X.X.X.X, partition 0, PROCESS_LOCAL, 5342 bytes)
INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 111 on executor id: 0 hostname: X.X.X.X.
INFO BlockManagerInfo: Added broadcast_13_piece0 in memory on X.X.X.X:37153 (size: 12.9 KB, free: X.2 GB)
INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on X.X.X.X:37153 (size: 52.0 KB, free: X.2 GB)
ERROR TaskSchedulerImpl: Lost executor 0 on X.X.X.X: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-12121212121211-0000/0 is now EXITED (Command exited with code 134)
WARN TaskSetManager: Lost task 0.0 in stage 17.0 (TID 111, X.X.X.X): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
INFO StandaloneSchedulerBackend: Executor app-12121212121211-0000/0 removed: Command exited with code 134
INFO DAGScheduler: Executor lost: 0 (epoch 8)
If Datasets dsTableA and dsTableB were not persisted, then everything works smoothly. But must have to use persisted datasets. So how to solve this problem?

Spark Job in YARN - Executors are not executing the tasks for long time

I can see the executors are not executing the tasks for long time from the Spark UI.
When i see the executors tab stderr, i can see the below logs.
6/02/04 05:30:56 INFO storage.MemoryStore: Block broadcast_91 of size 153016 dropped from memory (free 6665239401)
16/02/04 06:11:20 WARN hdfs.DFSClient: Slow ReadProcessor read fields took 31337ms (threshold=30000ms); ack: seqno: 1240 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 4835789, targets: [DatanodeInfoWithStorage[10.25.36.18:1004,DS-f6e20cf7-0ccb-45aa-988f-f3310d5acf89,DISK], DatanodeInfoWithStorage[10.25.36.11:1004,DS-61ad0a2d-a6fd-402e-b0a1-61682d1755fb,DISK], DatanodeInfoWithStorage[10.25.36.5:1004,DS-c77503a2-0c7f-4b5c-8f4a-9c61cb4f18d7,DISK]]
I do not see any log for long time. i do not see error as well. It is keep on running..
Is anyone faced the same problem? how we can improve this?
Update:
It is actually took long time on saveAsTextFile() method.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to identify the optimum number of shuffle partition in Spark - java

Related

PySpark 3.0.1 Failing to run Distributed training in Tensorflow 2.1.0

How to stop Spark RDD mapPartitions duplicating data in cluster?

Hazelcast custom timeout for operations

Remote RPC client disassociated while doing operation on datasets with persisted datasets

Spark Job in YARN - Executors are not executing the tasks for long time

Categories

Resources