How to stop Spark RDD mapPartitions duplicating data in cluster? - java

We discovered the issue with duplication due to an external dependency getting called numerous times, a value greater than actual size of the RDD. The result of the transformation is a consolidated RDD with size == input RDD size, I suspect this could be the reason why the problem is undiscoverable and there aren't many solutions documented online. We tried a couple of approaches, mentioned below -
Approach 1: Using mapPartitionsWithIndex function with persistPartition = true, result was the same - duplication of data.
Approach 2: Another stackoverflow solution recommended calling rdd.cache() before the mapPartition transformation since the cause was due to lazy evaluation and cache can invoke an immediate transformation. This seemed to resolve the issue but caching led to other memory related exceptions.
rdd = rdd.mapPartitionsWithIndex(partitionFunction, true);
Results from spark worker nodes in a 4 node cluster with RDD having size 1000
2019-08-11 13:48:47,487 [Executor task launch worker-0] INFO - Found 215 records in Partition 3
2019-08-11 13:48:47,634 [Executor task launch worker-0] INFO - Found 203 records in Partition 0
2019-08-11 13:49:44,472 [Executor task launch worker-0] INFO - Found 177 records in Partition 2
2019-08-11 13:49:46,252 [Executor task launch worker-0] INFO - Found 201 records in Partition 4
2019-08-11 13:49:44,537 [Executor task launch worker-0] INFO - Found 203 records in Partition 0
2019-08-11 13:48:47,170 [Executor task launch worker-0] INFO - Found 204 records in Partition 1
2019-08-11 13:48:47,410 [Executor task launch worker-0] INFO - Found 177 records in Partition 2
2019-08-11 13:49:16,875 [Executor task launch worker-0] INFO - Found 201 records in Partition 4
Total records across partitions: 1581
Any suggestions from the community on solving this issue? TIA

Related

How to identify the optimum number of shuffle partition in Spark

I am running a spark structured streaming job (bounces every day) in EMR. I am getting an OOM error in my application after a few hours of execution and get killed. The following are my configurations and spark SQL code.
I am new to Spark and need your valuable input.
The EMR is having 10 instances with 16 core and 64GB memory.
Spark-Submit arguments:
num_of_executors: 17
executor_cores: 5
executor_memory: 19G
driver_memory: 30G
Job is reading input as micro-batches from a Kafka at an interval of 30seconds. Average number of rows read per batch is 90k.
spark.streaming.kafka.maxRatePerPartition: 4500
spark.streaming.stopGracefullyOnShutdown: true
spark.streaming.unpersist: true
spark.streaming.kafka.consumer.cache.enabled: true
spark.hadoop.fs.s3.maxRetries: 30
spark.sql.shuffle.partitions: 2001
Spark SQL aggregation code:
dataset.groupBy(functions.col(NAME),functions.window(functions.column(TIMESTAMP_COLUMN),30))
.agg(functions.concat_ws(SPLIT, functions.collect_list(DEPARTMENT)).as(DEPS))
.select(NAME,DEPS)
.map((row) -> {
Map<String, Object> map = Maps.newHashMap();
map.put(NAME, row.getString(0));
map.put(DEPS, row.getString(1));
return new KryoMapSerializationService().serialize(map);
}, Encoders.BINARY());
Some logs from the driver:
20/04/04 13:10:51 INFO TaskSetManager: Finished task 1911.0 in stage 1041.0 (TID 1052055) in 374 ms on <host> (executor 3) (1998/2001)
20/04/04 13:10:52 INFO TaskSetManager: Finished task 1925.0 in stage 1041.0 (TID 1052056) in 411 ms on <host> (executor 3) (1999/2001)
20/04/04 13:10:52 INFO TaskSetManager: Finished task 1906.0 in stage 1041.0 (TID 1052054) in 776 ms on <host> (executor 3) (2000/2001)
20/04/04 13:11:04 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
20/04/04 13:11:04 INFO DAGScheduler: Executor lost: 3 (epoch 522)
20/04/04 13:11:04 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
20/04/04 13:11:04 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, <host>, 38533, None)
20/04/04 13:11:04 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
20/04/04 13:11:04 INFO YarnAllocator: Completed container container_1582797414408_1814_01_000004 on host: <host> (state: COMPLETE, exit status: 143)
And by the way, I am using collectasList in my forEachBatch code
List<Event> list = dataset.select("value")
.selectExpr("deserialize(value) as rows")
.select("rows.*")
.selectExpr(NAME, DEPS)
.as(Encoders.bean(Event.class))
.collectAsList();
With these settings, you may be causing your own issues.
num_of_executors: 17
executor_cores: 5
executor_memory: 19G
driver_memory: 30G
You are basically creating extra containers here to have to shuffle between. Instead, start off with something like 10 executors, 15 cores, 60g memory. If that is working, then you can play these a bit to try and optimize performance. I usually try splitting my containers in half each step (but I also havent needed to do this since spark 2.0).
Let Spark SQL keep the default at 200. The more you break this up, the more math you make Spark do to calculate the shuffles. If anything, I'd try to go with the same number of parallelism as you have executors, so in this case just 10. When 2.0 came out, this is how you would tune hive queries.
Making the job complex to break up puts all the load on the master.
Using Datasets and Encoding are also generally not as performant as going with straight DataFrame operations. I have found great lifts in performance of factoring this out for dataframe operations.

Spring Batch: Multithreaded step with AsyncItemProcessor doesn't run in parallel

TL;DR
Given a file with a million reocrds where there is considerable amount of logic to be executed on each row in the file, what is the fastest way to read the file and finish applying the logic on each row. I used a multi-threaded step with file reader whose read method is synchronized to read the file and also used an AsynItemProcessor so that the records are processed in its own thread.
My expectation is that the AsynItemProcessor should start immediately as soon as it has a record from a reader to process. Each record should be processed in its own thread; however, this doesn't seem to be the case in my example below
I have a step in my Spring batch job that uses a TaskExecutor with 20 threads and commit interval of 10000 to read a file. I am also using an AsycnItemProcessor and AsyncItemWriter since the data processing can at times take longer than the amount of time required to read a line from the file.
<step id="aggregationStep">
<tasklet throttle-limit="20" task-executor="taskExecutor">
<chunk reader="fileReader"
processor="asyncProcessor" writer="asyncWriter"
commit-interval="10000" />
</tasklet>
</step>
Where :
fileReader is a class that extends FlatFileItemReader and the read method is synchronized and simply calls super.read within it.
asyncProcessor as nothing but a AsyncItemProcessor bean that is passed each row from the file and groups it by a key, and stores it into a singleton bean that holds a Map<String,BigDecimal>object. In other words, the processor is simply grouping the file data by a a few columns and storing this data in memory.
asyncWriter is nothing but an AsyncItemWriter that wraps a no operation ItemWriter within it. In other words, the job does not need to do any kind of writing since the processor itself is doing the aggregation and storing data in memory. (Map).
Note that the AsynItemProcessor has its on ThreadPoolTaskExecutor with corePoolSize=10 and maxPoolSize=20 and the Step has its own ThreadPoolTaskExecutor with a corePoolSize=20 and maxPoolSize=40
With the above setup, my exepcation was that the reading and processing would happen in parallel. Something like :
FileReader reads a record from the file and passes it to the processor
AsyncItemProcessor performs aggregation. Since it is an AsyncItemProcessor, the thread that called the process method should ideally be free to do other work?
Finally, the AsynItemWriter would get the Future and extract the data but do nothing since the delegate is a no operation ItemWriter.
But when I added some logs, I am not seeing what I expected :
2019-09-07 10:04:49 INFO FileReader:45 - Finished reading 2500 records
2019-09-07 10:04:49 INFO FileReader:45 - Finished reading 5000 records
2019-09-07 10:04:50 INFO FileReader:45 - Finished reading 7501 records
2019-09-07 10:04:50 INFO FileReader:45 - Finished reading 10000 records
2019-09-07 10:04:51 INFO FileReader:45 - Finished reading 12500 records
2019-09-07 10:04:51 INFO FileReader:45 - Finished reading 15000 records
... more such lines are printed until entire file is read. Only after the file is read do I start seeing the processor start doing its work :
2019-09-07 10:06:53 INFO FileProcessor:83 - Finished processing 2500 records
2019-09-07 10:08:28 INFO FileProcessor:83 - Finished processing 5000 records
2019-09-07 10:10:04 INFO FileProcessor:83 - Finished processing 7500 records
2019-09-07 10:11:40 INFO FileProcessor:83 - Finished processing 10000 records
2019-09-07 10:13:16 INFO FileProcessor:83 - Finished processing 12500 records
2019-09-07 10:14:51 INFO FileProcessor:83 - Finished processing 15000 records
Bottom line : Why is the processor not kicking in until the file has been fully read? No matter what the ThreadPoolTaskExecutor parameters used for the AsynItemProcessor or for the entire step, the reading always completes first and only then the processing starts.
This is how chunk oriented processing works. The step will read X items in a variable (where X is the commit-interval), then processing/writing are performed. You can see that in the code of ChunkOrientedTasklet.
In a multi-thread step, each chunk will be processed by a thread.

Remote RPC client disassociated while doing operation on datasets with persisted datasets

While performing join or any operation with persisted datasets with other non-persisted datasets, Spark server throws Remote RPC client disassociated. Following is piece of code that causing issue.
Dataset<Row> dsTableA = sparkSession.read().format("jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableA").load().persist(StorageLevel.MEMORY_AND_DISK_SER());
Dataset<Row> dsTableB = sparkSession.read().format("jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableB").load().persist(StorageLevel.MEMORY_AND_DISK_SER());
Dataset<Row> anotherTableA = sparkSession.read().format("jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableC").load();
anotherTableA.write().format("json").save("/path/toJsonA"); // Working Fine - No use of persisted datasets
Dataset<Row> anotherTableB = sparkSession.read().format("jdbc").options(dbConfig)
.option("dbTable", "SELECT * FROM tableD").load();
dsTableA.createOrReplaceTempView("dsTableA");
dsTableB.createOrReplaceTempView("dsTableB");
anotherTableB.createOrReplaceTempView("anotherTableB");
Dataset<Row> joinedTable = sparkSession.sql("select atb.* from anotherTableB atb INNER JOIN dsTableA dsta ON atb.pid=dsta.pid LEFT JOIN dsTableB dstb ON atb.ssid=dstb.ssid");
joinedTable.write().format("json").save("/path/toJsonB");
// ERROR : Remote RPC client disassociated
// Working fine if Datasets dsTableA and dsTableB were not persisted
Part of logs
INFO TaskSetManager: Starting task 0.0 in stage 17.0 (TID 111, X.X.X.X, partition 0, PROCESS_LOCAL, 5342 bytes)
INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 111 on executor id: 0 hostname: X.X.X.X.
INFO BlockManagerInfo: Added broadcast_13_piece0 in memory on X.X.X.X:37153 (size: 12.9 KB, free: X.2 GB)
INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on X.X.X.X:37153 (size: 52.0 KB, free: X.2 GB)
ERROR TaskSchedulerImpl: Lost executor 0 on X.X.X.X: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-12121212121211-0000/0 is now EXITED (Command exited with code 134)
WARN TaskSetManager: Lost task 0.0 in stage 17.0 (TID 111, X.X.X.X): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
INFO StandaloneSchedulerBackend: Executor app-12121212121211-0000/0 removed: Command exited with code 134
INFO DAGScheduler: Executor lost: 0 (epoch 8)
If Datasets dsTableA and dsTableB were not persisted, then everything works smoothly. But must have to use persisted datasets. So how to solve this problem?

Spark Job in YARN - Executors are not executing the tasks for long time

I can see the executors are not executing the tasks for long time from the Spark UI.
When i see the executors tab stderr, i can see the below logs.
6/02/04 05:30:56 INFO storage.MemoryStore: Block broadcast_91 of size 153016 dropped from memory (free 6665239401)
16/02/04 06:11:20 WARN hdfs.DFSClient: Slow ReadProcessor read fields took 31337ms (threshold=30000ms); ack: seqno: 1240 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 4835789, targets: [DatanodeInfoWithStorage[10.25.36.18:1004,DS-f6e20cf7-0ccb-45aa-988f-f3310d5acf89,DISK], DatanodeInfoWithStorage[10.25.36.11:1004,DS-61ad0a2d-a6fd-402e-b0a1-61682d1755fb,DISK], DatanodeInfoWithStorage[10.25.36.5:1004,DS-c77503a2-0c7f-4b5c-8f4a-9c61cb4f18d7,DISK]]
I do not see any log for long time. i do not see error as well. It is keep on running..
Is anyone faced the same problem? how we can improve this?
Update:
It is actually took long time on saveAsTextFile() method.

Hive Query failing with Heap Issue

Below is the hive query am running,
INSERT INTO TABLE temp.table_output
SELECT /*+ STREAMTABLE(tableB) */ c.column1 as client, a.column2 as testData,
CASE WHEN ca.updated_date IS NULL OR ca.updated_date = 'null' THEN null ELSE CONCAT(ca.updated_date, '+0000') END as update
FROM temp.tableA as a
INNER JOIN default.tableB as ca ON a.column5=ca.column2
INNER JOIN default.tableC as c ON ca.column3=c.column1 WHERE a.name='test';
TableB is having 2.4 billion rows (140 GB), TableA and TableC is having 200 million records around.
Cluster consists of 3 Cassandra data nodes and 3 Analytics node (Hive on top of cassandra), with 130GB ram on each node.
TableA, TableB, TableC are hive internal tables.
Hive cluster heap size is 12GB.
Can someone tell me when I run hive query, am having heap issue and it fails to complete the task?? Its the only job running on hive server.
Task fails with below error,
Caused by: java.io.IOException: Read failed from file: cfs://172.31.x.x/tmp/hive-root/hive_2015-03-17_00-27-25_132_17376615815827139-1/-mr-10002/000049_0
at com.datastax.bdp.hadoop.cfs.CassandraInputStream.read(CassandraInputStream.java:178)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:59)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:65)
... 16 more
Caused by: java.io.IOException: org.apache.thrift.TApplicationException: Internal error processing get_remote_cfs_sblock
at com.datastax.bdp.hadoop.cfs.CassandraFileSystemThriftStore.retrieveSubBlock(CassandraFileSystemThriftStore.java:537)
at com.datastax.bdp.hadoop.cfs.CassandraSubBlockInputStream.subBlockSeekTo(CassandraSubBlockInputStream.java:145)
at com.datastax.bdp.hadoop.cfs.CassandraSubBlockInputStream.read(CassandraSubBlockInputStream.java:95)
at com.datastax.bdp.hadoop.cfs.CassandraInputStream.read(CassandraInputStream.java:159)
... 25 more
Caused by: org.apache.thrift.TApplicationException: Internal error processing get_remote_cfs_sblock
at org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
at org.apache.cassandra.thrift.Dse$Client.recv_get_remote_cfs_sblock(Dse.java:271)
at org.apache.cassandra.thrift.Dse$Client.get_remote_cfs_sblock(Dse.java:254)
at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.datastax.bdp.util.CassandraProxyClient.invokeDseClient(CassandraProxyClient.java:655)
at com.datastax.bdp.util.CassandraProxyClient.invoke(CassandraProxyClient.java:631)
at com.sun.proxy.$Proxy5.get_remote_cfs_sblock(Unknown Source)
at com.datastax.bdp.hadoop.cfs.CassandraFileSystemThriftStore.retrieveSubBlock(CassandraFileSystemThriftStore.java:515)
... 28 more
Hive.log
2015-03-17 23:10:39,576 ERROR exec.Task (SessionState.java:printError(419)) - Examining task ID: task_201503171816_0036_r_000023 (and more) from job job_201503171816_0036
2015-03-17 23:10:39,579 ERROR exec.Task (SessionState.java:printError(419)) - Examining task ID: task_201503171816_0036_r_000052 (and more) from job job_201503171816_0036
2015-03-17 23:10:39,582 ERROR exec.Task (SessionState.java:printError(419)) - Examining task ID: task_201503171816_0036_m_000207 (and more) from job job_201503171816_0036
2015-03-17 23:10:39,585 ERROR exec.Task (SessionState.java:printError(419)) - Examining task ID: task_201503171816_0036_r_000087 (and more) from job job_201503171816_0036
2015-03-17 23:10:39,588 ERROR exec.Task (SessionState.java:printError(419)) - Examining task ID: task_201503171816_0036_m_000223 (and more) from job job_201503171816_0036
2015-03-17 23:10:39,591 ERROR exec.Task (SessionState.java:printError(419)) - Examining task ID: task_201503171816_0036_m_000045 (and more) from job job_201503171816_0036
2015-03-17 23:10:39,594 ERROR exec.Task (SessionState.java:printError(419)) - Examining task ID: task_201503171816_0036_m_000235 (and more) from job job_201503171816_0036
2015-03-17 23:10:39,597 ERROR exec.Task (SessionState.java:printError(419)) - Examining task ID: task_201503171816_0036_m_002140 (and more) from job job_201503171816_0036
2015-03-17 23:10:39,761 ERROR exec.Task (SessionState.java:printError(419)) -
Task with the most failures(4):
-----
Task ID:
task_201503171816_0036_m_000036
URL:
http://sjvtncasl064.mcafee.int:50030/taskdetails.jsp?jobid=job_201503171816_0036&tipid=task_201503171816_0036_m_000036
-----
Diagnostic Messages for this Task:
Error: Java heap space
2015-03-17 23:10:39,777 ERROR ql.Driver (SessionState.java:printError(419)) - FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
Most of the time, the hadoop errors on the tracker side are not very descriptive - like "there was a problem retrieving data from one of the nodes". In order to find out what had really happened, you need to get the system.log, both hive and hadoop task logs from each node, especially the one that did not return data back on time, and look at what the problem was there at the time of the error. In the Ops Center you can actually click on the hive job in progress and watch what is going on on each node, then see what was the error that caused your job interruption.
Here are some links that I found to be very useful. Some of these links are for older versions of DSE, but they still give a good start on how to optimize Hadoop operations and memory management.
http://www.datastax.com/dev/blog/tuning-dse-hadoop-map-reduce
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/ana/anaHivTune.html
https://support.datastax.com/entries/23459322-Tuning-memory-for-Hadoop-tasks
https://support.datastax.com/entries/23472546-Specifying-the-number-of-concurrent-tasks-per-node
You may also want to read this article. Sometimes, timeouts may be due to major Garbage collections.
HTH

Categories