TL;DR
Given a file with a million reocrds where there is considerable amount of logic to be executed on each row in the file, what is the fastest way to read the file and finish applying the logic on each row. I used a multi-threaded step with file reader whose read method is synchronized to read the file and also used an AsynItemProcessor so that the records are processed in its own thread.
My expectation is that the AsynItemProcessor should start immediately as soon as it has a record from a reader to process. Each record should be processed in its own thread; however, this doesn't seem to be the case in my example below
I have a step in my Spring batch job that uses a TaskExecutor with 20 threads and commit interval of 10000 to read a file. I am also using an AsycnItemProcessor and AsyncItemWriter since the data processing can at times take longer than the amount of time required to read a line from the file.
<step id="aggregationStep">
<tasklet throttle-limit="20" task-executor="taskExecutor">
<chunk reader="fileReader"
processor="asyncProcessor" writer="asyncWriter"
commit-interval="10000" />
</tasklet>
</step>
Where :
fileReader is a class that extends FlatFileItemReader and the read method is synchronized and simply calls super.read within it.
asyncProcessor as nothing but a AsyncItemProcessor bean that is passed each row from the file and groups it by a key, and stores it into a singleton bean that holds a Map<String,BigDecimal>object. In other words, the processor is simply grouping the file data by a a few columns and storing this data in memory.
asyncWriter is nothing but an AsyncItemWriter that wraps a no operation ItemWriter within it. In other words, the job does not need to do any kind of writing since the processor itself is doing the aggregation and storing data in memory. (Map).
Note that the AsynItemProcessor has its on ThreadPoolTaskExecutor with corePoolSize=10 and maxPoolSize=20 and the Step has its own ThreadPoolTaskExecutor with a corePoolSize=20 and maxPoolSize=40
With the above setup, my exepcation was that the reading and processing would happen in parallel. Something like :
FileReader reads a record from the file and passes it to the processor
AsyncItemProcessor performs aggregation. Since it is an AsyncItemProcessor, the thread that called the process method should ideally be free to do other work?
Finally, the AsynItemWriter would get the Future and extract the data but do nothing since the delegate is a no operation ItemWriter.
But when I added some logs, I am not seeing what I expected :
2019-09-07 10:04:49 INFO FileReader:45 - Finished reading 2500 records
2019-09-07 10:04:49 INFO FileReader:45 - Finished reading 5000 records
2019-09-07 10:04:50 INFO FileReader:45 - Finished reading 7501 records
2019-09-07 10:04:50 INFO FileReader:45 - Finished reading 10000 records
2019-09-07 10:04:51 INFO FileReader:45 - Finished reading 12500 records
2019-09-07 10:04:51 INFO FileReader:45 - Finished reading 15000 records
... more such lines are printed until entire file is read. Only after the file is read do I start seeing the processor start doing its work :
2019-09-07 10:06:53 INFO FileProcessor:83 - Finished processing 2500 records
2019-09-07 10:08:28 INFO FileProcessor:83 - Finished processing 5000 records
2019-09-07 10:10:04 INFO FileProcessor:83 - Finished processing 7500 records
2019-09-07 10:11:40 INFO FileProcessor:83 - Finished processing 10000 records
2019-09-07 10:13:16 INFO FileProcessor:83 - Finished processing 12500 records
2019-09-07 10:14:51 INFO FileProcessor:83 - Finished processing 15000 records
Bottom line : Why is the processor not kicking in until the file has been fully read? No matter what the ThreadPoolTaskExecutor parameters used for the AsynItemProcessor or for the entire step, the reading always completes first and only then the processing starts.
This is how chunk oriented processing works. The step will read X items in a variable (where X is the commit-interval), then processing/writing are performed. You can see that in the code of ChunkOrientedTasklet.
In a multi-thread step, each chunk will be processed by a thread.
Related
Can anyone please tell me about this exception.
ERROR [kafka-producer-network-thread | producer-2] c.o.p.a.s.CalculatorAdapter [CalculatorAdapter.java:285]
Cannot send outgoingDto with decision id = 46d1-9491-123ce9c7a916 in kafka:
org.springframework.kafka.core.KafkaProducerException: Failed to send;
nested exception is org.apache.kafka.common.errors.TimeoutException:
Expiring 1 record(s) for save-request-0:604351 ms has passed since batch creation
at org.springframework.kafka.core.KafkaTemplate.lambda$buildCallback$4(KafkaTemplate.java:602)
at org.springframework.kafka.core.DefaultKafkaProducerFactory$CloseSafeProducer$1.onCompletion(DefaultKafkaProducerFactory.java:871)
at org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1356)
at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:231)
at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:197)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:676)
at org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:380)
at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:323)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.kafka.common.errors.TimeoutException:
Expiring 1 record(s) for save-request-0:604351 ms has passed since batch creation
I have been fighting with him for the second week.
Revised a bunch of fix recipes, but none of the recipes helped.
My program sends messages about 60 kilobytes in size, but they do not reach the kafka server.
The entire java application log is filled with exceptions of this kind.
My guess is that the time to fill the batch size takes longer than the time of the transaction, so the message is not sent.
// example
Properties props = new Properties();
...
pros.put(ProducerConfig.BATCH_SIZE_CONFIG, 60000); // 60kb
...
Producer producer = new KafkaProducer<>(props);
Checkout this articles.
Kafka Producer Batch
Kafka Producer batch size
Batch size configuration
http://cloudurable.com/blog/kafka-tutorial-kafka-producer-advanced-java-examples/index.html
https://kafka.apache.org/26/javadoc/org/apache/kafka/clients/producer/ProducerConfig.html
I am running a spark structured streaming job (bounces every day) in EMR. I am getting an OOM error in my application after a few hours of execution and get killed. The following are my configurations and spark SQL code.
I am new to Spark and need your valuable input.
The EMR is having 10 instances with 16 core and 64GB memory.
Spark-Submit arguments:
num_of_executors: 17
executor_cores: 5
executor_memory: 19G
driver_memory: 30G
Job is reading input as micro-batches from a Kafka at an interval of 30seconds. Average number of rows read per batch is 90k.
spark.streaming.kafka.maxRatePerPartition: 4500
spark.streaming.stopGracefullyOnShutdown: true
spark.streaming.unpersist: true
spark.streaming.kafka.consumer.cache.enabled: true
spark.hadoop.fs.s3.maxRetries: 30
spark.sql.shuffle.partitions: 2001
Spark SQL aggregation code:
dataset.groupBy(functions.col(NAME),functions.window(functions.column(TIMESTAMP_COLUMN),30))
.agg(functions.concat_ws(SPLIT, functions.collect_list(DEPARTMENT)).as(DEPS))
.select(NAME,DEPS)
.map((row) -> {
Map<String, Object> map = Maps.newHashMap();
map.put(NAME, row.getString(0));
map.put(DEPS, row.getString(1));
return new KryoMapSerializationService().serialize(map);
}, Encoders.BINARY());
Some logs from the driver:
20/04/04 13:10:51 INFO TaskSetManager: Finished task 1911.0 in stage 1041.0 (TID 1052055) in 374 ms on <host> (executor 3) (1998/2001)
20/04/04 13:10:52 INFO TaskSetManager: Finished task 1925.0 in stage 1041.0 (TID 1052056) in 411 ms on <host> (executor 3) (1999/2001)
20/04/04 13:10:52 INFO TaskSetManager: Finished task 1906.0 in stage 1041.0 (TID 1052054) in 776 ms on <host> (executor 3) (2000/2001)
20/04/04 13:11:04 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
20/04/04 13:11:04 INFO DAGScheduler: Executor lost: 3 (epoch 522)
20/04/04 13:11:04 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
20/04/04 13:11:04 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, <host>, 38533, None)
20/04/04 13:11:04 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
20/04/04 13:11:04 INFO YarnAllocator: Completed container container_1582797414408_1814_01_000004 on host: <host> (state: COMPLETE, exit status: 143)
And by the way, I am using collectasList in my forEachBatch code
List<Event> list = dataset.select("value")
.selectExpr("deserialize(value) as rows")
.select("rows.*")
.selectExpr(NAME, DEPS)
.as(Encoders.bean(Event.class))
.collectAsList();
With these settings, you may be causing your own issues.
num_of_executors: 17
executor_cores: 5
executor_memory: 19G
driver_memory: 30G
You are basically creating extra containers here to have to shuffle between. Instead, start off with something like 10 executors, 15 cores, 60g memory. If that is working, then you can play these a bit to try and optimize performance. I usually try splitting my containers in half each step (but I also havent needed to do this since spark 2.0).
Let Spark SQL keep the default at 200. The more you break this up, the more math you make Spark do to calculate the shuffles. If anything, I'd try to go with the same number of parallelism as you have executors, so in this case just 10. When 2.0 came out, this is how you would tune hive queries.
Making the job complex to break up puts all the load on the master.
Using Datasets and Encoding are also generally not as performant as going with straight DataFrame operations. I have found great lifts in performance of factoring this out for dataframe operations.
We discovered the issue with duplication due to an external dependency getting called numerous times, a value greater than actual size of the RDD. The result of the transformation is a consolidated RDD with size == input RDD size, I suspect this could be the reason why the problem is undiscoverable and there aren't many solutions documented online. We tried a couple of approaches, mentioned below -
Approach 1: Using mapPartitionsWithIndex function with persistPartition = true, result was the same - duplication of data.
Approach 2: Another stackoverflow solution recommended calling rdd.cache() before the mapPartition transformation since the cause was due to lazy evaluation and cache can invoke an immediate transformation. This seemed to resolve the issue but caching led to other memory related exceptions.
rdd = rdd.mapPartitionsWithIndex(partitionFunction, true);
Results from spark worker nodes in a 4 node cluster with RDD having size 1000
2019-08-11 13:48:47,487 [Executor task launch worker-0] INFO - Found 215 records in Partition 3
2019-08-11 13:48:47,634 [Executor task launch worker-0] INFO - Found 203 records in Partition 0
2019-08-11 13:49:44,472 [Executor task launch worker-0] INFO - Found 177 records in Partition 2
2019-08-11 13:49:46,252 [Executor task launch worker-0] INFO - Found 201 records in Partition 4
2019-08-11 13:49:44,537 [Executor task launch worker-0] INFO - Found 203 records in Partition 0
2019-08-11 13:48:47,170 [Executor task launch worker-0] INFO - Found 204 records in Partition 1
2019-08-11 13:48:47,410 [Executor task launch worker-0] INFO - Found 177 records in Partition 2
2019-08-11 13:49:16,875 [Executor task launch worker-0] INFO - Found 201 records in Partition 4
Total records across partitions: 1581
Any suggestions from the community on solving this issue? TIA
I can see the executors are not executing the tasks for long time from the Spark UI.
When i see the executors tab stderr, i can see the below logs.
6/02/04 05:30:56 INFO storage.MemoryStore: Block broadcast_91 of size 153016 dropped from memory (free 6665239401)
16/02/04 06:11:20 WARN hdfs.DFSClient: Slow ReadProcessor read fields took 31337ms (threshold=30000ms); ack: seqno: 1240 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 4835789, targets: [DatanodeInfoWithStorage[10.25.36.18:1004,DS-f6e20cf7-0ccb-45aa-988f-f3310d5acf89,DISK], DatanodeInfoWithStorage[10.25.36.11:1004,DS-61ad0a2d-a6fd-402e-b0a1-61682d1755fb,DISK], DatanodeInfoWithStorage[10.25.36.5:1004,DS-c77503a2-0c7f-4b5c-8f4a-9c61cb4f18d7,DISK]]
I do not see any log for long time. i do not see error as well. It is keep on running..
Is anyone faced the same problem? how we can improve this?
Update:
It is actually took long time on saveAsTextFile() method.
We use solr as out full text searcher. we use multi cores with solr and we have approximately 5 hundred million data. we batch update our index when reaches 1000 new record without commit them because we use auto commit in solr.In recently, when we batch post out new record, the cpu usage colse to 100%.
I have dump the high cpu theads. most threads likes follows:
java.lang.Thread.State: RUNNABLE
at org.apache.lucene.search.ConjunctionTermScorer.doNext(ConjunctionTermScorer.java:64)
at org.apache.lucene.search.ConjunctionTermScorer.nextDoc(ConjunctionTermScorer.java:95)
at org.apache.lucene.search.Scorer.score(Scorer.java:63)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:605)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)
at org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:1060)
at org.apache.solr.search.SolrIndexSearcher.getPositiveDocSet(SolrIndexSearcher.java:763)
at org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:880)
at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1337)
at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1304)
at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:395)
at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:410)
at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)