Optaplanner's benchmark warm up - OutOfMemory - java

While trying to test the solution's solvers using a benchmark configuration, I encounter the follow exception :
2021-12-22 15:24:37.328 WARN 22684 --- [ Test worker] c.o.b.i.D.singleBenchmarkRunnerException : The warm up singleBenchmarkRunner (Problem_0_Currently used_0) with random seed (null) failed.
java.lang.OutOfMemoryError: Java heap space
2021-12-22 15:24:37.329 WARN 22684 --- [ Test worker] c.o.b.i.D.singleBenchmarkRunnerException : The warm up singleBenchmarkRunner (Problem_0_Currently used_0) with random seed (null) failed.
java.lang.OutOfMemoryError: Java heap space
2021-12-22 15:24:37.330 WARN 22684 --- [ Test worker] c.o.b.i.D.singleBenchmarkRunnerException : The warm up singleBenchmarkRunner (Problem_0_Currently used_0) with random seed (null) failed.
java.lang.OutOfMemoryError: Java heap space
at java.base/java.lang.Long.valueOf(Long.java:1207)
at myrostering.solver.PEC.LambdaExtractorEC9F24820AB70C5865CE63ED29F967E9.apply(LambdaExtractorEC9F24820AB70C5865CE63ED29F967E9.java:69)
at myrostering.solver.PEC.LambdaExtractorEC9F24820AB70C5865CE63ED29F967E9.apply(LambdaExtractorEC9F24820AB70C5865CE63ED29F967E9.java:1)
at org.drools.model.functions.Function1$Impl.apply(Function1.java:35)
at org.drools.modelcompiler.constraints.LambdaReadAccessor.getValue(LambdaReadAccessor.java:42)
at org.drools.core.rule.Declaration.getValue(Declaration.java:258)
at org.drools.core.rule.Declaration.getValue(Declaration.java:253)
at org.drools.modelcompiler.constraints.BindingEvaluator.getArgument(BindingEvaluator.java:59)
at org.drools.modelcompiler.constraints.ConstraintEvaluator$InnerEvaluator.getArgument(ConstraintEvaluator.java:242)
at org.drools.modelcompiler.constraints.ConstraintEvaluator$InnerEvaluator$_2.evaluate(ConstraintEvaluator.java:309)
at org.drools.modelcompiler.constraints.ConstraintEvaluator.evaluate(ConstraintEvaluator.java:124)
at org.drools.modelcompiler.constraints.LambdaConstraint.isAllowedCachedLeft(LambdaConstraint.java:187)
at org.drools.core.common.SingleBetaConstraints.isAllowedCachedLeft(SingleBetaConstraints.java:132)
at org.drools.core.phreak.PhreakAccumulateNode.doLeftInserts(PhreakAccumulateNode.java:178)
at org.drools.core.phreak.PhreakAccumulateNode.doNode(PhreakAccumulateNode.java:89)
at org.drools.core.phreak.RuleNetworkEvaluator.switchOnDoBetaNode(RuleNetworkEvaluator.java:591)
at org.drools.core.phreak.RuleNetworkEvaluator.evalBetaNode(RuleNetworkEvaluator.java:558)
at org.drools.core.phreak.RuleNetworkEvaluator.evalNode(RuleNetworkEvaluator.java:385)
at org.drools.core.phreak.RuleNetworkEvaluator.innerEval(RuleNetworkEvaluator.java:345)
at org.drools.core.phreak.RuleNetworkEvaluator.outerEval(RuleNetworkEvaluator.java:181)
at org.drools.core.phreak.RuleNetworkEvaluator.evaluateNetwork(RuleNetworkEvaluator.java:139)
at org.drools.core.phreak.RuleExecutor.reEvaluateNetwork(RuleExecutor.java:235)
at org.drools.core.phreak.RuleExecutor.evaluateNetworkAndFire(RuleExecutor.java:91)
at org.drools.core.concurrent.AbstractRuleEvaluator.internalEvaluateAndFire(AbstractRuleEvaluator.java:33)
at org.drools.core.concurrent.SequentialRuleEvaluator.evaluateAndFire(SequentialRuleEvaluator.java:43)
at org.drools.core.common.DefaultAgenda.fireLoop(DefaultAgenda.java:753)
at org.drools.core.common.DefaultAgenda.internalFireAllRules(DefaultAgenda.java:700)
at org.drools.core.common.DefaultAgenda.fireAllRules(DefaultAgenda.java:692)
at org.drools.core.impl.StatefulKnowledgeSessionImpl.internalFireAllRules(StatefulKnowledgeSessionImpl.java:1225)
at org.drools.core.impl.StatefulKnowledgeSessionImpl.fireAllRules(StatefulKnowledgeSessionImpl.java:1216)
at org.drools.core.impl.StatefulKnowledgeSessionImpl.fireAllRules(StatefulKnowledgeSessionImpl.java:1200)
at org.optaplanner.core.impl.score.director.drools.DroolsScoreDirector.calculateScore(DroolsScoreDirector.java:105)
Here is the test class I ran:
#SpringBootTest(classes = MyApplication.class)
#EnableConfigurationProperties({ApplicationProperties.class, MyRosterProperties.class})
public class SolverBenchmarkTest {
private PlannerBenchmarkFactory benchmarkFactory = PlannerBenchmarkFactory.createFromXmlResource(
"myrostering/benchmark/benchmarkSolverConfig.xml");
#Autowired
MyRosterGenerator myRosterGenerator;
#Test
public void benchmarkBasicRostering() {
MyRoster mr = myRosterGenerator.createMyRoster();
PlannerBenchmark benchmark = benchmarkFactory.buildPlannerBenchmark(mr);
benchmark.benchmarkAndShowReportInBrowser();
}
}
Here is the benchmark configuration file :
<?xml version="1.0" encoding="UTF-8"?>
<plannerBenchmark xmlns="https://www.optaplanner.org/xsd/benchmark" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://www.optaplanner.org/xsd/benchmark https://www.optaplanner.org/xsd/benchmark/benchmark.xsd">
<benchmarkDirectory>local/benchmark/data/my-roster</benchmarkDirectory>
<parallelBenchmarkCount>AUTO</parallelBenchmarkCount>
<warmUpSecondsSpentLimit>30</warmUpSecondsSpentLimit>
<inheritedSolverBenchmark>
<solver>
<!-- This part of the solver configuration must be the same as the one used by the planner, otherwise, the benchmark test is pointless -->
<moveThreadCount>4</moveThreadCount>
<solutionClass>myrostering.domain.MyRoster</solutionClass>
<entityClass>myrostering.domain.Assignment</entityClass>
<scoreDirectorFactory>
<scoreDrl>myrostering/solver/myRosteringScoreRules.drl</scoreDrl>
</scoreDirectorFactory>
<termination>
<!-- Adding this secondsSpentLimit (contrary to no limit set for the planner) to avoid the benchmark running for too long -->
<secondsSpentLimit>60</secondsSpentLimit>
<bestScoreLimit>0hard/0medium/0soft</bestScoreLimit>
</termination>
<constructionHeuristic>
<constructionHeuristicType>STRONGEST_FIT</constructionHeuristicType>
</constructionHeuristic>
</solver>
</inheritedSolverBenchmark>
<solverBenchmark>
<name>Currently used</name>
<solver>
<localSearch>
<unionMoveSelector>
<moveListFactory>
<cacheType>PHASE</cacheType>
<moveListFactoryClass>
myrostering.solver.move.factory.ChangeMoveFactory
</moveListFactoryClass>
</moveListFactory>
<moveListFactory>
<cacheType>PHASE</cacheType>
<moveListFactoryClass>
myrostering.solver.move.factory.SwapMoveFactory
</moveListFactoryClass>
</moveListFactory>
</unionMoveSelector>
<acceptor>
<entityTabuSize>5</entityTabuSize>
<simulatedAnnealingStartingTemperature>15000hard/10medium/1000soft</simulatedAnnealingStartingTemperature>
</acceptor>
<forager>
<acceptedCountLimit>4</acceptedCountLimit>
</forager>
</localSearch>
</solver>
</solverBenchmark>
</plannerBenchmark>
Also, I'd like to add that we run the solver.solve() without an issue - even if the dataset is quite large (150 to 300 Mo for the file containing the solution when serialized). So I'm a bit surprised when the benchmark fails on warm up...
EDIT:
I've changed the configuration for these two parameters :
<parallelBenchmarkCount>1</parallelBenchmarkCount>
...
<secondsSpentLimit>600</secondsSpentLimit>
But I still got the following exception :
2022-01-03 10:53:49.850 INFO 21696 --- [nchmarkThread-1] o.d.c.kie.builder.impl.KieContainerImpl : Start creation of KieBase: defaultKieBase
2022-01-03 10:53:49.909 INFO 21696 --- [nchmarkThread-1] o.d.c.kie.builder.impl.KieContainerImpl : End creation of KieBase: defaultKieBase
2022-01-03 10:54:32.585 INFO 21696 --- [nchmarkThread-1] o.o.core.impl.solver.DefaultSolver : Solving started: time spent (41506), best score (-38295462hard/38260medium/3640soft), environment mode (REPRODUCIBLE), move thread count (4), random (JDK with seed 0).
2022-01-03 10:54:33.611 ERROR 21696 --- [nchmarkThread-1] o.o.core.impl.solver.thread.ThreadUtils : Multithreaded Local Search's ExecutorService didn't terminate within timeout (1 seconds).
2022-01-03 10:54:33.611 INFO 21696 --- [nchmarkThread-1] o.o.c.i.h.thread.MoveThreadRunner : Score calculation speed will be too low because move thread (0)'s destroy wasn't processed soon enough.
2022-01-03 10:54:33.611 INFO 21696 --- [nchmarkThread-1] o.o.c.i.h.thread.MoveThreadRunner : Score calculation speed will be too low because move thread (1)'s destroy wasn't processed soon enough.
2022-01-03 10:54:33.611 INFO 21696 --- [nchmarkThread-1] o.o.c.i.h.thread.MoveThreadRunner : Score calculation speed will be too low because move thread (2)'s destroy wasn't processed soon enough.
2022-01-03 10:54:33.611 INFO 21696 --- [nchmarkThread-1] o.o.c.i.h.thread.MoveThreadRunner : Score calculation speed will be too low because move thread (3)'s destroy wasn't processed soon enough.
2022-01-03 10:54:33.612 INFO 21696 --- [nchmarkThread-1] .c.i.c.DefaultConstructionHeuristicPhase : Construction Heuristic phase (0) ended: time spent (42533), best score (-38295462hard/38260medium/3640soft), score calculation speed (0/sec), step total (0).
2022-01-03 10:56:00.115 WARN 21696 --- [ Test worker] c.o.b.i.D.singleBenchmarkRunnerException : The subSingleBenchmarkRunner (Problem_0_Currently used_0) with random seed (null) failed.
java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOf(Arrays.java:3480)
at java.base/java.util.ArrayList.grow(ArrayList.java:237)
at java.base/java.util.ArrayList.grow(ArrayList.java:244)
at java.base/java.util.ArrayList.add(ArrayList.java:454)
at java.base/java.util.ArrayList.add(ArrayList.java:467)
at be.myrostering.solver.move.factory.MySwapMoveFactory.createMoveList(MySwapMoveFactory.java:50)
at be.myrostering.solver.move.factory.MySwapMoveFactory.createMoveList(MySwapMoveFactory.java:30)
at org.optaplanner.core.impl.heuristic.selector.move.factory.MoveListFactoryToMoveSelectorBridge.constructCache(MoveListFactoryToMoveSelectorBridge.java:72)
at org.optaplanner.core.impl.heuristic.selector.common.SelectionCacheLifecycleBridge.phaseStarted(SelectionCacheLifecycleBridge.java:51)
at org.optaplanner.core.impl.phase.event.PhaseLifecycleSupport.firePhaseStarted(PhaseLifecycleSupport.java:37)
at org.optaplanner.core.impl.heuristic.selector.AbstractSelector.phaseStarted(AbstractSelector.java:50)
at org.optaplanner.core.impl.phase.event.PhaseLifecycleSupport.firePhaseStarted(PhaseLifecycleSupport.java:37)
at org.optaplanner.core.impl.heuristic.selector.AbstractSelector.phaseStarted(AbstractSelector.java:50)
at org.optaplanner.core.impl.localsearch.decider.LocalSearchDecider.phaseStarted(LocalSearchDecider.java:94)
at org.optaplanner.core.impl.localsearch.decider.MultiThreadedLocalSearchDecider.phaseStarted(MultiThreadedLocalSearchDecider.java:92)
at org.optaplanner.core.impl.localsearch.DefaultLocalSearchPhase.phaseStarted(DefaultLocalSearchPhase.java:141)
at org.optaplanner.core.impl.localsearch.DefaultLocalSearchPhase.solve(DefaultLocalSearchPhase.java:82)
at org.optaplanner.core.impl.solver.AbstractSolver.runPhases(AbstractSolver.java:99)
at org.optaplanner.core.impl.solver.DefaultSolver.solve(DefaultSolver.java:192)
at org.optaplanner.benchmark.impl.SubSingleBenchmarkRunner.call(SubSingleBenchmarkRunner.java:122)
at org.optaplanner.benchmark.impl.SubSingleBenchmarkRunner.call(SubSingleBenchmarkRunner.java:42)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)
2022-01-03 10:56:00.603 INFO 21696 --- [ Test worker] o.o.b.impl.report.BenchmarkReport : Generating benchmark report...
VERSION_2_3_31
java.lang.NoSuchFieldError: VERSION_2_3_31
at org.optaplanner.benchmark.impl.report.BenchmarkReport.writeHtmlOverviewFile(BenchmarkReport.java:828)
at org.optaplanner.benchmark.impl.report.BenchmarkReport.writeReport(BenchmarkReport.java:318)
at org.optaplanner.benchmark.impl.DefaultPlannerBenchmark.benchmarkingEnded(DefaultPlannerBenchmark.java:311)
at org.optaplanner.benchmark.impl.DefaultPlannerBenchmark.benchmark(DefaultPlannerBenchmark.java:100)
at org.optaplanner.benchmark.impl.DefaultPlannerBenchmark.benchmarkAndShowReportInBrowser(DefaultPlannerBenchmark.java:424)
On a final note, it appears that the problem might not be linked to Optaplanner (because the out-of-memory is triggered in MySwapMoveFactory) - if so, I'll close this post. But it would be still odd that it works when running the solver but not the benchmark...

MoveListFactory scales badly, consuming a lot of memory and cpu.
Some kind of moves have billions of possible moves. For example a 3 swap on 10000 shifts has 1 trillion moves. That doesn't fit into a few GB RAM and it takes ages to generate.
Use a MoveIteratorFactory instead and don't generate a list of moves, but generate them Just In Time, just like the default selectors do. See docs.

Increase memory, for example with VM option -Xmx4g
Also note that parallelBenchmarkCount AUTO currently doesn't take into account that moveThreadCount is not NONE. So your benchmarks will not be accurate, because if you have 16 cores, parallelBenchmarkCount AUTO will resolve to 8. With moveThreadCount 4 (+ 1 solver thread), you'll be using 32+ cores but only have 16 cores. This probably should be reported as an issue in optaplanner's jira for parallelBenchmarkCount AUTO.

Related

Spring application shuts itself down at 200 DB connections

On an legacy production application we were having an issue where the application crashed because it ran out of connections (the default was 100 connections) as a temporal solution we decided to increase the available connections to 500 but when the application reached 200 connections it just stopped itself, with no errors on the logs, just like a simple shut down.
I added a couple of logs that are generated each 15 secs for clearly seeing the behavior of the connections, these logs prints the idle connection and active connection as well as the full object of the Datasource properties. Before the application shut down the following logs where added:
Datasource idle connections: 0, active connections: 200
Datasource properties: org.apache.tomcat.jdbc.pool.DataSource#20b2475a{ConnectionPool[defaultAutoCommit=null; defaultReadOnly=null; defaultTransactionIsolation=-1; defaultCatalog=null; driverClassName=com.mysql.jdbc.Driver; maxActive=500; maxIdle=500; minIdle=10; initialSize=10; maxWait=30000; testOnBorrow=true; testOnReturn=false; timeBetweenEvictionRunsMillis=5000; numTestsPerEvictionRun=0; minEvictableIdleTimeMillis=60000; testWhileIdle=false; testOnConnect=false; password=********; url=jdbc:mysql://127.0.0.1:3306/db_name?createDatabaseIfNotExist=true; username=username; validationQuery=SELECT 1; validationQueryTimeout=-1; validatorClassName=null; validationInterval=3000; accessToUnderlyingConnectionAllowed=true; removeAbandoned=false; removeAbandonedTimeout=60; logAbandoned=false; connectionProperties=null; initSQL=null; jdbcInterceptors=null; jmxEnabled=true; fairQueue=true; useEquals=true; abandonWhenPercentageFull=0; maxAge=0; useLock=false; dataSource=null; dataSourceJNDI=null; suspectTimeout=0; alternateUsernameAllowed=false; commitOnReturn=false; rollbackOnReturn=false; useDisposableConnectionFacade=true; logValidationErrors=false; propagateInterruptState=false; ignoreExceptionOnPreLoad=false; }
After that the application shut itself down and I found following logs with no errors before:
2021-02-03 20:23:02.618 INFO 1 --- [ Thread-4] ationConfigEmbeddedWebApplicationContext : Closing org.springframework.boot.context.embedded.AnnotationConfigEmbeddedWebApplicationContext#8807e25: startup date [Wed Feb 03 19:49:09 GMT 2021]; root of context hierarchy
2021-02-03 20:23:02.623 INFO 1 --- [ Thread-4] o.s.c.support.DefaultLifecycleProcessor : Stopping beans in phase 0
2021-02-03 20:23:02.643 INFO 1 --- [ Thread-4] o.s.j.e.a.AnnotationMBeanExporter : Unregistering JMX-exposed beans on shutdown
2021-02-03 20:23:02.647 INFO 1 --- [ Thread-4] j.LocalContainerEntityManagerFactoryBean : Closing JPA EntityManagerFactory for persistence unit 'default'
A couple of relevant dependencies and their versions:
org.springframework:spring-webmvc:jar:4.3.6.RELEASE:compile
org.springframework.boot:spring-boot-starter-data-jpa:jar:1.5.1.RELEASE:compile
org.springframework.boot:spring-boot-starter-jdbc:jar:1.5.1.RELEASE:compile
org.apache.tomcat:tomcat-jdbc:jar:8.5.11:compile
org.hibernate:hibernate-core:jar:5.0.11.Final:compile
org.springframework.data:spring-data-jpa:jar:1.11.0.RELEASE:compile
org.springframework.boot:spring-boot-starter-web:jar:1.5.1.RELEASE:compile
org.liquibase:liquibase-core:jar:3.5.1:compile
org.liquibase.ext:liquibase-hibernate5:jar:3.6:compile
Finally my ask is for help to understand why the application shuts itself down and how could I fix it so it is able to reach 500 connections?

SpringBatch - Step no longer executing: Step already complete or not restartable

I have a single step springbatch application. The job is as follows:
#Bean
public Job databaseCursorJob(#Qualifier("databaseCursorStep") Step exampleJobStep,
JobBuilderFactory jobBuilderFactory) {
return jobBuilderFactory.get("databaseCursorJob")
.incrementer(new RunIdIncrementer())
.flow(exampleJobStep)
.end()
.build();
}
I start the job from a springboot application. This afternoon, I attempted to add a second step to the job. Essentially as follows:
#Bean
public Job databaseCursorJob(#Qualifier("databaseCursorStep") Step exampleJobStep,
JobBuilderFactory jobBuilderFactory) {
return jobBuilderFactory.get("databaseCursorJob")
.incrementer(new RunIdIncrementer())
.flow(exampleJobStep).next(partitionStep())
.end()
.build();
}
In other words, just adding the "next(partitionStep()). However, ever since I did this, the job finishes without executing any step (see shell output below). In fact, even after removing the second step and going back to the original job, it refuses to execute the step. Before attempting to add the second step, I never once encountered this problem. I have gone so far as restarting my VM and it still skips the step. I am rather dead in the water until I resolved this. Grateful for any insights. thanks.
2020-09-01 14:49:00.260 INFO 6913 --- [ main] o.s.b.w.embedded.tomcat.TomcatWebServer : Tomcat started on port(s): 8087 (http) with context path ''
2020-09-01 14:49:00.263 INFO 6913 --- [ main] f.p.r.Application : Started Application in 7.752 seconds (JVM running for 9.092)
2020-09-01 14:49:00.268 INFO 6913 --- [ main] o.s.b.a.b.JobLauncherCommandLineRunner : Running default command line with: []
2020-09-01 14:49:00.579 INFO 6913 --- [ main] o.s.b.c.l.support.SimpleJobLauncher : Job: [FlowJob: [name=databaseCursorJob]] launched with the following parameters: [{}]
2020-09-01 14:49:00.698 INFO 6913 --- [ main] o.s.batch.core.job.SimpleStepHandler : Step already complete or not restartable, so no action to execute: StepExecution: id=120, version=4, name=databaseCursorStep, status=COMPLETED, exitStatus=COMPLETED, readCount=1, filterCount=0, writeCount=1 readSkipCount=0, writeSkipCount=0, processSkipCount=0, commitCount=2, rollbackCount=0, exitDescription=
2020-09-01 14:49:00.730 INFO 6913 --- [ main] o.s.b.c.l.support.SimpleJobLauncher : Job: [FlowJob: [name=databaseCursorJob]] completed with the following parameters: [{}] and the following status: [COMPLETED]
My issue was that my job had no way recover if there was an error or stuck in an unknown state. The step was not "already complete", it never completed. Its status was still "STARTED", and exit code "UNKNOWN" because it never exited. Anyway, my job repository is not in memory, but captured to a local DB, which is why it never resolved itself even after restarting VM (shame on me for not remembering this). So, I was able to fix by wiping out the job instance history, however that was a band-aid. I still have to fix my code to prevent it from happening again.
I also learned I could diagnose by examining the job repository in the database (its all there).
I really resolved this thanks Mr Hassine who responded above several times and pointed me in the right direction. The solution to prevent in the future is indeed addressed in the link he provided in his first response: Spring Batch error (A Job Instance Already Exists) and RunIdIncrementer generates only once

How to identify the optimum number of shuffle partition in Spark

I am running a spark structured streaming job (bounces every day) in EMR. I am getting an OOM error in my application after a few hours of execution and get killed. The following are my configurations and spark SQL code.
I am new to Spark and need your valuable input.
The EMR is having 10 instances with 16 core and 64GB memory.
Spark-Submit arguments:
num_of_executors: 17
executor_cores: 5
executor_memory: 19G
driver_memory: 30G
Job is reading input as micro-batches from a Kafka at an interval of 30seconds. Average number of rows read per batch is 90k.
spark.streaming.kafka.maxRatePerPartition: 4500
spark.streaming.stopGracefullyOnShutdown: true
spark.streaming.unpersist: true
spark.streaming.kafka.consumer.cache.enabled: true
spark.hadoop.fs.s3.maxRetries: 30
spark.sql.shuffle.partitions: 2001
Spark SQL aggregation code:
dataset.groupBy(functions.col(NAME),functions.window(functions.column(TIMESTAMP_COLUMN),30))
.agg(functions.concat_ws(SPLIT, functions.collect_list(DEPARTMENT)).as(DEPS))
.select(NAME,DEPS)
.map((row) -> {
Map<String, Object> map = Maps.newHashMap();
map.put(NAME, row.getString(0));
map.put(DEPS, row.getString(1));
return new KryoMapSerializationService().serialize(map);
}, Encoders.BINARY());
Some logs from the driver:
20/04/04 13:10:51 INFO TaskSetManager: Finished task 1911.0 in stage 1041.0 (TID 1052055) in 374 ms on <host> (executor 3) (1998/2001)
20/04/04 13:10:52 INFO TaskSetManager: Finished task 1925.0 in stage 1041.0 (TID 1052056) in 411 ms on <host> (executor 3) (1999/2001)
20/04/04 13:10:52 INFO TaskSetManager: Finished task 1906.0 in stage 1041.0 (TID 1052054) in 776 ms on <host> (executor 3) (2000/2001)
20/04/04 13:11:04 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
20/04/04 13:11:04 INFO DAGScheduler: Executor lost: 3 (epoch 522)
20/04/04 13:11:04 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
20/04/04 13:11:04 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, <host>, 38533, None)
20/04/04 13:11:04 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
20/04/04 13:11:04 INFO YarnAllocator: Completed container container_1582797414408_1814_01_000004 on host: <host> (state: COMPLETE, exit status: 143)
And by the way, I am using collectasList in my forEachBatch code
List<Event> list = dataset.select("value")
.selectExpr("deserialize(value) as rows")
.select("rows.*")
.selectExpr(NAME, DEPS)
.as(Encoders.bean(Event.class))
.collectAsList();
With these settings, you may be causing your own issues.
num_of_executors: 17
executor_cores: 5
executor_memory: 19G
driver_memory: 30G
You are basically creating extra containers here to have to shuffle between. Instead, start off with something like 10 executors, 15 cores, 60g memory. If that is working, then you can play these a bit to try and optimize performance. I usually try splitting my containers in half each step (but I also havent needed to do this since spark 2.0).
Let Spark SQL keep the default at 200. The more you break this up, the more math you make Spark do to calculate the shuffles. If anything, I'd try to go with the same number of parallelism as you have executors, so in this case just 10. When 2.0 came out, this is how you would tune hive queries.
Making the job complex to break up puts all the load on the master.
Using Datasets and Encoding are also generally not as performant as going with straight DataFrame operations. I have found great lifts in performance of factoring this out for dataframe operations.

Poor performance on Grails 3.2.9 application startup

Recently I've been doing an update of an application from Grails 2.5.5 to Grails 3.2.9.
An application is serving ~3K RPM.
The issue I currently have is a poor performance after application startup. Our normal release process works in the following way (assuming 2 nodes setup):
2 nodes with service are running.
Turn off the first node.
Release new version on the first node.
Service on the first node register itself in Eureka and start consuming requests.
(repeat same on second nodes)
Problems are starting to appear on the 4th step. The application responds quite slowly and timing is really inconsistent - some responses are in an expected time range, but some are really off normal.
Sample logs:
2017-09-01 08:03:38,594 INFO [request][http-nio-12345-exec-72][][]- END controller=98ms
2017-09-01 08:03:38,911 INFO [request][http-nio-12345-exec-101][][]- END controller=134ms
2017-09-01 08:03:38,948 INFO [request][http-nio-12345-exec-56][][]- END controller=211ms
2017-09-01 08:03:39,156 INFO [request][http-nio-12345-exec-82][][]- END controller=95ms
2017-09-01 08:03:39,124 INFO [request][http-nio-12345-exec-111][][]- END controller=98ms
2017-09-01 08:03:39,184 INFO [request][http-nio-12345-exec-110][][]- END controller=4099ms
2017-09-01 08:03:39,399 INFO [request][http-nio-12345-exec-46][][]- END controller=24ms
2017-09-01 08:03:39,428 INFO [request][http-nio-12345-exec-43][][]- END controller=191ms
2017-09-01 08:03:39,744 INFO [request][http-nio-12345-exec-83][][]- END controller=117ms
2017-09-01 08:03:40,335 INFO [request][http-nio-12345-exec-56][][]- END controller=483ms
2017-09-01 08:03:45,595 INFO [request][http-nio-12345-exec-110][][]- END controller=5623ms
2017-09-01 08:03:45,618 INFO [request][http-nio-12345-exec-83][][]- END controller=5274ms
2017-09-01 08:03:45,629 INFO [request][http-nio-12345-exec-144][][]- END controller=2007ms
2017-09-01 08:03:45,671 INFO [request][http-nio-12345-exec-119][][]- END controller=4591ms
As you can see from it - few requests went below 100ms and some - more than 5 seconds.
My assumption is that’s happening due to slow warm up of Grails 3 application and lazy class loading.
Things I've already done:
grais.gorm.autowire = false
grais.gorm.reactor.events = false
Delay registration of service in Eureka for 30 seconds (to wait while applications is fully loaded)
The next thing which comes to my mind is to compile the project with #CompileStatic annotation.

GC overhead while running pig job, after hadoop job ends

I'm running a very simple pig script (pig 0.14, Hadoop 2.4) :
customers = load '/some/hdfs/path' using SomeUDFLoader();
customers2 = foreach (group customers by customer_id) generate FLATTEN(group) as customer_id, MIN(dw_customer.date) as date;
store customers2 into '/hdfs/output' using PigStorage(',');
This launches a map-reduce job of ~60000 mappers, and 999 reducers.
After the map-reduce job has finished it's work ( I know becuase the output has been written, and the job manager says the job has succeeded ), There is a long pause and I get the following error in the pig output :
2015-11-24 11:45:29,394 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at *********
2015-11-24 11:45:29,403 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-11-24 11:46:03,533 [Service Thread] INFO org.apache.pig.impl.util.SpillableMemoryManager - first memory handler call- Usage threshold init = 698875904(682496K) used = 520031456(507843K) committed = 698875904(682496K) max = 698875904(682496K)
2015-11-24 11:46:04,473 [Service Thread] INFO org.apache.pig.impl.util.SpillableMemoryManager - first memory handler call - Collection threshold init = 698875904(682496K) used = 575405920(561919K) committed = 698875904(682496K) max = 698875904(682496K)
2015-11-24 11:47:36,255 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. GC overhead limit exceeded
The stack trace looks something like (each time the exception in is another function ):
Pig Stack Trace
---------------
ERROR 2998: Unhandled internal error. Java heap space
java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapreduce.v2.api.records.impl.pb.CounterGroupPBImpl.initCounters(CounterGroupPBImpl.java:136)
at org.apache.hadoop.mapreduce.v2.api.records.impl.pb.CounterGroupPBImpl.getAllCounters(CounterGroupPBImpl.java:121)
at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:240)
at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:367)
at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:388)
at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskReports(ClientServiceDelegate.java:448)
at org.apache.hadoop.mapred.YARNRunner.getTaskReports(YARNRunner.java:551)
at org.apache.hadoop.mapreduce.Job$3.run(Job.java:533)
at org.apache.hadoop.mapreduce.Job$3.run(Job.java:531)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.mapreduce.Job.getTaskReports(Job.java:531)
at org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.getTaskReports(HadoopShims.java:235)
at org.apache.pig.tools.pigstats.mapreduce.MRJobStats.addMapReduceStatistics(MRJobStats.java:352)
at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.addSuccessJobStats(MRPigStatsUtil.java:233)
at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.accumulateStats(MRPigStatsUtil.java:165)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:360)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:280)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
...
My set of SET statements in the pig script :
SET mapreduce.map.java.opts '-server -Xmx6144m -Djava.net.preferIPv4Stack=true -Duser.timezone=UTC'
SET mapreduce.reduce.java.opts '-server -Xmx6144m -Djava.net.preferIPv4Stack=true -Duser.timezone=UTC'
SET mapreduce.map.memory.mb '8192'
SET mapreduce.reduce.memory.mb '8192'
SET mapreduce.map.speculative 'true'
SET mapreduce.reduce.speculative 'true'
SET mapreduce.jobtracker.maxtasks.perjob '100000'
SET mapreduce.job.split.metainfo.maxsize '-1'
Why is this happening, and how can I fix that ?
Thanks in advance for any help.
Looks like this is caused in your application manager, since you mention that the error is being returned after the execution of all mappers/reducers. Try increasing the memory of application-manager.
In a YARN cluster, you can use the following two properties to control the amount of memory available to your ApplicationMaster:
yarn.app.mapreduce.am.command-opts
yarn.app.mapreduce.am.resource.mb
Again, you could set -Xmx (in the former) to 75% of the resource.mb value.
Details regarding the parameters can be found here.

Categories