GC overhead while running pig job, after hadoop job ends

GC overhead while running pig job, after hadoop job ends - java

I'm running a very simple pig script (pig 0.14, Hadoop 2.4) :
customers = load '/some/hdfs/path' using SomeUDFLoader();
customers2 = foreach (group customers by customer_id) generate FLATTEN(group) as customer_id, MIN(dw_customer.date) as date;
store customers2 into '/hdfs/output' using PigStorage(',');
This launches a map-reduce job of ~60000 mappers, and 999 reducers.
After the map-reduce job has finished it's work ( I know becuase the output has been written, and the job manager says the job has succeeded ), There is a long pause and I get the following error in the pig output :
2015-11-24 11:45:29,394 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at *********
2015-11-24 11:45:29,403 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-11-24 11:46:03,533 [Service Thread] INFO org.apache.pig.impl.util.SpillableMemoryManager - first memory handler call- Usage threshold init = 698875904(682496K) used = 520031456(507843K) committed = 698875904(682496K) max = 698875904(682496K)
2015-11-24 11:46:04,473 [Service Thread] INFO org.apache.pig.impl.util.SpillableMemoryManager - first memory handler call - Collection threshold init = 698875904(682496K) used = 575405920(561919K) committed = 698875904(682496K) max = 698875904(682496K)
2015-11-24 11:47:36,255 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. GC overhead limit exceeded
The stack trace looks something like (each time the exception in is another function ):
Pig Stack Trace
---------------
ERROR 2998: Unhandled internal error. Java heap space
java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapreduce.v2.api.records.impl.pb.CounterGroupPBImpl.initCounters(CounterGroupPBImpl.java:136)
at org.apache.hadoop.mapreduce.v2.api.records.impl.pb.CounterGroupPBImpl.getAllCounters(CounterGroupPBImpl.java:121)
at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:240)
at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:367)
at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:388)
at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskReports(ClientServiceDelegate.java:448)
at org.apache.hadoop.mapred.YARNRunner.getTaskReports(YARNRunner.java:551)
at org.apache.hadoop.mapreduce.Job$3.run(Job.java:533)
at org.apache.hadoop.mapreduce.Job$3.run(Job.java:531)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.mapreduce.Job.getTaskReports(Job.java:531)
at org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.getTaskReports(HadoopShims.java:235)
at org.apache.pig.tools.pigstats.mapreduce.MRJobStats.addMapReduceStatistics(MRJobStats.java:352)
at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.addSuccessJobStats(MRPigStatsUtil.java:233)
at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.accumulateStats(MRPigStatsUtil.java:165)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:360)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:280)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
...
My set of SET statements in the pig script :
SET mapreduce.map.java.opts '-server -Xmx6144m -Djava.net.preferIPv4Stack=true -Duser.timezone=UTC'
SET mapreduce.reduce.java.opts '-server -Xmx6144m -Djava.net.preferIPv4Stack=true -Duser.timezone=UTC'
SET mapreduce.map.memory.mb '8192'
SET mapreduce.reduce.memory.mb '8192'
SET mapreduce.map.speculative 'true'
SET mapreduce.reduce.speculative 'true'
SET mapreduce.jobtracker.maxtasks.perjob '100000'
SET mapreduce.job.split.metainfo.maxsize '-1'
Why is this happening, and how can I fix that ?
Thanks in advance for any help.

Looks like this is caused in your application manager, since you mention that the error is being returned after the execution of all mappers/reducers. Try increasing the memory of application-manager.
In a YARN cluster, you can use the following two properties to control the amount of memory available to your ApplicationMaster:
yarn.app.mapreduce.am.command-opts
yarn.app.mapreduce.am.resource.mb
Again, you could set -Xmx (in the former) to 75% of the resource.mb value.
Details regarding the parameters can be found here.

Related

Optaplanner's benchmark warm up - OutOfMemory

While trying to test the solution's solvers using a benchmark configuration, I encounter the follow exception :
2021-12-22 15:24:37.328 WARN 22684 --- [ Test worker] c.o.b.i.D.singleBenchmarkRunnerException : The warm up singleBenchmarkRunner (Problem_0_Currently used_0) with random seed (null) failed.
java.lang.OutOfMemoryError: Java heap space
2021-12-22 15:24:37.329 WARN 22684 --- [ Test worker] c.o.b.i.D.singleBenchmarkRunnerException : The warm up singleBenchmarkRunner (Problem_0_Currently used_0) with random seed (null) failed.
java.lang.OutOfMemoryError: Java heap space
2021-12-22 15:24:37.330 WARN 22684 --- [ Test worker] c.o.b.i.D.singleBenchmarkRunnerException : The warm up singleBenchmarkRunner (Problem_0_Currently used_0) with random seed (null) failed.
java.lang.OutOfMemoryError: Java heap space
at java.base/java.lang.Long.valueOf(Long.java:1207)
at myrostering.solver.PEC.LambdaExtractorEC9F24820AB70C5865CE63ED29F967E9.apply(LambdaExtractorEC9F24820AB70C5865CE63ED29F967E9.java:69)
at myrostering.solver.PEC.LambdaExtractorEC9F24820AB70C5865CE63ED29F967E9.apply(LambdaExtractorEC9F24820AB70C5865CE63ED29F967E9.java:1)
at org.drools.model.functions.Function1$Impl.apply(Function1.java:35)
at org.drools.modelcompiler.constraints.LambdaReadAccessor.getValue(LambdaReadAccessor.java:42)
at org.drools.core.rule.Declaration.getValue(Declaration.java:258)
at org.drools.core.rule.Declaration.getValue(Declaration.java:253)
at org.drools.modelcompiler.constraints.BindingEvaluator.getArgument(BindingEvaluator.java:59)
at org.drools.modelcompiler.constraints.ConstraintEvaluator$InnerEvaluator.getArgument(ConstraintEvaluator.java:242)
at org.drools.modelcompiler.constraints.ConstraintEvaluator$InnerEvaluator$_2.evaluate(ConstraintEvaluator.java:309)
at org.drools.modelcompiler.constraints.ConstraintEvaluator.evaluate(ConstraintEvaluator.java:124)
at org.drools.modelcompiler.constraints.LambdaConstraint.isAllowedCachedLeft(LambdaConstraint.java:187)
at org.drools.core.common.SingleBetaConstraints.isAllowedCachedLeft(SingleBetaConstraints.java:132)
at org.drools.core.phreak.PhreakAccumulateNode.doLeftInserts(PhreakAccumulateNode.java:178)
at org.drools.core.phreak.PhreakAccumulateNode.doNode(PhreakAccumulateNode.java:89)
at org.drools.core.phreak.RuleNetworkEvaluator.switchOnDoBetaNode(RuleNetworkEvaluator.java:591)
at org.drools.core.phreak.RuleNetworkEvaluator.evalBetaNode(RuleNetworkEvaluator.java:558)
at org.drools.core.phreak.RuleNetworkEvaluator.evalNode(RuleNetworkEvaluator.java:385)
at org.drools.core.phreak.RuleNetworkEvaluator.innerEval(RuleNetworkEvaluator.java:345)
at org.drools.core.phreak.RuleNetworkEvaluator.outerEval(RuleNetworkEvaluator.java:181)
at org.drools.core.phreak.RuleNetworkEvaluator.evaluateNetwork(RuleNetworkEvaluator.java:139)
at org.drools.core.phreak.RuleExecutor.reEvaluateNetwork(RuleExecutor.java:235)
at org.drools.core.phreak.RuleExecutor.evaluateNetworkAndFire(RuleExecutor.java:91)
at org.drools.core.concurrent.AbstractRuleEvaluator.internalEvaluateAndFire(AbstractRuleEvaluator.java:33)
at org.drools.core.concurrent.SequentialRuleEvaluator.evaluateAndFire(SequentialRuleEvaluator.java:43)
at org.drools.core.common.DefaultAgenda.fireLoop(DefaultAgenda.java:753)
at org.drools.core.common.DefaultAgenda.internalFireAllRules(DefaultAgenda.java:700)
at org.drools.core.common.DefaultAgenda.fireAllRules(DefaultAgenda.java:692)
at org.drools.core.impl.StatefulKnowledgeSessionImpl.internalFireAllRules(StatefulKnowledgeSessionImpl.java:1225)
at org.drools.core.impl.StatefulKnowledgeSessionImpl.fireAllRules(StatefulKnowledgeSessionImpl.java:1216)
at org.drools.core.impl.StatefulKnowledgeSessionImpl.fireAllRules(StatefulKnowledgeSessionImpl.java:1200)
at org.optaplanner.core.impl.score.director.drools.DroolsScoreDirector.calculateScore(DroolsScoreDirector.java:105)
Here is the test class I ran:
#SpringBootTest(classes = MyApplication.class)
#EnableConfigurationProperties({ApplicationProperties.class, MyRosterProperties.class})
public class SolverBenchmarkTest {
private PlannerBenchmarkFactory benchmarkFactory = PlannerBenchmarkFactory.createFromXmlResource(
"myrostering/benchmark/benchmarkSolverConfig.xml");
#Autowired
MyRosterGenerator myRosterGenerator;
#Test
public void benchmarkBasicRostering() {
MyRoster mr = myRosterGenerator.createMyRoster();
PlannerBenchmark benchmark = benchmarkFactory.buildPlannerBenchmark(mr);
benchmark.benchmarkAndShowReportInBrowser();
}
}
Here is the benchmark configuration file :
<?xml version="1.0" encoding="UTF-8"?>
<plannerBenchmark xmlns="https://www.optaplanner.org/xsd/benchmark" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://www.optaplanner.org/xsd/benchmark https://www.optaplanner.org/xsd/benchmark/benchmark.xsd">
<benchmarkDirectory>local/benchmark/data/my-roster</benchmarkDirectory>
<parallelBenchmarkCount>AUTO</parallelBenchmarkCount>
<warmUpSecondsSpentLimit>30</warmUpSecondsSpentLimit>
<inheritedSolverBenchmark>
<solver>
<!-- This part of the solver configuration must be the same as the one used by the planner, otherwise, the benchmark test is pointless -->
<moveThreadCount>4</moveThreadCount>
<solutionClass>myrostering.domain.MyRoster</solutionClass>
<entityClass>myrostering.domain.Assignment</entityClass>
<scoreDirectorFactory>
<scoreDrl>myrostering/solver/myRosteringScoreRules.drl</scoreDrl>
</scoreDirectorFactory>
<termination>
<!-- Adding this secondsSpentLimit (contrary to no limit set for the planner) to avoid the benchmark running for too long -->
<secondsSpentLimit>60</secondsSpentLimit>
<bestScoreLimit>0hard/0medium/0soft</bestScoreLimit>
</termination>
<constructionHeuristic>
<constructionHeuristicType>STRONGEST_FIT</constructionHeuristicType>
</constructionHeuristic>
</solver>
</inheritedSolverBenchmark>
<solverBenchmark>
<name>Currently used</name>
<solver>
<localSearch>
<unionMoveSelector>
<moveListFactory>
<cacheType>PHASE</cacheType>
<moveListFactoryClass>
myrostering.solver.move.factory.ChangeMoveFactory
</moveListFactoryClass>
</moveListFactory>
<moveListFactory>
<cacheType>PHASE</cacheType>
<moveListFactoryClass>
myrostering.solver.move.factory.SwapMoveFactory
</moveListFactoryClass>
</moveListFactory>
</unionMoveSelector>
<acceptor>
<entityTabuSize>5</entityTabuSize>
<simulatedAnnealingStartingTemperature>15000hard/10medium/1000soft</simulatedAnnealingStartingTemperature>
</acceptor>
<forager>
<acceptedCountLimit>4</acceptedCountLimit>
</forager>
</localSearch>
</solver>
</solverBenchmark>
</plannerBenchmark>
Also, I'd like to add that we run the solver.solve() without an issue - even if the dataset is quite large (150 to 300 Mo for the file containing the solution when serialized). So I'm a bit surprised when the benchmark fails on warm up...
EDIT:
I've changed the configuration for these two parameters :
<parallelBenchmarkCount>1</parallelBenchmarkCount>
...
<secondsSpentLimit>600</secondsSpentLimit>
But I still got the following exception :
2022-01-03 10:53:49.850 INFO 21696 --- [nchmarkThread-1] o.d.c.kie.builder.impl.KieContainerImpl : Start creation of KieBase: defaultKieBase
2022-01-03 10:53:49.909 INFO 21696 --- [nchmarkThread-1] o.d.c.kie.builder.impl.KieContainerImpl : End creation of KieBase: defaultKieBase
2022-01-03 10:54:32.585 INFO 21696 --- [nchmarkThread-1] o.o.core.impl.solver.DefaultSolver : Solving started: time spent (41506), best score (-38295462hard/38260medium/3640soft), environment mode (REPRODUCIBLE), move thread count (4), random (JDK with seed 0).
2022-01-03 10:54:33.611 ERROR 21696 --- [nchmarkThread-1] o.o.core.impl.solver.thread.ThreadUtils : Multithreaded Local Search's ExecutorService didn't terminate within timeout (1 seconds).
2022-01-03 10:54:33.611 INFO 21696 --- [nchmarkThread-1] o.o.c.i.h.thread.MoveThreadRunner : Score calculation speed will be too low because move thread (0)'s destroy wasn't processed soon enough.
2022-01-03 10:54:33.611 INFO 21696 --- [nchmarkThread-1] o.o.c.i.h.thread.MoveThreadRunner : Score calculation speed will be too low because move thread (1)'s destroy wasn't processed soon enough.
2022-01-03 10:54:33.611 INFO 21696 --- [nchmarkThread-1] o.o.c.i.h.thread.MoveThreadRunner : Score calculation speed will be too low because move thread (2)'s destroy wasn't processed soon enough.
2022-01-03 10:54:33.611 INFO 21696 --- [nchmarkThread-1] o.o.c.i.h.thread.MoveThreadRunner : Score calculation speed will be too low because move thread (3)'s destroy wasn't processed soon enough.
2022-01-03 10:54:33.612 INFO 21696 --- [nchmarkThread-1] .c.i.c.DefaultConstructionHeuristicPhase : Construction Heuristic phase (0) ended: time spent (42533), best score (-38295462hard/38260medium/3640soft), score calculation speed (0/sec), step total (0).
2022-01-03 10:56:00.115 WARN 21696 --- [ Test worker] c.o.b.i.D.singleBenchmarkRunnerException : The subSingleBenchmarkRunner (Problem_0_Currently used_0) with random seed (null) failed.
java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOf(Arrays.java:3480)
at java.base/java.util.ArrayList.grow(ArrayList.java:237)
at java.base/java.util.ArrayList.grow(ArrayList.java:244)
at java.base/java.util.ArrayList.add(ArrayList.java:454)
at java.base/java.util.ArrayList.add(ArrayList.java:467)
at be.myrostering.solver.move.factory.MySwapMoveFactory.createMoveList(MySwapMoveFactory.java:50)
at be.myrostering.solver.move.factory.MySwapMoveFactory.createMoveList(MySwapMoveFactory.java:30)
at org.optaplanner.core.impl.heuristic.selector.move.factory.MoveListFactoryToMoveSelectorBridge.constructCache(MoveListFactoryToMoveSelectorBridge.java:72)
at org.optaplanner.core.impl.heuristic.selector.common.SelectionCacheLifecycleBridge.phaseStarted(SelectionCacheLifecycleBridge.java:51)
at org.optaplanner.core.impl.phase.event.PhaseLifecycleSupport.firePhaseStarted(PhaseLifecycleSupport.java:37)
at org.optaplanner.core.impl.heuristic.selector.AbstractSelector.phaseStarted(AbstractSelector.java:50)
at org.optaplanner.core.impl.phase.event.PhaseLifecycleSupport.firePhaseStarted(PhaseLifecycleSupport.java:37)
at org.optaplanner.core.impl.heuristic.selector.AbstractSelector.phaseStarted(AbstractSelector.java:50)
at org.optaplanner.core.impl.localsearch.decider.LocalSearchDecider.phaseStarted(LocalSearchDecider.java:94)
at org.optaplanner.core.impl.localsearch.decider.MultiThreadedLocalSearchDecider.phaseStarted(MultiThreadedLocalSearchDecider.java:92)
at org.optaplanner.core.impl.localsearch.DefaultLocalSearchPhase.phaseStarted(DefaultLocalSearchPhase.java:141)
at org.optaplanner.core.impl.localsearch.DefaultLocalSearchPhase.solve(DefaultLocalSearchPhase.java:82)
at org.optaplanner.core.impl.solver.AbstractSolver.runPhases(AbstractSolver.java:99)
at org.optaplanner.core.impl.solver.DefaultSolver.solve(DefaultSolver.java:192)
at org.optaplanner.benchmark.impl.SubSingleBenchmarkRunner.call(SubSingleBenchmarkRunner.java:122)
at org.optaplanner.benchmark.impl.SubSingleBenchmarkRunner.call(SubSingleBenchmarkRunner.java:42)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)
2022-01-03 10:56:00.603 INFO 21696 --- [ Test worker] o.o.b.impl.report.BenchmarkReport : Generating benchmark report...
VERSION_2_3_31
java.lang.NoSuchFieldError: VERSION_2_3_31
at org.optaplanner.benchmark.impl.report.BenchmarkReport.writeHtmlOverviewFile(BenchmarkReport.java:828)
at org.optaplanner.benchmark.impl.report.BenchmarkReport.writeReport(BenchmarkReport.java:318)
at org.optaplanner.benchmark.impl.DefaultPlannerBenchmark.benchmarkingEnded(DefaultPlannerBenchmark.java:311)
at org.optaplanner.benchmark.impl.DefaultPlannerBenchmark.benchmark(DefaultPlannerBenchmark.java:100)
at org.optaplanner.benchmark.impl.DefaultPlannerBenchmark.benchmarkAndShowReportInBrowser(DefaultPlannerBenchmark.java:424)
On a final note, it appears that the problem might not be linked to Optaplanner (because the out-of-memory is triggered in MySwapMoveFactory) - if so, I'll close this post. But it would be still odd that it works when running the solver but not the benchmark...

MoveListFactory scales badly, consuming a lot of memory and cpu.
Some kind of moves have billions of possible moves. For example a 3 swap on 10000 shifts has 1 trillion moves. That doesn't fit into a few GB RAM and it takes ages to generate.
Use a MoveIteratorFactory instead and don't generate a list of moves, but generate them Just In Time, just like the default selectors do. See docs.

Increase memory, for example with VM option -Xmx4g
Also note that parallelBenchmarkCount AUTO currently doesn't take into account that moveThreadCount is not NONE. So your benchmarks will not be accurate, because if you have 16 cores, parallelBenchmarkCount AUTO will resolve to 8. With moveThreadCount 4 (+ 1 solver thread), you'll be using 32+ cores but only have 16 cores. This probably should be reported as an issue in optaplanner's jira for parallelBenchmarkCount AUTO.

Mongo Connection issue with error "state should be: open"

I am running an event in Akka actor system, where we run multiple actors to query mongo db and retrieve data. Each actor queries for 1000 documents (each document's size is 9kb)
When running an event that is required to fire 14 actors to query for Mongo DB to retrieve 13000 documents.Once I experienced below exception, not sure why? Have anyone experienced this before?
2020-04-14 19:17:28,818 [erp-writer-actor-system-akka.actor.default-dispatcher-378] ERROR c.a.s.c.m.GlobalContextMongoClientService- 76cd7a80-83ef-4389-885a-be9caed77449 - Exception occured while reading data from cursor
java.lang.IllegalStateException: state should be: open
at com.mongodb.assertions.Assertions.isTrue(Assertions.java:70)
at com.mongodb.connection.DefaultServer.getConnection(DefaultServer.java:84)
at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.getConnection(ClusterBinding.java:86)
at com.mongodb.operation.QueryBatchCursor.getMore(QueryBatchCursor.java:203)
at com.mongodb.operation.QueryBatchCursor.hasNext(QueryBatchCursor.java:103)
at com.mongodb.MongoBatchCursorAdapter.hasNext(MongoBatchCursorAdapter.java:46)
at com.xyz.smartconnect.commons.mongoclient.GlobalContextMongoClientService.findWorkers(GlobalContextMongoClientService.java:145)
at com.xyz.smartconnect.actors.QueryWorkersActor.lambda$createReceive$0(QueryWorkersActor.java:40)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at akka.actor.Actor$class.aroundReceive(Actor.scala:513)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:132)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:519)
at akka.actor.ActorCell.invoke(ActorCell.scala:488)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Suppressed: java.lang. IllegalStateException: state should be: open
at com.mongodb.assertions.Assertions.isTrue(Assertions.java:70)
at com.mongodb.connection.DefaultServer.getConnection(DefaultServer.java:84)
at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.getConnection(ClusterBinding.java:86)
at com.mongodb.operation.QueryBatchCursor.killCursor(QueryBatchCursor.java:261)
at com.mongodb.operation.QueryBatchCursor.close(QueryBatchCursor.java:147)
at com.mongodb.MongoBatchCursorAdapter.close(MongoBatchCursorAdapter.java:41)
at com.xyz.smartconnect.commons.mongoclient.GlobalContextMongoClientService.findWorkers(GlobalContextMongoClientService.java:149)

After running multiple tests and analyzing the logs carefully, I found the root cause. Below are the details.
While the application is using cursor to query data from mongoDb, connection has been released/closed. 'State should be : open' is complaining about a released connection.
In my case, my application experienced OutOfMemory, which caused disposing beans and releasing connections. Here is timeline of log events for this issue.
Since this is a memory issue for my case, fixing memory issue will fix below exception for me.
2020-04-19 12:57:32,981 [xyz-actor-system-akka.actor.default-dispatcher-72] ERROR a.a.ActorSystemImpl- - 413f9298-ca92-4744-913b-59934e4ce831 - exception on LARS’ timer thread
java.lang.OutOfMemoryError: GC overhead limit exceeded
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:269)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
at java.lang.Thread.run(Thread.java:748)
2020-04-19 12:57:43,649 [Thread-19] INFO o.s.c.s.DefaultLifecycleProcessor- - - Stopping beans in phase 2147483647
2020-04-19 12:58:13,483 [Thread-19] INFO o.s.j.e.a.AnnotationMBeanExporter- - - Unregistering JMX-exposed beans on shutdown
2020-04-19 12:58:45,186 [localhost-startStop-2] INFO c.a.s.ApplicationContextListener- - - >>>>>>>>> Disposing beans
2020-04-19 12:59:00,182 [localhost-startStop-2] INFO c.a.s.c.SpringBeanDisposer- - - Mongo connections are released.
2020-04-19 12:59:09,591 [xyz-actor-system-akka.actor.default-dispatcher-73] ERROR c.a.s.c.m.GlobalContextMongoClientService- - 413f9298-ca92-4744-913b-59934e4ce831 - Exception occured while reading data from cursor
java.lang.IllegalStateException: state should be: open
at com.mongodb.assertions.Assertions.isTrue(Assertions.java:70)
at com.mongodb.connection.DefaultServer.getDescription(DefaultServer.java:114)
at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.getServerDescription(ClusterBinding.java:81)
at com.mongodb.operation.QueryBatchCursor.initFromCommandResult(QueryBatchCursor.java:251)
at com.mongodb.operation.QueryBatchCursor.getMore(QueryBatchCursor.java:207)
at com.mongodb.operation.QueryBatchCursor.hasNext(QueryBatchCursor.java:103)
at com.mongodb.MongoBatchCursorAdapter.hasNext(MongoBatchCursorAdapter.java:46)

SparkR out of memory error

I have a 2 node test cluster on AWS with spark-2.0.0-bin-hadoop2.7 installed.
This is the code I'm using to launch the cluster.
./spark-ec2 -k blah -i blah.pem -r us-west-1 -s 1 -t r3.2xlarge launch --copy-aws-credentials blah
Viewing port 8080 shows 58.8GB(0.0 B Used) of memory after running these two lines in rstudio.
Sys.setenv(SPARK_HOME="/root/spark")
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
When I run this line and refresh the page on port 8080 the memory usage changes to 58.8 GB (53.8 GB Used).
sparkR.session(master = "spark://[ip]:7077",
sparkHome = '/root/spark',
enableHiveSupport = FALSE)
When I try to create a spark data frame from a data frame which should consume 0.04857268 GB of memory I get this error:
acquisition <- as.DataFrame(orig)
17/11/04 14:27:23 WARN TaskSetManager: Stage 0 contains a task of very large size (166360 KB). The maximum recommended task size is 100 KB.
Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space
I tried adding this but get the same error.
options(java.parameters = "-Xmx2048m")
install.packages("rJava")
library(rJava)
I'm stuck. I've spent three weekends googling this issue and can't figure it out.
Thanks.

Pig Error 1066, Backend error : -1; NegativeArraySizeException; UDF, joda-time, HBase

I'm getting an exception from a Pig script and haven't been able to nail down the cause. I'm fairly new to Pig & have searched for
various topics based on the exception I'm getting but haven't been
able to find anything meaningful. From the grunt shell & log I've
looked for different variations of these - unable to read pigs
manifest file java.lang.NegativeArraySizeException: -1 ERROR 1066:
Unable to open iterator for alias F. Backend error : -1
I'm using Hadoop version 2.0.0-cdh4.6.0 & Pig version 0.11.0, running
from the Grunt shell.
My Pig script reads a file, does some manipulation on the data
(including calling a Java UDF), joins to an HBase table, then DUMPs
the output. Pretty simple. I can DUMP the intermediate result (alias
B) and the data looks fine.
I've tested the Java function from Pig using the same input file and
have seen it return values as I'd expect, and I've tested the function
locally outside the Pig script. The Java function is provided a number
of days from 01-01-1900 & uses joda-time v2.7 to return a Datetime.
Initially, the UDF was accepting a tuple as input. I've tried changing
the UDF input format to Byte and most recently String and casting to
Datetime in Pig upon returning, but am still getting the same error.
When I change my Pig script merely to not call the UDF it works fine.
The NegativeArray error sounds like the data is out of whack for the
Dump, possibly from some kind of format issue, but I don't see how.
Pig script
A = LOAD 'tst2_SplitGroupMax.txt' using PigStorage(',')
as (id:bytearray, year:int, doy:int, month:int, dayOfMonth:int,
awh_minTemp:double, awh_maxTemp:double,
nws_minTemp:double, nws_maxTemp:double,
wxs_minTemp:double, wxs_maxTemp:double,
tcc_minTemp:double, tcc_maxTemp:double
) ;
register /import/pool2/home/NA1000APP-TPSDM/ejbles/Test-0.0.1-SNAPSHOT-jar-with-dependencies.jar;
B = FOREACH A GENERATE id as msmtid, SUBSTRING(id,0,8) as gridid, SUBSTRING(id,9,20) as msmt_days,
year, doy, month, dayOfMonth,
CONCAT(CONCAT(CONCAT((chararray)year,'-'),CONCAT((chararray)month,'-')),(chararray)dayOfMonth) as msmt_dt,
ToDate(monutil.geoloc.GridIDtoDatetime(id)) as func_msmt_dt,
awh_minTemp, awh_maxTemp,
nws_minTemp, nws_maxTemp,
wxs_minTemp, wxs_maxTemp,
tcc_minTemp, tcc_maxTemp
;
E = LOAD 'hbase://wxgrid_detail' using org.apache.pig.backend.hadoop.hbase.HBaseStorage
('loc:country, loc:fips, loc:l1 ,loc:l2, loc:latitude, loc:longitude',
'-loadKey=true -caster=HBaseBinaryConverter')
as (wxgrid:bytearray, country:chararray, fips:chararray, l1:chararray, l2:chararray,
latitude:double, longitude:double);
F = join B by gridid, E by wxgrid;
DUMP F; --- This is where I get the exception
Here's an excerpt from what's returned in the Grunt shell -
2015-06-15 12:23:24,204 [main] WARN
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure. 2015-06-15 12:23:24,205 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- job job_201502081759_916870 has failed! Stop running all dependent jobs 2015-06-15 12:23:24,205 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 100% complete 2015-06-15 12:23:24,221 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR: -1 2015-06-15
12:23:24,221 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil -
1 map reduce job(s) failed! 2015-06-15 12:23:24,223 [main] WARN
org.apache.pig.tools.pigstats.ScriptState - unable to read pigs
manifest file 2015-06-15 12:23:24,224 [main] INFO
org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt
Features
2.0.0-cdh4.6.0 na1000app-tpsdm 2015-06-15 12:22:39 2015-06-15 12:23:24 HASH_JOIN
Failed!
Failed Jobs: JobId Alias Feature Message Outputs
job_201502081759_916870 A,B,E,F HASH_JOIN Message: Job failed!
hdfs://nameservice1/tmp/temp-238648079/tmp-1338617620,
Input(s): Failed to read data from "hbase://wxgrid_detail" Failed to
read data from
"hdfs://nameservice1/user/na1000app-tpsdm/tst2_SplitGroupMax.txt"
Output(s): Failed to produce result in
"hdfs://nameservice1/tmp/temp-238648079/tmp-1338617620"
Counters: Total records written : 0 Total bytes written : 0 Spillable
Memory Manager spill count : 0 Total bags proactively spilled: 0 Total
records proactively spilled: 0
Job DAG: job_201502081759_916870
2015-06-15 12:23:24,224 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Failed! 2015-06-15 12:23:24,234 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator
for alias F. Backend error : -1 Details at logfile:
/import/pool2/home/NA1000APP-TPSDM/ejbles/pig_1434388844905.log
And here's the log -
Backend error message
--------------------- java.lang.NegativeArraySizeException: -1 at org.apache.hadoop.hbase.util.Bytes.readByteArray(Bytes.java:148) at
org.apache.hadoop.hbase.mapreduce.TableSplit.readFields(TableSplit.java:133)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:73)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:44)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit.readFields(PigSplit.java:233)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:73)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:44)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:356)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:640)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at
org.apache.hadoop.mapred.Child$4.run(Ch
Pig Stack Trace
--------------- ERROR 1066: Unable to open iterator for alias F. Backend error : -1
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable
to open iterator for alias F. Backend error : -1 at
org.apache.pig.PigServer.openIterator(PigServer.java:828) at
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:320)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) at
org.apache.pig.Main.run(Main.java:538) at
org.apache.pig.Main.main(Main.java:157) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597) at
org.apache.hadoop.util.RunJar.main(RunJar.java:208) Caused by:
java.lang.NegativeArraySizeException: -1 at
org.apache.hadoop.hbase.util.Bytes.readByteArray(Bytes.java:148) at
org.apache.hadoop.hbase.mapreduce.TableSplit.readFields(TableSplit.java:133)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:73)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:44)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit.readFields(PigSplit.java:233)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:73)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:44)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:356)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:640)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)

Launching jobs in a for loop

I am confronted with a weird problem. I have a mapreduce class which looks for patterns in a file (the patternfile goes into DistributedCache). Now I wanted to reuse this class to run for 1000 pattern files. I just had to extend the pattern matching class and override its main and run function. In the run of the child class I modify the commandline arguments and feed them to the parents run() function. Everything goes well up until iteration 45-50. Suddenly all tasktrackers start to fail until no progress is made. I checked the HDFS but still 70% of space left. Anybody any ideas as to why launching 50 jobs, one by one causes difficulties to hadoop?
#Override
public int run(String[] args) throws Exception {
//-patterns patternsDIR input/ output/
List<String> files = getFiles(args[1]);
String inputDataset=args[2];
String outputDir=args[3];
for (int i=0; i<files.size(); i++){
String [] newArgs= new String[4];
newArgs = modifyArgs(args);
super.run(newArgs);
}
return 0;
}
EDIT: Just checked the job logs, this is the first error occurring:
2013-11-12 09:03:01,665 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hduser cause:java.io.IOException: java.lang.OutOfMemoryError: Java heap space
2013-11-12 09:03:32,971 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_201311120807_0053_m_000053_0' has completed task_201311120807_0053_m_000053 successfully.
2013-11-12 09:07:51,717 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hduser cause:java.io.IOException: java.lang.OutOfMemoryError: Java heap space
2013-11-12 09:08:05,973 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_201311120807_0053_m_000128_0' has completed task_201311120807_0053_m_000128 successfully.
2013-11-12 09:08:16,571 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_201311120807_0053_m_000130_0' has completed task_201311120807_0053_m_000130 successfully.
2013-11-12 09:08:16,571 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to renew lease for [DFSClient_NONMAPREDUCE_1595161181_30] for 30 seconds. Will retry shortly ...
2013-11-12 09:08:27,175 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_201311120807_0053_m_000138_0' has completed task_201311120807_0053_m_000138 successfully.
2013-11-12 09:08:25,241 ERROR org.mortbay.log: EXCEPTION
java.lang.OutOfMemoryError: Java heap space
2013-11-12 09:08:25,241 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 54311, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus#7fcb9c0a, false, false, true, 9834) from 10.1.1.13:55028: error: java.io.IOException: java.lang.OutOfMemoryError: Java heap space
java.io.IOException: java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:62)
at java.lang.StringBuilder.<init>(StringBuilder.java:97)
at org.apache.hadoop.util.StringUtils.escapeString(StringUtils.java:435)
at org.apache.hadoop.mapred.Counters.escape(Counters.java:768)
at org.apache.hadoop.mapred.Counters.access$000(Counters.java:52)
at org.apache.hadoop.mapred.Counters$Counter.makeEscapedCompactString(Counters.java:111)
at org.apache.hadoop.mapred.Counters$Group.makeEscapedCompactString(Counters.java:221)
at org.apache.hadoop.mapred.Counters.makeEscapedCompactString(Counters.java:648)
at org.apache.hadoop.mapred.JobHistory$MapAttempt.logFinished(JobHistory.java:2276)
at org.apache.hadoop.mapred.JobInProgress.completedTask(JobInProgress.java:2636)
at org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:1222)
at org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:4471)
at org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:3306)
at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3001)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1432)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1428)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1426)
2013-11-12 09:08:16,571 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 54311, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus#3269c671, false, false, true, 9841) from 10.1.1.23:42125: error: java.io.IOException: java.lang.OutOfMemoryError: Java heap space
java.io.IOException: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$Packet.<init>(DFSClient.java:2875)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3806)
at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150)
at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121)
at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112)
at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:220)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:290)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:294)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:140)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at java.io.BufferedWriter.flush(BufferedWriter.java:253)
at java.io.PrintWriter.flush(PrintWriter.java:293)
at java.io.PrintWriter.checkError(PrintWriter.java:330)
at org.apache.hadoop.mapred.JobHistory.log(JobHistory.java:847)
at org.apache.hadoop.mapred.JobHistory$MapAttempt.logStarted(JobHistory.java:2225)
at org.apache.hadoop.mapred.JobInProgress.completedTask(JobInProgress.java:2632)
at org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:1222)
at org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:4471)
at org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:3306)
at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3001)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1432)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1428)
at java.security.AccessController.doPrivileged(Native Method)
And after that we see a bunch of:
2013-11-12 09:13:48,204 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201311120807_0053_m_000033_0: Lost task tracker: tracker_n144-06b.wall1.ilabt.iminds.be:localhost/127.0.0.1:47567
EDIT2: Some ideas?
The heap space error is kind of unexpected since the mappers hardly require any memory.
I am calling the base class with super.run(), should I use a Toolrunner call for that?
In every iteration a file with approximately 1000 words + score is added to the DistributedCache, I am not sure whether I should reset the cache somewhere? (every job in the super.run() runs with job.waitForCompletion(), is the cache cleared then?)
EDIT3:
#Donald: I haven't resized the memory for the hadoop daemons, so they should have a heap of 1GB each. The maptasks have 800 MB of heap from which 450 MB is used for io.sort.
#Chris: I haven't modified anything on the counters, I am using the regular ones. There are 1764 map tasks with 16 counters each, and the job itself will have another 20 or so. This might indeed add up after 50 consecutive jobs, but I would think it is not stored in the heap if you are running multiple consecutive jobs?
#Extra information:
The map tasks are extremely fast, it only takes 3-5 seconds per task, and I have jvm.reuse=-1. A map tasks processes a file with 10 records (the file is much smaller than the block size). Due to the small files I could consider making input files with 100 records to reduce the mapping overhead.
The first thing I tried was to add a unit reducer (1 reduce task) to reduce the number of files create in the HDFS, (otherwise there would be 1 per pattern and therefore 1000 per job which might create overhead for the datanodes)
The number of records per job is rather low, I am looking for specific words in 1764 files and the number of matches with one of 1000 patterns is around 5000 map output records in total)
#All: Thanks for helping me out guys!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

GC overhead while running pig job, after hadoop job ends - java

Related

Optaplanner's benchmark warm up - OutOfMemory

Mongo Connection issue with error "state should be: open"

SparkR out of memory error

Pig Error 1066, Backend error : -1; NegativeArraySizeException; UDF, joda-time, HBase

Launching jobs in a for loop

Categories

Resources