Apache Spark out of Java heap space: where does it happen?

Apache Spark out of Java heap space: where does it happen? - java

I have a Java memory issue with Spark. The same application working on my 8GB Mac crashes on my 72GB Ubuntu server...
I have changed things in the conf file, but it looks like Spark does not care, so I wonder if my issues are with the driver or executor.
I set:
spark.driver.memory 20g
spark.executor.memory 20g
And, whatever I do, the crash is always at the same spot in the app, which makes me think that it is a driver problem.
The exception I get is:
16/07/13 20:36:30 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 208, micha.nc.rr.com): java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapCharBuffer.<init>(HeapCharBuffer.java:57)
at java.nio.CharBuffer.allocate(CharBuffer.java:335)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:810)
at org.apache.hadoop.io.Text.decode(Text.java:412)
at org.apache.hadoop.io.Text.decode(Text.java:389)
at org.apache.hadoop.io.Text.toString(Text.java:280)
at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd$1.apply(JSONRelation.scala:105)
at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd$1.apply(JSONRelation.scala:105)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1135)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1135)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1136)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1136)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Any hint? Thanks
Update:
I have set a small memory "dumper" in my app. At the beginning, it says:
** Free ......... 1,413,566
** Allocated .... 1,705,984
** Max .......... 16,495,104
**> Total free ... 16,202,686
Just before the crash, it says:
** Free ......... 1,461,633
** Allocated .... 1,786,880
** Max .......... 16,495,104
**> Total free ... 16,169,857

So for some reason, I have not been able to make the configuration file read by Spark on the server side, but modifying my code to:
SparkConf conf = new SparkConf()
.setAppName("app")
.set("spark.executor.memory", "4g")
.setMaster("spark://10.0.100.120:7077");
(Thanks to all the people who voted the question down, it is really motivating to come back here and post a solution).

Related

ElasticSearch - Inner lucene file deletion hangs forever

I'm using an ElasticSearch cluster in my Production environments for months now.
This cluster contains 2 nodes, which are Windows Server 2019 servers.
Sometimes, a random node of this cluster suddenly get stuck until i reboot the ElasticService, which is impossible by simply restarting the windows service. I need to kill the process to be able to restart it just after.
When I'm looking the threads contention, calling Elastic API, i'm getting this :
0.0% (0s out of 500ms) cpu usage by thread 'threadDeathWatcher-2-1'
10/10 snapshots sharing following 4 elements
java.lang.Thread.sleep(Native Method)
io.netty.util.ThreadDeathWatcher$Watcher.run(ThreadDeathWatcher.java:152)
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
java.lang.Thread.run(Thread.java:748)
0.0% (0s out of 500ms) cpu usage by thread 'DestroyJavaVM'
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[PRODUCTION_CRITQUE_2][refresh][T#1]'
10/10 snapshots sharing following 27 elements
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:251)
org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:910)
org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:632)
org.elasticsearch.action.admin.indices.refresh.TransportShardRefreshAction.shardOperationOnReplica(TransportShardRefreshAction.java:65)
org.elasticsearch.action.admin.indices.refresh.TransportShardRefreshAction.shardOperationOnReplica(TransportShardRefreshAction.java:38)
org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:494)
org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:467)
org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:147)
org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationLock(IndexShard.java:1673)
org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:566)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:451)
org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:441)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[PRODUCTION_CRITQUE_2][refresh][T#2]'
10/10 snapshots sharing following 39 elements
sun.nio.fs.WindowsNativeDispatcher.DeleteFile0(Native Method)
sun.nio.fs.WindowsNativeDispatcher.DeleteFile(WindowsNativeDispatcher.java:114)
sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:249)
sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
java.nio.file.Files.delete(Files.java:1126)
org.apache.lucene.store.FSDirectory.privateDeleteFile(FSDirectory.java:373)
org.apache.lucene.store.FSDirectory.deleteFile(FSDirectory.java:335)
org.apache.lucene.store.FilterDirectory.deleteFile(FilterDirectory.java:62)
org.apache.lucene.store.FilterDirectory.deleteFile(FilterDirectory.java:62)
org.elasticsearch.index.store.Store$StoreDirectory.deleteFile(Store.java:700)
org.elasticsearch.index.store.Store$StoreDirectory.deleteFile(Store.java:705)
org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:38)
org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:723)
org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:717)
org.apache.lucene.index.IndexFileDeleter.deleteNewFiles(IndexFileDeleter.java:693)
org.apache.lucene.index.IndexWriter.deleteNewFiles(IndexWriter.java:4965)
org.apache.lucene.index.DocumentsWriter$DeleteNewFilesEvent.process(DocumentsWriter.java:771)
org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5043)
org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5034)
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:477)
org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:291)
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:266)
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:256)
org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:104)
org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:140)
org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:156)
org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:910)
org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:632)
org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:690)
org.elasticsearch.index.IndexService.access$400(IndexService.java:92)
org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:832)
org.elasticsearch.index.IndexService$BaseAsyncTask.run(IndexService.java:743)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[PRODUCTION_CRITQUE_2][flush][T#4334]'
10/10 snapshots sharing following 16 elements
org.apache.lucene.index.IndexWriter.setLiveCommitData(IndexWriter.java:3116)
org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:1562)
org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:1063)
org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:780)
org.elasticsearch.indices.flush.SyncedFlushService.performPreSyncedFlush(SyncedFlushService.java:414)
org.elasticsearch.indices.flush.SyncedFlushService.access$1000(SyncedFlushService.java:70)
org.elasticsearch.indices.flush.SyncedFlushService$PreSyncedFlushTransportHandler.messageReceived(SyncedFlushService.java:696)
org.elasticsearch.indices.flush.SyncedFlushService$PreSyncedFlushTransportHandler.messageReceived(SyncedFlushService.java:692)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
It seems that a delete file is locking (deadlocking ?) Elastic threads. I'm not deleting any index on Production so I guess it's an internal ElasticSearch process about Lucene when the replica node is trying to synchronize with the master node, it should delete Lucene segments that doesn't exist anymore or something like that ..
I tried speaking with the Elastic development team, but being stuck on a delete file seems, in their opinion, an environment issue more than an Elastic issue, which is undertanstable actually.
I stopped Antivirus and backup process on these servers, but still getting these locks like one time per month minimum.
How inner Java "DeleteFile" can hangs without returning any error or something. It just hangs forever, the server didn't seem to be under pressure at the same time.
If anyone has encountered this kind of issue, or have an idea to help me investigate, it would be awesome.
Thanks !

Looks like others are experiencing this:
https://discuss.elastic.co/t/massive-queue-in-refresh-thread-pool-on-a-single-node-causing-timeouts/280732/4
Have you looked at the Windows Event Viewer Application Logs to see if any windows process is giving any insight?
Looks like it is trying to remove old index files.
org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:632)

Unable to load 25GB dataset in PySpark local mode with 56GB RAM free

I am having trouble loading and processing a 25GB Parquet dataset (of stackoverflow.com posts) on a single beefy machine in local mode with 12 cores/64GB of RAM.
I have more memory on my machine that is free and allocated to pyspark than the size of a Parquet dataset (let alone two columns of the dataset), and yet I am unable to run any operations on the DataFrame once I load it. This is confusing, and I can't figure out what to do.
Specifically, I have a Parquet dataset that is 25GB:
$ du -sh data/stackoverflow/parquet/Posts.df.parquet
25G data/stackoverflow/parquet/Posts.df.parquet
I have a machine with 56GB of free RAM:
$ free -h
total used free shared buff/cache
available
Mem: 62G 4.7G 56G 23M 1.7G
57G
Swap: 63G 0B 63G
I have configured PySpark to use 50GB of RAM (have tried adusting maxResultSize to no effect).
My configuration looks like this:
$ cat ~/spark/conf/spark-defaults.conf
spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec
spark.driver.memory 50g
spark.jars ...
spark.executor.cores 12
spark.driver.maxResultSize 20g
My environment looks like this:
$ cat ~/spark/conf/spark-env.sh
PYSPARK_PYTHON=python3
PYSPARK_DRIVER_PYTHON=python3
SPARK_WORKER_DIR=/nvm/spark/work
SPARK_LOCAL_DIRS=/nvm/spark/local
SPARK_WORKER_MEMORY=50g
SPARK_WORKER_CORES=12
I load the data like this:
$ pyspark
>>> posts = spark.read.parquet('data/stackoverflow/parquet/Posts.df.parquet')
It loads ok, but any operation - including if I run a limit(10) on the DataFrame first - results in an out of heap space error.
>>> posts.limit(10)\
.select('_ParentId','_Body')\
.filter(posts._ParentId == 9915705)\
.show()
[Stage 1:> (0 + 12) / 195]19/06/30 17:26:13 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 8)
java.lang.OutOfMemoryError: Java heap space
19/06/30 17:26:13 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
java.lang.OutOfMemoryError: Java heap space
19/06/30 17:26:13 ERROR Executor: Exception in task 5.0 in stage 1.0 (TID 6)
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.parquet.bytes.HeapByteBufferAllocator.allocate(HeapByteBufferAllocator.java:32)
at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1166)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/06/30 17:26:13 ERROR Executor: Exception in task 10.0 in stage 1.0 (TID 11)
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.parquet.bytes.HeapByteBufferAllocator.allocate(HeapByteBufferAllocator.java:32)
at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1166)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/06/30 17:26:13 ERROR Executor: Exception in task 6.0 in stage 1.0 (TID 7)
java.lang.OutOfMemoryError: Java heap space
19/06/30 17:26:13 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 7,5,main]
java.lang.OutOfMemoryError: Java heap space
19/06/30 17:26:13 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 11,5,main]
java.lang.OutOfMemoryError: Java heap space
...
The following will run, suggesting the problem is the _Body field (obviously the largest):
>>> posts.limit(10).select('_Id').show()
+---+
|_Id|
+---+
| 4|
| 6|
| 7|
| 9|
| 11|
| 12|
| 13|
| 14|
| 16|
| 17|
+---+
What am I to do? I could use EMR, but I would like to be able to load this dataset locally and that seems an entirely reasonable thing to be able to do in this situation.

The default memory fraction for Spark's storage and computation is 0.6. Under your config it will be 0.6 * 50GB = 30GB. But the representation of data in memory may consume more space than the serialized disk version.
Please check the section of Memory Management to get more details.

You will need to set the spark memory config while running the pyspark command:
pyspark --conf spark.driver.memory=50g --conf spark.executor.pyspark.memory=50g
Check this doc for the config to set.
You might also need to figure out the number of executors you need based on your hardware.

Spark on mesos Executors failing with OOM Errors

We are using spark 2.0.2 managed by a DCOS system that fetch data from a Kafka 1.0.0 messaging service and writes parquet in a hdfs system.
Every thing was working ok, but when we increase the number of topics in Kafka, our spark executors began to crash constantly with OOM errors:
java.lang.OutOfMemoryError: Java heap space
at org.apache.parquet.column.values.dictionary.IntList.initSlab(IntList.java:90)
at org.apache.parquet.column.values.dictionary.IntList.<init>(IntList.java:86)
at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter.<init>(DictionaryValuesWriter.java:93)
at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainDoubleDictionaryValuesWriter.<init>(DictionaryValuesWriter.java:422)
at org.apache.parquet.column.ParquetProperties.dictionaryWriter(ParquetProperties.java:139)
at org.apache.parquet.column.ParquetProperties.dictWriterWithFallBack(ParquetProperties.java:178)
at org.apache.parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:203)
at org.apache.parquet.column.impl.ColumnWriterV1.<init>(ColumnWriterV1.java:83)
at org.apache.parquet.column.impl.ColumnWriteStoreV1.newMemColumn(ColumnWriteStoreV1.java:68)
at org.apache.parquet.column.impl.ColumnWriteStoreV1.getColumnWriter(ColumnWriteStoreV1.java:56)
at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:183)
at org.apache.parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:375)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:109)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:99)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:217)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:175)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:146)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:113)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:87)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:62)
at org.apache.parquet.avro.AvroParquetWriter.<init>(AvroParquetWriter.java:47)
at npm.parquet.ParquetMeasurementWriter.ensureOpenWriter(ParquetMeasurementWriter.java:91)
at npm.parquet.ParquetMeasurementWriter.write(ParquetMeasurementWriter.java:75)
at npm.ingestion.spark.StagingArea$Measurements.store(StagingArea.java:100)
at npm.ingestion.spark.StagingArea$StagingAreaStorage.store(StagingArea.java:80)
at npm.ingestion.spark.StagingArea.add(StagingArea.java:40)
at npm.ingestion.spark.Kafka2HDFSPM$SubsetProcessor.sendToStagingArea(Kafka2HDFSPM.java:207)
at npm.ingestion.spark.Kafka2HDFSPM$SubsetProcessor.consumeRecords(Kafka2HDFSPM.java:193)
at npm.ingestion.spark.Kafka2HDFSPM$SubsetProcessor.process(Kafka2HDFSPM.java:169)
at npm.ingestion.spark.Kafka2HDFSPM$FetchSubsetsAndStore.call(Kafka2HDFSPM.java:133)
at npm.ingestion.spark.Kafka2HDFSPM$FetchSubsetsAndStore.call(Kafka2HDFSPM.java:111)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:218)
18/03/20 18:41:13 ERROR [Executor task launch worker-0] SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
java.lang.OutOfMemoryError: Java heap space
at org.apache.parquet.column.values.dictionary.IntList.initSlab(IntList.java:90)
at org.apache.parquet.column.values.dictionary.IntList.<init>(IntList.java:86)
at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter.<init>(DictionaryValuesWriter.java:93)
at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainDoubleDictionaryValuesWriter.<init>(DictionaryValuesWriter.java:422)
at org.apache.parquet.column.ParquetProperties.dictionaryWriter(ParquetProperties.java:139)
at org.apache.parquet.column.ParquetProperties.dictWriterWithFallBack(ParquetProperties.java:178)
at org.apache.parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:203)
at org.apache.parquet.column.impl.ColumnWriterV1.<init>(ColumnWriterV1.java:83)
at org.apache.parquet.column.impl.ColumnWriteStoreV1.newMemColumn(ColumnWriteStoreV1.java:68)
at org.apache.parquet.column.impl.ColumnWriteStoreV1.getColumnWriter(ColumnWriteStoreV1.java:56)
at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:183)
at org.apache.parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:375)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:109)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:99)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:217)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:175)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:146)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:113)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:87)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:62)
at org.apache.parquet.avro.AvroParquetWriter.<init>(AvroParquetWriter.java:47)
at npm.parquet.ParquetMeasurementWriter.ensureOpenWriter(ParquetMeasurementWriter.java:91)
at npm.parquet.ParquetMeasurementWriter.write(ParquetMeasurementWriter.java:75)
at npm.ingestion.spark.StagingArea$Measurements.store(StagingArea.java:100)
at npm.ingestion.spark.StagingArea$StagingAreaStorage.store(StagingArea.java:80)
at npm.ingestion.spark.StagingArea.add(StagingArea.java:40)
at npm.ingestion.spark.Kafka2HDFSPM$SubsetProcessor.sendToStagingArea(Kafka2HDFSPM.java:207)
at npm.ingestion.spark.Kafka2HDFSPM$SubsetProcessor.consumeRecords(Kafka2HDFSPM.java:193)
at npm.ingestion.spark.Kafka2HDFSPM$SubsetProcessor.process(Kafka2HDFSPM.java:169)
at npm.ingestion.spark.Kafka2HDFSPM$FetchSubsetsAndStore.call(Kafka2HDFSPM.java:133)
at npm.ingestion.spark.Kafka2HDFSPM$FetchSubsetsAndStore.call(Kafka2HDFSPM.java:111)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:218)
We tried to increase the available the executors memory, review the code, but we couldn't find anything wrong.
Another info: we are using RDDs in spark.
Have someone encountered a similar problem, that already been solved

What is the heap configuration for the executor? By default, Java will autotune its heap according to machine memory. You need to change it to fit in your container with -Xmx setting.
See this article about running Java in the container
https://github.com/fabianenardon/docker-java-issues-demo/tree/master/memory-sample

jmeter download 1.5g file out of memory exception

I'M running jmx from command line
JVM_ARGS="-Xms2048m -Xmx4096m -XX:4096ize=4096m -XX:MaxNewSize=4096m" && export JVM_ARGS && ./jmeter.sh -n -t ./jmeter-ec2.jmx -l ./scriptresults.jtl
but on some point I got out of memory error , after going to jmeter.log
I found this error
ERROR o.a.j.JMeter: Uncaught exception: java.lang.OutOfMemoryError:
Java heap space at java.util.Arrays.copyOf(Arrays.java:3236)
~[?:1.8.0_91] at
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
~[?:1.8.0_91] at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
~[?:1.8.0_91] at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
~[?:1.8.0_91] at
org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.readResponse(HTTPSamplerBase.java:1833)
~[ApacheJMeter_http.jar:3.3 r1808647] at
org.apache.jmeter.protocol.http.sampler.HTTPAbstractImpl.readResponse(HTTPAbstractImpl.java:440)
~[ApacheJMeter_http.jar:3.3 r1808647] at
org.apache.jmeter.protocol.http.sampler.HTTPHC4Impl.sample(HTTPHC4Impl.java:474)
~[ApacheJMeter_http.jar:3.3 r1808647] at
org.apache.jmeter.protocol.http.sampler.HTTPSamplerProxy.sample(HTTPSamplerProxy.java:74)
~[ApacheJMeter_http.jar:3.3 r1808647] at
org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.sample(HTTPSamplerBase.java:1189)
~[ApacheJMeter_http.jar:3.3 r1808647] at
org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.sample(HTTPSamplerBase.java:1178)
~[ApacheJMeter_http.jar:3.3 r1808647] at
org.apache.jmeter.threads.JMeterThread.executeSamplePackage(JMeterThread.java:498)
~[ApacheJMeter_core.jar:3.3 r1808647] at
org.apache.jmeter.threads.JMeterThread.processSampler(JMeterThread.java:424)
~[ApacheJMeter_core.jar:3.3 r1808647] at
org.apache.jmeter.threads.JMeterThread.run(JMeterThread.java:255)
~[ApacheJMeter_core.jar:3.3 r1808647] at
java.lang.Thread.run(Thread.java:745) [?:1.8.0_91] 2018-01-26
02:03:55,731 INFO o.a.j.e.StandardJMeterEngine: Notifying test
listeners of end of test 2018-01-26 02:03:55,732 INFO
o.a.j.r.Summariser: summary = 0 in 00:00:00 = ******/s Avg: 0
Min: 9223372036854775807 Max: -9223372036854775808 Err: 0 (0.00%)
what I"M doing wrong here ? I cant solve it:(

Your JVM arguments are wrong, just keep:
-Xms2048m -Xmx4096m
You don't tell with how much threads this occurs nor if you're running in GUI or NON GUI mode, so:
Don't run in GUI mode, it's an anti-pattern
Ensure you have enough memory for your threads
Finally you can reduce the memory impact of big response by adapting this in user.properties:
httpsampler.max_bytes_to_store_per_request
And another option is to only compute HASH from your response by setting this in http://jmeter.apache.org/usermanual/component_reference.html#HTTP_Request:

Well, given you have 1.5 GB file you will be able to have not more than 3 virtual users which doesn't look like a "load test" to me.
If you are not interested in downloaded file's content and just want to stress your server you can consider switching to JSR223 Sampler which will send request and discard the response data using underlying Apache HttpComponents libraries methods, the relevant Groovy code would be something like:
import org.apache.http.client.methods.HttpGet
import org.apache.http.impl.client.HttpClientBuilder
import org.apache.http.util.EntityUtils
def client = HttpClientBuilder.create().build()
def get = new HttpGet('http://example.com')
def response = client.execute(get)
EntityUtils.consume(response.getEntity())
References:
HttpClient Tutorial
HttpClient Quick Start
Apache Groovy - Why and How You Should Use It

Error : java.lang.OutOfMemoryError: unable to create new native thread : gemfire

Please before marking this duplicate read this : I have gone through all the answers provided for this error and nothing helped in my scenario.
I am doing a server migration where the same thing works well in 32 bit and 64 bit runs out of memory.
I have a windows service which internally points to .exe that spawns java process : I have made all the possible memory improvements in the config file of my .exe :Below:
I am not sure what different behavior is causing this out of memory for 64 bit server.(my java version is 1.8.xx)
#Java Additional Parameters
wrapper.java.additional.1=-XX:+UseConcMarkSweepGC
wrapper.java.additional.2=-XX:+UseParNewGC
wrapper.java.additional.3=-XX:ParallelGCThreads=8
wrapper.java.additional.4=-verbose:gc
# wrapper.java.additional.!!! should be sequence !!!=-Xloggc:D:\apps\Logs\gc.log
# wrapper.java.additional.!!! should be sequence !!!=-XX:+PrintGCDetails
# wrapper.java.additional.!!! should be sequence !!!=-XX:+PrintGCTimeStamps
wrapper.java.additional.5=-XX:MaxDirectMemorySize=128m
wrapper.java.additional.6=-XX:+HeapDumpOnOutOfMemoryError
wrapper.java.additional.7=-Dcom.sun.management.jmxremote.port=34001
wrapper.java.additional.8=-Dcom.sun.management.jmxremote.ssl=false
wrapper.java.additional.9=-Dcom.sun.management.jmxremote.authenticate=false
wrapper.java.additional.10=-XX:CMSInitiatingOccupancyFraction=55
wrapper.java.additional.11=-XX:NewSize=474m
wrapper.java.additional.12=-XX:MaxNewSize=474m
#wrapper.java.additional.13=-XX:PermSize=128m
#wrapper.java.additional.14=-XX:MaxPermSize=128m
wrapper.java.additional.15=-Xss128k
wrapper.java.additional.16=-XX:+CMSIncrementalMode
wrapper.java.additional.17=-XX:+UseCompressedOops
# Initial Java Heap Size (in MB)
wrapper.java.initmemory=1638
# Maximum Java Heap Size (in MB)
wrapper.java.maxmemory=1638
Still i am ending up to have :
[severe 2016/10/24 06:27:46.192 java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Unknown Source)
at com.gemstone.gemfire.internal.SocketCreator.asyncClose(SocketCreator.java:688)
Reading done for the concept here :
Error reading
I am not much into Java things but tried all the things from my side , any help on this will be highly appreciated , i spend huge amount of time on this but not able to reach to any conclusion.
***********Update***************
So basically could figure out that this problem was coming due to excessive creation of thread from Gemfire which exceeds the threshold ~800 threads for Gemfire Java Process.
Here Jconsole tool helped to calculate the thread count , i could see around 200-300 threads from different pool getting created with no purpose apart from usual threads and they have discription as :
Name: pool-9-thread-1
State: WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject#163b285
Total blocked: 0 Total waited: 2
Stack trace:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(Unknown Source)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unknown Source)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source)
java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
java.lang.Thread.run(Unknown Source)
I'll add more details if i can find more on this !
*******Update 2 : ************
I could manage to see all the threads created by gemfire using Jconsole:
And this number keeps on increasing and after certain point of time i am seeing the OOM issue.Is there any way i can stop this unnecessary threads creation and memory conumption !

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Spark out of Java heap space: where does it happen? - java

Related

ElasticSearch - Inner lucene file deletion hangs forever

Unable to load 25GB dataset in PySpark local mode with 56GB RAM free

Spark on mesos Executors failing with OOM Errors

jmeter download 1.5g file out of memory exception

Error : java.lang.OutOfMemoryError: unable to create new native thread : gemfire

Categories

Resources