Netty resource leak even after release - java

Why is the following considered as a leak?
2016-12-04 09:24:01,534 ERROR [epollEventLoopGroup-2-1] [io.netty.util.ResourceLeakDetector] - LEAK: ByteBuf.release() was not called before it's garbage-collected. See http://netty.io/wiki
/reference-counted-objects.html for more information.
Recent access records: 5
#5:
io.netty.buffer.AdvancedLeakAwareByteBuf.release(AdvancedLeakAwareByteBuf.java:955)
com.example.network.listener.netty.PreprocessHandler.handle(PreprocessHandler.java:42)
com.example.network.listener.netty.UdpHandlerChain.handle(UdpHandlerChain.java:17)
com.example.network.listener.netty.UdpRequestExecutor$1.run(UdpRequestExecutor.java:89)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
#4:
io.netty.buffer.AdvancedLeakAwareByteBuf.readBytes(AdvancedLeakAwareByteBuf.java:495)
com.example.network.listener.netty.PreprocessHandler.handle(PreprocessHandler.java:39)
com.example.network.listener.netty.UdpHandlerChain.handle(UdpHandlerChain.java:17)
com.example.network.listener.netty.UdpRequestExecutor$1.run(UdpRequestExecutor.java:89)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
#3:
io.netty.buffer.AdvancedLeakAwareByteBuf.retain(AdvancedLeakAwareByteBuf.java:927)
io.netty.buffer.AdvancedLeakAwareByteBuf.retain(AdvancedLeakAwareByteBuf.java:35)
io.netty.util.ReferenceCountUtil.retain(ReferenceCountUtil.java:36)
io.netty.channel.DefaultAddressedEnvelope.retain(DefaultAddressedEnvelope.java:89)
io.netty.channel.socket.DatagramPacket.retain(DatagramPacket.java:67)
io.netty.channel.socket.DatagramPacket.retain(DatagramPacket.java:27)
io.netty.util.ReferenceCountUtil.retain(ReferenceCountUtil.java:36)
com.example.network.listener.netty.UdpRequestExecutor.channelRead(UdpRequestExecutor.java:71)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351)
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
io.netty.channel.epoll.EpollDatagramChannel$EpollDatagramChannelUnsafe.epollInReady(EpollDatagramChannel.java:580)
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:402)
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:307)
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873)
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
java.lang.Thread.run(Thread.java:745)
#2:
Hint: 'UdpRequestExecutor#0' will handle the message from this point.
io.netty.channel.DefaultAddressedEnvelope.touch(DefaultAddressedEnvelope.java:117)
io.netty.channel.socket.DatagramPacket.touch(DatagramPacket.java:85)
io.netty.channel.socket.DatagramPacket.touch(DatagramPacket.java:27)
io.netty.channel.DefaultChannelPipeline.touch(DefaultChannelPipeline.java:107)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351)
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
io.netty.channel.epoll.EpollDatagramChannel$EpollDatagramChannelUnsafe.epollInReady(EpollDatagramChannel.java:580)
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:402)
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:307)
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873)
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
java.lang.Thread.run(Thread.java:745)
#1:
Hint: 'DefaultChannelPipeline$HeadContext#0' will handle the message from this point.
io.netty.channel.DefaultAddressedEnvelope.touch(DefaultAddressedEnvelope.java:117)
io.netty.channel.socket.DatagramPacket.touch(DatagramPacket.java:85)
io.netty.channel.socket.DatagramPacket.touch(DatagramPacket.java:27)
io.netty.channel.DefaultChannelPipeline.touch(DefaultChannelPipeline.java:107)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
io.netty.channel.epoll.EpollDatagramChannel$EpollDatagramChannelUnsafe.epollInReady(EpollDatagramChannel.java:580)
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:402)
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:307)
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873)
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
java.lang.Thread.run(Thread.java:745)
Created at:
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:170)
io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:131)
io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:73)
io.netty.channel.RecvByteBufAllocator$DelegatingHandle.allocate(RecvByteBufAllocator.java:124)
io.netty.channel.epoll.EpollDatagramChannel$EpollDatagramChannelUnsafe.epollInReady(EpollDatagramChannel.java:544)
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:402)
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:307)
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873)
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
java.lang.Thread.run(Thread.java:745)
The last access is an explicit release()...
I'm using Netty 4.1.6.Final.

The exception message points to the Netty Wiki. From that info and traces it looks like the call buf.retain() at com.example.network.listener.netty.UdpRequestExecutor.channelRead(UdpRequestExecutor.java:71) was wrong. Anyway at the time of GC your buffer refcount was > 0. You should study examples and the matrix of responsibility to work with refcounts correctly.

Related

rocketMQTemplate.asyncSend throws ConcurrentModificationException at MessageDecoder.messageProperties2String

I am using rocketmq-spring to send message which version is 2.1.0, sometimes i got ConcurrentModificationException at org.apache.rocketmq.common.message.MessageDecoder.messageProperties2String(MessageDecoder.java:414),the following is a detailed log. Thanks!
#[xx, 10.xx.52] INFO 2022-05-07 15:31:11.043 [XNIO-1 task-74, 29de7f06241a3313, 29de7f06241a3313] com.xx.common.IpProducerService.asyncSendMessage:45 - contentMap{refNo=xx, system=xx, ip=null, platformId=xx, userId=xxx}
#[fp, 10.xx.52] INFO 2022-05-07 15:31:11.043 [XNIO-1 task-74, 29de7f06241a3313, 29de7f06241a3313] com.xx.rocketmq.producer.RocketMqProducer.asyncInfo:19 - -=-=-= [Async Sending Message] -=-=-=
Topic = TOPIC_xx_xx
Tag =
MessageId = null
DelayLevel = 0
Content = {"refNo":"xx","system":"xx","platformId":"xx","userId":"xx"}
#[fp, 10.xx.52] ERROR 2022-05-07 15:31:11.044 [AsyncSenderExecutor_3, , ] com.xx.rocketmq.producer.ProduceCallBack.onException:32 - asyncSendMessage caused exception.
java.util.ConcurrentModificationException: null
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
at java.util.HashMap$EntryIterator.next(HashMap.java:1471)
at java.util.HashMap$EntryIterator.next(HashMap.java:1469)
at org.apache.rocketmq.common.message.MessageDecoder.messageProperties2String(MessageDecoder.java:414)
at org.apache.rocketmq.client.impl.producer.DefaultMQProducerImpl.sendKernelImpl(DefaultMQProducerImpl.java:790)
at org.apache.rocketmq.client.impl.producer.DefaultMQProducerImpl.sendDefaultImpl(DefaultMQProducerImpl.java:584)
at org.apache.rocketmq.client.impl.producer.DefaultMQProducerImpl.access$300(DefaultMQProducerImpl.java:97)
at org.apache.rocketmq.client.impl.producer.DefaultMQProducerImpl$4.run(DefaultMQProducerImpl.java:511)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
#[xx, 10.xx.52] INFO 2022-05-07 15:31:11.044 [AsyncSenderExecutor_3, , ] com.xx.common.IpProducerService.handleResult:49 - async produce status is F
use different rocketMQTemplates instance to send same message.
asyncSend is doing on ThreadPool in rocketMQTemplates, there are lots of threads there, so you can't send same message(same message use the same propertiy HashMap, not thread-safe) using different rocketMQTemplates.

Getting org.rocksdb.RocksDBException: bad entry in block

I'm using rocksDB to store data where the key is a string and the value is an integer. Recently my application threw the following exception while writing into rocks.
java.lang.Exception: org.rocksdb.RocksDBException: bad entry in block
at com.techspot.store.RocksStore.encodePacket(RocksStore.java:684) ~[techspot-encoder-dev.jar:?]
at com.techspot.store.workers.PacketEncoder.run(PacketEncoder.java:67) [techspot-encoder-dev.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
Caused by: org.rocksdb.RocksDBException: bad entry in block
at org.rocksdb.RocksDB.get(Native Method) ~[rocksdbjni-6.13.3.jar:?]
at org.rocksdb.RocksDB.get(RocksDB.java:1948) ~[rocksdbjni-6.13.3.jar:?]
at com.techspot.store.RocksStore.packetExists(RocksStore.java:402) ~[techspot-encoder-dev.jar:?]
at com.techspot.store.RocksStore.encodePacket(RocksStore.java:634) ~[techspot-encoder-dev.jar:?]
... 6 more
The error first started occurring when the number of entries was ~350 million and the database size was ~18GB. This issue is hard to reproduce, I tried to reproduce it by putting almost ~700 million entries but couldn't do it. I'm using RocksDB version 6.13.3 and using the following options for rocksDB:
Options options = new Options();
BlockBasedTableConfig blockBasedTableConfig = new BlockBasedTableConfig();
blockBasedTableConfig.setBlockSize(16 * 1024);// 16Kb
options.setWriteBufferSize(64 * 1024 * 1024);// 64MB
options.setMaxWriteBufferNumber(8);
options.setMinWriteBufferNumberToMerge(1);
options.setTableCacheNumshardbits(8);
options.setLevelZeroSlowdownWritesTrigger(1000);
options.setLevelZeroStopWritesTrigger(2000);
options.setLevelZeroFileNumCompactionTrigger(1);
options.setCompressionType(CompressionType.LZ4_COMPRESSION);
options.setTableFormatConfig(blockBasedTableConfig);
options.setCompactionStyle(CompactionStyle.UNIVERSAL);
options.setCreateIfMissing(Boolean.TRUE);
options.setEnablePipelinedWrite(true);
options.setIncreaseParallelism(8);
Does anyone have any idea what might be the cause for this exception?

Random Exception : Futures timed out after Exception in Spark Jobs

Getting the following error on running the spark Job on Spark 2.0.
The error is Random in nature & does not occur all the time.
Once the tasks are being created most of them are completed properly while a few gets hung & throws the following error after a while.
I have tried increasing the following properties spark.executor.heartbeatInterval & spark.network.timeout but of no use.
17/07/23 20:46:35 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;#597e9d16,BlockManagerId(driver, 128.164.190.35, 38337))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 14 more
Yes, the problem is indeed due to GC as it used to pause the tasks, changing the default GC to G1GC reduced the problem. Thanks
XX:+UseG1GC
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

gRPC OOME and NPE in simple JMH benchmark

I tried gRPC, but gRPC use proto-buf immutable message object, I meet a lot OOM like
Exception in thread "grpc-default-executor-68" java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:204)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:262)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:157)
at io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:93)
at io.grpc.netty.NettyWritableBufferAllocator.allocate(NettyWritableBufferAllocator.java:66)
at io.grpc.internal.MessageFramer.writeKnownLength(MessageFramer.java:182)
at io.grpc.internal.MessageFramer.writeUncompressed(MessageFramer.java:135)
at io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:125)
at io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:165)
at io.grpc.internal.AbstractServerStream.writeMessage(AbstractServerStream.java:108)
at io.grpc.internal.ServerImpl$ServerCallImpl.sendMessage(ServerImpl.java:496)
at io.grpc.stub.ServerCalls$ResponseObserver.onNext(ServerCalls.java:241)
at play.bench.BenchGRPC$CounterImpl$1.onNext(BenchGRPC.java:194)
at play.bench.BenchGRPC$CounterImpl$1.onNext(BenchGRPC.java:191)
at io.grpc.stub.ServerCalls$2$1.onMessage(ServerCalls.java:191)
at io.grpc.internal.ServerImpl$ServerCallImpl$ServerStreamListenerImpl.messageRead(ServerImpl.java:546)
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1.run(ServerImpl.java:417)
at io.grpc.internal.SerializingExecutor$TaskRunner.run(SerializingExecutor.java:154)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
io.grpc.StatusRuntimeException: CANCELLED
at io.grpc.Status.asRuntimeException(Status.java:430)
at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:266)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$3.run(ClientCallImpl.java:320)
at io.grpc.internal.SerializingExecutor$TaskRunner.run(SerializingExecutor.java:154)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I'm not sure is this caused by object creation, I gave 5G mem to this process, still OOM, needs some help.
EDIT
I put my bench, proto, dependencies and an example out to this gist, the problem is the memory goes very high, sooner or later will cause OOME, and there is a strange NPE
严重: Exception while executing runnable io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$2#312546d9
java.lang.NullPointerException
at io.netty.buffer.PoolChunk.initBufWithSubpage(PoolChunk.java:378)
at io.netty.buffer.PoolChunk.initBufWithSubpage(PoolChunk.java:369)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:194)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:262)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:157)
at io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:93)
at io.grpc.netty.NettyWritableBufferAllocator.allocate(NettyWritableBufferAllocator.java:66)
at io.grpc.internal.MessageFramer.writeKnownLength(MessageFramer.java:182)
at io.grpc.internal.MessageFramer.writeUncompressed(MessageFramer.java:135)
at io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:125)
at io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:165)
at io.grpc.internal.AbstractServerStream.writeMessage(AbstractServerStream.java:108)
at io.grpc.internal.ServerImpl$ServerCallImpl.sendMessage(ServerImpl.java:496)
at io.grpc.stub.ServerCalls$ResponseObserver.onNext(ServerCalls.java:241)
at play.bench.BenchGRPCOOME$CounterImpl.inc(BenchGRPCOOME.java:150)
at play.bench.CounterServerGrpc$1.invoke(CounterServerGrpc.java:171)
at play.bench.CounterServerGrpc$1.invoke(CounterServerGrpc.java:166)
at io.grpc.stub.ServerCalls$1$1.onHalfClose(ServerCalls.java:154)
at io.grpc.internal.ServerImpl$ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerImpl.java:562)
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$2.run(ServerImpl.java:432)
at io.grpc.internal.SerializingExecutor$TaskRunner.run(SerializingExecutor.java:154)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The problem is that StreamObserver.onNext does not block, so there is no push-back when you write too much. There is an open issue. There needs to be a way for you to interact with flow control and be informed that you should slow your sending rate. For client-side, a workaround is to use ClientCall directly; you call Channel.newCall all then pay attention to isReady and onReady. For server-side there isn't an easy workaround.

Apache Storm: the entire job hung after 2 ~ 3 days for unknown reason

Recently I have submitted a storm (0.9.5) job written in Python (2.7.6) with multi-lang protocol. The Bolt class was firstly inherited from BasicBolt (with ack), and I have not set max.spout.pending.
class SnifferSpout(storm.Spout):
def __init__(self):
...
class MonitorBolt(storm.BasicBolt):
def __init__(self):
...
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("sniffer", new SnifferSpout(), 1);
builder.setBolt("relation", new MonitorBolt(), 3).shuffleGrouping("sniffer");
conf.setDebug(false);
conf.setNumWorkers(4);
StormSubmitter.submitTopology(args[0], conf, builder.createTopology());
It could work, however the entire job began to hung after 2~3 days.
I did not know why. And I found in the first few hours, the process latency were unbelievable (in contrast to execute latency) high.
And the process latency would increased too high (like 36456ms) sometimes
Moreover, in the log of one worker, I found
2015-10-31T12:14:30.784+0000 b.s.s.ShellSpout [ERROR] Halting process: ShellSpout died.
java.lang.RuntimeException: subprocess heartbeat timeout
at backtype.storm.spout.ShellSpout$SpoutHeartbeatTimerTask.run(ShellSpout.java:261) [storm-core-0.9.5.jar:0.9.5]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_79]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_79]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_79]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_79]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
2015-10-31T12:14:30.795+0000 b.s.d.executor [ERROR]
java.lang.RuntimeException: subprocess heartbeat timeout
at backtype.storm.spout.ShellSpout$SpoutHeartbeatTimerTask.run(ShellSpout.java:261) [storm-core-0.9.5.jar:0.9.5]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_79]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_79]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_79]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_79]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
I suspected that the issue was caused by OOM, so I checked the memory after the job hung, and found there are 3 Java processes, each consumed about 22% main memory. And the python processes only consume 1.x% memory.
I could not make sure that memory is an issue. So I tried to remove ack by using Bolt instead of BasicBolt together with setting MaxSpoutPending to 200.
And now a worse thing happened, the job consume memory very fast(the memory was dropped from 2.83G to 480M in about 10 minutes), and the executors restart every about 10 minutes due to OOM.
Anybody could help to find the root reason why this would happen

Categories