On one of our platforms, HDFS namenode is shutting down with following error message every 1 or 3 days
FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(390)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [<ip1>:<port>,<ip2>:<port>, etc], stream=QuorumOutputStream starting at txid 29873171))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:109)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:525)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:385)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:55)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:521)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:710)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.run(FSEditLogAsync.java:188)
at java.lang.Thread.run(Thread.java:748)
Before this FATAL log we can see following kind of logs, on which we can detect a degradation of response time
WARN client.QuorumJournalManager (QuorumCall.java:waitFor(185)) - Waited 18014 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [<ip1>:<port>,<ip2>:<port>]
Have you already encountered this problem, and do you have any advices to fix it ?
We have already:
checked that our VMs are time synchronized
detected that when the problem occurs a burst of data on the network is in progress, without detecting the root cause yet
checked our network devices. Except a problem on a port which goes from UP state to DOWN state quickly that we are going to fix, the network seems correct
Thanks in advance
I have an odd issue with my Tomcat + Spring websocket application. When a user disconnects without sending a "closing" signal, due to power loss or a plug pull, the thread will block about 10 seconds later.
The thread blocks on this function :
org.springframework.web.socket.WebSocketSession.sendMessage(WebSocketMessage<?> wsm) throws IOException;
I have tried putting a line in my AppConfig to try and set a timeout of 3 seconds but it does not seem to work properly as the block seems to go on for upwards of 15 minutes before throwing an exception.
#Bean(name="servletServerContainerFactoryBean")
public int maxSessionIdleTimeout() {
return 3000;
}
Here is the stack trace after an eventual SocketTimeoutException
Step: 2304
SendB -> test isOpen -> sendMes -> Done -> Finished Send.
SendB -> test2 isOpen -> sendMes -> User closed connection during packet sending: s01
Propogating exception up!
java.io.IOException: java.net.SocketTimeoutException
at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendMessageBlock(WsRemoteEndpointImplBase.java:315)
at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendMessageBlock(WsRemoteEndpointImplBase.java:250)
at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendPartialString(WsRemoteEndpointImplBase.java:223)
at org.apache.tomcat.websocket.WsRemoteEndpointBasic.sendText(WsRemoteEndpointBasic.java:49)
at org.springframework.web.socket.adapter.standard.StandardWebSocketSession.sendTextMessage(StandardWebSocketSession.java:197)
at org.springframework.web.socket.adapter.AbstractWebSocketSession.sendMessage(AbstractWebSocketSession.java:102)
at org.infpls.noxio.game.module.game.session.NoxioSession.sendPacket(NoxioSession.java:40)
at org.infpls.noxio.game.module.game.dao.lobby.GameLobby.step(GameLobby.java:117)
at org.infpls.noxio.game.module.game.dao.lobby.GameLobby$GameLoop.run(GameLobby.java:274)
Caused by: java.net.SocketTimeoutException
at org.apache.tomcat.websocket.server.WsRemoteEndpointImplServer.doWrite(WsRemoteEndpointImplServer.java:81)
at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.writeMessagePart(WsRemoteEndpointImplBase.java:494)
at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendMessageBlock(WsRemoteEndpointImplBase.java:309)
... 8 more
## CRITICAL ## Ejecting player: test2 :: Exception thrown on packet send...
java.io.IOException: java.net.SocketTimeoutException
at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendMessageBlock(WsRemoteEndpointImplBase.java:315)
at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendMessageBlock(WsRemoteEndpointImplBase.java:250)
at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendPartialString(WsRemoteEndpointImplBase.java:223)
at org.apache.tomcat.websocket.WsRemoteEndpointBasic.sendText(WsRemoteEndpointBasic.java:49)
at org.springframework.web.socket.adapter.standard.StandardWebSocketSession.sendTextMessage(StandardWebSocketSession.java:197)
at org.springframework.web.socket.adapter.AbstractWebSocketSession.sendMessage(AbstractWebSocketSession.java:102)
at org.infpls.noxio.game.module.game.session.NoxioSession.sendPacket(NoxioSession.java:40)
at org.infpls.noxio.game.module.game.dao.lobby.GameLobby.step(GameLobby.java:117)
at org.infpls.noxio.game.module.game.dao.lobby.GameLobby$GameLoop.run(GameLobby.java:274)
Caused by: java.net.SocketTimeoutException
at org.apache.tomcat.websocket.server.WsRemoteEndpointImplServer.doWrite(WsRemoteEndpointImplServer.java:81)
at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.writeMessagePart(WsRemoteEndpointImplBase.java:494)
at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendMessageBlock(WsRemoteEndpointImplBase.java:309)
... 8 more
Finished Send. Step Finished.
Having threads be blocked for 15 minutes at a time is a major problem. Any info on why this happens and how to fix it would be greatly appreciated. Thank you!
Found an answer finally. It's actually a system setting.
Source
How many times to retry before deciding that something is wrong and
it is necessary to report this suspicion to network layer. Minimal RFC
value is 3, it is default, which corresponds to 3sec-8min depending on
RTO.
/proc/sys/net/ipv4/tcp_retries2
This question already has answers here:
What are possible reasons for receiving TimeoutException: Futures timed out after [n seconds] when working with Spark [duplicate]
(4 answers)
Closed 5 years ago.
I get the following error while running my spark streaming application, we have a large application running multiple stateful (with mapWithState) and stateless operations. It's getting difficult to isolate the error since spark itself hangs and the only error we see is in the spark log and not the application log itself.
The error happens only after abount 4-5 mins with a micro-batch interval of 10 seconds.
I am using Spark 1.6.1 on an ubuntu server with Kafka based input and output streams.
Please note it's not possible for me to provide the smallest possible code to re-create this bug as it does not occur in unit test-cases, and the application itself is very large
Any direction you can give to solve this issue will be helpful. Please let me know if I can provide any more information.
Error inline below:
[2017-07-11 16:15:15,338] ERROR Error cleaning broadcast 2211 (org.apache.spark.ContextCleaner)
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136)
at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228)
at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:77)
at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:233)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:189)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:180)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:180)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:173)
at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
Your exception message clearly says that its RPCTimeout due to default configuration of 120 seconds and adjust to optimal value as per your work load.
please see 1.6 configuration
your error messages org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds].
and
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) confirms that.
For Better understanding please see the below code from
see RpcTimeout.scala
/**
* Wait for the completed result and return it. If the result is not available within this
* timeout, throw a [[RpcTimeoutException]] to indicate which configuration controls the timeout.
* #param awaitable the `Awaitable` to be awaited
* #throws RpcTimeoutException if after waiting for the specified time `awaitable`
* is still not ready
*/
def awaitResult[T](awaitable: Awaitable[T]): T = {
try {
Await.result(awaitable, duration)
} catch addMessageIfTimeout
}
}
Also see my answer in another context
I have a pretty complex java program, which doesn't terminate.
The eclipse debugger is showing a thread that can be suspended, but has no stack trace.
It is called "Thread-2".
The jstack -l output for this thread is:
"Thread-2" #17 prio=5 os_prio=0 tid=0x00007f1268002800 nid=0x3342 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
Locked ownable synchronizers:
- None
I added a breakpoint to Thread.start(), but I cannot find a thread called "Thread-2".
The thread just appears after the two "AWT-Event-Queue" threads were created.
I don't create any threads manually in my program.
After the main thread and all other threads exit, and the JFrame is disposed, the following threads still exist:
Thread [AWT-EventQueue-0] (Running)
Thread [Thread-2] (Running)
Thread [DestroyJavaVM] (Running)
When suspending the VM, the following threads exist:
Daemon System Thread [Signal Dispatcher] (Suspended)
Daemon System Thread [Finalizer] (Suspended)
Daemon System Thread [Reference Handler] (Suspended)
Daemon System Thread [Java2D Disposer] (Suspended)
Daemon System Thread [AWT-XAWT] (Suspended)
Thread [AWT-EventQueue-0] (Suspended)
Thread [Thread-2] (Suspended)
Thread [DestroyJavaVM] (Suspended)
How can I get more information about this thread, or allow it to terminate?
EDIT 1:
According to the Dependency Hierarchy of the eclipse pom.xml view, I use the following third party libraries:
guava 17.0 [compile]
hamcrest-core 1.3 [test]
junit 4.11 [test]
log4j-api 2.0-beta9 [compile]
log4j-core 2.0-beta9 [compile]
EDIT 2:
Adding breakpoints to all constructors of the thread class, as suggested in https://stackoverflow.com/a/35128213/577485, I see that Thread-0 and Thread-1 are created by log4j, but not Thread-2. It still just appears as before and no breakpoint triggers when it is constructed.
EDIT 3:
Now it's getting creepy. Not even the stop() method works, when invoked on the thread. I added it to the code given in https://stackoverflow.com/a/35128149/577485.
At least System.exit(int) still works. But as said in a comment, I don't want to use that.
EDIT 4:
Info about my system:
I'm running the newest stable version of Ubuntu 15.10 Wily. I have the security, updates and backports repos enabled.
My JVM version is:
java version "1.7.0_91"
OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.15.10.1)
OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
EDIT 5:
I executed the program with Java version jre-8u71-linux-x64 directly downloaded from java.com, but the error persists. jstack -l shows the same strange thread. Note that the program was still built with the older java version. Edit: After compiling it with java8u72 from oracle.com, I get the same behaviour.
EDIT 6:
I iterated over all the threads fields, here's the output. I cannot get any hint out of those fields, the thread doesn't even have a target.
name: [C#6b67034
priority: 5
threadQ: null
eetop: 140274638530560
single_step: false
daemon: false
stillborn: false
target: null
group: java.lang.ThreadGroup[name=main,maxpri=10]
contextClassLoader: null
inheritedAccessControlContext: java.security.AccessControlContext#0
threadInitNumber: 3
threadLocals: null
inheritableThreadLocals: null
stackSize: 0
nativeParkEventPointer: 0
tid: 17
threadSeqNumber: 20
threadStatus: 5
parkBlocker: null
blocker: null
blockerLock: java.lang.Object#16267862
MIN_PRIORITY: 1
NORM_PRIORITY: 5
MAX_PRIORITY: 10
EMPTY_STACK_TRACE: [Ljava.lang.StackTraceElement;#453da22c
SUBCLASS_IMPLEMENTATION_PERMISSION: ("java.lang.RuntimePermission" "enableContextClassLoaderOverride")
uncaughtExceptionHandler: null
defaultUncaughtExceptionHandler: null
threadLocalRandomSeed: 0
threadLocalRandomProbe: 0
threadLocalRandomSecondarySeed: 0
EDIT 7:
Added a watchpoint to the name field of Thread. It is only accessed by my analytics code, and seems to be never written...
EDIT 8:
jstack -F -m throws an error for my program:
Attaching to process ID 10973, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.71-b15
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at sun.tools.jstack.JStack.runJStackTool(JStack.java:140)
at sun.tools.jstack.JStack.main(JStack.java:106)
Caused by: java.lang.RuntimeException: Unable to deduce type of thread from address 0x00007ff68000c000 (expected type JavaThread, CompilerThread, ServiceThread, JvmtiAgentThread, or SurrogateLockerThread)
at sun.jvm.hotspot.runtime.Threads.createJavaThreadWrapper(Threads.java:169)
at sun.jvm.hotspot.runtime.Threads.first(Threads.java:153)
at sun.jvm.hotspot.tools.PStack.initJFrameCache(PStack.java:200)
at sun.jvm.hotspot.tools.PStack.run(PStack.java:71)
at sun.jvm.hotspot.tools.PStack.run(PStack.java:58)
at sun.jvm.hotspot.tools.PStack.run(PStack.java:53)
at sun.jvm.hotspot.tools.JStack.run(JStack.java:66)
at sun.jvm.hotspot.tools.Tool.startInternal(Tool.java:260)
at sun.jvm.hotspot.tools.Tool.start(Tool.java:223)
at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
at sun.jvm.hotspot.tools.JStack.main(JStack.java:92)
... 6 more
Caused by: sun.jvm.hotspot.types.WrongTypeException: No suitable match for type of address 0x00007ff68000c000
at sun.jvm.hotspot.runtime.InstanceConstructor.newWrongTypeException(InstanceConstructor.java:62)
at sun.jvm.hotspot.runtime.VirtualConstructor.instantiateWrapperFor(VirtualConstructor.java:80)
at sun.jvm.hotspot.runtime.Threads.createJavaThreadWrapper(Threads.java:165)
... 16 more
The class name of the strange thread is java.lang.Thread.
I don't use any command line arguments for executing the program. Adding the -Dlog4j2.disable.jmx=true option gives the strange thread the name Thread-1.
I updated log4j to 2.5, and the strange thread now has name Thread-0, when the -Dlog4j2.disable.jmx=true option is given, and Thread-1, when it isn't.
EDIT 9:
Completely removed log4j, error persists. The thread is now called Thread-0.
Here's the project, if that helps.
I don't understand why the breakpoint on Thread.start() did not work, but you could also try intercepting the thread >>creation<< by setting a breakpoint on the Thread constructors, or on the (internal) Thread.init() method.
The fact that the thread has the name Thread-2 implies that it was created by one of the constructors that generates a default thread name. That suggests that it was not created by the JVM or standard Java class libraries. It also narrows down the constructors that could have been used to create it.
How can I get more information about this thread ...
I can't think of any way apart from setting breakpoints.
... or allow it to terminate?
If you can find where it is created, you should be able to use setDaemon(true) to mark it as a daemon thread. However, this needs to be done before the thread is started.
Another possibility would be to find the thread by traversing the ThreadGroup tree and then calling Thread.interrupt() on it. ( Thread.getAllStackTraces() is another way of tracking down the thread object. ) However, there is no guarantee that the thread will "respect" the interrupt and shut down.
Finally, you could just call System.exit(...).
UPDATE
I mentioned that the thread might not respect interrupt() and I'm not surprised that stop() doesn't work. (It is deprecated, and may not even be implemented on some platforms.)
However, if you have managed to implement code that actually finds the mystery thread, you could dig around to find either the Thread subclass, or the Runnable that it is instantiated with. If you can print out the fully qualified class name, that will give you a big clue as to where it comes from. (Assuming that you are still having no success with breakpoints, then you may need to use "nasty" reflection to extract the runnable from the thread's private target field.)
Not sure if this enough for you, but the following code will allow you to try to interrupt any Thread by its name:
//Set of current Threads
Set<Thread> setOfThread = Thread.getAllStackTraces().keySet();
//Iterate over set to find yours
for(Thread thread : setOfThread){
if (thread.getName().equals("Thread-2")) {
thread.interrupt();
break;
}
}
Also, take a look on this article from JavaSpecialists that tries to identify the creator of a Thread based on the fact that the constructor of Thread makes a call to the security manager. If we add a custom SecurityManager to our System, we may track the initiator of a Thread.
First off, you should change your _exit flag to volatile, since it's read from one thread (your main method), and written by another (JCF/Swing event handler), so it's possible your main thread isn't getting a "fresh" value. Specifically: the thread may be saving the field to a CPU register and not reloading it from memory as you're looping. 'volatile' will prevent that behavior:
private volatile boolean _exit;
However, based on your stack traces I don't think that's your problem, since we don't see your main method in there. But this should be done anyway, it's just good practice.
Assuming that doesn't fix it, I'm guessing your problem is that you have at least one other Window (besides AgentFrame) that isn't being disposed. The AWT thread won't stop until all Windows are disposed.
Put this at the end of your main method:
System.out.println(Arrays.toString(Window.getWindows()))
I'm guessing you're going to see more than just your DrawFrame there. If I were to guess, I'd say UISettingsFrame is in there, undisposed.
After running my application, i am getting this error after around 5 mins.
Even though i am returning the resource after use, i keep getting this.
I have built jedis-2.2.2-SNAPSHOT.jar from the jedis code base, since its not released yet
I had set the minIdle = 100, maxIdle=200 & maxActive=200. At the time of this exception, the connection count to redis was 122 from my application
redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource from the pool
at redis.clients.util.Pool.getResource(Pool.java:42)
Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:442)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:360)
at redis.clients.util.Pool.getResource(Pool.java:40)
... 6 more
Did you check that redis is still up & running ?
If not, investigate why it died.
try a redis-cli in a terminal if you can. "info" would give you more details.