SseEventSource.open() takes too long - java

I am using javax.ws.rs.sse.SseEventSource to listen to events sent by my API. Here is my implementation:
String bearer = getBearerToken();
Client client = ClientBuilder.newClient().register((ClientRequestFilter) clientRequestContext -> {
MultivaluedMap<String, Object> headers = clientRequestContext.getHeaders();
headers.add("Authorization", "Bearer " + bearer);
});
SseEventSource.target(client.target(url)).build();
sseEventSource.register(this::onMessage, this::onError);
sseEventSource.open();
Everthing works great except for the fact that sseEventSource.open() blocks for a substantial amount of time (15s-40s). When I print the thread state I get the following:
"main" #1 prio=5 os_prio=31 cpu=2616.95ms elapsed=34.80s tid=0x00007fbd38009800 nid=0x1703 waiting on condition [0x00007000046a7000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base#11.0.6/Native Method)
- parking to wait for <0x000000070d8cbe00> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(java.base#11.0.6/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base#11.0.6/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base#11.0.6/AbstractQueuedSynchronizer.java:1039)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base#11.0.6/AbstractQueuedSynchronizer.java:1345)
at java.util.concurrent.CountDownLatch.await(java.base#11.0.6/CountDownLatch.java:232)
at org.jboss.resteasy.plugins.providers.sse.client.SseEventSourceImpl$EventHandler.awaitConnected(SseEventSourceImpl.java:417)
at org.jboss.resteasy.plugins.providers.sse.client.SseEventSourceImpl.open(SseEventSourceImpl.java:174)
at org.jboss.resteasy.plugins.providers.sse.client.SseEventSourceImpl.open(SseEventSourceImpl.java:163)
at org.jboss.resteasy.plugins.providers.sse.client.SseEventSourceImpl.open(SseEventSourceImpl.java:158)
I was wondering what is the thread waiting for, is it expecting something to be published in the stream before proceeding (if so, is it possible to configure it otherwise)? Anyway, my main question is how to have a faster SseEventSource.open()?
Update: I found out that SseEventSource.open() is waiting to receive a message for the first time. It is unfortunate because the actual connection is established quickly and I have to wait about 30s for the method to return. So far I was not able to find out a good solution for this.

Related

Fork Join pool hangs

The case is that an application hangs infinitely from time to time.
Seems that the bug sits in the following snippet:
ForkJoinPool pool = new ForkJoinPool(1); // parallelism = 1
List<String> entries = ...;
pool.submit(() -> {
entries.stream().parallel().forEach(entry -> {
// An I/O op.
...
});
}).get();
Thread pool-4-thread-1 that executes the code freezes on get():
"pool-4-thread-1" #35 prio=5 os_prio=0 tid=0x00002b42e4013800 nid=0xb7d1 in Object.wait() [0x00002b427b72f000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.util.concurrent.ForkJoinTask.externalInterruptibleAwaitDone(ForkJoinTask.java:367)
- locked <0x00000000e08b68b8> (a java.util.concurrent.ForkJoinTask$AdaptedRunnableAction)
at java.util.concurrent.ForkJoinTask.get(ForkJoinTask.java:1001)
...other app methods
One can assume that the task passed to submit() executes too long.
But surprisingly there is no ForkJoinPool-N-worker-N occurrences in the thread dump, so looks like the pool doesn't perform any computations!
How is that possible? If no tasks are executed by the pool, why pool-4-thread-1 thread waits inside get()?
P.S. I know that it's not recommended to execute I/O-related tasks in ForkJoinPool, but still interested in the root of the problem.
Update. When parallelism is set to value greater than 1, no problems are detected.
Set parallelism = N where N > 1 solved the problem.
Strange thing but seems that there is some bug in ForkJoinPool similar to what is stated here.

“java.lang.OutOfMemoryError : unable to create new native Thread”

I am getting this error while I am trying to post some messages into a queue using JMSTemplate.
I am trying to push more than 10000 messages in queue using for loop and I get this error.
I have limit of 4096 threads on my system.
In the threadDumps I can see below thread mostly.
"SimpleAsyncTaskExecutor-36991" prio=10 tid=0x00007f857dc13000 nid=0x5a9 waiting on condition [0x00007f83f21e9000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000748860208> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:374)
at org.apache.activemq.transport.FutureResponse.getResult(FutureResponse.java:40)
at org.apache.activemq.transport.ResponseCorrelator.request(ResponseCorrelator.java:87)
at org.apache.activemq.ActiveMQConnection.syncSendPacket(ActiveMQConnection.java:1394)
at org.apache.activemq.ActiveMQSession.syncSendPacket(ActiveMQSession.java:1925)
at org.apache.activemq.ActiveMQMessageProducer.<init>(ActiveMQMessageProducer.java:125)
at org.apache.activemq.ActiveMQSession.createProducer(ActiveMQSession.java:969)
at org.apache.activemq.jms.pool.PooledSession.getMessageProducer(PooledSession.java:395)
at org.apache.activemq.jms.pool.PooledSession.createProducer(PooledSession.java:359)
at org.springframework.jms.core.JmsTemplate.doCreateProducer(JmsTemplate.java:1044)
at org.springframework.jms.core.JmsTemplate.createProducer(JmsTemplate.java:1025)
at org.springframework.jms.core.JmsTemplate.doSend(JmsTemplate.java:598)
at org.springframework.jms.core.JmsTemplate$3.doInJms(JmsTemplate.java:569)
at org.springframework.jms.core.JmsTemplate.execute(JmsTemplate.java:491)
at org.springframework.jms.core.JmsTemplate.send(JmsTemplate.java:566)
at org.springframework.jms.core.JmsTemplate.convertAndSend(JmsTemplate.java:655)
at org.springframework.jms.core.JmsTemplate.convertAndSend(JmsTemplate.java:646)
at com.jato.frameworks.commons.util.JATOJMSProducer.sendMessageToQueue(JATOJMSProducer.java:59)
What I suspect is the threads which are created to post a message in the queue are getting created more rapidly than the rate they are being killed.
Is there any way to solve this ?

How to determine where Java thread interrupts are coming from?

I've got a UI automation framework that launches tests using TestNG and runs through pages using Selenium/WebDriver. Oftentimes the pages I'm testing make AJAX calls that modify the DOM upon returning. In these cases I use Selenium explicit waits to declare a DOM condition that I want to be met before the automation can proceed (IE: some button gets enabled).
Internally Selenium's FluentWait.until method handles this by polling the DOM for my ExpectedCondition every 500ms and calling Thread.sleep() in-between these checks.
When I run two tests back to back in a TestNG suite this works perfectly fine for the first test, but starts to fail with an InterruptedException about halfway through each subsequent test. This is consistent. The exceptions look like this:
Associated Throwable Type: class org.openqa.selenium.WebDriverException Associated Throwable Message: java.lang.InterruptedException: sleep interrupted
The strange thing is that there's no multi-threading going on here. I've disabled Selenium Grid, BrowserMob Proxy, and every other bit of code that could be conflicting. I've read both of these questions:
https://stackoverflow.com/questions/24495176/why-is-thread-sleep-being-interrupted - Closed for not providing enough detail, but one of the proposed answers states that one should override the Thread.interrupt method for debugging.
Who interrupts my thread? - Accepted answer also states that one should override the Thread.interrupt method for debugging.
My problem with this solution is that placing a breakpoint inside the existing Thread.interrupt method does not reveal any calls around the time that the thread is interrupted. This includes calls from all of my third party dependencies (IE: TestNG and Selenium). Whatever is calling this thread interrupt appears to be external to my framework.
I've also tried calling Thread.currentThread.isInterrupted() at every point prior to the FluentWait.until call and it consistently returns false. I've even used IntelliJ's evaluate function to check for isInterrupted inside the Selenium code itself. This thread is only being interrupted once the Thread.sleep call occurs inside FluentWait.until.
I've seen this happen on multiple Windows build servers as well as on my Macbook, so this does not appear to be machine specific.
I thought for a while that this might be caused by a TestNG timeout, but reducing the TestNG timeout in my suite yielded a different behavior than these interruptions.
Currently I'm working around this issue with the following code which swallows the exception and resumes the explicit wait:
public static boolean waitForElementStatus(Stuff)
{
/* snip - setup for ExpectedCondition (change) */
long startSeconds = new Date().getTime() / 1000;
long currentSeconds = startSeconds;
long remainingSeconds = maxElementStatusChangeSeconds;
WebDriverWait waitForElement = new WebDriverWait(driver, maxElementStatusChangeSeconds);
boolean changed = false;
boolean firstWait = true; // If specified time is 0 we still want to check once.
out:while(firstWait || remainingSeconds > 0)
{
firstWait = false;
Boolean exceptionThrown = false;
try
{
waitForElement.until(change);
}
catch(Throwable t)
{
exceptionThrown = true;
if(t.getCause()) != null
{
t = t.getCause(); // InterruptedException is wrapped inside a WebDriverException
}
if(t.getClass().equals(InterruptedException.class))
{
Thread.interrupted(); // clear interrupt status for this thread
currentSeconds = new Date().getTime() / 1000;
remainingSeconds = startSeconds + maxElementStatusChangeSeconds - currentSeconds;
if(remainingSeconds > 0)
{
String warning = String.format("Caught unidentified interrupt inside Selenium " +
"FluentWait.until call. Swallowing interrupt and repeating call with [%s] seconds " +
"remaining.", remainingSeconds);
CombinedLogger.warn(warning);
waitForElement = new WebDriverWait(driver, remainingSeconds);
}
else
{
// If a timeout exception would have been thrown instead of the interruption then
// we'll allow the WebDriverWait to execute one last time so it can throw the
// timeout instead.
waitForElement = new WebDriverWait(driver, 0);
}
}
else if(haltOnFailure) // for any other exception type such as TimeoutException
{
CombinedLogger.error(stuff + "...FAILURE(HALTING)", t);
break out;
}
else // for any other exception type such as TimeoutException
{
CombinedLogger.info(stuff + "...failure(non-halting)");
break out;
}
}
if(!exceptionThrown)
{
changed = true;
CombinedLogger.info(stuff + "...success ");
break out;
}
}
return changed;
}
This workaround does function, and fortunately these mystery interrupts are only occurring sporadically afterwards (they don't happen repeatedly), so the tests are able to proceed. However, I understand that swallowing InterruptedException is bad form. If possible, I'd like to determine where and why these interrupts are taking place so that I can put an end to them instead of using this hack.
Simply propagating the exceptions is not an option since these tests need to continue running instead of obediently crashing.
Are there any known utilities, JVM arguments, or libraries that I could use which would help me track down Java thread interruptions that are caused by code which is out of my control?
Update 12/10/2014: I've captured two thread dumps. One is from immediately before the interrupt and one is from immediately after it. The only difference between the two is the line number of the interrupted thread (it goes from the try block to the catch block after being interrupted). Not sure what this tells me, but here's the data:
Full thread dump (immediately before interrupt)
"TestNG#1359" prio=5 tid=0xc nid=NA runnable
java.lang.Thread.State: RUNNABLE
at org.openqa.selenium.support.ui.FluentWait.until(FluentWait.java:232)
/* snip - company stuff */
at sun.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-1)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:84)
at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:46)
at org.testng.internal.InvokeMethodRunnable.run(InvokeMethodRunnable.java:37)
at org.testng.internal.MethodInvocationHelper.invokeWithTimeoutWithNoExecutor(MethodInvocationHelper.java:240)
at org.testng.internal.MethodInvocationHelper.invokeWithTimeout(MethodInvocationHelper.java:229)
at org.testng.internal.Invoker.invokeMethod(Invoker.java:724)
at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901)
at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231)
at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
at org.testng.TestRunner.privateRun(TestRunner.java:767)
at org.testng.TestRunner.run(TestRunner.java:617)
at org.testng.SuiteRunner.runTest(SuiteRunner.java:348)
at org.testng.SuiteRunner.access$000(SuiteRunner.java:38)
at org.testng.SuiteRunner$SuiteWorker.run(SuiteRunner.java:382)
at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
"main#1" prio=5 tid=0x1 nid=NA waiting
java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Unsafe.java:-1)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:422)
at java.util.concurrent.FutureTask.get(FutureTask.java:199)
at java.util.concurrent.AbstractExecutorService.invokeAll(AbstractExecutorService.java:289)
at org.testng.internal.thread.ThreadUtil.execute(ThreadUtil.java:72)
at org.testng.SuiteRunner.runInParallelTestMode(SuiteRunner.java:367)
at org.testng.SuiteRunner.privateRun(SuiteRunner.java:308)
at org.testng.SuiteRunner.run(SuiteRunner.java:254)
at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
at org.testng.TestNG.runSuitesSequentially(TestNG.java:1224)
at org.testng.TestNG.runSuitesLocally(TestNG.java:1149)
at org.testng.TestNG.run(TestNG.java:1057)
at org.testng.remote.RemoteTestNG.run(RemoteTestNG.java:111)
at org.testng.remote.RemoteTestNG.initAndRun(RemoteTestNG.java:204)
at org.testng.remote.RemoteTestNG.main(RemoteTestNG.java:175)
at org.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:125)
"Thread-8#2432" daemon prio=5 tid=0x15 nid=NA runnable
java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(FileInputStream.java:-1)
at java.io.FileInputStream.read(FileInputStream.java:272)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked <0xe08> (a java.lang.UNIXProcess$ProcessPipeInputStream)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.commons.exec.StreamPumper.run(StreamPumper.java:105)
at java.lang.Thread.run(Thread.java:745)
"Thread-7#2431" daemon prio=5 tid=0x14 nid=NA runnable
java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(FileInputStream.java:-1)
at java.io.FileInputStream.read(FileInputStream.java:272)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked <0xe09> (a java.lang.UNIXProcess$ProcessPipeInputStream)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.commons.exec.StreamPumper.run(StreamPumper.java:105)
at java.lang.Thread.run(Thread.java:745)
"Thread-6#2424" prio=5 tid=0x13 nid=NA waiting
java.lang.Thread.State: WAITING
at java.lang.Object.wait(Object.java:-1)
at java.lang.Object.wait(Object.java:503)
at java.lang.UNIXProcess.waitFor(UNIXProcess.java:261)
at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:347)
at org.apache.commons.exec.DefaultExecutor.access$200(DefaultExecutor.java:46)
at org.apache.commons.exec.DefaultExecutor$1.run(DefaultExecutor.java:188)
"process reaper#2008" daemon prio=10 tid=0x10 nid=NA runnable
java.lang.Thread.State: RUNNABLE
at java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:-1)
at java.lang.UNIXProcess.access$500(UNIXProcess.java:54)
at java.lang.UNIXProcess$4.run(UNIXProcess.java:225)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
"ReaderThread#645" prio=5 tid=0xb nid=NA runnable
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(SocketInputStream.java:-1)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
- locked <0xe0b> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:154)
at java.io.BufferedReader.readLine(BufferedReader.java:317)
at java.io.BufferedReader.readLine(BufferedReader.java:382)
at org.testng.remote.strprotocol.BaseMessageSender$ReaderThread.run(BaseMessageSender.java:245)
"Finalizer#2957" daemon prio=8 tid=0x3 nid=NA waiting
java.lang.Thread.State: WAITING
at java.lang.Object.wait(Object.java:-1)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)
"Reference Handler#2958" daemon prio=10 tid=0x2 nid=NA waiting
java.lang.Thread.State: WAITING
at java.lang.Object.wait(Object.java:-1)
at java.lang.Object.wait(Object.java:503)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
"Signal Dispatcher#2956" daemon prio=9 tid=0x4 nid=NA runnable
java.lang.Thread.State: RUNNABLE
There is not much that can be inferred from the thread dump as in what caused it.
But in reality you cannot rely on Thread.sleep() too much ,it might be interrupted for known/unknown reason.OS might be the reason in the later case.
Thread.sleep() is one of the few methods which takes interrupt seriously. As a thread cannot handle InterruptedException while it is sleeping ,you need to handle it.
What you are doing right now might not be a workaround but a way to go in such cases,where we cannot do without Thread.sleep().
A bit outdated but I have similar problem and with the help of your previously posted link (https://stackoverflow.com/a/2476246) I put a breakpoint into the Thread.interrupt() method.
It reveals that the interruption was made by StoryManager.waitUntilAllDoneOrFailed() method that triggers future.cancel() method after the timeout set on whole story.
My whole setup is:
page.getPageObject().withTimeoutOf(convertDuration(duration)).waitFor(by);
where duration is about 60 secs. (the minute is due to some async stuff)
and
configuredEmbedder().embedderControls().useStoryTimeouts("30");
And the stackTrace is:
at java.util.concurrent.FutureTask.cancel(FutureTask.java:174)
at org.jbehave.core.embedder.StoryManager.waitUntilAllDoneOrFailed(StoryManager.java:184)
at org.jbehave.core.embedder.StoryManager.performStories(StoryManager.java:121)
at org.jbehave.core.embedder.StoryManager.runStories(StoryManager.java:107)
and that interrupts later Thread.sleep() method in ThucydidesFluentWait.doWait() (basically in the underneath Sleeper instance method sleep())
Increasing the story timeout or proper setup of waitFor(...) timeout vs. story timeout solves the problem on my side.

Reading thread dump to debug an exhausted database connection pool

I have a web app being served by jetty + mysql. I'm running into an issue where my database connection pool gets exhausted, and all threads start blocking waiting for a connection. I've tried two database connection pool libraries: (1) bonecp (2) hikari. Both exhibit the same behavior with my app.
I've done several thread dumps when I see this state, and all the blocked threads are in this state (not picking on bonecp, I'm sure it's something on my end now):
"qtp1218743501-131" prio=10 tid=0x00007fb858295800 nid=0x669b waiting on condition [0x00007fb8cd5d3000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000763f42d20> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at com.jolbox.bonecp.DefaultConnectionStrategy.getConnectionInternal(DefaultConnectionStrategy.java:82)
at com.jolbox.bonecp.AbstractConnectionStrategy.getConnection(AbstractConnectionStrategy.java:90)
at com.jolbox.bonecp.BoneCP.getConnection(BoneCP.java:553)
at com.me.Foo.start(Foo.java:30)
...
I'm not sure where to go from here. I was thinking that I would see some stack traces in the thread dump where my code was stuck doing some lengthly operation, not waiting for a connection. For example, if my code looks like this:
public class Foo {
public void start() {
Connection conn = threadPool.getConnection();
work(conn);
conn.close();
}
public void work(Connection conn) {
.. something lengthy like scan every row in the database etc ..
}
}
I would expect one of the threads above to have a stack trace that shows it working away in the work() method:
...
at com.me.mycode.Foo.work()
at com.me.mycode.Foo.start()
but instead they're just all waiting for a connection:
...
at com.jolbox.bonecp.BoneCP.getConnection() // ?
at com.me.mycode.Foo.work()
at com.me.mycode.Foo.start()
Any thoughts on how to continue debugging would be great.
Some other background: the app operates normally for about 45 minutes, mem and thread dumps show nothing out of the ordinary. Then the condition is triggered and the thread count spikes up. I started thinking it might be some combination of sql statements the app is trying to perform which turn into some sort of lock on the mysql side, but again I would expect some of the threads in the stack traces above to show me that they're in that part of the code.
The thread dumps were taken using visualvm.
Thanks
Take advantage of the configuration options for the connection pool (see BoneCPConfig / HikariCPConfig). First of all, set a connection time-out (HikariCP connectionTimeout) and a leak detection time-out (HikariCP leakDetectionThreshold, I could not find the counterpart in BoneCP). There might be more configuration options that dump stack-traces when something is not quite right.
My guess is that your application does not always return a connection to the pool and after 45 minutes has no connection in the pool anymore (and thus blocks forever trying to get a connection from the pool). Treat a connection like opening/closing a file, i.e. always use try/finally:
public void start() {
Connection conn = null;
try {
work(conn = dbPool.getConnection());
} finally {
if (conn != null) {
conn.close();
}
}
}
Finally, both connection pools have options to allow JMX monitoring. You can use this to monitor for strange behavior in the pool.
I question the whole design.
If you have a waiting block in a multithreaded netIO, you need a better implementation of the connection.
I suggest you take a look at non blocking IO (Java.nio, channels package), or granulate your locks.

SelectorImpl is BLOCKED

I use a lot of client sends a request to the server about 1000 requests per second a client, the server's CPU soon rose to 600% (8 cores), and always maintain this state. When I use jstack printing process content, I found SelectorImpl is BLOCKED state. Records are as follows:
nioEventLoopGroup-4-1 prio=10 tid=0x00007fef28001800 nid=0x1dbf waiting for monitor entry [0x00007fef9eec7000]
java.lang.Thread.State: BLOCKED (on object monitor)
at sun.nio.ch.EPollSelectorImpl.doSelect(Unknown Source)
- waiting to lock <0x00000000c01f1af8> (a java.lang.Object)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(Unknown Source)
- locked <0x00000000c01d9420> (a io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0x00000000c01f1948> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000000c01d92c0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(Unknown Source)
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:635)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:319)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
at java.lang.Thread.run(Unknown Source)
High CPU has something to do with this? Another problem is that when I connect a lot of clients, find some client will connect, an error is as follows:
"nioEventLoopGroup-4-1" prio=10 tid=0x00007fef28001800 nid=0x1dbf waiting for monitor entry [0x00007fef9eec7000]
java.lang.Thread.State: BLOCKED (on object monitor)
at sun.nio.ch.EPollSelectorImpl.doSelect(Unknown Source)
- waiting to lock <0x00000000c01f1af8> (a java.lang.Object)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(Unknown Source)
- locked <0x00000000c01d9420> (a io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0x00000000c01f1948> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000000c01d92c0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(Unknown Source)
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:635)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:319)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
at java.lang.Thread.run(Unknown Source)
Generate client is accomplished by using a thread pool, and has set up a connection timeout, but why frequent connection timeout? Is to serve the cause of the suit?
public void run() {
System.out.println(tnum + " connecting...");
try {
Bootstrap bootstrap = new Bootstrap();
bootstrap.group(group)
.channel(NioSocketChannel.class)
.option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 30000)
.handler(loadClientInitializer);
// Start the connection attempt.
ChannelFuture future = bootstrap.connect(host, port);
future.channel().attr(AttrNum).set(tnum);
future.sync();
if (future.isSuccess()) {
System.out.println(tnum + " login success.");
goSend(tnum, future.channel());
} else {
System.out.println(tnum + " login failed.");
}
} catch (Exception e) {
XLog.error(e);
} finally {
// group.shutdownGracefully();
}
}
High CPU has something to do with this?
It might be. I'd diagnose this problem following way (on a Linux box):
Find threads which are eating CPU
Using pidstat I'd find which threads are eating CPU and in what mode (user/kernel) time is spent.
$ pidstat -p [java-process-pid] -tu 1 | awk '$9 > 50'
This command shows threads eating at least 50% of CPU time. You can inspect what those threads are doing using jstack, VisualVM or Java Flight Recorder.
If CPU-hungry threads and BLOCKED threads are the same, CPU usage seems to do something with contention.
Find reason for connection timeout
Basically you will get connection timeout if two OS'es can't finish TCP-handshake in a given time. Several reasons for this:
network link saturation. Can be diagnosed using sar -n DEV 1 and comparing rxkB/s and txkB/s columns to your link maximum throughput.
server (Netty) doesn't respond with accept() call in given timeout. This thread can be BLOCKED or starving for CPU time. You can find which threads are calling accept() (therefore finishing TCP-handshake) using strace -f -e trace=accept -p [java-pid]. And after that check for possible reasons using pidstat/jstack.
Also you can find number of received requests for connection open (but not confirmed) with netstat -an | grep -c SYN_RECV
If you can elaborate more on what your Netty is doing it could be helpful. Regardless - please make sure you are closing the channels. Notice from the Channel javadoc:
It is important to call close() or close(ChannelPromise) to release all resources once you are done with the Channel. This ensures all resources are released in a proper way, i.e. filehandles
If you are closing the channels, then the problem may be with the logic it self - running into infinite loops or similar - which may be able to explain the high CPU.

Categories