I'm using jdk1.6_20 on Linux 2.6. I am observing a behavior where the NIO Selector, after calling Selector.select(timeout), fails to wake-up within the timeout(timeout=5 sec). It returns much later, couple of seconds delay(2~10 seconds) . This seems to be happening frequently during initial couple of minutes of application start-up time and stabilizes later on. Since our server is heart-beating with the client, the selector failing to wake-up on time causes it miss heartbeat and the peer disconnecting us.
Any help appreciated. Thanks.
From the Javadoc for Selector.select(long):
This method does not offer real-time guarantees: It schedules the
timeout as if by invoking the Object.wait(long) method.
Since startup time for an application might put a lot of stress on a system, this may lead to wakeup-delays.
For a solution: Switch to Selector.selectNow() as a non-blocking operation and handle retries in your application code.
It doesn't matter what the timeout is, as soon as a client is connecting, the selector should wake up immediately. Therefore you have some more serious bugs.
fails to wake-up within the timeout(timeout=5 sec).
It's not supposed to 'wake-up within the timeout'. It is supposed to wakeup after the timeout expires. If you're supposed to send heartbeats within 5 seconds, a timeout of 5 seconds is too long. I would make it 2.5s in this case.
hmm... actually the story doesnt stop there ..we are not using incremental cms ..hence during the concurrent phase it is not relinquishing the cpu ... we are having 2 application servers on the same host with 16 cores and each is having 4 Parallel CMS threads besides the application threads of which there are about roughtly 45 to 60. Hence chances of CPU starvation are the most likely especially since we see that every time the selector gets delayed it is 100~200 milliseconds immediately after the concurrent-mark phase..
Related
Visual VM shows FifoMessageDispatchChannel.dequeue() taking a lot of time. The Tomcat process is using around 100% of a processor core.
The most probable cause is that you are calling a consumer receive method with a very short wait but it is impossible to tell without more information. The dispatch channel simply checks a queue for data and if none present will block for a given timeout waiting for a signal to wake and check again or time out and return.
dequeue() is not taking much processor time as that other guy said. This answer to another question explains Self Time includes time spent doing things other than processing, such as waiting.
Self and Total Time (CPU) include time in the method using the processor and they are 0 for dequeue(). To find methods using the processor most, sort by Self Time (CPU), as Bedla indicated.
For setting up the timeouts while making REST calls we should specify both these parameters but I'm not sure why both and exactly what different purpose they serve. Also, what if we set only one of them or both with different value?
CONNECT_TIMEOUT is the amount of time it will wait to establish the connection to the host. Once connected, READ_TIMEOUT is the amount of time allowed for the server to respond with all of the content in a give request.
How you set either one will depend on your requirements, but they can be different values. CONNECT_TIMEOUT should not require a large value, because it is only the time required to setup a socket connection with the server. 30 seconds should be ample time - frankly if it is not complete within 10 seconds it is too long, and the server is likely hosed, or at least overloaded.
READ_TIMEOUT - this could be longer, especially if you know that the action/resource you requested takes a long time to process. You might set this as high as 60 seconds, or even several minutes. Again, this depends on how critical it is that you wait for confirmation that the process completed, and you'll weigh this against how quickly your system needs to respond on its end. If your client times out while waiting for the process to complete, that doesn't necessarily mean that the process stopped, it may keep on running until it is finished on the server (or at least, until it reaches the server's timeout).
If these calls are directly driving an interface, then you may want much lower times, as your users may not have the patience for such a delay. If it is called in a background or batch process, then longer times may be acceptable. This is up to you.
I'm running a Java 7 Dropwizard app on a CentOS 6.4 server that basically serves as a layer on top of a data store (Cassandra) and does some additional processing. It also has an interface to Zookeeper using the Curator framework for some other stuff. This all works well and good most of the time, CPU and RAM load is never above 50% and usually about 10% and our response times are good.
My problem is that recently we've discovered that occasionally we get blips of about 1-2 seconds where seemingly all tasks scheduled via thread pools get delayed. We noticed this because of connection timeouts to Cassandra and session timeouts with Zookeeper. What we've done to narrow it down:
Used Wireshark and Boundary to make sure all network activity from our app was getting stalled, not just a single component. All network activity was stalling at the same time.
Wrote a quick little Python script to send timestamp strings to netcat on one of the servers we were seeing timeouts connecting to to make sure it's not an overall network issue between the boxes. We saw all timestamps come through smoothly during periods where our app had timeouts.
Disabled hyperthreading on the server.
Checked garbage collection timing logs for the timeout periods. They were consistent and well under 1ms through the timeout periods.
Checked our CPU and RAM resources during the timeout periods. Again, consistent, and well under significant load.
Added an additional Dropwizard resource to our app for diagnostics that would send timestamp strings to netcat on another server, just like the Python script. In this case, we did see delays in the timestamps when we saw timeouts in our app. With half-second pings, we would generally see a whole second missing entirely, and then four pings in the next second, the extra two being the delayed pings from the previous second.
To remove the network from the equation, we changed the above to just write to the console and a local file instead of to the network. We saw the same results (delayed pings) with both of those.
Profiled and checked our thread pool settings to see if we were using too many OS threads. /proc/sys/kernel/threads-max is 190115 and we never get above 1000.
Code for #7 (#6 is identical except for using a Socket and PrintWriter in place of the FileWriter):
public void start() throws IOException {
fileWriter = new FileWriter(this.fileName, false);
executor = Executors.newSingleThreadScheduledExecutor();
executor.scheduleAtFixedRate(this, 0, this.delayMillis, TimeUnit.MILLISECONDS);
}
#Override
public synchronized void run() {
try {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
Date now = new Date();
String debugString = "ExecutorService test " + this.content + " : " + sdf.format(now) + "\n";
fileWriter.write(debugString);
fileWriter.flush();
} catch (Exception e) {
logger.error("Error running ExecutorService test: " + e.toString());
}
}
So it seems like the Executor is scheduling the tasks to be run, but they're being delayed in starting (because the timestamps are delayed and there's no way the first two lines of the try block in the run method are delaying the task execution). Any ideas on what might cause this or other things we can try? Hopefully we won't get to the point where we start reverting the code until we find what change caused it...
TL;DR: Scheduled tasks are being delayed and we don't know why.
UPDATE 1: We modified the executor task to push timestamps every half-second into a ring buffer instead of straight out to a file, and then dump the buffer every 20 seconds. This removes I/O as a possible cause of blocking task execution but still gives us the same info. From this, we still saw the same pattern of timestamps, from which it appears that the issue is not something in the task occasionally blocking the next execution of the task, but something in the task execution engine itself delaying execution for some reason.
When you use scheduleAtFixedRate, your expressing a desire that your task should be run as close to that rate as possible. The executor will do its best to keep to it, but sometimes it can't.
Your using Executors.newSingleThreadScheduledExecutor(), and so the executor only has a single thread to play with. If each execution of the task is taking longer than the period you specified in your schedule, then the executor won't be able to keep up, since the single thread may not have finished executing the previous run before the schedule kicked in the execute the next run. The result would manifest itself as delays in the schedule. This would seem a plausible explanation, since you say your real code is writing to a socket. That can easily block and send your timing off kilter.
You can find out if this is indeed the case by adding more logging at the end of the run method (i.e. after the flush). If the IO is taking too long, you'll see that in the logs.
As a fix, you could consider using scheduleWithFixedDelay instead, which will add a delay between each execution of the task, so long-running tasks don't run into each other. Failing that, then you need to ensure that the socket write completes on time, allowing each subsequent task execution to start on schedule.
The first step to diagnose a liveness issue is usually taking a thread dump when the system is stalled, and check what the threads were doing. In your case, the executor threads would be of particular interest. Are they processing, or are they waiting for work?
If they are all processing, the executor service has run out of worker threads, and can only schedule new tasks once a current task has been completed. This may be caused by tasks temporarily taking longer to complete. The stack traces of the worker threads may yield a clue just what is taking longer.
If many worker threads are idle, you have found a bug in the JDK. Congratulations!
In Google Appengine documentation it says that tasks are limited to 10 minutes. However when I run deferred tasks they die in 60 seconds. I couldn't find anywhere this to be mentioned.
Does it mean that Appengine deferred tasks are limited to 60 seconds, or maybe I am doing something wrong?
UPDATE: The first task is triggered from request, but I am not waiting for it to return (and how could I anyway, there are no callbacks). The subsequent ones
I am triggering, kind of recursively, from within the task itself.
DeferredTask df = new QuoteReader(params);
QueueFactory.getDefaultQueue().add(withPayload(df));
Many of them just work, but for the ones which reach 1 minute limit I get ApiProxy$ApiDeadlineExceededException
com.googlecode.objectify.cache.Pending completeAllPendingFutures: Error cleaning up pending Future: com.googlecode.objectify.cache.CachingAsyncDatastoreService$3#17f5ddc
java.util.concurrent.ExecutionException: com.google.apphosting.api.ApiProxy$ApiDeadlineExceededException: The API call datastore_v3.Get() took too long to respond and was cancelled.
Another thing I noticed, this affects the other request to that server happening at the same time and that goes down with DeadlineExceededException.
The error is coming from a Datastore operation that is exceeding 60s. It's not really related to Taskqueue deadlines as such. You are correct that they are 10 minutes (see here)
However as per Old related issue (maybe it changed to 60s since)
From Google: Even though offline requests can currently live up to 10 minutes (and background instances can live forever) datastore queries can still only live for 30 seconds.
It seems from the exception that your code completed and it's Objectify (later in the request filters) that's actually where the timeout occurs. I'd suggest you split up your data operations so datastore queries are quicker and if necessary use .now() on your data operations so exceptions occur in your code.
We are facing an unusual problem in our application, in the last one month our application reached an unrecoverable state, It was recovered post application restart.
Background : Our application makes a DB query to fetch some information and this Database is hosted on a separate node.
Problematic case : When the thread dump was analyzed we see all the threads are in runnable state fetching the data from the database, but it didn't finished even after 20 minutes.
Post the application restart as expected all threads recovered. And the CPU usage was also normal.
Below is the thread dump
ThreadPool:2:47" prio=3 tid=0x0000000007334000 nid=0x5f runnable
[0xfffffd7fe9f54000] java.lang.Thread.State: RUNNABLE at
oracle.jdbc.driver.T2CStatement.t2cParseExecuteDescribe(Native Method)
at
oracle.jdbc.driver.T2CPreparedStatement.executeForDescribe(T2CPreparedStatement.java:518)
at
oracle.jdbc.driver.T2CPreparedStatement.executeForRows(T2CPreparedStatement.java:764)
at ora
All threads in the same state.
Questions:
what could be the reason for this state?
how to recover under this case ?
It's probably waiting for network data from the database server. Java threads waiting (blocked) on I/O are described by the JVM as being in the state RUNNABLE even though from the program's point of view they're blocked.
As others mentioned already, that native methods are always in runnable, as the JVM doesn't know/care about them.
The Oracle drivers on the client side have no socket timeout by default. This means if you have network issues, the client's low level socket may "stuck" there for ever, resulting in a maxxed out connection pool. You could also check the network trafic towards the Oracle server to see if it even transmits data or not.
When using the thin client, you can set oracle.jdbc.ReadTimeout, but I don't know how to do that for the thick (oci) client you use, I'm not familiar with it.
What to do? Research how can you specify read timeout for the thick ojdbc driver, and watch for exceptions related to the connection timeout, that will clearly signal network issues. If you can change the source, you can wrap the calls and retry the session when you catch timeout-related SQLExceptions.
To quickly address the issue, terminate the connection on the Oracle server manually.
Worth checking the session contention, maybe a query blocks these sessions. If you find one, you'll see which database object causes the problem.
Does your code manually handle transaction? If then, maybe some of the code didn't commit() after changing data. Or maybe someone ran data modification query directly through PLSQL or something and didn't commit, and that leads all reading operation to be hung.
When you experienced that "hung" and DB has recovered from the status, did you check the data if some of them were rolled back? Asking this since you said "It was recovered post application restart.". It's happening when JDBC driver changed stuff but didn't commit, and timeout happened... DB operation will be rolled back. ( can be different based on the configuration though )
Native methods remain always in RUNNABLE state (ok, unless you change the state from the native method, itself, but this doesn't count).
The method can be blocked on IO, any other event waiting or just long cpu intense task... or endless loop.
You can make your own pick.
how to recover under this case ?
drop the connection from oracle.
Is the system or JVM getting hanged?
If configurable and if possible, reduce the number of threads/ parallel connections.
The thread simply waste CPU cycles when waiting for IO.
Yes your CPU is unfortunately kept busy by the threads who are awaiting a response from DB.