Appengine Deferred task limited to 60 seconds - java

In Google Appengine documentation it says that tasks are limited to 10 minutes. However when I run deferred tasks they die in 60 seconds. I couldn't find anywhere this to be mentioned.
Does it mean that Appengine deferred tasks are limited to 60 seconds, or maybe I am doing something wrong?
UPDATE: The first task is triggered from request, but I am not waiting for it to return (and how could I anyway, there are no callbacks). The subsequent ones
I am triggering, kind of recursively, from within the task itself.
DeferredTask df = new QuoteReader(params);
QueueFactory.getDefaultQueue().add(withPayload(df));
Many of them just work, but for the ones which reach 1 minute limit I get ApiProxy$ApiDeadlineExceededException
com.googlecode.objectify.cache.Pending completeAllPendingFutures: Error cleaning up pending Future: com.googlecode.objectify.cache.CachingAsyncDatastoreService$3#17f5ddc
java.util.concurrent.ExecutionException: com.google.apphosting.api.ApiProxy$ApiDeadlineExceededException: The API call datastore_v3.Get() took too long to respond and was cancelled.
Another thing I noticed, this affects the other request to that server happening at the same time and that goes down with DeadlineExceededException.

The error is coming from a Datastore operation that is exceeding 60s. It's not really related to Taskqueue deadlines as such. You are correct that they are 10 minutes (see here)
However as per Old related issue (maybe it changed to 60s since)
From Google: Even though offline requests can currently live up to 10 minutes (and background instances can live forever) datastore queries can still only live for 30 seconds.
It seems from the exception that your code completed and it's Objectify (later in the request filters) that's actually where the timeout occurs. I'd suggest you split up your data operations so datastore queries are quicker and if necessary use .now() on your data operations so exceptions occur in your code.

Related

RESTful: What is the difference between ClientProperties.CONNECT_TIMEOUT and ClientProperties.READ_TIMEOUT in Jersey?

For setting up the timeouts while making REST calls we should specify both these parameters but I'm not sure why both and exactly what different purpose they serve. Also, what if we set only one of them or both with different value?
CONNECT_TIMEOUT is the amount of time it will wait to establish the connection to the host. Once connected, READ_TIMEOUT is the amount of time allowed for the server to respond with all of the content in a give request.
How you set either one will depend on your requirements, but they can be different values. CONNECT_TIMEOUT should not require a large value, because it is only the time required to setup a socket connection with the server. 30 seconds should be ample time - frankly if it is not complete within 10 seconds it is too long, and the server is likely hosed, or at least overloaded.
READ_TIMEOUT - this could be longer, especially if you know that the action/resource you requested takes a long time to process. You might set this as high as 60 seconds, or even several minutes. Again, this depends on how critical it is that you wait for confirmation that the process completed, and you'll weigh this against how quickly your system needs to respond on its end. If your client times out while waiting for the process to complete, that doesn't necessarily mean that the process stopped, it may keep on running until it is finished on the server (or at least, until it reaches the server's timeout).
If these calls are directly driving an interface, then you may want much lower times, as your users may not have the patience for such a delay. If it is called in a background or batch process, then longer times may be acceptable. This is up to you.

ExecutorService task execution intermittently delayed

I'm running a Java 7 Dropwizard app on a CentOS 6.4 server that basically serves as a layer on top of a data store (Cassandra) and does some additional processing. It also has an interface to Zookeeper using the Curator framework for some other stuff. This all works well and good most of the time, CPU and RAM load is never above 50% and usually about 10% and our response times are good.
My problem is that recently we've discovered that occasionally we get blips of about 1-2 seconds where seemingly all tasks scheduled via thread pools get delayed. We noticed this because of connection timeouts to Cassandra and session timeouts with Zookeeper. What we've done to narrow it down:
Used Wireshark and Boundary to make sure all network activity from our app was getting stalled, not just a single component. All network activity was stalling at the same time.
Wrote a quick little Python script to send timestamp strings to netcat on one of the servers we were seeing timeouts connecting to to make sure it's not an overall network issue between the boxes. We saw all timestamps come through smoothly during periods where our app had timeouts.
Disabled hyperthreading on the server.
Checked garbage collection timing logs for the timeout periods. They were consistent and well under 1ms through the timeout periods.
Checked our CPU and RAM resources during the timeout periods. Again, consistent, and well under significant load.
Added an additional Dropwizard resource to our app for diagnostics that would send timestamp strings to netcat on another server, just like the Python script. In this case, we did see delays in the timestamps when we saw timeouts in our app. With half-second pings, we would generally see a whole second missing entirely, and then four pings in the next second, the extra two being the delayed pings from the previous second.
To remove the network from the equation, we changed the above to just write to the console and a local file instead of to the network. We saw the same results (delayed pings) with both of those.
Profiled and checked our thread pool settings to see if we were using too many OS threads. /proc/sys/kernel/threads-max is 190115 and we never get above 1000.
Code for #7 (#6 is identical except for using a Socket and PrintWriter in place of the FileWriter):
public void start() throws IOException {
fileWriter = new FileWriter(this.fileName, false);
executor = Executors.newSingleThreadScheduledExecutor();
executor.scheduleAtFixedRate(this, 0, this.delayMillis, TimeUnit.MILLISECONDS);
}
#Override
public synchronized void run() {
try {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
Date now = new Date();
String debugString = "ExecutorService test " + this.content + " : " + sdf.format(now) + "\n";
fileWriter.write(debugString);
fileWriter.flush();
} catch (Exception e) {
logger.error("Error running ExecutorService test: " + e.toString());
}
}
So it seems like the Executor is scheduling the tasks to be run, but they're being delayed in starting (because the timestamps are delayed and there's no way the first two lines of the try block in the run method are delaying the task execution). Any ideas on what might cause this or other things we can try? Hopefully we won't get to the point where we start reverting the code until we find what change caused it...
TL;DR: Scheduled tasks are being delayed and we don't know why.
UPDATE 1: We modified the executor task to push timestamps every half-second into a ring buffer instead of straight out to a file, and then dump the buffer every 20 seconds. This removes I/O as a possible cause of blocking task execution but still gives us the same info. From this, we still saw the same pattern of timestamps, from which it appears that the issue is not something in the task occasionally blocking the next execution of the task, but something in the task execution engine itself delaying execution for some reason.
When you use scheduleAtFixedRate, your expressing a desire that your task should be run as close to that rate as possible. The executor will do its best to keep to it, but sometimes it can't.
Your using Executors.newSingleThreadScheduledExecutor(), and so the executor only has a single thread to play with. If each execution of the task is taking longer than the period you specified in your schedule, then the executor won't be able to keep up, since the single thread may not have finished executing the previous run before the schedule kicked in the execute the next run. The result would manifest itself as delays in the schedule. This would seem a plausible explanation, since you say your real code is writing to a socket. That can easily block and send your timing off kilter.
You can find out if this is indeed the case by adding more logging at the end of the run method (i.e. after the flush). If the IO is taking too long, you'll see that in the logs.
As a fix, you could consider using scheduleWithFixedDelay instead, which will add a delay between each execution of the task, so long-running tasks don't run into each other. Failing that, then you need to ensure that the socket write completes on time, allowing each subsequent task execution to start on schedule.
The first step to diagnose a liveness issue is usually taking a thread dump when the system is stalled, and check what the threads were doing. In your case, the executor threads would be of particular interest. Are they processing, or are they waiting for work?
If they are all processing, the executor service has run out of worker threads, and can only schedule new tasks once a current task has been completed. This may be caused by tasks temporarily taking longer to complete. The stack traces of the worker threads may yield a clue just what is taking longer.
If many worker threads are idle, you have found a bug in the JDK. Congratulations!

What is the best bucket size for a task queue filled with many deferred tasks in Google App Engine?

My Google App Engine application is adding a large number of deferred tasks to a task queue. The tasks are scheduled to run every x seconds. If I understand the bucket-size property b correctly, a high value would prevent the deferred tasks to run until b tasks have been added. However, there is a close-to-realtime requirement that the tasks run as scheduled. I do not want that the tasks are blocked until the bucket-size is reached. Instead they should run as close to their scheduled time as possible.
To support this use case, should I use a bucket-size of 1 and a rate of 500 (which is the current maximum rate)? Which other approaches exist to support this? Thanks!
The bucket size does not prevent tasks from running individually. It plays a different role.
Suppose you have an empty queue with rate of 500 tasks per second, and several hours where no tasks are added or started. Then suddenly a large number of tasks are added at once. How many of these tasks would you like started immediately? Set this number as your bucket size. For example, with a bucket size of 1000, 1000 tasks will be started immediately (then 500 per second going forward).
How does this work? The bucket is topped up by 500 tokens every second (the queue's rate), up to the maximum being the bucket size. When there are tasks are available to start, they will only be started while the bucket is not empty, and one token will be removed from the bucket as each task is started.
You should NOT use taskqueues (TQ) for deferred tasks that are important to run close-to-realtime using the assumption that bucket/rate setting will assure high throughput. There have been several discussion threads in Google groups about infrequent delays with task start times that are minutes or more in length. Bucket size and rates will not have an affect on this -- your TQ tasks will simply sit there while your high-throughput TQ is idle. To date I have not ever seen any explanation from Google as to why this occurs. Again, if you utilize TQs for close-to-real-time tasks you MUST handle as an exception the infrequent times when your tasks will delay for minutes prior to starting. (I in fact do this, and have not yet been negatively affected, but you have to have code in place to handle a result = delayed task). My great hope is that with the new server/application testing underway, Google will find an easy way to kill this incredibly big issue with TQs (fingers crossed).

Scheduling tasks, making sure task is ever being executed

I have an application that checks a resource on the internet for new mails. If there is are new mails it does some processing on them. This means that depending on the amount of mails it might take just a few seconds to hours of processing.
Now the object/program that does the processing is already a singleton. So right now I already took care of there really only being 1 instance that's handling the checking and processing.
However I only have it running once now and I'd like to have it continuously running, checking for new mails more or less every 10 minutes or so to handle them in a timely manner.
I understand I can take care of this with Timer/Timertask or even better I found a resource here: http://www.ibm.com/developerworks/java/library/j-schedule/index.html that uses Scheduler/SchedulerTask. But what I am afraid of.. is if I set it to run every 10 minutes and a previous session is already processing data it will put the new task in a stack waiting to be executed once the previous one is done. So what I'm afraid of is for instance the first run running for 5 hours and then, because it was busy all the time, after that it will launch 5*6-1=29 runs immediately after each other checking for mails and/do some processing without giving the server a break.
Does anyone know how I can solve this?
P.S. the way I have my application set up right now is I'm using a Java Servlet on my tomcat server that's launched upon server start where it creates a Singleton instance of my main program, then calls some method to do the fetching/processing. And what I want is to repeat that fetching/processing every "x" amount of time (10 minutes or so), making sure that really only 1 instance is doing this and that really after each run 10 minutes or so are given to rest.
Actually, Timer + TimerTask can deal with this pretty cleanly. If you schedule something with Timer.scheduleAtFixedRate() You will notice that the docs say that it will attempt to "make up" late events to maintain the long-term period of execution. However, this can be overcome by using TimerTask.scheduledExecutionTime(). The example therein lets you figure out if the task is too tardy to run, and you can just return instead of doing anything. This will, in effect, "clear the queue" of TimerTask.
Of note: TimerTask uses a single thread to execute, so it won't spawn two copies of your task side-by-side.
On the side note part, you don't have to process all 10k emails in the queue in a single run. I would suggest processing for a fixed amount of time using TimerTask.scheduledExecutionTime() to figure out how long you have, then returning. That keeps your process more limber, cleans up the stack between runs, and if you are doing aggregates, ensures that you don't have to rebuild too much data if, for example, the server is restarted in the middle of the task. But this recommendation is based on generalities, since I don't know what you're doing in the task :)

Java NIO Selector Hang (jdk1.6_20)

I'm using jdk1.6_20 on Linux 2.6. I am observing a behavior where the NIO Selector, after calling Selector.select(timeout), fails to wake-up within the timeout(timeout=5 sec). It returns much later, couple of seconds delay(2~10 seconds) . This seems to be happening frequently during initial couple of minutes of application start-up time and stabilizes later on. Since our server is heart-beating with the client, the selector failing to wake-up on time causes it miss heartbeat and the peer disconnecting us.
Any help appreciated. Thanks.
From the Javadoc for Selector.select(long):
This method does not offer real-time guarantees: It schedules the
timeout as if by invoking the Object.wait(long) method.
Since startup time for an application might put a lot of stress on a system, this may lead to wakeup-delays.
For a solution: Switch to Selector.selectNow() as a non-blocking operation and handle retries in your application code.
It doesn't matter what the timeout is, as soon as a client is connecting, the selector should wake up immediately. Therefore you have some more serious bugs.
fails to wake-up within the timeout(timeout=5 sec).
It's not supposed to 'wake-up within the timeout'. It is supposed to wakeup after the timeout expires. If you're supposed to send heartbeats within 5 seconds, a timeout of 5 seconds is too long. I would make it 2.5s in this case.
hmm... actually the story doesnt stop there ..we are not using incremental cms ..hence during the concurrent phase it is not relinquishing the cpu ... we are having 2 application servers on the same host with 16 cores and each is having 4 Parallel CMS threads besides the application threads of which there are about roughtly 45 to 60. Hence chances of CPU starvation are the most likely especially since we see that every time the selector gets delayed it is 100~200 milliseconds immediately after the concurrent-mark phase..

Categories