I am using GAE task queue to update bulk data in Datastore. Number of records are around 1-2M. To do this I scheduled a cron Job and a queue in this way
<queue>
<name>queueName</name>
<rate>20/s</rate>
<bucket-size>300</bucket-size>
<retry-parameters>
<task-retry-limit>1</task-retry-limit>
</retry-parameters>
<max-concurrent-requests>800</max-concurrent-requests>
</queue>
Each task is doing following task
Fetching 1500 record from datastore using a cursor.
If the next cursor exists create a new task and push in the queue.
Process 1500 fetched record, means updating all 1500 in datastore back.
the expected task to add should be around 667, but I can only see 40 tasks in logs.
In logs, I can see the 40 tasks are added in the queue in 40 sec. I m not getting any error in the logs.
Can anybody help me to understand what is happening? Why I m not able to add all the task.
Thanks
In your approach the task enqueueing appears to be very tightly coupled with the task request processing, in the sense that the request for one such task in the queue needs to be processed to enqueue the next task. So you need to take a look at your task processing rate limiting factors you may hit. The ones from your queue configuration are pretty generous, but there are others.
If you configured your app with threadsafe and if your app design takes advantage of it an instance of your app will be able to handle multiple requests concurrently, up to a maximum depending on its max-concurrent-requests config and its processing latency. Without the threadsafe config that maximum is 1.
Once an instance hits the max number of task requests it can process concurrently it won't start processing new tasks from the queue (so it won't execute step #1 - enqueueing a new task) until it completes processing at least one of the tasks already in progress. The task enqueueing rate per app instance is thus effectively limited - each running instance can contribute to the overall number of tasks in the queue only with a number equal to the max number of tasks it can process in parallel.
But your app is configured for automatic scaling, so once you manage to quickly "fill up" all your running instances, the scheduler will start new instances for it. As new instances are started they will be able to process more of the tasks in the queue and thus also enqueue new tasks, contributing with the above-mentioned amount to the total number of tasks in the queue.
But this growth in the number of enqueued tasks can be much slower than while instances didn't hit their max processing rate - it takes some time to measure how new instances helps with traffic to determine if more instances are needed or not. The overall growth in the number of tasks in the queue will have a "staircase" profile, with the height of a step being the max number of concurrent requests an instance can handle and the number of steps being the number of new instances started +1.
Since you aren't seeing any actual task enqueuing errors I can only suspect that you're somehow hitting a rate limit in processing your enqueued tasks or somehow that processing completely stops. There can be many reasons for it, including, for example:
hitting your app's daily budget (most likely due to the number of instance-hours)
hitting automatic scaling limits
You'd have to investigate your app from this perspective to pinpoint the culprit.
Side note: I assume this is on GAE, not on the development server (which doesn't respect the task queue configs and most likely can't get even close to GAE's parallel processing capability).
Related
Context:
I am designing an application which will be consuming messages from various Amazon SQS queues. (More than 25 queues)
For this, I am thinking of creating a library to consume messages from the queues, (call it MessageConsumer)
I want to be dynamically allocating threads to receive/process messages from different queues based on traffic in the queue to minimise waste of resources.
There are 2 ways I can go about it.
1) Can have only one type of thread that polls queues, receives messages and process those message and have one common thread pool for all queues.
2) Can have separate polling and worker threads.
In the second case, I will be having common worker thread pool and constant number of pollers per queue.
Edit:
To elaborate on the second case:
I am planning to have 1 continuously running thread per queue to poke that queue for the amount of messages in it. Then have some logic to decide the number of polling threads required per queue based on the number of messages in each queue and priority of the queue.
I dont want polling threads running all the time because that may cause empty receives (sqs.receiveMessages()), so I will allocate the polling threads based on traffic.
The high traffic queues will have more polling threads and hence more jobs being submitted to worker thread pool.
Please suggest any improvements or flaws in this design?
The recommended process is:
Workers poll the queue using Long Polling (which means it will wait for a maximum of 20 seconds before returning an empty response)
They can request up to 10 messages per call to ReceiveMessage()
The worker processes the message(s)
The worker deletes the message from the queue
Repeat
If you wish to scale the number of workers, you can base this on the ApproximateNumberOfMessagesVisible metric in Amazon CloudWatch. If the number goes too high, add a worker. If it drops to zero (or below some threshold), remove a worker.
It is probably easiest to have each worker only poll one queue.
There is no need for "pollers". The workers do the polling themselves. This way, you can scale the workers independently, without needing some central "polling" service trying to manage it all. Simply launch a new Amazon EC2 instance, launch the some workers and they start processing messages. When scaling-in, just terminate the workers or even the instance -- again, no need to register/deregister workers with a central "polling" service.
I'm building a system where users can set a future date(down to hours and minutes) in calendar. At that date a trigger is calling a certain task, unique for every user.
Every user can set a different date. The system will have 10k+ from the start and a user can create more than one trigger.
So assuming I have 10k users each user create on average 3 triggers => 30k triggers with 30k different dates.
All dates are saved in a database.
I'm new to quartz, can this be done in a more optimized way?
I was thinking about making a task run every minute that will get the tasks that will suppose to run in the next hour and remove them from database.
Do you have any better ideas? Did someone used quartz for a large number of triggers.
You have the schedule backed in the database. If I understand the idea - you want the quartz to load all the upcoming tasks to execute them in the future.
This is problematic approach:
Synchronization Issues: I assume that users can edit, remove and add new tasks to the database. You would have to periodically ask the database to refresh the state of the quartz jobs, remove some jobs, edit other jobs etc. This may not be trivial. The state of the program would be a long living cache which needs to be synchronised often.
Performance and scalability issues: Even if proposed solution may be ok for 30K tasks it may not be ok for 70k or 700k tasks. In your approach it's not easy to scale - adding new machine would require additional layer of synchronisation - which machine should actually execute which job (as all of them have all the tasks).
What I would propose:
Add the "stage" to the Tasks table (new, queued, running, finished, failed)
divide your solution into several components. (Initially they can run on a single machine but it will be easy to scale)
Components:
Task Finder: Executed periodically (once every few seconds). Scans the database for tasks that are "new", and due soon. Sends the tasks found to Message Queue and marks the task as "queued" in the db. Marking as "queued" has to be done carefully as there can be multiple "task finders". (As an addition it may find the tasks that have been marked as "queued" or "running" more than N minutes ago and are not "finished" nor "canceled" - probably need to re-run these)
Message Queue: Connector between Taks Finder and Task Executor.
Task Executor: Listens to the Message Queue and process the tasks that it received. Marks the tasks as "running" initially and "finished" or "failed" later on.
With this approach you can have:
multiple Task Executors on multiple machines
multiple Task Schedulers on multiple machines
even if one of the Task Schedulers or Executors will fail it will not be Single Point of Failure. Some of the tasks will be delayed but it will be picked up and run afterwards.
This may not address all the scenarios but would be a good starting point.
I don't see why you need quartz here at all. As far as I remember, quartz is best used to schedule backend internal processes, not user-defined tasks obtained from db.
Just process the trigger as it is created, save a row to your tasks table with start_date based on the trigger and every second select all incomplete tasks with start_date< sysdate. If the job is repeating, calculate next execution time and insert new task row / update previous accordingly.
As Sam pointed out there are some nice topics addressing the same problem:
Quartz Performance
Quartz FAQ
In a system like the mentioned it should not a problem mostly to handle this amount of triggers. But according to my experiance it is a better way to create something like a "JobChecker". If you enable your users to create own triggers it could really break Quartz in some cases. For example if 5000 user creates an event to the exact same time, Quartz will have a hard time to handle them correctly. (It is not likely a situation that will occur often, but it is possible as your specification does not excludes it.) Quartz has difficulties only when a lot of triggers should be fired at the same time.
My recommendation to this problem is to create one job that is running in every hour/minute etc and that should handle every user set events. This way is simmilar to a cron job in bash. With this kind of processing your system will be pretty stable even if the number of "triggers" increases dramatically. Basically your line of thought is correct if you thrive for scalability.
My Google App Engine application is adding a large number of deferred tasks to a task queue. The tasks are scheduled to run every x seconds. If I understand the bucket-size property b correctly, a high value would prevent the deferred tasks to run until b tasks have been added. However, there is a close-to-realtime requirement that the tasks run as scheduled. I do not want that the tasks are blocked until the bucket-size is reached. Instead they should run as close to their scheduled time as possible.
To support this use case, should I use a bucket-size of 1 and a rate of 500 (which is the current maximum rate)? Which other approaches exist to support this? Thanks!
The bucket size does not prevent tasks from running individually. It plays a different role.
Suppose you have an empty queue with rate of 500 tasks per second, and several hours where no tasks are added or started. Then suddenly a large number of tasks are added at once. How many of these tasks would you like started immediately? Set this number as your bucket size. For example, with a bucket size of 1000, 1000 tasks will be started immediately (then 500 per second going forward).
How does this work? The bucket is topped up by 500 tokens every second (the queue's rate), up to the maximum being the bucket size. When there are tasks are available to start, they will only be started while the bucket is not empty, and one token will be removed from the bucket as each task is started.
You should NOT use taskqueues (TQ) for deferred tasks that are important to run close-to-realtime using the assumption that bucket/rate setting will assure high throughput. There have been several discussion threads in Google groups about infrequent delays with task start times that are minutes or more in length. Bucket size and rates will not have an affect on this -- your TQ tasks will simply sit there while your high-throughput TQ is idle. To date I have not ever seen any explanation from Google as to why this occurs. Again, if you utilize TQs for close-to-real-time tasks you MUST handle as an exception the infrequent times when your tasks will delay for minutes prior to starting. (I in fact do this, and have not yet been negatively affected, but you have to have code in place to handle a result = delayed task). My great hope is that with the new server/application testing underway, Google will find an easy way to kill this incredibly big issue with TQs (fingers crossed).
I'm working on a job dispatcher for appengine, and the default scheduler always winds up firing up 3-4 instances that do all the work, some overflow instances that might take thousands of tasks, or only a couple and then sits there burning cpus doing nothing.
My task involves processing jobs for many different sized domains, sometimes there's huge throughput, and other times it's one user with 10,000 models to update; if I turn the normal appengine task scheduler loose, it fails in two ways: 1) backends never shut down, and when memory hits the cap, java gc makes an instance thrash and act like it's almost a zombie yet never shut down {and still take/hold jobs}, and 2) many domains have a single user that takes far longer than all the others to process, and this keeps a backend alive long after the rest of the domain has finished.
These tasks must run throughout the day, and it takes multiple backends to handle fanout, so I can't just dump them all on a B8 and call it a day., so we need a dispatcher to manage how tasks get allocated to backends.
Now, I don't want to pay datastore ops on every task just to save a few minutes of cpu time, so my plan of attack {please critique} is to use a static ConcurrentHashMap in RAM, start each run() in a try, have every deferred task put it's [hashcode, startTime] in at startup and remove(hashcode) in a finally. There will be one such map per backend instance that's running jobs, wrapped in a method, BackendCounter.addToLiveMap(this); it's .size() serves as a running total of how many jobs are alive on that backend {with timestamp to detect zombie jobs that run >10 minutes}. The job dispatcher can fire off a worker thread per instance to monitor how many jobs, excluding itself, are running in that instance, and keep a ranked list in memcache of which instances have how many tasks alive. If one instance drops below a threshold of X live tasks, pick an overflow instance to defer to, then have the method BackendCounter.addToLiveMap(this) throw an exception I can catch to tell jobs to just schedule themselves to a new instance {ChangeInstanceException#getNewTarget()}. This way I can prevent barely-used instances from getting new jobs so they have a chance to shut down, paying only for some memcache ops and fanout only pays a write and delete to static map.
That takes care of problem two, which is the instance-hour killer. As for problem one, which is how to prevent one instance {usually instance 0 and 1} from hitting peak memory and start turning toward the dark side, I am torn between two options.
On the one hand, I can use the expected call to BackendCounter.addToLiveMap(this) throws ChangeInstanceException and simply check memory:
if (((float)Runtime.getRuntime().freeMemory() / Runtime.getRuntime().totalMemory())<0.9) throw new ChangeInstanceException(getOverflowInstance());
This naive approach will simply tell any instance approaching it's memory limit to send all new work elsewhere.
On the other hand, I could keep instance 0 and 1 for handling overflow {and toggle between which of the two gets new jobs to give them chances to shut down}, then send the fanout to instances 2+, which will only run until they drop to say, 10 or 15 jobs in parallel. The fanout is pretty consistent, and only takes a couple minutes, so instances 2, 3 and, at most, 4, will need to turn on, and be given time to turn off while a different instance gets hit with more load.
The only thing I'm afraid of is if jobs starting bouncing from one instance to another, which can probably be overcome with a redirect header limit to skip throwing ChangeInstanceException.
Any thoughts or advice are greatly appreciated.
Is it possible to limit the number of JMS receiver instances to a single instance? I.e. only process a single message from a queue at any one time?
The reason I ask is because I have a fairly intensive render type process to run for each message (potentially many thousands). I'd like to limit the execution of this code to a single instance at a time.
My application server is JBoss AS 6.0
You can configure the queue listener pool to have a single thread, so no more than one listener is handling requests, but this makes no sense to me.
The right answer is to tune the size of the thread pool to balance performance with memory requirements.
Many thousands? Per second, per minute, per hour? The rate at which they arrive, and the time each task takes, are both crucial. How much time, memory, CPU per request? Make sure you configure your queue to handle what could be a rather large backlog.
UPDATE: If ten messages arrive per second, and it takes 10 seconds for a single listener to process a message, then you'll need 101 listener threads to be able to keep up. (10 messages/second * 10 seconds means 100 messages arrive by the time the first listener finishes its 10 second task. The 101st listener will handle the 101st message, and subsequent listeners will finish in time to keep up.) If you need 1 MB of RAM per listener, you'll need 101 MB RAM just to process all the messages on one server. You'll need a similar estimate for CPU as well.
It might be wise to think about multiple queues on multiple servers and load balancing between them if one server isn't sufficient.