I'm working on a project where there is a large input of data elements that need to be processed. The processing of each is independent of the others and I need a return a result from each. What I'm doing now is creating a Callable task for each element to do the processing and using ExecutorCompletionService to collect the Future result as the threads complete.
I then have another thread that is pulling the Future objects from the ExecutorCompletionService queue. This thread just spins in an infinite while loop and calls take() which blocks until a Future shows up in the queue.
What I'm trying to do is avoid the scenario where the queue of Future objects grows faster than I pull them off the queue so I'd like to sleep the process that's creating tasks if I get behind on processing the Future results.
The problem I'm running into is that I'm not able to find a way to see how many Future objects are in the ExecutorCompletionService queue. Is there a way to do this?
I could probably keep an external counter that I increment when a new task is created and decrement when a Future is processed but this only gets me to the number of outstanding tasks, not the number that are actually done. Any thoughts on the best way to tackle this?
You can pass the queues an executor uses using one of the overloaded constructor. Since queue implements Collection you could just call .size() on that queue. You will have a queue for the completion and another queue for the executor that the ExecutorCompletionService uses so you could tell how many are submitted and how many are completed between those two.
You'll just need to hold on to those queues after you create them and pass it to whatever is watching the size of them.
http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html shows the overloaded constructor
Related
I have a single threaded model job which iterates over a collection of data and customizes the data. I want to divide the collection into small sublists and want each individual sublist to be executed in parallel. Should I use an array of threads (where the size of the array is the number of sublist created), or a thread pool?
Based on what you are going to divide the collection further? If your type of jobs/data are same then let them be in one collection and threads from threadpool will pick up the task from the list and run in parallel.
It is better to use thread pool in any case, because in this case you can get rid of low level operations for managing arrays of objects and increase flexibility.
You should use ExecutorService instance in your code and choose right type of it.
For example:
Executors.newCachedThreadPool - if your processing logic is simple enough and doesn't require much processing time (to not produce too many concurrent threads that will cause different exceptions).
Executors.newFixedThreadPool - if your processing logic is complex enough, so you should limit number of threads.
So, I think that you should:
Create required ExecutionService in your consumer.
Go through your collection and submit processing job to executor (for each element, instance of Callable). Save them in List<Future<?>> instance.
Iterate through futures (wait completion of all tasks), save results in new collection, send results somewhere and commit kafka offset.
I have a queue of Runnables to execute that I would like to process using a thread pool. However, some of the tasks in the queue are related to each other (could be implemented by a common hashcode() on the Runnable) and must not be processed concurrently, so, when taking a task off the queue, I would like to check that no currently executing task is related, and if it is, hold back the new task until the related one has completed, but allow other, unrelated tasks to continue (maintaining high throughput is important)
I could do this by a customised ThreadPoolExecutorService and BlockingQueue that tracks the hashcodes of tasks in progress but this is somewhat complex and unwieldy.
is anyone aware of an elegant solution to the problem?
When we talk about the processing of asynchronous events using an Executors service, why does creating a new fixed thread pool, involve the use of LinkedBlockingQueue ? The events which are arriving are not dependent at all, so why use a queue because the consumer thread would still involve the contention for take lock? Why doens't the Executors class have some hybrid data structure(such as a concurrent Map implementation) where there is no need for a take lock in most of the cases ?
There is very good reason what thread pool executor works with BlockingQueue (btw, you are not obliged to use LinkedBlockingQueue implementation, you can use different implementations of the BlockingQueue). The queue should be blocking in order to suspend worker threads when there are no tasks to execute. This blocking is done using wait on condition variables, so waiting worker threads do not consume any CPU resources when queue is empty.
If you use non-blocking queue in the thread pool, then how would worker threads poll for tasks to execute? They would have to implement some kind of polling, which is unnecessary wasting of CPU resources (it will be "busy waiting").
UPDATE:
Ok, now I fully understood the use case. Still you need blocking collection anyway. The reason is basically the same - since you implement Producer-Consumer you should have means for worker threads to wait for messages to arrive - and this you simply can't do without mutex + condition variable (or simply BlockingQueue).
Regarding map - yes, I understand how you want to use it, but unfortunately there is no such implementation provided. Recently I solved the similar problem: I needed to group incoming tasks by some criteria and execute tasks from each group serially. As a result I implemented my own GroupThreadPoolExecutor that does this grouping. The idea is simple: group incoming tasks into map and then add them to the executor queue when previous task from the group completes.
There is big discussion here - I think it's relevant to your question.
There is one fixed thread pool (let it be with size=100), that I want to use for all tasks across my app.
It is used to limit server load.
Task = web crawler, that submits first job to thread pool.
That job can generate more jobs, and so on.
One job = one HTTP I/O request.
Problem
Suppose that there is only one executing task, that generated 10000 jobs.
Those jobs are now queued in thread pool queue, and all 100 threads are used for their execution.
Suppose that I now submit a second task.
The first job of the second task is 10001th in the queue.
It will be executed only after the 10000 jobs that the first task queued up.
So, this is a problem - I don't want the second task to wait so long to start its first job.
Idea
The first idea on my mind is to create a custom BlockingQueue and pass it to the thread pool constructor.
That queue will hold several blocking queues, one for each task.
Its take method will then choose a random queue and take an item from it.
My problem with this is that I don't see how to remove an empty queue from this list when its task is finished. This would mean some or all workers could get blocked on the take method, waiting for jobs from tasks that are finished.
Is this the best way to solve this problem?
I was unable to find any patterns for it in books or on the Internet :(
Thank you!
I would use multiple queues and draw from a random of the queues that contains items. Alternatively you could prioritize which queue should get the highest priority.
I would suggest using a single PriorityBlockingQueue and using the 'depth' of the recursive tasks to compute the priority. With a single queue, workers get blocked when the queue is empty and there is no need for randomization logic around the multiple queues.
Are there any implementations of a thread pool (in Java) that ensures all tasks for the same logical ID are executed on the same thread?
The logic I'm after is if there is already a task being executed on a specific thread for a given logical ID, then new tasks with the same ID are scheduled on the same thread. If there are no threads executing a task for the same ID then any thread can be used.
This would allow tasks for unrelated IDs to be executed in parallel, but tasks for the same ID to be executed in serial and in the order submitted.
If not, are there any suggestions on how I might extend ThreadPoolExecutor to get this behaviour (if that's even possible)?
UPDATE
Having spent longer thinking about this, I don't actually require that tasks for the same logical ID get executed on the same thread, just that they don't get executed at the same time.
An example for this would be a system that processed orders for customers, where it was OK to process multiple orders at the same time, but not for the same customer (and all orders for the same customer had to be processed in order).
The approach I'm taking at the moment is to use a standard ThreadPoolExecutor, with a customised BlockingQueue and also wrapping the Runnable with a custom wrapper. The Runnable wrapper logic is:
Atomically attempt to add ID to concurrent 'running' set (ConcurrentHashMap) to see if a task for the same ID is currently running
if add fails, push the task back on to the front of the queue and return immediately
if succeeeds, carry on
Run the task
Remove the task's associated ID from the 'running' set
The queue's poll() methods then only return tasks that have an ID that is not currently in the 'running' set.
The trouble with this is that I'm sure there are going to be a lot of corner cases that I haven't thought about, so it's going to require a lot of testing.
Create an array of executor services running one thread each and assign your queue entries to them by the hash code of your item id. The array can be of any size, depending on how many threads at most do you want to use.
This will restrict that we can use from the executor service but still allows to use its capability to shut down the only thread when no longer needed (with allowCoreThreadTimeOut(true)) and restart it as required. Also, all queuing stuff will work without rewriting it.
The simplest idea could be this:
Have a fixed map of BlockingQueues. Use hash mechanism to pick a queue based on task id. The hash algorithm should pick the same queue for the same ids. Start one single thread for every queue. every thread will pick one task from it's own dedicated queue and execute it.
p.s. the appropriate solution is strongly depends on the type of work you assign to threads
UPDATE
Ok, how about this crazy idea, please bear with me :)
Say, we have a ConcurrentHashMap which holds references id -> OrderQueue
ID1->Q1, ID2->Q2, ID3->Q3, ...
Meaning that now every id is associated with it's own queue. OrderQueue is a custom blocking-queue with an additional boolean flag - isAssociatedWithWorkingThread.
There is also a regular BlockingQueue which we will call amortizationQueue for now, you'll see it's use later.
Next, we have N working threads. Every working thread has it's own working queue which is a BlockingQueue containing ids associated with this thread.
When a new id comes, we do the following:
create a new OrderQueue(isAssociatedWithWorkingThread=false)
put the task to the queue
put id->OrderQueue to the map
put this OrderQueue to amortizationQueue
When an update for existing id comes we do the following:
pick OrderQueue from the map
put the task to the queue
if isAssociatedWithWorkingThread == false
put this OrderQueue to amortizationQueue
Every working thread does the following:
take next id from the working queue
take the OrderQueue associated with this id from the map
take all tasks from this queue
execute them
mark isAssociatedWithWorkingThread=false for this OrderQueue
put this OrderQueue to amortizationQueue
Pretty straightforward. Now to the fun part - work stealing :)
If at some point of time some working thread finds itself with empty working queue, then it does the following:
go to the pool of all working threads
pick one (say, one with the longest working queue)
steal id from *the tail* of that thread's working queue
put this id to it's own working queue
continue with regular execution
And there also +1 additional thread which provides amortization work:
while (true)
take next OrderQueue from amortizationQueue
if queue is not empty and isAssociatedWithWorkingThread == false
set isAssociatedWithWorkingThread=true
pick any working thread and add the id to it's working queue
Will have to spend more time thinking if you can get away with AtomicBoolean for isAssociatedWithWorkingThread flag or there is a need to make it blocking operation to check/change this flag.
I had to deal with a similar situation recently.
I ended up with a design similar to yours. The only difference was that the "current" was a map rather than a set: a map from ID to a queue of Runnables. When the wrapper around task's runnable sees that its ID is present in the map it adds the task's runnable to the ID's queue and returns immediately. Otherwise the ID is added to the map with empty queue and the task is executed.
When the task is done, the wrapper checks the ID's queue again. If the queue is not empty, the runnable is picked. Otherwise it's removed from the map and we're done.
I'll leave shutdown and cancelation as an exercise to the reader :)
Our approach is similar to what is in the update of the original question. We have a wrapper class that is a runnable that contains a queue (LinkedTransferQueue) which we call a RunnableQueue. The runnable queue has the basic API of:
public class RunnableQueue implements Runnable
{
public RunnableQueue(String name, Executor executor);
public void run();
public void execute(Runnable runnable);
}
When the user submits the first Runnable via the execute call the RunnableQueue enqueues itself on the executor. Subsequent calls to execute get queued up on the queue inside the RunnableQueue. When the runnable queue get executed by the ThreadPool (via its run method) it starts to "drain" the internal queue by serially executing the runnables one by one. If execute is called on the RunnableQueue while it is executing, the new runnables simply get appended to the internal queue. Once the queue is drained, the run method of the runnable queue completes and it "leaves" the executor pool. Rinse repeat.
We have other optimizations that do things like only let some number of runnables run (e.g. four) before the RunnableQueue re-posts itself to the executor pool.
The only really tricky bit inside and it isn't that hard) is to synchronize around when it is posted to the executor or not so that it doesn't repost, or miss when it should post.
Overall we find this to work pretty well. The "ID" (semantic context) for us is the runnable queue. The need we have (i.e. a plugin) has a reference to the RunnableQueue and not the executor pool so it is forced to work exclusively through the RunnableQueue. This not only guarantees all accesses are serially sequence (thread confinement) but lets the RunnableQueue "moderate" the plugin's job loading. Additionally, it requires no centralized management structure or other points of contention.
I have to implement a similar solution and the suggestion of creating an array of executor services by h22 seems the best approach to me with one caveat that I will be taking the modulus % of the ID (either the raw ID assuming it is long/int or the hash code) relative to some desired max size and using that result as the new ID so that way I can have a balance between ending up with way too many executor service objects while still getting a good amount of concurrency in the processing.
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class ExecutorServiceRouter {
private List<ExecutorService> services;
private int size;
public ExecutorServiceRouter(int size) {
services = new ArrayList<ExecutorService>(size);
this.size = size;
for (int i = 0; i < size; i++) {
services.add(Executors.newSingleThreadExecutor());
}
}
public void route(long id, Runnable r) {
services.get((int) (id % size)).execute(r);
}
public void shutdown() {
for (ExecutorService service : services) {
service.shutdown();
}
}
}
Extending ThreadPoolExecutor would be quite difficult. I would suggest you to go for a producer-consumer system. Here is what I am suggesting.
You can create typical producer consumer systems . Check out the code mentioned in this question.
Now each of these system will have a queue and a Single Consumer thread,which will process the tasks in the queue serially
Now, create a pool of such individual systems.
When you submit a task for a related ID , see if there is already a system marked for that related ID which is currently processing the tasks, if yes then submit the tasks,
If its not processing any tasks then mark that system with this new related ID and submit the task.
This way a single system will cater only for one logical related IDs .
Here I am assuming that a related ID is logical bunch of individual IDs and the producer consumer systems will be created for related IDs and NOT individual IDs.