Work/Task Stealing ThreadPoolExecutor

Work/Task Stealing ThreadPoolExecutor - java

In my project I am building a Java execution framework that receives work requests from a client. The work (varying size) is broken down in to a set of tasks and then queued up for processing. There are separate queues to process each type of task and each queue is associated with a ThreadPool. The ThreadPools are configured in a way such that the overall performance of the engine is optimal.
This design helps us load balance the requests effectively and large requests don't end up hogging the system resources. However at times the solution becomes ineffective when some of the queues are empty and their respective thread pools sitting idle.
To make this better I was thinking of implementing a work/task stealing technique so that the heavily loaded queue can get help from the other ThreadPools. However this may require implementing my own Executor as Java doesn't allow multiple queues to be associated with a ThreadPool and doesn't support the work stealing concept.
Read about Fork/Join but that doesn't seem like a fit for my needs. Any suggestions or alternative way to build this solution could be very helpful.
Thanks
Andy

Executors.newWorkStealingPool
Java 8 has factory and utility methods for that in the Executors class: Executors.newWorkStealingPool
That is an implementation of a work-stealing thread pool, I believe, is exactly what you want.

Have you considered the ForkJoinPool? The fork-join framework was implemented in a nice modular fashion so you can just use the work-stealing thread pool.

you could implement a custom BlockingQueue implementation (i think you mainly need to implement the offer() and take() methods) which is backed by a "primary" queue and 0 or more secondary queues. take would always take from the primary backing queue if non-empty, otherwise it can pull from the secondary queues.
in fact, it may be better to have 1 pool where all workers have access to all the queues, but "prefer" a specific queue. you can come up with your optimal work ratio by assigning different priorities to different workers. in a fully loaded system, your workers should be working at the optimal ratio. in an underloaded system, your workers should be able to help out with other queues.

Related

How Akka benefits from ForkJoinPool?

Akka docs states that default dispatcher is a fork-join-executor because it "gives excellent performance in most cases".
I'm wondering why is it?
From ForkJoinPool
A ForkJoinPool differs from other kinds of ExecutorService mainly by virtue of employing work-stealing: all threads in the pool attempt to find and execute tasks submitted to the pool and/or created by other active tasks (eventually blocking waiting for work if none exist). This enables (1) efficient processing when most tasks spawn other subtasks (as do most ForkJoinTasks), as well as (2) when many small tasks are submitted to the pool from external clients. Especially when setting asyncMode to true in constructors, ForkJoinPools may also be (3) appropriate for use with event-style tasks that are never joined.
At first, I guess that Akka is not an example of case (1) because I can't figure it out how Akka could be forking tasks, I mean, what would be the task that could be forked in many tasks?
I see each message as an independent task, that is why I think Akka is similar to case (2), where the messages are many small tasks being submitted (via ! and ?) to the ForkJoinPool.
The next question, although not strictly related to akka, will be, why a use case where fork and join (main capabilities of ForkJoinPool that allows work-stealing) are not being used still can be benefited by ForkJoinPool?
From Scalability of Fork Join Pool
We noticed that the number of context switches was abnormal, above 70000 per second.
That must be the problem, but what is causing it? Viktor came up with the qualified guess that it must be the task queue of the thread pool executor, since that is shared and the locks in the LinkedBlockingQueue could potentially generate the context switches when there is contention.
However, if it is true that Akka doesn't use ForkJoinTasks, all tasks submitted by external clients will be queued in the shared queue, so the contention should be the same as in ThreadPoolExecutor.
So, my questions are:
Akka uses ForkJoinTasks (case (1)) or is related to case (2)?
Why ForkJoinPool is beneficial in case (2) if all that tasks submitted by external clients will be pushed to a shared queue and no work-stealing will happen?
What would be an example of "with event-style tasks that are never joined" (case 3)?
Update
Correct answer is the one from johanandren, however I want to add some highlights.
Akka doesn't use fork and join capabilities since AFAIK with the Actor model, or at least how we implement it, there isn't really a usecase for that (from johanandren's comment).
So my understanding that Akka is not an instance of case (1) was correct.
In my original answer I said that all tasks submitted by external clients will be queued in the shared queue.
This was correct but only for a previous version (jdk7) of the FJP.
In jdk8 the single submission queue was replaced by many "submission queues".
This answer explains this well:
Now, before (IIRC) JDK 7u12, ForkJoinPool had a single global submission queue. When worker threads ran out of local tasks, as well the tasks to steal, they got there and tried to see if external work is available. In this design, there is no advantage against a regular, say, ThreadPoolExecutor backed by ArrayBlockingQueue. [...]
Now, the external submission goes into one of the submission queues. Then, workers that have no work to munch on, can first look into the submission queue associated with a particular worker, and then wander around looking into the submission queues of others. One can call that "work stealing" too.
So, this enabled work stealing in scenarios where fork join weren't used. As Doug Lea says
Substantially better throughput when lots of clients submit lots of tasks. (I've measured up to 60X speedups on micro-benchmarks). The idea is to treat external submitters in a similar way as workers -- using randomized queuing and stealing. (This required a big internal refactoring to disassociate work queues and workers.) This also greatly improves throughput when all tasks are async and submitted to the pool rather than forked, which becomes a reasonable way to structure actor frameworks, as well as many plain services that you might otherwise use ThreadPoolExecutor for.
There is another singularity that is worth mention it about FJP taken from this comment
4% is indeed not much for FJP. There's still a trade-off you do with FJP
which you need to be aware of: FJP keeps threads spinning for a while to be
able to handle just-in-time arriving work faster. This ensures good latency
in many cases. Especially if your pool is overprovisioned, however, the
trade-off is a bit of latency against more power consumption in almost-idle
situations.

The FJP in Akka is run with asyncMode = true so for the first question that is - having external clients submitting short/small async workloads. Each submitted workload is either dispatching an actor to process one or a few messages from its inbox but it is also used to execute Scala Future operations.
When a non-ForkJoinTask is scheduled to run on the FJP, it is adapted to a FJP and enqueued just like ForkJoinTasks. There's isn't a single submission where tasks are queued (there was in an early version, JDK7 perhaps), there are many, to avoid contention, and an idle thread can pick (steal) tasks from other queues than its own if that is empty.
Note that by default we are currently running on a forked version of the Java 8 FJP, as we saw significant decrease in throughput with the Java 9 FJP when that came (it contains quite a bit of changes). Here's the issue #21910 discussing that if you are interested. Additionally, if you want to play around with benchmarking different pools you can find a few *Pool benchmarks here: https://github.com/akka/akka/tree/master/akka-bench-jmh/src/main/scala/akka/actor

http://letitcrash.com/post/17607272336/scalability-of-fork-join-pool
Scalability of Fork Join Pool
Akka 2.0 message passing throughput scales way better on multi-core hardware than in previous versions, thanks to the new fork join executor developed by Doug Lea. One micro benchmark illustrates a 1100% increase in throughput!
...
http://cs.oswego.edu/pipermail/concurrency-interest/2012-January/008987.html
...
Highlights:
Substantially better throughput when lots of clients
submit lots of tasks. (I've measured up to 60X speedups
on microbenchmarks). The idea is to treat external submitters
in a similar way as workers -- using randomized queuing and
stealing. (This required a big internal refactoring to
disassociate work queues and workers.) This also greatly
improves throughput when all tasks are async and submitted
to the pool rather than forked, which becomes a reasonable
way to structure actor frameworks, as well as many plain
services that you might otherwise use ThreadPoolExecutor for.
These improvements also lead to a less hostile stance about
submitting possibly-blocking tasks. An added parag in
the ForkJoinTask documentation provides some guidance
(basically: we like them if they are small (even if numerous)
and don't have dependencies).
...

We can only use a blockingqueue or any other data structures for Threadpool task queue?

Hi I am a newbie in Concurrent programming with java. of all the examples I saw in concurrent programming whenever we use to define a task queue people used different implementations of blockingqueue.
why only blockingqueue? what are the advantages and disadvantages?
why not any other data structures?

Ok, i can't address exactly why unspecified code you looked at uses certain data structures and not other ones. But Blocking queues have nice properties. Holding only a fixed number of elements and forcing producers who would insert items over that limit to wait is actually a feature.
Limiting the queue size helps keep the application safe from a badly-behaved producer, which otherwise could fill the queue with entries until the application ran out of memory. Obviously it's faster to insert a task into the task wueue thsn it is to execute it, an executor is going to be at risk for getting bombarded with work.
Also making the producer wait applies back pressure to the system. That way the queue lets the producer know it's falling behind and not accepting more work. It's better for the producer to wait than it is for it to keep hammering the queue; back pressure lets the system degrade gracefully.
So you have a data structure that is easy to understand, has practical benefits for building applications and seems like a natural fit for a task queue. Of course people are going to use it.

Multiple SingleThreadExecutors for a given application...a good idea?

This question is about the fallouts of using SingleThreadExecutor (JDK 1.6). Related questions have been asked and answered in this forum before, but I believe the situation I am facing, is a bit different.
Various components of the application (let's call the components C1, C2, C3 etc.) generate (outbound) messages, mostly in response to messages (inbound) that they receive from other components. These outbound messages are kept in queues which are usually ArrayBlockingQueue instances - fairly standard practice perhaps. However, the outbound messages must be processed in the order they are added. I guess use of a SingleThreadExector is the obvious answer here. We end up having a 1:1 situation - one SingleThreadExecutor for one queue (which is dedicated to messages emanating from one component).
Now, the number of components (C1,C2,C3...) is unknown at a given moment. They will come into existence depending on the need of the users (and will be eventually disposed of too). We are talking about 200-300 such components at the peak load. Following the 1:1 design principle stated above, we are going to arrange for 200 SingleThreadExecutors. This is the source of my query here.
I am uncomfortable with the thought of having to create so many SingleThreadExecutors. I would rather try and use a pool of SingleThreadExecutors, if that makes sense and is plausible (any ready-made, seen-before classes/patterns?). I have read many posts on recommended use of SingleThreadExecutor here, but what about a pool of the same?
What do learned women and men here think? I would like to be directed, corrected or simply, admonished :-).

If your requirement is that the messages be processed in the order that they're posted, then you want one and only one SingleThreadExecutor. If you have multiple executors, then messages will be processed out-of-order across the set of executors.
If messages need only be processed in the order that they're received for a single producer, then it makes sense to have one executor per producer. If you try pooling executors, then you're going to have to put a lot of work into ensuring affinity between producer and executor.
Since you indicate that your producers will have defined lifetimes, one thing that you have to ensure is that you properly shut down your executors when they're done.

Messaging and batch jobs is something that has been solved time and time again. I suggest not attempting to solve it again. Instead, look into Quartz, which maintains thread pools, persisting tasks in a database etc. Or, maybe even better look into JMS/ActiveMQ. But, at the very least look into Quartz, if you have not already. Oh, and Spring makes working with Quartz so much easier...

I don't see any problem there. Essentially you have independent queues and each has to be drained sequentially, one thread for each is a natural design. Anything else you can come up with are essentially the same. As an example, when Java NIO first came out, frameworks were written trying to take advantage of it and get away from the thread-per-request model. In the end some authors admitted that to provide a good programming model they are just reimplementing threading all over again.

It's impossible to say whether 300 or even 3000 threads will cause any issues without knowing more about your application. I strongly recommend that you should profile your application before adding more complexity
The first thing that you should check is that number of concurrently running threads should not be much higher than number of cores available to run those threads. The more active threads you have, the more time is wasted managing those threads (context switch is expensive) and the less work gets done.
The easiest way to limit number of running threads is to use semaphore. Acquire semaphore before starting work and release it after the work is done.
Unfortunately limiting number of running threads may not be enough. While it may help, overhead may still be to great, if time spent per context switch is major part of total cost of one unit of work. In this scenario, often the most efficient way is to have fixed number of queues. You get queue from global pool of queues when component initializes using algorithm such as round-robin for queue selection.
If you are in one of those unfortunate cases where most obvious solutions do not work, I would start with something relatively simple: one thread pool, one concurrent queue, lock, list of queues and temporary queue for each thread in pool.
Posting work to queue is simple: add payload and identity of producer.
Processing is relatively straightforward as well. First you get get next item from queue. Then you acquire the lock. While you have lock in place, you check if any of other threads is running task for same producer. If not, you register thread by adding a temporary queue to list of queues. Otherwise you add task to existing temporary queue. Finally you release the lock. Now you either run the task or poll for next and start over depending on whether current thread was registered to run tasks. After running the task, you get lock again and see, if there is more work to be done in temporary queue. If not, remove queue from list. Otherwise get next task. Finally you release the lock. Again, you choose whether to run the task or to start over.

What are the advantages of Blocking Queue in Java?

I am working on a project that uses a queue that keeps information about the messages that need to be sent to remote hosts. In that case one thread is responsible for putting information into the queue and another thread is responsible for getting information from the queue and sending it. The 2nd thread needs to check the queue for the information periodically.
But later I found this is reinvention of the wheel :) I could use a blocking queue for this purpose.
What are the other advantages of using a blocking queue for the above application? (Ex : Performance, Modifiable of the code, Any special tricks etc )

The main advantage is that a BlockingQueue provides a correct, thread-safe implementation. Developers have implemented this feature themselves for years, but it is tricky to get right. Now the runtime has an implementation developed, reviewed, and maintained by concurrency experts.
The "blocking" nature of the queue has a couple of advantages. First, on adding elements, if the queue capacity is limited, memory consumption is limited as well. Also, if the queue consumers get too far behind producers, the producers are naturally throttled since they have to wait to add elements. When taking elements from the queue, the main advantage is simplicity; waiting forever is trivial, and correctly waiting for a specified time-out is only a little more complicated.

They key thing you eliminate with the blocking queue is 'polling'. This is where you say
In that case the 2nd thread needs to check the queue for the information periodically.
This can be very inefficient - using much unnecessary CPU time. It can also introduce unneeded latencies.

Where should you use BlockingQueue Implementations instead of Simple Queue Implementations?

I think I shall reframe my question from
Where should you use BlockingQueue Implementations instead of Simple Queue Implementations ?
to
What are the advantages/disadvantages of BlockingQueue over Queue implementations taking into consideration aspects like speed,concurrency or other properties which vary e.g. time to access last element.
I have used both kind of Queues. I know that Blocking Queue is normally used in concurrent application. I was writing simple ByteBuffer pool where I needed some placeholder for ByteBuffer objects. I needed fastest , thread safe queue implementation. Even there are List implementations like ArrayList which has constant access time for elements.
Can anyone discuss about pros and cons of BlockingQueue vs Queue vs List implementations?
Currently I have used ArrayList to hold these ByteBuffer objects.
Which data structure shall I use to hold these objects?

A limited capacity BlockingQueue is also helpful if you want to throttle some sort of request. With an unbounded queue, a producers can get far ahead of the consumers. The tasks will eventually be performed (unless there are so many that they cause an OutOfMemoryError), but the producer may long since have given up, so the effort is wasted.
In situations like these, it may be better to signal a would-be producer that the queue is full, and to give up quickly with a failure. For example, the producer might be a web request, with a user that doesn't want to wait too long, and even though it won't consume many CPU cycles while waiting, it is using up limited resources like a socket and some memory. Giving up will give the tasks that have been queued already a better chance to finish in a timely manner.
Regarding the amended question, which I'm interpreting as, "What is a good collection for holding objects in a pool?"
An unbounded LinkedBlockingQueue is a good choice for many pools. However, depending on your pool management strategy, a ConcurrentLinkedQueue may work too.
In a pooling application, a blocking "put" is not appropriate. Controlling the maximum size of the queue is the job of the pool manager—it decides when to create or destroy resources for the pool. Clients of the pool borrow and return resources from the pool. Adding a new object, or returning a previously borrowed object to the pool should be fast, non-blocking operations. So, a bounded capacity queue is not a good choice for pools.
On the other hand, when retrieving an object from the pool, most applications want to wait until a resource is available. A "take" operation that blocks, at least temporarily, is much more efficient than a "busy wait"—repeatedly polling until a resource is available. The LinkedBlockingQueue is a good choice in this case. A borrower can block indefinitely with take, or limit the time it is willing to block with poll.
A less common case in when a client is not willing to block at all, but has the ability to create a resource for itself if the pool is empty. In that case, a ConcurrentLinkedQueue is a good choice. This is sort of a gray area where it would be nice to share a resource (e.g., memory) as much as possible, but speed is even more important. In the worse case, this degenerates to every thread having its own instance of the resource; then it would have been more efficient not to bother trying to share among threads.
Both of these collections give good performance and ease of use in a concurrent application. For non-concurrent applications, an ArrayList is hard to beat. Even for collections that grow dynamically, the per-element overhead of a LinkedList allows an ArrayList with some empty slots to stay competitive memory-wise.

You would see BlockingQueue in multi-threaded situations. For example you need pass in a BlockingQueue as a parameter to create ThreadPoolExecutor if you want to create one using constructor. Depending on the type of queue you pass in the executor could act differently.

It is a Queue implementation that additionally supports operations that
wait for the queue to become non-empty when retrieving an element,
and
wait for space to become available in the queue when storing an
element.
If you required above functionality will be followed by your Queue implementation then use Blocking Queue

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.