How Akka benefits from ForkJoinPool?

How Akka benefits from ForkJoinPool? - java

Akka docs states that default dispatcher is a fork-join-executor because it "gives excellent performance in most cases".
I'm wondering why is it?
From ForkJoinPool
A ForkJoinPool differs from other kinds of ExecutorService mainly by virtue of employing work-stealing: all threads in the pool attempt to find and execute tasks submitted to the pool and/or created by other active tasks (eventually blocking waiting for work if none exist). This enables (1) efficient processing when most tasks spawn other subtasks (as do most ForkJoinTasks), as well as (2) when many small tasks are submitted to the pool from external clients. Especially when setting asyncMode to true in constructors, ForkJoinPools may also be (3) appropriate for use with event-style tasks that are never joined.
At first, I guess that Akka is not an example of case (1) because I can't figure it out how Akka could be forking tasks, I mean, what would be the task that could be forked in many tasks?
I see each message as an independent task, that is why I think Akka is similar to case (2), where the messages are many small tasks being submitted (via ! and ?) to the ForkJoinPool.
The next question, although not strictly related to akka, will be, why a use case where fork and join (main capabilities of ForkJoinPool that allows work-stealing) are not being used still can be benefited by ForkJoinPool?
From Scalability of Fork Join Pool
We noticed that the number of context switches was abnormal, above 70000 per second.
That must be the problem, but what is causing it? Viktor came up with the qualified guess that it must be the task queue of the thread pool executor, since that is shared and the locks in the LinkedBlockingQueue could potentially generate the context switches when there is contention.
However, if it is true that Akka doesn't use ForkJoinTasks, all tasks submitted by external clients will be queued in the shared queue, so the contention should be the same as in ThreadPoolExecutor.
So, my questions are:
Akka uses ForkJoinTasks (case (1)) or is related to case (2)?
Why ForkJoinPool is beneficial in case (2) if all that tasks submitted by external clients will be pushed to a shared queue and no work-stealing will happen?
What would be an example of "with event-style tasks that are never joined" (case 3)?
Update
Correct answer is the one from johanandren, however I want to add some highlights.
Akka doesn't use fork and join capabilities since AFAIK with the Actor model, or at least how we implement it, there isn't really a usecase for that (from johanandren's comment).
So my understanding that Akka is not an instance of case (1) was correct.
In my original answer I said that all tasks submitted by external clients will be queued in the shared queue.
This was correct but only for a previous version (jdk7) of the FJP.
In jdk8 the single submission queue was replaced by many "submission queues".
This answer explains this well:
Now, before (IIRC) JDK 7u12, ForkJoinPool had a single global submission queue. When worker threads ran out of local tasks, as well the tasks to steal, they got there and tried to see if external work is available. In this design, there is no advantage against a regular, say, ThreadPoolExecutor backed by ArrayBlockingQueue. [...]
Now, the external submission goes into one of the submission queues. Then, workers that have no work to munch on, can first look into the submission queue associated with a particular worker, and then wander around looking into the submission queues of others. One can call that "work stealing" too.
So, this enabled work stealing in scenarios where fork join weren't used. As Doug Lea says
Substantially better throughput when lots of clients submit lots of tasks. (I've measured up to 60X speedups on micro-benchmarks). The idea is to treat external submitters in a similar way as workers -- using randomized queuing and stealing. (This required a big internal refactoring to disassociate work queues and workers.) This also greatly improves throughput when all tasks are async and submitted to the pool rather than forked, which becomes a reasonable way to structure actor frameworks, as well as many plain services that you might otherwise use ThreadPoolExecutor for.
There is another singularity that is worth mention it about FJP taken from this comment
4% is indeed not much for FJP. There's still a trade-off you do with FJP
which you need to be aware of: FJP keeps threads spinning for a while to be
able to handle just-in-time arriving work faster. This ensures good latency
in many cases. Especially if your pool is overprovisioned, however, the
trade-off is a bit of latency against more power consumption in almost-idle
situations.

The FJP in Akka is run with asyncMode = true so for the first question that is - having external clients submitting short/small async workloads. Each submitted workload is either dispatching an actor to process one or a few messages from its inbox but it is also used to execute Scala Future operations.
When a non-ForkJoinTask is scheduled to run on the FJP, it is adapted to a FJP and enqueued just like ForkJoinTasks. There's isn't a single submission where tasks are queued (there was in an early version, JDK7 perhaps), there are many, to avoid contention, and an idle thread can pick (steal) tasks from other queues than its own if that is empty.
Note that by default we are currently running on a forked version of the Java 8 FJP, as we saw significant decrease in throughput with the Java 9 FJP when that came (it contains quite a bit of changes). Here's the issue #21910 discussing that if you are interested. Additionally, if you want to play around with benchmarking different pools you can find a few *Pool benchmarks here: https://github.com/akka/akka/tree/master/akka-bench-jmh/src/main/scala/akka/actor

http://letitcrash.com/post/17607272336/scalability-of-fork-join-pool
Scalability of Fork Join Pool
Akka 2.0 message passing throughput scales way better on multi-core hardware than in previous versions, thanks to the new fork join executor developed by Doug Lea. One micro benchmark illustrates a 1100% increase in throughput!
...
http://cs.oswego.edu/pipermail/concurrency-interest/2012-January/008987.html
...
Highlights:
Substantially better throughput when lots of clients
submit lots of tasks. (I've measured up to 60X speedups
on microbenchmarks). The idea is to treat external submitters
in a similar way as workers -- using randomized queuing and
stealing. (This required a big internal refactoring to
disassociate work queues and workers.) This also greatly
improves throughput when all tasks are async and submitted
to the pool rather than forked, which becomes a reasonable
way to structure actor frameworks, as well as many plain
services that you might otherwise use ThreadPoolExecutor for.
These improvements also lead to a less hostile stance about
submitting possibly-blocking tasks. An added parag in
the ForkJoinTask documentation provides some guidance
(basically: we like them if they are small (even if numerous)
and don't have dependencies).
...

Related

Recommended size of a thread pool where several related tasks are submitted once

I am trying to make a program that will execute a variable number of possibly (but not certainly) computationally heavy tasks in parallel. These tasks (of Runnable type) will all be submitted at the same time and the thread pool should shut down once all these tasks are complete (in other words, the pool will only need to accept the initial tasks and nothing more).
In most of the answers that I found on this site, the question was about a server-based task (I am running my program on a decent desktop) or a pool that accepts tasks over irregular time intervals. In the questions that were not specific about the use, the answer was usually "it depends."
I have basically zero experience with threads, so I really do not know what is the optimal "thread count to task intensity" ratio.
For context, the program that I am working on deals with collections of matrices (represented by 3D arrays) where each matrix can contain up to 1000x1000 elements. One of the tasks may be to perform a convolution operation, and each task is an operation on one of the matrices in the collection.
Is there a recommendation for this specific type of problem?

The same that you hear when that question gets asked for a server: don't make assumptions, make experiments.
Try to identify (worst case: guess) the typical hardware setup that your users are running your software on. Then make sure you can do nicely automated performance testing. And then see what happens.
But thing is: that won't help much. You see, when you run your own server, you are (hopefully) in control about the workload that these machines are busy with. For a desktop setup, where remote users run your code on their boxes ... you have zero insights what else is running there. You might find that 16 threads are fine for 50% of the users. But the rest is maybe doing a lot of other things on their machines, and 16 is already way too much for them.
And that is the real crux. No matter what number you find "good to go" for a specific hardware configuration, you have no control about other workloads.
From that point of view, I would be pretty conservative. For a CPU intensive workload "too many" threads isn't helpful anyway, so go with the number of CPUs, or better number of cores as starting point.
Beyond that, what might be really helpful here: add some sort of "data gathering" to your application. Meaning: have it call home regularly, to tell you things like: "this is the hardware I am running on, I am using X threads, and the other workload on the system is Y". That might help you to get to some heuristics to adapt to the most important user setups. But be diligent about what data to collect. Define the questions you want to be answered upfront, and then pull the data you need to answer these questions.

If you workload is computationally intensive (CPU bound) you might want to look into ForkJoinPool which implements worker stealing.
A ForkJoinPool differs from other kinds of ExecutorService mainly by virtue of employing work-stealing: all threads in the pool attempt to find and execute tasks submitted to the pool and/or created by other active tasks (eventually blocking waiting for work if none exist). This enables efficient processing when most tasks spawn other subtasks (as do most ForkJoinTasks), as well as when many small tasks are submitted to the pool from external clients.

Best practices with Akka in Scala and third-party Java libraries

I need to use memcached Java API in my Scala/Akka code. This API gives you both synchronous and asynchronous methods. The asynchronous ones return java.util.concurrent.Future. There was a question here about dealing with Java Futures in Scala here How do I wrap a java.util.concurrent.Future in an Akka Future?. However in my case I have two options:
Using synchronous API and wrapping blocking code in future and mark blocking:
Future {
blocking {
cache.get(key) //synchronous blocking call
}
}
Using asynchronous Java API and do polling every n ms on Java Future to check if the future completed (like described in one of the answers above in the linked question above).
Which one is better? I am leaning towards the first option because polling can dramatically impact response times. Shouldn't blocking { } block prevent from blocking the whole pool?

I always go with the first option. But i am doing it in a slightly different way. I don't use the blocking feature. (Actually i have not thought about it yet.) Instead i am providing a custom execution context to the Future that wraps the synchronous blocking call. So it looks basically like this:
val ecForBlockingMemcachedStuff = ExecutionContext.fromExecutorService(Executors.newFixedThreadPool(100)) // whatever number you think is appropriate
// i create a separate ec for each blocking client/resource/api i use
Future {
cache.get(key) //synchronous blocking call
}(ecForBlockingMemcachedStuff) // or mark the execution context implicit. I like to mention it explicitly.
So all the blocking calls will use a dedicated execution context (= Threadpool). So it is separated from your main execution context responsible for non blocking stuff.
This approach is also explained in a online training video for Play/Akka provided by Typesafe. There is a video in lesson 4 about how to handle blocking calls. It is explained by Nilanjan Raychaudhuri (hope i spelled it correctly), who is a well known author for Scala books.
Update: I had a discussion with Nilanjan on twitter. He explained what the difference between the approach with blocking and a custom ExecutionContext is. The blocking feature just creates a special ExecutionContext. It provides a naive approach to the question how many threads you will need. It spawns a new thread every time, when all the other existing threads in the pool are busy. So it is actually an uncontrolled ExecutionContext. It could create lots of threads and lead to problems like an out of memory error. So the solution with the custom execution context is actually better, because it makes this problem obvious. Nilanjan also added that you need to consider circuit breaking for the case this pool gets overloaded with requests.
TLDR: Yeah, blocking calls suck. Use a custom/dedicated ExecutionContext for blocking calls. Also consider circuit breaking.

The Akka documentation provides a few suggestions on how to deal with blocking calls:
In some cases it is unavoidable to do blocking operations, i.e. to put
a thread to sleep for an indeterminate time, waiting for an external
event to occur. Examples are legacy RDBMS drivers or messaging APIs,
and the underlying reason is typically that (network) I/O occurs under
the covers. When facing this, you may be tempted to just wrap the
blocking call inside a Future and work with that instead, but this
strategy is too simple: you are quite likely to find bottlenecks or
run out of memory or threads when the application runs under increased
load.
The non-exhaustive list of adequate solutions to the “blocking
problem” includes the following suggestions:
Do the blocking call within an actor (or a set of actors managed by a router), making sure to configure a thread pool which is either
dedicated for this purpose or sufficiently sized.
Do the blocking call within a Future, ensuring an upper bound on the number of such calls at any point in time (submitting an unbounded
number of tasks of this nature will exhaust your memory or thread
limits).
Do the blocking call within a Future, providing a thread pool with an upper limit on the number of threads which is appropriate for the
hardware on which the application runs.
Dedicate a single thread to manage a set of blocking resources (e.g. a NIO selector driving multiple channels) and dispatch events as they
occur as actor messages.
The first possibility is especially well-suited for resources which
are single-threaded in nature, like database handles which
traditionally can only execute one outstanding query at a time and use
internal synchronization to ensure this. A common pattern is to create
a router for N actors, each of which wraps a single DB connection and
handles queries as sent to the router. The number N must then be tuned
for maximum throughput, which will vary depending on which DBMS is
deployed on what hardware.

Work/Task Stealing ThreadPoolExecutor

In my project I am building a Java execution framework that receives work requests from a client. The work (varying size) is broken down in to a set of tasks and then queued up for processing. There are separate queues to process each type of task and each queue is associated with a ThreadPool. The ThreadPools are configured in a way such that the overall performance of the engine is optimal.
This design helps us load balance the requests effectively and large requests don't end up hogging the system resources. However at times the solution becomes ineffective when some of the queues are empty and their respective thread pools sitting idle.
To make this better I was thinking of implementing a work/task stealing technique so that the heavily loaded queue can get help from the other ThreadPools. However this may require implementing my own Executor as Java doesn't allow multiple queues to be associated with a ThreadPool and doesn't support the work stealing concept.
Read about Fork/Join but that doesn't seem like a fit for my needs. Any suggestions or alternative way to build this solution could be very helpful.
Thanks
Andy

Executors.newWorkStealingPool
Java 8 has factory and utility methods for that in the Executors class: Executors.newWorkStealingPool
That is an implementation of a work-stealing thread pool, I believe, is exactly what you want.

Have you considered the ForkJoinPool? The fork-join framework was implemented in a nice modular fashion so you can just use the work-stealing thread pool.

you could implement a custom BlockingQueue implementation (i think you mainly need to implement the offer() and take() methods) which is backed by a "primary" queue and 0 or more secondary queues. take would always take from the primary backing queue if non-empty, otherwise it can pull from the secondary queues.
in fact, it may be better to have 1 pool where all workers have access to all the queues, but "prefer" a specific queue. you can come up with your optimal work ratio by assigning different priorities to different workers. in a fully loaded system, your workers should be working at the optimal ratio. in an underloaded system, your workers should be able to help out with other queues.

Multiple SingleThreadExecutors for a given application...a good idea?

This question is about the fallouts of using SingleThreadExecutor (JDK 1.6). Related questions have been asked and answered in this forum before, but I believe the situation I am facing, is a bit different.
Various components of the application (let's call the components C1, C2, C3 etc.) generate (outbound) messages, mostly in response to messages (inbound) that they receive from other components. These outbound messages are kept in queues which are usually ArrayBlockingQueue instances - fairly standard practice perhaps. However, the outbound messages must be processed in the order they are added. I guess use of a SingleThreadExector is the obvious answer here. We end up having a 1:1 situation - one SingleThreadExecutor for one queue (which is dedicated to messages emanating from one component).
Now, the number of components (C1,C2,C3...) is unknown at a given moment. They will come into existence depending on the need of the users (and will be eventually disposed of too). We are talking about 200-300 such components at the peak load. Following the 1:1 design principle stated above, we are going to arrange for 200 SingleThreadExecutors. This is the source of my query here.
I am uncomfortable with the thought of having to create so many SingleThreadExecutors. I would rather try and use a pool of SingleThreadExecutors, if that makes sense and is plausible (any ready-made, seen-before classes/patterns?). I have read many posts on recommended use of SingleThreadExecutor here, but what about a pool of the same?
What do learned women and men here think? I would like to be directed, corrected or simply, admonished :-).

If your requirement is that the messages be processed in the order that they're posted, then you want one and only one SingleThreadExecutor. If you have multiple executors, then messages will be processed out-of-order across the set of executors.
If messages need only be processed in the order that they're received for a single producer, then it makes sense to have one executor per producer. If you try pooling executors, then you're going to have to put a lot of work into ensuring affinity between producer and executor.
Since you indicate that your producers will have defined lifetimes, one thing that you have to ensure is that you properly shut down your executors when they're done.

Messaging and batch jobs is something that has been solved time and time again. I suggest not attempting to solve it again. Instead, look into Quartz, which maintains thread pools, persisting tasks in a database etc. Or, maybe even better look into JMS/ActiveMQ. But, at the very least look into Quartz, if you have not already. Oh, and Spring makes working with Quartz so much easier...

I don't see any problem there. Essentially you have independent queues and each has to be drained sequentially, one thread for each is a natural design. Anything else you can come up with are essentially the same. As an example, when Java NIO first came out, frameworks were written trying to take advantage of it and get away from the thread-per-request model. In the end some authors admitted that to provide a good programming model they are just reimplementing threading all over again.

It's impossible to say whether 300 or even 3000 threads will cause any issues without knowing more about your application. I strongly recommend that you should profile your application before adding more complexity
The first thing that you should check is that number of concurrently running threads should not be much higher than number of cores available to run those threads. The more active threads you have, the more time is wasted managing those threads (context switch is expensive) and the less work gets done.
The easiest way to limit number of running threads is to use semaphore. Acquire semaphore before starting work and release it after the work is done.
Unfortunately limiting number of running threads may not be enough. While it may help, overhead may still be to great, if time spent per context switch is major part of total cost of one unit of work. In this scenario, often the most efficient way is to have fixed number of queues. You get queue from global pool of queues when component initializes using algorithm such as round-robin for queue selection.
If you are in one of those unfortunate cases where most obvious solutions do not work, I would start with something relatively simple: one thread pool, one concurrent queue, lock, list of queues and temporary queue for each thread in pool.
Posting work to queue is simple: add payload and identity of producer.
Processing is relatively straightforward as well. First you get get next item from queue. Then you acquire the lock. While you have lock in place, you check if any of other threads is running task for same producer. If not, you register thread by adding a temporary queue to list of queues. Otherwise you add task to existing temporary queue. Finally you release the lock. Now you either run the task or poll for next and start over depending on whether current thread was registered to run tasks. After running the task, you get lock again and see, if there is more work to be done in temporary queue. If not, remove queue from list. Otherwise get next task. Finally you release the lock. Again, you choose whether to run the task or to start over.

Patterns/Principles for thread-safe queues and "master/worker" program in Java

I have a problem which I believe is the classic master/worker pattern, and I'm seeking advice on implementation. Here's what I currently am thinking about the problem:
There's a global "queue" of some sort, and it is a central place where "the work to be done" is kept. Presumably this queue will be managed by a kind of "master" object. Threads will be spawned to go find work to do, and when they find work to do, they'll tell the master thing (whatever that is) to "add this to the queue of work to be done".
The master, perhaps on an interval, will spawn other threads that actually perform the work to be done. Once a thread completes its work, I'd like it to notify the master that the work is finished. Then, the master can remove this work from the queue.
I've done a fair amount of thread programming in Java in the past, but it's all been prior to JDK 1.5 and consequently I am not familiar with the appropriate new APIs for handling this case. I understand that JDK7 will have fork-join, and that that might be a solution for me, but I am not able to use an early-access product in this project.
The problems, as I see them, are:
1) how to have the "threads doing the work" communicate back to the master telling them that their work is complete and that the master can now remove the work from the queue
2) how to efficiently have the master guarantee that work is only ever scheduled once. For example, let's say this queue has a million items, and it wants to tell a worker to "go do these 100 things". What's the most efficient way of guaranteeing that when it schedules work to the next worker, it gets "the next 100 things" and not "the 100 things I've already scheduled"?
3) choosing an appropriate data structure for the queue. My thinking here is that the "threads finding work to do" could potentially find the same work to do more than once, and they'd send a message to the master saying "here's work", and the master would realize that the work has already been scheduled and consequently should ignore the message. I want to ensure that I choose the right data structure such that this computation is as cheap as possible.
Traditionally, I would have done this in a database, in sort of a finite-state-machine manner, working "tasks" through from start to complete. However, in this problem, I don't want to use a database because of the high volume and volatility of the queue. In addition, I'd like to keep this as light-weight as possible. I don't want to use any app server if that can be avoided.
It is quite likely that this problem I'm describing is a common problem with a well-known name and accepted set of solutions, but I, with my lowly non-CS degree, do not know what this is called (i.e. please be gentle).
Thanks for any and all pointers.

As far as I understand your requirements, you need ExecutorService. ExecutorService have
submit(Callable task)
method which return value is Future. Future is a blocking way to communicate back from worker to master. You could easily expand this mechanism to work is asynchronous manner. And yes, ExecutorService also maintaining work queue like ThreadPoolExecutor. So you don't need to bother about scheduling, in most cases. java.util.concurrent package already have efficient implementations of thread safe queue (ConcurrentLinked queue - nonblocking, and LinkedBlockedQueue - blocking).

Check out java.util.concurrent in the Java library.
Depending on your application it might be as simple as cobbling together some blocking queue and a ThreadPoolExecutor.
Also, the book Java Concurrency in Practice by Brian Goetz might be helpful.

First, why do you want to hold the items after a worker started doing them? Normally, you would have a queue of work and a worker takes items out of this queue. This would also solve the "how can I prevent workers from getting the same item"-problem.
To your questions:
1) how to have the "threads doing the
work" communicate back to the master
telling them that their work is
complete and that the master can now
remove the work from the queue
The master could listen to the workers using the listener/observer pattern
2) how to efficiently have the master
guarantee that work is only ever
scheduled once. For example, let's say
this queue has a million items, and it
wants to tell a worker to "go do these
100 things". What's the most efficient
way of guaranteeing that when it
schedules work to the next worker, it
gets "the next 100 things" and not
"the 100 things I've already
scheduled"?
See above. I would let the workers pull the items out of the queue.
3) choosing an appropriate data
structure for the queue. My thinking
here is that the "threads finding work
to do" could potentially find the same
work to do more than once, and they'd
send a message to the master saying
"here's work", and the master would
realize that the work has already been
scheduled and consequently should
ignore the message. I want to ensure
that I choose the right data structure
such that this computation is as cheap
as possible.
There are Implementations of a blocking queue since Java 5

Don't forget Jini and Javaspaces. What you're describing sounds very like the classic producer/consumer pattern that space-based architectures excel at.
A producer will write the jobs into the space. 1 or more consumers will take out jobs (under a transaction) and work on that in parallel, and then write the results back. Since it's under a transaction, if a problem occurs the job is made available again for another consumer .
You can scale this trivially by adding more consumers. This works especially well when the consumers are separate VMs and you scale across the network.

If you are open to the idea of Spring, then check out their Spring Integration project. It gives you all the queue/thread-pool boilerplate out of the box and leaves you to focus on the business logic. Configuration is kept to a minimum using #annotations.
btw, the Goetz is very good.

This doesn't sound like a master-worker problem, but a specialized client above a threadpool. Given that you have a lot of scavenging threads and not a lot of processing units, it may be worthwhile simply doing a scavaging pass and then a computing pass. By storing the work items in a Set, the uniqueness constraint will remove duplicates. The second pass can submit all of the work to an ExecutorService to perform the process in parallel.
A master-worker model generally assumes that the data provider has all of the work and supplies it to the master to manage. The master controls the work execution and deals with distributed computation, time-outs, failures, retries, etc. A fork-join abstraction is a recursive rather than iterative data provider. A map-reduce abstraction is a multi-step master-worker that is useful in certain scenarios.
A good example of master-worker is for trivially parallel problems, such as finding prime numbers. Another is a data load where each entry is independant (validate, transform, stage). The need to process a known working set, handle failures, etc. is what makes a master-worker model different than a thread-pool. This is why a master must be in control and pushes the work units out, whereas a threadpool allows workers to pull work from a shared queue.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.