ForkJoinPool stalls during invokeAll/join

ForkJoinPool stalls during invokeAll/join - java

I try to use a ForkJoinPool to parallelize my CPU intensive calculations.
My understanding of a ForkJoinPool is, that it continues to work as long as any task is available to be executed. Unfortunately I frequently observed worker threads idling/waiting, thus not all CPU are kept busy. Sometimes I even observed additional worker threads.
I did not expect this, as I strictly tried to use non blocking tasks.
My observation is very similar to those of ForkJoinPool seems to waste a thread.
After debugging a lot into ForkJoinPool I have a guess:
I used invokeAll() to distribute work over a list of subtasks. After invokeAll() finished to execute the first task itself it starts joining the other ones. This works fine, until the next task to join is on top of the executing queue. Unfortunately I submitted additional tasks asynchronously without joining them. I expected the ForkJoin framework to continue executing those task first and than turn back to joining any remaining tasks.
But it seems not to work this way. Instead the worker thread gets stalled calling wait() until the task waiting for gets ready (presumably executed by an other worker thread). I did not verify this, but it seems to be a general flaw of calling join().
ForkJoinPool provides an asyncMode, but this is a global parameter and can not be used for individual submissions. But I like to see my asynchronously forked tasks to be executed soon.
So, why does ForkJoinTask.doJoin() not simply executes any available task on top of its queue until it gets ready (either executed by itself or stolen by others)?

Since nobody else seems to understand my question I try to explain what I found after some nights of debugging:
The current implementation of ForkJoinTasks works well if all fork/join calls are strictly paired. Illustrating a fork by an opening bracket and join by a closing one a perfect binary fork join pattern may look like this:
{([][]) ([][])} {([][]) ([][])}
If you use invokeAll() you may also submit list of subtasks like this:
{([][][][]) ([][][][]) ([][][][])}
What I did however looks like this pattern:
{([) ([)} ... ]]
You may argue this looks ill or is a misuse of the fork-join framework. But the only constraint is, that the tasks completion dependencies are acyclic, else you may run into a deadlock. As long as my [] tasks are not dependent on the () tasks, I don't see any problem with it. The offending ]]'s just express that I do not wait for them explicitly; they may finish some day, it does not matter to me (at that point).
Indeed the current implementation is able to execute my interlocked tasks, but only by spawning additional helper threads which is quite inefficient.
The flaw seems to be the current implementation of join(): joining an ) expects to see its corresponding ( on top of its execution queue, but it finds a [ and is perplexed. Instead of simply executing [] to get rid of it, the current thread suspends (calling wait()) until someone else comes around to execute the unexpected task. This causes a drastic performance break down.
My primary intend was to put additional work onto the queue to prevent the worker thread from suspending if the queue runs empty. Unfortunately the opposite happens :-(

You are dead right about join(). I wrote this article two years ago that points out the problem with join().
As I said there, the framework cannot execute newly submitted requests until it finishes the earlier ones. And each WorkThread cannot steal until it's current request finishes which results in the wait().
The additional threads you see are "continuation threads." Since join() eventually issues a wait(), these threads are needed so the entire framework doesn't stall.

You’re not using this framework for the very narrow purpose for which it was intended.
The framework started life as the experiment in the 2000 research paper. It’s been modified since then but the basic design, fork-and-join on large arrays, remains the same. The basic purpose is to teach undergraduates how to walk down the leaves of a balanced tree. When people use it for other than simple array-processing weird things happen. What it is doing in Java7 is beyond me; which is the purpose of the article.
The problems only get worse in Java8. There it’s the engine to drive all stream parallel work. Have a read in part two of that article. The lambda interest lists are filled with reports of thread stalls, stack overflow, and out of memory errors.
You use it at your own risk when you don’t use it for pure recursive decomposition of large data structures. Even then, the excessive threads it creates can cause havoc. I’m not going to pursue this discussion any further.

Related

Couldn't Spring Webflux or Non blocking pattern be bad for scaling

I get that with threads being nonblocking, we don't need to have Thread sprawl depending on N concurrent requests, but rather we put our tasks in a single event loop in our reactive web programming pattern.
Yes, that can help, but since the event loop is a queue, what if the first task to be processed blocks forever? Then the event loop will never progress and thus end of responses and processing other than queueing more tasks. Yes, timeouts are probably possible, but I can't wrap my head around how the event loop can be a good solution.
Say you have 3 tasks that take 3 seconds to wait for IO and run each executions and they got submitted to the event queue. Then they will still take 9 seconds to be able to be processed and also to execute once IO resolved. In the case of making threads that block, this would have resolved in 3 seconds since they run concurrently.
Where I can see a benefit is if the event loop is not really a queue and upon signal that a task is ready to be processed, it dispatches that task to be processed. In that case though, this would mean that order of task execution is not maintained and also each task has to still be running a thread in order to be able to tell when IO is resolved.
Maybe I am not understanding the event loop and thread handling correctly. Can someone correct me please because it seems like this Reactor pattern seems to make things possibly worse.
Lastly, upon X requests in Spring Reactor, does only 1 thread get created to run handlers instead of the traditional X threads? In that case, if someone accidently wrote blocking code, doesnt that mean each subsequent requests get queued?

It is not a good idea to use the event loop for long running tasks. This is considered an anti-pattern. Usually it is merely used for quickly picking up imminent events, but not actually doing the work associated with these events if the work would block the event loop noticeably. You would want to use a separate thread pool for executing long running tasks. So the event loop would usually only initiate work using asynchronous and hence non-blocking structures (or actually doing the work only if it can be done very quickly) and pass the heavier and possibly blocking tasks to a separate thread pool (for CPU intensive computations) or to the operating system (such as data buffers to be sent over the network).
Also, don't be fooled by the fact that only one thread is dealing with the events, it is very fast and is usually enough for even demanding applications. Platforms like NodeJS or frameworks like Netty (used in Akka, Play framework, Apache Cassandra, etc.) are using an event loop at their heart with great success. One should just be aware of the fact, that performing blocking operations inside the event loop is generally a bad idea.
Please have a look at some of these posts for more information:
The reactor pattern and non blocking IO
Unix Network Programming
Kotlin Webflux
Slightly off topic but still a very prominent example: Don't Block the Event Loop (NodeJS)

ForkJoinTask completion handler

I have a long-running calculation that I have split up with Java's ForkJoinTask.
Java's FutureTask provides a template method done(). Overriding this method allows for "registering a completion handler".
Is it possible to register a completion handler for a ForkJoinTask?
I am asking because I don't want to have blocking threads in my application - but my application will have a blocking thread as soon as I retrieve the calculation result via calls to result = ForkJoinPool.invoke(myForkJoinTask) or result = ForkJoinPool.submit(myForkJoinTask).get().

I think you mean "lock free" programming http://en.wikipedia.org/wiki/Non-blocking_algorithm? While FutureTask.get() possibly blocks the current thread (and thus leaves an idling CPU) ForkJoinTask.get() (or join) tries to keep the CPU busy.
This works well if you are able to split your problem into many small peaces (ForkJoinTask). If one FJTask is internally waiting for the result of an other task, which is not ready, the ForkJoinTask tries to pick up some other work (Task) to do from its ForkJoinPool and executes that task(s) meanwhile.
Until all your Task are CPU bound, it works fine: all your CPU(s) are kept busy.
It does NOT work if any of your Task waits for some external event (i.e. sending a REST call to the Mars rover). Also the problem should form a DAG, else you may get a deadlock. But until you join only tasks you forked before in the same Task it works well. Even better if you join the task you forked at last.
So it is not too worse to call get() or join() within/between your Tasks.
You mentioned a completion handler to solve the problem. If you are implementing the ForkJoinTask yourself you may have a look at RecursiveTask or even RecursiveAction. You will implement compute() and you may easily forward the result of each task to some collector at the end of your compute() function instead of returning it.
But you have to consider that you collector will be called concurrently! For adding values or counting completion counts have a look at java.util.concurrent.atomic. Avoid using synchronized blocks. Else all your Tasks have to wait for this single bottleneck and only one CPU keeps working.
I think propagating the results involves more problems than returning them (since FJPool handles this). In addition it becomes difficult to decide (and to communicate to the outside) at which point your final result is done.

Which one is better for performance to check another threads boolean in java

while(!anotherThread.isDone());
or
while(!anotherThread.isDone())
Thread.sleep(5);

If you really need to wait for a thread to complete, use
anotherThread.join()
(You may want to consider specifying a timeout in the join call.)
You definitely shouldn't tight-loop like your first snippet does... and sleeping for 5ms is barely better.
If you can't use join (e.g. you're waiting for a task to complete rather than a whole thread) you should look at the java.util.concurrent package - chances are there's something which will meet your needs.

IMHO, avoid using such logic altogether. Instead, perhaps implement some sort of notification system using property change listeners.

As others have said, it's better to just use join in this case. However, I'd like to generalize your question and ask the following:
In general when a thread is waiting for an event that depends on another thread to occur is it better to:
Use a blocking mechanism (i.e. join, conditional variable, etc.) or
Busy spin without sleep or
Busy spin with sleep?
Now let's see what are the implications for each case:
In this case, using a blocking call will effectively take your thread off the CPU and not schedule it again until the expected event occurs. Good for resource utilization (the thread would waste CPU cycles otherwise), but not very efficient if the event may occur very frequently and at small intervals (i.e. a context switch is much more time-consuming than the time it takes for the event to occur). Generally good when the event will occur eventually, but you don't know how soon.
In case two, you are busy spinning, meaning that you are actively using the CPU without performing useful work. This is the opposite of case 1: it is useful when the event is expected to occur very very soon, but otherwise may occupy the CPU unnecessarily.
This case is a sort of trade-off. You are busy spinning, but at the same time allowing other threads to run by giving up the CPU. This is generally employed when you don't want to saturate the CPU, but the event is expected to occur soon and you want to be sure that you will still be there in almost real time to catch it when it occurs.

I would recommend utilizing the wait/notify mechanism that is built into all Java objects (or using the new Lock code in Java 5).
Thread 1 (waiting for Thread2)
while(!thread2.isDone()) {
synchronize(thread2.lockObject) {
thread2.lockObject.wait();
}
}
Thread 2
// finish work, set isDone=true, notify T1
thread2.lockObject.notify();
'lockObject' is just a plain (Object lockObject = new Object()) -- all Java objects support the wait/notify calls.
After that last call to notify(), Thread1 will wake up, hit the top of the while, see that T2 is now done, and continue execution.
You should account for interrupt exceptions and the like, but using wait/notify is hugely helpful for scenarios like this.
If you use your existing code, with or without sleep, you are burning a huge number of cycles doing nothing... and that's never good.
ADDENDUM
I see a lot of comments saying to use join - if the executing thread you are waiting on will complete, then yes, use join. If you have two parallel threads that run at all times (e.g. a producer thread and a consumer) and they don't "complete", they just run in lock-step with each other, then you can use the wait/notify paradigm I provided above.

The second one.
Better though is to use the join() method of a thread to block the current thread until it is complete :).
EDIT:
I just realised that this only addresses the question as it applies to the two examples you gave, not the question in general (how to wait for a boolean value to be changed by another Thread, not necessarily for the other Thread to actually finish).
To answer the question in general I would suggest that rather than using the methods you described, to do something like this I would recommend using the guarding block pattern as described here. This way, the waiting thread doesn't have to keep checking the condition itself and can just wait to be notified of the change. Hope this helps!

Have you considered: anotherThread.join() ? That will cause the current one to be 'parked' without any overhead until the other one terminates.

The second is better than the first, but neither is very good. You should use anotherThread.join() (or anotherThread.join(timeout)).

Neither, use join() instead:
anotherThread.join();
// anotherThread has finished executing.

Parallel-processing in Java; advice needed i.e. on Runnanble/Callable interfaces

Assume that I have a set of objects that need to be analyzed in two different ways, both of which take relatively long time and involve IO-calls, I am trying to figure out how/if I could go about optimizing this part of my software, especially utilizing the multiple processors (the machine i am sitting on for ex is a 8-core i7 which almost never goes above 10% load during execution).
I am quite new to parallel-programming or multi-threading (not sure what the right term is), so I have read some of the prior questions, particularly paying attention to highly voted and informative answers. I am also in the process of going through the Oracle/Sun tutorial on concurrency.
Here's what I thought out so far;
A thread-safe collection holds the objects to be analyzed
As soon as there are objects in the collection (they come a couple at a time from a series of queries), a thread per object is started
Each specific thread takes care of the initial pre-analysis preparations; and then calls on the analyses.
The two analyses are implemented as Runnables/Callables, and thus called on by the thread when necessary.
And my questions are:
Is this a reasonable scheme, if not, how would you go about doing this?
In order to make sure things don't get out of hand, should I implement a ThreadManager or some thing of that sort, which starts and stops threads, and re-distributes them when they are complete? For example, if i have 256 objects to be analyzed, and 16 threads in total, the ThreadManager assigns the first finished thread to the 17th object to be analyzed etc.
Is there a dramatic difference between Runnable/Callable other than the fact that Callable can return a result? Otherwise should I try to implement my own interface, in that case why?
Thanks,

You could use a BlockingQueue implementation to hold your objects and spawn your threads from there. This interface is based on the producer-consumer principle. The put() method will block if your queue is full until there is some more space and the take() method will block if the queue is empty until there are some objects again in the queue.
An ExecutorService can help you manage your pool of threads.
If you are awaiting a result from your spawned threads then Callable interface is a good idea to use since you can start the computation earlier and work in your code assuming the results in Future-s. As far as the differencies with the Runnable interface, from the Callable javadoc:
The Callable interface is similar to Runnable, in that both are designed for classes whose instances are potentially executed by another thread. A Runnable, however, does not return a result and cannot throw a checked exception.
Some general things you need to consider in your quest for java concurrency:
Visibility is not coming by defacto. volatile, AtomicReference and other objects in the java.util.concurrent.atomic package are your friends.
You need to carefully ensure atomicity of compound actions using synchronization and locks.

Your idea is basically sound. However, rather than creating threads directly, or indirectly through some kind of ThreadManager of your own design, use an Executor from Java's concurrency package. It does everything you need, and other people have already taken the time to write and debug it. An executor manages a queue of tasks, so you don't need to worry about providing the threadsafe queue yourself either.
There's no difference between Callable and Runnable except that the former returns a value. Executors will handle both, and ready them the same.
It's not clear to me whether you're planning to make the preparation step a separate task to the analyses, or fold it into one of them, with that task spawning the other analysis task halfway through. I can't think of any reason to strongly prefer one to the other, but it's a choice you should think about.

The Executors provides factory methods for creating thread pools. Specifically Executors#newFixedThreadPool(int nThreads) creates a thread pool with a fixed size that utilizes an unbounded queue. Also if a thread terminates due to a failure then a new thread will be replaced in its place. So in your specific example of 256 tasks and 16 threads you would call
// create pool
ExecutorService threadPool = Executors.newFixedThreadPool(16);
// submit task.
Runnable task = new Runnable(){};;
threadPool.submit(task);
The important question is determining the proper number of threads for you thread pool. See if this helps Efficient Number of Threads

Sounds reasonable, but it's not as trivial to implement as it may seem.
Maybe you should check the jsr166y project.
That's probably the easiest solution to your problem.

what would make a single task executor stop processing tasks?

I'm using a java.util.concurrent.ExecutorService that I obtained by calling Executors.newSingleThreadExecutor(). This ExecutorService can sometimes stop processing tasks, even though it has not been shutdown and continues to accept new tasks without throwing exceptions. Eventually, it builds up enough of a queue that my app shuts down with OutOfMemoryError exceptions.
The documentation seem to indicate that this single thread executor should survive task processing errors by firing up a new worker thread if necessary to replace one that has died. Am I missing something?

It sounds like you have two different issues:
1) You're over-feeding the work queue. You can't just keep stuffing new tasks into the queue, with no regard for the consumption rate of the task executors. You need to figure out some logic for knowing when you to block new additions to the work queue.
2) Any uncaught exception in a task's thread can completely kill the thread. When that happens, the ExecutorService spins up a new thread to replace it. But that doesn't mean you can ignore whatever problem is causing the thread to die in the first place! Find those uncaught exceptions and catch them!
This is just a hunch (cuz there's not enough info in your post to know otherwise), but I don't think your problem is that the task executor stops processing tasks. My guess is that it just doesn't process tasks as fast as you're creating them. (And the fact that your tasks sometimes die prematurely is probably orthogonal to the problem.)
At least, that's been my experience working with thread pools and task executors.
Okay, here's another possibility that sounds feasible based on your comment (that everything will run smoothly for hours until suddenly coming to a crashing halt)...
You might have a rare deadlock between your task threads. Most of the time, you get lucky, and the deadlock doesn't manifest itself. But occasionally, two or more of your task threads get into a state where they're waiting for the release of a lock held by the other thread. At that point, no more task processing can take place, and your work queue will pile up until you get the OutOfMemoryError.
Here's how I'd diagnose that problem:
Eliminate ALL shared state between your task threads. At first, this might require each task thread making a defensive copy of all shared data structures it requires. Once you've done that, it should be completely impossible to experience a deadlock.
At this point, gradually reintroduced the shared data structures, one at a time (with appropriate synchronization). Re-run your application after each tiny modification to test for the deadlock. When you get that crashing situation again, take a close look at the access patterns for the shared resource and determine whether you really need to share it.
As for me, whenever I write code that processes parallel tasks with thread pools and executors, I always try to eliminate ALL shared state between those tasks. As far as the application is concerned, they may as well be completely autonomous applications. Hunting down deadlocks is a drag, and in my experience, the best way to eliminate deadlocks is for each thread to have its own local state rather than sharing any state with other task threads.
Good luck!

My guess would be that your tasks are blocking indefinitely, rather than dying. Do you have evidence, such as a log statement at the end of your task, suggest that your tasks are successfully completing?
This could be a deadlock, or an interaction with some external process that is blocking.

Although you don't leave enough detail to be sure, the first thing I'd try is to have your tasks catch "Exception" at the top level and log the message.
I know it doesn't seem right, but occasionally (depending on a lot of variables) I've worked on code where stuff happening in a thread throws an exception and it is never logged, or it just doesn't show up on the console--yet the "executing" code exits out of it's top level loop or whatever code is causing your task to run.
I guess I'm just saying, make sure your tasks are not throwing an exception out.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.