I am working on a tutorial for my Java concurrency course. The objective is to use thread pools to compute prime numbers in parallel.
The design is based on the Sieve of Eratosthenes. It has an array of n bools, where n is the largest integer you are checking, and each element in the array represents one integer. True is prime, false is non prime, and the array is initially all true.
A thread pool is used with a fixed number of threads (we are supposed to experiment with the number of threads in the pool and observe the performance).
A thread is given a integer multiple to process. The thread then finds the first true element in the array that is not a multiple of thread's integer. The thread then creates a new thread on the thread pool which is given the found number.
After a new thread is formed, the existing thread then continues to set all multiples of it's integer in the array to false.
The main program thread starts the first thread with the integer '2', and then waits for all spawned threads to finish. It then spits out the prime numbers and the time taken to compute.
The issue I have is that the more threads there are in the thread pool, the slower it takes with 1 thread being the fastest. It should be getting faster not slower!
All the stuff on the internet about Java thread pools create n worker threads the main thread then wait for all threads to finish. The method I use is recursive as a worker can spawn more worker threads.
I would like to know what is going wrong, and if Java thread pools can be used recursively.
Your solution may run slower as threads are added for some of following problems:
Thread creation overheads: creating a thread is expensive.
Processor contention: if there are more threads than there are processors to execute them, some of threads will be suspended waiting for a free processor. The result is that the average processing rate for each thread drops. Also, the OS then needs to time-slice the threads, and that takes away time that would otherwise be used for "real" work.
Virtual memory contention: each thread needs memory for its stack. If your machine doesn't have enough physical memory for the workload, each new thread stack increases virtual memory contention which results in paging which slows things down
Cache contention: each thread will (presumably) be scanning a different section of the array, resulting in memory cache misses. This slows down memory accesses.
Lock contention: if your threads are all reading and updating a shared array and using synchronized and one lock object to control access to the array, you could be suffering from lock contention. If a single lock object is used, each thread will spend most of its time waiting to acquire the lock. The net result is that the computation is effectively serialized, and the overall processing rate drops to the rate of a single processor / thread.
The first four problems are inherent to multi-threading, and there are no real solutions ... apart from not creating too many threads and reusing the ones that you have already created. However, there are a number of ways to attack the lock contention problem. For example,
Recode the application so that each thread scans for multiple integers, but in its own section of the array. This will eliminate lock contention on the arrays, though you will then need a way to tell each thread what to do, and that needs to be designed with contention in mind.
Create an array of locks for different regions of the array, and have the threads pick the lock to used based on the region of the array they are operating on. You would still get contention, but on average you should get less contention.
Design and implement a lockless solution. This would entail DEEP UNDERSTANDING of the Java memory model. And it would be very difficult to prove / demonstrate that a lockless solution does not contain subtle concurrency flaws.
Finally, recursive creation of threads is probably a mistake, since it will make it harder to implement thread reuse and the anti-lock-contention measures.
How many processors are available on your system? If #threads > #processors, adding more threads is going to slow things down for a compute-bound task like this.
Remember no matter how many threads you start, they're still all sharing the same CPU(s). The more time the CPU spends switching between threads, the less time it can be doing actual work.
Also note that the cost of starting a thread is significant compared to the cost of checking a prime - you can probably do hundreds or thousands of multiplications in the time it takes to fire up 1 thread.
The key point of a thread pool is to keep a set of thread alive and re-use them to process tasks. Usually the pattern is to have a queue of tasks and randomly pick one thread from the pool to process it. If there is no free thread and the pool is full, just wait.
The problem you designed is not a good one to be solved by a thread pool, because you need threads to run in order. Correct me if I'm wrong here.
thread #1: set 2's multiple to false
thread #2: find 3, set 3's multiple to false
thread #3: find 5, set 5's multiple to false
thread #4: find 7, set 7's multiple to false
....
These threads need to be run in order and they're interleaving (how the runtime schedules them) matters.
For example, if thread #3 starts running before thread #1 sets "4" to false, it will find "4" and continue to reset 4's multiples. This ends up doing a lot of extra work, although the final result will be correct.
Restructure your program to create a fixed ThreadPoolExecutor in advance. Make sure you call ThreadPoolExecutor#prestartAllCoreThreads(). Have your main method submit a task for the integer 2. Each task will submit another task. Since you are using a thread pool, you won't be creating and terminating a bunch of threads, but instead allowing the same threads to take on new tasks as they become available. This will reduce on overall execution overhead.
You should discover that in this case the optimum number of threads is equal to the number of processors (P) on the machine. It is often the case that the optimum number of threads is P+1. This is because P+1 minimizes overhead from context switching while also minimizing loss from idle/blocking time.
Related
I recently had a question looking at the source code of ThreadPoolExecutor: If thread pool represents the reuse of existing threads to reduce the overhead of thread creation or destruction, why not reuse core threads in the initial phase? That is, when the current number of threads is less than the number of core threads, first check whether there are core threads that have completed the task, if so, reuse. Why not? Instead of creating a new thread before the number of core threads is reached, does this violate thread pool design principles?
The following is a partial comment on the addWorker() method in ThreadPoolExecutor
#param firstTask the task the new thread should run first (or null if none). Workers are created with an initial first task (in method execute()) to bypass queuing when there are fewer than corePoolSize threads (in which case we always start one), or when the queue is full (in which case we must bypass queue). Initially idle threads are usually created via prestartCoreThread or to replace other dying workers.
This was actually requested already: JDK-6452337. A core libraries developer has noted:
I like this idea, but ThreadPoolExecutor is already complicated enough.
Keep in mind that corePoolSize is an essential part of ThreadPoolExecutor and is saying how many workers are always active/idle at least. Reaching this number just naturally takes a very short time. You set corePoolSize according to your needs and it's expected that the workload will meet this number.
My assumption is that optimizing this "warm-up phase" – taking it for granted that this will actually increase efficiency – is not worth it. I can't quantify for you what additional complexity this optimization will bring, I'm not developing Java Core libraries, but I assume that it's not worth it.
You can think of it like that: The "warm-up phase" is constant while the thread pool will run for an undefined amount of time. In an ideal world, the initial phase actually should take no time at all, the workload should be there as you create the thread pool. So you are thinking about an optimization that optimizes something that is not the expected thread pool state.
The thread workers will have to be created at some point anyways. This optimization only delays the creation. Imagine you have a corePoolSize of 10, so there is the overhead of creating 10 threads at least. This overhead won't change if you do it later. Yes, resources are also taken later but here I'm asking if the thread pool is configured correctly in the first place: Is corePoolSize correct, does it meet the current workload?
Notice that ThreadPoolExecutor has methods like setCorePoolSize(int) and allowCoreThreadTimeOut(boolean) and more that allow you to configure the thread pool according to your needs.
I am trying to implement a divide-and-conquer solution to some large data. I use fork and join to break down things into threads. However I have a question regarding the fork mechanism: if I set my divide and conquer condition as:
#Override
protected SomeClass compute(){
if (list.size()<LIMIT){
//Do something here
...
}else{
//Divide the list and invoke sub-threads
SomeRecursiveTaskClass subWorker1 = new SomeRecursiveTaskClass(list.subList());
SomeRecursiveTaskClass subWorker2 = new SomeRecursiveTaskClass(list.subList());
invokeAll(subWorker1, subWorker2);
...
}
}
What will happen if there is not enough resource to invoke subWorker (e.g. not enough thread in pool)? Does Fork/Join framework maintains a pool size for available threads? Or should I add this condition into my divide-and-conquer logic?
Each ForkJoinPool has a configured target parallelism. This isn’t exactly matching the number of threads, i.e. if a worker thread is going to wait via a ManagedBlocker, the pool may start even more threads to compensate. The parallelism of the commonPool defaults to “number of CPU cores minus one”, so when incorporating the initiating non-pool thread as helper, the resulting parallelism will utilize all CPU cores.
When you submit more jobs than threads, they will be enqueued. Enqueuing a few jobs can help utilizing the threads, as not all jobs may run exactly the same time, so threads running out of work may steal jobs from other threads, but splitting the work too much may create an unnecessary overhead.
Therefore, you may use ForkJoinTask.getSurplusQueuedTaskCount() to get the current number of pending jobs that are unlikely to be stolen by other threads and split only when it is below a small threshold. As its documentation states:
This value may be useful for heuristic decisions about whether to fork other tasks. In many usages of ForkJoinTasks, at steady state, each worker should aim to maintain a small constant surplus (for example, 3) of tasks, and to process computations locally if this threshold is exceeded.
So this is the condition to decide whether to split your jobs further. Since this number reflects when idle threads steal your created jobs, it will cause balancing when the jobs have different CPU load. Also, it works the other way round, if the pool is shared (like the common pool) and threads are already busy, they will not pick up your jobs, the surplus count will stay high and you will automatically stop splitting then.
When I have hundreds of items to iterate through, and I have to do a computation-heavy operation to each one, I would take a "divide and conquer" approach. Essentially, I would take the processor count + 1, and divide those items into the same number of batches. And then I would execute each batch on a runnable in a cached thread pool. It seems to work well. My GUI task went from 20 seconds to 2 seconds, which is a much better experience for the user.
However, I was reading Brian Goetz' fine book on concurrency, and I noticed that for iterating through a list of items, he would take a totally different approach. He would kick off a Runnable for each item! Before, I always speculated this would be bad, especially on a cached thread pool which could create tons of threads. However each runnable would probably finish very quickly in the larger scope, and I understand the cached thread pool is very optimal for short tasks.
So which is the more accepted paradigm to iterate through computation-heavy items? Dividing into a fixed number of batches and giving each batch a runnable? Or kicking each item off in its own runnable? If the latter approach is optimal, is it okay to use a cached thread pool or is it better to use a bounded thread pool?
With batches you will always have to wait for the longest running batch (you are as fast as the slowest batch). "Divide and conquer" implies management overhead: doing administration for the dividing and monitoring the conquering.
Creating a task for each item is relative straightforward (no management), but you are right in that it may start hundreds of threads (unlikely, but it could happen) which will only slow things down (context switching) if the task does no/very few I/O and is mostly CPU intensive.
If the cached thread pool does not start hundreds of threads (see getLargestPoolSize), then by all means, use the cached thread pool. If too many threads are started then one alternative is to use a bounded thread pool. But a bounded thread pool needs some tuning/decisions: do you use an unbounded task queue or a bounded task queue with a CallerRunsPolicy for example?
On a side note: there is also the ForkJoinPool which is suitable for tasks that start sub-tasks.
I have a multi-threaded application which creates hundreds of threads on the fly. When the JVM has less memory available than necessary to create the next Thread, it's unable to create more threads. Every thread lives for 1-3 minutes. Is there a way, if I create a thread and don't start it, the application can be made to automatically start it when it has resources, and otherwise wait until existing threads die?
You're responsible for checking your available memory before allocating more resources, if you're running close to your limit. One way to do this is to use the MemoryUsage class, or use one of:
Runtime.getRuntime().totalMemory()
Runtime.getRuntime().freeMemory()
...to see how much memory is available. To figure out how much is used, of course, you just subtract total from free. Then, in your app, simply set a MAX_MEMORY_USAGE value that, when your app has used that amount or more memory, it stops creating more threads until the amount of used memory has dropped back below this threshold. This way you're always running with the maximum number of threads, and not exceeding memory available.
Finally, instead of trying to create threads without starting them (because once you've created the Thread object, you're already taking up the memory), simply do one of the following:
Keep a queue of things that need to be done, and create a new thread for those things as memory becomes available
Use a "thread pool", let's say a max of 128 threads, as all your "workers". When a worker thread is done with a job, it simply checks the pending work queue to see if anything is waiting to be done, and if so, it removes that job from the queue and starts work.
I ran into a similar issue recently and I used the NotifyingBlockingThreadPoolExecutor solution described at this site:
http://today.java.net/pub/a/today/2008/10/23/creating-a-notifying-blocking-thread-pool-executor.html
The basic idea is that this NotifyingBlockingThreadPoolExecutor will execute tasks in parallel like the ThreadPoolExecutor, but if you try to add a task and there are no threads available, it will wait. It allowed me to keep the code with the simple "create all the tasks I need as soon as I need them" approach while avoiding huge overhead of waiting tasks instantiated all at once.
It's unclear from your question, but if you're using straight threads instead of Executors and Runnables, you should be learning about java.util.concurrent package and using that instead: http://docs.oracle.com/javase/tutorial/essential/concurrency/executors.html
Just write code to do exactly what you want. Your question describes a recipe for a solution, just implement that recipe. Also, you should give serious thought to re-architecting. You only need a thread for things you want to do concurrently and you can't usefully do hundreds of things concurrently.
This is an alternative, lower level solution Then the above mentioed NotifyingBlocking executor - it is probably not as ideal but will be simple to implement
If you want alot of threads on standby, then you ultimately need a mechanism for them to know when its okay to "come to life". This sounds like a case for semaphores.
Make sure that each thread allocates no unnecessary memory before it starts working. Then implement as follows :
1) create n threads on startup of the application, stored in a queue. You can Base this n on the result of Runtime.getMemory(...), rather than hard coding it.
2) also, creat a semaphore with n-k permits. Again, base this onthe amount of memory available.
3) now, have each of n-k threads periodically check if the semaphore has permits, calling Thread.sleep(...) in between checks, for example.
4) if a thread notices a permit, then update the semaphore, and acquire the permit.
If this satisfies your needs, you can go on to manage your threads using a more sophisticated polling or wait/lock mechanism later.
How do I spawn threads to the maximum number possible assuming that each thread may take different time to complete. The idea is to spawn the maximum number of threads possible while not causing any to die.
E.g. While (spawnable) spawn more threads;
I am trying to spawn threads to make calls to ejb, I wish to spawn the maximum number possible to simulate a load while not causing the threads to go into out of memory exception.
Executors.newFixedThreadPool() or for finer control, create your own ThreadPoolExecutor.
There is no fixed answer. You need to tune the number of threads to your host capabilities.
In response to the memory issue, it is not only a matter of how many threads are there but also of what they do. It is not the same if they perform simple calls or have to deal with huge arrays.
Relative for performance, and supposing that your host is dedicated, a value of one thread per core is a minimum value. Given that they are going to call a remote system most of these threads will spend a time idle; depending of the proportion of idle time you can spawn more or less.
In essence, chech your host performance and tune your thread number in consequence.
The Executor framework has been cited here, and it's a wonderful tool indeed (Already +1'ed that answer).
But I believe what the OP wants is a Executors.newCachedThreadPool().
From the docs:
Creates a thread pool that creates new threads as needed, but will
reuse previously constructed threads when they are available
More on executors here