Design issue: is this doable only with producer/consumer?

Design issue: is this doable only with producer/consumer? - java

I'm trying to increase performance of indexing my lucene files. For this, I created a worker "LuceneWorker" that does the job.
Given the code below, the 'concurrent' execution becomes significantly slow. I think I know why - it's because the futures grows to a limit that there's hardly memory to perform yet another task of the LuceneWorker.
Q: is there a way to limit the amount of 'workers' that goes into the executor? In other words if there are 'n' futures - do not continue and allow the documents to be indexed first?
My intuitive approach is that I should build a consumer/producer with ArrayBlockingQueue. But wonder if I'm right before I redesign it.
ExecutorService executor = Executors.newFixedThreadPool(cores);
List<Future<List<Document>>> futures = new ArrayList<Future<List<Document>>>(3);
for (File file : files)
{
if (isFileIndexingOK(file))
{
System.out.println(file.getName());
Future<List<Document>> future = executor.submit(new LuceneWorker(file, indexSearcher));
futures.add(future);
}
else
{
System.out.println("NOT A VALID FILE FOR INDEXING: "+file.getName());
continue;
}
}
int index=0;
for (Future<List<Document>> future : futures)
{
try{
List<Document> docs = future.get();
for(Document doc : docs)
writer.addDocument(doc);
}catch(Exception exp)
{
//exp code comes here.
}
}

If you want to limit the number of waiting jobs, use a ThreadPoolExecutor with a bounded queue like ArrayBlockingQueue. Also roll your own RejectedExecutionHandler so that the submitting thread waits for capacity in the queue. You cannot use the convenience methods in Executors for that as newFixedThreadPool uses an unbounded LinkedBlockingQueue.

Depending on the standard input size and the complexity of the LuceneWorker class, I could imagine solving this problem at least partially using the Fork/Join framework. When using JDK 8's CountedCompleter implementation (included in jsr166y) I/O operations would not produce any problems.

Related

Parallelize a for loop in java

I have a for loop that is looping over a list of collections. Inside the loop some select/update queries are taking place on collection which are exclusive of the other collections. Since each collection has a lot of data to process on i would like to parallelize it.
The code snippet looks something like this:
//Some variables that are used within the for loop logic
for(String collection : collections) {
//Select queries on collection
//Update queries on collection
}
How can i achieve this in java?

You can use the parallelStream() method (since java 8):
collections.parallelStream().forEach((collection) -> {
//Select queries on collection
//Update queries on collection
});
More informations about streams.
Another way to do it is using Executors :
try
{
final ExecutorService exec = Executors.newFixedThreadPool(collections.size());
for (final String collection : collections)
{
exec.submit(() -> {
// Select queries on collection
// Update queries on collection
});
}
// We want to wait that the jobs are done.
final boolean terminated = exec.awaitTermination(500, TimeUnit.MILLISECONDS);
if (terminated == false)
{
exec.shutdownNow();
}
} catch (final InterruptedException e)
{
e.printStackTrace();
}
This example is more powerfull since you can easily know when the job is done, force termination... and more.

final int numberOfThreads = 32;
final ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
// List to store the 'handles' (Futures) for all tasks:
final List<Future<MyResult>> futures = new ArrayList<>();
// Schedule one (parallel) task per String from "collections":
for(final String str : collections) {
futures.add(executor.submit(() -> { return doSomethingWith(str); }));
}
// Wait until all tasks have completed:
for ( Future<MyResult> f : futures ) {
MyResult aResult = f.get(); // Will block until the result of the task is available.
// Optionally do something with the result...
}
executor.shutdown(); // Release the threads held by the executor.
// At this point all tasks have ended and we can continue as if they were all executed sequentially
Adjust the numberOfThreads as needed to achieve the best throughput. More threads will tend to utilize the local CPU better, but may cause more overhead at the remote end. To get good local CPU utilization, you want to have (much) more threads than CPUs (/cores) so that, whenever one thread has to wait, e.g. for a response from the DB, another thread can be switched in to execute on the CPU.

There are a number of question that you need to ask yourself to find the right answer:
If I have as many threads as the number of my CPU cores, would that be enough?
Using parallelStream() will give you as many threads as your CPU cores.
Will parallelizing the loop give me a performance boost or is there a bottleneck on the DB?
You could spin up 100 threads, processing in parallel, but this doesn't mean that you will do things 100 times faster, if your DB or the network cannot handle the volume. DB locking can also be an issue here.
Do I need to process my data in a specific order?
If you have to process your data in a specific order, this may limit your choices. E.g. forEach() doesn't guarantee that the elements of your collection will be processed in a specific order, but forEachOrdered() does (with a performance cost).
Is my datasource capable of fetching data reactively?
There are cases when our datasource can provide data in the form of a stream. In that case, you can always process this stream using a technology such as RxJava or WebFlux. This would enable you to take a different approach on your problem.
Having said all the above, you can choose the approach you want (executors, RxJava etc.) that fit better to your purpose.

Any alternatives for linkedtransfer queue with size restrictions in java 7/8?

To implement producer/consumer pattern, I have used
LinkedTransferQueue.
check the below code
while (true) {
String tmp = new randomString();
if (linkedTransferQueueString.size() < 10000) {
linkedTransferQueueString.add(tmp);
}
}
From the docs, it states the size is O(n) operation :(. So for adding an element, it should go through the whole collection.
Is there any other concurrent collection queue which has size restriction?
Could not find any in java standard collections, apache concurrent collections ?

#OP: You had already accepted the answer and it is correct as well, but you still raise the bounty, so I am assuming you are more looking for the concept so I will just throw light on that part.
Now, your issue is that you are not happy with O(n) for size operation so it means your solution is either:
the data structure should be able to tell you that queue is full.
size operation should return you result in constant time.
It is not common that size operation will give O(n) but since in case of LinkedTransferQueue there is async behavior, so complete queue is traversed to ensure the number of elements in the queue. Otherwise most of the queue implementations give you size result in constant time, but you really don't need to do this size check, please keep on reading.
If you have hard dependency on purpose of LinkedTransferQueue i.e. you want to dequeue based on how long an element has been on the queue for some producer, then I don't think there is any alternative except that you can do some dirty thing like extending LinkedTransferQueue and then tracking the number of elements yourself, but soon it can become a mess and cannot give you accurate result and may give approximate result.
If you do not have any hard dependency on LinkedTransferQueue then you can use some flavor of BlockingQueue and many of them enable you to have a "bounded" queue (bounded queue is what you need) in some way or other - for example, ArrayBlockingQueue is implicitly bounded and you can create a bounded LinkedBlockingQueue like this new LinkedBlockingQueue(100). You can check the documentation for other queues.
And then you can use offer method of the queue, which will return FALSE if the queue is full, so if you are getting FALSE then you can handle as you want, so like this you need not to do explicit size check, you can simply put the element in the queue using offer method and it will return you a boolean indicating whether element was successfully placed in the queue or not.

BlockingQueue is
BlockingQueue implementations are thread-safe
[...]
A BlockingQueue may be capacity bounded.
and an ArrayBlockingQueue is
A bounded blocking queue backed by an array
Here's how you would write your example with it:
BlockingQueue queue = new ArrayBlockingQueue<>(10000);
while (true) {
String tmp = new randomString();
if (!queue.offer(tmp)) {
// the limit was reached, item was not added
}
}
Or for a simple producer/consumer example
public static void main(String[] args) {
// using a low limit so it doesn't take too long for the queue to fill
BlockingQueue<String> queue = new ArrayBlockingQueue<>(10);
Runnable producer = () -> {
if (!queue.offer(randomString())) {
System.out.println("queue was full!");
}
};
Runnable consumer = () -> {
try {
queue.take();
} catch (InterruptedException e) {
e.printStackTrace();
}
};
ScheduledExecutorService executor = Executors.newScheduledThreadPool(4);
// produce faster than consume so the queue becomes full eventually
executor.scheduleAtFixedRate(producer, 0, 100, TimeUnit.MILLISECONDS);
executor.scheduleAtFixedRate(consumer, 0, 200, TimeUnit.MILLISECONDS);
}

Have you tried ArrayBlockingQueue?
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ArrayBlockingQueue.html
It has size restriction and concurrency.
Also, the size is O(1).
public int size() {
final ReentrantLock lock = this.lock;
lock.lock();
try {
return count;
} finally {
lock.unlock();
}
}

Could you please go through BlockingQueue.
Here is the best link I found on the internet - BlockingQueue. BlockingQueue is an interface, it is in the package - java.util.concurrent and it has multiple implementations:-
ArrayBlockingQueue
DelayQueue
LinkedBlockingQueue
PriorityBlockingQueue
SynchronousQueue

What is a better idiom for producer-consumers in Java?

I would like to read a file line by line, do something slow with each line that can easily be done in parallel, and write the result to a file line by line. I don't care about the order of the output. The input and output are so big they don't fit in memory. I would like to be able to set a hard limit on the number of threads running at the same time as well as the number of lines in memory.
The libary I use for file IO (Apache Commons CSV) does not seem to offer synchronised file access so I don't think I can read from the same file or write to the same file from several threads at once. If that was possible I would create a ThreadPoolExecutor and feed it a task for each line, which would simply read the line, perform the calculation and write the result.
Instead, what I think I need is a single thread that does the parsing, a bounded queue for the parsed input lines, a thread pool with jobs that do the calculations, a bounded queue for the calculated output lines, and a single thread that does the writing. A producer, a lot of consumer-producers and a consumer if that makes sense.
What I have looks like this:
BlockingQueue<CSVRecord> inputQueue = new ArrayBlockingQueue<CSVRecord>(INPUT_QUEUE_SIZE);
BlockingQueue<String[]> outputQueue = new ArrayBlockingQueue<String[]>(OUTPUT_QUEUE_SIZE);
Thread parserThread = new Thread(() -> {
while (inputFileIterator.hasNext()) {
CSVRecord record = inputFileIterator.next();
parsedQueue.put(record); // blocks if queue is full
}
});
// the job queue of the thread pool has to be bounded too, otherwise all
// the objects in the input queue will be given to jobs immediately and
// I'll run out of heap space
// source: https://stackoverflow.com/questions/2001086/how-to-make-threadpoolexecutors-submit-method-block-if-it-is-saturated
BlockingQueue<Runnable> jobQueue = new ArrayBlockingQueue<Runnable>(JOB_QUEUE_SIZE);
RejectedExecutionHandler rejectedExecutionHandler
= new ThreadPoolExecutor.CallerRunsPolicy();
ExecutorService executorService
= new ThreadPoolExecutor(
NUMBER_OF_THREADS,
NUMBER_OF_THREADS,
0L,
TimeUnit.MILLISECONDS,
jobQueue,
rejectedExecutionHandler
);
Thread processingBossThread = new Thread(() -> {
while (!inputQueue.isEmpty() || parserThread.isAlive()) {
CSVRecord record = inputQueue.take(); // blocks if queue is empty
executorService.execute(() -> {
String[] array = this.doStuff(record);
outputQueue.put(array); // blocks if queue is full
});
}
// getting here that means that all CSV rows have been read and
// added to the processing queue
executorService.shutdown(); // do not accept any new tasks
executorService.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS);
// wait for existing tasks to finish
});
Thread writerThread = new Thread(() -> {
while (!outputQueue.isEmpty() || consumerBossThread.isAlive()) {
String[] outputRow = outputQueue.take(); // blocks if queue is empty
outputFileWriter.printRecord((Object[]) outputRow);
});
parserThread.start();
consumerBossThread.start();
writerThread.start();
// wait until writer thread has finished
writerThread.join();
I've left out the logging and exception handling so this looks a lot shorter than it is.
This solution works but I'm not happy with it. It seems hacky to have to create my own threads, check their isAlive(), create a Runnable within a Runnable, be forced to specify a timeout when I really just want to wait until all the workers have finished, etc. All in all it's a 100+ line method, or even several hundred lines of code if I make the Runnables their own classes, for what seems like a very basic pattern.
Is there a better solution? I'd like to make use of Java's libraries as much as possible, to help keep my code maintainable and in line with best practices. I would still like to know what it's doing under the hood, but I doubt that implementing all this myself is the best way to do it.
Update:
Better solution, after suggestions from the answers:
BlockingQueue<Runnable> jobQueue = new ArrayBlockingQueue<Runnable>(JOB_QUEUE_SIZE);
RejectedExecutionHandler rejectedExecutionHandler
= new ThreadPoolExecutor.CallerRunsPolicy();
ExecutorService executorService
= new ThreadPoolExecutor(
NUMBER_OF_THREADS,
NUMBER_OF_THREADS,
0L,
TimeUnit.MILLISECONDS,
jobQueue,
rejectedExecutionHandler
);
while (it.hasNext()) {
CSVRecord record = it.next();
executorService.execute(() -> {
String[] array = this.doStuff(record);
synchronized (writer) {
writer.printRecord((Object[]) array);
}
});
}

I would like to point out something first, I could think of three possible scenarios:
1.- For all the lines of a file, the time that it needs to process a line, by using the doStuff method, is bigger than the time that it takes to read the same line from disk and parse it
2.- For all the lines of a file, the time that it needs to process a line, by using the doStuff method, is lower or equal than the time that it takes to read the same line and parse it.
3.- Neither the first nor the second scenarios for the same file.
Your solution should be good for the first scenario, but not for the second or third ones, also, you're not modifying queues in a synchronized way. Even more, if you're experiencing scenarios like number 2, then you're wasting cpu cycles when there is no data to be sent to the output, or when there are no lines to be sent to the queue to be processed by the doStuff, by spining at:
while (!outputQueue.isEmpty() || consumerBossThread.isAlive()) {
Finally, regardless of which scenario you're experiencing, I would suggest you to use Monitor objects, that will allow you to put specific threads to wait until another process notifies them that a certain condition is true and that they can be activated again. By using Monitor objects you'll not waste cpu cycles.
For more information, see:
https://docs.oracle.com/javase/7/docs/api/javax/management/monitor/Monitor.html
EDIT: I've deleted the suggestion of using Synchronized Methods, since as you've pointed out, BlockingQueue's methods are thread-safe (or almost all) and prevents race conditions.

Use ThreadPoolExecutor tied to a fixed size blocking queue and all of your complexity vanishes in a puff of JavaDoc.
Just have a single thread read the file and gorge the blocking queue, all the processing is done by the Executor.
Addenda:
You can either synchronize on your writer, or simply use yet another queue, and the processors fill that, and your single write thread consume the queue.
Synchronizing on the writer would most likely be the simplest way.

Java Stream API: why the distinction between sequential and parallel execution mode?

From the Stream javadoc:
Stream pipelines may execute either sequentially or in parallel. This execution mode is a property of the stream. Streams are created with an initial choice of sequential or parallel execution.
My assumptions:
There is no functional difference between a sequential/parallel streams. Output is never affected by execution mode.
A parallel stream is always preferable, given appropriate number of cores and problem size to justify the overhead, due to the performance gains.
We want to write code once and run anywhere without having to care about the hardware (this is Java, after all).
Assuming these assumptions are valid (nothing wrong with a bit of meta-assumption), what's the value in having the execution mode exposed in the api?
It seems like you should just be able to declare a Stream, and the choice of sequential/parallel execution should be handled automagically in a layer below, either by library code or the JVM itself as a function of the cores available at runtime, the size of the problem, etc.
Sure, assuming parallel streams also work on a single core machine, perhaps just always using a parallel stream achieves this. But this is really ugly - why have explicit references to parallel streams in my code when it's the default option?
Even if there is a scenario where you'd deliberately want to hard code the use of a sequential stream - why is there not just a sub-interface SequentialStream for that purpose, rather than polluting Stream with an execution mode switch?

It seems like you should just be able to declare a Stream, and the choice of sequential/parallel execution should be handled automagically in a layer below, either by library code or the JVM itself as a function of the cores available at runtime, the size of the problem, etc.
The reality is that a) streams are a library, and have no special JVM magic, and b) you can't really design a library smart enough to automagically figure out what the right decision is in this particular case. There's no sensible way to estimate how costly a particular function will be without running it -- even if you could introspect its implementation, which you can't -- and now you're introducing a benchmark into every stream operation, trying to figure out if parallelizing it will be worth the cost of the parallelism overhead. That's just not practical, especially given that you don't know in advance how bad the parallelism overhead is, either.
A parallel stream is always preferable, given appropriate number of cores and problem size to justify the overhead, due to the performance gains.
Not always, in practice. Some tasks are just so small that they're not worth parallelizing, and parallelism does always have some overhead. (And frankly, most programmers tend to overestimate the usefulness of parallelism, slapping it everywhere when it's really hurting performance.)
Basically, it's a hard enough problem that you basically have to shove it off onto the programmer.

There's an interesting case in this question showing that sometimes parallel stream might be slower in orders of magnitude. In that particular example parallel version runs for ten minutes while sequential takes several seconds.

There is no functional difference between a sequential/parallel
streams. Output is never affected by execution mode.
There is a difference between sequential/parallel streams execution.
In the below code TEST_2 results shows that parallel thread execution is very much faster than the sequential way.
A parallel stream
is always preferable, given appropriate number of cores and problem
size to justify the overhead, due to the performance gains.
Not really. if task is not worthy(simple tasks) to be executed in parallel threads, then it is simply we are adding overhead to our code.
TEST_1 results shows this. Also note that if all the worker threads are busy on one parallel execution tasks; then other parallel stream operation elsewhere in your code will be waiting for that.
We want to
write code once and run anywhere without having to care about the
hardware (this is Java, after all).
Since only programmer knows about; is it worthy to execute this task in parallel/sequential irrespective of CPU's. So java API exposed both option to the developer.
import java.util.ArrayList;
import java.util.List;
/*
* Performance test over internal(parallel/sequential) and external iterations.
* https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html
*
*
* Parallel computing involves dividing a problem into subproblems,
* solving those problems simultaneously (in parallel, with each subproblem running in a separate thread),
* and then combining the results of the solutions to the subproblems. Java SE provides the fork/join framework,
* which enables you to more easily implement parallel computing in your applications. However, with this framework,
* you must specify how the problems are subdivided (partitioned).
* With aggregate operations, the Java runtime performs this partitioning and combining of solutions for you.
*
* Limit the parallelism that the ForkJoinPool offers you. You can do it yourself by supplying the -Djava.util.concurrent.ForkJoinPool.common.parallelism=1,
* so that the pool size is limited to one and no gain from parallelization
*
* #see ForkJoinPool
* https://docs.oracle.com/javase/tutorial/essential/concurrency/forkjoin.html
*
* ForkJoinPool, that pool creates a fixed number of threads (default: number of cores) and
* will never create more threads (unless the application indicates a need for those by using managedBlock).
* * http://stackoverflow.com/questions/10797568/what-determines-the-number-of-threads-a-java-forkjoinpool-creates
*
*/
public class IterationThroughStream {
private static boolean found = false;
private static List<Integer> smallListOfNumbers = null;
public static void main(String[] args) throws InterruptedException {
// TEST_1
List<String> bigListOfStrings = new ArrayList<String>();
for(Long i = 1l; i <= 1000000l; i++) {
bigListOfStrings.add("Counter no: "+ i);
}
System.out.println("Test Start");
System.out.println("-----------");
long startExternalIteration = System.currentTimeMillis();
externalIteration(bigListOfStrings);
long endExternalIteration = System.currentTimeMillis();
System.out.println("Time taken for externalIteration(bigListOfStrings) is :" + (endExternalIteration - startExternalIteration) + " , and the result found: "+ found);
long startInternalIteration = System.currentTimeMillis();
internalIteration(bigListOfStrings);
long endInternalIteration = System.currentTimeMillis();
System.out.println("Time taken for internalIteration(bigListOfStrings) is :" + (endInternalIteration - startInternalIteration) + " , and the result found: "+ found);
// TEST_2
smallListOfNumbers = new ArrayList<Integer>();
for(int i = 1; i <= 10; i++) {
smallListOfNumbers.add(i);
}
long startExternalIteration1 = System.currentTimeMillis();
externalIterationOnSleep(smallListOfNumbers);
long endExternalIteration1 = System.currentTimeMillis();
System.out.println("Time taken for externalIterationOnSleep(smallListOfNumbers) is :" + (endExternalIteration1 - startExternalIteration1));
long startInternalIteration1 = System.currentTimeMillis();
internalIterationOnSleep(smallListOfNumbers);
long endInternalIteration1 = System.currentTimeMillis();
System.out.println("Time taken for internalIterationOnSleep(smallListOfNumbers) is :" + (endInternalIteration1 - startInternalIteration1));
// TEST_3
Thread t1 = new Thread(IterationThroughStream :: internalIterationOnThread);
Thread t2 = new Thread(IterationThroughStream :: internalIterationOnThread);
Thread t3 = new Thread(IterationThroughStream :: internalIterationOnThread);
Thread t4 = new Thread(IterationThroughStream :: internalIterationOnThread);
t1.start();
t2.start();
t3.start();
t4.start();
Thread.sleep(30000);
}
private static boolean externalIteration(List<String> bigListOfStrings) {
found = false;
for(String s : bigListOfStrings) {
if(s.equals("Counter no: 1000000")) {
found = true;
}
}
return found;
}
private static boolean internalIteration(List<String> bigListOfStrings) {
found = false;
bigListOfStrings.parallelStream().forEach(
(String s) -> {
if(s.equals("Counter no: 1000000")){ //Have a breakpoint to look how many threads are spawned.
found = true;
}
}
);
return found;
}
private static boolean externalIterationOnSleep(List<Integer> smallListOfNumbers) {
found = false;
for(Integer s : smallListOfNumbers) {
try {
Thread.sleep(100);
} catch (Exception e) {
e.printStackTrace();
}
}
return found;
}
private static boolean internalIterationOnSleep(List<Integer> smallListOfNumbers) {
found = false;
smallListOfNumbers.parallelStream().forEach( //Removing parallelStream() will behave as single threaded (sequential access).
(Integer s) -> {
try {
Thread.sleep(100); //Have a breakpoint to look how many threads are spawned.
} catch (Exception e) {
e.printStackTrace();
}
}
);
return found;
}
public static void internalIterationOnThread() {
smallListOfNumbers.parallelStream().forEach(
(Integer s) -> {
try {
/*
* DANGEROUS
* This will tell you that if all the 7 FJP(Fork join pool) worker threads are blocked for one single thread (e.g. t1),
* then other normal three(t2 - t4) thread wont execute, will wait for FJP worker threads.
*/
Thread.sleep(100); //Have a breakpoint here.
} catch (Exception e) {
e.printStackTrace();
}
}
);
}
}

It seems like you should just be able to declare a Stream, and the choice of sequential/parallel execution should be handled automagically in a layer below, either by library code or the JVM itself as a function of the cores available at runtime, the size of the problem, etc.
To add to the already given answers:
Thats a pretty bold assumption. Imagine simulating a board-game for training some form of AI, it's pretty easy to parallelize the execution of different playthroughs - just create a new instance and let it run on its own thread. As it doesn't share any state with another playthrough you don't even have to consider multi-threading issues in your game logic. If you on the other hand parallelize the game logic itself you get all sorts of multi-threading issues and most likely pay a steep price for complexity and even performance.
Having control over the behaviour of streams gives you (appropriately limited) flexibility which in and of itself is a key feature for good library design.

Weak performance of CyclicBarrier with many threads: Would a tree-like synchronization structure be an alternative?

Our application requires all worker threads to synchronize at a defined point. For this we use a CyclicBarrier, but it does not seem to scale well. With more than eight threads, the synchronization overhead seems to outweigh the benefits of multithreading. (However, I cannot support this with measurement data.)
EDIT: Synchronization happens very frequently, in the order of 100k to 1M times.
If synchronization of many threads is "hard", would it help building a synchronization tree? Thread 1 waits for 2 and 3, which in turn wait for 4+5 and 6+7, respectively, etc.; after finishing, threads 2 and 3 wait for thread 1, thread 4 and 5 wait for thread 2, etc..
1
| \
2 3
|\ |\
4 5 6 7
Would such a setup reduce synchronization overhead? I'd appreciate any advice.
See also this featured question: What is the fastest cyclic synchronization in Java (ExecutorService vs. CyclicBarrier vs. X)?

With more than eight threads, the synchronization overhead seems to outweigh the benefits of multithreading. (However, I cannot support this with measurement data.)
Honestly, there's your problem right there. Figure out a performance benchmark and prove that this is the problem, or risk spending hours / days solving the entirely wrong problem.

You are thinking about the problem in a subtly wrong way that tends to lead to very bad coding. You don't want to wait for threads, you want to wait for work to be completed.
Probably the most efficient way is a shared, waitable counter. When you make new work, increment the counter and signal the counter. When you complete work, decrement the counter. If there is no work to do, wait on the counter. If you drop the counter to zero, check if you can make new work.

If I understand correctly, you're trying to break your solution up into parts and solve them separately, but concurrently, right? Then have your current thread wait for those tasks? You want to use something like a fork/join pattern.
List<CustomThread> threads = new ArrayList<CustomThread>();
for (Something something : somethings) {
threads.add(new CustomThread(something));
}
for (CustomThread thread : threads) {
thread.start();
}
for (CustomThread thread : threads) {
thread.join(); // Blocks until thread is complete
}
List<Result> results = new ArrayList<Result>();
for (CustomThread thread : threads) {
results.add(thread.getResult());
}
// do something with results.
In Java 7, there's even further support via a fork/join pool. See ForkJoinPool and its trail, and use Google to find one of many other tutorials.
You can recurse on this concept to get the tree you want, just have the threads you create generate more threads in the exact same way.
Edit: I was under the impression that you wouldn't be creating that many threads, so this is better for your scenario. The example won't be horribly short, but it goes along the same vein as the discussion you're having in the other answer, that you can wait on jobs, not threads.
First, you need a Callable for your sub-jobs that takes an Input and returns a Result:
public class SubJob implements Callable<Result> {
private final Input input;
public MyCallable(Input input) {
this.input = input;
}
public Result call() {
// Actually process input here and return a result
return JobWorker.processInput(input);
}
}
Then to use it, create an ExecutorService with a fix-sized thread pool. This will limit the number of jobs you're running concurrently so you don't accidentally thread-bomb your system. Here's your main job:
public class MainJob extends Thread {
// Adjust the pool to the appropriate number of concurrent
// threads you want running at the same time
private static final ExecutorService pool = Executors.newFixedThreadPool(30);
private final List<Input> inputs;
public MainJob(List<Input> inputs) {
super("MainJob")
this.inputs = new ArrayList<Input>(inputs);
}
public void run() {
CompletionService<Result> compService = new ExecutorCompletionService(pool);
List<Result> results = new ArrayList<Result>();
int submittedJobs = inputs.size();
for (Input input : inputs) {
// Starts the job when a thread is available
compService.submit(new SubJob(input));
}
for (int i = 0; i < submittedJobs; i++) {
// Blocks until a job is completed
results.add(compService.take())
}
// Do something with results
}
}
This will allow you to reuse threads instead of generating a bunch of new ones every time you want to run a job. The completion service will do the blocking while it waits for jobs to complete. Also note that the results list will be in order of completion.
You can also use Executors.newCachedThreadPool, which creates a pool with no upper limit (like using Integer.MAX_VALUE). It will reuse threads if one is available and create a new one if all the threads in the pool are running a job. This may be desirable later if you start encountering deadlocks (because there's so many jobs in the fixed thread pool waiting that sub jobs can't run and complete). This will at least limit the number of threads you're creating/destroying.
Lastly, you'll need to shutdown the ExecutorService manually, perhaps via a shutdown hook, or the threads that it contains will not allow the JVM to terminate.
Hope that helps/makes sense.

If you have a generation task (like the example of processing columns of a matrix) then you may be stuck with a CyclicBarrier. That is to say, if every single piece of work for generation 1 must be done in order to process any work for generation 2, then the best you can do is to wait for that condition to be met.
If there are thousands of tasks in each generation, then it may be better to submit all of those tasks to an ExecutorService (ExecutorService.invokeAll) and simply wait for the results to return before proceeding to the next step. The advantage of doing this is eliminating context switching and wasted time/memory from allocating hundreds of threads when the physical CPU is bounded.
If your tasks are not generational but instead more of a tree-like structure in which only a subset need to be complete before the next step can occur on that subset, then you might want to consider a ForkJoinPool and you don't need Java 7 to do that. You can get a reference implementation for Java 6. This would be found under whatever JSR introduced the ForkJoinPool library code.
I also have another answer which provides a rough implementation in Java 6:
public class Fib implements Callable<Integer> {
int n;
Executor exec;
Fib(final int n, final Executor exec) {
this.n = n;
this.exec = exec;
}
/**
* {#inheritDoc}
*/
#Override
public Integer call() throws Exception {
if (n == 0 || n == 1) {
return n;
}
//Divide the problem
final Fib n1 = new Fib(n - 1, exec);
final Fib n2 = new Fib(n - 2, exec);
//FutureTask only allows run to complete once
final FutureTask<Integer> n2Task = new FutureTask<Integer>(n2);
//Ask the Executor for help
exec.execute(n2Task);
//Do half the work ourselves
final int partialResult = n1.call();
//Do the other half of the work if the Executor hasn't
n2Task.run();
//Return the combined result
return partialResult + n2Task.get();
}
}
Keep in mind that if you have divided the tasks up too much and the unit of work being done by each thread is too small, there will negative performance impacts. For example, the above code is a terribly slow way to solve Fibonacci.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.