Simple asynchronous I/O: many threads, one file

Simple asynchronous I/O: many threads, one file - java

I have a scientific application which I usually run in parallel with xargs, but this scheme incurs repeated JVM start costs and neglects cached file I/O and the JIT compiler. I've already adapted the code to use a thread pool, but I'm stuck on how to save my output.
The program (i.e. one thread of the new program) reads two files, does some processing and then prints the result to standard output. Currently, I've dealt with output by having each thread add its result string to a BlockingQueue. Another thread takes from the queue and writes to a file, as long as a Boolean flag is true. Then I awaitTermination and set the flag to false, triggering the file to close and the program to exit.
My solution seems a little kludgey; what is the simplest and best way to accomplish this?
How should I write primary result data from many threads to a single file?
The answer doesn't need to be Java-specific if it is, for example, a broadly applicable method.
Update
I'm using "STOP" as the poison pill.
while (true) {
String line = queue.take();
if (line.equals("STOP")) {
break;
} else {
output.write(line);
}
}
output.close();
I manually start the queue-consuming thread, then add the jobs to the thread pool, wait for the jobs to finish and finally poison the queue and join the consumer thread.

That's really the way you want to do it, have the threads put their output to the queue and then have the writer exhaust it.
The only thing you might want to do to make things a little cleaner is rather than checking a flag, simply put an "all done" token on to the queue that the writer can use to know that it's finished. That way there's no out of band signaling necessary.
That's trivial to do, you can use an well known string, an enum, or simply a shared object.

You could use an ExecutorService.
Submit a Callable that would perform the task and return the string after completion.
When Submitting the Callable you get hold of a Future, store these references e.g. in a List.
Then simply iterate through the Futures and get the Strings by calling Future#get.
This will block until the task is completed if it not yet is, otherwise return the value immediately.
Example:
ExecutorService exec = Executors.newFixedThreadPool(10);
List<Future<String>> tasks = new ArrayList<Future<String>>();
tasks.add(exec.submit(new Callable<String> {
public String call() {
//do stuff
return <yourString>;
}
}));
//and so on for the other tasks
for (Future<String> task : tasks) {
String result = task.get();
//write to output
}

Many threads processing, one thread writing and a message queue between them is a good strategy. The issue that just needs to be solved, is knowing when all work is finished. One way to do that is to count how many worker threads you started, and then after that count how many responses you got. Something like this pseudo code:
int workers = 0
for each work item {
workers++
start the item's worker in a separate thread
}
while workers > 0 {
take worker's response from a queue
write response to file
workers--
}
This approach also works if the workers can find more work items while they are executing. Just include any additional not-yet-processed work in the worker responses, and then increment the workers count and start workers threads as usual.
If each of the workers returns just one message, you can use Java's ExecutorService to execute Callable instances which return the result. ExecutorService's methods give access to Future instances from which you can get the result when the Callable has finished its work.
So you would first submit all the tasks to the ExecutorService and then loop over all the Futures and get their responses. That way you would write the responses in the order in which you check the futures, which can be different from the order in which they finish their work. If latency is not important, that shouldn't be a problem. Otherwise, a message queue (as mentioned above) might be more suitable.

It's not clear if your output file has some defined order or if you just dump your data there. I assume it has no order.
I don't see why you need an extra thread for writing to output. Just synchronized the method that writes to file and call it at the end of each thread.

If you have many threads writing to the same file the simplest thing to do is to write to that file in the task.
final PrintWriter out =
ExecutorService es =
for(int i=0;i<tasks;i++)
es.submit(new Runnable() {
public void run() {
performCalculations();
// so only one thread can write to the file at a time.
synchornized(out) {
writeResults(out);
}
}
});
es.shutdown();
es.awaitTermination(1, TimeUnit.HOUR);
out.close();

Related

How do you set up 2 threadpools that work next to eachother?

I have a question about threadpooling.
This is the situation I have:
Somewhere from a backendservice we receive a list of pdf files.
One by one these pdf pages first need to be converted to .bmp files.
After this conversion these bmps need to be printed.
The problem is that both the converting and the printing tasks take a while to complete and I want to make this process go quicker because else it would take quite a while before somebody sees something coming out of the printer.
A solution I thought off was to create 2 ThreadPools: one for the converting stuff and one for the printing stuff.
These would be my Threadpools:
ExecutorService convertPool = Executors.newFixedThreadPool(10);
ExecutorService printPool = Executors.newSingleThreadExecutor();
A convertPool with 10 threads to convert the pages of a pdf to bmps.
When this is done, the created bmps will be send to the printPool. This is a single thread because there can always be just one printing.
But now comes my question:
so the convertPool has done it's work with the first pdf and send all Future Tasks to the PrintPool to get printed.
But when the printPool is busy, I want the convertPool to begin already with the 2nd pdf. So that when the PrintPool has done it's work with printing the bmps from the first pdf, it can immmediately start working with the bmp printing of the 2nd pdf because these are already created.
But how can I set this mechanism up? Can somebody help me with this?
Thanks!

you can use the same executor to execute a runnable for printing job.
in that runnable you can reference a List of futures produced from
execution of callables that are in-fact processing your PDFs.
the list would of-course have a global reference.
now you can use semaphores with limited permits say (10) and while any callable
ends execution, it can release a permit which the runnable would
try to acquire beforehand and thus stay in a blocked state as none would be available.
NOTE: Every PDF processing Callable would accquire a permit at the start of its execution.
your runnable can be try acquirng the permit before accessing any
future that returns the status of Done() as true and can release the
permit at the tail of iteration.
ones all the taks have finished execution or when 10 permits are available again, you can terminate the executor by executuing shutdownNow() thus terminating runnable.
Hope it helps !

You are on the right track. You can submit as many conversion tasks as you like to your converter pool. It will execute them as soon as it can and queue those that can't be serviced by a thread yet. When each conversion task completes, it can submit itself to the print pool, which will also queue tasks as needed. This is a basic skeleton of what I mean:
class Conversion implements Runnable {
Consumer<Conversion> onCompletion;
Conversion(Consumer<Conversion> onCompletion) {
this.onCompletion = onCompletion;
}
#Override
public void run() {
// ... conversion code. You could Thread.sleep()
// here to simulate the conversion work taking up
// some time.
// (now we're done converting)
onCompletion.accept(this);
}
}
class Print implements Runnable {
Print(Conversion c) {
// ...
}
// ...run() method, etc.
}
// Example of submitting a conversion task to the executor
convertPool.submit(new Conversion(c -> printPool.submit(new Print(c))));

Writing from Future of CachedThreadPool. Is my implementation incorrect?

I need help with my multithreading code.
I have a callable class which returns a value. I have a cachedThreadPool to submit ~60,000 tasks. I collect all the Futures in a List. After the ExecutiveService has shutdown, I loop through the list of Futures, and write the returned values using a bufferedWriter. Is this correct way of implementation?
ExecutorService execService = Executors.newCachedThreadPool();
List<Future<ValidationDataObject<String, Boolean>>> futureList = new ArrayList<>();
for (int i = 0; i < emailArrayList.size(); i++) {
String emailAddress = emailArrayList.get(i);
ValidateEmail validateEmail = new ValidateEmail(emailAddress);
Future<ValidationDataObject<String, Boolean>> future =
execService.submit(validateEmail);
futureList.add(future);
}
execService.shutdown();
for (Future<ValidationDataObject<String, Boolean>> future: futureList) {
ValidationDataObject<String, Boolean> validationObject = future.get();
bufferedWriter.write(validationObject.getEmailAddress() + "|"
+ validationObject.getIsValid());
bufferedWriter.newLine();
bufferedWriter.flush();
}
if (execService.isTerminated()) bufferedWriter.close();
Should I using synchronized block for the bufferedWriter? I am thinking, It doesn't need to be synchronized because, I am using the bufferedWriter from the main Thread, right?

I have a cachedThreadPool to submit ~60,000 tasks.
Off the bat, a cached thread-pool and 60k tasks is a red flag. That is going to start 60k threads which I doubt you really want. You should use a fixed thread pool and vary the number of threads until you achieve a good balance of throughput versus overwhelming your server. Maybe start with 2x the number of CPUs and then vary it depending on the server load.
You might also might consider using a fixed size queue which will limit the number of tasks outstanding although 60k tasks is fine unless those objects are heavy.
I collect all the Futures in a List. After the ExecutiveService has shutdown, I loop through the list of Futures, and write the returned values using a bufferedWriter. Is this correct way of implementation?
Yes, that's a good pattern. You don't show the writer being created but it is certainly fine for the main thread to own that.
Should I using synchronized block for the bufferedWriter? I am thinking, It doesn't need to be synchronized because, I am using the bufferedWriter from the main Thread, right?
Right. No other threads are using it so that's fine. It is a very typical pattern to have a writer thread managing the output of a multi-thread application.
One final comment is that you might want to look at the ExecutionCompletionService which allows you to process the tasks as they finish instead of having to wait for them in order. You might require the output to be in order in which case this isn't helpful but it's good technology to know about anyway.

Apart from the fact, that executor.shutdown() will most likely not do, what you believe it to do (it simply stops the Executor from accepting new Tasks, it will not wait for all tasks to terminate), your code looks fine.
You are right, there is no need for synchronization with respect to the writer, as you access it only single threaded.
There are things, that can be improved, though. Firstly, you are not doing a lot of Exception handling. Future.get() will throw an ExecutionException, if the Callable hits an Exception.
I'm not certain, how large the deviations in execution-time of your Callables are. Assume, there are notable deviations look at the following case: Say we submit Callables A, B and C, you receive FutA, FutB and FutC. Calling the get methods will block until the calculation behind the Future is finished. In your setting, you might be waiting for FutA to complete, while FutB/FutC might already be finished and ready for writing. Worst case here is, that processing of A will delay writing for all 60000 tasks.
I think, I would go for another approach, where every Callable gets the reference to the same ConcurrentLinkedQueue and instead of returning the result via Future writes the result into that queue. In this scenario, the ordering of the result is not dependent on the ordering of the Callables but on the time, the Callables finish execution. Whether or not this results in a speedup depends on your setting (especially time to write result and deviation in execution times of the Callables).

How do I stop my command queue loop using so much CPU properly?

I have a while loop that checks if an arraylist containing commands for the program to execute is empty. Obviously it does things if not empty, but if it is right now I just have a Thread.sleep(1000) for the else. That leaves anything that interacts with it rather sluggish. Is there any way to get the thread it runs on to block until a new command is added? (It runs in it's own thread so that seems to be the best solution to me) Or is there a better solution to this?

You can use wait() and notify() to have the threads that add something to the list inform the consumer thread that there is something to be read. However, this requires proper synchronization, etc..
But a better way to solve your problem is to use a BlockingQueue instead. By definition, they are synchronized classes and the dequeuing will block appropriately and wakeup when stuff is added. The LinkedBlockingQueue is a good class to use if you want your queue to not be limited. The ArrayBlockingQueue can be used when you want a limited number of items to be stored in the queue (or LinkedBlockingQueue with an integer passed to the constructor). If a limited queue then queue.add(...) would block if the queue was full.
BlockingQueue<Message> queue = new LinkedBlockingQueue<Messsage>();
...
// producer thread(s) add a message to the queue
queue.add(message);
...
// consumer(s) wait for a message to be added to the queue and then removes it
Message message = queue.take();
...
// you can also wait for certain amount of time, returns null on timeout
Message message = queue.poll(10, TimeUnit.MINUTES);

Use a BlockingQueue<E> for your commands.
There's a very good example of how to use it in the link above.

A better solution is to use an ExecutorService. This combines a queue and a pool of threads.
// or use a thread pool with multiple threads.
ExecutorService executor = Executors.newSingleThreadExecutor();
// call as often as you like.
executor.submit(new Runnable() {
#Override
public void run() {
process(string);
}
});
// when finished
executor.shutdown();

Multithreading help in Java

I'm new to Java, and I need some help working on this program. This is a small part of a large class project, and I must use multithreading.
Here's what I want to do algorithmically:
while (there is still input left, store chunk of input in <chunk>)
{
if there is not a free thread in my array then
wait until a thread finishes
else there is a free thread then
apply the free thread to <chunk> (which will do something to chunk and output it).
Note: The ordering of the chunks being output must be the same as input
}
So, the main things I don't know how to do:
How can I check whether or not there's a free thread in the array? I know that there is a function ThreadAlive, but it seems super inefficient to poll every single thread every time in my loop.
If there is no free thread, how can I wait until one has finished?
The ordering is important. How can I preserve the ordering in which the threads output? As in, the order of the output needs to match the order of the input. How can I guarantee this synchronization?
How do I even pass the chunk to my thread? Can I just use the Runnable interface to do this?
Any help with these four bullets is greatly appreciated. Since I'm a super noob, code samples would help significantly.
(side-note: Making an array of threads was just an idea of mine to handle the user defined number of threads. If you have a better way to handle this you're welcome to suggest it!)

Sounds like you basically have a producer/consumer model and can be solved with an ExecutorService and BlockingQueue. Here is a similar question with a similar answer:
producer/consumer work queues

As #altaiojok mentioned, you want to use an ExecutorService and BlockingQueue. The basic algorithm works like this:
ExecutorService executor = Executors.newFixedThreadPool(...); //or newCachedThreadPool, etc...
BlockingQueue<Future<?>> outputQueue = new LinkedBlockingQueue<Future<?>>();
//To be run by an input processing thread
void submitTasks() {
BufferedReader input = ... //going to assume you have a file that you want to read, this could be any method of input (file, keyboard, network, etc...)
String line = input.readLine();
while(line != null) {
outputQueue.add(executor.submit(new YourCallableImplementation(line)));
line = input.readLine();
}
}
//To be run by a different output processing thread
void processTaskOutput() {
try {
while(true) {
Future<?> resultFuture = outputQueue.take();
? result = resultFuture.get();
//process the output (write to file, send to network, print to screen, etc...
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
I'll leave it to you to figure out how to implement Runnable to make the input and output thread as well as how to implement Callable for the tasks you need to process.

I would suggest using commons-pool which offers pooling of threads so you can easily limit the number of used threads and it also offers some other helper methods.
Concerning the ordering: have a look at the synchronize keyword.
And I would suggest to have a look at the java tutorial (the part about concurrency): http://download.oracle.com/javase/tutorial/essential/concurrency/index.html

Streams might come handy:
List<Chunk> chunks = new ArrayList<>();
//....
Function<Chunk, String> toWeightInfo = (chunk) -> "weight = "+(chunk.size()*chunk.prio());
List<String> results = chunks.parallelStream()
.map(toWeightInfo)
.collect(Collectors.toList());
System.out.println(results);
The parallel stream uses the System's default "fork/join" thread pool, which should be the size of available logical CPUs and processes your stuff in parallel. It also guarantees the same order of results.
The parallel streams API hides all the complexity of assigning free threads to jobs and optimizations like work-stealing away from you. Just give it something to chew on and it will work its magic.
If you need to use a thread pool of a custom size, please refer to the
Custom thread pool in Java 8 parallel stream question.
You might also have a look at this good Java 8 Stream Tutorial.
If your case is rather complex and you're streaming chunks into your program, and you've got multiple stages of work, where some must be serial and some can be parallel and some depend on each other, you might have a look at the Disruptor framework from LMAX.
Kind regards

Use ExecutorCompletionService and Future<T>. Together they provide a threadpool based task framework that takes care of all your concerns.
How can I check whether or not there's a free thread in the array? I know that there is a function ThreadAlive, but it seems super inefficient to poll every single thread every time in my loop.
You dont have to. The executor will do this for you in an (super)efficient manner.You just have to submit tasks to it and sit back.
If there is no free thread, how can I wait until one has finished?
Again , you really dont have to. This is taken care of by executor.
The ordering is important. How can I preserve the ordering in which the threads output? As in, the order of the output needs to match the order of the input. How can I guarantee this synchronization?
This is a concern. If you want the processed output ( of chunks, in your words ) to arrive in the same order as these chunks are present in the initial array, you have to address a few points :
Is it just the order of arrival of the results that matter , or is it that the tasks processing themselves have dependencies on the order ? If it is the former , it is much easily done, but if its the later , then you have problems. ( which I think are very hard things to start with considering your admission of being new to Java, so I would just recommend more learning on your part before attempting this. )
Assuming it is the former case , what you can do is this : Submit the chunks to the executor in some order , and each submission will give you a handle ( called a Future<Result> ) to the task processed output. Store these handles in a ordered queue, and when you want the results , call the get() on these Future(s). Note that if some task in the middle of the order takes long time to complete , then the results of the following tasks will also be delayed.
How do I even pass the chunk to my thread? Can I just use the Runnable interface to do this?
Create a Callable instance wrapping one chunk each into the instance. This represents your task that you will submit() to the ExecutorService.

a "simple" thread pool in java

I'm looking for a simple object that will hold my work threads and I need it to not limit the number of threads, and not keep them alive longer than needed.
But I do need it to have a method similar to an ExecutorService.shutdown();
(Waiting for all the active threads to finish but not accepting any new ones)
so maybe a threadpool isn't what I need, so I would love a push in the right direction.
(as they are meant to keep the threads alive)
Further clarification of intent:
each thread is an upload of a file, and I have another process that modifies files, but it waits for the file to not have any uploads. by joining each of the threads. So when they are kept alive it locks that process. (each thread adds himself to a list for a specific file on creation, so I only join() threads that upload a specific file)

One way to do what you awant is to use a Callable with a Future that returns the File object of a completed upload. Then pass the Future into another Callable that checks Future.isDone() and spins until it returns true and then do whatever you need to do to the file. Your use case is not unique and fits very neatly into the java.util.concurrent package capabilities.
One interesting class is ExecutorCompletionService class which does exactly what you want with waiting for results then proceeding with an additional calculation.
A CompletionService that uses a
supplied Executor to execute tasks.
This class arranges that submitted
tasks are, upon completion, placed on
a queue accessible using take. The
class is lightweight enough to be
suitable for transient use when
processing groups of tasks.
Usage Examples: Suppose you have a set of solvers for a certain problem,
each returning a value of some type
Result, and would like to run them
concurrently, processing the results
of each of them that return a non-null
value, in some method use(Result r).
You could write this as:
void solve(Executor e, Collection<Callable<Result>> solvers)
throws InterruptedException, ExecutionException
{
CompletionService<Result> ecs = new ExecutorCompletionService<Result>(e);
for (Callable<Result> s : solvers) { ecs.submit(s); }
int n = solvers.size();
for (int i = 0; i < n; ++i)
{
Result r = ecs.take().get();
if (r != null) { use(r); }
}
}
You don't want an unbounded ExecutorService
You almost never want to allow unbounded thread pools, as they actually can limit the performance of your application if the number of threads gets out of hand.
You domain is limited by disk or network I/O or both, so a small thread pool would be sufficient. You are not going to want to try and read from hundreds or thousands of incoming connections with a thread per connection.
Part of your solution, if you are receiving more than a handful of concurrent uploads is to investigate the java.nio package and read about non-blocking I/O as well.

Is there a reason that you don't want to reuse threads? Seems to me that the simplest thing would be to use ExecutorService anyway and let it reuse threads.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.