Using concurrent classes to process files in a directory in parallel

Using concurrent classes to process files in a directory in parallel - java

I am trying to figure out how to use the types from the java.util.concurrent package to parallelize processing of all the files in a directory.
I am familiar with the multiprocessing package in Python, which is very simple to use, so ideally I am looking for something similar:
public interface FictionalFunctor<T>{
void handle(T arg);
}
public class FictionalThreadPool {
public FictionalThreadPool(int threadCount){
...
}
public <T> FictionalThreadPoolMapResult<T> map(FictionalFunctor<T> functor, List<T> args){
// Executes the given functor on each and every arg from args in parallel. Returns, when
// all the parallel branches return.
// FictionalThreadPoolMapResult allows to abort the whole mapping process, at the least.
}
}
dir = getDirectoryToProcess();
pool = new FictionalThreadPool(10); // 10 threads in the pool
pool.map(new FictionalFunctor<File>(){
#Override
public void handle(File file){
// process the file
}
}, dir.listFiles());
I have a feeling that the types in java.util.concurrent allow me to do something similar, but I have absolutely no idea where to start.
Any ideas?
Thanks.
EDIT 1
Following the advices given in the answers, I have written something like this:
public void processAllFiles() throws IOException {
ExecutorService exec = Executors.newFixedThreadPool(6);
BlockingQueue<Runnable> tasks = new LinkedBlockingQueue<Runnable>(5); // Figured we can keep the contents of 6 files simultaneously.
exec.submit(new MyCoordinator(exec, tasks));
for (File file : dir.listFiles(getMyFilter()) {
try {
tasks.add(new MyTask(file));
} catch (IOException exc) {
System.err.println(String.format("Failed to read %s - %s", file.getName(), exc.getMessage()));
}
}
}
public class MyTask implements Runnable {
private final byte[] m_buffer;
private final String m_name;
public MyTask(File file) throws IOException {
m_name = file.getName();
m_buffer = Files.toByteArray(file);
}
#Override
public void run() {
// Process the file contents
}
}
private class MyCoordinator implements Runnable {
private final ExecutorService m_exec;
private final BlockingQueue<Runnable> m_tasks;
public MyCoordinator(ExecutorService exec, BlockingQueue<Runnable> tasks) {
m_exec = exec;
m_tasks = tasks;
}
#Override
public void run() {
while (true) {
Runnable task = m_tasks.remove();
m_exec.submit(task);
}
}
}
How I thought the code works is:
The files are read one after another.
A file contents are saved in a dedicated MyTask instance.
A blocking queue with the capacity of 5 to hold the tasks. I count on the ability of the server to keep the contents of at most 6 files at one time - 5 in the queue and another fully initialized task waiting to enter the queue.
A special MyCoordinator task fetches the file tasks from the queue and dispatches them to the same pool.
OK, so there is a bug - more than 6 tasks can be created. Some will be submitted, even though all the pool threads are busy. I've planned to solve it later.
The problem is that it does not work at all. The MyCoordinator thread blocks on the first remove - this is fine. But it never unblocks, even though new tasks were placed in the queue. Can anyone tell me what am I doing wrong?

The thread pool you are looking for is the ExecutorService class. You can create a fixed-size thread pool using newFixedThreadPool. This allows you to easily implement a producer-consumer pattern, with the pool encapsulating all the queue and worker functionality for you:
ExecutorService exec = Executors.newFixedThreadPool(10);
You can then submit tasks in the form of objects whose type implements Runnable (or Callable if you want to also get a result):
class ThreadTask implements Runnable {
public void run() {
// task code
}
}
...
exec.submit(new ThreadTask());
// alternatively, using an anonymous type
exec.submit(new Runnable() {
public void run() {
// task code
}
});
A big word of advice on processing multiple files in parallel: if you have a single mechanical disk holding the files it's wise to use a single thread to read them one-by-one and submit each file to a thread pool task as above, for processing. Do not do the actual reading in parallel as it will degrade performance.

A simpler solution than using ExecuterService is to implement your own producer-consumer scheme. Have a thread that create tasks and submits to a LinkedBlockingQueue or ArrayBlockingQueue and have worker threads that check this queue to retrieve the tasks and do them. You may need a special kind of tasks name ExitTask that forces the workers to exit. So at the end of the jobs if you have n workers you need to add n ExitTasks into the queue.

Basically, what #Tudor said, use an ExecutorService, but I wanted to expand on his code and I always feel strange editing other people's posts. Here's a sksleton of what you would submit to the ExecutorService:
public class MyFileTask implements Runnable {
final File fileToProcess;
public MyFileTask(File file) {
fileToProcess = file;
}
public void run() {
// your code goes here, e.g.
handle(fileToProcess);
// if you prefer, implement Callable instead
}
}
See also my blog post here for some more details if you get stuck
Since processing Files often leads to IOExceptions, I'd prefer a Callable (which can throw a checked Exception) to a Runnable, but YMMV.

Related

How to manage threads in Spring TaskExecutor framework

I have a BlockingQueue of Runnable - I can simply execute all tasks using one of TaskExecutor implementations, and all will be run in parallel.
However some Runnable depends on others, it means they need to wait when Runnable finish, then they can be executed.
Rule is quite simple: every Runnable has a code. Two Runnable with the same code cannot be run simultanously, but if the code differ they should be run in parallel.
In other words all running Runnable need to have different code, all "duplicates" should wait.
The problem is that there's no event/method/whatsoever when thread ends.
I can built such notification into every Runnable, but I don't like this approach, because it will be done just before thread ends, not after it's ended
java.util.concurrent.ThreadPoolExecutor has method afterExecute, but it needs to be implemented - Spring use only default implementation, and this method is ignored.
Even if I do that, it's getting complicated, because I need to track two additional collections: with Runnables already executing (no implementation gives access to this information) and with those postponed because they have duplicated code.
I like the BlockingQueue approach because there's no polling, thread simply activate when something new is in the queue. But maybe there's a better approach to manage such dependencies between Runnables, so I should give up with BlockingQueue and use different strategy?

If the number of different codes is not that large, the approach with a separate single thread executor for each possible code, offered by BarrySW19, is fine.
If the whole number of threads become unacceptable, then, instead of single thread executor, we can use an actor (from Akka or another similar library):
public class WorkerActor extends UntypedActor {
public void onReceive(Object message) {
if (message instanceof Runnable) {
Runnable work = (Runnable) message;
work.run();
} else {
// report an error
}
}
}
As in the original solution, ActorRefs for WorkerActors are collected in a HashMap. When an ActorRef workerActorRef corresponding to the given code is obtained (retrieved or created), the Runnable job is submitted to execution with workerActorRef.tell(job).
If you don't want to have a dependency to the actor library, you can program WorkerActor from scratch:
public class WorkerActor implements Runnable, Executor {
Executor executor=ForkJoinPool.commonPool(); // or can by assigned in constructor
LinkedBlockingQueue<Runnable> queue = new LinkedBlockingQueu<>();
boolean running = false;
public synchronized void execute(Runnable job) {
queue.put(job);
if (!running) {
executor.execute(this); // execute this worker, not job!
running=true;
}
public void run() {
for (;;) {
Runnable work=null;
synchronized (this) {
work = queue.poll();
if (work==null) {
running = false;
return;
}
}
work.run();
}
}
}
When a WorkerActor worker corresponding to the given code is obtained (retrieved or created), the Runnable job is submitted to execution with worker.execute(job).

One alternate strategy which springs to mind is to have a separate single thread executor for each possible code. Then, when you want to submit a new Runnable you simply lookup the correct executor to use for its code and submit the job.
This may, or may not be a good solution depending on how many different codes you have. The main thing to consider would be that the number of concurrent threads running could be as high as the number of different codes you have. If you have many different codes this could be a problem.
Of course, you could use a Semaphore to restrict the number of concurrently running jobs; you would still create one thread per code, but only a limited number could actually execute at the same time. For example, this would serialise jobs by code, allowing up to three different codes to run concurrently:
public class MultiPoolExecutor {
private final Semaphore semaphore = new Semaphore(3);
private final ConcurrentMap<String, ExecutorService> serviceMap
= new ConcurrentHashMap<>();
public void submit(String code, Runnable job) {
ExecutorService executorService = serviceMap.computeIfAbsent(
code, (k) -> Executors.newSingleThreadExecutor());
executorService.submit(() -> {
semaphore.acquireUninterruptibly();
try {
job.run();
} finally {
semaphore.release();
}
});
}
}
Another approach would be to modify the Runnable to release a lock and check for jobs which could be run upon completion (so avoiding polling) - something like this example, which keeps all the jobs in a list until they can be submitted. The boolean latch ensures only one job for each code has been submitted to the thread pool at any one time. Whenever a new job arrives or a running one completes the code checks again for new jobs which can be submitted (the CodedRunnable is simply an extension of Runnable which has a code property).
public class SubmissionService {
private final ExecutorService executorService = Executors.newFixedThreadPool(5);
private final ConcurrentMap<String, AtomicBoolean> locks = new ConcurrentHashMap<>();
private final List<CodedRunnable> jobs = new ArrayList<>();
public void submit(CodedRunnable codedRunnable) {
synchronized (jobs) {
jobs.add(codedRunnable);
}
submitWaitingJobs();
}
private void submitWaitingJobs() {
synchronized (jobs) {
for(Iterator<CodedRunnable> iter = jobs.iterator(); iter.hasNext(); ) {
CodedRunnable nextJob = iter.next();
AtomicBoolean latch = locks.computeIfAbsent(
nextJob.getCode(), (k) -> new AtomicBoolean(false));
if(latch.compareAndSet(false, true)) {
iter.remove();
executorService.submit(() -> {
try {
nextJob.run();
} finally {
latch.set(false);
submitWaitingJobs();
}
});
}
}
}
}
}
The downside of this approach is that the code needs to scan through the entire list of waiting jobs after each task completes. Of course, you could make this more efficient - a completing task would actually only need to check for other jobs with the same code, so the jobs could be stored in a Map<String, List<Runnable>> structure instead to allow for faster processing.

Adding multi-threading possibility to a single-threaded all-files-in-directory iterator utility function

I have a function that serially (single-threaded-ly) iterates through a directory of files, changing all tab indentation to three-space indentation.
I'm using it as my first attempt at multi-threading. (Am most of the way through Java Concurrency in Practice...surprised it's eight years old now.)
In order to keep it's current single-threaded functionality, but add in the additional possibility of multi-threading, I'm thinking of changing the function to accept an additional Executor parameter, where the original single-threaded function would now be a call to it, passing in a single threaded executor.
Is this an appropriate way to go about it?

If you're using Java 8, I've found parallelStream to be about the easiest way to implement multi-threading
List<File> files = Arrays.asList(getDirectoryContents());
files.parallelStream().forEach( file -> processFile(file));
If you want to be able to change between single-threaded and multi-threaded, you could simply pass a boolean flag
List<File> files = Arrays.asList(getDirectoryContents());
if(multithreaded){
files.parallelStream().forEach( file -> processFile(file));
}else{
files.stream().forEach(file -> processFile(file));
}
I wish I could help with Java 7, but I went from Java 5 to 8 overnight. :) Java 8 is sooooooo worth it.

One way is as #Victor Sorokin suggests in his answer: wrap the processing of every file in a Runnable and then either submit to an Executor or just invoke run() from the main thread.
Another possibility is to always do the same wrapping in a Runnable and submit it to an always-given Executor.
Whether processing of each file is executed concurrently or not would depend on the given Executor's implementation.
For parallel processing, you could invoke your function passing it i.e. a ThreadPoolExecutor as an argument, whereas for sequential processing you could pass in a fake Executor, i.e. one that runs submitted tasks in the caller thread:
public class FakeExecutor implements Executor {
#Override
public void execute(Runnable task) {
task.run();
}
}
I believe this way is the most flexible approach.

Most straight-forward way:
(The most tricky part) Make sure code is thread-safe. Unfortunately, it's hard to give more concrete advice w/o seeing actual code in question;
Wrap code into Runnable\Callable (either anonymous class or explicit class which implements Runnable\Callable;
This way you'll be able either call your Runnable in main thread (single-threaded version) or pass it to an Executor (multi-threaded version).

One of the ways to create a class implements Executor interface which will execute your code in the main thread. Like this:
public class FileProcessor implements Runnable {
private final File file;
public FileProcessor(File file) {
this.file = file;
}
#Override
public void run() {
// do something with file
}
}
public class DirectoryManager {
private final Executor executor;
public DirectoryManager() {
executor = new Executor() {
#Override
public void execute(Runnable command) {
command.run();
}
};
}
public DirectoryManager(int numberOfThreads) {
executor = Executors.newFixedThreadPool(numberOfThreads);
}
public void process(List<File> files) {
for (File file : files) {
executor.execute(new FileProcessor(file));
}
}
}
and call it in your code like this
DirectoryManager directoryManager = new DirectoryManager();
directoryManager.process(lists);
// some other sync code
or this
DirectoryManager directoryManager = new DirectoryManager(5);
directoryManager.process(lists);
// some other async code

How to properly extend FutureTask

While coding a computation-heavy application, I tried to make use of the SwingWorker class to spread the load to multiple CPU cores. However, behaviour of this class proved to be somewhat strange: only one core seemed to be utilized.
When searching the internet, I found an excellent answer on this web (see Swingworker instances not running concurrently, answer by user268396) which -- in addition to the cause of the problem -- also mentions a possible solution:
What you can do to get around this is use an ExecutorService and post
FutureTasks on it. These will provide 99% of the SwingWorker API
(SwingWorker is a FutureTask derivative), all you have to do is set up
your Executor properly.
Being a Java beginner, I am not entirely sure how to do this properly. Not only that I need to pass some initial data to the FutureTask objects, I also need to get the results back similarly as with SwingWorker. Any example code would therefore be much appreciated.
nvx
==================== EDIT ====================
After implementing the simple yet elegant solution mentioned in FutureTask that implements Callable, another issue has come up. If I use an ExecutorService to create individual threads, how do I execute specific code after a thread finished running?
I tried to override done() of the FutureTask object (see the code below) but I guess that the "show results" bit (or any GUI related stuff for that matter) should be done in the application's event dispatch thread (EDT). Therefore: how do I submit the runnable to the EDT?
package multicoretest;
import java.util.concurrent.*;
public class MultiCoreTest {
static int coresToBeUsed = 4;
static Future[] futures = new Future[coresToBeUsed];
public static void main(String[] args) {
ExecutorService execSvc = Executors.newFixedThreadPool(coresToBeUsed);
for (int i = 0; i < coresToBeUsed; i++) {
futures[i] = execSvc.submit(new Worker(i));
}
execSvc.shutdown();
// I do not want to block the thread (so that users can
// e.g. terminate the computation via GUI)
//execSvc.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS);
}
static class Worker implements Callable<String> {
private final FutureTask<String> futureTask;
private final int workerIdx;
public Worker(int idx) {
workerIdx = idx;
futureTask = new FutureTask<String>(this) {
#Override
protected void done() {
Runnable r = new Runnable() {
#Override
public void run() {
showResults(workerIdx);
}
};
r.run(); // Does not work => how do I submit the runnable
// to the application's event dispatch thread?
}
};
}
#Override
public String call() throws Exception {
String s = "";
for (int i = 0; i < 2e4; i++) {
s += String.valueOf(i) + " ";
}
return s;
}
final String get() throws InterruptedException, ExecutionException {
return futureTask.get();
}
void showResults(int idx) {
try {
System.out.println("Worker " + idx + ":" +
(String)futures[idx].get());
} catch (Exception e) {
System.err.println(e.getMessage());
}
}
}
}

A couple of points:
you rarely need to use FutureTask directly, just implement Callable or Runnable and submit the instance to an Executor
in order to update the gui when you are done, as the last step of your run()/call() method, use SwingUtilities.invokeLater() with the code to update the ui.
Note, you can still use SwingWorker, just, instead of calling execute(), submit the SwingWorker to your Executor instead.
if you need to process all results together when all threads are done before updating the gui, then i would suggest:
have each worker stash it's results into a thread-safe, shared list
the last worker to add results to the list should then do the post-processing work
the worker which did the post-processing work should then invoke SwingUtilities.invokeLater() with the final results

I tried to make use of the SwingWorker class to spread the load to
multiple CPU cores. However, behaviour of this class proved to be
somewhat strange: only one core seemed to be utilized.
no idea without posting an SSCCE, short, runnable, compilable,
SSCCE could be based on
SwingWorker is designated creating Workers Thread for Swing GUI, more in this thread

Are there any executor in java concurrent package which guarantee that all tasks will be done in order they were submitted?

A code sample for demonstration of the idea from the title:
executor.submit(runnable1);
executor.submit(runnable2);
I need to be sure that runnable1 will finish before runnable2 start and I haven't found any proofs of such behavior in the executors documentation.
About the problem I'm solving:
I need write lots of logs to a file. Each log requires much precomputing (formatting and some other stuff). So, I want to put each logging task to a kind of queue and process these tasks in a separate thread. And, of course, it's important to keep logs ordering.

A single threaded executor will perform all tasks in the order submitted. You would only use a thread pool with multiple threads if you wanted the tasks to be perform concurrently.
Adding tasks to a queue can be expensive in itself. You can use an Exchanger like this
http://vanillajava.blogspot.com/2011/09/exchange-and-gc-less-java.html?z#!/2011/09/exchange-and-gc-less-java.html
This avoid using a queue or creating object.
An alternative which is faster is to use a memory mapped file which doesn't require a background thread (actually the OS is working in the background) This is much faster again. It supports sub-microsecond latencies and millions of messages per second.
https://github.com/peter-lawrey/Java-Chronicle

You could create a simple wrapper like the one below so that all your Runnables are executed in the same thread (i.e. sequentially), and submit that wrapper to the executor instead. That does not address the logging issue.
class MyRunnable implements Runnable {
private List<Runnable> runnables = new ArrayList<>();
public void add(Runnable r) {
runnables.add(r);
}
#Override
public void run() {
for (Runnable r : runnables) {
r.run();
}
}
}
//......
MyRunnable r = new MyRunnable();
r.add(runnable1);
r.add(runnable2);
executor.submit(r);

Presumably you are doing some post-analysis of the logfile? Have you considered not caring about the order they're written and re-ordering offline later. You could allocate a unique id at submit time using, a timestamp or AtomicLong?
a code sketch (untested) would look like this:
import java.util.concurrent.atomic.AtomicLong;
class MyProcessor {
public void work()
for (Object data: allData) {
executor.submit(new MySequencedRunnable(data);
}
}
}
class MySequencedRunnable implements Runnable {
private static final AtomicLong LOG_SEQUENCE_ID = new AtomicLong(0);
private final Object data;
MySequencedRunnable(Object data) {
this.data = data;
}
public void run() {
LOGGER.log(LOG_SEQUENCE_ID.incrementAndGet(), data);
}
}
Also consider, if you're using something like log4j, using NDC or MDC to assist with the re-ordering.

java: combined multithreaded / singlethreaded task queue

I like the ExecutorService series of classes/interfaces. I don't have to worry about threads; I take in an ExecutorService instance and use it to schedule tasks, and if I want to use an 8-thread or 16-thread pool, well, great, I don't have to worry about that at all, it just happens depending on how the ExecutorService is setup. Hurray!
But what do I do if some of my tasks need to be executed in serial order? Ideally I would ask the ExecutorService to let me schedule these tasks on a single thread, but there doesn't seem to be any means of doing so.
edit: The tasks are not known ahead of time, they are an unlimited series of tasks that are erratically generated by events of various kinds (think random / unknown arrival process: e.g. clicks of a Geiger counter, or keystroke events).

You could write an implementation of Runnable that takes some tasks and executes them serially.
Something like:
public class SerialRunner implements Runnable {
private List<Runnable> tasks;
public SerialRunner(List<Runnable> tasks) {
this.tasks = tasks;
}
public void run() {
for (Runnable task: tasks) {
task.run();
}
}
}

I'm using a separate executor created with Executors.newSingleThreadExecutor() for tasks that I want to queue up and only run one at a time.
Another approach is to just compose several tasks and submit that one,
executor.submit(new Runnable() {
public void run() {
myTask1.call();
myTask2.call();
myTask3.call();
}});
Though you might need to be more elaborate if still want myTask2 to run even if myTask1 throws an Exception.

The way I do this is via some homegrown code that streams work onto different threads according what the task says its key is (this can be completely arbitrary or a meaningful value). Instead of offering to a Queue and having some other thread(s) taking work off it (or lodging work with the ExecutorService in your case and having the service maintain a threadpool that takes off the internal work queues), you offer a Pipelineable (aka a task) to the PipelineManager which locates the right queue for the key of that task and sticks the task onto that queue. There is assorted other code that manages the threads taking off the queues to ensure you always have 1 and only 1 thread taking off that queue in order to guarantee that all work offered to it for the same key will be executed serially.
Using this approach you could easily set aside certain keys for n sets of serial work while round robining over the remaining keys for the work that can go in any old order or alternatively you can keep certain pipes (threads) hot by judicious key selection.
This approach is not feasible for the JDK ExecutorService implementation because they're backed by a single BlockingQueue (at least a ThreadPoolExecutor is) and hence there's no way to say "do this work in any old order but this work must be serialised". I am assuming you want that of course in order to maintain throughput otherwise just stick everything onto a singleThreadExecutor as per danben's comment.
(edit)
What you could do instead, to maintain the same abstraction, is create create your own implementation of ExecutorService that delegates to as many instances of ThreadPoolExecutor (or similar) as you need; 1 backed by n threads and 1 or more single threaded instances. Something like the following (which in no way at all is working code but hopefully you get the idea!)
public class PipeliningExecutorService<T extends Pipelineable> implements ExecutorService {
private Map<Key, ExecutorService> executors;
private ExecutorService generalPurposeExecutor;
// ExecutorService methods here, for example
#Override
public <T> Future<T> submit(Callable<T> task) {
Pipelineable pipelineableTask = convertTaskToPipelineable(task);
Key taskKey = pipelineable.getKey();
ExecutorService delegatedService = executors.get(taskKey);
if (delegatedService == null) delegatedService = generalPurposeExecutor;
return delegatedService.submit(task);
}
}
public interface Pipelineable<K,V> {
K getKey();
V getValue();
}
It's pretty ugly, for this purpose, that the ExecutorService methods are generic as opposed to the service itself which means you need some standard way to marshal whatever gets passed in into a Pipelineable and a fallback if you can't (e.g. throw it onto the general purpose pool).

hmm, I thought of something, not quite sure if this will work, but maybe it will (untested code). This skips over subtleties (exception handling, cancellation, fairness to other tasks of the underlying Executor, etc.) but is maybe useful.
class SequentialExecutorWrapper implements Runnable
{
final private ExecutorService executor;
// queue of tasks to execute in sequence
final private Queue<Runnable> taskQueue = new ConcurrentLinkedQueue<Runnable>();
// semaphore for pop() access to the task list
final private AtomicBoolean taskInProcess = new AtomicBoolean(false);
public void submit(Runnable task)
{
// add task to the queue, try to run it now
taskQueue.offer(task);
if (!tryToRunNow())
{
// this object is running tasks on another thread
// do we need to try again or will the currently-running thread
// handle it? (depends on ordering between taskQueue.offer()
// and the tryToRunNow(), not sure if there is a problem)
}
}
public void run()
{
tryToRunNow();
}
private boolean tryToRunNow()
{
if (taskInProcess.compareAndSet(false, true))
{
// yay! I own the task queue!
try {
Runnable task = taskQueue.poll();
while (task != null)
{
task.run();
task = taskQueue.poll();
}
}
finally
{
taskInProcess.set(false);
}
return true;
}
else
{
return false;
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using concurrent classes to process files in a directory in parallel - java

Related

How to manage threads in Spring TaskExecutor framework

Adding multi-threading possibility to a single-threaded all-files-in-directory iterator utility function

How to properly extend FutureTask

Are there any executor in java concurrent package which guarantee that all tasks will be done in order they were submitted?

java: combined multithreaded / singlethreaded task queue

Categories

Resources