using parallelStream for independent tasks?

using parallelStream for independent tasks? - java

I have a list of tasks. Each task is independent of each other (they do not use results from each other).
When having 1000 tasks and using a sequential stream to process these tasks..
tasks.forEach(task->{
// long running task
task.run();
System.out.println("Thread: " + Thread.currentThread().getName());
});
..then, the second task is running AFTER the first task and so forth. The loop is running in blocking and sequential mode (second task is only done after first task is finished).
What is the best way to process each task in parallel?
Is this the best way?
tasks.parallelStream().forEach(task->{
// long running task
task.run();
System.out.println("Thread: " + Thread.currentThread().getName());
});
According to Should I always use a parallel stream when possible?, it should be avoided to use parallel streams. As in my case, these tasks are independent of each other, I do not need the synchronization overhead which comes by using parallelStream(). However, there is no option to disable the synchronization overhead when using parallelStream(). Or?
Is there a better way for my use case than parallelStream()?

In Java 8 parallelStream() use the ForkJoinCommonPool which is initialised at JVM startup and contains a fixed number of threads that is more suited to work that can follows the "divide and conquer" paradigm. In your case, since they are all isolated, the use of an ExecutorService may be more fitting.

A good solution for you can be to use CompletableFuture.allOf. Use it like this:
ExecutorService ex = //Whatever executor you want;
CompletableFuture.allOf((CompletableFuture<Void>[]) tasks.stream()
.map(task -> CompletableFuture.runAsync((() -> /* Do task */), ex))
.toArray());
In doing so, you can perform asynchronous, non-blocking. Also, you will get a compiler warning about type casting but I think in your case, it may be safe to ignore it.
ExecutorService.submit will fire off the task but when you use get to obtain any result, it's going to block and then retrieve. CompletableFuture doesn't block when getting the data. This is case when you want to see some kind of result returned after all the parallel tasks finish.
Some more explanation can be found here.
Also, in your original question, you asked if it is a good idea to use parallelStream and my answer to that would be that it isn't a good idea because if there is a task that blocks the thread then you will have problems (assuming you have used parallelStream all over the place in your code).
Also, CompletableFuture can accept it's own thread pool (which you can customize yourself) and run there. Notice the second argument to runAsync in the above code.
If you simply want to have a fire and forget mechanism and don't care about the result then using the ExecutorService.invokeAll is a good way to do thing. You can use it like this:
executorService.invokeAll(tasks.stream().map(task -> new Callable<Void>() {
#Override
public Void call() throws Exception {
// run task;
return null;
}
})
.collect(Collectors.toList()));
But why do you want to use a CompletableFuture with your own ExecutorService in such a case?
One good reason is the fluent error handling. You can see some examples here and here

Related

Java 8 concurrency in AWS lambda?

We have an AWS lambda function that needs to perform a few checks done by calling remote services. As long as one of them returning false, lambda can return; otherwise, all the checks need to be finished to make sure none returning false. Right now I am using a parallel stream to run the tasks, as they can go independently.
In a may-not-be-rare situation, the main thread returns while one of the tasks is still running with its thread, or thread blocked waiting for I/O, as short-circuiting has seen a false with another task. AWS lambda documentation says that all threads in Lambda will be frozen when main thread returns. And they will thaw once lambda is handling the next request. Will the busy/blocked thread keep working on the original task after getting re-activated, or it will take on the new task for current request?
Would really appreciate it if Lambda gurus can share some insights.

I hope I understood correctly. You want to perform parallel activities while waiting for them to finish.
I just read in StackOverflow a comment saying the following:
Streams is about data-parallelism; data parallel problems are CPU-bound, not IO-bound. It seems that you're simply looking to run a number of mostly unrelated IO-intensive tasks concurrently. Use a plain-old thread pool for that; your first example is an ideal candidate for ExecutorService.invokeAll()
Maybe ExecutorService can help.
I don't know how your code is being structured but I can propose something like this:
int processors = Runtime.getRuntime().availableProcessors();
ExecutorService executorService = Executors.newFixedThreadPool(processors);
List<Callable<Boolean>> services = getURLToCheck().parallelStream()
.map(this::checkService)
.collect(Collectors.toList());
try {
List<Future<Boolean>> futures = executorService.invokeAll(services);
// do your validation with the concurrent tasks.
} catch (InterruptedException e) {
// Handle as you wish
}
Where also:
private List<URL> getURLToCheck() {
// Fetch your URL from wherever :)
}
private Callable<Boolean> checkService(URL url){
// Logic to check the service
}
The Future class has to key methods that may be useful for you. The isDone() method and the .get().
The first one indicates whether the task finished or not, and the second one will wait for it to finish throwing all the exceptions that occurred inside but wrapped in ExecutionException. Maybe you can combine those methods to have the validation done. Having a quick think, I imagined a while loop where you ask if the future finished, and if so, have the validation result and with that, break that loop if false. But I don't like it haha.
Hope I made my self clear. And also I hope that can help. If not, i tried my best.

CompletableFuture runAsync vs new Thread

Context: I've read this SO thread discussing the differences between CompletableFuture and Thread.
But I'm trying to understand when should I use new Thread() instead of runAsync().
I understand that runAsyn() is more efficient for short/one-time parallel task because my program might avoid the overhead of creating a brand new thread, but should I also consider it for long running operations?
What are the factors that I should be aware before considering to use one over the other?
Thanks everyone.

The difference between using the low-level concurrency APIs (such as Thread) and others is not just about the kind of work that you want to get done, it's also about the programming model and also how easy they make it to configure the environment in which the task runs.
In general, it is advisable to use higher-level APIs, such as CompletableFuture instead of directly using Threads.
I understand that runAsyn() is more efficient for short/one-time parallel task
Well, maybe, assuming you call runAsync, counting on it to use the fork-join pool. The runAsync method is overloaded with a method that allows one to specify the java.util.concurrent.Executor with which the asynchronous task is executed.
Using an ExecutorService for more control over the thread pool and using CompletableFuture.runAsync or CompletableFuture.supplyAsync with a specified executor service (or not) are both generally preferred to creating and running a Thread object directly.
There's nothing particularly for or against using the CompletableFuture API for long-running tasks. But the choice one makes to use Threads has other implications as well, among which:
The CompletableFuture gives a better API for programming reactively, without forcing us to write explicit synchronization code. (we don't get this when using threads directly)
The Future interface (which is implemented by CompletableFuture) gives other additional, obvious advantages.
So, in short: you can (and probably should, if the alternative being considered is the Thread API) use the CompletableFuture API for your long-running tasks. To better control thread pools, you can combine it with executor services.

The main difference is CompletableFuture run your task by default on the ForkJoinPool.commonPool. But if you create your own thread and start it will execute as a single thread, not on a Thread pool. Also if you want to execute some task in a sequence but asynchronously. Then you can do like below.
CompletableFuture.runAsync(() -> {
System.out.println("On first task");
System.out.println("Thread : " + Thread.currentThread());
}).thenRun(() -> {
System.out.println("On second task");
});
Output:
On first task
Thread : Thread[ForkJoinPool.commonPool-worker-1,5,main]
On second task
If you run the above code you can see that which pool CompletableFuture is using.
Note: Threads is Daemon in ForkJoinPool.commonPool.

Writing from Future of CachedThreadPool. Is my implementation incorrect?

I need help with my multithreading code.
I have a callable class which returns a value. I have a cachedThreadPool to submit ~60,000 tasks. I collect all the Futures in a List. After the ExecutiveService has shutdown, I loop through the list of Futures, and write the returned values using a bufferedWriter. Is this correct way of implementation?
ExecutorService execService = Executors.newCachedThreadPool();
List<Future<ValidationDataObject<String, Boolean>>> futureList = new ArrayList<>();
for (int i = 0; i < emailArrayList.size(); i++) {
String emailAddress = emailArrayList.get(i);
ValidateEmail validateEmail = new ValidateEmail(emailAddress);
Future<ValidationDataObject<String, Boolean>> future =
execService.submit(validateEmail);
futureList.add(future);
}
execService.shutdown();
for (Future<ValidationDataObject<String, Boolean>> future: futureList) {
ValidationDataObject<String, Boolean> validationObject = future.get();
bufferedWriter.write(validationObject.getEmailAddress() + "|"
+ validationObject.getIsValid());
bufferedWriter.newLine();
bufferedWriter.flush();
}
if (execService.isTerminated()) bufferedWriter.close();
Should I using synchronized block for the bufferedWriter? I am thinking, It doesn't need to be synchronized because, I am using the bufferedWriter from the main Thread, right?

I have a cachedThreadPool to submit ~60,000 tasks.
Off the bat, a cached thread-pool and 60k tasks is a red flag. That is going to start 60k threads which I doubt you really want. You should use a fixed thread pool and vary the number of threads until you achieve a good balance of throughput versus overwhelming your server. Maybe start with 2x the number of CPUs and then vary it depending on the server load.
You might also might consider using a fixed size queue which will limit the number of tasks outstanding although 60k tasks is fine unless those objects are heavy.
I collect all the Futures in a List. After the ExecutiveService has shutdown, I loop through the list of Futures, and write the returned values using a bufferedWriter. Is this correct way of implementation?
Yes, that's a good pattern. You don't show the writer being created but it is certainly fine for the main thread to own that.
Should I using synchronized block for the bufferedWriter? I am thinking, It doesn't need to be synchronized because, I am using the bufferedWriter from the main Thread, right?
Right. No other threads are using it so that's fine. It is a very typical pattern to have a writer thread managing the output of a multi-thread application.
One final comment is that you might want to look at the ExecutionCompletionService which allows you to process the tasks as they finish instead of having to wait for them in order. You might require the output to be in order in which case this isn't helpful but it's good technology to know about anyway.

Apart from the fact, that executor.shutdown() will most likely not do, what you believe it to do (it simply stops the Executor from accepting new Tasks, it will not wait for all tasks to terminate), your code looks fine.
You are right, there is no need for synchronization with respect to the writer, as you access it only single threaded.
There are things, that can be improved, though. Firstly, you are not doing a lot of Exception handling. Future.get() will throw an ExecutionException, if the Callable hits an Exception.
I'm not certain, how large the deviations in execution-time of your Callables are. Assume, there are notable deviations look at the following case: Say we submit Callables A, B and C, you receive FutA, FutB and FutC. Calling the get methods will block until the calculation behind the Future is finished. In your setting, you might be waiting for FutA to complete, while FutB/FutC might already be finished and ready for writing. Worst case here is, that processing of A will delay writing for all 60000 tasks.
I think, I would go for another approach, where every Callable gets the reference to the same ConcurrentLinkedQueue and instead of returning the result via Future writes the result into that queue. In this scenario, the ordering of the result is not dependent on the ordering of the Callables but on the time, the Callables finish execution. Whether or not this results in a speedup depends on your setting (especially time to write result and deviation in execution times of the Callables).

ExecutorService.submit(Task) vs CompletableFuture.supplyAsync(Task, Executor)

To run some stuff in parallel or asynchronously I can use either an ExecutorService: <T> Future<T> submit(Runnable task, T result); or the CompletableFuture Api:static <U> CompletableFuture<U> supplyAsync(Supplier<U> supplier, Executor executor);
(Lets assume I use in both cases the same Executor)
Besides the return type Future vs. CompletableFuture are there any remarkable differences. Or When to use what?
And what are the differences if I use the CompletableFuture API with default Executor (the method without executor)?

Besides the return type Future vs. CompletableFuture are there any remarkable differences. Or When to use what?
It's rather simple really. You use the Future when you want the executing thread to wait for async computation response. An example of this is with a parallel merge/sort. Sort left asynchronously, sort right synchronously, wait on left to complete (future.get()), merge results.
You use a CompleteableFuture when you want some action executed, with the result after completion, asynchronously from the executed thread. For instance: I want to do some computation asynchronously and when I compute, write the results to some system. The requesting thread may not need to wait on a result then.
You can mimic the above example in a single Future executable, but the CompletableFuture offers a more fluent interface with better error handling.
It really depends on what you want to do.
And what are the differences if i use the CompletableFutureApi with default Executor (the method without executor)?
It will delegate to ForkJoin.commonPool() which is a default size to the number of CPUs on your system. If you are doing something IO intensive (reading and writing to the file system) you should define the thread pool differently.
If it's CPU intensive, using the commonPool makes most sense.

CompletableFuture has rich features like chaining multiple futures, combining the futures, executing some action after future is executed (both synchronously as well as asynchronously), etc.
However, CompletableFuture is no different than Future in terms of performance. Even when combine multiple instances of CompletableFuture (using .thenCombine and .join in the end), none of them get executed unless we call .get method and during this time, the invoking thread is blocked. I feel in terms of performance, this is not better than Future.
Please let me know if I am missing some aspect of performance here.

This clarified for me the difference between future an completable future a bit more: Difference between Future and Promise
CompletableFuture is more like a promise.

Java parallelStream() with custom pool with caller work stealing?

Normally when one uses Java 8's parallelStream(), the result is execution via the default, common fork-join pool (i.e. ForkJoinPool.commonPool()).
That is clearly undesirable, however, if one has work that is far from CPU bound, e.g. may be waiting on IO much of the time. In such cases one will want to use a separate pool, sized according to other criteria (e.g. how much of the time the tasks are likely to be actually using the CPU).
There's no obvious means of getting parallelStream() to use a different pool, but there is a way as detailed here.
Unfortunately, that approach entails invoking the terminal operation on the parallel stream from a fork-join pool thread. The downside of this is that if the target-fork join pool is completely busy with existing work, the whole execution will wait on it while doing absolutely nothing. Thus the pool can become a bottleneck worse than single threaded execution. By contrast, when one uses parallelStream() in the "normal" fashion, ForkJoinPool.common.externalHelpComplete() or ForkJoinPool.common.tryExternalUnpush() are used to let the calling thread from outside the pool help in the processing.
Does anyone know of a way to both get parallelStream() to use a non-default fork-join pool and have a calling thread from outside the fork-join pool help in the processing of this work (but not the rest of the fork-join pool's work)?

You can use awaitQuiescence on the pool to help out. However, you can’t select which task(s) you will help, it will just take the next pending from the pool, thus, if there are more pending tasks, you might ending up executing these before getting to your own.
ForkJoinPool forkJoinPool = new ForkJoinPool(1);
// make all threads busy:
forkJoinPool.submit(() -> LockSupport.parkNanos(Long.MAX_VALUE));
// submit our task (may contain your stream operation)
ForkJoinTask<Thread> task = forkJoinPool.submit(() -> Thread.currentThread());
// help out
while(!task.isDone()) // use zero timeout to execute one task only
forkJoinPool.awaitQuiescence(0, TimeUnit.NANOSECONDS);
System.out.println(Thread.currentThread()==task.get());
will print true.
whereas
ForkJoinPool forkJoinPool = new ForkJoinPool(1);
// make all threads busy:
forkJoinPool.submit(() -> LockSupport.parkNanos(Long.MAX_VALUE));
// overload:
forkJoinPool.submit(() -> LockSupport.parkNanos(Long.MAX_VALUE));
// submit our task (may contain your stream operation)
ForkJoinTask<Thread> task = forkJoinPool.submit(() -> Thread.currentThread());
// help out
while(!task.isDone())
forkJoinPool.awaitQuiescence(0, TimeUnit.NANOSECONDS);
System.out.println(Thread.currentThread()==task.get());
will hang forever as it attempts to execute the second blocking task.
Nevertheless, it will let the initiating thread help processing the pool’s pending tasks which will raise the chance of its own task getting executed as long as there are no infinite tasks (the example above is extreme and only chosen for demonstration).
But note that the entire relationship between the Fork/Join framework and the Stream API is an implementation detail anyway.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.