How to aggregate results from making CompletableFuture calls in a loop?

How to aggregate results from making CompletableFuture calls in a loop? - java

I am just learning and trying to apply CompletableFuture to my problem statement. I have a list of items I am iterating over.
Prop is a class with only two attributes prop1 and prop2, respective getters and setters.
List<Prop> result = new ArrayList<>();
for ( Item item : items ) {
item.load();
Prop temp = new Prop();
// once the item is loaded, get its properties
temp.setProp1(item.getProp1());
temp.setProp2(item.getProp2());
result.add(temp);
}
return result;
However, item.load() here is a blocking call. So, I was thinking to use CompletableFuture something like below -
for (Item item : items) {
CompletableFuture<Prop> prop = CompletableFuture.supplyAsync(() -> {
try {
item.load();
return item;
} catch (Exception e) {
logger.error("Error");
return null;
}
}).thenApply(item1 -> {
try {
Prop temp = new Prop();
// once the item is loaded, get its properties
temp.setProp1(item.getProp1());
temp.setProp2(item.getProp2());
return temp;
} catch (Exception e) {
}
});
}
But I am not sure how I can wait for all the items to be loaded and then aggregate and return their result.
I may be completely wrong in the way of implementing CompletableFutures since this is my first attempt. Please pardon any mistake. Thanks in advance for any help.

There are two issues with your approach of using CompletableFuture.
First, you say item.load() is a blocking call, so the CompletableFuture’s default executor is not suitable for it, as it tries to achieve a level of parallelism matching the number of CPU cores. You could solve this by passing a different Executor to CompletableFuture’s asynchronous methods, but your load() method doesn’t return a value that your subsequent operations rely on. So the use of CompletableFuture complicates the design without a benefit.
You can perform the load() invocations asynchronously and wait for their completion just using an ExecutorService, followed by the loop as-is (without the already performed load() operation, of course):
ExecutorService es = Executors.newCachedThreadPool();
es.invokeAll(items.stream()
.map(i -> Executors.callable(i::load))
.collect(Collectors.toList()));
es.shutdown();
List<Prop> result = new ArrayList<>();
for(Item item : items) {
Prop temp = new Prop();
// once the item is loaded, get its properties
temp.setProp1(item.getProp1());
temp.setProp2(item.getProp2());
result.add(temp);
}
return result;
You can control the level of parallelism through the choice of the executor, e.g. you could use a Executors.newFixedThreadPool(numberOfThreads) instead of the unbounded thread pool.

Related

How to check when all CompleteableFuture are done?

I have a Stream<Item> which I'm mapping to a CompleteableFuture<ItemResult>
What I'd like to do is to know when all the futures are completed.
One may suggest to:
collect all the futures to an array and use CompleteableFuture.allOf(). This is somewhat problematic since there could be hundreds of thousands of items
just continue with forEach(CompleteableFuture::join). This is problematic too as calling forEach with join will just block the stream and it will be essentially a serial processing and not concurrent
Inject a poisoned item in the end of the stream. This could work but it's not that elegant in my view
check if the executor queue is empty - This is quite limiting because I might use more than one executor in the future. Also, the queue can be momentarily empty
Monitor the database instead and check the number of new items
I feel like all the suggested solutions aren't good enough.
What is the appropriate way to monitor the futures?
Thanks
EDIT:
another (vague) idea I had in mind is to use a counter and wait for it to go down to zero. But again, need to check that it's not a momentarily 0..

Disclaimer: I'm not sure whether Phaser is the right tool here, and if yes, whether it's better to have one root with multiple children or to chain them like I'm proposing below, so feel free to correct me.
Here's one approach that uses Phaser.
A Phaser has a limited number of parties, so we need to create a new child Phaser if that limit is about to get reached:
private Phaser register(Phaser phaser) {
if (phaser.getRegisteredParties() < 65534) {
// warning: side-effect,
// conflicts with AtomicReference#updateAndGet recommendation,
// might not fit well if the Stream is parallel:
phaser.register();
return phaser;
} else {
return new Phaser(phaser, 1);
}
}
Register each CompletableFuture against that Phaser chain, and deregister once done:
private void register(CompletableFuture<?> future, AtomicReference<Phaser> phaser) {
Phaser registeredPhaser = phaser.updateAndGet(this::register);
future
.thenRun(registeredPhaser::arriveAndDeregister)
.exceptionally(e -> {
// log e?
registeredPhaser.arriveAndDeregister();
return null;
});
}
Wait for all futures to be finished:
private <T> void await(Stream<CompletableFuture<T>> futures) {
Phaser rootPhaser = new Phaser(1);
AtomicReference<Phaser> phaser = new AtomicReference<>(rootPhaser);
futures.forEach(future -> register(future, phaser));
rootPhaser.arriveAndAwaitAdvance();
rootPhaser.arriveAndDeregister();
}
Example:
ExecutorService executor = Executors.newFixedThreadPool(500);
// creating fake stream with 500,000 futures:
Stream<CompletableFuture<Integer>> stream = IntStream
.rangeClosed(1, 500_000)
.mapToObj(i -> CompletableFuture.supplyAsync(() -> {
try {
TimeUnit.MILLISECONDS.sleep(10);
if (i % 50_000 == 0) {
System.out.println(Thread.currentThread().getName() + ": " + i);
}
return i;
} catch (InterruptedException e) {
throw new IllegalStateException(e);
}
}, executor));
// usage:
await(stream);
System.out.println("Done");
Outputs:
pool-1-thread-348: 50000
pool-1-thread-395: 100000
pool-1-thread-333: 150000
pool-1-thread-30: 200000
pool-1-thread-120: 250000
pool-1-thread-10: 300000
pool-1-thread-241: 350000
pool-1-thread-340: 400000
pool-1-thread-283: 450000
pool-1-thread-176: 500000
Done

Java ForkJoinPool with .forEach and .add

I have a List of TicketDTO objects where every TicketDTO needs to go through a function to convert the data to TicketDataDTO. What I want here is to reduce the time it takes for this code to run because when the list size is bigger, it takes a lot of time to convert it and it's unacceptable for fetching the data through a GET mapping. However, when I try to implement ForkJoinPool along with the parallelStream) (code below) to get it done, my return List` is empty. Can someone tell me what am I doing wrong?
#Override
public List<TicketDataDTO> getOtrsTickets(String value, String startDate, String endDate, String product, String user) {
// TODO Implement threads
List<TicketDTO> tickets = ticketDao.findOtrsTickets(value, startDate, endDate, product, user);
Stream<TicketDTO> ticketsStream = tickets.parallelStream();
List<TicketDataDTO> data = new ArrayList<TicketDataDTO>();
ForkJoinPool forkJoinPool = new ForkJoinPool(6);
forkJoinPool.submit(() -> {
try {
ticketsStream.forEach(ticket -> data.add(createTicketData(ticket)));
} catch (Exception e) {
throw new RuntimeException(e);
}
});
forkJoinPool.shutdown();
//ticketsStream.forEach(ticket -> data.add(createTicketData(ticket)));
return data;
createTicketData is just a function with two for loops and one switch loop to create some new columns I need as an output.

Additional to calling shutdown() on the ForkJoinPool, you have to wait for its termination like
forkJoinPool.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
If you do not wait for the termination, data will be returned before the threads have the chance to add their results to it.
See
How to wait for all threads to finish, using ExecutorService?
for more details

Parallelizing deserialization step

There is the following pipeline:
item is produced (the producer is external to the pipeline);
item is deserialized (JSON to Java object);
item is processed;
At the moment it all happens synchronously in a single thread:
while(producer.next()) {
var item = gson.deserialize(producer.item());
processItem(item);
}
Or schematically:
PRODUCER -> DESERIALIZATION -> CONSUMER
(sync) (sync) (sync)
The concern is that the deserialization step has no side-effects and could be parallelized saving some world time.
The overall code should like the following:
var pipeline = new Pipeline<Item>();
pipeline.setProducer(producer);
pipeline.setDeserialization(gson::deserialize);
pipeline.setConsumer(item -> {
...
});
pipeline.run();
Or schematically:
-> DESERIALIZATION
-> DESERIALIZATION
-> DESERIALIZATION
PRODUCER -> ... -> CONSUMER
-> DESERIALIZATION
-> DESERIALIZATION
-> DESERIALIZATION
(sync) (parallel) (sync)
Important notice. Deserialized items should be produced:
synchronously;
in the same order the original producer produces encoded items.
Q. Is there a standardized way to code such a pipeline?

Try
while(producer.next()) {
CompletableFuture.supplyAsync(()-> gson.deserialize(producer.item()))
.thenRunAsync(item->processItem(item));
}

One way you can achieve your pattern is to:
Construct a multi-threaded executor to process the decoding requests
Have a consumer queue; each time you submit an item to be decoded, also add the corresponding Future object to the consumer queue
Have a consumer thread sit waiting to take items off the queue [which therefore consumes them in the order they were posted], call the corresponding get() method [which waits for the item to be decoded]
So the 'consumer' would look like this:
BlockingQueue<Future<Item>> consumerQueue = new LinkedBlockingDeque<>();
Thread consumerThread = new Thread(() -> {
try {
while (true) {
Future<Item> item = consumerQueue.take();
try {
// Get the next decoded item that's ready
Item decodedItem = item.get();
// 'Consume' the item
...
} catch (ExecutionException ex) {
}
}
} catch (InterruptedException irr) {
}
});
consumerThread.start()
Meanwhile, the 'producer' end, with its multi-threaded 'decoder', would look like this:
ExecutorService decoder = Executors.newFixedThreadPool(4);
while (!producer.hasNext()) {
Item item = producer.next()
// Submit the decode job for asynchronous processing
Future<Item> p = decoder.submit(() -> {
item.decode();
}, item);
// Also queue this decode job for future consumption once complete
consumerQueue.add(p);
}
As a separate matter, I wonder if you will actually see much benefit in practice, since by insisting on consumption in the same order, you are inherently introducing a serial condition on the process. But technically, this is one way that you could achieve what you are after.
P.S. If you didn't want a separate consumer thread, then the same 'producer' thread could poll the queue for completed items and execute in line.

How to prioritise waiting CompletableFutures by access time instead of creation time?

TL;DR: When several CompletableFutures are waiting to get executed, how can I prioritize those whose values i'm interested in?
I have a list of 10,000 CompletableFutures (which calculate the data rows for an internal report over the product database):
List<Product> products = ...;
List<CompletableFuture<DataRow>> dataRows = products
.stream()
.map(p -> CompletableFuture.supplyAsync(() -> calculateDataRowForProduct(p), singleThreadedExecutor))
.collect(Collectors.toList());
Each takes around 50ms to complete, so the entire thing finishes in 500sec. (they all share the same DB connection, so cannot run in parallel).
Let's say I want to access the data row of the 9000th product:
dataRows.get(9000).join()
The problem is, all these CompletableFutures are executed in the order they have been created, not in the order they are accessed. Which means I have to wait 450sec for it to calculate stuff that at the moment I don't care about, to finally get to the data row I want.
Question:
Is there any way to change this behaviour, so that the Futures I try to access get priority over those I don't care about at the moment?
First thoughts:
I noticed that a ThreadPoolExecutor uses a BlockingQueue<Runnable> to queue up entries waiting for an available Thread.
So I thought about using a PriorityBlockingQueue, to change the priority of the Runnable when I access its CompletableFuture but:
PriorityBlockingQueue does not have a method to reprioritize an existing element, and
I need to figure out a way to get from the CompletableFuture to the corresponding Runnable entry in the queue.
Before I go further down this road, do you think this sounds like the correct approach. Do others ever had this kind of requirement? I tried to search for it, but found exactly nothing. Maybe CompletableFuture is not the correct way of doing this?
Background:
We have an internal report which displays 100 products per page. Initially we precalculated all DataRows for the report, which took way to long if someone has that many products.
So first optimization was to wrap the calculation in a memoized supplier:
List<Supplier<DataRow>> dataRows = products
.stream()
.map(p -> Suppliers.memoize(() -> calculateDataRowForProduct(p)))
.collect(Collectors.toList());
This means that initial display of first 100 entries now takes 5sec instead of 500sec (which is great), but when the user switches to the next pages, it takes another 5sec for each single one of them.
So the idea is, while the user is staring at the first screen, why not precalculate the next pages in the background. Which leads me to my question above.

Interesting problem :)
One way is to roll out custom FutureTask class to facilitate changing priorities of tasks dynamically.
DataRow and Product are both taken as just String here for simplicity.
import java.util.*;
import java.util.concurrent.*;
public class Testing {
private static String calculateDataRowForProduct(String product) {
try {
// Dummy operation.
Thread.sleep(200);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("Computation done for " + product);
return "data row for " + product;
}
public static void main(String[] args) throws ExecutionException, InterruptedException {
PriorityBlockingQueue<Runnable> customQueue = new PriorityBlockingQueue<Runnable>(1, new CustomRunnableComparator());
ThreadPoolExecutor executor = new ThreadPoolExecutor(1, 1, 0L, TimeUnit.MILLISECONDS, customQueue);
List<String> products = new ArrayList<>();
for (int i = 0; i < 10; i++) {
products.add("product" + i);
}
Map<Integer, PrioritizedFutureTask<String>> taskIndexMap = new HashMap<>();
for (int i = 0; i < products.size(); i++) {
String product = products.get(i);
Callable callable = () -> calculateDataRowForProduct(product);
PrioritizedFutureTask<String> dataRowFutureTask = new PrioritizedFutureTask<>(callable, i);
taskIndexMap.put(i, dataRowFutureTask);
executor.execute(dataRowFutureTask);
}
List<Integer> accessOrder = new ArrayList<>();
accessOrder.add(4);
accessOrder.add(7);
accessOrder.add(2);
accessOrder.add(9);
int priority = -1 * accessOrder.size();
for (Integer nextIndex : accessOrder) {
PrioritizedFutureTask taskAtIndex = taskIndexMap.get(nextIndex);
assert (customQueue.remove(taskAtIndex));
customQueue.offer(taskAtIndex.set_priority(priority++));
// Now this task will be at the front of the thread pool queue.
// Hence this task will execute next.
}
for (Integer nextIndex : accessOrder) {
PrioritizedFutureTask<String> dataRowFutureTask = taskIndexMap.get(nextIndex);
String dataRow = dataRowFutureTask.get();
System.out.println("Data row for index " + nextIndex + " = " + dataRow);
}
}
}
class PrioritizedFutureTask<T> extends FutureTask<T> implements Comparable<PrioritizedFutureTask<T>> {
private Integer _priority = 0;
private Callable<T> callable;
public PrioritizedFutureTask(Callable<T> callable, Integer priority) {
super(callable);
this.callable = callable;
_priority = priority;
}
public Integer get_priority() {
return _priority;
}
public PrioritizedFutureTask set_priority(Integer priority) {
_priority = priority;
return this;
}
#Override
public int compareTo(#NotNull PrioritizedFutureTask<T> other) {
if (other == null) {
throw new NullPointerException();
}
return get_priority().compareTo(other.get_priority());
}
}
class CustomRunnableComparator implements Comparator<Runnable> {
#Override
public int compare(Runnable task1, Runnable task2) {
return ((PrioritizedFutureTask)task1).compareTo((PrioritizedFutureTask)task2);
}
}
Output:
Computation done for product0
Computation done for product4
Data row for index 4 = data row for product4
Computation done for product7
Data row for index 7 = data row for product7
Computation done for product2
Data row for index 2 = data row for product2
Computation done for product9
Data row for index 9 = data row for product9
Computation done for product1
Computation done for product3
Computation done for product5
Computation done for product6
Computation done for product8
There is one more scope of optimization here.
The customQueue.remove(taskAtIndex) operation has O(n) time complexity with respect to the size of the queue (or the total number of products).
It might not affect much if the number of products is less (<= 10^5).
But it might result in a performance issue otherwise.
One solution to that is to extend BlockingPriorityQueue and roll out functionality to remove an element from a priority queue in O(logn) rather than O(n).
We can achieve that by keeping a hashmap inside the PriorityQueue structure. This hashmap will keep a count of elements vs the index (or indices in case of duplicates) of that element in the underlying array.
Fortunately, I had already implemented such a heap in Python sometime back.
If you have more questions on this optimization, its probably better to ask a new question altogether.

You could avoid submitting all of the tasks to the executor at the start, instead only submit one background task and when it finishes submit the next. If you want to get the 9000th row submit it immediately (if it has not already been submitted):
static class FutureDataRow {
CompletableFuture<DataRow> future;
int index;
List<FutureDataRow> list;
Product product;
FutureDataRow(List<FutureDataRow> list, Product product){
this.list = list;
index = list.size();
list.add(this);
this.product = product;
}
public DataRow get(){
submit();
return future.join();
}
private synchronized void submit(){
if(future == null) future = CompletableFuture.supplyAsync(() ->
calculateDataRowForProduct(product), singleThreadedExecutor);
}
private void background(){
submit();
if(index >= list.size() - 1) return;
future.whenComplete((dr, t) -> list.get(index + 1).background());
}
}
...
List<FutureDataRow> dataRows = new ArrayList<>();
products.forEach(p -> new FutureDataRow(dataRows, p));
dataRows.get(0).background();
If you want you could also submit the next row inside the get method if you expect that they will navigate to the next page afterwards.
If you were instead using a multithreaded executor and you wanted to run multiple background tasks concurrently you could modify the background method to find the next unsubmitted task in the list and start it when the current background task has finished.
private synchronized boolean background(){
if(future != null) return false;
submit();
future.whenComplete((dr, t) -> {
for(int i = index + 1; i < list.size(); i++){
if(list.get(i).background()) return;
}
});
return true;
}
You would also need to start the first n tasks in the background instead of just the first one.
int n = 8; //number of active background tasks
for(int i = 0; i < dataRows.size() && n > 0; i++){
if(dataRows.get(i).background()) n--;
}

To answer my own question...
There is a surprisingly simple (and surprisingly boring) solution to my problem. I have no idea why it took me three days to find it, I guess it required the right mindset, that you only have when walking along an endless tranquilizing beach looking into the sunset on a quiet Sunday evening.
So, ah, it's a little bit embarrassing to write this, but when I need to fetch a certain value (say for 9000th product), and the future has not yet computed that value, I can, instead of somehow forcing the future to produce that value asap (by doing all this repriorisation and scheduling magic), I can, well, I can, ... simply ... compute that value myself! Yes! Wait, what? Seriously, that's it?
It's something like this: if (!future.isDone()) {future.complete(supplier.get());}
I just need to store the original Supplier alongside the CompletableFuture in some wrapper class. This is the wrapper class, which works like a charm, all it needs is a better name:
public static class FuturizedMemoizedSupplier<T> implements Supplier<T> {
private CompletableFuture<T> future;
private Supplier<T> supplier;
public FuturizedSupplier(Supplier<T> supplier) {
this.supplier = supplier;
this.future = CompletableFuture.supplyAsync(supplier, singleThreadExecutor);
}
public T get() {
// if the future is not yet completed, we just calculate the value ourselves, and set it into the future
if (!future.isDone()) {
future.complete(supplier.get());
}
supplier = null;
return future.join();
}
}
Now, I think, there is a small chance for a race condition here, which could lead to the supplier being executed twice. But actually, I don't care, it produces the same value anyway.
Afterthoughts:
I have no idea why I didn't think of this earlier, I was completely fixated on the idea, it has to be the CompletableFuture which calculates the value, and it has to run in one of these background threads, and whatnot, and, well, none of these mattered or were in any way a requirement.
I think this whole question is a classic example of Ask what problem you really want to solve instead of coming up with a half baked broken solution, and ask how to fix that. In the end, I didn't care about CompletableFuture or any of its features at all, it was just the easiest way that came to my mind to run something in the background.
Thanks for your help!

Concurrent iteration and deletion from Set in Java

I have a pre-populated set of strings. I want to iterate over the items and while iterating, i need to "do work" which might also remove the item from the set. I want to spawn a new thread for each item's "do work". Please note that only some items are removed from the set during "do work".
Now i have the following question,
Can i achieve this by simply using Collections.synchronizedSet(new HashSet()); ? I am guessing this will throw up ConcurrentModificationException since i am removing items from the list while i am iterating. How can i achieve the above behavior efficiently without consistency issues ?
Thanks!

I would use an ExecutorService
ExecutorService es = Executors.newFixedThreadPool(n);
List<Future<String>> toRemove = new ARraysList<>();
for(String s: set)
toRemove.add(es.submit(new Task(s)));
for(Future<String> future : toRemove()) {
String s = future.get();
if (s != null)
set.remove(s);
}
This avoids needing to access the collection in a multi-threaded way.

Use a master producer thread that will remove the elements from the collection and will feed them to consumer threads. The consumer threads have no need to "personally" remove the items.

Yes, a SynchronisedSet will still throw ConcurrentModificationExceptions.
Try this:
Set s = Collections.newSetFromMap(new ConcurrentHashMap())
ConcurrentHashMap should never throw a ConcurrentModificationException, when multiple threads are accessing and modifying it.

The approach depends on the relation between the data in your set and the successful completion of the operation.
Remove from Set is independent of the result of task execution
If you don't care about the actual result of the thread execution, you can just go through the set and remove every item as you dispatch the task (you have some examples of that already)
Remove from Set only if task execution completed successfully
If the deletion from the set should be transactional to the success of the execution, you could use Futures to collect information about the success of the task execution. That way, only successfully executed items will be deleted from the original set. There's no need to access the Set structure concurrently, as you can separate execution from check using Futures and an ExecutorService . eg:
// This task will execute the job and,
// if successful, return the string used as context
class Task implements Callable<String> {
final String target;
Task(String s) {
this.target = s;
}
#Override
public String call() throws Exception {
// do your stuff
// throw an exception if failed
return target;
}
}
And this is how it's used:
ExecutorService executor;
Set<Callable<String>> myTasks = new HashSet<Callable<String>>();
for(String s: set) {
myTasks.add(new Task(s));
}
List<Future<String>> results = executor.invoqueAll(myTasks);
for (Future<String> result:results) {
try {
set.remove(result.get());
} catch (ExecutionException ee) {
// the task failed during execution - handle as required
} catch (CancellationException ce) {
// the task was cancelled - handle as required
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to aggregate results from making CompletableFuture calls in a loop? - java

Related

How to check when all CompleteableFuture are done?

Java ForkJoinPool with .forEach and .add

Parallelizing deserialization step

How to prioritise waiting CompletableFutures by access time instead of creation time?

Concurrent iteration and deletion from Set in Java

Categories

Resources