Java Clear CompletionService Working Queue

Java Clear CompletionService Working Queue - java

I am writing a program which uses a CompletionService to run threaded analyses on a bunch of different objects, where each "analysis" consists of taking in a string and doing some computation to give either true or false as an answer. My code looks essentially like this:
// tasks come from a different method and contain the strings + some other needed info
List<Future<Pair<Pieces,Boolean>>> futures = new ArrayList<>(tasks.size());
for (Task task : tasks) {
futures.add(executorCompletionService.submit(task));
}
ArrayList<Pair<Pieces, Boolean>> pairs = new ArrayList<>();
int toComplete = tasks.size();
int received = 0;
int failed = 0;
while (received < toComplete) {
Future<Pair<Pieces, Boolean>> resFuture = executorCompletionService.take();
received++;
Pair<Pieces, Boolean> res = resFuture.get();
if (!res.getValue()) failed++;
if (failed > 300) {
// My problem is here
}
pairs.add(res);
}
// return pairs and go on to do something else
In the marked section, my goal is to have it abandon the computation if over 300 strings have failed, such that I can move on to a new analysis, calling this method again with some different data. The problem is that since the same CompletionService is used again, if I do not somehow clear the queue, then the worker queue will keep growing as I keep adding more to it every time I use it (since after 300 failures there are likely still many unprocessed strings left).
I have tried to loop through the futures list and delete all unfinished tasks using something like futures.foreach(future -> future.cancel(true), however when I next call the method I get a java.util.concurrent.CancellationException error when I try to call resFuture.get().
(Edit: It seems that even though I call foreach(future->future.cancel(true)), this does not guarantee that the workerQueue is actually clear afterwards. I do not understand why this is. It almost seems as if it takes a while to clear the queue, and the code does not wait for this to happen before moving to the next analysis, so occasionally get will be called on a future which has been cancelled.)
I have also tried to do
while (received < toComplete) {
executorCompletionService.take();
received++;
}
To empty the queue, and while this works it is barely faster than just running all of the analyses anyway, and so it does not do very well for the efficiency.
My question is if there is a better way to empty the worker queue such that when I next call this code it is as if the CompletionService is new again.
Edit: Another method I have tried is just setting executorCompletionService = new CompletionService, which is slightly faster than my other solution but is still rather slow and definitely not good practice.
P.S.: Also happy to accept any other ways in which this is possible, I am not attached to using a CompletionService it has just been the easiest thing for what I've done so far

This has since been resolved, but I have seen other similar questions with no good answer so here is my solution:
Previously, I was using an ExecutorService to create my ExecutorCompletionService(ExecutorService). I switched the ExecutorService to be a ThreadPoolExecutor, and since in the backed the ExecutorService already is a ThreadPoolExecutor all method signatures can be fixed with just a cast. Using the ThreadPoolExecutor gives you much more freedom in the backend, and specifically you can called threadPoolExecutor.getQueue().clear() which clears all tasks awaiting completion. Finally, I needed to make sure to "drain" the remaining working tasks, so my final cancelling code looked like this:
if (failed > maxFailures) {
executorService.getQueue().clear();
while (executorService.getActiveCount() > 0) {
executorCompletionService.poll();
}
At the end of this code block, the executor will be ready to run again.

Related

Thread vs Runnable vs CompletableFuture in Java multi threading

I am trying to implement multi threading in my Spring Boot app. I am just beginner on multi threading in Java and after making some search and reading articles on various pages, I need to be clarified about the following points. So;
As far as I see, I can use Thread, Runnable or CompletableFuture in order to implement multi threading in a Java app. CompletableFuture seems a newer and cleaner way, but Thread may have more advantages. So, should I stick to CompletableFuture or use all of them based on the scenario?
Basically I want to send 2 concurrent requests to the same service method by using CompletableFuture:
CompletableFuture<Integer> future1 = fetchAsync(1);
CompletableFuture<Integer> future2 = fetchAsync(2);
Integer result1 = future1.get();
Integer result2 = future2.get();
How can I send these request concurrently and then return result based on the following condition:
if the first result is not null, return result and stop process
if the first result is null, return the second result and stop process
How can I do this? Should I use CompletableFuture.anyOf() for that?

CompletableFuture is a tool which settles atop the Executor/ExecutorService abstraction, which has implementations dealing with Runnable and Thread. You usually have no reason to deal with Thread creation manually. If you find CompletableFuture unsuitable for a particular task you may try the other tools/abstractions first.
If you want to proceed with the first (in the sense of faster) non‑null result, you can use something like
CompletableFuture<Integer> future1 = fetchAsync(1);
CompletableFuture<Integer> future2 = fetchAsync(2);
Integer result = CompletableFuture.anyOf(future1, future2)
.thenCompose(i -> i != null?
CompletableFuture.completedFuture((Integer)i):
future1.thenCombine(future2, (a, b) -> a != null? a: b))
.join();
anyOf allows you to proceed with the first result, but regardless of its actual value. So to use the first non‑null result we need to chain another operation which will resort to thenCombine if the first result is null. This will only complete when both futures have been completed but at this point we already know that the faster result was null and the second is needed. The overall code will still result in null when both results were null.
Note that anyOf accepts arbitrarily typed futures and results in a CompletableFuture<Object>. Hence, i is of type Object and a type cast needed. An alternative with full type safety would be
CompletableFuture<Integer> future1 = fetchAsync(1);
CompletableFuture<Integer> future2 = fetchAsync(2);
Integer result = future1.applyToEither(future2, Function.identity())
.thenCompose(i -> i != null?
CompletableFuture.completedFuture(i):
future1.thenCombine(future2, (a, b) -> a != null? a: b))
.join();
which requires us to specify a function which we do not need here, so this code resorts to Function.identity(). You could also just use i -> i to denote an identity function; that’s mostly a stylistic choice.
Note that most complications stem from the design that tries to avoid blocking threads by always chaining a dependent operation to be executed when the previous stage has been completed. The examples above follow this principle as the final join() call is only for demonstration purposes; you can easily remove it and return the future, if the caller expects a future rather than being blocked.
If you are going to perform the final blocking join() anyway, because you need the result value immediately, you can also use
Integer result = future1.applyToEither(future2, Function.identity()).join();
if(result == null) {
Integer a = future1.join(), b = future2.join();
result = a != null? a: b;
}
which might be easier to read and debug. This ease of use is the motivation behind the upcoming Virtual Threads feature. When an action is running on a virtual thread, you don’t need to avoid blocking calls. So with this feature, if you still need to return a CompletableFuture without blocking the your caller thread, you can use
CompletableFuture<Integer> resultFuture = future1.applyToEitherAsync(future2, r-> {
if(r != null) return r;
Integer a = future1.join(), b = future2.join();
return a != null? a: b;
}, Executors.newVirtualThreadPerTaskExecutor());
By requesting a virtual thread for the dependent action, we can use blocking join() calls within the function without hesitation which makes the code simpler, in fact, similar to the previous non-asynchronous variant.
In all cases, the code will provide the faster result if it is non‑null, without waiting for the completion of the second future. But it does not stop the evaluation of the unnecessary future. Stopping an already ongoing evaluation is not supported by CompletableFuture at all. You can call cancel(…) on it, but this will will only set the completion state (result) of the future to “exceptionally completed with a CancellationException”
So whether you call cancel or not, the already ongoing evaluation will continue in the background and only its final result will be ignored.
This might be acceptable for some operations. If not, you would have to change the implementation of fetchAsync significantly. You could use an ExecutorService directly and submit an operation to get a Future which support cancellation with interruption.
But it also requires the operation’s code to be sensitive to interruption, to have an actual effect:
When calling blocking operations, use those methods that may abort and throw an InterruptedException and do not catch-and-continue.
When performing a long running computational intense task, poll Thread.interrupted() occasionally and bail out when true.

So, should I stick to CompletableFuture or use all of them based on the scenario?
Use the one that is most appropriate to the scenario. Obviously, we can't be more specific unless you explain the scenario.
There are various factors to take into account. For example:
Thread + Runnable doesn't have a natural way to wait for / return a result. (But it is not hard to implement.)
Repeatedly creating bare Thread objects is inefficient because thread creation is expensive. Thread pooling is better but you shouldn't implement a thread pool yourself.
Solutions that use an ExecutorService take care of thread pooling and allow you to use Callable and return a Future. But for a once-off async computation this might be over-kill.
Solutions that involve ComputableFuture allow you to compose and combine asynchronous tasks. But if you don't need to do that, using ComputableFuture may be overkill.
As you can see ... there is no single correct answer for all scenarios.
Should I use CompletableFuture.anyOf() for that?
No. The logic of your example requires that you must have the result for future1 to determine whether or not you need the result for future2. So the solution is something like this:
Integer i1 = future1.get();
if (i1 == null) {
return future2.get();
} else {
future2.cancel(true);
return i1;
}
Note that the above works with plain Future as well as CompletableFuture. If you were using CompletableFuture because you thought that anyOf was the solution, then you didn't need to do that. Calling ExecutorService.submit(Callable) will give you a Future ...
It will be more complicated if you need to deal with exceptions thrown by the tasks and/or timeouts. In the former case, you need to catch ExecutionException and the extract its cause exception to get the exception thrown by the task.
There is also the caveat that the second computation may ignore the interrupt and continue on regardless.

So, should I stick to CompletableFuture or use all of them based on the scenario?
Well, they all have different purposes and you'll probably use them all either directly or indirectly:
Thread represents a thread and while it can be subclassed in most cases you shouldn't do so. Many frameworks maintain thread pools, i.e. they spin up several threads that then can take tasks from a task pool. This is done to reduce the overhead that thread creation brings as well as to reduce the amount of contention (many threads and few cpu cores mean a lot of context switches so you'd normally try to have fewer threads that just work on one task after another).
Runnable was one of the first interfaces to represent tasks that a thread can work on. Another is Callable which has 2 major differences to Runnable: 1) it can return a value while Runnable has void and 2) it can throw checked exceptions. Depending on your case you can use either but since you want to get a result, you'll more likely use Callable.
CompletableFuture and Future are basically a way for cross-thread communication, i.e. you can use those to check whether the task is done already (non-blocking) or to wait for completion (blocking).
So in many cases it's like this:
you submit a Runnable or Callable to some executor
the executor maintains a pool of Threads to execute the tasks you submitted
the executor returns a Future (one implementation being CompletableFuture) for you to check on the status and results of the task without having to synchronize yourself.
However, there may be other cases where you directly provide a Runnable to a Thread or even subclass Thread but nowadays those are far less common.
How can I do this? Should I use CompletableFuture.anyOf() for that?
CompletableFuture.anyOf() wouldn't work since you'd not be able to determine which of the 2 you'd pass in was successful first.
Since you're interested in result1 first (which btw can't be null if the type is int) you basically want to do the following:
Integer result1 = future1.get(); //block until result 1 is ready
if( result1 != null ) {
return result1;
} else {
return future2.get(); //result1 was null so wait for result2 and return it
}
You'd not want to call future2.get() right away since that would block until both are done but instead you're first interested in future1 only so if that produces a result you wouldn't have for future2 to ever finish.
Note that the code above doesn't handle exceptional completions and there's also probably a more elegant way of composing the futures like you want but I don't remember it atm (if I do I'll add it as an edit).
Another note: you could call future2.cancel() if result1 isn't null but I'd suggest you first check whether cancelling would even work (e.g. you'd have a hard time really cancelling a webservice request) and what the results of interrupting the service would be. If it's ok to just let it complete and ignore the result that's probably the easier way to go.

Using Threads in Loop

I have a for loop which needs to execute 36000 times
for(int i=0;i<36000;i++)
{
}
Whether its possible to use Multiple threads inorder to execute the loop faster at the same time
Please suggest how to use it.

If you want a more explicit method, you can use thread pools with Thread, Callable or Runnable. See my answere here for examples:
Java : a method to do multiple calculations on arrays quickly
Thread won't naturally exit at end of run()
I do not recommend using Java's Fork/Join as they are not that great as they were hyped to be. Performance is pretty bad. Instead, I would use Java 8's map and parallel streams if you want to make it easy. You have several options using this method.
IntStream.range(1, 4)
.mapToObj(i -> "testing " + i)
.forEach(System.out::println);
You would want to call map( lambda ). Java 8 finally brings lambda functions. It is possible to feed the stream one huge list, but there will be a performance impact. IntStream.range will do what you want. Then you need to figure out which of the new functions you want to use like filter, map, count, sum, reduce, etc. You may have to tell it that you want it to be a parallel stream. See these links.
https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html
http://winterbe.com/posts/2014/07/31/java8-stream-tutorial-examples/
Classic method and still has the best performance is to do it yourself using a thread pool:
Basically, you would create a Runnable (does not return something) or Callable (returns a result) object that will do some work on one of the treads in the pool. The pool with handle scheduling, which is great for us. Java has several options on the pool you use. You can create a Runnable/Callable in a loop, then submit that into the pool. The pool immediately returns a Future object that represents the task. You can add that Future to an ArrayList if you have many of these. After adding all the futures to the list, loop through them and call future.get(), which will wait for the end of execution. See the linked example above, which does not use a list, but does everything else I said.

Processing sub-streams of a stream in Java using executors

I have a program that processes a huge stream (not in the sense of java.util.stream, but rather InputStream) of data coming in through the network. The stream consists of objects, each having a sort of sub-stream identifier. Right now the whole processing is done in a single thread, but it takes a lot of CPU time and each sub-stream can easily be processed independently, so I'm thinking of multi-threading it.
However, each sub-stream requires to keep a lot of bulky state, including various buffers, hash maps and such. There is no particular reason to make it concurrent or synchronized since sub-streams are independent of each other. Moreover, each sub-stream requires that its objects are processed in the order they arrive, which means that probably there should be a single thread for each sub-stream (but possibly one thread processing multiple sub-streams).
I'm thinking of several approaches to this, but they are not quite elegant.
Create a single ThreadPoolExecutor for all tasks. Each task will contain the next object to process and the reference to a Processor instance which keeps all the state. That would ensure the necessary happens-before relationship thus ensuring that the processing thread will see the up-to-date state for this sub-stream. This approach has no way to make sure that the next object of the same sub-stream will be processed in the same thread, as far as I can see. Moreover, it needs some guarantee that objects will be processed in the order they come in, which will require additional synchronization of the Processor objects, introducing unnecessary delays.
Create multiple single-thread executors manually and a sort of hash-map that maps sub-stream identifiers to executor. This approach requires manual management of executors, creating or shutting down them as new sub-streams begin or end, and distributing the tasks between them accordingly.
Create a custom executor that processes a special subclass of tasks each having a sub-stream ID. This executor would use it as a hint to use the same thread for executing this task as the previous one with the same ID. However, I don't see an easy way to implement such executor. Unfortunately, it doesn't seem possible to extend any of the existing executor classes, and implementing an executor from scratch is kind of overkill.
Create a single ThreadPoolExecutor, but instead of creating a task for each incoming object, create a single long-running task for each sub-stream that would block in a concurrent queue, waiting for the next object. Then put objects in queues according to their sub-stream IDs. This approach needs as many threads as there are sub-streams because the tasks will be blocked. The expected number of sub-streams is about 30-60, so that may be acceptable.
Alternatively, proceed as in 4, but limit the number of threads, assigning multiple sub-streams to a single task. This is sort of a hybrid between 2 and 4. As far as I can see, this is the best approach of these, but it still requires some sort of manual sub-stream distribution between tasks and some way to shut the extra tasks down as sub-streams end.
What would be the best way to ensure that each sub-stream is processed in its own thread without a lot of error-prone code? So that the following pseudo-code will work:
// loop {
Item next = stream.read();
int id = next.getSubstreamID();
Processor processor = getProcessor(id);
SubstreamTask task = new SubstreamTask(processor, next, id);
executor.submit(task); // This makes sure that the task will
// be executed in the same thread as the
// previous task with the same ID.
// } // loop

I suggest having an array of single threaded executors. If you can devise a consistent hashing strategy for sub-streams, you can map sub-streams to individual threads. e.g.
final ExecutorsService[] es = ...
public void submit(int id, Runnable run) {
es[(id & 0x7FFFFFFF) % es.length].submit(run);
}
The key could be an String or long but some way to identify the sub-stream. If you know a particular sub-stream is very expensive, you could assign it a dedicated thread.

The solution I finally chose looks like this:
private final Executor[] streamThreads
= new Executor[Runtime.getRuntime().availableProcessors()];
{
for (int i = 0; i < streamThreads.length; ++i) {
streamThreads[i] = Executors.newSingleThreadExecutor();
}
}
private final ConcurrentHashMap<SubstreamId, Integer>
threadById = new ConcurrentHashMap<>();
This code determines which executor to use:
Message msg = in.readNext();
SubstreamId msgSubstream = msg.getSubstreamId();
int exe = threadById.computeIfAbsent(msgSubstream,
id -> findBestExecutor());
streamThreads[exe].execute(() -> {
// processing goes here
});
And the findBestExecutor() function is this:
private int findBestExecutor() {
// Thread index -> substream count mapping:
final int[] loads = new int[streamThreads.length];
for (int thread : threadById.values()) {
++loads[thread];
}
// return the index of the minimum load
return IntStream.range(0, streamThreads.length)
.reduce((i, j) -> loads[i] <= loads[j] ? i : j)
.orElse(0);
}
This is, of course, not very efficient, but note that this function is only called when a new sub-stream shows up (which happens several times every few hours, so it's not a big deal in my case). My real code looks a bit more complicated because I have a way to determine whether two sub-streams are likely to finish simultaneously, and if they are, I try to assign them to different threads in order to maintain even load after they do finish. But since I never mentioned this detail in the question, I guess it doesn't belong to the answer either.

ParallelStreams in java

I'm trying to use parallel streams to call an API endpoint to get some data back. I am using an ArrayList<String> and sending each String to a method that uses it in making a call to my API. I have setup parallel streams to call a method that will call the endpoint and marshall the data that comes back. The problem for me is that when viewing this in htop I see ALL the cores on the db server light up the second I hit this method ... then as the first group finish I see 1 or 2 cores light up. My issue here is that I think I am truly getting the result I want ... for the first set of calls only and then from monitoring it looks like the rest of the calls get made one at a time.
I think it may have something to do with the recursion but I'm not 100% sure.
private void generateObjectMap(Integer count){
ArrayList<String> myList = getMyList();
myList.parallelStream().forEach(f -> performApiRequest(f,count));
}
private void performApiRequest(String myString,Integer count){
if(count < 10) {
TreeMap<Integer,TreeMap<Date,MyObj>> tempMap = new TreeMap();
try {
tempMap = myJson.getTempMap(myRestClient.executeGet(myString);
} catch(SocketTimeoutException e) {
count += 1;
performApiRequest(myString,count);
}
...
else {
System.exit(1);
}
}

This seems an unusual use for parallel streams. In general the idea is that your are informing the JVM that the operations on the stream are truly independent and can run in any order in one thread or multiple. The results will subsequently be reduced or collected as part of the stream. The important point to remember here is that side effects are undefined (which is why variables changed in streams need to be final or effectively final) and you shouldn't be relying on how the JVM organises execution of the operations.
I can imagine the following being a reasonable usage:
list.parallelStream().map(item -> getDataUsingApi(item))
.collect(Collectors.toList());
Where the api returns data which is then handed to downstream operations with no side effects.
So in conclusion if you want tight control over how the api calls are executed I would recommend you not use parallel streams for this. Traditional Thread instances, possibly with a ThreadPoolExecutor will serve you much better for this.

Java Iterator Concurrency

I'm trying to loop over a Java iterator concurrently, but am having troubles with the best way to do this.
Here is what I have where I don't try to do anything concurrently.
Long l;
Iterator<Long> i = getUserIDs();
while (i.hasNext()) {
l = i.next();
someObject.doSomething(l);
anotheObject.doSomething(l);
}
There should be no race conditions between the things I'm doing on the non iterator objects, so I'm not too worried about that. I'd just like to speed up how long it takes to loop through the iterator by not doing it sequentially.
Thanks in advance.

One solution is to use an executor to parallelise your work.
Simple example:
ExecutorService executor = Executors.newCachedThreadPool();
Iterator<Long> i = getUserIDs();
while (i.hasNext()) {
final Long l = i.next();
Runnable task = new Runnable() {
public void run() {
someObject.doSomething(l);
anotheObject.doSomething(l);
}
}
executor.submit(task);
}
executor.shutdown();
This will create a new thread for each item in the iterator, which will then do the work. You can tune how many threads are used by using a different method on the Executors class, or subdivide the work as you see fit (e.g. a different Runnable for each of the method calls).

A can offer two possible approaches:
Use a thread pool and dispatch the items received from the iterator to a set of processing threads. This will not accelerate the iterator operations themselves, since those would still happen in a single thread, but it will parallelize the actual processing.
Depending on how the iteration is created, you might be able to split the iteration process to multiple segments, each to be processed by a separate thread via a different Iterator object. For an example, have a look at the List.sublist(int fromIndex, int toIndex) and List.listIterator(int index) methods.
This would allow the iterator operations to happen in parallel, but it is not always possible to segment the iteration like this, usually due to the simple fact that the items to be iterated over are not immediately available.
As a bonus trick, if the iteration operations are expensive or slow, such as those required to access a database, you might see a throughput improvement if you separate them out to a separate thread that will use the iterator to fill in a BlockingQueue. The dispatcher thread will then only have to access the queue, without waiting on the iterator object to retrieve the next item.
The most important advice in this case is this: "Use your profiler", usually to be followed by "Do not optimise prematurely". By using a profiler, such as VisualVM, you should be able to ascertain the exact cause of any performance issues, without taking shots in the dark.

If you are using Java 7, you can use the new fork/join; see the tutorial.
Not only does it split automatically the tasks among the threads, but if some thread finishes its tasks earlier than the other threads, it "steals" some tasks from the other threads.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.