Question about potential race condition with ParallelStream of Lists - java

I came across this piece of code that uses Java streams, specifically parallelStream() in order to collect some data from an oracle database. See below where in this case:
range = some list of input Id
rangeLimit = 1000
rangeLimitedFunction = some function that queries a DB for some content
ForkJoinPool threadPool = new ForkJoinPool(Math.min(Runtime.getRuntime().availableProcessors(), parallelism));
try {
Optional<C> res = threadPool.submit(new Callable<Optional<C>>() {
#Override
public Optional<C> call() throws Exception {
return splitByLimit(range, rangeLimit).parallelStream()
.map(rangeLimitedFunction::apply)
.reduce((list, items) -> {
list.addAll(items);
return list;
});
}
}).get();
From what I understand this is how this is working:
Split range into chunks of 1000 to feed into the function
Process each chunk in a thread to return some results
Aggregate the results to a list of POJOs
My question is around a potential race condition imposed by trying to reduce into a single list. Is it not possible for many of these threads to be trying to add content to the resulting list and potentially corrupt data?

That depends largely on the implementation of List that's used in this case.
That said, this pice of code would be way better using flatMap and a collector to leverage the thread-safety of Java paralell streams and avoid potential pitfalls from non-thread-safe list implementations.
That said, paralellStreams don't offer much benefit for IO-operations. They target processor heavy operations and usually only pay-off if there are more than 15000 (IIRC) operations (that is stream-iterations times cpu-heavy stream operations), which is kind of rare.

Related

ParallelStream with filter chaining

Does filter chaining change the outcome if i use parallelStream() instead of stream() ?
I tried with a few thousand records, and the output appeared consistent over a few iterations. But since this involves threads,(and I could not find enough relevant material that talks about this combination) I want to make doubly sure that parallel stream does not impact the output of filter chaining in any way. Example code:
List<Element> list = myList.parallelStream()
.filter(element -> element.getId() > 10)
.filter(element -> element.getName().contains("something"))
.collect(Collectors.toList());
Short answer: No.
The filter operation as documented expects a non-interferening and stateless predicate to apply to each element to determine if it should be included as part of the new stream.
Few aspects that you shall consider for that are -
With an exception to concurrent collections(what do you choose as myList in the existing code to be) -
For most data sources, preventing interference means ensuring that the
data source is not modified at all during the execution of the stream
pipeline.
The state of the data sources (myList and its elements within your filter operations are not mutated)
Note also that attempting to access mutable state from behavioral
parameters presents you with a bad choice with respect to safety and
performance;
Moreover, think around it, what is it in your filter operation that would be impacted by multiple threads. Given the current code, nothing functionally, as long as both the operations are executed, you would get a consistent result regardless of the thread(s) executing them.

Parallelize a for loop in java

I have a for loop that is looping over a list of collections. Inside the loop some select/update queries are taking place on collection which are exclusive of the other collections. Since each collection has a lot of data to process on i would like to parallelize it.
The code snippet looks something like this:
//Some variables that are used within the for loop logic
for(String collection : collections) {
//Select queries on collection
//Update queries on collection
}
How can i achieve this in java?
You can use the parallelStream() method (since java 8):
collections.parallelStream().forEach((collection) -> {
//Select queries on collection
//Update queries on collection
});
More informations about streams.
Another way to do it is using Executors :
try
{
final ExecutorService exec = Executors.newFixedThreadPool(collections.size());
for (final String collection : collections)
{
exec.submit(() -> {
// Select queries on collection
// Update queries on collection
});
}
// We want to wait that the jobs are done.
final boolean terminated = exec.awaitTermination(500, TimeUnit.MILLISECONDS);
if (terminated == false)
{
exec.shutdownNow();
}
} catch (final InterruptedException e)
{
e.printStackTrace();
}
This example is more powerfull since you can easily know when the job is done, force termination... and more.
final int numberOfThreads = 32;
final ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
// List to store the 'handles' (Futures) for all tasks:
final List<Future<MyResult>> futures = new ArrayList<>();
// Schedule one (parallel) task per String from "collections":
for(final String str : collections) {
futures.add(executor.submit(() -> { return doSomethingWith(str); }));
}
// Wait until all tasks have completed:
for ( Future<MyResult> f : futures ) {
MyResult aResult = f.get(); // Will block until the result of the task is available.
// Optionally do something with the result...
}
executor.shutdown(); // Release the threads held by the executor.
// At this point all tasks have ended and we can continue as if they were all executed sequentially
Adjust the numberOfThreads as needed to achieve the best throughput. More threads will tend to utilize the local CPU better, but may cause more overhead at the remote end. To get good local CPU utilization, you want to have (much) more threads than CPUs (/cores) so that, whenever one thread has to wait, e.g. for a response from the DB, another thread can be switched in to execute on the CPU.
There are a number of question that you need to ask yourself to find the right answer:
If I have as many threads as the number of my CPU cores, would that be enough?
Using parallelStream() will give you as many threads as your CPU cores.
Will parallelizing the loop give me a performance boost or is there a bottleneck on the DB?
You could spin up 100 threads, processing in parallel, but this doesn't mean that you will do things 100 times faster, if your DB or the network cannot handle the volume. DB locking can also be an issue here.
Do I need to process my data in a specific order?
If you have to process your data in a specific order, this may limit your choices. E.g. forEach() doesn't guarantee that the elements of your collection will be processed in a specific order, but forEachOrdered() does (with a performance cost).
Is my datasource capable of fetching data reactively?
There are cases when our datasource can provide data in the form of a stream. In that case, you can always process this stream using a technology such as RxJava or WebFlux. This would enable you to take a different approach on your problem.
Having said all the above, you can choose the approach you want (executors, RxJava etc.) that fit better to your purpose.

does stateful map operation of ordered stream process elements in deterministic way?

I'm reading about java streams API and I encountered the following here:
The operation forEachOrdered processes elements in the order specified by the stream, regardless of whether the stream is executed in serial or parallel. However, when a stream is executed in parallel, the map operation processes elements of the stream specified by the Java runtime and compiler. Consequently, the order in which the lambda expression e -> { parallelStorage.add(e); return e; } adds elements to the List parallelStorage can vary every time the code is run. For deterministic and predictable results, ensure that lambda expression parameters in stream operations are not stateful.
I tested the following code and in fact, it works as mentioned:
public class MapOrdering {
public static void main(String[] args) {
// TODO Auto-generated method stub
List < String > serialStorage = new ArrayList < > ();
System.out.println("Serial stream:");
int j = 0;
List < String > listOfIntegers = new ArrayList();
for (int i = 0; i < 10; i++) listOfIntegers.add(String.valueOf(i));
listOfIntegers.stream().parallel().map(e - > {
serialStorage.add(e.concat(String.valueOf(j)));
return e;
}).forEachOrdered(k - > System.out.println(k));;
/*
// Don't do this! It uses a stateful lambda expression.
.map(e -> { serialStorage.add(e); return e; })*/
for (String s: serialStorage) System.out.println(s);
}
}
output
Serial stream:
0
1
2
3
4
5
6
7
8
9
null
null
80
90
50
40
30
00
questions:
The output changes every time I run this. How do I make sure that the stateful map operation is executed in order.
map is an intermediate operation and it only starts processing elements
until terminal operation commences. Since a terminal operation is
ordered, why is a map operation unordered, and tends to change results
every time when working with stateful operation?
You got lucky to see that serialStorage has all the elements that you think it will, after all you are adding from multiple threads multiple elements to a non-thread-safe collection ArrayList. You could have easily seen nulls or a List that does not have all the elements. But even when you add a List that is thread-safe - there is absolutely no order that you can rely on in that List.
This is explicitly mentioned in the documentation under side-effects, and intermediate operations should be side effect-free.
Basically there are two orderings: processing order (intermediate operations) and encounter order. The last one is preserved (if it is has one to begin with and stream intermediate operations don't break it - for example unordered, sorted).
Processing order is not specified, meaning all intermediate operations will process elements in whatever order they feel like. Encounter order (the one you see from a terminal operation) will preserver the initial order.
But even terminal operations don't have to preserve the initial order, for example forEach vs forEachOrdered or when you collect to a Set; of course read the documentation, it usually states clearly this aspect.
I would like to answer your 2 questions, while adding to this other answer...
output changes everytime i run this. how to write code to process statefull map operation in an ordered way?
Stateful map operations are discouraged and you shouldn't use them, even for sequential streams. If you want that behaviour, you'd better use an imperative approach.
map is intermediate operation and it only starts processing elements until terminal operation commences.since terminal operation is ordered ,why map operation is unordered and tend to change results every time when working with statefull operation?
Only forEachOrdered respects encounter order of elements; intermediate operations (such as map) are not compelled to do so. For a parallel stream, this means that intermediate operations are allowed to be executed in any order by the pipeline, thus taking advantage of parallelism.
However, bear in mind that providing a stateful argument to an intermediate operation, (i.e. a stateful mapper function to the map operation) when the stream is parallel, would require you to manually synchronize the state kept by the stateful argument (i.e. you would need to use a synchronized view of the list, or implement some locking mechanism, etc), but this would in turn affect performance negatively, since (as stated in the docs) you'd risk having contention undermine the parallelism you are seeking to benefit from.
Edit: for a terminal operation like forEachOrdered, parallelism would usually bring little benefit, since many times it needs to do some internal processing to comply with the requirement of respecting encounter order, i.e. buffer the elements.

ParallelStreams in java

I'm trying to use parallel streams to call an API endpoint to get some data back. I am using an ArrayList<String> and sending each String to a method that uses it in making a call to my API. I have setup parallel streams to call a method that will call the endpoint and marshall the data that comes back. The problem for me is that when viewing this in htop I see ALL the cores on the db server light up the second I hit this method ... then as the first group finish I see 1 or 2 cores light up. My issue here is that I think I am truly getting the result I want ... for the first set of calls only and then from monitoring it looks like the rest of the calls get made one at a time.
I think it may have something to do with the recursion but I'm not 100% sure.
private void generateObjectMap(Integer count){
ArrayList<String> myList = getMyList();
myList.parallelStream().forEach(f -> performApiRequest(f,count));
}
private void performApiRequest(String myString,Integer count){
if(count < 10) {
TreeMap<Integer,TreeMap<Date,MyObj>> tempMap = new TreeMap();
try {
tempMap = myJson.getTempMap(myRestClient.executeGet(myString);
} catch(SocketTimeoutException e) {
count += 1;
performApiRequest(myString,count);
}
...
else {
System.exit(1);
}
}
This seems an unusual use for parallel streams. In general the idea is that your are informing the JVM that the operations on the stream are truly independent and can run in any order in one thread or multiple. The results will subsequently be reduced or collected as part of the stream. The important point to remember here is that side effects are undefined (which is why variables changed in streams need to be final or effectively final) and you shouldn't be relying on how the JVM organises execution of the operations.
I can imagine the following being a reasonable usage:
list.parallelStream().map(item -> getDataUsingApi(item))
.collect(Collectors.toList());
Where the api returns data which is then handed to downstream operations with no side effects.
So in conclusion if you want tight control over how the api calls are executed I would recommend you not use parallel streams for this. Traditional Thread instances, possibly with a ThreadPoolExecutor will serve you much better for this.

Can a Collector's combiner function ever be used on sequential streams?

Sample program:
public final class CollectorTest
{
private CollectorTest()
{
}
private static <T> BinaryOperator<T> nope()
{
return (t, u) -> { throw new UnsupportedOperationException("nope"); };
}
public static void main(final String... args)
{
final Collector<Integer, ?, List<Integer>> c
= Collector.of(ArrayList::new, List::add, nope());
IntStream.range(0, 10_000_000).boxed().collect(c);
}
}
So, to simplify matters here, there is no final transformation, so the resulting code is quite simple.
Now, IntStream.range() produces a sequential stream. I simply box the results into Integers and then my contrived Collector collects them into a List<Integer>. Pretty simple.
And no matter how many times I run this sample program, the UnsupportedOperationException never hits, which means my dummy combiner is never called.
I kind of expected this, but then I have already misunderstood streams enough that I have to ask the question...
Can a Collector's combiner ever be called when the stream is guaranteed to be sequential?
A careful reading of the streams implementation code in ReduceOps.java reveals that the combine function is called only when a ReduceTask completes, and ReduceTask instances are used only when evaluating a pipeline in parallel. Thus, in the current implementation, the combiner is never called when evaluating a sequential pipeline.
There is nothing in the specification that guarantees this, however. A Collector is an interface that makes requirements on its implementations, and there are no exemptions granted for sequential streams. Personally, I find it difficult to imagine why sequential pipeline evaluation might need to call the combiner, but someone with more imagination than me might find a clever use for it, and implement it. The specification allows for it, and even though today's implementation doesn't do it, you still have to think about it.
This should not surprising. The design center of the streams API is to support parallel execution on an equal footing with sequential execution. Of course, it is possible for a program to observe whether it is being executed sequentially or in parallel. But the design of the API is to support a style of programming that allows either.
If you're writing a collector and you find that it's impossible (or inconvenient, or difficult) to write an associative combiner function, leading you to want to restrict your stream to sequential execution, maybe this means you're heading in the wrong direction. It's time to step back a bit and think about approaching the problem a different way.
A common reduction-style operation that doesn't require an associative combiner function is called fold-left. The main characteristic is that the fold function is applied strictly left-to-right, proceeding one at a time. I'm not aware of a way to parallelize fold-left.
When people try to contort collectors the way we've been talking about, they're usually looking for something like fold-left. The Streams API doesn't have direct API support for this operation, but it's pretty easy to write. For example, suppose you want to reduce a list of strings using this operation: repeat the first string and then append the second. It's pretty easy to demonstrate that this operation isn't associative:
List<String> list = Arrays.asList("a", "b", "c", "d", "e");
System.out.println(list.stream()
.collect(StringBuilder::new,
(a, b) -> a.append(a.toString()).append(b),
(a, b) -> a.append(a.toString()).append(b))); // BROKEN -- NOT ASSOCIATIVE
Run sequentially, this produces the desired output:
aabaabcaabaabcdaabaabcaabaabcde
But when run in parallel, it might produce something like this:
aabaabccdde
Since it "works" sequentially, we could enforce this by calling sequential() and back this up by having the combiner throw an exception. In addition, the supplier must be called exactly once. There's no way to combine the intermediate results, so if the supplier is called twice, we're already in trouble. But since we "know" the supplier is called only once in sequential mode, most people don't worry about this. In fact, I've seen people write "suppliers" that return some existing object instead of creating a new one, in violation of the supplier contract.
In this use of the 3-arg form of collect(), we have two out of the three functions breaking their contracts. Shouldn't this be telling us to do things a different way?
The main work here is being done by the accumulator function. To accomplish a fold-style reduction, we can apply this function in a strict left-to-right order using forEachOrdered(). We have to do a bit of setup and finishing code before and after, but that's no problem:
StringBuilder a = new StringBuilder();
list.parallelStream()
.forEachOrdered(b -> a.append(a.toString()).append(b));
System.out.println(a.toString());
Naturally, this works fine in parallel, though the performance benefits of running in parallel may be somewhat negated by the ordering requirements of forEachOrdered().
In summary, if you find yourself wanting to do a mutable reduction but you're lacking an associative combiner function, leading you to restrict your stream to sequential execution, recast the problem as a fold-left operation and use forEachRemaining() on your accumulator function.
As observed in previous comments from #MarkoTopolnik and #Duncan there is no guarantee that Collector.combiner() on sequential mode is called to produce a reduced result. In fact, the Java doc is a little bit subjective in this point, which can to lead an not appropriate interpretation.
(...) A parallel implementation would partition the input, create a result container for each partition, accumulate the contents of each partition into a subresult for that partition, and then use the combiner function to merge the subresults into a combined result.
According to NoBlogDefFound combinator is used only in parallel mode. See the partial quotation below:
combiner() is used to join two accumulators together into one. It is used when collector is executed in parallel, splitting input Stream and collecting parts independently first.
To show more clear this issue I re-write the first code and I put two approaches (serial and parallel).
public final class CollectorTest
{
private CollectorTest()
{
}
private static <T> BinaryOperator<T> nope()
{
return (t, u) -> { throw new UnsupportedOperationException("nope"); };
}
public static void main(final String... args)
{
final Collector<Integer, ?, List<Integer>> c =
Collector
.of(ArrayList::new, List::add, nope());
// approach sequential
Stream<Integer> sequential = IntStream
.range(0, 10_000_000)
.boxed();
System.out.println("isParallel:" + sequential.isParallel());
sequential
.collect(c);
// approach parallel
Stream<Integer> parallel = IntStream
.range(0, 10_000_000)
.parallel()
.boxed();
System.out.println("isParallel:" + parallel.isParallel());
parallel
.collect(c);
}
}
After running this code we can get the output:
isParallel:false
isParallel:true
Exception in thread "main" java.lang.UnsupportedOperationException: nope
at com.stackoverflow.lambda.CollectorTest.lambda$nope$0(CollectorTest.java:18)
at com.stackoverflow.lambda.CollectorTest$$Lambda$3/2001049719.apply(Unknown Source)
at java.util.stream.ReduceOps$3ReducingSink.combine(ReduceOps.java:174)
at java.util.stream.ReduceOps$3ReducingSink.combine(ReduceOps.java:160)
So, according this result we can infer that Collector's combiner can be called only by the parallel execution.

Categories