ParallelStream with filter chaining

ParallelStream with filter chaining - java

Does filter chaining change the outcome if i use parallelStream() instead of stream() ?
I tried with a few thousand records, and the output appeared consistent over a few iterations. But since this involves threads,(and I could not find enough relevant material that talks about this combination) I want to make doubly sure that parallel stream does not impact the output of filter chaining in any way. Example code:
List<Element> list = myList.parallelStream()
.filter(element -> element.getId() > 10)
.filter(element -> element.getName().contains("something"))
.collect(Collectors.toList());

Short answer: No.
The filter operation as documented expects a non-interferening and stateless predicate to apply to each element to determine if it should be included as part of the new stream.
Few aspects that you shall consider for that are -
With an exception to concurrent collections(what do you choose as myList in the existing code to be) -
For most data sources, preventing interference means ensuring that the
data source is not modified at all during the execution of the stream
pipeline.
The state of the data sources (myList and its elements within your filter operations are not mutated)
Note also that attempting to access mutable state from behavioral
parameters presents you with a bad choice with respect to safety and
performance;
Moreover, think around it, what is it in your filter operation that would be impacted by multiple threads. Given the current code, nothing functionally, as long as both the operations are executed, you would get a consistent result regardless of the thread(s) executing them.

Related

Getting ArrayIndexOutOfBoundsException when using parallel stream

I am ending up with occasional array index out of bounds exception when using the following code . Any leads ? The size of the array is always approximately around 29-30.
logger.info("devicetripmessageinfo size :{}",deviceMessageInfoList.size());
deviceMessageInfoList.parallelStream().forEach(msg->{
if(msg!=null && msg.getMessageVO()!=null)
{
DeviceTripMessageInfo currentDevTripMsgInfo =
(DeviceTripMessageInfo) msg.getMessageVO();
if(currentDevTripMsgInfo.getValueMap()!=null)
{mapsList.add(currentDevTripMsgInfo.getValueMap());}
}
});
j
ava.lang.ArrayIndexOutOfBoundsException: null
at java.base/jdk.internal.reflect.GeneratedConstructorAccessor26.newInstance(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:603)
at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:678)
at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:737)
at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:661)
at com.*.*.*.*.worker.*.process(*.java:96)
at com.*.jms.consumer.JMSWorker.processList(JMSWorker.java:279)
at com.*.jms.consumer.JMSWorker.process(JMSWorker.java:244)
at com.*.jms.consumer.JMSWorker.processMessages(JMSWorker.java:200)
at com.*.jms.consumer.JMSWorker.run(JMSWorker.java:136)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ArrayIndexOutOfBoundsException: null

Summary
The problem is that ArrayList is by design not safe for modification by multiple threads concurrently, but the parallel stream is writing to the list from multiple threads. A good solution is to switch to an idiomatic stream implementation:
List msgList = deviceMessageInfoList.parallelStream() // Declare generic type, e.g. List<Map<String, Object>>
.filter(Objects::nonNull)
.map(m -> (DeviceTripMessageInfo) m.getMessageVO())
.filter(Objects::nonNull)
.map(DeviceTripMessageInfo::getValueMap)
.filter(Objects::nonNull)
.collect(Collectors.toUnmodifiableList());
Issue: concurrent modification
The ArrayList Javadocs explain the concurrent modification issue:
Note that this implementation is not synchronized. If multiple threads access an ArrayList instance concurrently, and at least one of the threads modifies the list structurally, it must be synchronized externally. This is typically accomplished by synchronizing on some object that naturally encapsulates the list. If no such object exists, the list should be "wrapped" using the Collections.synchronizedList method. This is best done at creation time, to prevent accidental unsynchronized access to the list
Note that the exception you're seeing is not the only incorrect behavior you might encounter. In my own tests of your code against large lists, the resulting list often contained only some of the elements from the source list.
Note that while switching from a parallel stream to a sequential stream would likely fix the issue in practice, it is dependent on the stream implementation, and not guaranteed by the API. Therefore, such an approach is highly inadvisable, as it could break in future versions of the library. Per the forEach Javadocs:
For any given element, the action may be performed at whatever time and in whatever thread the library chooses. If the action accesses shared state, it is responsible for providing the required synchronization.
Issue: not idiomatic
Aside from the correctness issue, another issue with this approach is that it's not particularly idiomatic to use side effects within stream code. The stream documentation explicitly discourages them.
Side-effects in behavioral parameters to stream operations are, in general, discouraged, as they can often lead to unwitting violations of the statelessness requirement, as well as other thread-safety hazards.
[...]
Many computations where one might be tempted to use side effects can be more safely and efficiently expressed without side-effects, such as using reduction instead of mutable accumulators.
Of particular note, the documentation goes on to describe the exact scenario posted in this question as an inappropriate use of side-effects in a stream:
As an example of how to transform a stream pipeline that inappropriately uses side-effects to one that does not, the following code searches a stream of strings for those matching a given regular expression, and puts the matches in a list.
ArrayList<String> results = new ArrayList<>();
stream.filter(s -> pattern.matcher(s).matches())
.forEach(s -> results.add(s)); // Unnecessary use of side-effects!
This code unnecessarily uses side-effects. If executed in parallel, the non-thread-safety of ArrayList would cause incorrect results, and adding needed synchronization would cause contention, undermining the benefit of parallelism.
Aside: traditional non-stream solution
As an aside, this points to a solution one might use using traditional non-stream code. I will discuss it briefly, since it's helpful to understand traditional solutions to the issue of concurrent list modification. Traditionally, one might replace the ArrayList with either a wrapped syncnhronized version using Collections.synchronizedList or an inherently concurrent collection type such as ConcurrentLinkedQueue. Since these approaches are designed for concurrent insertion, they solve the parallel insert issue, though possibly with additional synchronization contention overhead.
Stream solution
The stream documentation continues on with a replacement for the inappropriate use of side effects:
Furthermore, using side-effects here is completely unnecessary; the forEach() can simply be replaced with a reduction operation that is safer, more efficient, and more amenable to parallelization:
List<String>results =
stream.filter(s -> pattern.matcher(s).matches())
.collect(Collectors.toList()); // No side-effects!
Applying this approach to your code, you get:
List msgList = deviceMessageInfoList.parallelStream() // Declare generic type, e.g. List<Map<String, Object>>
.filter(Objects::nonNull)
.map(m -> (DeviceTripMessageInfo) m.getMessageVO())
.filter(Objects::nonNull)
.map(DeviceTripMessageInfo::getValueMap)
.filter(Objects::nonNull)
.collect(Collectors.toUnmodifiableList());

Even if you change that to a synchronized (or better said a thread-safe List), with your current approach, you still don't have a guaranteed order of how the elements are going to be put in. The documentation, btw, is very clear to discourage such things via forEach, here. Just look-up Side-Effects.
This entire thing can be done in far better way (and easier to read too):
deviceMessageInfoList
.stream()
.parallel()
.filter(Objects::notNull)
.map(x -> x.getMessageVO())
.filter(Objects::notNull)
.map(x -> (DeviceTripMessageInfo) x.getMessageVO())
.map(DeviceTripMessageInfo::getValueMap)
.filter(Objects::notNull)
.collect(Collectors.toList());

does stateful map operation of ordered stream process elements in deterministic way?

I'm reading about java streams API and I encountered the following here:
The operation forEachOrdered processes elements in the order specified by the stream, regardless of whether the stream is executed in serial or parallel. However, when a stream is executed in parallel, the map operation processes elements of the stream specified by the Java runtime and compiler. Consequently, the order in which the lambda expression e -> { parallelStorage.add(e); return e; } adds elements to the List parallelStorage can vary every time the code is run. For deterministic and predictable results, ensure that lambda expression parameters in stream operations are not stateful.
I tested the following code and in fact, it works as mentioned:
public class MapOrdering {
public static void main(String[] args) {
// TODO Auto-generated method stub
List < String > serialStorage = new ArrayList < > ();
System.out.println("Serial stream:");
int j = 0;
List < String > listOfIntegers = new ArrayList();
for (int i = 0; i < 10; i++) listOfIntegers.add(String.valueOf(i));
listOfIntegers.stream().parallel().map(e - > {
serialStorage.add(e.concat(String.valueOf(j)));
return e;
}).forEachOrdered(k - > System.out.println(k));;
/*
// Don't do this! It uses a stateful lambda expression.
.map(e -> { serialStorage.add(e); return e; })*/
for (String s: serialStorage) System.out.println(s);
}
}
output
Serial stream:
0
1
2
3
4
5
6
7
8
9
null
null
80
90
50
40
30
00
questions:
The output changes every time I run this. How do I make sure that the stateful map operation is executed in order.
map is an intermediate operation and it only starts processing elements
until terminal operation commences. Since a terminal operation is
ordered, why is a map operation unordered, and tends to change results
every time when working with stateful operation?

You got lucky to see that serialStorage has all the elements that you think it will, after all you are adding from multiple threads multiple elements to a non-thread-safe collection ArrayList. You could have easily seen nulls or a List that does not have all the elements. But even when you add a List that is thread-safe - there is absolutely no order that you can rely on in that List.
This is explicitly mentioned in the documentation under side-effects, and intermediate operations should be side effect-free.
Basically there are two orderings: processing order (intermediate operations) and encounter order. The last one is preserved (if it is has one to begin with and stream intermediate operations don't break it - for example unordered, sorted).
Processing order is not specified, meaning all intermediate operations will process elements in whatever order they feel like. Encounter order (the one you see from a terminal operation) will preserver the initial order.
But even terminal operations don't have to preserve the initial order, for example forEach vs forEachOrdered or when you collect to a Set; of course read the documentation, it usually states clearly this aspect.

I would like to answer your 2 questions, while adding to this other answer...
output changes everytime i run this. how to write code to process statefull map operation in an ordered way?
Stateful map operations are discouraged and you shouldn't use them, even for sequential streams. If you want that behaviour, you'd better use an imperative approach.
map is intermediate operation and it only starts processing elements until terminal operation commences.since terminal operation is ordered ,why map operation is unordered and tend to change results every time when working with statefull operation?
Only forEachOrdered respects encounter order of elements; intermediate operations (such as map) are not compelled to do so. For a parallel stream, this means that intermediate operations are allowed to be executed in any order by the pipeline, thus taking advantage of parallelism.
However, bear in mind that providing a stateful argument to an intermediate operation, (i.e. a stateful mapper function to the map operation) when the stream is parallel, would require you to manually synchronize the state kept by the stateful argument (i.e. you would need to use a synchronized view of the list, or implement some locking mechanism, etc), but this would in turn affect performance negatively, since (as stated in the docs) you'd risk having contention undermine the parallelism you are seeking to benefit from.
Edit: for a terminal operation like forEachOrdered, parallelism would usually bring little benefit, since many times it needs to do some internal processing to comply with the requirement of respecting encounter order, i.e. buffer the elements.

Java, in which thread are sequential streams executed?

While reading the documentation about streams, I came across the following sentences:
... attempting to access mutable state from behavioral parameters presents you with a bad choice ... if you do not synchronize access to that state, you have a data race and therefore your code is broken ... [1]
If the behavioral parameters do have side-effects ... [there are no] guarantees that different operations on the "same" element within the same stream pipeline are executed in the same thread. [2]
For any given element, the action may be performed at whatever time and in whatever thread the library chooses. [3]
These sentences don't make a distinction between sequential and parallel streams. So my questions are:
In which thread is the pipeline of a sequential stream executed? Is it always the calling thread or is an implementation free to choose any thread?
In which thread is the action parameter of the forEach terminal operation executed if the stream is sequential?
Do I have to use any synchronization when using sequential streams?
[1+2] https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html
[3] https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#forEach-java.util.function.Consumer-

This all boils down to what is guaranteed based on the specification, and the fact that a current implementation may have additional behaviors beyond what is guaranteed.
Java Language Architect Brian Goetz made a relevant point regarding specifications in a related question:
Specifications exist to describe the minimal guarantees a caller can depend on, not to describe what the implementation does.
[...]
When a specification says "does not preserve property X", it does not mean that the property X may never be observed; it means the implementation is not obligated to preserve it. [...] (HashSet doesn't promise that iterating its elements preserves the order they were inserted, but that doesn't mean this can't accidentally happen -- you just can't count on it.)
This all means that even if the current implementation happens to have certain behavioral characteristics, they should not be relied upon nor assumed that they will not change in new versions of the library.
Sequential stream pipeline thread
In which thread is the pipeline of a sequential stream executed? Is it always the calling thread or is an implementation free to choose any thread?
Current stream implementations may or may not use the calling thread, and may use one or multiple threads. As none of this is specified by the API, this behavior should not be relied on.
forEach execution thread
In which thread is the action parameter of the forEach terminal operation executed if the stream is sequential?
While current implementations use the existing thread, this cannot be relied on, as the documentation states that the choice of thread is up to the implementation. In fact, there are no guarantees that the elements aren't processed by different threads for different elements, though that is not something the current stream implementation does either.
Per the API:
For any given element, the action may be performed at whatever time and in whatever thread the library chooses.
Note that while the API calls out parallel streams specifically when discussing encounter order, that was clarified by Brian Goetz to clarify the motivation of the behavior, and not that any of the behavior is specific to parallel streams:
The intent of calling out the parallel case explicitly here was pedagogical [...]. However, to a reader who is unaware of parallelism, it would be almost impossible to not assume that forEach would preserve encounter order, so this sentence was added to help clarify the motivation.
Synchronization using sequential streams
Do I have to use any synchronization when using sequential streams?
Current implementations will likely work since they use a single thread for the sequential stream's forEach method. However, as it is not guaranteed by the stream specification, it should not be relied on. Therefore, synchronization should be used as though the methods could be called by multiple threads.
That said, the stream documentation specifically recommends against using side-effects that would require synchronization, and suggest using reduction operations instead of mutable accumulators:
Many computations where one might be tempted to use side effects can be more safely and efficiently expressed without side-effects, such as using reduction instead of mutable accumulators. [...] A small number of stream operations, such as forEach() and peek(), can operate only via side-effects; these should be used with care.
As an example of how to transform a stream pipeline that inappropriately uses side-effects to one that does not, the following code searches a stream of strings for those matching a given regular expression, and puts the matches in a list.
ArrayList<String> results = new ArrayList<>();
stream.filter(s -> pattern.matcher(s).matches())
.forEach(s -> results.add(s)); // Unnecessary use of side-effects!
This code unnecessarily uses side-effects. If executed in parallel, the non-thread-safety of ArrayList would cause incorrect results, and adding needed synchronization would cause contention, undermining the benefit of parallelism. Furthermore, using side-effects here is completely unnecessary; the forEach() can simply be replaced with a reduction operation that is safer, more efficient, and more amenable to parallelization:
List<String>results =
stream.filter(s -> pattern.matcher(s).matches())
.collect(Collectors.toList()); // No side-effects!

Stream's terminal operations are blocking operations. In case there is no parallel excution, the thread that executes the terminal operation runs all the operations in the pipeline.
Definition 1.1. Pipeline is a couple of chained methods.
Definition 1.2. Intermediate operations will be located everywhere in the stream except at the end. They return a stream object and does not execute any operation in the pipeline.
Definition 1.3. Terminal operations will be located only at the end of the stream. They execute the pipeline. They does not return stream object so no other Intermidiate operations or terminal operations can be added after them.
From the first solution we can conclude that the calling thread will execute the action method inside the forEach terminal operation on each element in the calling stream.
Java 8 introduces us the Spliterator interface. It has the capabilities of Iterator but also a set of operations to help performing and spliting a task in parallel.
When calling forEach from primitive streams in sequential execution, the calling thread will invoke the Spliterator.forEachRemaining method:
#Override
public void forEach(IntConsumer action) {
if (!isParallel()) {
adapt(sourceStageSpliterator()).forEachRemaining(action);
}
else {
super.forEach(action);
}
}
You can read more on Spliterator in my tutorial: Part 6 - Spliterator
As long as you don't mutate any shared state between multiple threads in one of the stream operations(and it is forbidden - explained soon), you do not need to use any additional synchronization tool or algorithm when you want to run parallel streams.
Stream operations like reduce use accumulator and combiner functions for executing parallel streams. The streams library by definition forbids mutation. You should avoid it.
There are a lot of definitions in concurrent and parallel programming. I will introduce a set of definitions that will serve us best.
Definition 8.1. Concurrent programming is the ability to solve a task using additional synchronization algorithms.
Definition 8.2. Parallel programming is the ability to solve a task without using additional synchronization algorithms.
You can read more about it in my tutorial: Part 7 - Parallel Streams.

Non-interference exact meaning in Java 8 streams

Does the non-interference requirement for using streams of non-concurrent data structure sources mean that we can't change the state of an element of the data structure during the execution of a stream pipeline (in addition to that we can't change the source data structure itself)? (Question 1)
In the section about non-interference, in the stream package description, its said:
"For most data sources, preventing interference means ensuring that the data source is not modified at all during the execution of the stream pipeline."
This passage does not mention modifying the state of elements?
For example, assuming "shapes" is non-thread-safe collection (such as ArrayList), is the code below considered to have an interference? (Question 2)
shapes.stream()
.filter(s -> s.getColor() == BLUE)
.forEach(s -> s.setColor(RED));
This example is taken from a reliable source (to say the least), so it should be correct.
But what if I changed stream() to be parallelStream(), will it still be safe and correct? (Question 3)
On the other hand, "Mastering Lambdas" by Naftalin Maurice, another reliable source, makes it clear that changing the state (value) of elements by the pipeline operation is indeed interference. From the section about non-interference (3.2.3):
"But the rules for streams forbid any modification of stream sources—including, for example, changing the value of an element— by any thread, not only pipeline operations."
If what's said in the book is correct, does it mean we can't use the Stream API to modify state of elements (using forEach), and have to do that using the regular iterator (or for-each, or Iterable.forEach)? (Question 4)

There's a bigger class of functions called "functions with side effects". The JavaDoc statement is correct and complete: here interference means modifying the mutable source. Another case is stateful expressions: expressions which depend on the application state or change this state. You may read the Parallelism tutorial on Oracle site.
In general you can modify the stream elements themselves and it should not be called as "interference". Beware though if you have the same mutable object produced several times by the stream source (for example, using Collections.nCopies(10, new MyMutableObject()).parallelStream(). While it's ensured that the same stream element is not processed concurrently by several threads, if your stream produces the same element twice, you may surely have a race condition when modifying it in the forEach, for example.
So while stateful expressions are sometimes smell and should be used with care and avoided if there's a stateless alternative, they are probably ok if they don't interfere with the stream source. When the stateless expression is required (for example, in Stream.map method), it's specially mentioned in the API docs. In forEach documentation only non-interference is required.
So back to your questions:
Question 1: no we can change the element state, and it's not called interference (though called statefullness)
Question 2: no it has no interference unless you have repeating objects in your stream source)
Question 3: you can safely use parallelStream() there
Question 4: no, you can use Stream API in this case.

Modifying the state of an object stored in a data structure is different from reassigning an element of a data structure.
When the other writes "changing the value of an element" presumably they mean as if assigning a new object to an index of an existing List.
From your link:
It is best to avoid any side-effects in the lambdas passed to stream methods. While some side-effects, such as debugging statements that print out values are usually safe, accessing mutable state from these lambdas can cause data races or surprising behavior since lambdas may be executed from many threads simultaneously, and may not see elements in their natural encounter order. Non-interference includes not only not interfering with the source, but not interfering with other lambdas; this sort of interference can arise when one lambda modifies mutable state and another lambda reads it.
As long as the non-interference requirement is satisfied, we can execute parallel operations safely and with predictable results even on non-thread-safe sources such as ArrayList.
This pertains specifically to parallelism and is no different than any other concurrent programming. Modifying state can cause issues with visibility amongst threads.

Mutable parameters in Java 8 Streams

Looking at this question: How to dynamically do filtering in Java 8?
The issue is to truncate a stream after a filter has been executed. I cant use limit because I dont know how long the list is after the filter. So, could we count the slements after the filter?
So, I thought I could create a class that counts and pass the stream through a map.The code is in this answer.
I created a class that counts but leave the elements unaltered, I use a Function here, to avoid to use the lambdas I used in the other answer:
class DoNothingButCount<T > implements Function<T, T> {
AtomicInteger i;
public DoNothingButCount() {
i = new AtomicInteger(0);
}
public T apply(T p) {
i.incrementAndGet();
return p;
}
}
So my Stream was finally:
persons.stream()
.filter(u -> u.size > 12)
.filter(u -> u.weitght > 12)
.map(counter)
.sorted((p1, p2) -> p1.age - p2.age)
.collect(Collectors.toList())
.stream()
.limit((int) (counter.i.intValue() * 0.5))
.sorted((p1, p2) -> p2.length - p1.length)
.limit((int) (counter.i.intValue() * 0.5 * 0.2)).forEach((p) -> System.out.println(p));
But my question is about another part of the my example.
collect(Collectors.toList()).stream().
If I remove that line the consequences are that the counter is ZERO when I try to execute limit. I am somehow cheating the "efectively final" requirement by using a mutable object.
I may be wrong, but I iunderstand that the stream is build first, so if we used mutable objects to pass parameters to any of the steps in the stream these will be taken when the stream is created.
My question is, if my assumption is right, why is this needed? The stream (if non parallel) could be pass sequentially through all the steps (filter, map..) so this limitation is not needed.

Short answer
My question is, if my assumption is right, why is this needed? The
stream (if non parallel) could be pass sequentially through all the
steps (filter, map..) so this limitation is not needed.
As you already know, for parallel streams, this sounds pretty obvious: this limitation is needed because otherwise the result would be non deterministic.
Regarding non-parallel streams, it is not possible because of their current design: each item is only visited once. If streams did work as you suggest, they would do each step on the whole collection before going to the next step, which would probably have an impact on performance, I think. I suspect that's why the language designers made that decision.
Why it technically does not work without collect
You already know that, but here is the explanation for other readers.
From the docs:
Streams are lazy; computation on the source data is only performed
when the terminal operation is initiated, and source elements are
consumed only as needed.
Every intermediate operation of Stream, such as filter() or limit() is actually just some kind of setter that initializes the stream's options.
When you call a terminal operation, such as forEach(), collect() or count(), that's when the computation happens, processing items following the pipeline previously built.
This is why limit()'s argument is evaluated before a single item has gone through the first step of the stream. That's why you need to end the stream with a terminal operation, and start a new one with the limit() you'll then know.
More detailed answer about why not allow it for parallel streams
Let your stream pipeline be step X > step Y > step Z.
We want parallel treatment of our items. Therefore, if we allow step Y's behavior to depend on the items that already went through X, then Y is non deterministic. This is because at the moment an item arrives at step Y, the set of items that have already gone through X won't be the same across multiple executions (because of the threading).
More detailed answer about why not allow it for non-parallel streams
A stream, by definition, is used to process the items in a flow. You could think of a non-parallel stream as follows: one single item goes through all the steps, then the next one goes through all the steps, etc. In fact, the doc says it all:
The elements of a stream are only visited once during the life of a
stream. Like an Iterator, a new stream must be generated to revisit
the same elements of the source.
If streams didn't work like this, it wouldn't be any better than just do each step on the whole collection before going to the next step. That would actually allow mutable parameters in non-parallel streams, but it would probably have a performance impact (because we would iterate multiple times over the collection). Anyway, their current behavior does not allow what you want.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.