The collect operation in Java 8 Stream API is defined as a mutable reduction that can be safely executed in parallel, even if the resulting Collection is not thread safe.
Can we say the same about the Stream.toArray() method?
Is this method a mutable reduction that is thread safe even if the Stream is a parallel stream and the resulting array is not thread safe?
According to https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#toArray-java.util.function.IntFunction- it should be since they create
... any additional arrays that might be required for a partitioned execution or for resizing
And in deduction, since Stream.toArray() is nothing but stream.toArray(Object[]::new) it should hold for Stream.toArray() too.
The toArray operation is a kind of Mutable Reduction, though not implemented exactly like the collect operation. Instead, it’s more efficient in some cases. But these are unspecified implementation details. The documentation of toArray itself does not say anything about how it is implemented, so regarding your question, you have to resort to more general statements:
package documentation, “Parallelism”
… All streams operations can execute either in serial or in parallel.
…
Except for operations identified as explicitly nondeterministic, such as findAny(), whether a stream executes sequentially or in parallel should not change the result of the computation.
…
Most stream operations accept parameters that describe user-specified behavior, which are often lambda expressions. To preserve correct behavior, these behavioral parameters must be non-interfering, and in most cases must be stateless.
So regardless of how it’s implemented, toArray is a stream operation that can run in parallel and since it’s not specified to have any restrictions or nondeterministic behavior, it will produce the same (correct) result as in sequential mode. That’s the only thing you have to think about.
But if you use the overloaded method toArray(IntFunction), it’s your responsibility to provide an appropriate function, e.g. SomeType[]::new is always non-interfering and stateless so the form toArray(SomeType[]::new) is also thread safe.
Related
I understand that modifying an ArrayList makes it not thread-safe.
➠ But if the ArrayList is not being modified, perhaps protected by a call to Collections.unmodifiableList, is calling ArrayList::get thread-safe?
For example, can an ArrayList be passed to a Java Stream for parallel-processing of its elements?
But if the ArrayList is not being modified is calling ArrayList::get thread-safe?
No it is not thread-safe.
The problems arise if you do something like the following:
Thread A creates and populates list.
Thread A passes reference to list to thread B (without a happens before relationship)
Thread B calls get on the list.
Unless there is a proper happens before chain between 1 and 3, thread B may see stale values ... occasionally ... on some platforms under certain work loads.
There are ways to address this. For example, if thread A starts thread B after step 1, there will be a happens before. Similarly, there will be happens before if A passes the list reference to B via properly synchronized setter / getter calls or a volatile variable.
But the bottom line is that (just) not changing the list is not sufficient to make it thread-safe.
... perhaps protected by a call to Collections.unmodifiableList
The creation of the Collections.unmodifiableList should provide the happens before relationship ... provided that you access the list via the wrapper not directly via ArrayList::get.
For example, can an ArrayList be passed to a Java Stream for parallel-processing of its elements?
That's a specific situation. The stream mechanisms will provide the happens before relationship. Provided they are used as intended. It is complicated.
This comes from the Spliterator interface javadoc.
"Despite their obvious utility in parallel algorithms, spliterators are not expected to be thread-safe; instead, implementations of parallel algorithms using spliterators should ensure that the spliterator is only used by one thread at a time. This is generally easy to attain via serial thread-confinement, which often is a natural consequence of typical parallel algorithms that work by recursive decomposition. A thread calling trySplit() may hand over the returned Spliterator to another thread, which in turn may traverse or further split that Spliterator. The behaviour of splitting and traversal is undefined if two or more threads operate concurrently on the same spliterator. If the original thread hands a spliterator off to another thread for processing, it is best if that handoff occurs before any elements are consumed with tryAdvance(), as certain guarantees (such as the accuracy of estimateSize() for SIZED spliterators) are only valid before traversal has begun."
In other words, thread safety is a joint responsibility of the Spliterator implementation and Stream implementation.
The simple way to think about this is that "magic happens" ... because if it didn't then parallel streams would be unusable.
But note that the Spliterator is not necessarily using ArrayList::get at all.
Thread safety is only a concern, as you stated when values can change between the threads. If the elements aren't being added or removed, the object remains the same and all threads can easily operate on it. This is the same for most objects in Java.
You may be able to get away with adding to an ArrayList across threads as seen here but I wouldn't bank on it.
No, ArrayList.get() is not inherently thread-safe just because it does not modify the List. You still need something to create a happens-before relationship between each get() and each method invocation that does modify the list.
Suppose, however, that you instantiate and populate the list first, and then perform multiple get()s, never modifying it again, or at least not until after some synchronization point following all the get()s. You do not then need mutual synchronization between the various get()s, and you may be able to obtain cheap synchronization between the get()s and the end of the initialization phase. This is effectively the situation you will have with an otherwise non-shared List that you provide as input to a parallel stream computation.
While reading the documentation about streams, I came across the following sentences:
... attempting to access mutable state from behavioral parameters presents you with a bad choice ... if you do not synchronize access to that state, you have a data race and therefore your code is broken ... [1]
If the behavioral parameters do have side-effects ... [there are no] guarantees that different operations on the "same" element within the same stream pipeline are executed in the same thread. [2]
For any given element, the action may be performed at whatever time and in whatever thread the library chooses. [3]
These sentences don't make a distinction between sequential and parallel streams. So my questions are:
In which thread is the pipeline of a sequential stream executed? Is it always the calling thread or is an implementation free to choose any thread?
In which thread is the action parameter of the forEach terminal operation executed if the stream is sequential?
Do I have to use any synchronization when using sequential streams?
[1+2] https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html
[3] https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#forEach-java.util.function.Consumer-
This all boils down to what is guaranteed based on the specification, and the fact that a current implementation may have additional behaviors beyond what is guaranteed.
Java Language Architect Brian Goetz made a relevant point regarding specifications in a related question:
Specifications exist to describe the minimal guarantees a caller can depend on, not to describe what the implementation does.
[...]
When a specification says "does not preserve property X", it does not mean that the property X may never be observed; it means the implementation is not obligated to preserve it. [...] (HashSet doesn't promise that iterating its elements preserves the order they were inserted, but that doesn't mean this can't accidentally happen -- you just can't count on it.)
This all means that even if the current implementation happens to have certain behavioral characteristics, they should not be relied upon nor assumed that they will not change in new versions of the library.
Sequential stream pipeline thread
In which thread is the pipeline of a sequential stream executed? Is it always the calling thread or is an implementation free to choose any thread?
Current stream implementations may or may not use the calling thread, and may use one or multiple threads. As none of this is specified by the API, this behavior should not be relied on.
forEach execution thread
In which thread is the action parameter of the forEach terminal operation executed if the stream is sequential?
While current implementations use the existing thread, this cannot be relied on, as the documentation states that the choice of thread is up to the implementation. In fact, there are no guarantees that the elements aren't processed by different threads for different elements, though that is not something the current stream implementation does either.
Per the API:
For any given element, the action may be performed at whatever time and in whatever thread the library chooses.
Note that while the API calls out parallel streams specifically when discussing encounter order, that was clarified by Brian Goetz to clarify the motivation of the behavior, and not that any of the behavior is specific to parallel streams:
The intent of calling out the parallel case explicitly here was pedagogical [...]. However, to a reader who is unaware of parallelism, it would be almost impossible to not assume that forEach would preserve encounter order, so this sentence was added to help clarify the motivation.
Synchronization using sequential streams
Do I have to use any synchronization when using sequential streams?
Current implementations will likely work since they use a single thread for the sequential stream's forEach method. However, as it is not guaranteed by the stream specification, it should not be relied on. Therefore, synchronization should be used as though the methods could be called by multiple threads.
That said, the stream documentation specifically recommends against using side-effects that would require synchronization, and suggest using reduction operations instead of mutable accumulators:
Many computations where one might be tempted to use side effects can be more safely and efficiently expressed without side-effects, such as using reduction instead of mutable accumulators. [...] A small number of stream operations, such as forEach() and peek(), can operate only via side-effects; these should be used with care.
As an example of how to transform a stream pipeline that inappropriately uses side-effects to one that does not, the following code searches a stream of strings for those matching a given regular expression, and puts the matches in a list.
ArrayList<String> results = new ArrayList<>();
stream.filter(s -> pattern.matcher(s).matches())
.forEach(s -> results.add(s)); // Unnecessary use of side-effects!
This code unnecessarily uses side-effects. If executed in parallel, the non-thread-safety of ArrayList would cause incorrect results, and adding needed synchronization would cause contention, undermining the benefit of parallelism. Furthermore, using side-effects here is completely unnecessary; the forEach() can simply be replaced with a reduction operation that is safer, more efficient, and more amenable to parallelization:
List<String>results =
stream.filter(s -> pattern.matcher(s).matches())
.collect(Collectors.toList()); // No side-effects!
Stream's terminal operations are blocking operations. In case there is no parallel excution, the thread that executes the terminal operation runs all the operations in the pipeline.
Definition 1.1. Pipeline is a couple of chained methods.
Definition 1.2. Intermediate operations will be located everywhere in the stream except at the end. They return a stream object and does not execute any operation in the pipeline.
Definition 1.3. Terminal operations will be located only at the end of the stream. They execute the pipeline. They does not return stream object so no other Intermidiate operations or terminal operations can be added after them.
From the first solution we can conclude that the calling thread will execute the action method inside the forEach terminal operation on each element in the calling stream.
Java 8 introduces us the Spliterator interface. It has the capabilities of Iterator but also a set of operations to help performing and spliting a task in parallel.
When calling forEach from primitive streams in sequential execution, the calling thread will invoke the Spliterator.forEachRemaining method:
#Override
public void forEach(IntConsumer action) {
if (!isParallel()) {
adapt(sourceStageSpliterator()).forEachRemaining(action);
}
else {
super.forEach(action);
}
}
You can read more on Spliterator in my tutorial: Part 6 - Spliterator
As long as you don't mutate any shared state between multiple threads in one of the stream operations(and it is forbidden - explained soon), you do not need to use any additional synchronization tool or algorithm when you want to run parallel streams.
Stream operations like reduce use accumulator and combiner functions for executing parallel streams. The streams library by definition forbids mutation. You should avoid it.
There are a lot of definitions in concurrent and parallel programming. I will introduce a set of definitions that will serve us best.
Definition 8.1. Concurrent programming is the ability to solve a task using additional synchronization algorithms.
Definition 8.2. Parallel programming is the ability to solve a task without using additional synchronization algorithms.
You can read more about it in my tutorial: Part 7 - Parallel Streams.
I have this code:
ArrayList<Detector> detectors;
detectors.stream().anyMatch(d -> d.detectRead(impendingInstruction, fieldName));
But I would also like to have guarantees that:
The list is processed in order, from the first element to the last;
As soon an element returns true, evaluations stops immediately
Is this always true, or if not, is it at least for all common JDK implementations?
Your question implies a concern about side-effects of stream operations, otherwise you wouldn't care about order or immediate termination. From the Javadoc:
Side-effects
Side-effects in behavioral parameters to stream operations are, in general, discouraged, as they can often lead to unwitting violations of the statelessness requirement, as well as other thread-safety hazards.
If the behavioral parameters do have side-effects, unless explicitly stated, there are no guarantees as to the visibility of those side-effects to other threads, nor are there any guarantees that different operations on the "same" element within the same stream pipeline are executed in the same thread. Further, the ordering of those effects may be surprising. Even when a pipeline is constrained to produce a result that is consistent with the encounter order of the stream source (for example, IntStream.range(0,5).parallel().map(x -> x*2).toArray() must produce [0, 2, 4, 6, 8]), no guarantees are made as to the order in which the mapper function is applied to individual elements, or in what thread any behavioral parameter is executed for a given element.
So the contract seems to be that you might get away with it but it's not guaranteed to work.
Does the non-interference requirement for using streams of non-concurrent data structure sources mean that we can't change the state of an element of the data structure during the execution of a stream pipeline (in addition to that we can't change the source data structure itself)? (Question 1)
In the section about non-interference, in the stream package description, its said:
"For most data sources, preventing interference means ensuring that the data source is not modified at all during the execution of the stream pipeline."
This passage does not mention modifying the state of elements?
For example, assuming "shapes" is non-thread-safe collection (such as ArrayList), is the code below considered to have an interference? (Question 2)
shapes.stream()
.filter(s -> s.getColor() == BLUE)
.forEach(s -> s.setColor(RED));
This example is taken from a reliable source (to say the least), so it should be correct.
But what if I changed stream() to be parallelStream(), will it still be safe and correct? (Question 3)
On the other hand, "Mastering Lambdas" by Naftalin Maurice, another reliable source, makes it clear that changing the state (value) of elements by the pipeline operation is indeed interference. From the section about non-interference (3.2.3):
"But the rules for streams forbid any modification of stream sources—including, for example, changing the value of an element— by any thread, not only pipeline operations."
If what's said in the book is correct, does it mean we can't use the Stream API to modify state of elements (using forEach), and have to do that using the regular iterator (or for-each, or Iterable.forEach)? (Question 4)
There's a bigger class of functions called "functions with side effects". The JavaDoc statement is correct and complete: here interference means modifying the mutable source. Another case is stateful expressions: expressions which depend on the application state or change this state. You may read the Parallelism tutorial on Oracle site.
In general you can modify the stream elements themselves and it should not be called as "interference". Beware though if you have the same mutable object produced several times by the stream source (for example, using Collections.nCopies(10, new MyMutableObject()).parallelStream(). While it's ensured that the same stream element is not processed concurrently by several threads, if your stream produces the same element twice, you may surely have a race condition when modifying it in the forEach, for example.
So while stateful expressions are sometimes smell and should be used with care and avoided if there's a stateless alternative, they are probably ok if they don't interfere with the stream source. When the stateless expression is required (for example, in Stream.map method), it's specially mentioned in the API docs. In forEach documentation only non-interference is required.
So back to your questions:
Question 1: no we can change the element state, and it's not called interference (though called statefullness)
Question 2: no it has no interference unless you have repeating objects in your stream source)
Question 3: you can safely use parallelStream() there
Question 4: no, you can use Stream API in this case.
Modifying the state of an object stored in a data structure is different from reassigning an element of a data structure.
When the other writes "changing the value of an element" presumably they mean as if assigning a new object to an index of an existing List.
From your link:
It is best to avoid any side-effects in the lambdas passed to stream methods. While some side-effects, such as debugging statements that print out values are usually safe, accessing mutable state from these lambdas can cause data races or surprising behavior since lambdas may be executed from many threads simultaneously, and may not see elements in their natural encounter order. Non-interference includes not only not interfering with the source, but not interfering with other lambdas; this sort of interference can arise when one lambda modifies mutable state and another lambda reads it.
As long as the non-interference requirement is satisfied, we can execute parallel operations safely and with predictable results even on non-thread-safe sources such as ArrayList.
This pertains specifically to parallelism and is no different than any other concurrent programming. Modifying state can cause issues with visibility amongst threads.
I understand that these methods differ the order of execution but in all my test I cannot achieve different order execution.
Example:
System.out.println("forEach Demo");
Stream.of("AAA","BBB","CCC").forEach(s->System.out.println("Output:"+s));
System.out.println("forEachOrdered Demo");
Stream.of("AAA","BBB","CCC").forEachOrdered(s->System.out.println("Output:"+s));
Output:
forEach Demo
Output:AAA
Output:BBB
Output:CCC
forEachOrdered Demo
Output:AAA
Output:BBB
Output:CCC
Please provide examples when 2 methods will produce different outputs.
Stream.of("AAA","BBB","CCC").parallel().forEach(s->System.out.println("Output:"+s));
Stream.of("AAA","BBB","CCC").parallel().forEachOrdered(s->System.out.println("Output:"+s));
The second line will always output
Output:AAA
Output:BBB
Output:CCC
whereas the first one is not guaranted since the order is not kept. forEachOrdered will processes the elements of the stream in the order specified by its source, regardless of whether the stream is sequential or parallel.
Quoting from forEach Javadoc:
The behavior of this operation is explicitly nondeterministic. For parallel stream pipelines, this operation does not guarantee to respect the encounter order of the stream, as doing so would sacrifice the benefit of parallelism.
When the forEachOrdered Javadoc states (emphasis mine):
Performs an action for each element of this stream, in the encounter order of the stream if the stream has a defined encounter order.
Although forEach shorter and looks prettier, I'd suggest to use forEachOrdered in every place where order matters to explicitly specify this. For sequential streams the forEach seems to respect the order and even stream API internal code uses forEach (for stream which is known to be sequential) where it's semantically necessary to use forEachOrdered! Nevertheless you may later decide to change your stream to parallel and your code will be broken. Also when you use forEachOrdered the reader of your code sees the message: "the order matters here". Thus it documents your code better.
Note also that for parallel streams the forEach not only executed in non-determenistic order, but you can also have it executed simultaneously in different threads for different elements (which is not possible with forEachOrdered).
Finally both forEach/forEachOrdered are rarely useful. In most of the cases you actually need to produce some result, not just side-effect, thus operations like reduce or collect should be more suitable. Expressing reducing-by-nature operation via forEach is usually considered as a bad style.
forEach() method performs an action for each element of this stream. For parallel stream, this operation does not guarantee to maintain order of the stream.
forEachOrdered() method performs an action for each element of this stream, guaranteeing that each element is processed in encounter order for streams that have a defined encounter order.
take the example below:
String str = "sushil mittal";
System.out.println("****forEach without using parallel****");
str.chars().forEach(s -> System.out.print((char) s));
System.out.println("\n****forEach with using parallel****");
str.chars().parallel().forEach(s -> System.out.print((char) s));
System.out.println("\n****forEachOrdered with using parallel****");
str.chars().parallel().forEachOrdered(s -> System.out.print((char) s));
Output:
****forEach without using parallel****
sushil mittal
****forEach with using parallel****
mihul issltat
****forEachOrdered with using parallel****
sushil mittal
We may lose the benefits of parallelism if we use forEachOrdered() with parallel Streams.
As we know, In parallel programming element will print parallelly if we use forEach() method. so the order will not be fixed. But In the use of forEachOrdered() fixed order in parallel Streams.
Stream.of("AAA","BBB","CCC").forEachOrdered(s->System.out.println("Output:"+s));
Output:AAA
Output:BBB
Output:CCC