Does the non-interference requirement for using streams of non-concurrent data structure sources mean that we can't change the state of an element of the data structure during the execution of a stream pipeline (in addition to that we can't change the source data structure itself)? (Question 1)
In the section about non-interference, in the stream package description, its said:
"For most data sources, preventing interference means ensuring that the data source is not modified at all during the execution of the stream pipeline."
This passage does not mention modifying the state of elements?
For example, assuming "shapes" is non-thread-safe collection (such as ArrayList), is the code below considered to have an interference? (Question 2)
shapes.stream()
.filter(s -> s.getColor() == BLUE)
.forEach(s -> s.setColor(RED));
This example is taken from a reliable source (to say the least), so it should be correct.
But what if I changed stream() to be parallelStream(), will it still be safe and correct? (Question 3)
On the other hand, "Mastering Lambdas" by Naftalin Maurice, another reliable source, makes it clear that changing the state (value) of elements by the pipeline operation is indeed interference. From the section about non-interference (3.2.3):
"But the rules for streams forbid any modification of stream sources—including, for example, changing the value of an element— by any thread, not only pipeline operations."
If what's said in the book is correct, does it mean we can't use the Stream API to modify state of elements (using forEach), and have to do that using the regular iterator (or for-each, or Iterable.forEach)? (Question 4)
There's a bigger class of functions called "functions with side effects". The JavaDoc statement is correct and complete: here interference means modifying the mutable source. Another case is stateful expressions: expressions which depend on the application state or change this state. You may read the Parallelism tutorial on Oracle site.
In general you can modify the stream elements themselves and it should not be called as "interference". Beware though if you have the same mutable object produced several times by the stream source (for example, using Collections.nCopies(10, new MyMutableObject()).parallelStream(). While it's ensured that the same stream element is not processed concurrently by several threads, if your stream produces the same element twice, you may surely have a race condition when modifying it in the forEach, for example.
So while stateful expressions are sometimes smell and should be used with care and avoided if there's a stateless alternative, they are probably ok if they don't interfere with the stream source. When the stateless expression is required (for example, in Stream.map method), it's specially mentioned in the API docs. In forEach documentation only non-interference is required.
So back to your questions:
Question 1: no we can change the element state, and it's not called interference (though called statefullness)
Question 2: no it has no interference unless you have repeating objects in your stream source)
Question 3: you can safely use parallelStream() there
Question 4: no, you can use Stream API in this case.
Modifying the state of an object stored in a data structure is different from reassigning an element of a data structure.
When the other writes "changing the value of an element" presumably they mean as if assigning a new object to an index of an existing List.
From your link:
It is best to avoid any side-effects in the lambdas passed to stream methods. While some side-effects, such as debugging statements that print out values are usually safe, accessing mutable state from these lambdas can cause data races or surprising behavior since lambdas may be executed from many threads simultaneously, and may not see elements in their natural encounter order. Non-interference includes not only not interfering with the source, but not interfering with other lambdas; this sort of interference can arise when one lambda modifies mutable state and another lambda reads it.
As long as the non-interference requirement is satisfied, we can execute parallel operations safely and with predictable results even on non-thread-safe sources such as ArrayList.
This pertains specifically to parallelism and is no different than any other concurrent programming. Modifying state can cause issues with visibility amongst threads.
Related
Does filter chaining change the outcome if i use parallelStream() instead of stream() ?
I tried with a few thousand records, and the output appeared consistent over a few iterations. But since this involves threads,(and I could not find enough relevant material that talks about this combination) I want to make doubly sure that parallel stream does not impact the output of filter chaining in any way. Example code:
List<Element> list = myList.parallelStream()
.filter(element -> element.getId() > 10)
.filter(element -> element.getName().contains("something"))
.collect(Collectors.toList());
Short answer: No.
The filter operation as documented expects a non-interferening and stateless predicate to apply to each element to determine if it should be included as part of the new stream.
Few aspects that you shall consider for that are -
With an exception to concurrent collections(what do you choose as myList in the existing code to be) -
For most data sources, preventing interference means ensuring that the
data source is not modified at all during the execution of the stream
pipeline.
The state of the data sources (myList and its elements within your filter operations are not mutated)
Note also that attempting to access mutable state from behavioral
parameters presents you with a bad choice with respect to safety and
performance;
Moreover, think around it, what is it in your filter operation that would be impacted by multiple threads. Given the current code, nothing functionally, as long as both the operations are executed, you would get a consistent result regardless of the thread(s) executing them.
I understand that modifying an ArrayList makes it not thread-safe.
➠ But if the ArrayList is not being modified, perhaps protected by a call to Collections.unmodifiableList, is calling ArrayList::get thread-safe?
For example, can an ArrayList be passed to a Java Stream for parallel-processing of its elements?
But if the ArrayList is not being modified is calling ArrayList::get thread-safe?
No it is not thread-safe.
The problems arise if you do something like the following:
Thread A creates and populates list.
Thread A passes reference to list to thread B (without a happens before relationship)
Thread B calls get on the list.
Unless there is a proper happens before chain between 1 and 3, thread B may see stale values ... occasionally ... on some platforms under certain work loads.
There are ways to address this. For example, if thread A starts thread B after step 1, there will be a happens before. Similarly, there will be happens before if A passes the list reference to B via properly synchronized setter / getter calls or a volatile variable.
But the bottom line is that (just) not changing the list is not sufficient to make it thread-safe.
... perhaps protected by a call to Collections.unmodifiableList
The creation of the Collections.unmodifiableList should provide the happens before relationship ... provided that you access the list via the wrapper not directly via ArrayList::get.
For example, can an ArrayList be passed to a Java Stream for parallel-processing of its elements?
That's a specific situation. The stream mechanisms will provide the happens before relationship. Provided they are used as intended. It is complicated.
This comes from the Spliterator interface javadoc.
"Despite their obvious utility in parallel algorithms, spliterators are not expected to be thread-safe; instead, implementations of parallel algorithms using spliterators should ensure that the spliterator is only used by one thread at a time. This is generally easy to attain via serial thread-confinement, which often is a natural consequence of typical parallel algorithms that work by recursive decomposition. A thread calling trySplit() may hand over the returned Spliterator to another thread, which in turn may traverse or further split that Spliterator. The behaviour of splitting and traversal is undefined if two or more threads operate concurrently on the same spliterator. If the original thread hands a spliterator off to another thread for processing, it is best if that handoff occurs before any elements are consumed with tryAdvance(), as certain guarantees (such as the accuracy of estimateSize() for SIZED spliterators) are only valid before traversal has begun."
In other words, thread safety is a joint responsibility of the Spliterator implementation and Stream implementation.
The simple way to think about this is that "magic happens" ... because if it didn't then parallel streams would be unusable.
But note that the Spliterator is not necessarily using ArrayList::get at all.
Thread safety is only a concern, as you stated when values can change between the threads. If the elements aren't being added or removed, the object remains the same and all threads can easily operate on it. This is the same for most objects in Java.
You may be able to get away with adding to an ArrayList across threads as seen here but I wouldn't bank on it.
No, ArrayList.get() is not inherently thread-safe just because it does not modify the List. You still need something to create a happens-before relationship between each get() and each method invocation that does modify the list.
Suppose, however, that you instantiate and populate the list first, and then perform multiple get()s, never modifying it again, or at least not until after some synchronization point following all the get()s. You do not then need mutual synchronization between the various get()s, and you may be able to obtain cheap synchronization between the get()s and the end of the initialization phase. This is effectively the situation you will have with an otherwise non-shared List that you provide as input to a parallel stream computation.
While reading the documentation about streams, I came across the following sentences:
... attempting to access mutable state from behavioral parameters presents you with a bad choice ... if you do not synchronize access to that state, you have a data race and therefore your code is broken ... [1]
If the behavioral parameters do have side-effects ... [there are no] guarantees that different operations on the "same" element within the same stream pipeline are executed in the same thread. [2]
For any given element, the action may be performed at whatever time and in whatever thread the library chooses. [3]
These sentences don't make a distinction between sequential and parallel streams. So my questions are:
In which thread is the pipeline of a sequential stream executed? Is it always the calling thread or is an implementation free to choose any thread?
In which thread is the action parameter of the forEach terminal operation executed if the stream is sequential?
Do I have to use any synchronization when using sequential streams?
[1+2] https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html
[3] https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#forEach-java.util.function.Consumer-
This all boils down to what is guaranteed based on the specification, and the fact that a current implementation may have additional behaviors beyond what is guaranteed.
Java Language Architect Brian Goetz made a relevant point regarding specifications in a related question:
Specifications exist to describe the minimal guarantees a caller can depend on, not to describe what the implementation does.
[...]
When a specification says "does not preserve property X", it does not mean that the property X may never be observed; it means the implementation is not obligated to preserve it. [...] (HashSet doesn't promise that iterating its elements preserves the order they were inserted, but that doesn't mean this can't accidentally happen -- you just can't count on it.)
This all means that even if the current implementation happens to have certain behavioral characteristics, they should not be relied upon nor assumed that they will not change in new versions of the library.
Sequential stream pipeline thread
In which thread is the pipeline of a sequential stream executed? Is it always the calling thread or is an implementation free to choose any thread?
Current stream implementations may or may not use the calling thread, and may use one or multiple threads. As none of this is specified by the API, this behavior should not be relied on.
forEach execution thread
In which thread is the action parameter of the forEach terminal operation executed if the stream is sequential?
While current implementations use the existing thread, this cannot be relied on, as the documentation states that the choice of thread is up to the implementation. In fact, there are no guarantees that the elements aren't processed by different threads for different elements, though that is not something the current stream implementation does either.
Per the API:
For any given element, the action may be performed at whatever time and in whatever thread the library chooses.
Note that while the API calls out parallel streams specifically when discussing encounter order, that was clarified by Brian Goetz to clarify the motivation of the behavior, and not that any of the behavior is specific to parallel streams:
The intent of calling out the parallel case explicitly here was pedagogical [...]. However, to a reader who is unaware of parallelism, it would be almost impossible to not assume that forEach would preserve encounter order, so this sentence was added to help clarify the motivation.
Synchronization using sequential streams
Do I have to use any synchronization when using sequential streams?
Current implementations will likely work since they use a single thread for the sequential stream's forEach method. However, as it is not guaranteed by the stream specification, it should not be relied on. Therefore, synchronization should be used as though the methods could be called by multiple threads.
That said, the stream documentation specifically recommends against using side-effects that would require synchronization, and suggest using reduction operations instead of mutable accumulators:
Many computations where one might be tempted to use side effects can be more safely and efficiently expressed without side-effects, such as using reduction instead of mutable accumulators. [...] A small number of stream operations, such as forEach() and peek(), can operate only via side-effects; these should be used with care.
As an example of how to transform a stream pipeline that inappropriately uses side-effects to one that does not, the following code searches a stream of strings for those matching a given regular expression, and puts the matches in a list.
ArrayList<String> results = new ArrayList<>();
stream.filter(s -> pattern.matcher(s).matches())
.forEach(s -> results.add(s)); // Unnecessary use of side-effects!
This code unnecessarily uses side-effects. If executed in parallel, the non-thread-safety of ArrayList would cause incorrect results, and adding needed synchronization would cause contention, undermining the benefit of parallelism. Furthermore, using side-effects here is completely unnecessary; the forEach() can simply be replaced with a reduction operation that is safer, more efficient, and more amenable to parallelization:
List<String>results =
stream.filter(s -> pattern.matcher(s).matches())
.collect(Collectors.toList()); // No side-effects!
Stream's terminal operations are blocking operations. In case there is no parallel excution, the thread that executes the terminal operation runs all the operations in the pipeline.
Definition 1.1. Pipeline is a couple of chained methods.
Definition 1.2. Intermediate operations will be located everywhere in the stream except at the end. They return a stream object and does not execute any operation in the pipeline.
Definition 1.3. Terminal operations will be located only at the end of the stream. They execute the pipeline. They does not return stream object so no other Intermidiate operations or terminal operations can be added after them.
From the first solution we can conclude that the calling thread will execute the action method inside the forEach terminal operation on each element in the calling stream.
Java 8 introduces us the Spliterator interface. It has the capabilities of Iterator but also a set of operations to help performing and spliting a task in parallel.
When calling forEach from primitive streams in sequential execution, the calling thread will invoke the Spliterator.forEachRemaining method:
#Override
public void forEach(IntConsumer action) {
if (!isParallel()) {
adapt(sourceStageSpliterator()).forEachRemaining(action);
}
else {
super.forEach(action);
}
}
You can read more on Spliterator in my tutorial: Part 6 - Spliterator
As long as you don't mutate any shared state between multiple threads in one of the stream operations(and it is forbidden - explained soon), you do not need to use any additional synchronization tool or algorithm when you want to run parallel streams.
Stream operations like reduce use accumulator and combiner functions for executing parallel streams. The streams library by definition forbids mutation. You should avoid it.
There are a lot of definitions in concurrent and parallel programming. I will introduce a set of definitions that will serve us best.
Definition 8.1. Concurrent programming is the ability to solve a task using additional synchronization algorithms.
Definition 8.2. Parallel programming is the ability to solve a task without using additional synchronization algorithms.
You can read more about it in my tutorial: Part 7 - Parallel Streams.
The collect operation in Java 8 Stream API is defined as a mutable reduction that can be safely executed in parallel, even if the resulting Collection is not thread safe.
Can we say the same about the Stream.toArray() method?
Is this method a mutable reduction that is thread safe even if the Stream is a parallel stream and the resulting array is not thread safe?
According to https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#toArray-java.util.function.IntFunction- it should be since they create
... any additional arrays that might be required for a partitioned execution or for resizing
And in deduction, since Stream.toArray() is nothing but stream.toArray(Object[]::new) it should hold for Stream.toArray() too.
The toArray operation is a kind of Mutable Reduction, though not implemented exactly like the collect operation. Instead, it’s more efficient in some cases. But these are unspecified implementation details. The documentation of toArray itself does not say anything about how it is implemented, so regarding your question, you have to resort to more general statements:
package documentation, “Parallelism”
… All streams operations can execute either in serial or in parallel.
…
Except for operations identified as explicitly nondeterministic, such as findAny(), whether a stream executes sequentially or in parallel should not change the result of the computation.
…
Most stream operations accept parameters that describe user-specified behavior, which are often lambda expressions. To preserve correct behavior, these behavioral parameters must be non-interfering, and in most cases must be stateless.
So regardless of how it’s implemented, toArray is a stream operation that can run in parallel and since it’s not specified to have any restrictions or nondeterministic behavior, it will produce the same (correct) result as in sequential mode. That’s the only thing you have to think about.
But if you use the overloaded method toArray(IntFunction), it’s your responsibility to provide an appropriate function, e.g. SomeType[]::new is always non-interfering and stateless so the form toArray(SomeType[]::new) is also thread safe.
I read in a book that read in ConcurrentHashmap does not guarantee the most recently updated state and it can sometimes gives the closer value. Is this correct?
I have read its javadocs and many blogs which seems to say otherwise (i.e. it is accurate).
Which one is true?
From ConcurrentHashMap javadoc:
Retrieval operations (including get) generally do not block, so may overlap with update operations (including put and remove). Retrievals reflect the results of the most recently completed update operations holding upon their onset. For aggregate operations such as putAll and clear, concurrent retrievals may reflect insertion or removal of only some entries. Similarly, Iterators and Enumerations return elements reflecting the state of the hash table at some point at or since the creation of the iterator/enumeration. They do not throw ConcurrentModificationException. However, iterators are designed to be used by only one thread at a time.
One point to highlight is that an Iterator is not going to reflect changes that are made after the Iterator is created. So if you need to iterate over the values of the map, while other threads are adding to the map, and you care about the most current state of the map, then using a ConcurrentHashMap may not be the implementation of map that you need.
Intuitively, a ConcurrentHashMap should behave like a set of volatile variables; map keys being the variable addresses. get(key) and put(key, value) should behave like volatile read and write.
That is not explicitly stated in the document. However, I would strongly believe that it is the case. Otherwise there will be a lot of unexpected, surprising behaviors that undermine application logic.
I don't think Doug Lea would do that to us. To be sure, someone please ask him on concurrency-interest mailing list.
Suppose it does obey the volatile semantics, we can reason based on Java Memory Model -
All volatile reads and writes form a single total order. This can be considered a pseudo time line, where reads/writes are points on it.
A volatile read sees the immediate preceding volatile write, and sees only that write. "Preceding" here is according to the pseudo time line.
The pseudo time line can differ from "real" time line. However, in theory, a volatile write cannot be infinitely postponed on the pseudo time line. And, in pracitce, two time lines are pretty close.
Therefore, we can be pretty sure that, a volatile write should become visible "very quickly" to reads.
That is the nature of concurrency. Any concurrent collection may give you the old value in a field if a currently pending write operation is not safely finished. That is in all cases what you have to expect. The collections made for concurrency will not give you corrupted values or break if you access them from multiple threads, but they may give you old values, if writing is not done.