I am in the progress of learning through Java 8 lambda expressions and would like to ask about the following piece of Java code relating to the peek method in the function interface that I have come across.
On execution of the program on IDE, it gives no output. I was expecting it would give 2, 4, 6.
import java.util.Arrays;
import java.util.List;
public class Test_Q3 {
public Test_Q3() {
}
public static void main(String[] args) {
List<Integer> values = Arrays.asList(1, 2, 3);
values.stream()
.map(n -> n * 2)
.peek(System.out::print)
.count();
}
}
I assume you are running this under Java 9? You are not altering the SIZED property of the stream, so there is no need to execute either map or peek at all.
In other words all you care is about count as the final result, but in the meanwhile you do not alter the initial size of the List in any way (via filter for example or distinct) This is an optimization done in the Streams.
Btw, even if you add a dummy filter this will show what you expect:
values.stream ()
.map(n -> n*2)
.peek(System.out::print)
.filter(x -> true)
.count();
Here's some relevant quotes from the Javadoc of Stream interface:
A stream implementation is permitted significant latitude in optimizing the computation of the result. For example, a stream implementation is free to elide operations (or entire stages) from a stream pipeline -- and therefore elide invocation of behavioral parameters -- if it can prove that it would not affect the result of the computation. This means that side-effects of behavioral parameters may not always be executed and should not be relied upon, unless otherwise specified (such as by the terminal operations forEach and forEachOrdered). (For a specific example of such an optimization, see the API note documented on the count() operation. For more detail, see the side-effects section of the stream package documentation.)
And more specifically from the Javadoc of count() method:
API Note:
An implementation may choose to not execute the stream pipeline (either sequentially or in parallel) if it is capable of computing the count directly from the stream source. In such cases no source elements will be traversed and no intermediate operations will be evaluated. Behavioral parameters with side-effects, which are strongly discouraged except for harmless cases such as debugging, may be affected. For example, consider the following stream:
List<String> l = Arrays.asList("A", "B", "C", "D");
long count = l.stream().peek(System.out::println).count();
The number of elements covered by the stream source, a List, is known and the intermediate operation, peek, does not inject into or remove elements from the stream (as may be the case for flatMap or filter operations). Thus the count is the size of the List and there is no need to execute the pipeline and, as a side-effect, print out the list elements.
These quotes only appear on the Javadoc of Java 9, so it must be a new optimization.
Related
I am in the progress of learning through Java 8 lambda expressions and would like to ask about the following piece of Java code relating to the peek method in the function interface that I have come across.
On execution of the program on IDE, it gives no output. I was expecting it would give 2, 4, 6.
import java.util.Arrays;
import java.util.List;
public class Test_Q3 {
public Test_Q3() {
}
public static void main(String[] args) {
List<Integer> values = Arrays.asList(1, 2, 3);
values.stream()
.map(n -> n * 2)
.peek(System.out::print)
.count();
}
}
I assume you are running this under Java 9? You are not altering the SIZED property of the stream, so there is no need to execute either map or peek at all.
In other words all you care is about count as the final result, but in the meanwhile you do not alter the initial size of the List in any way (via filter for example or distinct) This is an optimization done in the Streams.
Btw, even if you add a dummy filter this will show what you expect:
values.stream ()
.map(n -> n*2)
.peek(System.out::print)
.filter(x -> true)
.count();
Here's some relevant quotes from the Javadoc of Stream interface:
A stream implementation is permitted significant latitude in optimizing the computation of the result. For example, a stream implementation is free to elide operations (or entire stages) from a stream pipeline -- and therefore elide invocation of behavioral parameters -- if it can prove that it would not affect the result of the computation. This means that side-effects of behavioral parameters may not always be executed and should not be relied upon, unless otherwise specified (such as by the terminal operations forEach and forEachOrdered). (For a specific example of such an optimization, see the API note documented on the count() operation. For more detail, see the side-effects section of the stream package documentation.)
And more specifically from the Javadoc of count() method:
API Note:
An implementation may choose to not execute the stream pipeline (either sequentially or in parallel) if it is capable of computing the count directly from the stream source. In such cases no source elements will be traversed and no intermediate operations will be evaluated. Behavioral parameters with side-effects, which are strongly discouraged except for harmless cases such as debugging, may be affected. For example, consider the following stream:
List<String> l = Arrays.asList("A", "B", "C", "D");
long count = l.stream().peek(System.out::println).count();
The number of elements covered by the stream source, a List, is known and the intermediate operation, peek, does not inject into or remove elements from the stream (as may be the case for flatMap or filter operations). Thus the count is the size of the List and there is no need to execute the pipeline and, as a side-effect, print out the list elements.
These quotes only appear on the Javadoc of Java 9, so it must be a new optimization.
Can anyone point to a official Java documentation which describes how many times Stream will invoke each "non-interfering and stateless" intermediate operation for each element.
For example:
Arrays.asList("1", "2", "3", "4").stream()
.filter(s -> check(s))
.forEach(s -> System.out.println(s));
public boolean check(Object o) {
return true;
}
The above currently will invoke check method 4 times.
Is it possible that in the current or future versions of JDKs the check method gets executed more or less times than the number of elements in the stream created from List or any other standard Java API?
This does not have to do with the source of the stream, but rather the terminal operation and optimization done in the stream implementation itself. For example:
Stream.of(1,2,3,4)
.map(x -> x + 1)
.count();
Since java-9, map will not get executed a single time.
Or:
someTreeSet.stream()
.sorted()
.findFirst();
sorted might not get executed at all, since the source is a TreeSet and getting the first element is trivial, but if this is implemented inside stream API or not, is a different question.
So the real answer here - it depends, but I can't imagine one operation that would get executed more that the numbers of elements in the source.
From the documentation:
Laziness-seeking. Many stream operations, such as filtering, mapping, or duplicate removal, can be implemented lazily, exposing opportunities for optimization. For example, "find the first String with three consecutive vowels" need not examine all the input strings. Stream operations are divided into intermediate (Stream-producing) operations and terminal (value- or side-effect-producing) operations. Intermediate operations are always lazy.
By that virtue, because filter is an intermediate operation which creates a new Stream as part of its operation, due to its laziness, it will only ever invoke the filter predicate once per element as part of its rebuilding of the stream.
The only way that your method would possibly have a different number of invocations against it in the stream is if the stream were somehow mutated between states, which given the fact that nothing in a stream actually runs until the terminal operation, would only realistically be possible due to a bug upstream.
Is there any guarantee that operations on a sequential and ordered Stream are processed in encounter order?
I mean, if I have code like this:
IntStream.range(0, 5)
.map(i -> {
myFunction(i);
return i * 2;
})
.boxed()
.collect(toList());
is there a guarantee, that it will perform myFunction() calls in the encounter order of the generated range?
I found draft JavaDocs for the Stream class which explicitely states this:
For sequential stream pipelines, all operations are performed in the encounter order of the pipeline source, if the pipeline source has a defined encounter order.
but in the official JavaDocs this line was removed. It now discusses encounter order only for selected methods. The Package java.util.stream doc in the Side-effects paragraph states:
Even when a pipeline is constrained to produce a result that is consistent with the encounter order of the stream source (for example, IntStream.range(0,5).parallel().map(x -> x*2).toArray() must produce [0, 2, 4, 6, 8]), no guarantees are made as to the order in which the mapper function is applied to individual elements, or in what thread any behavioral parameter is executed for a given element.
but it says nothing about sequential streams and the example is for a parallel stream (My understanding is that it's true for both sequential and parallel streams, but this is the part I'm not sure about).
On the other hand, it also states in the Ordering section:
If a stream is ordered, most operations are constrained to operate on the elements in their encounter order; if the source of a stream is a List containing [1, 2, 3], then the result of executing map(x -> x*2) must be [2, 4, 6]. However, if the source has no defined encounter order, then any permutation of the values [2, 4, 6] would be a valid result.
but this time it start with "operating on the elements", but the example is about the resulting stream, so I'm not sure they are taking side-effects in account and side-effects is really what this question is about.
I think we can learn a lot from the fact that this explicit sentence has been removed. This question seems to be closely related to the question “Does Stream.forEach respect the encounter order of sequential streams?”. The answer from Brian Goetz basically says that despite the fact that there’s no scenario where the order is ignored by the Stream’s current implementation when forEach is invoked on a sequential Stream, forEach has the freedom to ignore the encounter order even for sequential Streams per specification.
Now consider the following section of Stream’s class documentation:
To perform a computation, stream operations are composed into a stream pipeline. A stream pipeline consists of a source (which might be an array, a collection, a generator function, an I/O channel, etc), zero or more intermediate operations (which transform a stream into another stream, such as filter(Predicate)), and a terminal operation (which produces a result or side-effect, such as count() or forEach(Consumer)). Streams are lazy; computation on the source data is only performed when the terminal operation is initiated, and source elements are consumed only as needed.
Since it is the terminal operation which determines whether elements are needed and whether they are needed in the encounter order, a terminal action’s freedom to ignore the encounter order also implies consuming, hence processing, the elements in an arbitrary order.
Note that not only forEach can do that. A Collector has the ability to report an UNORDERED characteristic, e.g. Collectors.toSet() does not depend on the encounter order. It’s obvious that also an operation like count() doesn’t depend on the order—in Java 9 it may even return without any element processing. Think of IntStream#sum() for another example.
In the past, the implementation was too eager in propagating an unordered characteristic up the stream, see “Is this a bug in Files.lines(), or am I misunderstanding something about parallel streams?” where the terminal operation affected the outcome of a skip step, which is the reason why the current implementation is reluctant about such optimizations to avoid similar bugs, but that doesn’t preclude the reappearance of such optimizations, then being implemented with more care…
So at the moment it’s hard to imagine how an implementation could ever gain a performance benefit from exploiting the freedom of unordered evaluations in a sequential Stream, but, as stated in the forEach-related question, that doesn’t imply any guarantees.
The Java API documentations states that the combiner parameter of the collect method must be:
an associative, non-interfering, stateless function for combining two values, which must be compatible with the accumulator function
A combiner is a BiConsumer<R,R> that receives two parameters of type R and returns void. But the documentation does not state if we should combine the elements into the first or the second parameter?
For instance the following examples may give different results, depending on the order of combination be: m1.addAll(m2) or m2.addAll(m1).
List<String> res = LongStream
.rangeClosed(1, 1_000_000)
.parallel()
.mapToObj(n -> "" + n)
.collect(ArrayList::new, ArrayList::add,(m1, m2) -> m1.addAll(m2));
I know that in this case we could simply use a method handle, such as ArrayList::addAll. Yet, there are some cases where it is required a Lambda and we must combine the items in the correct order, otherwise we could get an inconsistent result when processing in parallel.
Is this claimed in any part of the Java 8 API documentation? Or it really doesn't matter?
Of course, it matters, as when you use m2.addAll(m1) instead of m1.addAll(m2), it doesn’t just change the order of elements, but completely breaks the operation. Since a BiConsumer doesn’t return a result, you have no control over which object the caller will use as the result and since the caller will use the first one, modifying the second instead will cause data loss.
There is a hint if you look at the accumulator function which has the type BiConsumer<R,? super T>, in other words can’t do anything else than storing the element of type T, provided as second argument, into the container of type R, provided as first argument.
If you look at the documentation of Collector, which uses a BinaryOperator as combiner function, hence allows the combiner to decide which argument to return (or even an entirely different result instance), you find:
The associativity constraint says that splitting the computation must produce an equivalent result. That is, for any input elements t1 and t2, the results r1 and r2 in the computation below must be equivalent:
A a1 = supplier.get();
accumulator.accept(a1, t1);
accumulator.accept(a1, t2);
R r1 = finisher.apply(a1); // result without splitting
A a2 = supplier.get();
accumulator.accept(a2, t1);
A a3 = supplier.get();
accumulator.accept(a3, t2);
R r2 = finisher.apply(combiner.apply(a2, a3)); // result with splitting
So if we assume that the accumulator is applied in encounter order, the combiner has to combine the first and second argument in left-to-right order to produce an equivalent result.
Now, the three-arg version of Stream.collect has a slightly different signature, using a BiConsumer as combiner exactly for supporting method references like ArrayList::addAll. Assuming consistency throughout all these operations and considering the purpose of this signature change, we can safely assume that it has to be the first argument which is the container to modify.
But it seems that this is a late change and the documentation hasn’t adapted accordingly. If you look at the Mutable reduction section of the package documentation, you will find that it has been adapted to show the actual Stream.collect’s signature and usage examples, but repeats exactly the same definition regarding the associativity constraint as shown above, despite the fact that finisher.apply(combiner.apply(a2, a3)) doesn’t work if combiner is a BiConsumer…
The documentation issue has been reported as JDK-8164691 and addressed in Java 9. The new documentation says:
combiner - an associative, non-interfering, stateless function that accepts two partial result containers and merges them, which must be compatible with the accumulator function. The combiner function must fold the elements from the second result container into the first result container.
Seems that this is not explicitly stated in the documentation. However there's an ordering concept in streams API. Stream can be either ordered or not. It may be unordered from the very beginning if source spliterator is unordered (for example, if the stream source is HashSet). Or the stream may become unordered if user explicitly used unordered() operation. If the stream is ordered, then collection procedure should also be stable, thus, I guess, it's assumed that for ordered streams the combiner receives the arguments in the strict order. However it's not guaranteed for an unordered stream.
When would you use collect() vs reduce()? Does anyone have good, concrete examples of when it's definitely better to go one way or the other?
Javadoc mentions that collect() is a mutable reduction.
Given that it's a mutable reduction, I assume it requires synchronization (internally) which, in turn, can be detrimental to performance. Presumably reduce() is more readily parallelizable at the cost of having to create a new data structure for return after every step in the reduce.
The above statements are guesswork however and I'd love an expert to chime in here.
reduce is a "fold" operation, it applies a binary operator to each element in the stream where the first argument to the operator is the return value of the previous application and the second argument is the current stream element.
collect is an aggregation operation where a "collection" is created and each element is "added" to that collection. Collections in different parts of the stream are then added together.
The document you linked gives the reason for having two different approaches:
If we wanted to take a stream of strings and concatenate them into a
single long string, we could achieve this with ordinary reduction:
String concatenated = strings.reduce("", String::concat)
We would get the desired result, and it would even work in parallel.
However, we might not be happy about the performance! Such an
implementation would do a great deal of string copying, and the run
time would be O(n^2) in the number of characters. A more performant
approach would be to accumulate the results into a StringBuilder,
which is a mutable container for accumulating strings. We can use the
same technique to parallelize mutable reduction as we do with ordinary
reduction.
So the point is that the parallelisation is the same in both cases but in the reduce case we apply the function to the stream elements themselves. In the collect case we apply the function to a mutable container.
The reason is simply that:
collect() can only work with mutable result objects.
reduce() is designed to work with immutable result objects.
"reduce() with immutable" example
public class Employee {
private Integer salary;
public Employee(String aSalary){
this.salary = new Integer(aSalary);
}
public Integer getSalary(){
return this.salary;
}
}
#Test
public void testReduceWithImmutable(){
List<Employee> list = new LinkedList<>();
list.add(new Employee("1"));
list.add(new Employee("2"));
list.add(new Employee("3"));
Integer sum = list
.stream()
.map(Employee::getSalary)
.reduce(0, (Integer a, Integer b) -> Integer.sum(a, b));
assertEquals(Integer.valueOf(6), sum);
}
"collect() with mutable" example
E.g. if you would like to manually calculate a sum using collect() it can not work with BigDecimal but only with MutableInt from org.apache.commons.lang.mutable for example. See:
public class Employee {
private MutableInt salary;
public Employee(String aSalary){
this.salary = new MutableInt(aSalary);
}
public MutableInt getSalary(){
return this.salary;
}
}
#Test
public void testCollectWithMutable(){
List<Employee> list = new LinkedList<>();
list.add(new Employee("1"));
list.add(new Employee("2"));
MutableInt sum = list.stream().collect(
MutableInt::new,
(MutableInt container, Employee employee) ->
container.add(employee.getSalary().intValue())
,
MutableInt::add);
assertEquals(new MutableInt(3), sum);
}
This works because the accumulator container.add(employee.getSalary().intValue()); is not supposed to return a new object with the result but to change the state of the mutable container of type MutableInt.
If you would like to use BigDecimal instead for the container you could not use the collect() method as container.add(employee.getSalary()); would not change the container because BigDecimal it is immutable.
(Apart from this BigDecimal::new would not work as BigDecimal has no empty constructor)
The normal reduction is meant to combine two immutable values such as int, double, etc. and produce a new one; it’s an immutable reduction. In contrast, the collect method is designed to mutate a container to accumulate the result it’s supposed to produce.
To illustrate the problem, let's suppose you want to achieve Collectors.toList() using a simple reduction like
List<Integer> numbers = stream.reduce(
new ArrayList<Integer>(),
(List<Integer> l, Integer e) -> {
l.add(e);
return l;
},
(List<Integer> l1, List<Integer> l2) -> {
l1.addAll(l2);
return l1;
});
This is the equivalent of Collectors.toList(). However, in this case you mutate the List<Integer>. As we know the ArrayList is not thread-safe, nor is safe to add/remove values from it while iterating so you will either get concurrent exception or ArrayIndexOutOfBoundsException or any kind of exception (especially when run in parallel) when you update the list or the combiner tries to merge the lists because you are mutating the list by accumulating (adding) the integers to it. If you want to make this thread-safe you need to pass a new list each time which would impair performance.
In contrast, the Collectors.toList() works in a similar fashion. However, it guarantees thread safety when you accumulate the values into the list. From the documentation for the collect method:
Performs a mutable reduction operation on the elements of this stream using a Collector. If the stream is parallel, and the Collector is concurrent, and either
the stream is unordered or the collector is unordered, then a
concurrent reduction will be performed. When executed in parallel, multiple intermediate results may be instantiated, populated, and merged so as to maintain isolation of mutable data structures. Therefore, even when executed in parallel with non-thread-safe data structures (such as ArrayList), no additional synchronization is needed for a parallel reduction.
So to answer your question:
When would you use collect() vs reduce()?
if you have immutable values such as ints, doubles, Strings then normal reduction works just fine. However, if you have to reduce your values into say a List (mutable data structure) then you need to use mutable reduction with the collect method.
Let the stream be a <- b <- c <- d
In reduction,
you will have ((a # b) # c) # d
where # is that interesting operation that you would like to do.
In collection,
your collector will have some kind of collecting structure K.
K consumes a.
K then consumes b.
K then consumes c.
K then consumes d.
At the end, you ask K what the final result is.
K then gives it to you.
They are very different in the potential memory footprint during the runtime. While collect() collects and puts all data into the collection, reduce() explicitly asks you to specify how to reduce the data that made it through the stream.
For example, if you want to read some data from a file, process it, and put it into some database, you might end up with java stream code similar to this:
streamDataFromFile(file)
.map(data -> processData(data))
.map(result -> database.save(result))
.collect(Collectors.toList());
In this case, we use collect() to force java to stream data through and make it save the result into the database. Without collect() the data is never read and never stored.
This code happily generates a java.lang.OutOfMemoryError: Java heap space runtime error, if the file size is large enough or the heap size is low enough. The obvious reason is that it tries to stack all the data that made it through the stream (and, in fact, has already been stored in the database) into the resulting collection and this blows up the heap.
However, if you replace collect() with reduce() -- it won't be a problem anymore as the latter will reduce and discard all the data that made it through.
In the presented example, just replace collect() with something with reduce:
.reduce(0L, (aLong, result) -> aLong, (aLong1, aLong2) -> aLong1);
You do not need even to care to make the calculation depend on the result as Java is not a pure FP (functional programming) language and cannot optimize out the data that is not being used at the bottom of the stream because of the possible side-effects.
Here is the code example
List<Integer> list = Arrays.asList(1,2,3,4,5,6,7);
int sum = list.stream().reduce((x,y) -> {
System.out.println(String.format("x=%d,y=%d",x,y));
return (x + y);
}).get();
System.out.println(sum);
Here is the execute result:
x=1,y=2
x=3,y=3
x=6,y=4
x=10,y=5
x=15,y=6
x=21,y=7
28
Reduce function handle two parameters, the first parameter is the previous return value int the stream, the second parameter is the current
calculate value in the stream, it sum the first value and current value as the first value in next caculation.
According to the docs
The reducing() collectors are most useful when used in a multi-level reduction, downstream of groupingBy or partitioningBy. To perform a simple reduction on a stream, use Stream.reduce(BinaryOperator) instead.
So basically you'd use reducing() only when forced within a collect.
Here's another example:
For example, given a stream of Person, to calculate the longest last name
of residents in each city:
Comparator<String> byLength = Comparator.comparing(String::length);
Map<String, String> longestLastNameByCity
= personList.stream().collect(groupingBy(Person::getCity,
reducing("", Person::getLastName, BinaryOperator.maxBy(byLength))));
According to this tutorial reduce is sometimes less efficient
The reduce operation always returns a new value. However, the accumulator function also returns a new value every time it processes an element of a stream. Suppose that you want to reduce the elements of a stream to a more complex object, such as a collection. This might hinder the performance of your application. If your reduce operation involves adding elements to a collection, then every time your accumulator function processes an element, it creates a new collection that includes the element, which is inefficient. It would be more efficient for you to update an existing collection instead. You can do this with the Stream.collect method, which the next section describes...
So the identity is "re-used" in a reduce scenario, so slightly more efficient to go with .reduce if possible.
There is a very good reason to always prefer collect() vs the reduce() method. Using collect() is much more performant, as explained here:
Java 8 tutorial
*A mutable reduction operation(such as Stream.collect()) collects the stream elements in a mutable result container(collection) as it processes them.
Mutable reduction operations provide much improved performance when compared to an immutable reduction operation(such as Stream.reduce()).
This is due to the fact that the collection holding the result at each step of reduction is mutable for a Collector and can be used again in the next step.
Stream.reduce() operation, on the other hand, uses immutable result containers and as a result needs to instantiate a new instance of the container at every intermediate step of reduction which degrades performance.*