Can RxJava reduce() be unsafe when parallelized? - java

I want to use the reduce() operation on observable to map it to a Guava ImmutableList, since I prefer it so much more to the standard ArrayList.
Observable<String> strings = ...
Observable<ImmutableList<String>> captured = strings.reduce(ImmutableList.<String>builder(), (b,s) -> b.add(s))
.map(ImmutableList.Builder::build);
captured.forEach(i -> System.out.println(i));
Simple enough. But suppose I somewhere scheduled the observable strings in parallel with multiple threads or something. Would this not derail the reduce() operation and possibly cause a race condition? Especially since the ImmutableList.Builder would be vulnerable to that?

The problem lies in the shared state between realizations of the chain. This is pitfall # 8 in my blog:
Shared state in an Observable chain
Let's assume you are dissatisfied with the performance or the type of the List the toList() operator returns and you want to roll your own aggregator instead of it. For a change, you want to do this by using existing operators and you find the operator reduce():
Observable<Vector<Integer>> list = Observable
.range(1, 3)
.reduce(new Vector<Integer>(), (vector, value) -> {
vector.add(value);
return vector;
});
list.subscribe(System.out::println);
list.subscribe(System.out::println);
list.subscribe(System.out::println);
When you run the 'test' calls, the first prints what you'd expect, but the second prints a vector where the range 1-3 appears twice and the third subscribe prints 9 elements!
The problem is not with the reduce() operator itself but with the expectation surrounding it. When the chain is established, the new Vector passed in is a 'global' instance and will be shared between all evaluation of the chain.
Naturally, there is a way of fixing this without implementing an operator for the whole purpose (which should be quite simple if you see the potential in the previous CounterOp):
Observable<Vector<Integer>> list2 = Observable
.range(1, 3)
.reduce((Vector<Integer>)null, (vector, value) -> {
if (vector == null) {
vector = new Vector<>();
}
vector.add(value);
return vector;
});
list2.subscribe(System.out::println);
list2.subscribe(System.out::println);
list2.subscribe(System.out::println);
You need to start with null and create a vector inside the accumulator function, which now isn't shared between subscribers.
Alternatively, you can look into the collect() operator which has a factory callback for the initial value.
The rule of thumb here is that whenever you see an aggregator-like operator taking some plain value, be cautious as this 'initial value' will most likely be shared across all subscribers and if you plan to consume the resulting stream with multiple subscribers, they will clash and may give you unexpected results or even crash.

According to the Observable contract, an observable must not make onNext calls in parallel, so you have to modify your strings Observable to respect this. You can use the serialize operator to achieve this.

Related

Bad practice to do "interesting" work inside a Stream.peek()? [duplicate]

I'm reading up about Java streams and discovering new things as I go along. One of the new things I found was the peek() function. Almost everything I've read on peek says it should be used to debug your Streams.
What if I had a Stream where each Account has a username, password field and a login() and loggedIn() method.
I also have
Consumer<Account> login = account -> account.login();
and
Predicate<Account> loggedIn = account -> account.loggedIn();
Why would this be so bad?
List<Account> accounts; //assume it's been setup
List<Account> loggedInAccount =
accounts.stream()
.peek(login)
.filter(loggedIn)
.collect(Collectors.toList());
Now as far as I can tell this does exactly what it's intended to do. It;
Takes a list of accounts
Tries to log in to each account
Filters out any account which aren't logged in
Collects the logged in accounts into a new list
What is the downside of doing something like this? Any reason I shouldn't proceed? Lastly, if not this solution then what?
The original version of this used the .filter() method as follows;
.filter(account -> {
account.login();
return account.loggedIn();
})
The important thing you have to understand is that streams are driven by the terminal operation. The terminal operation determines whether all elements have to be processed or any at all. So collect is an operation that processes each item, whereas findAny may stop processing items once it encountered a matching element.
And count() may not process any elements at all when it can determine the size of the stream without processing the items. Since this is an optimization not made in Java 8, but which will be in Java 9, there might be surprises when you switch to Java 9 and have code relying on count() processing all items. This is also connected to other implementation-dependent details, e.g. even in Java 9, the reference implementation will not be able to predict the size of an infinite stream source combined with limit while there is no fundamental limitation preventing such prediction.
Since peek allows “performing the provided action on each element as elements are consumed from the resulting stream”, it does not mandate processing of elements but will perform the action depending on what the terminal operation needs. This implies that you have to use it with great care if you need a particular processing, e.g. want to apply an action on all elements. It works if the terminal operation is guaranteed to process all items, but even then, you must be sure that not the next developer changes the terminal operation (or you forget that subtle aspect).
Further, while streams guarantee to maintain the encounter order for a certain combination of operations even for parallel streams, these guarantees do not apply to peek. When collecting into a list, the resulting list will have the right order for ordered parallel streams, but the peek action may get invoked in an arbitrary order and concurrently.
So the most useful thing you can do with peek is to find out whether a stream element has been processed which is exactly what the API documentation says:
This method exists mainly to support debugging, where you want to see the elements as they flow past a certain point in a pipeline
The key takeaway from this:
Don't use the API in an unintended way, even if it accomplishes your immediate goal. That approach may break in the future, and it is also unclear to future maintainers.
There is no harm in breaking this out to multiple operations, as they are distinct operations. There is harm in using the API in an unclear and unintended way, which may have ramifications if this particular behavior is modified in future versions of Java.
Using forEach on this operation would make it clear to the maintainer that there is an intended side effect on each element of accounts, and that you are performing some operation that can mutate it.
It's also more conventional in the sense that peek is an intermediate operation which doesn't operate on the entire collection until the terminal operation runs, but forEach is indeed a terminal operation. This way, you can make strong arguments around the behavior and the flow of your code as opposed to asking questions about if peek would behave the same as forEach does in this context.
accounts.forEach(a -> a.login());
List<Account> loggedInAccounts = accounts.stream()
.filter(Account::loggedIn)
.collect(Collectors.toList());
Perhaps a rule of thumb should be that if you do use peek outside the "debug" scenario, you should only do so if you're sure of what the terminating and intermediate filtering conditions are. For example:
return list.stream().map(foo->foo.getBar())
.peek(bar->bar.publish("HELLO"))
.collect(Collectors.toList());
seems to be a valid case where you want, in one operation to transform all Foos to Bars and tell them all hello.
Seems more efficient and elegant than something like:
List<Bar> bars = list.stream().map(foo->foo.getBar()).collect(Collectors.toList());
bars.forEach(bar->bar.publish("HELLO"));
return bars;
and you don't end up iterating a collection twice.
A lot of answers made good points, and especially the (accepted) answer by Makoto describes the possible problems in quite some detail. But no one actually showed how it can go wrong:
[1]-> IntStream.range(1, 10).peek(System.out::println).count();
| $6 ==> 9
No output.
[2]-> IntStream.range(1, 10).filter(i -> i%2==0).peek(System.out::println).count();
| $9 ==> 4
Outputs numbers 2, 4, 6, 8.
[3]-> IntStream.range(1, 10).filter(i -> i > 0).peek(System.out::println).count();
| $12 ==> 9
Outputs numbers 1 to 9.
[4]-> IntStream.range(1, 10).map(i -> i * 2).peek(System.out::println).count();
| $16 ==> 9
No output.
[5]-> Stream.of(1, 2, 3, 4, 5, 6, 7, 8, 9).peek(System.out::println).count();
| $23 ==> 9
No output.
[6]-> Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9).stream().peek(System.out::println).count();
| $25 ==> 9
No output.
[7]-> IntStream.range(1, 10).filter(i -> true).peek(System.out::println).count();
| $30 ==> 9
Outputs numbers 1 to 9.
[1]-> List<Integer> list = new ArrayList<>();
| list ==> []
[2]-> Stream.of(1, 5, 2, 7, 3, 9, 8, 4, 6).sorted().peek(list::add).count();
| $7 ==> 9
[3]-> list
| list ==> []
(You get the idea.)
The examples were run in jshell (Java 15.0.2) and mimic the use case of converting data (replace System.out::println by list::add for example as also done in some answers) and returning how much data was added. The current observation is that any operation that could filter elements (such as filter or skip) seems to force handling of all remaining elements, but it need not stay that way.
I would say that peek provides the ability to decentralize code that can mutate stream objects, or modify global state (based on them), instead of stuffing everything into a simple or composed function passed to a terminal method.
Now the question might be: should we mutate stream objects or change global state from within functions in functional style java programming?
If the answer to any of the the above 2 questions is yes (or: in some cases yes) then peek() is definitely not only for debugging purposes, for the same reason that forEach() isn't only for debugging purposes.
For me when choosing between forEach() and peek(), is choosing the following: Do I want pieces of code that mutate stream objects (or change global state) to be attached to a composable, or do I want them to attach directly to stream?
I think peek() will better pair with java9 methods. e.g. takeWhile() may need to decide when to stop iteration based on an already mutated object, so paring it with forEach() would not have the same effect.
P.S. I have not mentioned map() anywhere because in case we want to mutate objects (or global state), rather than generating new objects, it works exactly like peek().
Although I agree with most answers above, I have one case in which using peek actually seems like the cleanest way to go.
Similar to your use case, suppose you want to filter only on active accounts and then perform a login on these accounts.
accounts.stream()
.filter(Account::isActive)
.peek(login)
.collect(Collectors.toList());
Peek is helpful to avoid the redundant call while not having to iterate the collection twice:
accounts.stream()
.filter(Account::isActive)
.map(account -> {
account.login();
return account;
})
.collect(Collectors.toList());
Despite the documentation note for .peek saying the "method exists mainly to support debugging" I think it has general relevance. For one thing the documentation says "mainly", so leaves room for other use cases. It is not deprecated since years, and speculations about its removal are futile IMO.
I would say in a world where we still have to handle side-effectful methods it has a valid place and utility. There are many valid operations in streams that use side-effects. Many have been mentioned in other answers, I'll just add here to set a flag on a collection of objects, or register them with a registry, on objects which are then further processed in the stream. Not to mention creating log messages during stream processing.
I support the idea to have separate actions in separate stream operations, so I avoid pushing everything into a final .forEach. I favor .peek over an equivalent .map with a lambda who's only purpose, besides calling the side-effect method, is to return the passed in argument. .peek tells me that what goes in also goes out as soon as I encounter this operation, and I don't need to read a lambda to find out. In that sense it is succinct, expressive and improves readability of the code.
Having said that I agree with all the considerations when using .peek, e.g. being aware of the effect of the terminal operation of the stream it is used in.
The functional solution is to make account object immutable. So account.login() must return a new account object. This will mean that the map operation can be used for login instead of peek.
To get rid of warnings, I use functor tee, named after Unix' tee:
public static <T> Function<T,T> tee(Consumer<T> after) {
return arg -> {
f.accept(arg);
return arg;
};
}
You can replace:
.peek(f)
with
.map(tee(f))
It seems like a helper class is needed:
public static class OneBranchOnly<T> {
public Function<T, T> apply(Predicate<? super T> test,
Consumer<? super T> t) {
return o -> {
if (test.test(o)) t.accept(o);
return o;
};
}
}
then switch peek with map:
.map(new OneBranchOnly< Account >().apply(
account -> account.isTestAccount(),
account -> account.setName("Test Account"))
)
results: Collections of accounts that only test accounts got renamed (no reference gets maintained)

Java Streams - Using a setter inside map()

I have a discussion with colleague that we should not be using setters inside stream.map() like the solution suggested here - https://stackoverflow.com/a/35377863/1552771
There is a comment to this answer that discourages using map this way, but there hasn’t been a reason given as to why this is a bad idea. Can someone provide a possible scenario why this can break?
I have seen some discussions where people talk about concurrent modification of the collection itself, by adding or removing items from it, but are there any negatives to using map to just set some values to a data object?
Using side effects in map like invoking a setter, has a lot of similarities to using peek for non-debugging purposes, which have been discussed in In Java streams is peek really only for debugging?
This answer has a very good general advice:
Don't use the API in an unintended way, even if it accomplishes your immediate goal. That approach may break in the future, and it is also unclear to future maintainers.
Whereas the other answer names associated practical problems; I have to cite myself:
The important thing you have to understand, is that streams are driven by the terminal operation. The terminal operation determines whether all elements have to be processed or any at all.
When you place an operation with a side effect into a map function, you have a specific expectation about on which elements it will be performed and perhaps even how it will be performed, e.g. in which order. Whether the expectation will be fulfilled, depends on other subsequent Stream operations and perhaps even on subtle implementation details.
To show some examples:
IntStream.range(0, 10) // outcome changes with Java 9
.mapToObj(i -> System.out.append("side effect on "+i+"\n"))
.count();
IntStream.range(0, 2) // outcome changes with Java 10 (or 8u222)
.flatMap(i -> IntStream.range(i * 5, (i+1) * 5 ))
.map(i -> { System.out.println("side effect on "+i); return i; })
.anyMatch(i -> i > 3);
IntStream.range(0, 10) // outcome may change with every run
.parallel()
.map(i -> { System.out.println("side effect on "+i); return i; })
.anyMatch(i -> i > 6);
Further, as already mentioned in the linked answer, even if you have a terminal operation that processes all elements and is ordered, there is no guaranty about the processing order (or concurrency for parallel streams) of intermediate operations.
The code may happen to do the desired thing when you have a stream with no duplicates and a terminal operation processing all elements and a map function which is calling only a trivial setter, but the code has so many dependencies on subtle surrounding conditions that it will become a maintenance nightmare. Which brings us back to the first quote about using an API in an unintended way.
I think the real issue here is that it is just bad practice and violates the intended use of the capability. For example, one can also accomplish the same thing with filter. This perverts its use and also makes the code confusing or at best, unnecessarily verbose.
public static void main(String[] args) {
List<MyNumb> foo =
IntStream.range(1, 11).mapToObj(MyNumb::new).collect(
Collectors.toList());
System.out.println(foo);
foo = foo.stream().filter(i ->
{
i.value *= 10;
return true;
}).collect(Collectors.toList());
System.out.println(foo);
}
class MyNumb {
int value;
public MyNumb(int v) {
value = v;
}
public String toString() {
return Integer.toString(value);
}
}
So going back to the original example. One does not need to use map at all, resulting in the following rather ugly mess.
foos = foos.stream()
.filter(foo -> { boolean b = foo.isBlue();
if (b) {
foo.setTitle("Some value");
}
return b;})
.collect(Collectors.toList());
Streams are not just some new set of APIs which makes things easier for you. It also brings functional programming paradigm with it.
And, functional programming paradigm's most important aspect is to use pure functions for computations. A pure function is one where the output depends only and only on its input.
So, basically Streams API should use stateless, side-effect-free and pure functions.
Quoting things from Joshua Bloch's Effective Java (3rd Edition)
If you’re new to streams, it can be difficult to get the hang of them. Merely expressing your computation as a stream pipeline can be hard. When you succeed, your program will run, but you may realize little if any benefit. Streams isn’t just an API, it’s a paradigm based on functional programming. In order to obtain the expressiveness, speed, and in some cases parallelizability that streams have to offer, you have to adopt the paradigm as well as the API. The most important part of the streams paradigm is to structure your compu- tation as a sequence of transformations where the result of each stage is as close as possible to a pure function of the result of the previous stage. A pure function is one whose result depends only on its input: it does not depend on any mutable state, nor does it update any state. In order to achieve this, any function objects that you pass into stream operations, both intermediate and terminal, should be free of side-effects.
Occasionally, you may see streams code that looks like this snippet, which builds a frequency table of the words in a text file:
// Uses the streams API but not the paradigm--Don't do this!
Map<String, Long> freq = new HashMap<>();
try (Stream<String> words = new Scanner(file).tokens()) {
words.forEach(word -> { freq.merge(word.toLowerCase(), 1L, Long::sum);
});
}
What’s wrong with this code? After all, it uses streams, lambdas, and method references, and gets the right answer. Simply put, it’s not streams code at all; it’s iterative code masquerading as streams code. It derives no benefits from the streams API, and it’s (a bit) longer, harder to read, and less maintainable than the corresponding iterative code. The problem stems from the fact that this code is doing all its work in a terminal forEach operation, using a lambda that mutates external state (the frequency table). A forEach operation that does anything more than present the result of the computation performed by a stream is a “bad smell in code,” as is a lambda that mutates state. So how should this code look?
// Proper use of streams to initialize a frequency table
Map<String, Long> freq;
try (Stream<String> words = new Scanner(file).tokens()) {
freq = words
.collect(groupingBy(String::toLowerCase, counting()));
}
Just to name a few:
map() with setter is interfering (it modifies the initial data), while specs require a non-interfering function. For more details read this post.
map() with setter is stateful (your logic may depend on initial value of field you're updating), while specs require a stateless function
even if you're not interfering the collection that you're iterating over, the side effect of the setter is unnecessary
Setters in map may mislead the future code maintainers
etc...

Why the tryAdvance of stream.spliterator() may accumulate items into a buffer?

Getting a Spliterator from a Stream pipeline may return an instance of a StreamSpliterators.WrappingSpliterator. For example, getting the following Spliterator:
Spliterator<String> source = new Random()
.ints(11, 0, 7) // size, origin, bound
.filter(nr -> nr % 2 != 0)
.mapToObj(Integer::toString)
.spliterator();
Given the above Spliterator<String> source, when we traverse the elements individually through the tryAdvance (Consumer<? super P_OUT> consumer) method of Spliterator, which in this case is an instance of StreamSpliterators.WrappingSpliterator, it will first accumulate items into an internal buffer, before consuming those items, as we can see in StreamSpliterators.java#298. From a simple point of view, the doAdvance() inserts items first into buffer and then it gets the next item and pass it to consumer.accept (…).
public boolean tryAdvance(Consumer<? super P_OUT> consumer) {
boolean hasNext = doAdvance();
if (hasNext)
consumer.accept(buffer.get(nextToConsume));
return hasNext;
}
However, I am not figuring out the need of this buffer.
In this case, why the consumer parameter of the tryAdvance is not simply used as a terminal Sink of the pipeline?
Keep in mind that this is the Spliterator returned by the public method Stream.spliterator(), so no assumptions about the caller can be made (as long as it is within the contract).
The tryAdvance method may get called once for each of the stream’s elements and once more to detect the end of the stream, well, actually, it might get called an arbitrary number of times even after hitting the end. And there is no guaranty that the caller will always pass the same consumer.
To pass a consumer directly to the source spliterator without buffering, you will have to compose a consumer that will perform all pipeline stages, i.e. call a mapping function and use its result or test a predicate and not call the downstream consumer if negative and so on. The consumer passed to the source spliterator would also be responsible to notify the WrappingSpliterator somehow about a value being rejected by the filter as the source spliterator’s tryAdvance method still returns true in that case and the operation would have to be repeated then.
As Eugene correctly mentioned, this is the one-fits-all implementation that doesn’t consider how many or what kind of pipeline stages are there. The costs of composing such a consumer could be heavy and might have to be reapplied for each tryAdvance call, read for every stream element, e.g. when different consumers are passed to tryAdvance or when equality checks do not work. Keep in mind that consumers are often implemented as lambda expressions and the identity or equality of the instances produced by lambda expressions is unspecified.
So the tryAdvance implementation avoids these costs by composing only one consumer instance on the first invocation that will always store the element into the same buffer, also allocated on the first invocation, if not rejected by a filter. Note that under normal circumstances, the buffer will only hold one element. Afaik, flatMap is the only operation that may push more elements to the buffer. But note that the existence of this non-lazy behavior of flatMap is also the reason why this buffering strategy is required, at least when flatMap is involved, to ensure that the Spliterator implementation handed out by a public method will fulfill the contract of passing at most one element to the consumer during one invocation of tryAdvance.
In contrast, when you call forEachRemaining, these problems do not exist. There is only one Consumer instance during the entire operation and the non-laziness of flatMap doesn’t matter either, as all elements will get consumed anyway. Therefore, a non-buffering transfer will be attempted, as long as no previous tryAdvance call was made that could have caused buffering of some elements:
public void forEachRemaining(Consumer<? super P_OUT> consumer) {
if (buffer == null && !finished) {
Objects.requireNonNull(consumer);
init();
ph.wrapAndCopyInto((Sink<P_OUT>) consumer::accept, spliterator);
finished = true;
}
else {
do { } while (tryAdvance(consumer));
}
}
As you can see, as long as the buffer has not been initialized, i.e. no previous tryAdvance call was made, consumer::accept is bound as Sink and a complete direct transfer made.
I mostly agree with great #Holger answer, but I would put accents differently. I think it is hard for you to understand the need for a buffer because you have very simplistic mental model of what Stream API allows. If one thinks about Stream as a sequence of map and filter, there is no need for additional buffer because those operations have 2 important "good" properties:
Work on one element at a time
Produce 0 or 1 element as a result
However those are not true in general case. As #Holger (and I in my original answer) mentioned there is already flatMap in Java 8 that breaks rule #2 and in Java 9 they've finally added takeWhile that actually transforms on whole Stream -> Stream rather than on a per-element basis (and that is AFAIK the first intermediate shirt-circuiting operation).
Another point I don't quite agree with #Holger is that I think that the most fundamental reason is a bit different than the one he puts in the second paragraph (i.e. a) that you may call tryAdvance post the end of the Stream many times and b) that "there is no guaranty that the caller will always pass the same consumer"). I think that the most important reason is that Spliterator being functionally identical to Stream has to support short-circuiting and laziness (i.e. ability to not process the whole Stream or else it can't support unbound streams). In other words, even if Spliterator API (quite strangely) required that you must use the same Consumer object for all calls of all methods for a given Spliterator, you would still need tryAdvance and that tryAdvance implementation would still have to use some buffer. You just can't stop processing data if all you've got is forEachRemaining(Consumer<? super T> ) so you can't implement anything similar to findFirst or takeWhile using it. Actually this is one of the reasons why inside JDK implementation uses Sink interface rather than Consumer (and what "wrap" in wrapAndCopyInto stands for): Sink has additional boolean cancellationRequested() method.
So to sum up: a buffer is required because we want Spliterator:
To use simple Consumer that provides no means to report back end of processing/cancellation
To provide means to stop processing of the data by a request of the (logical) consumer.
Note that those two are actually slightly contradictory requirements.
Example and some code
Here I'd like to provide some example of code that I believe is impossible to implement without additional buffer given current API contract (interfaces). This example is based on your example.
There is simple Collatz sequence of integers that is conjectured to always eventually hit 1. AFAIK this conjecture is not proved yet but is verified for many integers (at least for whole 32-bit int range).
So assume that the problem we are trying to solve is following: from a stream of Collatz sequences for random start numbers in range from 1 to 1,000,000 find the first that contains "123" in its decimal representation.
Here is a solution that uses just Stream (not a Spliterator):
static String findGoodNumber() {
return new Random()
.ints(1, 1_000_000) // unbound!
.flatMap(nr -> collatzSequence(nr))
.mapToObj(Integer::toString)
.filter(s -> s.contains("123"))
.findFirst().get();
}
where collatzSequence is a function that returns Stream containing the Collatz sequence until the first 1 (and for nitpickers let it also stop when current value is bigger than Integer.MAX_VALUE /3 so we don't hit overflow).
Every such Stream returned by collatzSequence is bound. Also standard Random will eventually generate every number in the provided range. It means that we are guaranteed that there eventually will be some "good" number in the stream (for example just 123) and findFirst is short-circuiting so the whole operation will actually terminate. However no reasonable Stream API implementation can predict this.
Now let's assume that for some strange reason you want to perform the same thing using intermediate Spliterator. Even though you have only one piece of logic and no need for different Consumers, you can't use forEachRemaining. So you'll have to do something like this:
static Spliterator<String> createCollatzRandomSpliterator() {
return new Random()
.ints(1, 1_000_000) // unbound!
.flatMap(nr -> collatzSequence(nr))
.mapToObj(Integer::toString)
.spliterator();
}
static String findGoodNumberWithSpliterator() {
Spliterator<String> source = createCollatzRandomSpliterator();
String[] res = new String[1]; // work around for "final" closure restriction
while (source.tryAdvance(s -> {
if (s.contains("123")) {
res[0] = s;
}
})) {
if (res[0] != null)
return res[0];
}
throw new IllegalStateException("Impossible");
}
It is also important that for some starting numbers the Collatz sequence will contain several matching numbers. For example, both 41123 and 123370 (= 41123*3+1) contain "123". It means that we really don't want our Consumer to be called post the first matching hit. But since Consumer doesn't expose any means to report end of processing, WrappingSpliterator can't just pass our Consumer to the inner Spliterator. The only solution is to accumulate all results of inner flatMap (with all the post-processing) into some buffer and then iterate over that buffer one element at a time.
Spliterators are designed to handle sequential processing of each item in encounter order, and parallel processing of items in some order. Each method of the Spliterator must be able to support both early binding and late binding. The buffering is intended to gather data into suitable, processable chunks, that follow the requirements for ordering, parallelization and mutability.
In other words, tryAdvance() is not the only method in the class, and other methods have to work with each other to deliver the external contract. To do that, in the face of sub-classes that may override some or all of the methods, requires that each method obey its internal contract.
This is something that I've read from Holger quite in a few posts and I'll just sum it up here; if there's a certain exact duplicate (I'll try to find one) - I will close and delete my answer in respect to that one.
First, is why WrappingSpliterator are needed in the first place - for stateful operations like sorted, distinct, etc - but I think you already understood that. I assume for flatMap also - since it is eager.
Now, when you call spliterator, IFF there are no stateful operations there is no real reason to wrap that into a WrappingSpliterator obviously, but at the moment this is not done. This could be changed in a future release - where they can detect if there are stateful operations before you call spliterator; but they don't do that now and simply treat every operation as stateful, thus wrapping it into WrappingSpliterator

Can a Collector's combiner function ever be used on sequential streams?

Sample program:
public final class CollectorTest
{
private CollectorTest()
{
}
private static <T> BinaryOperator<T> nope()
{
return (t, u) -> { throw new UnsupportedOperationException("nope"); };
}
public static void main(final String... args)
{
final Collector<Integer, ?, List<Integer>> c
= Collector.of(ArrayList::new, List::add, nope());
IntStream.range(0, 10_000_000).boxed().collect(c);
}
}
So, to simplify matters here, there is no final transformation, so the resulting code is quite simple.
Now, IntStream.range() produces a sequential stream. I simply box the results into Integers and then my contrived Collector collects them into a List<Integer>. Pretty simple.
And no matter how many times I run this sample program, the UnsupportedOperationException never hits, which means my dummy combiner is never called.
I kind of expected this, but then I have already misunderstood streams enough that I have to ask the question...
Can a Collector's combiner ever be called when the stream is guaranteed to be sequential?
A careful reading of the streams implementation code in ReduceOps.java reveals that the combine function is called only when a ReduceTask completes, and ReduceTask instances are used only when evaluating a pipeline in parallel. Thus, in the current implementation, the combiner is never called when evaluating a sequential pipeline.
There is nothing in the specification that guarantees this, however. A Collector is an interface that makes requirements on its implementations, and there are no exemptions granted for sequential streams. Personally, I find it difficult to imagine why sequential pipeline evaluation might need to call the combiner, but someone with more imagination than me might find a clever use for it, and implement it. The specification allows for it, and even though today's implementation doesn't do it, you still have to think about it.
This should not surprising. The design center of the streams API is to support parallel execution on an equal footing with sequential execution. Of course, it is possible for a program to observe whether it is being executed sequentially or in parallel. But the design of the API is to support a style of programming that allows either.
If you're writing a collector and you find that it's impossible (or inconvenient, or difficult) to write an associative combiner function, leading you to want to restrict your stream to sequential execution, maybe this means you're heading in the wrong direction. It's time to step back a bit and think about approaching the problem a different way.
A common reduction-style operation that doesn't require an associative combiner function is called fold-left. The main characteristic is that the fold function is applied strictly left-to-right, proceeding one at a time. I'm not aware of a way to parallelize fold-left.
When people try to contort collectors the way we've been talking about, they're usually looking for something like fold-left. The Streams API doesn't have direct API support for this operation, but it's pretty easy to write. For example, suppose you want to reduce a list of strings using this operation: repeat the first string and then append the second. It's pretty easy to demonstrate that this operation isn't associative:
List<String> list = Arrays.asList("a", "b", "c", "d", "e");
System.out.println(list.stream()
.collect(StringBuilder::new,
(a, b) -> a.append(a.toString()).append(b),
(a, b) -> a.append(a.toString()).append(b))); // BROKEN -- NOT ASSOCIATIVE
Run sequentially, this produces the desired output:
aabaabcaabaabcdaabaabcaabaabcde
But when run in parallel, it might produce something like this:
aabaabccdde
Since it "works" sequentially, we could enforce this by calling sequential() and back this up by having the combiner throw an exception. In addition, the supplier must be called exactly once. There's no way to combine the intermediate results, so if the supplier is called twice, we're already in trouble. But since we "know" the supplier is called only once in sequential mode, most people don't worry about this. In fact, I've seen people write "suppliers" that return some existing object instead of creating a new one, in violation of the supplier contract.
In this use of the 3-arg form of collect(), we have two out of the three functions breaking their contracts. Shouldn't this be telling us to do things a different way?
The main work here is being done by the accumulator function. To accomplish a fold-style reduction, we can apply this function in a strict left-to-right order using forEachOrdered(). We have to do a bit of setup and finishing code before and after, but that's no problem:
StringBuilder a = new StringBuilder();
list.parallelStream()
.forEachOrdered(b -> a.append(a.toString()).append(b));
System.out.println(a.toString());
Naturally, this works fine in parallel, though the performance benefits of running in parallel may be somewhat negated by the ordering requirements of forEachOrdered().
In summary, if you find yourself wanting to do a mutable reduction but you're lacking an associative combiner function, leading you to restrict your stream to sequential execution, recast the problem as a fold-left operation and use forEachRemaining() on your accumulator function.
As observed in previous comments from #MarkoTopolnik and #Duncan there is no guarantee that Collector.combiner() on sequential mode is called to produce a reduced result. In fact, the Java doc is a little bit subjective in this point, which can to lead an not appropriate interpretation.
(...) A parallel implementation would partition the input, create a result container for each partition, accumulate the contents of each partition into a subresult for that partition, and then use the combiner function to merge the subresults into a combined result.
According to NoBlogDefFound combinator is used only in parallel mode. See the partial quotation below:
combiner() is used to join two accumulators together into one. It is used when collector is executed in parallel, splitting input Stream and collecting parts independently first.
To show more clear this issue I re-write the first code and I put two approaches (serial and parallel).
public final class CollectorTest
{
private CollectorTest()
{
}
private static <T> BinaryOperator<T> nope()
{
return (t, u) -> { throw new UnsupportedOperationException("nope"); };
}
public static void main(final String... args)
{
final Collector<Integer, ?, List<Integer>> c =
Collector
.of(ArrayList::new, List::add, nope());
// approach sequential
Stream<Integer> sequential = IntStream
.range(0, 10_000_000)
.boxed();
System.out.println("isParallel:" + sequential.isParallel());
sequential
.collect(c);
// approach parallel
Stream<Integer> parallel = IntStream
.range(0, 10_000_000)
.parallel()
.boxed();
System.out.println("isParallel:" + parallel.isParallel());
parallel
.collect(c);
}
}
After running this code we can get the output:
isParallel:false
isParallel:true
Exception in thread "main" java.lang.UnsupportedOperationException: nope
at com.stackoverflow.lambda.CollectorTest.lambda$nope$0(CollectorTest.java:18)
at com.stackoverflow.lambda.CollectorTest$$Lambda$3/2001049719.apply(Unknown Source)
at java.util.stream.ReduceOps$3ReducingSink.combine(ReduceOps.java:174)
at java.util.stream.ReduceOps$3ReducingSink.combine(ReduceOps.java:160)
So, according this result we can infer that Collector's combiner can be called only by the parallel execution.

Mutable parameters in Java 8 Streams

Looking at this question: How to dynamically do filtering in Java 8?
The issue is to truncate a stream after a filter has been executed. I cant use limit because I dont know how long the list is after the filter. So, could we count the slements after the filter?
So, I thought I could create a class that counts and pass the stream through a map.The code is in this answer.
I created a class that counts but leave the elements unaltered, I use a Function here, to avoid to use the lambdas I used in the other answer:
class DoNothingButCount<T > implements Function<T, T> {
AtomicInteger i;
public DoNothingButCount() {
i = new AtomicInteger(0);
}
public T apply(T p) {
i.incrementAndGet();
return p;
}
}
So my Stream was finally:
persons.stream()
.filter(u -> u.size > 12)
.filter(u -> u.weitght > 12)
.map(counter)
.sorted((p1, p2) -> p1.age - p2.age)
.collect(Collectors.toList())
.stream()
.limit((int) (counter.i.intValue() * 0.5))
.sorted((p1, p2) -> p2.length - p1.length)
.limit((int) (counter.i.intValue() * 0.5 * 0.2)).forEach((p) -> System.out.println(p));
But my question is about another part of the my example.
collect(Collectors.toList()).stream().
If I remove that line the consequences are that the counter is ZERO when I try to execute limit. I am somehow cheating the "efectively final" requirement by using a mutable object.
I may be wrong, but I iunderstand that the stream is build first, so if we used mutable objects to pass parameters to any of the steps in the stream these will be taken when the stream is created.
My question is, if my assumption is right, why is this needed? The stream (if non parallel) could be pass sequentially through all the steps (filter, map..) so this limitation is not needed.
Short answer
My question is, if my assumption is right, why is this needed? The
stream (if non parallel) could be pass sequentially through all the
steps (filter, map..) so this limitation is not needed.
As you already know, for parallel streams, this sounds pretty obvious: this limitation is needed because otherwise the result would be non deterministic.
Regarding non-parallel streams, it is not possible because of their current design: each item is only visited once. If streams did work as you suggest, they would do each step on the whole collection before going to the next step, which would probably have an impact on performance, I think. I suspect that's why the language designers made that decision.
Why it technically does not work without collect
You already know that, but here is the explanation for other readers.
From the docs:
Streams are lazy; computation on the source data is only performed
when the terminal operation is initiated, and source elements are
consumed only as needed.
Every intermediate operation of Stream, such as filter() or limit() is actually just some kind of setter that initializes the stream's options.
When you call a terminal operation, such as forEach(), collect() or count(), that's when the computation happens, processing items following the pipeline previously built.
This is why limit()'s argument is evaluated before a single item has gone through the first step of the stream. That's why you need to end the stream with a terminal operation, and start a new one with the limit() you'll then know.
More detailed answer about why not allow it for parallel streams
Let your stream pipeline be step X > step Y > step Z.
We want parallel treatment of our items. Therefore, if we allow step Y's behavior to depend on the items that already went through X, then Y is non deterministic. This is because at the moment an item arrives at step Y, the set of items that have already gone through X won't be the same across multiple executions (because of the threading).
More detailed answer about why not allow it for non-parallel streams
A stream, by definition, is used to process the items in a flow. You could think of a non-parallel stream as follows: one single item goes through all the steps, then the next one goes through all the steps, etc. In fact, the doc says it all:
The elements of a stream are only visited once during the life of a
stream. Like an Iterator, a new stream must be generated to revisit
the same elements of the source.
If streams didn't work like this, it wouldn't be any better than just do each step on the whole collection before going to the next step. That would actually allow mutable parameters in non-parallel streams, but it would probably have a performance impact (because we would iterate multiple times over the collection). Anyway, their current behavior does not allow what you want.

Categories