I have been trying to understand and showcase how Java streams implement a type of loop fusion under the hood, so that several operations can be fused into a single pass.
This first example here:
Stream.of("The", "cat", "sat", "on", "the", "mat")
.filter(w -> {
System.out.println("Filtering: " + w);
return w.length() == 3;
})
.map(w -> {
System.out.println("Mapping: " + w);
return w.toUpperCase();
})
.forEach(w -> System.out.println("Printing: " + w));
Has the following output (with the fusion of a single pass for each element quite clear):
Filtering: The
Mapping: The
Printing: THE
Filtering: cat
Mapping: cat
Printing: CAT
Filtering: sat
Mapping: sat
Printing: SAT
Filtering: on
Filtering: the
Mapping: the
Printing: THE
Filtering: mat
Mapping: mat
Printing: MAT
The second example is the same but I use the sorted() operation between the filter and map:
Stream.of("The", "cat", "sat", "on", "the", "mat")
.filter(w -> {
System.out.println("Filtering: " + w);
return w.length() == 3;
})
.sorted()
.map(w -> {
System.out.println("Mapping: " + w);
return w.toUpperCase();
})
.forEach(w -> System.out.println("Printing: " + w));
This has the following output:
Filtering: The
Filtering: cat
Filtering: sat
Filtering: on
Filtering: the
Filtering: mat
Mapping: The
Printing: THE
Mapping: cat
Printing: CAT
Mapping: mat
Printing: MAT
Mapping: sat
Printing: SAT
Mapping: the
Printing: THE
So my question is here, with the call to distinct, am I correct in thinking that because it is a "stateful" intermediate operation, it does not allow individual elements to be processed individually during a single pass (of all operations). Furthermore, because the sorted() stateful operation needs to process the entire input stream to produce a result, then the fusing technique cannot be deployed here, so that is why all the filtering occurs first, and then it fuses together the mapping and printing operations, after the sort? Please correct me if any of my assumptions are incorrect and feel free to elaborate on what I have already said.
In addition, how does it decide under the hood whether it can fuse elements together into a single pass or not, for example, when the distinct() operation exists, is there simply a flag that switches off to stop it from happening as it does when distinct() is not there?
A final query is, whilst the benefit of fusing operations into a single pass is sometimes obvious, for example, when combined with short-circuiting. What are the main benefits of fusing together operations such as a filter-map-forEach, or even a filter-map-sum?
The stateless operations (map, filter, flatMap, peek, etc) are fully fused; we build a chain of cascading Consumer objects and pour the data in. Each element can be operated upon independent of each other, so there's never anything "stuck" in the chain. (This is what Louis means by how fusion is implemented -- we compose the stages into a big function, and feed the data to that.)
Stateful operations (distinct, sorted, limit, etc) are more complicated, and vary more in their behavior. Each stateful operation gets to choose how it wants to implement itself, so it can choose the least intrusive approach possible. For example, distinct (under some circumstances), lets elements come out as they are vetted, whereas sorted is a full barrier. (The difference is in how much laziness is possible, and how well they handle things like infinite sources with a limit operation downstream.)
It is true that stateful operations generally undermine some of the benefits of fusion, but not all of them (the operations upstream and downstream can still be fused.)
In addition to the value of short-circuiting, which you observed, additional big wins from fusion include (a) you don't have to populate intermediate result containers between stages, and (b) the data you are dealing with is always "hot" in cache.
Yes, that's about right. All of this can be checked by looking at the source code.
Fusion isn't implemented the way I think you think it is, though. There's no looking at the whole pipeline and deciding how to fuse it; there's no flags or anything; it's just whether the operations are expressed as a StatefulOp object, which can run the entire stream up to that point and get all the output, or a StatelessOp which just decorates a Sink that says where the elements go. You can look at the source code for e.g. sorted and map for examples.
Related
I have two Fluxs one for successful elements another one holding the erroneous elements
Flux<String> success= Flux.just("Orange", "Apple", "Banana","Grape", "Strawberry");
Flux<String> erroneous = Flux.just("Banana","Grape",);
How can i filter the first Flux to execlude all the elements from the second one ?
You may wish to consider collecting the Flux into a set, caching that set, and then using filterWhen as follows:
Mono<Set<String>> erroneousSet = erroneous.collect(Collectors.toSet()).cache();
Flux<String> filtered = success.filterWhen(v -> erroneousSet.map(s -> !s.contains(v)));
Gives:
Orange
Apple
Strawberry
This isn't the most concise solution (see below), but it enables the contents of erroneous to be cached. In this specific example that's a moot point, but if it's a real-world situation (not using Flux.just()) then erroneous could be recomputed on every subscription, and that could end up being incredibly (and unnecessarily) expensive in performance terms.
Alternatively, if the above really doesn't matter in your use case, filterWhen() and hasElement() can be used much more concisely as follows:
success.filterWhen(s -> erroneous.hasElement(s).map(x->!x))
Or with reactor-extra:
success.filterWhen(s -> BooleanUtils.not(erroneous.hasElement(s)))
I have read a lot about Java 8 streams lately, and several articles about lazy loading with Java 8 streams specifically: here and over here. I can't seem to shake the feeling that lazy loading is COMPLETELY useless (or at best, a minor syntactic convenience offering zero performance value).
Let's take this code as an example:
int[] myInts = new int[]{1,2,3,5,8,13,21};
IntStream myIntStream = IntStream.of(myInts);
int[] myChangedArray = myIntStream
.peek(n -> System.out.println("About to square: " + n))
.map(n -> (int)Math.pow(n, 2))
.peek(n -> System.out.println("Done squaring, result: " + n))
.toArray();
This will log in the console, because the terminal operation, in this case toArray(), is called, and our stream is lazy and executes only when the terminal operation is called. Of course I can also do this:
IntStream myChangedInts = myIntStream
.peek(n -> System.out.println("About to square: " + n))
.map(n -> (int)Math.pow(n, 2))
.peek(n -> System.out.println("Done squaring, result: " + n));
And nothing will be printed, because the map isn't happening, because I don't need the data. Until I call this:
int[] myChangedArray = myChangedInts.toArray();
And voila, I get my mapped data, and my console logs. Except I see zero benefit to it whatsoever. I realize I can define the filter code long before I call to toArray(), and I can pass around this "not-really-filtered stream around), but so what? Is this the only benefit?
The articles seem to imply there is a performance gain associated with laziness, for example:
In the Java 8 Streams API, the intermediate operations are lazy and their internal processing model is optimized to make it being capable of processing the large amount of data with high performance.
and
Java 8 Streams API optimizes stream processing with the help of short circuiting operations. Short Circuit methods ends the stream processing as soon as their conditions are satisfied. In normal words short circuit operations, once the condition is satisfied just breaks all of the intermediate operations, lying before in the pipeline. Some of the intermediate as well as terminal operations have this behavior.
It sounds literally like breaking out of a loop, and not associated with laziness at all.
Finally, there is this perplexing line in the second article:
Lazy operations achieve efficiency. It is a way not to work on stale data. Lazy operations might be useful in the situations where input data is consumed gradually rather than having whole complete set of elements beforehand. For example consider the situations where an infinite stream has been created using Stream#generate(Supplier<T>) and the provided Supplier function is gradually receiving data from a remote server. In those kind of the situations server call will only be made at a terminal operation when it's needed.
Not working on stale data? What? How does lazy loading keep someone from working on stale data?
TLDR: Is there any benefit to lazy loading besides being able to run the filter/map/reduce/whatever operation at a later time (which offers zero performance benefit)?
If so, what's a real-world use case?
Your terminal operation, toArray(), perhaps supports your argument given that it requires all elements of the stream.
Some terminal operations don't. And for these, it would be a waste if streams weren't lazily executed. Two examples:
//example 1: print first element of 1000 after transformations
IntStream.range(0, 1000)
.peek(System.out::println)
.mapToObj(String::valueOf)
.peek(System.out::println)
.findFirst()
.ifPresent(System.out::println);
//example 2: check if any value has an even key
boolean valid = records.
.map(this::heavyConversion)
.filter(this::checkWithWebService)
.mapToInt(Record::getKey)
.anyMatch(i -> i % 2 == 0)
The first stream will print:
0
0
0
That is, intermediate operations will be run just on one element. This is an important optimization. If it weren't lazy, then all the peek() calls would have to run on all elements (absolutely unnecessary as you're interested in just one element). Intermediate operations can be expensive (such as in the second example)
Short-circuiting terminal operation (of which toArray isn't) make this optimization possible.
Laziness can be very useful for the users of your API, especially when the final result of the Stream pipeline evaluation might be very large!
The simple example is the Files.lines method in the Java API itself. If you don't want to read the whole file into the memory and you only need the first N lines, then just write:
Stream<String> stream = Files.lines(path); // lazy operation
List<String> result = stream.limit(N).collect(Collectors.toList()); // read and collect
You're right that there won't be a benefit from map().reduce() or map().collect(), but there's a pretty obvious benefit with findAny() findFirst(), anyMatch(), allMatch(), etc. Basically, any operation that can be short-circuited.
Good question.
Assuming you write textbook perfect code, the difference in performance between a properly optimized for and a stream is not noticeable (streams tend to be slightly better class loading wise, but the difference should not be noticeable in most cases).
Consider the following example.
// Some lengthy computation
private static int doStuff(int i) {
try { Thread.sleep(1000); } catch (InterruptedException e) { }
return i;
}
public static OptionalInt findFirstGreaterThanStream(int value) {
return IntStream
.of(MY_INTS)
.map(Main::doStuff)
.filter(x -> x > value)
.findFirst();
}
public static OptionalInt findFirstGreaterThanFor(int value) {
for (int i = 0; i < MY_INTS.length; i++) {
int mapped = Main.doStuff(MY_INTS[i]);
if(mapped > value){
return OptionalInt.of(mapped);
}
}
return OptionalInt.empty();
}
Given the above methods, the next test should show they execute in about the same time.
public static void main(String[] args) {
long begin;
long end;
begin = System.currentTimeMillis();
System.out.println(findFirstGreaterThanStream(5));
end = System.currentTimeMillis();
System.out.println(end-begin);
begin = System.currentTimeMillis();
System.out.println(findFirstGreaterThanFor(5));
end = System.currentTimeMillis();
System.out.println(end-begin);
}
OptionalInt[8]
5119
OptionalInt[8]
5001
Anyway, we spend most of the time in the doStuff method. Let's say we want to add more threads to the mix.
Adjusting the stream method is trivial (considering your operations meets the preconditions of parallel streams).
public static OptionalInt findFirstGreaterThanParallelStream(int value) {
return IntStream
.of(MY_INTS)
.parallel()
.map(Main::doStuff)
.filter(x -> x > value)
.findFirst();
}
Achieving the same behavior without streams can be tricky.
public static OptionalInt findFirstGreaterThanParallelFor(int value, Executor executor) {
AtomicInteger counter = new AtomicInteger(0);
CompletableFuture<OptionalInt> cf = CompletableFuture.supplyAsync(() -> {
while(counter.get() != MY_INTS.length-1);
return OptionalInt.empty();
});
for (int i = 0; i < MY_INTS.length; i++) {
final int current = MY_INTS[i];
executor.execute(() -> {
int mapped = Main.doStuff(current);
if(mapped > value){
cf.complete(OptionalInt.of(mapped));
} else {
counter.incrementAndGet();
}
});
}
try {
return cf.get();
} catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
return OptionalInt.empty();
}
}
The tests execute in about the same time again.
public static void main(String[] args) {
long begin;
long end;
begin = System.currentTimeMillis();
System.out.println(findFirstGreaterThanParallelStream(5));
end = System.currentTimeMillis();
System.out.println(end-begin);
ExecutorService executor = Executors.newFixedThreadPool(10);
begin = System.currentTimeMillis();
System.out.println(findFirstGreaterThanParallelFor(5678, executor));
end = System.currentTimeMillis();
System.out.println(end-begin);
executor.shutdown();
executor.awaitTermination(10, TimeUnit.SECONDS);
executor.shutdownNow();
}
OptionalInt[8]
1004
OptionalInt[8]
1004
In conclusion, although we don't squeeze a big performance benefit out of streams (considering you write excellent multi-threaded code in your for alternative), the code itself tends to be more maintainable.
A (slightly off-topic) final note:
As with programming languages, higher level abstractions (streams relative to fors) make stuff easier to develop at the cost of performance. We did not move away from assembly to procedural languages to object-oriented languages because the later offered greater performance. We moved because it made us more productive (develop the same thing at a lower cost). If you are able to get the same performance out of a stream as you would do with a for and properly written multi-threaded code, I would say it's already a win.
I have a real example from our code base, since I'm going to simplify it, not entirely sure you might like it or fully grasp it...
We have a service that needs a List<CustomService>, I am suppose to call it. Now in order to call it, I am going to a database (much simpler than reality) and obtaining a List<DBObject>; in order to obtain a List<CustomService> from that, there are some heavy transformations that need to be done.
And here are my choices, transform in place and pass the list. Simple, yet, probably not that optimal. Second option, refactor the service, to accept a List<DBObject> and a Function<DBObject, CustomService>. And this sounds trivial, but it enables laziness (among other things). That service might sometimes need only a few elements from that List, or sometimes a max by some property, etc. - thus no need for me to do the heavy transformation for all elements, this is where Stream API pull based laziness is a winner.
Before Streams existed, we used to use guava. It had Lists.transform( list, function) that was lazy too.
It's not a fundamental feature of streams as such, it could have been done even without guava, but it's s lot simpler that way. The example here provided with findFirst is great and the simplest to understand; this is the entire point of laziness, elements are pulled only when needed, they are not passed from an intermediate operation to another in chunks, but pass from one stage to another one at a time.
One interesting use case that hasn't been mentioned is arbitrary composition of operations on streams, coming from different parts of the code base, responding to different sorts of business or technical requisites.
For example, say you have an application where certain users can see all the data but certain other users can only see part of it. The part of the code that checks user permissions can simply impose a filter on whatever stream is being handed about.
Without lazy streams, that same part of the code could be filtering the already realized full collection, but that may have been expensive to obtain, for no real gain.
Alternatively, that same part of the code might want to append its filter to a data source, but now it has to know whether the data comes from a database, so it can impose an additional WHERE clause, or some other source.
With lazy streams, it's a filter that can be implemented ever which way. Filters imposed on streams from the database can translate into the aforementioned WHERE clause, with obvious performance gains over filtering in-memory collections resulting from whole table reads.
So, a better abstraction, better performance, better code readability and maintainability, sounds like a win to me. :)
Non-lazy implementation would process all input and collect output to a new collection on each operation. Obviously, it's impossible for unlimited or large enough sources, memory-consuming otherwise, and unnecessarily memory-consuming in case of reducing and short-circuiting operations, so there are great benefits.
Check the following example
Stream.of("0","0","1","2","3","4")
.distinct()
.peek(a->System.out.println("after distinct: "+a))
.anyMatch("1"::equals);
If it was not behaving as lazy you would expect that all elements would pass through the distinct filtering first. But because of lazy execution it behaves differently. It will stream the minimum amount of elements needed to calculate the result.
The above example will print
after distinct: 0
after distinct: 1
How it works analytically:
First "0" goes until the terminal operation but does not satisfy it. Another element must be streamed.
Second "0" is filtered through .distinct() and never reaches terminal operation.
Since the terminal operation is not satisfied yet, next element is streamed.
"1" goes through terminal operation and satisfies it.
No more elements need to be streamed.
This question already has answers here:
Why filter() after flatMap() is "not completely" lazy in Java streams?
(8 answers)
Closed 3 years ago.
Consider the following code:
urls.stream()
.flatMap(url -> fetchDataFromInternet(url).stream())
.filter(...)
.findFirst()
.get();
Will fetchDataFromInternet be called for second url when the first one was enough?
I tried with a smaller example and it looks like working as expected. i.e processes data one by one but can this behavior be relied on? If not, does calling .sequential() before .flatMap(...) help?
Stream.of("one", "two", "three")
.flatMap(num -> {
System.out.println("Processing " + num);
// return FetchFromInternetForNum(num).data().stream();
return Stream.of(num);
})
.peek(num -> System.out.println("Peek before filter: "+ num))
.filter(num -> num.length() > 0)
.peek(num -> System.out.println("Peek after filter: "+ num))
.forEach(num -> {
System.out.println("Done " + num);
});
Output:
Processing one
Peek before filter: one
Peek after filter: one
Done one
Processing two
Peek before filter: two
Peek after filter: two
Done two
Processing three
Peek before filter: three
Peek after filter: three
Done three
Update: Using official Oracle JDK8 if that matters on implementation
Answer:
Based on the comments and the answers below, flatmap is partially lazy. i.e reads the first stream fully and only when required, it goes for next. Reading a stream is eager but reading multiple streams is lazy.
If this behavior is intended, the API should let the function return an Iterable instead of a stream.
In other words: link
Under the current implementation, flatmap is eager; like any other stateful intermediate operation (like sorted and distinct). And it's very easy to prove :
int result = Stream.of(1)
.flatMap(x -> Stream.generate(() -> ThreadLocalRandom.current().nextInt()))
.findFirst()
.get();
System.out.println(result);
This never finishes as flatMap is computed eagerly. For your example:
urls.stream()
.flatMap(url -> fetchDataFromInternet(url).stream())
.filter(...)
.findFirst()
.get();
It means that for each url, the flatMap will block all others operation that come after it, even if you care about a single one. So let's suppose that from a single url your fetchDataFromInternet(url) generates 10_000 lines, well your findFirst will have to wait for all 10_000 to be computed, even if you care about only one.
EDIT
This is fixed in Java 10, where we get our laziness back: see JDK-8075939
EDIT 2
This is fixed in Java 8 too (8u222): JDK-8225328
It’s not clear why you set up an example that does not address the actual question, you’re interested in. If you want to know, whether the processing is lazy when applying a short-circuiting operation like findFirst(), well, then use an example using findFirst() instead of forEach that processes all elements anyway. Also, put the logging statement right into the function whose evaluation you want to track:
Stream.of("hello", "world")
.flatMap(s -> {
System.out.println("flatMap function evaluated for \""+s+'"');
return s.chars().boxed();
})
.peek(c -> System.out.printf("processing element %c%n", c))
.filter(c -> c>'h')
.findFirst()
.ifPresent(c -> System.out.printf("found an %c%n", c));
flatMap function evaluated for "hello"
processing element h
processing element e
processing element l
processing element l
processing element o
found an l
This demonstrates that the function passed to flatMap gets evaluated lazily as expected while the elements of the returned (sub-)stream are not evaluated as lazy as possible, as already discussed in the Q&A you have linked yourself.
So, regarding your fetchDataFromInternet method that gets invoked from the function passed to flatMap, you will get the desired laziness. But not for the data it returns.
Today I also stumbled up on this bug. Behavior is not so strait forward, cause simple case, like below, is working fine, but similar production code doesn't work.
stream(spliterator).map(o -> o).flatMap(Stream::of)..flatMap(Stream::of).findAny()
For guys who cannot wait another couple years for migration to JDK-10 there is a alternative true lazy stream. It doesn't support parallel. It was dedicated for JavaScript translation, but it worked out for me, cause interface is the same.
StreamHelper is collection based, but it is easy to adapt Spliterator.
https://github.com/yaitskov/j4ts/blob/stream/src/main/java/javaemul/internal/stream/StreamHelper.java
I've gone through several previous questions like Encounter order preservation in java stream, this answer by Brian Goetz, as well as the javadoc for Stream.reduce(), and the java.util.stream package javadoc, and yet I still can't grasp the following:
Take this piece of code:
public static void main(String... args) {
final String[] alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ".split("");
System.out.println("Alphabet: ".concat(Arrays.toString(alphabet)));
System.out.println(new HashSet<>(Arrays.asList(alphabet))
.parallelStream()
.unordered()
.peek(System.out::println)
.reduce("", (a,b) -> a + b, (a,b) -> a + b));
}
Why is the reduction always* preserving the encounter order?
So far, after several dozen runs, output is the same
First of all unordered does not imply an actual shuffling; all it does it sets a flag for the Stream pipeline - that could later be leveraged.
A shuffle of the source elements could potentially be much more expensive then the operations on the stream pipeline themselves, so the implementation might choose not to do this(like in this case).
At the moment (tested and looked at the sources) of jdk-8 and jdk-9 - reduce does not take that into account. Notice that this could very well change in a future build or release.
Also when you say unordered - you actually mean that you don't care about that order and the stream returning the same result is not a violation of that rule.
For example notice this question/answer that explains that findFirst for example (just another terminal operation) changed to take unordered into consideration in java-9 as opposed to java-8.
To help explain this, I am going to reduce the scope of this string to ABCD.
The parallel stream will divide the string into two pieces: AB and CD. When we go to combine these later, the result of the AB side will be the first argument passed into the function, while the result of the CD side will be the second argument passed into the function. This is regardless of which of the two actually finishes first.
The unordered operator will affect some operations on a stream, such as a limit operation, it does not affect a simple reduce.
TLDR: .reduce() is not always preserving order, its result is based on the stream spliterator characteristics.
Spliterator
The encounter order of the stream depends on stream spliterator (None of the answers mentioned that before).
There are different spliterators based on the source stream. You can get the types of spliterators from the source code of those collections.
HashSet -> HashMap#KeySpliterator = Not ordered
ArrayDeque = Ordered
ArrayList = Ordered
TreeSet -> TreeMap#Spliterator = Ordered and sorted
logicbig.com - Ordering
logicbig.com - Stateful vs Stateless
Additionally you can apply .unordered() intermediate stream operation that specifies following operations in the stream should not rely on ordering.
Stream operations (mostly stateful) that are affected by spliterator and usage of .unordered() method are:
.findFirst()
.limit()
.skip()
.distinct()
Those operations will give us different results based on the order property of the stream and its spliterator.
.peek() method does not take ordering into consideration, if stream is executed in parallel it will always print/receive elements in unordered manner.
.reduce()
Now for the terminal .reduce() method. Intermediate operation .unordered() doesn't have any affect on type of spliterator (as #Eugene mentioned). But important notice, it still stays the same as it is in the source spliterator. If source spliterator is ordered, result of the .reduce() will be ordered, if source was unordered result of .reduce() will be unordered.
You are using new HashSet<>(Arrays.asList(alphabet)) to get the instance of the stream. Its spliterator is unordered. It was just a coincidence that you are getting your result ordered because you are using the single alphabet Strings as elements of the stream and unordered result is actually the same. Now if you would mix that with numbers or mix it with lower case and upper case then this doesn't hold true anymore. For example take following inputs, the first one is subset of the example you posted:
HashSet .reduce() - Unordered
"A","B","C","D","E","F" -> "ABCDEF"
"a","b","c","1","2","3","A","B","C" -> "a1Ab2Bc3C"
"Apple","Orange","Banana","Mango" -> "AppleMangoOrangeBanana"
TreeSet .reduce() - Ordered, Sorted
"A","B","C","D","E","F" -> "ABCDEF"
"a","b","c","1","2","3","A","B","C" -> "123ABCabc"
"Apple","Orange","Banana","Mango" -> "AppleBananaMangoOrange"
ArrayList .reduce() - Ordered
"A","B","C","D","E","F" -> "ABCDEF"
"a","b","c","1","2","3","A","B","C" -> "abc123ABC"
"Apple","Orange","Banana","Mango" -> "AppleOrangeBananaMango"
You see that testing .reduce() operation only with an alphabet source stream can lead to false conclusions.
The answer is .reduce() is not always preserving order, its result is based on the stream spliterator characteristics.
I'm wondering if I can add an operation to a stream, based off of some sort of condition set outside of the stream. For example, I want to add a limit operation to the stream if my limit variable is not equal to -1.
My code currently looks like this, but I have yet to see other examples of streams being used this way, where a Stream object is reassigned to the result of an intermediate operation applied on itself:
// Do some stream stuff
stream = stream.filter(e -> e.getTimestamp() < max);
// Limit the stream
if (limit != -1) {
stream = stream.limit(limit);
}
// Collect stream to list
stream.collect(Collectors.toList());
As stated in this stackoverflow post, the filter isn't actually applied until a terminal operation is called. Since I'm reassigning the value of stream before a terminal operation is called, is the above code still a proper way to use Java 8 streams?
There is no semantic difference between a chained series of invocations and a series of invocations storing the intermediate return values. Thus, the following code fragments are equivalent:
a = object.foo();
b = a.bar();
c = b.baz();
and
c = object.foo().bar().baz();
In either case, each method is invoked on the result of the previous invocation. But in the latter case, the intermediate results are not stored but lost on the next invocation. In the case of the stream API, the intermediate results must not be used after you have called the next method on it, thus chaining is the natural way of using stream as it intrinsically ensures that you don’t invoke more than one method on a returned reference.
Still, it is not wrong to store the reference to a stream as long as you obey the contract of not using a returned reference more than once. By using it they way as in your question, i.e. overwriting the variable with the result of the next invocation, you also ensure that you don’t invoke more than one method on a returned reference, thus, it’s a correct usage. Of course, this only works with intermediate results of the same type, so when you are using map or flatMap, getting a stream of a different reference type, you can’t overwrite the local variable. Then you have to be careful to not use the old local variable again, but, as said, as long as you are not using it after the next invocation, there is nothing wrong with the intermediate storage.
Sometimes, you have to store it, e.g.
try(Stream<String> stream = Files.lines(Paths.get("myFile.txt"))) {
stream.filter(s -> !s.isEmpty()).forEach(System.out::println);
}
Note that the code is equivalent to the following alternatives:
try(Stream<String> stream = Files.lines(Paths.get("myFile.txt")).filter(s->!s.isEmpty())) {
stream.forEach(System.out::println);
}
and
try(Stream<String> srcStream = Files.lines(Paths.get("myFile.txt"))) {
Stream<String> tmp = srcStream.filter(s -> !s.isEmpty());
// must not be use variable srcStream here:
tmp.forEach(System.out::println);
}
They are equivalent because forEach is always invoked on the result of filter which is always invoked on the result of Files.lines and it doesn’t matter on which result the final close() operation is invoked as closing affects the entire stream pipeline.
To put it in one sentence, the way you use it, is correct.
I even prefer to do it that way, as not chaining a limit operation when you don’t want to apply a limit is the cleanest way of expression your intent. It’s also worth noting that the suggested alternatives may work in a lot of cases, but they are not semantically equivalent:
.limit(condition? aLimit: Long.MAX_VALUE)
assumes that the maximum number of elements, you can ever encounter, is Long.MAX_VALUE but streams can have more elements than that, they even might be infinite.
.limit(condition? aLimit: list.size())
when the stream source is list, is breaking the lazy evaluation of a stream. In principle, a mutable stream source might legally get arbitrarily changed up to the point when the terminal action is commenced. The result will reflect all modifications made up to this point. When you add an intermediate operation incorporating list.size(), i.e. the actual size of the list at this point, subsequent modifications applied to the collection between this point and the terminal operation may turn this value to have a different meaning than the intended “actually no limit” semantic.
Compare with “Non Interference” section of the API documentation:
For well-behaved stream sources, the source can be modified before the terminal operation commences and those modifications will be reflected in the covered elements. For example, consider the following code:
List<String> l = new ArrayList(Arrays.asList("one", "two"));
Stream<String> sl = l.stream();
l.add("three");
String s = sl.collect(joining(" "));
First a list is created consisting of two strings: "one"; and "two". Then a stream is created from that list. Next the list is modified by adding a third string: "three". Finally the elements of the stream are collected and joined together. Since the list was modified before the terminal collect operation commenced the result will be a string of "one two three".
Of course, this is a rare corner case as normally, a programmer will formulate an entire stream pipeline without modifying the source collection in between. Still, the different semantic remains and it might turn into a very hard to find bug when you once enter such a corner case.
Further, since they are not equivalent, the stream API will never recognize these values as “actually no limit”. Even specifying Long.MAX_VALUE implies that the stream implementation has to track the number of processed elements to ensure that the limit has been obeyed. Thus, not adding a limit operation can have a significant performance advantage over adding a limit with a number that the programmer expects to never be exceeded.
There is two ways you can do this
// Do some stream stuff
List<E> results = list.stream()
.filter(e -> e.getTimestamp() < max);
.limit(limit > 0 ? limit : list.size())
.collect(Collectors.toList());
OR
// Do some stream stuff
stream = stream.filter(e -> e.getTimestamp() < max);
// Limit the stream
if (limit != -1) {
stream = stream.limit(limit);
}
// Collect stream to list
List<E> results = stream.collect(Collectors.toList());
As this is functional programming you should always work on the result of each function. You should specifically avoid modifying anything in this style of programming and treat everything as if it was immutable if possible.
Since I'm reassigning the value of stream before a terminal operation is called, is the above code still a proper way to use Java 8 streams?
It should work, however it reads as a mix of imperative and functional coding. I suggest writing it as a fixed stream as per my first answer.
I think your first line needs to be:
stream = stream.filter(e -> e.getTimestamp() < max);
so that your using the stream returned by filter in subsequent operations rather than the original stream.
I known it is a bit too late, but I had the same question myself and didn't find the satisfying answer, however, inspired by this question and answers I came to the following solution:
return Stream.of( ///< wrap target stream in other stream ;)
/*do regular stream stuff*/
stream.filter(e -> e.getTimestamp() < max)
).flatMap(s -> limit != -1 ? s.limit(limit) : s) ///< apply limit only if necessary and unwrap stream of stream to "normal" stream
.collect(Collectors.toList()) ///< do final stuff