Create a finite stream with a generator function - java

I have a program to read data from multiple sources, use a tournament tree to merge sort them, pack the data into blocks and output the blocks. I have implemented this as a function, which returns null when no more block is available.
DataBlock buildBlock()
Now I want to output a stream of blocks, but the only method I have found so far is Stream.generate which generates an infinite stream. My stream is of course not infinite. What is a proper way to generate a finite stream from this function?

If you use at least Java 9, you can apply takeWhile(Objects::nonNull) on your stream. If you use the older Java version, check out this question.

You can create a stream of Optional's with Stream.iterate​(T seed, Predicate<? super T> hasNext, UnaryOperator<T> next), stopping on empty. You can then call .map(Optional::get) on the stream.
Here's an example of creating a stream that tracks the cumulative sum of a list.
public static Stream<Integer> cumulativeSum(List<Integer> nums) {
Iterator<Integer> numItr = nums.iterator();
if (!numItr.hasNext()) {
return Stream.of();
}
return Stream.iterate(
Optional.of(numItr.next()),
Optional::isPresent,
maybeSum -> maybeSum.flatMap(sum ->
numItr.hasNext() ? Optional.of(Integer.sum(sum, numItr.next())) : Optional.empty()))
.map(Optional::get);
}
If the input list is [2, 4, 3], then the output stream will be [2, 6, 9].
(Java 8 only has Stream.iterate​(T seed, UnaryOperator<T> f), so you will have to call .takeWhile(Optional::isPresent) instead to convert the infinite stream to a finite stream.)

Related

Doesn't Stream.parallel() update the characteristics of spliterator?

This question is based on the answers to this question What is the difference between Stream.of and IntStream.range?
Since the IntStream.range produces an already sorted stream, the output to the below code would only generate the output as 0:
IntStream.range(0, 4)
.peek(System.out::println)
.sorted()
.findFirst();
Also the spliterator would have SORTED characteristics. Below code returns true:
System.out.println(
IntStream.range(0, 4)
.spliterator()
.hasCharacteristics(Spliterator.SORTED)
);
Now, If I introduce a parallel() in the first code, then as expected, the output would contain all 4 numbers from 0 to 3 but in a random order, because the stream is not sorted anymore due to parallel().
IntStream.range(0, 4)
.parallel()
.peek(System.out::println)
.sorted()
.findFirst();
This would produce something like below: (in any random order)
2
0
1
3
So, I expect that the SORTED property has been removed due to parallel(). But, the below code returns true as well.
System.out.println(
IntStream.range(0, 4)
.parallel()
.spliterator()
.hasCharacteristics(Spliterator.SORTED)
);
Why doesn't the parallel() change SORTED property? And since all four numbers are printed, How does Java realize that the stream is not sorted even though the SORTED property still exists?
How exactly this is done is very much an implementation detail. You will have to dig deep inside the source code to really see why. Basically, parallel and sequential pipelines are just handled differently. Look at the AbstractPipeline.evaluate, which checks isParallel(), then does different things depending whether the pipeline is parallel.
return isParallel()
? terminalOp.evaluateParallel(this, sourceSpliterator(terminalOp.getOpFlags()))
: terminalOp.evaluateSequential(this, sourceSpliterator(terminalOp.getOpFlags()));
If you then look at SortedOps.OfInt, you'll see that it overrides two methods:
#Override
public Sink<Integer> opWrapSink(int flags, Sink sink) {
Objects.requireNonNull(sink);
if (StreamOpFlag.SORTED.isKnown(flags))
return sink;
else if (StreamOpFlag.SIZED.isKnown(flags))
return new SizedIntSortingSink(sink);
else
return new IntSortingSink(sink);
}
#Override
public <P_IN> Node<Integer> opEvaluateParallel(PipelineHelper<Integer> helper,
Spliterator<P_IN> spliterator,
IntFunction<Integer[]> generator) {
if (StreamOpFlag.SORTED.isKnown(helper.getStreamAndOpFlags())) {
return helper.evaluate(spliterator, false, generator);
}
else {
Node.OfInt n = (Node.OfInt) helper.evaluate(spliterator, true, generator);
int[] content = n.asPrimitiveArray();
Arrays.parallelSort(content);
return Nodes.node(content);
}
}
opWrapSink will be eventually called if it's a sequential pipeline, and opEvaluateParallel (as its name suggests) will be called when it's a parallel stream. Notice how opWrapSink doesn't do anything to the given sink if the pipeline is already sorted (just returns it unchanged), but opEvaluateParallel always evaluates the spliterator.
Also note that parallel-ness and sorted-ness are not mutually exclusive. You can have a stream with any combination of those characteristics.
"Sorted" is a characteristic of a Spliterator. It's not technically a characteristic of a Stream (like "parallel" is). Sure, parallel could create a stream with an entirely new spliterator (that gets elements from the original spliterator) with entirely new characteristics, but why do that, when you can just reuse the same spliterator? Id imagine you'll have to handle parallel and sequential streams differently in any case.
You need to take a step back and think of how you would solve such a problem in general, considering that ForkJoinPool is used for parallel streams and it works based on work stealing. It would be very helpful if you knew how a Spliterator works, too. Some details here.
You have a certain Stream, you "split it" (very simplified) into smaller pieces and give all those pieces to a ForkJoinPool for execution. All of those pieces are worked on independently, by individual threads. Since we are talking about threads here, there is obviously no sequence of events, things happen randomly (that is why you see a random order output).
If your stream preserves the order, terminal operation is suppose to preserve it too. So while intermediate operations are executed in any order, your terminal operation (if the stream up to that point is ordered), will handle elements in an ordered fashion. To put it slightly simplified:
System.out.println(
IntStream.of(1,2,3)
.parallel()
.map(x -> {System.out.println(x * 2); return x * 2;})
.boxed()
.collect(Collectors.toList()));
map will process elements in an unknown order (ForkJoinPool and threads, remember that), but collect will receive elements in order, "left to right".
Now, if we extrapolate that to your example: when you invoke parallel, the stream is split in small pieces and worked on. For example look how this is split (a single time).
Spliterator<Integer> spliterator =
IntStream.of(5, 4, 3, 2, 1, 5, 6, 7, 8)
.parallel()
.boxed()
.sorted()
.spliterator()
.trySplit(); // trySplit is invoked internally on parallel
spliterator.forEachRemaining(System.out::println);
On my machine it prints 1,2,3,4. This means that the internal implementation splits the stream in two Spliterators: left and right. left has [1, 2, 3, 4] and right has [5, 6, 7, 8]. But that is not it, these Spliterators can be split further. For example:
Spliterator<Integer> spliterator =
IntStream.of(5, 4, 3, 2, 1, 5, 6, 7, 8)
.parallel()
.boxed()
.sorted()
.spliterator()
.trySplit()
.trySplit()
.trySplit();
spliterator.forEachRemaining(System.out::println);
if you try to invoke trySplit again, you will get a null - meaning, that's it, I can't split anymore.
So, your Stream : IntStream.range(0, 4) is going to be split in 4 spliterators. All worked on individually, by a thread. If your first thread knows that this Spliterator it currently works on, is the "left-most one", that's it! The rest of the threads do not even need to start their work - the result is known.
On the other hand, it could be that this Spliterator that has the "left-most" element is only started last. So the first three ones, might already be done with their work (thus peek is invoked in your example), but they do not "produce" the needed result.
As a matter fact, this is how it is done internally. You do not need to understand the code - but the flow and the method names should be obvious.

For Java streams, does generate + limit guarantee no additional calls to the generator function, or is there a preferred alternative?

I have a source of data that I know has n elements, which I can access by repeatedly calling a method on an object; for the sake of example, let's call it myReader.find(). I want to create a stream of data containing those n elements. Let's also say that I don't want to call the find() method more times than the amount of data I want to return, as it will throw an exception (e.g. NoSuchElementException) if the method is called after the end of the data is reached.
I know I can create this stream by using the IntStream.range method, and mapping each element using the find method. However, this feels a little weird since I'm completely ignoring the int values in the stream (I'm really just using it to produce a stream with exactly n elements).
return IntStream.range(0, n).mapToObj(i -> myReader.read());
An approach I've considered is using Stream.generate(supplier) followed by Stream.limit(maxSize). Based on my understanding of the limit function, this feels like it should work.
Stream.generate(myReader::read).limit(n)
However, nowhere in the API documentation do I see an indication that the Stream.limit() method will guarantee exactly maxSize elements are generated by the stream it's called on. It wouldn't be infeasible that a stream implementation could be allowed to call the generator function more than n times, so long as the end result was just the first n calls, and so long as it meets the API contract for being a short-circuiting intermediate operation.
Stream.limit JavaDocs
Returns a stream consisting of the elements of this stream, truncated to be no longer than maxSize in length.
This is a short-circuiting stateful intermediate operation.
Stream operations and pipelines documentation
An intermediate operation is short-circuiting if, when presented with infinite input, it may produce a finite stream as a result. [...] Having a short-circuiting operation in the pipeline is a necessary, but not sufficient, condition for the processing of an infinite stream to terminate normally in finite time.
Is it safe to rely on Stream.generate(generator).limit(n) only making n calls to the underlying generator? If so, is there some documentation of this fact that I'm missing?
And to avoid the XY Problem: what is the idiomatic way of creating a stream by performing an operation exactly n times?
Stream.generate creates an unordered Stream. This implies that the subsequent limit operation is not required to use the first n elements, as there is no “first” when there’s no order, but may select arbitrary n elements. The implementation may exploit this permission , e.g. for higher parallel processing performance.
The following code
IntSummaryStatistics s =
Stream.generate(new AtomicInteger()::incrementAndGet)
.parallel()
.limit(100_000)
.collect(Collectors.summarizingInt(Integer::intValue));
System.out.println(s);
prints something like
IntSummaryStatistics{count=100000, sum=5000070273, min=1, average=50000,702730, max=100207}
on my machine, whereas the max number may vary. It demonstrates that the Stream has selected exactly 100000 elements, as required, but not the elements from 1 to 100000. Since the generator produces strictly ascending numbers, it’s clear that is has been called more than 100000 times to get number higher than that.
Another example
System.out.println(
Stream.generate(new AtomicInteger()::incrementAndGet)
.parallel()
.map(String::valueOf)
.limit(10)
.collect(Collectors.toList())
);
prints something like this on my machine (JDK-14)
[4, 8, 5, 6, 10, 3, 7, 1, 9, 11]
With JDK-8, it even prints something like
[4, 14, 18, 24, 30, 37, 42, 52, 59, 66]
If a construct like
IntStream.range(0, n).mapToObj(i -> myReader.read())
feels weird due to the unused i parameter, you may use
Collections.nCopies(n, myReader).stream().map(TypeOfMyReader::read)
instead. This doesn’t show an unused int parameter and works equally well, as in fact, it’s internally implemented as IntStream.range(0, n).mapToObj(i -> element). There is no way around some counter, visible or hidden, to ensure that the method will be called n times. Note that, since read likely is a stateful operation, the resulting behavior will always be like an unordered stream when enabling parallel processing, but the IntStream and nCopies approaches create a finite stream that will never invoke the method more than the specified number of times.
Only answering the XY-problem part of your question: simply create a spliterator for your reader.
class MyStreamSpliterator implements Spliterator<String> { // or whichever datatype
private final MyReaderClass reader;
public MyStramSpliterator(MyReaderClass reader) {
this.reader = reader;
}
#Override
public boolean tryAdvance(Consumer<String> action) {
try {
String nextval = reader.read();
action.accept(nextval);
return true;
} catch(NoSuchElementException e) {
// cleanup if necessary
return false;
}
// Alternative: if you really really want to use n iterations,
// add a counter and use it.
}
#Override
public Spliterator<String> trySplit() {
return null; // we don't split
}
#Override
public long estimateSize() {
return Long.MAX_VALUE; // or the correct value, if you know it before
}
#Override
public int characteristics() {
// add SIZED if you know the size
return Spliterator.IMMUTABLE | Spliterator.ORDERED;
}
}
Then, create your stream as StreamSupport.stream(new MyStreamSpliterator(reader), false)
Disclaimer: I just threw this together in the SO editor, probably there are some errors.

Collect both min and max in one stream

I need to print both min and max of a stream of int in one operation. I currently have 2 operations but the second is not allowed. Somehow collectors are not working for me:
Stream<Integer> stringInt = Stream.of(8,50,16,0,72);
System.out.println(stringInt.reduce(Math::min).get());
System.out.println(stringInt.reduce(Math::max).get());
The second is not allowed since stream can not be reused. From Stream javadoc :
A stream should be operated on (invoking an intermediate or terminal stream operation) only once. This rules out, for example, "forked" streams, where the same source feeds two or more pipelines, or multiple traversals of the same stream. A stream implementation may throw IllegalStateException if it detects that the stream is being reused.
You could use collect with Collectors.summarizingInt :
IntSummaryStatistics collect = stringInt.collect(Collectors.summarizingInt(value -> value));
System.out.println(collect.getMax());
System.out.println(collect.getMin());

Why do I have to chain Stream operations in Java? [duplicate]

This question already has answers here:
When is a Java 8 Stream considered to be consumed?
(2 answers)
Closed 4 years ago.
I think all of the resources I have studied one way or another emphasize that a stream can be consumed only once, and the consumption is done by so-called terminal operations (which is very clear to me).
Just out of curiosity I tried this:
import java.util.stream.IntStream;
class App {
public static void main(String[] args) {
IntStream is = IntStream.of(1, 2, 3, 4);
is.map(i -> i + 1);
int sum = is.sum();
}
}
which ends up throwing a Runtime Exception:
Exception in thread "main" java.lang.IllegalStateException: stream has already been operated upon or closed
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:229)
at java.util.stream.IntPipeline.reduce(IntPipeline.java:456)
at java.util.stream.IntPipeline.sum(IntPipeline.java:414)
at App.main(scratch.java:10)
This is usual, I am missing something, but still want to ask: As far as I know map is an intermediate (and lazy) operation and does nothing on the Stream by itself. Only when the terminal operation sum (which is an eager operation) is called, the Stream gets consumed and operated on.
But why do I have to chain them?
What is the difference between
is.map(i -> i + 1);
is.sum();
and
is.map(i -> i + 1).sum();
?
When you do this:
int sum = IntStream.of(1, 2, 3, 4).map(i -> i + 1).sum();
Every chained method is being invoked on the return value of the previous method in the chain.
So map is invoked on what IntStream.of(1, 2, 3, 4) returns and sum on what map(i -> i + 1) returns.
You don't have to chain stream methods, but it's more readable and less error-prone than using this equivalent code:
IntStream is = IntStream.of(1, 2, 3, 4);
is = is.map(i -> i + 1);
int sum = is.sum();
Which is not the same as the code you've shown in your question:
IntStream is = IntStream.of(1, 2, 3, 4);
is.map(i -> i + 1);
int sum = is.sum();
As you see, you're disregarding the reference returned by map. This is the cause of the error.
EDIT (as per the comments, thanks to #IanKemp for pointing this out): Actually, this is the external cause of the error. If you stop to think about it, map must be doing something internally to the stream itself, otherwise, how would then the terminal operation trigger the transformation passed to map on each element? I agree in that intermediate operations are lazy, i.e. when invoked, they do nothing to the elements of the stream. But internally, they must configure some state into the stream pipeline itself, so that they can be applied later.
Despite I'm not aware of the full details, what happens is that, conceptually, map is doing at least 2 things:
It's creating and returning a new stream that holds the function passed as an argument somewhere, so that it can be applied to elements later, when the terminal operation is invoked.
It is also setting a flag to the old stream instance, i.e. the one which it has been called on, indicating that this stream instance no longer represents a valid state for the pipeline. This is because the new, updated state which holds the function passed to map is now encapsulated by the instance it has returned. (I believe that this decision might have been taken by the jdk team to make errors appear as early as possible, i.e. by throwing an early exception instead of letting the pipeline go on with an invalid/old state that doesn't hold the function to be applied, thus letting the terminal operation return unexpected results).
Later on, when a terminal operation is invoked on this instance flagged as invalid, you're getting that IllegalStateException. The two items above configure the deep, internal cause of the error.
Another way to see all this is to make sure that a Stream instance is operated only once, by means of either an intermediate or a terminal operation. Here you are violating this requirement, because you are calling map and sum on the same instance.
In fact, javadocs for Stream state it clearly:
A stream should be operated on (invoking an intermediate or terminal stream operation) only once. This rules out, for example, "forked" streams, where the same source feeds two or more pipelines, or multiple traversals of the same stream. A stream implementation may throw IllegalStateException if it detects that the stream is being reused. However, since some stream operations may return their receiver rather than a new stream object, it may not be possible to detect reuse in all cases.
Imagine the IntStream is a wrapper around your data stream with an
immutable list of operations. These operations are not executed until you need the final result (sum in your case).
Since the list is immutable, you need a new instance of IntStream with a list that contains the previous items plus the new one, which is what '. map' returns.
This means that if you don't chain, you will operate on the old instance, which does not have that operation.
The stream library also keeps some internal tracking of what's going on, that's why it's able to throw the exception in the sum step.
If you don't want to chain, you can use a variable for each step:
IntStream is = IntStream.of(1, 2, 3, 4);
IntStream is2 = is.map(i -> i + 1);
int sum = is2.sum();
Intermediate operations return a new stream. They are always lazy; executing an intermediate operation such as filter() does not actually perform any filtering, but instead creates a new stream that, when traversed, contains the elements of the initial stream that match the given predicate.
Taken from https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html under "Stream Operations and Pipelines"
At the lowest level, all streams are driven by a spliterator.
Taken from the same link under "Low-level stream construction"
Traversal and splitting exhaust elements; each Spliterator is useful for only a single bulk computation.
Taken from https://docs.oracle.com/javase/8/docs/api/java/util/Spliterator.html

How to generate a stream using an index rather than the previous element?

How do I generate a stream of "new" data? Specifically, I want to be able to create data that includes functions that are not reversible.
If I want to create a stream from an Array
I do
Stream.of(arr)
From a collection
col.stream()
A constant stream can be made with a lambda expression
Stream.generate(() -> "constant")
A stream based on the last input (any reversible function) may be achieved by
Stream.iterate(0, x -> x + 2)
But if I want to create a more general generator (say output of whether a number is divisive by three: 0,0,1,0,0,1,0,0,1...) without creating a new class.
The main issue is that I need to have some way of inputing the index into the lambda, because I want to have a pattern, and not to be dependent on the last output of the function.
Note:
someStream.limit(length) may use to stop the length of the stream, so infinite stream generator is actually what I am looking for.
If you want to have an infinite stream for a function taking an index, you may consider creating a “practically infinite” stream using
IntStream.rangeClosed(0, Integer.MAX_VALUE).map(index -> your lambda)
resp.
IntStream.rangeClosed(0, Integer.MAX_VALUE).mapToObj(index -> your lambda)
for a Stream rather than an IntStream.
This isn’t truly infinite, but there are no int values to represent indices after Integer.MAX_VALUE, so you have a semantic problem to solve when ever hitting that index.
Also, when using LongStream.rangeClosed(0, Long.MAX_VALUE).map(index -> yourLambda) instead and each element evaluation takes only a nanosecond, it will take almost three hundred years to process all elements.
But, of course, there is a way to create a truly infinite stream using
Stream.iterate(BigInteger.ZERO, BigInteger.ONE::add).map(index -> yourLambda)
which might run forever, or more likely, bail out with an OutOfMemoryError once the index can’t be presented in the heap memory anymore, if your processing ever gets that far.
Note that streams constructed using range[Closed] might be more effcient than streams constructed using Stream.iterate.
You can do something like this
AtomicInteger counter = new AtomicInteger(0);
Stream<Integer> s = Stream.generate(() -> counter.getAndIncrement());

Categories