How stream's pipeline works in java like IntPipeline - java

I'm learning about java 8 streams and some questions became to me.
Suppose this code:
new Random().ints().forEach(System.out::println);
internally at some point, it calls IntPipeline, that I think it's responsible to generate those indefinitely ints. Streams implementation is hard to understand by looking the java source.
Can you give a brief explanation or give some good/easy-understandable material about how streams are generated and how operation over the pipeline are connected. Example in code above the integers are generate randomly, how this connection is made?

The Stream implementation is separated to Spliterator (which is input-specific code) and pipeline (which is input-independent code). The Spliterator is similar to Iterator. The main differences are the following:
It can split itself to the two parts (the trySplit method). For ordered spliterator the parts are prefix and suffix (for example, for array it could be the first half and the last half). For unordered sources (like random numbers) both parts just can generated some of the elements. The resulting parts are able to split further (unless they become too small). This feature is crucial for parallel stream processing.
It can report its size either exact or estimated. The exact size may be used to preallocate memory for some stream operations like toArray() or just to return it to caller (like count() in Java-9). The estimated size is used for parallel stream processing to decide when to stop splitting.
It can report some characteristics like ORDERED, SORTED, DISTINCT, etc.
It implements internal iteration: instead of two methods hasNext and next you have single method tryAdvance which executes the provided Consumer once unless there are no more elements left.
There are also primitive specializations of Spliterator interface (Spliterator.OfInt, etc.) which can help you process primitive values like int, long or double efficiently.
Thus to create your own Stream datasource you have to implement Spliterator, then call StreamSupport.stream(mySpliterator, isParallel) to create the Stream and StreamSupport.int/long/doubleStream for primitive specializations. So actually Random.ints calls StreamSupport.intStream providing its own spliterator. You don't have to implement all the Stream operations by yourself. In general Stream interface is implemented only once per stream type in JDK for different sources. There's basic abstract class AbstractPipeline and four implementations (ReferencePipeline for Stream, IntPipeline for IntStream, LongPipeline for LongStream and DoublePipeline for DoubleStream). But you have much more sources (Collection.stream(), Arrays.stream(), IntStream.range, String.chars(), BufferedReader.lines(), Files.lines(), Random.ints(), and so on, even more to appear in Java-9). All of these sources are implemented using custom spliterators. Implementing the Spliterator is much simpler than implementing the whole stream pipeline (especially taking into account the parallel processing), so such separation makes sense.
If you want to create your own stream source, you may start extending AbstractSpliterator. In this case you only have to implement tryAdvance and call superclass constructor providing the estimated size and some characteristics. The AbstractSpliterator provides default splitting behavior by reading a part of your source into array (calling your implemented tryAdvance method) and creating array-based spliterator for this prefix. Of course such strategy is not very performant and often affords only limited parallelism, but as a starting point it's ok. Later you can implement trySplit by yourself providing better splitting strategy.

Related

Swap operation according to stream's encounter order

As long as the documentation defines the so called encounter order I think it's reasonble to ask if we can reverse that encounter order somehow. Looking at the API streams provide us with, I didn't find anything related to ordering except sorted().
If I have a stream produced say from a List can I swap two elements of that stream and therefore producing another stream with the modified encounter order.
Does it even make sense to talking about "swapping" elements in a stream or the specification say nothing about it.
Java Stream API have no dedicated operations to reverse the encounter order or swap elements in pairs or something like this. Please note that the Stream source can be once-off (like network socket or stream of generated random numbers), so in general case you cannot make it backwards without storing everything in the memory. That's actually how sorting operation works: it dumps the whole stream content into the intermediate array, sorts it, then performs a downstream computation. So were reverse operation implemented it would work in the same way.
For particular sources like random-access list you may create reversed stream using, for example, this construct
List<T> list = ...;
Stream<T> stream = IntStream.rangeClosed(1, list.size())
.mapToObj(i -> list.get(list.size()-i));

What's the difference between Stream.map(...) and Collectors.mapping(...)?

I've noticed many functionalities exposed in Stream are apparently duplicated in Collectors, such as Stream.map(Foo::bar) versus Collectors.mapping(Foo::bar, ...), or Stream.count() versus Collectors.counting(). What's the difference between these approaches? Is there a performance difference? Are they implemented differently in some way that affects how well they can be parallelized?
The collectors that appear to duplicate functionality in Stream exist so they can be used as downstream collectors for collector combinators like groupingBy().
As a concrete example, suppose you want to compute "count of transactions by seller". You could do:
Map<Seller, Long> salesBySeller =
txns.stream()
.collect(groupingBy(Txn::getSeller, counting()));
Without collectors like counting() or mapping(), these kinds of queries would be much more difficult.
There's a big difference. The stream operations could be divided into two group:
Intermediate operations - Stream.map, Stream.flatMap, Stream.filter. Those produce instance of the Stream and are always lazy, e.g. no actual traversal of the Stream elements happens. Those operations are used to create transformation chain.
Terminal operations - Stream.collect, Stream.findFirst, Stream.reduce etc. Those do the actual work, e.g. perform the transformation chain operations on the stream, producing a terminal value. Which could be a List, count of element, first element etc.
Take a look at the Stream package summary javadoc for more information.

Most efficient collection for filtering a Java Stream?

I'm storing several Things in a Collection. The individual Things are unique, but their types aren't. The order in which they are stored also doesn't matter.
I want to use Java 8's Stream API to search it for a specific type with this code:
Collection<Thing> things = ...;
// ... populate things ...
Stream<Thing> filtered = things.stream.filter(thing -> thing.type.equals(searchType));
Is there a particular Collection that would make the filter() more efficient?
I'm inclined to think no, because the filter has to iterate through the entire collection.
On the other hand, if the collection is some sort of tree that is indexed by the Thing.type then the filter() might be able to take advantage of that fact. Is there any way to achieve this?
The stream operations like filter are not that specialized to take an advantage in special cases. For example, IntStream.range(0, 1_000_000_000).filter(x -> x > 999_999_000) will actually iterate all the input numbers, it cannot just "skip" the first 999_999_000. So your question is reduced to find the collection with the most efficient iteration.
The iteration is usually performed in Spliterator.forEachRemaining method (for non-short-circuiting stream) and in Spliterator.tryAdvance method (for short-circuiting stream), so you can take a look into the corresponding spliterator implementation and check how efficient it is. To my opinion the most efficient is an array (either bare or wrapped into list with Arrays.asList): it has minimal overhead. ArrayList is also quite fast, but for short-circuiting operation it will check the modCount (to detect concurrent modification) on every iteration which would add very slight overhead. Other types like HashSet or LinkedList are comparably slower, though in most of applications this difference is practically insignificant.
Note that parallel streams should be used with care. For example, the splitting of LinkedList is quite poor and you may experience worse performance than in sequential case.
The most important thing to understand, regarding this question, is that when you pass a lambda expression to a particular library like the Stream API, all the library receives is an implementation of a functional interface, e.g. an instance of Predicate. It has no knowledge about what that implementation will do and therefore has no way to exploit scenarios like filtering sorted data via comparison. The stream library simply doesn’t know that the Predicate is doing a comparison.
An implementation doing such an optimization would need an interaction of the JVM, which knows and understands the code, and the library, which knows the semantics. Such thing does not happen in current implementation and is currently far away, at least as I can see it.
If the source is a tree or sorted list and you want to benefit from that for filtering, you have to do it using APIs operating on the source, before creating the stream. E.g. suppose, we have a TreeSet and want to filter it to get items within a particular range, like
// our made-up source
TreeSet<Integer> tree=IntStream.range(0, 100).boxed()
.collect(Collectors.toCollection(TreeSet::new));
// the naive implementation
tree.stream().filter(i -> i>=65 && i<91).forEach(i->System.out.print((char)i.intValue()));
We can do instead:
tree.tailSet(65).headSet(91).stream().forEach(i->System.out.print((char)i.intValue()));
which will utilize the sorted/tree nature. When we have a sorted list instead, say
List<Integer> list=new ArrayList<>(tree);
utilizing the sorted nature is more complex as the collection itself doesn’t know that it’s sorted and doesn’t offer operations utilizing that directly:
int ix=Collections.binarySearch(list, 65);
if(ix<0) ix=~ix;
if(ix>0) list=list.subList(ix, list.size());
ix=Collections.binarySearch(list, 91);
if(ix<0) ix=~ix;
if(ix<list.size()) list=list.subList(0, ix);
list.stream().forEach(i->System.out.print((char)i.intValue()));
Of course, the stream operations here are only exemplary and you don’t need a stream at all, when all you do then is forEach…
As far as I am aware, there's no such differenciation for normal streaming.
However, you might be better off when you use parallel streaming when you use a collection which is easily devideable, like ArrayList over LinkedList or any type of Set.

How specialized are the Stream implementations returned by the standard collections?

Stream is an interface so whenever one gets hold of a Stream object there are lots of implementation specific details hidden.
For example, take the following code:
List<String> list = new ArrayList<>();
...
int size = list.stream()
.count();
Does it run in constant or linear time? Or this:
Set<String> set = new TreeSet<>();
...
set.stream()
.sorted()
.forEach(System.out::println);
Would that be O(n) or O(n log n)?
In general, how specialized are the streams implementations returned by the standard collections?
Does it run in constant or linear time?
The current implementation runs in linear time:
public final long count() {
return mapToLong(e -> 1L).sum();
}
But, this could be improved (there's an RFE for this somewhere) to run in constant time in some situations.
How? A stream is described by a stream source, zero or more intermediate operations, and a terminal operation (here, count() is the terminal operation). The stream implementation maintains a set of characteristics about the source, and knows how the characteristics are modified by the operations. For example, a stream backed by a Collection has the characteristic SIZED, whereas a stream backed by an Iterator is not sized. Similarly, the operation map() is size-preserving, but the operation filter() destroys any a priori knowledge of sized-ness. The stream implementation knows the composed characteristics of the pipeline before it starts the terminal operation, so it knows whether the source is sized and whether all stages are size-preserving, and in such cases, could simply ask the source for the size and bypass all the actual stream computation. (But the implementation in Java 8 does not happen to do this.)
Note that the streams need not be specialized to support this; the Collection classes create the stream with a Spliterator that knows its characteristics, so a specialized implementation for Collections is not needed, just updating the shared implementation to take advantage of this particular bit of information.
The sorted() method does not change the size, so it would, in theory, be possible that a future implementation could do a chain of stream().sorted().count() in O(1) time.
Take a look at the [speedment] open source implementation of the Stream interface at https://github.com/speedment/speedment. These streams can introspect their own pipeline and optimize away one or several steps in the stream.

Produce a Stream from a Stream and an element, Java 8

I'm working on getting my head around some of the Java 8 Stream features. I'm tolerably familiar with FP, having written some Lisp thirty years ago, and I think I might be trying to do something this new facility isn't really targeted at. Anyway, if the question is stupid, I'm happy to learn the error of my ways.
I'll give a specific problem, though it's really a general concept I'm trying to resolve.
Suppose I want to get a Stream from every third element of a Stream. In regular FP I would (approximately) create a recursive function that operates by concatenating the first element of the list with the (call-thyself) of the remainder of the list after dropping two elements. Easy enough. But to do this in a stream, I feel like I want one of two tools:
1) a means of having an operation extract multiple items from the stream for processing (then I'd just grab three, use the first, and dump the rest)
2) a means of making a Supplier that takes an item and a Stream and creates a Stream. Then it feels like I could create a downstream stream out of the first item and the shortened stream, though it's still unclear to me if this would do the necessary recursive magic to actually work.
BEGIN EDIT
So, there's some interesting and useful feedback; thanks everyone. In particular, the comments have helped me clarify what my head is trying to get around a bit better.
First, one can--conceptually, at least--having / needing knowledge of order in a sequence should not prevent one from permitting fully parallelizable operations. An example sprang to mind, and that's the convolution operations that graphics folks are inclined to do. Imagine blurring an image. Each pixel is modified by virtue of pixels near to it, but those pixels are only read, not modified, in themselves.
It's my understanding (very shaky at this stage, for sure!) that the streams mechanism is the primary entry point to the wonderful world of VM managed parallelism, and that iterators are still what they always were (yes? no?) If that's correct, then using the iterator to solve the problem domain that I'm waffling around doesn't seem great.
So, at this point, at least, the suggestion to create a chunking spliterator seems the most promising, but boy, does the code that supports that example seem like hard work! I think I'd rather do it with a ForkJoin mechanism, despite it being "old hat" now :)
Anyway, still interested in any more insights folks wish to offer.
END EDIT
Any thoughts? Am I trying to use these Streams to do something they're not intended for, or am I missing something obvious?
Cheers,
Toby.
One of the things to keep in mind is that Stream was primarily designed to be a way of taking advantage of parallel processing. An implication of this is that they have a number of conditions associated with them that are aimed at giving the VM a lot of freedom to process the elements in any convenient order. An example of this is insisting that reduction functions are associative. Another is that local variables manipulated are final. These types of conditions mean the stream items can be evaluated and collected in any order.
A natural consequence of this is that the best use cases for Stream involve no dependencies between the values of the stream. Things such as mapping a stream of integers to their cumulative values are trivial in languages like LISP but a pretty unnatural fit for Java streams (see this question).
There are clever ways of getting around some of these restrictions by using sequential to force the Stream to not be parallel but my experience has been that these are more trouble than they are worth. If your problem involves an essentially sequential series of items in which state is required to process the values then I recommend using traditional collections and iteration. The code will end up being clearer and will perform no worse given the stream cannot be parallelised anyway.
Having said all that, if you really want to do this then the most straightforward way is to have a collector that stores every third item then sends them out as a stream again:
class EveryThird {
private final List<Integer> list = new ArrayList<>();
private int count = 0;
public void accept(Integer i) {
if (count++ % 3 == 0)
list.add(i);
}
public EveryThird combine(EveryThird other) {
list.addAll(other.list);
count += other.count;
return this;
}
public Stream<Integer> stream() {
return list.stream();
}
}
This can then be used like:
IntStream.range(0, 10000)
.collect(EveryThird::new, EveryThird::accept, EveryThird::combine)
.stream()
But that's not really what collectors are designed for and this is pretty inefficient as it's unnecessarily collecting the stream. As stated above my recommendation is to use traditional iteration for this sort of situation.
My StreamEx library enhances standard Stream API. In particular it adds the headTail method which allows recursive definition of custom operations. It takes a function which receives stream head (first element) and tail (stream of the rest elements) and should return the resulting stream which will be used instead of the original one. For example, you can define every3 operation as follows:
public static <T> StreamEx<T> every3(StreamEx<T> input) {
return input.headTail(
(first, tail1) -> tail1.<T>headTail(
(second, tail2) -> tail2.headTail(
(third, tail3) -> every3(tail3))).prepend(first));
}
Here prepend is also used which just prepends the given element to the stream (this operation is just a best friend of headTail.
In general using headTail you can define almost any intermediate operation you want, including existing ones and new ones. You may find some samples here.
Note that I implemented some mechanism which optimizes tails in such recursive operation definition, so properly defined operation will not eat the whole stack when processing the long stream.
Java's streams are nothing like FP's (lazy) sequences. If you are familiar with Clojure, the difference is exactly like the difference between lazy seq and reducer. Whereas the lazy seq packages all processing with each element individually, and thus allows getting individually processed elements, a reducer collapses the complete sequence in a single atomic operation.
Specifically for the example you have described, consider relying on a stream partitioning transformation, as described in detail here. You would then easily do
partition(originalStream, 3).map(xs -> xs.get(0));
resulting in a stream having every third element of the original.
This would maintain efficiency, laziness, and parallelizability.
You can use (or peek at the source) of BatchingSpliterator
Then given aStream you can create a stream that consists of lists with size=3 (except maybe for the last one) and use the first element of that list
Stream<T> aStream = ...;
Stream<List<T>> batchedStream =
new BatchingSpliterator.Builder().wrap(aStream).batchSize(3).stream();
batchedStream.map(l -> l.get(0) ). ...
You can also "go parallel":
batchedStream.parallel().map(l -> l.get(0) ). ....

Categories