Is sort applied before map in Java Streams? - java

I want to process a List using Java streams, but not sure if I can guarantee the sort is processed before the map method in the following expression:
list.stream()
.sorted((a, b) -> b.getStartTime().compareTo(a.getStartTime()))
.mapToDouble(e -> {
double points = (e.getDuration() / 60);
...
return points * e.getType().getMultiplier();
}
).sum();
Since I need to perform some calculations based in that specific order.

Yes, you can guarantee that, because the operations in the stream pipeline are applied in the order they are declared (once a terminal operation has been executed).
From Stream docs:
To perform a computation, stream operations are composed into a stream pipeline. A stream pipeline consists of a source (which might be an array, a collection, a generator function, an I/O channel, etc), zero or more intermediate operations (which transform a stream into another stream, such as filter(Predicate)), and a terminal operation (which produces a result or side-effect, such as count() or forEach(Consumer)). Streams are lazy; computation on the source data is only performed when the terminal operation is initiated, and source elements are consumed only as needed.
The key word in the above paragraph is pipeline, whose definition in Wikipedia starts as follows:
In software engineering, a pipeline consists of a chain of processing elements (processes, threads, coroutines, functions, etc.), arranged so that the output of each element is the input of the next...

Not only sorted will be applied before map, but it will obviously traverse the underlying source. sorted will get all the elements, put them into an array or ArrayList (depending if the size is known), sort that and them give one element at a time to the map operation.

Related

Stream API - How does sorted() operation works if a filter() is placed right after it?

Take the following code, which is sorting a List and then filtering on it :
public static void main(String[] args) {
List<Integer> list = List.of(3,2,1);
List<Integer> filtered = list.stream()
.sorted() // Does sorted() sort the entire array first ? Then pass the entire sorted output to filter ?
.filter(x -> x < 3)
.collect(Collectors.toList());
System.out.println(filtered);
}
Does the entire sort() happen first, then gets passed to filter() ?
Then isn't that a violation of what streams are supposed to do?
I mean, they are supposed to process one element at a time.
Does the entire sort() happen first then gets passed to filter() ?
Then isn't that a violation of what streams are suppose to do ?
No, it isn't. Take a look at the documentation of the Stream IPA:
Intermediate operations are further divided into stateless and
stateful operations. Stateless operations, such as filter and map,
retain no state from previously seen element when processing a new
element -- each element can be processed independently of operations
on other elements. Stateful operations, such as distinct and sorted,
may incorporate state from previously seen elements when processing
new elements.
Stateful operations may need to process the entire input before
producing a result. For example, one cannot produce any results from
sorting a stream until one has seen all elements of the stream. As a
result, under parallel computation, some pipelines containing stateful
intermediate operations may require multiple passes on the data or may
need to buffer significant data. Pipelines containing exclusively
stateless intermediate operations can be processed in a single pass,
whether sequential or parallel, with minimal data buffering.
That means sorted is aware of all previously encountered elements, i.e. it's stateful. But map and filter don't need this information, they are stateless and lazy, these operations always process elements from the stream source one at a time.
And it's technically impossible to sort the contents of a pipeline by looking at a single element in isolation. sorted operates on all elements "at once" and hands out a sorted stream to the next operation. You might think of sorted as if it becomes a new source of the stream.
Let's take a look at the following stream and analyze how it will be processed:
Stream.of("foo", "bar", "Alice", "Bob", "Carol")
.filter(str -> !str.contains("r")) // lazy processing
.peek(System.out::println)
.map(String::toUpperCase) // lazy processing
.peek(System.out::println)
.sorted() // <--- all data is being dumped into memory
.peek(System.out::println)
.filter(str -> str.length() > 3) // lazy processing
.peek(System.out::println)
.findFirst(); // <--- the terminal operation
Operations filter and map that precede sorted will get applied lazily on each element from the source of the stream and only when it's needed. I.e. filter will be applied on the "foo", it successfully passes the filter and gets transformed by the map operation. Then filter gets applied on the "bar" and it will not reach the map. Then it will be "Alice"'s turn to pass the filter, and then the map will get executed on that string. And so on.
Keep in mind that sorted() requires all the data to do its job, so the first filter will get executed for all elements from the source and the map will be applied on every element that passed the filter.
Then sorted() operation will dump all the contents of the stream into memory and will sort the elements that have passed the first filter.
And after the sorting, again all elements will be processed lazily one at a time. Hence, the second filter will be applied only once (although 3 element have passed the first filter and were sorted). "Alice" will pass the second filter and reach the terminal operation findFirst() that will return this string.
Take a look at the debugging output from the peek() make that the process of execution is happening as described above.

Grouping Java8 stream without collecting it

Is there any way in Java 8 to group the elements in a java.util.stream.Stream without collecting them? I want the result to be a Stream again. Because I have to work with a lot of data or even infinite streams, I cannot collect the data first and stream the result again.
All elements that need to be grouped are consecutive in the first stream. Therefore I like to keep the stream evaluation lazy.
There's no way to do it using standard Stream API. In general you cannot do it as it's always possible that new item will appear in future which belongs to any of already created groups, so you cannot pass your group to downstream analysis until you process all the input.
However if you know in advance that items to be grouped are always adjacent in input stream, you can solve your problem using third-party libraries enhancing Stream API. One of such libraries is StreamEx which is free and written by me. It contains a number of "partial reduction" operators which collapse adjacent items into single based on some predicate. Usually you should supply a BiPredicate which tests two adjacent items and returns true if they should be grouped together. Some of partial reduction operations are listed below:
collapse(BiPredicate): replace each group with the first element of the group. For example, collapse(Objects::equals) is useful to remove adjacent duplicates from the stream.
groupRuns(BiPredicate): replace each group with the List of group elements (so StreamEx<T> is converted to StreamEx<List<T>>). For example, stringStream.groupRuns((a, b) -> a.charAt(0) == b.charAt(0)) will create stream of Lists of strings where each list contains adjacent strings started with the same letter.
Other partial reduction operations include intervalMap, runLengths() and so on.
All partial reduction operations are lazy, parallel-friendly and quite efficient.
Note that you can easily construct a StreamEx object from regular Java 8 stream using StreamEx.of(stream). Also there are methods to construct it from array, Collection, Reader, etc. The StreamEx class implements Stream interface and 100% compatible with standard Stream API.

Is it possible to operate on each List from a grouping by collector without an intermediate map being created?

I have the following code that does a group by on a List, and then operates on each grouped List in turn converting it to a single item:
Map<Integer, List<Record>> recordsGroupedById = myList.stream()
.collect(Collectors.groupingBy(r -> r.get("complex_id")));
List<Complex> whatIwant = recordsGroupedById.values().stream().map(this::toComplex)
.collect(Collectors.toList());
The toComplex function looks like:
Complex toComplex(List<Record> records);
I have the feeling I can do this without creating the intermediate map, perhaps using reduce. Any ideas?
The input stream is ordered with the elements I want grouped sequentially in the stream. Within a normal loop construct I'd be able to determine when the next group starts and create a "Complex" at that time.
Create a collector that combines groupingBy and your post-processing function with collectingAndThen.
Map<Integer, Complex> map = myList.stream()
.collect(collectingAndThen(groupingBy(r -> r.get("complex_id"),
Xxx::toComplex));
If you just want a Collection<Complex> here, you can then ask the map for its values().
Well you can avoid Map (honestly!) and do everything in single pipeline using my StreamEx library:
List<Complex> result = StreamEx.of(myList)
.sortedBy(r -> r.get("complex_id"))
.groupRuns((r1, r2) -> r1.get("complex_id").equals(r2.get("complex_id")))
.map(this::toComplex)
.toList();
Here we first sort input by complex_id, then use groupRuns custom intermediate operation which groups adjacent stream element to the List if the given BiPredicate applied to two adjacent elements returns true. Then you have a stream of lists which is mapped to stream of Complex objects and finally collected to the list.
There are actually no intermediate maps and groupRuns is actually lazy (in sequential mode it keeps no more than one intermediate List at a time), it also parallelizes well. On the other hand my tests show that for unsorted input such solution is slower than groupingBy-based as it involves sorting the whole input. And of course sortedBy (which is just a shortcut for sorted(Comparator.comparing(...))) takes intermediate memory to store the input. If your input is already sorted (or at least partially sorted, so TimSort can perform fast), then such solution usually faster than groupingBy.
No you can't. You must collect all data to ensure the contents of all groups are known before moving foreward. Obviously, however, if you can perform processes on each element in the group as it is assigned to the group then that can be done.
Think about it this way - imagine the very first item in the list and the very last item in the list contain the same complex_id. You must then have to wait for the end of the list anyway to fully gather that group (and all the others) so you must gather all groups together before processing.
Also - you should obviously be able to do:
List<Complex> whatIwant = myList.stream()
.collect(Collectors.groupingBy(r -> r.get("complex_id")))
.values()
.stream()
.map(this::toComplex)
.collect(Collectors.toList());

Java 8 Stream vs Collection Storage

I have been reading up on Java 8 Streams and the way data is streamed from a data source, rather than have the entire collection to extract data from.
This quote in particular I read on an article regarding streams in Java 8.
No storage. Streams don't have storage for values; they carry values from a source (which could be a data structure, a generating function, an I/O channel, etc) through a pipeline of computational steps.
I understand the concept of streaming data in from a source piece by piece. What I don't understand is if you are streaming from a collection how is there no storage? The collection already exists on the Heap, you are just streaming the data from that collection, the collection already exists in "storage".
What's the difference memory-footprint wise if I were to just loop through the collection with a standard for loop?
The statement about streams and storage means that a stream doesn't have any storage of its own. If the stream's source is a collection, then obviously that collection has storage to hold the elements.
Let's take one of examples from that article:
int sum = shapes.stream()
.filter(s -> s.getColor() == BLUE)
.mapToInt(s -> s.getWeight())
.sum();
Assume that shapes is a Collection that has millions of elements. One might imagine that the filter operation would iterate over the elements from the source and create a temporary collection of results, which might also have millions of elements. The mapToInt operation might then iterate over that temporary collection and generate its results to be summed.
That's not how it works. There is no temporary, intermediate collection. The stream operations are pipelined, so elements emerging from filter are passed through mapToInt and thence to sum without being stored into and read from a collection.
If the stream source weren't a collection -- say, elements were being read from a network collection -- there needn't be any storage at all. A pipeline like the following:
int sum = streamShapesFromNetwork()
.filter(s -> s.getColor() == BLUE)
.mapToInt(s -> s.getWeight())
.sum();
might process millions of elements, but it wouldn't need to store millions of elements anywhere.
Think of the stream as a nozzle connected to the water tank that is your data structure. The nozzle doesn't have its own storage. Sure, the water (data) the stream provides is coming from a source that has storage, but the stream itself has no storage. Connecting another nozzle (stream) to your tank (data structure) won't require storage for a whole new copy of the data.
Collection is a data structure. Based on the problem you decide which collection to be used like ArrayList, LinekedList (Considering time and space complexity) . Where as Stream is just a processing kind of tool, which makes your life easy.
Other difference is, you can consider Collection as in-memory data structure, where you can add , remove element.
Where as in Stream you can perform two kind of operation:
a. Intermediate operation : Filter, map ,sort,limit on the result set
b. Terminal operation : forEach ,collect the result set to a collection.
But if you notice, with stream you can't add or remove elements.
Stream is kind of iterator, you can traverse collection through stream. Note, you can traverse stream only once, let me give you an example to have better understanding:
Example1:
List<String> employeeNameList = Arrays.asList("John","Peter","Sachin");
Stream<String> s = employeeNameList.stream();
// iterate through list
s.forEach(System.out :: println); // this work's perfectly fine
s.forEach(System.out :: println); // you will get IllegalStateException, stating stream already operated upon
So, what you can infer is, collection you can iterate as many times as you want. But for the stream, once you iterate , it won't remember what it is supposed to do. So, you need to instruct it again.
I hope, it is clear.
A stream is just a view of the data, it has no storage of its own and you can't modify the underlying collection (assuming it's a stream that was built on top a collection) through the stream. It's like a "read only" access.
If you have any RDBMS experience - it's the exact same idea of "view".
Previous answer are mostly correct. Yet still a much more intuitive response follows (for Google passengers landing here):
Think of streams as UNIX pipelines of text:
cat input.file | sed ... | grep ... > output.file
In general those UNIX text utilities will consume an small quantity of RAM compared to the processed input data.
That's not always the case. Think of "sort". This algorithm will need to keep intermediate stuff in memory.
That same is true for streams. Sometimes temporal data will be needed. Most of the times it will not.
As an extra simile, to some extend "cloud-serverless APIs" follows this same UNIX pipelines o Java stream design.
They do not exist in memory until the have some input data to process. The cloud OS will launch them and inject the input data. The output is sent gradually somewhere else, so the cloud-serverless-API does not consume many resources (most of the times).
Not absolute "trues" in this case.

Java 8 stream Map<K,V> to List<T>

Given that I have some function that takes two parameters and returns one value , is it possible to convert a Map to a List in a Stream as a non-terminal operation?
The nearest I cam find is to use forEach on the map to create instances and add them to a pre-defined List, then start a new Stream from that List. Or did I just miss something?
Eg: The classic "find the 3 most frequently occurring words in some long list of words"
wordList.stream().collect(groupingBy(Function.identity, Collectors.counting))).
(now I want to stream the entrySetof that map)
sorted((a,b) -> a.getValue().compareTo(b.getValue))).limit(3).forEach(print...
You should get the entrySet of the map and glue the entries to the calls of your binary function:
inputMap.entrySet().stream().map(e->myFun(e.getKey(),e.getValue()));
The result of the above is a stream of T instances.
Update
Your additional example confirms what was discussed in the comments below: group by and sort are by their nature terminal operations. They must be performed in full to be able to produce even the first element of the output, so involving them as non-terminal operations doesn't buy anything in terms of performance/memory footprint.
It happens that Java 8 defines sorted as a non-terminal operation, however that decision could lead to deceptive code because the operation will block until it has received all upstream elements, and will have to retain them all while receiving.
You can also convert Hashmap entries into ArrayList by using following technique,
ArrayList list = hashMap.values();

Categories