Java 8 Stream vs Collection Storage - java

I have been reading up on Java 8 Streams and the way data is streamed from a data source, rather than have the entire collection to extract data from.
This quote in particular I read on an article regarding streams in Java 8.
No storage. Streams don't have storage for values; they carry values from a source (which could be a data structure, a generating function, an I/O channel, etc) through a pipeline of computational steps.
I understand the concept of streaming data in from a source piece by piece. What I don't understand is if you are streaming from a collection how is there no storage? The collection already exists on the Heap, you are just streaming the data from that collection, the collection already exists in "storage".
What's the difference memory-footprint wise if I were to just loop through the collection with a standard for loop?

The statement about streams and storage means that a stream doesn't have any storage of its own. If the stream's source is a collection, then obviously that collection has storage to hold the elements.
Let's take one of examples from that article:
int sum = shapes.stream()
.filter(s -> s.getColor() == BLUE)
.mapToInt(s -> s.getWeight())
.sum();
Assume that shapes is a Collection that has millions of elements. One might imagine that the filter operation would iterate over the elements from the source and create a temporary collection of results, which might also have millions of elements. The mapToInt operation might then iterate over that temporary collection and generate its results to be summed.
That's not how it works. There is no temporary, intermediate collection. The stream operations are pipelined, so elements emerging from filter are passed through mapToInt and thence to sum without being stored into and read from a collection.
If the stream source weren't a collection -- say, elements were being read from a network collection -- there needn't be any storage at all. A pipeline like the following:
int sum = streamShapesFromNetwork()
.filter(s -> s.getColor() == BLUE)
.mapToInt(s -> s.getWeight())
.sum();
might process millions of elements, but it wouldn't need to store millions of elements anywhere.

Think of the stream as a nozzle connected to the water tank that is your data structure. The nozzle doesn't have its own storage. Sure, the water (data) the stream provides is coming from a source that has storage, but the stream itself has no storage. Connecting another nozzle (stream) to your tank (data structure) won't require storage for a whole new copy of the data.

Collection is a data structure. Based on the problem you decide which collection to be used like ArrayList, LinekedList (Considering time and space complexity) . Where as Stream is just a processing kind of tool, which makes your life easy.
Other difference is, you can consider Collection as in-memory data structure, where you can add , remove element.
Where as in Stream you can perform two kind of operation:
a. Intermediate operation : Filter, map ,sort,limit on the result set
b. Terminal operation : forEach ,collect the result set to a collection.
But if you notice, with stream you can't add or remove elements.
Stream is kind of iterator, you can traverse collection through stream. Note, you can traverse stream only once, let me give you an example to have better understanding:
Example1:
List<String> employeeNameList = Arrays.asList("John","Peter","Sachin");
Stream<String> s = employeeNameList.stream();
// iterate through list
s.forEach(System.out :: println); // this work's perfectly fine
s.forEach(System.out :: println); // you will get IllegalStateException, stating stream already operated upon
So, what you can infer is, collection you can iterate as many times as you want. But for the stream, once you iterate , it won't remember what it is supposed to do. So, you need to instruct it again.
I hope, it is clear.

A stream is just a view of the data, it has no storage of its own and you can't modify the underlying collection (assuming it's a stream that was built on top a collection) through the stream. It's like a "read only" access.
If you have any RDBMS experience - it's the exact same idea of "view".

Previous answer are mostly correct. Yet still a much more intuitive response follows (for Google passengers landing here):
Think of streams as UNIX pipelines of text:
cat input.file | sed ... | grep ... > output.file
In general those UNIX text utilities will consume an small quantity of RAM compared to the processed input data.
That's not always the case. Think of "sort". This algorithm will need to keep intermediate stuff in memory.
That same is true for streams. Sometimes temporal data will be needed. Most of the times it will not.
As an extra simile, to some extend "cloud-serverless APIs" follows this same UNIX pipelines o Java stream design.
They do not exist in memory until the have some input data to process. The cloud OS will launch them and inject the input data. The output is sent gradually somewhere else, so the cloud-serverless-API does not consume many resources (most of the times).
Not absolute "trues" in this case.

Related

Is sort applied before map in Java Streams?

I want to process a List using Java streams, but not sure if I can guarantee the sort is processed before the map method in the following expression:
list.stream()
.sorted((a, b) -> b.getStartTime().compareTo(a.getStartTime()))
.mapToDouble(e -> {
double points = (e.getDuration() / 60);
...
return points * e.getType().getMultiplier();
}
).sum();
Since I need to perform some calculations based in that specific order.
Yes, you can guarantee that, because the operations in the stream pipeline are applied in the order they are declared (once a terminal operation has been executed).
From Stream docs:
To perform a computation, stream operations are composed into a stream pipeline. A stream pipeline consists of a source (which might be an array, a collection, a generator function, an I/O channel, etc), zero or more intermediate operations (which transform a stream into another stream, such as filter(Predicate)), and a terminal operation (which produces a result or side-effect, such as count() or forEach(Consumer)). Streams are lazy; computation on the source data is only performed when the terminal operation is initiated, and source elements are consumed only as needed.
The key word in the above paragraph is pipeline, whose definition in Wikipedia starts as follows:
In software engineering, a pipeline consists of a chain of processing elements (processes, threads, coroutines, functions, etc.), arranged so that the output of each element is the input of the next...
Not only sorted will be applied before map, but it will obviously traverse the underlying source. sorted will get all the elements, put them into an array or ArrayList (depending if the size is known), sort that and them give one element at a time to the map operation.

Grouping Java8 stream without collecting it

Is there any way in Java 8 to group the elements in a java.util.stream.Stream without collecting them? I want the result to be a Stream again. Because I have to work with a lot of data or even infinite streams, I cannot collect the data first and stream the result again.
All elements that need to be grouped are consecutive in the first stream. Therefore I like to keep the stream evaluation lazy.
There's no way to do it using standard Stream API. In general you cannot do it as it's always possible that new item will appear in future which belongs to any of already created groups, so you cannot pass your group to downstream analysis until you process all the input.
However if you know in advance that items to be grouped are always adjacent in input stream, you can solve your problem using third-party libraries enhancing Stream API. One of such libraries is StreamEx which is free and written by me. It contains a number of "partial reduction" operators which collapse adjacent items into single based on some predicate. Usually you should supply a BiPredicate which tests two adjacent items and returns true if they should be grouped together. Some of partial reduction operations are listed below:
collapse(BiPredicate): replace each group with the first element of the group. For example, collapse(Objects::equals) is useful to remove adjacent duplicates from the stream.
groupRuns(BiPredicate): replace each group with the List of group elements (so StreamEx<T> is converted to StreamEx<List<T>>). For example, stringStream.groupRuns((a, b) -> a.charAt(0) == b.charAt(0)) will create stream of Lists of strings where each list contains adjacent strings started with the same letter.
Other partial reduction operations include intervalMap, runLengths() and so on.
All partial reduction operations are lazy, parallel-friendly and quite efficient.
Note that you can easily construct a StreamEx object from regular Java 8 stream using StreamEx.of(stream). Also there are methods to construct it from array, Collection, Reader, etc. The StreamEx class implements Stream interface and 100% compatible with standard Stream API.

Java grouping in stream

Java 8 streams allow us to collect elements while grouping by an arbitrary constraint. For example:
Map<Type, List<MyThing>> grouped = stream
.collect(groupingBy(myThing -> myThing.type()));
However this has the drawback that the stream must be completely read through, so there is no chance of lazy evaluation of future operations on grouped.
Is there a way to do a grouping operation to get something like Stream<Tuple<Type, Stream<MyThing>>>? Is it even conceptually possible to group lazily in any language without evaluating the whole data set?
The concept of lazy grouping doesn't really make sense. Grouping, by definition, means selecting groups in advance to avoid the overhead of searching through all the elements for each key. "Lazy grouping" would look like this:
List<MyThing> get(Type key) {
source.stream()
.filter(myThing -> myThing.type().equals(key))
.collect(toList());
}
If you prefer to defer iteration to when you know you need it, or if you want to avoid the memory overhead of caching a grouping map, this is perfectly fine. But you can't optimize the selection process without iterating ahead of time.
A stream should be operated on (invoking an intermediate or terminal stream operation) only once. This rules out, for example, "forked" streams, where the same source feeds two or more pipelines, or multiple traversals of the same stream.
Taken from the doc at:
https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html
So i think there is no way to split it without consuming it and creating new streams.
I do not think that this would make sense since reading from one partition stream (Tuple<Type, Stream<MyThing>>) of a lazy stream Stream<Tuple<Type, Stream<MyThing>>> could produce an arbitrarily large amount of consumed memory in the other partitions.
E.g. consider the lazy stream of positive integers in natural order and group them by their smallest prime factor. Then reading from the last received element of the stream of partitions would produce an ever increasing number of integers in the streams received before.
Is it even conceptually possible to group lazily in any language without evaluating the whole data set?
No, you cannot group an entire data set correctly without checking the entire data set or having a guarantee of an exploitable pattern in the data. For example, I can group the first 10,000 integers into even-odd lazily, but I can't lazily group even-odd for a random set of 10,000 integers.
As far as grouping in a non-terminal fashion... it's not something that seems like a good idea. Conceptually, a grouping function on a stream should return multiple streams, as if it were branching the different streams, and Java 8 does not support that.
If you really want to use native Stream methods to group non-terminally, you could abuse the sorted method. Give it a sorter that treats the groups differently but treats all elements within a group as equal and you'll end up with group1,group2,group3,etc. This won't give you lazy evaluation, but it is grouping.

JAVA the efficient way to read a logs from file

I'm looking for most effective way to get all the elements from List<String> which contain some String value ("value1") for example.
First thought - simple iteration and adding the elements which contains "value1" to another List<String> But this task must be done very often and by many users.
Thought about list.RemoveAll(), but how do I remove all elements which don't contain "value1"?
So, what is the way to make it most efficiently?
UPDATE:
The whole picture - need to need to read the logs from file very often and for multiple users simultaneously. The logs must be filtered by the username from file. each string in file contains username.
In terms of time efficiency, you cannot get to better result than linear (O(n)) if you want to iterate through the whole list.
Deciding between LinkedList and ArrayList etc. is most likely irrelevant as the differences are small.
If you want a better time than linear to list size, you need to build on some assumptions and prerequisites:
if you know beforehand what string you'll search for, you can build another list along with your original list containing only relevant records
if you know you're going to query one list multiple times, you could build an index
If you just have a list on input that someone gave you, and you need to read through this one input once and find the relevant strings, then you're stuck with linear time since you cannot avoid reading the list at least once.
From your comments it seems like your list is a couple of log statements that should be grouped by user id (which would be your "value1"). If you really need to read the logs very often and for multiple users simultaneously you might consider some caching, possibly with grouping by user id.
As an example you could maintain an additional log file per user and just display it when needed. Alterantively you could keep the latest log statements in memory by employing some FIFO buffer which is grouped by user id (could be a buffer per user and maybe another LIFO layer on top of that).
However, depending on your use case it might not be worth the effort and you might just go and filter the list whenever the user requests to do so. In that case I'd recommend reading the file line by line and only adding the matching lines to the list. If you first read everything into a single list and then remove non-matching elements it'll be less efficient (you'd have to iterate more often, shift elements etc.) and temporarily use more memory (as opposed by discarding every non-matching line right after checking it).
Instead of List, Use TreeSet with provided Comparator so that all Strings with "value1" are at the beginning. When iterating, as soon as the string does not contain "value1", all the remaining do not have it, and you can stop to iterate.
The iteration is likely the only way, but you can allow Java to optimize it as much as possible (and use an elegant, non imperative syntax) by employing Java 8's streams:
// test list
List<String> original = new ArrayList<String>(){
{
add("value1");add("foo");add("foovalue1");add("value1foo");
}
};
List<String> trimmed = original
.stream()
.filter((s) -> s.contains("value1"))
.collect(Collectors.toList());
System.out.println(trimmed);
Output
[value1, foovalue1, value1foo]
Notes
One part of your question that may require more information is "performed often, by many users" - this may call for some concurrency-handling mechanism.
The actual functionality is not very clear. You may still have room to optimize your code early by fetching and collecting the "value1"-containing Strings prior to building you List
Ok, in this I can suggest you the simplest one, I had used.
Use of an Iterator, makes it easier but if you go with list.remove(val) , where val = "value1" , may give you UnsupportedOperationException
List list = yourList; /contains "value1"/
for (Iterator<String> itr = list.iterator(); itr.hasNext();){
String val = itr.next();
if(!val.equals("value1")){
itr.remove();
}
}
Try this one and let me know. :)

Java 8 stream Map<K,V> to List<T>

Given that I have some function that takes two parameters and returns one value , is it possible to convert a Map to a List in a Stream as a non-terminal operation?
The nearest I cam find is to use forEach on the map to create instances and add them to a pre-defined List, then start a new Stream from that List. Or did I just miss something?
Eg: The classic "find the 3 most frequently occurring words in some long list of words"
wordList.stream().collect(groupingBy(Function.identity, Collectors.counting))).
(now I want to stream the entrySetof that map)
sorted((a,b) -> a.getValue().compareTo(b.getValue))).limit(3).forEach(print...
You should get the entrySet of the map and glue the entries to the calls of your binary function:
inputMap.entrySet().stream().map(e->myFun(e.getKey(),e.getValue()));
The result of the above is a stream of T instances.
Update
Your additional example confirms what was discussed in the comments below: group by and sort are by their nature terminal operations. They must be performed in full to be able to produce even the first element of the output, so involving them as non-terminal operations doesn't buy anything in terms of performance/memory footprint.
It happens that Java 8 defines sorted as a non-terminal operation, however that decision could lead to deceptive code because the operation will block until it has received all upstream elements, and will have to retain them all while receiving.
You can also convert Hashmap entries into ArrayList by using following technique,
ArrayList list = hashMap.values();

Categories