There are mulitple questions for streams but for this usecase & in java, didnt find any.
I have a huge stream of objects Stream<A> [~1Million objects]. StreamA comes from a file.
Class A { enum status [Running,queued,Completed], String name }
I want to split Stream<A> into three streams without using any Collect statements. Collect statement loads everything into memory.
I am facing StackOverflowException as I am calling stream.concat multiple times here.
Stream.Concat has problem mentioned in Java Docs
"Implementation Note:
Use caution when constructing streams from repeated concatenation. Accessing an element of a deeply concatenated stream can result in deep call chains, or even StackOverflowException."
Map<Status, Stream<String>> splitStream = new HashMap<>();
streamA.foreach(aObj ->
Stream<String> statusBasedStream = splitStream.getOrDefault(aObj.status,Stream.of());
splitStream.put(aObj.status, Stream.concat(statusBasedStream, Stream.of(aObj.name)));
There are few options where custom streams are available in github to achieve Concatenation but wanted to use standard libraries to solve this.
If data is smaller would have taken a list approach as mentioned here (Split stream into substreams with N elements)
Not the exact solution of the problem but if you have information about the indexes then
combination of Stream.skip() and Stream.limit() can help in this - Below is the dummy code that I tried -
int queuedNumbers = 100;
int runningNumbers=200;
Stream<Object> all = Stream.of();
Stream<Object> queuedAndCompleted = all.skip(queuedNumbers);
Stream<Object> queued = all.limit(queuedNumbers);
Stream<Object> running = queuedAndCompleted.limit(runningNumbers);
Stream<Object> completed = queuedAndCompleted.skip(runningNumbers);
Hope it would be of some help.
Related
I have two Sets like this:
Set<String> set1;
Set<String> set2;
And I want to merge them with
Set<String> s = Stream.of(set1, set2).collect(Collectors.toSet());
but I get the following error:
no instance(s) of type variable(s) exist so that Set<String> conforms to String
inference variable T has incompatible bounds:
equality constraints: String lower bounds: Set<String>
How can I convert the Sets to a single Set<String> with flatMap()?
Is there any other solution that can accomplish this operation gracefully?
If you insist on using Streams, you can use flatMap to convert your Stream<Set<String>> to a Stream<String>, which can be collected into a Set<String>:
Set<String> s = Stream.of(set1, set2).flatMap(Set::stream).collect(Collectors.toSet());
You can use Stream.concat to merge the stream of two sets and collect as set.
Set<String> s = Stream.concat(set1.stream(), set2.stream()).collect(Collectors.toSet());
There are couple of approach possible -
Concat
Set<String> s = Stream.concat(set1.stream(), set2.stream()).collect(Collectors.toSet());
It's get slightly ugly for more than 2 streams as we have to write
Stream.concat(Stream.concat(set1.stream(), set2.stream()), set3.stream())
Concat could be a problem for deeply concatenated stream. From documentation -
Use caution when constructing streams from repeated
concatenation.Accessing an element of a deeply concatenated stream can
result in deep call chains, or even StackOverflowException.
Reduce
Reduce can also be used to perform concatenation of stream as -
Set<String> s = Stream.of(set1.stream(), set2.stream()).reduce(Stream::concat)
.orElseGet(Stream::empty).collect(Collectors.toSet());
Here Stream.reduce() returns optional that's the reason for orElseGet method call. It's also possible to contact multiple set as
Stream.of(set1.stream(), set2.stream(), set2.stream()).reduce(Stream::concat).orElseGet(Stream::empty).collect(Collectors.toSet());
Problem associated with deeply contacted stream applies to reduce as well
Flatmap
Flatmap can be used to get same result as -
Set<String> s = Stream.of(set1, set2).flatMap(Set::stream).collect(Collectors.toSet());
To concat multiple stream you can use -
Set<String> s = Stream.of(set1, set2, set3).flatMap(Set::stream).collect(Collectors.toSet());
flatmap avoids StackOverflowException.
How do I generate a stream of "new" data? Specifically, I want to be able to create data that includes functions that are not reversible.
If I want to create a stream from an Array
I do
Stream.of(arr)
From a collection
col.stream()
A constant stream can be made with a lambda expression
Stream.generate(() -> "constant")
A stream based on the last input (any reversible function) may be achieved by
Stream.iterate(0, x -> x + 2)
But if I want to create a more general generator (say output of whether a number is divisive by three: 0,0,1,0,0,1,0,0,1...) without creating a new class.
The main issue is that I need to have some way of inputing the index into the lambda, because I want to have a pattern, and not to be dependent on the last output of the function.
Note:
someStream.limit(length) may use to stop the length of the stream, so infinite stream generator is actually what I am looking for.
If you want to have an infinite stream for a function taking an index, you may consider creating a “practically infinite” stream using
IntStream.rangeClosed(0, Integer.MAX_VALUE).map(index -> your lambda)
resp.
IntStream.rangeClosed(0, Integer.MAX_VALUE).mapToObj(index -> your lambda)
for a Stream rather than an IntStream.
This isn’t truly infinite, but there are no int values to represent indices after Integer.MAX_VALUE, so you have a semantic problem to solve when ever hitting that index.
Also, when using LongStream.rangeClosed(0, Long.MAX_VALUE).map(index -> yourLambda) instead and each element evaluation takes only a nanosecond, it will take almost three hundred years to process all elements.
But, of course, there is a way to create a truly infinite stream using
Stream.iterate(BigInteger.ZERO, BigInteger.ONE::add).map(index -> yourLambda)
which might run forever, or more likely, bail out with an OutOfMemoryError once the index can’t be presented in the heap memory anymore, if your processing ever gets that far.
Note that streams constructed using range[Closed] might be more effcient than streams constructed using Stream.iterate.
You can do something like this
AtomicInteger counter = new AtomicInteger(0);
Stream<Integer> s = Stream.generate(() -> counter.getAndIncrement());
I have the following code:
ArrayList <String> entries = new ArrayList <String>();
entries.add("0");
entries.add("1");
entries.add("2");
entries.add("3");
String firstNotHiddenItem = entries.stream()
.filter(e -> e.equals("2"))
.findFirst()
.get();
I need to know what is the index of that first returned element, since I need to edit it inside of entries ArrayList. As far as I know get() returns the value of the element, not a reference. Should I just use
int indexOf(Object o)
instead?
You can get the index of an element using an IntStream like:
int index = IntStream.range(0, entries.size())
.filter(i -> "2".equals(entries.get(i)))
.findFirst().orElse(-1);
But you should use the List::indexOf method which is the preferred way, because it's more concise, more expressive and computes the same results.
You can't in a straightforward way - streams process elements without context of where they are in the stream.
However, if you're prepared to take the gloves off...
int[] position = {-1};
String firstNotHiddenItem = entries.stream()
.peek(x -> position[0]++) // increment every element encounter
.filter("2"::equals)
.findFirst()
.get();
System.out.println(position[0]); // 2
The use of an int[], instead of a simple int, is to circumvent the "effectively final" requirement; the reference to the array is constant, only its contents change.
Note also the use of a method reference "2"::equals instead of a lambda e -> e.equals("2"), which not only avoids a possible NPE (if a stream element is null) and more importantly looks way cooler.
A more palatable (less hackalicious) version:
AtomicInteger position = new AtomicInteger(-1);
String firstNotHiddenItem = entries.stream()
.peek(x -> position.incrementAndGet()) // increment every element encounter
.filter("2"::equals)
.findFirst()
.get();
position.get(); // 2
This will work using Eclipse Collections with Java 8
int firstIndex = ListIterate.detectIndex(entries, "2"::equals);
If you use a MutableList, you can simplify the code as follows:
MutableList<String> entries = Lists.mutable.with("0", "1", "2", "3");
int firstIndex = entries.detectIndex("2"::equals);
There is also a method to find the last index.
int lastIndex = entries.detectLastIndex("2"::equals);
Note: I am a committer for Eclipse Collections
Yes, you should use indexOf("2") instead. As you might have noticed, any stream based solution has a higher complexity, without providing any benefit.
In this specific situation, there is no significant difference in performance, but overusing streams can cause dramatic performance degradation, e.g. when using map.entrySet().stream().filter(e -> e.getKey().equals(object)).map(e -> e.getValue()) instead of a simple map.get(object).
The collection operations may utilize their known structure while most stream operation imply a linear search. So genuine collection operations are preferable.
Of course, if there is no collection operation, like when your predicate is not a simple equality test, the Stream API may be the right tool. As shown in “Is there a concise way to iterate over a stream with indices in Java 8?”, the solution for any task involving the indices works by using the indices as starting point, e.g. via IntStream.range, and accessing the list via List.get(int). If the source in not an array or a random access List, there is no equally clean and efficient solution. Sometimes, a loop might turn out to be the simplest and most efficient solution.
I'm wondering if I can add an operation to a stream, based off of some sort of condition set outside of the stream. For example, I want to add a limit operation to the stream if my limit variable is not equal to -1.
My code currently looks like this, but I have yet to see other examples of streams being used this way, where a Stream object is reassigned to the result of an intermediate operation applied on itself:
// Do some stream stuff
stream = stream.filter(e -> e.getTimestamp() < max);
// Limit the stream
if (limit != -1) {
stream = stream.limit(limit);
}
// Collect stream to list
stream.collect(Collectors.toList());
As stated in this stackoverflow post, the filter isn't actually applied until a terminal operation is called. Since I'm reassigning the value of stream before a terminal operation is called, is the above code still a proper way to use Java 8 streams?
There is no semantic difference between a chained series of invocations and a series of invocations storing the intermediate return values. Thus, the following code fragments are equivalent:
a = object.foo();
b = a.bar();
c = b.baz();
and
c = object.foo().bar().baz();
In either case, each method is invoked on the result of the previous invocation. But in the latter case, the intermediate results are not stored but lost on the next invocation. In the case of the stream API, the intermediate results must not be used after you have called the next method on it, thus chaining is the natural way of using stream as it intrinsically ensures that you don’t invoke more than one method on a returned reference.
Still, it is not wrong to store the reference to a stream as long as you obey the contract of not using a returned reference more than once. By using it they way as in your question, i.e. overwriting the variable with the result of the next invocation, you also ensure that you don’t invoke more than one method on a returned reference, thus, it’s a correct usage. Of course, this only works with intermediate results of the same type, so when you are using map or flatMap, getting a stream of a different reference type, you can’t overwrite the local variable. Then you have to be careful to not use the old local variable again, but, as said, as long as you are not using it after the next invocation, there is nothing wrong with the intermediate storage.
Sometimes, you have to store it, e.g.
try(Stream<String> stream = Files.lines(Paths.get("myFile.txt"))) {
stream.filter(s -> !s.isEmpty()).forEach(System.out::println);
}
Note that the code is equivalent to the following alternatives:
try(Stream<String> stream = Files.lines(Paths.get("myFile.txt")).filter(s->!s.isEmpty())) {
stream.forEach(System.out::println);
}
and
try(Stream<String> srcStream = Files.lines(Paths.get("myFile.txt"))) {
Stream<String> tmp = srcStream.filter(s -> !s.isEmpty());
// must not be use variable srcStream here:
tmp.forEach(System.out::println);
}
They are equivalent because forEach is always invoked on the result of filter which is always invoked on the result of Files.lines and it doesn’t matter on which result the final close() operation is invoked as closing affects the entire stream pipeline.
To put it in one sentence, the way you use it, is correct.
I even prefer to do it that way, as not chaining a limit operation when you don’t want to apply a limit is the cleanest way of expression your intent. It’s also worth noting that the suggested alternatives may work in a lot of cases, but they are not semantically equivalent:
.limit(condition? aLimit: Long.MAX_VALUE)
assumes that the maximum number of elements, you can ever encounter, is Long.MAX_VALUE but streams can have more elements than that, they even might be infinite.
.limit(condition? aLimit: list.size())
when the stream source is list, is breaking the lazy evaluation of a stream. In principle, a mutable stream source might legally get arbitrarily changed up to the point when the terminal action is commenced. The result will reflect all modifications made up to this point. When you add an intermediate operation incorporating list.size(), i.e. the actual size of the list at this point, subsequent modifications applied to the collection between this point and the terminal operation may turn this value to have a different meaning than the intended “actually no limit” semantic.
Compare with “Non Interference” section of the API documentation:
For well-behaved stream sources, the source can be modified before the terminal operation commences and those modifications will be reflected in the covered elements. For example, consider the following code:
List<String> l = new ArrayList(Arrays.asList("one", "two"));
Stream<String> sl = l.stream();
l.add("three");
String s = sl.collect(joining(" "));
First a list is created consisting of two strings: "one"; and "two". Then a stream is created from that list. Next the list is modified by adding a third string: "three". Finally the elements of the stream are collected and joined together. Since the list was modified before the terminal collect operation commenced the result will be a string of "one two three".
Of course, this is a rare corner case as normally, a programmer will formulate an entire stream pipeline without modifying the source collection in between. Still, the different semantic remains and it might turn into a very hard to find bug when you once enter such a corner case.
Further, since they are not equivalent, the stream API will never recognize these values as “actually no limit”. Even specifying Long.MAX_VALUE implies that the stream implementation has to track the number of processed elements to ensure that the limit has been obeyed. Thus, not adding a limit operation can have a significant performance advantage over adding a limit with a number that the programmer expects to never be exceeded.
There is two ways you can do this
// Do some stream stuff
List<E> results = list.stream()
.filter(e -> e.getTimestamp() < max);
.limit(limit > 0 ? limit : list.size())
.collect(Collectors.toList());
OR
// Do some stream stuff
stream = stream.filter(e -> e.getTimestamp() < max);
// Limit the stream
if (limit != -1) {
stream = stream.limit(limit);
}
// Collect stream to list
List<E> results = stream.collect(Collectors.toList());
As this is functional programming you should always work on the result of each function. You should specifically avoid modifying anything in this style of programming and treat everything as if it was immutable if possible.
Since I'm reassigning the value of stream before a terminal operation is called, is the above code still a proper way to use Java 8 streams?
It should work, however it reads as a mix of imperative and functional coding. I suggest writing it as a fixed stream as per my first answer.
I think your first line needs to be:
stream = stream.filter(e -> e.getTimestamp() < max);
so that your using the stream returned by filter in subsequent operations rather than the original stream.
I known it is a bit too late, but I had the same question myself and didn't find the satisfying answer, however, inspired by this question and answers I came to the following solution:
return Stream.of( ///< wrap target stream in other stream ;)
/*do regular stream stuff*/
stream.filter(e -> e.getTimestamp() < max)
).flatMap(s -> limit != -1 ? s.limit(limit) : s) ///< apply limit only if necessary and unwrap stream of stream to "normal" stream
.collect(Collectors.toList()) ///< do final stuff
This question arrose from the answer to another question where map and reduce were suggested to calculate a sum concurrently.
In that question there's a complexCalculation(e), but now I was wondering how to parallellise even more, by splitting the calculation in two parts, so that complexCalculation(e) = part1(e) * part2(e). I wonder whether it would be possible to calculate part1 and part2 on a collection concurrently (using map() again) and then zip the two resulting streams so that the ith element of both streams is combined with the function * so that the resulting stream equals the stream that can be gotten by mapping complexCalculation(e) on that collection. In code this would look like:
Stream map1 = bigCollection.parallelStream().map(e -> part1(e));
Stream map2 = bigCollection.parallelStream().map(e -> part2(e));
// preferably map1 and map2 are computed concurrently...
Stream result = map1.zip(map2, (e1, e2) -> e1 * e2);
result.equals(bigCollection.map(e -> complexCalculation(e))); //should be true
So my question is: does there exist some functionality like the zip function I tried to describe here?
parallelStream() is guarenteed to complete in the order submitted. This means you cannot assume that two parallelStreams can be zipped together like this.
Your original bigCollection.map(e -> complexCalculation(e)) is likely to be faster unless your collection is actually smaller than the number of CPUs you have.
If you really want to parallelize part1 and part2 (for example your bigCollection has very few elements, less than CPU cores), you can do the following trick. Suppose you have two methods part1 and part2 in the current class:
public long part1(Type t) { ... }
public long part2(Type t) { ... }
Create a stream of two functions created from these methods and process it in parallel like this:
bigCollection.parallelStream()
.map(e -> Stream.<ToLongFunction<Type>>of(this::part1, this::part2)
.parallel()
.mapToLong(fn -> fn.applyAsLong(e)).reduce(1, (a, b) -> a*b))
.// continue the outer stream operations
However it's very rare case. As #PeterLawrey noted if your outer collection is big enough, no need to parallelize part1 and part2. Instead you will handle separate elements in parallel.