I have two Fluxs one for successful elements another one holding the erroneous elements
Flux<String> success= Flux.just("Orange", "Apple", "Banana","Grape", "Strawberry");
Flux<String> erroneous = Flux.just("Banana","Grape",);
How can i filter the first Flux to execlude all the elements from the second one ?
You may wish to consider collecting the Flux into a set, caching that set, and then using filterWhen as follows:
Mono<Set<String>> erroneousSet = erroneous.collect(Collectors.toSet()).cache();
Flux<String> filtered = success.filterWhen(v -> erroneousSet.map(s -> !s.contains(v)));
Gives:
Orange
Apple
Strawberry
This isn't the most concise solution (see below), but it enables the contents of erroneous to be cached. In this specific example that's a moot point, but if it's a real-world situation (not using Flux.just()) then erroneous could be recomputed on every subscription, and that could end up being incredibly (and unnecessarily) expensive in performance terms.
Alternatively, if the above really doesn't matter in your use case, filterWhen() and hasElement() can be used much more concisely as follows:
success.filterWhen(s -> erroneous.hasElement(s).map(x->!x))
Or with reactor-extra:
success.filterWhen(s -> BooleanUtils.not(erroneous.hasElement(s)))
Related
I want change my code for single subscriber. Now i have
auctionFlux.window(Duration.ofSeconds(120), Duration.ofSeconds(120)).subscribe(
s -> s.groupBy(Auction::getItem).subscribe( longAuctionGroupedFlux -> longAuctionGroupedFlux.reduce(new ItemDumpStats(), this::calculateStats )
));
This code is working correctly reduce method is very simple. I tried change my code for single subscriber
auctionFlux.window(Duration.ofSeconds(120), Duration.ofSeconds(120))
.flatMap(window -> window.groupBy(Auction::getItem))
.flatMap(longAuctionGroupedFlux -> longAuctionGroupedFlux.reduce(new ItemDumpStats(), this::calculateStats))
.subscribe(itemDumpStatsMono -> log.info(itemDumpStatsMono.toString()));
This is my code, and this code is not working. No errors and no results. After debugging i found code is stuck on second flatMap when i reducing stream. I think problem is on flatMap merging, stucking on Mono resolve. Some one now how to fix this problem and use only single subscriber?
How to replicate, you can use another class or create one. In small size is working but on bigger is dying
List<Auction> auctionList = new ArrayList<>();
for (int i = 0;i<100000;i++){
Auction a = new Auction((long) i, "test");
a.setItem((long) (i%50));
auctionList.add(a);
}
Flux.fromIterable(auctionList).groupBy(Auction::getId).flatMap(longAuctionGroupedFlux ->
longAuctionGroupedFlux.reduce(new ItemDumpStats(), (itemDumpStats, auction) -> itemDumpStats)).collectList().subscribe(itemDumpStats -> System.out.println(itemDumpStats.toString()));
On this approach is instant result but I using 3 subscribers
Flux.fromIterable(auctionList)
.groupBy(Auction::getId)
.subscribe(
auctionIdAuctionGroupedFlux -> auctionIdAuctionGroupedFlux.reduce(new ItemDumpStats(), (itemDumpStats, auction) -> itemDumpStats).subscribe(itemDumpStats -> System.out.println(itemDumpStats.toString()
)
));
I think the behavior you described is related to the interaction between groupBy chained with flatMap.
Check groupBy documentation. It states that:
The groups need to be drained and consumed downstream for groupBy to work correctly. Notably when the criteria produces a large amount of groups, it can lead to hanging if the groups are not suitably consumed downstream (eg. due to a flatMap with a maxConcurrency parameter that is set too low).
By default, maxConcurrency (flatMap) is set to 256 (i checked the source code of 3.2.2). So,
selecting more than 256 groups may cause the execution to hang (particularly when all execution happens on the same thread).
The following code helps in understanding what happens when you chain the operators groupBy and flatMap:
#Test
public void groupAndFlatmapTest() {
val groupCount = 257;
val groupSize = 513;
val list = rangeClosed(1, groupSize * groupCount).boxed().collect(Collectors.toList());
val source = Flux.fromIterable(list)
.groupBy(i -> i % groupCount)
.flatMap(Flux::collectList);
StepVerifier.create(source).expectNextCount(groupCount).expectComplete().verify();
}
The execution of this code hangs. Changing groupCount to 256 or less makes the test pass (for every value of groupSize).
So, regarding your original problem, it is very possible that you are creating a large amount of groups with your key-selector Auction::getItem.
Adding parallel fixed problem, but i looking answer why reduce dramatically slow flatMap.
I have the following code:
final Observable<String> a = Observable.just("a1", "a2");
final Observable<String> b = Observable.just("b1");
final Observable<String> c = Observable.combineLatest(a, b, (first, second) -> first + second);
c.subscribe(res -> System.out.println(res));
What is expected output? I would have expected
a1b1
a2b1
But the actual output is
a2b1
Does that make sense? What is the correct operator the generate the expected sequence?
As the name of the operator should imply, it combines the latest value of each source. If the sources are synchronous or really fast, this could mean that one or more sources will run to their completion and the operator will remember only the very last values of each. You have to interleave the source values by some means, such as having asynchronous sources with ample amount of time between items and avoid close overlapping of items of multiple sources.
The expected sequence can be generated a couple of ways, depending on what your original intention was. For example, if you wanted all cross combination, use flatMap:
a.flatMap(aValue -> b, (aValue, bValue) -> first + second)
.subscribe(System.out::println);
If b is something expensive to recreate, cache it:
Observable<String> cachedB = b.cache();
a.flatMap(aValue -> cachedB, (aValue, bValue) -> first + second)
.subscribe(System.out::println);
Good question! Seems like perhaps a race condition. combineLatest won't output anything until both sources have emitted, and it appears that by the time b generates its output, a has already moved on to its second item. In a "real world" application with asynchronous events that are spaced out in time, you would probably get the behavior you want.
If you can stand the wait, a solution would be to delay a's outputs a bit. With a bit more work you could delay just the first output (see the various overloads of the delay operator). Also I just noticed there's an operator delaySubscription that would probably do the trick (delay your subscription to a until b emits something). I'm sure there are other, perhaps better, solutions (I'm still learning myself).
I've gone through several previous questions like Encounter order preservation in java stream, this answer by Brian Goetz, as well as the javadoc for Stream.reduce(), and the java.util.stream package javadoc, and yet I still can't grasp the following:
Take this piece of code:
public static void main(String... args) {
final String[] alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ".split("");
System.out.println("Alphabet: ".concat(Arrays.toString(alphabet)));
System.out.println(new HashSet<>(Arrays.asList(alphabet))
.parallelStream()
.unordered()
.peek(System.out::println)
.reduce("", (a,b) -> a + b, (a,b) -> a + b));
}
Why is the reduction always* preserving the encounter order?
So far, after several dozen runs, output is the same
First of all unordered does not imply an actual shuffling; all it does it sets a flag for the Stream pipeline - that could later be leveraged.
A shuffle of the source elements could potentially be much more expensive then the operations on the stream pipeline themselves, so the implementation might choose not to do this(like in this case).
At the moment (tested and looked at the sources) of jdk-8 and jdk-9 - reduce does not take that into account. Notice that this could very well change in a future build or release.
Also when you say unordered - you actually mean that you don't care about that order and the stream returning the same result is not a violation of that rule.
For example notice this question/answer that explains that findFirst for example (just another terminal operation) changed to take unordered into consideration in java-9 as opposed to java-8.
To help explain this, I am going to reduce the scope of this string to ABCD.
The parallel stream will divide the string into two pieces: AB and CD. When we go to combine these later, the result of the AB side will be the first argument passed into the function, while the result of the CD side will be the second argument passed into the function. This is regardless of which of the two actually finishes first.
The unordered operator will affect some operations on a stream, such as a limit operation, it does not affect a simple reduce.
TLDR: .reduce() is not always preserving order, its result is based on the stream spliterator characteristics.
Spliterator
The encounter order of the stream depends on stream spliterator (None of the answers mentioned that before).
There are different spliterators based on the source stream. You can get the types of spliterators from the source code of those collections.
HashSet -> HashMap#KeySpliterator = Not ordered
ArrayDeque = Ordered
ArrayList = Ordered
TreeSet -> TreeMap#Spliterator = Ordered and sorted
logicbig.com - Ordering
logicbig.com - Stateful vs Stateless
Additionally you can apply .unordered() intermediate stream operation that specifies following operations in the stream should not rely on ordering.
Stream operations (mostly stateful) that are affected by spliterator and usage of .unordered() method are:
.findFirst()
.limit()
.skip()
.distinct()
Those operations will give us different results based on the order property of the stream and its spliterator.
.peek() method does not take ordering into consideration, if stream is executed in parallel it will always print/receive elements in unordered manner.
.reduce()
Now for the terminal .reduce() method. Intermediate operation .unordered() doesn't have any affect on type of spliterator (as #Eugene mentioned). But important notice, it still stays the same as it is in the source spliterator. If source spliterator is ordered, result of the .reduce() will be ordered, if source was unordered result of .reduce() will be unordered.
You are using new HashSet<>(Arrays.asList(alphabet)) to get the instance of the stream. Its spliterator is unordered. It was just a coincidence that you are getting your result ordered because you are using the single alphabet Strings as elements of the stream and unordered result is actually the same. Now if you would mix that with numbers or mix it with lower case and upper case then this doesn't hold true anymore. For example take following inputs, the first one is subset of the example you posted:
HashSet .reduce() - Unordered
"A","B","C","D","E","F" -> "ABCDEF"
"a","b","c","1","2","3","A","B","C" -> "a1Ab2Bc3C"
"Apple","Orange","Banana","Mango" -> "AppleMangoOrangeBanana"
TreeSet .reduce() - Ordered, Sorted
"A","B","C","D","E","F" -> "ABCDEF"
"a","b","c","1","2","3","A","B","C" -> "123ABCabc"
"Apple","Orange","Banana","Mango" -> "AppleBananaMangoOrange"
ArrayList .reduce() - Ordered
"A","B","C","D","E","F" -> "ABCDEF"
"a","b","c","1","2","3","A","B","C" -> "abc123ABC"
"Apple","Orange","Banana","Mango" -> "AppleOrangeBananaMango"
You see that testing .reduce() operation only with an alphabet source stream can lead to false conclusions.
The answer is .reduce() is not always preserving order, its result is based on the stream spliterator characteristics.
I have been trying to understand and showcase how Java streams implement a type of loop fusion under the hood, so that several operations can be fused into a single pass.
This first example here:
Stream.of("The", "cat", "sat", "on", "the", "mat")
.filter(w -> {
System.out.println("Filtering: " + w);
return w.length() == 3;
})
.map(w -> {
System.out.println("Mapping: " + w);
return w.toUpperCase();
})
.forEach(w -> System.out.println("Printing: " + w));
Has the following output (with the fusion of a single pass for each element quite clear):
Filtering: The
Mapping: The
Printing: THE
Filtering: cat
Mapping: cat
Printing: CAT
Filtering: sat
Mapping: sat
Printing: SAT
Filtering: on
Filtering: the
Mapping: the
Printing: THE
Filtering: mat
Mapping: mat
Printing: MAT
The second example is the same but I use the sorted() operation between the filter and map:
Stream.of("The", "cat", "sat", "on", "the", "mat")
.filter(w -> {
System.out.println("Filtering: " + w);
return w.length() == 3;
})
.sorted()
.map(w -> {
System.out.println("Mapping: " + w);
return w.toUpperCase();
})
.forEach(w -> System.out.println("Printing: " + w));
This has the following output:
Filtering: The
Filtering: cat
Filtering: sat
Filtering: on
Filtering: the
Filtering: mat
Mapping: The
Printing: THE
Mapping: cat
Printing: CAT
Mapping: mat
Printing: MAT
Mapping: sat
Printing: SAT
Mapping: the
Printing: THE
So my question is here, with the call to distinct, am I correct in thinking that because it is a "stateful" intermediate operation, it does not allow individual elements to be processed individually during a single pass (of all operations). Furthermore, because the sorted() stateful operation needs to process the entire input stream to produce a result, then the fusing technique cannot be deployed here, so that is why all the filtering occurs first, and then it fuses together the mapping and printing operations, after the sort? Please correct me if any of my assumptions are incorrect and feel free to elaborate on what I have already said.
In addition, how does it decide under the hood whether it can fuse elements together into a single pass or not, for example, when the distinct() operation exists, is there simply a flag that switches off to stop it from happening as it does when distinct() is not there?
A final query is, whilst the benefit of fusing operations into a single pass is sometimes obvious, for example, when combined with short-circuiting. What are the main benefits of fusing together operations such as a filter-map-forEach, or even a filter-map-sum?
The stateless operations (map, filter, flatMap, peek, etc) are fully fused; we build a chain of cascading Consumer objects and pour the data in. Each element can be operated upon independent of each other, so there's never anything "stuck" in the chain. (This is what Louis means by how fusion is implemented -- we compose the stages into a big function, and feed the data to that.)
Stateful operations (distinct, sorted, limit, etc) are more complicated, and vary more in their behavior. Each stateful operation gets to choose how it wants to implement itself, so it can choose the least intrusive approach possible. For example, distinct (under some circumstances), lets elements come out as they are vetted, whereas sorted is a full barrier. (The difference is in how much laziness is possible, and how well they handle things like infinite sources with a limit operation downstream.)
It is true that stateful operations generally undermine some of the benefits of fusion, but not all of them (the operations upstream and downstream can still be fused.)
In addition to the value of short-circuiting, which you observed, additional big wins from fusion include (a) you don't have to populate intermediate result containers between stages, and (b) the data you are dealing with is always "hot" in cache.
Yes, that's about right. All of this can be checked by looking at the source code.
Fusion isn't implemented the way I think you think it is, though. There's no looking at the whole pipeline and deciding how to fuse it; there's no flags or anything; it's just whether the operations are expressed as a StatefulOp object, which can run the entire stream up to that point and get all the output, or a StatelessOp which just decorates a Sink that says where the elements go. You can look at the source code for e.g. sorted and map for examples.
Edit
IMHO : I think it is not a duplicate because the two questions are trying to solve the problem in different ways and especially because they provide totally different technological skills (and finally, because I ask myself these two questions).
Question
How to aggregate items from an ordered stream, preferably in an intermediate operation ?
Context
Following my other question : Java8 stream lines and aggregate with action on terminal line
I've got a very large file of the form :
MASTER_REF1
SUBREF1
SUBREF2
SUBREF3
MASTER_REF2
MASTER_REF3
SUBREF1
...
Where SUBREF (if any) is applicable to MASTER_REF and both are complex objects (you can imagine it somewhat like JSON).
On first look I tried to group the lines with an operation returning null while agregating and a value when a group of line could be found (a "group" of lines ends if line.charAt(0)!=' ').
This code is hard to read and requires a .filter(Objects::nonNull).
I think one could achieve this using a .collect(groupingBy(...)) or a .reduce(...) but those are terminal operations which is :
not required in my case : lines are ordered and should be grouped by their position and groups of line are to be transformed afterwards (map+filter+...+foreach);
nor a good idea : I'm talking of a huge data file that is way bigger than the total amount of RAM+SWAP ... a terminal operation would saturate availiable resources (as said, by design I need to keep groups in memory because are to be transformed afterwards)
As I already noted in the answer to the previous question, it's possible to use some third-party libraries which provide partial reduction operations. One of such libraries is StreamEx which I develop by myself.
In StreamEx library the partial reduction operation is the intermediate stream operation which combines several input elements while some condition is met. Usually the condition is specified via BiPredicate applied to the pair of adjacent stream elements which returns true when elements should be combined together. The simplest way to combine elements is to make a List via StreamEx.groupRuns() method like this:
Stream<List<String>> records = StreamEx.of(Files.lines(path))
.groupRuns((line1, line2) -> !line2.startsWith("MASTER"));
Here we start a new record when the second of two adjacent lines starts with "MASTER" (as in your example). Otherwise we continue the previous record.
Note that such stream is still lazy. In sequential processing at most one intermediate List<String> is created at a time. Parallel processing is also supported, though turning the Files.lines stream into parallel mode rarely improves the performance (at least prior to Java-9).