Intersect operation with Flux - Project Reactor - java

Let's say I have both var a = Flux.just("A", "B", "C") and var b = Flux.just("B", "C", "D")
I want to be able to intersect both variables and the result should be equivalent of a set intersect
Something like a.intersect(b) or Flux.intersect(a, b) that would result in (Flux of) ["B", "C"]
I could not find any operation that does this, any ideas?

You could use join, filter, map and groupBy like so
//Join fluxes in tuple
a.join(b,s -> Flux.never(), s-> Flux.never(),Tuples::of)
//Filter out matching
.filter(t -> t.getT1().equals(t.getT2()))
//Revert to single value
.map(Tuple2::getT1)
//Remove duplicates
.groupBy(f -> f)
.map(GroupedFlux::key)
.subscribe(System.out::println);
Results in single subscription to each and will also work with dupes.
Or you could write your own intersect method
public <T> Flux<T> intersect(Flux<T> f1,Flux<T> f2){
return f1.join(f2,f ->Flux.never(),f-> Flux.never(),Tuples::of)
.filter(t -> t.getT1().equals(t.getT2()))
.map(Tuple2::getT1)
.groupBy(f -> f)
.map(GroupedFlux::key);
}
//Use on it's own
intersect(a,b).subscribe(System.out::println)
//Or with existing flux
a.transform(f -> intersect(a,f)).subscribe(System.out::println)

My favoured approach would be something like:
Flux.merge(a, b)
.groupBy(Function.identity())
.filterWhen(g -> g.count().map(l -> l>1))
.map(g -> g.key())
.subscribe(System.out::print); //Prints "BC"
(If a or b might contain duplicates, replace the first line with Flux.merge(a.distinct(), b.distinct()).)
Each publisher is only played once, and it's trivial to expand it to more than two publishers if necessary.

I like efficiency, so I like to use what is proven without overly depending on streaming (or fluxing) operations.
Disadvantage of this is the need to collect one of the fluxes into a sorted list. Perhaps you can know in advance whether one Flux is shorter. Seems to me however that are going to have to do such a thing no matter what since you have to compare each element of Flux A against all elements in Flux B (or at least until you find a match).
So, collect Flux A into a sorted list and then there is no reason not to use Collections::binarySearch on your collected/sorted flux.
a.collectSortedList()
.flatMapMany(sorteda -> b.filter(be->Collections.binarySearch(sorteda, be)>=0))
.subscribe(System.out::println);

Related

Handling Mono Inside Flux Flatmap

I have a Flux of strings. For each string, I have to make a remote call. But the problem is that the method that makes the remote call actually returns a Mono of the response (obviously since corresponding to a single request, there'll be a single response).
What should be the correct pattern to handle such cases? One solution I can think of is to make serial (or parallel) calls for the stream elements and reduce the responses to a single one and return.
Here's the code:
fluxObj.flatmap(a -> makeRemoteCall(a)//converts the Mono of the response to a Flux).reduce(...)
I am being unable to wrap my head around inside the flatmap.The makeRemoteCall method returns a Mono. But the flatmap returns a Flux of the response. First, why is this happening? Second, does it mean that the returned Flux contains a single response object (that was returned in the Mono)?
If the mapper Function returns a Mono, then it means that there will be (at most) one derived value for each source element in the Flux.
Having the Function return:
an empty Mono (eg. Mono.empty()) for a given value means that this source value is "ignored"
a valued Mono (like in your example) means that this source value is asynchronously mapped to another specific value
a Flux with several derived values for a given value means that this source value is asynchronously mapped to several values
For instance, given the following flatMap:
Flux.just("A", "B")
.flatMap(v -> Mono.just("value" + v))
Subscribing to the above Flux<String> and printing the emitted elements would yield:
valueA
valueB
Another fun example: with delays, one can get out of order results. Like this:
Flux.just(300, 100)
.flatMap(delay -> Mono.delay(Duration.ofMillis(delay))
.thenReturn(delay + "ms")
)
would result in a Flux<String> that yields:
100ms
300ms
If you see documentation flatMap, you can found answers to your questions:
Transform the elements emitted by this Flux asynchronously into Publishers, then flatten these inner publishers into a single Flux, sequentially and preserving order using concatenation.
Long story in short,
#Test
public void testFlux() {
Flux<String> oneString = Flux.just("1");
oneString
.flatMap(s -> testMono(s))
.collectList()
.subscribe(integers -> System.out.println("elements:" + integers));
}
private Mono<Integer> testMono(String s) {
return Mono.just(Integer.valueOf(s + "0"));
}
mapper - s -> testMono(s) where testMono(s) is Publisher (in your case makeRemoteCall(a)), it transforms a type of my oneString to Integer.
I collected Flux result to List, and printed it. Console output:
elements:[10]
It means result Flux after flatMap operator contains just one element.

Parallel Stream behaving differently to Stream

I am having trouble comprehending why parallel stream and stream are giving a different result for the exact same statement.
List<String> list = Arrays.asList("1", "2", "3");
String resultParallel = list.parallelStream().collect(StringBuilder::new,
(response, element) -> response.append(" ").append(element),
(response1, response2) -> response1.append(",").append(response2.toString()))
.toString();
System.out.println("ResultParallel: " + resultParallel);
String result = list.stream().collect(StringBuilder::new,
(response, element) -> response.append(" ").append(element),
(response1, response2) -> response1.append(",").append(response2.toString()))
.toString();
System.out.println("Result: " + result);
ResultParallel: 1, 2, 3
Result: 1 2 3
Can somebody explain why this is happening and how I get the non-parallel version to give the same result as the parallel version?
The Java 8 Stream.collect method has the following signature:
<R> R collect(Supplier<R> supplier,
BiConsumer<R, ? super T> accumulator,
BiConsumer<R, R> combiner);
Where BiConsumer<R, R> combiner is called only in the parallel streams (in order to combine partial results into a single container), therefore the output of your first code snippet is:
ResultParallel: 1, 2, 3
In the sequential version the combiner doesn't get called (see this answer), therefore the following statement is ignored:
(response1, response2) -> response1.append(",").append(response2.toString())
and the result is different:
1 2 3
How to fix it? Check #Eugene's answer or this question and answers.
To understand why this is going wrong, consider this from the javadoc.
accumulator - an associative, non-interfering, stateless function that must fold an element into a result container.
combiner - an associative, non-interfering, stateless function that accepts two partial result containers and merges them, which must be compatible with the accumulator function. The combiner function must fold the elements from the second result container into the first result container.
What this is saying is that it should not matter whether the elements are collected by "accumulating" or "combining" or some combination of the two. But in your code, the accumulator and the combiner concatenate using a different separator. They are not "compatible" in the sense required by the javadoc.
That leads to inconsistent results depending on whether sequential or parallel streams are used.
In the parallel case, the stream is split into substreams1 to be handled by different threads. This leads to a separate collection for each substream. The collections are then combined.
In the sequential case, the stream is not split. Instead, the stream is simply accumulated into a single collection, and no combining needs to take place.
Observations:
In general, for a stream of this size performing a simple transformation, parallelStream() is liable to make things slower.
In this specific case, the bottleneck with the parallelStream() version will be the combining step. That is a serial step, and it performs the same amount of copying as the entire serial pipeline. So, in fact, parallelization is definitely going to make things slower.
In fact, the lambdas do not behave correctly. They add an extra space at the start, and double some spaces if the combiner is used. A more correct version would be:
String result = list.stream().collect(
StringBuilder::new,
(b, e) -> b.append(b.isEmpty() ? "" : " ").append(e),
(l, r) -> l.append(l.isEmpty() ? "" : " ").append(r)).toString();
The Joiner class is a far simpler and more efficient way to concatenate streams. (Credit: #Eugene)
1 - In this case, the substreams each have only one element. For a longer list, you would typically get as many substreams as there are worker threads, and the substreams would contain multiple elements.
As a side note, even if you replace , with a space in the combiner, your results are still going to differ (slightly altered the code to make it more readable):
String resultParallel = list.parallelStream().collect(
StringBuilder::new,
(builder, elem) -> builder.append(" ").append(elem),
(left, right) -> left.append(" ").append(right)).toString();
String result = list.stream().collect(
StringBuilder::new,
(builder, elem) -> builder.append(" ").append(elem),
(left, right) -> left.append(" ").append(right)).toString();
System.out.println("ResultParallel: ->" + resultParallel + "<-"); // -> 1 2 3 4<-
System.out.println("Result: ->" + result + "<-"); // -> 1 2 3 4<-
Notice how you have a little too many spaces.
The java-doc has the hint:
combiner... must be compatible with the accumulator function
If you want to join, there are simpler options like:
String.join(",", yourList)
yourList.stream().collect(Collectors.joining(","))

Java Map Lambda Exception

I have a List of Maps with certain keys that map to String values.
Something like List<Map<String,String>> aMapList;
Objective : Stream over this List of maps and collect values of a single key in all Maps.
How I'm doing this ->
key = "somekey";
aMapList.stream().map(a -> a.get(key)).collect(Collectors.averagingInt());
The Problem:
I get exceptions due to a.get(key) if there is no such key! because averaging this will give a null. How do I check or make lambda ignore any such maps and move on.
I do know that I can add a filter on a -> a.contains(key) and then proceed.
Edit : I can also add more filters or simple check multiple conditions on one filter.
Possible Solution:
aMapList.stream().filter(a -> a.contains(key)).
map(a -> a.get(key)).collect(Collectors.averagingInt());
Can this be made prettier? Instead of halting the operation, simply skip over them?
Is there some more generic way to skip over exceptions or nulls.
For eg. We can expand the lambda and put a try-catch block, but I still need to return something, what if I wish to do an equivalent of "continue".
Eg.
(a -> {return a.get(key) }).
Can be expanded to -->
(a -> {try{return a.get(key)}
catch(Exception e){return null} }).
The above still returns a null, instead of just skipping over.
I'm selecting the best answer for giving two options, But I do not find any of them prettier. Chaining filters seems to be the solution to this.
How about wrapping the result with Optional:
List<Optional<String>> values = aMapList.stream()
.map(a -> Optional.ofNullable(a.get(key)))
.collect(Collectors.toList());
Later code will know to expect possible empty elements.
The solution you propose has a potential bug for maps that allow null values. For example:
Map<String, String> aMap = new HashMap<>();
aMap.put("somekey", null);
aMapList.add(aMap);
aMapList.straem()
.filter(a -> a.contains("somekey")) // true returned for contains
.map(a -> a.get("somekey")) // null returned for get
.collect(Collectors.toList());
Based on the Map documentation, and on your comment under your question, you're not actually getting an exception from a.get(key). Rather, that expression produces a null value, and you're having problems later when you run into these null values. So simply filtering out these null values right away should work just fine:
aMapList.stream()
.map(a -> a.get(key))
.filter(v -> v != null)
.collect(Collectors.toList());
This is prettier, simpler, and performs better than the workaround in your question.
I should mention that I usually prefer the Optional<> type when dealing with null values, but this filtering approach works better in this case since you specifically said you wanted to ignore elements where the key doesn't exist in a map list.
The simplest I could come up with was:
aMapList.stream()
.filter(map -> map.containsKey(key))
.map(map -> map.get(key))
.collect(Collectors.toList());
By formatting the lambda in this fashion, it is easier to see the distinct steps that the code processes.
Although I reckon this is not exactly a prettier approach, you could do:
aMapList.stream().map(a -> a.containsKey(key) ? a.get(key) : null).collect(Collectors.toList());

Java Lambda create a filter with a predicate function which determines if the Levenshtein distance is greater than 2

I have a query to get the most similar value. Well I need to define the minimum Levenshtein distance result. If the score is more than 2, I don't want to see the value as part of the recommendation.
String recommendation = candidates.parallelStream()
.map(String::trim)
.filter(s -> !s.equals(search))
.min((a, b) -> Integer.compare(
cache.computeIfAbsent(a, k -> StringUtils.getLevenshteinDistance(Arrays.stream(search.split(" ")).sorted().toString(), Arrays.stream(k.split(" ")).sorted().toString()) ),
cache.computeIfAbsent(b, k -> StringUtils.getLevenshteinDistance(Arrays.stream(search.split(" ")).sorted().toString(), Arrays.stream(k.split(" ")).sorted().toString()))))
.get();
You question is about one single filtering operation: how to exclude the elements with the score more 2. You need to write a predicate for it. The simplest form of a predicate that can be written without knowing any details about the rest of your application logic is the following:
.filter(s -> StringUtils.getLevenshteinDistance(search, s) <= 2)
Considering that you cache the Levenshtein scores in a HashMap, the predicate should be rewritten this way:
.filter(s -> cache.computeIfAbsent(s, k -> StringUtils.getLevenshteinDistance(search, k)) <= 2)
Now, if you want to do anything else with the elements like splitting, reordering and joining them, you can further enhance this code, but that's outside of the scope of your question.
Nevertheless, speaking of the splitting/joining, let me correct an error in your code. The line
Arrays.stream(search.split(" ")).sorted().toString()
does not really do anything useful. It would just print a hashcode of a Stream instance. I guess you wanted to get this done:
Arrays.stream(s.split(" ")).sorted().collect(Collectors.joining(" "))
This code will reorder a word chain alphabetically: "Malus Casus" -> "Casus Malus"

Is there a way to zip two streams?

This question arrose from the answer to another question where map and reduce were suggested to calculate a sum concurrently.
In that question there's a complexCalculation(e), but now I was wondering how to parallellise even more, by splitting the calculation in two parts, so that complexCalculation(e) = part1(e) * part2(e). I wonder whether it would be possible to calculate part1 and part2 on a collection concurrently (using map() again) and then zip the two resulting streams so that the ith element of both streams is combined with the function * so that the resulting stream equals the stream that can be gotten by mapping complexCalculation(e) on that collection. In code this would look like:
Stream map1 = bigCollection.parallelStream().map(e -> part1(e));
Stream map2 = bigCollection.parallelStream().map(e -> part2(e));
// preferably map1 and map2 are computed concurrently...
Stream result = map1.zip(map2, (e1, e2) -> e1 * e2);
result.equals(bigCollection.map(e -> complexCalculation(e))); //should be true
So my question is: does there exist some functionality like the zip function I tried to describe here?
parallelStream() is guarenteed to complete in the order submitted. This means you cannot assume that two parallelStreams can be zipped together like this.
Your original bigCollection.map(e -> complexCalculation(e)) is likely to be faster unless your collection is actually smaller than the number of CPUs you have.
If you really want to parallelize part1 and part2 (for example your bigCollection has very few elements, less than CPU cores), you can do the following trick. Suppose you have two methods part1 and part2 in the current class:
public long part1(Type t) { ... }
public long part2(Type t) { ... }
Create a stream of two functions created from these methods and process it in parallel like this:
bigCollection.parallelStream()
.map(e -> Stream.<ToLongFunction<Type>>of(this::part1, this::part2)
.parallel()
.mapToLong(fn -> fn.applyAsLong(e)).reduce(1, (a, b) -> a*b))
.// continue the outer stream operations
However it's very rare case. As #PeterLawrey noted if your outer collection is big enough, no need to parallelize part1 and part2. Instead you will handle separate elements in parallel.

Categories