Reactor's Flux Fan out and Zip it? - java

I am trying to do following operations on Flux/Publisher which can only be read once ( think database results which can be read once). But, this question is generic enough that it can be answered in functional programming context without reactor knowledge.
count unique item
check if an element exists
Don't call the publisher/flux generator multiple times.
distinctAndHasElement(4, Flux.just(1,2,3,3,4,4,5));
Mono<Pair<Long, Boolean>> distinctAndHasElement(int toCheck, Flux<Integer> intsFlux) {
// Code that doesn't work, Due to use of non final local variable
boolean found = false;
return intsFlux.map(x -> {
if (toCheck == x) {
found = true;
}
return x;
})
.distinct()
.count()
.map(x -> Pair.of(x, found));
}
We just need ability to fan out into 2 functions that operate on the same type/domain, and zip the final result.
Following doesn't work due to constrain#3
Flux<Integer> distinct = intsFlux.distinct();
Mono<Boolean> found = distinct.hasElement(toCheck);
Mono<Long> count = distinct.count();
return Mono.zip(count, found);

What you're attempting to do is a reduction of your dataset. It means that you attempt to create a single result by merging your initial elements.
Note that count can be considered as a kind of reduction, and in your case, you want an advanced kind of count operation, that also check if at least one of the input elements is equal to a given value.
With reactor (and many other stream framework), you can use the reduce operator.
Let's try your first example with it :
import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;
import reactor.util.function.Tuple2;
import reactor.util.function.Tuples;
public class CountAndCheck {
static Mono<Tuple2<Long, Boolean>> distinctAndHasElement(int toCheck, Flux<Integer> intsFlux) {
return intsFlux
.distinct()
.reduce(Tuples.of(0L, false), (intermediateResult, nextElement) -> {
return Tuples.of(intermediateResult.getT1() + 1L, intermediateResult.getT2() || toCheck == nextElement);
});
}
public static void main(String[] args) {
System.out.println(distinctAndHasElement(2, Flux.just(1, 2, 2, 3, 4, 4)).block());
}
}
The above program prints: [4,true]
Note: You can use the scan operator instead of reduction, to get a flux of every intermediate step in the reduction operation. It can be useful to understand how reduction is performed.

You can broadcast your Flux as described in the documentation.
Flux<Integer> distinct = intsFlux.distinct().publish().autoConnect(2);
Mono<Boolean> found = distinct.hasElement(toCheck);
Mono<Long> count = distinct.count();
return Mono.zip(count, found);

Related

How do I turn this expression into a lambda expression?

I'd like to turn what I'm doing into lambda, in which case it would be I scroll through a list (listRegistrationTypeWork) within the other, check if the child list (getRegistrationTypeWorkAuthors) is != null, if it is, scroll through it looking for an authorCoautor = type, and increment a count, to find out how many records within the lists have this same type.
public int qtyMaximumWorksByAuthorCoauthor(AuthorCoauthor type) {
int count = 0;
for (RegistrationTypeWork tab : listRegistrationTypeWork) {
if (CollectionUtils.isNotEmpty(tab.getRegistrationTypeWorkAuthors())) {
for (RegistrationTypeWorkAuthors author : tab.getRegistrationTypeWorkAuthors()) {
if (author.getAuthorCoauthor().equals(type))
count++;
}
}
}
return count;
}
Although your statement is not clear enough on what transforming to lambda expression would mean, but I am assuming you would like to turn your imperative looping step to a functional stream and lambda based one.
This should be straightforward using:
filter to filter out the unwanted values from both of your collections
flatMap to flatten all inner collections into a single stream so that you can operate your count on it as a single source
public int qtyMaximumWorksByAuthorCoauthor(AuthorCoauthor type) {
return listRegistrationTypeWork.stream()
.filter(tab -> tab.getRegistrationTypeWorkAuthors() != null)
.flatMap(tab -> tab.getRegistrationTypeWorkAuthors().stream())
.filter(author -> type.equals(author.getAuthorCoauthor()))
.count();
}
In addition to Thomas fine comment I think you would want to write your stream something like this.
long count = listRegistrationTypeWork.stream()
// to make sure no lists that are actual null are mapped.
// map all RegistrationTypeWork into optionals of lists of RegistrationTypeWorkAuthors
.map(registrationTypeWork -> Optional.ofNullable(registrationTypeWork.getRegistrationTypeWorkAuthors()))
// this removes all empty Optionals from the stream
.flatMap(Optional::stream)
// this turns the stream of lists of RegistrationTypeWorkAuthors into a stream of plain RegistrationTypeWorkAuthors
.flatMap(Collection::stream)
// this filters out RegistrationTypeWorkAuthors which are of a different type
.filter(registrationTypeWorkAuthors -> type.equals(registrationTypeWorkAuthors.getAuthorCoauthor()))
.count();
// count returns a long so you either need to return a long in your method signature or cast the long to an integer.
return (int) count;

How to cache and replay the items of a Supplier<Stream<T>>

Regarding a Supplier<Stream<T>> dataSrc I would like to cache the Stream items for further traversals of the same sequence of elements. In this case, assume that dataSrc always produces the same sequence (e.g. getting a Stream<Integer> with temperatures in Celsius of March (see example usage bellow)). Thus, option 1) is to first collect the Stream items, however it will waste one first traversal to add those items into a collection:
Supplier<Stream<T>> dataSrc = ...
List<T> cache = dataSrc.collect(toList()); // **Additional** traversal to collect items
cache.stream().reduce(…) // 1st traversal
cache.stream().reduce(…) // 2nd traversal
... // Nth traversals
I would like to avoid the additional traversal to collect items and the explicit cache variable and hide it inside the Supplier<> in such a way that on first traversal the items are implicitly cached and on further traversals the items are accessed from that cache. I think this is similar to the idea of the method cache() of Reactor Project for reactive streams.
Thus, I am outlining an alternative in the following cache() method implementation, although it already has two problems (at least): 1) the onClose is not called on traversal finish (and I cannot figure out any way of detecting the end of a traversal); 2) If the first traversal never ends then the cache will never be filled.
Supplier<Stream<T>> dataSrc = cache(...)
dataSrc.get().reduce(…) // 1st traversal
dataSrc.get().reduce(…) // 2nd traversal
... // Nth traversals
static <T> Supplier<Stream<T>> cache(Supplier<Stream<T>> dataSrc) {
final List<T> cache = new ArrayList<>();
final AtomicBoolean started = new AtomicBoolean();
final AtomicBoolean isCached = new AtomicBoolean();
return () -> {
if(isCached.get()) return cache.stream();
if(!started.getAndSet(true)) {
return dataSrc
.get()
.peek(cache::add)
.onClose(() -> isCached.set(true));
}
return dataSrc.get();
};
}
Question:
Is there any better approach to achieve an utility cache() function which returns a new Stream<T> that caches items on first Stream traversal (without an implicit additional traversal to collect first) and further Stream objects are created from that cache?
Usage Example:
Here I am getting a stream with the temperatures in March from the World Weather online API. To execute it you must include a dependency of AsyncHttpClient and a valid API key in given URI.
Pattern pat = Pattern.compile("\\n");
boolean [] isEven = {true};
CompletableFuture<Stream<Integer>> temps = asyncHttpClient()
.prepareGet("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?q=37.017,-7.933&date=2018-03-01&enddate=2018-03-31&tp=24&format=csv&key=715b185b36034a4c879141841182802")
.execute()
.toCompletableFuture()
.thenApply(Response::getResponseBody)
.thenApply(pat::splitAsStream)
.thenApply(str -> str
.filter(w -> !w.startsWith("#")) // Filter comments
.skip(1) // Skip line: Not Available
.filter(l -> isEven[0] = !isEven[0]) // Filter Even line
.map(line -> line.substring(14, 16)) // Extract temperature in celcius
.map(Integer::parseInt)
);
Note that a CompletableFuture<Stream<Integer>> is functionally compliant with a Supplier<Stream<Integer>>. Although, the CompletableFuture caches the resulting stream it cannot be iterated twice.
Problem 1: The following code throws IllegalStateException: stream has already been operated upon or closed
out.println(temps.join().distinct().count());
out.println(temps.join().max(Integer::compare)); // throws IllegalStateException
Problem 2: Collecting it in a List will induce a first traversal and thus we will have 3 traversals, instead of 2:
CompletableFuture<List<Integer>> list = temps.thenApply(str -> str.collect(toList()));
out.println(list.join().stream().distinct().count()); // 2 traversals
out.println(list.join().stream().distinct().max(Integer::compare));// 1 traversal
Goal: Store items in cache on first traversal. Every time the stream retrieves an item it should store it in an internal cache that will be used on further traversals.
Supplier<Stream<Integer>> cache = Cache.of(temps::join);
out.println(temps.get().distinct().count()); // 1 traversal
out.println(temps.get().max(Integer::compare)); // 1 traversal form cache
I think the only way to detect the end of the Stream traversal is through its iterator() or spliterator(). Thus, maybe a better option to get a replayable Stream is to record its items from its iterator (done by Recorder class of the example bellow) and then implement a new Spliterator that reads the items previously recorded (done by cacheIterator()). In this solution I made the getOrAdvance() method of Recorder synchronized to guarantee that just one resulting stream will get a new item from the source.
So, Cache.of(dataSrc) creates a chain of:
dataSrc ----> Recorder ----> cacheIterator() ----> Stream
Side notes:
The resulting stream from the method Cache.of() permits limited parallelism. For better splitting support the cacheIterator() bellow should return a Spliterator implementation instead, such as AbstractList.RandomAccessSpliterator.
Although it is not a requirement, the Recorder/ cacheIterator() solution also works with infinite data sources that can be short-circuited later.
E.g. it can cache the items of the infinite stream nrs and prints the output bellow without, or with cache (i.e. nrsReplay):
Random rnd = new Random();
Supplier<Stream<String>> nrs = () -> Stream.generate(() -> rnd.nextInt(99)).map(Object::toString);
IntStream.range(1, 6).forEach(size -> out.println(nrs.get().limit(size).collect(joining(","))));
System.out.println();
Supplier<Stream<String>> nrsReplay = Cache.of(nrs);
IntStream.range(1, 6).forEach(size -> out.println(nrsReplay.get().limit(size).collect(joining(","))));
Output:
32
65,94
94,19,34
72,77,66,18
88,41,34,97,28
93
93,65
93,65,71
93,65,71,40
93,65,71,40,68
class Cache {
public static <T> Supplier<Stream<T>> of(Supplier<Stream<T>> dataSrc) {
final Spliterator<T> src = dataSrc.get().spliterator(); // !!!maybe it should be lazy and memorized!!!
final Recorder<T> rec = new Recorder<>(src);
return () -> {
// CacheIterator starts on index 0 and reads data from src or
// from an internal cache of Recorder.
Spliterator<T> iter = rec.cacheIterator();
return StreamSupport.stream(iter, false);
};
}
static class Recorder<T> {
final Spliterator<T> src;
final List<T> cache = new ArrayList<>();
final long estimateSize;
boolean hasNext = true;
public Recorder(Spliterator<T> src) {
this.src = src;
this.estimateSize = src.estimateSize();
}
public synchronized boolean getOrAdvance(
final int index,
Consumer<? super T> cons) {
if (index < cache.size()) {
// If it is in cache then just get if from the corresponding index.
cons.accept(cache.get(index));
return true;
} else if (hasNext)
// If not in cache then advance the src iterator
hasNext = src.tryAdvance(item -> {
cache.add(item);
cons.accept(item);
});
return hasNext;
}
public Spliterator<T> cacheIterator() {
return new Spliterators.AbstractSpliterator<T>(
estimateSize, src.characteristics()
) {
int index = 0;
public boolean tryAdvance(Consumer<? super T> cons) {
return getOrAdvance(index++, cons);
}
public Comparator<? super T> getComparator() {
return src.getComparator();
}
};
}
}
}
You can use Guava's Suppliers#memoize function to turn a given Supplier into a caching ("memoizing") one.
Turn your dataSrc Supplier<Stream<T>> into a Supplier<List<T>> that collects the stream
Wrap it with Suppliers#memoize
This would be your cache() method:
private static <T> Supplier<Stream<T>> cache(Supplier<Stream<T>> dataSrc) {
Supplier<List<T>> memoized = Suppliers.memoize(() -> dataSrc.get().collect(toList()));
return () -> memoized.get().stream();
}
(when mixing in Guava you might need to switch between Guava's version of c.g.c.b.Supplier, and java.util.Supplier, and they can easily be transformed back and forth, however in this case it's not even necessary)
Example
Assume a simple Integer stream that returns the first 5 natural numbers and reports computation to stdout:
private static Supplier<Stream<Integer>> getDataSrc() {
return () -> IntStream.generate(new IntSupplier() {
private int i = 0;
#Override
public int getAsInt() {
System.out.println("Computing next i: " + (i + 1));
return i += 1;
}
}).limit(5).boxed();
}
Then running the non-memoized version
Supplier<Stream<Integer>> dataSrc = getDataSrc();
System.out.println(dataSrc.get().collect(toList()));
System.out.println(dataSrc.get().collect(toList()));
yields
Computing next i: 1
Computing next i: 2
Computing next i: 3
Computing next i: 4
Computing next i: 5
[1, 2, 3, 4, 5]
Computing next i: 1
Computing next i: 2
Computing next i: 3
Computing next i: 4
Computing next i: 5
[1, 2, 3, 4, 5]
And running the memoized version
Supplier<Stream<Integer>> dataSrc = cached(getDataSrc());
System.out.println(dataSrc.get().collect(toList()));
System.out.println(dataSrc.get().collect(toList()));
yields
Computing next i: 1
Computing next i: 2
Computing next i: 3
Computing next i: 4
Computing next i: 5
[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]
If using the Reactor Project is an option, then you can simply convert the Supplier<Stream<T>> to a Flux<T>, which already provides the utility cache() method and henceforward use Flux<T> operations rather than Stream<T> operations.
Regarding the example of the original post, where temps is a CompletableFuture<Stream<Integer>> with the result of an HTTP request transformed in a sequence of temperatures in Celsius, then we can perform both queries in the following way:
Flux<Integer> cache = Flux.fromStream(temps::join).cache();
cache.distinct().count().subscribe(out::println);
cache.reduce(Integer::max).subscribe(out::println);
This solution avoids: 1) IllegalStateException on further traversals of this sequence; 2) a first traversal to collect items in a cache.

Limit a stream and find out if there are pending elements

I have the following code that I want to translate to Java 8 streams:
public ReleaseResult releaseReources() {
List<String> releasedNames = new ArrayList<>();
Stream<SomeResource> stream = this.someResources();
Iterator<SomeResource> it = stream.iterator();
while (it.hasNext() && releasedNames.size() < MAX_TO_RELEASE) {
SomeResource resource = it.next();
if (!resource.isTaken()) {
resource.release();
releasedNames.add(resource.getName());
}
}
return new ReleaseResult(releasedNames, it.hasNext(), MAX_TO_RELEASE);
}
Method someResources() returns a Stream<SomeResource> and ReleaseResult class is as follows:
public class ReleaseResult {
private int releasedCount;
private List<String> releasedNames;
private boolean hasMoreItems;
private int releaseLimit;
public ReleaseResult(List<String> releasedNames,
boolean hasMoreItems, int releaseLimit) {
this.releasedNames = releasedNames;
this.releasedCount = releasedNames.size();
this.hasMoreItems = hasMoreItems;
this.releaseLimit = releaseLimit;
}
// getters & setters
}
My attempt so far:
public ReleaseResult releaseReources() {
List<String> releasedNames = this.someResources()
.filter(resource -> !resource.isTaken())
.limit(MAX_TO_RELEASE)
.peek(SomeResource::release)
.map(SomeResource::getName)
.collect(Collectors.toList());
return new ReleasedResult(releasedNames, ???, MAX_TO_RELEASE);
}
The problem is that I can't find a way to know if there are pending resources to process. I've thought of using releasedNames.size() == MAX_TO_RELEASE, but this doesn't take into account the case where the stream of resources has exactly MAX_TO_RELEASE elements.
Is there a way to do the same with Java 8 streams?
Note: I'm not looking for answers like "you don't have to do everything with streams" or "using loops and iterators is fine". I'm OK if using an iterator and a loop is the only way or just the best way. It's just that I'd like to know if there's a non-murky way to do the same.
Since you don’t wanna hear that you don’t need streams for everything and loops and iterators are fine, let’s demonstrate it by showing a clean solution, not relying on peek:
public ReleaseResult releaseReources() {
return this.someResources()
.filter(resource -> !resource.isTaken())
.limit(MAX_TO_RELEASE+1)
.collect(
() -> new ReleaseResult(new ArrayList<>(), false, MAX_TO_RELEASE),
(result, resource) -> {
List<String> names = result.getReleasedNames();
if(names.size() == MAX_TO_RELEASE) result.setHasMoreItems(true);
else {
resource.release();
names.add(resource.getName());
}
},
(r1, r2) -> {
List<String> names = r1.getReleasedNames();
names.addAll(r2.getReleasedNames());
if(names.size() > MAX_TO_RELEASE) {
r1.setHasMoreItems(true);
names.remove(MAX_TO_RELEASE);
}
}
);
}
This assumes that // getters & setters includes getters and setters for all non-final fields of your ReleaseResult. And that getReleasedNames() returns the list by reference. Otherwise you would have to rewrite it to provide a specialized Collector having special non-public access to ReleaseResult (implementing another builder type or temporary storage would be an unnecessary complication, it looks like ReleaseResult is already designed exactly for that use case).
We could conclude that for any nontrivial loop code that doesn’t fit into the stream’s intrinsic operations, you can find a collector solution that basically does the same as the loop in its accumulator function, but suffers from the requirement of always having to provide a combiner function. Ok, in this case we can prepend a filter(…).limit(…) so it’s not that bad…
I just noticed, if you ever dare to use that with a parallel stream, you need a way to reverse the effect of releasing the last element in the combiner in case the combined size exceeds MAX_TO_RELEASE. Generally, limits and parallel processing never play well.
I don't think there's a nice way to do this. I've found a hack that does it lazily. What you can do is convert the Stream to an Iterator, convert the Iterator back to another Stream, do the Stream operations, then finally test the Iterator for a next element!
Iterator<SomeResource> it = this.someResource().iterator();
List<String> list = StreamSupport.stream(Spliterators.spliteratorUnknownSize(it, Spliterator.ORDERED), false)
.filter(resource -> !resource.isTaken())
.limit(MAX_TO_RELEASE)
.peek(SomeResource::release)
.map(SomeResource::getName)
.collect(Collectors.toList());
return new ReleaseResult(list, it.hasNext(), MAX_TO_RELEASE);
The only thing I can think of is
List<SomeResource> list = someResources(); // A List, rather than a Stream, is required
List<Integer> indices = IntStream.range(0, list.size())
.filter(i -> !list.get(i).isTaken())
.limit(MAX_TO_RELEASE)
.collect(Collectors.toList());
List<String> names = indices.stream()
.map(list::get)
.peek(SomeResource::release)
.map(SomeResource::getName)
.collect(Collectors.toList());
Then (I think) there are unprocessed elements if
names.size() == MAX_TO_RELEASE
&& (indices.isEmpty() || indices.get(indices.size() - 1) < list.size() - 1)

How should I check whether a Stream<T> is sorted?

With an Iterable<T>, it's easy:
T last = null;
for (T t : iterable) {
if (last != null && last.compareTo(t) > 0) {
return false;
}
last = t;
}
return true;
But I can't think of a clean way to do the same thing for a Stream<T> that avoids consuming all the elements when it doesn't have to.
There are several methods to iterate over the successive pairs of the stream. For example, you can check this question. Of course my favourite method is to use the library I wrote:
boolean unsorted = StreamEx.of(sourceStream)
.pairMap((a, b) -> a.compareTo(b) > 0)
.has(true);
It's short-circuit operation: it will finish as soon as it find the misorder. Also it works fine with parallel streams.
This is a sequential, state holding solution:
IntStream stream = IntStream.of(3, 3, 5, 6, 6, 9, 10);
final AtomicInteger max = new AtomicInteger(Integer.MIN_VALUE);
boolean sorted = stream.allMatch(n -> n >= max.getAndSet(n));
Parallelizing would need to introduce ranges. The state, max might be dealt with otherwise, but the above seems most simple.
You can grab the Stream's underlying spliterator and check it it has the SORTED characteristic. Since it's a terminal operation, you can't use the Stream after (but you can create another one from this spliterator, see also Convert Iterable to Stream using Java 8 JDK).
For example:
Stream<Integer> st = Stream.of(1, 2, 3);
//false
boolean isSorted = st.spliterator().hasCharacteristics(Spliterator.SORTED);
Stream<Integer> st = Stream.of(1, 2, 3).sorted();
//true
boolean isSorted = st.spliterator().hasCharacteristics(Spliterator.SORTED);
My example shows that the SORTED characteristic appears only if you get the Stream from a source's that reports the SORTED characteristic or you call sorted() at a point on the pipeline.
One could argue that Stream.iterate(0, x -> x + 1); creates a SORTED stream, but there is no knowledge about the semantic of the function applied iteratively. The same applies for Stream.of(...).
If the pipeline is infinite then it's the only way to know. If not, and that the spliterator does not report this characteristic, you'd need to go through the elements and see if it does not satisfy the sorted characteristic you are looking for.
This is what you already done with your iterator approach but then you need to consume some elements of the Stream (in the worst case, all elements). You can make the task parallelizable with some extra code, then it's up to you to see if it's worth it or not...
You could hijack a reduction operation to save the last value and compare it to the current value and throw an exception if it isn't sorted:
.stream().reduce((last, curr) -> {
if (((Comparable)curr).compareTo(last) < 0) {
throw new Exception();
}
return curr;
});
EDIT: I forked another answer's example and replaced it with my code to show it only does the requisite number of checks.
http://ideone.com/ZMGnVW
You could use allMatch with a multi-line lambda, checking the current value against the previous one. You'll have to wrap the last value into an array, though, so the lambda can modify it.
// infinite stream with one pair of unsorted numbers
IntStream s = IntStream.iterate(0, x -> x != 1000 ? x + 2 : x - 1);
// terminates as soon as the first unsorted pair is found
int[] last = {Integer.MIN_VALUE};
boolean sorted = s.allMatch(x -> {
boolean b = x >= last[0]; last[0] = x; return b;
});
Alternatively, just get the iterator from the stream and use a simple loop.
A naive solution uses the stream's Iterator:
public static <T extends Comparable<T>> boolean isSorted(Stream<T> stream) {
Iterator<T> i = stream.iterator();
if(!i.hasNext()) return true;
T current = i.next();
while(i.hasNext()) {
T next = i.next();
if(current == null || current.compareTo(next) > 0) return false;
current = next;
}
return true;
}
Edit: It would also be possible to use a spliterator to parallelize the task, but the gains would be questionable and the increase in complexity is probably not worth it.
I don't know how good it is , but i have just got an idea:
Make a list out of your Stream , Integer or Strings or anything.
i have written this for a List<String> listOfStream:
long countSorted = IntStream.range(1, listOfStream.size())
.map(
index -> {
if (listOfStream.get(index).compareTo(listOfStream.get(index-1)) > 0) {
return 0;
}
return index;
})
.sum();

Perform operation on n random distinct elements from Collection using Streams API

I'm attempting to retrieve n unique random elements for further processing from a Collection using the Streams API in Java 8, however, without much or any luck.
More precisely I'd want something like this:
Set<Integer> subList = new HashSet<>();
Queue<Integer> collection = new PriorityQueue<>();
collection.addAll(Arrays.asList(1,2,3,4,5,6,7,8,9));
Random random = new Random();
int n = 4;
while (subList.size() < n) {
subList.add(collection.get(random.nextInt()));
}
sublist.forEach(v -> v.doSomethingFancy());
I want to do it as efficiently as possible.
Can this be done?
edit: My second attempt -- although not exactly what I was aiming for:
List<Integer> sublist = new ArrayList<>(collection);
Collections.shuffle(sublist);
sublist.stream().limit(n).forEach(v -> v.doSomethingFancy());
edit: Third attempt (inspired by Holger), which will remove a lot of the overhead of shuffle if coll.size() is huge and n is small:
int n = // unique element count
List<Integer> sublist = new ArrayList<>(collection);
Random r = new Random();
for(int i = 0; i < n; i++)
Collections.swap(sublist, i, i + r.nextInt(source.size() - i));
sublist.stream().limit(n).forEach(v -> v.doSomethingFancy());
The shuffling approach works reasonably well, as suggested by fge in a comment and by ZouZou in another answer. Here's a generified version of the shuffling approach:
static <E> List<E> shuffleSelectN(Collection<? extends E> coll, int n) {
assert n <= coll.size();
List<E> list = new ArrayList<>(coll);
Collections.shuffle(list);
return list.subList(0, n);
}
I'll note that using subList is preferable to getting a stream and then calling limit(n), as shown in some other answers, because the resulting stream has a known size and can be split more efficiently.
The shuffling approach has a couple disadvantages. It needs to copy out all the elements, and then it needs to shuffle all the elements. This can be quite expensive if the total number of elements is large and the number of elements to be chosen is small.
An approach suggested by the OP and by a couple other answers is to choose elements at random, while rejecting duplicates, until the desired number of unique elements has been chosen. This works well if the number of elements to choose is small relative to the total, but as the number to choose rises, this slows down quite a bit because of the likelihood of choosing duplicates rises as well.
Wouldn't it be nice if there were a way to make a single pass over the space of input elements and choose exactly the number wanted, with the choices made uniformly at random? It turns out that there is, and as usual, the answer can be found in Knuth. See TAOCP Vol 2, sec 3.4.2, Random Sampling and Shuffling, Algorithm S.
Briefly, the algorithm is to visit each element and decide whether to choose it based on the number of elements visited and the number of elements chosen. In Knuth's notation, suppose you have N elements and you want to choose n of them at random. The next element should be chosen with probability
(n - m) / (N - t)
where t is the number of elements visited so far, and m is the number of elements chosen so far.
It's not at all obvious that this will give a uniform distribution of chosen elements, but apparently it does. The proof is left as an exercise to the reader; see Exercise 3 of this section.
Given this algorithm, it's pretty straightforward to implement it in "conventional" Java by looping over the collection and adding to the result list based on the random test. The OP asked about using streams, so here's a shot at that.
Algorithm S doesn't lend itself obviously to Java stream operations. It's described entirely sequentially, and the decision about whether to select the current element depends on a random decision plus state derived from all previous decisions. That might make it seem inherently sequential, but I've been wrong about that before. I'll just say that it's not immediately obvious how to make this algorithm run in parallel.
There is a way to adapt this algorithm to streams, though. What we need is a stateful predicate. This predicate will return a random result based on a probability determined by the current state, and the state will be updated -- yes, mutated -- based on this random result. This seems hard to run in parallel, but at least it's easy to make thread-safe in case it's run from a parallel stream: just make it synchronized. It'll degrade to running sequentially if the stream is parallel, though.
The implementation is pretty straightforward. Knuth's description uses random numbers between 0 and 1, but the Java Random class lets us choose a random integer within a half-open interval. Thus all we need to do is keep counters of how many elements are left to visit and how many are left to choose, et voila:
/**
* A stateful predicate that, given a total number
* of items and the number to choose, will return 'true'
* the chosen number of times distributed randomly
* across the total number of calls to its test() method.
*/
static class Selector implements Predicate<Object> {
int total; // total number items remaining
int remain; // number of items remaining to select
Random random = new Random();
Selector(int total, int remain) {
this.total = total;
this.remain = remain;
}
#Override
public synchronized boolean test(Object o) {
assert total > 0;
if (random.nextInt(total--) < remain) {
remain--;
return true;
} else {
return false;
}
}
}
Now that we have our predicate, it's easy to use in a stream:
static <E> List<E> randomSelectN(Collection<? extends E> coll, int n) {
assert n <= coll.size();
return coll.stream()
.filter(new Selector(coll.size(), n))
.collect(toList());
}
An alternative also mentioned in the same section of Knuth suggests choosing an element at random with a constant probability of n / N. This is useful if you don't need to choose exactly n elements. It'll choose n elements on average, but of course there will be some variation. If this is acceptable, the stateful predicate becomes much simpler. Instead of writing a whole class, we can simply create the random state and capture it from a local variable:
/**
* Returns a predicate that evaluates to true with a probability
* of toChoose/total.
*/
static Predicate<Object> randomPredicate(int total, int toChoose) {
Random random = new Random();
return obj -> random.nextInt(total) < toChoose;
}
To use this, replace the filter line in the stream pipeline above with
.filter(randomPredicate(coll.size(), n))
Finally, for comparison purposes, here's an implementation of the selection algorithm written using conventional Java, that is, using a for-loop and adding to a collection:
static <E> List<E> conventionalSelectN(Collection<? extends E> coll, int remain) {
assert remain <= coll.size();
int total = coll.size();
List<E> result = new ArrayList<>(remain);
Random random = new Random();
for (E e : coll) {
if (random.nextInt(total--) < remain) {
remain--;
result.add(e);
}
}
return result;
}
This is quite straightforward, and there's nothing really wrong with this. It's simpler and more self-contained than the stream approach. Still, the streams approach illustrates some interesting techniques that might be useful in other contexts.
Reference:
Knuth, Donald E. The Art of Computer Programming: Volume 2, Seminumerical Algorithms, 2nd edition. Copyright 1981, 1969 Addison-Wesley.
You could always create a "dumb" comparator, that will compare elements randomly in the list. Calling distinct() will ensure you that the elements are unique (from the queue).
Something like this:
static List<Integer> nDistinct(Collection<Integer> queue, int n) {
final Random rand = new Random();
return queue.stream()
.distinct()
.sorted(Comparator.comparingInt(a -> rand.nextInt()))
.limit(n)
.collect(Collectors.toList());
}
However I'm not sure it will be more efficient that putting the elements in the list, shuffling it and return a sublist.
static List<Integer> nDistinct(Collection<Integer> queue, int n) {
List<Integer> list = new ArrayList<>(queue);
Collections.shuffle(list);
return list.subList(0, n);
}
Oh, and it's probably semantically better to return a Set instead of a List since the elements are distincts. The methods are also designed to take Integers, but there's no difficulty to design them to be generic. :)
Just as a note, the Stream API looks like a tool box that we could use for everything, however that's not always the case. As you see, the second method is more readable (IMO), probably more efficient and doesn't have much more code (even less!).
As an addendum to the shuffle approach of the accepted answer:
If you want to select only a few items from a large list and want to avoid the overhead of shuffling the entire list you can solve the task as follows:
public static <T> List<T> getRandom(List<T> source, int num) {
Random r=new Random();
for(int i=0; i<num; i++)
Collections.swap(source, i, i+r.nextInt(source.size()-i));
return source.subList(0, num);
}
What it does is very similar to what shuffle does but it reduces it’s action to having only num random elements rather than source.size() random elements…
You can use limit to solve your problem.
http://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#limit-long-
Collections.shuffle(collection);
int howManyDoYouWant = 10;
List<Integer> smallerCollection = collection
.stream()
.limit(howManyDoYouWant)
.collect(Collectors.toList());
List<Integer> collection = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
int n = 4;
Random random = ThreadLocalRandom.current();
random.ints(0, collection.size())
.distinct()
.limit(n)
.mapToObj(collection::get)
.forEach(System.out::println);
This will of course have the overhead of the intermediate set of indexes and it will hang forever if n > collection.size().
If you want to avoid any non-constatn overhead, you'll have to make a stateful Predicate.
It should be clear that streaming the collection is not what you want.
Use the generate() and limit methods:
Stream.generate(() -> list.get(new Random().nextInt(list.size())).limit(3).forEach(...);
If you want to process the whole Stream without too much hassle, you can simply create your own Collector using Collectors.collectingAndThen():
public static <T> Collector<T, ?, Stream<T>> toEagerShuffledStream() {
return Collectors.collectingAndThen(
toList(),
list -> {
Collections.shuffle(list);
return list.stream();
});
}
But this won't perform well if you want to limit() the resulting Stream. In order to overcome this, one could create a custom Spliterator:
package com.pivovarit.stream;
import java.util.List;
import java.util.Random;
import java.util.Spliterator;
import java.util.function.Consumer;
import java.util.function.Supplier;
public class ImprovedRandomSpliterator<T> implements Spliterator<T> {
private final Random random;
private final T[] source;
private int size;
ImprovedRandomSpliterator(List<T> source, Supplier<? extends Random> random) {
if (source.isEmpty()) {
throw new IllegalArgumentException("RandomSpliterator can't be initialized with an empty collection");
}
this.source = (T[]) source.toArray();
this.random = random.get();
this.size = this.source.length;
}
#Override
public boolean tryAdvance(Consumer<? super T> action) {
int nextIdx = random.nextInt(size);
int lastIdx = size - 1;
action.accept(source[nextIdx]);
source[nextIdx] = source[lastIdx];
source[lastIdx] = null; // let object be GCed
return --size > 0;
}
#Override
public Spliterator<T> trySplit() {
return null;
}
#Override
public long estimateSize() {
return source.length;
}
#Override
public int characteristics() {
return SIZED;
}
}
and then:
public final class RandomCollectors {
private RandomCollectors() {
}
public static <T> Collector<T, ?, Stream<T>> toImprovedLazyShuffledStream() {
return Collectors.collectingAndThen(
toCollection(ArrayList::new),
list -> !list.isEmpty()
? StreamSupport.stream(new ImprovedRandomSpliterator<>(list, Random::new), false)
: Stream.empty());
}
public static <T> Collector<T, ?, Stream<T>> toEagerShuffledStream() {
return Collectors.collectingAndThen(
toCollection(ArrayList::new),
list -> {
Collections.shuffle(list);
return list.stream();
});
}
}
And then you could use it like:
stream
.collect(toLazyShuffledStream()) // or toEagerShuffledStream() depending on the use case
.distinct()
.limit(42)
.forEach( ... );
A detailed explanation can be found here.
If you want a random sample of elements from a stream, a lazy alternative to shuffling might be a filter based on the uniform distribution:
...
import org.apache.commons.lang3.RandomUtils
// If you don't know ntotal, just use a 0-1 ratio
var relativeSize = nsample / ntotal;
Stream.of (...) // or any other stream
.parallel() // can work in parallel
.filter ( e -> Math.random() < relativeSize )
// or any other stream operation
.forEach ( e -> System.out.println ( "I've got: " + e ) );

Categories