I have an existing interface that allows me to access a theoretically infinite collection as follows:
List<Element> retrieve(int start, int end);
//example
retrieve(5, 10); // retrieves the elements 5 through 10.
Now I would like to build a Java stream on top of this existing interface so that I can stream as many elements as I need without requesting a large list at once.
How would I go about doing this?
I looked at examples of Java streams and all I can find are examples of how to create stream from collections that are completely in memory. I currently load in 30 elements at a time and do the necessary processing but it would be cleaner if I could abstract that logic away and just return a stream instead.
class Chunk implements Supplier<Element> {
private final Generator generator;
private final int chunkSize;
private List<Element> list = Collections.emptyList();
private int index = 0;
public Chunk(Generator generator, int chunkSize) {
assert chunkSize > 0;
this.generator = generator;
this.chunkSize = chunkSize;
}
#Override
public Element get() {
if (list.isEmpty()) {
list = generator.retrieve(index, index + chunkSize);
index += chunkSize;
}
return list.remove(0);
}
}
Here I'm assuming retrieve returns a mutable list. If not then you'd need to create a new ArrayList or equivalent at this point.
This can be used as Stream.generate(new Chuck(generator, 30)). It generates an infinite stream starting at index 0. You could add a constructor that allows the starting index to be set if that would be useful.
I assume you can't edit retrieve method.
You can do this:
IntStream.iterate(1, x -> x + 1).mapToObj(x -> retrieve(x, x).get(0))
If one term of the sequence depends on the previous term, this would mean recalculating every term up to n if you want the nth term.
This slightly solves the problem by getting it in chunks of 100:
IntStream.iterate(1, x -> x + 1).mapToObj(x -> retrieve(1 + (x - 1) * 100, x * 100)).flatMap(List::stream)
If you can edit what's behind that interface, you can just make that return a Stream<Element>, using IntStream.iterate as above.
Related
I have list of objects, let's say of class Document:
class Document {
private final String id;
private final int length;
public Document(String id, int length) {
this.id = id;
this.length = length;
}
public int getLength() {
return length;
}
}
Task at hand is to group them in Envelopes so that number of pages (Document.length) does not exceed certain number.
class Envelope {
private final List<Document> documents = new ArrayList<>();
}
So for example, if I had follwing List of Documents:
Document doc0 = new Document("doc0", 2);
Document doc1 = new Document("doc1", 5);
Document doc2 = new Document("doc2", 5);
Document doc3 = new Document("doc3", 5);
and max page count in envelope is let's say 7, than I expect 3 envelopes with following documents:
Assert.assertEquals(3, envelopeList.size());
Assert.assertEquals(2, envelopeList.get(0).getDocuments().size()); // doc0, doc1
Assert.assertEquals(1, envelopeList.get(1).getDocuments().size()); // doc2
Assert.assertEquals(1, envelopeList.get(2).getDocuments().size()); // doc3
I have implemented this with traditional for loop and bunch of if's but question is, is it possible to do this more elegant way with streams and collectors?
thank you and best regards
Dalibor
For batching the documents based on the length, we need to maintain the state of accumulated lengths. Streams are not the best choice when external state needs to be maintained and custom loop should be simpler and efficient option.
If we force fit, streams for this scenario, the DocumentSpliterator would change as below:
public static List<Couvert> splitDocuments(List<Document> docs) {
IntUnaryOperator helper = new IntUnaryOperator() {
private int bucketIndex = 0;
private int accumulated = 0;
public synchronized int applyAsInt(int length) {
if (length + accumulated > MAX) {
bucketIndex++;
accumulated = 0;
}
accumulated += length;
return bucketIndex;
}
};
return new ArrayList<>(docs.stream()
.map(d -> new AbstractMap.SimpleEntry<>(helper.applyAsInt(d.getLength()), d))
.collect(Collectors.groupingBy(AbstractMap.SimpleEntry::getKey,
Collector.of(Couvert::new,
(c, e) -> c.getDocuments().add(e.getValue()),
(c1, c2) -> {c1.getDocuments().addAll(c2.getDocuments());return c1;})))
.values());
}
Explanation:
helper maintains the accumulated length and provides a new bucket index when it exceeds max. I have used IntUnaryOperator interface here. Alternatively, we can use any interface that takes an int parame and returns an int.
Regarding the stream,
Document is mapped to a SimpleEntry of bucketIndex and Document.
The stream of SimpleEntry is first grouped based on the bucketIndex. Another Collector transforms the stream of Document for a particuar bucketIndex to a Couvert. Output of the collect() is Map<Integer,Couvert>
Finally, the Collection of Couvert are converted to a list and returned.
Note: For this implementation, I removed the front parameter and included it as part of the docs list.
I have a class called Data which has only one method:
public boolean isValid()
I have a Listof Dataand I want to loop through them via a Java 8 stream. I need to count how many valid Data objects there are in this List and print out only the valid entries.
Below is how far I've gotten but I don't understand how.
List<Data> ar = new ArrayList<>();
...
// ar is now full of Data objects.
...
int count = ar.stream()
.filter(Data::isValid)
.forEach(System.out::println)
.count(); // Compiler error, forEach() does not return type stream.
My second attempt: (horrible code)
List<Data> ar = new ArrayList<>();
...
// Must be final or compiler error will happen via inner class.
final AtomicInteger counter = new AtomicInteger();
ar.stream()
.filter(Data:isValid)
.forEach(d ->
{
System.out.println(d);
counter.incrementAndGet();
};
System.out.printf("There are %d/%d valid Data objects.%n", counter.get(), ar.size());
If you don’t need the original ArrayList, containing a mixture of valid and invalid objects, later-on, you might simply perform a Collection operation instead of the Stream operation:
ar.removeIf(d -> !d.isValid());
ar.forEach(System.out::println);
int count = ar.size();
Otherwise, you can implement it like
List<Data> valid = ar.stream().filter(Data::isValid).collect(Collectors.toList());
valid.forEach(System.out::println);
int count = valid.size();
Having a storage for something you need multiple times is not so bad. If the list is really large, you can reduce the storage memory by (typically) factor 32, using
BitSet valid = IntStream.range(0, ar.size())
.filter(index -> ar.get(index).isValid())
.collect(BitSet::new, BitSet::set, BitSet::or);
valid.stream().mapToObj(ar::get).forEach(System.out::println);
int count = valid.cardinality();
Though, of course, you can also use
int count = 0;
for(Data d: ar) {
if(d.isValid()) {
System.out.println(d);
count++;
}
}
Peek is similar to foreach, except that it lets you continue the stream.
ar.stream().filter(Data::isValid)
.peek(System.out::println)
.count();
Scenario: I have a object with 2 functions -->
<Integer> getA(); <Integer> getB();
I have a list of objects, say List<MyObject> myObject.
Objective Iterate over the List and get sum of A and B's in the List of object.
My Solution
myObject.stream().map(a -> a.getA()).collect(Collectors.summingDouble());
myObject.stream().map(a -> a.getB()).collect(Collectors.summingDouble());
The Question: How can I do both of these at the same time? This way I will not have to iterate twice over the original list.
#Edit: I was doing this because some of the filters that I used were of O(n^3). Kind of really bad to do those twice.
Benchmark : It really matters if it is T or 2T when the program runs for half hour on an i5. This was on much lesser data and if I run on a cluster, my data would be larger too.
It does matter if you can do these in one line!.
You need to write another class to store the total values like this:
public class Total {
private int totalA = 0;
private int totalB = 0;
public void addA(int a) {
totalA += a;
}
public void addB(int b) {
totalB += b;
}
public int getTotalA() {
return totalA;
}
public int getTotalB() {
return totalB;
}
}
And then collect the values using
Total total = objects.stream()
.map(o -> (A) o)
.collect(Total::new,
(t, a) -> {
t.addA(a.getA());
t.addB(a.getB());
},
(t1, t2) -> { });
//check total.getTotalA() and total.getTotalB()
You can also use AbstractMap.SimpleEntry<Integer, Integer> to replace Total to avoid writing a new class, but it's still kind of weird because A/B are not in a key-value relationship.
The most optimal probably still would be a for loop. Though the stream alternative has parallelism as option, and is likely to become as efficient soon.
Combining two loops to one is not necessarily faster.
One improvement would be to use int instead of Integer.
List<String> list = Arrays.asList("tom","terry","john","kevin","steve");
int n = list.stream().collect(Collectors.summingInt(String::length));
int h = list.stream().collect(Collectors.summingInt(String::hashCode));
I favour this solution.
If one would make one loop, there are two alternatives:
putting both ints in their own class. You might abuse java.awt.Point class with int x and int y.
putting both ints in a long. When no overflow, then one can even sum in the loop on the long.
The latter:
List<String> list = Arrays.asList("tom","terry","john","kevin","steve");
long nh = list.stream()
.collect(Collectors.summingLong((s) -> (s.hashCode() << 32) | s.length()));
int n = (int) nh;
int h = (int) (nh >> 32L);
I have a Record class:
public class Record implements Comparable<Record>
{
private String myCategory1;
private int myCategory2;
private String myCategory3;
private String myCategory4;
private int myValue1;
private double myValue2;
public Record(String category1, int category2, String category3, String category4,
int value1, double value2)
{
myCategory1 = category1;
myCategory2 = category2;
myCategory3 = category3;
myCategory4 = category4;
myValue1 = value1;
myValue2 = value2;
}
// Getters here
}
I create a big list of a lot of records. Only the second and fifth values, i / 10000 and i, are used later, by the getters getCategory2() and getValue1() respectively.
List<Record> list = new ArrayList<>();
for (int i = 0; i < 115000; i++)
{
list.add(new Record("A", i / 10000, "B", "C", i, (double) i / 100 + 1));
}
Note that first 10,000 records have a category2 of 0, then next 10,000 have 1, etc., while the value1 values are 0-114999 sequentially.
I create a Stream that is both parallel and sorted.
Stream<Record> stream = list.stream()
.parallel()
.sorted(
//(r1, r2) -> Integer.compare(r1.getCategory2(), r2.getCategory2())
)
//.parallel()
;
I have a ForkJoinPool that maintains 8 threads, which is the number of cores I have on my PC.
ForkJoinPool pool = new ForkJoinPool(8);
I use the trick described here to submit a stream processing task to my own ForkJoinPool instead of the common ForkJoinPool.
List<Record> output = pool.submit(() ->
stream.collect(Collectors.toList()
)).get();
I expected that the parallel sorted operation would respect the encounter order of the stream, and that it would be a stable sort, because the Spliterator returned by ArrayList is ORDERED.
However, simple code that prints out the elements of the resultant List output in order shows that it's not quite the case.
for (Record record : output)
{
System.out.println(record.getValue1());
}
Output, condensed:
0
1
2
3
...
69996
69997
69998
69999
71875 // discontinuity!
71876
71877
71878
...
79058
79059
79060
79061
70000 // discontinuity!
70001
70002
70003
...
71871
71872
71873
71874
79062 // discontinuity!
79063
79064
79065
79066
...
114996
114997
114998
114999
The size() of output is 115000, and all elements appear to be there, just in a slightly different order.
So I wrote some checking code to see if the sort was stable. If it's stable, then all of the value1 values should remain in order. This code verifies the order, printing any discrepancies.
int prev = -1;
boolean verified = true;
for (Record record : output)
{
int curr = record.getValue1();
if (prev != -1)
{
if (prev + 1 != curr)
{
System.out.println("Warning: " + prev + " followed by " + curr + "!");
verified = false;
}
}
prev = curr;
}
System.out.println("Verified: " + verified);
Output:
Warning: 69999 followed by 71875!
Warning: 79061 followed by 70000!
Warning: 71874 followed by 79062!
Warning: 99999 followed by 100625!
Warning: 107811 followed by 100000!
Warning: 100624 followed by 107812!
Verified: false
This condition persists if I do any of the following:
Replace the ForkJoinPool with a ThreadPoolExecutor.
ThreadPoolExecutor pool = new ThreadPoolExecutor(8, 8, 0, TimeUnit.SECONDS, new ArrayBlockingQueue<>(10));
Use the common ForkJoinPool by processing the Stream directly.
List<Record> output = stream.collect(Collectors.toList());
Call parallel() after I call sorted.
Stream<Record> stream = list.stream().sorted().parallel();
Call parallelStream() instead of stream().parallel().
Stream<Record> stream = list.parallelStream().sorted();
Sort using a Comparator. Note that this sort criterion is different that the "natural" order I defined for the Comparable interface, although starting with the results already in order from the beginning, the result should still be the same.
Stream<Record> stream = list.stream().parallel().sorted(
(r1, r2) -> Integer.compare(r1.getCategory2(), r2.getCategory2())
);
I can only get this to preserve the encounter order if I don't do one of the following on the Stream:
Don't call parallel().
Don't call any overload of sorted.
Interestingly, the parallel() without a sort preserved the order.
In both of the above cases, the output is:
Verified: true
My Java version is 1.8.0_05. This anomaly also occurs on Ideone, which appears to be running Java 8u25.
Update
I've upgraded my JDK to the latest version as of this writing, 1.8.0_45, and the problem is unchanged.
Question
Is the record order in the resultant List (output) out of order because the sort is somehow not stable, because the encounter order is not preserved, or some other reason?
How can I ensure that the encounter order is preserved when I create a parallel stream and sort it?
It looks like Arrays.parallelSort isn't stable in some circumstances. Well spotted. The stream parallel sort is implemented in terms of Arrays.parallelSort, so it affects streams as well. Here's a simplified example:
public class StableSortBug {
static final int SIZE = 50_000;
static class Record implements Comparable<Record> {
final int sortVal;
final int seqNum;
Record(int i1, int i2) { sortVal = i1; seqNum = i2; }
#Override
public int compareTo(Record other) {
return Integer.compare(this.sortVal, other.sortVal);
}
}
static Record[] genArray() {
Record[] array = new Record[SIZE];
Arrays.setAll(array, i -> new Record(i / 10_000, i));
return array;
}
static boolean verify(Record[] array) {
return IntStream.range(1, array.length)
.allMatch(i -> array[i-1].seqNum + 1 == array[i].seqNum);
}
public static void main(String[] args) {
Record[] array = genArray();
System.out.println(verify(array));
Arrays.sort(array);
System.out.println(verify(array));
Arrays.parallelSort(array);
System.out.println(verify(array));
}
}
On my machine (2 core x 2 threads) this prints the following:
true
true
false
Of course, it's supposed to print true three times. This is on the current JDK 9 dev builds. I wouldn't be surprised if it occurs in all the JDK 8 releases thus far, given what you've tried. Curiously, reducing the size or the divisor will change the behavior. A size of 20,000 and a divisor of 10,000 is stable, and a size of 50,000 and a divisor of 1,000 is also stable. It seems like the problem has to do with a sufficiently large run of values comparing equal versus the parallel split size.
The OpenJDK issue JDK-8076446 covers this bug.
I'm attempting to retrieve n unique random elements for further processing from a Collection using the Streams API in Java 8, however, without much or any luck.
More precisely I'd want something like this:
Set<Integer> subList = new HashSet<>();
Queue<Integer> collection = new PriorityQueue<>();
collection.addAll(Arrays.asList(1,2,3,4,5,6,7,8,9));
Random random = new Random();
int n = 4;
while (subList.size() < n) {
subList.add(collection.get(random.nextInt()));
}
sublist.forEach(v -> v.doSomethingFancy());
I want to do it as efficiently as possible.
Can this be done?
edit: My second attempt -- although not exactly what I was aiming for:
List<Integer> sublist = new ArrayList<>(collection);
Collections.shuffle(sublist);
sublist.stream().limit(n).forEach(v -> v.doSomethingFancy());
edit: Third attempt (inspired by Holger), which will remove a lot of the overhead of shuffle if coll.size() is huge and n is small:
int n = // unique element count
List<Integer> sublist = new ArrayList<>(collection);
Random r = new Random();
for(int i = 0; i < n; i++)
Collections.swap(sublist, i, i + r.nextInt(source.size() - i));
sublist.stream().limit(n).forEach(v -> v.doSomethingFancy());
The shuffling approach works reasonably well, as suggested by fge in a comment and by ZouZou in another answer. Here's a generified version of the shuffling approach:
static <E> List<E> shuffleSelectN(Collection<? extends E> coll, int n) {
assert n <= coll.size();
List<E> list = new ArrayList<>(coll);
Collections.shuffle(list);
return list.subList(0, n);
}
I'll note that using subList is preferable to getting a stream and then calling limit(n), as shown in some other answers, because the resulting stream has a known size and can be split more efficiently.
The shuffling approach has a couple disadvantages. It needs to copy out all the elements, and then it needs to shuffle all the elements. This can be quite expensive if the total number of elements is large and the number of elements to be chosen is small.
An approach suggested by the OP and by a couple other answers is to choose elements at random, while rejecting duplicates, until the desired number of unique elements has been chosen. This works well if the number of elements to choose is small relative to the total, but as the number to choose rises, this slows down quite a bit because of the likelihood of choosing duplicates rises as well.
Wouldn't it be nice if there were a way to make a single pass over the space of input elements and choose exactly the number wanted, with the choices made uniformly at random? It turns out that there is, and as usual, the answer can be found in Knuth. See TAOCP Vol 2, sec 3.4.2, Random Sampling and Shuffling, Algorithm S.
Briefly, the algorithm is to visit each element and decide whether to choose it based on the number of elements visited and the number of elements chosen. In Knuth's notation, suppose you have N elements and you want to choose n of them at random. The next element should be chosen with probability
(n - m) / (N - t)
where t is the number of elements visited so far, and m is the number of elements chosen so far.
It's not at all obvious that this will give a uniform distribution of chosen elements, but apparently it does. The proof is left as an exercise to the reader; see Exercise 3 of this section.
Given this algorithm, it's pretty straightforward to implement it in "conventional" Java by looping over the collection and adding to the result list based on the random test. The OP asked about using streams, so here's a shot at that.
Algorithm S doesn't lend itself obviously to Java stream operations. It's described entirely sequentially, and the decision about whether to select the current element depends on a random decision plus state derived from all previous decisions. That might make it seem inherently sequential, but I've been wrong about that before. I'll just say that it's not immediately obvious how to make this algorithm run in parallel.
There is a way to adapt this algorithm to streams, though. What we need is a stateful predicate. This predicate will return a random result based on a probability determined by the current state, and the state will be updated -- yes, mutated -- based on this random result. This seems hard to run in parallel, but at least it's easy to make thread-safe in case it's run from a parallel stream: just make it synchronized. It'll degrade to running sequentially if the stream is parallel, though.
The implementation is pretty straightforward. Knuth's description uses random numbers between 0 and 1, but the Java Random class lets us choose a random integer within a half-open interval. Thus all we need to do is keep counters of how many elements are left to visit and how many are left to choose, et voila:
/**
* A stateful predicate that, given a total number
* of items and the number to choose, will return 'true'
* the chosen number of times distributed randomly
* across the total number of calls to its test() method.
*/
static class Selector implements Predicate<Object> {
int total; // total number items remaining
int remain; // number of items remaining to select
Random random = new Random();
Selector(int total, int remain) {
this.total = total;
this.remain = remain;
}
#Override
public synchronized boolean test(Object o) {
assert total > 0;
if (random.nextInt(total--) < remain) {
remain--;
return true;
} else {
return false;
}
}
}
Now that we have our predicate, it's easy to use in a stream:
static <E> List<E> randomSelectN(Collection<? extends E> coll, int n) {
assert n <= coll.size();
return coll.stream()
.filter(new Selector(coll.size(), n))
.collect(toList());
}
An alternative also mentioned in the same section of Knuth suggests choosing an element at random with a constant probability of n / N. This is useful if you don't need to choose exactly n elements. It'll choose n elements on average, but of course there will be some variation. If this is acceptable, the stateful predicate becomes much simpler. Instead of writing a whole class, we can simply create the random state and capture it from a local variable:
/**
* Returns a predicate that evaluates to true with a probability
* of toChoose/total.
*/
static Predicate<Object> randomPredicate(int total, int toChoose) {
Random random = new Random();
return obj -> random.nextInt(total) < toChoose;
}
To use this, replace the filter line in the stream pipeline above with
.filter(randomPredicate(coll.size(), n))
Finally, for comparison purposes, here's an implementation of the selection algorithm written using conventional Java, that is, using a for-loop and adding to a collection:
static <E> List<E> conventionalSelectN(Collection<? extends E> coll, int remain) {
assert remain <= coll.size();
int total = coll.size();
List<E> result = new ArrayList<>(remain);
Random random = new Random();
for (E e : coll) {
if (random.nextInt(total--) < remain) {
remain--;
result.add(e);
}
}
return result;
}
This is quite straightforward, and there's nothing really wrong with this. It's simpler and more self-contained than the stream approach. Still, the streams approach illustrates some interesting techniques that might be useful in other contexts.
Reference:
Knuth, Donald E. The Art of Computer Programming: Volume 2, Seminumerical Algorithms, 2nd edition. Copyright 1981, 1969 Addison-Wesley.
You could always create a "dumb" comparator, that will compare elements randomly in the list. Calling distinct() will ensure you that the elements are unique (from the queue).
Something like this:
static List<Integer> nDistinct(Collection<Integer> queue, int n) {
final Random rand = new Random();
return queue.stream()
.distinct()
.sorted(Comparator.comparingInt(a -> rand.nextInt()))
.limit(n)
.collect(Collectors.toList());
}
However I'm not sure it will be more efficient that putting the elements in the list, shuffling it and return a sublist.
static List<Integer> nDistinct(Collection<Integer> queue, int n) {
List<Integer> list = new ArrayList<>(queue);
Collections.shuffle(list);
return list.subList(0, n);
}
Oh, and it's probably semantically better to return a Set instead of a List since the elements are distincts. The methods are also designed to take Integers, but there's no difficulty to design them to be generic. :)
Just as a note, the Stream API looks like a tool box that we could use for everything, however that's not always the case. As you see, the second method is more readable (IMO), probably more efficient and doesn't have much more code (even less!).
As an addendum to the shuffle approach of the accepted answer:
If you want to select only a few items from a large list and want to avoid the overhead of shuffling the entire list you can solve the task as follows:
public static <T> List<T> getRandom(List<T> source, int num) {
Random r=new Random();
for(int i=0; i<num; i++)
Collections.swap(source, i, i+r.nextInt(source.size()-i));
return source.subList(0, num);
}
What it does is very similar to what shuffle does but it reduces it’s action to having only num random elements rather than source.size() random elements…
You can use limit to solve your problem.
http://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#limit-long-
Collections.shuffle(collection);
int howManyDoYouWant = 10;
List<Integer> smallerCollection = collection
.stream()
.limit(howManyDoYouWant)
.collect(Collectors.toList());
List<Integer> collection = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
int n = 4;
Random random = ThreadLocalRandom.current();
random.ints(0, collection.size())
.distinct()
.limit(n)
.mapToObj(collection::get)
.forEach(System.out::println);
This will of course have the overhead of the intermediate set of indexes and it will hang forever if n > collection.size().
If you want to avoid any non-constatn overhead, you'll have to make a stateful Predicate.
It should be clear that streaming the collection is not what you want.
Use the generate() and limit methods:
Stream.generate(() -> list.get(new Random().nextInt(list.size())).limit(3).forEach(...);
If you want to process the whole Stream without too much hassle, you can simply create your own Collector using Collectors.collectingAndThen():
public static <T> Collector<T, ?, Stream<T>> toEagerShuffledStream() {
return Collectors.collectingAndThen(
toList(),
list -> {
Collections.shuffle(list);
return list.stream();
});
}
But this won't perform well if you want to limit() the resulting Stream. In order to overcome this, one could create a custom Spliterator:
package com.pivovarit.stream;
import java.util.List;
import java.util.Random;
import java.util.Spliterator;
import java.util.function.Consumer;
import java.util.function.Supplier;
public class ImprovedRandomSpliterator<T> implements Spliterator<T> {
private final Random random;
private final T[] source;
private int size;
ImprovedRandomSpliterator(List<T> source, Supplier<? extends Random> random) {
if (source.isEmpty()) {
throw new IllegalArgumentException("RandomSpliterator can't be initialized with an empty collection");
}
this.source = (T[]) source.toArray();
this.random = random.get();
this.size = this.source.length;
}
#Override
public boolean tryAdvance(Consumer<? super T> action) {
int nextIdx = random.nextInt(size);
int lastIdx = size - 1;
action.accept(source[nextIdx]);
source[nextIdx] = source[lastIdx];
source[lastIdx] = null; // let object be GCed
return --size > 0;
}
#Override
public Spliterator<T> trySplit() {
return null;
}
#Override
public long estimateSize() {
return source.length;
}
#Override
public int characteristics() {
return SIZED;
}
}
and then:
public final class RandomCollectors {
private RandomCollectors() {
}
public static <T> Collector<T, ?, Stream<T>> toImprovedLazyShuffledStream() {
return Collectors.collectingAndThen(
toCollection(ArrayList::new),
list -> !list.isEmpty()
? StreamSupport.stream(new ImprovedRandomSpliterator<>(list, Random::new), false)
: Stream.empty());
}
public static <T> Collector<T, ?, Stream<T>> toEagerShuffledStream() {
return Collectors.collectingAndThen(
toCollection(ArrayList::new),
list -> {
Collections.shuffle(list);
return list.stream();
});
}
}
And then you could use it like:
stream
.collect(toLazyShuffledStream()) // or toEagerShuffledStream() depending on the use case
.distinct()
.limit(42)
.forEach( ... );
A detailed explanation can be found here.
If you want a random sample of elements from a stream, a lazy alternative to shuffling might be a filter based on the uniform distribution:
...
import org.apache.commons.lang3.RandomUtils
// If you don't know ntotal, just use a 0-1 ratio
var relativeSize = nsample / ntotal;
Stream.of (...) // or any other stream
.parallel() // can work in parallel
.filter ( e -> Math.random() < relativeSize )
// or any other stream operation
.forEach ( e -> System.out.println ( "I've got: " + e ) );