Encounter Order wrong when sorting a parallel stream - java

I have a Record class:
public class Record implements Comparable<Record>
{
private String myCategory1;
private int myCategory2;
private String myCategory3;
private String myCategory4;
private int myValue1;
private double myValue2;
public Record(String category1, int category2, String category3, String category4,
int value1, double value2)
{
myCategory1 = category1;
myCategory2 = category2;
myCategory3 = category3;
myCategory4 = category4;
myValue1 = value1;
myValue2 = value2;
}
// Getters here
}
I create a big list of a lot of records. Only the second and fifth values, i / 10000 and i, are used later, by the getters getCategory2() and getValue1() respectively.
List<Record> list = new ArrayList<>();
for (int i = 0; i < 115000; i++)
{
list.add(new Record("A", i / 10000, "B", "C", i, (double) i / 100 + 1));
}
Note that first 10,000 records have a category2 of 0, then next 10,000 have 1, etc., while the value1 values are 0-114999 sequentially.
I create a Stream that is both parallel and sorted.
Stream<Record> stream = list.stream()
.parallel()
.sorted(
//(r1, r2) -> Integer.compare(r1.getCategory2(), r2.getCategory2())
)
//.parallel()
;
I have a ForkJoinPool that maintains 8 threads, which is the number of cores I have on my PC.
ForkJoinPool pool = new ForkJoinPool(8);
I use the trick described here to submit a stream processing task to my own ForkJoinPool instead of the common ForkJoinPool.
List<Record> output = pool.submit(() ->
stream.collect(Collectors.toList()
)).get();
I expected that the parallel sorted operation would respect the encounter order of the stream, and that it would be a stable sort, because the Spliterator returned by ArrayList is ORDERED.
However, simple code that prints out the elements of the resultant List output in order shows that it's not quite the case.
for (Record record : output)
{
System.out.println(record.getValue1());
}
Output, condensed:
0
1
2
3
...
69996
69997
69998
69999
71875 // discontinuity!
71876
71877
71878
...
79058
79059
79060
79061
70000 // discontinuity!
70001
70002
70003
...
71871
71872
71873
71874
79062 // discontinuity!
79063
79064
79065
79066
...
114996
114997
114998
114999
The size() of output is 115000, and all elements appear to be there, just in a slightly different order.
So I wrote some checking code to see if the sort was stable. If it's stable, then all of the value1 values should remain in order. This code verifies the order, printing any discrepancies.
int prev = -1;
boolean verified = true;
for (Record record : output)
{
int curr = record.getValue1();
if (prev != -1)
{
if (prev + 1 != curr)
{
System.out.println("Warning: " + prev + " followed by " + curr + "!");
verified = false;
}
}
prev = curr;
}
System.out.println("Verified: " + verified);
Output:
Warning: 69999 followed by 71875!
Warning: 79061 followed by 70000!
Warning: 71874 followed by 79062!
Warning: 99999 followed by 100625!
Warning: 107811 followed by 100000!
Warning: 100624 followed by 107812!
Verified: false
This condition persists if I do any of the following:
Replace the ForkJoinPool with a ThreadPoolExecutor.
ThreadPoolExecutor pool = new ThreadPoolExecutor(8, 8, 0, TimeUnit.SECONDS, new ArrayBlockingQueue<>(10));
Use the common ForkJoinPool by processing the Stream directly.
List<Record> output = stream.collect(Collectors.toList());
Call parallel() after I call sorted.
Stream<Record> stream = list.stream().sorted().parallel();
Call parallelStream() instead of stream().parallel().
Stream<Record> stream = list.parallelStream().sorted();
Sort using a Comparator. Note that this sort criterion is different that the "natural" order I defined for the Comparable interface, although starting with the results already in order from the beginning, the result should still be the same.
Stream<Record> stream = list.stream().parallel().sorted(
(r1, r2) -> Integer.compare(r1.getCategory2(), r2.getCategory2())
);
I can only get this to preserve the encounter order if I don't do one of the following on the Stream:
Don't call parallel().
Don't call any overload of sorted.
Interestingly, the parallel() without a sort preserved the order.
In both of the above cases, the output is:
Verified: true
My Java version is 1.8.0_05. This anomaly also occurs on Ideone, which appears to be running Java 8u25.
Update
I've upgraded my JDK to the latest version as of this writing, 1.8.0_45, and the problem is unchanged.
Question
Is the record order in the resultant List (output) out of order because the sort is somehow not stable, because the encounter order is not preserved, or some other reason?
How can I ensure that the encounter order is preserved when I create a parallel stream and sort it?

It looks like Arrays.parallelSort isn't stable in some circumstances. Well spotted. The stream parallel sort is implemented in terms of Arrays.parallelSort, so it affects streams as well. Here's a simplified example:
public class StableSortBug {
static final int SIZE = 50_000;
static class Record implements Comparable<Record> {
final int sortVal;
final int seqNum;
Record(int i1, int i2) { sortVal = i1; seqNum = i2; }
#Override
public int compareTo(Record other) {
return Integer.compare(this.sortVal, other.sortVal);
}
}
static Record[] genArray() {
Record[] array = new Record[SIZE];
Arrays.setAll(array, i -> new Record(i / 10_000, i));
return array;
}
static boolean verify(Record[] array) {
return IntStream.range(1, array.length)
.allMatch(i -> array[i-1].seqNum + 1 == array[i].seqNum);
}
public static void main(String[] args) {
Record[] array = genArray();
System.out.println(verify(array));
Arrays.sort(array);
System.out.println(verify(array));
Arrays.parallelSort(array);
System.out.println(verify(array));
}
}
On my machine (2 core x 2 threads) this prints the following:
true
true
false
Of course, it's supposed to print true three times. This is on the current JDK 9 dev builds. I wouldn't be surprised if it occurs in all the JDK 8 releases thus far, given what you've tried. Curiously, reducing the size or the divisor will change the behavior. A size of 20,000 and a divisor of 10,000 is stable, and a size of 50,000 and a divisor of 1,000 is also stable. It seems like the problem has to do with a sufficiently large run of values comparing equal versus the parallel split size.
The OpenJDK issue JDK-8076446 covers this bug.

Related

How to cache and replay the items of a Supplier<Stream<T>>

Regarding a Supplier<Stream<T>> dataSrc I would like to cache the Stream items for further traversals of the same sequence of elements. In this case, assume that dataSrc always produces the same sequence (e.g. getting a Stream<Integer> with temperatures in Celsius of March (see example usage bellow)). Thus, option 1) is to first collect the Stream items, however it will waste one first traversal to add those items into a collection:
Supplier<Stream<T>> dataSrc = ...
List<T> cache = dataSrc.collect(toList()); // **Additional** traversal to collect items
cache.stream().reduce(…) // 1st traversal
cache.stream().reduce(…) // 2nd traversal
... // Nth traversals
I would like to avoid the additional traversal to collect items and the explicit cache variable and hide it inside the Supplier<> in such a way that on first traversal the items are implicitly cached and on further traversals the items are accessed from that cache. I think this is similar to the idea of the method cache() of Reactor Project for reactive streams.
Thus, I am outlining an alternative in the following cache() method implementation, although it already has two problems (at least): 1) the onClose is not called on traversal finish (and I cannot figure out any way of detecting the end of a traversal); 2) If the first traversal never ends then the cache will never be filled.
Supplier<Stream<T>> dataSrc = cache(...)
dataSrc.get().reduce(…) // 1st traversal
dataSrc.get().reduce(…) // 2nd traversal
... // Nth traversals
static <T> Supplier<Stream<T>> cache(Supplier<Stream<T>> dataSrc) {
final List<T> cache = new ArrayList<>();
final AtomicBoolean started = new AtomicBoolean();
final AtomicBoolean isCached = new AtomicBoolean();
return () -> {
if(isCached.get()) return cache.stream();
if(!started.getAndSet(true)) {
return dataSrc
.get()
.peek(cache::add)
.onClose(() -> isCached.set(true));
}
return dataSrc.get();
};
}
Question:
Is there any better approach to achieve an utility cache() function which returns a new Stream<T> that caches items on first Stream traversal (without an implicit additional traversal to collect first) and further Stream objects are created from that cache?
Usage Example:
Here I am getting a stream with the temperatures in March from the World Weather online API. To execute it you must include a dependency of AsyncHttpClient and a valid API key in given URI.
Pattern pat = Pattern.compile("\\n");
boolean [] isEven = {true};
CompletableFuture<Stream<Integer>> temps = asyncHttpClient()
.prepareGet("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?q=37.017,-7.933&date=2018-03-01&enddate=2018-03-31&tp=24&format=csv&key=715b185b36034a4c879141841182802")
.execute()
.toCompletableFuture()
.thenApply(Response::getResponseBody)
.thenApply(pat::splitAsStream)
.thenApply(str -> str
.filter(w -> !w.startsWith("#")) // Filter comments
.skip(1) // Skip line: Not Available
.filter(l -> isEven[0] = !isEven[0]) // Filter Even line
.map(line -> line.substring(14, 16)) // Extract temperature in celcius
.map(Integer::parseInt)
);
Note that a CompletableFuture<Stream<Integer>> is functionally compliant with a Supplier<Stream<Integer>>. Although, the CompletableFuture caches the resulting stream it cannot be iterated twice.
Problem 1: The following code throws IllegalStateException: stream has already been operated upon or closed
out.println(temps.join().distinct().count());
out.println(temps.join().max(Integer::compare)); // throws IllegalStateException
Problem 2: Collecting it in a List will induce a first traversal and thus we will have 3 traversals, instead of 2:
CompletableFuture<List<Integer>> list = temps.thenApply(str -> str.collect(toList()));
out.println(list.join().stream().distinct().count()); // 2 traversals
out.println(list.join().stream().distinct().max(Integer::compare));// 1 traversal
Goal: Store items in cache on first traversal. Every time the stream retrieves an item it should store it in an internal cache that will be used on further traversals.
Supplier<Stream<Integer>> cache = Cache.of(temps::join);
out.println(temps.get().distinct().count()); // 1 traversal
out.println(temps.get().max(Integer::compare)); // 1 traversal form cache
I think the only way to detect the end of the Stream traversal is through its iterator() or spliterator(). Thus, maybe a better option to get a replayable Stream is to record its items from its iterator (done by Recorder class of the example bellow) and then implement a new Spliterator that reads the items previously recorded (done by cacheIterator()). In this solution I made the getOrAdvance() method of Recorder synchronized to guarantee that just one resulting stream will get a new item from the source.
So, Cache.of(dataSrc) creates a chain of:
dataSrc ----> Recorder ----> cacheIterator() ----> Stream
Side notes:
The resulting stream from the method Cache.of() permits limited parallelism. For better splitting support the cacheIterator() bellow should return a Spliterator implementation instead, such as AbstractList.RandomAccessSpliterator.
Although it is not a requirement, the Recorder/ cacheIterator() solution also works with infinite data sources that can be short-circuited later.
E.g. it can cache the items of the infinite stream nrs and prints the output bellow without, or with cache (i.e. nrsReplay):
Random rnd = new Random();
Supplier<Stream<String>> nrs = () -> Stream.generate(() -> rnd.nextInt(99)).map(Object::toString);
IntStream.range(1, 6).forEach(size -> out.println(nrs.get().limit(size).collect(joining(","))));
System.out.println();
Supplier<Stream<String>> nrsReplay = Cache.of(nrs);
IntStream.range(1, 6).forEach(size -> out.println(nrsReplay.get().limit(size).collect(joining(","))));
Output:
32
65,94
94,19,34
72,77,66,18
88,41,34,97,28
93
93,65
93,65,71
93,65,71,40
93,65,71,40,68
class Cache {
public static <T> Supplier<Stream<T>> of(Supplier<Stream<T>> dataSrc) {
final Spliterator<T> src = dataSrc.get().spliterator(); // !!!maybe it should be lazy and memorized!!!
final Recorder<T> rec = new Recorder<>(src);
return () -> {
// CacheIterator starts on index 0 and reads data from src or
// from an internal cache of Recorder.
Spliterator<T> iter = rec.cacheIterator();
return StreamSupport.stream(iter, false);
};
}
static class Recorder<T> {
final Spliterator<T> src;
final List<T> cache = new ArrayList<>();
final long estimateSize;
boolean hasNext = true;
public Recorder(Spliterator<T> src) {
this.src = src;
this.estimateSize = src.estimateSize();
}
public synchronized boolean getOrAdvance(
final int index,
Consumer<? super T> cons) {
if (index < cache.size()) {
// If it is in cache then just get if from the corresponding index.
cons.accept(cache.get(index));
return true;
} else if (hasNext)
// If not in cache then advance the src iterator
hasNext = src.tryAdvance(item -> {
cache.add(item);
cons.accept(item);
});
return hasNext;
}
public Spliterator<T> cacheIterator() {
return new Spliterators.AbstractSpliterator<T>(
estimateSize, src.characteristics()
) {
int index = 0;
public boolean tryAdvance(Consumer<? super T> cons) {
return getOrAdvance(index++, cons);
}
public Comparator<? super T> getComparator() {
return src.getComparator();
}
};
}
}
}
You can use Guava's Suppliers#memoize function to turn a given Supplier into a caching ("memoizing") one.
Turn your dataSrc Supplier<Stream<T>> into a Supplier<List<T>> that collects the stream
Wrap it with Suppliers#memoize
This would be your cache() method:
private static <T> Supplier<Stream<T>> cache(Supplier<Stream<T>> dataSrc) {
Supplier<List<T>> memoized = Suppliers.memoize(() -> dataSrc.get().collect(toList()));
return () -> memoized.get().stream();
}
(when mixing in Guava you might need to switch between Guava's version of c.g.c.b.Supplier, and java.util.Supplier, and they can easily be transformed back and forth, however in this case it's not even necessary)
Example
Assume a simple Integer stream that returns the first 5 natural numbers and reports computation to stdout:
private static Supplier<Stream<Integer>> getDataSrc() {
return () -> IntStream.generate(new IntSupplier() {
private int i = 0;
#Override
public int getAsInt() {
System.out.println("Computing next i: " + (i + 1));
return i += 1;
}
}).limit(5).boxed();
}
Then running the non-memoized version
Supplier<Stream<Integer>> dataSrc = getDataSrc();
System.out.println(dataSrc.get().collect(toList()));
System.out.println(dataSrc.get().collect(toList()));
yields
Computing next i: 1
Computing next i: 2
Computing next i: 3
Computing next i: 4
Computing next i: 5
[1, 2, 3, 4, 5]
Computing next i: 1
Computing next i: 2
Computing next i: 3
Computing next i: 4
Computing next i: 5
[1, 2, 3, 4, 5]
And running the memoized version
Supplier<Stream<Integer>> dataSrc = cached(getDataSrc());
System.out.println(dataSrc.get().collect(toList()));
System.out.println(dataSrc.get().collect(toList()));
yields
Computing next i: 1
Computing next i: 2
Computing next i: 3
Computing next i: 4
Computing next i: 5
[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]
If using the Reactor Project is an option, then you can simply convert the Supplier<Stream<T>> to a Flux<T>, which already provides the utility cache() method and henceforward use Flux<T> operations rather than Stream<T> operations.
Regarding the example of the original post, where temps is a CompletableFuture<Stream<Integer>> with the result of an HTTP request transformed in a sequence of temperatures in Celsius, then we can perform both queries in the following way:
Flux<Integer> cache = Flux.fromStream(temps::join).cache();
cache.distinct().count().subscribe(out::println);
cache.reduce(Integer::max).subscribe(out::println);
This solution avoids: 1) IllegalStateException on further traversals of this sequence; 2) a first traversal to collect items in a cache.

Java - Basic streams usage of forEach

I have a class called Data which has only one method:
public boolean isValid()
I have a Listof Dataand I want to loop through them via a Java 8 stream. I need to count how many valid Data objects there are in this List and print out only the valid entries.
Below is how far I've gotten but I don't understand how.
List<Data> ar = new ArrayList<>();
...
// ar is now full of Data objects.
...
int count = ar.stream()
.filter(Data::isValid)
.forEach(System.out::println)
.count(); // Compiler error, forEach() does not return type stream.
My second attempt: (horrible code)
List<Data> ar = new ArrayList<>();
...
// Must be final or compiler error will happen via inner class.
final AtomicInteger counter = new AtomicInteger();
ar.stream()
.filter(Data:isValid)
.forEach(d ->
{
System.out.println(d);
counter.incrementAndGet();
};
System.out.printf("There are %d/%d valid Data objects.%n", counter.get(), ar.size());
If you don’t need the original ArrayList, containing a mixture of valid and invalid objects, later-on, you might simply perform a Collection operation instead of the Stream operation:
ar.removeIf(d -> !d.isValid());
ar.forEach(System.out::println);
int count = ar.size();
Otherwise, you can implement it like
List<Data> valid = ar.stream().filter(Data::isValid).collect(Collectors.toList());
valid.forEach(System.out::println);
int count = valid.size();
Having a storage for something you need multiple times is not so bad. If the list is really large, you can reduce the storage memory by (typically) factor 32, using
BitSet valid = IntStream.range(0, ar.size())
.filter(index -> ar.get(index).isValid())
.collect(BitSet::new, BitSet::set, BitSet::or);
valid.stream().mapToObj(ar::get).forEach(System.out::println);
int count = valid.cardinality();
Though, of course, you can also use
int count = 0;
for(Data d: ar) {
if(d.isValid()) {
System.out.println(d);
count++;
}
}
Peek is similar to foreach, except that it lets you continue the stream.
ar.stream().filter(Data::isValid)
.peek(System.out::println)
.count();

Java Lambda Multiple operations on a single stream

Scenario: I have a object with 2 functions -->
<Integer> getA(); <Integer> getB();
I have a list of objects, say List<MyObject> myObject.
Objective Iterate over the List and get sum of A and B's in the List of object.
My Solution
myObject.stream().map(a -> a.getA()).collect(Collectors.summingDouble());
myObject.stream().map(a -> a.getB()).collect(Collectors.summingDouble());
The Question: How can I do both of these at the same time? This way I will not have to iterate twice over the original list.
#Edit: I was doing this because some of the filters that I used were of O(n^3). Kind of really bad to do those twice.
Benchmark : It really matters if it is T or 2T when the program runs for half hour on an i5. This was on much lesser data and if I run on a cluster, my data would be larger too.
It does matter if you can do these in one line!.
You need to write another class to store the total values like this:
public class Total {
private int totalA = 0;
private int totalB = 0;
public void addA(int a) {
totalA += a;
}
public void addB(int b) {
totalB += b;
}
public int getTotalA() {
return totalA;
}
public int getTotalB() {
return totalB;
}
}
And then collect the values using
Total total = objects.stream()
.map(o -> (A) o)
.collect(Total::new,
(t, a) -> {
t.addA(a.getA());
t.addB(a.getB());
},
(t1, t2) -> { });
//check total.getTotalA() and total.getTotalB()
You can also use AbstractMap.SimpleEntry<Integer, Integer> to replace Total to avoid writing a new class, but it's still kind of weird because A/B are not in a key-value relationship.
The most optimal probably still would be a for loop. Though the stream alternative has parallelism as option, and is likely to become as efficient soon.
Combining two loops to one is not necessarily faster.
One improvement would be to use int instead of Integer.
List<String> list = Arrays.asList("tom","terry","john","kevin","steve");
int n = list.stream().collect(Collectors.summingInt(String::length));
int h = list.stream().collect(Collectors.summingInt(String::hashCode));
I favour this solution.
If one would make one loop, there are two alternatives:
putting both ints in their own class. You might abuse java.awt.Point class with int x and int y.
putting both ints in a long. When no overflow, then one can even sum in the loop on the long.
The latter:
List<String> list = Arrays.asList("tom","terry","john","kevin","steve");
long nh = list.stream()
.collect(Collectors.summingLong((s) -> (s.hashCode() << 32) | s.length()));
int n = (int) nh;
int h = (int) (nh >> 32L);

How should I check whether a Stream<T> is sorted?

With an Iterable<T>, it's easy:
T last = null;
for (T t : iterable) {
if (last != null && last.compareTo(t) > 0) {
return false;
}
last = t;
}
return true;
But I can't think of a clean way to do the same thing for a Stream<T> that avoids consuming all the elements when it doesn't have to.
There are several methods to iterate over the successive pairs of the stream. For example, you can check this question. Of course my favourite method is to use the library I wrote:
boolean unsorted = StreamEx.of(sourceStream)
.pairMap((a, b) -> a.compareTo(b) > 0)
.has(true);
It's short-circuit operation: it will finish as soon as it find the misorder. Also it works fine with parallel streams.
This is a sequential, state holding solution:
IntStream stream = IntStream.of(3, 3, 5, 6, 6, 9, 10);
final AtomicInteger max = new AtomicInteger(Integer.MIN_VALUE);
boolean sorted = stream.allMatch(n -> n >= max.getAndSet(n));
Parallelizing would need to introduce ranges. The state, max might be dealt with otherwise, but the above seems most simple.
You can grab the Stream's underlying spliterator and check it it has the SORTED characteristic. Since it's a terminal operation, you can't use the Stream after (but you can create another one from this spliterator, see also Convert Iterable to Stream using Java 8 JDK).
For example:
Stream<Integer> st = Stream.of(1, 2, 3);
//false
boolean isSorted = st.spliterator().hasCharacteristics(Spliterator.SORTED);
Stream<Integer> st = Stream.of(1, 2, 3).sorted();
//true
boolean isSorted = st.spliterator().hasCharacteristics(Spliterator.SORTED);
My example shows that the SORTED characteristic appears only if you get the Stream from a source's that reports the SORTED characteristic or you call sorted() at a point on the pipeline.
One could argue that Stream.iterate(0, x -> x + 1); creates a SORTED stream, but there is no knowledge about the semantic of the function applied iteratively. The same applies for Stream.of(...).
If the pipeline is infinite then it's the only way to know. If not, and that the spliterator does not report this characteristic, you'd need to go through the elements and see if it does not satisfy the sorted characteristic you are looking for.
This is what you already done with your iterator approach but then you need to consume some elements of the Stream (in the worst case, all elements). You can make the task parallelizable with some extra code, then it's up to you to see if it's worth it or not...
You could hijack a reduction operation to save the last value and compare it to the current value and throw an exception if it isn't sorted:
.stream().reduce((last, curr) -> {
if (((Comparable)curr).compareTo(last) < 0) {
throw new Exception();
}
return curr;
});
EDIT: I forked another answer's example and replaced it with my code to show it only does the requisite number of checks.
http://ideone.com/ZMGnVW
You could use allMatch with a multi-line lambda, checking the current value against the previous one. You'll have to wrap the last value into an array, though, so the lambda can modify it.
// infinite stream with one pair of unsorted numbers
IntStream s = IntStream.iterate(0, x -> x != 1000 ? x + 2 : x - 1);
// terminates as soon as the first unsorted pair is found
int[] last = {Integer.MIN_VALUE};
boolean sorted = s.allMatch(x -> {
boolean b = x >= last[0]; last[0] = x; return b;
});
Alternatively, just get the iterator from the stream and use a simple loop.
A naive solution uses the stream's Iterator:
public static <T extends Comparable<T>> boolean isSorted(Stream<T> stream) {
Iterator<T> i = stream.iterator();
if(!i.hasNext()) return true;
T current = i.next();
while(i.hasNext()) {
T next = i.next();
if(current == null || current.compareTo(next) > 0) return false;
current = next;
}
return true;
}
Edit: It would also be possible to use a spliterator to parallelize the task, but the gains would be questionable and the increase in complexity is probably not worth it.
I don't know how good it is , but i have just got an idea:
Make a list out of your Stream , Integer or Strings or anything.
i have written this for a List<String> listOfStream:
long countSorted = IntStream.range(1, listOfStream.size())
.map(
index -> {
if (listOfStream.get(index).compareTo(listOfStream.get(index-1)) > 0) {
return 0;
}
return index;
})
.sum();

Perform operation on n random distinct elements from Collection using Streams API

I'm attempting to retrieve n unique random elements for further processing from a Collection using the Streams API in Java 8, however, without much or any luck.
More precisely I'd want something like this:
Set<Integer> subList = new HashSet<>();
Queue<Integer> collection = new PriorityQueue<>();
collection.addAll(Arrays.asList(1,2,3,4,5,6,7,8,9));
Random random = new Random();
int n = 4;
while (subList.size() < n) {
subList.add(collection.get(random.nextInt()));
}
sublist.forEach(v -> v.doSomethingFancy());
I want to do it as efficiently as possible.
Can this be done?
edit: My second attempt -- although not exactly what I was aiming for:
List<Integer> sublist = new ArrayList<>(collection);
Collections.shuffle(sublist);
sublist.stream().limit(n).forEach(v -> v.doSomethingFancy());
edit: Third attempt (inspired by Holger), which will remove a lot of the overhead of shuffle if coll.size() is huge and n is small:
int n = // unique element count
List<Integer> sublist = new ArrayList<>(collection);
Random r = new Random();
for(int i = 0; i < n; i++)
Collections.swap(sublist, i, i + r.nextInt(source.size() - i));
sublist.stream().limit(n).forEach(v -> v.doSomethingFancy());
The shuffling approach works reasonably well, as suggested by fge in a comment and by ZouZou in another answer. Here's a generified version of the shuffling approach:
static <E> List<E> shuffleSelectN(Collection<? extends E> coll, int n) {
assert n <= coll.size();
List<E> list = new ArrayList<>(coll);
Collections.shuffle(list);
return list.subList(0, n);
}
I'll note that using subList is preferable to getting a stream and then calling limit(n), as shown in some other answers, because the resulting stream has a known size and can be split more efficiently.
The shuffling approach has a couple disadvantages. It needs to copy out all the elements, and then it needs to shuffle all the elements. This can be quite expensive if the total number of elements is large and the number of elements to be chosen is small.
An approach suggested by the OP and by a couple other answers is to choose elements at random, while rejecting duplicates, until the desired number of unique elements has been chosen. This works well if the number of elements to choose is small relative to the total, but as the number to choose rises, this slows down quite a bit because of the likelihood of choosing duplicates rises as well.
Wouldn't it be nice if there were a way to make a single pass over the space of input elements and choose exactly the number wanted, with the choices made uniformly at random? It turns out that there is, and as usual, the answer can be found in Knuth. See TAOCP Vol 2, sec 3.4.2, Random Sampling and Shuffling, Algorithm S.
Briefly, the algorithm is to visit each element and decide whether to choose it based on the number of elements visited and the number of elements chosen. In Knuth's notation, suppose you have N elements and you want to choose n of them at random. The next element should be chosen with probability
(n - m) / (N - t)
where t is the number of elements visited so far, and m is the number of elements chosen so far.
It's not at all obvious that this will give a uniform distribution of chosen elements, but apparently it does. The proof is left as an exercise to the reader; see Exercise 3 of this section.
Given this algorithm, it's pretty straightforward to implement it in "conventional" Java by looping over the collection and adding to the result list based on the random test. The OP asked about using streams, so here's a shot at that.
Algorithm S doesn't lend itself obviously to Java stream operations. It's described entirely sequentially, and the decision about whether to select the current element depends on a random decision plus state derived from all previous decisions. That might make it seem inherently sequential, but I've been wrong about that before. I'll just say that it's not immediately obvious how to make this algorithm run in parallel.
There is a way to adapt this algorithm to streams, though. What we need is a stateful predicate. This predicate will return a random result based on a probability determined by the current state, and the state will be updated -- yes, mutated -- based on this random result. This seems hard to run in parallel, but at least it's easy to make thread-safe in case it's run from a parallel stream: just make it synchronized. It'll degrade to running sequentially if the stream is parallel, though.
The implementation is pretty straightforward. Knuth's description uses random numbers between 0 and 1, but the Java Random class lets us choose a random integer within a half-open interval. Thus all we need to do is keep counters of how many elements are left to visit and how many are left to choose, et voila:
/**
* A stateful predicate that, given a total number
* of items and the number to choose, will return 'true'
* the chosen number of times distributed randomly
* across the total number of calls to its test() method.
*/
static class Selector implements Predicate<Object> {
int total; // total number items remaining
int remain; // number of items remaining to select
Random random = new Random();
Selector(int total, int remain) {
this.total = total;
this.remain = remain;
}
#Override
public synchronized boolean test(Object o) {
assert total > 0;
if (random.nextInt(total--) < remain) {
remain--;
return true;
} else {
return false;
}
}
}
Now that we have our predicate, it's easy to use in a stream:
static <E> List<E> randomSelectN(Collection<? extends E> coll, int n) {
assert n <= coll.size();
return coll.stream()
.filter(new Selector(coll.size(), n))
.collect(toList());
}
An alternative also mentioned in the same section of Knuth suggests choosing an element at random with a constant probability of n / N. This is useful if you don't need to choose exactly n elements. It'll choose n elements on average, but of course there will be some variation. If this is acceptable, the stateful predicate becomes much simpler. Instead of writing a whole class, we can simply create the random state and capture it from a local variable:
/**
* Returns a predicate that evaluates to true with a probability
* of toChoose/total.
*/
static Predicate<Object> randomPredicate(int total, int toChoose) {
Random random = new Random();
return obj -> random.nextInt(total) < toChoose;
}
To use this, replace the filter line in the stream pipeline above with
.filter(randomPredicate(coll.size(), n))
Finally, for comparison purposes, here's an implementation of the selection algorithm written using conventional Java, that is, using a for-loop and adding to a collection:
static <E> List<E> conventionalSelectN(Collection<? extends E> coll, int remain) {
assert remain <= coll.size();
int total = coll.size();
List<E> result = new ArrayList<>(remain);
Random random = new Random();
for (E e : coll) {
if (random.nextInt(total--) < remain) {
remain--;
result.add(e);
}
}
return result;
}
This is quite straightforward, and there's nothing really wrong with this. It's simpler and more self-contained than the stream approach. Still, the streams approach illustrates some interesting techniques that might be useful in other contexts.
Reference:
Knuth, Donald E. The Art of Computer Programming: Volume 2, Seminumerical Algorithms, 2nd edition. Copyright 1981, 1969 Addison-Wesley.
You could always create a "dumb" comparator, that will compare elements randomly in the list. Calling distinct() will ensure you that the elements are unique (from the queue).
Something like this:
static List<Integer> nDistinct(Collection<Integer> queue, int n) {
final Random rand = new Random();
return queue.stream()
.distinct()
.sorted(Comparator.comparingInt(a -> rand.nextInt()))
.limit(n)
.collect(Collectors.toList());
}
However I'm not sure it will be more efficient that putting the elements in the list, shuffling it and return a sublist.
static List<Integer> nDistinct(Collection<Integer> queue, int n) {
List<Integer> list = new ArrayList<>(queue);
Collections.shuffle(list);
return list.subList(0, n);
}
Oh, and it's probably semantically better to return a Set instead of a List since the elements are distincts. The methods are also designed to take Integers, but there's no difficulty to design them to be generic. :)
Just as a note, the Stream API looks like a tool box that we could use for everything, however that's not always the case. As you see, the second method is more readable (IMO), probably more efficient and doesn't have much more code (even less!).
As an addendum to the shuffle approach of the accepted answer:
If you want to select only a few items from a large list and want to avoid the overhead of shuffling the entire list you can solve the task as follows:
public static <T> List<T> getRandom(List<T> source, int num) {
Random r=new Random();
for(int i=0; i<num; i++)
Collections.swap(source, i, i+r.nextInt(source.size()-i));
return source.subList(0, num);
}
What it does is very similar to what shuffle does but it reduces it’s action to having only num random elements rather than source.size() random elements…
You can use limit to solve your problem.
http://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#limit-long-
Collections.shuffle(collection);
int howManyDoYouWant = 10;
List<Integer> smallerCollection = collection
.stream()
.limit(howManyDoYouWant)
.collect(Collectors.toList());
List<Integer> collection = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
int n = 4;
Random random = ThreadLocalRandom.current();
random.ints(0, collection.size())
.distinct()
.limit(n)
.mapToObj(collection::get)
.forEach(System.out::println);
This will of course have the overhead of the intermediate set of indexes and it will hang forever if n > collection.size().
If you want to avoid any non-constatn overhead, you'll have to make a stateful Predicate.
It should be clear that streaming the collection is not what you want.
Use the generate() and limit methods:
Stream.generate(() -> list.get(new Random().nextInt(list.size())).limit(3).forEach(...);
If you want to process the whole Stream without too much hassle, you can simply create your own Collector using Collectors.collectingAndThen():
public static <T> Collector<T, ?, Stream<T>> toEagerShuffledStream() {
return Collectors.collectingAndThen(
toList(),
list -> {
Collections.shuffle(list);
return list.stream();
});
}
But this won't perform well if you want to limit() the resulting Stream. In order to overcome this, one could create a custom Spliterator:
package com.pivovarit.stream;
import java.util.List;
import java.util.Random;
import java.util.Spliterator;
import java.util.function.Consumer;
import java.util.function.Supplier;
public class ImprovedRandomSpliterator<T> implements Spliterator<T> {
private final Random random;
private final T[] source;
private int size;
ImprovedRandomSpliterator(List<T> source, Supplier<? extends Random> random) {
if (source.isEmpty()) {
throw new IllegalArgumentException("RandomSpliterator can't be initialized with an empty collection");
}
this.source = (T[]) source.toArray();
this.random = random.get();
this.size = this.source.length;
}
#Override
public boolean tryAdvance(Consumer<? super T> action) {
int nextIdx = random.nextInt(size);
int lastIdx = size - 1;
action.accept(source[nextIdx]);
source[nextIdx] = source[lastIdx];
source[lastIdx] = null; // let object be GCed
return --size > 0;
}
#Override
public Spliterator<T> trySplit() {
return null;
}
#Override
public long estimateSize() {
return source.length;
}
#Override
public int characteristics() {
return SIZED;
}
}
and then:
public final class RandomCollectors {
private RandomCollectors() {
}
public static <T> Collector<T, ?, Stream<T>> toImprovedLazyShuffledStream() {
return Collectors.collectingAndThen(
toCollection(ArrayList::new),
list -> !list.isEmpty()
? StreamSupport.stream(new ImprovedRandomSpliterator<>(list, Random::new), false)
: Stream.empty());
}
public static <T> Collector<T, ?, Stream<T>> toEagerShuffledStream() {
return Collectors.collectingAndThen(
toCollection(ArrayList::new),
list -> {
Collections.shuffle(list);
return list.stream();
});
}
}
And then you could use it like:
stream
.collect(toLazyShuffledStream()) // or toEagerShuffledStream() depending on the use case
.distinct()
.limit(42)
.forEach( ... );
A detailed explanation can be found here.
If you want a random sample of elements from a stream, a lazy alternative to shuffling might be a filter based on the uniform distribution:
...
import org.apache.commons.lang3.RandomUtils
// If you don't know ntotal, just use a 0-1 ratio
var relativeSize = nsample / ntotal;
Stream.of (...) // or any other stream
.parallel() // can work in parallel
.filter ( e -> Math.random() < relativeSize )
// or any other stream operation
.forEach ( e -> System.out.println ( "I've got: " + e ) );

Categories