I have some function that return some Flux<Integer>. This flux is hot, it is being emitting live data. After some time of execution, I want to block until the next Integer is emitted, and assign to a variable. This Integer may not be the first and will not be the last.
I considered blockFirst(), but this would block indefinitely as the Flux has already emitted an Integer. I do not care if the Integer is the first or last in the Flux, I just want to block till the next Integer is emitted and assign it to a variable.
How would I do this? I think I could subscribe to the Flux, and after the first value, unsubscribe, but I am hoping there is a better way to do this.
It depends on the replay/buffering behavior of your hot flux. Both blockFirst() and next() operator do the same things: they wait for the first value received in the current subscription.
It is very important to understand that, because in the case of hot fluxes, subscription is independent of source data emission. The first value is not necessarily the first value emitted upstream. It is the first value received by your current subscriber, and that depends on the upstream flow behaviour.
In case of hot fluxes, how they pass values to the subscribers depends both on their buffering and broadcast strategies. For this answer, I will focus only on the buffering aspect:
If your hot flux does not buffer any emitted value (Ex: Flux.share(), Sinks.multicast().directBestEffort()), then both blockFirst() and next().block() operators meet your requirement: wait until the next emitted live data in a blocking fashion.
NOTE: next() has the advantage to allow to become non-blocking if replacing block with cache and subscribe
If your upstream flux does buffer some past values, then your subscriber / downsream flow will not only receive live stream. Before it, it will receive (part of) upstream history. In such case, you will have to use a more advanced strategy to skip values until the one you want.
From your question, I would say that skipping values until an elapsed time has passed can be done using skipUntilOther(Mono.delay(wantedDuration)).
But be careful, because the delay starts from your subscription, not from upstream subscription (to do so, you would require the upstream to provide timed elements, and switch to another strategy).
It is also important to know that Reactor forbids calling block() from some Threads (the one used by non-elastic schedulers).
Let's demonstrate all of that with code. In the below program, there's 4 examples:
Use next/blockFirst directly on a non-buffering hot flux
Use skipUntilOther on a buffering hot flux
Show that blocking can fail sometimes
Try to avoid block operation
All examples are commented for clarity, and launched sequentially in a main function:
import java.time.Duration;
import java.util.concurrent.CountDownLatch;
import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;
public class HotFlux {
public static void main(String[] args) throws Exception {
System.out.println("== 1. Hot flux without any buffering");
noBuffer();
System.out.println("== 2. Hot flux with buffering");
withBuffer();
// REMINDER: block operator is not always accepted by Reactor
System.out.println("== 3. block called from a wrong context");
blockFailsOnSomeSchedulers();
System.out.println("== 4. Next value without blocking");
avoidBlocking();
}
static void noBuffer() {
// Prepare a hot flux thanks to share().
var hotFlux = Flux.interval(Duration.ofMillis(100))
.share();
// Prepare an operator that fetch the next value from live stream after a delay.
var nextValue = Mono.delay(Duration.ofMillis(300))
.then(hotFlux.next());
// Launch live data emission
var livestream = hotFlux.subscribe(i -> System.out.println("Emitted: "+i));
try {
// Trigger value fetching after a delay
var value = nextValue.block();
System.out.println("next() -> " + value);
// Immediately try to block until next value is available
System.out.println("blockFirst() -> " + hotFlux.blockFirst());
} finally {
// stop live data production
livestream.dispose();
}
}
static void withBuffer() {
// Prepare a hot flux replaying all values emitted in the past to each subscriber
var hotFlux = Flux.interval(Duration.ofMillis(100))
.cache();
// Launch live data emission
var livestream = hotFlux.subscribe(i -> System.out.println("Emitted: "+i));
try {
// Wait half a second, then get next emitted value.
var value = hotFlux.skipUntilOther(Mono.delay(Duration.ofMillis(500)))
.next()
.block();
System.out.println("skipUntilOther + next: " + value);
// block first can also be used
value = hotFlux.skipUntilOther(Mono.delay(Duration.ofMillis(500)))
.blockFirst();
System.out.println("skipUntilOther + blockFirst: " + value);
} finally {
// stop live data production
livestream.dispose();
}
}
public static void blockFailsOnSomeSchedulers() throws InterruptedException {
var hotFlux = Flux.interval(Duration.ofMillis(100)).share();
var barrier = new CountDownLatch(1);
var forbiddenInnerBlock = Mono.delay(Duration.ofMillis(200))
// This block will fail, because delay op above is scheduled on parallel scheduler by default.
.then(Mono.fromCallable(() -> hotFlux.blockFirst()))
.doFinally(signal -> barrier.countDown());
forbiddenInnerBlock.subscribe(value -> System.out.println("Block success: "+value),
err -> System.out.println("BLOCK FAILED: "+err.getMessage()));
barrier.await();
}
static void avoidBlocking() throws InterruptedException {
var hotFlux = Flux.interval(Duration.ofMillis(100)).share();
var asyncValue = hotFlux.skipUntilOther(Mono.delay(Duration.ofMillis(500)))
.next()
// time wil let us verify that the value has been fetched once then cached properly
.timed()
.cache();
asyncValue.subscribe(); // launch immediately
// Barrier is required because we're in a main/test program. If you intend to reuse the mono in a bigger scope, you do not need it.
CountDownLatch barrier = new CountDownLatch(2);
// We will see that both subscribe methods return the same timestamped value, because it has been launched previously and cached
asyncValue.subscribe(value -> System.out.println("Get value (1): "+value), err -> barrier.countDown(), () -> barrier.countDown());
asyncValue.subscribe(value -> System.out.println("Get value (2): "+value), err -> barrier.countDown(), () -> barrier.countDown());
barrier.await();
}
}
This program gives the following output:
== 1. Hot flux without any buffering
Emitted: 0
Emitted: 1
Emitted: 2
Emitted: 3
next() -> 3
Emitted: 4
blockFirst() -> 4
== 2. Hot flux with buffering
Emitted: 0
Emitted: 1
Emitted: 2
Emitted: 3
Emitted: 4
Emitted: 5
skipUntilOther + next: 5
Emitted: 6
Emitted: 7
Emitted: 8
Emitted: 9
Emitted: 10
Emitted: 11
skipUntilOther + blockFirst: 11
== 3. block called from a wrong context
BLOCK FAILED: block()/blockFirst()/blockLast() are blocking, which is not supported in thread parallel-6
== 4. Next value without blocking
Get value (1): Timed(4){eventElapsedNanos=500247504, eventElapsedSinceSubscriptionNanos=500247504, eventTimestampEpochMillis=1674831275873}
Get value (2): Timed(4){eventElapsedNanos=500247504, eventElapsedSinceSubscriptionNanos=500247504, eventTimestampEpochMillis=1674831275873}
I found some surprising behavior with Java parallel streams. I made my own Spliterator, and the resulting parallel stream gets divided up until each stream has only one element in it. That seems way too small and I wonder what I'm doing wrong. I'm hoping there's some characteristics I can set to correct this.
Here's my test code. The Float here is just a dummy payload, my real stream class is somewhat more complicated.
public static void main( String[] args ) {
TestingSpliterator splits = new TestingSpliterator( 10 );
Stream<Float> test = StreamSupport.stream( splits, true );
double total = test.mapToDouble( Float::doubleValue ).sum();
System.out.println( "Total: " + total );
}
This code will continually split this stream until each Spliterator has only one element. That seems way too much to be efficient.
Output:
run:
Split on count: 10
Split on count: 5
Split on count: 3
Split on count: 5
Split on count: 2
Split on count: 2
Split on count: 3
Split on count: 2
Split on count: 2
Total: 5.164293184876442
BUILD SUCCESSFUL (total time: 0 seconds)
Here's the code of the Spliterator. My main concern is what characteristics I should be using, but perhaps there's a problem somewhere else?
public class TestingSpliterator implements Spliterator<Float> {
int count;
int splits;
public TestingSpliterator( int count ) {
this.count = count;
}
#Override
public boolean tryAdvance( Consumer<? super Float> cnsmr ) {
if( count > 0 ) {
cnsmr.accept( (float)Math.random() );
count--;
return true;
} else
return false;
}
#Override
public Spliterator<Float> trySplit() {
System.err.println( "Split on count: " + count );
if( count > 1 ) {
splits++;
int half = count / 2;
TestingSpliterator newSplit = new TestingSpliterator( count - half );
count = half;
return newSplit;
} else
return null;
}
#Override
public long estimateSize() {
return count;
}
#Override
public int characteristics() {
return IMMUTABLE | SIZED;
}
}
So how can I get the stream to be split in to much larger chunks? I was hoping in the neighborhood of 10,000 to 50,000 would be better.
I know I can return null from the trySplit() method, but that seems like a backwards way of doing it. It seems like the system should have some notion of number of cores, current load, and how complex the code is that uses the stream, and adjust itself accordingly. In other words, I want the stream chunk size to be externally configured, not internally fixed by the stream itself.
EDIT: re. Holger's answer below, when I increase the number of elements in the original stream, the stream splits are somewhat less, so StreamSupport does stop splitting eventually.
At an initial stream size of 100 elements, StreamSupport stops splitting when it reaches a stream size of 2 (the last line I see on my screen is Split on count: 4).
And for an initial stream size of 1000 elements, the final size of the individual stream chunks is about 32 elements.
Edit part deux: After looking at the output of the above, I changed my code to list out the individual Spliterators created. Here's the changes:
public static void main( String[] args ) {
TestingSpliterator splits = new TestingSpliterator( 100 );
Stream<Float> test = StreamSupport.stream( splits, true );
double total = test.mapToDouble( Float::doubleValue ).sum();
System.out.println( "Total Spliterators: " + testers.size() );
for( TestingSpliterator t : testers ) {
System.out.println( "Splits: " + t.splits );
}
}
And to the TestingSpliterator's ctor:
static Queue<TestingSpliterator> testers = new ConcurrentLinkedQueue<>();
public TestingSpliterator( int count ) {
this.count = count;
testers.add( this ); // OUCH! 'this' escape
}
The result of this code is that the first Spliterator gets split 5 times. The nextSpliterator gets split 4 times. The next set of Spliterators get split 3 times. Etc. The result is that 36 Spliterators get made and the stream is split into as many parts. On typical desktop systems this seems to be the way that the API thinks is the best for parallel operations.
I'm going to accept Holger's answer below, which is essentially that the StreamSupport class is doing the right thing, don't worry, be happy. Part of the issue for me was that I was doing my early testing on very small stream sizes and I was surprised at the number of splits. Don't make the same mistake yourself.
You are looking on it from the wrong angle. The implementation did not split “until each spliterator has one element”, it rather split “until having ten spliterators”.
A single spliterator instance can only be processed by one thread. A spliterator is not required to support splitting after its traversal has been started. Therefore any splitting opportunity that has not been used beforehand may lead to limited parallel processing capabilities afterwards.
It’s important to keep in mind that the Stream implementation received a ToDoubleFunction with an unknown workload¹. It doesn’t know that it is as simple as Float::doubleValue in your case. It could be a function taking a minute to evaluate and then, having a spliterator per CPU core would be righteous right. Even having more than CPU cores is a valid strategy to handle the possibility that some evaluations take significantly longer than others.
A typical number of initial spliterators will be “number of CPU cores” × 4, though here might be more split operations later when more knowledge about actual workloads exist. When your input data has less than that number, it’s not surprising when it gets split down until one element per spliterator is left.
You may try with new TestingSpliterator( 10000 ) or 1000 or 100 to see that the number of splits will not change significantly, once the implementation assumes to have enough chunks to keep all CPU cores busy.
Since your spliterator does not know anything about the per-element workload of the consuming stream either, you shouldn’t be concerned about this. If you can smoothly support splitting down to single elements, just do that.
¹ It doesn’t have special optimizations for the case that no operations have been chained, though.
Unless I am missing the obvious, you could always pass a bufferSize in the constructor and use that for your trySplit:
#Override
public Spliterator<Float> trySplit() {
if( count > 1 ) {
splits++;
if(count > bufferSize) {
count = count - bufferSize;
return new TestingSpliterator( bufferSize, bufferSize);
}
}
return null;
}
And with this:
TestingSpliterator splits = new TestingSpliterator(12, 5);
Stream<Float> test = StreamSupport.stream(splits, true);
test.map(x -> new AbstractMap.SimpleEntry<>(
x.doubleValue(),
Thread.currentThread().getName()))
.collect(Collectors.groupingBy(
Map.Entry::getValue,
Collectors.mapping(
Map.Entry::getKey,
Collectors.toList())))
.forEach((x, y) -> System.out.println("Thread : " + x + " processed : " + y));
You would see that there are 3 threads. Two of them process 5 elements and one 2.
import java.util.ArrayList;
import java.util.List;
public class IterationBenchmark {
public static void main(String args[]){
List<String> persons = new ArrayList<String>();
persons.add("AAA");
persons.add("BBB");
persons.add("CCC");
persons.add("DDD");
long timeMillis = System.currentTimeMillis();
for(String person : persons)
System.out.println(person);
System.out.println("Time taken for legacy for loop : "+
(System.currentTimeMillis() - timeMillis));
timeMillis = System.currentTimeMillis();
persons.stream().forEach(System.out::println);
System.out.println("Time taken for sequence stream : "+
(System.currentTimeMillis() - timeMillis));
timeMillis = System.currentTimeMillis();
persons.parallelStream().forEach(System.out::println);
System.out.println("Time taken for parallel stream : "+
(System.currentTimeMillis() - timeMillis));
}
}
Output:
AAA
BBB
CCC
DDD
Time taken for legacy for loop : 0
AAA
BBB
CCC
DDD
Time taken for sequence stream : 49
CCC
DDD
AAA
BBB
Time taken for parallel stream : 3
Why the Java 8 Stream API performance is very low compare to legacy for loop?
Very first call to the Stream API in your program is always quite slow, because you need to load many auxiliary classes, generate many anonymous classes for lambdas and JIT-compile many methods. Thus usually very first Stream operation takes several dozens of milliseconds. The consecutive calls are much faster and may fall beyond 1 us depending on the exact stream operation. If you exchange the parallel-stream test and sequential stream test, the sequential stream will be much faster. All the hard work is done by one who comes the first.
Let's write a JMH benchmark to properly warm-up your code and test all the cases independently:
import java.util.concurrent.TimeUnit;
import java.util.*;
import java.util.stream.*;
import org.openjdk.jmh.annotations.*;
#Warmup(iterations = 5, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#Measurement(iterations = 10, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#Fork(3)
#State(Scope.Benchmark)
public class StreamTest {
List<String> persons;
#Setup
public void setup() {
persons = new ArrayList<String>();
persons.add("AAA");
persons.add("BBB");
persons.add("CCC");
persons.add("DDD");
}
#Benchmark
public void loop() {
for(String person : persons)
System.err.println(person);
}
#Benchmark
public void stream() {
persons.stream().forEach(System.err::println);
}
#Benchmark
public void parallelStream() {
persons.parallelStream().forEach(System.err::println);
}
}
Here we have three tests: loop, stream and parallelStream. Note that I changed the System.out to System.err. That's because System.out is used normally to output the JMH results. I will redirect the output of System.err to nul, so the result should less depend on my filesystem or console subsystem (which is especially slow on Windows).
So the results are (Core i7-4702MQ CPU # 2.2GHz, 4 cores HT, Win7, Oracle JDK 1.8.0_40):
Benchmark Mode Cnt Score Error Units
StreamTest.loop avgt 30 42.410 ± 1.833 us/op
StreamTest.parallelStream avgt 30 76.440 ± 2.073 us/op
StreamTest.stream avgt 30 42.820 ± 1.389 us/op
What we see is that stream and loop produce exactly the same result. The difference is statistically insignificant. Actually Stream API is somewhat slower than loop, but here the slowest part is the PrintStream. Even with output to nul the IO subsystem is very slow compared to other operations. So we just measured not the Stream API or loop speed, but println speed.
Also see, it's microseconds, thus stream version actually works 1000 times faster than in your test.
Why parallelStream is much slower? Just because you cannot parallelize the writes to the same PrintStream, because it is internally synchronized. So the parallelStream did all the hard work to splitting 4-element list to the 4 sub-tasks, schedule the jobs in the different threads, synchronize them properly, but it's absolutely futile as the slowest operation (println) cannot perform in parallel: while one of threads is working, others are waiting. In general it's useless to parallelize the code which synchronizes on the same mutex (which is your case).
I had some fun comparing the speed of the removeAll(Collection<?> c) call declared in Collection. Now I know that micro-benchmarks are difficult to do right, and I won’t look at a few milliseconds difference, but I believe my results to be valid, since I ran them repeatedly and they are very reproducible.
Let’s assume I have two collections that are not too tiny, say 100,000 consecutive integer elements, and also that they mostly overlap, for instance 5,000 are in the left but not the right. Now I simply call:
left.removeAll(right);
Of course this all depends on the types of both the left and the right collection. It’s blazingly fast if the right collection is a hash map, because that’s where the look-ups are done. But looking closer, I noticed two results that I cannot explain. I tried all the tests both with an ArrayList that is sorted and with another that is shuffled (using Collections.shuffle(), if that is of importance).
The first weird result is:
00293 025% shuffled ArrayList, HashSet
00090 008% sorted ArrayList, HashSet
Now either removing elements from the sorted ArrayList is faster than removing from the shuffled list, or looking up consecutive values from the HashSet is faster that looking up random values.
Now the other one:
02311 011% sorted ArrayList, shuffled ArrayList
01401 006% sorted ArrayList, sorted ArrayList
Now this suggests that the lookup in the sorted ArrayList (using a contains() call for each element of the list to the left) is faster than in the shuffled list. Now that would be quite easy if we could make use of the fact that it is sorted and use a binary search, but I do not do that.
Both results are mysterious to me. I cannot explain them by looking at the code or with my data-structure knowledge. Does it have anything to do with processor cache access patterns? Is the JIT compiler optimizing stuff? But if so, which? I performed warming up and run the tests a few times in a row, but perhaps there is a fundamental problem with my benchmark?
The reason for the performance difference is the memory access pattern: accessing elements which are consecutive in memory is faster than doing a random memory access (due to memory pre-fetching, cpu caches etc.)
When you initially populate the collection you create all the elements sequentially in the memory, so when you are traversing it (foreach, removeAll, etc) you are accessing consecutive memory regions which is cache friendly. When you shuffle the collection - the elements remain in the same order in memory, but the pointers to those elements are no longer in the same order, so when you are traversing the collection you'll be accessing for instance the 10th, the 1st, then the 5th element which is very cache unfriendly and ruins the performance.
You can look at this question where this effect is visible in greater detail:
Why filtering an unsorted list is faster than filtering a sorted list
Since the asker did not provide any example code, and there have been doubts about the benchmark mentioned in the comments and answers, I created a small test to see whether the removeAll method is slower when the argument is a shuffled list (instead of a sorted list). And I confirmed the observation of the asker: The output of the test was roughly
100000 elements, sortedList and sortedList, 5023,090 ms, size 5000
100000 elements, shuffledList and sortedList, 5062,293 ms, size 5000
100000 elements, sortedList and shuffledList, 10657,438 ms, size 5000
100000 elements, shuffledList and shuffledList, 10700,145 ms, size 5000
I'll omit the code for this particular test here, because it also has been questioned (which - by the way - is perfectly justified! A lot of BS is posted on the web...).
So I did further tests, for which I'll provide the code here.
This may also not be considered as a definite answer. But I tried to adjust the tests so that they at least provide some strong evidence that the reason for the reduced performance is indeed what Svetlin Zarev mentioned in his answer (+1 and accept this if it convinces you). Namely, that the reason for the slowdown lies in the caching effects of the scattered accesses.
First of all: I am aware of many of the possible pitfalls when writing a microbenchmark (and so is the asker, according to his statements). However, I know that nobody will believe a lie benchmark, even if it is perfectly reasonable, unless it is performed with an appropriate microbenchmarking tool. So in order to show that the performance with a shuffled list is lower than with a sorted list, I created this simple JMH benchmark:
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.infra.Blackhole;
#State(Scope.Thread)
public class RemoveAllBenchmarkJMH
{
#Param({"sorted", "shuffled"})
public String method;
#Param({"1000", "10000", "100000" })
public int numElements;
private List<Integer> left;
private List<Integer> right;
#Setup
public void initList()
{
left = new ArrayList<Integer>();
right = new ArrayList<Integer>();
for (int i=0; i<numElements; i++)
{
left.add(i);
}
int n = (int)(numElements * 0.95);
for (int i=0; i<n; i++)
{
right.add(i);
}
if (method.equals("shuffled"))
{
Collections.shuffle(right);
}
}
#Benchmark
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
public void testMethod(Blackhole bh)
{
left.removeAll(right);
bh.consume(left.size());
}
}
The output of this one is as follows:
(method) (numElements) Mode Cnt Score Error Units
sorted 1000 avgt 50 52,055 ± 0,507 us/op
shuffled 1000 avgt 50 55,720 ± 0,466 us/op
sorted 10000 avgt 50 5341,917 ± 28,630 us/op
shuffled 10000 avgt 50 7108,845 ± 45,869 us/op
sorted 100000 avgt 50 621714,569 ± 19040,964 us/op
shuffled 100000 avgt 50 1110301,876 ± 22935,976 us/op
I hope that this helps to resolve doubts about the statement itself.
Although I admit that I'm not a JMH expert. If there is something wrong with this benchmark, please let me know
Now, these results have been roughly in line with my other, manual (non-JMH) microbenchmark. In order to create evidence for the fact that the shuffling is the problem, I created a small test that compares the performance using lists that are shuffled by different degrees. By providing a value between 0.0 and 1.0, one can limit the number of swapped elements, and thus the shuffledness of the list. (Of course, this is rather "pragmatic", as there are different options of how this could be implemented, considering the different possible (statistical) measures for "shuffledness").
The code looks as follows:
import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.List;
import java.util.Random;
import java.util.function.Function;
public class RemoveAllBenchmarkExt
{
public static void main(String[] args)
{
for (int n=10000; n<=100000; n+=10000)
{
runTest(n, sortedList() , sortedList());
runTest(n, sortedList() , shuffledList(0.00));
runTest(n, sortedList() , shuffledList(0.25));
runTest(n, sortedList() , shuffledList(0.50));
runTest(n, sortedList() , shuffledList(0.75));
runTest(n, sortedList() , shuffledList(1.00));
runTest(n, sortedList() , reversedList());
System.out.println();
}
}
private static Function<Integer, Collection<Integer>> sortedList()
{
return new Function<Integer, Collection<Integer>>()
{
#Override
public Collection<Integer> apply(Integer t)
{
List<Integer> list = new ArrayList<Integer>(t);
for (int i=0; i<t; i++)
{
list.add(i);
}
return list;
}
#Override
public String toString()
{
return "sorted";
}
};
}
private static Function<Integer, Collection<Integer>> shuffledList(
final double degree)
{
return new Function<Integer, Collection<Integer>>()
{
#Override
public Collection<Integer> apply(Integer t)
{
List<Integer> list = new ArrayList<Integer>(t);
for (int i=0; i<t; i++)
{
list.add(i);
}
shuffle(list, degree);
return list;
}
#Override
public String toString()
{
return String.format("shuffled(%4.2f)", degree);
}
};
}
private static void shuffle(List<Integer> list, double degree)
{
Random random = new Random(0);
int n = (int)(degree * list.size());
for (int i=n; i>1; i--)
{
swap(list, i-1, random.nextInt(i));
}
}
private static void swap(List<Integer> list, int i, int j)
{
list.set(i, list.set(j, list.get(i)));
}
private static Function<Integer, Collection<Integer>> reversedList()
{
return new Function<Integer, Collection<Integer>>()
{
#Override
public Collection<Integer> apply(Integer t)
{
List<Integer> list = new ArrayList<Integer>(t);
for (int i=0; i<t; i++)
{
list.add(i);
}
Collections.reverse(list);
return list;
}
#Override
public String toString()
{
return "reversed";
}
};
}
private static void runTest(int n,
Function<Integer, ? extends Collection<Integer>> leftFunction,
Function<Integer, ? extends Collection<Integer>> rightFunction)
{
Collection<Integer> left = leftFunction.apply(n);
Collection<Integer> right = rightFunction.apply((int)(n*0.95));
long before = System.nanoTime();
left.removeAll(right);
long after = System.nanoTime();
double durationMs = (after - before) / 1e6;
System.out.printf(
"%8d elements, %15s, duration %10.3f ms, size %d\n",
n, rightFunction, durationMs, left.size());
}
}
(Yes, it's very simple. However, if you think that the timings are completely useless, compare them to a JMH run, and after a few hours, you'll see that they are reasonable)
The timings for the last pass are as follows:
100000 elements, sorted, duration 6016,354 ms, size 5000
100000 elements, shuffled(0,00), duration 5849,537 ms, size 5000
100000 elements, shuffled(0,25), duration 7319,948 ms, size 5000
100000 elements, shuffled(0,50), duration 9344,408 ms, size 5000
100000 elements, shuffled(0,75), duration 10657,021 ms, size 5000
100000 elements, shuffled(1,00), duration 11295,808 ms, size 5000
100000 elements, reversed, duration 5830,695 ms, size 5000
One can clearly see that the timings are basically increasing linearly with the shuffledness.
Of course, all this is still not a proof, but at least an evidence that the answer by Svetlin Zarev is correct.
Looking at the source code for ArrayList.removeAll() (OpenJDK7-b147) it appears that the it delegates to a private method called batchRemove() which is as follows:
663 private boolean batchRemove(Collection<?> c, boolean complement) {
664 final Object[] elementData = this.elementData;
665 int r = 0, w = 0;
666 boolean modified = false;
667 try {
668 for (; r < size; r++)
669 if (c.contains(elementData[r]) == complement)
670 elementData[w++] = elementData[r];
671 } finally {
672 // Preserve behavioral compatibility with AbstractCollection,
673 // even if c.contains() throws.
674 if (r != size) {
675 System.arraycopy(elementData, r,
676 elementData, w,
677 size - r);
678 w += size - r;
679 }
680 if (w != size) {
681 for (int i = w; i < size; i++)
682 elementData[i] = null;
683 modCount += size - w;
684 size = w;
685 modified = true;
686 }
687 }
688 return modified;
689 }
It practically loops through the array and has a bunch of c.contains() calls. Basically there's no reason why this iteration would go faster for a sorted array.
I second StephenC's doubt about the benchmark, and believe that it'd be more fruitful for you to scrutinize the benchmark code before digging in any deeper into cache access patterns etc.
Also if the benchmark code is not the culprit, it would be interesting to know the java version, and the OS/arch etc.
Now I know that micro-benchmarks are difficult to do right, and I won’t look at a few milliseconds difference, but I believe my results to be valid, since I ran them repeatedly and they are very reproducible.
That does not convince me. The behaviour of an flawed benchmark can be 100% reproducible.
I suspect that ... in fact ... a flaw or flaws in your benchmark >>is<< the cause of your strange results. It often is.
... but perhaps there is a fundamental problem with my benchmark?
Yes (IMO).
Show us the benchmark code if you want a more detailed answer.
I experienced a performance issue when using the stream created using the spliterator() over an Iterable. ie., like StreamSupport.stream(integerList.spliterator(), true). Wanted to prove this over a normal collection. Please see below some benchmark results.
Question:
Why does the parallel stream created from an iterable much slower than the stream created from an ArrayList or an IntStream ?
From a range
public void testParallelFromIntRange() {
long start = System.nanoTime();
IntStream stream = IntStream.rangeClosed(1, Integer.MAX_VALUE).parallel();
System.out.println("Is Parallel: "+stream.isParallel());
stream.forEach(ParallelStreamSupportTest::calculate);
long end = System.nanoTime();
System.out.println("ParallelStream from range Takes : " + TimeUnit.MILLISECONDS.convert((end - start),
TimeUnit.NANOSECONDS) + " milli seconds");
}
Is Parallel: true
ParallelStream from range Takes : 490 milli seconds
From an Iterable
public void testParallelFromIterable() {
Set<Integer> integerList = ContiguousSet.create(Range.closed(1, Integer.MAX_VALUE), DiscreteDomain.integers());
long start = System.nanoTime();
Stream<Integer> stream = StreamSupport.stream(integerList.spliterator(), true);
System.out.println("Is Parallel: " + stream.isParallel());
stream.forEach(ParallelStreamSupportTest::calculate);
long end = System.nanoTime();
System.out.println("ParallelStream from Iterable Takes : " + TimeUnit.MILLISECONDS.convert((end - start),
TimeUnit.NANOSECONDS) + " milli seconds");
}
Is Parallel: true
ParallelStream from Iterable Takes : 12517 milli seconds
And the so trivial calculate method.
public static Integer calculate(Integer input) {
return input + 2;
}
Not all spliterators are created equally. One of the tasks of a spliterator is to decompose the source into two parts, that can be processed in parallel. A good spliterator will divide the source roughly in half (and will be able to continue to do so recursively.)
Now, imagine you are writing a spliterator for a source that is only described by an Iterator. What quality of decomposition can you get? Basically, all you can do is divide the source into "first" and "rest". That's about as bad as it gets. The result is a computation tree that is very "right-heavy".
The spliterator that you get from a data structure has more to work with; it knows the layout of the data, and can use that to give better splits, and therefore better parallel performance. The spliterator for ArrayList can always divide in half, and retains knowledge of exactly how much data is in each half. That's really good. The spliterator from a balanced tree can get good distribution (since each half of the tree has roughly half the elements), but isn't quite as good as the ArrayList spliterator because it doesn't know the exact sizes. The spliterator for a LinkedList is about as bad as it gets; all it can do is (first, rest). And the same for deriving a spliterator from an iterator.
Now, all is not necessarily lost; if the work per element is high, you can overcome bad splitting. But if you're doing a small amount of work per element, you'll be limited by the quality of splits from your spliterator.
There are several problems with your benchmark.
Stream<Integer> cannot be compared to IntStream because of boxing overhead.
You aren't doing anything with the result of the calculation, which makes it hard to know whether the code is actually being run
You are benchmarking with System.nanoTime instead of using a proper benchmarking tool.
Here's a JMH-based benchmark:
import com.google.common.collect.ContiguousSet;
import com.google.common.collect.DiscreteDomain;
import com.google.common.collect.Range;
import java.util.stream.IntStream;
import java.util.stream.Stream;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.OptionsBuilder;
public class Ranges {
final static int SIZE = 10_000_000;
#Benchmark
public long intStream() {
Stream<Integer> st = IntStream.rangeClosed(1, SIZE).boxed();
return st.parallel().mapToInt(x -> x).sum();
}
#Benchmark
public long contiguousSet() {
ContiguousSet<Integer> cs = ContiguousSet.create(Range.closed(1, SIZE), DiscreteDomain.integers());
Stream<Integer> st = cs.stream();
return st.parallel().mapToInt(x -> x).sum();
}
public static void main(String[] args) throws RunnerException {
new Runner(
new OptionsBuilder()
.include(".*Ranges.*")
.forks(1)
.warmupIterations(5)
.measurementIterations(5)
.build()
).run();
}
}
And the output:
Benchmark Mode Samples Score Score error Units
b.Ranges.contiguousSet thrpt 5 13.540 0.924 ops/s
b.Ranges.intStream thrpt 5 27.047 5.119 ops/s
So IntStream.range is about twice as fast as ContiguousSet, which is perfectly reasonable, given that ContiguousSet doesn't implement its own Spliterator and uses the default from Set