I had some fun comparing the speed of the removeAll(Collection<?> c) call declared in Collection. Now I know that micro-benchmarks are difficult to do right, and I won’t look at a few milliseconds difference, but I believe my results to be valid, since I ran them repeatedly and they are very reproducible.
Let’s assume I have two collections that are not too tiny, say 100,000 consecutive integer elements, and also that they mostly overlap, for instance 5,000 are in the left but not the right. Now I simply call:
left.removeAll(right);
Of course this all depends on the types of both the left and the right collection. It’s blazingly fast if the right collection is a hash map, because that’s where the look-ups are done. But looking closer, I noticed two results that I cannot explain. I tried all the tests both with an ArrayList that is sorted and with another that is shuffled (using Collections.shuffle(), if that is of importance).
The first weird result is:
00293 025% shuffled ArrayList, HashSet
00090 008% sorted ArrayList, HashSet
Now either removing elements from the sorted ArrayList is faster than removing from the shuffled list, or looking up consecutive values from the HashSet is faster that looking up random values.
Now the other one:
02311 011% sorted ArrayList, shuffled ArrayList
01401 006% sorted ArrayList, sorted ArrayList
Now this suggests that the lookup in the sorted ArrayList (using a contains() call for each element of the list to the left) is faster than in the shuffled list. Now that would be quite easy if we could make use of the fact that it is sorted and use a binary search, but I do not do that.
Both results are mysterious to me. I cannot explain them by looking at the code or with my data-structure knowledge. Does it have anything to do with processor cache access patterns? Is the JIT compiler optimizing stuff? But if so, which? I performed warming up and run the tests a few times in a row, but perhaps there is a fundamental problem with my benchmark?
The reason for the performance difference is the memory access pattern: accessing elements which are consecutive in memory is faster than doing a random memory access (due to memory pre-fetching, cpu caches etc.)
When you initially populate the collection you create all the elements sequentially in the memory, so when you are traversing it (foreach, removeAll, etc) you are accessing consecutive memory regions which is cache friendly. When you shuffle the collection - the elements remain in the same order in memory, but the pointers to those elements are no longer in the same order, so when you are traversing the collection you'll be accessing for instance the 10th, the 1st, then the 5th element which is very cache unfriendly and ruins the performance.
You can look at this question where this effect is visible in greater detail:
Why filtering an unsorted list is faster than filtering a sorted list
Since the asker did not provide any example code, and there have been doubts about the benchmark mentioned in the comments and answers, I created a small test to see whether the removeAll method is slower when the argument is a shuffled list (instead of a sorted list). And I confirmed the observation of the asker: The output of the test was roughly
100000 elements, sortedList and sortedList, 5023,090 ms, size 5000
100000 elements, shuffledList and sortedList, 5062,293 ms, size 5000
100000 elements, sortedList and shuffledList, 10657,438 ms, size 5000
100000 elements, shuffledList and shuffledList, 10700,145 ms, size 5000
I'll omit the code for this particular test here, because it also has been questioned (which - by the way - is perfectly justified! A lot of BS is posted on the web...).
So I did further tests, for which I'll provide the code here.
This may also not be considered as a definite answer. But I tried to adjust the tests so that they at least provide some strong evidence that the reason for the reduced performance is indeed what Svetlin Zarev mentioned in his answer (+1 and accept this if it convinces you). Namely, that the reason for the slowdown lies in the caching effects of the scattered accesses.
First of all: I am aware of many of the possible pitfalls when writing a microbenchmark (and so is the asker, according to his statements). However, I know that nobody will believe a lie benchmark, even if it is perfectly reasonable, unless it is performed with an appropriate microbenchmarking tool. So in order to show that the performance with a shuffled list is lower than with a sorted list, I created this simple JMH benchmark:
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.infra.Blackhole;
#State(Scope.Thread)
public class RemoveAllBenchmarkJMH
{
#Param({"sorted", "shuffled"})
public String method;
#Param({"1000", "10000", "100000" })
public int numElements;
private List<Integer> left;
private List<Integer> right;
#Setup
public void initList()
{
left = new ArrayList<Integer>();
right = new ArrayList<Integer>();
for (int i=0; i<numElements; i++)
{
left.add(i);
}
int n = (int)(numElements * 0.95);
for (int i=0; i<n; i++)
{
right.add(i);
}
if (method.equals("shuffled"))
{
Collections.shuffle(right);
}
}
#Benchmark
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
public void testMethod(Blackhole bh)
{
left.removeAll(right);
bh.consume(left.size());
}
}
The output of this one is as follows:
(method) (numElements) Mode Cnt Score Error Units
sorted 1000 avgt 50 52,055 ± 0,507 us/op
shuffled 1000 avgt 50 55,720 ± 0,466 us/op
sorted 10000 avgt 50 5341,917 ± 28,630 us/op
shuffled 10000 avgt 50 7108,845 ± 45,869 us/op
sorted 100000 avgt 50 621714,569 ± 19040,964 us/op
shuffled 100000 avgt 50 1110301,876 ± 22935,976 us/op
I hope that this helps to resolve doubts about the statement itself.
Although I admit that I'm not a JMH expert. If there is something wrong with this benchmark, please let me know
Now, these results have been roughly in line with my other, manual (non-JMH) microbenchmark. In order to create evidence for the fact that the shuffling is the problem, I created a small test that compares the performance using lists that are shuffled by different degrees. By providing a value between 0.0 and 1.0, one can limit the number of swapped elements, and thus the shuffledness of the list. (Of course, this is rather "pragmatic", as there are different options of how this could be implemented, considering the different possible (statistical) measures for "shuffledness").
The code looks as follows:
import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.List;
import java.util.Random;
import java.util.function.Function;
public class RemoveAllBenchmarkExt
{
public static void main(String[] args)
{
for (int n=10000; n<=100000; n+=10000)
{
runTest(n, sortedList() , sortedList());
runTest(n, sortedList() , shuffledList(0.00));
runTest(n, sortedList() , shuffledList(0.25));
runTest(n, sortedList() , shuffledList(0.50));
runTest(n, sortedList() , shuffledList(0.75));
runTest(n, sortedList() , shuffledList(1.00));
runTest(n, sortedList() , reversedList());
System.out.println();
}
}
private static Function<Integer, Collection<Integer>> sortedList()
{
return new Function<Integer, Collection<Integer>>()
{
#Override
public Collection<Integer> apply(Integer t)
{
List<Integer> list = new ArrayList<Integer>(t);
for (int i=0; i<t; i++)
{
list.add(i);
}
return list;
}
#Override
public String toString()
{
return "sorted";
}
};
}
private static Function<Integer, Collection<Integer>> shuffledList(
final double degree)
{
return new Function<Integer, Collection<Integer>>()
{
#Override
public Collection<Integer> apply(Integer t)
{
List<Integer> list = new ArrayList<Integer>(t);
for (int i=0; i<t; i++)
{
list.add(i);
}
shuffle(list, degree);
return list;
}
#Override
public String toString()
{
return String.format("shuffled(%4.2f)", degree);
}
};
}
private static void shuffle(List<Integer> list, double degree)
{
Random random = new Random(0);
int n = (int)(degree * list.size());
for (int i=n; i>1; i--)
{
swap(list, i-1, random.nextInt(i));
}
}
private static void swap(List<Integer> list, int i, int j)
{
list.set(i, list.set(j, list.get(i)));
}
private static Function<Integer, Collection<Integer>> reversedList()
{
return new Function<Integer, Collection<Integer>>()
{
#Override
public Collection<Integer> apply(Integer t)
{
List<Integer> list = new ArrayList<Integer>(t);
for (int i=0; i<t; i++)
{
list.add(i);
}
Collections.reverse(list);
return list;
}
#Override
public String toString()
{
return "reversed";
}
};
}
private static void runTest(int n,
Function<Integer, ? extends Collection<Integer>> leftFunction,
Function<Integer, ? extends Collection<Integer>> rightFunction)
{
Collection<Integer> left = leftFunction.apply(n);
Collection<Integer> right = rightFunction.apply((int)(n*0.95));
long before = System.nanoTime();
left.removeAll(right);
long after = System.nanoTime();
double durationMs = (after - before) / 1e6;
System.out.printf(
"%8d elements, %15s, duration %10.3f ms, size %d\n",
n, rightFunction, durationMs, left.size());
}
}
(Yes, it's very simple. However, if you think that the timings are completely useless, compare them to a JMH run, and after a few hours, you'll see that they are reasonable)
The timings for the last pass are as follows:
100000 elements, sorted, duration 6016,354 ms, size 5000
100000 elements, shuffled(0,00), duration 5849,537 ms, size 5000
100000 elements, shuffled(0,25), duration 7319,948 ms, size 5000
100000 elements, shuffled(0,50), duration 9344,408 ms, size 5000
100000 elements, shuffled(0,75), duration 10657,021 ms, size 5000
100000 elements, shuffled(1,00), duration 11295,808 ms, size 5000
100000 elements, reversed, duration 5830,695 ms, size 5000
One can clearly see that the timings are basically increasing linearly with the shuffledness.
Of course, all this is still not a proof, but at least an evidence that the answer by Svetlin Zarev is correct.
Looking at the source code for ArrayList.removeAll() (OpenJDK7-b147) it appears that the it delegates to a private method called batchRemove() which is as follows:
663 private boolean batchRemove(Collection<?> c, boolean complement) {
664 final Object[] elementData = this.elementData;
665 int r = 0, w = 0;
666 boolean modified = false;
667 try {
668 for (; r < size; r++)
669 if (c.contains(elementData[r]) == complement)
670 elementData[w++] = elementData[r];
671 } finally {
672 // Preserve behavioral compatibility with AbstractCollection,
673 // even if c.contains() throws.
674 if (r != size) {
675 System.arraycopy(elementData, r,
676 elementData, w,
677 size - r);
678 w += size - r;
679 }
680 if (w != size) {
681 for (int i = w; i < size; i++)
682 elementData[i] = null;
683 modCount += size - w;
684 size = w;
685 modified = true;
686 }
687 }
688 return modified;
689 }
It practically loops through the array and has a bunch of c.contains() calls. Basically there's no reason why this iteration would go faster for a sorted array.
I second StephenC's doubt about the benchmark, and believe that it'd be more fruitful for you to scrutinize the benchmark code before digging in any deeper into cache access patterns etc.
Also if the benchmark code is not the culprit, it would be interesting to know the java version, and the OS/arch etc.
Now I know that micro-benchmarks are difficult to do right, and I won’t look at a few milliseconds difference, but I believe my results to be valid, since I ran them repeatedly and they are very reproducible.
That does not convince me. The behaviour of an flawed benchmark can be 100% reproducible.
I suspect that ... in fact ... a flaw or flaws in your benchmark >>is<< the cause of your strange results. It often is.
... but perhaps there is a fundamental problem with my benchmark?
Yes (IMO).
Show us the benchmark code if you want a more detailed answer.
Related
This is not a duplicate of my question. I checked it and it is more about inner anonymous classes.
I was curious about Lambda expressions and tested the following :
Given an array of ten thousand entries, what would be the faster to delete certain indexes : Lamba expression or For-Loop with an if test inside?
First results were not surprising in the fact that I did not know what I was going to come up with :
final int NUMBER_OF_LIST_INDEXES = 10_000;
List<String> myList = new ArrayList<>();
String[] myWords = "Testing Lamba expressions with this String array".split(" ");
for (int i = 0 ; i < NUMBER_OF_LIST_INDEXES ; i++){
myList.add(myWords[i%6]);
}
long time = System.currentTimeMillis();
// BOTH TESTS WERE RUN SEPARATELY OF COURSE
// PUT THE UNUSED ONE IN COMMENTS WHEN THE OTHER WAS WORKING
// 250 milliseconds for the Lambda Expression
myList.removeIf(x -> x.contains("s"));
// 16 milliseconds for the traditional Loop
for (int i = NUMBER_OF_LIST_INDEXES - 1 ; i >= 0 ; i--){
if (myList.get(i).contains("s")) myList.remove(i);
}
System.out.println(System.currentTimeMillis() - time + " milliseconds");
But then, I decided to change the constant NUMBER_OF_LIST_INDEXES to one million and here is the result :
final int NUMBER_OF_LIST_INDEXES = 1_000_000;
List<String> myList = new ArrayList<>();
String[] myWords = "Testing Lamba expressions with this String array".split(" ");
for (int i = 0 ; i < NUMBER_OF_LIST_INDEXES ; i++){
myList.add(myWords[i%6]);
}
long time = System.currentTimeMillis();
// BOTH TESTS WERE RUN SEPARATELY OF COURSE
// PUT THE UNUSED ONE IN COMMENTS WHEN THE OTHER WAS WORKING
// 390 milliseconds for the Lambda Expression
myList.removeIf(x -> x.contains("s"));
// 32854 milliseconds for the traditional Loop
for (int i = NUMBER_OF_LIST_INDEXES - 1 ; i >= 0 ; i--){
if (myList.get(i).contains("s")) myList.remove(i);
}
System.out.println(System.currentTimeMillis() - time + " milliseconds");
To make things simpler to read, here are the results :
| | 10.000 | 1.000.000 |
| LAMBDA | 250ms | 390ms | 156% evolution
|FORLOOP | 16ms | 32854ms | 205000+% evolution
I have the following questions :
What is magic behind this? How do we come to such a big difference for the array and not for the lambda when the indexes to work with is *100.
In terms of performance, how do we know when to use Lambdas and when to stick to traditional ways to work with data?
Is this a specific behavior of the List method? Are other Lambda expression also produce random performances like this one?
Because remove(index) is very expensive :) It needs to copy and shift the rest of elements, and this is done multiple times in your case.
While removeIf(filter) does not need to do that. It can sweep once and mark all elements to be deleted; then the final phase copies survivors to the head of list just once.
I wrote a JMH benchmark to measure this. There are 4 methods in it:
removeIf on an ArrayList.
removeIf on a LinkedList.
iterator with iterator.remove() on an ArrayList.
iterator with iterator.remove() on a LinkedList.
The point of the benchmark is to show that removeIf and an iterator should provide the same performance, but that it is not the case for an ArrayList.
By default, removeIf uses an iterator internally to remove the elements so we should expect the same performance with removeIf and with an iterator.
Now consider an ArrayList which uses an array internally to hold the elements. Everytime we call remove, the remaining elements after the index have to be shifted by one; so each time a lot of elements have to be copied. When an iterator is used to traverse the ArrayList and we need to remove an element, this copying needs to happen again and again, making this very slow. For a LinkedList, this is not the case: when an element is deleted, the only change is the pointer to the next element.
So why is removeIf as fast on an ArrayList as on a LinkedList? Because it is actually overriden and it does not use an iterator: the code actually marks the elements to be deleted in a first pass and then deletes them (shifting the remaining elements) in a second pass. An optimization is possible in this case: instead of shifting the remaining elements each time one needs to be removed, we only do it once when we know all the elements that need to be removed.
Conclusion:
removeIf should be used when one needs to remove every elements matching a predicate.
remove should be used to remove a single known element.
Result of benchmark:
Benchmark Mode Cnt Score Error Units
RemoveTest.removeIfArrayList avgt 30 4,478 ± 0,194 ms/op
RemoveTest.removeIfLinkedList avgt 30 3,634 ± 0,184 ms/op
RemoveTest.removeIteratorArrayList avgt 30 27197,046 ± 536,584 ms/op
RemoveTest.removeIteratorLinkedList avgt 30 3,601 ± 0,195 ms/op
Benchmark:
#Warmup(iterations = 5, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#Measurement(iterations = 10, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MILLISECONDS)
#Fork(3)
#State(Scope.Benchmark)
public class RemoveTest {
private static final int NUMBER_OF_LIST_INDEXES = 1_000_000;
private static final String[] words = "Testing Lamba expressions with this String array".split(" ");
private ArrayList<String> arrayList;
private LinkedList<String> linkedList;
#Setup(Level.Iteration)
public void setUp() {
arrayList = new ArrayList<>();
linkedList = new LinkedList<>();
for (int i = 0 ; i < NUMBER_OF_LIST_INDEXES ; i++){
arrayList.add(words[i%6]);
linkedList.add(words[i%6]);
}
}
#Benchmark
public void removeIfArrayList() {
arrayList.removeIf(x -> x.contains("s"));
}
#Benchmark
public void removeIfLinkedList() {
linkedList.removeIf(x -> x.contains("s"));
}
#Benchmark
public void removeIteratorArrayList() {
for (ListIterator<String> it = arrayList.listIterator(arrayList.size()); it.hasPrevious();){
if (it.previous().contains("s")) it.remove();
}
}
#Benchmark
public void removeIteratorLinkedList() {
for (ListIterator<String> it = linkedList.listIterator(linkedList.size()); it.hasPrevious();){
if (it.previous().contains("s")) it.remove();
}
}
public static void main(String[] args) throws Exception {
Main.main(args);
}
}
I think the performance difference you're seeing is probably due more to removeIf's use of an iterator internally vs. get and remove in your for loop. The answers in this PAQ have some good information on the benefits of iterators.
bayou.io's answer is spot on, you can see the code for removeIf here it does two passes to avoid shifting the remaining elements over and over.
I wrote a simple program to compare to performance of stream for finding maximum form list of integer. Surprisingly I found that the performance of ' stream way' 1/10 of 'usual way'. Am I doing something wrong? Is there any condition on which Stream way will not be efficient? Could anyone have a nice explanation for this behavior?
"stream way" took 80 milliseconds "usual way" took 15 milli seconds
Please find the code below
public class Performance {
public static void main(String[] args) {
ArrayList<Integer> a = new ArrayList<Integer>();
Random randomGenerator = new Random();
for (int i=0;i<40000;i++){
a.add(randomGenerator.nextInt(40000));
}
long start_s = System.currentTimeMillis( );
Optional<Integer> m1 = a.stream().max(Integer::compare);
long diff_s = System.currentTimeMillis( ) - start_s;
System.out.println(diff_s);
int e = a.size();
Integer m = Integer.MIN_VALUE;
long start = System.currentTimeMillis( );
for(int i=0; i < e; i++)
if(a.get(i) > m) m = a.get(i);
long diff = System.currentTimeMillis( ) - start;
System.out.println(diff);
}
}
Yes, Streams are slower for such simple operations. But your numbers are completely unrelated. If you think that 15 milliseconds is satisfactory time for your task, then there are good news: after warm-up stream code can solve this problem in like 0.1-0.2 milliseconds, which is 70-150 times faster.
Here's quick-and-dirty benchmark:
import java.util.concurrent.TimeUnit;
import java.util.*;
import java.util.stream.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.annotations.*;
#Warmup(iterations = 5, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#Measurement(iterations = 10, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#Fork(3)
#State(Scope.Benchmark)
public class StreamTest {
// Stream API is very nice to get random data for tests!
List<Integer> a = new Random().ints(40000, 0, 40000).boxed()
.collect(Collectors.toList());
#Benchmark
public Integer streamList() {
return a.stream().max(Integer::compare).orElse(Integer.MIN_VALUE);
}
#Benchmark
public Integer simpleList() {
int e = a.size();
Integer m = Integer.MIN_VALUE;
for(int i=0; i < e; i++)
if(a.get(i) > m) m = a.get(i);
return m;
}
}
The results are:
Benchmark Mode Cnt Score Error Units
StreamTest.simpleList avgt 30 38.241 ± 0.434 us/op
StreamTest.streamList avgt 30 215.425 ± 32.871 us/op
Here's microseconds. So the Stream version is actually much faster than your test. Nevertheless the simple version is even more faster. So if you were fine with 15 ms, you can use any of these two versions you like: both will perform much faster.
If you want to get the best possible performance no matter what, you should get rid of boxed Integer objects and work with primitive array:
int[] b = new Random().ints(40000, 0, 40000).toArray();
#Benchmark
public int streamArray() {
return Arrays.stream(b).max().orElse(Integer.MIN_VALUE);
}
#Benchmark
public int simpleArray() {
int e = b.length;
int m = Integer.MIN_VALUE;
for(int i=0; i < e; i++)
if(b[i] > m) m = b[i];
return m;
}
Both versions are faster now:
Benchmark Mode Cnt Score Error Units
StreamTest.simpleArray avgt 30 10.132 ± 0.193 us/op
StreamTest.streamArray avgt 30 167.435 ± 1.155 us/op
Actually the stream version result may vary greatly as it involves many intermediate methods which are JIT-compiled in different time, so the speed may change in any direction after some iterations.
By the way your original problem can be solved by good old Collections.max method without Stream API like this:
Integer max = Collections.max(a);
In general you should avoid testing the artificial code which does not solve real problems. With artificial code you will get the artificial results which generally say nothing about the API performance in real conditions.
The immediate difference that I see is that the stream way uses Integer::compare which might require more autoboxing etc. vs. an operator in the loop. perhaps you can call Integer::compare in the loop to see if this is the reason?
EDIT: following the advice from Nicholas Robinson, I wrote a new version of the test. It uses 400K sized list (the original one yielded zero diff results), it uses Integer.compare in both cases and runs only one of them in each invocation (I alternate between the two methods):
static List<Integer> a = new ArrayList<Integer>();
public static void main(String[] args)
{
Random randomGenerator = new Random();
for (int i = 0; i < 400000; i++) {
a.add(randomGenerator.nextInt(400000));
}
long start = System.currentTimeMillis();
//Integer max = checkLoop();
Integer max = checkStream();
long diff = System.currentTimeMillis() - start;
System.out.println("max " + max + " diff " + diff);
}
static Integer checkStream()
{
Optional<Integer> max = a.stream().max(Integer::compare);
return max.get();
}
static Integer checkLoop()
{
int e = a.size();
Integer max = Integer.MIN_VALUE;
for (int i = 0; i < e; i++) {
if (Integer.compare(a.get(i), max) > 0) max = a.get(i);
}
return max;
}
The results for loop: max 399999 diff 10
The results for stream: max 399999 diff 40 (and sometimes I got 50)
In Java 8 they have been putting a lot of effort into making use of concurrent processes with the new lambdas. You will find the stream to be so much faster because the list is being processed concurrently in the most efficient way possible where as the usual way is running through the list sequentially.
Because the lambda are static this makes threading easier, however when you are accessing something line your hard drive (reading in a file line by line) you will probably find the stream wont be as efficient because the hard drive can only access info.
[UPDATE]
The reason your stream took so much longer than the normal way is because you run in first. The JRE is constantly trying to optimize the performance so there will be a cache set up with the usual way. If you run the usual way before the stream way you should get opposing results. I would recommend running the tests in different mains for the best results.
I experienced a performance issue when using the stream created using the spliterator() over an Iterable. ie., like StreamSupport.stream(integerList.spliterator(), true). Wanted to prove this over a normal collection. Please see below some benchmark results.
Question:
Why does the parallel stream created from an iterable much slower than the stream created from an ArrayList or an IntStream ?
From a range
public void testParallelFromIntRange() {
long start = System.nanoTime();
IntStream stream = IntStream.rangeClosed(1, Integer.MAX_VALUE).parallel();
System.out.println("Is Parallel: "+stream.isParallel());
stream.forEach(ParallelStreamSupportTest::calculate);
long end = System.nanoTime();
System.out.println("ParallelStream from range Takes : " + TimeUnit.MILLISECONDS.convert((end - start),
TimeUnit.NANOSECONDS) + " milli seconds");
}
Is Parallel: true
ParallelStream from range Takes : 490 milli seconds
From an Iterable
public void testParallelFromIterable() {
Set<Integer> integerList = ContiguousSet.create(Range.closed(1, Integer.MAX_VALUE), DiscreteDomain.integers());
long start = System.nanoTime();
Stream<Integer> stream = StreamSupport.stream(integerList.spliterator(), true);
System.out.println("Is Parallel: " + stream.isParallel());
stream.forEach(ParallelStreamSupportTest::calculate);
long end = System.nanoTime();
System.out.println("ParallelStream from Iterable Takes : " + TimeUnit.MILLISECONDS.convert((end - start),
TimeUnit.NANOSECONDS) + " milli seconds");
}
Is Parallel: true
ParallelStream from Iterable Takes : 12517 milli seconds
And the so trivial calculate method.
public static Integer calculate(Integer input) {
return input + 2;
}
Not all spliterators are created equally. One of the tasks of a spliterator is to decompose the source into two parts, that can be processed in parallel. A good spliterator will divide the source roughly in half (and will be able to continue to do so recursively.)
Now, imagine you are writing a spliterator for a source that is only described by an Iterator. What quality of decomposition can you get? Basically, all you can do is divide the source into "first" and "rest". That's about as bad as it gets. The result is a computation tree that is very "right-heavy".
The spliterator that you get from a data structure has more to work with; it knows the layout of the data, and can use that to give better splits, and therefore better parallel performance. The spliterator for ArrayList can always divide in half, and retains knowledge of exactly how much data is in each half. That's really good. The spliterator from a balanced tree can get good distribution (since each half of the tree has roughly half the elements), but isn't quite as good as the ArrayList spliterator because it doesn't know the exact sizes. The spliterator for a LinkedList is about as bad as it gets; all it can do is (first, rest). And the same for deriving a spliterator from an iterator.
Now, all is not necessarily lost; if the work per element is high, you can overcome bad splitting. But if you're doing a small amount of work per element, you'll be limited by the quality of splits from your spliterator.
There are several problems with your benchmark.
Stream<Integer> cannot be compared to IntStream because of boxing overhead.
You aren't doing anything with the result of the calculation, which makes it hard to know whether the code is actually being run
You are benchmarking with System.nanoTime instead of using a proper benchmarking tool.
Here's a JMH-based benchmark:
import com.google.common.collect.ContiguousSet;
import com.google.common.collect.DiscreteDomain;
import com.google.common.collect.Range;
import java.util.stream.IntStream;
import java.util.stream.Stream;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.OptionsBuilder;
public class Ranges {
final static int SIZE = 10_000_000;
#Benchmark
public long intStream() {
Stream<Integer> st = IntStream.rangeClosed(1, SIZE).boxed();
return st.parallel().mapToInt(x -> x).sum();
}
#Benchmark
public long contiguousSet() {
ContiguousSet<Integer> cs = ContiguousSet.create(Range.closed(1, SIZE), DiscreteDomain.integers());
Stream<Integer> st = cs.stream();
return st.parallel().mapToInt(x -> x).sum();
}
public static void main(String[] args) throws RunnerException {
new Runner(
new OptionsBuilder()
.include(".*Ranges.*")
.forks(1)
.warmupIterations(5)
.measurementIterations(5)
.build()
).run();
}
}
And the output:
Benchmark Mode Samples Score Score error Units
b.Ranges.contiguousSet thrpt 5 13.540 0.924 ops/s
b.Ranges.intStream thrpt 5 27.047 5.119 ops/s
So IntStream.range is about twice as fast as ContiguousSet, which is perfectly reasonable, given that ContiguousSet doesn't implement its own Spliterator and uses the default from Set
I have always read that we should use Vector everywhere in Java and that there are no performance issues, which is certainly true. I'm writing a method to calculate the MSE (Mean Squared Error) and noticed that it was very slow - I basically was passing the Vector of values. When I switched to Array, it was 10 times faster but I don't understand why.
I have written a simple test:
public static void main(String[] args) throws IOException {
Vector <Integer> testV = new Vector<Integer>();
Integer[] testA = new Integer[1000000];
for(int i=0;i<1000000;i++){
testV.add(i);
testA[i]=i;
}
Long startTime = System.currentTimeMillis();
for(int i=0;i<500;i++){
double testVal = testArray(testA, 0, 1000000);
}
System.out.println(String.format("Array total time %s ",System.currentTimeMillis() - startTime));
startTime = System.currentTimeMillis();
for(int i=0;i<500;i++){
double testVal = testVector(testV, 0, 1000000);
}
System.out.println(String.format("Vector total time %s ",System.currentTimeMillis() - startTime));
}
Which calls the following methods:
public static double testVector(Vector<Integer> data, int start, int stop){
double toto = 0.0;
for(int i=start ; i<stop ; i++){
toto += data.get(i);
}
return toto / data.size();
}
public static double testArray(Integer[] data, int start, int stop){
double toto = 0.0;
for(int i=start ; i<stop ; i++){
toto += data[i];
}
return toto / data.length;
}
The array one is indeed 10 times faster. Here is the output:
Array total time 854
Vector total time 9840
Can somebody explain me why ? I have searched for quite a while, but cannot figure it out. The vector method appears to be making a local copy of the vector, but I always thought that objects where passed by reference in Java.
I have always read that we should use Vector everywhere in Java and that there are no performance issues, - Wrong. A vector is thread safe and thus it needs additional logic (code) to handle access/ modification by multiple threads So, it is slow. An array on the other hand doesn't need additional logic to handle multiple threads. You should try ArrayList instead of Vector to increase the speed
Note (based on your comment): I'm running the method 500 times each
This is not the right way to measure performance / speed in java. You should atleast give a warm-up run so as to nullify the effect of JIT.
Yes, that's the eternal problem of poor microbenchmarking. The Vector itself is not SO slow.
Here is a trick:
add -XX:BiasedLockingStartupDelay=0 and now testVector "magically" runs 5 times faster than before!
Next, wrap testVector into synchronized (data) - and now it is almost as fast as testArray.
You are basically measuring the performance of object monitors in HotSpot, not the data structures.
Simple thing. Vector is thread-safe so it needs synchoronization to add and access. Use ArrayList which is also back-up by array but it is not thread-safe and faster
Note:
Please provide size of the elements if you know in advance to ArrayList. Since in normal ArrayList without initial capacity resize will happen intenally which uses Arrays copy
And a normal array and ArrayList without initial capacity performances too varies drastically if no of elements is larger
Poor code, instead of list.get() rather use an iterator on the list. The array will still be faster though.
Maybe I'm being misled by my profiler (Netbeans), but I'm seeing some odd behavior, hoping maybe someone here can help me understand it.
I am working on an application, which makes heavy use of rather large hash tables (keys are longs, values are objects). The performance with the built in java hash table (HashMap specifically) was very poor, and after trying some alternatives -- Trove, Fastutils, Colt, Carrot -- I started working on my own.
The code is very basic using a double hashing strategy. This works fine and good and shows the best performance of all the other options I've tried thus far.
The catch is, according to the profiler, lookups into the hash table are the single most expensive method in the entire application -- despite the fact that other methods are called many more times, and/or do a lot more logic.
What really confuses me is the lookups are called only by one class; the calling method does the lookup and processes the results. Both are called nearly the same number of times, and the method that calls the lookup has a lot of logic in it to handle the result of the lookup, but is about 100x faster.
Below is the code for the hash lookup. It's basically just two accesses into an array (the functions that compute the hash codes, according to profiling, are virtually free). I don't understand how this bit of code can be so slow since it is just array access, and I don't see any way of making it faster.
Note that the code simply returns the bucket matching the key, the caller is expected to process the bucket. 'size' is the hash.length/2, hash1 does lookups in the first half of the hash table, hash2 does lookups in the second half. key_index is a final int field on the hash table passed into the constructor, and the values array on the Entry objects is a small array of longs usually of length 10 or less.
Any thoughts people have on this are much appreciated.
Thanks.
public final Entry get(final long theKey) {
Entry aEntry = hash[hash1(theKey, size)];
if (aEntry != null && aEntry.values[key_index] != theKey) {
aEntry = hash[hash2(theKey, size)];
if (aEntry != null && aEntry.values[key_index] != theKey) {
return null;
}
}
return aEntry;
}
Edit, the code for hash1 & hash2
private static int hash1(final long key, final int hashTableSize) {
return (int)(key&(hashTableSize-1));
}
private static int hash2(final long key, final int hashTableSize) {
return (int)(hashTableSize+((key^(key>>3))&(hashTableSize-1)));
}
Nothing in your implementation strikes me as particularly inefficient. I'll admit I don't really follow your hashing/lookup strategy, but if you say it's performant in your circumstances, I'll believe you.
The only thing that I would expect might make some difference is to move the key out of the values array of Entry.
Instead of having this:
class Entry {
long[] values;
}
//...
if ( entry.values[key_index] == key ) { //...
Try this:
class Entry {
long key;
long values[];
}
//...
if ( entry.key == key ) { //...
Instead of incurring the cost of accessing a member, plus doing bounds checking, then getting a value of the array, you should just incur the cost of accessing the member.
Is there a random-access data type faster than an array?
I was interested in the answer to this question, so I set up a test environment. This is my Array interface:
interface Array {
long get(int i);
void set(int i, long v);
}
This "Array" has undefined behaviour when indices are out of bounds. I threw together the obvious implementation:
class NormalArray implements Array {
private long[] data;
public NormalArray(int size) {
data = new long[size];
}
#Override
public long get(int i) {
return data[i];
}
#Override
public void set(int i, long v) {
data[i] = v;
}
}
And then a control:
class NoOpArray implements Array {
#Override
public long get(int i) {
return 0;
}
#Override
public void set(int i, long v) {
}
}
Finally, I designed an "array" where the first 10 indices are hardcoded members. The members are set/selected through a switch:
class TenArray implements Array {
private long v0;
private long v1;
private long v2;
private long v3;
private long v4;
private long v5;
private long v6;
private long v7;
private long v8;
private long v9;
private long[] extras;
public TenArray(int size) {
if (size > 10) {
extras = new long[size - 10];
}
}
#Override
public long get(final int i) {
switch (i) {
case 0:
return v0;
case 1:
return v1;
case 2:
return v2;
case 3:
return v3;
case 4:
return v4;
case 5:
return v5;
case 6:
return v6;
case 7:
return v7;
case 8:
return v8;
case 9:
return v9;
default:
return extras[i - 10];
}
}
#Override
public void set(final int i, final long v) {
switch (i) {
case 0:
v0 = v; break;
case 1:
v1 = v; break;
case 2:
v2 = v; break;
case 3:
v3 = v; break;
case 4:
v4 = v; break;
case 5:
v5 = v; break;
case 6:
v6 = v; break;
case 7:
v7 = v; break;
case 8:
v8 = v; break;
case 9:
v9 = v; break;
default:
extras[i - 10] = v;
}
}
}
I tested it with this harness:
import java.util.Random;
public class ArrayOptimization {
public static void main(String[] args) {
int size = 10;
long[] data = new long[size];
Random r = new Random();
for ( int i = 0; i < data.length; i++ ) {
data[i] = r.nextLong();
}
Array[] a = new Array[] {
new NoOpArray(),
new NormalArray(size),
new TenArray(size)
};
for (;;) {
for ( int i = 0; i < a.length; i++ ) {
testSet(a[i], data, 10000000);
testGet(a[i], data, 10000000);
}
}
}
private static void testGet(Array a, long[] data, int iterations) {
long nanos = System.nanoTime();
for ( int i = 0; i < iterations; i++ ) {
for ( int j = 0; j < data.length; j++ ) {
data[j] = a.get(j);
}
}
long stop = System.nanoTime();
System.out.printf("%s/get took %fms%n", a.getClass().getName(),
(stop - nanos) / 1000000.0);
}
private static void testSet(Array a, long[] data, int iterations) {
long nanos = System.nanoTime();
for ( int i = 0; i < iterations; i++ ) {
for ( int j = 0; j < data.length; j++ ) {
a.set(j, data[j]);
}
}
long stop = System.nanoTime();
System.out.printf("%s/set took %fms%n", a.getClass().getName(),
(stop - nanos) / 1000000.0);
}
}
The results were somewhat surprising. The TenArray performs non-trivially faster than a NormalArray does (for sizes <= 10). Subtracting the overhead (using the NoOpArray average) you get TenArray as taking ~65% of the time of the normal array. So if you know the likely max size of your array, I suppose it is possible to exceed the speed of an array. I would imagine switch uses either less bounds checking or more efficient bounds checking than does an array.
NoOpArray/set took 953.272654ms
NoOpArray/get took 891.514622ms
NormalArray/set took 1235.694953ms
NormalArray/get took 1148.091061ms
TenArray/set took 1149.833109ms
TenArray/get took 1054.040459ms
NoOpArray/set took 948.458667ms
NoOpArray/get took 888.618223ms
NormalArray/set took 1232.554749ms
NormalArray/get took 1120.333771ms
TenArray/set took 1153.505578ms
TenArray/get took 1056.665337ms
NoOpArray/set took 955.812843ms
NoOpArray/get took 893.398847ms
NormalArray/set took 1237.358472ms
NormalArray/get took 1125.100537ms
TenArray/set took 1150.901231ms
TenArray/get took 1057.867936ms
Now whether you can in practice get speeds faster than an array I'm not sure; obviously this way you incur any overhead associated with the interface/class/methods.
Most likely you are partially misled in your interpretation of the profilers results. Profilers are notoriously overinflating the performance impact of small, frequently called methods. In your case, the profiling overhead for the get()-method is probably larger than the actual processing spent in the method itself. The situation is worsened further, since the instrumentation also interferes with the JIT's capability to inline methods.
As a rule of thumb for this situation - if the total processing time for a piece of work of known length increases more then two- to threefold when running under the profiler, the profiling overhead will give you skewed results.
To verify your changes actually do have impact, always measure performance improvements without the profiler, too. The profiler can hint you about bottlenecks, but it can also deceive you to look at places where nothing is wrong.
Array bounds checking can have a surprisingly large impact on performance (if you do comparably little else), but it can also be hard to clearly separate from general memory access penalties. In some trivial cases, the JIT might be able to eliminate them (there have been efforts towards bounds check elimination in Java 6), but this is AFAIK mostly limited to simple loop constructs like for(x=0; x<array.length; x++).
Under some circumstances you may be able to replace array access by simple member access, completely avoiding the bound checks, but its limited to the rare cases where you access you array exclusively by constant indices. I see no way to apply it to your problem.
The change suggested by Mark Peters is most likely not solely faster because it eliminates a bounds check, but also because it alters the locality properties of your data structures in a more cache friendly way.
Many profilers tell you very confusing things, partly because of how they work, and partly because people have funny ideas about performance to begin with.
For example, you're wondering about how many times functions are called, and you're looking at code and thinking it looks like a lot of logic, therefore slow.
There's a very simple way to think about this stuff, that makes it very easy to understand what's going on.
First of all, think in terms of the percent of time a routine or statement is active, rather than the number of times it is called or the average length of time it takes. The reason for that is it is relatively unaffected by irrelevant issues like competing processes or I/O, and it saves you having to multiply the number of calls by the average execution time and divide by the total time just to see if it is a big enough to even care about. Also, percent tells you, bottom line, how much fixing it could potentially reduce the overall execution time.
Second, what I mean by "active" is "on the stack", where the stack includes the currently running instruction and all the calls "above" it back to "call main". If a routine is responsible for 10% of the time, including routines that it calls, then during that time it is on the stack. The same is true of individual statements or even instructions. (Ignore "self time" or "exclusive time". It's a distraction.)
Profilers that put timers and counters on functions can only give you some of this information. Profilers that only sample the program counter tell you even less. What you need is something that samples the call stack and reports to you by line (not just by function) the percent of stack samples containing that line. It's also important that they sample the stack a) during I/O or other blockage, but b) not while waiting for user input.
There are profilers that can do this. I'm not sure about Java.
If you're still with me, let me throw out another ringer. You're looking for things you can optimize, right? and only things that have a large enough percent to be worth the trouble, like 10% or more? Such a line of code costing 10% is on the stack 10% of the time. That means if 20,000 samples are taken, it is on about 2,000 of them. If 20 samples are taken, it is on about 2 of them, on average. Now, you're trying to find the line, right? Does it really matter if the percent is off a little bit, as long as you find it? That's another one of those happy myths of profilers - that precision of timing matters. For finding problems worth fixing, 20,000 samples won't tell you much more than 20 samples will.
So what do I do? Just take the samples by hand and study them. Code worth optimizing will simply jump out at me.
Finally, there's a big gob of good news. There are probably multiple things you could optimize. Suppose you fix a 20% problem and make it go away. Overall time shrinks to 4/5 of what it was, but the other problems aren't taking any less time, so now their percentage is 5/4 of what it was, because the denominator got smaller. Percentage-wise they got bigger, and easier to find. This effect snowballs, allowing you to really squeeze the code.
You could try using a memoizing or caching strategy to reduce the number of actual calls. Another thing you could try if you're very desperate is a native array, since indexing those is unbelievably fast, and JNI shouldn't invoke toooo much overhead if you're using parameters like longs that don't require marshalling.