I am iterating through a HashMap with +- 20 Million entries. In each iteration I am again iterating through HashMap with +- 20 Million entries.
HashMap<String, BitSet> data_1 = new HashMap<String, BitSet>
HashMap<String, BitSet> data_2 = new HashMap<String, BitSet>
I am dividng data_1 into chunks based on number of threads(threads = cores, i have four core processor).
My code is taking more than 20 Hrs to excute. Excluding not storing the results into a file.
1) If i want to store the results of each thread without overlapping into a file, How can i
do that?.
2) How can i make the following much faster.
3) How to create the chunks dynamically, based on number of cores?
int cores = Runtime.getRuntime().availableProcessors();
int threads = cores;
//Number of threads
int Chunks = data_1.size() / threads;
//I don't trust with chunks created by the below line, that's why i created chunk1, chunk2, chunk3, chunk4 seperately and validated them.
Map<Integer, BitSet>[] Chunk= (Map<Integer, BitSet>[]) new HashMap<?,?>[threads];
4) How to create threads using for loops? Is it correct what i am doing?
ClassName thread1 = new ClassName(data2, chunk1);
ClassName thread2 = new ClassName(data2, chunk2);
ClassName thread3 = new ClassName(data2, chunk3);
ClassName thread4 = new ClassName(data2, chunk4);
thread1.start();
thread2.start();
thread3.start();
thread4.start();
thread1.join();
thread2.join();
thread3.join();
thread4.join();
Representation of My Code
Public class ClassName {
Integer nSimilarEntities = 30;
public void run() {
for (String kNonRepeater : data_1.keySet()) {
// Extract the feature vector
BitSet vFeaturesNonRepeater = data_1.get(kNonRepeater);
// Calculate the sum of 1s (L2 norm is the sqrt of this)
double nNormNonRepeater = Math.sqrt(vFeaturesNonRepeater.cardinality());
// Loop through the repeater set
double nMinSimilarity = 100;
int nMinSimIndex = 0;
// Maintain the list of top similar repeaters and the similarity values
long dpind = 0;
ArrayList<String> vSimilarKeys = new ArrayList<String>();
ArrayList<Double> vSimilarValues = new ArrayList<Double>();
for (String kRepeater : data_2.keySet()) {
// Status output at regular intervals
dpind++;
if (Math.floorMod(dpind, pct) == 0) {
System.out.println(dpind + " dot products (" + Math.round(dpind / pct) + "%) out of "
+ nNumSimilaritiesToCompute + " completed!");
}
// Calculate the norm of repeater, and the dot product
BitSet vFeaturesRepeater = data_2.get(kRepeater);
double nNormRepeater = Math.sqrt(vFeaturesRepeater.cardinality());
BitSet vTemp = (BitSet) vFeaturesNonRepeater.clone();
vTemp.and(vFeaturesRepeater);
double nCosineDistance = vTemp.cardinality() / (nNormNonRepeater * nNormRepeater);
// queue.add(new MyClass(kRepeater,kNonRepeater,nCosineDistance));
// if(queue.size() > YOUR_LIMIT)
// queue.remove();
// Don't bother if the similarity is 0, obviously
if ((vSimilarKeys.size() < nSimilarEntities) && (nCosineDistance > 0)) {
vSimilarKeys.add(kRepeater);
vSimilarValues.add(nCosineDistance);
nMinSimilarity = vSimilarValues.get(0);
nMinSimIndex = 0;
for (int j = 0; j < vSimilarValues.size(); j++) {
if (vSimilarValues.get(j) < nMinSimilarity) {
nMinSimilarity = vSimilarValues.get(j);
nMinSimIndex = j;
}
}
} else { // If there are more, keep only the best
// If this is better than the smallest distance, then remove the smallest
if (nCosineDistance > nMinSimilarity) {
// Remove the lowest similarity value
vSimilarKeys.remove(nMinSimIndex);
vSimilarValues.remove(nMinSimIndex);
// Add this one
vSimilarKeys.add(kRepeater);
vSimilarValues.add(nCosineDistance);
// Refresh the index of lowest similarity value
nMinSimilarity = vSimilarValues.get(0);
nMinSimIndex = 0;
for (int j = 0; j < vSimilarValues.size(); j++) {
if (vSimilarValues.get(j) < nMinSimilarity) {
nMinSimilarity = vSimilarValues.get(j);
nMinSimIndex = j;
}
}
}
} // End loop for maintaining list of similar entries
}// End iteration through repeaters
for (int i = 0; i < vSimilarValues.size(); i++) {
System.out.println(Thread.currentThread().getName() + kNonRepeater + "|" + vSimilarKeys.get(i) + "|" + vSimilarValues.get(i));
}
}
}
}
Finally, If not Multithreading, is there any other approaches in java, to reduce time complexity.
The computer works similarly to what you have to do by hand (It processes more digits/bits at a time but the problem is the same.
If you do addition, the time is proportional to the of the size of the number.
If you do multiplication or divisor it's proportional to the square of the size of the number.
For the computer the size is based on multiples of 32 or 64 significant bits depending on the implementation.
I'd say this task is suitable for parallel streams. Don't hesitate to take a look at this conception if you have time. Parallel streams seamlessly use multithreading at full speed.
The top-level processing will look like this:
data_1.entrySet()
.parallelStream()
.flatmap(nonRepeaterEntry -> processOne(nonRepeaterEntry.getKey(), nonRepeaterEntry.getValue(), data2))
.forEach(System.out::println);
You should provide processOne function with prototype like this:
Stream<String> processOne(String nonRepeaterKey, String nonRepeaterBitSet, Map<String BitSet> data2);
It will return prepared string list with what you print now into file.
To make stream inside you can prepare List list first and then turn it into stream in return statement:
return list.stream();
Even though inner loop can be processed in streams, parallel streaming inside is discouraged - you already have enough parallelism.
For your questions:
1) If i want to store the results of each thread without overlapping into a file, How can i do that?.
Any logging framework (logback, log4j) can deal with it. Parallel streams can deal with it. Also you can store prepared lines into some queue/array and print them in separate thread. It takes a bit of care, though, ready solutions are easier and effectively they do such thing.
2) How can i make the following much faster.
Optimize and parallelize. At normal situation you get number_of_threads/1.5..number_of_threads times faster processing thinking you have hyperthreading in play, but it depends on things you do not-so-parallel and underlying implementations of stuff.
3) How to create the chunks dynamically, based on number of cores?
You don't have to. Make a list of tasks (1 task per data_1 entry) and feed executor service with them - that's already big enough task size. You can use FixedThreadPool with number of threads as parameter, and it will deal will distribute tasks evenly.
Not you should create task class, get Future for each task upon threadpool.submit and in the end run a loop doing .get for each Future. It will throttle main thread down to executor processing speed implicitly doing fork-join like behaviour.
4) Direct threads creation is outdated technique. It's recommended to use executor service of some sort, parallel streams etc. For loop processing you need to create list of chunks, and in loop create thread, add it to list of threads. And in another loop join to each thread if the list.
Ad hoc optimizations:
1) Make Repeater class that will store key, bitset and cardinality. Preprocess your hashsets turning them into Repeater instances and calculating cardinality once (i.e. not for every inner loop run). It will save you 20mil*(20mil-1) calls of .cardinality(). You still need to call it for difference.
2) Replace similarKeys, similarValues with limited size priorityQueue on combined entries. It works faster for 30 elements.
Take a look at this question for infor about PriorityQueue:
Java PriorityQueue with fixed size
3) You can skip processing of nonRepeater if its cardinality is already 0 - bitSet and will never increase resulting cardinality, and you'll filter out all 0-distance values.
4) You can skip (remove from temporary list you create in p.1 optimization) every Repeater with zero cardinality. Like in p.3 it will never produce anything fruitful.
I wrote a simple program to compare to performance of stream for finding maximum form list of integer. Surprisingly I found that the performance of ' stream way' 1/10 of 'usual way'. Am I doing something wrong? Is there any condition on which Stream way will not be efficient? Could anyone have a nice explanation for this behavior?
"stream way" took 80 milliseconds "usual way" took 15 milli seconds
Please find the code below
public class Performance {
public static void main(String[] args) {
ArrayList<Integer> a = new ArrayList<Integer>();
Random randomGenerator = new Random();
for (int i=0;i<40000;i++){
a.add(randomGenerator.nextInt(40000));
}
long start_s = System.currentTimeMillis( );
Optional<Integer> m1 = a.stream().max(Integer::compare);
long diff_s = System.currentTimeMillis( ) - start_s;
System.out.println(diff_s);
int e = a.size();
Integer m = Integer.MIN_VALUE;
long start = System.currentTimeMillis( );
for(int i=0; i < e; i++)
if(a.get(i) > m) m = a.get(i);
long diff = System.currentTimeMillis( ) - start;
System.out.println(diff);
}
}
Yes, Streams are slower for such simple operations. But your numbers are completely unrelated. If you think that 15 milliseconds is satisfactory time for your task, then there are good news: after warm-up stream code can solve this problem in like 0.1-0.2 milliseconds, which is 70-150 times faster.
Here's quick-and-dirty benchmark:
import java.util.concurrent.TimeUnit;
import java.util.*;
import java.util.stream.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.annotations.*;
#Warmup(iterations = 5, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#Measurement(iterations = 10, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#Fork(3)
#State(Scope.Benchmark)
public class StreamTest {
// Stream API is very nice to get random data for tests!
List<Integer> a = new Random().ints(40000, 0, 40000).boxed()
.collect(Collectors.toList());
#Benchmark
public Integer streamList() {
return a.stream().max(Integer::compare).orElse(Integer.MIN_VALUE);
}
#Benchmark
public Integer simpleList() {
int e = a.size();
Integer m = Integer.MIN_VALUE;
for(int i=0; i < e; i++)
if(a.get(i) > m) m = a.get(i);
return m;
}
}
The results are:
Benchmark Mode Cnt Score Error Units
StreamTest.simpleList avgt 30 38.241 ± 0.434 us/op
StreamTest.streamList avgt 30 215.425 ± 32.871 us/op
Here's microseconds. So the Stream version is actually much faster than your test. Nevertheless the simple version is even more faster. So if you were fine with 15 ms, you can use any of these two versions you like: both will perform much faster.
If you want to get the best possible performance no matter what, you should get rid of boxed Integer objects and work with primitive array:
int[] b = new Random().ints(40000, 0, 40000).toArray();
#Benchmark
public int streamArray() {
return Arrays.stream(b).max().orElse(Integer.MIN_VALUE);
}
#Benchmark
public int simpleArray() {
int e = b.length;
int m = Integer.MIN_VALUE;
for(int i=0; i < e; i++)
if(b[i] > m) m = b[i];
return m;
}
Both versions are faster now:
Benchmark Mode Cnt Score Error Units
StreamTest.simpleArray avgt 30 10.132 ± 0.193 us/op
StreamTest.streamArray avgt 30 167.435 ± 1.155 us/op
Actually the stream version result may vary greatly as it involves many intermediate methods which are JIT-compiled in different time, so the speed may change in any direction after some iterations.
By the way your original problem can be solved by good old Collections.max method without Stream API like this:
Integer max = Collections.max(a);
In general you should avoid testing the artificial code which does not solve real problems. With artificial code you will get the artificial results which generally say nothing about the API performance in real conditions.
The immediate difference that I see is that the stream way uses Integer::compare which might require more autoboxing etc. vs. an operator in the loop. perhaps you can call Integer::compare in the loop to see if this is the reason?
EDIT: following the advice from Nicholas Robinson, I wrote a new version of the test. It uses 400K sized list (the original one yielded zero diff results), it uses Integer.compare in both cases and runs only one of them in each invocation (I alternate between the two methods):
static List<Integer> a = new ArrayList<Integer>();
public static void main(String[] args)
{
Random randomGenerator = new Random();
for (int i = 0; i < 400000; i++) {
a.add(randomGenerator.nextInt(400000));
}
long start = System.currentTimeMillis();
//Integer max = checkLoop();
Integer max = checkStream();
long diff = System.currentTimeMillis() - start;
System.out.println("max " + max + " diff " + diff);
}
static Integer checkStream()
{
Optional<Integer> max = a.stream().max(Integer::compare);
return max.get();
}
static Integer checkLoop()
{
int e = a.size();
Integer max = Integer.MIN_VALUE;
for (int i = 0; i < e; i++) {
if (Integer.compare(a.get(i), max) > 0) max = a.get(i);
}
return max;
}
The results for loop: max 399999 diff 10
The results for stream: max 399999 diff 40 (and sometimes I got 50)
In Java 8 they have been putting a lot of effort into making use of concurrent processes with the new lambdas. You will find the stream to be so much faster because the list is being processed concurrently in the most efficient way possible where as the usual way is running through the list sequentially.
Because the lambda are static this makes threading easier, however when you are accessing something line your hard drive (reading in a file line by line) you will probably find the stream wont be as efficient because the hard drive can only access info.
[UPDATE]
The reason your stream took so much longer than the normal way is because you run in first. The JRE is constantly trying to optimize the performance so there will be a cache set up with the usual way. If you run the usual way before the stream way you should get opposing results. I would recommend running the tests in different mains for the best results.
I had some fun comparing the speed of the removeAll(Collection<?> c) call declared in Collection. Now I know that micro-benchmarks are difficult to do right, and I won’t look at a few milliseconds difference, but I believe my results to be valid, since I ran them repeatedly and they are very reproducible.
Let’s assume I have two collections that are not too tiny, say 100,000 consecutive integer elements, and also that they mostly overlap, for instance 5,000 are in the left but not the right. Now I simply call:
left.removeAll(right);
Of course this all depends on the types of both the left and the right collection. It’s blazingly fast if the right collection is a hash map, because that’s where the look-ups are done. But looking closer, I noticed two results that I cannot explain. I tried all the tests both with an ArrayList that is sorted and with another that is shuffled (using Collections.shuffle(), if that is of importance).
The first weird result is:
00293 025% shuffled ArrayList, HashSet
00090 008% sorted ArrayList, HashSet
Now either removing elements from the sorted ArrayList is faster than removing from the shuffled list, or looking up consecutive values from the HashSet is faster that looking up random values.
Now the other one:
02311 011% sorted ArrayList, shuffled ArrayList
01401 006% sorted ArrayList, sorted ArrayList
Now this suggests that the lookup in the sorted ArrayList (using a contains() call for each element of the list to the left) is faster than in the shuffled list. Now that would be quite easy if we could make use of the fact that it is sorted and use a binary search, but I do not do that.
Both results are mysterious to me. I cannot explain them by looking at the code or with my data-structure knowledge. Does it have anything to do with processor cache access patterns? Is the JIT compiler optimizing stuff? But if so, which? I performed warming up and run the tests a few times in a row, but perhaps there is a fundamental problem with my benchmark?
The reason for the performance difference is the memory access pattern: accessing elements which are consecutive in memory is faster than doing a random memory access (due to memory pre-fetching, cpu caches etc.)
When you initially populate the collection you create all the elements sequentially in the memory, so when you are traversing it (foreach, removeAll, etc) you are accessing consecutive memory regions which is cache friendly. When you shuffle the collection - the elements remain in the same order in memory, but the pointers to those elements are no longer in the same order, so when you are traversing the collection you'll be accessing for instance the 10th, the 1st, then the 5th element which is very cache unfriendly and ruins the performance.
You can look at this question where this effect is visible in greater detail:
Why filtering an unsorted list is faster than filtering a sorted list
Since the asker did not provide any example code, and there have been doubts about the benchmark mentioned in the comments and answers, I created a small test to see whether the removeAll method is slower when the argument is a shuffled list (instead of a sorted list). And I confirmed the observation of the asker: The output of the test was roughly
100000 elements, sortedList and sortedList, 5023,090 ms, size 5000
100000 elements, shuffledList and sortedList, 5062,293 ms, size 5000
100000 elements, sortedList and shuffledList, 10657,438 ms, size 5000
100000 elements, shuffledList and shuffledList, 10700,145 ms, size 5000
I'll omit the code for this particular test here, because it also has been questioned (which - by the way - is perfectly justified! A lot of BS is posted on the web...).
So I did further tests, for which I'll provide the code here.
This may also not be considered as a definite answer. But I tried to adjust the tests so that they at least provide some strong evidence that the reason for the reduced performance is indeed what Svetlin Zarev mentioned in his answer (+1 and accept this if it convinces you). Namely, that the reason for the slowdown lies in the caching effects of the scattered accesses.
First of all: I am aware of many of the possible pitfalls when writing a microbenchmark (and so is the asker, according to his statements). However, I know that nobody will believe a lie benchmark, even if it is perfectly reasonable, unless it is performed with an appropriate microbenchmarking tool. So in order to show that the performance with a shuffled list is lower than with a sorted list, I created this simple JMH benchmark:
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.infra.Blackhole;
#State(Scope.Thread)
public class RemoveAllBenchmarkJMH
{
#Param({"sorted", "shuffled"})
public String method;
#Param({"1000", "10000", "100000" })
public int numElements;
private List<Integer> left;
private List<Integer> right;
#Setup
public void initList()
{
left = new ArrayList<Integer>();
right = new ArrayList<Integer>();
for (int i=0; i<numElements; i++)
{
left.add(i);
}
int n = (int)(numElements * 0.95);
for (int i=0; i<n; i++)
{
right.add(i);
}
if (method.equals("shuffled"))
{
Collections.shuffle(right);
}
}
#Benchmark
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
public void testMethod(Blackhole bh)
{
left.removeAll(right);
bh.consume(left.size());
}
}
The output of this one is as follows:
(method) (numElements) Mode Cnt Score Error Units
sorted 1000 avgt 50 52,055 ± 0,507 us/op
shuffled 1000 avgt 50 55,720 ± 0,466 us/op
sorted 10000 avgt 50 5341,917 ± 28,630 us/op
shuffled 10000 avgt 50 7108,845 ± 45,869 us/op
sorted 100000 avgt 50 621714,569 ± 19040,964 us/op
shuffled 100000 avgt 50 1110301,876 ± 22935,976 us/op
I hope that this helps to resolve doubts about the statement itself.
Although I admit that I'm not a JMH expert. If there is something wrong with this benchmark, please let me know
Now, these results have been roughly in line with my other, manual (non-JMH) microbenchmark. In order to create evidence for the fact that the shuffling is the problem, I created a small test that compares the performance using lists that are shuffled by different degrees. By providing a value between 0.0 and 1.0, one can limit the number of swapped elements, and thus the shuffledness of the list. (Of course, this is rather "pragmatic", as there are different options of how this could be implemented, considering the different possible (statistical) measures for "shuffledness").
The code looks as follows:
import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.List;
import java.util.Random;
import java.util.function.Function;
public class RemoveAllBenchmarkExt
{
public static void main(String[] args)
{
for (int n=10000; n<=100000; n+=10000)
{
runTest(n, sortedList() , sortedList());
runTest(n, sortedList() , shuffledList(0.00));
runTest(n, sortedList() , shuffledList(0.25));
runTest(n, sortedList() , shuffledList(0.50));
runTest(n, sortedList() , shuffledList(0.75));
runTest(n, sortedList() , shuffledList(1.00));
runTest(n, sortedList() , reversedList());
System.out.println();
}
}
private static Function<Integer, Collection<Integer>> sortedList()
{
return new Function<Integer, Collection<Integer>>()
{
#Override
public Collection<Integer> apply(Integer t)
{
List<Integer> list = new ArrayList<Integer>(t);
for (int i=0; i<t; i++)
{
list.add(i);
}
return list;
}
#Override
public String toString()
{
return "sorted";
}
};
}
private static Function<Integer, Collection<Integer>> shuffledList(
final double degree)
{
return new Function<Integer, Collection<Integer>>()
{
#Override
public Collection<Integer> apply(Integer t)
{
List<Integer> list = new ArrayList<Integer>(t);
for (int i=0; i<t; i++)
{
list.add(i);
}
shuffle(list, degree);
return list;
}
#Override
public String toString()
{
return String.format("shuffled(%4.2f)", degree);
}
};
}
private static void shuffle(List<Integer> list, double degree)
{
Random random = new Random(0);
int n = (int)(degree * list.size());
for (int i=n; i>1; i--)
{
swap(list, i-1, random.nextInt(i));
}
}
private static void swap(List<Integer> list, int i, int j)
{
list.set(i, list.set(j, list.get(i)));
}
private static Function<Integer, Collection<Integer>> reversedList()
{
return new Function<Integer, Collection<Integer>>()
{
#Override
public Collection<Integer> apply(Integer t)
{
List<Integer> list = new ArrayList<Integer>(t);
for (int i=0; i<t; i++)
{
list.add(i);
}
Collections.reverse(list);
return list;
}
#Override
public String toString()
{
return "reversed";
}
};
}
private static void runTest(int n,
Function<Integer, ? extends Collection<Integer>> leftFunction,
Function<Integer, ? extends Collection<Integer>> rightFunction)
{
Collection<Integer> left = leftFunction.apply(n);
Collection<Integer> right = rightFunction.apply((int)(n*0.95));
long before = System.nanoTime();
left.removeAll(right);
long after = System.nanoTime();
double durationMs = (after - before) / 1e6;
System.out.printf(
"%8d elements, %15s, duration %10.3f ms, size %d\n",
n, rightFunction, durationMs, left.size());
}
}
(Yes, it's very simple. However, if you think that the timings are completely useless, compare them to a JMH run, and after a few hours, you'll see that they are reasonable)
The timings for the last pass are as follows:
100000 elements, sorted, duration 6016,354 ms, size 5000
100000 elements, shuffled(0,00), duration 5849,537 ms, size 5000
100000 elements, shuffled(0,25), duration 7319,948 ms, size 5000
100000 elements, shuffled(0,50), duration 9344,408 ms, size 5000
100000 elements, shuffled(0,75), duration 10657,021 ms, size 5000
100000 elements, shuffled(1,00), duration 11295,808 ms, size 5000
100000 elements, reversed, duration 5830,695 ms, size 5000
One can clearly see that the timings are basically increasing linearly with the shuffledness.
Of course, all this is still not a proof, but at least an evidence that the answer by Svetlin Zarev is correct.
Looking at the source code for ArrayList.removeAll() (OpenJDK7-b147) it appears that the it delegates to a private method called batchRemove() which is as follows:
663 private boolean batchRemove(Collection<?> c, boolean complement) {
664 final Object[] elementData = this.elementData;
665 int r = 0, w = 0;
666 boolean modified = false;
667 try {
668 for (; r < size; r++)
669 if (c.contains(elementData[r]) == complement)
670 elementData[w++] = elementData[r];
671 } finally {
672 // Preserve behavioral compatibility with AbstractCollection,
673 // even if c.contains() throws.
674 if (r != size) {
675 System.arraycopy(elementData, r,
676 elementData, w,
677 size - r);
678 w += size - r;
679 }
680 if (w != size) {
681 for (int i = w; i < size; i++)
682 elementData[i] = null;
683 modCount += size - w;
684 size = w;
685 modified = true;
686 }
687 }
688 return modified;
689 }
It practically loops through the array and has a bunch of c.contains() calls. Basically there's no reason why this iteration would go faster for a sorted array.
I second StephenC's doubt about the benchmark, and believe that it'd be more fruitful for you to scrutinize the benchmark code before digging in any deeper into cache access patterns etc.
Also if the benchmark code is not the culprit, it would be interesting to know the java version, and the OS/arch etc.
Now I know that micro-benchmarks are difficult to do right, and I won’t look at a few milliseconds difference, but I believe my results to be valid, since I ran them repeatedly and they are very reproducible.
That does not convince me. The behaviour of an flawed benchmark can be 100% reproducible.
I suspect that ... in fact ... a flaw or flaws in your benchmark >>is<< the cause of your strange results. It often is.
... but perhaps there is a fundamental problem with my benchmark?
Yes (IMO).
Show us the benchmark code if you want a more detailed answer.
I experienced a performance issue when using the stream created using the spliterator() over an Iterable. ie., like StreamSupport.stream(integerList.spliterator(), true). Wanted to prove this over a normal collection. Please see below some benchmark results.
Question:
Why does the parallel stream created from an iterable much slower than the stream created from an ArrayList or an IntStream ?
From a range
public void testParallelFromIntRange() {
long start = System.nanoTime();
IntStream stream = IntStream.rangeClosed(1, Integer.MAX_VALUE).parallel();
System.out.println("Is Parallel: "+stream.isParallel());
stream.forEach(ParallelStreamSupportTest::calculate);
long end = System.nanoTime();
System.out.println("ParallelStream from range Takes : " + TimeUnit.MILLISECONDS.convert((end - start),
TimeUnit.NANOSECONDS) + " milli seconds");
}
Is Parallel: true
ParallelStream from range Takes : 490 milli seconds
From an Iterable
public void testParallelFromIterable() {
Set<Integer> integerList = ContiguousSet.create(Range.closed(1, Integer.MAX_VALUE), DiscreteDomain.integers());
long start = System.nanoTime();
Stream<Integer> stream = StreamSupport.stream(integerList.spliterator(), true);
System.out.println("Is Parallel: " + stream.isParallel());
stream.forEach(ParallelStreamSupportTest::calculate);
long end = System.nanoTime();
System.out.println("ParallelStream from Iterable Takes : " + TimeUnit.MILLISECONDS.convert((end - start),
TimeUnit.NANOSECONDS) + " milli seconds");
}
Is Parallel: true
ParallelStream from Iterable Takes : 12517 milli seconds
And the so trivial calculate method.
public static Integer calculate(Integer input) {
return input + 2;
}
Not all spliterators are created equally. One of the tasks of a spliterator is to decompose the source into two parts, that can be processed in parallel. A good spliterator will divide the source roughly in half (and will be able to continue to do so recursively.)
Now, imagine you are writing a spliterator for a source that is only described by an Iterator. What quality of decomposition can you get? Basically, all you can do is divide the source into "first" and "rest". That's about as bad as it gets. The result is a computation tree that is very "right-heavy".
The spliterator that you get from a data structure has more to work with; it knows the layout of the data, and can use that to give better splits, and therefore better parallel performance. The spliterator for ArrayList can always divide in half, and retains knowledge of exactly how much data is in each half. That's really good. The spliterator from a balanced tree can get good distribution (since each half of the tree has roughly half the elements), but isn't quite as good as the ArrayList spliterator because it doesn't know the exact sizes. The spliterator for a LinkedList is about as bad as it gets; all it can do is (first, rest). And the same for deriving a spliterator from an iterator.
Now, all is not necessarily lost; if the work per element is high, you can overcome bad splitting. But if you're doing a small amount of work per element, you'll be limited by the quality of splits from your spliterator.
There are several problems with your benchmark.
Stream<Integer> cannot be compared to IntStream because of boxing overhead.
You aren't doing anything with the result of the calculation, which makes it hard to know whether the code is actually being run
You are benchmarking with System.nanoTime instead of using a proper benchmarking tool.
Here's a JMH-based benchmark:
import com.google.common.collect.ContiguousSet;
import com.google.common.collect.DiscreteDomain;
import com.google.common.collect.Range;
import java.util.stream.IntStream;
import java.util.stream.Stream;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.OptionsBuilder;
public class Ranges {
final static int SIZE = 10_000_000;
#Benchmark
public long intStream() {
Stream<Integer> st = IntStream.rangeClosed(1, SIZE).boxed();
return st.parallel().mapToInt(x -> x).sum();
}
#Benchmark
public long contiguousSet() {
ContiguousSet<Integer> cs = ContiguousSet.create(Range.closed(1, SIZE), DiscreteDomain.integers());
Stream<Integer> st = cs.stream();
return st.parallel().mapToInt(x -> x).sum();
}
public static void main(String[] args) throws RunnerException {
new Runner(
new OptionsBuilder()
.include(".*Ranges.*")
.forks(1)
.warmupIterations(5)
.measurementIterations(5)
.build()
).run();
}
}
And the output:
Benchmark Mode Samples Score Score error Units
b.Ranges.contiguousSet thrpt 5 13.540 0.924 ops/s
b.Ranges.intStream thrpt 5 27.047 5.119 ops/s
So IntStream.range is about twice as fast as ContiguousSet, which is perfectly reasonable, given that ContiguousSet doesn't implement its own Spliterator and uses the default from Set
I have made the same Test as it was done in this post:
new String() vs literal string performance
Meaning I wanted to test which one has the better in performance. As I expected the result was that the assignment by literals is faster. I don't know why, but I did the test with some more assignments and I noticed something strange: when I let the program do the loops more than 10.000 times the assignment by literals is relatively not as much faster than at less than 10.000 assignments. And at 1.000.000 repetitions it is even slower than creating new objects.
Here is my code:
double tx = System.nanoTime();
for (int i = 0; i<1; i++){
String s = "test";
}
double ty = System.nanoTime();
double ta = System.nanoTime();
for (int i = 0; i<1; i++){
String s = new String("test");
}
double tb = System.nanoTime();
System.out.println((ty-tx));
System.out.println((tb-ta));
I let this run like it is written above. I'm just learning Java and my boss asked me to do the test and after I presented the outcome of the test he asked me do find an answer, why this happens. I cannot find anything on google or in stackoverflow and so I hope someone can help me out here.
factor at 1 repetition 3,811565221
factor at 10 repetitions 4,393570401
factor at 100 repetitions 5,234779103
factor at 1,000 repetitions 7,909884116
factor at 10,000 repetitions 9,395538811
factor at 100,000 repetitions 2,355514697
factor at 1,000,000 repetitions 0,734826755
Thank you!
First you'll have to learn a lot about the internals of HotSpot, in particular the fact that your code is first interpreted, then at a certain point compiled into native code.
A lot of optimizations happen while compiling, based on results of both static and dynamic analysis of your code.
Specifically, in your code,
String s = "test";
is a clear no-op. The compiler will emit no code whatsoever for this line. All that remains is the loop itself, and the whole loop may be eliminated if HotSpot proves it has no observable outside effects.
Second, even the code
String s = new String("test");
may result in almost the same thing as above because it is very easy to prove that your new String is an instance which cannot escape from the method where it is created.
With your code, the measurements are mixing up the performance of interpreted bytecode, the delay it takes to compile the code and swap it in by On-Stack Replacement, and then the performance of the native code.
Basically, the measurements you are making are measuring everything but the effect you have set out to measure.
To make the arguments more solid, I have repeated the test with jmh:
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#BenchmarkMode(Mode.AverageTime)
#Warmup(iterations = 1, time = 1)
#Measurement(iterations = 3, time = 1)
#Threads(1)
#Fork(2)
public class Strings
{
static final int ITERS = 1_000;
#GenerateMicroBenchmark
public void literal() {
for (int i = 0; i < ITERS; i++) { String s = "test"; }
}
#GenerateMicroBenchmark
public void newString() {
for (int i = 0; i < ITERS; i++) { String s = new String("test"); }
}
}
and these are the results:
Benchmark Mode Samples Mean Mean error Units
literal avgt 6 0.625 0.023 ns/op
newString avgt 6 43.778 3.283 ns/op
You can see that the whole method body is eliminated in the case of the string literal, while with new String the loop remains, but nothing in it because the time per loop iteration is just 0.04 nanoseconds. Definitely no String instances are allocated.