So I'm trying to play a bit with microbenchmarks, have chosen JMH, have read some articles. How JMH measures execution of methods below system's timer granularity?
A more detailed explanation:
These are the benchmarks I'm running (method names speak for themselves):
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import java.util.concurrent.TimeUnit;
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#State(Scope.Thread)
#Warmup(iterations = 10, time = 200, timeUnit = TimeUnit.NANOSECONDS)
#Measurement(iterations = 20, time = 200, timeUnit = TimeUnit.NANOSECONDS)
public class RandomBenchmark {
public long lastValue;
#Benchmark
#Fork(1)
public void blankMethod() {
}
#Benchmark
#Fork(1)
public void simpleMethod(Blackhole blackhole) {
int i = 0;
blackhole.consume(i++);
}
#Benchmark
#Fork(1)
public void granularityMethod(Blackhole blackhole) {
long initialTime = System.nanoTime();
long measuredTime;
do {
measuredTime = System.nanoTime();
} while (measuredTime == initialTime);
blackhole.consume(measuredTime);
}
}
Here are results:
# Run complete. Total time: 00:00:02
Benchmark Mode Cnt Score Error Units
RandomBenchmark.blankMethod avgt 20 0,887 ? 0,274 ns/op
RandomBenchmark.granularityMethod avgt 20 407,002 ? 26,297 ns/op
RandomBenchmark.simpleMethod avgt 20 6,979 ? 0,743 ns/op
Currently ran on Windows 7 and as it's described in various articles it has big granularity (407 ns). Checking with basic code below it's indeed new timer value comes every ~400ns:
final int sampleSize = 100;
long[] timeMarks = new long[sampleSize];
for (int i=0; i < sampleSize; i++) {
timeMarks[i] = System.nanoTime();
}
for (long timeMark : timeMarks) {
System.out.println(timeMark);
}
It's hard to fully understand how generated methods exactly work but looking through decompiled JMH generated code it seems like it's using the same System.nanoTime() before and after execution and measures the difference. How is it able to measure method execution of couple nanoseconds while granularity is 400 ns?
You are totally right. You cannot measure something that is faster than your system's timer granularity.
JMH doesn't measure each invocation of the bechmark method. It calls System.nanotime() before the start of an iteration, executes the benchmark method X times and call System.nanotime() again after the iteration. The results is then time difference / # of operations (potentially you specify on the method more than 1 operation per invocation with #OperationsPerInvocation).
Aleksey Shipilev discussed measurement problems with Nanotime in his article Nanotrusting the Nanotime. Section 'Latency' contains a code example that shows how JMH measures one benchmark iteration.
Related
In JMH(Java Microbenchmark Harness), we can use
#BenchmarkMode(Mode.AverageTime)
#Warmup(iterations = 10)
#Measurement(iterations = 10)
to evaluate the average time of an execution after JVM warms up.
Also we can use
#BenchmarkMode(Mode.SingleShotTime)
#Measurement(iterations = 1)
to estimate the cold start time of an execution. But this only executes the benchmark once, which may introduce bias. So is there any method to evaluate the average time of the cold start in JMH?
According to Alexey himself (though from 2014):
Single-shot benchmarks were originally destined to run a single
measurement iteration over multiple forks -- the scenarios to estimate
"cold" performance. But for many cases, you might want more measurement
iterations there especially if you are running only a single fork,
because more samples would be generated.
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public class AverageSingleShot {
public static void main(String[] args) throws Exception {
Options opt = new OptionsBuilder()
.include(AverageSingleShot.class.getSimpleName())
.build();
new Runner(opt).run();
}
#Fork(100)
#Benchmark
#BenchmarkMode(Mode.SingleShotTime)
public int test() {
return ThreadLocalRandom.current().nextInt() + ThreadLocalRandom.current().nextInt();
}
}
Besides the fact that this will tell you the average (see that 100):
Benchmark Mode Cnt Score Error Units
AverageSingleShot.test ss 100 41173.540 ± 2871.546 ns/op
you will also get Percentiles and a Histogram.
I am having a problem with Guava RateLimiter. I create the RateLimiter with RateLimiter.create(1.0) ("1 permit per second") and calls rateLimiter.acquire() on every cycle, but when I test run I get the following result:
Average: 1232.0 Diff since last: 2540
Average: 1180.0 Diff since last: 258
Average: 1159.0 Diff since last: 746
Average: 1151.0 Diff since last: 997
Average: 1144.0 Diff since last: 1004
Average is the number of milliseconds it sleeps on average and diff is the number of milliseconds passed since last print. On average it's okay, it does not permit my code to run more than once per second. But sometimes (as you can see) it runs more than once per second.
Do you have any idea why? Am I missing something?
The code that generates the above output:
private int numberOfRequests;
private Date start;
private long last;
private boolean first = true;
private void sleep() {
numberOfRequests++;
if(first) {
first = false;
start = new Date();
}
rateLimiter.acquire();
long current = new Date().getTime();
double num = (current -start.getTime()) / numberOfRequests;
System.out.println("Average: "+ num + " Diff since last: " + (current - last));
last = current;
}
Your benchmark appears to be flawed - when I try to replicate it I see very close to one acquisition per second. Here's my benchmark:
public class RateLimiterDemo {
public static void main(String[] args) {
StatsAccumulator stats = new StatsAccumulator();
RateLimiter rateLimiter = RateLimiter.create(1.0);
rateLimiter.acquire(); // discard initial permit
for (int i = 0; i < 10; i++) {
long start = System.nanoTime();
rateLimiter.acquire();
stats.add((System.nanoTime() - start) / 1_000_000.0);
}
System.out.println(stats.snapshot());
}
}
A sample run prints:
Stats{count=10, mean=998.9071456, populationStandardDeviation=3.25398397901304, min=989.303887, max=1000.971085}
The variance there is almost all attributable to the benchmark overhead (populating the StatsAccumulator and computing the time delta).
Even this benchmark has flaws (though I'll contend it's less-so). Creating an accurate benchmark is very hard, and simple whipped-together benchmarks are often either inaccurate or worse don't reflect the actual performance of the code being tested in a production setting.
I wrote a simple program to compare to performance of stream for finding maximum form list of integer. Surprisingly I found that the performance of ' stream way' 1/10 of 'usual way'. Am I doing something wrong? Is there any condition on which Stream way will not be efficient? Could anyone have a nice explanation for this behavior?
"stream way" took 80 milliseconds "usual way" took 15 milli seconds
Please find the code below
public class Performance {
public static void main(String[] args) {
ArrayList<Integer> a = new ArrayList<Integer>();
Random randomGenerator = new Random();
for (int i=0;i<40000;i++){
a.add(randomGenerator.nextInt(40000));
}
long start_s = System.currentTimeMillis( );
Optional<Integer> m1 = a.stream().max(Integer::compare);
long diff_s = System.currentTimeMillis( ) - start_s;
System.out.println(diff_s);
int e = a.size();
Integer m = Integer.MIN_VALUE;
long start = System.currentTimeMillis( );
for(int i=0; i < e; i++)
if(a.get(i) > m) m = a.get(i);
long diff = System.currentTimeMillis( ) - start;
System.out.println(diff);
}
}
Yes, Streams are slower for such simple operations. But your numbers are completely unrelated. If you think that 15 milliseconds is satisfactory time for your task, then there are good news: after warm-up stream code can solve this problem in like 0.1-0.2 milliseconds, which is 70-150 times faster.
Here's quick-and-dirty benchmark:
import java.util.concurrent.TimeUnit;
import java.util.*;
import java.util.stream.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.annotations.*;
#Warmup(iterations = 5, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#Measurement(iterations = 10, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#Fork(3)
#State(Scope.Benchmark)
public class StreamTest {
// Stream API is very nice to get random data for tests!
List<Integer> a = new Random().ints(40000, 0, 40000).boxed()
.collect(Collectors.toList());
#Benchmark
public Integer streamList() {
return a.stream().max(Integer::compare).orElse(Integer.MIN_VALUE);
}
#Benchmark
public Integer simpleList() {
int e = a.size();
Integer m = Integer.MIN_VALUE;
for(int i=0; i < e; i++)
if(a.get(i) > m) m = a.get(i);
return m;
}
}
The results are:
Benchmark Mode Cnt Score Error Units
StreamTest.simpleList avgt 30 38.241 ± 0.434 us/op
StreamTest.streamList avgt 30 215.425 ± 32.871 us/op
Here's microseconds. So the Stream version is actually much faster than your test. Nevertheless the simple version is even more faster. So if you were fine with 15 ms, you can use any of these two versions you like: both will perform much faster.
If you want to get the best possible performance no matter what, you should get rid of boxed Integer objects and work with primitive array:
int[] b = new Random().ints(40000, 0, 40000).toArray();
#Benchmark
public int streamArray() {
return Arrays.stream(b).max().orElse(Integer.MIN_VALUE);
}
#Benchmark
public int simpleArray() {
int e = b.length;
int m = Integer.MIN_VALUE;
for(int i=0; i < e; i++)
if(b[i] > m) m = b[i];
return m;
}
Both versions are faster now:
Benchmark Mode Cnt Score Error Units
StreamTest.simpleArray avgt 30 10.132 ± 0.193 us/op
StreamTest.streamArray avgt 30 167.435 ± 1.155 us/op
Actually the stream version result may vary greatly as it involves many intermediate methods which are JIT-compiled in different time, so the speed may change in any direction after some iterations.
By the way your original problem can be solved by good old Collections.max method without Stream API like this:
Integer max = Collections.max(a);
In general you should avoid testing the artificial code which does not solve real problems. With artificial code you will get the artificial results which generally say nothing about the API performance in real conditions.
The immediate difference that I see is that the stream way uses Integer::compare which might require more autoboxing etc. vs. an operator in the loop. perhaps you can call Integer::compare in the loop to see if this is the reason?
EDIT: following the advice from Nicholas Robinson, I wrote a new version of the test. It uses 400K sized list (the original one yielded zero diff results), it uses Integer.compare in both cases and runs only one of them in each invocation (I alternate between the two methods):
static List<Integer> a = new ArrayList<Integer>();
public static void main(String[] args)
{
Random randomGenerator = new Random();
for (int i = 0; i < 400000; i++) {
a.add(randomGenerator.nextInt(400000));
}
long start = System.currentTimeMillis();
//Integer max = checkLoop();
Integer max = checkStream();
long diff = System.currentTimeMillis() - start;
System.out.println("max " + max + " diff " + diff);
}
static Integer checkStream()
{
Optional<Integer> max = a.stream().max(Integer::compare);
return max.get();
}
static Integer checkLoop()
{
int e = a.size();
Integer max = Integer.MIN_VALUE;
for (int i = 0; i < e; i++) {
if (Integer.compare(a.get(i), max) > 0) max = a.get(i);
}
return max;
}
The results for loop: max 399999 diff 10
The results for stream: max 399999 diff 40 (and sometimes I got 50)
In Java 8 they have been putting a lot of effort into making use of concurrent processes with the new lambdas. You will find the stream to be so much faster because the list is being processed concurrently in the most efficient way possible where as the usual way is running through the list sequentially.
Because the lambda are static this makes threading easier, however when you are accessing something line your hard drive (reading in a file line by line) you will probably find the stream wont be as efficient because the hard drive can only access info.
[UPDATE]
The reason your stream took so much longer than the normal way is because you run in first. The JRE is constantly trying to optimize the performance so there will be a cache set up with the usual way. If you run the usual way before the stream way you should get opposing results. I would recommend running the tests in different mains for the best results.
import java.util.ArrayList;
import java.util.List;
public class IterationBenchmark {
public static void main(String args[]){
List<String> persons = new ArrayList<String>();
persons.add("AAA");
persons.add("BBB");
persons.add("CCC");
persons.add("DDD");
long timeMillis = System.currentTimeMillis();
for(String person : persons)
System.out.println(person);
System.out.println("Time taken for legacy for loop : "+
(System.currentTimeMillis() - timeMillis));
timeMillis = System.currentTimeMillis();
persons.stream().forEach(System.out::println);
System.out.println("Time taken for sequence stream : "+
(System.currentTimeMillis() - timeMillis));
timeMillis = System.currentTimeMillis();
persons.parallelStream().forEach(System.out::println);
System.out.println("Time taken for parallel stream : "+
(System.currentTimeMillis() - timeMillis));
}
}
Output:
AAA
BBB
CCC
DDD
Time taken for legacy for loop : 0
AAA
BBB
CCC
DDD
Time taken for sequence stream : 49
CCC
DDD
AAA
BBB
Time taken for parallel stream : 3
Why the Java 8 Stream API performance is very low compare to legacy for loop?
Very first call to the Stream API in your program is always quite slow, because you need to load many auxiliary classes, generate many anonymous classes for lambdas and JIT-compile many methods. Thus usually very first Stream operation takes several dozens of milliseconds. The consecutive calls are much faster and may fall beyond 1 us depending on the exact stream operation. If you exchange the parallel-stream test and sequential stream test, the sequential stream will be much faster. All the hard work is done by one who comes the first.
Let's write a JMH benchmark to properly warm-up your code and test all the cases independently:
import java.util.concurrent.TimeUnit;
import java.util.*;
import java.util.stream.*;
import org.openjdk.jmh.annotations.*;
#Warmup(iterations = 5, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#Measurement(iterations = 10, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#Fork(3)
#State(Scope.Benchmark)
public class StreamTest {
List<String> persons;
#Setup
public void setup() {
persons = new ArrayList<String>();
persons.add("AAA");
persons.add("BBB");
persons.add("CCC");
persons.add("DDD");
}
#Benchmark
public void loop() {
for(String person : persons)
System.err.println(person);
}
#Benchmark
public void stream() {
persons.stream().forEach(System.err::println);
}
#Benchmark
public void parallelStream() {
persons.parallelStream().forEach(System.err::println);
}
}
Here we have three tests: loop, stream and parallelStream. Note that I changed the System.out to System.err. That's because System.out is used normally to output the JMH results. I will redirect the output of System.err to nul, so the result should less depend on my filesystem or console subsystem (which is especially slow on Windows).
So the results are (Core i7-4702MQ CPU # 2.2GHz, 4 cores HT, Win7, Oracle JDK 1.8.0_40):
Benchmark Mode Cnt Score Error Units
StreamTest.loop avgt 30 42.410 ± 1.833 us/op
StreamTest.parallelStream avgt 30 76.440 ± 2.073 us/op
StreamTest.stream avgt 30 42.820 ± 1.389 us/op
What we see is that stream and loop produce exactly the same result. The difference is statistically insignificant. Actually Stream API is somewhat slower than loop, but here the slowest part is the PrintStream. Even with output to nul the IO subsystem is very slow compared to other operations. So we just measured not the Stream API or loop speed, but println speed.
Also see, it's microseconds, thus stream version actually works 1000 times faster than in your test.
Why parallelStream is much slower? Just because you cannot parallelize the writes to the same PrintStream, because it is internally synchronized. So the parallelStream did all the hard work to splitting 4-element list to the 4 sub-tasks, schedule the jobs in the different threads, synchronize them properly, but it's absolutely futile as the slowest operation (println) cannot perform in parallel: while one of threads is working, others are waiting. In general it's useless to parallelize the code which synchronizes on the same mutex (which is your case).
I have some code that profiles Runtime.freeMemory. Here is my code:
package misc;
import java.util.ArrayList;
import java.util.Random;
public class FreeMemoryTest {
private final ArrayList<Double> l;
private final Random r;
public FreeMemoryTest(){
this.r = new Random();
this.l = new ArrayList<Double>();
}
public static boolean memoryCheck() {
double freeMem = Runtime.getRuntime().freeMemory();
double totalMem = Runtime.getRuntime().totalMemory();
double fptm = totalMem * 0.05;
boolean toReturn = fptm > freeMem;
return toReturn;
}
public void freeMemWorkout(int max){
for(int i = 0; i < max; i++){
memoryCheck();
l.add(r.nextDouble());
}
}
public void workout(int max){
for(int i = 0; i < max; i++){
l.add(r.nextDouble());
}
}
public static void main(String[] args){
FreeMemoryTest f = new FreeMemoryTest();
int count = Integer.parseInt(args[1]);
long startTime = System.currentTimeMillis();
if(args[0].equals("f")){
f.freeMemWorkout(count);
} else {
f.workout(count);
}
long endTime = System.currentTimeMillis();
System.out.println(endTime - startTime);
}
}
When I run the profiler using -Xrunhprof:cpu=samples, the vast majority of the calls are to the Runtime.freeMemory(), like this:
CPU SAMPLES BEGIN (total = 531) Fri Dec 7 00:17:20 2012
rank self accum count trace method
1 83.62% 83.62% 444 300274 java.lang.Runtime.freeMemory
2 9.04% 92.66% 48 300276 java.lang.Runtime.totalMemory
When I run the profiler using -Xrunhprof:cpu=time, I don't see any of the calls to Runtime.freeMemory at all, and the top five calls are as follows:
CPU TIME (ms) BEGIN (total = 10042) Fri Dec 7 00:29:51 2012
rank self accum count trace method
1 13.39% 13.39% 200000 307547 java.util.Random.next
2 9.69% 23.08% 1 307852 misc.FreeMemoryTest.freeMemWorkout
3 7.41% 30.49% 100000 307544 misc.FreeMemoryTest.memoryCheck
4 7.39% 37.88% 100000 307548 java.util.Random.nextDouble
5 4.35% 42.23% 100000 307561 java.util.ArrayList.add
These two profiles are so different from one another. I thought that samples was supposed to at least roughly approximate the results from the times, but here we see a very radical difference, something that consumes more than 80% of the samples doesn't even appear in the times profile. This does not make any sense to me, does anyone know why this is happening?
More on this:
$ java -Xmx1000m -Xms1000m -jar memtest.jar a 20000000 5524
//does not have the calls to Runtime.freeMemory()
$ java -Xmx1000m -Xms1000m -jar memtest.jar f 20000000 9442
//has the calls to Runtime.freeMemory()
Running with freemem requires approximately twice the amount of time as running without it. If 80% of the CPU time is spent in java.Runtime.freeMemory(), and I remove that call, I would expect the program to speed up by a factor of approximately 5. As we can see above, the program speeds up by a factor of approximately 2.
A slowdown of a factor of 5 is way worse than a slowdown of a factor of 2 that was observed empirically, so what I do not understand is how the sampling profiler is so far off from reality.
The Runtime freeMemory() and totalMemory() are native calls.
See http://www.docjar.com/html/api/java/lang/Runtime.java.html
The timer cannot time them, but the sampler can.