In JMH(Java Microbenchmark Harness), we can use
#BenchmarkMode(Mode.AverageTime)
#Warmup(iterations = 10)
#Measurement(iterations = 10)
to evaluate the average time of an execution after JVM warms up.
Also we can use
#BenchmarkMode(Mode.SingleShotTime)
#Measurement(iterations = 1)
to estimate the cold start time of an execution. But this only executes the benchmark once, which may introduce bias. So is there any method to evaluate the average time of the cold start in JMH?
According to Alexey himself (though from 2014):
Single-shot benchmarks were originally destined to run a single
measurement iteration over multiple forks -- the scenarios to estimate
"cold" performance. But for many cases, you might want more measurement
iterations there especially if you are running only a single fork,
because more samples would be generated.
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public class AverageSingleShot {
public static void main(String[] args) throws Exception {
Options opt = new OptionsBuilder()
.include(AverageSingleShot.class.getSimpleName())
.build();
new Runner(opt).run();
}
#Fork(100)
#Benchmark
#BenchmarkMode(Mode.SingleShotTime)
public int test() {
return ThreadLocalRandom.current().nextInt() + ThreadLocalRandom.current().nextInt();
}
}
Besides the fact that this will tell you the average (see that 100):
Benchmark Mode Cnt Score Error Units
AverageSingleShot.test ss 100 41173.540 ± 2871.546 ns/op
you will also get Percentiles and a Histogram.
Related
Details:
For a lot the programs that I develop I use this code (or some slight variant) to "tick" a method every so often, set to the varaible tps (if set to 32 it calls the method tick 32 times every second). Its very essential so I can't remove it from my code as animations and various other parts will break.
Unfortunately it seems to use a sizable amount of cpu usage for a reason I can't figure out. A while back I was thinking about using thread.sleep() to fix this issue but according to this post; it's rather innacurate which makes it unfeasible as this requires reasonably accurate timing.
It doesn't use that much cpu, around 6-11% cpu for a ryzen 1700 in my admittedly short testing, but it's still quite a lot considering how little it's doing. Is there a less cpu intensive method of completing this? Or will the timing be to innacurate for regular usage.
public class ThreadTest {
public ThreadTest() {
int tps = 32;
boolean threadShouldRun = true;
long lastTime = System.nanoTime();
double ns = 1000000000 / tps;
double delta = 0;
long now;
while (threadShouldRun) {
now = System.nanoTime();
delta += (now - lastTime) / ns;
lastTime = now;
while ((delta >= 1) && (threadShouldRun)) {
tick();
delta--;
}
}
}
public void tick() {
}
public static void main(String[] args) {
new ThreadTest();
}
}
Basic summary: The code above uses 6-11% cpu with a ryzen 1700, is there a way in java to accomplish the same code with less cpu usage and keeping reasonable timing when executing code a certain amount of times per second.
One easy alternative that shouldn't use as much CPU is to use a ScheduledExecutorService. For example:
public static void main(String[] args) {
ScheduledExecutorService executor = Executors.newSingleThreadScheduledExecutor();
executor.scheduleAtFixedRate(() -> {
}, 0, 31250, TimeUnit.MICROSECONDS);
}
Note that 31250 represents the value of 1/32 seconds converted to microseconds, as that parameter accepts a long.
My current setup:
public void launchBenchmark() throws Exception {
Options opt = new OptionsBuilder()
.include(this.getClass().getName())
.mode(Mode.Throughput) //Calculate number of operations in a time unit.
.mode(Mode.AverageTime) //Calculate an average running time per operation
.timeUnit(TimeUnit.MILLISECONDS)
.warmupIterations(1)
.measurementIterations(30)
.threads(Runtime.getRuntime().availableProcessors())
.forks(1)
.shouldFailOnError(true)
.shouldDoGC(true)
.build();
new Runner(opt).run();
}
How can I know/control (if possible) the number of operations is performed per benchmark ?
And is it important to set warmUp time and measurementIteration time?
Thank you.
You cannot control the number of operations per iteration. The whole point of JMH is to correctly measure that number.
You can configure the warmup using the annotation:
#Warmup(iterations = 10, time = 500, timeUnit = MILLISECONDS)
And the measurement by:
#Measurement(iterations = 200, time = 200, timeUnit = MILLISECONDS)
Just set the appropriate values for your use case
import java.util.ArrayList;
import java.util.List;
public class IterationBenchmark {
public static void main(String args[]){
List<String> persons = new ArrayList<String>();
persons.add("AAA");
persons.add("BBB");
persons.add("CCC");
persons.add("DDD");
long timeMillis = System.currentTimeMillis();
for(String person : persons)
System.out.println(person);
System.out.println("Time taken for legacy for loop : "+
(System.currentTimeMillis() - timeMillis));
timeMillis = System.currentTimeMillis();
persons.stream().forEach(System.out::println);
System.out.println("Time taken for sequence stream : "+
(System.currentTimeMillis() - timeMillis));
timeMillis = System.currentTimeMillis();
persons.parallelStream().forEach(System.out::println);
System.out.println("Time taken for parallel stream : "+
(System.currentTimeMillis() - timeMillis));
}
}
Output:
AAA
BBB
CCC
DDD
Time taken for legacy for loop : 0
AAA
BBB
CCC
DDD
Time taken for sequence stream : 49
CCC
DDD
AAA
BBB
Time taken for parallel stream : 3
Why the Java 8 Stream API performance is very low compare to legacy for loop?
Very first call to the Stream API in your program is always quite slow, because you need to load many auxiliary classes, generate many anonymous classes for lambdas and JIT-compile many methods. Thus usually very first Stream operation takes several dozens of milliseconds. The consecutive calls are much faster and may fall beyond 1 us depending on the exact stream operation. If you exchange the parallel-stream test and sequential stream test, the sequential stream will be much faster. All the hard work is done by one who comes the first.
Let's write a JMH benchmark to properly warm-up your code and test all the cases independently:
import java.util.concurrent.TimeUnit;
import java.util.*;
import java.util.stream.*;
import org.openjdk.jmh.annotations.*;
#Warmup(iterations = 5, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#Measurement(iterations = 10, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#Fork(3)
#State(Scope.Benchmark)
public class StreamTest {
List<String> persons;
#Setup
public void setup() {
persons = new ArrayList<String>();
persons.add("AAA");
persons.add("BBB");
persons.add("CCC");
persons.add("DDD");
}
#Benchmark
public void loop() {
for(String person : persons)
System.err.println(person);
}
#Benchmark
public void stream() {
persons.stream().forEach(System.err::println);
}
#Benchmark
public void parallelStream() {
persons.parallelStream().forEach(System.err::println);
}
}
Here we have three tests: loop, stream and parallelStream. Note that I changed the System.out to System.err. That's because System.out is used normally to output the JMH results. I will redirect the output of System.err to nul, so the result should less depend on my filesystem or console subsystem (which is especially slow on Windows).
So the results are (Core i7-4702MQ CPU # 2.2GHz, 4 cores HT, Win7, Oracle JDK 1.8.0_40):
Benchmark Mode Cnt Score Error Units
StreamTest.loop avgt 30 42.410 ± 1.833 us/op
StreamTest.parallelStream avgt 30 76.440 ± 2.073 us/op
StreamTest.stream avgt 30 42.820 ± 1.389 us/op
What we see is that stream and loop produce exactly the same result. The difference is statistically insignificant. Actually Stream API is somewhat slower than loop, but here the slowest part is the PrintStream. Even with output to nul the IO subsystem is very slow compared to other operations. So we just measured not the Stream API or loop speed, but println speed.
Also see, it's microseconds, thus stream version actually works 1000 times faster than in your test.
Why parallelStream is much slower? Just because you cannot parallelize the writes to the same PrintStream, because it is internally synchronized. So the parallelStream did all the hard work to splitting 4-element list to the 4 sub-tasks, schedule the jobs in the different threads, synchronize them properly, but it's absolutely futile as the slowest operation (println) cannot perform in parallel: while one of threads is working, others are waiting. In general it's useless to parallelize the code which synchronizes on the same mutex (which is your case).
I am writing a micro-benchmark to compare String concatenation using + operator vs StringBuilder. To this aim, I created a JMH benchmark class based on OpenJDK example that uses the batchSize parameter:
#State(Scope.Thread)
#BenchmarkMode(Mode.AverageTime)
#Measurement(batchSize = 10000, iterations = 10)
#Warmup(batchSize = 10000, iterations = 10)
#Fork(1)
public class StringConcatenationBenchmark {
private String string;
private StringBuilder stringBuilder;
#Setup(Level.Iteration)
public void setup() {
string = "";
stringBuilder = new StringBuilder();
}
#Benchmark
public void stringConcatenation() {
string += "some more data";
}
#Benchmark
public void stringBuilderConcatenation() {
stringBuilder.append("some more data");
}
}
When I run the benchmark I get the following error for stringBuilderConcatenation method:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at link.pellegrino.string_concatenation.StringConcatenationBenchmark.stringBuilderConcatenation(StringConcatenationBenchmark.java:29)
at link.pellegrino.string_concatenation.generated.StringConcatenationBenchmark_stringBuilderConcatenation.stringBuilderConcatenation_avgt_jmhStub(StringConcatenationBenchmark_stringBuilderConcatenation.java:165)
at link.pellegrino.string_concatenation.generated.StringConcatenationBenchmark_stringBuilderConcatenation.stringBuilderConcatenation_AverageTime(StringConcatenationBenchmark_stringBuilderConcatenation.java:130)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:430)
at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:412)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I was thinking that the default JVM heap size has to be increased, so I tried to allow up to 10GB using -Xmx10G value with -jvmArgs option provided by JMH. Unfortunately, I still get the error.
Consequently, I tried to reduce the value for batchSize parameter to 1 but I still get an OutOfMemoryError.
The only workaround I have found is to set the benchmark mode to Mode.SingleShotTime. Since this mode seems to consider a batch as a single shot (even if s/op is displayed in the Units column), it seems that I get the metric I want: the average time to perform the set of batch operations. However, I still don't understand why it is not working with Mode.AverageTime.
Please also note that the benchmarks for method stringConcatenation work as expected whatever the benchmark mode is used. The issue only occurs with stringBuilderConcatenation method that makes use of StringBuilder.
Any help to understand why the previous example is not working with Benchmark mode set to Mode.AverageTime is welcome.
JMH version I used is 1.10.4.
You're right that Mode.SingleShotTime is what you need: it measures the time for single batch. When using the Mode.AverageTime your iteration still works until the iteration time finishes (which is 1 second by default). It measures the time per executing the single batch (only batches which were fully finished during the execution time are counted), so the final results differ, but execution time is the same.
Another problem is that #Setup(Level.Iteration) forces setup to be executed before every iteration, but not before every batch. Thus your strings are not actually limited by the batch size. The string version does not cause the OutOfMemoryError just because it's much slower than StringBuilder, so during the 1 second it's capable to build much shorter string.
Not very beautiful way to fix your benchmark (while still using average time mode and batchSize parameter) is to reset the string/stringBuilder manually:
#State(Scope.Thread)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#Measurement(batchSize = 10000, iterations = 10)
#Warmup(batchSize = 10000, iterations = 10)
#Fork(1)
public class StringConcatenationBenchmark {
private static final String S = "some more data";
private static final int maxLen = S.length()*10000;
private String string;
private StringBuilder stringBuilder;
#Setup(Level.Iteration)
public void setup() {
string = "";
stringBuilder = new StringBuilder();
}
#Benchmark
public void stringConcatenation() {
if(string.length() >= maxLen) string = "";
string += S;
}
#Benchmark
public void stringBuilderConcatenation() {
if(stringBuilder.length() >= maxLen) stringBuilder = new StringBuilder();
stringBuilder.append(S);
}
}
Here's results on my box (i5 3340, 4Gb RAM, 64bit Win7, JDK 1.8.0_45):
Benchmark Mode Cnt Score Error Units
stringBuilderConcatenation avgt 10 145.997 ± 2.301 us/op
stringConcatenation avgt 10 324878.341 ± 39824.738 us/op
So you can see that only about 3 batches fit the second for stringConcatenation (1e6/324878) while for stringBuilderConcatenation thousands of batches can be executed resulting in enormous string leading to OutOfMemoryError.
I don't know why adding more memory doesn't work for you, for me -Xmx4G is enough to run the stringBuilder test of your original benchmark. Probably your box is faster, so the resulting string is even longer. Note that for the very big string you can hit the array size limit (2 billion of elements) even if you have enough memory. Check the exception stacktrace after adding the memory: is it the same? If you hit the array size limit, it will still be OutOfMemoryError, but stacktrace will be different a little bit. Anyways even with enough memory the results for your benchmark will be incorrect (both for String and StringBuilder).
So I'm trying to play a bit with microbenchmarks, have chosen JMH, have read some articles. How JMH measures execution of methods below system's timer granularity?
A more detailed explanation:
These are the benchmarks I'm running (method names speak for themselves):
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import java.util.concurrent.TimeUnit;
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#State(Scope.Thread)
#Warmup(iterations = 10, time = 200, timeUnit = TimeUnit.NANOSECONDS)
#Measurement(iterations = 20, time = 200, timeUnit = TimeUnit.NANOSECONDS)
public class RandomBenchmark {
public long lastValue;
#Benchmark
#Fork(1)
public void blankMethod() {
}
#Benchmark
#Fork(1)
public void simpleMethod(Blackhole blackhole) {
int i = 0;
blackhole.consume(i++);
}
#Benchmark
#Fork(1)
public void granularityMethod(Blackhole blackhole) {
long initialTime = System.nanoTime();
long measuredTime;
do {
measuredTime = System.nanoTime();
} while (measuredTime == initialTime);
blackhole.consume(measuredTime);
}
}
Here are results:
# Run complete. Total time: 00:00:02
Benchmark Mode Cnt Score Error Units
RandomBenchmark.blankMethod avgt 20 0,887 ? 0,274 ns/op
RandomBenchmark.granularityMethod avgt 20 407,002 ? 26,297 ns/op
RandomBenchmark.simpleMethod avgt 20 6,979 ? 0,743 ns/op
Currently ran on Windows 7 and as it's described in various articles it has big granularity (407 ns). Checking with basic code below it's indeed new timer value comes every ~400ns:
final int sampleSize = 100;
long[] timeMarks = new long[sampleSize];
for (int i=0; i < sampleSize; i++) {
timeMarks[i] = System.nanoTime();
}
for (long timeMark : timeMarks) {
System.out.println(timeMark);
}
It's hard to fully understand how generated methods exactly work but looking through decompiled JMH generated code it seems like it's using the same System.nanoTime() before and after execution and measures the difference. How is it able to measure method execution of couple nanoseconds while granularity is 400 ns?
You are totally right. You cannot measure something that is faster than your system's timer granularity.
JMH doesn't measure each invocation of the bechmark method. It calls System.nanotime() before the start of an iteration, executes the benchmark method X times and call System.nanotime() again after the iteration. The results is then time difference / # of operations (potentially you specify on the method more than 1 operation per invocation with #OperationsPerInvocation).
Aleksey Shipilev discussed measurement problems with Nanotime in his article Nanotrusting the Nanotime. Section 'Latency' contains a code example that shows how JMH measures one benchmark iteration.