JMH - How to correctly benchmark Thread Pools? - java

Please, read the newest EDIT of this question.
Issue: I need to write a correct benchmark to compare a different work using different Thread Pool realizations (also from external libraries) using different methods of execution to other work using other Thread Pool realizations and to a work without any threading.
For example I have 24 tasks to complete and 10000 random Strings in benchmark state:
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#Fork(1)
#BenchmarkMode(Mode.AverageTime)
#Warmup(iterations = 3)
#Measurement(iterations = 3)
#State(Scope.Benchmark)
public class ThreadPoolSamples {
#Param({"24"})
int amountOfTasks;
private static final int tts = Runtime.getRuntime().availableProcessors() * 2;
private String[] strs = new String[10000];
#Setup
public void setup() {
for (int i = 0; i < strs.length; i++) {
strs[i] = String.valueOf(Math.random());
}
}
}
And two States as inner classes representing Work (string concat.) and ExecutorService setup and shutdown:
#State(Scope.Thread)
public static class Work {
public String doWork(String[] strs) {
StringBuilder conc = new StringBuilder();
for (String str : strs) {
conc.append(str);
}
return conc.toString();
}
}
#State(Scope.Benchmark)
public static class ExecutorServiceState {
ExecutorService service;
#Setup(Level.Iteration)
public void setupMethod() {
service = Executors.newFixedThreadPool(tts);
}
#TearDown(Level.Iteration)
public void downMethod() {
service.shutdownNow();
service = null;
}
}
More strict question is: How to write correct benchmark to measure average time of doWork(); first: without any threading, second: using .execute() method and third: using .submit() method getting results of futures later.
Implementation that I tried to wrote:
#Benchmark
public void noThreading(Work w, Blackhole bh) {
for (int i = 0; i < amountOfTasks; i++) {
bh.consume(w.doWork(strs));
}
}
#Benchmark
public void executorService(ExecutorServiceState e, Work w, Blackhole bh) {
for (int i = 0; i < amountOfTasks; i++) {
e.service.execute(() -> bh.consume(w.doWork(strs)));
}
}
#Benchmark
public void noThreadingResult(Work w, Blackhole bh) {
String[] strss = new String[amountOfTasks];
for (int i = 0; i < amountOfTasks; i++) {
strss[i] = w.doWork(strs);
}
bh.consume(strss);
}
#Benchmark
public void executorServiceResult(ExecutorServiceState e, Work w, Blackhole bh) throws ExecutionException, InterruptedException {
Future[] strss = new Future[amountOfTasks];
for (int i = 0; i < amountOfTasks; i++) {
strss[i] = e.service.submit(() -> {return w.doWork(strs);});
}
for (Future future : strss) {
bh.consume(future.get());
}
}
After benchmarking this implementation on my PC (2 Cores, 4 threads) I got:
Benchmark (amountOfTasks) Mode Cnt Score Error Units
ThreadPoolSamples.executorService 24 avgt 3 255102,966 ± 4460279,056 ns/op
ThreadPoolSamples.executorServiceResult 24 avgt 3 19790020,180 ± 7676762,394 ns/op
ThreadPoolSamples.noThreading 24 avgt 3 18881360,497 ± 340778,773 ns/op
ThreadPoolSamples.noThreadingResult 24 avgt 3 19283976,445 ± 471788,642 ns/op
noThreading and executorService maybe correct (but i am still unsure) and noThreadingResult and executorServiceResult doesn't look correct at all.
EDIT:
I find out some new details, but i think the result is still incorrect: as answered user17280749 in this answer that the thread pool wasn't waiting for submitted tasks to complete, but there wasn't only one issue: javac also somehow optimises doWork() method in the Work class (prob the result of that operation was predictable by JVM), so for simplicity I used Thread.sleep() as "work" and also setted amountOfTasks new two params: "1" and "128" to demonstrate that on 1 task threading will be slower than noThreading, and 24 and 128 will be approx. four times faster than noThreading, also to the correctness of measurement I setted thread pools starting up and shutting down in benchmark:
package io.denery;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import java.util.concurrent.*;
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#Fork(1)
#BenchmarkMode(Mode.AverageTime)
#Warmup(iterations = 3)
#Measurement(iterations = 3)
#State(Scope.Benchmark)
public class ThreadPoolSamples {
#Param({"1", "24", "128"})
int amountOfTasks;
private static final int tts = Runtime.getRuntime().availableProcessors() * 2;
#State(Scope.Thread)
public static class Work {
public void doWork() {
try {
Thread.sleep(1);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
#Benchmark
public void noThreading(Work w) {
for (int i = 0; i < amountOfTasks; i++) {
w.doWork();
}
}
#Benchmark
public void fixedThreadPool(Work w)
throws ExecutionException, InterruptedException {
ExecutorService service = Executors.newFixedThreadPool(tts);
Future[] futures = new Future[amountOfTasks];
for (int i = 0; i < amountOfTasks; i++) {
futures[i] = service.submit(w::doWork);
}
for (Future future : futures) {
future.get();
}
service.shutdown();
}
#Benchmark
public void cachedThreadPool(Work w)
throws ExecutionException, InterruptedException {
ExecutorService service = Executors.newCachedThreadPool();
Future[] futures = new Future[amountOfTasks];
for (int i = 0; i < amountOfTasks; i++) {
futures[i] = service.submit(() -> {
w.doWork();
});
}
for (Future future : futures) {
future.get();
}
service.shutdown();
}
}
And the result of this benchmark is:
Benchmark (amountOfTasks) Mode Cnt Score Error Units
ThreadPoolSamples.cachedThreadPool 1 avgt 3 1169075,866 ± 47607,783 ns/op
ThreadPoolSamples.cachedThreadPool 24 avgt 3 5208437,498 ± 4516260,543 ns/op
ThreadPoolSamples.cachedThreadPool 128 avgt 3 13112351,066 ± 1905089,389 ns/op
ThreadPoolSamples.fixedThreadPool 1 avgt 3 1166087,665 ± 61193,085 ns/op
ThreadPoolSamples.fixedThreadPool 24 avgt 3 4721503,799 ± 313206,519 ns/op
ThreadPoolSamples.fixedThreadPool 128 avgt 3 18337097,997 ± 5781847,191 ns/op
ThreadPoolSamples.noThreading 1 avgt 3 1066035,522 ± 83736,346 ns/op
ThreadPoolSamples.noThreading 24 avgt 3 25525744,055 ± 45422,015 ns/op
ThreadPoolSamples.noThreading 128 avgt 3 136126357,514 ± 200461,808 ns/op
We see that error doesn't really huge, and thread pools with task 1 are slower than noThreading, but if you compare 25525744,055 and 4721503,799 the speedup is: 5.406 and it is faster somehow than excpected ~4, and if you compare 136126357,514 and 18337097,997 the speedup is: 7.4, and this fake speedup is growing with amountOfTasks, and i think it is still incorrect. I think to look at this using PrintAssembly to find out is there are any JVM optimisations.
EDIT:
As mentioned user17294549 in this answer, I used Thread.sleep() as imitation of real work and it doesn't correct because:
for real work: only 2 tasks can run simultaneously on a 2-core system
for Thread.sleep(): any number of tasks can run simultaneously on a 2-core system
I remembered about Blackhole.consumeCPU(long tokens) JMH method that "burns cycles" and imitating a work, there is JMH example and docs for it.
So I changed work to:
#State(Scope.Thread)
public static class Work {
public void doWork() {
Blackhole.consumeCPU(4096);
}
}
And benchmarks for this change:
Benchmark (amountOfTasks) Mode Cnt Score Error Units
ThreadPoolSamples.cachedThreadPool 1 avgt 3 301187,897 ± 95819,153 ns/op
ThreadPoolSamples.cachedThreadPool 24 avgt 3 2421815,991 ± 545978,808 ns/op
ThreadPoolSamples.cachedThreadPool 128 avgt 3 6648647,025 ± 30442,510 ns/op
ThreadPoolSamples.cachedThreadPool 2048 avgt 3 60229404,756 ± 21537786,512 ns/op
ThreadPoolSamples.fixedThreadPool 1 avgt 3 293364,540 ± 10709,841 ns/op
ThreadPoolSamples.fixedThreadPool 24 avgt 3 1459852,773 ± 160912,520 ns/op
ThreadPoolSamples.fixedThreadPool 128 avgt 3 2846790,222 ± 78929,182 ns/op
ThreadPoolSamples.fixedThreadPool 2048 avgt 3 25102603,592 ± 1825740,124 ns/op
ThreadPoolSamples.noThreading 1 avgt 3 10071,049 ± 407,519 ns/op
ThreadPoolSamples.noThreading 24 avgt 3 241561,416 ± 15326,274 ns/op
ThreadPoolSamples.noThreading 128 avgt 3 1300241,347 ± 148051,168 ns/op
ThreadPoolSamples.noThreading 2048 avgt 3 20683253,408 ± 1433365,542 ns/op
We see that fixedThreadPool is somehow slower than the example without threading, and when amountOfTasks is bigger, then difference between fixedThreadPool and noThreading examples is smaller. What's happening in there? Same phenomenon I saw with String concatenation in the beginning of this question, but I didn't report it. (btw, thanks who read this novel and trying to answer this question you're really help me)

See answers to this question to learn how to write benchmarks in java.
... executorService maybe correct (but i am still unsure) ...
Benchmark (amountOfTasks) Mode Cnt Score Error Units
ThreadPoolSamples.executorService 24 avgt 3 255102,966 ± 4460279,056 ns/op
It doesn't look correct like a correct result:
the error 4460279,056 is 17 times greater than the base value 255102,966.
Also you have an error in:
#Benchmark
public void executorService(ExecutorServiceState e, Work w, Blackhole bh) {
for (int i = 0; i < amountOfTasks; i++) {
e.service.execute(() -> bh.consume(w.doWork(strs)));
}
}
You submit the tasks to the ExecutorService, but doesn't wait for them to complete.

Look at this code:
#TearDown(Level.Iteration)
public void downMethod() {
service.shutdownNow();
service = null;
}
You don't wait for the threads to stop. Read the docs for details.
So some of your benchmarks might run in parallel with another 128 threads spawned by cachedThreadPool in a previous benchmark.
so for simplicity I used Thread.sleep() as "work"
Are you sure?
There is a big difference between real work and Thread.sleep():
for real work: only 2 tasks can run simultaneously on a 2-core system
for Thread.sleep(): any number of tasks can run simultaneously on a 2-core system

Here is what I got on my machine (maybe this might help you understand what's the problem):
This is the benchmark (I modified it a little bit):
package io.denery;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.Main;
import java.util.concurrent.*;
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#Fork(1)
#Threads(1)
#BenchmarkMode(Mode.AverageTime)
#Warmup(iterations = 5)
#Measurement(iterations = 5)
#State(Scope.Benchmark)
public class ThreadPoolSamples {
#Param({"1", "24", "128"})
int amountOfTasks;
private static final int tts = Runtime.getRuntime().availableProcessors() * 2;
private static void doWork() {
Blackhole.consumeCPU(4096);
}
public static void main(String[] args) throws Exception {
Main.main(args);
}
#Benchmark
public void noThreading() {
for (int i = 0; i < amountOfTasks; i++) {
doWork();
}
}
#Benchmark
public void fixedThreadPool(Blackhole bh) throws Exception {
runInThreadPool(amountOfTasks, bh, Executors.newFixedThreadPool(tts));
}
#Benchmark
public void cachedThreadPool(Blackhole bh) throws Exception {
runInThreadPool(amountOfTasks, bh, Executors.newCachedThreadPool());
}
private static void runInThreadPool(int amountOfTasks, Blackhole bh, ExecutorService threadPool)
throws Exception {
Future<?>[] futures = new Future[amountOfTasks];
for (int i = 0; i < amountOfTasks; i++) {
futures[i] = threadPool.submit(ThreadPoolSamples::doWork);
}
for (Future<?> future : futures) {
bh.consume(future.get());
}
threadPool.shutdownNow();
threadPool.awaitTermination(5, TimeUnit.MINUTES);
}
}
Specs and versions:
JMH version: 1.33
VM version: JDK 17.0.1, OpenJDK 64-Bit Server
Linux 5.14.14
CPU: Intel(R) Core(TM) i5-2320 CPU # 3.00GHz, 4 Cores, No Hyper-Threading
Results:
Benchmark (amountOfTasks) Mode Cnt Score Error Units
ThreadPoolSamples.cachedThreadPool 1 avgt 5 92968.252 ± 2853.687 ns/op
ThreadPoolSamples.cachedThreadPool 24 avgt 5 547558.977 ± 88937.441 ns/op
ThreadPoolSamples.cachedThreadPool 128 avgt 5 1502909.128 ± 40698.141 ns/op
ThreadPoolSamples.fixedThreadPool 1 avgt 5 97945.026 ± 435.458 ns/op
ThreadPoolSamples.fixedThreadPool 24 avgt 5 643453.028 ± 135859.966 ns/op
ThreadPoolSamples.fixedThreadPool 128 avgt 5 998425.118 ± 126463.792 ns/op
ThreadPoolSamples.noThreading 1 avgt 5 10165.462 ± 78.008 ns/op
ThreadPoolSamples.noThreading 24 avgt 5 245942.867 ± 10594.808 ns/op
ThreadPoolSamples.noThreading 128 avgt 5 1302173.090 ± 5482.655 ns/op

I solved this issue by myself with the help of other answerers. In the last edit (And in all other edits) the issue was in my gradle configuration, so I was running this benchmark in all of my system threads, I use this gradle plugin to run JMH and before making all of my benchmarks in my gradle buildscript I set threads = 4 value, so you saw these strange benchmark results because JMH tried to benchmark on all available threads thread pool doing work on all available threads. I removed this configuration and set #State(Scope.Thread) and #Threads(1) annotations in benchmark class, a bit edited runInThreadPool() method to:
public static void runInThreadPool(int amountOfTasks, Blackhole bh, ExecutorService threadPool)
throws InterruptedException, ExecutionException {
Future<?>[] futures = new Future[amountOfTasks];
for (int i = 0; i < amountOfTasks; i++) {
futures[i] = threadPool.submit(PrioritySchedulerSamples::doWork, (ThreadFactory) runnable -> {
Thread thread = new Thread(runnable);
thread.setPriority(10);
return thread;
});
}
for (Future<?> future : futures) {
bh.consume(future.get());
}
threadPool.shutdownNow();
threadPool.awaitTermination(10, TimeUnit.SECONDS);
}
So each thread in this thread pool runs in maximal priority.
And benchmark of all these changes:
Benchmark (amountOfTasks) Mode Cnt Score Error Units
PrioritySchedulerSamples.fixedThreadPool 2048 avgt 3 8021054,516 ± 2874987,327 ns/op
PrioritySchedulerSamples.noThreading 2048 avgt 3 17583295,617 ± 5499026,016 ns/op
These results seems to be correct. (Especially for my system.)
I also made a list of common problems in microbenchmarking thread pools and basically all of the concurrent java components:
Make sure your microbenchmark is executing in one thread, use #Threads(1) and #State(Scope.Thread) annotations to make your microbenchmark executing in one thread. (use, for example, htop command to find out how many and which threads are consuming the most CPU percentage)
Make sure you execute the task completely in your microbenchmark, and wait for all Threads to complete this task. (maybe your microbenchmark doesn't wait for tasks to complete?)
Don't use Thread.sleep() for imitating the real work, instead JMH provides Blackhole.consumeCPU(long tokens) method which you can freely use as imitation of some work.
Make sure you know the component that you benchmark. (obvious, but I didn't know before this post java thread pools very well)
Make sure you know compiler optimization effects described in these JMH samles basically know JMH very well.

Related

Does the order of if else matter for performance? e.g. put the most likely condition in the front is better

I'm trying to measure if the order of if else affects performance.
For example, if
if (condition == more likely condition) {}
else /** condition == rare condition **/ {}
is faster than
if (condition == rare condition) {}
else /** condition == more likely condition **/ {}
I think maybe JIT should be able to optimise it no matter which order I put it? Couldn't find any documentation on this though.
I tried to test it out myself with following benchmark. Based on it, I don't see strong evidence that the order matters. Because if it does, I think the throughput when bias=0.9 (probability of if (zeroOrOne == 1) is true is 0.9) should be higher than when bias=0.1 (else probability is 0.9).
public class BranchBench {
#Param({ "0.02", "0.1", "0.9", "0.98", })
private double bias;
#Param("10000")
private int count;
private final List<Byte> randomZeroOnes = new ArrayList<>(count);
#Setup
public void setup() {
Random r = new Random(12345);
for (int c = 0; c < count; c++) {
byte zeroOrOne = (byte) (c < (bias * count) ? 1 : 0);
randomZeroOnes.add(zeroOrOne);
}
Collections.shuffle(randomZeroOnes, r);
}
#Benchmark
public int static_ID_ifElse() {
int i = 0;
for (final Byte zeroOrOne : randomZeroOnes) {
if (zeroOrOne == 1) {
i++;
} else {
i--;
}
}
return i;
}
}
Benchmark (bias) (count) Mode Cnt Score Error Units
BranchBench.static_ID_ifElse 0.02 10000 thrpt 15 137.409 ± 1.376 ops/ms
BranchBench.static_ID_ifElse 0.1 10000 thrpt 15 129.277 ± 1.552 ops/ms
BranchBench.static_ID_ifElse 0.9 10000 thrpt 15 125.640 ± 5.858 ops/ms
BranchBench.static_ID_ifElse 0.98 10000 thrpt 15 137.427 ± 2.396 ops/ms
On modern processors I don't think the order of your conditionals really matter that much anymore. As part of the instruction pipeline, processors will do what is called branch prediction; where it guesses which condition will be true and pre-loads the instructions into the pipeline.
These days, processors guess correctly >90% of the time, so any hand-written conditional tweaking is less important.
There are quite a lot of literature on branch prediction:
https://dzone.com/articles/branch-prediction-in-java
https://www.baeldung.com/java-branch-prediction

Trying to benchmark lambda performance

I've read this post: Performance difference between Java 8 lambdas and anonymous inner classes and provided there article
and it there said:
Lambda invocation behaves exactly as anonymous class invocation
"Ok" I said and decided to write my own benchmark, I've used jmh, here it is below (I've also added benchmark for method reference).
public class MyBenchmark {
public static final int TESTS_COUNT = 100_000_000;
#Benchmark
public void testMethod_lambda() {
X x = i -> test(i);
for (long i = 0; i < TESTS_COUNT; i++) {
x.x(i);
}
}
#Benchmark
public void testMethod_methodRefernce() {
X x = this::test;
for (long i = 0; i < TESTS_COUNT; i++) {
x.x(i);
}
}
#Benchmark
public void testMethod_anonymous() {
X x = new X() {
#Override
public void x(Long i) {
test(i);
}
};
for (long i = 0; i < TESTS_COUNT; i++) {
x.x(i);
}
}
interface X {
void x(Long i);
}
public void test(Long i) {
if (i == null) System.out.println("never");
}
}
And the results (on Intel Core i7 4770k) are:
Benchmark Mode Samples Score Score error Units
t.j.MyBenchmark.testMethod_anonymous thrpt 200 16,160 0,044 ops/s
t.j.MyBenchmark.testMethod_lambda thrpt 200 4,102 0,029 ops/s
t.j.MyBenchmark.testMethod_methodRefernce thrpt 200 4,149 0,022 ops/s
So, as you can see there is 4x difference between lambda and anonymous method invocation, where lambda is 4x slower.
The question is: what am I doing wrong or I have misunderstanding of performance theory about lambdas?
EDIT:
# VM invoker: C:\Program Files\Java\jre1.8.0_31\bin\java.exe
# VM options: <none>
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
The problem is in your benchmark: you are the victim of dead code elimination.
JIT-compiler is quite smart to understand sometimes that the result of automatic boxing is never null, so for anonymous class it simply removed your check which in turn made the loop body almost empty. Replace it with something less obvious (for JIT) like this:
public void test(Long i) {
if (i == Long.MAX_VALUE) System.out.println("never");
}
And you will observe the same performace (anonymous class becomes slower, while lambda and method reference perform at the same level).
For lambda/method reference it did not made the same optimization for some reason. But you should not worry: it's unlikely that you will have such method in real code which can be optimized out completely.
In general #apangin is right: use Blackhole instead.
In addition to the issues raised by #TagirValeev, the benchmark approach you are taking is fundamentally flawed, because you are measuring a composite metric (despite your attempts not to.)
The significant costs you want to measure independently are linkage, capture, and invocation. But all your tests smear together some amount of each, poisoning your results. My advice would be to focus only on invocation cost -- this is the most relevant to overall application throughput, and also the easiest to measure (because it is less influenced by caching at multiple levels.)
Bottom line: measuring performance in dynamically compiled environments is really, really hard. Even with JMH.
My question is another example how you shouldn't do benchmarking. I've recreated my test according to the advice in other answers here.
Hope now it is near to correctness, because it shows that there isn't any significant difference between lambda's and anon's method invocation performance. See it below:
#State(Scope.Benchmark)
public class MyBenchmark {
#Param({"1", "100000", "500000"})
public int arg;
#Benchmark
public void testMethod_lambda(Blackhole bh) {
X x = (i, bh2) -> test(i, bh2);
x.x(arg, bh);
}
#Benchmark
public void testMethod_methodRefernce(Blackhole bh) {
X x = this::test;
x.x(arg, bh);
}
#Benchmark
public void testMethod_anonymous(Blackhole bh) {
X x = new X() {
#Override
public void x(Integer i, Blackhole bh) {
test(i, bh);
}
};
x.x(arg, bh);
}
interface X {
void x(Integer i, Blackhole bh);
}
public void test(Integer i, Blackhole bh) {
bh.consume(i);
}
}
Benchmark (arg) Mode Samples Score Score error Units
t.j.MyBenchmark.testMethod_anonymous 1 thrpt 200 415893575,928 1353627,574 ops/s
t.j.MyBenchmark.testMethod_anonymous 100000 thrpt 200 394989882,972 1429490,555 ops/s
t.j.MyBenchmark.testMethod_anonymous 500000 thrpt 200 395707755,557 1325623,340 ops/s
t.j.MyBenchmark.testMethod_lambda 1 thrpt 200 418597958,944 1098137,844 ops/s
t.j.MyBenchmark.testMethod_lambda 100000 thrpt 200 394672254,859 1593253,378 ops/s
t.j.MyBenchmark.testMethod_lambda 500000 thrpt 200 394407399,819 1373366,572 ops/s
t.j.MyBenchmark.testMethod_methodRefernce 1 thrpt 200 417249323,668 1140804,969 ops/s
t.j.MyBenchmark.testMethod_methodRefernce 100000 thrpt 200 396783159,253 1458935,363 ops/s
t.j.MyBenchmark.testMethod_methodRefernce 500000 thrpt 200 395098696,491 1682126,737 ops/s

Performance benefits of a static empty array instance

It seems common practice to extract a constant empty array return value into a static constant. Like here:
public class NoopParser implements Parser {
private static final String[] EMPTY_ARRAY = new String[0];
#Override public String[] supportedSchemas() {
return EMPTY_ARRAY;
}
// ...
}
Presumably this is done for performance reasons, since returning new String[0] directly would create a new array object every time the method is called – but would it really?
I've been wondering if there really is a measurable performance benefit in doing this or if it's just outdated folk wisdom. An empty array is immutable. Is the VM not able to roll all empty String arrays into one? Can the VM not make new String[0] basically free of cost?
Contrast this practice with returning an empty String: we're usually perfectly happy to write return "";, not return EMPTY_STRING;.
I benchmarked it using JMH:
private static final String[] EMPTY_STRING_ARRAY = new String[0];
#Benchmark
public void testStatic(Blackhole blackhole) {
blackhole.consume(EMPTY_STRING_ARRAY);
}
#Benchmark
#Fork(jvmArgs = "-XX:-EliminateAllocations")
public void testStaticEliminate(Blackhole blackhole) {
blackhole.consume(EMPTY_STRING_ARRAY);
}
#Benchmark
public void testNew(Blackhole blackhole) {
blackhole.consume(new String[0]);
}
#Benchmark
#Fork(jvmArgs = "-XX:-EliminateAllocations")
public void testNewEliminate(Blackhole blackhole) {
blackhole.consume(new String[0]);
}
#Benchmark
public void noop(Blackhole blackhole) {
}
Full source code.
Environment (seen after java -jar target/benchmarks.jar -f 1):
# JMH 1.11.2 (released 51 days ago)
# VM version: JDK 1.7.0_75, VM 24.75-b04
# VM invoker: /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
# VM options: <none>
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
EliminateAllocations was on by default (seen after java -XX:+PrintFlagsFinal -version | grep EliminateAllocations).
Results:
Benchmark Mode Cnt Score Error Units
MyBenchmark.testNewEliminate thrpt 20 95912464.879 ± 3260948.335 ops/s
MyBenchmark.testNew thrpt 20 103980230.952 ± 3772243.160 ops/s
MyBenchmark.testStaticEliminate thrpt 20 206849985.523 ± 4920788.341 ops/s
MyBenchmark.testStatic thrpt 20 219735906.550 ± 6162025.973 ops/s
MyBenchmark.noop thrpt 20 1126421653.717 ± 8938999.666 ops/s
Using a constant was almost two times faster.
Turning off EliminateAllocations slowed things down a tiny bit.
I'm most interested in the actual performance difference between these two idioms in practical, real-world situations. I have no experience in micro-benchmarking (and it is probably not the right tool for such a question) but I gave it a try anyway.
This benchmark models a somewhat more typical, "realistic" setting. The returned array is just looked at and then discarded. No references hanging around, no requirement for reference equality.
One interface, two implementations:
public interface Parser {
String[] supportedSchemas();
void parse(String s);
}
public class NoopParserStaticArray implements Parser {
private static final String[] EMPTY_STRING_ARRAY = new String[0];
#Override public String[] supportedSchemas() {
return EMPTY_STRING_ARRAY;
}
#Override public void parse(String s) {
s.codePoints().count();
}
}
public class NoopParserNewArray implements Parser {
#Override public String[] supportedSchemas() {
return new String[0];
}
#Override public void parse(String s) {
s.codePoints().count();
}
}
And the JMH benchmark:
import org.openjdk.jmh.annotations.Benchmark;
public class EmptyArrayBenchmark {
private static final Parser NOOP_PARSER_STATIC_ARRAY = new NoopParserStaticArray();
private static final Parser NOOP_PARSER_NEW_ARRAY = new NoopParserNewArray();
#Benchmark
public void staticEmptyArray() {
Parser parser = NOOP_PARSER_STATIC_ARRAY;
for (String schema : parser.supportedSchemas()) {
parser.parse(schema);
}
}
#Benchmark
public void newEmptyArray() {
Parser parser = NOOP_PARSER_NEW_ARRAY;
for (String schema : parser.supportedSchemas()) {
parser.parse(schema);
}
}
}
The result on my machine, Java 1.8.0_51 (HotSpot VM):
Benchmark Mode Cnt Score Error Units
EmptyArrayBenchmark.staticEmptyArray thrpt 60 3024653836.077 ± 37006870.221 ops/s
EmptyArrayBenchmark.newEmptyArray thrpt 60 3018798922.045 ± 33953991.627 ops/s
EmptyArrayBenchmark.noop thrpt 60 3046726348.436 ± 5802337.322 ops/s
There is no significant difference between the two approaches in this case. In fact, they are indistinguishable from the no-op case: apparently the JIT compiler recognises that the returned array is always empty and optimises the loop away entirely!
Piping parser.supportedSchemas() into the black hole instead of looping over it, gives the static array instance approach a ~30% advantage. But they're definitely of the same magnitude:
Benchmark Mode Cnt Score Error Units
EmptyArrayBenchmark.staticEmptyArray thrpt 60 338971639.355 ± 738069.217 ops/s
EmptyArrayBenchmark.newEmptyArray thrpt 60 266936194.767 ± 411298.714 ops/s
EmptyArrayBenchmark.noop thrpt 60 3055609298.602 ± 5694730.452 ops/s
Perhaps in the end the answer is the usual "it depends". I have a hunch that in many practical scenarios, the performance benefit in factoring out the array creation is not significant.
I think it is fair to say that
if the method contract gives you the freedom to return a new empty array instance every time, and
unless you need to guard against problematic or pathological usage patterns and/or aim for theoretical max performance,
then returning new String[0] directly is fine.
Personally, I like the expressiveness and concision of return new String[0]; and not having to introduce an extra static field.
By some strange coincidence, a month after I wrote this a real performance engineer investigated the problem: see this section in Alexey Shipilёv's blog post 'Arrays of Wisdom of the Ancients':
As expected, the only effect whatsoever can be observed on a very small collection sizes, and this is only a marginal improvement over new Foo[0]. This improvement does not seem to justify caching the array in the grand scheme of things. As a teeny tiny micro-optimization, it might make sense in some tight code, but I wouldn’t care otherwise.
That settles it. I'll take the tick mark and dedicate it to Alexey.
Is the VM not able to roll all empty String arrays into one?
It can't do that, because distinct empty arrays need to compare unequal with ==. Only the programmer can make this optimization.
Contrast this practice with returning an empty String: we're usually perfectly happy writing return "";.
With strings, there is no requirement that distinct string literals produce distinct strings. In every case I know of, two instances of "" will produce the same string object, but maybe there's some weird case with classloaders where that won't happen.
I will go out on a limb and say that the performance benefit, even though using constant is much faster, is not actually relevant; because the software will likely spend a lot more time in doing other things besides returning empty arrays. If the total run-time is even hours a few extra seconds spent in creating an array does not mean much. By the same logic, memory consumption is not relevant either.
The only reason I can think of for doing this is readability.

How JMH measures execution time below granularity value?

So I'm trying to play a bit with microbenchmarks, have chosen JMH, have read some articles. How JMH measures execution of methods below system's timer granularity?
A more detailed explanation:
These are the benchmarks I'm running (method names speak for themselves):
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import java.util.concurrent.TimeUnit;
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#State(Scope.Thread)
#Warmup(iterations = 10, time = 200, timeUnit = TimeUnit.NANOSECONDS)
#Measurement(iterations = 20, time = 200, timeUnit = TimeUnit.NANOSECONDS)
public class RandomBenchmark {
public long lastValue;
#Benchmark
#Fork(1)
public void blankMethod() {
}
#Benchmark
#Fork(1)
public void simpleMethod(Blackhole blackhole) {
int i = 0;
blackhole.consume(i++);
}
#Benchmark
#Fork(1)
public void granularityMethod(Blackhole blackhole) {
long initialTime = System.nanoTime();
long measuredTime;
do {
measuredTime = System.nanoTime();
} while (measuredTime == initialTime);
blackhole.consume(measuredTime);
}
}
Here are results:
# Run complete. Total time: 00:00:02
Benchmark Mode Cnt Score Error Units
RandomBenchmark.blankMethod avgt 20 0,887 ? 0,274 ns/op
RandomBenchmark.granularityMethod avgt 20 407,002 ? 26,297 ns/op
RandomBenchmark.simpleMethod avgt 20 6,979 ? 0,743 ns/op
Currently ran on Windows 7 and as it's described in various articles it has big granularity (407 ns). Checking with basic code below it's indeed new timer value comes every ~400ns:
final int sampleSize = 100;
long[] timeMarks = new long[sampleSize];
for (int i=0; i < sampleSize; i++) {
timeMarks[i] = System.nanoTime();
}
for (long timeMark : timeMarks) {
System.out.println(timeMark);
}
It's hard to fully understand how generated methods exactly work but looking through decompiled JMH generated code it seems like it's using the same System.nanoTime() before and after execution and measures the difference. How is it able to measure method execution of couple nanoseconds while granularity is 400 ns?
You are totally right. You cannot measure something that is faster than your system's timer granularity.
JMH doesn't measure each invocation of the bechmark method. It calls System.nanotime() before the start of an iteration, executes the benchmark method X times and call System.nanotime() again after the iteration. The results is then time difference / # of operations (potentially you specify on the method more than 1 operation per invocation with #OperationsPerInvocation).
Aleksey Shipilev discussed measurement problems with Nanotime in his article Nanotrusting the Nanotime. Section 'Latency' contains a code example that shows how JMH measures one benchmark iteration.

Why am I getting *worse* performance with a *larger* buffer in my BufferedReader?

I'm getting weird results I can't explain from a BufferedReader when I vary the size of the buffer.
I had strongly expected that performance would gradually increase as I increased the size of the buffer, with diminishing returns setting in fairly quickly, and that thereafter performance would be more or less flat. But it seems that, after only a very modest buffer size, increasing the size of the buffer makes it slower.
Here's a minimal example. All it does is run through a text file, and calculate the sum of the lengths of the lines.
public int traverseFile(int bufSize) throws IOException {
BufferedReader reader = new BufferedReader(new FileReader("words16"), bufSize*1024);
String line;
int total=0;
while ((line=reader.readLine())!=null)
total+=line.length();
reader.close();
return total;
}
I tried benchmarking this with various buffer sizes, and the results were rather odd. Up to about 256KB, performance increases; after that point, it gets worse. I wondered whether it was just the time taken to allocate the buffer, so I tried adding something in to make it always allocate the same total amount of memory (see second line below):
public int traverseFile(int bufSize) throws IOException {
byte[] pad = new byte[(65536-bufSize)*1024];
BufferedReader reader = new BufferedReader(new FileReader("words16"), bufSize*1024);
String line;
int total=0;
while ((line=reader.readLine())!=null)
total+=line.length();
reader.close();
return total;
}
This makes no odds. I am still getting the same results, on two different machines. Here are the full results:
Benchmark Mode Samples Score Error Units
j.t.BufferSizeBenchmark.traverse_test1_4K avgt 100 363.987 ± 1.901 ms/op
j.t.BufferSizeBenchmark.traverse_test2_16K avgt 100 356.551 ± 0.330 ms/op
j.t.BufferSizeBenchmark.traverse_test3_64K avgt 100 353.462 ± 0.557 ms/op
j.t.BufferSizeBenchmark.traverse_test4_256K avgt 100 350.822 ± 0.562 ms/op
j.t.BufferSizeBenchmark.traverse_test5_1024K avgt 100 356.949 ± 0.338 ms/op
j.t.BufferSizeBenchmark.traverse_test6_4096K avgt 100 358.377 ± 0.388 ms/op
j.t.BufferSizeBenchmark.traverse_test7_16384K avgt 100 367.890 ± 0.393 ms/op
j.t.BufferSizeBenchmark.traverse_test8_65536K avgt 100 363.271 ± 0.228 ms/op
As you can see, the sweet spot is at about 256KB. The difference isn't huge, but it is certainly measurable.
All I can think is that this might be something to do with the memory cache. Is it because the RAM that's being written to is further away from the RAM that's being read? But if it's a cyclic buffer, I'm not even sure that's true: what's being written will be just behind what's being read.
The words16 file is 80MB so I can't post it here, but it's Fedora's standard /usr/share/dict/words file, sixteen times over. I can find a way to post a link if necessary.
Here's the benchmarking code:
#OutputTimeUnit(TimeUnit.MILLISECONDS)
#BenchmarkMode(Mode.AverageTime)
#OperationsPerInvocation(1)
#Warmup(iterations = 30, time = 100, timeUnit = TimeUnit.MILLISECONDS)
#Measurement(iterations = 100, time = 10000, timeUnit = TimeUnit.MILLISECONDS)
#State(Scope.Thread)
#Threads(1)
#Fork(1)
public class BufferSizeBenchmark {
public int traverseFile(int bufSize) throws IOException {
byte[] pad = new byte[(65536-bufSize)*1024];
BufferedReader reader = new BufferedReader(new FileReader("words16"), bufSize*1024);
String line;
int total=0;
while ((line=reader.readLine())!=null)
total+=line.length();
reader.close();
return total;
}
#Benchmark
public int traverse_test1_4K() throws IOException {
return traverseFile(4);
}
#Benchmark
public int traverse_test2_16K() throws IOException {
return traverseFile(16);
}
#Benchmark
public int traverse_test3_64K() throws IOException {
return traverseFile(64);
}
#Benchmark
public int traverse_test4_256K() throws IOException {
return traverseFile(256);
}
#Benchmark
public int traverse_test5_1024K() throws IOException {
return traverseFile(1024);
}
#Benchmark
public int traverse_test6_4096K() throws IOException {
return traverseFile(4096);
}
#Benchmark
public int traverse_test7_16384K() throws IOException {
return traverseFile(16384);
}
#Benchmark
public int traverse_test8_65536K() throws IOException {
return traverseFile(65536);
}
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(
".*" + BufferSizeBenchmark.class.getSimpleName() + ".*")
.forks(1).build();
new Runner(opt).run();
}
}
Why am I getting worse performance when I increase the size of the buffer?
This is most likely an affect of the cache line size. Since cache uses LRU eviction policy using too large a buffer causes what you've written to the 'beggining' of the buffer to be evicted before you get a chance to read it.
256k is a typical CPU cache size! What type of CPU you tested on?
So what happens is: If you read 256k chunks or smaller, the content that was written to the buffer is still in the CPU cache when the read accesses it. If you have chunks greater of 256k then the last 256k that were read are in the CPU cache, so when the read starts from the beginning the content must be retrieved from main memory.
The second problem is with the buffer allocation. The trick with the padding buffer is clever, but does not really average out the allocation cost. The reason for this is, that the real cost of the allocation is not the reservation of the memory, but to clear it. Furthermore the OS may postpone to map in the real memory to the time it was first accessed. But you never access the padding buffer.

Categories