Java 8: Class.getName() slows down String concatenation chain - java

Recently I've run into an issue regarding String concatenation. This benchmark summarizes it:
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public class BrokenConcatenationBenchmark {
#Benchmark
public String slow(Data data) {
final Class<? extends Data> clazz = data.clazz;
return "class " + clazz.getName();
}
#Benchmark
public String fast(Data data) {
final Class<? extends Data> clazz = data.clazz;
final String clazzName = clazz.getName();
return "class " + clazzName;
}
#State(Scope.Thread)
public static class Data {
final Class<? extends Data> clazz = getClass();
#Setup
public void setup() {
//explicitly load name via native method Class.getName0()
clazz.getName();
}
}
}
On JDK 1.8.0_222 (OpenJDK 64-Bit Server VM, 25.222-b10) I've got the following results:
Benchmark Mode Cnt Score Error Units
BrokenConcatenationBenchmark.fast avgt 25 22,253 ± 0,962 ns/op
BrokenConcatenationBenchmark.fast:·gc.alloc.rate avgt 25 9824,603 ± 400,088 MB/sec
BrokenConcatenationBenchmark.fast:·gc.alloc.rate.norm avgt 25 240,000 ± 0,001 B/op
BrokenConcatenationBenchmark.fast:·gc.churn.PS_Eden_Space avgt 25 9824,162 ± 397,745 MB/sec
BrokenConcatenationBenchmark.fast:·gc.churn.PS_Eden_Space.norm avgt 25 239,994 ± 0,522 B/op
BrokenConcatenationBenchmark.fast:·gc.churn.PS_Survivor_Space avgt 25 0,040 ± 0,011 MB/sec
BrokenConcatenationBenchmark.fast:·gc.churn.PS_Survivor_Space.norm avgt 25 0,001 ± 0,001 B/op
BrokenConcatenationBenchmark.fast:·gc.count avgt 25 3798,000 counts
BrokenConcatenationBenchmark.fast:·gc.time avgt 25 2241,000 ms
BrokenConcatenationBenchmark.slow avgt 25 54,316 ± 1,340 ns/op
BrokenConcatenationBenchmark.slow:·gc.alloc.rate avgt 25 8435,703 ± 198,587 MB/sec
BrokenConcatenationBenchmark.slow:·gc.alloc.rate.norm avgt 25 504,000 ± 0,001 B/op
BrokenConcatenationBenchmark.slow:·gc.churn.PS_Eden_Space avgt 25 8434,983 ± 198,966 MB/sec
BrokenConcatenationBenchmark.slow:·gc.churn.PS_Eden_Space.norm avgt 25 503,958 ± 1,000 B/op
BrokenConcatenationBenchmark.slow:·gc.churn.PS_Survivor_Space avgt 25 0,127 ± 0,011 MB/sec
BrokenConcatenationBenchmark.slow:·gc.churn.PS_Survivor_Space.norm avgt 25 0,008 ± 0,001 B/op
BrokenConcatenationBenchmark.slow:·gc.count avgt 25 3789,000 counts
BrokenConcatenationBenchmark.slow:·gc.time avgt 25 2245,000 ms
This looks like an issue similar to JDK-8043677, where an expression having side effect
breaks optimization of new StringBuilder.append().append().toString() chain.
But the code of Class.getName() itself does not seem to have any side effects:
private transient String name;
public String getName() {
String name = this.name;
if (name == null) {
this.name = name = this.getName0();
}
return name;
}
private native String getName0();
The only suspicious thing here is a call to native method which happens
in fact only once and its result is cached in the field of the class.
In my benchmark I've explicitly cached it in setup method.
I've expected branch predictor to figure out that at each benchmark invocation
the actual value of this.name is never null and optimize the whole expression.
However, while for the BrokenConcatenationBenchmark.fast() I have this:
# 19 tsypanov.strings.benchmark.concatenation.BrokenConcatenationBenchmark::fast (30 bytes) force inline by CompileCommand
# 6 java.lang.Class::getName (18 bytes) inline (hot)
# 14 java.lang.Class::initClassName (0 bytes) native method
# 14 java.lang.StringBuilder::<init> (7 bytes) inline (hot)
# 19 java.lang.StringBuilder::append (8 bytes) inline (hot)
# 23 java.lang.StringBuilder::append (8 bytes) inline (hot)
# 26 java.lang.StringBuilder::toString (35 bytes) inline (hot)
i.e. compiler is able to inline everything, for BrokenConcatenationBenchmark.slow() it is different:
# 19 tsypanov.strings.benchmark.concatenation.BrokenConcatenationBenchmark::slow (28 bytes) force inline by CompilerOracle
# 9 java.lang.StringBuilder::<init> (7 bytes) inline (hot)
# 3 java.lang.AbstractStringBuilder::<init> (12 bytes) inline (hot)
# 1 java.lang.Object::<init> (1 bytes) inline (hot)
# 14 java.lang.StringBuilder::append (8 bytes) inline (hot)
# 2 java.lang.AbstractStringBuilder::append (50 bytes) inline (hot)
# 10 java.lang.String::length (6 bytes) inline (hot)
# 21 java.lang.AbstractStringBuilder::ensureCapacityInternal (27 bytes) inline (hot)
# 17 java.lang.AbstractStringBuilder::newCapacity (39 bytes) inline (hot)
# 20 java.util.Arrays::copyOf (19 bytes) inline (hot)
# 11 java.lang.Math::min (11 bytes) (intrinsic)
# 14 java.lang.System::arraycopy (0 bytes) (intrinsic)
# 35 java.lang.String::getChars (62 bytes) inline (hot)
# 58 java.lang.System::arraycopy (0 bytes) (intrinsic)
# 18 java.lang.Class::getName (21 bytes) inline (hot)
# 11 java.lang.Class::getName0 (0 bytes) native method
# 21 java.lang.StringBuilder::append (8 bytes) inline (hot)
# 2 java.lang.AbstractStringBuilder::append (50 bytes) inline (hot)
# 10 java.lang.String::length (6 bytes) inline (hot)
# 21 java.lang.AbstractStringBuilder::ensureCapacityInternal (27 bytes) inline (hot)
# 17 java.lang.AbstractStringBuilder::newCapacity (39 bytes) inline (hot)
# 20 java.util.Arrays::copyOf (19 bytes) inline (hot)
# 11 java.lang.Math::min (11 bytes) (intrinsic)
# 14 java.lang.System::arraycopy (0 bytes) (intrinsic)
# 35 java.lang.String::getChars (62 bytes) inline (hot)
# 58 java.lang.System::arraycopy (0 bytes) (intrinsic)
# 24 java.lang.StringBuilder::toString (17 bytes) inline (hot)
So the question is whether this is appropriate behaviour of the JVM or compiler bug?
I'm asking the question because some of the projects are still using Java 8 and if it won't be fixed on any of release updates then to me it's reasonable to hoist calls to Class.getName() manually from hot spots.
P.S. On the latest JDKs (11, 13, 14-eap) the issue is not reproduced.

HotSpot JVM collects execution statistics per bytecode. If the same code is run in different contexts, the result profile will aggregate statistics from all contexts. This effect is known as profile pollution.
Class.getName() is obviously called not only from your benchmark code. Before JIT starts compiling the benchmark, it already knows that the following condition in Class.getName() was met multiple times:
if (name == null)
this.name = name = getName0();
At least, enough times to treat this branch statistically important. So, JIT did not exclude this branch from compilation, and thus could not optimize string concat due to possible side effect.
This does not even need to be a native method call. Just a regular field assignment is also considered a side effect.
Here is an example how profile pollution can harm further optimizations.
#State(Scope.Benchmark)
public class StringConcat {
private final MyClass clazz = new MyClass();
static class MyClass {
private String name;
public String getName() {
if (name == null) name = "ZZZ";
return name;
}
}
#Param({"1", "100", "400", "1000"})
private int pollutionCalls;
#Setup
public void setup() {
for (int i = 0; i < pollutionCalls; i++) {
new MyClass().getName();
}
}
#Benchmark
public String fast() {
String clazzName = clazz.getName();
return "str " + clazzName;
}
#Benchmark
public String slow() {
return "str " + clazz.getName();
}
}
This is basically the modified version of your benchmark that simulates the pollution of getName() profile. Depending on the number of preliminary getName() calls on a fresh object, the further performance of string concatenation may dramatically differ:
Benchmark (pollutionCalls) Mode Cnt Score Error Units
StringConcat.fast 1 avgt 15 11,458 ± 0,076 ns/op
StringConcat.fast 100 avgt 15 11,690 ± 0,222 ns/op
StringConcat.fast 400 avgt 15 12,131 ± 0,105 ns/op
StringConcat.fast 1000 avgt 15 12,194 ± 0,069 ns/op
StringConcat.slow 1 avgt 15 11,771 ± 0,105 ns/op
StringConcat.slow 100 avgt 15 11,963 ± 0,212 ns/op
StringConcat.slow 400 avgt 15 26,104 ± 0,202 ns/op << !
StringConcat.slow 1000 avgt 15 26,108 ± 0,436 ns/op << !
More examples of profile pollution »
I can't call it either a bug or an "appropriate behaviour". This is just how dynamic adaptive compilation is implemented in HotSpot.

Slightly unrelated but since Java 9 and JEP 280: Indify String Concatenation the string concatenation is now done with invokedynamic and not StringBuilder. This article shows the differences in the bytecode between Java 8 and Java 9.
If the benchmark re-run on newer Java version doesn't show the problem there is most likley no bug in javac because the compiler now uses new mechanism. Not sure if diving into Java 8 behavior is beneficial if there is such a substantial change in the newer versions.

Related

Why Java compiled regex works slower then interpreted in String::split?

I'm trying to improve the following code:
public int applyAsInt(String ipAddress) {
var ipAddressInArray = ipAddress.split("\\.");
...
So I compile the regular expression into a static constant:
private static final Pattern PATTERN_DOT = Pattern.compile(".", Pattern.LITERAL);
public int applyAsInt(String ipAddress) {
var ipAddressInArray = PATTERN_DOT.split(ipAddress);
...
The rest of the code remained unchanged.
To my amazement, the new code is slower than the previous one.
Below are the test results:
Benchmark (ipAddress) Mode Cnt Score Error Units
ConverterBenchmark.mkyongConverter 1.2.3.4 avgt 10 166.456 ± 9.087 ns/op
ConverterBenchmark.mkyongConverter 120.1.34.78 avgt 10 168.548 ± 2.996 ns/op
ConverterBenchmark.mkyongConverter 129.205.201.114 avgt 10 180.754 ± 6.891 ns/op
ConverterBenchmark.mkyong2Converter 1.2.3.4 avgt 10 253.318 ± 4.977 ns/op
ConverterBenchmark.mkyong2Converter 120.1.34.78 avgt 10 263.045 ± 8.373 ns/op
ConverterBenchmark.mkyong2Converter 129.205.201.114 avgt 10 331.376 ± 53.092 ns/op
Help me understand why this is happening, please.
String.split has code aimed at exactly this use case:
https://github.com/openjdk/jdk17u/blob/master/src/java.base/share/classes/java/lang/String.java#L3102
/* fastpath if the regex is a
* (1) one-char String and this character is not one of the
* RegEx's meta characters ".$|()[{^?*+\\", or
* (2) two-char String and the first char is the backslash and
* the second is not the ascii digit or ascii letter.
*/
That means that when using split("\\.") the string is effectively not split using a regular expression - the method splits the string directly at the '.' characters.
This optimization is not possible when you write PATTERN_DOT.split(ipAddress).

Java Collections.reverseOrder() vs. Comparator

I am practicing leetcode and oftenTimes I see different ways of reversing data structures. I was wondering if there is a benefit to doing it one way vs the other?
For example to create a max heap from a PriortiyQueue i can
PriorityQueue<Integer> heap = new PriorityQueue<>((a,b) -> b-a);
or
PriorityQueue<Integer> heap = new PriorityQueue<>(Collections.reverseOrder());
Is the time and space complexity the same? Should i be defaulting to use a comparator or is the collections.reverseOrder() good?
The first Comparator (one with lambda) is wrong because it is subject to integer overflow (for example, with b = Integer.MIN_VALUE and a > 0, this would return a positive value, not a negative one), you should use:
PriorityQueue<Integer> heap = new PriorityQueue<>((a,b) -> Integer.compare(b, a));
Which is also called by Integer.compareTo, which is called by reverseOrder.
As said in other answer, you should use the reverseOrder or Comparator.reversed() because it make the intention clear.
Now, for your question in depth, you should note that:
The lambda (a, b) -> b - a is affected by unboxing and should be read b.intValue() - a.intValue() and since this one is not valid for some values due to integer overflow: it should be Integer.compare(b.intValue(), a.intValue()).
The reverseOrder simply calls b.compareTo(a) which calls Integer.compare(value, other.value) where value is the actual value of the boxed Integer.
The performance difference would amount to:
The cost of the two unboxing
Cost of calling methods.
The JVM optimization
You could wind up a JMH test (like the one below, based on another that I wrote for another answer): I shortened the values v1/v2 does to 3 because it takes times (~40m).
package stackoverflow;
import java.util.*;
import java.util.stream.*;
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
#State(Scope.Benchmark)
#Warmup( time = 1, timeUnit = TimeUnit.SECONDS)
#Measurement( time = 1, timeUnit = TimeUnit.SECONDS)
#Fork(1)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#BenchmarkMode({ Mode.AverageTime})
public class ComparatorBenchmark {
private List<Integer> values;
#Param({ "" + Integer.MIN_VALUE, "" + Integer.MAX_VALUE, "10", "-1000", "5000", "10000", "20000", "50000", "100000" })
private Integer v1;
#Param({ "" + Integer.MIN_VALUE, "" + Integer.MAX_VALUE, "10", "1000", "-5000", "10000", "20000", "50000", "100000" })
private Integer v2;
private Comparator<Integer> cmp1;
private Comparator<Integer> cmp2;
private Comparator<Integer> cmp3;
#Setup
public void setUp() {
cmp1 = (a, b) -> Integer.compare(b, a);
cmp2 = (a, b) -> b - a;
cmp3 = Collections.reverseOrder();
}
#Benchmark
public void with_Integer_compare(Blackhole blackhole) {
blackhole.consume(cmp1.compare(v1, v2));
}
#Benchmark
public void with_b_minus_a(Blackhole blackhole) {
blackhole.consume(cmp2.compare(v1, v2));
}
#Benchmark
public void with_reverse_comparator(Blackhole blackhole) {
blackhole.consume(cmp3.compare(v1, v2));
}
}
Running this on:
Windows 10
Java 17.0.2
AMD Ryzen 7 2700X # 3.70 GHz
Produces the following result (I limited it to 3 values: MIN_VALUE, MAX_VALUE and 10 because its ETA is ~40m):
Benchmark (v1) (v2) Mode Cnt Score Error Units
ComparatorBenchmark.with_Integer_compare -2147483648 -2147483648 avgt 5 1,113 ± 0,074 ns/op
ComparatorBenchmark.with_Integer_compare -2147483648 2147483647 avgt 5 1,111 ± 0,037 ns/op
ComparatorBenchmark.with_Integer_compare -2147483648 10 avgt 5 1,111 ± 0,075 ns/op
ComparatorBenchmark.with_Integer_compare 2147483647 -2147483648 avgt 5 1,122 ± 0,075 ns/op
ComparatorBenchmark.with_Integer_compare 2147483647 2147483647 avgt 5 1,123 ± 0,070 ns/op
ComparatorBenchmark.with_Integer_compare 2147483647 10 avgt 5 1,102 ± 0,039 ns/op
ComparatorBenchmark.with_Integer_compare 10 -2147483648 avgt 5 1,097 ± 0,024 ns/op
ComparatorBenchmark.with_Integer_compare 10 2147483647 avgt 5 1,094 ± 0,019 ns/op
ComparatorBenchmark.with_Integer_compare 10 10 avgt 5 1,097 ± 0,034 ns/op
ComparatorBenchmark.with_b_minus_a -2147483648 -2147483648 avgt 5 1,105 ± 0,054 ns/op
ComparatorBenchmark.with_b_minus_a -2147483648 2147483647 avgt 5 1,099 ± 0,040 ns/op
ComparatorBenchmark.with_b_minus_a -2147483648 10 avgt 5 1,094 ± 0,038 ns/op
ComparatorBenchmark.with_b_minus_a 2147483647 -2147483648 avgt 5 1,112 ± 0,044 ns/op
ComparatorBenchmark.with_b_minus_a 2147483647 2147483647 avgt 5 1,105 ± 0,029 ns/op
ComparatorBenchmark.with_b_minus_a 2147483647 10 avgt 5 1,112 ± 0,068 ns/op
ComparatorBenchmark.with_b_minus_a 10 -2147483648 avgt 5 1,086 ± 0,010 ns/op
ComparatorBenchmark.with_b_minus_a 10 2147483647 avgt 5 1,125 ± 0,084 ns/op
ComparatorBenchmark.with_b_minus_a 10 10 avgt 5 1,125 ± 0,082 ns/op
ComparatorBenchmark.with_reverse_comparator -2147483648 -2147483648 avgt 5 1,121 ± 0,050 ns/op
ComparatorBenchmark.with_reverse_comparator -2147483648 2147483647 avgt 5 1,122 ± 0,067 ns/op
ComparatorBenchmark.with_reverse_comparator -2147483648 10 avgt 5 1,129 ± 0,094 ns/op
ComparatorBenchmark.with_reverse_comparator 2147483647 -2147483648 avgt 5 1,117 ± 0,046 ns/op
ComparatorBenchmark.with_reverse_comparator 2147483647 2147483647 avgt 5 1,122 ± 0,072 ns/op
ComparatorBenchmark.with_reverse_comparator 2147483647 10 avgt 5 1,116 ± 0,080 ns/op
ComparatorBenchmark.with_reverse_comparator 10 -2147483648 avgt 5 1,114 ± 0,052 ns/op
ComparatorBenchmark.with_reverse_comparator 10 2147483647 avgt 5 1,133 ± 0,068 ns/op
ComparatorBenchmark.with_reverse_comparator 10 10 avgt 5 1,134 ± 0,036 ns/op
As you can see, the score are within the same margin.
You would not gain/lose much from neither implementation, and as already said, you should use Collections.reverseOrder() to make your intention clear, and if not use Integer.compare, not a subtraction subject to integer overflow unless you are sure that each Integer is limited (for example, Short.MIN_VALUE to Short.MAX_VALUE).
The asymptotics are the same, but Collections.reverseOrder() has several important advantages:
It guarantees that it doesn't do an allocation. ((a, b) -> b - a probably doesn't allocate, either, but reverseOrder guarantees a singleton.)
It is clearly self-documenting.
It works for all integers; b-a will break if comparing e.g. Integer.MAX_VALUE and -2 due to overflow.
Yes, the time and space complexity is the same (constant).
Using Collections.reverseOrder() is better because it names the order explicitly, instead of making the reader read the implementation and infer the order.
I think time and space complexity must be same because at the end your passing the comparator in both object creations so the only difference is :-
//passing specified comparator
PriorityQueue<Integer> heap = new PriorityQueue<>((a,b) -> b-a);
//Collection.reverseOrder will return comparator with natural reverse ordering
PriorityQueue<Integer> heap = new PriorityQueue<>(Collections.reverseOrder());

JMH: strange dependency on the environment

While making my first approaches to using JMH to benchmark my class, I encountered a behavior that confuses me, and I'd like to clarify the issue before moving on.
The situation that confuses me:
When I run the benchmarks while the CPU is loaded (78%-80%) by extraneous processes, the results shown by JMH look quite plausible and stable:
Benchmark Mode Cnt Score Error Units
ArrayOperations.a_bigDecimalAddition avgt 5 264,703 ± 2,800 ns/op
ArrayOperations.b_quadrupleAddition avgt 5 44,290 ± 0,769 ns/op
ArrayOperations.c_bigDecimalSubtraction avgt 5 286,266 ± 2,454 ns/op
ArrayOperations.d_quadrupleSubtraction avgt 5 46,966 ± 0,629 ns/op
ArrayOperations.e_bigDecimalMultiplcation avgt 5 546,535 ± 4,988 ns/op
ArrayOperations.f_quadrupleMultiplcation avgt 5 85,056 ± 1,820 ns/op
ArrayOperations.g_bigDecimalDivision avgt 5 612,814 ± 5,943 ns/op
ArrayOperations.h_quadrupleDivision avgt 5 631,127 ± 4,172 ns/op
Relatively large errors are because I need only a rough estimate right now and I trade precision for quickness deliberately.
But the results obtained without extraneous load on the processor seem amazing to me:
Benchmark Mode Cnt Score Error Units
ArrayOperations.a_bigDecimalAddition avgt 5 684,035 ± 370,722 ns/op
ArrayOperations.b_quadrupleAddition avgt 5 83,743 ± 25,762 ns/op
ArrayOperations.c_bigDecimalSubtraction avgt 5 531,430 ± 184,980 ns/op
ArrayOperations.d_quadrupleSubtraction avgt 5 85,937 ± 103,351 ns/op
ArrayOperations.e_bigDecimalMultiplcation avgt 5 641,953 ± 288,545 ns/op
ArrayOperations.f_quadrupleMultiplcation avgt 5 102,692 ± 31,625 ns/op
ArrayOperations.g_bigDecimalDivision avgt 5 733,727 ± 161,827 ns/op
ArrayOperations.h_quadrupleDivision avgt 5 820,388 ± 546,990 ns/op
Everything seems to work almost twice slower, iteration times are very unstable (may vary from 500 to 1300 ns/op at neighbor iterations) and the errors are respectively unacceptably large.
The first set of results is obtained with a bunch of application running, including Folding#home distribute computations client (FahCore_a7.exe) which takes 75% of CPU time, a BitTorrent client that actively uses disks, a dozen of tabs in a browser, e-mail client etc. Average CPU load is about 85%. During the benchmark execution FahCoredecreases the load so that Java takes 25% and total load is 100%.
The second set of results is taken when all unnecessary processes are stopped, CPU is practically idle, only Java takes it's 25% and a couple of percents are used for system needs.
My CPU is Intel i5-4460, 4 kernels, 3.2 GHz, RAM 32 GB, OS Windows Server 2008 R2.
java version "1.8.0_231"
Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
The questions are:
Why the benchmarks show much worse and unstable results when it's the only task that loads the machine?
Can I consider the first set of results more or less reliable when they depend on the environment so dramatically?
Should I setup the environment somehow to eliminate this dependency?
Or is this my code that is to blame?
The code:
package com.mvohm.quadruple.benchmarks;
// Required imports here
import com.mvohm.quadruple.Quadruple; // The class under tests
#State(value = Scope.Benchmark)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(java.util.concurrent.TimeUnit.NANOSECONDS)
#Fork(value = 1)
#Warmup(iterations = 3, time = 7)
#Measurement(iterations = 5, time = 10)
public class ArrayOperations {
// To do BigDecimal arithmetic with the precision close to this of Quadruple
private static final MathContext MC_38 = new MathContext(38, RoundingMode.HALF_EVEN);
private static final int DATA_SIZE = 0x1_0000; // 65536
private static final int INDEX_MASK = DATA_SIZE - 1; // 0xFFFF
private static final double RAND_SCALE = 1e39; // To provide a sensible range of operands,
// so that the actual calculations don't get bypassed
private final BigDecimal[] // Data to apply operations to
bdOp1 = new BigDecimal[DATA_SIZE], // BigDecimals
bdOp2 = new BigDecimal[DATA_SIZE],
bdResult = new BigDecimal[DATA_SIZE];
private final Quadruple[]
qOp1 = new Quadruple[DATA_SIZE], // Quadruples
qOp2 = new Quadruple[DATA_SIZE],
qResult = new Quadruple[DATA_SIZE];
private int index = 0;
#Setup
public void initData() {
final Random rand = new Random(12345); // for reproducibility
for (int i = 0; i < DATA_SIZE; i++) {
bdOp1[i] = randomBigDecimal(rand);
bdOp2[i] = randomBigDecimal(rand);
qOp1[i] = randomQuadruple(rand);
qOp2[i] = randomQuadruple(rand);
}
}
private static Quadruple randomQuadruple(Random rand) {
return Quadruple.nextNormalRandom(rand).multiply(RAND_SCALE); // ranged 0 .. 9.99e38
}
private static BigDecimal randomBigDecimal(Random rand) {
return Quadruple.nextNormalRandom(rand).multiply(RAND_SCALE).bigDecimalValue();
}
#Benchmark
public void a_bigDecimalAddition() {
bdResult[index] = bdOp1[index].add(bdOp2[index], MC_38);
index = ++index & INDEX_MASK;
}
#Benchmark
public void b_quadrupleAddition() {
// semantically the same as above
qResult[index] = Quadruple.add(qOp1[index], qOp2[index]);
index = ++index & INDEX_MASK;
}
// Other methods are similar
public static void main(String... args) throws IOException, RunnerException {
final Options opt = new OptionsBuilder()
.include(ArrayOperations.class.getSimpleName())
.forks(1)
.build();
new Runner(opt).run();
}
}
The reason was very simple, and I should have understood it immediately. Power saving mode was enabled in the OS, which reduced the clock frequency of the CPU under low load. The moral is, always disable power saving when benchmarking!

Performance comparison of modulo operator and bitwise AND

I'm working to determine if an 32bit integer is even or odd. I have setup 2 approaches:
modulo(%) approach
int r = (i % 2);
bitwise(&) approach
int r = (i & 0x1);
Both approaches work successfully. So I run each line for 15000 times to test performance.
Result:
modulo(%) approach (source code)
mean 141.5801887ns | SD 270.0700275ns
bitwise(&) approach (source code)
mean 141.2504ns | SD 193.6351007ns
Questions:
Why is bitwise(&) more stable than division(%) ?
Does JVM optimize modulo(%) using AND(&) according to here?
Let's try to reproduce with JMH.
#Benchmark
#Measurement(timeUnit = TimeUnit.NANOSECONDS)
#BenchmarkMode(Mode.AverageTime)
public int first() throws IOException {
return i % 2;
}
#Benchmark
#Measurement(timeUnit = TimeUnit.NANOSECONDS)
#BenchmarkMode(Mode.AverageTime)
public int second() throws IOException {
return i & 0x1;
}
Okay, it is reproducable. The first is slightly slower than the second. Now let's figure out why. Run it with -prof perfnorm:
Benchmark Mode Cnt Score Error Units
MyBenchmark.first avgt 50 2.674 ± 0.028 ns/op
MyBenchmark.first:CPI avgt 10 0.301 ± 0.002 #/op
MyBenchmark.first:L1-dcache-load-misses avgt 10 0.001 ± 0.001 #/op
MyBenchmark.first:L1-dcache-loads avgt 10 11.011 ± 0.146 #/op
MyBenchmark.first:L1-dcache-stores avgt 10 3.011 ± 0.034 #/op
MyBenchmark.first:L1-icache-load-misses avgt 10 ≈ 10⁻³ #/op
MyBenchmark.first:LLC-load-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:LLC-loads avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:LLC-store-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.first:LLC-stores avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:branch-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:branches avgt 10 4.006 ± 0.054 #/op
MyBenchmark.first:cycles avgt 10 9.322 ± 0.113 #/op
MyBenchmark.first:dTLB-load-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:dTLB-loads avgt 10 10.939 ± 0.175 #/op
MyBenchmark.first:dTLB-store-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.first:dTLB-stores avgt 10 2.991 ± 0.045 #/op
MyBenchmark.first:iTLB-load-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.first:iTLB-loads avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:instructions avgt 10 30.991 ± 0.427 #/op
MyBenchmark.second avgt 50 2.263 ± 0.015 ns/op
MyBenchmark.second:CPI avgt 10 0.320 ± 0.001 #/op
MyBenchmark.second:L1-dcache-load-misses avgt 10 0.001 ± 0.001 #/op
MyBenchmark.second:L1-dcache-loads avgt 10 11.045 ± 0.152 #/op
MyBenchmark.second:L1-dcache-stores avgt 10 3.014 ± 0.032 #/op
MyBenchmark.second:L1-icache-load-misses avgt 10 ≈ 10⁻³ #/op
MyBenchmark.second:LLC-load-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:LLC-loads avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:LLC-store-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:LLC-stores avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:branch-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:branches avgt 10 4.014 ± 0.045 #/op
MyBenchmark.second:cycles avgt 10 8.024 ± 0.098 #/op
MyBenchmark.second:dTLB-load-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:dTLB-loads avgt 10 10.989 ± 0.161 #/op
MyBenchmark.second:dTLB-store-misses avgt 10 ≈ 10⁻⁶ #/op
MyBenchmark.second:dTLB-stores avgt 10 3.004 ± 0.042 #/op
MyBenchmark.second:iTLB-load-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:iTLB-loads avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:instructions avgt 10 25.076 ± 0.296 #/op
Note the difference in cycles and instructions. And now that's kind of obvious. The first does care about the sign, but the second does not (just bitwise and). To make sure this is the reason take a look at the assembly fragment:
first:
0x00007f91111f8355: mov 0xc(%r10),%r11d ;*getfield i
0x00007f91111f8359: mov %r11d,%edx
0x00007f91111f835c: and $0x1,%edx
0x00007f91111f835f: mov %edx,%r10d
0x00007f6bd120a6e2: neg %r10d
0x00007f6bd120a6e5: test %r11d,%r11d
0x00007f6bd120a6e8: cmovl %r10d,%edx
second:
0x00007ff36cbda580: mov $0x1,%edx
0x00007ff36cbda585: mov 0x40(%rsp),%r10
0x00007ff36cbda58a: and 0xc(%r10),%edx
An execution time of 150 ns is about 500 clock cycles. I don't think there has ever been a processor that went about the business of checking a bit this inefficiently :-).
The problem is that your test harness is flawed in many ways. In particular:
you make no attempt to trigger JIT compilation before starting the clock
System.nanotime() is not guaranteed to have nanosecond accuracy
System.nanotime() is quite a bit more expensive to call that the code you want to measure
See How do I write a correct micro-benchmark in Java? for a more complete list of things to keep in mind.
Here's a better benchmark:
public abstract class Benchmark {
final String name;
public Benchmark(String name) {
this.name = name;
}
#Override
public String toString() {
return name + "\t" + time() + " ns / iteration";
}
private BigDecimal time() {
try {
// automatically detect a reasonable iteration count (and trigger just in time compilation of the code under test)
int iterations;
long duration = 0;
for (iterations = 1; iterations < 1_000_000_000 && duration < 1_000_000_000; iterations *= 2) {
long start = System.nanoTime();
run(iterations);
duration = System.nanoTime() - start;
cleanup();
}
return new BigDecimal((duration) * 1000 / iterations).movePointLeft(3);
} catch (Throwable e) {
throw new RuntimeException(e);
}
}
/**
* Executes the code under test.
* #param iterations
* number of iterations to perform
* #return any value that requires the entire code to be executed (to
* prevent dead code elimination by the just in time compiler)
* #throws Throwable
* if the test could not complete successfully
*/
protected abstract Object run(int iterations) throws Throwable;
/**
* Cleans up after a run, setting the stage for the next.
*/
protected void cleanup() {
// do nothing
}
public static void main(String[] args) throws Exception {
System.out.println(new Benchmark("%") {
#Override
protected Object run(int iterations) throws Throwable {
int sum = 0;
for (int i = 0; i < iterations; i++) {
sum += i % 2;
}
return sum;
}
});
System.out.println(new Benchmark("&") {
#Override
protected Object run(int iterations) throws Throwable {
int sum = 0;
for (int i = 0; i < iterations; i++) {
sum += i & 1;
}
return sum;
}
});
}
}
On my machine, it prints:
% 0.375 ns / iteration
& 0.139 ns / iteration
So the difference is, as expected, on the order of a couple clock cycles. That is, & 1 was optimized slightly better by this JIT on this particular hardware, but the difference is so small it is extremely unlikely to have a measurable (let alone significant) impact on the performance of your program.
The two operations correspond to different JVM processor instructions:
irem // int remainder (%)
iand // bitwise and (&)
Somewhere I read irem is usually implemented by the JVM, while iand is available on hardware. Oracle explains the two instructions as follows:
iand
An int result is calculated by taking the bitwise AND (conjunction) of value1 and value2.
irem
The int result is value1 - (value1 / value2) * value2.
It seems reasonable to me to assume that iand results in less CPU cycles.

Construct a string from a repeated character

In Java I need to construct a string of n zeros with n unknown at compile time. Ideally I'd use
String s = new String('0', n);
But no such constructor exists. CharSequence doesn't seem to have a suitable constructor either. So I'm tempted to build my own loop using StringBuilder.
Before I do this and risk getting defenestrated by my boss, could anyone advise: is there a standard way of doing this in Java? In C++, one of the std::string constructors allows this.
If you don't mind creating an extra string:
String zeros = new String(new char[n]).replace((char) 0, '0');
Or more explicit (and probably more efficient):
char[] c = new char[n];
Arrays.fill(c, '0');
String zeros = new String(c);
Performance wise, the Arrays.fill option seems to perform better in most situations, but especially for large strings. Using a StringBuilder is quite slow for large strings but efficient for small ones. Using replace is a nice one liner and performs ok for larger strings, but not as well as filll.
Micro benchmark for different values of n:
Benchmark (n) Mode Samples Score Error Units
c.a.p.SO26504151.builder 1 avgt 3 29.452 ± 1.849 ns/op
c.a.p.SO26504151.builder 10 avgt 3 51.641 ± 12.426 ns/op
c.a.p.SO26504151.builder 1000 avgt 3 2681.956 ± 336.353 ns/op
c.a.p.SO26504151.builder 1000000 avgt 3 3522995.218 ± 422579.979 ns/op
c.a.p.SO26504151.fill 1 avgt 3 30.255 ± 0.297 ns/op
c.a.p.SO26504151.fill 10 avgt 3 32.638 ± 7.553 ns/op
c.a.p.SO26504151.fill 1000 avgt 3 592.459 ± 91.413 ns/op
c.a.p.SO26504151.fill 1000000 avgt 3 706187.003 ± 152774.601 ns/op
c.a.p.SO26504151.replace 1 avgt 3 44.366 ± 5.153 ns/op
c.a.p.SO26504151.replace 10 avgt 3 51.778 ± 2.959 ns/op
c.a.p.SO26504151.replace 1000 avgt 3 1385.383 ± 289.319 ns/op
c.a.p.SO26504151.replace 1000000 avgt 3 1486335.886 ± 1807239.775 ns/op
Create a n sized char array and convert it to String:
char[] myZeroCharArray = new char[n];
for(int i = 0; i < n; i++) myZeroCharArray[i] = '0';
String myZeroString = new String(myZeroCharArray);
See StringUtils in Apache Commons Lang
https://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringUtils.html#repeat%28java.lang.String,%20int%29
There isn't a standard JDK way, but Apache commons (almost a defacto standard), has the StringUtils.repeat() method, e.g.:
String s = StringUtils.repeat('x', 5); // s = "xxxxx"
or the plain old String Format
int n = 10;
String s = String.format("%" + n + "s", "").replace(' ', '0');
System.out.println(s);

Categories