Why Java compiled regex works slower then interpreted in String::split?

Why Java compiled regex works slower then interpreted in String::split? - java

I'm trying to improve the following code:
public int applyAsInt(String ipAddress) {
var ipAddressInArray = ipAddress.split("\\.");
...
So I compile the regular expression into a static constant:
private static final Pattern PATTERN_DOT = Pattern.compile(".", Pattern.LITERAL);
public int applyAsInt(String ipAddress) {
var ipAddressInArray = PATTERN_DOT.split(ipAddress);
...
The rest of the code remained unchanged.
To my amazement, the new code is slower than the previous one.
Below are the test results:
Benchmark (ipAddress) Mode Cnt Score Error Units
ConverterBenchmark.mkyongConverter 1.2.3.4 avgt 10 166.456 ± 9.087 ns/op
ConverterBenchmark.mkyongConverter 120.1.34.78 avgt 10 168.548 ± 2.996 ns/op
ConverterBenchmark.mkyongConverter 129.205.201.114 avgt 10 180.754 ± 6.891 ns/op
ConverterBenchmark.mkyong2Converter 1.2.3.4 avgt 10 253.318 ± 4.977 ns/op
ConverterBenchmark.mkyong2Converter 120.1.34.78 avgt 10 263.045 ± 8.373 ns/op
ConverterBenchmark.mkyong2Converter 129.205.201.114 avgt 10 331.376 ± 53.092 ns/op
Help me understand why this is happening, please.

String.split has code aimed at exactly this use case:
https://github.com/openjdk/jdk17u/blob/master/src/java.base/share/classes/java/lang/String.java#L3102
/* fastpath if the regex is a
* (1) one-char String and this character is not one of the
* RegEx's meta characters ".$|()[{^?*+\\", or
* (2) two-char String and the first char is the backslash and
* the second is not the ascii digit or ascii letter.
*/
That means that when using split("\\.") the string is effectively not split using a regular expression - the method splits the string directly at the '.' characters.
This optimization is not possible when you write PATTERN_DOT.split(ipAddress).

Related

Java Collections.reverseOrder() vs. Comparator

I am practicing leetcode and oftenTimes I see different ways of reversing data structures. I was wondering if there is a benefit to doing it one way vs the other?
For example to create a max heap from a PriortiyQueue i can
PriorityQueue<Integer> heap = new PriorityQueue<>((a,b) -> b-a);
or
PriorityQueue<Integer> heap = new PriorityQueue<>(Collections.reverseOrder());
Is the time and space complexity the same? Should i be defaulting to use a comparator or is the collections.reverseOrder() good?

The first Comparator (one with lambda) is wrong because it is subject to integer overflow (for example, with b = Integer.MIN_VALUE and a > 0, this would return a positive value, not a negative one), you should use:
PriorityQueue<Integer> heap = new PriorityQueue<>((a,b) -> Integer.compare(b, a));
Which is also called by Integer.compareTo, which is called by reverseOrder.
As said in other answer, you should use the reverseOrder or Comparator.reversed() because it make the intention clear.
Now, for your question in depth, you should note that:
The lambda (a, b) -> b - a is affected by unboxing and should be read b.intValue() - a.intValue() and since this one is not valid for some values due to integer overflow: it should be Integer.compare(b.intValue(), a.intValue()).
The reverseOrder simply calls b.compareTo(a) which calls Integer.compare(value, other.value) where value is the actual value of the boxed Integer.
The performance difference would amount to:
The cost of the two unboxing
Cost of calling methods.
The JVM optimization
You could wind up a JMH test (like the one below, based on another that I wrote for another answer): I shortened the values v1/v2 does to 3 because it takes times (~40m).
package stackoverflow;
import java.util.*;
import java.util.stream.*;
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
#State(Scope.Benchmark)
#Warmup( time = 1, timeUnit = TimeUnit.SECONDS)
#Measurement( time = 1, timeUnit = TimeUnit.SECONDS)
#Fork(1)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#BenchmarkMode({ Mode.AverageTime})
public class ComparatorBenchmark {
private List<Integer> values;
#Param({ "" + Integer.MIN_VALUE, "" + Integer.MAX_VALUE, "10", "-1000", "5000", "10000", "20000", "50000", "100000" })
private Integer v1;
#Param({ "" + Integer.MIN_VALUE, "" + Integer.MAX_VALUE, "10", "1000", "-5000", "10000", "20000", "50000", "100000" })
private Integer v2;
private Comparator<Integer> cmp1;
private Comparator<Integer> cmp2;
private Comparator<Integer> cmp3;
#Setup
public void setUp() {
cmp1 = (a, b) -> Integer.compare(b, a);
cmp2 = (a, b) -> b - a;
cmp3 = Collections.reverseOrder();
}
#Benchmark
public void with_Integer_compare(Blackhole blackhole) {
blackhole.consume(cmp1.compare(v1, v2));
}
#Benchmark
public void with_b_minus_a(Blackhole blackhole) {
blackhole.consume(cmp2.compare(v1, v2));
}
#Benchmark
public void with_reverse_comparator(Blackhole blackhole) {
blackhole.consume(cmp3.compare(v1, v2));
}
}
Running this on:
Windows 10
Java 17.0.2
AMD Ryzen 7 2700X # 3.70 GHz
Produces the following result (I limited it to 3 values: MIN_VALUE, MAX_VALUE and 10 because its ETA is ~40m):
Benchmark (v1) (v2) Mode Cnt Score Error Units
ComparatorBenchmark.with_Integer_compare -2147483648 -2147483648 avgt 5 1,113 ± 0,074 ns/op
ComparatorBenchmark.with_Integer_compare -2147483648 2147483647 avgt 5 1,111 ± 0,037 ns/op
ComparatorBenchmark.with_Integer_compare -2147483648 10 avgt 5 1,111 ± 0,075 ns/op
ComparatorBenchmark.with_Integer_compare 2147483647 -2147483648 avgt 5 1,122 ± 0,075 ns/op
ComparatorBenchmark.with_Integer_compare 2147483647 2147483647 avgt 5 1,123 ± 0,070 ns/op
ComparatorBenchmark.with_Integer_compare 2147483647 10 avgt 5 1,102 ± 0,039 ns/op
ComparatorBenchmark.with_Integer_compare 10 -2147483648 avgt 5 1,097 ± 0,024 ns/op
ComparatorBenchmark.with_Integer_compare 10 2147483647 avgt 5 1,094 ± 0,019 ns/op
ComparatorBenchmark.with_Integer_compare 10 10 avgt 5 1,097 ± 0,034 ns/op
ComparatorBenchmark.with_b_minus_a -2147483648 -2147483648 avgt 5 1,105 ± 0,054 ns/op
ComparatorBenchmark.with_b_minus_a -2147483648 2147483647 avgt 5 1,099 ± 0,040 ns/op
ComparatorBenchmark.with_b_minus_a -2147483648 10 avgt 5 1,094 ± 0,038 ns/op
ComparatorBenchmark.with_b_minus_a 2147483647 -2147483648 avgt 5 1,112 ± 0,044 ns/op
ComparatorBenchmark.with_b_minus_a 2147483647 2147483647 avgt 5 1,105 ± 0,029 ns/op
ComparatorBenchmark.with_b_minus_a 2147483647 10 avgt 5 1,112 ± 0,068 ns/op
ComparatorBenchmark.with_b_minus_a 10 -2147483648 avgt 5 1,086 ± 0,010 ns/op
ComparatorBenchmark.with_b_minus_a 10 2147483647 avgt 5 1,125 ± 0,084 ns/op
ComparatorBenchmark.with_b_minus_a 10 10 avgt 5 1,125 ± 0,082 ns/op
ComparatorBenchmark.with_reverse_comparator -2147483648 -2147483648 avgt 5 1,121 ± 0,050 ns/op
ComparatorBenchmark.with_reverse_comparator -2147483648 2147483647 avgt 5 1,122 ± 0,067 ns/op
ComparatorBenchmark.with_reverse_comparator -2147483648 10 avgt 5 1,129 ± 0,094 ns/op
ComparatorBenchmark.with_reverse_comparator 2147483647 -2147483648 avgt 5 1,117 ± 0,046 ns/op
ComparatorBenchmark.with_reverse_comparator 2147483647 2147483647 avgt 5 1,122 ± 0,072 ns/op
ComparatorBenchmark.with_reverse_comparator 2147483647 10 avgt 5 1,116 ± 0,080 ns/op
ComparatorBenchmark.with_reverse_comparator 10 -2147483648 avgt 5 1,114 ± 0,052 ns/op
ComparatorBenchmark.with_reverse_comparator 10 2147483647 avgt 5 1,133 ± 0,068 ns/op
ComparatorBenchmark.with_reverse_comparator 10 10 avgt 5 1,134 ± 0,036 ns/op
As you can see, the score are within the same margin.
You would not gain/lose much from neither implementation, and as already said, you should use Collections.reverseOrder() to make your intention clear, and if not use Integer.compare, not a subtraction subject to integer overflow unless you are sure that each Integer is limited (for example, Short.MIN_VALUE to Short.MAX_VALUE).

The asymptotics are the same, but Collections.reverseOrder() has several important advantages:
It guarantees that it doesn't do an allocation. ((a, b) -> b - a probably doesn't allocate, either, but reverseOrder guarantees a singleton.)
It is clearly self-documenting.
It works for all integers; b-a will break if comparing e.g. Integer.MAX_VALUE and -2 due to overflow.

Yes, the time and space complexity is the same (constant).
Using Collections.reverseOrder() is better because it names the order explicitly, instead of making the reader read the implementation and infer the order.

I think time and space complexity must be same because at the end your passing the comparator in both object creations so the only difference is :-
//passing specified comparator
PriorityQueue<Integer> heap = new PriorityQueue<>((a,b) -> b-a);
//Collection.reverseOrder will return comparator with natural reverse ordering
PriorityQueue<Integer> heap = new PriorityQueue<>(Collections.reverseOrder());

JMH: strange dependency on the environment

While making my first approaches to using JMH to benchmark my class, I encountered a behavior that confuses me, and I'd like to clarify the issue before moving on.
The situation that confuses me:
When I run the benchmarks while the CPU is loaded (78%-80%) by extraneous processes, the results shown by JMH look quite plausible and stable:
Benchmark Mode Cnt Score Error Units
ArrayOperations.a_bigDecimalAddition avgt 5 264,703 ± 2,800 ns/op
ArrayOperations.b_quadrupleAddition avgt 5 44,290 ± 0,769 ns/op
ArrayOperations.c_bigDecimalSubtraction avgt 5 286,266 ± 2,454 ns/op
ArrayOperations.d_quadrupleSubtraction avgt 5 46,966 ± 0,629 ns/op
ArrayOperations.e_bigDecimalMultiplcation avgt 5 546,535 ± 4,988 ns/op
ArrayOperations.f_quadrupleMultiplcation avgt 5 85,056 ± 1,820 ns/op
ArrayOperations.g_bigDecimalDivision avgt 5 612,814 ± 5,943 ns/op
ArrayOperations.h_quadrupleDivision avgt 5 631,127 ± 4,172 ns/op
Relatively large errors are because I need only a rough estimate right now and I trade precision for quickness deliberately.
But the results obtained without extraneous load on the processor seem amazing to me:
Benchmark Mode Cnt Score Error Units
ArrayOperations.a_bigDecimalAddition avgt 5 684,035 ± 370,722 ns/op
ArrayOperations.b_quadrupleAddition avgt 5 83,743 ± 25,762 ns/op
ArrayOperations.c_bigDecimalSubtraction avgt 5 531,430 ± 184,980 ns/op
ArrayOperations.d_quadrupleSubtraction avgt 5 85,937 ± 103,351 ns/op
ArrayOperations.e_bigDecimalMultiplcation avgt 5 641,953 ± 288,545 ns/op
ArrayOperations.f_quadrupleMultiplcation avgt 5 102,692 ± 31,625 ns/op
ArrayOperations.g_bigDecimalDivision avgt 5 733,727 ± 161,827 ns/op
ArrayOperations.h_quadrupleDivision avgt 5 820,388 ± 546,990 ns/op
Everything seems to work almost twice slower, iteration times are very unstable (may vary from 500 to 1300 ns/op at neighbor iterations) and the errors are respectively unacceptably large.
The first set of results is obtained with a bunch of application running, including Folding#home distribute computations client (FahCore_a7.exe) which takes 75% of CPU time, a BitTorrent client that actively uses disks, a dozen of tabs in a browser, e-mail client etc. Average CPU load is about 85%. During the benchmark execution FahCoredecreases the load so that Java takes 25% and total load is 100%.
The second set of results is taken when all unnecessary processes are stopped, CPU is practically idle, only Java takes it's 25% and a couple of percents are used for system needs.
My CPU is Intel i5-4460, 4 kernels, 3.2 GHz, RAM 32 GB, OS Windows Server 2008 R2.
java version "1.8.0_231"
Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
The questions are:
Why the benchmarks show much worse and unstable results when it's the only task that loads the machine?
Can I consider the first set of results more or less reliable when they depend on the environment so dramatically?
Should I setup the environment somehow to eliminate this dependency?
Or is this my code that is to blame?
The code:
package com.mvohm.quadruple.benchmarks;
// Required imports here
import com.mvohm.quadruple.Quadruple; // The class under tests
#State(value = Scope.Benchmark)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(java.util.concurrent.TimeUnit.NANOSECONDS)
#Fork(value = 1)
#Warmup(iterations = 3, time = 7)
#Measurement(iterations = 5, time = 10)
public class ArrayOperations {
// To do BigDecimal arithmetic with the precision close to this of Quadruple
private static final MathContext MC_38 = new MathContext(38, RoundingMode.HALF_EVEN);
private static final int DATA_SIZE = 0x1_0000; // 65536
private static final int INDEX_MASK = DATA_SIZE - 1; // 0xFFFF
private static final double RAND_SCALE = 1e39; // To provide a sensible range of operands,
// so that the actual calculations don't get bypassed
private final BigDecimal[] // Data to apply operations to
bdOp1 = new BigDecimal[DATA_SIZE], // BigDecimals
bdOp2 = new BigDecimal[DATA_SIZE],
bdResult = new BigDecimal[DATA_SIZE];
private final Quadruple[]
qOp1 = new Quadruple[DATA_SIZE], // Quadruples
qOp2 = new Quadruple[DATA_SIZE],
qResult = new Quadruple[DATA_SIZE];
private int index = 0;
#Setup
public void initData() {
final Random rand = new Random(12345); // for reproducibility
for (int i = 0; i < DATA_SIZE; i++) {
bdOp1[i] = randomBigDecimal(rand);
bdOp2[i] = randomBigDecimal(rand);
qOp1[i] = randomQuadruple(rand);
qOp2[i] = randomQuadruple(rand);
}
}
private static Quadruple randomQuadruple(Random rand) {
return Quadruple.nextNormalRandom(rand).multiply(RAND_SCALE); // ranged 0 .. 9.99e38
}
private static BigDecimal randomBigDecimal(Random rand) {
return Quadruple.nextNormalRandom(rand).multiply(RAND_SCALE).bigDecimalValue();
}
#Benchmark
public void a_bigDecimalAddition() {
bdResult[index] = bdOp1[index].add(bdOp2[index], MC_38);
index = ++index & INDEX_MASK;
}
#Benchmark
public void b_quadrupleAddition() {
// semantically the same as above
qResult[index] = Quadruple.add(qOp1[index], qOp2[index]);
index = ++index & INDEX_MASK;
}
// Other methods are similar
public static void main(String... args) throws IOException, RunnerException {
final Options opt = new OptionsBuilder()
.include(ArrayOperations.class.getSimpleName())
.forks(1)
.build();
new Runner(opt).run();
}
}

The reason was very simple, and I should have understood it immediately. Power saving mode was enabled in the OS, which reduced the clock frequency of the CPU under low load. The moral is, always disable power saving when benchmarking!

Performance comparison of modulo operator and bitwise AND

I'm working to determine if an 32bit integer is even or odd. I have setup 2 approaches:
modulo(%) approach
int r = (i % 2);
bitwise(&) approach
int r = (i & 0x1);
Both approaches work successfully. So I run each line for 15000 times to test performance.
Result:
modulo(%) approach (source code)
mean 141.5801887ns | SD 270.0700275ns
bitwise(&) approach (source code)
mean 141.2504ns | SD 193.6351007ns
Questions:
Why is bitwise(&) more stable than division(%) ?
Does JVM optimize modulo(%) using AND(&) according to here?

Let's try to reproduce with JMH.
#Benchmark
#Measurement(timeUnit = TimeUnit.NANOSECONDS)
#BenchmarkMode(Mode.AverageTime)
public int first() throws IOException {
return i % 2;
}
#Benchmark
#Measurement(timeUnit = TimeUnit.NANOSECONDS)
#BenchmarkMode(Mode.AverageTime)
public int second() throws IOException {
return i & 0x1;
}
Okay, it is reproducable. The first is slightly slower than the second. Now let's figure out why. Run it with -prof perfnorm:
Benchmark Mode Cnt Score Error Units
MyBenchmark.first avgt 50 2.674 ± 0.028 ns/op
MyBenchmark.first:CPI avgt 10 0.301 ± 0.002 #/op
MyBenchmark.first:L1-dcache-load-misses avgt 10 0.001 ± 0.001 #/op
MyBenchmark.first:L1-dcache-loads avgt 10 11.011 ± 0.146 #/op
MyBenchmark.first:L1-dcache-stores avgt 10 3.011 ± 0.034 #/op
MyBenchmark.first:L1-icache-load-misses avgt 10 ≈ 10⁻³ #/op
MyBenchmark.first:LLC-load-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:LLC-loads avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:LLC-store-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.first:LLC-stores avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:branch-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:branches avgt 10 4.006 ± 0.054 #/op
MyBenchmark.first:cycles avgt 10 9.322 ± 0.113 #/op
MyBenchmark.first:dTLB-load-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:dTLB-loads avgt 10 10.939 ± 0.175 #/op
MyBenchmark.first:dTLB-store-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.first:dTLB-stores avgt 10 2.991 ± 0.045 #/op
MyBenchmark.first:iTLB-load-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.first:iTLB-loads avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:instructions avgt 10 30.991 ± 0.427 #/op
MyBenchmark.second avgt 50 2.263 ± 0.015 ns/op
MyBenchmark.second:CPI avgt 10 0.320 ± 0.001 #/op
MyBenchmark.second:L1-dcache-load-misses avgt 10 0.001 ± 0.001 #/op
MyBenchmark.second:L1-dcache-loads avgt 10 11.045 ± 0.152 #/op
MyBenchmark.second:L1-dcache-stores avgt 10 3.014 ± 0.032 #/op
MyBenchmark.second:L1-icache-load-misses avgt 10 ≈ 10⁻³ #/op
MyBenchmark.second:LLC-load-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:LLC-loads avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:LLC-store-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:LLC-stores avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:branch-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:branches avgt 10 4.014 ± 0.045 #/op
MyBenchmark.second:cycles avgt 10 8.024 ± 0.098 #/op
MyBenchmark.second:dTLB-load-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:dTLB-loads avgt 10 10.989 ± 0.161 #/op
MyBenchmark.second:dTLB-store-misses avgt 10 ≈ 10⁻⁶ #/op
MyBenchmark.second:dTLB-stores avgt 10 3.004 ± 0.042 #/op
MyBenchmark.second:iTLB-load-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:iTLB-loads avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:instructions avgt 10 25.076 ± 0.296 #/op
Note the difference in cycles and instructions. And now that's kind of obvious. The first does care about the sign, but the second does not (just bitwise and). To make sure this is the reason take a look at the assembly fragment:
first:
0x00007f91111f8355: mov 0xc(%r10),%r11d ;*getfield i
0x00007f91111f8359: mov %r11d,%edx
0x00007f91111f835c: and $0x1,%edx
0x00007f91111f835f: mov %edx,%r10d
0x00007f6bd120a6e2: neg %r10d
0x00007f6bd120a6e5: test %r11d,%r11d
0x00007f6bd120a6e8: cmovl %r10d,%edx
second:
0x00007ff36cbda580: mov $0x1,%edx
0x00007ff36cbda585: mov 0x40(%rsp),%r10
0x00007ff36cbda58a: and 0xc(%r10),%edx

An execution time of 150 ns is about 500 clock cycles. I don't think there has ever been a processor that went about the business of checking a bit this inefficiently :-).
The problem is that your test harness is flawed in many ways. In particular:
you make no attempt to trigger JIT compilation before starting the clock
System.nanotime() is not guaranteed to have nanosecond accuracy
System.nanotime() is quite a bit more expensive to call that the code you want to measure
See How do I write a correct micro-benchmark in Java? for a more complete list of things to keep in mind.
Here's a better benchmark:
public abstract class Benchmark {
final String name;
public Benchmark(String name) {
this.name = name;
}
#Override
public String toString() {
return name + "\t" + time() + " ns / iteration";
}
private BigDecimal time() {
try {
// automatically detect a reasonable iteration count (and trigger just in time compilation of the code under test)
int iterations;
long duration = 0;
for (iterations = 1; iterations < 1_000_000_000 && duration < 1_000_000_000; iterations *= 2) {
long start = System.nanoTime();
run(iterations);
duration = System.nanoTime() - start;
cleanup();
}
return new BigDecimal((duration) * 1000 / iterations).movePointLeft(3);
} catch (Throwable e) {
throw new RuntimeException(e);
}
}
/**
* Executes the code under test.
* #param iterations
* number of iterations to perform
* #return any value that requires the entire code to be executed (to
* prevent dead code elimination by the just in time compiler)
* #throws Throwable
* if the test could not complete successfully
*/
protected abstract Object run(int iterations) throws Throwable;
/**
* Cleans up after a run, setting the stage for the next.
*/
protected void cleanup() {
// do nothing
}
public static void main(String[] args) throws Exception {
System.out.println(new Benchmark("%") {
#Override
protected Object run(int iterations) throws Throwable {
int sum = 0;
for (int i = 0; i < iterations; i++) {
sum += i % 2;
}
return sum;
}
});
System.out.println(new Benchmark("&") {
#Override
protected Object run(int iterations) throws Throwable {
int sum = 0;
for (int i = 0; i < iterations; i++) {
sum += i & 1;
}
return sum;
}
});
}
}
On my machine, it prints:
% 0.375 ns / iteration
& 0.139 ns / iteration
So the difference is, as expected, on the order of a couple clock cycles. That is, & 1 was optimized slightly better by this JIT on this particular hardware, but the difference is so small it is extremely unlikely to have a measurable (let alone significant) impact on the performance of your program.

The two operations correspond to different JVM processor instructions:
irem // int remainder (%)
iand // bitwise and (&)
Somewhere I read irem is usually implemented by the JVM, while iand is available on hardware. Oracle explains the two instructions as follows:
iand
An int result is calculated by taking the bitwise AND (conjunction) of value1 and value2.
irem
The int result is value1 - (value1 / value2) * value2.
It seems reasonable to me to assume that iand results in less CPU cycles.

Why Java Optional performance increase with number of chained calls?

I was recently asked about the performance of java 8 Optional. After some searching, I found this question and several blog posts, with contradicting answers. So I benchmarked it using JMH and I don't understand my findings.
Here is the gist of my benchmark code (full code is available on GitHub):
#State(Scope.Benchmark)
public class OptionalBenchmark {
private Room room;
#Param({ "empty", "small", "large", "full" })
private String filling;
#Setup
public void setUp () {
switch (filling) {
case "empty":
room = null;
break;
case "small":
room = new Room(new Flat(new Floor(null)));
break;
case "large":
room = new Room(new Flat(new Floor(new Building(new Block(new District(null))))));
break;
case "full":
room = new Room(new Flat(new Floor(new Building(new Block(new District(new City(new Country("France"))))))));
break;
default:
throw new IllegalStateException("Unsupported filling.");
}
}
#Benchmark
public String nullChecks () {
if (room == null) {
return null;
}
Flat flat = room.getFlat();
if (flat == null) {
return null;
}
Floor floor = flat.getFloor();
if (floor == null) {
return null;
}
Building building = floor.getBuilding();
if (building == null) {
return null;
}
Block block = building.getBlock();
if (block == null) {
return null;
}
District district = block.getDistrict();
if (district == null) {
return null;
}
City city = district.getCity();
if (city == null) {
return null;
}
Country country = city.getCountry();
if (country == null) {
return null;
}
return country.getName();
}
#Benchmark
public String optionalsWithMethodRefs () {
return Optional.ofNullable (room)
.map (Room::getFlat)
.map (Flat::getFloor)
.map (Floor::getBuilding)
.map (Building::getBlock)
.map (Block::getDistrict)
.map (District::getCity)
.map (City::getCountry)
.map (Country::getName)
.orElse (null);
}
#Benchmark
public String optionalsWithLambdas () {
return Optional.ofNullable (room)
.map (room -> room.getFlat ())
.map (flat -> flat.getFloor ())
.map (floor -> floor.getBuilding ())
.map (building -> building.getBlock ())
.map (block -> block.getDistrict ())
.map (district -> district.getCity ())
.map (city -> city.getCountry ())
.map (country -> country.getName ())
.orElse (null);
}
}
And the results I got were:
Benchmark (filling) Mode Cnt Score Error Units
OptionalBenchmark.nullChecks empty thrpt 200 468835378.093 ± 895576.864 ops/s
OptionalBenchmark.nullChecks small thrpt 200 306602013.907 ± 136966.520 ops/s
OptionalBenchmark.nullChecks large thrpt 200 259996142.619 ± 307584.215 ops/s
OptionalBenchmark.nullChecks full thrpt 200 275954974.981 ± 4154597.959 ops/s
OptionalBenchmark.optionalsWithLambdas empty thrpt 200 460491457.335 ± 322920.650 ops/s
OptionalBenchmark.optionalsWithLambdas small thrpt 200 98604468.453 ± 68320.074 ops/s
OptionalBenchmark.optionalsWithLambdas large thrpt 200 67648427.470 ± 206810.285 ops/s
OptionalBenchmark.optionalsWithLambdas full thrpt 200 167124820.392 ± 1229924.561 ops/s
OptionalBenchmark.optionalsWithMethodRefs empty thrpt 200 460690135.554 ± 273853.568 ops/s
OptionalBenchmark.optionalsWithMethodRefs small thrpt 200 98639064.680 ± 56848.805 ops/s
OptionalBenchmark.optionalsWithMethodRefs large thrpt 200 68138436.113 ± 158409.539 ops/s
OptionalBenchmark.optionalsWithMethodRefs full thrpt 200 169603006.971 ± 52646.423 ops/s
First of all, when given a null reference, Optional and null checks behave pretty much the same. I guess this is because there is only one instance of Optional.empty (), so any .map () method call on it just returns itself.
When the given object is non-null and contains a chain of non-null attributes, however, a new Optional has to be instantiated on each call to .map (). Hence, performance degrade much more quickly than with null checks. Makes sense. Expect for my full filling, where the performance all of a sudden increase. So what is the magic going on here? Am I doing something wrong in my benchmark?
Edit
The parameters from my first run were the default from JMH: each benchmark was ran in 10 different forks, with 20 warmup iterations of 1s each, and then 20 measurement iterations of 1s each. I believe those value are sane, since I trust the libraries I use. However, since I was told I wasn’t warming up enough, here is the result of a longer test (200 warmup iterations and 200 measurement iteration for each of the 10 forks):
# JMH version: 1.19
# VM version: JDK 1.8.0_152, VM 25.152-b16
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/jre/bin/java
# VM options: <none>
# Warmup: 200 iterations, 1 s each
# Measurement: 200 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Run complete. Total time: 17:49:25
Benchmark (filling) Mode Cnt Score Error Units
OptionalBenchmark.nullChecks empty thrpt 2000 471803721.972 ± 116120.114 ops/s
OptionalBenchmark.nullChecks small thrpt 2000 289181482.246 ± 3967502.916 ops/s
OptionalBenchmark.nullChecks large thrpt 2000 260222478.406 ± 105074.121 ops/s
OptionalBenchmark.nullChecks full thrpt 2000 282487728.710 ± 71214.637 ops/s
OptionalBenchmark.optionalsWithLambdas empty thrpt 2000 460931830.242 ± 335263.946 ops/s
OptionalBenchmark.optionalsWithLambdas small thrpt 2000 98688943.879 ± 20485.863 ops/s
OptionalBenchmark.optionalsWithLambdas large thrpt 2000 67262330.106 ± 50465.262 ops/s
OptionalBenchmark.optionalsWithLambdas full thrpt 2000 168070919.770 ± 352435.666 ops/s
OptionalBenchmark.optionalsWithMethodRefs empty thrpt 2000 460998599.579 ± 85063.337 ops/s
OptionalBenchmark.optionalsWithMethodRefs small thrpt 2000 98707338.408 ± 17231.648 ops/s
OptionalBenchmark.optionalsWithMethodRefs large thrpt 2000 68052673.021 ± 55285.427 ops/s
OptionalBenchmark.optionalsWithMethodRefs full thrpt 2000 169259067.479 ± 174402.212 ops/s
As you can see, we have almost the same figures.

Even such a powerful tool like JMH is not able to save from all benchmarking pitfalls.
I've found two different issues with this benchmark.
1.
HotSpot JIT compiler speculatively optimizes code basing on runtime profile. In the given "full" scenario Optional never sees null values. That's why Optional.ofNullable method (also called by Optional.map) happens to be optimized exclusively for non-null path which constructs a new non-empty Optional. In this case JIT is able to eliminate all short-lived allocations and perform all map operations without intermediate objects.
public static <T> Optional<T> ofNullable(T value) {
return value == null ? empty() : of(value);
}
In "small" and "large" scenarios the mapping sequence finally ends with Optional.empty(). That is, both branches of ofNullable method are compiled, and JIT is no longer able to eliminate allocations of intermediate Optional objects - data flow graph appears to be too complex for Escape Analysis to succeed.
Check it by running JMH with -prof gc, and you'll see that "small" allocates 48 bytes (3 Optionals) per iteration, "large" allocates 96 bytes (6 Optionals), and "full" allocates nothing.
Benchmark (filling) Mode Cnt Score Error Units
OptionalBenchmark.optionalsWithMethodRefs:·gc.alloc.rate.norm empty avgt 5 ≈ 10⁻⁶ B/op
OptionalBenchmark.optionalsWithMethodRefs:·gc.alloc.rate.norm small avgt 5 48,000 ± 0,001 B/op
OptionalBenchmark.optionalsWithMethodRefs:·gc.alloc.rate.norm large avgt 5 96,000 ± 0,001 B/op
OptionalBenchmark.optionalsWithMethodRefs:·gc.alloc.rate.norm full avgt 5 ≈ 10⁻⁵ B/op
If you replace new Country("France") with new Country(null), the opimization will also break, and "full" scenario will become expectedly slower than "small" and "large".
Alternatively, the following dummy loop added to setUp will also prevent from overoptimizing ofNullable making the benchmark results more realistic.
for (int i = 0; i < 1000; i++) {
Optional.ofNullable(null);
}
2.
Surprisingly, nullChecks benchmark also appears faster in "full" scenario. The reason here is class initialization barriers. Note that only "full" case initializes all related classes. In "small" and "large" cases nullChecks method refers to some classes that are not yet initialized. This prevents from compiling nullChecks efficiently.
If you explicitly initialize all the classes in setUp, e.g. by creating a dummy object, then "empty", "small" and "large" scenarios of nullChecks will become faster.
Room dummy = new Room(new Flat(new Floor(new Building(new Block(new District(new City(new Country("France"))))))))

Construct a string from a repeated character

In Java I need to construct a string of n zeros with n unknown at compile time. Ideally I'd use
String s = new String('0', n);
But no such constructor exists. CharSequence doesn't seem to have a suitable constructor either. So I'm tempted to build my own loop using StringBuilder.
Before I do this and risk getting defenestrated by my boss, could anyone advise: is there a standard way of doing this in Java? In C++, one of the std::string constructors allows this.

If you don't mind creating an extra string:
String zeros = new String(new char[n]).replace((char) 0, '0');
Or more explicit (and probably more efficient):
char[] c = new char[n];
Arrays.fill(c, '0');
String zeros = new String(c);
Performance wise, the Arrays.fill option seems to perform better in most situations, but especially for large strings. Using a StringBuilder is quite slow for large strings but efficient for small ones. Using replace is a nice one liner and performs ok for larger strings, but not as well as filll.
Micro benchmark for different values of n:
Benchmark (n) Mode Samples Score Error Units
c.a.p.SO26504151.builder 1 avgt 3 29.452 ± 1.849 ns/op
c.a.p.SO26504151.builder 10 avgt 3 51.641 ± 12.426 ns/op
c.a.p.SO26504151.builder 1000 avgt 3 2681.956 ± 336.353 ns/op
c.a.p.SO26504151.builder 1000000 avgt 3 3522995.218 ± 422579.979 ns/op
c.a.p.SO26504151.fill 1 avgt 3 30.255 ± 0.297 ns/op
c.a.p.SO26504151.fill 10 avgt 3 32.638 ± 7.553 ns/op
c.a.p.SO26504151.fill 1000 avgt 3 592.459 ± 91.413 ns/op
c.a.p.SO26504151.fill 1000000 avgt 3 706187.003 ± 152774.601 ns/op
c.a.p.SO26504151.replace 1 avgt 3 44.366 ± 5.153 ns/op
c.a.p.SO26504151.replace 10 avgt 3 51.778 ± 2.959 ns/op
c.a.p.SO26504151.replace 1000 avgt 3 1385.383 ± 289.319 ns/op
c.a.p.SO26504151.replace 1000000 avgt 3 1486335.886 ± 1807239.775 ns/op

Create a n sized char array and convert it to String:
char[] myZeroCharArray = new char[n];
for(int i = 0; i < n; i++) myZeroCharArray[i] = '0';
String myZeroString = new String(myZeroCharArray);

See StringUtils in Apache Commons Lang
https://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringUtils.html#repeat%28java.lang.String,%20int%29

There isn't a standard JDK way, but Apache commons (almost a defacto standard), has the StringUtils.repeat() method, e.g.:
String s = StringUtils.repeat('x', 5); // s = "xxxxx"

or the plain old String Format
int n = 10;
String s = String.format("%" + n + "s", "").replace(' ', '0');
System.out.println(s);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why Java compiled regex works slower then interpreted in String::split? - java

Related

Java Collections.reverseOrder() vs. Comparator

JMH: strange dependency on the environment

Performance comparison of modulo operator and bitwise AND

Why Java Optional performance increase with number of chained calls?

Construct a string from a repeated character

Categories

Resources