While analyzing the results of a recent question here, I encountered a quite peculiar phenomenon: apparently an extra layer of HotSpot's JIT-optimization actually slows down execution on my machine.
Here is the code I have used for the measurement:
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#BenchmarkMode(Mode.AverageTime)
#OperationsPerInvocation(Measure.ARRAY_SIZE)
#Warmup(iterations = 2, time = 1)
#Measurement(iterations = 5, time = 1)
#State(Scope.Thread)
#Threads(1)
#Fork(2)
public class Measure
{
public static final int ARRAY_SIZE = 1024;
private final int[] array = new int[ARRAY_SIZE];
#Setup public void setup() {
final Random random = new Random();
for (int i = 0; i < ARRAY_SIZE; ++i) {
final int x = random.nextInt();
array[i] = x == 0? 1 : x;
}
}
#GenerateMicroBenchmark public int normalIndex() {
final int[] array = this.array;
int result = 0;
for (int i = 0; i < array.length; i++) {
final int j = i & array.length-1;
final int entry = array[i];
result ^= entry + j;
}
return result;
}
#GenerateMicroBenchmark public int maskedIndex() {
final int[] array = this.array;
int result = 0;
for (int i = 0; i < array.length; i++) {
final int j = i & array.length-1;
final int entry = array[j];
result ^= entry + i;
}
return result;
}
#GenerateMicroBenchmark public int normalWithExitPoint() {
final int[] array = this.array;
int result = 0;
for (int i = 0; i < array.length; i++) {
final int j = i & array.length-1;
final int entry = array[i];
result ^= entry + j;
if (entry == 0) break;
}
return result;
}
#GenerateMicroBenchmark public int maskedWithExitPoint() {
final int[] array = this.array;
int result = 0;
for (int i = 0; i < array.length; i++) {
final int j = i & array.length-1;
final int entry = array[j];
result ^= entry + i;
if (entry == 0) break;
}
return result;
}
}
The code is quite subtle so let me point out the important bits:
the "normal index" variants use the straight variable i for array index. HotSpot can easily determine the range of i throughout the loop and eliminate the array bounds check;
the "masked index" variants index by j, which is actually equal to i, but that fact is "hidden" from HotSpot through the AND-masking operation;
the "with exit point" variants introduce an explict loop exit point. The importance of this will be explained below.
Loop unrolling and reordering
Observe that the bounds check figures in two important ways:
it has runtime overhead associated with it (a comparison followed by a conditional branch);
it constitutes a loop exit point which can interrupt the loop on any step. This turns out to have important consequences for the applicable JIT optimizations.
By inspecting the machine code emitted for the above four methods, I have noticed the following:
in all cases the loop is unrolled;
in the case of normalIndex, which is distinguished as the only one having no premature loop exit points, the operations of all the unrolled steps are reordered such that first all the array fetches are performed, followed by XOR-ing all the values into the accumulator.
Expected and actual measurement results
Now we can classify the four methods according to the discussed features:
normalIndex has no bounds checks and no loop exit points;
normalWithExitPoint has no bounds checks and 1 exit point;
maskedIndex has 1 bounds check and 1 exit point;
maskedWithExitPoint has 1 bounds check and 2 exit points.
The obvious expectation is that the above list should present the methods in descending order of performance; however, these are my actual results:
Benchmark Mode Samples Mean Mean error Units
normalIndex avgt 20 0.946 0.010 ns/op
normalWithExitPoint avgt 20 0.807 0.010 ns/op
maskedIndex avgt 20 0.803 0.007 ns/op
maskedWithExitPoint avgt 20 1.007 0.009 ns/op
normalWithExitPoint and maskedIndex are identical modulo measurement error, even though only the latter has the bounds check;
the greatest anomaly is observed on normalIndex which should have been the fastest, but is significantly slower than normalWithExitPoint, identical to it in every respect save for having one more line of code, the one which introduces the exit point.
Since normalIndex is the only method which has the extra reordering "optimization" applied to it, the conclusion is that this is the cause of the slowdown.
I am testing on:
Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode) (Java 7 Update 40)
OS X Version 10.9.1
2.66 GHz Intel Core i7
I have successfully reproduced the results on Java 8 EA b118 as well.
My question:
Is the above phenomenon reproducible on other, similar machines? From the question mentioned at the beginning I already have a hint that at least some machines do not reproduce it, so another result from the same CPU would be very interesting.
Update 1: more measurements inspired by maaartinus's findings
I have gathered the following table which correlates execution times with the -XX:LoopUnrollLimit command-line argument. Here I focus only on two variants, with and without the if (entry == 0) break; line:
LoopUnrollLimit: 14 15 18 19 22 23 60
withExitPoint: 96 95 95 79 80 80 69 1/100 ns
withoutExitPoint: 94 64 64 63 64 77 75 1/100 ns
The following sudden changes can be observed:
on the transition from 14 to 15, the withoutExitPoint variant receives a beneficial LCM1 transformation and is significantly sped up. Due to the loop-unrolling limit, all loaded values fit into registers;
on 18->19, the withExitPoint variant receives a speedup, lesser than the above;
on 22->23, the withoutExitPoint variant is slowed down. At this point I see the spill-over into stack locations, as described in maaartinus's answer, starting to happen.
The default loopUnrollLimit for my setup is 60, so I present its results in the last column.
1 LCM = Local Code Motion. It is the transformation which causes all array access to happen on the top, followed by the processing of the loaded values.
Update 2: this is actually a known, reported problem
https://bugs.openjdk.java.net/browse/JDK-7101232
Appendix: The unrolled and reordered loop of normalIndex in machine code
0x00000001044a37c0: mov ecx,eax
0x00000001044a37c2: and ecx,esi ;*iand
; - org.sample.Measure::normalIndex#20 (line 44)
0x00000001044a37c4: mov rbp,QWORD PTR [rsp+0x28] ;*iload_3
; - org.sample.Measure::normalIndex#15 (line 44)
0x00000001044a37c9: add ecx,DWORD PTR [rbp+rsi*4+0x10]
0x00000001044a37cd: xor ecx,r8d
0x00000001044a37d0: mov DWORD PTR [rsp],ecx
0x00000001044a37d3: mov r10d,esi
0x00000001044a37d6: add r10d,0xf
0x00000001044a37da: and r10d,eax
0x00000001044a37dd: mov r8d,esi
0x00000001044a37e0: add r8d,0x7
0x00000001044a37e4: and r8d,eax
0x00000001044a37e7: mov DWORD PTR [rsp+0x4],r8d
0x00000001044a37ec: mov r11d,esi
0x00000001044a37ef: add r11d,0x6
0x00000001044a37f3: and r11d,eax
0x00000001044a37f6: mov DWORD PTR [rsp+0x8],r11d
0x00000001044a37fb: mov r8d,esi
0x00000001044a37fe: add r8d,0x5
0x00000001044a3802: and r8d,eax
0x00000001044a3805: mov DWORD PTR [rsp+0xc],r8d
0x00000001044a380a: mov r11d,esi
0x00000001044a380d: inc r11d
0x00000001044a3810: and r11d,eax
0x00000001044a3813: mov DWORD PTR [rsp+0x10],r11d
0x00000001044a3818: mov r8d,esi
0x00000001044a381b: add r8d,0x2
0x00000001044a381f: and r8d,eax
0x00000001044a3822: mov DWORD PTR [rsp+0x14],r8d
0x00000001044a3827: mov r11d,esi
0x00000001044a382a: add r11d,0x3
0x00000001044a382e: and r11d,eax
0x00000001044a3831: mov r9d,esi
0x00000001044a3834: add r9d,0x4
0x00000001044a3838: and r9d,eax
0x00000001044a383b: mov r8d,esi
0x00000001044a383e: add r8d,0x8
0x00000001044a3842: and r8d,eax
0x00000001044a3845: mov DWORD PTR [rsp+0x18],r8d
0x00000001044a384a: mov r8d,esi
0x00000001044a384d: add r8d,0x9
0x00000001044a3851: and r8d,eax
0x00000001044a3854: mov ebx,esi
0x00000001044a3856: add ebx,0xa
0x00000001044a3859: and ebx,eax
0x00000001044a385b: mov ecx,esi
0x00000001044a385d: add ecx,0xb
0x00000001044a3860: and ecx,eax
0x00000001044a3862: mov edx,esi
0x00000001044a3864: add edx,0xc
0x00000001044a3867: and edx,eax
0x00000001044a3869: mov edi,esi
0x00000001044a386b: add edi,0xd
0x00000001044a386e: and edi,eax
0x00000001044a3870: mov r13d,esi
0x00000001044a3873: add r13d,0xe
0x00000001044a3877: and r13d,eax
0x00000001044a387a: movsxd r14,esi
0x00000001044a387d: add r10d,DWORD PTR [rbp+r14*4+0x4c]
0x00000001044a3882: mov DWORD PTR [rsp+0x24],r10d
0x00000001044a3887: mov QWORD PTR [rsp+0x28],rbp
0x00000001044a388c: mov ebp,DWORD PTR [rsp+0x4]
0x00000001044a3890: mov r10,QWORD PTR [rsp+0x28]
0x00000001044a3895: add ebp,DWORD PTR [r10+r14*4+0x2c]
0x00000001044a389a: mov DWORD PTR [rsp+0x4],ebp
0x00000001044a389e: mov r10d,DWORD PTR [rsp+0x8]
0x00000001044a38a3: mov rbp,QWORD PTR [rsp+0x28]
0x00000001044a38a8: add r10d,DWORD PTR [rbp+r14*4+0x28]
0x00000001044a38ad: mov DWORD PTR [rsp+0x8],r10d
0x00000001044a38b2: mov r10d,DWORD PTR [rsp+0xc]
0x00000001044a38b7: add r10d,DWORD PTR [rbp+r14*4+0x24]
0x00000001044a38bc: mov DWORD PTR [rsp+0xc],r10d
0x00000001044a38c1: mov r10d,DWORD PTR [rsp+0x10]
0x00000001044a38c6: add r10d,DWORD PTR [rbp+r14*4+0x14]
0x00000001044a38cb: mov DWORD PTR [rsp+0x10],r10d
0x00000001044a38d0: mov r10d,DWORD PTR [rsp+0x14]
0x00000001044a38d5: add r10d,DWORD PTR [rbp+r14*4+0x18]
0x00000001044a38da: mov DWORD PTR [rsp+0x14],r10d
0x00000001044a38df: add r13d,DWORD PTR [rbp+r14*4+0x48]
0x00000001044a38e4: add r11d,DWORD PTR [rbp+r14*4+0x1c]
0x00000001044a38e9: add r9d,DWORD PTR [rbp+r14*4+0x20]
0x00000001044a38ee: mov r10d,DWORD PTR [rsp+0x18]
0x00000001044a38f3: add r10d,DWORD PTR [rbp+r14*4+0x30]
0x00000001044a38f8: mov DWORD PTR [rsp+0x18],r10d
0x00000001044a38fd: add r8d,DWORD PTR [rbp+r14*4+0x34]
0x00000001044a3902: add ebx,DWORD PTR [rbp+r14*4+0x38]
0x00000001044a3907: add ecx,DWORD PTR [rbp+r14*4+0x3c]
0x00000001044a390c: add edx,DWORD PTR [rbp+r14*4+0x40]
0x00000001044a3911: add edi,DWORD PTR [rbp+r14*4+0x44]
0x00000001044a3916: mov r10d,DWORD PTR [rsp+0x10]
0x00000001044a391b: xor r10d,DWORD PTR [rsp]
0x00000001044a391f: mov ebp,DWORD PTR [rsp+0x14]
0x00000001044a3923: xor ebp,r10d
0x00000001044a3926: xor r11d,ebp
0x00000001044a3929: xor r9d,r11d
0x00000001044a392c: xor r9d,DWORD PTR [rsp+0xc]
0x00000001044a3931: xor r9d,DWORD PTR [rsp+0x8]
0x00000001044a3936: xor r9d,DWORD PTR [rsp+0x4]
0x00000001044a393b: mov r10d,DWORD PTR [rsp+0x18]
0x00000001044a3940: xor r10d,r9d
0x00000001044a3943: xor r8d,r10d
0x00000001044a3946: xor ebx,r8d
0x00000001044a3949: xor ecx,ebx
0x00000001044a394b: xor edx,ecx
0x00000001044a394d: xor edi,edx
0x00000001044a394f: xor r13d,edi
0x00000001044a3952: mov r8d,DWORD PTR [rsp+0x24]
0x00000001044a3957: xor r8d,r13d ;*ixor
; - org.sample.Measure::normalIndex#34 (line 46)
0x00000001044a395a: add esi,0x10 ;*iinc
; - org.sample.Measure::normalIndex#36 (line 43)
0x00000001044a395d: cmp esi,DWORD PTR [rsp+0x20]
0x00000001044a3961: jl 0x00000001044a37c0 ;*if_icmpge
; - org.sample.Measure::normalIndex#12 (line 43)
The reason why the JITC tries to group everything together is rather unclear to me. AFAIK there are (were?) architectures for which grouping of two loads leads to a better performance (some early Pentium, I think).
As JITC knows the hot spots, it can inline much more aggressively than an ahead-of-time compiler, so it does it 16 times in this case. I can't see any clear advantage thereof here, except for making the looping relatively cheaper. I also doubt that there's any architecture profiting from grouping 16 loads together.
The code computes 16 temporary values, one per iteration of
int j = i & array.length-1;
int entry = array[i];
int tmp = entry + j;
result ^= tmp;
Each computation is pretty trivial, one AND, one LOAD, and one ADD. The values are to be mapped to registers, but there aren't enough of them. So the values have to be stored and loaded later.
This happens for 7 out of the 16 registers and increases the costs significantly.
Update
I'm not very sure about verifying this by using -XX:LoopUnrollLimit:
LoopUnrollLimit Benchmark Mean Mean error Units
8 ..normalIndex 0.902 0.004 ns/op
8 ..normalWithExitPoint 0.913 0.005 ns/op
8 ..maskedIndex 0.918 0.006 ns/op
8 ..maskedWithExitPoint 0.996 0.008 ns/op
16 ..normalIndex 0.769 0.003 ns/op
16 ..normalWithExitPoint 0.930 0.004 ns/op
16 ..maskedIndex 0.937 0.004 ns/op
16 ..maskedWithExitPoint 1.012 0.003 ns/op
32 ..normalIndex 0.814 0.003 ns/op
32 ..normalWithExitPoint 0.816 0.005 ns/op
32 ..maskedIndex 0.838 0.003 ns/op
32 ..maskedWithExitPoint 0.978 0.002 ns/op
- ..normalIndex 0.830 0.002 ns/op
- ..normalWithExitPoint 0.683 0.002 ns/op
- ..maskedIndex 0.791 0.005 ns/op
- ..maskedWithExitPoint 0.908 0.003 ns/op
The limit of 16 makes normalIndex be the fastest variant, which indicates that I was right with the "overallocation penalty". Bur according to Marko, the generated assembly changes with the unroll limit also in other aspects, so things are more complicated.
Related
In this code:
if (value >= x && value <= y) {
when value >= x and value <= y are as likely true as false with no particular pattern, would using the & operator be faster than using &&?
Specifically, I am thinking about how && lazily evaluates the right-hand-side expression (ie only if the LHS is true), which implies a conditional, whereas in Java & in this context guarantees strict evaluation of both (boolean) sub-expressions. The value result is the same either way.
But whilst a >= or <= operator will use a simple comparison instruction, the && must involve a branch, and that branch is susceptible to branch prediction failure - as per this Very Famous Question: Why is it faster to process a sorted array than an unsorted array?
So, forcing the expression to have no lazy components will surely be more deterministic and not be vulnerable to prediction failure. Right?
Notes:
obviously the answer to my question would be No if the code looked like this: if(value >= x && verySlowFunction()). I am focusing on "sufficiently simple" RHS expressions.
there's a conditional branch in there anyway (the if statement). I can't quite prove to myself that that is irrelevant, and that alternative formulations might be better examples, like boolean b = value >= x && value <= y;
this all falls into the world of horrendous micro-optimizations. Yeah, I know :-) ... interesting though?
Update
Just to explain why I'm interested: I've been staring at the systems that Martin Thompson has been writing about on his Mechanical Sympathy blog, after he came and did a talk about Aeron. One of the key messages is that our hardware has all this magical stuff in it, and we software developers tragically fail to take advantage of it. Don't worry, I'm not about to go s/&&/\&/ on all my code :-) ... but there are a number of questions on this site on improving branch prediction by removing branches, and it occurred to me that the conditional boolean operators are at the core of test conditions.
Of course, #StephenC makes the fantastic point that bending your code into weird shapes can make it less easy for JITs to spot common optimizations - if not now, then in the future. And that the Very Famous Question mentioned above is special because it pushes the prediction complexity far beyond practical optimization.
I'm pretty much aware that in most (or almost all) situations, && is the clearest, simplest, fastest, best thing to do - although I'm very grateful to the people who have posted answers demonstrating this! I'm really interested to see if there are actually any cases in anyone's experience where the answer to "Can & be faster?" might be Yes...
Update 2:
(Addressing advice that the question is overly broad. I don't want to make major changes to this question because it might compromise some of the answers below, which are of exceptional quality!) Perhaps an example in the wild is called for; this is from the Guava LongMath class (thanks hugely to #maaartinus for finding this):
public static boolean isPowerOfTwo(long x) {
return x > 0 & (x & (x - 1)) == 0;
}
See that first &? And if you check the link, the next method is called lessThanBranchFree(...), which hints that we are in branch-avoidance territory - and Guava is really widely used: every cycle saved causes sea-levels to drop visibly. So let's put the question this way: is this use of & (where && would be more normal) a real optimization?
Ok, so you want to know how it behaves at the lower level... Let's have a look at the bytecode then!
EDIT : added the generated assembly code for AMD64, at the end. Have a look for some interesting notes.
EDIT 2 (re: OP's "Update 2"): added asm code for Guava's isPowerOfTwo method as well.
Java source
I wrote these two quick methods:
public boolean AndSC(int x, int value, int y) {
return value >= x && value <= y;
}
public boolean AndNonSC(int x, int value, int y) {
return value >= x & value <= y;
}
As you can see, they are exactly the same, save for the type of AND operator.
Java bytecode
And this is the generated bytecode:
public AndSC(III)Z
L0
LINENUMBER 8 L0
ILOAD 2
ILOAD 1
IF_ICMPLT L1
ILOAD 2
ILOAD 3
IF_ICMPGT L1
L2
LINENUMBER 9 L2
ICONST_1
IRETURN
L1
LINENUMBER 11 L1
FRAME SAME
ICONST_0
IRETURN
L3
LOCALVARIABLE this Ltest/lsoto/AndTest; L0 L3 0
LOCALVARIABLE x I L0 L3 1
LOCALVARIABLE value I L0 L3 2
LOCALVARIABLE y I L0 L3 3
MAXSTACK = 2
MAXLOCALS = 4
// access flags 0x1
public AndNonSC(III)Z
L0
LINENUMBER 15 L0
ILOAD 2
ILOAD 1
IF_ICMPLT L1
ICONST_1
GOTO L2
L1
FRAME SAME
ICONST_0
L2
FRAME SAME1 I
ILOAD 2
ILOAD 3
IF_ICMPGT L3
ICONST_1
GOTO L4
L3
FRAME SAME1 I
ICONST_0
L4
FRAME FULL [test/lsoto/AndTest I I I] [I I]
IAND
IFEQ L5
L6
LINENUMBER 16 L6
ICONST_1
IRETURN
L5
LINENUMBER 18 L5
FRAME SAME
ICONST_0
IRETURN
L7
LOCALVARIABLE this Ltest/lsoto/AndTest; L0 L7 0
LOCALVARIABLE x I L0 L7 1
LOCALVARIABLE value I L0 L7 2
LOCALVARIABLE y I L0 L7 3
MAXSTACK = 3
MAXLOCALS = 4
The AndSC (&&) method generates two conditional jumps, as expected:
It loads value and x onto the stack, and jumps to L1 if value is lower. Else it keeps running the next lines.
It loads value and y onto the stack, and jumps to L1 also, if value is greater. Else it keeps running the next lines.
Which happen to be a return true in case none of the two jumps were made.
And then we have the lines marked as L1 which are a return false.
The AndNonSC (&) method, however, generates three conditional jumps!
It loads value and x onto the stack and jumps to L1 if value is lower. Because now it needs to save the result to compare it with the other part of the AND, so it has to execute either "save true" or "save false", it can't do both with the same instruction.
It loads value and y onto the stack and jumps to L1 if value is greater. Once again it needs to save true or false and that's two different lines depending on the comparison result.
Now that both comparisons are done, the code actually executes the AND operation -- and if both are true, it jumps (for a third time) to return true; or else it continues execution onto the next line to return false.
(Preliminary) Conclusion
Though I'm not that very much experienced with Java bytecode and I may have overlooked something, it seems to me that & will actually perform worse than && in every case: it generates more instructions to execute, including more conditional jumps to predict and possibly fail at.
A rewriting of the code to replace comparisons with arithmetical operations, as someone else proposed, might be a way to make & a better option, but at the cost of making the code much less clear.
IMHO it is not worth the hassle for 99% of the scenarios (it may be very well worth it for the 1% loops that need to be extremely optimized, though).
EDIT: AMD64 assembly
As noted in the comments, the same Java bytecode can lead to different machine code in different systems, so while the Java bytecode might give us a hint about which AND version performs better, getting the actual ASM as generated by the compiler is the only way to really find out.
I printed the AMD64 ASM instructions for both methods; below are the relevant lines (stripped entry points etc.).
NOTE: all methods compiled with java 1.8.0_91 unless otherwise stated.
Method AndSC with default options
# {method} {0x0000000016da0810} 'AndSC' '(III)Z' in 'AndTest'
...
0x0000000002923e3e: cmp %r8d,%r9d
0x0000000002923e41: movabs $0x16da0a08,%rax ; {metadata(method data for {method} {0x0000000016da0810} 'AndSC' '(III)Z' in 'AndTest')}
0x0000000002923e4b: movabs $0x108,%rsi
0x0000000002923e55: jl 0x0000000002923e65
0x0000000002923e5b: movabs $0x118,%rsi
0x0000000002923e65: mov (%rax,%rsi,1),%rbx
0x0000000002923e69: lea 0x1(%rbx),%rbx
0x0000000002923e6d: mov %rbx,(%rax,%rsi,1)
0x0000000002923e71: jl 0x0000000002923eb0 ;*if_icmplt
; - AndTest::AndSC#2 (line 22)
0x0000000002923e77: cmp %edi,%r9d
0x0000000002923e7a: movabs $0x16da0a08,%rax ; {metadata(method data for {method} {0x0000000016da0810} 'AndSC' '(III)Z' in 'AndTest')}
0x0000000002923e84: movabs $0x128,%rsi
0x0000000002923e8e: jg 0x0000000002923e9e
0x0000000002923e94: movabs $0x138,%rsi
0x0000000002923e9e: mov (%rax,%rsi,1),%rdi
0x0000000002923ea2: lea 0x1(%rdi),%rdi
0x0000000002923ea6: mov %rdi,(%rax,%rsi,1)
0x0000000002923eaa: jle 0x0000000002923ec1 ;*if_icmpgt
; - AndTest::AndSC#7 (line 22)
0x0000000002923eb0: mov $0x0,%eax
0x0000000002923eb5: add $0x30,%rsp
0x0000000002923eb9: pop %rbp
0x0000000002923eba: test %eax,-0x1c73dc0(%rip) # 0x0000000000cb0100
; {poll_return}
0x0000000002923ec0: retq ;*ireturn
; - AndTest::AndSC#13 (line 25)
0x0000000002923ec1: mov $0x1,%eax
0x0000000002923ec6: add $0x30,%rsp
0x0000000002923eca: pop %rbp
0x0000000002923ecb: test %eax,-0x1c73dd1(%rip) # 0x0000000000cb0100
; {poll_return}
0x0000000002923ed1: retq
Method AndSC with -XX:PrintAssemblyOptions=intel option
# {method} {0x00000000170a0810} 'AndSC' '(III)Z' in 'AndTest'
...
0x0000000002c26e2c: cmp r9d,r8d
0x0000000002c26e2f: jl 0x0000000002c26e36 ;*if_icmplt
0x0000000002c26e31: cmp r9d,edi
0x0000000002c26e34: jle 0x0000000002c26e44 ;*iconst_0
0x0000000002c26e36: xor eax,eax ;*synchronization entry
0x0000000002c26e38: add rsp,0x10
0x0000000002c26e3c: pop rbp
0x0000000002c26e3d: test DWORD PTR [rip+0xffffffffffce91bd],eax # 0x0000000002910000
0x0000000002c26e43: ret
0x0000000002c26e44: mov eax,0x1
0x0000000002c26e49: jmp 0x0000000002c26e38
Method AndNonSC with default options
# {method} {0x0000000016da0908} 'AndNonSC' '(III)Z' in 'AndTest'
...
0x0000000002923a78: cmp %r8d,%r9d
0x0000000002923a7b: mov $0x0,%eax
0x0000000002923a80: jl 0x0000000002923a8b
0x0000000002923a86: mov $0x1,%eax
0x0000000002923a8b: cmp %edi,%r9d
0x0000000002923a8e: mov $0x0,%esi
0x0000000002923a93: jg 0x0000000002923a9e
0x0000000002923a99: mov $0x1,%esi
0x0000000002923a9e: and %rsi,%rax
0x0000000002923aa1: cmp $0x0,%eax
0x0000000002923aa4: je 0x0000000002923abb ;*ifeq
; - AndTest::AndNonSC#21 (line 29)
0x0000000002923aaa: mov $0x1,%eax
0x0000000002923aaf: add $0x30,%rsp
0x0000000002923ab3: pop %rbp
0x0000000002923ab4: test %eax,-0x1c739ba(%rip) # 0x0000000000cb0100
; {poll_return}
0x0000000002923aba: retq ;*ireturn
; - AndTest::AndNonSC#25 (line 30)
0x0000000002923abb: mov $0x0,%eax
0x0000000002923ac0: add $0x30,%rsp
0x0000000002923ac4: pop %rbp
0x0000000002923ac5: test %eax,-0x1c739cb(%rip) # 0x0000000000cb0100
; {poll_return}
0x0000000002923acb: retq
Method AndNonSC with -XX:PrintAssemblyOptions=intel option
# {method} {0x00000000170a0908} 'AndNonSC' '(III)Z' in 'AndTest'
...
0x0000000002c270b5: cmp r9d,r8d
0x0000000002c270b8: jl 0x0000000002c270df ;*if_icmplt
0x0000000002c270ba: mov r8d,0x1 ;*iload_2
0x0000000002c270c0: cmp r9d,edi
0x0000000002c270c3: cmovg r11d,r10d
0x0000000002c270c7: and r8d,r11d
0x0000000002c270ca: test r8d,r8d
0x0000000002c270cd: setne al
0x0000000002c270d0: movzx eax,al
0x0000000002c270d3: add rsp,0x10
0x0000000002c270d7: pop rbp
0x0000000002c270d8: test DWORD PTR [rip+0xffffffffffce8f22],eax # 0x0000000002910000
0x0000000002c270de: ret
0x0000000002c270df: xor r8d,r8d
0x0000000002c270e2: jmp 0x0000000002c270c0
First of all, the generated ASM code differs depending on whether we choose the default AT&T syntax or the Intel syntax.
With AT&T syntax:
The ASM code is actually longer for the AndSC method, with every bytecode IF_ICMP* translated to two assembly jump instructions, for a total of 4 conditional jumps.
Meanwhile, for the AndNonSC method the compiler generates a more straight-forward code, where each bytecode IF_ICMP* is translated to only one assembly jump instruction, keeping the original count of 3 conditional jumps.
With Intel syntax:
The ASM code for AndSC is shorter, with just 2 conditional jumps (not counting the non-conditional jmp at the end). Actually it's just two CMP, two JL/E and a XOR/MOV depending on the result.
The ASM code for AndNonSC is now longer than the AndSC one! However, it has just 1 conditional jump (for the first comparison), using the registers to directly compare the first result with the second, without any more jumps.
Conclusion after ASM code analysis
At AMD64 machine-language level, the & operator seems to generate ASM code with fewer conditional jumps, which might be better for high prediction-failure rates (random values for example).
On the other hand, the && operator seems to generate ASM code with fewer instructions (with the -XX:PrintAssemblyOptions=intel option anyway), which might be better for really long loops with prediction-friendly inputs, where the fewer number of CPU cycles for each comparison can make a difference in the long run.
As I stated in some of the comments, this is going to vary greatly between systems, so if we're talking about branch-prediction optimization, the only real answer would be: it depends on your JVM implementation, your compiler, your CPU and your input data.
Addendum: Guava's isPowerOfTwo method
Here, Guava's developers have come up with a neat way of calculating if a given number is a power of 2:
public static boolean isPowerOfTwo(long x) {
return x > 0 & (x & (x - 1)) == 0;
}
Quoting OP:
is this use of & (where && would be more normal) a real optimization?
To find out if it is, I added two similar methods to my test class:
public boolean isPowerOfTwoAND(long x) {
return x > 0 & (x & (x - 1)) == 0;
}
public boolean isPowerOfTwoANDAND(long x) {
return x > 0 && (x & (x - 1)) == 0;
}
Intel's ASM code for Guava's version
# {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest'
# this: rdx:rdx = 'AndTest'
# parm0: r8:r8 = long
...
0x0000000003103bbe: movabs rax,0x0
0x0000000003103bc8: cmp rax,r8
0x0000000003103bcb: movabs rax,0x175811f0 ; {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
0x0000000003103bd5: movabs rsi,0x108
0x0000000003103bdf: jge 0x0000000003103bef
0x0000000003103be5: movabs rsi,0x118
0x0000000003103bef: mov rdi,QWORD PTR [rax+rsi*1]
0x0000000003103bf3: lea rdi,[rdi+0x1]
0x0000000003103bf7: mov QWORD PTR [rax+rsi*1],rdi
0x0000000003103bfb: jge 0x0000000003103c1b ;*lcmp
0x0000000003103c01: movabs rax,0x175811f0 ; {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
0x0000000003103c0b: inc DWORD PTR [rax+0x128]
0x0000000003103c11: mov eax,0x1
0x0000000003103c16: jmp 0x0000000003103c20 ;*goto
0x0000000003103c1b: mov eax,0x0 ;*lload_1
0x0000000003103c20: mov rsi,r8
0x0000000003103c23: movabs r10,0x1
0x0000000003103c2d: sub rsi,r10
0x0000000003103c30: and rsi,r8
0x0000000003103c33: movabs rdi,0x0
0x0000000003103c3d: cmp rsi,rdi
0x0000000003103c40: movabs rsi,0x175811f0 ; {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
0x0000000003103c4a: movabs rdi,0x140
0x0000000003103c54: jne 0x0000000003103c64
0x0000000003103c5a: movabs rdi,0x150
0x0000000003103c64: mov rbx,QWORD PTR [rsi+rdi*1]
0x0000000003103c68: lea rbx,[rbx+0x1]
0x0000000003103c6c: mov QWORD PTR [rsi+rdi*1],rbx
0x0000000003103c70: jne 0x0000000003103c90 ;*lcmp
0x0000000003103c76: movabs rsi,0x175811f0 ; {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
0x0000000003103c80: inc DWORD PTR [rsi+0x160]
0x0000000003103c86: mov esi,0x1
0x0000000003103c8b: jmp 0x0000000003103c95 ;*goto
0x0000000003103c90: mov esi,0x0 ;*iand
0x0000000003103c95: and rsi,rax
0x0000000003103c98: and esi,0x1
0x0000000003103c9b: mov rax,rsi
0x0000000003103c9e: add rsp,0x50
0x0000000003103ca2: pop rbp
0x0000000003103ca3: test DWORD PTR [rip+0xfffffffffe44c457],eax # 0x0000000001550100
0x0000000003103ca9: ret
Intel's asm code for && version
# {method} {0x0000000017580bd0} 'isPowerOfTwoANDAND' '(J)Z' in 'AndTest'
# this: rdx:rdx = 'AndTest'
# parm0: r8:r8 = long
...
0x0000000003103438: movabs rax,0x0
0x0000000003103442: cmp rax,r8
0x0000000003103445: jge 0x0000000003103471 ;*lcmp
0x000000000310344b: mov rax,r8
0x000000000310344e: movabs r10,0x1
0x0000000003103458: sub rax,r10
0x000000000310345b: and rax,r8
0x000000000310345e: movabs rsi,0x0
0x0000000003103468: cmp rax,rsi
0x000000000310346b: je 0x000000000310347b ;*lcmp
0x0000000003103471: mov eax,0x0
0x0000000003103476: jmp 0x0000000003103480 ;*ireturn
0x000000000310347b: mov eax,0x1 ;*goto
0x0000000003103480: and eax,0x1
0x0000000003103483: add rsp,0x40
0x0000000003103487: pop rbp
0x0000000003103488: test DWORD PTR [rip+0xfffffffffe44cc72],eax # 0x0000000001550100
0x000000000310348e: ret
In this specific example, the JIT compiler generates far less assembly code for the && version than for Guava's & version (and, after yesterday's results, I was honestly surprised by this).
Compared to Guava's, the && version translates to 25% less bytecode for JIT to compile, 50% less assembly instructions, and only two conditional jumps (the & version has four of them).
So everything points to Guava's & method being less efficient than the more "natural" && version.
... Or is it?
As noted before, I'm running the above examples with Java 8:
C:\....>java -version
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
But what if I switch to Java 7?
C:\....>c:\jdk1.7.0_79\bin\java -version
java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)
C:\....>c:\jdk1.7.0_79\bin\java -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*AndTest.isPowerOfTwoAND -XX:PrintAssemblyOptions=intel AndTestMain
.....
0x0000000002512bac: xor r10d,r10d
0x0000000002512baf: mov r11d,0x1
0x0000000002512bb5: test r8,r8
0x0000000002512bb8: jle 0x0000000002512bde ;*ifle
0x0000000002512bba: mov eax,0x1 ;*lload_1
0x0000000002512bbf: mov r9,r8
0x0000000002512bc2: dec r9
0x0000000002512bc5: and r9,r8
0x0000000002512bc8: test r9,r9
0x0000000002512bcb: cmovne r11d,r10d
0x0000000002512bcf: and eax,r11d ;*iand
0x0000000002512bd2: add rsp,0x10
0x0000000002512bd6: pop rbp
0x0000000002512bd7: test DWORD PTR [rip+0xffffffffffc0d423],eax # 0x0000000002120000
0x0000000002512bdd: ret
0x0000000002512bde: xor eax,eax
0x0000000002512be0: jmp 0x0000000002512bbf
.....
Surprise! The assembly code generated for the & method by the JIT compiler in Java 7, has only one conditional jump now, and is way shorter! Whereas the && method (you'll have to trust me on this one, I don't want to clutter the ending!) remains about the same, with its two conditional jumps and a couple less instructions, tops.
Looks like Guava's engineers knew what they were doing, after all! (if they were trying to optimize Java 7 execution time, that is ;-)
So back to OP's latest question:
is this use of & (where && would be more normal) a real optimization?
And IMHO the answer is the same, even for this (very!) specific scenario: it depends on your JVM implementation, your compiler, your CPU and your input data.
For those kind of questions you should run a microbenchmark. I used JMH for this test.
The benchmarks are implemented as
// boolean logical AND
bh.consume(value >= x & y <= value);
and
// conditional AND
bh.consume(value >= x && y <= value);
and
// bitwise OR, as suggested by Joop Eggen
bh.consume(((value - x) | (y - value)) >= 0)
With values for value, x and y according to the benchmark name.
The result (five warmup and ten measurement iterations) for throughput benchmarking is:
Benchmark Mode Cnt Score Error Units
Benchmark.isBooleanANDBelowRange thrpt 10 386.086 ▒ 17.383 ops/us
Benchmark.isBooleanANDInRange thrpt 10 387.240 ▒ 7.657 ops/us
Benchmark.isBooleanANDOverRange thrpt 10 381.847 ▒ 15.295 ops/us
Benchmark.isBitwiseORBelowRange thrpt 10 384.877 ▒ 11.766 ops/us
Benchmark.isBitwiseORInRange thrpt 10 380.743 ▒ 15.042 ops/us
Benchmark.isBitwiseOROverRange thrpt 10 383.524 ▒ 16.911 ops/us
Benchmark.isConditionalANDBelowRange thrpt 10 385.190 ▒ 19.600 ops/us
Benchmark.isConditionalANDInRange thrpt 10 384.094 ▒ 15.417 ops/us
Benchmark.isConditionalANDOverRange thrpt 10 380.913 ▒ 5.537 ops/us
The result is not that different for the evaluation itself. As long no perfomance impact is spotted on that piece of code I would not try to optimize it. Depending on the place in the code the hotspot compiler might decide to do some optimization. Which probably is not covered by the above benchmarks.
some references:
boolean logical AND - the result value is true if both operand values are true; otherwise, the result is false
conditional AND - is like &, but evaluates its right-hand operand only if the value of its left-hand operand is true
bitwise OR - the result value is the bitwise inclusive OR of the operand values
I'm going to come at this from a different angle.
Consider these two code fragments,
if (value >= x && value <= y) {
and
if (value >= x & value <= y) {
If we assume that value, x, y have a primitive type, then those two (partial) statements will give the same outcome for all possible input values. (If wrapper types are involved, then they are not exactly equivalent because of an implicit null test for y that might fail in the & version and not the && version.)
If the JIT compiler is doing a good job, its optimizer will be able to deduce that those two statements do the same thing:
If one is predictably faster than the other, then it should be able to use the faster version ... in the JIT compiled code.
If not, then it doesn't matter which version is used at the source code level.
Since the JIT compiler gathers path statistics before compiling, it can potentially have more information about the execution characteristics that the programmer(!).
If the current generation JIT compiler (on any given platform) doesn't optimize well enough to handle this, the next generation could well do ... depending on whether or not empirical evidence points to this being a worthwhile pattern to optimize.
Indeed, if you write you Java code in a way that optimizes for this, there is a chance that by picking the more "obscure" version of the code, you might inhibit the current or future JIT compiler's ability to optimize.
In short, I don't think you should do this kind of micro-optimization at the source code level. And if you accept this argument1, and follow it to its logical conclusion, the question of which version is faster is ... moot2.
1 - I do not claim this is anywhere near being a proof.
2 - Unless you are one of the tiny community of people who actually write Java JIT compilers ...
The "Very Famous Question" is interesting in two respects:
On the one hand, that is an example where the kind of optimization required to make a difference is way beyond the capability of a JIT compiler.
On the other hand, it would not necessarily be the correct thing to sort the array ... just because a sorted array can be processed faster. The cost of sorting the array, could well be (much) greater than the saving.
Using either & or && still requires a condition to be evaluated so it's unlikely it will save any processing time - it might even add to it considering you're evaluating both expressions when you only need to evaluate one.
Using & over && to save a nanosecond if that in some very rare situations is pointless, you've already wasted more time contemplating the difference than you would've saved using & over &&.
Edit
I got curious and decided to run some bench marks.
I made this class:
public class Main {
static int x = 22, y = 48;
public static void main(String[] args) {
runWithOneAnd(30);
runWithTwoAnds(30);
}
static void runWithOneAnd(int value){
if(value >= x & value <= y){
}
}
static void runWithTwoAnds(int value){
if(value >= x && value <= y){
}
}
}
and ran some profiling tests with NetBeans. I didn't use any print statements to save processing time, just know both evaluate to true.
First test:
Second test:
Third test:
As you can see by the profiling tests, using only one & actually takes 2-3 times longer to run compared to using two &&. This does strike as some what odd as i did expect better performance from only one &.
I'm not 100% sure why. In both cases, both expressions have to be evaluated because both are true. I suspect that the JVM does some special optimization behind the scenes to speed it up.
Moral of the story: convention is good and premature optimization is bad.
Edit 2
I redid the benchmark code with #SvetlinZarev's comments in mind and a few other improvements. Here is the modified benchmark code:
public class Main {
static int x = 22, y = 48;
public static void main(String[] args) {
oneAndBothTrue();
oneAndOneTrue();
oneAndBothFalse();
twoAndsBothTrue();
twoAndsOneTrue();
twoAndsBothFalse();
System.out.println(b);
}
static void oneAndBothTrue() {
int value = 30;
for (int i = 0; i < 2000; i++) {
if (value >= x & value <= y) {
doSomething();
}
}
}
static void oneAndOneTrue() {
int value = 60;
for (int i = 0; i < 4000; i++) {
if (value >= x & value <= y) {
doSomething();
}
}
}
static void oneAndBothFalse() {
int value = 100;
for (int i = 0; i < 4000; i++) {
if (value >= x & value <= y) {
doSomething();
}
}
}
static void twoAndsBothTrue() {
int value = 30;
for (int i = 0; i < 4000; i++) {
if (value >= x & value <= y) {
doSomething();
}
}
}
static void twoAndsOneTrue() {
int value = 60;
for (int i = 0; i < 4000; i++) {
if (value >= x & value <= y) {
doSomething();
}
}
}
static void twoAndsBothFalse() {
int value = 100;
for (int i = 0; i < 4000; i++) {
if (value >= x & value <= y) {
doSomething();
}
}
}
//I wanted to avoid print statements here as they can
//affect the benchmark results.
static StringBuilder b = new StringBuilder();
static int times = 0;
static void doSomething(){
times++;
b.append("I have run ").append(times).append(" times \n");
}
}
And here are the performance tests:
Test 1:
Test 2:
Test 3:
This takes into account different values and different conditions as well.
Using one & takes more time to run when both conditions are true, about 60% or 2 milliseconds more time. When either one or both conditions are false, then one & runs faster, but it only runs about 0.30-0.50 milliseconds faster. So & will run faster than && in most circumstances, but the performance difference is still negligible.
What you are after is something like this:
x <= value & value <= y
value - x >= 0 & y - value >= 0
((value - x) | (y - value)) >= 0 // integer bit-or
Interesting, one would almost like to look at the byte code.
But hard to say. I wish this were a C question.
I was curious to the answer as well, so I wrote the following (simple) test for this:
private static final int max = 80000;
private static final int size = 100000;
private static final int x = 1500;
private static final int y = 15000;
private Random random;
#Before
public void setUp() {
this.random = new Random();
}
#After
public void tearDown() {
random = null;
}
#Test
public void testSingleOperand() {
int counter = 0;
int[] numbers = new int[size];
for (int j = 0; j < size; j++) {
numbers[j] = random.nextInt(max);
}
long start = System.nanoTime(); //start measuring after an array has been filled
for (int i = 0; i < numbers.length; i++) {
if (numbers[i] >= x & numbers[i] <= y) {
counter++;
}
}
long end = System.nanoTime();
System.out.println("Duration of single operand: " + (end - start));
}
#Test
public void testDoubleOperand() {
int counter = 0;
int[] numbers = new int[size];
for (int j = 0; j < size; j++) {
numbers[j] = random.nextInt(max);
}
long start = System.nanoTime(); //start measuring after an array has been filled
for (int i = 0; i < numbers.length; i++) {
if (numbers[i] >= x & numbers[i] <= y) {
counter++;
}
}
long end = System.nanoTime();
System.out.println("Duration of double operand: " + (end - start));
}
With the end result being that the comparison with && always wins in terms of speed, being about 1.5/2 milliseconds quicker than &.
EDIT:
As #SvetlinZarev pointed out, I was also measuring the time it took Random to get an integer. Changed it to use a pre-filled array of random numbers, which caused the duration of the single operand test to wildly fluctuate; the differences between several runs were up to 6-7ms.
The way this was explained to me, is that && will return false if the first check in a series is false, while & checks all items in a series regardless of how many are false. I.E.
if (x>0 && x <=10 && x
Will run faster than
if (x>0 & x <=10 & x
If x is greater than 10, because single ampersands will continue to check the rest of the conditions whereas double ampersands will break after the first non-true condition.
I keep reading that bitshifting is not needed, as compiler optimisations will translate a multiplication to a bitshift. Such as Should I bit-shift to divide by 2 in Java? and Is shifting bits faster than multiplying and dividing in Java? .NET?
I am not inquiring here about the performance difference, I could test that out myself. But what I think is curious, is that several people mention that it will be "compiled to the same thing". Which seems to be not true. I have written a small piece of code.
private static void multi()
{
int a = 3;
int b = a * 2;
System.out.println(b);
}
private static void shift()
{
int a = 3;
int b = a << 1L;
System.out.println(b);
}
Which gives the same result, and just prints it out.
When I look at the generated Java Bytecode, the following is shown.
private static void multi();
Code:
0: iconst_3
1: istore_0
2: iload_0
3: iconst_2
4: imul
5: istore_1
6: getstatic #4 // Field java/lang/System.out:Ljava/io/PrintStream;
9: iload_1
10: invokevirtual #5 // Method java/io/PrintStream.println:(I)V
13: return
private static void shift();
Code:
0: iconst_3
1: istore_0
2: iload_0
3: iconst_1
4: ishl
5: istore_1
6: getstatic #4 // Field java/lang/System.out:Ljava/io/PrintStream;
9: iload_1
10: invokevirtual #5 // Method java/io/PrintStream.println:(I)V
13: return
Now we can see the difference between "imul" and "ishl".
My question being: clearly the spoken off optimization is not visible in the java bytecode. I am still assuming the optimization does happen, so does it just happen at a lower level? Or, alternatively because it is Java, does the JVM when encountering the imul statement somehow know that it should be translated to something else. If so, any resources on how this is handled would be greatly appreciated.
(as a sidenote, I am not trying to justify the need for bitshifting. I think it decreases readability, at least to people used to Java, for C++ that might be different. I am just trying to see where the optimization happens).
The question in the title sounds a bit different than the question in the text. The quoted statement that the shift and the multiplication would be "compiled to the same thing" is true. But it does not yet apply to the bytecode.
In general, the Java bytecode is rather un-optimized. There are very few optimizations done at all - mainly inlining of constants. Beyond that, the Java bytecode is just an intermediate representation of the original program. And the translation from Java to Java bytecode is done rather "literally".
(Which, I think, is a good thing. The bytecode still closely resembles the original Java code. All the nitty-gritty (platform specific!) optimizations that may be possible are left to the virtual machine, which has far more options here.
All further optimizations, like arithmetic optimizations, dead code elimination or method inlining, are done by the JIT (Just-In-Time-Compiler), at runtime. The Just-In-Time compiler also applies the optimization of replacing the multiplication by a bit shift.
The example that you gave makes it a bit difficult to show the effect, for several reasons. The fact that the System.out.println was included in the method tends to make the actual machine code large, due to inlining and the general prerequisites for calling this method. But more importantly, the shift by 1, which corresponds to a multiplication with 2, also corresponds to an addition of the value to itself. So instead of observing a shl (left-shift) assembler instruction in the resulting machine code for the multi method, you'd likely see a disguised add instruction in the multi- and the shift method.
However, here is a very pragmatic example that does a left-shift by 8, corresponding to a multiplication with 256:
class BitShiftOptimization
{
public static void main(String args[])
{
int blackHole = 0;
for (int i=0; i<1000000; i++)
{
blackHole += testMulti(i);
blackHole += testShift(i);
}
System.out.println(blackHole);
}
public static int testMulti(int a)
{
int b = a * 256;
return b;
}
public static int testShift(int a)
{
int b = a << 8L;
return b;
}
}
(It receives the value to be shifted as an argument, to prevent it from being optimized away into a constant. It calls the methods several times, to trigger the JIT. And it returns and collects the values from both methods to prevent the method calls to be optimized away. Again, this is very pragmatic, but sufficient to show the effect)
Running this in a Hotspot Disassembler VM with
java -server -XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:+PrintInlining -XX:+PrintAssembly BitShiftOptimization
will produce the following assembler code for the testMulti method:
Decoding compiled method 0x000000000286fbd0:
Code:
[Entry Point]
[Verified Entry Point]
[Constants]
# {method} {0x000000001c0003b0} 'testMulti' '(I)I' in 'BitShiftOptimization'
# parm0: rdx = int
# [sp+0x40] (sp of caller)
0x000000000286fd20: mov %eax,-0x6000(%rsp)
0x000000000286fd27: push %rbp
0x000000000286fd28: sub $0x30,%rsp
0x000000000286fd2c: movabs $0x1c0005a8,%rax ; {metadata(method data for {method} {0x000000001c0003b0} 'testMulti' '(I)I' in 'BitShiftOptimization')}
0x000000000286fd36: mov 0xdc(%rax),%esi
0x000000000286fd3c: add $0x8,%esi
0x000000000286fd3f: mov %esi,0xdc(%rax)
0x000000000286fd45: movabs $0x1c0003a8,%rax ; {metadata({method} {0x000000001c0003b0} 'testMulti' '(I)I' in 'BitShiftOptimization')}
0x000000000286fd4f: and $0x1ff8,%esi
0x000000000286fd55: cmp $0x0,%esi
0x000000000286fd58: je 0x000000000286fd70 ;*iload_0
; - BitShiftOptimization::testMulti#0 (line 17)
0x000000000286fd5e: shl $0x8,%edx
0x000000000286fd61: mov %rdx,%rax
0x000000000286fd64: add $0x30,%rsp
0x000000000286fd68: pop %rbp
0x000000000286fd69: test %eax,-0x273fc6f(%rip) # 0x0000000000130100
; {poll_return}
0x000000000286fd6f: retq
0x000000000286fd70: mov %rax,0x8(%rsp)
0x000000000286fd75: movq $0xffffffffffffffff,(%rsp)
0x000000000286fd7d: callq 0x000000000285f160 ; OopMap{off=98}
;*synchronization entry
; - BitShiftOptimization::testMulti#-1 (line 17)
; {runtime_call}
0x000000000286fd82: jmp 0x000000000286fd5e
0x000000000286fd84: nop
0x000000000286fd85: nop
0x000000000286fd86: mov 0x2a8(%r15),%rax
0x000000000286fd8d: movabs $0x0,%r10
0x000000000286fd97: mov %r10,0x2a8(%r15)
0x000000000286fd9e: movabs $0x0,%r10
0x000000000286fda8: mov %r10,0x2b0(%r15)
0x000000000286fdaf: add $0x30,%rsp
0x000000000286fdb3: pop %rbp
0x000000000286fdb4: jmpq 0x0000000002859420 ; {runtime_call}
0x000000000286fdb9: hlt
0x000000000286fdba: hlt
0x000000000286fdbb: hlt
0x000000000286fdbc: hlt
0x000000000286fdbd: hlt
0x000000000286fdbe: hlt
(the code for the testShift method has the same instructions, by the way).
The relevant line here is
0x000000000286fd5e: shl $0x8,%edx
which corresponds to the left-shift by 8.
There are two ways to check if the number is divisible by 2:
x % 2 == 1
(x & 1) == 1
Which of the two is more efficient?
The bit operation is almost certainly faster.
Division/modulus is a generalized operation which must work for any divisor you provide, not just 2. It must also check for underflow, range errors and division by zero, and maintain a remainder, all of which takes time.
The bit operation just does a bit "and" operation, which in this case just so happens to correspond to division by two. It might actually use just a single processor operation to execute.
Either the & expression will be faster or they will be the same speed. Last time I tried, they were the same speed when I used a literal 2 (because the compiler could optimise it) but % was slower if the 2 was in a variable.
The expression x % 2 == 1 as a test for odd numbers does not work for negative x.
So there's at least one reason to prefer &.
There will hardly be a noticable difference in practice. Particularly, it's hard to imagine a case where such an instruction will be the actual bottleneck.
(Some nitpicking: The "binary" operation should rather be called bitwise operation, and the "modulo" operation actually is a remainder operation)
From a more theoretical point of view, one could assume that the binary operation is more efficient than the remainder operation, for reasons that already have been pointed out in other answers.
However, back to the practical point of view again: The JIT will almost certainly come for the rescue. Considering the following (very simple) test:
class BitwiseVersusMod
{
public static void main(String args[])
{
for (int i=0; i<10; i++)
{
for (int n=100000; n<=100000000; n*=10)
{
long s0 = runTestBitwise(n);
System.out.println("Bitwise sum "+s0);
long s1 = runTestMod(n);
System.out.println("Mod sum "+s1);
}
}
}
private static long runTestMod(int n)
{
long sum = 0;
for (int i=0; i<n; i++)
{
if (i % 2 == 1)
{
sum += i;
}
}
return sum;
}
private static long runTestBitwise(int n)
{
long sum = 0;
for (int i=0; i<n; i++)
{
if ((i & 1) == 1)
{
sum += i;
}
}
return sum;
}
}
Running it with a Hotspot Disassembler VM using
java -server -XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:+PrintAssembly BitwiseVersusMod
creates the JIT disassembly log.
Indeed, for the first invocations of the modulo version, it creates the following disassembly:
...
0x00000000027dcae6: cmp $0xffffffff,%ecx
0x00000000027dcae9: je 0x00000000027dcaf2
0x00000000027dcaef: cltd
0x00000000027dcaf0: idiv %ecx ;*irem
; - BitwiseVersusMod::runTestMod#11 (line 26)
; implicit exception: dispatches to 0x00000000027dcc18
0x00000000027dcaf2: cmp $0x1,%edx
0x00000000027dcaf5: movabs $0x54fa0888,%rax ; {metadata(method data for {method} {0x0000000054fa04b0} 'runTestMod' '(I)J' in 'BitwiseVersusMod')}
0x00000000027dcaff: movabs $0xb0,%rdx
....
where the irem instruction is translated into an idiv, which is considered to be rather expensive.
In contrast to that, the binary version uses an and instruction for the decision, as expected:
....
0x00000000027dc58c: nopl 0x0(%rax)
0x00000000027dc590: mov %rsi,%rax
0x00000000027dc593: and $0x1,%eax
0x00000000027dc596: cmp $0x1,%eax
0x00000000027dc599: movabs $0x54fa0768,%rax ; {metadata(method data for {method} {0x0000000054fa0578} 'runTestBitwise' '(I)J' in 'BitwiseVersusMod')}
0x00000000027dc5a3: movabs $0xb0,%rbx
....
However, for the final, optimized version, the generated code is more similar for both versions. In both cases, the compiler does a lot of loop unrolling, but the core of the methods can still be identified: For the bitwise version, it generates an unrolled loop containing the following instructions:
...
0x00000000027de2c7: mov %r10,%rax
0x00000000027de2ca: mov %r9d,%r11d
0x00000000027de2cd: add $0x4,%r11d ;*iinc
; - BitwiseVersusMod::runTestBitwise#21 (line 37)
0x00000000027de2d1: mov %r11d,%r8d
0x00000000027de2d4: and $0x1,%r8d
0x00000000027de2d8: cmp $0x1,%r8d
0x00000000027de2dc: jne 0x00000000027de2e7 ;*if_icmpne
; - BitwiseVersusMod::runTestBitwise#13 (line 39)
0x00000000027de2de: movslq %r11d,%r10
0x00000000027de2e1: add %rax,%r10 ;*ladd
; - BitwiseVersusMod::runTestBitwise#19 (line 41)
...
There is still the and instruction for testing the lowest bit. But for the modulo version, the core of the unrolled loop is
...
0x00000000027e3a0a: mov %r11,%r10
0x00000000027e3a0d: mov %ebx,%r8d
0x00000000027e3a10: add $0x2,%r8d ;*iinc
; - BitwiseVersusMod::runTestMod#21 (line 24)
0x00000000027e3a14: test %r8d,%r8d
0x00000000027e3a17: jl 0x00000000027e3a2e ;*irem
; - BitwiseVersusMod::runTestMod#11 (line 26)
0x00000000027e3a19: mov %r8d,%r11d
0x00000000027e3a1c: and $0x1,%r11d
0x00000000027e3a20: cmp $0x1,%r11d
0x00000000027e3a24: jne 0x00000000027e3a2e ;*if_icmpne
; - BitwiseVersusMod::runTestMod#13 (line 26)
...
I have to admit that I can not fully understand (at least, not in reasonable time) what exactly it is doing there. But in any case: The irem bytecode instruction is also implemented with an and assembly instruction, and there is no longer any idiv instruction in the resulting machine code.
So to repeat the first statement from this answer: There will hardly be a noticable difference in practice. Not only because the cost of a single instruction will hardly ever be the bottleneck, but also because you never know which instructions actually will be executed, and in this particular case, you have to assume that they will basically be equal.
Actually neither of those expressions test divisibility by two (other than in the negative). They actually both resolve to true if x is odd.
There are many other ways of testing even/oddness (e.g. ((x / 2) * 2) == x)) but none of them have the optimal properties of x & 1 solely because no compiler could possibly get it wrong and use a divide.
Most modern compilers would compile x % 2 to the same code as x & 1 but a particularly stupid one could implement x % 2 using a divide operation so it could be less efficient.
The argument as to which is better is a different story. A rookie/tired programmer may not recognize x & 1 as a test for odd numbers but x % 2 would be clearer so there is an argument that x % 2 would be the better option.
Me - I'd go for if ( Maths.isEven(x) ) making it absolutely clear what I mean. IMHO Efficiency comes way down the list, well past clarity and readability.
public class Maths {
// Method is final to encourage compiler to inline if it is bright enough.
public static final boolean isEven(long n) {
/* All even numbers have their lowest-order bit set to 0.
* This `should` therefore be the most efficient way to recognise
* even numbers.
* Also works for negative numbers.
*/
return (n & 1) == 0;
}
}
The binary operation is faster.
The mod operation has to calculate a division in order to get the remainder.
I am working on some Java code which needs to be highly optimized as it will run in hot functions that are invoked at many points in my main program logic. Part of this code involves multiplying double variables by 10 raised to arbitrary non-negative int exponents. One fast way (edit: but not the fastest possible, see Update 2 below) to get the multiplied value is to switch on the exponent:
double multiplyByPowerOfTen(final double d, final int exponent) {
switch (exponent) {
case 0:
return d;
case 1:
return d*10;
case 2:
return d*100;
// ... same pattern
case 9:
return d*1000000000;
case 10:
return d*10000000000L;
// ... same pattern with long literals
case 18:
return d*1000000000000000000L;
default:
throw new ParseException("Unhandled power of ten " + power, 0);
}
}
The commented ellipses above indicate that the case int constants continue incrementing by 1, so there are really 19 cases in the above code snippet. Since I wasn't sure whether I would actually need all the powers of 10 in case statements 10 thru 18, I ran some microbenchmarks comparing the time to complete 10 million operations with this switch statement versus a switch with only cases 0 thru 9 (with the exponent limited to 9 or less to avoid breaking the pared-down switch). I got the rather surprising (to me, at least!) result that the longer switch with more case statements actually ran faster.
On a lark, I tried adding even more cases which just returned dummy values, and found that I could get the switch to run even faster with around 22-27 declared cases (even though those dummy cases are never actually hit while the code is running). (Again, cases were added in a contiguous fashion by incrementing the prior case constant by 1.) These execution time differences are not very significant: for a random exponent between 0 and 10, the dummy padded switch statement finishes 10 million executions in 1.49 secs versus 1.54 secs for the unpadded version, for a grand total savings of 5ns per execution. So, not the kind of thing that makes obsessing over padding out a switch statement worth the effort from an optimization standpoint. But I still just find it curious and counter-intuitive that a switch doesn't become slower (or perhaps at best maintain constant O(1) time) to execute as more cases are added to it.
These are the results I obtained from running with various limits on the randomly-generated exponent values. I didn't include the results all the way down to 1 for the exponent limit, but the general shape of the curve remains the same, with a ridge around the 12-17 case mark, and a valley between 18-28. All tests were run in JUnitBenchmarks using shared containers for the random values to ensure identical testing inputs. I also ran the tests both in order from longest switch statement to shortest, and vice-versa, to try and eliminate the possibility of ordering-related test problems. I've put my testing code up on a github repo if anyone wants to try to reproduce these results.
So, what's going on here? Some vagaries of my architecture or micro-benchmark construction? Or is the Java switch really a little faster to execute in the 18 to 28 case range than it is from 11 up to 17?
github test repo "switch-experiment"
UPDATE: I cleaned up the benchmarking library quite a bit and added a text file in /results with some output across a wider range of possible exponent values. I also added an option in the testing code not to throw an Exception from default, but this doesn't appear to affect the results.
UPDATE 2: Found some pretty good discussion of this issue from back in 2009 on the xkcd forum here: http://forums.xkcd.com/viewtopic.php?f=11&t=33524. The OP's discussion of using Array.binarySearch() gave me the idea for a simple array-based implementation of the exponentiation pattern above. There's no need for the binary search since I know what the entries in the array are. It appears to run about 3 times faster than using switch, obviously at the expense of some of the control flow that switch affords. That code has been added to the github repo also.
As pointed out by the other answer, because the case values are contiguous (as opposed to sparse), the generated bytecode for your various tests uses a switch table (bytecode instruction tableswitch).
However, once the JIT starts its job and compiles the bytecode into assembly, the tableswitch instruction does not always result in an array of pointers: sometimes the switch table is transformed into what looks like a lookupswitch (similar to an if/else if structure).
Decompiling the assembly generated by the JIT (hotspot JDK 1.7) shows that it uses a succession of if/else if when there are 17 cases or less, an array of pointers when there are more than 18 (more efficient).
The reason why this magic number of 18 is used seems to come down to the default value of the MinJumpTableSize JVM flag (around line 352 in the code).
I have raised the issue on the hotspot compiler list and it seems to be a legacy of past testing. Note that this default value has been removed in JDK 8 after more benchmarking was performed.
Finally, when the method becomes too long (> 25 cases in my tests), it is in not inlined any longer with the default JVM settings - that is the likeliest cause for the drop in performance at that point.
With 5 cases, the decompiled code looks like this (notice the cmp/je/jg/jmp instructions, the assembly for if/goto):
[Verified Entry Point]
# {method} 'multiplyByPowerOfTen' '(DI)D' in 'javaapplication4/Test1'
# parm0: xmm0:xmm0 = double
# parm1: rdx = int
# [sp+0x20] (sp of caller)
0x00000000024f0160: mov DWORD PTR [rsp-0x6000],eax
; {no_reloc}
0x00000000024f0167: push rbp
0x00000000024f0168: sub rsp,0x10 ;*synchronization entry
; - javaapplication4.Test1::multiplyByPowerOfTen#-1 (line 56)
0x00000000024f016c: cmp edx,0x3
0x00000000024f016f: je 0x00000000024f01c3
0x00000000024f0171: cmp edx,0x3
0x00000000024f0174: jg 0x00000000024f01a5
0x00000000024f0176: cmp edx,0x1
0x00000000024f0179: je 0x00000000024f019b
0x00000000024f017b: cmp edx,0x1
0x00000000024f017e: jg 0x00000000024f0191
0x00000000024f0180: test edx,edx
0x00000000024f0182: je 0x00000000024f01cb
0x00000000024f0184: mov ebp,edx
0x00000000024f0186: mov edx,0x17
0x00000000024f018b: call 0x00000000024c90a0 ; OopMap{off=48}
;*new ; - javaapplication4.Test1::multiplyByPowerOfTen#72 (line 83)
; {runtime_call}
0x00000000024f0190: int3 ;*new ; - javaapplication4.Test1::multiplyByPowerOfTen#72 (line 83)
0x00000000024f0191: mulsd xmm0,QWORD PTR [rip+0xffffffffffffffa7] # 0x00000000024f0140
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#52 (line 62)
; {section_word}
0x00000000024f0199: jmp 0x00000000024f01cb
0x00000000024f019b: mulsd xmm0,QWORD PTR [rip+0xffffffffffffff8d] # 0x00000000024f0130
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#46 (line 60)
; {section_word}
0x00000000024f01a3: jmp 0x00000000024f01cb
0x00000000024f01a5: cmp edx,0x5
0x00000000024f01a8: je 0x00000000024f01b9
0x00000000024f01aa: cmp edx,0x5
0x00000000024f01ad: jg 0x00000000024f0184 ;*tableswitch
; - javaapplication4.Test1::multiplyByPowerOfTen#1 (line 56)
0x00000000024f01af: mulsd xmm0,QWORD PTR [rip+0xffffffffffffff81] # 0x00000000024f0138
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#64 (line 66)
; {section_word}
0x00000000024f01b7: jmp 0x00000000024f01cb
0x00000000024f01b9: mulsd xmm0,QWORD PTR [rip+0xffffffffffffff67] # 0x00000000024f0128
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#70 (line 68)
; {section_word}
0x00000000024f01c1: jmp 0x00000000024f01cb
0x00000000024f01c3: mulsd xmm0,QWORD PTR [rip+0xffffffffffffff55] # 0x00000000024f0120
;*tableswitch
; - javaapplication4.Test1::multiplyByPowerOfTen#1 (line 56)
; {section_word}
0x00000000024f01cb: add rsp,0x10
0x00000000024f01cf: pop rbp
0x00000000024f01d0: test DWORD PTR [rip+0xfffffffffdf3fe2a],eax # 0x0000000000430000
; {poll_return}
0x00000000024f01d6: ret
With 18 cases, the assembly looks like this (notice the array of pointers which is used and suppresses the need for all the comparisons: jmp QWORD PTR [r8+r10*1] jumps directly to the right multiplication) - that is the likely reason for the performance improvement:
[Verified Entry Point]
# {method} 'multiplyByPowerOfTen' '(DI)D' in 'javaapplication4/Test1'
# parm0: xmm0:xmm0 = double
# parm1: rdx = int
# [sp+0x20] (sp of caller)
0x000000000287fe20: mov DWORD PTR [rsp-0x6000],eax
; {no_reloc}
0x000000000287fe27: push rbp
0x000000000287fe28: sub rsp,0x10 ;*synchronization entry
; - javaapplication4.Test1::multiplyByPowerOfTen#-1 (line 56)
0x000000000287fe2c: cmp edx,0x13
0x000000000287fe2f: jae 0x000000000287fe46
0x000000000287fe31: movsxd r10,edx
0x000000000287fe34: shl r10,0x3
0x000000000287fe38: movabs r8,0x287fd70 ; {section_word}
0x000000000287fe42: jmp QWORD PTR [r8+r10*1] ;*tableswitch
; - javaapplication4.Test1::multiplyByPowerOfTen#1 (line 56)
0x000000000287fe46: mov ebp,edx
0x000000000287fe48: mov edx,0x31
0x000000000287fe4d: xchg ax,ax
0x000000000287fe4f: call 0x00000000028590a0 ; OopMap{off=52}
;*new ; - javaapplication4.Test1::multiplyByPowerOfTen#202 (line 96)
; {runtime_call}
0x000000000287fe54: int3 ;*new ; - javaapplication4.Test1::multiplyByPowerOfTen#202 (line 96)
0x000000000287fe55: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe8b] # 0x000000000287fce8
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#194 (line 92)
; {section_word}
0x000000000287fe5d: jmp 0x000000000287ff16
0x000000000287fe62: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe86] # 0x000000000287fcf0
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#188 (line 90)
; {section_word}
0x000000000287fe6a: jmp 0x000000000287ff16
0x000000000287fe6f: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe81] # 0x000000000287fcf8
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#182 (line 88)
; {section_word}
0x000000000287fe77: jmp 0x000000000287ff16
0x000000000287fe7c: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe7c] # 0x000000000287fd00
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#176 (line 86)
; {section_word}
0x000000000287fe84: jmp 0x000000000287ff16
0x000000000287fe89: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe77] # 0x000000000287fd08
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#170 (line 84)
; {section_word}
0x000000000287fe91: jmp 0x000000000287ff16
0x000000000287fe96: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe72] # 0x000000000287fd10
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#164 (line 82)
; {section_word}
0x000000000287fe9e: jmp 0x000000000287ff16
0x000000000287fea0: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe70] # 0x000000000287fd18
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#158 (line 80)
; {section_word}
0x000000000287fea8: jmp 0x000000000287ff16
0x000000000287feaa: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe6e] # 0x000000000287fd20
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#152 (line 78)
; {section_word}
0x000000000287feb2: jmp 0x000000000287ff16
0x000000000287feb4: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe24] # 0x000000000287fce0
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#146 (line 76)
; {section_word}
0x000000000287febc: jmp 0x000000000287ff16
0x000000000287febe: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe6a] # 0x000000000287fd30
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#140 (line 74)
; {section_word}
0x000000000287fec6: jmp 0x000000000287ff16
0x000000000287fec8: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe68] # 0x000000000287fd38
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#134 (line 72)
; {section_word}
0x000000000287fed0: jmp 0x000000000287ff16
0x000000000287fed2: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe66] # 0x000000000287fd40
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#128 (line 70)
; {section_word}
0x000000000287feda: jmp 0x000000000287ff16
0x000000000287fedc: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe64] # 0x000000000287fd48
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#122 (line 68)
; {section_word}
0x000000000287fee4: jmp 0x000000000287ff16
0x000000000287fee6: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe62] # 0x000000000287fd50
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#116 (line 66)
; {section_word}
0x000000000287feee: jmp 0x000000000287ff16
0x000000000287fef0: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe60] # 0x000000000287fd58
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#110 (line 64)
; {section_word}
0x000000000287fef8: jmp 0x000000000287ff16
0x000000000287fefa: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe5e] # 0x000000000287fd60
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#104 (line 62)
; {section_word}
0x000000000287ff02: jmp 0x000000000287ff16
0x000000000287ff04: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe5c] # 0x000000000287fd68
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#98 (line 60)
; {section_word}
0x000000000287ff0c: jmp 0x000000000287ff16
0x000000000287ff0e: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe12] # 0x000000000287fd28
;*tableswitch
; - javaapplication4.Test1::multiplyByPowerOfTen#1 (line 56)
; {section_word}
0x000000000287ff16: add rsp,0x10
0x000000000287ff1a: pop rbp
0x000000000287ff1b: test DWORD PTR [rip+0xfffffffffd9b00df],eax # 0x0000000000230000
; {poll_return}
0x000000000287ff21: ret
And finally the assembly with 30 cases (below) looks similar to 18 cases, except for the additional movapd xmm0,xmm1 that appears towards the middle of the code, as spotted by #cHao - however the likeliest reason for the drop in performance is that the method is too long to be inlined with the default JVM settings:
[Verified Entry Point]
# {method} 'multiplyByPowerOfTen' '(DI)D' in 'javaapplication4/Test1'
# parm0: xmm0:xmm0 = double
# parm1: rdx = int
# [sp+0x20] (sp of caller)
0x0000000002524560: mov DWORD PTR [rsp-0x6000],eax
; {no_reloc}
0x0000000002524567: push rbp
0x0000000002524568: sub rsp,0x10 ;*synchronization entry
; - javaapplication4.Test1::multiplyByPowerOfTen#-1 (line 56)
0x000000000252456c: movapd xmm1,xmm0
0x0000000002524570: cmp edx,0x1f
0x0000000002524573: jae 0x0000000002524592 ;*tableswitch
; - javaapplication4.Test1::multiplyByPowerOfTen#1 (line 56)
0x0000000002524575: movsxd r10,edx
0x0000000002524578: shl r10,0x3
0x000000000252457c: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe3c] # 0x00000000025243c0
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#364 (line 118)
; {section_word}
0x0000000002524584: movabs r8,0x2524450 ; {section_word}
0x000000000252458e: jmp QWORD PTR [r8+r10*1] ;*tableswitch
; - javaapplication4.Test1::multiplyByPowerOfTen#1 (line 56)
0x0000000002524592: mov ebp,edx
0x0000000002524594: mov edx,0x31
0x0000000002524599: xchg ax,ax
0x000000000252459b: call 0x00000000024f90a0 ; OopMap{off=64}
;*new ; - javaapplication4.Test1::multiplyByPowerOfTen#370 (line 120)
; {runtime_call}
0x00000000025245a0: int3 ;*new ; - javaapplication4.Test1::multiplyByPowerOfTen#370 (line 120)
0x00000000025245a1: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe27] # 0x00000000025243d0
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#358 (line 116)
; {section_word}
0x00000000025245a9: jmp 0x0000000002524744
0x00000000025245ae: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe22] # 0x00000000025243d8
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#348 (line 114)
; {section_word}
0x00000000025245b6: jmp 0x0000000002524744
0x00000000025245bb: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe1d] # 0x00000000025243e0
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#338 (line 112)
; {section_word}
0x00000000025245c3: jmp 0x0000000002524744
0x00000000025245c8: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe18] # 0x00000000025243e8
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#328 (line 110)
; {section_word}
0x00000000025245d0: jmp 0x0000000002524744
0x00000000025245d5: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe13] # 0x00000000025243f0
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#318 (line 108)
; {section_word}
0x00000000025245dd: jmp 0x0000000002524744
0x00000000025245e2: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe0e] # 0x00000000025243f8
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#308 (line 106)
; {section_word}
0x00000000025245ea: jmp 0x0000000002524744
0x00000000025245ef: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe09] # 0x0000000002524400
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#298 (line 104)
; {section_word}
0x00000000025245f7: jmp 0x0000000002524744
0x00000000025245fc: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe04] # 0x0000000002524408
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#288 (line 102)
; {section_word}
0x0000000002524604: jmp 0x0000000002524744
0x0000000002524609: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffdff] # 0x0000000002524410
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#278 (line 100)
; {section_word}
0x0000000002524611: jmp 0x0000000002524744
0x0000000002524616: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffdfa] # 0x0000000002524418
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#268 (line 98)
; {section_word}
0x000000000252461e: jmp 0x0000000002524744
0x0000000002524623: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffd9d] # 0x00000000025243c8
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#258 (line 96)
; {section_word}
0x000000000252462b: jmp 0x0000000002524744
0x0000000002524630: movapd xmm0,xmm1
0x0000000002524634: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffe0c] # 0x0000000002524448
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#242 (line 92)
; {section_word}
0x000000000252463c: jmp 0x0000000002524744
0x0000000002524641: movapd xmm0,xmm1
0x0000000002524645: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffddb] # 0x0000000002524428
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#236 (line 90)
; {section_word}
0x000000000252464d: jmp 0x0000000002524744
0x0000000002524652: movapd xmm0,xmm1
0x0000000002524656: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffdd2] # 0x0000000002524430
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#230 (line 88)
; {section_word}
0x000000000252465e: jmp 0x0000000002524744
0x0000000002524663: movapd xmm0,xmm1
0x0000000002524667: mulsd xmm0,QWORD PTR [rip+0xfffffffffffffdc9] # 0x0000000002524438
;*dmul
; - javaapplication4.Test1::multiplyByPowerOfTen#224 (line 86)
; {section_word}
[etc.]
0x0000000002524744: add rsp,0x10
0x0000000002524748: pop rbp
0x0000000002524749: test DWORD PTR [rip+0xfffffffffde1b8b1],eax # 0x0000000000340000
; {poll_return}
0x000000000252474f: ret
Switch - case is faster if the case values are placed in a narrow range Eg.
case 1:
case 2:
case 3:
..
..
case n:
Because, in this case the compiler can avoid performing a comparison for every case leg in the switch statement. The compiler make a jump table which contains addresses of the actions to be taken on different legs. The value on which the switch is being performed is manipulated to convert it into an index in to the jump table. In this implementation , the time taken in the switch statement is much less than the time taken in an equivalent if-else-if statement cascade. Also the time taken in the switch statement is independent of the number of case legs in the switch statement.
As stated in wikipedia about switch statement in Compilation section.
If the range of input values is identifiably 'small' and has only a
few gaps, some compilers that incorporate an optimizer may actually
implement the switch statement as a branch table or an array of
indexed function pointers instead of a lengthy series of conditional
instructions. This allows the switch statement to determine instantly
what branch to execute without having to go through a list of
comparisons.
The answer lies in the bytecode:
SwitchTest10.java
public class SwitchTest10 {
public static void main(String[] args) {
int n = 0;
switcher(n);
}
public static void switcher(int n) {
switch(n) {
case 0: System.out.println(0);
break;
case 1: System.out.println(1);
break;
case 2: System.out.println(2);
break;
case 3: System.out.println(3);
break;
case 4: System.out.println(4);
break;
case 5: System.out.println(5);
break;
case 6: System.out.println(6);
break;
case 7: System.out.println(7);
break;
case 8: System.out.println(8);
break;
case 9: System.out.println(9);
break;
case 10: System.out.println(10);
break;
default: System.out.println("test");
}
}
}
Corresponding bytecode; only relevant parts shown:
public static void switcher(int);
Code:
0: iload_0
1: tableswitch{ //0 to 10
0: 60;
1: 70;
2: 80;
3: 90;
4: 100;
5: 110;
6: 120;
7: 131;
8: 142;
9: 153;
10: 164;
default: 175 }
SwitchTest22.java:
public class SwitchTest22 {
public static void main(String[] args) {
int n = 0;
switcher(n);
}
public static void switcher(int n) {
switch(n) {
case 0: System.out.println(0);
break;
case 1: System.out.println(1);
break;
case 2: System.out.println(2);
break;
case 3: System.out.println(3);
break;
case 4: System.out.println(4);
break;
case 5: System.out.println(5);
break;
case 6: System.out.println(6);
break;
case 7: System.out.println(7);
break;
case 8: System.out.println(8);
break;
case 9: System.out.println(9);
break;
case 100: System.out.println(10);
break;
case 110: System.out.println(10);
break;
case 120: System.out.println(10);
break;
case 130: System.out.println(10);
break;
case 140: System.out.println(10);
break;
case 150: System.out.println(10);
break;
case 160: System.out.println(10);
break;
case 170: System.out.println(10);
break;
case 180: System.out.println(10);
break;
case 190: System.out.println(10);
break;
case 200: System.out.println(10);
break;
case 210: System.out.println(10);
break;
case 220: System.out.println(10);
break;
default: System.out.println("test");
}
}
}
Corresponding bytecode; again, only relevant parts shown:
public static void switcher(int);
Code:
0: iload_0
1: lookupswitch{ //23
0: 196;
1: 206;
2: 216;
3: 226;
4: 236;
5: 246;
6: 256;
7: 267;
8: 278;
9: 289;
100: 300;
110: 311;
120: 322;
130: 333;
140: 344;
150: 355;
160: 366;
170: 377;
180: 388;
190: 399;
200: 410;
210: 421;
220: 432;
default: 443 }
In the first case, with narrow ranges, the compiled bytecode uses a tableswitch. In the second case, the compiled bytecode uses a lookupswitch.
In tableswitch, the integer value on the top of the stack is used to index into the table, to find the branch/jump target. This jump/branch is then performed immediately. Hence, this is an O(1) operation.
A lookupswitch is more complicated. In this case, the integer value needs to be compared against all the keys in the table until the correct key is found. After the key is found, the branch/jump target (that this key is mapped to) is used for the jump. The table that is used in lookupswitch is sorted and a binary-search algorithm can be used to find the correct key. Performance for a binary search is O(log n), and the entire process is also O(log n), because the jump is still O(1). So the reason the performance is lower in the case of sparse ranges is that the correct key must first be looked up because you cannot index into the table directly.
If there are sparse values and you only had a tableswitch to use, table would essentially contain dummy entries that point to the default option. For example, assuming that the last entry in SwitchTest10.java was 21 instead of 10, you get:
public static void switcher(int);
Code:
0: iload_0
1: tableswitch{ //0 to 21
0: 104;
1: 114;
2: 124;
3: 134;
4: 144;
5: 154;
6: 164;
7: 175;
8: 186;
9: 197;
10: 219;
11: 219;
12: 219;
13: 219;
14: 219;
15: 219;
16: 219;
17: 219;
18: 219;
19: 219;
20: 219;
21: 208;
default: 219 }
So the compiler basically creates this huge table containing dummy entries between the gaps, pointing to the branch target of the default instruction. Even if there isn't a default, it will contain entries pointing to the instruction after the switch block. I did some basic tests, and I found that if the gap between the last index and the previous one (9) is greater than 35, it uses a lookupswitch instead of a tableswitch.
The behavior of the switch statement is defined in Java Virtual Machine Specification (§3.10):
Where the cases of the switch are sparse, the table representation of the tableswitch instruction becomes inefficient in terms of space. The lookupswitch instruction may be used instead. The lookupswitch instruction pairs int keys (the values of the case labels) with target offsets in a table. When a lookupswitch instruction is executed, the value of the expression of the switch is compared against the keys in the table. If one of the keys matches the value of the expression, execution continues at the associated target offset. If no key matches, execution continues at the default target. [...]
Since the question is already answered (more or less), here is some tip.
Use
private static final double[] mul={1d, 10d...};
static double multiplyByPowerOfTen(final double d, final int exponent) {
if (exponent<0 || exponent>=mul.length) throw new ParseException();//or just leave the IOOBE be
return mul[exponent]*d;
}
That code uses significantly less IC (instruction cache) and will be always inlined. The array will be in L1 data cache if the code is hot. The lookup table is almost always a win. (esp. on microbenchmarks :D )
Edit: if you wish the method to be hot-inlined, consider the non-fast paths like throw new ParseException() to be as short as minimum or move them to separate static method (hence making them short as minimum). That is throw new ParseException("Unhandled power of ten " + power, 0); is a weak idea b/c it eats a lot of the inlining budget for code that can be just interpreted - string concatenation is quite verbose in bytecode . More info and a real case w/ ArrayList
Based on javac source, you can write switch in a way that it uses tableswitch.
We can use the calculation from javac source to calculate cost for your second example.
lo = 0
hi = 220
nlabels = 24
table_space_cost = 4 + hi - lo + 1
table_time_cost = 3
lookup_space_cost = 3 + 2 * nlabels
lookup_time_cost = nlabels
table_cost = table_space_cost + 3 * table_time_cost // 234
lookup_cost = lookup_space_cost + 3 * lookup_time_cos // 123
Here tableswitch cost is higher (234) than the lookupswitch (123) and therefore lookupswitch will be selected as the opcode for this switch statement.
I am running some micro benchmarks on Java list iteration code. I have used -XX:+PrintCompilation, and -verbose:gc flags to ensure that nothing is happening in the background when the timing is being run. However, I see something in the output which I cannot understand.
Here's the code, I am running the benchmark on:
import java.util.ArrayList;
import java.util.List;
public class PerformantIteration {
private static int theSum = 0;
public static void main(String[] args) {
System.out.println("Starting microbenchmark on iterating over collections with a call to size() in each iteration");
List<Integer> nums = new ArrayList<Integer>();
for(int i=0; i<50000; i++) {
nums.add(i);
}
System.out.println("Warming up ...");
//warmup... make sure all JIT comliling is done before the actual benchmarking starts
for(int i=0; i<10; i++) {
iterateWithConstantSize(nums);
iterateWithDynamicSize(nums);
}
//actual
System.out.println("Starting the actual test");
long constantSizeBenchmark = iterateWithConstantSize(nums);
long dynamicSizeBenchmark = iterateWithDynamicSize(nums);
System.out.println("Test completed... printing results");
System.out.println("constantSizeBenchmark : " + constantSizeBenchmark);
System.out.println("dynamicSizeBenchmark : " + dynamicSizeBenchmark);
System.out.println("dynamicSizeBenchmark/constantSizeBenchmark : " + ((double)dynamicSizeBenchmark/(double)constantSizeBenchmark));
}
private static long iterateWithDynamicSize(List<Integer> nums) {
int sum=0;
long start = System.nanoTime();
for(int i=0; i<nums.size(); i++) {
// appear to do something useful
sum += nums.get(i);
}
long end = System.nanoTime();
setSum(sum);
return end-start;
}
private static long iterateWithConstantSize(List<Integer> nums) {
int count = nums.size();
int sum=0;
long start = System.nanoTime();
for(int i=0; i<count; i++) {
// appear to do something useful
sum += nums.get(i);
}
long end = System.nanoTime();
setSum(sum);
return end-start;
}
// invocations to this method simply exist to fool the VM into thinking that we are doing something useful in the loop
private static void setSum(int sum) {
theSum = sum;
}
}
Here's the output.
152 1 java.lang.String::charAt (33 bytes)
160 2 java.lang.String::indexOf (151 bytes)
165 3Starting microbenchmark on iterating over collections with a call to size() in each iteration java.lang.String::hashCode (60 bytes)
171 4 sun.nio.cs.UTF_8$Encoder::encodeArrayLoop (490 bytes)
183 5
java.lang.String::lastIndexOf (156 bytes)
197 6 java.io.UnixFileSystem::normalize (75 bytes)
200 7 java.lang.Object::<init> (1 bytes)
205 8 java.lang.Number::<init> (5 bytes)
206 9 java.lang.Integer::<init> (10 bytes)
211 10 java.util.ArrayList::add (29 bytes)
211 11 java.util.ArrayList::ensureCapacity (58 bytes)
217 12 java.lang.Integer::valueOf (35 bytes)
221 1% performance.api.PerformantIteration::main # 21 (173 bytes)
Warming up ...
252 13 java.util.ArrayList::get (11 bytes)
252 14 java.util.ArrayList::rangeCheck (22 bytes)
253 15 java.util.ArrayList::elementData (7 bytes)
260 2% performance.api.PerformantIteration::iterateWithConstantSize # 19 (59 bytes)
268 3% performance.api.PerformantIteration::iterateWithDynamicSize # 12 (57 bytes)
272 16 performance.api.PerformantIteration::iterateWithConstantSize (59 bytes)
278 17 performance.api.PerformantIteration::iterateWithDynamicSize (57 bytes)
Starting the actual test
Test completed... printing results
constantSizeBenchmark : 301688
dynamicSizeBenchmark : 782602
dynamicSizeBenchmark/constantSizeBenchmark : 2.5940773249184588
I don't understand these four lines from the output.
260 2% performance.api.PerformantIteration::iterateWithConstantSize # 19 (59 bytes)
268 3% performance.api.PerformantIteration::iterateWithDynamicSize # 12 (57 bytes)
272 16 performance.api.PerformantIteration::iterateWithConstantSize (59 bytes)
278 17 performance.api.PerformantIteration::iterateWithDynamicSize (57 bytes)
Why are both these methods being compiled twice ?
How do I read this output... what do the various numbers mean ?
I am going to attempt answering my own question with the help of this link posted by Thomas Jungblut.
260 2% performance.api.PerformantIteration::iterateWithConstantSize # 19 (59 bytes)
268 3% performance.api.PerformantIteration::iterateWithDynamicSize # 12 (57 bytes)
272 16 performance.api.PerformantIteration::iterateWithConstantSize (59 bytes)
278 17 performance.api.PerformantIteration::iterateWithDynamicSize (57 bytes)
First column
The first column '260' is the timestamp.
Second column
The second column is the compilation_id and method_attributes. When a HotSpot compilation is triggered, every compilation unit gets a compilation id. The number in the second column is the compilation id. JIT compilation, and OSR compilation have two different sequences for the compilation id. So 1% and 1 are different compilation units. The % in the first two rows, refer to the fact that this is an OSR compilation. An OSR compilation was triggered because the code was looping over a large loop, and the VM determined that this code is hot. So an OSR compilation was triggered, which would enable the VM to do an On Stack Replacement and move over to the optimized code, once it is ready.
Third column
The third column performance.api.PerformantIteration::iterateWithConstantSize is the method name.
Fourth column
The fourth column is again different when OSR compilation happens and when it does not. Let's look at the common parts first. The end of the fourth column (59 bytes), refers to the size of the compilation unit in bytecode (not the size of the compiled code). The # 19 part in OSR compilation refers to the osr_bci. I am going to quote from the link mentioned above -
A "place" in a Java method is defined by its bytecode index (BCI), and
the place that triggered an OSR compilation is called the "osr_bci".
An OSR-compiled nmethod can only be entered from its osr_bci; there
can be multiple OSR-compiled versions of the same method at the same
time, as long as their osr_bci differ.
Finally, why was the method compiled twice ?
The first one is an OSR compilation, which presumably happened while the loop was running due to the warmup code (in the example), and the second compilation is a JIT compilation, presumably to further optimize the compiled code ?
I think first time OSR happened , then it change the Invocation Counter tigger method compilar
(PS: sorry, my english is pool)