I have always wondered if, in general, declaring a throw-away variable before a loop, as opposed to repeatedly inside the loop, makes any (performance) difference?
A (quite pointless) example in Java:
a) declaration before loop:
double intermediateResult;
for(int i=0; i < 1000; i++){
intermediateResult = i;
System.out.println(intermediateResult);
}
b) declaration (repeatedly) inside loop:
for(int i=0; i < 1000; i++){
double intermediateResult = i;
System.out.println(intermediateResult);
}
Which one is better, a or b?
I suspect that repeated variable declaration (example b) creates more overhead in theory, but that compilers are smart enough so that it doesn't matter. Example b has the advantage of being more compact and limiting the scope of the variable to where it is used. Still, I tend to code according example a.
Edit: I am especially interested in the Java case.
Which is better, a or b?
From a performance perspective, you'd have to measure it. (And in my opinion, if you can measure a difference, the compiler isn't very good).
From a maintenance perspective, b is better. Declare and initialize variables in the same place, in the narrowest scope possible. Don't leave a gaping hole between the declaration and the initialization, and don't pollute namespaces you don't need to.
Well I ran your A and B examples 20 times each, looping 100 million times.(JVM - 1.5.0)
A: average execution time: .074 sec
B: average execution time : .067 sec
To my surprise B was slightly faster.
As fast as computers are now its hard to say if you could accurately measure this.
I would code it the A way as well but I would say it doesn't really matter.
It depends on the language and the exact use. For instance, in C# 1 it made no difference. In C# 2, if the local variable is captured by an anonymous method (or lambda expression in C# 3) it can make a very signficant difference.
Example:
using System;
using System.Collections.Generic;
class Test
{
static void Main()
{
List<Action> actions = new List<Action>();
int outer;
for (int i=0; i < 10; i++)
{
outer = i;
int inner = i;
actions.Add(() => Console.WriteLine("Inner={0}, Outer={1}", inner, outer));
}
foreach (Action action in actions)
{
action();
}
}
}
Output:
Inner=0, Outer=9
Inner=1, Outer=9
Inner=2, Outer=9
Inner=3, Outer=9
Inner=4, Outer=9
Inner=5, Outer=9
Inner=6, Outer=9
Inner=7, Outer=9
Inner=8, Outer=9
Inner=9, Outer=9
The difference is that all of the actions capture the same outer variable, but each has its own separate inner variable.
The following is what I wrote and compiled in .NET.
double r0;
for (int i = 0; i < 1000; i++) {
r0 = i*i;
Console.WriteLine(r0);
}
for (int j = 0; j < 1000; j++) {
double r1 = j*j;
Console.WriteLine(r1);
}
This is what I get from .NET Reflector when CIL is rendered back into code.
for (int i = 0; i < 0x3e8; i++)
{
double r0 = i * i;
Console.WriteLine(r0);
}
for (int j = 0; j < 0x3e8; j++)
{
double r1 = j * j;
Console.WriteLine(r1);
}
So both look exactly same after compilation. In managed languages code is converted into CL/byte code and at time of execution it's converted into machine language. So in machine language a double may not even be created on the stack. It may just be a register as code reflect that it is a temporary variable for WriteLine function. There are a whole set optimization rules just for loops. So the average guy shouldn't be worried about it, especially in managed languages. There are cases when you can optimize manage code, for example, if you have to concatenate a large number of strings using just string a; a+=anotherstring[i] vs using StringBuilder. There is very big difference in performance between both. There are a lot of such cases where the compiler cannot optimize your code, because it cannot figure out what is intended in a bigger scope. But it can pretty much optimize basic things for you.
This is a gotcha in VB.NET. The Visual Basic result won't reinitialize the variable in this example:
For i as Integer = 1 to 100
Dim j as Integer
Console.WriteLine(j)
j = i
Next
' Output: 0 1 2 3 4...
This will print 0 the first time (Visual Basic variables have default values when declared!) but i each time after that.
If you add a = 0, though, you get what you might expect:
For i as Integer = 1 to 100
Dim j as Integer = 0
Console.WriteLine(j)
j = i
Next
'Output: 0 0 0 0 0...
I made a simple test:
int b;
for (int i = 0; i < 10; i++) {
b = i;
}
vs
for (int i = 0; i < 10; i++) {
int b = i;
}
I compiled these codes with gcc - 5.2.0. And then I disassembled the main ()
of these two codes and that's the result:
1º:
0x00000000004004b6 <+0>: push rbp
0x00000000004004b7 <+1>: mov rbp,rsp
0x00000000004004ba <+4>: mov DWORD PTR [rbp-0x4],0x0
0x00000000004004c1 <+11>: jmp 0x4004cd <main+23>
0x00000000004004c3 <+13>: mov eax,DWORD PTR [rbp-0x4]
0x00000000004004c6 <+16>: mov DWORD PTR [rbp-0x8],eax
0x00000000004004c9 <+19>: add DWORD PTR [rbp-0x4],0x1
0x00000000004004cd <+23>: cmp DWORD PTR [rbp-0x4],0x9
0x00000000004004d1 <+27>: jle 0x4004c3 <main+13>
0x00000000004004d3 <+29>: mov eax,0x0
0x00000000004004d8 <+34>: pop rbp
0x00000000004004d9 <+35>: ret
vs
2º
0x00000000004004b6 <+0>: push rbp
0x00000000004004b7 <+1>: mov rbp,rsp
0x00000000004004ba <+4>: mov DWORD PTR [rbp-0x4],0x0
0x00000000004004c1 <+11>: jmp 0x4004cd <main+23>
0x00000000004004c3 <+13>: mov eax,DWORD PTR [rbp-0x4]
0x00000000004004c6 <+16>: mov DWORD PTR [rbp-0x8],eax
0x00000000004004c9 <+19>: add DWORD PTR [rbp-0x4],0x1
0x00000000004004cd <+23>: cmp DWORD PTR [rbp-0x4],0x9
0x00000000004004d1 <+27>: jle 0x4004c3 <main+13>
0x00000000004004d3 <+29>: mov eax,0x0
0x00000000004004d8 <+34>: pop rbp
0x00000000004004d9 <+35>: ret
Which are exaclty the same asm result. isn't a proof that the two codes produce the same thing?
It is language dependent - IIRC C# optimises this, so there isn't any difference, but JavaScript (for example) will do the whole memory allocation shebang each time.
I would always use A (rather than relying on the compiler) and might also rewrite to:
for(int i=0, double intermediateResult=0; i<1000; i++){
intermediateResult = i;
System.out.println(intermediateResult);
}
This still restricts intermediateResult to the loop's scope, but doesn't redeclare during each iteration.
In my opinion, b is the better structure. In a, the last value of intermediateResult sticks around after your loop is finished.
Edit:
This doesn't make a lot of difference with value types, but reference types can be somewhat weighty. Personally, I like variables to be dereferenced as soon as possible for cleanup, and b does that for you,
I suspect a few compilers could optimize both to be the same code, but certainly not all. So I'd say you're better off with the former. The only reason for the latter is if you want to ensure that the declared variable is used only within your loop.
As a general rule, I declare my variables in the inner-most possible scope. So, if you're not using intermediateResult outside of the loop, then I'd go with B.
A co-worker prefers the first form, telling it is an optimization, preferring to re-use a declaration.
I prefer the second one (and try to persuade my co-worker! ;-)), having read that:
It reduces scope of variables to where they are needed, which is a good thing.
Java optimizes enough to make no significant difference in performance. IIRC, perhaps the second form is even faster.
Anyway, it falls in the category of premature optimization that rely in quality of compiler and/or JVM.
There is a difference in C# if you are using the variable in a lambda, etc. But in general the compiler will basically do the same thing, assuming the variable is only used within the loop.
Given that they are basically the same: Note that version b makes it much more obvious to readers that the variable isn't, and can't, be used after the loop. Additionally, version b is much more easily refactored. It is more difficult to extract the loop body into its own method in version a. Moreover, version b assures you that there is no side effect to such a refactoring.
Hence, version a annoys me to no end, because there's no benefit to it and it makes it much more difficult to reason about the code...
Well, you could always make a scope for that:
{ //Or if(true) if the language doesn't support making scopes like this
double intermediateResult;
for (int i=0; i<1000; i++) {
intermediateResult = i;
System.out.println(intermediateResult);
}
}
This way you only declare the variable once, and it'll die when you leave the loop.
I think it depends on the compiler and is hard to give a general answer.
I've always thought that if you declare your variables inside of your loop then you're wasting memory. If you have something like this:
for(;;) {
Object o = new Object();
}
Then not only does the object need to be created for each iteration, but there needs to be a new reference allocated for each object. It seems that if the garbage collector is slow then you'll have a bunch of dangling references that need to be cleaned up.
However, if you have this:
Object o;
for(;;) {
o = new Object();
}
Then you're only creating a single reference and assigning a new object to it each time. Sure, it might take a bit longer for it to go out of scope, but then there's only one dangling reference to deal with.
My practice is following:
if type of variable is simple (int, double, ...) I prefer variant b (inside).
Reason: reducing scope of variable.
if type of variable is not simple (some kind of class or struct) I prefer variant a (outside).
Reason: reducing number of ctor-dtor calls.
I had this very same question for a long time. So I tested an even simpler piece of code.
Conclusion: For such cases there is NO performance difference.
Outside loop case
int intermediateResult;
for(int i=0; i < 1000; i++){
intermediateResult = i+2;
System.out.println(intermediateResult);
}
Inside loop case
for(int i=0; i < 1000; i++){
int intermediateResult = i+2;
System.out.println(intermediateResult);
}
I checked the compiled file on IntelliJ's decompiler and for both cases, I got the same Test.class
for(int i = 0; i < 1000; ++i) {
int intermediateResult = i + 2;
System.out.println(intermediateResult);
}
I also disassembled code for both the case using the method given in this answer. I'll show only the parts relevant to the answer
Outside loop case
Code:
stack=2, locals=3, args_size=1
0: iconst_0
1: istore_2
2: iload_2
3: sipush 1000
6: if_icmpge 26
9: iload_2
10: iconst_2
11: iadd
12: istore_1
13: getstatic #2 // Field java/lang/System.out:Ljava/io/PrintStream;
16: iload_1
17: invokevirtual #3 // Method java/io/PrintStream.println:(I)V
20: iinc 2, 1
23: goto 2
26: return
LocalVariableTable:
Start Length Slot Name Signature
13 13 1 intermediateResult I
2 24 2 i I
0 27 0 args [Ljava/lang/String;
Inside loop case
Code:
stack=2, locals=3, args_size=1
0: iconst_0
1: istore_1
2: iload_1
3: sipush 1000
6: if_icmpge 26
9: iload_1
10: iconst_2
11: iadd
12: istore_2
13: getstatic #2 // Field java/lang/System.out:Ljava/io/PrintStream;
16: iload_2
17: invokevirtual #3 // Method java/io/PrintStream.println:(I)V
20: iinc 1, 1
23: goto 2
26: return
LocalVariableTable:
Start Length Slot Name Signature
13 7 2 intermediateResult I
2 24 1 i I
0 27 0 args [Ljava/lang/String;
If you pay close attention, only the Slot assigned to i and intermediateResult in LocalVariableTable is swapped as a product of their order of appearance. The same difference in slot is reflected in other lines of code.
No extra operation is being performed
intermediateResult is still a local variable in both cases, so there is no difference access time.
BONUS
Compilers do a ton of optimization, take a look at what happens in this case.
Zero work case
for(int i=0; i < 1000; i++){
int intermediateResult = i;
System.out.println(intermediateResult);
}
Zero work decompiled
for(int i = 0; i < 1000; ++i) {
System.out.println(i);
}
From a performance perspective, outside is (much) better.
public static void outside() {
double intermediateResult;
for(int i=0; i < Integer.MAX_VALUE; i++){
intermediateResult = i;
}
}
public static void inside() {
for(int i=0; i < Integer.MAX_VALUE; i++){
double intermediateResult = i;
}
}
I executed both functions 1 billion times each.
outside() took 65 milliseconds. inside() took 1.5 seconds.
I tested for JS with Node 4.0.0 if anyone is interested. Declaring outside the loop resulted in a ~.5 ms performance improvement on average over 1000 trials with 100 million loop iterations per trial. So I'm gonna say go ahead and write it in the most readable / maintainable way which is B, imo. I would put my code in a fiddle, but I used the performance-now Node module. Here's the code:
var now = require("../node_modules/performance-now")
// declare vars inside loop
function varInside(){
for(var i = 0; i < 100000000; i++){
var temp = i;
var temp2 = i + 1;
var temp3 = i + 2;
}
}
// declare vars outside loop
function varOutside(){
var temp;
var temp2;
var temp3;
for(var i = 0; i < 100000000; i++){
temp = i
temp2 = i + 1
temp3 = i + 2
}
}
// for computing average execution times
var insideAvg = 0;
var outsideAvg = 0;
// run varInside a million times and average execution times
for(var i = 0; i < 1000; i++){
var start = now()
varInside()
var end = now()
insideAvg = (insideAvg + (end-start)) / 2
}
// run varOutside a million times and average execution times
for(var i = 0; i < 1000; i++){
var start = now()
varOutside()
var end = now()
outsideAvg = (outsideAvg + (end-start)) / 2
}
console.log('declared inside loop', insideAvg)
console.log('declared outside loop', outsideAvg)
A) is a safe bet than B).........Imagine if you are initializing structure in loop rather than 'int' or 'float' then what?
like
typedef struct loop_example{
JXTZ hi; // where JXTZ could be another type...say closed source lib
// you include in Makefile
}loop_example_struct;
//then....
int j = 0; // declare here or face c99 error if in loop - depends on compiler setting
for ( ;j++; )
{
loop_example loop_object; // guess the result in memory heap?
}
You are certainly bound to face problems with memory leaks!. Hence I believe 'A' is safer bet while 'B' is vulnerable to memory accumulation esp working close source libraries.You can check usinng 'Valgrind' Tool on Linux specifically sub tool 'Helgrind'.
It's an interesting question. From my experience there is an ultimate question to consider when you debate this matter for a code:
Is there any reason why the variable would need to be global?
It makes sense to only declare the variable once, globally, as opposed to many times locally, because it is better for organizing the code and requires less lines of code. However, if it only needs to be declared locally within one method, I would initialize it in that method so it is clear that the variable is exclusively relevant to that method. Be careful not to call this variable outside the method in which it is initialized if you choose the latter option--your code won't know what you're talking about and will report an error.
Also, as a side note, don't duplicate local variable names between different methods even if their purposes are near-identical; it just gets confusing.
this is the better form
double intermediateResult;
int i = byte.MinValue;
for(; i < 1000; i++)
{
intermediateResult = i;
System.out.println(intermediateResult);
}
1) in this way declared once time both variable, and not each for cycle.
2) the assignment it's fatser thean all other option.
3) So the bestpractice rule is any declaration outside the iteration for.
Tried the same thing in Go, and compared the compiler output using go tool compile -S with go 1.9.4
Zero difference, as per the assembler output.
I use (A) when I want to see the contents of the variable after exiting the loop. It only matters for debugging. I use (B) when I want the code more compact, since it saves one line of code.
Even if I know my compiler is smart enough, I won't like to rely on it, and will use the a) variant.
The b) variant makes sense to me only if you desperately need to make the intermediateResult unavailable after the loop body. But I can't imagine such desperate situation, anyway....
EDIT: Jon Skeet made a very good point, showing that variable declaration inside a loop can make an actual semantic difference.
Related
In this code:
if (value >= x && value <= y) {
when value >= x and value <= y are as likely true as false with no particular pattern, would using the & operator be faster than using &&?
Specifically, I am thinking about how && lazily evaluates the right-hand-side expression (ie only if the LHS is true), which implies a conditional, whereas in Java & in this context guarantees strict evaluation of both (boolean) sub-expressions. The value result is the same either way.
But whilst a >= or <= operator will use a simple comparison instruction, the && must involve a branch, and that branch is susceptible to branch prediction failure - as per this Very Famous Question: Why is it faster to process a sorted array than an unsorted array?
So, forcing the expression to have no lazy components will surely be more deterministic and not be vulnerable to prediction failure. Right?
Notes:
obviously the answer to my question would be No if the code looked like this: if(value >= x && verySlowFunction()). I am focusing on "sufficiently simple" RHS expressions.
there's a conditional branch in there anyway (the if statement). I can't quite prove to myself that that is irrelevant, and that alternative formulations might be better examples, like boolean b = value >= x && value <= y;
this all falls into the world of horrendous micro-optimizations. Yeah, I know :-) ... interesting though?
Update
Just to explain why I'm interested: I've been staring at the systems that Martin Thompson has been writing about on his Mechanical Sympathy blog, after he came and did a talk about Aeron. One of the key messages is that our hardware has all this magical stuff in it, and we software developers tragically fail to take advantage of it. Don't worry, I'm not about to go s/&&/\&/ on all my code :-) ... but there are a number of questions on this site on improving branch prediction by removing branches, and it occurred to me that the conditional boolean operators are at the core of test conditions.
Of course, #StephenC makes the fantastic point that bending your code into weird shapes can make it less easy for JITs to spot common optimizations - if not now, then in the future. And that the Very Famous Question mentioned above is special because it pushes the prediction complexity far beyond practical optimization.
I'm pretty much aware that in most (or almost all) situations, && is the clearest, simplest, fastest, best thing to do - although I'm very grateful to the people who have posted answers demonstrating this! I'm really interested to see if there are actually any cases in anyone's experience where the answer to "Can & be faster?" might be Yes...
Update 2:
(Addressing advice that the question is overly broad. I don't want to make major changes to this question because it might compromise some of the answers below, which are of exceptional quality!) Perhaps an example in the wild is called for; this is from the Guava LongMath class (thanks hugely to #maaartinus for finding this):
public static boolean isPowerOfTwo(long x) {
return x > 0 & (x & (x - 1)) == 0;
}
See that first &? And if you check the link, the next method is called lessThanBranchFree(...), which hints that we are in branch-avoidance territory - and Guava is really widely used: every cycle saved causes sea-levels to drop visibly. So let's put the question this way: is this use of & (where && would be more normal) a real optimization?
Ok, so you want to know how it behaves at the lower level... Let's have a look at the bytecode then!
EDIT : added the generated assembly code for AMD64, at the end. Have a look for some interesting notes.
EDIT 2 (re: OP's "Update 2"): added asm code for Guava's isPowerOfTwo method as well.
Java source
I wrote these two quick methods:
public boolean AndSC(int x, int value, int y) {
return value >= x && value <= y;
}
public boolean AndNonSC(int x, int value, int y) {
return value >= x & value <= y;
}
As you can see, they are exactly the same, save for the type of AND operator.
Java bytecode
And this is the generated bytecode:
public AndSC(III)Z
L0
LINENUMBER 8 L0
ILOAD 2
ILOAD 1
IF_ICMPLT L1
ILOAD 2
ILOAD 3
IF_ICMPGT L1
L2
LINENUMBER 9 L2
ICONST_1
IRETURN
L1
LINENUMBER 11 L1
FRAME SAME
ICONST_0
IRETURN
L3
LOCALVARIABLE this Ltest/lsoto/AndTest; L0 L3 0
LOCALVARIABLE x I L0 L3 1
LOCALVARIABLE value I L0 L3 2
LOCALVARIABLE y I L0 L3 3
MAXSTACK = 2
MAXLOCALS = 4
// access flags 0x1
public AndNonSC(III)Z
L0
LINENUMBER 15 L0
ILOAD 2
ILOAD 1
IF_ICMPLT L1
ICONST_1
GOTO L2
L1
FRAME SAME
ICONST_0
L2
FRAME SAME1 I
ILOAD 2
ILOAD 3
IF_ICMPGT L3
ICONST_1
GOTO L4
L3
FRAME SAME1 I
ICONST_0
L4
FRAME FULL [test/lsoto/AndTest I I I] [I I]
IAND
IFEQ L5
L6
LINENUMBER 16 L6
ICONST_1
IRETURN
L5
LINENUMBER 18 L5
FRAME SAME
ICONST_0
IRETURN
L7
LOCALVARIABLE this Ltest/lsoto/AndTest; L0 L7 0
LOCALVARIABLE x I L0 L7 1
LOCALVARIABLE value I L0 L7 2
LOCALVARIABLE y I L0 L7 3
MAXSTACK = 3
MAXLOCALS = 4
The AndSC (&&) method generates two conditional jumps, as expected:
It loads value and x onto the stack, and jumps to L1 if value is lower. Else it keeps running the next lines.
It loads value and y onto the stack, and jumps to L1 also, if value is greater. Else it keeps running the next lines.
Which happen to be a return true in case none of the two jumps were made.
And then we have the lines marked as L1 which are a return false.
The AndNonSC (&) method, however, generates three conditional jumps!
It loads value and x onto the stack and jumps to L1 if value is lower. Because now it needs to save the result to compare it with the other part of the AND, so it has to execute either "save true" or "save false", it can't do both with the same instruction.
It loads value and y onto the stack and jumps to L1 if value is greater. Once again it needs to save true or false and that's two different lines depending on the comparison result.
Now that both comparisons are done, the code actually executes the AND operation -- and if both are true, it jumps (for a third time) to return true; or else it continues execution onto the next line to return false.
(Preliminary) Conclusion
Though I'm not that very much experienced with Java bytecode and I may have overlooked something, it seems to me that & will actually perform worse than && in every case: it generates more instructions to execute, including more conditional jumps to predict and possibly fail at.
A rewriting of the code to replace comparisons with arithmetical operations, as someone else proposed, might be a way to make & a better option, but at the cost of making the code much less clear.
IMHO it is not worth the hassle for 99% of the scenarios (it may be very well worth it for the 1% loops that need to be extremely optimized, though).
EDIT: AMD64 assembly
As noted in the comments, the same Java bytecode can lead to different machine code in different systems, so while the Java bytecode might give us a hint about which AND version performs better, getting the actual ASM as generated by the compiler is the only way to really find out.
I printed the AMD64 ASM instructions for both methods; below are the relevant lines (stripped entry points etc.).
NOTE: all methods compiled with java 1.8.0_91 unless otherwise stated.
Method AndSC with default options
# {method} {0x0000000016da0810} 'AndSC' '(III)Z' in 'AndTest'
...
0x0000000002923e3e: cmp %r8d,%r9d
0x0000000002923e41: movabs $0x16da0a08,%rax ; {metadata(method data for {method} {0x0000000016da0810} 'AndSC' '(III)Z' in 'AndTest')}
0x0000000002923e4b: movabs $0x108,%rsi
0x0000000002923e55: jl 0x0000000002923e65
0x0000000002923e5b: movabs $0x118,%rsi
0x0000000002923e65: mov (%rax,%rsi,1),%rbx
0x0000000002923e69: lea 0x1(%rbx),%rbx
0x0000000002923e6d: mov %rbx,(%rax,%rsi,1)
0x0000000002923e71: jl 0x0000000002923eb0 ;*if_icmplt
; - AndTest::AndSC#2 (line 22)
0x0000000002923e77: cmp %edi,%r9d
0x0000000002923e7a: movabs $0x16da0a08,%rax ; {metadata(method data for {method} {0x0000000016da0810} 'AndSC' '(III)Z' in 'AndTest')}
0x0000000002923e84: movabs $0x128,%rsi
0x0000000002923e8e: jg 0x0000000002923e9e
0x0000000002923e94: movabs $0x138,%rsi
0x0000000002923e9e: mov (%rax,%rsi,1),%rdi
0x0000000002923ea2: lea 0x1(%rdi),%rdi
0x0000000002923ea6: mov %rdi,(%rax,%rsi,1)
0x0000000002923eaa: jle 0x0000000002923ec1 ;*if_icmpgt
; - AndTest::AndSC#7 (line 22)
0x0000000002923eb0: mov $0x0,%eax
0x0000000002923eb5: add $0x30,%rsp
0x0000000002923eb9: pop %rbp
0x0000000002923eba: test %eax,-0x1c73dc0(%rip) # 0x0000000000cb0100
; {poll_return}
0x0000000002923ec0: retq ;*ireturn
; - AndTest::AndSC#13 (line 25)
0x0000000002923ec1: mov $0x1,%eax
0x0000000002923ec6: add $0x30,%rsp
0x0000000002923eca: pop %rbp
0x0000000002923ecb: test %eax,-0x1c73dd1(%rip) # 0x0000000000cb0100
; {poll_return}
0x0000000002923ed1: retq
Method AndSC with -XX:PrintAssemblyOptions=intel option
# {method} {0x00000000170a0810} 'AndSC' '(III)Z' in 'AndTest'
...
0x0000000002c26e2c: cmp r9d,r8d
0x0000000002c26e2f: jl 0x0000000002c26e36 ;*if_icmplt
0x0000000002c26e31: cmp r9d,edi
0x0000000002c26e34: jle 0x0000000002c26e44 ;*iconst_0
0x0000000002c26e36: xor eax,eax ;*synchronization entry
0x0000000002c26e38: add rsp,0x10
0x0000000002c26e3c: pop rbp
0x0000000002c26e3d: test DWORD PTR [rip+0xffffffffffce91bd],eax # 0x0000000002910000
0x0000000002c26e43: ret
0x0000000002c26e44: mov eax,0x1
0x0000000002c26e49: jmp 0x0000000002c26e38
Method AndNonSC with default options
# {method} {0x0000000016da0908} 'AndNonSC' '(III)Z' in 'AndTest'
...
0x0000000002923a78: cmp %r8d,%r9d
0x0000000002923a7b: mov $0x0,%eax
0x0000000002923a80: jl 0x0000000002923a8b
0x0000000002923a86: mov $0x1,%eax
0x0000000002923a8b: cmp %edi,%r9d
0x0000000002923a8e: mov $0x0,%esi
0x0000000002923a93: jg 0x0000000002923a9e
0x0000000002923a99: mov $0x1,%esi
0x0000000002923a9e: and %rsi,%rax
0x0000000002923aa1: cmp $0x0,%eax
0x0000000002923aa4: je 0x0000000002923abb ;*ifeq
; - AndTest::AndNonSC#21 (line 29)
0x0000000002923aaa: mov $0x1,%eax
0x0000000002923aaf: add $0x30,%rsp
0x0000000002923ab3: pop %rbp
0x0000000002923ab4: test %eax,-0x1c739ba(%rip) # 0x0000000000cb0100
; {poll_return}
0x0000000002923aba: retq ;*ireturn
; - AndTest::AndNonSC#25 (line 30)
0x0000000002923abb: mov $0x0,%eax
0x0000000002923ac0: add $0x30,%rsp
0x0000000002923ac4: pop %rbp
0x0000000002923ac5: test %eax,-0x1c739cb(%rip) # 0x0000000000cb0100
; {poll_return}
0x0000000002923acb: retq
Method AndNonSC with -XX:PrintAssemblyOptions=intel option
# {method} {0x00000000170a0908} 'AndNonSC' '(III)Z' in 'AndTest'
...
0x0000000002c270b5: cmp r9d,r8d
0x0000000002c270b8: jl 0x0000000002c270df ;*if_icmplt
0x0000000002c270ba: mov r8d,0x1 ;*iload_2
0x0000000002c270c0: cmp r9d,edi
0x0000000002c270c3: cmovg r11d,r10d
0x0000000002c270c7: and r8d,r11d
0x0000000002c270ca: test r8d,r8d
0x0000000002c270cd: setne al
0x0000000002c270d0: movzx eax,al
0x0000000002c270d3: add rsp,0x10
0x0000000002c270d7: pop rbp
0x0000000002c270d8: test DWORD PTR [rip+0xffffffffffce8f22],eax # 0x0000000002910000
0x0000000002c270de: ret
0x0000000002c270df: xor r8d,r8d
0x0000000002c270e2: jmp 0x0000000002c270c0
First of all, the generated ASM code differs depending on whether we choose the default AT&T syntax or the Intel syntax.
With AT&T syntax:
The ASM code is actually longer for the AndSC method, with every bytecode IF_ICMP* translated to two assembly jump instructions, for a total of 4 conditional jumps.
Meanwhile, for the AndNonSC method the compiler generates a more straight-forward code, where each bytecode IF_ICMP* is translated to only one assembly jump instruction, keeping the original count of 3 conditional jumps.
With Intel syntax:
The ASM code for AndSC is shorter, with just 2 conditional jumps (not counting the non-conditional jmp at the end). Actually it's just two CMP, two JL/E and a XOR/MOV depending on the result.
The ASM code for AndNonSC is now longer than the AndSC one! However, it has just 1 conditional jump (for the first comparison), using the registers to directly compare the first result with the second, without any more jumps.
Conclusion after ASM code analysis
At AMD64 machine-language level, the & operator seems to generate ASM code with fewer conditional jumps, which might be better for high prediction-failure rates (random values for example).
On the other hand, the && operator seems to generate ASM code with fewer instructions (with the -XX:PrintAssemblyOptions=intel option anyway), which might be better for really long loops with prediction-friendly inputs, where the fewer number of CPU cycles for each comparison can make a difference in the long run.
As I stated in some of the comments, this is going to vary greatly between systems, so if we're talking about branch-prediction optimization, the only real answer would be: it depends on your JVM implementation, your compiler, your CPU and your input data.
Addendum: Guava's isPowerOfTwo method
Here, Guava's developers have come up with a neat way of calculating if a given number is a power of 2:
public static boolean isPowerOfTwo(long x) {
return x > 0 & (x & (x - 1)) == 0;
}
Quoting OP:
is this use of & (where && would be more normal) a real optimization?
To find out if it is, I added two similar methods to my test class:
public boolean isPowerOfTwoAND(long x) {
return x > 0 & (x & (x - 1)) == 0;
}
public boolean isPowerOfTwoANDAND(long x) {
return x > 0 && (x & (x - 1)) == 0;
}
Intel's ASM code for Guava's version
# {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest'
# this: rdx:rdx = 'AndTest'
# parm0: r8:r8 = long
...
0x0000000003103bbe: movabs rax,0x0
0x0000000003103bc8: cmp rax,r8
0x0000000003103bcb: movabs rax,0x175811f0 ; {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
0x0000000003103bd5: movabs rsi,0x108
0x0000000003103bdf: jge 0x0000000003103bef
0x0000000003103be5: movabs rsi,0x118
0x0000000003103bef: mov rdi,QWORD PTR [rax+rsi*1]
0x0000000003103bf3: lea rdi,[rdi+0x1]
0x0000000003103bf7: mov QWORD PTR [rax+rsi*1],rdi
0x0000000003103bfb: jge 0x0000000003103c1b ;*lcmp
0x0000000003103c01: movabs rax,0x175811f0 ; {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
0x0000000003103c0b: inc DWORD PTR [rax+0x128]
0x0000000003103c11: mov eax,0x1
0x0000000003103c16: jmp 0x0000000003103c20 ;*goto
0x0000000003103c1b: mov eax,0x0 ;*lload_1
0x0000000003103c20: mov rsi,r8
0x0000000003103c23: movabs r10,0x1
0x0000000003103c2d: sub rsi,r10
0x0000000003103c30: and rsi,r8
0x0000000003103c33: movabs rdi,0x0
0x0000000003103c3d: cmp rsi,rdi
0x0000000003103c40: movabs rsi,0x175811f0 ; {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
0x0000000003103c4a: movabs rdi,0x140
0x0000000003103c54: jne 0x0000000003103c64
0x0000000003103c5a: movabs rdi,0x150
0x0000000003103c64: mov rbx,QWORD PTR [rsi+rdi*1]
0x0000000003103c68: lea rbx,[rbx+0x1]
0x0000000003103c6c: mov QWORD PTR [rsi+rdi*1],rbx
0x0000000003103c70: jne 0x0000000003103c90 ;*lcmp
0x0000000003103c76: movabs rsi,0x175811f0 ; {metadata(method data for {method} {0x0000000017580af0} 'isPowerOfTwoAND' '(J)Z' in 'AndTest')}
0x0000000003103c80: inc DWORD PTR [rsi+0x160]
0x0000000003103c86: mov esi,0x1
0x0000000003103c8b: jmp 0x0000000003103c95 ;*goto
0x0000000003103c90: mov esi,0x0 ;*iand
0x0000000003103c95: and rsi,rax
0x0000000003103c98: and esi,0x1
0x0000000003103c9b: mov rax,rsi
0x0000000003103c9e: add rsp,0x50
0x0000000003103ca2: pop rbp
0x0000000003103ca3: test DWORD PTR [rip+0xfffffffffe44c457],eax # 0x0000000001550100
0x0000000003103ca9: ret
Intel's asm code for && version
# {method} {0x0000000017580bd0} 'isPowerOfTwoANDAND' '(J)Z' in 'AndTest'
# this: rdx:rdx = 'AndTest'
# parm0: r8:r8 = long
...
0x0000000003103438: movabs rax,0x0
0x0000000003103442: cmp rax,r8
0x0000000003103445: jge 0x0000000003103471 ;*lcmp
0x000000000310344b: mov rax,r8
0x000000000310344e: movabs r10,0x1
0x0000000003103458: sub rax,r10
0x000000000310345b: and rax,r8
0x000000000310345e: movabs rsi,0x0
0x0000000003103468: cmp rax,rsi
0x000000000310346b: je 0x000000000310347b ;*lcmp
0x0000000003103471: mov eax,0x0
0x0000000003103476: jmp 0x0000000003103480 ;*ireturn
0x000000000310347b: mov eax,0x1 ;*goto
0x0000000003103480: and eax,0x1
0x0000000003103483: add rsp,0x40
0x0000000003103487: pop rbp
0x0000000003103488: test DWORD PTR [rip+0xfffffffffe44cc72],eax # 0x0000000001550100
0x000000000310348e: ret
In this specific example, the JIT compiler generates far less assembly code for the && version than for Guava's & version (and, after yesterday's results, I was honestly surprised by this).
Compared to Guava's, the && version translates to 25% less bytecode for JIT to compile, 50% less assembly instructions, and only two conditional jumps (the & version has four of them).
So everything points to Guava's & method being less efficient than the more "natural" && version.
... Or is it?
As noted before, I'm running the above examples with Java 8:
C:\....>java -version
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
But what if I switch to Java 7?
C:\....>c:\jdk1.7.0_79\bin\java -version
java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)
C:\....>c:\jdk1.7.0_79\bin\java -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*AndTest.isPowerOfTwoAND -XX:PrintAssemblyOptions=intel AndTestMain
.....
0x0000000002512bac: xor r10d,r10d
0x0000000002512baf: mov r11d,0x1
0x0000000002512bb5: test r8,r8
0x0000000002512bb8: jle 0x0000000002512bde ;*ifle
0x0000000002512bba: mov eax,0x1 ;*lload_1
0x0000000002512bbf: mov r9,r8
0x0000000002512bc2: dec r9
0x0000000002512bc5: and r9,r8
0x0000000002512bc8: test r9,r9
0x0000000002512bcb: cmovne r11d,r10d
0x0000000002512bcf: and eax,r11d ;*iand
0x0000000002512bd2: add rsp,0x10
0x0000000002512bd6: pop rbp
0x0000000002512bd7: test DWORD PTR [rip+0xffffffffffc0d423],eax # 0x0000000002120000
0x0000000002512bdd: ret
0x0000000002512bde: xor eax,eax
0x0000000002512be0: jmp 0x0000000002512bbf
.....
Surprise! The assembly code generated for the & method by the JIT compiler in Java 7, has only one conditional jump now, and is way shorter! Whereas the && method (you'll have to trust me on this one, I don't want to clutter the ending!) remains about the same, with its two conditional jumps and a couple less instructions, tops.
Looks like Guava's engineers knew what they were doing, after all! (if they were trying to optimize Java 7 execution time, that is ;-)
So back to OP's latest question:
is this use of & (where && would be more normal) a real optimization?
And IMHO the answer is the same, even for this (very!) specific scenario: it depends on your JVM implementation, your compiler, your CPU and your input data.
For those kind of questions you should run a microbenchmark. I used JMH for this test.
The benchmarks are implemented as
// boolean logical AND
bh.consume(value >= x & y <= value);
and
// conditional AND
bh.consume(value >= x && y <= value);
and
// bitwise OR, as suggested by Joop Eggen
bh.consume(((value - x) | (y - value)) >= 0)
With values for value, x and y according to the benchmark name.
The result (five warmup and ten measurement iterations) for throughput benchmarking is:
Benchmark Mode Cnt Score Error Units
Benchmark.isBooleanANDBelowRange thrpt 10 386.086 ▒ 17.383 ops/us
Benchmark.isBooleanANDInRange thrpt 10 387.240 ▒ 7.657 ops/us
Benchmark.isBooleanANDOverRange thrpt 10 381.847 ▒ 15.295 ops/us
Benchmark.isBitwiseORBelowRange thrpt 10 384.877 ▒ 11.766 ops/us
Benchmark.isBitwiseORInRange thrpt 10 380.743 ▒ 15.042 ops/us
Benchmark.isBitwiseOROverRange thrpt 10 383.524 ▒ 16.911 ops/us
Benchmark.isConditionalANDBelowRange thrpt 10 385.190 ▒ 19.600 ops/us
Benchmark.isConditionalANDInRange thrpt 10 384.094 ▒ 15.417 ops/us
Benchmark.isConditionalANDOverRange thrpt 10 380.913 ▒ 5.537 ops/us
The result is not that different for the evaluation itself. As long no perfomance impact is spotted on that piece of code I would not try to optimize it. Depending on the place in the code the hotspot compiler might decide to do some optimization. Which probably is not covered by the above benchmarks.
some references:
boolean logical AND - the result value is true if both operand values are true; otherwise, the result is false
conditional AND - is like &, but evaluates its right-hand operand only if the value of its left-hand operand is true
bitwise OR - the result value is the bitwise inclusive OR of the operand values
I'm going to come at this from a different angle.
Consider these two code fragments,
if (value >= x && value <= y) {
and
if (value >= x & value <= y) {
If we assume that value, x, y have a primitive type, then those two (partial) statements will give the same outcome for all possible input values. (If wrapper types are involved, then they are not exactly equivalent because of an implicit null test for y that might fail in the & version and not the && version.)
If the JIT compiler is doing a good job, its optimizer will be able to deduce that those two statements do the same thing:
If one is predictably faster than the other, then it should be able to use the faster version ... in the JIT compiled code.
If not, then it doesn't matter which version is used at the source code level.
Since the JIT compiler gathers path statistics before compiling, it can potentially have more information about the execution characteristics that the programmer(!).
If the current generation JIT compiler (on any given platform) doesn't optimize well enough to handle this, the next generation could well do ... depending on whether or not empirical evidence points to this being a worthwhile pattern to optimize.
Indeed, if you write you Java code in a way that optimizes for this, there is a chance that by picking the more "obscure" version of the code, you might inhibit the current or future JIT compiler's ability to optimize.
In short, I don't think you should do this kind of micro-optimization at the source code level. And if you accept this argument1, and follow it to its logical conclusion, the question of which version is faster is ... moot2.
1 - I do not claim this is anywhere near being a proof.
2 - Unless you are one of the tiny community of people who actually write Java JIT compilers ...
The "Very Famous Question" is interesting in two respects:
On the one hand, that is an example where the kind of optimization required to make a difference is way beyond the capability of a JIT compiler.
On the other hand, it would not necessarily be the correct thing to sort the array ... just because a sorted array can be processed faster. The cost of sorting the array, could well be (much) greater than the saving.
Using either & or && still requires a condition to be evaluated so it's unlikely it will save any processing time - it might even add to it considering you're evaluating both expressions when you only need to evaluate one.
Using & over && to save a nanosecond if that in some very rare situations is pointless, you've already wasted more time contemplating the difference than you would've saved using & over &&.
Edit
I got curious and decided to run some bench marks.
I made this class:
public class Main {
static int x = 22, y = 48;
public static void main(String[] args) {
runWithOneAnd(30);
runWithTwoAnds(30);
}
static void runWithOneAnd(int value){
if(value >= x & value <= y){
}
}
static void runWithTwoAnds(int value){
if(value >= x && value <= y){
}
}
}
and ran some profiling tests with NetBeans. I didn't use any print statements to save processing time, just know both evaluate to true.
First test:
Second test:
Third test:
As you can see by the profiling tests, using only one & actually takes 2-3 times longer to run compared to using two &&. This does strike as some what odd as i did expect better performance from only one &.
I'm not 100% sure why. In both cases, both expressions have to be evaluated because both are true. I suspect that the JVM does some special optimization behind the scenes to speed it up.
Moral of the story: convention is good and premature optimization is bad.
Edit 2
I redid the benchmark code with #SvetlinZarev's comments in mind and a few other improvements. Here is the modified benchmark code:
public class Main {
static int x = 22, y = 48;
public static void main(String[] args) {
oneAndBothTrue();
oneAndOneTrue();
oneAndBothFalse();
twoAndsBothTrue();
twoAndsOneTrue();
twoAndsBothFalse();
System.out.println(b);
}
static void oneAndBothTrue() {
int value = 30;
for (int i = 0; i < 2000; i++) {
if (value >= x & value <= y) {
doSomething();
}
}
}
static void oneAndOneTrue() {
int value = 60;
for (int i = 0; i < 4000; i++) {
if (value >= x & value <= y) {
doSomething();
}
}
}
static void oneAndBothFalse() {
int value = 100;
for (int i = 0; i < 4000; i++) {
if (value >= x & value <= y) {
doSomething();
}
}
}
static void twoAndsBothTrue() {
int value = 30;
for (int i = 0; i < 4000; i++) {
if (value >= x & value <= y) {
doSomething();
}
}
}
static void twoAndsOneTrue() {
int value = 60;
for (int i = 0; i < 4000; i++) {
if (value >= x & value <= y) {
doSomething();
}
}
}
static void twoAndsBothFalse() {
int value = 100;
for (int i = 0; i < 4000; i++) {
if (value >= x & value <= y) {
doSomething();
}
}
}
//I wanted to avoid print statements here as they can
//affect the benchmark results.
static StringBuilder b = new StringBuilder();
static int times = 0;
static void doSomething(){
times++;
b.append("I have run ").append(times).append(" times \n");
}
}
And here are the performance tests:
Test 1:
Test 2:
Test 3:
This takes into account different values and different conditions as well.
Using one & takes more time to run when both conditions are true, about 60% or 2 milliseconds more time. When either one or both conditions are false, then one & runs faster, but it only runs about 0.30-0.50 milliseconds faster. So & will run faster than && in most circumstances, but the performance difference is still negligible.
What you are after is something like this:
x <= value & value <= y
value - x >= 0 & y - value >= 0
((value - x) | (y - value)) >= 0 // integer bit-or
Interesting, one would almost like to look at the byte code.
But hard to say. I wish this were a C question.
I was curious to the answer as well, so I wrote the following (simple) test for this:
private static final int max = 80000;
private static final int size = 100000;
private static final int x = 1500;
private static final int y = 15000;
private Random random;
#Before
public void setUp() {
this.random = new Random();
}
#After
public void tearDown() {
random = null;
}
#Test
public void testSingleOperand() {
int counter = 0;
int[] numbers = new int[size];
for (int j = 0; j < size; j++) {
numbers[j] = random.nextInt(max);
}
long start = System.nanoTime(); //start measuring after an array has been filled
for (int i = 0; i < numbers.length; i++) {
if (numbers[i] >= x & numbers[i] <= y) {
counter++;
}
}
long end = System.nanoTime();
System.out.println("Duration of single operand: " + (end - start));
}
#Test
public void testDoubleOperand() {
int counter = 0;
int[] numbers = new int[size];
for (int j = 0; j < size; j++) {
numbers[j] = random.nextInt(max);
}
long start = System.nanoTime(); //start measuring after an array has been filled
for (int i = 0; i < numbers.length; i++) {
if (numbers[i] >= x & numbers[i] <= y) {
counter++;
}
}
long end = System.nanoTime();
System.out.println("Duration of double operand: " + (end - start));
}
With the end result being that the comparison with && always wins in terms of speed, being about 1.5/2 milliseconds quicker than &.
EDIT:
As #SvetlinZarev pointed out, I was also measuring the time it took Random to get an integer. Changed it to use a pre-filled array of random numbers, which caused the duration of the single operand test to wildly fluctuate; the differences between several runs were up to 6-7ms.
The way this was explained to me, is that && will return false if the first check in a series is false, while & checks all items in a series regardless of how many are false. I.E.
if (x>0 && x <=10 && x
Will run faster than
if (x>0 & x <=10 & x
If x is greater than 10, because single ampersands will continue to check the rest of the conditions whereas double ampersands will break after the first non-true condition.
I think the question should be self explanatory, and the language I'm thinking about right now is Java, but it probably applies across all languages.
That being said, basically what I'm talking about is whether this:
// Initialize first
int i = 0;
for (i = 0; i < x; i++) {
// do some stuff
}
for (i = 0; i < x; i++) {
// do some more stuff
}
for (i = 0; i < x; i++) {
// do other stuff
}
Is better than this:
// Initializing i in the for loop
for(int i = 0; i < x; i++) {
// do some stuff
}
for(int i = 0; i < x; i++) {
// do some more stuff
}
for(int i = 0; i < x; i++) {
// do other stuff
}
This is a performance question, and I'm talking about initializing once /per/ scope resolution.
I performed a performance test with x=10 to evaluate the performance difference between the in-loop declaration method and the out-of-loop declaration method.
Details: I ran the code 300x with in-loop first and then 300x with out-of-loop first. Each run, I recorded the total runtime in nanoseconds to execute each method 10,000 times. So, I recorded a total of 1200 observations (600 per method). To measure steady-state performance (vice startup performance), I removed the 20 observations from each data set that had the longest duration. (The mean runtime for the 20 startup observations was an order of magnitude larger than the mean runtime for all the other observations.)
Results: A single-factor ANOVA indicates that the in-loop declaration is faster than the out-of-loop declaration (p-value=8.12584E-07). The mean runtimes were 158635.4931 nanoseconds for in-loop and 166943.7397 nanoseconds for out-of-loop. From a practical standpoint, we're talking about a difference of ~0.01ms per 10,000 iterations.
Conclusion: Just use the in-loop declaration. #FallAndLearn also points out that the in-loop declaration is easier to maintain because the local variable i is declared with the smallest scope possible .
You first piece of code is better than second one because int is a value type and its value is stored in stack once you initialize it , later on you just assign that type a value again and again .
On the other hand (second piece of code )you are initializing i three times i.e. creating stack entries three times .
So the first piece of code is better than the second one , performance wise .
The scope of local variables should always be the smallest possible.
Hence if int i is not used outside the loop then second way is always better. More readable too.
Performance wise they are both same. From a maintenance perspective, second option is better.
Also, the answers to this question will depend on your requirement. If your code has other data dependent on i or have only three for loops statement.
Let's check out the disassembled code for the following snippet:
public class Test {
public static void main(String[] args) {
int i = 0;
for(i = 0; i < 3; i++){
//do some stuff
}
}
}
public class Test {
public Test();
Code:
0: aload_0
1: invokespecial #1 // Method java/lang/Object."<init>":()V
4: return
public static void main(java.lang.String[]);
Code:
0: iconst_0
1: istore_1
2: iconst_0
3: istore_1
4: iload_1
5: iconst_3
6: if_icmpge 15
9: iinc 1, 1
12: goto 4
15: return
}
Now let's generate another one for the initialization of the control variable inside the loop:
public class Test {
public Test();
Code:
0: aload_0
1: invokespecial #1 // Method java/lang/Object."<init>":()V
4: return
public static void main(java.lang.String[]);
Code:
0: iconst_0
1: istore_1
2: iload_1
3: iconst_3
4: if_icmpge 13
7: iinc 1, 1
10: goto 2
13: return
}
They're not the same, I'm not a byte-code expert but I can tell that the second one has less overhead. The first one pushes int constant twice (the two iconst_<i> instructions), and has two istore_<n> instructions, compared to one instruction in the second code.
This question already has answers here:
What is x after "x = x++"?
(18 answers)
Closed 6 years ago.
import java.io.*;
public class test {
public static void main(String args[]) {
int a=0, b=6, sum;
for(int i=0; i<=2; i++) {
System.out.println(i=i++);
}
}
}
Output: 0 | 1 | 2. But actually I think it should be 0 | 2. Please explain why I am wrong? Thank you in advance.
The difference is in this line of code:
System.out.println(i=i++);
i++ is a post increment, meaning it is only executed after the rest of the statement.
so, it goes a bit like this:
System.out.println(
int tempI = i;
i = tempI;
tempI = i + 1;
);
In the end, you print the value of i, while the value of tempI is not used after that, so considered lost.
Answer is in byte code generated for above code. It store back the old value of i into i. Therefore i=i++ statement make no impact logically.
iconst_0
istore_1
goto 11
getstatic java/lang/System/out Ljava/io/PrintStream;
iload_1
iinc 1 1
dup
istore_1
invokevirtual java/io/PrintStream/println(I)V
iinc 1 1
iload_1
iconst_2
if_icmple 4
return
It is easy to test it out by yourself actually even without using a debugger.
int a=0,b=6, sum;
for(int i=0;i<=2;i++)
{
System.out.println(i=i++);
System.out.println("Value of i:" + i);
}
OUTPUT:
0
Value of i:0
1
Value of i:1
2
Value of i:2
Question:
So why after System.out.println(i=i++);, value of i is not increasing?
Answer: This is because i++ means post-increment. The i at the right hand side will only increase by one after that line.
//Let say i is 0.
i = i++;
i++ means the right hand side i will still be 0 till it goes to next line. Hence 0 was assigned to the i at the left hand side.
In your code:
for(int i=0;i<=2;i++)
{
System.out.println(i=i++); //here the value of i is incremented after it is assigned to i.
}
When you are doing i=i++ then i is incremented after the assignment. And hence you are not getting as expected.
You can use pre-increment operator and see the difference.
for(int i=0;i<=2;i++)
{
System.out.println(i=++i);
}
Here is a related thread which will help you: What is x after “x = x++”?
Please see this snippet:
for (int i=0; i<=2;i+=2){
System.out.println("i= "+ i);
}
this should give you a short way to do it.
If this works:.
for(int i=0;i<=2;i++)
{
System.out.println(i=++i);
}
Because this does not work?:
for(int i=0;i<=2;i++)
{
System.out.println(i=i++);
}
Apparently the postincrement in the println method is treated differently.
It is an undefined behaviour in C or C++ if you write:
i = i++;
Reason being that the order of evaluation cannot be determined. Hence if you write this in C or C++, there is no guarantee what will be produced.
In Java, this kind of ambiguity is removed from the design. The order of evaluation is as follows:
i = i++; //value of i++ is stored (0 is stored in this case)
//i increased by 1, i is now 1
//The stored value assigned back to i (0 assigned to i)
The TL;DR version, for those who don't want the background, is the following specific question:
Question
Why doesn't Java have an implementation of true multidimensional arrays? Is there a solid technical reason? What am I missing here?
Background
Java has multidimensional arrays at the syntax level, in that one can declare
int[][] arr = new int[10][10];
but it seems that this is really not what one might have expected. Rather than having the JVM allocate a contiguous block of RAM big enough to store 100 ints, it comes out as an array of arrays of ints: so each layer is a contiguous block of RAM, but the thing as a whole is not. Accessing arr[i][j] is thus rather slow: the JVM has to
find the int[] stored at arr[i];
index this to find the int stored at arr[i][j].
This involves querying an object to go from one layer to the next, which is rather expensive.
Why Java does this
At one level, it's not hard to see why this can't be optimised to a simple scale-and-add lookup even if it were all allocated in one fixed block. The problem is that arr[3] is a reference all of its own, and it can be changed. So although arrays are of fixed size, we could easily write
arr[3] = new int[11];
and now the scale-and-add is screwed because this layer has grown. You'd need to know at runtime whether everything is still the same size as it used to be. In addition, of course, this will then get allocated somewhere else in RAM (it'll have to be, since it's bigger than what it's replacing), so it's not even in the right place for scale-and-add.
What's problematic about it
It seems to me that this is not ideal, and that for two reasons.
For one, it's slow. A test I ran with these methods for summing the contents of a single dimensional or multidimensional array took nearly twice as long (714 seconds vs 371 seconds) for the multidimensional case (an int[1000000] and an int[100][100][100] respectively, filled with random int values, run 1000000 times with warm cache).
public static long sumSingle(int[] arr) {
long total = 0;
for (int i=0; i<arr.length; i++)
total+=arr[i];
return total;
}
public static long sumMulti(int[][][] arr) {
long total = 0;
for (int i=0; i<arr.length; i++)
for (int j=0; j<arr[0].length; j++)
for (int k=0; k<arr[0][0].length; k++)
total+=arr[i][j][k];
return total;
}
Secondly, because it's slow, it thereby encourages obscure coding. If you encounter something performance-critical that would be naturally done with a multidimensional array, you have an incentive to write it as a flat array, even if that makes the unnatural and hard to read. You're left with an unpalatable choice: obscure code or slow code.
What could be done about it
It seems to me that the basic problem could easily enough be fixed. The only reason, as we saw earlier, that it can't be optimised is that the structure might change. But Java already has a mechanism for making references unchangeable: declare them as final.
Now, just declaring it with
final int[][] arr = new int[10][10];
isn't good enough because it's only arr that is final here: arr[3] still isn't, and could be changed, so the structure might still change. But if we had a way of declaring things so that it was final throughout, except at the bottom layer where the int values are stored, then we'd have an entire immutable structure, and it could all be allocated as one block, and indexed with scale-and-add.
How it would look syntactically, I'm not sure (I'm not a language designer). Maybe
final int[final][] arr = new int[10][10];
although admittedly that looks a bit weird. This would mean: final at the top layer; final at the next layer; not final at the bottom layer (else the int values themselves would be immutable).
Finality throughout would enable the JIT compiler to optimise this to give performance to that of a single dimensional array, which would then take away the temptation to code that way just to get round the slowness of multidimensional arrays.
(I hear a rumour that C# does something like this, although I also hear another rumour that the CLR implementation is so bad that it's not worth having... perhaps they're just rumours...)
Question
So why doesn't Java have an implementation of true multidimensional arrays? Is there a solid technical reason? What am I missing here?
Update
A bizarre side note: the difference in timings drops away to only a few percent if you use an int for the running total rather than a long. Why would there be such a small difference with an int, and such a big difference with a long?
Benchmarking code
Code I used for benchmarking, in case anyone wants to try to reproduce these results:
public class Multidimensional {
public static long sumSingle(final int[] arr) {
long total = 0;
for (int i=0; i<arr.length; i++)
total+=arr[i];
return total;
}
public static long sumMulti(final int[][][] arr) {
long total = 0;
for (int i=0; i<arr.length; i++)
for (int j=0; j<arr[0].length; j++)
for (int k=0; k<arr[0][0].length; k++)
total+=arr[i][j][k];
return total;
}
public static void main(String[] args) {
final int iterations = 1000000;
Random r = new Random();
int[] arr = new int[1000000];
for (int i=0; i<arr.length; i++)
arr[i]=r.nextInt();
long total = 0;
System.out.println(sumSingle(arr));
long time = System.nanoTime();
for (int i=0; i<iterations; i++)
total = sumSingle(arr);
time = System.nanoTime()-time;
System.out.printf("Took %d ms for single dimension\n", time/1000000, total);
int[][][] arrMulti = new int[100][100][100];
for (int i=0; i<arrMulti.length; i++)
for (int j=0; j<arrMulti[i].length; j++)
for (int k=0; k<arrMulti[i][j].length; k++)
arrMulti[i][j][k]=r.nextInt();
System.out.println(sumMulti(arrMulti));
time = System.nanoTime();
for (int i=0; i<iterations; i++)
total = sumMulti(arrMulti);
time = System.nanoTime()-time;
System.out.printf("Took %d ms for multi dimension\n", time/1000000, total);
}
}
but it seems that this is really not what one might have expected.
Why?
Consider that the form T[] means "array of type T", then just as we would expect int[] to mean "array of type int", we would expect int[][] to mean "array of type array of type int", because there's no less reason for having int[] as the T than int.
As such, considering that one can have arrays of any type, it follows just from the way [ and ] are used in declaring and initialising arrays (and for that matter, {, } and ,), that without some sort of special rule banning arrays of arrays, we get this sort of use "for free".
Now consider also that there are things we can do with jagged arrays that we can't do otherwise:
We can have "jagged" arrays where different inner arrays are of different sizes.
We can have null arrays within the outer array where appropriate mapping of the data, or perhaps to allow lazy building.
We can deliberately alias within the array so e.g. lookup[1] is the same array as lookup[5]. (This can allow for massive savings with some data-sets, e.g. many Unicode properties can be mapped for the full set of 1,112,064 code points in a small amount of memory because leaf arrays of properties can be repeated for ranges with matching patterns).
Some heap implementations can handle the many smaller objects better than one large object in memory.
There are certainly cases where these sort of multi-dimensional arrays are useful.
Now, the default state of any feature is unspecified and unimplemented. Someone needs to decide to specify and implement a feature, or else it wouldn't exist.
Since, as shown above, the array-of-array sort of multidimensional array will exist unless someone decided to introduce a special banning array-of-array feature. Since arrays of arrays are useful for the reasons above, that would be a strange decision to make.
Conversely, the sort of multidimensional array where an array has a defined rank that can be greater than 1 and so be used with a set of indices rather than a single index, does not follow naturally from what is already defined. Someone would need to:
Decide on the specification for the declaration, initialisation and use would work.
Document it.
Write the actual code to do this.
Test the code to do this.
Handle the bugs, edge-cases, reports of bugs that aren't actually bugs, backward-compatibility issues caused by fixing the bugs.
Also users would have to learn this new feature.
So, it has to be worth it. Some things that would make it worth it would be:
If there was no way of doing the same thing.
If the way of doing the same thing was strange or not well-known.
People would expect it from similar contexts.
Users can't provide similar functionality themselves.
In this case though:
But there is.
Using strides within arrays was already known to C and C++ programmers and Java built on its syntax so that the same techniques are directly applicable
Java's syntax was based on C++, and C++ similarly only has direct support for multidimensional arrays as arrays-of-arrays. (Except when statically allocated, but that's not something that would have an analogy in Java where arrays are objects).
One can easily write a class that wraps an array and details of stride-sizes and allows access via a set of indices.
Really, the question is not "why doesn't Java have true multidimensional arrays"? But "Why should it?"
Of course, the points you made in favour of multidimensional arrays are valid, and some languages do have them for that reason, but the burden is nonetheless to argue a feature in, not argue it out.
(I hear a rumour that C# does something like this, although I also hear another rumour that the CLR implementation is so bad that it's not worth having... perhaps they're just rumours...)
Like many rumours, there's an element of truth here, but it is not the full truth.
.NET arrays can indeed have multiple ranks. This is not the only way in which it is more flexible than Java. Each rank can also have a lower-bound other than zero. As such, you could for example have an array that goes from -3 to 42 or a two dimensional array where one rank goes from -2 to 5 and another from 57 to 100, or whatever.
C# does not give complete access to all of this from its built-in syntax (you need to call Array.CreateInstance() for lower bounds other than zero), but it does for allow you to use the syntax int[,] for a two-dimensional array of int, int[,,] for a three-dimensional array, and so on.
Now, the extra work involved in dealing with lower bounds other than zero adds a performance burden, and yet these cases are relatively uncommon. For that reason single-rank arrays with a lower-bound of 0 are treated as a special case with a more performant implementation. Indeed, they are internally a different sort of structure.
In .NET multi-dimensional arrays with lower bounds of zero are treated as multi-dimensional arrays whose lower bounds just happen to be zero (that is, as an example of the slower case) rather than the faster case being able to handle ranks greater than 1.
Of course, .NET could have had a fast-path case for zero-based multi-dimensional arrays, but then all the reasons for Java not having them apply and the fact that there's already one special case, and special cases suck, and then there would be two special cases and they would suck more. (As it is, one can have some issues with trying to assign a value of one type to a variable of the other type).
Not a single thing above shows clearly that Java couldn't possibly have had the sort of multi-dimensional array you talk of; it would have been a sensible enough decision, but so also the decision that was made was also sensible.
This should be a question to James Gosling, I suppose. The initial design of Java was about OOP and simplicity, not about speed.
If you have a better idea of how multidimensional arrays should work, there are several ways of bringing it to life:
Submit a JDK Enhancement Proposal.
Develop a new JSR through Java Community Process.
Propose a new Project.
UPD. Of course, you are not the first to question the problems of Java arrays design.
For instance, projects Sumatra and Panama would also benefit from true multidimensional arrays.
"Arrays 2.0" is John Rose's talk on this subject at JVM Language Summit 2012.
To me it looks like you sort of answered the question yourself:
... an incentive to write it as a flat array, even if that makes the unnatural and hard to read.
So write it as a flat array which is easy to read. With a trivial helper like
double get(int row, int col) {
return data[rowLength * row + col];
}
and similar setter and possibly a +=-equivalent, you can pretend you're working with a 2D array. It's really no big deal. You can't use the array notation and everything gets verbose and ugly. But that seems to be the Java way. It's exactly the same as with BigInteger or BigDecimal. You can't use braces for accessing a Map, that's a very similar case.
Now the question is how important all those features are? Would more people be happy if they could write x += BigDecimal.valueOf("123456.654321") + 10;, or spouse["Paul"] = "Mary";, or use 2D arrays without the boilerplate, or what? All of this would be nice and you could go further, e.g., array slices. But there's no real problem. You have to choose between verbosity and inefficiency as in many other cases. IMHO, the effort spent on this feature can be better spent elsewhere. Your 2D arrays are a new best as....
Java actually has no 2D primitive arrays, ...
it's mostly a syntactic sugar, the underlying thing is array of objects.
double[][] a = new double[1][1];
Object[] b = a;
As arrays are reified, the current implementation needs hardly any support. Your implementation would open a can of worms:
There are currently 8 primitive types, which means 9 array types, would a 2D array be the tenth? What about 3D?
There is a single special object header type for arrays. A 2D array could need another one.
What about java.lang.reflect.Array? Clone it for 2D arrays?
Many other features would have be adapted, e.g. serialization.
And what would
??? x = {new int[1], new int[2]};
be? An old-style 2D int[][]? What about interoperability?
I guess, it's all doable, but there are simpler and more important things missing from Java. Some people need 2D arrays all the time, but many can hardly remember when they used any array at all.
I am unable to reproduce the performance benefits you claim. Specifically, the test program:
public abstract class Benchmark {
final String name;
public Benchmark(String name) {
this.name = name;
}
abstract int run(int iterations) throws Throwable;
private BigDecimal time() {
try {
int nextI = 1;
int i;
long duration;
do {
i = nextI;
long start = System.nanoTime();
run(i);
duration = System.nanoTime() - start;
nextI = (i << 1) | 1;
} while (duration < 1000000000 && nextI > 0);
return new BigDecimal((duration) * 1000 / i).movePointLeft(3);
} catch (Throwable e) {
throw new RuntimeException(e);
}
}
#Override
public String toString() {
return name + "\t" + time() + " ns";
}
public static void main(String[] args) throws Exception {
final int[] flat = new int[100*100*100];
final int[][][] multi = new int[100][100][100];
Random chaos = new Random();
for (int i = 0; i < flat.length; i++) {
flat[i] = chaos.nextInt();
}
for (int i=0; i<multi.length; i++)
for (int j=0; j<multi[0].length; j++)
for (int k=0; k<multi[0][0].length; k++)
multi[i][j][k] = chaos.nextInt();
Benchmark[] marks = {
new Benchmark("flat") {
#Override
int run(int iterations) throws Throwable {
long total = 0;
for (int j = 0; j < iterations; j++)
for (int i = 0; i < flat.length; i++)
total += flat[i];
return (int) total;
}
},
new Benchmark("multi") {
#Override
int run(int iterations) throws Throwable {
long total = 0;
for (int iter = 0; iter < iterations; iter++)
for (int i=0; i<multi.length; i++)
for (int j=0; j<multi[0].length; j++)
for (int k=0; k<multi[0][0].length; k++)
total+=multi[i][j][k];
return (int) total;
}
},
new Benchmark("multi (idiomatic)") {
#Override
int run(int iterations) throws Throwable {
long total = 0;
for (int iter = 0; iter < iterations; iter++)
for (int[][] a : multi)
for (int[] b : a)
for (int c : b)
total += c;
return (int) total;
}
}
};
for (Benchmark mark : marks) {
System.out.println(mark);
}
}
}
run on my workstation with
java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)
prints
flat 264360.217 ns
multi 270303.246 ns
multi (idiomatic) 266607.334 ns
That is, we observe a mere 3% difference between the one-dimensional and the multi-dimensional code you provided. This difference drops to 1% if we use idiomatic Java (specifically, an enhanced for loop) for traversal (probably because bounds checking is performed on the same array object the loop dereferences, enabling the just in time compiler to elide bounds checking more completely).
Performance therefore seems an inadequate justification for increasing the complexity of the language. Specifically, to support true multi dimensional arrays, the Java programming language would have to distinguish between arrays of arrays, and multidimensional arrays.
Likewise, programmers would have to distinguish between them, and be aware of their differences. API designers would have to ponder whether to use an array of arrays, or a multidimensional array. The compiler, class file format, class file verifier, interpreter, and just in time compiler would have to be extended. This would be particularly difficult, because multidimensional arrays of different dimension counts would have an incompatible memory layout (because the size of their dimensions must be stored to enable bounds checking), and can therefore not be subtypes of each other. As a consequence, the methods of class java.util.Arrays would likely have to be duplicated for each dimension count, as would all otherwise polymorphic algorithms working with arrays.
To conclude, extending Java to support multidimensional arrays would offer negligible performance gain for most programs, but require non-trivial extensions to its type system, compiler and runtime environment. Introducing them would therefore have been at odds with the design goals of the Java programming language, specifically that it be simple.
Since this question is to a great extent about performance, let me contribute a proper JMH-based benchmark. I have also changed some things to make your example both simpler and the performance edge more prominent.
In my case I compare a 1D array with a 2D-array, and use a very short inner dimension. This is the worst case for the cache.
I have tried with both long and int accumulator and saw no difference between them. I submit the version with int.
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#BenchmarkMode(Mode.AverageTime)
#OperationsPerInvocation(X*Y)
#Warmup(iterations = 30, time = 100, timeUnit=MILLISECONDS)
#Measurement(iterations = 5, time = 1000, timeUnit=MILLISECONDS)
#State(Scope.Thread)
#Threads(1)
#Fork(1)
public class Measure
{
static final int X = 100_000, Y = 10;
private final int[] single = new int[X*Y];
private final int[][] multi = new int[X][Y];
#Setup public void setup() {
final ThreadLocalRandom rnd = ThreadLocalRandom.current();
for (int i=0; i<single.length; i++) single[i] = rnd.nextInt();
for (int i=0; i<multi.length; i++)
for (int j=0; j<multi[0].length; j++)
multi[i][j] = rnd.nextInt();
}
#Benchmark public long sumSingle() { return sumSingle(single); }
#Benchmark public long sumMulti() { return sumMulti(multi); }
public static long sumSingle(int[] arr) {
int total = 0;
for (int i=0; i<arr.length; i++)
total+=arr[i];
return total;
}
public static long sumMulti(int[][] arr) {
int total = 0;
for (int i=0; i<arr.length; i++)
for (int j=0; j<arr[0].length; j++)
total+=arr[i][j];
return total;
}
}
The difference in performance is larger than what you have measured:
Benchmark Mode Samples Score Score error Units
o.s.Measure.sumMulti avgt 5 1,356 0,121 ns/op
o.s.Measure.sumSingle avgt 5 0,421 0,018 ns/op
That's a factor above three. (Note that the timing is reported per array element.)
I also note that there is no warmup involved: the first 100 ms are as fast as the rest. Apparently this is such a simple task that the interpreter already does all it takes to make it optimal.
Update
Changing sumMulti's inner loop to
for (int j=0; j<arr[i].length; j++)
total+=arr[i][j];
(note arr[i].length) resulted in a significant speedup, as predicted by maaartinus. Using arr[0].length makes it impossible to eliminate the index range check. Now the results are as follows:
Benchmark Mode Samples Score Error Units
o.s.Measure.sumMulti avgt 5 0,992 ± 0,066 ns/op
o.s.Measure.sumSingle avgt 5 0,424 ± 0,046 ns/op
If you want a fast implementation of a true multi-dimentional array you could write a custom implementation like this. But you are right... it is not as crisp as the array notation. Although, a neat implementation could be quite friendly.
public class MyArray{
private int rows = 0;
private int cols = 0;
String[] backingArray = null;
public MyArray(int rows, int cols){
this.rows = rows;
this.cols = cols;
backingArray = new String[rows*cols];
}
public String get(int row, int col){
return backingArray[row*cols + col];
}
... setters and other stuff
}
Why is it not the default implementation?
The designers of Java probably had to decide how the default notation of the usual C array syntax would behave. They had a single array notation which could either implement arrays-of-arrays or true multi-dimentional arrays.
I think early Java designers were really concerned with Java being safe. Lot of decisions seem to have been taken to make it difficult for the average programmer(or a good programmer on a bad day) to not mess up something . With true multi-dimensional arrays, it is easier for users to waste large chunks of memory by allocating blocks where they are not useful.
Also, from Java's embedded systems roots, they probably found that it was more likely to find pieces of memory to allocate rather than large chunks of memory required for true multi-dimentional objects.
Of course, the flip side is that places where multi-dimensional arrays really make sense suffer. And you are forced to use a library and messy looking code to get your work done.
Why is it still not included in the language?
Even today, true multi-dimensional arrays are a risk from the the point of view of possible of memory wastage/misuse.
There are two ways to check if the number is divisible by 2:
x % 2 == 1
(x & 1) == 1
Which of the two is more efficient?
The bit operation is almost certainly faster.
Division/modulus is a generalized operation which must work for any divisor you provide, not just 2. It must also check for underflow, range errors and division by zero, and maintain a remainder, all of which takes time.
The bit operation just does a bit "and" operation, which in this case just so happens to correspond to division by two. It might actually use just a single processor operation to execute.
Either the & expression will be faster or they will be the same speed. Last time I tried, they were the same speed when I used a literal 2 (because the compiler could optimise it) but % was slower if the 2 was in a variable.
The expression x % 2 == 1 as a test for odd numbers does not work for negative x.
So there's at least one reason to prefer &.
There will hardly be a noticable difference in practice. Particularly, it's hard to imagine a case where such an instruction will be the actual bottleneck.
(Some nitpicking: The "binary" operation should rather be called bitwise operation, and the "modulo" operation actually is a remainder operation)
From a more theoretical point of view, one could assume that the binary operation is more efficient than the remainder operation, for reasons that already have been pointed out in other answers.
However, back to the practical point of view again: The JIT will almost certainly come for the rescue. Considering the following (very simple) test:
class BitwiseVersusMod
{
public static void main(String args[])
{
for (int i=0; i<10; i++)
{
for (int n=100000; n<=100000000; n*=10)
{
long s0 = runTestBitwise(n);
System.out.println("Bitwise sum "+s0);
long s1 = runTestMod(n);
System.out.println("Mod sum "+s1);
}
}
}
private static long runTestMod(int n)
{
long sum = 0;
for (int i=0; i<n; i++)
{
if (i % 2 == 1)
{
sum += i;
}
}
return sum;
}
private static long runTestBitwise(int n)
{
long sum = 0;
for (int i=0; i<n; i++)
{
if ((i & 1) == 1)
{
sum += i;
}
}
return sum;
}
}
Running it with a Hotspot Disassembler VM using
java -server -XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:+PrintAssembly BitwiseVersusMod
creates the JIT disassembly log.
Indeed, for the first invocations of the modulo version, it creates the following disassembly:
...
0x00000000027dcae6: cmp $0xffffffff,%ecx
0x00000000027dcae9: je 0x00000000027dcaf2
0x00000000027dcaef: cltd
0x00000000027dcaf0: idiv %ecx ;*irem
; - BitwiseVersusMod::runTestMod#11 (line 26)
; implicit exception: dispatches to 0x00000000027dcc18
0x00000000027dcaf2: cmp $0x1,%edx
0x00000000027dcaf5: movabs $0x54fa0888,%rax ; {metadata(method data for {method} {0x0000000054fa04b0} 'runTestMod' '(I)J' in 'BitwiseVersusMod')}
0x00000000027dcaff: movabs $0xb0,%rdx
....
where the irem instruction is translated into an idiv, which is considered to be rather expensive.
In contrast to that, the binary version uses an and instruction for the decision, as expected:
....
0x00000000027dc58c: nopl 0x0(%rax)
0x00000000027dc590: mov %rsi,%rax
0x00000000027dc593: and $0x1,%eax
0x00000000027dc596: cmp $0x1,%eax
0x00000000027dc599: movabs $0x54fa0768,%rax ; {metadata(method data for {method} {0x0000000054fa0578} 'runTestBitwise' '(I)J' in 'BitwiseVersusMod')}
0x00000000027dc5a3: movabs $0xb0,%rbx
....
However, for the final, optimized version, the generated code is more similar for both versions. In both cases, the compiler does a lot of loop unrolling, but the core of the methods can still be identified: For the bitwise version, it generates an unrolled loop containing the following instructions:
...
0x00000000027de2c7: mov %r10,%rax
0x00000000027de2ca: mov %r9d,%r11d
0x00000000027de2cd: add $0x4,%r11d ;*iinc
; - BitwiseVersusMod::runTestBitwise#21 (line 37)
0x00000000027de2d1: mov %r11d,%r8d
0x00000000027de2d4: and $0x1,%r8d
0x00000000027de2d8: cmp $0x1,%r8d
0x00000000027de2dc: jne 0x00000000027de2e7 ;*if_icmpne
; - BitwiseVersusMod::runTestBitwise#13 (line 39)
0x00000000027de2de: movslq %r11d,%r10
0x00000000027de2e1: add %rax,%r10 ;*ladd
; - BitwiseVersusMod::runTestBitwise#19 (line 41)
...
There is still the and instruction for testing the lowest bit. But for the modulo version, the core of the unrolled loop is
...
0x00000000027e3a0a: mov %r11,%r10
0x00000000027e3a0d: mov %ebx,%r8d
0x00000000027e3a10: add $0x2,%r8d ;*iinc
; - BitwiseVersusMod::runTestMod#21 (line 24)
0x00000000027e3a14: test %r8d,%r8d
0x00000000027e3a17: jl 0x00000000027e3a2e ;*irem
; - BitwiseVersusMod::runTestMod#11 (line 26)
0x00000000027e3a19: mov %r8d,%r11d
0x00000000027e3a1c: and $0x1,%r11d
0x00000000027e3a20: cmp $0x1,%r11d
0x00000000027e3a24: jne 0x00000000027e3a2e ;*if_icmpne
; - BitwiseVersusMod::runTestMod#13 (line 26)
...
I have to admit that I can not fully understand (at least, not in reasonable time) what exactly it is doing there. But in any case: The irem bytecode instruction is also implemented with an and assembly instruction, and there is no longer any idiv instruction in the resulting machine code.
So to repeat the first statement from this answer: There will hardly be a noticable difference in practice. Not only because the cost of a single instruction will hardly ever be the bottleneck, but also because you never know which instructions actually will be executed, and in this particular case, you have to assume that they will basically be equal.
Actually neither of those expressions test divisibility by two (other than in the negative). They actually both resolve to true if x is odd.
There are many other ways of testing even/oddness (e.g. ((x / 2) * 2) == x)) but none of them have the optimal properties of x & 1 solely because no compiler could possibly get it wrong and use a divide.
Most modern compilers would compile x % 2 to the same code as x & 1 but a particularly stupid one could implement x % 2 using a divide operation so it could be less efficient.
The argument as to which is better is a different story. A rookie/tired programmer may not recognize x & 1 as a test for odd numbers but x % 2 would be clearer so there is an argument that x % 2 would be the better option.
Me - I'd go for if ( Maths.isEven(x) ) making it absolutely clear what I mean. IMHO Efficiency comes way down the list, well past clarity and readability.
public class Maths {
// Method is final to encourage compiler to inline if it is bright enough.
public static final boolean isEven(long n) {
/* All even numbers have their lowest-order bit set to 0.
* This `should` therefore be the most efficient way to recognise
* even numbers.
* Also works for negative numbers.
*/
return (n & 1) == 0;
}
}
The binary operation is faster.
The mod operation has to calculate a division in order to get the remainder.