My code looks like this:
public static int counter = 0;
public static void operation() {
counter++;
}
public static void square(int n) {
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
operation();
}
public static void main(String[] args) {
System.out.println("Start...");
long start = System.nanoTime();
square(350000);
long end = System.nanoTime();
System.out.println("Run time: " + ((end - start) / 1000000)
+ " ms");
}
I tried to run this code with IntelliJ and it took 6 500 ms, while Eclipse was much more faster with 18 ms. I'm using Skylake CPU, Java 11. Both are set on same settings and I didn't change a thing in settings.
Is there a way / How can I optimize IntelliJ to get same results as Eclipse?
Thanks.
This is not about compile time but about run time.
The Eclipse compiler and javac used by IntelliJ IDEA generate different bytecode by using different optimizations. Also on the command line, you get these different run times if you compile the Java code with the two compilers and execute it in the same Java VM.
For example, the inner loop of square(int)
for (int j = 0; j < n; j++)
operation();
is compiled by Eclipse to
L4
GOTO L5
L6
INVOKESTATIC Snippet.operation() : void
IINC 2: j 1
L5
ILOAD 2: j
ILOAD 0: n
IF_ICMPLT L6
whereas javac creates the following bytecode:
L4
ILOAD 2: j
ILOAD 0: n
IF_ICMPGE L5
INVOKESTATIC Snippet.operation() : void
IINC 2: j 1
GOTO L4
L5
Semantically, both are the same, but the jump (GOTO) is only executed for j = 0 in the bytecode created by Eclipse, while GOTO is executed 349,999 times in the bytecode created by javac. In the combination of the machine code generated by the Java VM and the optimizations of the processor (especially inlining and branch prediction), this can lead to very different execution times as in this case (I suppose that in one case the static field counter is updated only once and in the other case it is updated 350,000 x 350,000 times).
IntelliJ IDEA is shipped with (an older version of) the Eclipse compiler which is not used by default. So using the Eclipse compiler should create the same bytecode.
Related
This question already has answers here:
Seemingly endless loop terminates, unless System.out.println is used
(6 answers)
Closed 6 years ago.
I was messing around with stuff in my eclipse when, to my surprise, I found that this piece of code when run gets terminated without any error / exception
public class Test {
public static void main(String[] args) {
for(int i = 2; i > 0; i++){
int c = 0;
}
}
}
while his piece of code keeps on executing
public class Test {
public static void main(String[] args) {
for(int i = 2; i > 0; i++){
int c = 0;
System.out.println(c);
}
}
}
Even though both ought to be infinite loops running for ever. Is there something I'm missing as to why the first code snippet is terminated?
First of all, both snippets are not infinite loops, since i will become negative once it passes Integer.MAX_VALUE. They just take a long time to run.
The first snippet takes much less time to run, since it doesn't have to print anything, and it's possible the compiler is smart enough to just optimize the code and eliminate the loop, since it does nothing.
Testing your first snippet, adding System.out.println (System.currentTimeMillis ()); before and after the loop, I got :
1486539220248
1486539221124
i.e. it ran in less than 1 second.
Changing the loop slightly :
System.out.println (System.currentTimeMillis ());
for(int i = 2; i > 0; i++){
int c = 0;
if (i==Integer.MAX_VALUE)
System.out.println (i);
}
System.out.println (System.currentTimeMillis ());
I got
1486539319309
2147483647
1486539319344
As you can see, it takes i less than 1 second to increment from 0 to Integer.MAX_VALUE, and then overflow, at which point the loop terminates.
The more prints you add to the loop, the more time it will take to terminate. For example :
System.out.println (System.currentTimeMillis ());
for(int i = 2; i > 0; i++){
int c = 0;
if (i % 100000000 == 0)
System.out.println (i);
}
System.out.println (System.currentTimeMillis ());
Output :
1486539560318
100000000
200000000
300000000
400000000
500000000
600000000
700000000
800000000
900000000
1000000000
1100000000
1200000000
1300000000
1400000000
1500000000
1600000000
1700000000
1800000000
1900000000
2000000000
2100000000
1486539563232
Now it took 3 seconds.
The point is: this loop doesn't have any visible side effect.
Thus one could assume that the compiler is optimizing away the complete loop. On the other hand, javac isn't exactly famous for doing a lot of optimizations. Thus: lets see what happens:
javap -c Test
...
public static void main(java.lang.String[]);
0: iconst_2
1: istore_1
2: iload_1
3: ifle 14
6: iconst_0
7: istore_2
8: iinc 1, 1
11: goto 2
14: return
Obviously: the loop is still there. So the real thing is: your loop stops due to int overflow at some point; and the first version of your program just reaches that point much quicker (out.println() is a very expensive operation; compared to pure adding of numbers)
I found a fairly simple n-process mutual exclusion algorithm on page 4 (836) in the following paper: "Mutual Exclusion Using Indivisible Reads and Writes" by Burns and Lynch
program Process_i;
type flag = (down, up);
shared var F : array [1..N] of flag;
var j : 1..N;
begin
while true do begin
1: F[i] := down;
2: remainder; (* remainder region *)
3: F[i] := down;
4: for j := 1 to i-1 do
if F[j] = up then goto 3;
5: F[i] := up;
6: for j := 1 to i-1 do
if F[j] = up then goto 3;
7: for j := i+1 to N do
if F[j] = up then goto 7;
8: critical; (* critical region *)
end
end.
I like it, because of its minimal memory use and the goto's should allow me to implement it in a method enterCriticalRegion() that returns a boolean indicating whether the process succeeded in acquiring the lock (i.e. reached line 8) or whether it hit one of the goto's and needs to try again later rather than busy-waiting. (Fairness and starvation aren't really a concern in my case)
I tried to implement this in Java and test it out by having a bunch of threads try to enter the critical region in rapid succession (looks long, but it's mostly comments):
import java.util.concurrent.atomic.AtomicInteger;
public class BurnsME {
// Variable to count processes in critical section (for verification)
private static AtomicInteger criticalCount = new AtomicInteger(0);
// shared var F : array [1..N] of flag;
private static final boolean[] F = new boolean[10000];
// Some process-local variables
private final int processID;
private boolean atLine7;
public BurnsME(int processID) {
this.processID = processID;
this.atLine7 = false;
}
/**
* Try to enter critical region.
*
* #return T - success; F - failure, need to try again later
*/
public boolean enterCriticalRegion() {
// Burns Lynch Algorithm
// Mutual Exclusion Using Indivisible Reads and Writes, p. 836
if (!atLine7) {
// 3: F[i] down
F[processID] = false;
// 4: for j:=1 to i-1 do if F[j] = up goto 3
for (int process=0; process<processID; process++)
if (F[process]) return false;
// 5: F[i] = up
F[processID] = true;
// 6: for j:=1 to i-1 do if F[j] = up goto 3
for (int process=0; process<processID; process++)
if (F[process]) return false;
atLine7 = true;
}
// 7: for j:=i+1 to N do if F[j] = up goto 7
for (int process=processID+1; process<F.length; process++)
if (F[process]) return false;
// Verify mutual exclusion
if (criticalCount.incrementAndGet()>1) {
System.err.println("TWO PROCESSES ENTERED CRITICAL SECTION!");
System.exit(1);
}
// 8: critical region
return true;
}
/**
* Leave critical region and allow next process in
*/
public void leaveCriticalRegion() {
// Reset state
atLine7 = false;
criticalCount.decrementAndGet();
// Release critical region lock
// 1: F[i] = down
F[processID] = false;
}
//===============================================================================
// Test Code
private static final int THREADS = 50;
public static void main(String[] args) {
System.out.println("Launching "+THREADS+" threads...");
for (int i=0; i<THREADS; i++) {
final int threadID = i;
new Thread() {
#Override
public void run() {
BurnsME mutex = new BurnsME(threadID);
while (true) {
if (mutex.enterCriticalRegion()) {
System.out.println(threadID+" in critical region");
mutex.leaveCriticalRegion();
}
}
}
}.start();
}
while (true);
}
}
For some reason, the mutual exclusion verification (via the AtomicInteger) keeps failing after a few seconds and the program exits with the message TWO PROCESSES ENTERED CRITICAL SECTION!.
Both the algorithm and my implementation are so simple, that I'm a little perplexed why it's not working.
Is there something wrong with the Burns/Lynch algorithm (doubt it)? Or did I make some stupid mistake somewhere that I'm just not seeing? Or is this caused by some Java instruction reordering? The latter seems somewhat unlikely to me since each assignment is followed by a potential return and should thus not be swapped with any other, no? Or is array access in Java not thread safe?
A quick aside:
Here is how I visualize the Burns and Lynch algorithm (might help think about the issue):
I'm the process and I'm standing somewhere in a row with other people (processes). When I want to enter the critical section, I do the following:
3/4: I look to my left and keep my hand down as long as someone there has their hand up.
5: If no-one to my left has their hand up, I put mine up
6: I check again if anyone to my left has meanwhile put their hand up. If so, I put mine back down and start over. Otherwise, I keep my hand up.
7: Everyone to my right goes first, so I look to my right and wait until I don't see any hands up.
8: Once no-one to my right has their hand up any more, I can enter the critical section.
1: When I'm done, I put my hand back down.
Seems solid to me... Not sure why it shouldn't work reliably...
In the java memory model you have no guarantee that a write to F[i] will be visible to another Thread reading from it later.
The standard solution for this kind of visibility problem is to declare the shared variable as volatile, but in this case F is an array and write/reads to F[i] do not change the value of F.
It is not possible to declare an "array of volatiles ...", but one can declare F as AtomicIntegerArray and use compareAndSet to atomically change the array content without worrying about Thread-visibility.
I'm running Windows 8.1 x64 with Java 7 update 45 x64 (no 32 bit Java installed) on a Surface Pro 2 tablet.
The code below takes 1688ms when the type of i is a long and 109ms when i is an int. Why is long (a 64 bit type) an order of magnitude slower than int on a 64 bit platform with a 64 bit JVM?
My only speculation is that the CPU takes longer to add a 64 bit integer than a 32 bit one, but that seems unlikely. I suspect Haswell doesn't use ripple-carry adders.
I'm running this in Eclipse Kepler SR1, btw.
public class Main {
private static long i = Integer.MAX_VALUE;
public static void main(String[] args) {
System.out.println("Starting the loop");
long startTime = System.currentTimeMillis();
while(!decrementAndCheck()){
}
long endTime = System.currentTimeMillis();
System.out.println("Finished the loop in " + (endTime - startTime) + "ms");
}
private static boolean decrementAndCheck() {
return --i < 0;
}
}
Edit: Here are the results from equivalent C++ code compiled by VS 2013 (below), same system. long: 72265ms int: 74656ms Those results were in debug 32 bit mode.
In 64 bit release mode: long: 875ms long long: 906ms int: 1047ms
This suggests that the result I observed is JVM optimization weirdness rather than CPU limitations.
#include "stdafx.h"
#include "iostream"
#include "windows.h"
#include "limits.h"
long long i = INT_MAX;
using namespace std;
boolean decrementAndCheck() {
return --i < 0;
}
int _tmain(int argc, _TCHAR* argv[])
{
cout << "Starting the loop" << endl;
unsigned long startTime = GetTickCount64();
while (!decrementAndCheck()){
}
unsigned long endTime = GetTickCount64();
cout << "Finished the loop in " << (endTime - startTime) << "ms" << endl;
}
Edit: Just tried this again in Java 8 RTM, no significant change.
My JVM does this pretty straightforward thing to the inner loop when you use longs:
0x00007fdd859dbb80: test %eax,0x5f7847a(%rip) /* fun JVM hack */
0x00007fdd859dbb86: dec %r11 /* i-- */
0x00007fdd859dbb89: mov %r11,0x258(%r10) /* store i to memory */
0x00007fdd859dbb90: test %r11,%r11 /* unnecessary test */
0x00007fdd859dbb93: jge 0x00007fdd859dbb80 /* go back to the loop top */
It cheats, hard, when you use ints; first there's some screwiness that I don't claim to understand but looks like setup for an unrolled loop:
0x00007f3dc290b5a1: mov %r11d,%r9d
0x00007f3dc290b5a4: dec %r9d
0x00007f3dc290b5a7: mov %r9d,0x258(%r10)
0x00007f3dc290b5ae: test %r9d,%r9d
0x00007f3dc290b5b1: jl 0x00007f3dc290b662
0x00007f3dc290b5b7: add $0xfffffffffffffffe,%r11d
0x00007f3dc290b5bb: mov %r9d,%ecx
0x00007f3dc290b5be: dec %ecx
0x00007f3dc290b5c0: mov %ecx,0x258(%r10)
0x00007f3dc290b5c7: cmp %r11d,%ecx
0x00007f3dc290b5ca: jle 0x00007f3dc290b5d1
0x00007f3dc290b5cc: mov %ecx,%r9d
0x00007f3dc290b5cf: jmp 0x00007f3dc290b5bb
0x00007f3dc290b5d1: and $0xfffffffffffffffe,%r9d
0x00007f3dc290b5d5: mov %r9d,%r8d
0x00007f3dc290b5d8: neg %r8d
0x00007f3dc290b5db: sar $0x1f,%r8d
0x00007f3dc290b5df: shr $0x1f,%r8d
0x00007f3dc290b5e3: sub %r9d,%r8d
0x00007f3dc290b5e6: sar %r8d
0x00007f3dc290b5e9: neg %r8d
0x00007f3dc290b5ec: and $0xfffffffffffffffe,%r8d
0x00007f3dc290b5f0: shl %r8d
0x00007f3dc290b5f3: mov %r8d,%r11d
0x00007f3dc290b5f6: neg %r11d
0x00007f3dc290b5f9: sar $0x1f,%r11d
0x00007f3dc290b5fd: shr $0x1e,%r11d
0x00007f3dc290b601: sub %r8d,%r11d
0x00007f3dc290b604: sar $0x2,%r11d
0x00007f3dc290b608: neg %r11d
0x00007f3dc290b60b: and $0xfffffffffffffffe,%r11d
0x00007f3dc290b60f: shl $0x2,%r11d
0x00007f3dc290b613: mov %r11d,%r9d
0x00007f3dc290b616: neg %r9d
0x00007f3dc290b619: sar $0x1f,%r9d
0x00007f3dc290b61d: shr $0x1d,%r9d
0x00007f3dc290b621: sub %r11d,%r9d
0x00007f3dc290b624: sar $0x3,%r9d
0x00007f3dc290b628: neg %r9d
0x00007f3dc290b62b: and $0xfffffffffffffffe,%r9d
0x00007f3dc290b62f: shl $0x3,%r9d
0x00007f3dc290b633: mov %ecx,%r11d
0x00007f3dc290b636: sub %r9d,%r11d
0x00007f3dc290b639: cmp %r11d,%ecx
0x00007f3dc290b63c: jle 0x00007f3dc290b64f
0x00007f3dc290b63e: xchg %ax,%ax /* OK, fine; I know what a nop looks like */
then the unrolled loop itself:
0x00007f3dc290b640: add $0xfffffffffffffff0,%ecx
0x00007f3dc290b643: mov %ecx,0x258(%r10)
0x00007f3dc290b64a: cmp %r11d,%ecx
0x00007f3dc290b64d: jg 0x00007f3dc290b640
then the teardown code for the unrolled loop, itself a test and a straight loop:
0x00007f3dc290b64f: cmp $0xffffffffffffffff,%ecx
0x00007f3dc290b652: jle 0x00007f3dc290b662
0x00007f3dc290b654: dec %ecx
0x00007f3dc290b656: mov %ecx,0x258(%r10)
0x00007f3dc290b65d: cmp $0xffffffffffffffff,%ecx
0x00007f3dc290b660: jg 0x00007f3dc290b654
So it goes 16 times faster for ints because the JIT unrolled the int loop 16 times, but didn't unroll the long loop at all.
For completeness, here is the code I actually tried:
public class foo136 {
private static int i = Integer.MAX_VALUE;
public static void main(String[] args) {
System.out.println("Starting the loop");
for (int foo = 0; foo < 100; foo++)
doit();
}
static void doit() {
i = Integer.MAX_VALUE;
long startTime = System.currentTimeMillis();
while(!decrementAndCheck()){
}
long endTime = System.currentTimeMillis();
System.out.println("Finished the loop in " + (endTime - startTime) + "ms");
}
private static boolean decrementAndCheck() {
return --i < 0;
}
}
The assembly dumps were generated using the options -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly. Note that you need to mess around with your JVM installation to have this work for you as well; you need to put some random shared library in exactly the right place or it will fail.
The JVM stack is defined in terms of words, whose size is an implementation detail but must be at least 32 bits wide. The JVM implementer may use 64-bit words, but the bytecode can't rely on this, and so operations with long or double values have to be handled with extra care. In particular, the JVM integer branch instructions are defined on exactly the type int.
In the case of your code, disassembly is instructive. Here's the bytecode for the int version as compiled by the Oracle JDK 7:
private static boolean decrementAndCheck();
Code:
0: getstatic #14 // Field i:I
3: iconst_1
4: isub
5: dup
6: putstatic #14 // Field i:I
9: ifge 16
12: iconst_1
13: goto 17
16: iconst_0
17: ireturn
Note that the JVM will load the value of your static i (0), subtract one (3-4), duplicate the value on the stack (5), and push it back into the variable (6). It then does a compare-with-zero branch and returns.
The version with the long is a bit more complicated:
private static boolean decrementAndCheck();
Code:
0: getstatic #14 // Field i:J
3: lconst_1
4: lsub
5: dup2
6: putstatic #14 // Field i:J
9: lconst_0
10: lcmp
11: ifge 18
14: iconst_1
15: goto 19
18: iconst_0
19: ireturn
First, when the JVM duplicates the new value on the stack (5), it has to duplicate two stack words. In your case, it's quite possible that this is no more expensive than duplicating one, since the JVM is free to use a 64-bit word if convenient. However, you'll notice that the branch logic is longer here. The JVM doesn't have an instruction to compare a long with zero, so it has to push a constant 0L onto the stack (9), do a general long comparison (10), and then branch on the value of that calculation.
Here are two plausible scenarios:
The JVM is following the bytecode path exactly. In this case, it's doing more work in the long version, pushing and popping several extra values, and these are on the virtual managed stack, not the real hardware-assisted CPU stack. If this is the case, you'll still see a significant performance difference after warmup.
The JVM realizes that it can optimize this code. In this case, it's taking extra time to optimize away some of the practically unnecessary push/compare logic. If this is the case, you'll see very little performance difference after warmup.
I recommend you write a correct microbenchmark to eliminate the effect of having the JIT kick in, and also trying this with a final condition that isn't zero, to force the JVM to do the same comparison on the int that it does with the long.
Basic unit of data in a Java Virtual Machine is word. Choosing the right word size is left upon the implementation of the JVM. A JVM implementation should choose a minimum word size of 32 bits. It can choose a higher word size to gain efficiency. Neither there is any restriction that a 64 bit JVM should choose 64 bit word only.
The underlying architecture doesn't rules that the word size should also be the same. JVM reads/writes data word by word. This is the reason why it might be taking longer for a long than an int.
Here you can find more on the same topic.
I have just written a benchmark using caliper.
The results are quite consistent with the original code: a ~12x speedup for using int over long. It certainly seems that the loop unrolling reported by tmyklebu or something very similar is going on.
timeIntDecrements 195,266,845.000
timeLongDecrements 2,321,447,978.000
This is my code; note that it uses a freshly-built snapshot of caliper, since I could not figure out how to code against their existing beta release.
package test;
import com.google.caliper.Benchmark;
import com.google.caliper.Param;
public final class App {
#Param({""+1}) int number;
private static class IntTest {
public static int v;
public static void reset() {
v = Integer.MAX_VALUE;
}
public static boolean decrementAndCheck() {
return --v < 0;
}
}
private static class LongTest {
public static long v;
public static void reset() {
v = Integer.MAX_VALUE;
}
public static boolean decrementAndCheck() {
return --v < 0;
}
}
#Benchmark
int timeLongDecrements(int reps) {
int k=0;
for (int i=0; i<reps; i++) {
LongTest.reset();
while (!LongTest.decrementAndCheck()) { k++; }
}
return (int)LongTest.v | k;
}
#Benchmark
int timeIntDecrements(int reps) {
int k=0;
for (int i=0; i<reps; i++) {
IntTest.reset();
while (!IntTest.decrementAndCheck()) { k++; }
}
return IntTest.v | k;
}
}
For the record, this version does a crude "warmup":
public class LongSpeed {
private static long i = Integer.MAX_VALUE;
private static int j = Integer.MAX_VALUE;
public static void main(String[] args) {
for (int x = 0; x < 10; x++) {
runLong();
runWord();
}
}
private static void runLong() {
System.out.println("Starting the long loop");
i = Integer.MAX_VALUE;
long startTime = System.currentTimeMillis();
while(!decrementAndCheckI()){
}
long endTime = System.currentTimeMillis();
System.out.println("Finished the long loop in " + (endTime - startTime) + "ms");
}
private static void runWord() {
System.out.println("Starting the word loop");
j = Integer.MAX_VALUE;
long startTime = System.currentTimeMillis();
while(!decrementAndCheckJ()){
}
long endTime = System.currentTimeMillis();
System.out.println("Finished the word loop in " + (endTime - startTime) + "ms");
}
private static boolean decrementAndCheckI() {
return --i < 0;
}
private static boolean decrementAndCheckJ() {
return --j < 0;
}
}
The overall times improve about 30%, but the ratio between the two remains roughly the same.
For the records:
if i use
boolean decrementAndCheckLong() {
lo = lo - 1l;
return lo < -1l;
}
(changed "l--" to "l = l - 1l") long performance improves by ~50%
It's likely due to the JVM checking for safepoints when long is used (uncounted loop), and not doing it for int (counted loop).
Some references:
https://stackoverflow.com/a/62557768/14624235
https://stackoverflow.com/a/58726530/14624235
http://psy-lob-saw.blogspot.com/2016/02/wait-for-it-counteduncounted-loops.html
I don't have a 64 bit machine to test with, but the rather large difference suggests that there is more than the slightly longer bytecode at work.
I see very close times for long/int (4400 vs 4800ms) on my 32-bit 1.7.0_45.
This is only a guess, but I strongly suspect that it is the effect of a memory misalignment penalty. To confirm/deny the suspicion, try adding a public static int dummy = 0; before the declaration of i. That will push i down by 4 bytes in memory layout and may make it properly aligned for better performance. Confirmed to be not causing the issue.
EDIT: The reasoning behind this is that the VM may not reorder fields at its leisure adding padding for optimal alignment, since that may interfere with JNI (Not the case).
I am running this code and getting unexpected results. I expect that the loop which adds the primitives would perform much faster, but the results do not agree.
import java.util.*;
public class Main {
public static void main(String[] args) {
StringBuilder output = new StringBuilder();
long start = System.currentTimeMillis();
long limit = 1000000000; //10^9
long value = 0;
for(long i = 0; i < limit; ++i){}
long i;
output.append("Base time\n");
output.append(System.currentTimeMillis() - start + "ms\n");
start = System.currentTimeMillis();
for(long j = 0; j < limit; ++j) {
value = value + j;
}
output.append("Using longs\n");
output.append(System.currentTimeMillis() - start + "ms\n");
start = System.currentTimeMillis();
value = 0;
for(long k = 0; k < limit; ++k) {
value = value + (new Long(k));
}
output.append("Using Longs\n");
output.append(System.currentTimeMillis() - start + "ms\n");
System.out.print(output);
}
}
Output:
Base time
359ms
Using longs
1842ms
Using Longs
614ms
I have tried running each individual test in it's own java program, but the results are the same. What could cause this?
Small detail: running java 1.6
Edit:
I asked 2 other people to try out this code, one gets the same exact strange results that I get. The other gets results that actually make sense! I asked the guy who got normal results to give us his class binary. We run it and we STILL get the strange results. The problem is not at compile time (I think). I'm running 1.6.0_31, the guy who gets normal results is on 1.6.0_16, the guy who gets strange results like I do is on 1.7.0_04.
Edit: Get same results with a Thread.sleep(5000) at the start of program. Also get the same results with a while loop around the whole program (to see if the times would converge to normal times after java was fully started up)
I suspect that this is a JVM warmup effect. Specifically, the code is being JIT compiled at some point, and this is distorting the times that you are seeing.
Put the whole lot in a loop, and ignore the times reported until they stabilize. (But note that they won't entirely stabilize. Garbage is being generated, and therefore the GC will need to kick occasionally. This is liable to distort the timings, at least a bit. The best way to deal with this is to run a huge number of iterations of the outer loop, and calculate / display the average times.)
Another problem is that the JIT compiler on some releases of Java may be able to optimize away the stuff you are trying to test:
It could figure out that the creation and immediate unboxing of the Long objects could be optimized away. (Thanks Louis!)
It could figure out that the loops are doing "busy work" ... and optimize them away entirely. (The value of value is not used once each loop ends.)
FWIW, it is generally recommended that you use Long.valueOf(long) rather than new Long(long) because the former can make use of a cached Long instance. However, in this case, we can predict that there will be a cache miss in all but the first few loop iterations, so the recommendation is not going to help. If anything, it is likely to make the loop in question slower.
UPDATE
I did some investigation of my own, and ended up with the following:
import java.util.*;
public class Main {
public static void main(String[] args) {
while (true) {
test();
}
}
private static void test() {
long start = System.currentTimeMillis();
long limit = 10000000; //10^9
long value = 0;
for(long i = 0; i < limit; ++i){}
long t1 = System.currentTimeMillis() - start;
start = System.currentTimeMillis();
for(long j = 0; j < limit; ++j) {
value = value + j;
}
long t2 = System.currentTimeMillis() - start;
start = System.currentTimeMillis();
for(long k = 0; k < limit; ++k) {
value = value + (new Long(k));
}
long t3 = System.currentTimeMillis() - start;
System.out.print(t1 + " " + t2 + " " + t3 + " " + value + "\n");
}
}
which gave me the following output.
28 58 2220 99999990000000
40 58 2182 99999990000000
36 49 157 99999990000000
34 51 157 99999990000000
37 49 158 99999990000000
33 52 158 99999990000000
33 50 159 99999990000000
33 54 159 99999990000000
35 52 159 99999990000000
33 52 159 99999990000000
31 50 157 99999990000000
34 51 156 99999990000000
33 50 159 99999990000000
Note that the first two columns are pretty stable, but the third one shows a significant speedup on the 3rd iteration ... probably indicating that JIT compilation has occurred.
Interestingly, before I separated out the test into a separate method, I didn't see the speedup on the 3rd iteration. The numbers all looked like the first two rows. And that seems to be saying that the JVM (that I'm using) won't JIT compile a method that is currently executing ... or something like that.
Anyway, this demonstrates (to me) that there should be a warm up effect. If you don't see a warmup effect, your benchmark is doing something that is inhibiting JIT compilation ... and therefore isn't meaningful for real applications.
I'm surprised, too.
My first guess would have been inadvertant "autoboxing", but that's clearly not an issue in your example code.
This link might give a clue:
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Long.html
valueOf
public static Long valueOf(long l)
Returns a Long instance representing the specified long value. If a new Long instance is not required, this method should generally be
used in preference to the constructor Long(long), as this method is
likely to yield significantly better space and time performance by
caching frequently requested values.
Parameters:
l - a long value.
Returns:
a Long instance representing l.
Since:
1.5
But yes, I would expect using a wrapper (e.g. "Long") to take MORE time, and MORE space. I would not expect using the wrapper to be three times FASTER!
================================================================================
ADDENDUM:
I got these results with your code:
Base time 6878ms
Using longs 10515ms
Using Longs 428022ms
I'm running JDK 1.6.0_16 on a pokey 32-bit, single-core CPU.
OK - here's a slightly different version, along with my results (running JDK 1.6.0_16 pokey 32-bit single-code CPU):
import java.util.*;
/*
Test Base longs Longs/new Longs/valueOf
---- ---- ----- --------- -------------
0 343 896 3431 6025
1 342 957 3401 5796
2 342 881 3379 5742
*/
public class LongTest {
private static int limit = 100000000;
private static int ntimes = 3;
private static final long[] base = new long[ntimes];
private static final long[] primitives = new long[ntimes];
private static final long[] wrappers1 = new long[ntimes];
private static final long[] wrappers2 = new long[ntimes];
private static void test_base (int idx) {
long start = System.currentTimeMillis();
for (int i = 0; i < limit; ++i){}
base[idx] = System.currentTimeMillis() - start;
}
private static void test_primitive (int idx) {
long value = 0;
long start = System.currentTimeMillis();
for (int i = 0; i < limit; ++i){
value = value + i;
}
primitives[idx] = System.currentTimeMillis() - start;
}
private static void test_wrappers1 (int idx) {
long value = 0;
long start = System.currentTimeMillis();
for (int i = 0; i < limit; ++i){
value = value + new Long(i);
}
wrappers1[idx] = System.currentTimeMillis() - start;
}
private static void test_wrappers2 (int idx) {
long value = 0;
long start = System.currentTimeMillis();
for (int i = 0; i < limit; ++i){
value = value + Long.valueOf(i);
}
wrappers2[idx] = System.currentTimeMillis() - start;
}
public static void main(String[] args) {
for (int i=0; i < ntimes; i++) {
test_base (i);
test_primitive(i);
test_wrappers1 (i);
test_wrappers2 (i);
}
System.out.println ("Test Base longs Longs/new Longs/valueOf");
System.out.println ("---- ---- ----- --------- -------------");
for (int i=0; i < ntimes; i++) {
System.out.printf (" %2d %6d %6d %6d %6d\n",
i, base[i], primitives[i], wrappers1[i], wrappers2[i]);
}
}
}
=======================================================================
5.28.2012:
Here are some additional timings, from a faster (but still modest), dual-core CPU running Windows 7/64 and running the same JDK revision 1.6.0_16:
/*
PC 1: limit = 100,000,000, ntimes = 3, JDK 1.6.0_16 (32-bit):
Test Base longs Longs/new Longs/valueOf
---- ---- ----- --------- -------------
0 343 896 3431 6025
1 342 957 3401 5796
2 342 881 3379 5742
PC 2: limit = 1,000,000,000, ntimes = 5,JDK 1.6.0_16 (64-bit):
Test Base longs Longs/new Longs/valueOf
---- ---- ----- --------- -------------
0 3 2 5627 5573
1 0 0 5494 5537
2 0 0 5475 5530
3 0 0 5477 5505
4 0 0 5487 5508
PC 2: "for loop" counters => long; limit = 10,000,000,000, ntimes = 5:
Test Base longs Longs/new Longs/valueOf
---- ---- ----- --------- -------------
0 6278 6302 53713 54064
1 6273 6286 53547 53999
2 6273 6294 53606 53986
3 6274 6325 53593 53938
4 6274 6279 53566 53974
*/
You'll notice:
I'm not using StringBuilder, and I separate out all of the I/O until the end of the program.
"long" primtive is consistently equivalent to a "no-op"
"Long" wrappers are consistently much, much slower
"new Long()" is slightly faster than "Long.valueOf()"
Changing the loop counters from "int" to "long" makes the first two columns ("base" and "longs" much slower.
"JIT warmup" is negligible after the the first few iterations...
... provided I/O (like System.out) and potentially memory-intensive activities (like StringBuilder) are moved outside of the actual test sections.
Look please at this code:
public static void main(String[] args) {
String[] array = new String[10000000];
Arrays.fill(array, "Test");
long startNoSize;
long finishNoSize;
long startSize;
long finishSize;
for (int called = 0; called < 6; called++) {
startNoSize = Calendar.getInstance().getTimeInMillis();
for (int i = 0; i < array.length; i++) {
array[i] = String.valueOf(i);
}
finishNoSize = Calendar.getInstance().getTimeInMillis();
System.out.println(finishNoSize - startNoSize);
}
System.out.println("Length saved");
int length = array.length;
for (int called = 0; called < 6; called++) {
startSize = Calendar.getInstance().getTimeInMillis();
for (int i = 0; i < length; i++) {
array[i] = String.valueOf(i);
}
finishSize = Calendar.getInstance().getTimeInMillis();
System.out.println(finishSize - startSize);
}
}
The execution result differs from run to run, but there can be observed a strange behavior:
6510
4604
8805
6070
5128
8961
Length saved
6117
5194
8814
6380
8893
3982
Generally, there are 3 result: 6 seconds, 4 seconds, 8 seconds and they iterates in the same order.
Who knows, why does it happen?
UPDATE
After some playing with -Xms and -Xmx Java VM option the next results was observed:
The minimum total memory size should be at least 1024m for this code, otherwise there will be an OutOfMemoryError. The -Xms option influences the time of execution of for block:
It flows between 10 seconds for -Xms16m and 4 seconds for -Xms256m.
The question is - why the initial available memory size affect each iteration and not only the first one ?
Thank you in advance.
Micro benchmarking in Java is not that trivial. A lot of things happen in the background when we run a java program; Garbage collection being a prime example. There also might be the case of a context switch from your Java process to another process. IMO, there is no definite explanation why there is a sequence in the seemingly random times generated.
This is not entirely unexpected. There are all sorts of factors that could be affecting your numbers.
See: How do I write a correct micro-benchmark in Java?