Adding numbers using Java Long wrapper versus primitive longs - java

I am running this code and getting unexpected results. I expect that the loop which adds the primitives would perform much faster, but the results do not agree.
import java.util.*;
public class Main {
public static void main(String[] args) {
StringBuilder output = new StringBuilder();
long start = System.currentTimeMillis();
long limit = 1000000000; //10^9
long value = 0;
for(long i = 0; i < limit; ++i){}
long i;
output.append("Base time\n");
output.append(System.currentTimeMillis() - start + "ms\n");
start = System.currentTimeMillis();
for(long j = 0; j < limit; ++j) {
value = value + j;
}
output.append("Using longs\n");
output.append(System.currentTimeMillis() - start + "ms\n");
start = System.currentTimeMillis();
value = 0;
for(long k = 0; k < limit; ++k) {
value = value + (new Long(k));
}
output.append("Using Longs\n");
output.append(System.currentTimeMillis() - start + "ms\n");
System.out.print(output);
}
}
Output:
Base time
359ms
Using longs
1842ms
Using Longs
614ms
I have tried running each individual test in it's own java program, but the results are the same. What could cause this?
Small detail: running java 1.6
Edit:
I asked 2 other people to try out this code, one gets the same exact strange results that I get. The other gets results that actually make sense! I asked the guy who got normal results to give us his class binary. We run it and we STILL get the strange results. The problem is not at compile time (I think). I'm running 1.6.0_31, the guy who gets normal results is on 1.6.0_16, the guy who gets strange results like I do is on 1.7.0_04.
Edit: Get same results with a Thread.sleep(5000) at the start of program. Also get the same results with a while loop around the whole program (to see if the times would converge to normal times after java was fully started up)

I suspect that this is a JVM warmup effect. Specifically, the code is being JIT compiled at some point, and this is distorting the times that you are seeing.
Put the whole lot in a loop, and ignore the times reported until they stabilize. (But note that they won't entirely stabilize. Garbage is being generated, and therefore the GC will need to kick occasionally. This is liable to distort the timings, at least a bit. The best way to deal with this is to run a huge number of iterations of the outer loop, and calculate / display the average times.)
Another problem is that the JIT compiler on some releases of Java may be able to optimize away the stuff you are trying to test:
It could figure out that the creation and immediate unboxing of the Long objects could be optimized away. (Thanks Louis!)
It could figure out that the loops are doing "busy work" ... and optimize them away entirely. (The value of value is not used once each loop ends.)
FWIW, it is generally recommended that you use Long.valueOf(long) rather than new Long(long) because the former can make use of a cached Long instance. However, in this case, we can predict that there will be a cache miss in all but the first few loop iterations, so the recommendation is not going to help. If anything, it is likely to make the loop in question slower.
UPDATE
I did some investigation of my own, and ended up with the following:
import java.util.*;
public class Main {
public static void main(String[] args) {
while (true) {
test();
}
}
private static void test() {
long start = System.currentTimeMillis();
long limit = 10000000; //10^9
long value = 0;
for(long i = 0; i < limit; ++i){}
long t1 = System.currentTimeMillis() - start;
start = System.currentTimeMillis();
for(long j = 0; j < limit; ++j) {
value = value + j;
}
long t2 = System.currentTimeMillis() - start;
start = System.currentTimeMillis();
for(long k = 0; k < limit; ++k) {
value = value + (new Long(k));
}
long t3 = System.currentTimeMillis() - start;
System.out.print(t1 + " " + t2 + " " + t3 + " " + value + "\n");
}
}
which gave me the following output.
28 58 2220 99999990000000
40 58 2182 99999990000000
36 49 157 99999990000000
34 51 157 99999990000000
37 49 158 99999990000000
33 52 158 99999990000000
33 50 159 99999990000000
33 54 159 99999990000000
35 52 159 99999990000000
33 52 159 99999990000000
31 50 157 99999990000000
34 51 156 99999990000000
33 50 159 99999990000000
Note that the first two columns are pretty stable, but the third one shows a significant speedup on the 3rd iteration ... probably indicating that JIT compilation has occurred.
Interestingly, before I separated out the test into a separate method, I didn't see the speedup on the 3rd iteration. The numbers all looked like the first two rows. And that seems to be saying that the JVM (that I'm using) won't JIT compile a method that is currently executing ... or something like that.
Anyway, this demonstrates (to me) that there should be a warm up effect. If you don't see a warmup effect, your benchmark is doing something that is inhibiting JIT compilation ... and therefore isn't meaningful for real applications.

I'm surprised, too.
My first guess would have been inadvertant "autoboxing", but that's clearly not an issue in your example code.
This link might give a clue:
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Long.html
valueOf
public static Long valueOf(long l)
Returns a Long instance representing the specified long value. If a new Long instance is not required, this method should generally be
used in preference to the constructor Long(long), as this method is
likely to yield significantly better space and time performance by
caching frequently requested values.
Parameters:
l - a long value.
Returns:
a Long instance representing l.
Since:
1.5
But yes, I would expect using a wrapper (e.g. "Long") to take MORE time, and MORE space. I would not expect using the wrapper to be three times FASTER!
================================================================================
ADDENDUM:
I got these results with your code:
Base time 6878ms
Using longs 10515ms
Using Longs 428022ms
I'm running JDK 1.6.0_16 on a pokey 32-bit, single-core CPU.

OK - here's a slightly different version, along with my results (running JDK 1.6.0_16 pokey 32-bit single-code CPU):
import java.util.*;
/*
Test Base longs Longs/new Longs/valueOf
---- ---- ----- --------- -------------
0 343 896 3431 6025
1 342 957 3401 5796
2 342 881 3379 5742
*/
public class LongTest {
private static int limit = 100000000;
private static int ntimes = 3;
private static final long[] base = new long[ntimes];
private static final long[] primitives = new long[ntimes];
private static final long[] wrappers1 = new long[ntimes];
private static final long[] wrappers2 = new long[ntimes];
private static void test_base (int idx) {
long start = System.currentTimeMillis();
for (int i = 0; i < limit; ++i){}
base[idx] = System.currentTimeMillis() - start;
}
private static void test_primitive (int idx) {
long value = 0;
long start = System.currentTimeMillis();
for (int i = 0; i < limit; ++i){
value = value + i;
}
primitives[idx] = System.currentTimeMillis() - start;
}
private static void test_wrappers1 (int idx) {
long value = 0;
long start = System.currentTimeMillis();
for (int i = 0; i < limit; ++i){
value = value + new Long(i);
}
wrappers1[idx] = System.currentTimeMillis() - start;
}
private static void test_wrappers2 (int idx) {
long value = 0;
long start = System.currentTimeMillis();
for (int i = 0; i < limit; ++i){
value = value + Long.valueOf(i);
}
wrappers2[idx] = System.currentTimeMillis() - start;
}
public static void main(String[] args) {
for (int i=0; i < ntimes; i++) {
test_base (i);
test_primitive(i);
test_wrappers1 (i);
test_wrappers2 (i);
}
System.out.println ("Test Base longs Longs/new Longs/valueOf");
System.out.println ("---- ---- ----- --------- -------------");
for (int i=0; i < ntimes; i++) {
System.out.printf (" %2d %6d %6d %6d %6d\n",
i, base[i], primitives[i], wrappers1[i], wrappers2[i]);
}
}
}
=======================================================================
5.28.2012:
Here are some additional timings, from a faster (but still modest), dual-core CPU running Windows 7/64 and running the same JDK revision 1.6.0_16:
/*
PC 1: limit = 100,000,000, ntimes = 3, JDK 1.6.0_16 (32-bit):
Test Base longs Longs/new Longs/valueOf
---- ---- ----- --------- -------------
0 343 896 3431 6025
1 342 957 3401 5796
2 342 881 3379 5742
PC 2: limit = 1,000,000,000, ntimes = 5,JDK 1.6.0_16 (64-bit):
Test Base longs Longs/new Longs/valueOf
---- ---- ----- --------- -------------
0 3 2 5627 5573
1 0 0 5494 5537
2 0 0 5475 5530
3 0 0 5477 5505
4 0 0 5487 5508
PC 2: "for loop" counters => long; limit = 10,000,000,000, ntimes = 5:
Test Base longs Longs/new Longs/valueOf
---- ---- ----- --------- -------------
0 6278 6302 53713 54064
1 6273 6286 53547 53999
2 6273 6294 53606 53986
3 6274 6325 53593 53938
4 6274 6279 53566 53974
*/
You'll notice:
I'm not using StringBuilder, and I separate out all of the I/O until the end of the program.
"long" primtive is consistently equivalent to a "no-op"
"Long" wrappers are consistently much, much slower
"new Long()" is slightly faster than "Long.valueOf()"
Changing the loop counters from "int" to "long" makes the first two columns ("base" and "longs" much slower.
"JIT warmup" is negligible after the the first few iterations...
... provided I/O (like System.out) and potentially memory-intensive activities (like StringBuilder) are moved outside of the actual test sections.

Related

java stream performace for finding maximum element form a list

I wrote a simple program to compare to performance of stream for finding maximum form list of integer. Surprisingly I found that the performance of ' stream way' 1/10 of 'usual way'. Am I doing something wrong? Is there any condition on which Stream way will not be efficient? Could anyone have a nice explanation for this behavior?
"stream way" took 80 milliseconds "usual way" took 15 milli seconds
Please find the code below
public class Performance {
public static void main(String[] args) {
ArrayList<Integer> a = new ArrayList<Integer>();
Random randomGenerator = new Random();
for (int i=0;i<40000;i++){
a.add(randomGenerator.nextInt(40000));
}
long start_s = System.currentTimeMillis( );
Optional<Integer> m1 = a.stream().max(Integer::compare);
long diff_s = System.currentTimeMillis( ) - start_s;
System.out.println(diff_s);
int e = a.size();
Integer m = Integer.MIN_VALUE;
long start = System.currentTimeMillis( );
for(int i=0; i < e; i++)
if(a.get(i) > m) m = a.get(i);
long diff = System.currentTimeMillis( ) - start;
System.out.println(diff);
}
}
Yes, Streams are slower for such simple operations. But your numbers are completely unrelated. If you think that 15 milliseconds is satisfactory time for your task, then there are good news: after warm-up stream code can solve this problem in like 0.1-0.2 milliseconds, which is 70-150 times faster.
Here's quick-and-dirty benchmark:
import java.util.concurrent.TimeUnit;
import java.util.*;
import java.util.stream.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.annotations.*;
#Warmup(iterations = 5, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#Measurement(iterations = 10, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#Fork(3)
#State(Scope.Benchmark)
public class StreamTest {
// Stream API is very nice to get random data for tests!
List<Integer> a = new Random().ints(40000, 0, 40000).boxed()
.collect(Collectors.toList());
#Benchmark
public Integer streamList() {
return a.stream().max(Integer::compare).orElse(Integer.MIN_VALUE);
}
#Benchmark
public Integer simpleList() {
int e = a.size();
Integer m = Integer.MIN_VALUE;
for(int i=0; i < e; i++)
if(a.get(i) > m) m = a.get(i);
return m;
}
}
The results are:
Benchmark Mode Cnt Score Error Units
StreamTest.simpleList avgt 30 38.241 ± 0.434 us/op
StreamTest.streamList avgt 30 215.425 ± 32.871 us/op
Here's microseconds. So the Stream version is actually much faster than your test. Nevertheless the simple version is even more faster. So if you were fine with 15 ms, you can use any of these two versions you like: both will perform much faster.
If you want to get the best possible performance no matter what, you should get rid of boxed Integer objects and work with primitive array:
int[] b = new Random().ints(40000, 0, 40000).toArray();
#Benchmark
public int streamArray() {
return Arrays.stream(b).max().orElse(Integer.MIN_VALUE);
}
#Benchmark
public int simpleArray() {
int e = b.length;
int m = Integer.MIN_VALUE;
for(int i=0; i < e; i++)
if(b[i] > m) m = b[i];
return m;
}
Both versions are faster now:
Benchmark Mode Cnt Score Error Units
StreamTest.simpleArray avgt 30 10.132 ± 0.193 us/op
StreamTest.streamArray avgt 30 167.435 ± 1.155 us/op
Actually the stream version result may vary greatly as it involves many intermediate methods which are JIT-compiled in different time, so the speed may change in any direction after some iterations.
By the way your original problem can be solved by good old Collections.max method without Stream API like this:
Integer max = Collections.max(a);
In general you should avoid testing the artificial code which does not solve real problems. With artificial code you will get the artificial results which generally say nothing about the API performance in real conditions.
The immediate difference that I see is that the stream way uses Integer::compare which might require more autoboxing etc. vs. an operator in the loop. perhaps you can call Integer::compare in the loop to see if this is the reason?
EDIT: following the advice from Nicholas Robinson, I wrote a new version of the test. It uses 400K sized list (the original one yielded zero diff results), it uses Integer.compare in both cases and runs only one of them in each invocation (I alternate between the two methods):
static List<Integer> a = new ArrayList<Integer>();
public static void main(String[] args)
{
Random randomGenerator = new Random();
for (int i = 0; i < 400000; i++) {
a.add(randomGenerator.nextInt(400000));
}
long start = System.currentTimeMillis();
//Integer max = checkLoop();
Integer max = checkStream();
long diff = System.currentTimeMillis() - start;
System.out.println("max " + max + " diff " + diff);
}
static Integer checkStream()
{
Optional<Integer> max = a.stream().max(Integer::compare);
return max.get();
}
static Integer checkLoop()
{
int e = a.size();
Integer max = Integer.MIN_VALUE;
for (int i = 0; i < e; i++) {
if (Integer.compare(a.get(i), max) > 0) max = a.get(i);
}
return max;
}
The results for loop: max 399999 diff 10
The results for stream: max 399999 diff 40 (and sometimes I got 50)
In Java 8 they have been putting a lot of effort into making use of concurrent processes with the new lambdas. You will find the stream to be so much faster because the list is being processed concurrently in the most efficient way possible where as the usual way is running through the list sequentially.
Because the lambda are static this makes threading easier, however when you are accessing something line your hard drive (reading in a file line by line) you will probably find the stream wont be as efficient because the hard drive can only access info.
[UPDATE]
The reason your stream took so much longer than the normal way is because you run in first. The JRE is constantly trying to optimize the performance so there will be a cache set up with the usual way. If you run the usual way before the stream way you should get opposing results. I would recommend running the tests in different mains for the best results.

Multi-Threading Mechanics?

I've been experimenting with multithreading and I'm really confused about a test I did.
I had been doing research and everything was talking about how multithreading can allow 2 processes to run at the same time.
I made this program so 3 different threads would use a for loop to count 1-10, 11-20, and 21-30 so that I could see if they actually ran at the same time how I expected them to.
After running the program the output is something like 1 2 3 4 5 6 7 8 9 10 21 22 23 24 25 26 27 28 29 30 11 12 13 14 15 16 17 18 19 20
Basically any variation order of the 3 sets of numbers. So they can all be in order or have 21-30 before 11-20 sometimes. That doesn't seem like it is running at the same time, just running one after another.
In the for loop in the println(i); if i change it to println(i + "a"); for 1-10 and b, c for 11-20, 21-30 The output is actually in a random order like I had expected. Like this: 1a 11b 21c 2a 22c 12b 13b 23c 3a
Does the program know its doing nothing but counting up and just throw all the numbers on the screen without actually doing it? Or does adding the string at the end make it slow enough for the other threads to sneak in between the operations? I know nothing about this ha.
public class Run{
static Runnable updatePosition;
static Runnable render;
static Runnable checkCollisions;
public static void main(String[] args) {
// System.out.println(Runtime.getRuntime().availableProcessors());
updatePosition = new Runnable() {
#Override
public void run() {
for (int i = 1; i <= 10; i++) {
System.out.println(i);
}
}
};
render = new Runnable() {
#Override
public void run() {
for (int i = 11; i <= 20; i++) {
System.out.println(i);
}
}
};
checkCollisions = new Runnable() {
#Override
public void run() {
for (int i = 21; i <= 30; i++) {
System.out.println(i);
}
}
};
Thread updatePositionThread = new Thread(updatePosition);
Thread renderThread = new Thread(render);
Thread checkCollisionsThread = new Thread(checkCollisions);
updatePositionThread.start();
renderThread.start();
checkCollisionsThread.start();
}
}
Also, how do threads get assigned to CPU cores? In depth please, reasonably. What I mean by asking this is: If I were to use a single thread program and use an update method and a draw method and they together took too long and made my program lag, would putting them on separate threads make this not help, or does it not actually run side by side? Assuming I can deal with all of the concurrency.
Actually, JIT optimizations, processor architecture and several other things play a part in how these kind of situations turn out.
The actual output should not be an ordered execution of threads like 1-10, 21-30, 11-20.
Changing your code a little to :
for (int i = 1; i <= 10000; i++)
for (int i = 10001; i <= 20000; i++)
for (int i = 20001; i <= 30000; i++)
I get (as expected, complete execution of one thread is not happening, as one might assume in your case). It is all about how much time a thread gets.
1
2
3
...
250
251
252
253
20001
20002
20003
...
20127
20128
10001
10002
10003
..
Changing i to "a" + i leads to dynamic construction of new Strings using StringBuilder, this will indeed take some time (and hence CPU cycles). On the other hand, primitive ints don't have this delay.. So, you get such an output.

Why is long slower than int in x64 Java?

I'm running Windows 8.1 x64 with Java 7 update 45 x64 (no 32 bit Java installed) on a Surface Pro 2 tablet.
The code below takes 1688ms when the type of i is a long and 109ms when i is an int. Why is long (a 64 bit type) an order of magnitude slower than int on a 64 bit platform with a 64 bit JVM?
My only speculation is that the CPU takes longer to add a 64 bit integer than a 32 bit one, but that seems unlikely. I suspect Haswell doesn't use ripple-carry adders.
I'm running this in Eclipse Kepler SR1, btw.
public class Main {
private static long i = Integer.MAX_VALUE;
public static void main(String[] args) {
System.out.println("Starting the loop");
long startTime = System.currentTimeMillis();
while(!decrementAndCheck()){
}
long endTime = System.currentTimeMillis();
System.out.println("Finished the loop in " + (endTime - startTime) + "ms");
}
private static boolean decrementAndCheck() {
return --i < 0;
}
}
Edit: Here are the results from equivalent C++ code compiled by VS 2013 (below), same system. long: 72265ms int: 74656ms Those results were in debug 32 bit mode.
In 64 bit release mode: long: 875ms long long: 906ms int: 1047ms
This suggests that the result I observed is JVM optimization weirdness rather than CPU limitations.
#include "stdafx.h"
#include "iostream"
#include "windows.h"
#include "limits.h"
long long i = INT_MAX;
using namespace std;
boolean decrementAndCheck() {
return --i < 0;
}
int _tmain(int argc, _TCHAR* argv[])
{
cout << "Starting the loop" << endl;
unsigned long startTime = GetTickCount64();
while (!decrementAndCheck()){
}
unsigned long endTime = GetTickCount64();
cout << "Finished the loop in " << (endTime - startTime) << "ms" << endl;
}
Edit: Just tried this again in Java 8 RTM, no significant change.
My JVM does this pretty straightforward thing to the inner loop when you use longs:
0x00007fdd859dbb80: test %eax,0x5f7847a(%rip) /* fun JVM hack */
0x00007fdd859dbb86: dec %r11 /* i-- */
0x00007fdd859dbb89: mov %r11,0x258(%r10) /* store i to memory */
0x00007fdd859dbb90: test %r11,%r11 /* unnecessary test */
0x00007fdd859dbb93: jge 0x00007fdd859dbb80 /* go back to the loop top */
It cheats, hard, when you use ints; first there's some screwiness that I don't claim to understand but looks like setup for an unrolled loop:
0x00007f3dc290b5a1: mov %r11d,%r9d
0x00007f3dc290b5a4: dec %r9d
0x00007f3dc290b5a7: mov %r9d,0x258(%r10)
0x00007f3dc290b5ae: test %r9d,%r9d
0x00007f3dc290b5b1: jl 0x00007f3dc290b662
0x00007f3dc290b5b7: add $0xfffffffffffffffe,%r11d
0x00007f3dc290b5bb: mov %r9d,%ecx
0x00007f3dc290b5be: dec %ecx
0x00007f3dc290b5c0: mov %ecx,0x258(%r10)
0x00007f3dc290b5c7: cmp %r11d,%ecx
0x00007f3dc290b5ca: jle 0x00007f3dc290b5d1
0x00007f3dc290b5cc: mov %ecx,%r9d
0x00007f3dc290b5cf: jmp 0x00007f3dc290b5bb
0x00007f3dc290b5d1: and $0xfffffffffffffffe,%r9d
0x00007f3dc290b5d5: mov %r9d,%r8d
0x00007f3dc290b5d8: neg %r8d
0x00007f3dc290b5db: sar $0x1f,%r8d
0x00007f3dc290b5df: shr $0x1f,%r8d
0x00007f3dc290b5e3: sub %r9d,%r8d
0x00007f3dc290b5e6: sar %r8d
0x00007f3dc290b5e9: neg %r8d
0x00007f3dc290b5ec: and $0xfffffffffffffffe,%r8d
0x00007f3dc290b5f0: shl %r8d
0x00007f3dc290b5f3: mov %r8d,%r11d
0x00007f3dc290b5f6: neg %r11d
0x00007f3dc290b5f9: sar $0x1f,%r11d
0x00007f3dc290b5fd: shr $0x1e,%r11d
0x00007f3dc290b601: sub %r8d,%r11d
0x00007f3dc290b604: sar $0x2,%r11d
0x00007f3dc290b608: neg %r11d
0x00007f3dc290b60b: and $0xfffffffffffffffe,%r11d
0x00007f3dc290b60f: shl $0x2,%r11d
0x00007f3dc290b613: mov %r11d,%r9d
0x00007f3dc290b616: neg %r9d
0x00007f3dc290b619: sar $0x1f,%r9d
0x00007f3dc290b61d: shr $0x1d,%r9d
0x00007f3dc290b621: sub %r11d,%r9d
0x00007f3dc290b624: sar $0x3,%r9d
0x00007f3dc290b628: neg %r9d
0x00007f3dc290b62b: and $0xfffffffffffffffe,%r9d
0x00007f3dc290b62f: shl $0x3,%r9d
0x00007f3dc290b633: mov %ecx,%r11d
0x00007f3dc290b636: sub %r9d,%r11d
0x00007f3dc290b639: cmp %r11d,%ecx
0x00007f3dc290b63c: jle 0x00007f3dc290b64f
0x00007f3dc290b63e: xchg %ax,%ax /* OK, fine; I know what a nop looks like */
then the unrolled loop itself:
0x00007f3dc290b640: add $0xfffffffffffffff0,%ecx
0x00007f3dc290b643: mov %ecx,0x258(%r10)
0x00007f3dc290b64a: cmp %r11d,%ecx
0x00007f3dc290b64d: jg 0x00007f3dc290b640
then the teardown code for the unrolled loop, itself a test and a straight loop:
0x00007f3dc290b64f: cmp $0xffffffffffffffff,%ecx
0x00007f3dc290b652: jle 0x00007f3dc290b662
0x00007f3dc290b654: dec %ecx
0x00007f3dc290b656: mov %ecx,0x258(%r10)
0x00007f3dc290b65d: cmp $0xffffffffffffffff,%ecx
0x00007f3dc290b660: jg 0x00007f3dc290b654
So it goes 16 times faster for ints because the JIT unrolled the int loop 16 times, but didn't unroll the long loop at all.
For completeness, here is the code I actually tried:
public class foo136 {
private static int i = Integer.MAX_VALUE;
public static void main(String[] args) {
System.out.println("Starting the loop");
for (int foo = 0; foo < 100; foo++)
doit();
}
static void doit() {
i = Integer.MAX_VALUE;
long startTime = System.currentTimeMillis();
while(!decrementAndCheck()){
}
long endTime = System.currentTimeMillis();
System.out.println("Finished the loop in " + (endTime - startTime) + "ms");
}
private static boolean decrementAndCheck() {
return --i < 0;
}
}
The assembly dumps were generated using the options -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly. Note that you need to mess around with your JVM installation to have this work for you as well; you need to put some random shared library in exactly the right place or it will fail.
The JVM stack is defined in terms of words, whose size is an implementation detail but must be at least 32 bits wide. The JVM implementer may use 64-bit words, but the bytecode can't rely on this, and so operations with long or double values have to be handled with extra care. In particular, the JVM integer branch instructions are defined on exactly the type int.
In the case of your code, disassembly is instructive. Here's the bytecode for the int version as compiled by the Oracle JDK 7:
private static boolean decrementAndCheck();
Code:
0: getstatic #14 // Field i:I
3: iconst_1
4: isub
5: dup
6: putstatic #14 // Field i:I
9: ifge 16
12: iconst_1
13: goto 17
16: iconst_0
17: ireturn
Note that the JVM will load the value of your static i (0), subtract one (3-4), duplicate the value on the stack (5), and push it back into the variable (6). It then does a compare-with-zero branch and returns.
The version with the long is a bit more complicated:
private static boolean decrementAndCheck();
Code:
0: getstatic #14 // Field i:J
3: lconst_1
4: lsub
5: dup2
6: putstatic #14 // Field i:J
9: lconst_0
10: lcmp
11: ifge 18
14: iconst_1
15: goto 19
18: iconst_0
19: ireturn
First, when the JVM duplicates the new value on the stack (5), it has to duplicate two stack words. In your case, it's quite possible that this is no more expensive than duplicating one, since the JVM is free to use a 64-bit word if convenient. However, you'll notice that the branch logic is longer here. The JVM doesn't have an instruction to compare a long with zero, so it has to push a constant 0L onto the stack (9), do a general long comparison (10), and then branch on the value of that calculation.
Here are two plausible scenarios:
The JVM is following the bytecode path exactly. In this case, it's doing more work in the long version, pushing and popping several extra values, and these are on the virtual managed stack, not the real hardware-assisted CPU stack. If this is the case, you'll still see a significant performance difference after warmup.
The JVM realizes that it can optimize this code. In this case, it's taking extra time to optimize away some of the practically unnecessary push/compare logic. If this is the case, you'll see very little performance difference after warmup.
I recommend you write a correct microbenchmark to eliminate the effect of having the JIT kick in, and also trying this with a final condition that isn't zero, to force the JVM to do the same comparison on the int that it does with the long.
Basic unit of data in a Java Virtual Machine is word. Choosing the right word size is left upon the implementation of the JVM. A JVM implementation should choose a minimum word size of 32 bits. It can choose a higher word size to gain efficiency. Neither there is any restriction that a 64 bit JVM should choose 64 bit word only.
The underlying architecture doesn't rules that the word size should also be the same. JVM reads/writes data word by word. This is the reason why it might be taking longer for a long than an int.
Here you can find more on the same topic.
I have just written a benchmark using caliper.
The results are quite consistent with the original code: a ~12x speedup for using int over long. It certainly seems that the loop unrolling reported by tmyklebu or something very similar is going on.
timeIntDecrements 195,266,845.000
timeLongDecrements 2,321,447,978.000
This is my code; note that it uses a freshly-built snapshot of caliper, since I could not figure out how to code against their existing beta release.
package test;
import com.google.caliper.Benchmark;
import com.google.caliper.Param;
public final class App {
#Param({""+1}) int number;
private static class IntTest {
public static int v;
public static void reset() {
v = Integer.MAX_VALUE;
}
public static boolean decrementAndCheck() {
return --v < 0;
}
}
private static class LongTest {
public static long v;
public static void reset() {
v = Integer.MAX_VALUE;
}
public static boolean decrementAndCheck() {
return --v < 0;
}
}
#Benchmark
int timeLongDecrements(int reps) {
int k=0;
for (int i=0; i<reps; i++) {
LongTest.reset();
while (!LongTest.decrementAndCheck()) { k++; }
}
return (int)LongTest.v | k;
}
#Benchmark
int timeIntDecrements(int reps) {
int k=0;
for (int i=0; i<reps; i++) {
IntTest.reset();
while (!IntTest.decrementAndCheck()) { k++; }
}
return IntTest.v | k;
}
}
For the record, this version does a crude "warmup":
public class LongSpeed {
private static long i = Integer.MAX_VALUE;
private static int j = Integer.MAX_VALUE;
public static void main(String[] args) {
for (int x = 0; x < 10; x++) {
runLong();
runWord();
}
}
private static void runLong() {
System.out.println("Starting the long loop");
i = Integer.MAX_VALUE;
long startTime = System.currentTimeMillis();
while(!decrementAndCheckI()){
}
long endTime = System.currentTimeMillis();
System.out.println("Finished the long loop in " + (endTime - startTime) + "ms");
}
private static void runWord() {
System.out.println("Starting the word loop");
j = Integer.MAX_VALUE;
long startTime = System.currentTimeMillis();
while(!decrementAndCheckJ()){
}
long endTime = System.currentTimeMillis();
System.out.println("Finished the word loop in " + (endTime - startTime) + "ms");
}
private static boolean decrementAndCheckI() {
return --i < 0;
}
private static boolean decrementAndCheckJ() {
return --j < 0;
}
}
The overall times improve about 30%, but the ratio between the two remains roughly the same.
For the records:
if i use
boolean decrementAndCheckLong() {
lo = lo - 1l;
return lo < -1l;
}
(changed "l--" to "l = l - 1l") long performance improves by ~50%
It's likely due to the JVM checking for safepoints when long is used (uncounted loop), and not doing it for int (counted loop).
Some references:
https://stackoverflow.com/a/62557768/14624235
https://stackoverflow.com/a/58726530/14624235
http://psy-lob-saw.blogspot.com/2016/02/wait-for-it-counteduncounted-loops.html
I don't have a 64 bit machine to test with, but the rather large difference suggests that there is more than the slightly longer bytecode at work.
I see very close times for long/int (4400 vs 4800ms) on my 32-bit 1.7.0_45.
This is only a guess, but I strongly suspect that it is the effect of a memory misalignment penalty. To confirm/deny the suspicion, try adding a public static int dummy = 0; before the declaration of i. That will push i down by 4 bytes in memory layout and may make it properly aligned for better performance. Confirmed to be not causing the issue.
EDIT: The reasoning behind this is that the VM may not reorder fields at its leisure adding padding for optimal alignment, since that may interfere with JNI (Not the case).

Profiling java code that calls Runtime.freeMemory()

I have some code that profiles Runtime.freeMemory. Here is my code:
package misc;
import java.util.ArrayList;
import java.util.Random;
public class FreeMemoryTest {
private final ArrayList<Double> l;
private final Random r;
public FreeMemoryTest(){
this.r = new Random();
this.l = new ArrayList<Double>();
}
public static boolean memoryCheck() {
double freeMem = Runtime.getRuntime().freeMemory();
double totalMem = Runtime.getRuntime().totalMemory();
double fptm = totalMem * 0.05;
boolean toReturn = fptm > freeMem;
return toReturn;
}
public void freeMemWorkout(int max){
for(int i = 0; i < max; i++){
memoryCheck();
l.add(r.nextDouble());
}
}
public void workout(int max){
for(int i = 0; i < max; i++){
l.add(r.nextDouble());
}
}
public static void main(String[] args){
FreeMemoryTest f = new FreeMemoryTest();
int count = Integer.parseInt(args[1]);
long startTime = System.currentTimeMillis();
if(args[0].equals("f")){
f.freeMemWorkout(count);
} else {
f.workout(count);
}
long endTime = System.currentTimeMillis();
System.out.println(endTime - startTime);
}
}
When I run the profiler using -Xrunhprof:cpu=samples, the vast majority of the calls are to the Runtime.freeMemory(), like this:
CPU SAMPLES BEGIN (total = 531) Fri Dec 7 00:17:20 2012
rank self accum count trace method
1 83.62% 83.62% 444 300274 java.lang.Runtime.freeMemory
2 9.04% 92.66% 48 300276 java.lang.Runtime.totalMemory
When I run the profiler using -Xrunhprof:cpu=time, I don't see any of the calls to Runtime.freeMemory at all, and the top five calls are as follows:
CPU TIME (ms) BEGIN (total = 10042) Fri Dec 7 00:29:51 2012
rank self accum count trace method
1 13.39% 13.39% 200000 307547 java.util.Random.next
2 9.69% 23.08% 1 307852 misc.FreeMemoryTest.freeMemWorkout
3 7.41% 30.49% 100000 307544 misc.FreeMemoryTest.memoryCheck
4 7.39% 37.88% 100000 307548 java.util.Random.nextDouble
5 4.35% 42.23% 100000 307561 java.util.ArrayList.add
These two profiles are so different from one another. I thought that samples was supposed to at least roughly approximate the results from the times, but here we see a very radical difference, something that consumes more than 80% of the samples doesn't even appear in the times profile. This does not make any sense to me, does anyone know why this is happening?
More on this:
$ java -Xmx1000m -Xms1000m -jar memtest.jar a 20000000 5524
//does not have the calls to Runtime.freeMemory()
$ java -Xmx1000m -Xms1000m -jar memtest.jar f 20000000 9442
//has the calls to Runtime.freeMemory()
Running with freemem requires approximately twice the amount of time as running without it. If 80% of the CPU time is spent in java.Runtime.freeMemory(), and I remove that call, I would expect the program to speed up by a factor of approximately 5. As we can see above, the program speeds up by a factor of approximately 2.
A slowdown of a factor of 5 is way worse than a slowdown of a factor of 2 that was observed empirically, so what I do not understand is how the sampling profiler is so far off from reality.
The Runtime freeMemory() and totalMemory() are native calls.
See http://www.docjar.com/html/api/java/lang/Runtime.java.html
The timer cannot time them, but the sampler can.

Strange performance issues of Java VM

Look please at this code:
public static void main(String[] args) {
String[] array = new String[10000000];
Arrays.fill(array, "Test");
long startNoSize;
long finishNoSize;
long startSize;
long finishSize;
for (int called = 0; called < 6; called++) {
startNoSize = Calendar.getInstance().getTimeInMillis();
for (int i = 0; i < array.length; i++) {
array[i] = String.valueOf(i);
}
finishNoSize = Calendar.getInstance().getTimeInMillis();
System.out.println(finishNoSize - startNoSize);
}
System.out.println("Length saved");
int length = array.length;
for (int called = 0; called < 6; called++) {
startSize = Calendar.getInstance().getTimeInMillis();
for (int i = 0; i < length; i++) {
array[i] = String.valueOf(i);
}
finishSize = Calendar.getInstance().getTimeInMillis();
System.out.println(finishSize - startSize);
}
}
The execution result differs from run to run, but there can be observed a strange behavior:
6510
4604
8805
6070
5128
8961
Length saved
6117
5194
8814
6380
8893
3982
Generally, there are 3 result: 6 seconds, 4 seconds, 8 seconds and they iterates in the same order.
Who knows, why does it happen?
UPDATE
After some playing with -Xms and -Xmx Java VM option the next results was observed:
The minimum total memory size should be at least 1024m for this code, otherwise there will be an OutOfMemoryError. The -Xms option influences the time of execution of for block:
It flows between 10 seconds for -Xms16m and 4 seconds for -Xms256m.
The question is - why the initial available memory size affect each iteration and not only the first one ?
Thank you in advance.
Micro benchmarking in Java is not that trivial. A lot of things happen in the background when we run a java program; Garbage collection being a prime example. There also might be the case of a context switch from your Java process to another process. IMO, there is no definite explanation why there is a sequence in the seemingly random times generated.
This is not entirely unexpected. There are all sorts of factors that could be affecting your numbers.
See: How do I write a correct micro-benchmark in Java?

Categories