Java 2D array fill - innocent optimization caused terrible slowdown

Java 2D array fill - innocent optimization caused terrible slowdown - java

I've tried to optimize a filling of square two-dimensional Java array with sums of indices at each element by computing each sum once for two elements, opposite relative to the main diagonal. But instead of speedup or, at least, comparable performance, I've got 23(!) times slower code.
My code:
#State(Scope.Benchmark)
#BenchmarkMode(Mode.AverageTime)
#OperationsPerInvocation(ArrayFill.N * ArrayFill.N)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public class ArrayFill {
public static final int N = 8189;
public int[][] g;
#Setup
public void setup() { g = new int[N][N]; }
#GenerateMicroBenchmark
public int simple(ArrayFill state) {
int[][] g = state.g;
for(int i = 0; i < g.length; i++) {
for(int j = 0; j < g[i].length; j++) {
g[i][j] = i + j;
}
}
return g[g.length - 1][g[g.length - 1].length - 1];
}
#GenerateMicroBenchmark
public int optimized(ArrayFill state) {
int[][] g = state.g;
for(int i = 0; i < g.length; i++) {
for(int j = 0; j <= i; j++) {
g[j][i] = g[i][j] = i + j;
}
}
return g[g.length - 1][g[g.length - 1].length - 1];
}
}
Benchmark results:
Benchmark Mode Mean Mean error Units
ArrayFill.simple avgt 0.907 0.008 ns/op
ArrayFill.optimized avgt 21.188 0.049 ns/op
The question:
How could so tremendous performance drop be explained?
P. S. Java version is 1.8.0-ea-b124, 64-bit 3.2 GHz AMD processor, benchmarks were executed in a single thread.

A side note: Your "optimized" version mightn't be faster at all, even when we leave all possible problems aside. There are multiple resources in a modern CPU and saturating one of them may stop you from any improvements. What I mean: The speed may be memory-bound, and trying to write twice as fast may in one iteration may change nothing at all.
I can see three possible reasons:
Your access pattern may enforce bound checks. In the "simple" loop they can be obviously eliminated, in the "optimized" only if the array is a square. It is, but this information is available only outside of the method (moreover a different piece of code could change it!).
The memory locality in your "optimized" loop is bad. It accesses essentially random memory locations as there's nothing like a 2D array in Java (only an array of arrays for which new int[N][N] is a shortcut). When iterating column-wise, you use only a single int from each loaded cacheline, i.e., 4 bytes out of 64.
The memory prefetcher can have a problem with your access pattern. The array with its 8189 * 8189 * 4 bytes is too big to fit in any cache. Modern CPUs have a prefetcher allowing to load a cache line during in advance, when it spots a regular access pattern. The capabilities of the prefetchers vary a lot. This might be irrelevant here, as you're only writing, but I'm not sure if it's possible to write into a cache-line which hasn't been fetched.
I guess the memory locality is the main culprit:
I added a method "reversed" which works juts like simple, but with
g[j][i] = i + j;
instead of
g[i][j] = i + j;
This "innocuous" change is a performance desaster:
Benchmark Mode Samples Mean Mean error Units
o.o.j.s.ArrayFillBenchmark.optimized avgt 20 10.484 0.048 ns/op
o.o.j.s.ArrayFillBenchmark.reversed avgt 20 20.989 0.294 ns/op
o.o.j.s.ArrayFillBenchmark.simple avgt 20 0.693 0.003 ns/op

I wrote version that works faster than "simple". But, I don't know why it is faster (. Here is the code:
class A {
public static void main(String[] args) {
int n = 8009;
long st, en;
// one
int gg[][] = new int[n][n];
st = System.nanoTime();
for(int i = 0; i < n; i++) {
for(int j = 0; j < n; j++) {
gg[i][j] = i + j;
}
}
en = System.nanoTime();
System.out.println("\nOne time " + (en - st)/1000000.d + " msc");
// two
int g[][] = new int[n][n];
st = System.nanoTime();
int odd = (n%2), l=n-odd;
for(int i = 0; i < l; ++i) {
int t0, t1;
int a0[] = g[t0 = i];
int a1[] = g[t1 = ++i];
for(int j = 0; j < n; ++j) {
a0[j] = t0 + j;
a1[j] = t1 + j;
}
}
if(odd != 0)
{
int i = n-1;
int a[] = g[i];
for(int j = 0; j < n; ++j) {
a[j] = i + j;
}
}
en = System.nanoTime();
System.out.println("\nOptimized time " + (en - st)/1000000.d + " msc");
int r = g[0][0]
// + gg[0][0]
;
System.out.println("\nZZZZ = " + r);
}
}
The results are:
One time 165.177848 msc
Optimized time 99.536178 msc
ZZZZ = 0
Can someone explain me we why it is faster?

http://www.learn-java-tutorial.com/Arrays.cfm#Multidimensional-Arrays-in-Memory
Picture: http://www.learn-java-tutorial.com/images/4715/Arrays03.gif
int[][] === array of arrays of values
int[] === array of values
class A {
public static void main(String[] args) {
int n = 5000;
int g[][] = new int[n][n];
long st, en;
// one
st = System.nanoTime();
for(int i = 0; i < n; i++) {
for(int j = 0; j < n; j++) {
g[i][j] = 10;
}
}
en = System.nanoTime();
System.out.println("\nTwo time " + (en - st)/1000000.d + " msc");
// two
st = System.nanoTime();
for(int i = 0; i < n; i++) {
g[i][i] = 20;
for(int j = 0; j < i; j++) {
g[j][i] = g[i][j] = 20;
}
}
en = System.nanoTime();
System.out.println("\nTwo time " + (en - st)/1000000.d + " msc");
// 3
int arrLen = n*n;
int[] arr = new int[arrLen];
st = System.nanoTime();
for(int i : arr) {
arr[i] = 30;
}
en = System.nanoTime();
System.out.println("\n3 time " + (en - st)/1000000.d + " msc");
// 4
st = System.nanoTime();
int i, j;
for(i = 0; i < n; i++) {
for(j = 0; j < n; j++) {
arr[i*n+j] = 40;
}
}
en = System.nanoTime();
System.out.println("\n4 time " + (en - st)/1000000.d + " msc");
}
}
Two time 71.998012 msc
Two time 551.664166 msc
3 time 63.74851 msc
4 time 57.215167 msc
P.S. I'am not a java spec =)

I see, you allocated a new array for the second run, but still, did you try changing the order of "unoptimized" and "optimized" runs? – fikto
I changed order of them and optimized it a little bit:
class A {
public static void main(String[] args) {
int n = 8009;
double q1, q2;
long st, en;
// two
int g[][] = new int[n][n];
st = System.nanoTime();
int odd = (n%2), l=n-odd;
for(int i = 0; i < l; ++i) {
int t0, t1;
int a0[] = g[t0 = i];
int a1[] = g[t1 = ++i];
for(int j = 0; j < n; ++j, ++t0, ++t1) {
a0[j] = t0;
a1[j] = t1;
}
}
if(odd != 0)
{
int i = n-1;
int a[] = g[i];
for(int j = 0; j < n; ++j, ++i) {
a[j] = i;
}
}
en = System.nanoTime();
System.out.println("Optimized time " + (q1=(en - st)/1000000.d) + " msc");
// one
int gg[][] = new int[n][n];
st = System.nanoTime();
for(int i = 0; i < n; i++) {
for(int j = 0; j < n; j++) {
gg[i][j] = i + j;
}
}
en = System.nanoTime();
System.out.println("One time " + (q2=(en - st)/1000000.d) + " msc");
System.out.println("1 - T1/T2 = " + (1 - q1/q2));
}
}
And results are:
Optimized time 99.360293 msc
One time 162.23607 msc
1 - T1/T2 = 0.3875573231033026

Related

Out of Memory Heap Space

I am working on a code that will output all the possible combinations of a certain amount of objects that will occur between three possibilities. My code is working for smaller numbers like 10,000 but I want it to be able to go up to 100,000. Anytime I go above 10,000 I get the following error code: java.lang.OutOfMemoryError: Java heap space.
Is there a more efficient way to store this information, or be able to circumvent the error code somehow? I have the code below to show what I am talking about
public static double Oxygen[][];
public static int OxygenPermutationRows;
public void OxygenCalculations(){
Scanner reader = new Scanner(System.in);
System.out.print("Oxygen Number: ");
int oxygenNumber = reader.nextInt();
System.out.println();
int oxygenIsotopes = 3;
OxygenPermutationRows = 0;
//Number of Feesable Permutations
if(oxygenNumber % 2 == 0)
{
OxygenPermutationRows = (1 + oxygenNumber) * ((oxygenNumber / 2) + 1);
}
else
{
OxygenPermutationRows = (1 + oxygenNumber) * (int)Math.floor(oxygenNumber / 2) + (int)Math.ceil(oxygenNumber / 2) + 2 + oxygenNumber;
}
int [][] Permutations = new int[OxygenPermutationRows][oxygenIsotopes];
int counterrow = 0;
int k;
for (int f = 0; f <= oxygenNumber; f++)
{
for (int j = 0; j <= oxygenNumber; j++)
{
k = oxygenNumber - j - f;
Permutations[counterrow][0] = f;
Permutations[counterrow][1] = j;
Permutations[counterrow][2] = k;
counterrow++;
if(f+j == oxygenNumber)
{
j = oxygenNumber + 10;
}
}
}
//TO CHECK PERMUTATION ARRAY VALUES
System.out.println("PERMUTATION ARRAY =======================================");
for (int i = 0; i < OxygenPermutationRows; i++) {
System.out.println();
for (int j = 0; j < 3; j++) {
System.out.print(Permutations[i][j] + " ");
}
}
public double[][] returnOxygen()
{
return Oxygen;
}
public double returnOxygenRows()
{
return OxygenPermutationRows;
}
}

Couldn't you split the 100000 iterations in chunks? Maybe chunks of 10000 iterations? And for every chunk, write the result in a file.This will get rid of the OOM error.

Raise max memory with the parameter -Xmx, you can do this in the run configuration for the java class

Why is running time of the same algorithm different in doubling ratio test?

I am analyzing brute force Three Sum algorithm. Let's say the running time of this algorithm is T(N)=aN^3. What I am doing is that I am running this ThreeSum.java program with 8Kints.txt and using that running time to calculate constant a. After calculating a I am guessing what the running time of 16Kints.txt is. Here is my ThreeSum.java file:
public class ThreeSum {
public static int count(int[] a) {
// Count triples that sum to 0.
int N = a.length;
int cnt = 0;
for (int i = 0; i < N; i++)
for (int j = i + 1; j < N; j++)
for (int k = j + 1; k < N; k++)
if (a[i] + a[j] + a[k] == 0)
cnt++;
return cnt;
}
public static void main(String[] args) {
In in = new In(args[0]);
int[] a = in.readAllInts();
Stopwatch timer = new Stopwatch();
int count = count(a);
StdOut.println("elapsed time = " + timer.elapsedTime());
StdOut.println(count);
}
}
When I run like this:
$ java ThreeSum 8Kints.txt
I get this:
elapsed time = 165.335
And now in doubling ratio experiment where I use the same method inside another client and run this client with multiple files as arguments and wanna try to compare the running time of 8Kints.txt with above method but I get different result actually faster result. Here is my DoublingRatio.java client:
public class DoublingRatio {
public static double timeTrial(int[] a) {
Stopwatch timer = new Stopwatch();
int cnt = ThreeSum.count(a);
return timer.elapsedTime();
}
public static void main(String[] args) {
In in;
int[][] inputs = new int[args.length][];
for (int i = 0; i < args.length; i++) {
in = new In(args[i]);
inputs[i] = in.readAllInts();
}
double prev = timeTrial(inputs[0]);
for (int i = 1; i < args.length; i++) {
double time = timeTrial(inputs[i]);
StdOut.printf("%6d %7.3f ", inputs[i].length, time);
StdOut.printf("%5.1f\n", time / prev);
prev = time;
}
}
}
When I run this like:
$ java DoublingRatio 1Kints.txt 2Kints.txt 4Kints.txt 8Kints.txt 16Kints.txt 32Kints.txt
I get faster reuslt and I wonder why:
N sec ratio
2000 2.631 7.8
4000 4.467 1.7
8000 34.626 7.8
I know it is something that has to do with Java not the algorithm? Does java optimizes some things under the hood.

Why does reducing the number of loops not speed up the program?

I have a program that does a lot of matrix multiplication. I thought I'd speed it up by reducing the number of loops in the code to see how much faster it would be (I'll try a matrix math library later). It turns out it's not faster at all. I've been able to replicate the problem with some example code. My guess was that testOne() would be faster than testTwo() because it doesn't create any new arrays and because it has a third as many loops. On my machine, its takes twice as long to run:
Duration for testOne with 5000 epochs: 657, loopCount: 64000000
Duration for testTwo with 5000 epochs: 365, loopCount: 192000000
My guess is that multOne() is slower than multTwo() because in multOne() the CPU is not writing to sequential memory addresses like it is in multTwo(). Does that sound right? Any explanations would be appreciated.
import java.util.Random;
public class ArrayTest {
double[] arrayOne;
double[] arrayTwo;
double[] arrayThree;
double[][] matrix;
double[] input;
int loopCount;
int rows;
int columns;
public ArrayTest(int rows, int columns) {
this.rows = rows;
this.columns = columns;
this.loopCount = 0;
arrayOne = new double[rows];
arrayTwo = new double[rows];
arrayThree = new double[rows];
matrix = new double[rows][columns];
Random random = new Random();
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
matrix[i][j] = random.nextDouble();
}
}
}
public void testOne(double[] input, int epochs) {
this.input = input;
this.loopCount = 0;
long start = System.currentTimeMillis();
long duration;
for (int i = 0; i < epochs; i++) {
multOne();
}
duration = System.currentTimeMillis() - start;
System.out.println("Duration for testOne with " + epochs + " epochs: " + duration + ", loopCount: " + loopCount);
}
public void multOne() {
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
arrayOne[i] += matrix[i][j] * arrayOne[i] * input[j];
arrayTwo[i] += matrix[i][j] * arrayTwo[i] * input[j];
arrayThree[i] += matrix[i][j] * arrayThree[i] * input[j];
loopCount++;
}
}
}
public void testTwo(double[] input, int epochs) {
this.loopCount = 0;
long start = System.currentTimeMillis();
long duration;
for (int i = 0; i < epochs; i++) {
arrayOne = multTwo(matrix, arrayOne, input);
arrayTwo = multTwo(matrix, arrayTwo, input);
arrayThree = multTwo(matrix, arrayThree, input);
}
duration = System.currentTimeMillis() - start;
System.out.println("Duration for testTwo with " + epochs + " epochs: " + duration + ", loopCount: " + loopCount);
}
public double[] multTwo(double[][] matrix, double[] array, double[] input) {
double[] newArray = new double[rows];
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
newArray[i] += matrix[i][j] * array[i] * input[j];
loopCount++;
}
}
return newArray;
}
public static void main(String[] args) {
int rows = 100;
int columns = 128;
ArrayTest arrayTest = new ArrayTest(rows, columns);
Random random = new Random();
double[] input = new double[columns];
for (int i = 0; i < columns; i++) {
input[i] = random.nextDouble();
}
arrayTest.testOne(input, 5000);
arrayTest.testTwo(input, 5000);
}
}

There is a simple reason why your tests take different time: they don't do the same thing. Since the two loops you compare are not functionally identical, the number of iterations is not a good metric to look at.
testOne takes longer than testTwo because:
In multOne you update arrayOne[i] in place, during each iteration
of the j loop. This means for each iteration of the inner loop j
you are using a new value of arrayOne[i], computed in the
previous iteration. This creates a loop carried dependency, which is
harder to optimise for the compiler, because you require the output
of the operation matrix[i][j] * arrayOne[i] * input[j] on the next
CPU clock cycle. This is not really possible with floating point
operations, which have a latency of a few clock cycles usually, so
it results in stalls, therefore reduced performance.
In testTwo you
update arrayOne only once per each iteration of the epoch, and
since there are no carried dependecies, the loop can be vectorised
efficiently, which results in better cache and arithmetic
performance.

How to display a result from a class?

I've been working on a program which multiplies matrices using threads. I written the program in a non-threaded fashion and it was working like a charm. However when I was writing it w/ threading function, it appears that I am not getting a result from the Threading class (I will rename it later on). Also if I were to use 1 thread, I would get all 3 matrices followed by a empty results for the matC matrix. Anything more than 1 would get me a Array index out of bounds.
Still pretty inept using classes and whatnot, and i apologize in advance of being somewhat wordy. But any help would be appreciated.
Main class:
package matrixthread;
import java.io.*;
import java.util.*;
public class MatrixThread extends Thread{
public static void matrixPrint(int[][] A) {
int i, j;
for (i = 0; i < A.length; i++) {
for (j = 0; j < A.length; j++) {
System.out.print(A[i][j] + " ");
}
System.out.println("");
}
}
public static void matrixFill(int[][] A) {
int i, j;
Random r = new Random();
for (i = 0; i < A.length; i++) {
for (j = 0; j < A.length; j++) {
A[i][j] = r.nextInt(10);
}
}
}
public static void main(String[] args) {
Scanner in = new Scanner(System.in);
int threadCounter = 0;
System.out.print("Enter number of rows: ");
int M = in.nextInt();
System.out.print("Enter number of threads: ");
int N = in.nextInt();
Thread[] thrd = new Thread[N];
if (M < N)
{
System.out.println("Error! Numbers of rows are greater than number of threads!");
System.exit(0);
}
if(M % N != 0)
{
System.out.println("Error! Number of rows and threads aren't equal");
System.exit(0);
}
int[][] matA = new int[M][M];
int[][] matB = new int[M][M];
int[][] matC = new int[M][M];
try
{
for (int x = 0; x < N; x ++)
for (int y = 0; y < N; y++)
{
thrd[threadCounter] = new Thread(new Threading(matA, matB, matC));
thrd[threadCounter].start();
thrd[threadCounter].join();
threadCounter++;
}
}
catch(InterruptedException ie){}
long startTime = (int)System.currentTimeMillis();
matrixFill(matA);
matrixFill(matB);
//mulMatrix(matA, matB, matC);
long stopTime = (int)System.currentTimeMillis();
int execTimeMin = (int) ((((stopTime - startTime)/1000)/60)%60);
int execTimeSec = ((int) ((stopTime - startTime)/1000)%60);
System.out.println("\n" + "Matrix 1: ");
matrixPrint(matA);
System.out.println("\n" + "Matrix 2: ");
matrixPrint(matB);
System.out.println("\n" + "Results: ");
matrixPrint(matC);
System.out.println("\n" + "Finish: " + execTimeMin + "m " + execTimeSec + "s");
}
}
And here's my threading class:
package matrixthread;
public class Threading implements Runnable {
//private int N;
private int[][] matA;
private int[][] matB;
private int[][] matC;
public Threading(int matA[][], int matB[][], int matC[][]) {
this.matA = matA;
this.matB = matB;
this.matC = matC;
}
#Override
public void run() {
int i, j, k;
for (i = 0; i < matA.length; i++) {
for (j = 0; j < matA.length; j++) {
for (k = 0; k < matA.length; k++) {
matC[i][j] += matA[i][k] * matB[k][j];
}
}
}
}
}
Here are my results using 2 rows and 1 thread
Matrix 1:
7 8
4 5
Matrix 2:
3 0
1 5
Results:
0 0
0 0

You have a large number of problems here. First and foremost is this loop:
for (int x = 0; x < N; x ++)
for (int y = 0; y < N; y++) {
thrd[threadCounter] = new Thread(new Threading(matA, matB, matC));
thrd[threadCounter].start();
thrd[threadCounter].join();
threadCounter++;
}
You run this loop before calling matrixFill. matA and matB are both equal to the Zero matrix at this point.
You then, for each Thread, call join() which waits for completion. So all you do is 0 * 0 = 0 N2 times.
Creating threads in a loop and calling join() on them as they are created waits for each thread to finish. This means that you never have two tasks running in parallel. This isn't threading, this is doing something with a single thread in a very complicated manner.
You then fill matA and matB and print out all three. Unsurprisingly, matC is still equal to the Zero matrix.
Your next issue is with your threading:
public void run() {
int i, j, k;
for (i = 0; i < matA.length; i++) {
for (j = 0; j < matA.length; j++) {
for (k = 0; k < matA.length; k++) {
matC[i][j] += matA[i][k] * matB[k][j];
}
}
}
}
Each thread runs this code. This code multiplies two matrices. Running this code N2 times as you do just makes a single matrix multiplication N2 times slower.
If you fixed your threading (see above) so that all your threads ran concurrently then all you would have is the biggest race hazard since in creation of Java. I highly doubt you would ever get the correct result.
TL;DR The reason matC is zero is because, when multiplication happens, maA == matB == 0.

First of all I don't know why you run the thread with the same data N^2 times. Possible you wanted to fill matA and matB every time with different data?
If you want to compute many data parallel you should do it in this way:
//First start all Threads
for (int x = 0; x < N; x ++)
for (int y = 0; y < N; y++)
{
thrd[threadCounter] = new Thread(new Threading(matA, matB, matC));
thrd[threadCounter].start();
threadCounter++;
}
//then wait for all
for(int i = 0; i < threadCounter; i++){
thrd[threadCounter].join();
}
Otherwise there is nothing parallel because you wait until the first Thread is ready and then do the next.
That you get a result of 0 0 0 0 is because matA and matB are 0 0 0 0 at the moment you start the threads.

Inline Code is Slower Than Function Calls / Static Functions in Java

I've been running some tests to see how inlining function code (explicitly writing function algorithms in the code itself) affects performance. I wrote a simple byte array to integer code and then wrapped it in a function, called it statically from another class, and called it statically from the class itself. The code is as follows:
public class FunctionCallSpeed {
public static final int numIter = 50000000;
public static void main (String [] args) {
byte [] n = new byte[4];
long start;
System.out.println("Function from Static Class =================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
StaticClass.toInt(n);
}
System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");
System.out.println("Function from Class ========================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
toInt(n);
}
System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");
int actual = 0;
int len = n.length;
System.out.println("Inline Function ============================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
for (int j = 0; j < len; j++) {
actual += n[len - 1 - j] << 8 * j;
}
}
System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");
}
public static int toInt(byte [] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
The results are as follows:
Function from Static Class =================
Elapsed time: 0.096559931s
Function from Class ========================
Elapsed time: 0.015741711s
Inline Function ============================
Elapsed time: 0.837626286s
Is there something weird going on with the bytecode? I've looked at the bytecode myself, but I'm not very familiar and I can't make heads or tails of it.
EDIT
I added assert statements to read the outputs and then randomized the bytes read and the benchmark now behaves the way I thought it would. Thanks to Tomasz Nurkiewicz, who pointed me to the microbenchmark article. The resulting code is thus:
public class FunctionCallSpeed {
public static final int numIter = 50000000;
public static void main (String [] args) {
byte [] n;
long start, end;
int checker, calc;
end = 0;
System.out.println("Function from Object =================");
for (int i = 0; i < numIter; i++) {
checker = (int)(Math.random() * 65535);
n = toByte(checker);
start = System.nanoTime();
calc = StaticClass.toInt(n);
end += System.nanoTime() - start;
assert calc == checker;
}
System.out.println("Elapsed time: " + (double)end / 1000000000 + "s");
end = 0;
System.out.println("Function from Class ==================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
checker = (int)(Math.random() * 65535);
n = toByte(checker);
start = System.nanoTime();
calc = toInt(n);
end += System.nanoTime() - start;
assert calc == checker;
}
System.out.println("Elapsed time: " + (double)end / 1000000000 + "s");
int len = 4;
end = 0;
System.out.println("Inline Function ======================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
calc = 0;
checker = (int)(Math.random() * 65535);
n = toByte(checker);
start = System.nanoTime();
for (int j = 0; j < len; j++) {
calc += n[len - 1 - j] << 8 * j;
}
end += System.nanoTime() - start;
assert calc == checker;
}
System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");
}
public static byte [] toByte(int val) {
byte [] n = new byte[4];
for (int i = 0; i < 4; i++) {
n[i] = (byte)((val >> 8 * i) & 0xFF);
}
return n;
}
public static int toInt(byte [] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
Results:
Function from Static Class =================
Elapsed time: 9.276437031s
Function from Class ========================
Elapsed time: 9.225660708s
Inline Function ============================
Elapsed time: 5.9512E-5s

It's always hard to make a guarantee of what the JIT is doing, but if I had to guess, it noticed the return value of the function was never being used, and optimized a lot of it out.
If you actually use the return value of your function I bet it changes the speed.

I ported your test case to caliper:
import com.google.caliper.SimpleBenchmark;
public class ToInt extends SimpleBenchmark {
private byte[] n;
private int total;
#Override
protected void setUp() throws Exception {
n = new byte[4];
}
public int timeStaticClass(int reps) {
for (int i = 0; i < reps; i++) {
total += StaticClass.toInt(n);
}
return total;
}
public int timeFromClass(int reps) {
for (int i = 0; i < reps; i++) {
total += toInt(n);
}
return total;
}
public int timeInline(int reps) {
for (int i = 0; i < reps; i++) {
int actual = 0;
int len = n.length;
for (int i1 = 0; i1 < len; i1++) {
actual += n[len - 1 - i1] << 8 * i1;
}
total += actual;
}
return total;
}
public static int toInt(byte[] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
class StaticClass {
public static int toInt(byte[] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
And indeed seems like inlined version is the slowest while two static versions are almost the same (as expected):
The reasons are hard to imagine. I can think of two factors:
JVM is better in performing micro-optimizations when code blocks are as small and simple to reason about as possible. When the function is inlined, the whole code becomes more complex and JVM gives up. With smaller toInt() function it JIT is more clever
cache locality - somehow JVM performs better with two small chunks of code (loop and method) rather than one bigger

You have several problems, but the main one is that you are testing one iteration of one optimised code. That is sure to give you mixed results. I suggest running the test for 2 seconds, ignoring the first 10,000 iterations or so.
If the result of a loop is not kept, the entire loop can be discarded after some random interval.
Breaking each test into a separate method
public class FunctionCallSpeed {
public static final int numIter = 50000000;
private static int dontOptimiseAway;
public static void main(String[] args) {
byte[] n = new byte[4];
for (int i = 0; i < 10; i++) {
test1(n);
test2(n);
test3(n);
System.out.println();
}
}
private static void test1(byte[] n) {
System.out.print("from Static Class: ");
long start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
dontOptimiseAway = FunctionCallSpeed.toInt(n);
}
System.out.print((System.nanoTime() - start) / numIter + "ns ");
}
private static void test2(byte[] n) {
long start;
System.out.print("from Class: ");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
dontOptimiseAway = toInt(n);
}
System.out.print((System.nanoTime() - start) / numIter + "ns ");
}
private static void test3(byte[] n) {
long start;
int actual = 0;
int len = n.length;
System.out.print("Inlined: ");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
for (int j = 0; j < len; j++) {
actual += n[len - 1 - j] << 8 * j;
}
dontOptimiseAway = actual;
}
System.out.print((System.nanoTime() - start) / numIter + "ns ");
}
public static int toInt(byte[] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
prints
from Class: 7ns Inlined: 11ns from Static Class: 9ns
from Class: 6ns Inlined: 8ns from Static Class: 8ns
from Class: 6ns Inlined: 9ns from Static Class: 6ns
This suggest that when the inner loop is optimised separately it is slightly more efficient.
However if I use an optimised conversion of bytes to int
public static int toInt(byte[] num) {
return num[0] + (num[1] << 8) + (num[2] << 16) + (num[3] << 24);
}
all the tests report
from Static Class: 0ns from Class: 0ns Inlined: 0ns
from Static Class: 0ns from Class: 0ns Inlined: 0ns
from Static Class: 0ns from Class: 0ns Inlined: 0ns
as its realised the test doesn't do anything useful. ;)

Your test is flawed. The second test is having the benefit of the first test already being run. You need to run each test case in its own JVM invocation.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.