Which code runs faster?

Which code runs faster? - java

I have two piece of code, and I want to know which is faster when they run and why it's faster. I learn less about JVM and CPU, but I'm hard working on them. Every tip will help.
int[] a=new int[1000];
int[] b=new int[10000000];
long start = System.currentTimeMillis();
//method 1
for(int i=0;i<1000;i++){
for(int j=0;j<10000000;j++){
a[i]++;
}
}
long end = System.currentTimeMillis();
System.out.println(end-start);
start=System.currentTimeMillis();
//method 2
for(int i=0 ;i<10000000;i++){
for(int j=0;j<1000;j++){
b[i]++;
}
}
end = System.currentTimeMillis();
System.out.println(end-start);

I'll throw my answer in there, in theory they will be exactly the same but in practice there will be a small, but negligible, difference. Too small to really matter, actually.
The basic idea is how array b is stored in memory. Because it is a lot larger, depending on your platform/implementation it might be stored in chunks, aka non-contiguously. That is likely since an array of 10 million ints is 40 million bytes = 40 MB!
EDIT: I get 572 and 593, respectively.

Complexity
In terms of asymptotic complexity (e.g. big-O notation), they have the same running time.
Data localization
Ignoring any optimization for the moment...
b is larger and is thus more likely to be split across multiple (or more) pages. Because of this, the first is likely faster.
The difference here is likely to be rather small, unless not all of these pages fit into RAM and need to be written to disk (which is unlikely here since b is only 10000000*4 = 40000000 bytes = 38 MB).
Optimization
The first method involves "execute a[i]++ 10000000 times" (for a fixed i), which can theoretically rather easily be converted to a[i] += 10000000 by the optimizer.
A similar optimization can occur for b, but only to b[i] += 1000, which still has to run 10000000 times.
The optimizer is free to do this or not do this. As far as I know, the Java language specification doesn't say much about what should and shouldn't be optimized, as long as it doesn't change the end result.
As an extreme result, the optimizer could, in theory, see that you're not doing anything with a or b after the loops and thus get rid of both loops.

The first loop runs faster on my system (median: 333 ms vs. 596 ms)
(Edit: I made a wrong assumption on number of array accesses in my first response, see comments)
Subsequent incremental (index++) accesses to the same array seem to be faster than random accesses or decremental (index--) accesses. I assume the Java Hotspot compiler can optimize the array bound checks if it recognizes that the array will be incrementally traversed.
When reversing the loops, it actually runs slower:
//incremental array index
for (int i = 0; i < 1000; i++) {
for (int j = 0; j < 10000000; j++) {
a[i]++;
}
}
//decremental array index
for (int i = 1000 - 1; i >= 0; i--) {
for (int j = 10000000 - 1; j >= 0; j--) {
a[i]++;
}
}
Incremental: 349ms, decremental: 485ms.
Without bounds checks, decremental loops usually are faster, especially on old processors (comparing to zero).
If my assumption is right, this makes 1000 optimized bounds checks versus 10000000 checks, so the first method is faster.
By the way, when benchmarking:
Do multiple rounds and compare the averages/mediums instead of the first sample
In Java: give your benchmark a warmup-phase (execute it a few times before measuring). On the first run, classes have to be loaded, and code might be interpreted before the HotSpot VM feature kicks in and does a native compilation
Measure time deltas with System.nanoTime(). This gives more accurate timestamps. System.currentTimeMillis() is not that precise (depends on the VM), and usually 'hops' in junks of a dozen or more milliseconds, rendering your result times more volatile than they actually are. Btw: 1 milliseconds = 1'000'000 nano second.

My guess would be that the they are both pretty much the same. One of them has a smaller array to handle, but that wouldn't make much difference except for the initial allocation of memory, which is outside your measuring anyway.
The time to execute each iteration should be the same (writing a value into an array). incrementing larger numbers shouldn't take the JVM longer than incrementing smaller numbers, nor should addressing a smaller or larger array index.
But why the question if you already know how to measure yourself?

Check out big-oh notation
A nested for loop is O(n^2) - They will run the same in theory.
The numbers 1000 or 100000 are a constant k O(n^2 + k)
They won't be exactly identical in practise because of various other things at play but it will be close.

Time should be equal, the result will obviously be different since a will contain 1000 entries with value 10000 and b will contain 10000000 entries with value 1000. I don't really get your question. What's the result for end-start?
It might be that the JVM will optimize the forloops, if it understands what the end results will be in the array than the smallest array will be much easier to calculate, since it requires only 1000 assignments, while the other one needs 10 000 times more.

First will be fuster. Because of initialization of first a and i cell much more less times.

Modern architectures are complex, so answering this kind of question is never easy.
The run time could be same or the first could be faster.
Things to consider in this case is mostly memory access and optimization.
A good optimizer will realize that the values are never read, so the loops can be skipped entirely, which gives a run time of 0 in both cases. Such optimization can take place at compile time or at run-time.
If it isn't optimized away, then there is memory access to consider. a[] is much smaller than b[], so it will more readily fit in faster cache memory resulting in fewer cache misses.
Another thing to consider is memory interleaving.

Related

In Java, for primitive arrays, is reusing arrays signifcantly faster than repeatedly recreating them?

In Java, for primitive arrays, is reusing arrays signifcantly faster than repeatedly recreating them?
The following snippet is a comparison of the two cases: (a) reuse an array by System.arrayCopy vs (b) repeatedly create an array of specified values, in an intensive loop.
public static void testArrayAllocationVsCopyPerformance() {
long time = System.currentTimeMillis();
final int length = 3000;
boolean[] array = new boolean[length];
boolean[] backup = new boolean[length];
//for (int j = 0; j < length; j++) {
// backup [j] = true;
//}
for (int i = 0; i < 10000000; i++) {
//(a). array copy
//System.arraycopy(backup, 0, array, 0, length);
//(b). reconstruct arrays
array = new boolean[length];
//for (int j = 0; j < length; j++) {
// array[j] = true;
//}
}
long millis = System.currentTimeMillis() - time;
System.out.println("Time taken: " + millis + " milliseconds.");
System.exit(0);
}
On my PC, (b) takes around 2600 milliseconds in average, while (a) takes around 450 milliseconds in average. For recreating an array with different values, the performance gap grows wider: (b) takes around 3750 milliseconds in average, while (a) remains constant, still 450 milliseconds in average.
In the snippet above, if 'boolean' are changed to 'int', the results are similar: reusing the int array takes around one thirds of recreating arrays. In addition, the (b) is also not far less readable than (a); (b) is just slightly less readable than (a) which does not need the 'backup' array.
However, answers of similar questions on stackoverflow or stackexchange regarding Java object creation are always things like "don't optimize it until it becomes a bottleneck", "JIT or JVM handles today can handle these better and faster than yourself", "keep it simple for readibility", etc. And these kinds of answers are typically well received by viewers.
The question is: can the performance comparison snippet above show that array copy is significantly faster compared to array re-creation wrt using short lived primitive arrays? Is the snippet above flawed? Or people still shouldn't optimize it until it becomes a bottleneck, etc?

Can the performance comparison snippet above show that array copy is significantly faster compared to array re-creation wrt using short
lived primitive arrays?
Yes, however you don't really need to prove it. Array occupies a continuous space in the memory. System.arraycopy is a native function, which is dedicated to copying arrays. It is obvious that it will be faster than creating the array, iterating over it, increasing the counter on every iteration, checking the boolean expression whether the loop should terminate, assigning the primitive to the certain position in the array, etc.
You should also remember that compilers nowadays are quite smart and they might replace your code with a more efficient version. In such case, you would not observe any difference in the test you wrote. Also, keep in mind that Java uses just-in-time compiler, which might optimize your code after you run it a lot of times and it decides that the optimization is worth doing.
Is the snippet above flawed?
Yes, it is flawed. What you are doing here is microbenchmarking. However, you haven't done any warm-up phase. I suggest to do some more reading about this topic.
Or people still shouldn't optimize it until it becomes a bottleneck,
etc?
You should never do premature optimizations. If there are performance issues, run the code with profiler enabled, identify bottleneck and fix the problem.
However, you should also use some common sense. If you have a List and are adding elements at the front, use LinkedList, not ArrayList. If you need to copy entire array, use System.arraycopy instead of looping over it and doing it manually.

Can Java use unsigned comparison for general in-range tests?

I've seen that JITC uses unsigned comparison for checking array bounds (the test 0 <= x < LIMIT is equivalent to 0 ≺ LIMIT where ≺ treats the numbers as unsigned quantities). So I was curious if it works for arbitrary comparisons of the form 0 <= x < LIMIT as well.
The results of my benchmark are pretty confusing. I've created three experiments of the form
for (int i=0; i<LENGTH; ++i) {
int x = data[i];
if (condition) result += x;
}
with different conditions
0 <= x called above
x < LIMIT called below
0 <= x && x < LIMIT called inRange
0 <= x & x < LIMIT called inRange2
and prepared the data so that the probabilities of the condition being true are the same.
The results should be fairly similar, just above might be slightly faster as it compares against zero. Even if the JITC couldn't use the unsigned comparison for the test, the results for above and below should still be similar.
Can anyone explain what's going on here? It's quite possible that I did something wrong...
Update
I'm using Java build 1.7.0_51-b13 on Ubuntu 2.6.32-54-generic with i5-2400 CPU # 3.10GHz, in case anybody cares. As the results for inRange and inRange2 near 0.00 are especially confusing, I re-ran the benchmark with more steps in this area.

The likely variation in the results of the benchmarks have to do with CPU caching at different levels.
Since primitive int(s) are being used, there is no JVM specific caching going on, as will happen with auto-boxed Integer to primitive values.
Thus all that remains, given minimal memory consumption of the data[] array, is CPU-caching of low level values/operations. Since as described the distributions of values are based on random values with statistical 'probabilities' of the conditions being true across the tests, the likely cause is that, depending on the values, more or less (random) caching is going on for each test, leading to more randomness.
Further, depending on the isolation of the computer (background services/processes), the test cases may not be running in complete isolation. Ensure that everything is shutdown for these tests except the core OS functions and the JVM. Set the JVM memory min/max the same, shutdown any networking processes, updates, etc.

Are you test results the avarage of a number of runs, or did you only test each function once?
One thing I have found are that the first time you run a for loop the JVM will interpret, then each time its run the JVM will optimize it. Therefore the first few runs may get horrible performance, but after a few runs it will be near native performance.
I also figured out that a loop will not be optimized while its running. I have not tested if this applies to just the loop or the whole function. If it only applies to the loop you may get much more performance if you nest in in an inner and outer loop, and work with your data one block at a time. If its the whole function, you will have to place the inner loop in its own function.
Also run the test more than once, if you compare the code you will notice how the JIT optimizes the code in stages.
For most code this gives Java optimal performance. It allows it to skip costly optimization on code that runs rarely and makes code that run often a lot faster. However if you have a code block that runs once but for a long time, it will become horribly slow.

Java can recognize SIMD advantages of CPU; or there is just optimization effect of loop unrolling

This part of code is from dotproduct method of a vector class of mine. The method does inner product computing for a target array of vectors(1000 vectors).
When vector length is an odd number(262145), compute time is 4.37 seconds. When vector length(N) is 262144(multiple of 8), compute time is 1.93 seconds.
time1=System.nanotime();
int count=0;
for(int j=0;j<1000;i++)
{
b=vektors[i]; // selects next vector(b) to multiply as inner product.
// each vector has an array of float elements.
if(((N/2)*2)!=N)
{
for(int i=0;i<N;i++)
{
t1+=elements[i]*b.elements[i];
}
}
else if(((N/8)*8)==N)
{
float []vek=new float[8];
for(int i=0;i<(N/8);i++)
{
vek[0]=elements[i]*b.elements[i];
vek[1]=elements[i+1]*b.elements[i+1];
vek[2]=elements[i+2]*b.elements[i+2];
vek[3]=elements[i+3]*b.elements[i+3];
vek[4]=elements[i+4]*b.elements[i+4];
vek[5]=elements[i+5]*b.elements[i+5];
vek[6]=elements[i+6]*b.elements[i+6];
vek[7]=elements[i+7]*b.elements[i+7];
t1+=vek[0]+vek[1]+vek[2]+vek[3]+vek[4]+vek[5]+vek[6]+vek[7];
//t1 is total sum of all dot products.
}
}
}
time2=System.nanotime();
time3=(time2-time1)/1000000000.0; //seconds
Question: Could the reduction of time from 4.37s to 1.93s (2x as fast) be JIT's wise decision of using SIMD instructions or just my loop-unrolling's positive effect?
If JIT cannot do SIMD optimizaton automatically, then in this example there is also no unrolling optimization done automatically by JIT, is this true?.
For 1M iterations(vectors) and for vector size of 64, speedup multiplier goes to 3.5X(cache advantage?).
Thanks.

Your code has a bunch of problems. Are you sure you're measuring what you think you're measuring?
Your first loop does this, indented more conventionally:
for(int j=0;j<1000;i++) {
b=vektors[i]; // selects next vector(b) to multiply as inner product.
// each vector has an array of float elements.
}
Your rolled loop involves a really long chain of dependent loads and stores. Your unrolled loop involves 8 separate chains of dependent loads and stores. The JVM can't turn one into the other if you're using floating-point arithmetic because they're fundamentally different computations. Breaking dependent load-store chains can lead to major speedups on modern processors.
Your rolled loop iterates over the whole vector. Your unrolled loop only iterates over the first (roughly) eighth. Thus, the unrolled loop again computes something fundamentally different.
I haven't seen a JVM generate vectorised code for something like your second loop, but I'm maybe a few years out of date on what JVMs do. Try using -XX:+PrintAssembly when you run your code and inspect the code opto generates.

I have done a little research on this (and am drawing from knowledge from a similar project I did in C with matrix multiplication), but take my answer with a grain of salt as I am by no means an expert on this topic.
As for your first question, I think the speedup is coming from your loop unrolling; you're making roughly 87% fewer condition checks in terms of the for loop. As far as I know, JVM supports SSE since 1.4, but to actually control whether your code is using vectorization (and to know for sure), you'll need to use JNI.
See an example of JNI here: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?
When you decrease the size of your vector to 64 from 262144, cache is definitely a factor. When I did this project in C, we had to implement cache blocking for larger matrices in order to take advantage of the cache. One thing you might want to do is check your cache size.
Just as a side note: It might be a better idea to measure performance in flops rather than seconds, just because the runtime (in seconds) of your program can vary based on many different factors, such as CPU usage at the time.

How to produce the cpu cache effect in C and java?

In Ulrich Drepper's paper What every programmer should know about memory, the 3rd part: CPU Caches, he shows a graph that shows the relationship between "working set" size and the cpu cycle consuming per operation (in this case, sequential reading). And there are two jumps in the graph which indicate the size of L1 cache and L2 cache. I wrote my own program to reproduce the effect in c. It just simply read a int[] array sequentially from head to tail, and I've tried different size of the array(from 1KB to 1MB). I plot the data into a graph and there is no jump, the graph is a straight line.
My questions are:
Is there something wrong with my method? What is the right way to produce the cpu cache effect(to see the jumps).
I was thinking, if it is sequential read, then it should operate like this:
When read the first element, it's a cache miss, and within the cache line size(64K), there will be hits. With the help of the prefetching, the latency of reading the next cache line will be hidden. It will contiguously read data into the L1 cache, even when the working set size is over the L1 cache size, it will evict the least recently used ones, and continue prefetch. So most of the cache misses will be hidden, the time consumed by fetch data from L2 will be hidden behind the reading activity, meaning they are operating at the same time. the assosiativity (8 way in my case) will hide the latency of reading data from L2. So, phenomenon of my program should be right, am I missing something?
Is it possible to get the same effect in java?
By the way, I am doing this in linux.
Edit 1
Thanks for Stephen C's suggestion, here are some additional Information:
This is my code:
int *arrayInt;
void initInt(long len) {
int i;
arrayInt = (int *)malloc(len * sizeof(int));
memset(arrayInt, 0, len * sizeof(int));
}
long sreadInt(long len) {
int sum = 0;
struct timespec tsStart, tsEnd;
initInt(len);
clock_gettime(CLOCK_REALTIME, &tsStart);
for(i = 0; i < len; i++) {
sum += arrayInt[i];
}
clock_gettime(CLOCK_REALTIME, &tsEnd);
free(arrayInt);
return (tsEnd.tv_nsec - tsStart.tv_nsec) / len;
}
In main() function, I've tried from 1KB to 100MB of the array size, still the same, average time consuming per element is 2 nanoseconds. I think the time is the access time of L1d.
My cache size:
L1d == 32k
L2 == 256k
L3 == 6144k
EDIT 2
I've changed my code to use a linked list.
// element type
struct l {
struct l *n;
long int pad[NPAD]; // the NPAD could be changed, in my case I set it to 1
};
struct l *array;
long globalSum;
// for init the array
void init(long len) {
long i, j;
struct l *ptr;
array = (struct l*)malloc(sizeof(struct l));
ptr = array;
for(j = 0; j < NPAD; j++) {
ptr->pad[j] = j;
}
ptr->n = NULL;
for(i = 1; i < len; i++) {
ptr->n = (struct l*)malloc(sizeof(struct l));
ptr = ptr->n;
for(j = 0; j < NPAD; j++) {
ptr->pad[j] = i + j;
}
ptr->n = NULL;
}
}
// for free the array when operation is done
void release() {
struct l *ptr = array;
struct l *tmp = NULL;
while(ptr) {
tmp = ptr;
ptr = ptr->n;
free(tmp);
}
}
double sread(long len) {
int i;
long sum = 0;
struct l *ptr;
struct timespec tsStart, tsEnd;
init(len);
ptr = array;
clock_gettime(CLOCK_REALTIME, &tsStart);
while(ptr) {
for(i = 0; i < NPAD; i++) {
sum += ptr->pad[i];
}
ptr = ptr->n;
}
clock_gettime(CLOCK_REALTIME, &tsEnd);
release();
globalSum += sum;
return (double)(tsEnd.tv_nsec - tsStart.tv_nsec) / (double)len;
}
At last, I will printf out the globalSum in order to avoid the compiler optimization. As you can see, it is still a sequential read, I've even tried up to 500MB of the array size, the average time per element is approximately 4 nanoseconds (maybe because it has to access the data 'pad' and the pointer 'n', two accesses), the same as 1KB of the array size. So, I think it is because the cache optimization like prefetch hide the latency very well, am I right? I will try a random access, and put the result on later.
EDIT 3
I've tried a random access to the linked list, this is the result:
the first red line is my L1 cache size, the second is L2. So we can see a little jump there. And some times the latency still be hidden well.

This answer isn't an answer, but more of a set of notes.
First, the CPU tends to operate on cache lines, not on individual bytes/words/dwords. This means that if you sequentially read/write an array of integers then the first access to a cache line may cause a cache miss but subsequent accesses to different integers in that same cache line won't. For 64-byte cache lines and 4-byte integers this means that you'd only get a cache miss once for every 16 accesses; which will dilute the results.
Second, the CPU has a "hardware pre-fetcher." If it detects that cache lines are being read sequentially, the hardware pre-fetcher will automatically pre-fetch cache lines it predicts will be needed next (in an attempt to fetch them into cache before they're needed).
Third, the CPU does other things (like "out of order execution") to hide fetch costs. The time difference (between cache hit and cache miss) that you can measure is the time that the CPU couldn't hide and not the total cost of the fetch.
These 3 things combined mean that; for sequentially reading an array of integers, it's likely that the CPU pre-fetches the next cache line while you're doing 16 reads from the previous cache line; and any cache miss costs won't be noticeable and may be entirely hidden. To prevent this; you'd want to "randomly" access each cache line once, to maximise the performance difference measured between "working set fits in cache/s" and "working set doesn't fit in cache/s."
Finally, there are other factors that may influence measurements. For example, for an OS that uses paging (e.g. Linux and almost all other modern OSs) there's a whole layer of caching above all this (TLBs/Translation Look-aside Buffers), and TLB misses once the working set gets beyond a certain size; which should be visible as a fourth "step" in the graph. There's also interference from the kernel (IRQs, page faults, task switches, multiple CPUs, etc); which might be visible as random static/error in the graph (unless tests are repeated often and outliers discarded). There are also artifacts of the cache design (cache associativity) that can reduce the effectiveness of the cache in ways that depend on the physical address/es allocated by the kernel; which might be seen as the "steps" in the graph shifting to different places.

Is there something wrong with my method?
Possibly, but without seeing your actual code that cannot be answered.
Your description of what your code is doing does not say whether you are reading the array once or many times.
The array may not be big enough ... depending on your hardware. (Don't some modern chips have a 3rd level cache of a few megabytes?)
In the Java case in particular you have to do lots of things the right way to implement a meaningful micro-benchmark.
In the C case:
You might try adjusting the C compiler's optimization switches.
Since your code is accessing the array serially, the compiler might be able to order the instructions so that the CPU can keep up, or the CPU might be optimistically prefetching or doing wide fetches. You could try reading the array elements in a less predictable order.
It is even possible that the compiler has entirely optimized the loop away because result of the loop calculation is not used for anything.
(According to this Q&A - How much time does it take to fetch one word from memory?, a fetch from L2 cache is ~7 nanoseconds and a fetch from main memory is ~100 nanoseconds. But you are getting ~2 nanoseconds. Something clever has to be going on here to make it run as fast as you are observing.)

With gcc-4.7 and compilation with gcc -std=c99 -O2 -S -D_GNU_SOURCE -fverbose-asm tcache.c you can see that the compiler is optimizing enough to remove the for loop (because sum is not used).
I had to improve your source code; some #include-s are missing, and i is not declared in the second function, so your example don't even compile as it is.
Make sum a global variable, or pass it somehow to the caller (perhaps with a global int globalsum; and putting globalsum=sum; after the loop).
And I am not sure you are right to clear the array with a memset. I could imagine a clever-enough compiler understanding that you are summing all zeros.
At last your code has extremely regular behavior with good locality: once in a while, a cache miss happens, the entire cache line is loaded and data is good enough for many iterations. Some clever optimizations (e.g. -O3 or better) might generate the good prefetch instructions. This is optimal for caches, because for a 32 words L1 cache line the cache miss happens every 32 loops so is well amortized.
Making a linked list of data will make cache behavior be worse. Conversely, in some real programs carefully adding a __builtin_prefetch at few well chosen places may improve performance by more than 10% (but adding too many of them will decrease performance).
In real life, the processor is spending the majority of the time to wait for some cache (and it is difficult to measure that; this waiting is CPU time, not idle time). Remember that during an L3 cache miss, the time needed to load data from your RAM module is the time needed to execute hundreds of machine instructions!

I can't say for certain about 1 and 2, but it would be more challenging to successfully run such a test in Java. In particular, I might be concerned that managed language features like automatic garbage collection might happen during the middle of your testing and throw off your results.

As you can see from graph 3.26 the Intel Core 2 shows hardly any jumps while reading (red line at the top of the graph). It is writing/copying where the jumps are clearly visible. Better to do a write test.

Decreased array access time without array bounds check in java

Ok. I am miscalculated things of microbenchmarking. Plz dont read if you dont have excess time.
Instead of
double[] my_array=new array[1000000];double blabla=0;
for(int i=0;i<1000000;i++)
{
my_array[i]=Math.sqrt(i);//init
}
for(int i=0;i<1000000;i++)
{
blabla+=my_array[i];//array access time is 3.7ms per 1M operation
}
i used
public final static class my_class
{
public static double element=0;
my_class(double elementz)
{
element=elementz;
}
}
my_class[] class_z=new my_class[1000000];
for(int i=0;i<1000000;i++)
{
class_z[i]=new my_class(Math.sqrt(i)); //instantiating array elements for later use(random-access)
}
double blabla=0;
for(int i=0;i<1000000;i++)
{
blabla+=class_z[i].element; // array access time 2.7 ms per 1M operations.
}
}
looping overhead is nearly 0.5 ms per 1M looping iterations(used this offset).
Array of classes' element accessing time is %25 lower than a primitive-array's.
Question: Do you know any other way to even lower random-access time?
intel 2Ghz single core java -eclipse

Looking at your code again, I can see that in the first loop you are adding 1m different elements. In the second example, you are adding the same static element 1m times.
A common problem with micro-benchmarks is the order you perform the tests impacts the results.
For example, if you have two loops, the first loops is initially not compiled to native code. However after some time, the whole method will be compiled and the loop will run faster.
Then you run the second loop and find it is either
much faster because it is optimised from the start. (For simple loops)
much slower because it is optimised without any runtime metrics. (For complex loop)
You need to place each loop in a seperate method and run the test alteratively a numebr of times to get reproduceable results.
In your first case, the loop is not optimised until after it has run for a while. In the second case, your loop is likely to already be compiled when it starts.

The difference is easily explained:
The primitive array has a memory footprint of 1M * 8 bytes = 8MB.
The class array has a memory footprint of 1M * 4 bytes = 4MB, all pointing to the same instance (assuming 32bit VM or compressed refs 64bit VM).
Put different objects into your class array and you will see the primitive array perform better. You are comparing oranges to apples at the moment.

There are several problems with your benchmarks and your assessment above. First, your code doesn't compile as shown. Second, your benchmark times (i.e., a few milliseconds) are far too short to be of any statistical worth with today's high-speed processors. Third, you're comparing apples to oranges (as mentioned above). That is, you're timing two completely different use cases: a single static and a million variables.
I fixed your code and ran it several times on an i7-2620m for 10,000 x 1,000,000 repetitions. All results were within +/- 1%, which is good enough for this discussion. Then, I took the fastest of all of those runs in order to compare their performance.
Above, you claimed that the second use case was "25% lower" than the first. That is wildly inaccurate.
In order to do a "static" versus "variable" performance comparison, I changed the first benchmark to add the 999,999th square-root just like the second one is doing. The difference was only about 4.63% in favor of the second use case.
In order to do an array access performance comparison, I changed the second use case to a "non-static" variable. The difference was about 68.2% in favor of the first use case (primitive array access), meaning that the first way was much faster than the second.
(Feel free to ask me more about micro-benchmarking since I've been doing performance measurement and assessment for over 25 years.)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.