Can Java use unsigned comparison for general in-range tests? - java

I've seen that JITC uses unsigned comparison for checking array bounds (the test 0 <= x < LIMIT is equivalent to 0 ≺ LIMIT where ≺ treats the numbers as unsigned quantities). So I was curious if it works for arbitrary comparisons of the form 0 <= x < LIMIT as well.
The results of my benchmark are pretty confusing. I've created three experiments of the form
for (int i=0; i<LENGTH; ++i) {
int x = data[i];
if (condition) result += x;
}
with different conditions
0 <= x called above
x < LIMIT called below
0 <= x && x < LIMIT called inRange
0 <= x & x < LIMIT called inRange2
and prepared the data so that the probabilities of the condition being true are the same.
The results should be fairly similar, just above might be slightly faster as it compares against zero. Even if the JITC couldn't use the unsigned comparison for the test, the results for above and below should still be similar.
Can anyone explain what's going on here? It's quite possible that I did something wrong...
Update
I'm using Java build 1.7.0_51-b13 on Ubuntu 2.6.32-54-generic with i5-2400 CPU # 3.10GHz, in case anybody cares. As the results for inRange and inRange2 near 0.00 are especially confusing, I re-ran the benchmark with more steps in this area.

The likely variation in the results of the benchmarks have to do with CPU caching at different levels.
Since primitive int(s) are being used, there is no JVM specific caching going on, as will happen with auto-boxed Integer to primitive values.
Thus all that remains, given minimal memory consumption of the data[] array, is CPU-caching of low level values/operations. Since as described the distributions of values are based on random values with statistical 'probabilities' of the conditions being true across the tests, the likely cause is that, depending on the values, more or less (random) caching is going on for each test, leading to more randomness.
Further, depending on the isolation of the computer (background services/processes), the test cases may not be running in complete isolation. Ensure that everything is shutdown for these tests except the core OS functions and the JVM. Set the JVM memory min/max the same, shutdown any networking processes, updates, etc.

Are you test results the avarage of a number of runs, or did you only test each function once?
One thing I have found are that the first time you run a for loop the JVM will interpret, then each time its run the JVM will optimize it. Therefore the first few runs may get horrible performance, but after a few runs it will be near native performance.
I also figured out that a loop will not be optimized while its running. I have not tested if this applies to just the loop or the whole function. If it only applies to the loop you may get much more performance if you nest in in an inner and outer loop, and work with your data one block at a time. If its the whole function, you will have to place the inner loop in its own function.
Also run the test more than once, if you compare the code you will notice how the JIT optimizes the code in stages.
For most code this gives Java optimal performance. It allows it to skip costly optimization on code that runs rarely and makes code that run often a lot faster. However if you have a code block that runs once but for a long time, it will become horribly slow.

Related

Oracle JDK8: unrolled loop converted to NOOP by JVM?

I've got a little program that is a fairly pointless exercise in simple number crunching that has thrown me for a loop.
The program spawns a bunch of worker threads that do simple mathematical operations. Recently I changed the inner loop of one variant of worker from:
do
{
int3 = int1 + int2;
int3 = int1 * int2;
int1++;
int2++;
i++;
}
while (i < 128);
to something akin to:
int3 = tempint4[0] + tempint5[0];
int3 = tempint4[0] * tempint5[0];
int3 = tempint4[1] + tempint5[1];
int3 = tempint4[1] * tempint5[1];
int3 = tempint4[2] + tempint5[2];
int3 = tempint4[2] * tempint5[2];
int3 = tempint4[3] + tempint5[3];
int3 = tempint4[3] * tempint5[3];
...
int3 = tempint4[127] + tempint5[127];
int3 = tempint4[127] * tempint5[127];
The arrays are populated by random integers no higher than 1025 in value, and the array values do not change.
The end result was that the program ran much faster, though closer examination seems to indicate that the CPU isn't actually doing anything when running the newer version of the code. It seems that the JVM has figured out that it can safely ignore the code that replaced the inner loop after one iteration of the outer loop since it is only redoing the same calculations on the same set of data over and over again.
To illustrate my point, the old code took maybe ~27000 ms to run and noticeably increased the operating temperature of the CPU (it also showed 100% utilization for all cores). The new code takes maybe 5 ms to run (sometimes less) and causes nary a spike in CPU utilization or temperature. Increasing the number of outer loop iterations does nothing to change the behavior of the new code, even when the number of iterations increases by a hundred times or more.
I have another version of the worker that is identical to the one above except that it has a division operation along with the addition and multiplication operations. In its new unrolled form, the division-enabled version is also much faster than it's previous form, but it actually takes a little while (~300 ms on the first run and ~200 ms on subsequent runs, despite warmup, which is a little odd) and produces a profound spike in CPU temperature for its brief run. Increasing the number of outer loop iterations seems to cause the temperature phenomenon to mostly cease after a certain amount of time has passed while running the program, though utilization still shows 100% for all cores. My guess is the JVM is taking much longer to figure out which operations it can safely ignore when handling division operations, and that it is not ignoring all of them.
Short of adding division operations to all my code (which isn't really a fix anyway beyond a certain number of outer loop iterations), is there any way I can get the JVM to stop reducing my code to apparent NOOPs? I've tried several solutions to the problem, such as generating new random values per iteration of the outer loop, going back to simple integer variables with incrementation, and some other nonsense, but none of those solutions have produced desirable results. Either it continues to ignore the series of instructions, or the performance hit from modifications is bad enough that my division-heavy variant actually performs better than the code without division operations.
edit: to provide some context:
i: this variable is an integer that is used for a loop counter in a do/while loop. It is defined in the class file containing the worker code. It's initial value is 0. It is no longer used in the newer version of the worker.
int1/int2: These are integers defined in the class file containing the worker code. Their initial values are both 0. They were used in the old version of the code to provide changing values for each iteration of the internal loop. All I had to do was increment them upward by one per loop iteration, and the JVM would be forced to carry out every operation faithfully. Unfortunately, this loop apparently prevented the use of SIMD. Each time the outer loop iterated, int1 and int2 had their values reset to prevent overflow of int1, int2, or int3 (I have discovered that integer overflow can slow down the code unnecessarily, as can allowing a float to reach Infinity).
tempint4/tempint5: These are references to a pair of integer arrays defined in the main class file for the program (Mathtester. Yes, unimaginative, I know). When the program first starts, there is a short do/while loop that fills each array with random integers randing from 1-1025. The arrays are 128 integers in size. Each array is static, though the reference variables are not. In truth there is no particular reason for me to use the reference variables. They are leftovers from when I was trying to do an array reference swap so that, after each iteration of the outer loop, tempint4 and tempint5 would be referred to the opposite array. It was my hope that the JVM would stop ignoring my code block. For the division-enabled version of the code, this seems to have worked (sort of), since it fundamentally changes the values to be calculated. Swapping tempint4 for tempint5 and vice versa does not change the results of the addition and multiplication operations, so the JVM can still ignore those.
edit: Making tempint4 and tempint5 (since they are only reference variables, I am actually referring to the main arrays, Mathtester.int4 and Mathtester.int5) volatile worked without notably reducing the amount of CPU activity or level or CPU temperature. It did slow down the code a bit, but that is a probable indicator that the JVM was NOOPing more than I knew.
Is there any way I can get the JVM to stop reducing my code to apparent NOOPs?
Yes, by making int3 volatile.
One of the first things when dealing with Java performance that you have to learn by heart is this:
"A single line of Java code means nothing at all in isolation".
Modern JVMs are very complex beasts, and do all kinds of optimization. If you try to measure some small piece of code, the chances are that you will not be measuring what you think you are - it is really complicated to do it correctly without very, very detailed knowledge of what the JVM is doing.
In this case, yes, it's entirely likely that the JVM is optimizing away the loop. There's no simple way to prevent it from doing this, and almost all techniques are fragile and JVM-version specific (because new & cleverer optimizations are developed & added to the JVM all the time).
So, let me turn the question around: "What are you really trying to achieve here? Why do you want to prevent the JVM from optimizing?"

Which code runs faster?

I have two piece of code, and I want to know which is faster when they run and why it's faster. I learn less about JVM and CPU, but I'm hard working on them. Every tip will help.
int[] a=new int[1000];
int[] b=new int[10000000];
long start = System.currentTimeMillis();
//method 1
for(int i=0;i<1000;i++){
for(int j=0;j<10000000;j++){
a[i]++;
}
}
long end = System.currentTimeMillis();
System.out.println(end-start);
start=System.currentTimeMillis();
//method 2
for(int i=0 ;i<10000000;i++){
for(int j=0;j<1000;j++){
b[i]++;
}
}
end = System.currentTimeMillis();
System.out.println(end-start);
I'll throw my answer in there, in theory they will be exactly the same but in practice there will be a small, but negligible, difference. Too small to really matter, actually.
The basic idea is how array b is stored in memory. Because it is a lot larger, depending on your platform/implementation it might be stored in chunks, aka non-contiguously. That is likely since an array of 10 million ints is 40 million bytes = 40 MB!
EDIT: I get 572 and 593, respectively.
Complexity
In terms of asymptotic complexity (e.g. big-O notation), they have the same running time.
Data localization
Ignoring any optimization for the moment...
b is larger and is thus more likely to be split across multiple (or more) pages. Because of this, the first is likely faster.
The difference here is likely to be rather small, unless not all of these pages fit into RAM and need to be written to disk (which is unlikely here since b is only 10000000*4 = 40000000 bytes = 38 MB).
Optimization
The first method involves "execute a[i]++ 10000000 times" (for a fixed i), which can theoretically rather easily be converted to a[i] += 10000000 by the optimizer.
A similar optimization can occur for b, but only to b[i] += 1000, which still has to run 10000000 times.
The optimizer is free to do this or not do this. As far as I know, the Java language specification doesn't say much about what should and shouldn't be optimized, as long as it doesn't change the end result.
As an extreme result, the optimizer could, in theory, see that you're not doing anything with a or b after the loops and thus get rid of both loops.
The first loop runs faster on my system (median: 333 ms vs. 596 ms)
(Edit: I made a wrong assumption on number of array accesses in my first response, see comments)
Subsequent incremental (index++) accesses to the same array seem to be faster than random accesses or decremental (index--) accesses. I assume the Java Hotspot compiler can optimize the array bound checks if it recognizes that the array will be incrementally traversed.
When reversing the loops, it actually runs slower:
//incremental array index
for (int i = 0; i < 1000; i++) {
for (int j = 0; j < 10000000; j++) {
a[i]++;
}
}
//decremental array index
for (int i = 1000 - 1; i >= 0; i--) {
for (int j = 10000000 - 1; j >= 0; j--) {
a[i]++;
}
}
Incremental: 349ms, decremental: 485ms.
Without bounds checks, decremental loops usually are faster, especially on old processors (comparing to zero).
If my assumption is right, this makes 1000 optimized bounds checks versus 10000000 checks, so the first method is faster.
By the way, when benchmarking:
Do multiple rounds and compare the averages/mediums instead of the first sample
In Java: give your benchmark a warmup-phase (execute it a few times before measuring). On the first run, classes have to be loaded, and code might be interpreted before the HotSpot VM feature kicks in and does a native compilation
Measure time deltas with System.nanoTime(). This gives more accurate timestamps. System.currentTimeMillis() is not that precise (depends on the VM), and usually 'hops' in junks of a dozen or more milliseconds, rendering your result times more volatile than they actually are. Btw: 1 milliseconds = 1'000'000 nano second.
My guess would be that the they are both pretty much the same. One of them has a smaller array to handle, but that wouldn't make much difference except for the initial allocation of memory, which is outside your measuring anyway.
The time to execute each iteration should be the same (writing a value into an array). incrementing larger numbers shouldn't take the JVM longer than incrementing smaller numbers, nor should addressing a smaller or larger array index.
But why the question if you already know how to measure yourself?
Check out big-oh notation
A nested for loop is O(n^2) - They will run the same in theory.
The numbers 1000 or 100000 are a constant k O(n^2 + k)
They won't be exactly identical in practise because of various other things at play but it will be close.
Time should be equal, the result will obviously be different since a will contain 1000 entries with value 10000 and b will contain 10000000 entries with value 1000. I don't really get your question. What's the result for end-start?
It might be that the JVM will optimize the forloops, if it understands what the end results will be in the array than the smallest array will be much easier to calculate, since it requires only 1000 assignments, while the other one needs 10 000 times more.
First will be fuster. Because of initialization of first a and i cell much more less times.
Modern architectures are complex, so answering this kind of question is never easy.
The run time could be same or the first could be faster.
Things to consider in this case is mostly memory access and optimization.
A good optimizer will realize that the values are never read, so the loops can be skipped entirely, which gives a run time of 0 in both cases. Such optimization can take place at compile time or at run-time.
If it isn't optimized away, then there is memory access to consider. a[] is much smaller than b[], so it will more readily fit in faster cache memory resulting in fewer cache misses.
Another thing to consider is memory interleaving.

Java can recognize SIMD advantages of CPU; or there is just optimization effect of loop unrolling

This part of code is from dotproduct method of a vector class of mine. The method does inner product computing for a target array of vectors(1000 vectors).
When vector length is an odd number(262145), compute time is 4.37 seconds. When vector length(N) is 262144(multiple of 8), compute time is 1.93 seconds.
time1=System.nanotime();
int count=0;
for(int j=0;j<1000;i++)
{
b=vektors[i]; // selects next vector(b) to multiply as inner product.
// each vector has an array of float elements.
if(((N/2)*2)!=N)
{
for(int i=0;i<N;i++)
{
t1+=elements[i]*b.elements[i];
}
}
else if(((N/8)*8)==N)
{
float []vek=new float[8];
for(int i=0;i<(N/8);i++)
{
vek[0]=elements[i]*b.elements[i];
vek[1]=elements[i+1]*b.elements[i+1];
vek[2]=elements[i+2]*b.elements[i+2];
vek[3]=elements[i+3]*b.elements[i+3];
vek[4]=elements[i+4]*b.elements[i+4];
vek[5]=elements[i+5]*b.elements[i+5];
vek[6]=elements[i+6]*b.elements[i+6];
vek[7]=elements[i+7]*b.elements[i+7];
t1+=vek[0]+vek[1]+vek[2]+vek[3]+vek[4]+vek[5]+vek[6]+vek[7];
//t1 is total sum of all dot products.
}
}
}
time2=System.nanotime();
time3=(time2-time1)/1000000000.0; //seconds
Question: Could the reduction of time from 4.37s to 1.93s (2x as fast) be JIT's wise decision of using SIMD instructions or just my loop-unrolling's positive effect?
If JIT cannot do SIMD optimizaton automatically, then in this example there is also no unrolling optimization done automatically by JIT, is this true?.
For 1M iterations(vectors) and for vector size of 64, speedup multiplier goes to 3.5X(cache advantage?).
Thanks.
Your code has a bunch of problems. Are you sure you're measuring what you think you're measuring?
Your first loop does this, indented more conventionally:
for(int j=0;j<1000;i++) {
b=vektors[i]; // selects next vector(b) to multiply as inner product.
// each vector has an array of float elements.
}
Your rolled loop involves a really long chain of dependent loads and stores. Your unrolled loop involves 8 separate chains of dependent loads and stores. The JVM can't turn one into the other if you're using floating-point arithmetic because they're fundamentally different computations. Breaking dependent load-store chains can lead to major speedups on modern processors.
Your rolled loop iterates over the whole vector. Your unrolled loop only iterates over the first (roughly) eighth. Thus, the unrolled loop again computes something fundamentally different.
I haven't seen a JVM generate vectorised code for something like your second loop, but I'm maybe a few years out of date on what JVMs do. Try using -XX:+PrintAssembly when you run your code and inspect the code opto generates.
I have done a little research on this (and am drawing from knowledge from a similar project I did in C with matrix multiplication), but take my answer with a grain of salt as I am by no means an expert on this topic.
As for your first question, I think the speedup is coming from your loop unrolling; you're making roughly 87% fewer condition checks in terms of the for loop. As far as I know, JVM supports SSE since 1.4, but to actually control whether your code is using vectorization (and to know for sure), you'll need to use JNI.
See an example of JNI here: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?
When you decrease the size of your vector to 64 from 262144, cache is definitely a factor. When I did this project in C, we had to implement cache blocking for larger matrices in order to take advantage of the cache. One thing you might want to do is check your cache size.
Just as a side note: It might be a better idea to measure performance in flops rather than seconds, just because the runtime (in seconds) of your program can vary based on many different factors, such as CPU usage at the time.

How do comparison operators work in java?

Recently someone told me that a comparison involving smaller integers will be faster, e.g.,
// Case 1
int i = 0;
int j = 1000;
if(i<j) {//do something}
// Case 2
int j = 6000000;
int i = 500000;
if(i<j){//do something else}
so, comparison (if condition) in Case 1 should be faster than that in Case 2. In Case 2 the integers will take more space to store but that can affect the comparison, I am not sure.
Edit 1: I was thinking of binary representation of i & j, e.g., for i=0, it will be 0 and for j=1000 its 1111101000 (in 32-bit presentation it should be: 22 zeros followed by 1111101000, completely forgot about 32-bit or 64-bit representation, my bad!)
I tried to look at JVM Specification and bytecode of a simple comparison program, nothing made much sense to me. Now the question is how does comparison work for numbers in java? I guess that will also answer why (or why not) any of the cases will be faster.
Edit 2: I am just looking for a detailed explanation, I am not really worried about micro optimizations
If you care about performance, you really only need to consider what happens when the code is compiled to native code. In this case, it is the behaviour of the CPU which matters. For simple operations almost all CPUs work the same way.
so, comparison (if condition) in Case 1 should be faster than that in Case 2. In Case 2 the integers will take more space to store but that can affect the comparison, I am not sure.
An int is always 32-bit in Java. This means it always takes the same space. In any case, the size of the data type is not very important e.g. short and byte are often slower because the native word size is 32-bit and/or 64-bit and it has to extract the right data from a larger type and sign extend it.
I tried to look at JVM Specification and bytecode of a simple comparison program, nothing made much sense to me.
The JLS specifies behaviour, not performance characteristics.
Now the question is how does comparison work for numbers in java?
It works the same way it does in just about every other compiled language. It uses a single machine code instruction to compare the two values and another to perform a condition branch as required.
I guess that will also answer why (or why not) any of the cases will be faster.
Most modern CPUs use branch prediction. To keep the CPUs pipeline full it attempt to predict which branch will be taken (before it know it is the right one to take) so there is no break in the instruction executed. WHen this works well the branch has almost no cost. When it mis-predicts, the pipeline can be filled with instructions for a branch which was the wrong guess and this can cause a significant delay as it clears the pipeline and takes the right branch. In the worst case it can mean 100s of clock cycles delay. e.g.
Consider the following.
int i; // happens to be 0
if (i > 0) {
int j = 100 / i;
Say the branch is usually taken. This means the pipeline could be loaded with an instruction which triggers an interrupt (or Error in Java) before it knows the branch will not be taken. This can result in a complex situation which takes a while to unwind correctly.
These are called Mispredicted branches In short a branch which goes the same way every/most of the time is faster, a branch which suddenly changes or is quite random (e.g. in sorting and tree data structures) will be slower.
Int might be faster on 32-bit system and long might be faster on 64-bit system.
Should I bother about it? No you dont code for system configuration you code based on what are your requirements. Micro optinizations never work and they might introduce some unprecedented issues.
Small Integer objects can be faster because java treats them specific.
All Integer values for -128 && i <= 127 are stored in IntegerCache an inner class of Integer

Nested loop comparison in Python,Java and C

The following code in python takes very long to run. (I couldn't wait until the program ended, though my friend told me for him it took 20 minutes.)
But the equivalent code in Java runs in approximately 8 seconds and in C it takes 45 seconds.
I expected Python to be slow but not this much, and in case of C which I expected to be faster than Java was actually slower. Is the JVM using some loop unrolling technique to achieve this speed? Is there any reason for Python being so slow?
import time
st=time.time()
for i in xrange(0,100000):
for j in xrange(0,100000):
continue;
print "Time taken : ",time.time()-st
Your test is not measuring anything meaningful.
A language's performance in the real world has little to do with how quickly it executes a tight loop.
Frankly, I'm intrigued that C and Java took as long as they did; I would have expected both of their compilers to realize that there was nothing happening inside the inner loop, and have optimized both of them away into nonexistence (and 0 seconds).
Python, on the other hand, is still interpreted (I could be wrong about this). In any case, it looks like the outer loop is needing to construct 100,000 xrange objects on which to run the empty inner loop, and that's unlikely to be optimized away.
So all you're really measuring is various compilers' ability to see through the fact that no real computing work is being done.
The lesson is: Performance is never what you expect. Therefore, always measure, never believe.
Some reasons why you might see these numbers (and from the first sentence, some of these might be completely wrong):
C is compiled for an "i586" processor (also called Pentium). That CPU was sold from 1993 to about 2000. Have you seen one lately? Guess not. So the C code isn't really optimized for what your CPU can do (or to put it another way around: Today's CPUs try very hard to be a fast Pentium CPU). Java, OTOH, is compiled for your CPU as the code is loaded. It can pull some tricks that C simply can't. The price is that the C program starts in 15ms while the Java program needs 4 Seconds.
Python has no JIT (just in time compiler), yet. All code is converted into bytecode which is then interpreted. This means the loop above is turned into a dozen bytecode instructions which are then interpreted by a C program. That just takes time. Python is not meant for huge loops, it's meant for smart algorithms which you simply can't express in any other language (at least not with the same amount of code and readability).
So just as it doesn't make sense to go shopping with a 18t truck (you can transport anything but you won't find a space to park it), chose your programming language according to the problem you need to solve. It has to be small&fast? C. Just fast? Java. Flexible? Python. Flexible&Fast: Python with a helper library in C (like NumPy).
Is there any reason for Python being so slow?
Yes.
But what does it matter? You've created 100,000 xrange objects. Why? What does that matter? What is your real question on performance? What algorithm do you actually have that's actually too slow?
for i in xrange(0,100000): # Creates one xrange object
for j in xrange(0,100000): # Creates a fresh xrange object each time through the loop
for i in xrange(0, 10000):
for j in xrange(0, 10000):
pass
or
for i in xrange(0, 100000000):
pass
Python 2.6.5 - Time taken : 8.50866484642
PyPy 1.3 - Time taken : 1.55999398232
reason of slow work not in creation of xrange objects
gcc 4.2 with the -O1 flag or higher optimize away the loop and the program takes 1 milli second to execute.
This benchmark is not very representative as it is very far from any real world use.
You're doing a nested loop for a reason, and you never leave it empty.
Python doesn't optimize away the loop, although I see no technical reason why it couldn't.
Python is slower than C because it's further from the machine language. xrange is a nice abstraction but it adds a heavy layer of machine code compared to a simple C loop.
C source:
int main( void ){
int i, j;
for (i=0;i<100000;i++){
for (j=0;j<100000;j++){
continue;
}
}
return 0;
}
A good compiler would optimise away the loop.
Assuming the loop isn't optimised away, I'd expect Python to be something like 100 times slower than the C version

Categories