Nested loop comparison in Python,Java and C - java

The following code in python takes very long to run. (I couldn't wait until the program ended, though my friend told me for him it took 20 minutes.)
But the equivalent code in Java runs in approximately 8 seconds and in C it takes 45 seconds.
I expected Python to be slow but not this much, and in case of C which I expected to be faster than Java was actually slower. Is the JVM using some loop unrolling technique to achieve this speed? Is there any reason for Python being so slow?
import time
st=time.time()
for i in xrange(0,100000):
for j in xrange(0,100000):
continue;
print "Time taken : ",time.time()-st

Your test is not measuring anything meaningful.
A language's performance in the real world has little to do with how quickly it executes a tight loop.
Frankly, I'm intrigued that C and Java took as long as they did; I would have expected both of their compilers to realize that there was nothing happening inside the inner loop, and have optimized both of them away into nonexistence (and 0 seconds).
Python, on the other hand, is still interpreted (I could be wrong about this). In any case, it looks like the outer loop is needing to construct 100,000 xrange objects on which to run the empty inner loop, and that's unlikely to be optimized away.
So all you're really measuring is various compilers' ability to see through the fact that no real computing work is being done.

The lesson is: Performance is never what you expect. Therefore, always measure, never believe.
Some reasons why you might see these numbers (and from the first sentence, some of these might be completely wrong):
C is compiled for an "i586" processor (also called Pentium). That CPU was sold from 1993 to about 2000. Have you seen one lately? Guess not. So the C code isn't really optimized for what your CPU can do (or to put it another way around: Today's CPUs try very hard to be a fast Pentium CPU). Java, OTOH, is compiled for your CPU as the code is loaded. It can pull some tricks that C simply can't. The price is that the C program starts in 15ms while the Java program needs 4 Seconds.
Python has no JIT (just in time compiler), yet. All code is converted into bytecode which is then interpreted. This means the loop above is turned into a dozen bytecode instructions which are then interpreted by a C program. That just takes time. Python is not meant for huge loops, it's meant for smart algorithms which you simply can't express in any other language (at least not with the same amount of code and readability).
So just as it doesn't make sense to go shopping with a 18t truck (you can transport anything but you won't find a space to park it), chose your programming language according to the problem you need to solve. It has to be small&fast? C. Just fast? Java. Flexible? Python. Flexible&Fast: Python with a helper library in C (like NumPy).

Is there any reason for Python being so slow?
Yes.
But what does it matter? You've created 100,000 xrange objects. Why? What does that matter? What is your real question on performance? What algorithm do you actually have that's actually too slow?
for i in xrange(0,100000): # Creates one xrange object
for j in xrange(0,100000): # Creates a fresh xrange object each time through the loop

for i in xrange(0, 10000):
for j in xrange(0, 10000):
pass
or
for i in xrange(0, 100000000):
pass
Python 2.6.5 - Time taken : 8.50866484642
PyPy 1.3 - Time taken : 1.55999398232
reason of slow work not in creation of xrange objects

gcc 4.2 with the -O1 flag or higher optimize away the loop and the program takes 1 milli second to execute.
This benchmark is not very representative as it is very far from any real world use.
You're doing a nested loop for a reason, and you never leave it empty.
Python doesn't optimize away the loop, although I see no technical reason why it couldn't.
Python is slower than C because it's further from the machine language. xrange is a nice abstraction but it adds a heavy layer of machine code compared to a simple C loop.
C source:
int main( void ){
int i, j;
for (i=0;i<100000;i++){
for (j=0;j<100000;j++){
continue;
}
}
return 0;
}

A good compiler would optimise away the loop.
Assuming the loop isn't optimised away, I'd expect Python to be something like 100 times slower than the C version

Related

Oracle JDK8: unrolled loop converted to NOOP by JVM?

I've got a little program that is a fairly pointless exercise in simple number crunching that has thrown me for a loop.
The program spawns a bunch of worker threads that do simple mathematical operations. Recently I changed the inner loop of one variant of worker from:
do
{
int3 = int1 + int2;
int3 = int1 * int2;
int1++;
int2++;
i++;
}
while (i < 128);
to something akin to:
int3 = tempint4[0] + tempint5[0];
int3 = tempint4[0] * tempint5[0];
int3 = tempint4[1] + tempint5[1];
int3 = tempint4[1] * tempint5[1];
int3 = tempint4[2] + tempint5[2];
int3 = tempint4[2] * tempint5[2];
int3 = tempint4[3] + tempint5[3];
int3 = tempint4[3] * tempint5[3];
...
int3 = tempint4[127] + tempint5[127];
int3 = tempint4[127] * tempint5[127];
The arrays are populated by random integers no higher than 1025 in value, and the array values do not change.
The end result was that the program ran much faster, though closer examination seems to indicate that the CPU isn't actually doing anything when running the newer version of the code. It seems that the JVM has figured out that it can safely ignore the code that replaced the inner loop after one iteration of the outer loop since it is only redoing the same calculations on the same set of data over and over again.
To illustrate my point, the old code took maybe ~27000 ms to run and noticeably increased the operating temperature of the CPU (it also showed 100% utilization for all cores). The new code takes maybe 5 ms to run (sometimes less) and causes nary a spike in CPU utilization or temperature. Increasing the number of outer loop iterations does nothing to change the behavior of the new code, even when the number of iterations increases by a hundred times or more.
I have another version of the worker that is identical to the one above except that it has a division operation along with the addition and multiplication operations. In its new unrolled form, the division-enabled version is also much faster than it's previous form, but it actually takes a little while (~300 ms on the first run and ~200 ms on subsequent runs, despite warmup, which is a little odd) and produces a profound spike in CPU temperature for its brief run. Increasing the number of outer loop iterations seems to cause the temperature phenomenon to mostly cease after a certain amount of time has passed while running the program, though utilization still shows 100% for all cores. My guess is the JVM is taking much longer to figure out which operations it can safely ignore when handling division operations, and that it is not ignoring all of them.
Short of adding division operations to all my code (which isn't really a fix anyway beyond a certain number of outer loop iterations), is there any way I can get the JVM to stop reducing my code to apparent NOOPs? I've tried several solutions to the problem, such as generating new random values per iteration of the outer loop, going back to simple integer variables with incrementation, and some other nonsense, but none of those solutions have produced desirable results. Either it continues to ignore the series of instructions, or the performance hit from modifications is bad enough that my division-heavy variant actually performs better than the code without division operations.
edit: to provide some context:
i: this variable is an integer that is used for a loop counter in a do/while loop. It is defined in the class file containing the worker code. It's initial value is 0. It is no longer used in the newer version of the worker.
int1/int2: These are integers defined in the class file containing the worker code. Their initial values are both 0. They were used in the old version of the code to provide changing values for each iteration of the internal loop. All I had to do was increment them upward by one per loop iteration, and the JVM would be forced to carry out every operation faithfully. Unfortunately, this loop apparently prevented the use of SIMD. Each time the outer loop iterated, int1 and int2 had their values reset to prevent overflow of int1, int2, or int3 (I have discovered that integer overflow can slow down the code unnecessarily, as can allowing a float to reach Infinity).
tempint4/tempint5: These are references to a pair of integer arrays defined in the main class file for the program (Mathtester. Yes, unimaginative, I know). When the program first starts, there is a short do/while loop that fills each array with random integers randing from 1-1025. The arrays are 128 integers in size. Each array is static, though the reference variables are not. In truth there is no particular reason for me to use the reference variables. They are leftovers from when I was trying to do an array reference swap so that, after each iteration of the outer loop, tempint4 and tempint5 would be referred to the opposite array. It was my hope that the JVM would stop ignoring my code block. For the division-enabled version of the code, this seems to have worked (sort of), since it fundamentally changes the values to be calculated. Swapping tempint4 for tempint5 and vice versa does not change the results of the addition and multiplication operations, so the JVM can still ignore those.
edit: Making tempint4 and tempint5 (since they are only reference variables, I am actually referring to the main arrays, Mathtester.int4 and Mathtester.int5) volatile worked without notably reducing the amount of CPU activity or level or CPU temperature. It did slow down the code a bit, but that is a probable indicator that the JVM was NOOPing more than I knew.
Is there any way I can get the JVM to stop reducing my code to apparent NOOPs?
Yes, by making int3 volatile.
One of the first things when dealing with Java performance that you have to learn by heart is this:
"A single line of Java code means nothing at all in isolation".
Modern JVMs are very complex beasts, and do all kinds of optimization. If you try to measure some small piece of code, the chances are that you will not be measuring what you think you are - it is really complicated to do it correctly without very, very detailed knowledge of what the JVM is doing.
In this case, yes, it's entirely likely that the JVM is optimizing away the loop. There's no simple way to prevent it from doing this, and almost all techniques are fragile and JVM-version specific (because new & cleverer optimizations are developed & added to the JVM all the time).
So, let me turn the question around: "What are you really trying to achieve here? Why do you want to prevent the JVM from optimizing?"

Can Java use unsigned comparison for general in-range tests?

I've seen that JITC uses unsigned comparison for checking array bounds (the test 0 <= x < LIMIT is equivalent to 0 ≺ LIMIT where ≺ treats the numbers as unsigned quantities). So I was curious if it works for arbitrary comparisons of the form 0 <= x < LIMIT as well.
The results of my benchmark are pretty confusing. I've created three experiments of the form
for (int i=0; i<LENGTH; ++i) {
int x = data[i];
if (condition) result += x;
}
with different conditions
0 <= x called above
x < LIMIT called below
0 <= x && x < LIMIT called inRange
0 <= x & x < LIMIT called inRange2
and prepared the data so that the probabilities of the condition being true are the same.
The results should be fairly similar, just above might be slightly faster as it compares against zero. Even if the JITC couldn't use the unsigned comparison for the test, the results for above and below should still be similar.
Can anyone explain what's going on here? It's quite possible that I did something wrong...
Update
I'm using Java build 1.7.0_51-b13 on Ubuntu 2.6.32-54-generic with i5-2400 CPU # 3.10GHz, in case anybody cares. As the results for inRange and inRange2 near 0.00 are especially confusing, I re-ran the benchmark with more steps in this area.
The likely variation in the results of the benchmarks have to do with CPU caching at different levels.
Since primitive int(s) are being used, there is no JVM specific caching going on, as will happen with auto-boxed Integer to primitive values.
Thus all that remains, given minimal memory consumption of the data[] array, is CPU-caching of low level values/operations. Since as described the distributions of values are based on random values with statistical 'probabilities' of the conditions being true across the tests, the likely cause is that, depending on the values, more or less (random) caching is going on for each test, leading to more randomness.
Further, depending on the isolation of the computer (background services/processes), the test cases may not be running in complete isolation. Ensure that everything is shutdown for these tests except the core OS functions and the JVM. Set the JVM memory min/max the same, shutdown any networking processes, updates, etc.
Are you test results the avarage of a number of runs, or did you only test each function once?
One thing I have found are that the first time you run a for loop the JVM will interpret, then each time its run the JVM will optimize it. Therefore the first few runs may get horrible performance, but after a few runs it will be near native performance.
I also figured out that a loop will not be optimized while its running. I have not tested if this applies to just the loop or the whole function. If it only applies to the loop you may get much more performance if you nest in in an inner and outer loop, and work with your data one block at a time. If its the whole function, you will have to place the inner loop in its own function.
Also run the test more than once, if you compare the code you will notice how the JIT optimizes the code in stages.
For most code this gives Java optimal performance. It allows it to skip costly optimization on code that runs rarely and makes code that run often a lot faster. However if you have a code block that runs once but for a long time, it will become horribly slow.

Java can recognize SIMD advantages of CPU; or there is just optimization effect of loop unrolling

This part of code is from dotproduct method of a vector class of mine. The method does inner product computing for a target array of vectors(1000 vectors).
When vector length is an odd number(262145), compute time is 4.37 seconds. When vector length(N) is 262144(multiple of 8), compute time is 1.93 seconds.
time1=System.nanotime();
int count=0;
for(int j=0;j<1000;i++)
{
b=vektors[i]; // selects next vector(b) to multiply as inner product.
// each vector has an array of float elements.
if(((N/2)*2)!=N)
{
for(int i=0;i<N;i++)
{
t1+=elements[i]*b.elements[i];
}
}
else if(((N/8)*8)==N)
{
float []vek=new float[8];
for(int i=0;i<(N/8);i++)
{
vek[0]=elements[i]*b.elements[i];
vek[1]=elements[i+1]*b.elements[i+1];
vek[2]=elements[i+2]*b.elements[i+2];
vek[3]=elements[i+3]*b.elements[i+3];
vek[4]=elements[i+4]*b.elements[i+4];
vek[5]=elements[i+5]*b.elements[i+5];
vek[6]=elements[i+6]*b.elements[i+6];
vek[7]=elements[i+7]*b.elements[i+7];
t1+=vek[0]+vek[1]+vek[2]+vek[3]+vek[4]+vek[5]+vek[6]+vek[7];
//t1 is total sum of all dot products.
}
}
}
time2=System.nanotime();
time3=(time2-time1)/1000000000.0; //seconds
Question: Could the reduction of time from 4.37s to 1.93s (2x as fast) be JIT's wise decision of using SIMD instructions or just my loop-unrolling's positive effect?
If JIT cannot do SIMD optimizaton automatically, then in this example there is also no unrolling optimization done automatically by JIT, is this true?.
For 1M iterations(vectors) and for vector size of 64, speedup multiplier goes to 3.5X(cache advantage?).
Thanks.
Your code has a bunch of problems. Are you sure you're measuring what you think you're measuring?
Your first loop does this, indented more conventionally:
for(int j=0;j<1000;i++) {
b=vektors[i]; // selects next vector(b) to multiply as inner product.
// each vector has an array of float elements.
}
Your rolled loop involves a really long chain of dependent loads and stores. Your unrolled loop involves 8 separate chains of dependent loads and stores. The JVM can't turn one into the other if you're using floating-point arithmetic because they're fundamentally different computations. Breaking dependent load-store chains can lead to major speedups on modern processors.
Your rolled loop iterates over the whole vector. Your unrolled loop only iterates over the first (roughly) eighth. Thus, the unrolled loop again computes something fundamentally different.
I haven't seen a JVM generate vectorised code for something like your second loop, but I'm maybe a few years out of date on what JVMs do. Try using -XX:+PrintAssembly when you run your code and inspect the code opto generates.
I have done a little research on this (and am drawing from knowledge from a similar project I did in C with matrix multiplication), but take my answer with a grain of salt as I am by no means an expert on this topic.
As for your first question, I think the speedup is coming from your loop unrolling; you're making roughly 87% fewer condition checks in terms of the for loop. As far as I know, JVM supports SSE since 1.4, but to actually control whether your code is using vectorization (and to know for sure), you'll need to use JNI.
See an example of JNI here: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?
When you decrease the size of your vector to 64 from 262144, cache is definitely a factor. When I did this project in C, we had to implement cache blocking for larger matrices in order to take advantage of the cache. One thing you might want to do is check your cache size.
Just as a side note: It might be a better idea to measure performance in flops rather than seconds, just because the runtime (in seconds) of your program can vary based on many different factors, such as CPU usage at the time.

How do comparison operators work in java?

Recently someone told me that a comparison involving smaller integers will be faster, e.g.,
// Case 1
int i = 0;
int j = 1000;
if(i<j) {//do something}
// Case 2
int j = 6000000;
int i = 500000;
if(i<j){//do something else}
so, comparison (if condition) in Case 1 should be faster than that in Case 2. In Case 2 the integers will take more space to store but that can affect the comparison, I am not sure.
Edit 1: I was thinking of binary representation of i & j, e.g., for i=0, it will be 0 and for j=1000 its 1111101000 (in 32-bit presentation it should be: 22 zeros followed by 1111101000, completely forgot about 32-bit or 64-bit representation, my bad!)
I tried to look at JVM Specification and bytecode of a simple comparison program, nothing made much sense to me. Now the question is how does comparison work for numbers in java? I guess that will also answer why (or why not) any of the cases will be faster.
Edit 2: I am just looking for a detailed explanation, I am not really worried about micro optimizations
If you care about performance, you really only need to consider what happens when the code is compiled to native code. In this case, it is the behaviour of the CPU which matters. For simple operations almost all CPUs work the same way.
so, comparison (if condition) in Case 1 should be faster than that in Case 2. In Case 2 the integers will take more space to store but that can affect the comparison, I am not sure.
An int is always 32-bit in Java. This means it always takes the same space. In any case, the size of the data type is not very important e.g. short and byte are often slower because the native word size is 32-bit and/or 64-bit and it has to extract the right data from a larger type and sign extend it.
I tried to look at JVM Specification and bytecode of a simple comparison program, nothing made much sense to me.
The JLS specifies behaviour, not performance characteristics.
Now the question is how does comparison work for numbers in java?
It works the same way it does in just about every other compiled language. It uses a single machine code instruction to compare the two values and another to perform a condition branch as required.
I guess that will also answer why (or why not) any of the cases will be faster.
Most modern CPUs use branch prediction. To keep the CPUs pipeline full it attempt to predict which branch will be taken (before it know it is the right one to take) so there is no break in the instruction executed. WHen this works well the branch has almost no cost. When it mis-predicts, the pipeline can be filled with instructions for a branch which was the wrong guess and this can cause a significant delay as it clears the pipeline and takes the right branch. In the worst case it can mean 100s of clock cycles delay. e.g.
Consider the following.
int i; // happens to be 0
if (i > 0) {
int j = 100 / i;
Say the branch is usually taken. This means the pipeline could be loaded with an instruction which triggers an interrupt (or Error in Java) before it knows the branch will not be taken. This can result in a complex situation which takes a while to unwind correctly.
These are called Mispredicted branches In short a branch which goes the same way every/most of the time is faster, a branch which suddenly changes or is quite random (e.g. in sorting and tree data structures) will be slower.
Int might be faster on 32-bit system and long might be faster on 64-bit system.
Should I bother about it? No you dont code for system configuration you code based on what are your requirements. Micro optinizations never work and they might introduce some unprecedented issues.
Small Integer objects can be faster because java treats them specific.
All Integer values for -128 && i <= 127 are stored in IntegerCache an inner class of Integer

Comparing c and java programs runtime

I had a job interview today, we were given a programming question, and were asked to solve it using c/c++/Java, I solved it in java and its runtime was 3 sec (the test was more 16000 lines, and the person accompanying us said the running time was reasonable), another person there solved it in c and the runtime was 0.25 sec, so I was wondering, is a factor of 12 normal?
Edit:
As I said, I don't think there was really much room for algorithm variation except maybe in one little thing, anyway, there was this protocol that we had to implement:
A (client) and B (server) communicate according to some protocol p, before the messages are delivered their validity is checked, the protocol is defined by its state and the text messages that can be sent when it is in a certain state, in all states there was only one valid message that could be sent, except in one state where there was like 10 messages that can be sent, there are 5 states and the states transition is defined by the protocol too.
so what I did with the state from which 10 different messages can be sent was storing their string value in an ArrayList container, then when I needed to check the message validity in the corresponding state i checked if arrayList.contains(sentMessageStr); I would think that this operation's complexity is O(n) although I think java has some built-in optimization for this operation, although now that I am thinking about it, maybe I should've used a HashSet container.I suppose the c implementation would have been storing those predefined legal strings lexicographically in an array and implementing a binary search function.
thanks
I would guess that it's likely the jvm took a significant portion of that 3 seconds just to load. Try running your java version on the same machine 5 times in a row. Or try running both on a dataset 500 times as large. I suspect you'll see a significant constant latency for the Java version that will become insignificant when runtimes go into the minutes.
Sounds more like a case of insufficient samples and unequal implementations (and possibly unequal test beds).
One of the first rules in measurement is to establish enough samples and obtain the mean of the samples for comparison. Even a couple of runs of the same program is not sufficient. You need to tax the machine enough to obtain samples whose values can be compared. That's why test-beds need to be warmed up, so that there are little or no variables at play, except for the system under observation.
And of course, you also have different people implementing the same requirement/algorithm in different manners. It counts. Period. Unless the algorithm implementations have been "normalized", obtaining samples and comparing them are the same as comparing apples and watermelons.
I don't think I need to expand on the fact that the testbeds could have been of varying configurations, or under varying loads.
It's almost impossible to say without seeing the code - you may have a better algorithm for example that scales up much better for larger input but has a greater overhead for small input sizes.
Having said that, this kind of 12x difference is roughly what I would expect if you coded the solution using "higher level" constructs such as ArrayLists / boxed objects and the C solution was basically using optimised, low level pointer arithmetic into a pre-allocated memory region.
I'd rather maintain the higher level solution, but there are times when only hand-optimised low level code will do.....
Another potential explanation is that the JIT had not yet warmed up on your code. In general, you need to reach "steady state" (typically a few thousand iterations of every code path) before you will see top performance in JIT-compiled code.
Performance depends on implementation. Without knowing exactly what you code and what your competitor did, it's very difficult to tell exactly what happened.
But let's say for isntance, that you used objects like vectors or whatever to solve the problem and the C guy used arrays[], his implementation is going to be faster than yours for sure.
C code can be translated very efficiently into assembly instructions, while Java on the other hand, relies on a bunch of stuff (like the JVM) that might make the bytecode of your program fatter and probably a little bit slower.
You will be hard pressed to find something that can execute faster in Java than in C. Its true that an order of magnitude is a big difference but in general C is more performant.
On the other hand you can produce a solution to any given problem much quicker in Java (especially taking into account the richness of the libraries).
So at the end of the day, if there is a choice at all, it comes down as a dilemma between performance and productivity.
That depends on the algorithm. Java is of course generally slower than C/C++ as it's a virtual machine but for most common applications its speed is sufficient. I would not call a factor of 12 normal for common applications.
Would be nice if you posted the C and Java codes for comparison.
A factor of 12 can be normal. So could a factor of 1 or 1/2. As some commentators mentioned, a lot has to do with how you coded your solution.
Dont forget that java programs have to run in a jvm (unless you compile to native machine code), so any benchmarks should take that into account.
You can google for 'java and c speed comparisons' for some analysis
Back in the days I'd say that there's nothing wrong with your Java code being 12 times slower. But nowadays I'd rather say that the C guy implemented it more efficiently. Obviously I might be wrong, but are you sure you used proper data structures and well simply coded it well?
Also did you measure the memory usage? This might sound silly, but last year at the uni we had a programming challenge, don't really remember what it was but we had to solve a graph problem in whatever language we wanted - I did two implementations of my algorithm one in C and one in Java, the Java one was ~1,5-2x slower, BUT for instance I knew I didn't have to worry about memory management (I knew exactly how big the input will be and how many test samples will be run from the teacher) so I simply didn't free any memory (which took way too much time in a programme that run for ~1-2seconds on a graph with ~15k nodes, or was it 150k?) so my Java code was better memory wise but it was slower. I also parsed the input myself in C (didn't do that in Java) which saved me really A LOT of time (~20-25% boost, I was amazed myself). I'd say 1,5-2x is more realistic than 12x.
Most likely the algorithm used in the implementation was different.
For instance ( an over simplification ) if you want to add a number N , M number of times one implementation could be:
long addTimes( long n, long m ) {
long r = 0;
long i;
for( i = 0; i < m ; i++ ) {
r += n;
}
return r;
}
And another implementation could simply be:
long addTimes( long n, long m ) {
return n * m;
}
Both, will run mostly the same in Java and C (you don't even have to change the code ) and still, one implementation will run way lot faster than the other.

Categories