Enhanced for loop performance worse than traditional indexed lookup?

Enhanced for loop performance worse than traditional indexed lookup? - java

I just came across this seemingly innocuous comment, benchmarking ArrayList vs a raw String array. It's from a couple years ago, but the OP writes
I did notice that using for String s: stringsList was about 50% slower than using an old-style for-loop to access the list. Go figure...
Nobody commented on it in the original post, and the test seemed a little dubious (too short to be accurate), but I nearly fell out of my chair when I read it. I've never benchmarked an enhanced loop against a "traditional" one, but I'm currently working on a project that does hundreds of millions of iterations over ArrayList instances using enhanced loops so this is a concern to me.
I'm going to do some benchmarking and post my findings here, but this is obviously a big concern to me. I could find precious little info online about relative performance, except for a couple offhand mentions that enhanced loops for ArrayLists do run a lot slower under Android.
Has anybody experienced this? Does such a performance gap still exist? I'll post my findings here, but was very surprised to read it. I suspect that if this performance gap did exist, it has been fixed in more modern VM's, but I guess now I'll have to do some testing and confirm.
Update: I made some changes to my code, but was already suspecting what others here have already pointed out: sure the enhanced for loop is slower, but outside of very trivial tight loops, the cost should be a miniscule fraction of the cost of the logic of the loop. In my case, even though I'm iterating over very large lists of strings using enhanced loops, my logic inside the loop is complex enough that I couldn't even measure a difference after switching to index-based loops.
TL;DR: enhanced loops are indeed slower than a traditional index-based loop over an arraylist; but for most applications the difference should be negligible.

The problem you have is that using an Iterator will be slower than using a direct lookup. On my machine the difference is about 0.13 ns per iteration. Using an array instead saves about 0.15 ns per iteration. This should be trivial in 99% of situations.
public static void main(String... args) {
int testLength = 100 * 1000 * 1000;
String[] stringArray = new String[testLength];
Arrays.fill(stringArray, "a");
List<String> stringList = new ArrayList<String>(Arrays.asList(stringArray));
{
long start = System.nanoTime();
long total = 0;
for (String str : stringArray) {
total += str.length();
}
System.out.printf("The for each Array loop time was %.2f ns total=%d%n", (double) (System.nanoTime() - start) / testLength, total);
}
{
long start = System.nanoTime();
long total = 0;
for (int i = 0, stringListSize = stringList.size(); i < stringListSize; i++) {
String str = stringList.get(i);
total += str.length();
}
System.out.printf("The for/get List loop time was %.2f ns total=%d%n", (double) (System.nanoTime() - start) / testLength, total);
}
{
long start = System.nanoTime();
long total = 0;
for (String str : stringList) {
total += str.length();
}
System.out.printf("The for each List loop time was %.2f ns total=%d%n", (double) (System.nanoTime() - start) / testLength, total);
}
}
When run with one billion entries entries prints (using Java 6 update 26.)
The for each Array loop time was 0.76 ns total=1000000000
The for/get List loop time was 0.91 ns total=1000000000
The for each List loop time was 1.04 ns total=1000000000
When run with one billion entries entries prints (using OpenJDK 7.)
The for each Array loop time was 0.76 ns total=1000000000
The for/get List loop time was 0.91 ns total=1000000000
The for each List loop time was 1.04 ns total=1000000000
i.e. exactly the same. ;)

Every claim that X is slower than Y on a JVM which does not address all the issues presented in this article ant it's second part spreads fears and lies about the performance of a typical JVM. This applies to the comment referred to by the original question as well as to GravityBringer's answer. I am sorry to be so rude, but unless you use appropriate micro benchmarking technology your benchmarks produce really badly skewed random numbers.
Tell me if you're interested in more explanations. Although it is all in the articles I referred to.

GravityBringer's number doesn't seem right, because I know ArrayList.get() is as fast as raw array access after VM optimization.
I ran GravityBringer's test twice on my machine, -server mode
50574847
43872295
30494292
30787885
(2nd round)
33865894
32939945
33362063
33165376
The bottleneck in such tests is actually memory read/write. Judging from the numbers, the entire 2 arrays are in my L2 cache. If we decrease the size to fit L1 cache, or if we increase the size beyond L2 cache, we'll see 10X throughput difference.
The iterator of ArrayList uses a single int counter. Even if VM doesn't put it in a register (the loop body is too complex), at least it will be in the L1 cache, therefore r/w of are basically free.
The ultimate answer of course is to test your particular program in your particular environment.
Though it's not helpful to play agnostic whenever a benchmark question is raised.

The situation has gotten worse for ArrayLists. On my computer running Java 6.26, there is a fourfold difference. Interestingly (and perhaps quite logically), there is no difference for raw arrays. I ran the following test:
int testSize = 5000000;
ArrayList<Double> list = new ArrayList<Double>();
Double[] arr = new Double[testSize];
//set up the data - make sure data doesn't have patterns
//or anything compiler could somehow optimize
for (int i=0;i<testSize; i++)
{
double someNumber = Math.random();
list.add(someNumber);
arr[i] = someNumber;
}
//ArrayList foreach
long time = System.nanoTime();
double total1 = 0;
for (Double k: list)
{
total1 += k;
}
System.out.println (System.nanoTime()-time);
//ArrayList get() method
time = System.nanoTime();
double total2 = 0;
for (int i=0;i<testSize;i++)
{
total2 += list.get(i);
}
System.out.println (System.nanoTime()-time);
//array foreach
time = System.nanoTime();
double total3 = 0;
for (Double k: arr)
{
total3 += k;
}
System.out.println (System.nanoTime()-time);
//array indexing
time = System.nanoTime();
double total4 = 0;
for (int i=0;i<testSize;i++)
{
total4 += arr[i];
}
System.out.println (System.nanoTime()-time);
//would be strange if different values were produced,
//but no, all these are the same, of course
System.out.println (total1);
System.out.println (total2);
System.out.println (total3);
System.out.println (total4);
The arithmetic in the loops is to prevent the JIT compiler from possibly optimizing away some of the code. The effect of the arithmetic on performance is small, as the runtime is dominated by the ArrayList accesses.
The runtimes are (in nanoseconds):
ArrayList foreach: 248,351,782
ArrayList get(): 60,657,907
array foreach: 27,381,576
array direct indexing: 27,468,091

Related

Is it more efficient to scan an array once against multiple predicates or multiple times against a single predicate

I have an int array with 1000 elements. I need to extract the size of various sub-populations within the array (How many are even, odd, greater than 500, etc..).
I could use a for loop and a bunch of if statements to try add to a counting variable for each matching item such as:
for(int i = 0; i < someArray.length i++) {
if(conditionA) sizeA++;
if(conditionB) sizeB++;
if(conditionC) sizeC++;
...
}
or I could do something more lazy such as:
Supplier<IntStream> ease = () -> Arrays.stream(someArray);
int sizeA = ease.get().filter(conditionA).toArray.length;
int sizeB = ease.get().filter(conditionB).toArray.length;
int sizeC = ease.get().filter(conditionC).toArray.length;
...
The benefit of doing it the second way seems to be limited to readability, but is there a massive hit on efficiency? Could it possibly be more efficient? I guess it boils down to is iterating through the array one time with 4 conditions always better than iterating through 4 times with one condition each time (assuming the conditions are independent). I am aware this particular example the second method has lots of additional method calls which I'm sure don't help efficiency any.

Preamble:
As #Kayaman points out, for a small array (1000 elements) it probably doesn't matter.
The correct approach to this kind of thing is to do the optimization after you have working code, and a working benchmark, and after you have profiled the code to see where the real hotspots are.
But assuming that this is worth spending effort on optimization, the first version is likely to be faster than the second version for a couple of reasons:
The overheads of incrementing and testing the index are only incurred once in the first version versus three times in the second one.
For an array that is too large to fit into the memory cache, the first version will entail fewer memory reads than the second one. Since memory access is typically a bottleneck (especially on a multi-core machine), this can be significant.
Streams add an extra performance overhead compared to simple iteration of an array.

I did some time measuring with this code:
Random r = new Random();
int[] array = IntStream.generate(() -> r.nextInt(100)).limit(1000).toArray();
long odd = 0;
long even = 0;
long divisibleBy3 = 0;
long start = System.nanoTime();
//for (int i: array) {
// if (i % 2 == 1) {
// odd++;
// }
// if (i % 2 == 0) {
// even++;
// }
// if (i % 3 == 0) {
// divisibleBy3++;
// }
//}
even = Arrays.stream(array).parallel().filter(x -> x % 2 == 0).toArray().length;
odd = Arrays.stream(array).parallel().filter(x -> x % 2 == 1).toArray().length;
divisibleBy3 = Arrays.stream(array).parallel().filter(x -> x % 3 == 0).toArray().length;
System.out.println(System.nanoTime() - start);
The above outputs a 8 digit number, usually around 14000000
If I uncomment the for loop and comment the streams, I get a 5 digit number as output, usually around 80000.
So the streams are slower in terms of execution time.
When the array size is bigger, though, the difference between streams and loops becomes smaller.

ArrayList vs LinkedList complexity when moving element to the first place

I want to analyze moving 1 2 3 4 to 3 1 2 4 (list of integers) using LinkedList or ArrayList.
What I have done:
aux = arraylist.get(2); // O(1)
arraylist.remove(2); // O(n)
arraylist.add(0, aux); // O(n), shifting elements up.
aux = linkedlist.get(2); // O(n)
linkedlist.remove(2); // O(n)
linkedlist.addFirst(aux); // O(1)
So, in this case, can we say that they are the same or am I missing something?

You can indeed say this specific operation takes O(n) time for both a LinkedList and an ArrayList.
But you can not say that they take the same amount of real time as a result of this.
Big-O notation only tells you how the running time will scale as the input gets larger since it ignores constant factors. So an O(n) algorithm can take at most 1*n operation or 100000*n operations or a whole lot more, and each of those operations can take a greatly varying amount of time. This also means it's generally a pretty inaccurate measure of performance for small inputs, since constant factors can have a bigger effect on the running time than the size of a small input (but performance differences tend to be less important for small inputs).
See also: What is a plain English explanation of "Big O" notation?

Here's a quick and dirty benchmark. I timed both operations and repeated one million times:
import java.util.ArrayList;
import java.util.Arrays;
import java.util.LinkedList;
public class Main {
public static void main(String[] args) {
int arrayListTime = 0;
int linkedListTime = 0;
int n = 10000000;
for (int i=0; i<n; i++) {
ArrayList<Integer> A = new ArrayList<>(Arrays.asList(1,2,3,4));
long startTime = System.currentTimeMillis();
int x = A.remove(2);
A.add(0, x);
long endTime = System.currentTimeMillis();
arrayListTime += (endTime - startTime);
LinkedList<Integer> L = new LinkedList<>(Arrays.asList(1,2,3,4));
long startTime2 = System.currentTimeMillis();
int x2 = L.remove(2);
L.addFirst(x2);
long endTime2 = System.currentTimeMillis();
linkedListTime += (endTime2 - startTime2);
}
System.out.println(arrayListTime);
System.out.println(linkedListTime);
}
}
The difference is pretty small. My output was:
424
363
So using LinkedList was only 61ms faster over the course of 1,000,000 operations.

Big O complexity means nothing when the input is very small, like your case. Recall that the definition of big O only holds for "large enough n", and I doubt you have reached that threashold for n==4.
If this runs in a tight loop with small data size (and with different data each time), the only thing that would actually matter is going to be cache peformance, and the array list solution is much more cache friendly than the linked list solution, since each Node in a linked list is likely to require a cache seek.
In this case, I would even prefer using a raw array (int[]), to avoid the redundant wrapper objects, which triggers more cache misses).

ArrayList or any other elegant solution?

(Apologies in advance for the descriptive question)
I have to take integers as values for two variables and keep taking the integer values for as long as I like- from the console. At any point of time , I might requisition what is the sum of the integers entered corresponding to the variables. For example, For Variable A , I have entered 5,5,4 and for Variable B I have entered 2,3,6 [These inputs are taken from the console at different times, not at the same time]. At any point of time I would need to display the sum of A( which is 14) and sum of B(which is 11).
What I have thought is to declare two arraylists : one each for A and B. As I get the values for for A & B, I would keep adding them to their respective arraylists. Whenever there is a demand for displaying the sum of A , I would add the values in the index of Arraylist for A and display. Same for B.
However, I have a doubt that this approach may be inefficient. Because if I enter one thousand integers for variable A, then the arrayList would become 1000 indexes long and would consume too much memory and summing would be too much of work. Can I achieve what I want without using an arraylist ? some other elegant solution?

You have doubt about 1000 integers in a list, and the time it takes to sum them? Let me remove all your doubts: You don't have a problem.
Try this code:
long start = System.currentTimeMillis();
ArrayList<Integer> list = new ArrayList<>();
for (int i = 0; i < 1000000; i++)
list.add(i);
int sum = 0;
for (int num : list)
sum += num;
long end = System.currentTimeMillis();
System.out.printf("Sum of %d values is %d%n", list.size(), sum);
System.out.printf("Built and calculated in %.3f seconds%n", (end - start) / 1000.0);
It built an ArrayList<Integer> with one million values, not a measly 1000, then sums them. Output is:
Sum of 1000000 values is 1783293664
Built and calculated in 0.141 seconds
So, a million in way less than one second. 1000? Not an issue!

It sounds like you are trying to do something like:
(in Java)
private int a = 0;
public void add_to_A(int input)
{
a += input;
return;
}
public int a_sum()
{
return a;
}
As noted, an arrayList would require O(n) space, and would run at O(n) time for any given point in the input. If you received a pathological input, asking for the sum at every step, this would require O(n^2) time, which wouldnt be good for this problem.
As other people have noted, memory isnt going to be an issue for 1000 integers, BUT for the sake of knowledge, it is important to note that if we increase the size of our input a few orders of magnitude and give it a bad input (ask the size of A at every step) run time will blow up quickly. Going by the statistic Andreas submitted, if the array algorithm runs in a few milliseconds for 1000 inputs, the same algorithm running for a million inputs would be closer to a few days/weeks for an unlucky run.

The CPU Cache is invalid in C Programing?

this is the C program under Linux/GUN:
#include<stdio.h>
#include<sys/time.h>
#define Max 1024*1024
int main()
{
struct timeval start,end;
long int dis;
int i;
int m=0;
int a[Max];
gettimeofday(&start,NULL);
for(i=0;i<Max;i += 1){
a[Max] *= 3;
}
gettimeofday(&end,NULL);
dis = end.tv_usec - start.tv_usec;
printf("time1: %ld\n",dis);
gettimeofday(&start,NULL);
for(i=0;i<Max;i += 16){
a[Max] *= 3;
}
gettimeofday(&end,NULL);
dis = end.tv_usec - start.tv_usec;
printf("time2: %ld\n",dis);
return 0;
}
the output:
time1: 7074
time2: 234
it's a big distance
this Java program:
public class Cache1 {
public static void main(String[] args){
int a[] = new int[1024*1024*64];
long time1 = System.currentTimeMillis();
for(int i=0;i<a.length;i++){
a[i] *= 3;
}
long time2 = System.currentTimeMillis();
System.out.println(time2 - time1);
time1 = System.currentTimeMillis();
for(int i=0;i<a.length;i += 16){
a[i] *= 3;
}
time2 = System.currentTimeMillis();
System.out.println(time2 - time1);
}
}
the output:
92
82
it's nealy the same
with the CPU Cache. why they hava so much difference? the Cpu Cache is invalid in C programing?

I hope you realize that the difference in units of time in those tests is 10^3. C code is order of magnitude faster than Java code.
In C code there should be a[i] instead of a[Max].
As for cache: since you access only one memory location in your C code (which triggers undefined behavior) your C test is completely invalid.
And even if it were correct, your method is flawed. It is quite possible that the multiplication operations and even the whole loops were skipped completely by C copiler, since nothing depends on their outcome.
The result where first run takes long, and the second takes less time, is expected. Data has to be loaded to cache anyway, and that takes time. Once it is loaded, operations on that data take less time.
Java may either not use cache at all (not likely) or preload the whole array to cache even before the loops are executed. That would explain equal execution times.

You have three cache sizes, these are typically
L1: 32 KB (data), 4 clock cycles
L2: 256KB, 10-11 clock cycles
L3: 3-24 MB. 40 - 75 clock cycles.
Anything larger than this will not fit into the cache as if you just scroll through memory it will be like they are not there.
I suggest you write a test which empirically works out the CPU cache sizes as a good exercise to help you understand this. BTW You don't need to use *= to exercise the cache as this exercises the ALU. Perhaps there is a simpler operation you can use ;)
In the case of your Java code, most likely it is not compiled yet so you are seeing the speed of the interperator, not the memory accesses.
I suggest you run the test repeatedly on smaller memory sizes for at least 2 seconds and take the average.

Does this mean that Java Math.floor is extremely slow?

I don't Java much.
I am writing some optimized math code and I was shocked by my profiler results. My code collects values, interleaves the data and then chooses the values based on that. Java runs slower than my C++ and MATLAB implementations.
I am using javac 1.7.0_05
I am using the Sun/Oracle JDK 1.7.05
There exists a floor function that performs a relevant task in the code.
Does anybody know of the paradigmatic way to fix this?
I noticed that my floor() function is defined with something called StrictMath. Is there something like -ffast-math for Java? I am expecting there must be a way to change the floor function to something more computationally reasonable without writing my own.
public static double floor(double a) {
return StrictMath.floor(a); // default impl. delegates to StrictMath
}
Edit
So a few people suggested I try to do a cast. I tried this and there was absolutely no change in walltime.
private static int flur(float dF)
{
return (int) dF;
}
413742 cast floor function
394675 Math.floor
These test were ran without the profiler. An effort was made to use a profiler but the runtime was drastically altered (15+ minutes so I quit).

You might want to give a try to FastMath.
Here is a post about the performance of Math in Java vs. Javascript. There are a few good hints about why the default math lib is slow. They are discussing other operations than floor, but I guess their findings can be generalized. I found it interesting.
EDIT
According to this bug entry, floor has been implemented a pure java code in 7(b79), 6u21(b01) resulting in better performance. The code of floor in the JDK 6 is still a bit longer than the one in FastMath, but might not be responsible for such a perf. degradation. What JDK are you using? Could you try with a more recent version?

Here's a sanity check for your hypothesis that the code is really spending 99% of its time in floor. Let's assume that you have Java and C++ versions of the algorithm that are both correct in terms of the outputs they produce. For the sake of the argument, let us assume that the two versions call the equivalent floor functions the same number of times. So a time function is
t(input) = nosFloorCalls(input) * floorTime + otherTime(input)
where floorTime is the time taken for a call to floor on the platform.
Now if your hypothesis is correct, and floorTime is vastly more expensive on Java (to the extent that it takes roughly 99% of the execution time) then you would expect the Java version of the application to run a large factor (50 times or more) slower than the C++ version. If you don't see this, then your hypothesis most likely is false.
If the hypothesis is false, here are two alternative explanations for the profiling results.
This is a measurement anomaly; i.e. the profiler has somehow got it wrong. Try using a different profiler.
There is a bug in the Java version of your code that is causing it to call floor many, many more times than in the C++ version of the code.

Math.floor() is insanely fast on my machine at around 7 nanoseconds per call in a tight loop. (Windows 7, Eclipse, Oracle JDK 7). I'd expect it to be very fast in pretty much all circumstances and would be extremely surprised if it turned out to be the bottleneck.
Some ideas:
I'd suggest re-running some benchmarks without a profiler running. It sometimes happens that profilers create spurious overhead when they instrument the binary - particularly for small functions like Math.floor() that are likely to be inlined.
Try a couple of different JVMs, you might have hit an obscure bug
Try the FastMath class in the excellent Apache Commons Math library, which includes a new implementation of floor. I'd be really surprised if it is faster, but you never know.
Check you are not running any virtualisation technonolgy or similar that might be interfering with Java's ability to call native code (which is used in a few of the java.lang.Math functions including Math.floor())

It is worth noting that monitoring a method takes some overhead and in the case of VisualVM, this is fairly high. If you have a method which is called often but does very little it can appear to use lots of CPU. e.g. I have seen Integer.hashCode() as a big hitter once. ;)
On my machine a floor takes less 5.6 ns, but a cast takes 2.3 ns. You might like to try this on your machine.
Unless you need to handle corner cases, a plain cast is faster.
// Rounds to zero, instead of Negative infinity.
public static double floor(double a) {
return (long) a;
}
public static void main(String... args) {
int size = 100000;
double[] a = new double[size];
double[] b = new double[size];
double[] c = new double[size];
for (int i = 0; i < a.length; i++) a[i] = Math.random() * 1e6;
for (int i = 0; i < 5; i++) {
timeCast(a, b);
timeFloor(a, c);
for (int j = 0; j < size; j++)
if (b[i] != c[i])
System.err.println(a[i] + ": " + b[i] + " " + c[i]);
}
}
public static double floor(double a) {
return a < 0 ? -(long) -a : (long) a;
}
private static void timeCast(double[] from, double[] to) {
long start = System.nanoTime();
for (int i = 0; i < from.length; i++)
to[i] = floor(from[i]);
long time = System.nanoTime() - start;
System.out.printf("Cast took an average of %.1f ns%n", (double) time / from.length);
}
private static void timeFloor(double[] from, double[] to) {
long start = System.nanoTime();
for (int i = 0; i < from.length; i++)
to[i] = Math.floor(from[i]);
long time = System.nanoTime() - start;
System.out.printf("Math.floor took an average of %.1f ns%n", (double) time / from.length);
}
prints
Cast took an average of 62.1 ns
Math.floor took an average of 123.6 ns
Cast took an average of 61.9 ns
Math.floor took an average of 6.3 ns
Cast took an average of 47.2 ns
Math.floor took an average of 6.5 ns
Cast took an average of 2.3 ns
Math.floor took an average of 5.6 ns
Cast took an average of 2.3 ns
Math.floor took an average of 5.6 ns

First of all: Your profiler shows that your spending 99% of the cpu time in the floor function. This does not indicate floor is slow. If you do nothing but floor() thats totally sane. Since other languages seem to implement floor more efficient your assumption may be correct, however.
I know from school that a naive implementation of floor (which works only for positive numbers and is one off for negative ones) can be done by casting to an integer/long. That is language agnostic and some sort of general knowledge from CS courses.
Here are some micro benches. Works on my machine and backs what I learned in school ;)
rataman#RWW009 ~/Desktop
$ javac Cast.java && java Cast
10000000 Rounds of Casts took 16 ms
rataman#RWW009 ~/Desktop
$ javac Floor.java && java Floor
10000000 Rounds of Floor took 140 ms
#
public class Cast/Floor {
private static final int ROUNDS = 10000000;
public static void main(String[] args)
{
double[] vals = new double[ROUNDS];
double[] res = new double[ROUNDS];
// awesome testdata
for(int i = 0; i < ROUNDS; i++)
{
vals[i] = Math.random() * 10.0;
}
// warmup
for(int i = 0; i < ROUNDS; i++)
{
res[i] = floor(vals[i]);
}
long start = System.currentTimeMillis();
for(int i = 0; i < ROUNDS; i++)
{
res[i] = floor(vals[i]);
}
System.out.println(ROUNDS + " Rounds of Casts took " + (System.currentTimeMillis() - start) +" ms");
}
private static double floor(double arg)
{
// Floor.java
return Math.floor(arg);
// or Cast.java
return (int)arg;
}
}

Math.floor (and Math.ceil) can be a surprising bottleneck if your algorithm depends on it a lot.
This is because these functions handle edge cases that you might not care about (such as minus-zero and positive-zero etc). Just look at the implementation of these functions to see what they're actually doing; there's a surprising amount of branching in there.
Also consider that Math.floor/ceil take only a double as an argument and return a double, which you might not want. If you just want an int or long, some of the checks in Math.floor are simply unnecessary.
Some have suggested to simply cast to an int, which will work as long as your values are positive (and your algorithm doesn't depend on the edge cases that Math.floor checks for). If that's the case, a simple cast is the fastest solution by quite a margin (in my experience).
If for example your values can be negative and you want an int from a float, you can do something like this:
public static final int floor(final float value) {
return ((int) value) - (Float.floatToRawIntBits(value) >>> 31);
}
(It just subtracts the float's sign bit from the cast to make it correct for negative numbers, while preventing an "if")
In my experience, this is a lot faster than Math.floor. If it isn't, I suggest to check your algorithm, or perhaps you've ran into JVM performance bug (which is much less likely).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.