Java parallelStream() with reduce() not improving performance

Java parallelStream() with reduce() not improving performance - java

As a test of Java 8's new implementation of streams and automatic parallelization, I ran the following simple test:
ArrayList<Integer> nums = new ArrayList<>();
for (int i=1; i<49999999; i++) nums.add(i);
int sum=0;
double begin, end;
begin = System.nanoTime();
for (Integer i : nums) sum += i;
end = System.nanoTime();
System.out.println( "1 core: " + (end-begin) );
begin = System.nanoTime();
sum = nums.parallelStream().reduce(0, Integer::sum);
end = System.nanoTime();
System.out.println( "8 cores: " + (end-begin) );
I thought summing up a series of integers would be able to take great advantage of all 8 cores, but the output looks like this:
1 core: 1.70552398E8
8 cores: 9.938507635E9
I am aware that nanoTime() has issues in multicore systems, but I doubt that's the problem here since I'm off by an order of magnitude.
Is the operation I'm performing so simple that the overhead required for reduce() is overcoming the advantage of multiple cores?

Your stream example has 2 unboxings (Integer.sum(int,int)) and one boxing (the resulting int has to be converted back to an Integer) for every number whereas the for loop has only one unboxing. So the two are not comparable.
When you plan to do calculations with Integers it's best to use an IntStream:
nums.stream().mapToInt(i -> i).sum();
That would give you a performance similar to that of the for loop. A parallel stream is still slower on my machine.
The fastest alternative would be this btw:
IntStream.rangeClosed(0, 49999999).sum();
An order of a magnitude faster and without the overhead of building a list first. It's only an alternative for this special use case, of course. But it demonstrates that it pays off to rethink an existing approach instead of merely "add a stream".

To properly compare this at all, you need to use similar overheads for both operations.
ArrayList<Integer> nums = new ArrayList<>();
for (int i = 1; i < 49999999; i++)
nums.add(i);
int sum = 0;
long begin, end;
begin = System.nanoTime();
sum = nums.stream().reduce(0, Integer::sum);
end = System.nanoTime();
System.out.println("1 core: " + (end - begin));
begin = System.nanoTime();
sum = nums.parallelStream().reduce(0, Integer::sum);
end = System.nanoTime();
System.out.println("8 cores: " + (end - begin));
This lands me
1 core: 769026020
8 cores: 538805164
which is in fact quicker for parallelStream(). (Note: I only have 4 cores, but parallelSteam() does not always use all of your cores anyways)
Another thing is boxing and unboxing. There is boxing for nums.add(i), and unboxing for everything going into Integer::sum which takes two ints. I converted this test to an array to remove that:
int[] nums = new int[49999999];
System.err.println("adding numbers");
for (int i = 1; i < 49999999; i++)
nums[i - 1] = i;
int sum = 0;
System.err.println("begin");
long begin, end;
begin = System.nanoTime();
sum = Arrays.stream(nums).reduce(0, Integer::sum);
end = System.nanoTime();
System.out.println("1 core: " + (end - begin));
begin = System.nanoTime();
sum = Arrays.stream(nums).parallel().reduce(0, Integer::sum);
end = System.nanoTime();
System.out.println("8 cores: " + (end - begin));
And that gives an unexpected timing:
1 core: 68050642
8 cores: 154591290
It is much faster (1-2 orders of magnitude) for the linear reduce with regular ints, but only about 1/4 of the time for the parallel reduce and ends up being slower. I'm not sure why that is, but it is certainly interesting!
Did some profiling, turns out that the fork() method for doing parallel streams is very expensive because of the use of ThreadLocalRandom, which calls upon network interfaces for it's seed! This is very slow and is the only reason why parallelStream() is slower than stream()!
Some of my VisualVM data: (ignore the await() time, that's the method I used so I could track the program)
For first example: https://www.dropbox.com/s/z7qf2es0lxs6fvu/streams1.nps?dl=0
For second example: https://www.dropbox.com/s/f3ydl4basv7mln5/streams2.nps?dl=0
TL;DR: In your Integer case it looks like parallel wins, but there is some overhead for the int case that makes parallel slower.

Related

Confused about averaging the duration of different sorting algorithms based on input size?

I am trying to write a method comparing the run times of four different sorting algorithms (mergesort, quicksort, heapsort, insertionsort). I am trying to time each algorithm with each iteration of the for loop, which increases the array size of a randomly generated array each loop. The code that I have works but takes way too much time. At the end of the loop, I am calculating the average time each sorting algorithm took over the arrays from size 1 to 100.
Note: generateArray(num) just creates an array filled with random integers of size num.
I'm not such a great coder, what would be a way to implement it better?
Here's the code snippet:
static void testArrays() {
ThreadMXBean bean = ManagementFactory.getThreadMXBean();
long insertionCount=0, mergeCount=0,
quickCount=0, heapCount = 0;
for (int i=1; i<100; i++) {
long[] A = generateArray(i);
long[] copy1 = A.clone();
long[] copy2 = A.clone();
long[] copy3 = A.clone();
long startTime1 = bean.getCurrentThreadCpuTime();
insertionSort(A);
long endTime1 = bean.getCurrentThreadCpuTime();
long duration = endTime1 - startTime1;
insertionCount = insertionCount + duration;
long startTime2 = bean.getCurrentThreadCpuTime();
mergeSort(copy1);
long endTime2 = bean.getCurrentThreadCpuTime();
long mergeDuration = endTime2 - startTime2;
mergeCount = mergeCount + mergeDuration;
long startTime3 = bean.getCurrentThreadCpuTime();
heapSort(copy2);
long endTime3 = bean.getCurrentThreadCpuTime();
long heapDuration = endTime3 - startTime3;
heapCount = heapCount + heapDuration;
long startTime4 = bean.getCurrentThreadCpuTime();
quickSort(copy3);
long endTime4 = bean.getCurrentThreadCpuTime();
long quickDuration = endTime4 - startTime4;
quickCount = quickCount + quickDuration;
}
long averageIS = insertionCount/10000;
long averageMS = mergeCount/10000;
long averageHS = heapCount/10000;
long averageQS = quickCount/10000;
System.out.println("Insertion Sort Avg: " + averageIS);
System.out.println("MergeSort Avg: " + averageMS);
System.out.println("HeapSort Avg: " + averageHS);
System.out.println("QuickSort Avg: " + averageQS);
}

There are a comparative sort and a distribute sort.
The comparison sort method is a method of performing sorting by comparing two key values to be compared at a time and exchanging them
Distributed sorting is a method of dividing data into several subsets based on key values, and sorting each subset to sort the whole.
It is generally known that Quick Sort is the fastest. In the worst case, n² occurs when the pivot is at its minimum and maximum values. To avoid this, use pivots randomly or Media-Of-Three Partitioning. On average, it produces the best performance.
Insertion Sort is the fastest for already sorted data. If they are already sorted, they are compared only with the first element.
The time complexity of the algorithms described above is as follows.
O (n²): Bubble Sort, Selection Sort, Insertion Sort, Shell Sort, Quick Sort
O (n log n): Heap Sort, Merge Sort
O (kn): Radix Sort (k has a constraint that it can perform well at 4-byte integers with low digits and low digits).

Java : Testing Array Sum Algorithm Efficiency

I am taking a Java course in university and my notes give me 3 methods for calculating the sum of an ArrayList. First using iteration, second using recursion, and third using array split combine with recursion.
My question is how do I test the efficiency of these algorithms? As it is, I think the number of steps it takes for the algorithm to compute the value is what tells you the efficiency of the algorithm.
My Code for the 3 algorithms:
import java.util.ArrayList;
public class ArraySumTester {
static int steps = 1;
public static void main(String[] args) {
ArrayList<Integer> numList = new ArrayList<Integer>();
numList.add(1);
numList.add(2);
numList.add(3);
numList.add(4);
numList.add(5);
System.out.println("------------------------------------------");
System.out.println("Recursive array sum = " + ArraySum(numList));
System.out.println("------------------------------------------");
steps = 1;
System.out.println("Iterative array sum = " + iterativeSum(numList));
System.out.println("------------------------------------------");
steps = 1;
System.out.println("Array sum using recursive array split : " + sumArraySplit(numList));
}
static int ArraySum(ArrayList<Integer> list) {
return sumHelper(list, 0);
}
static int sumHelper(ArrayList<Integer> list, int start) {
// System.out.println("Start : " + start);
System.out.println("Rescursive step : " + steps++);
if (start >= list.size())
return 0;
else
return list.get(start) + sumHelper(list, start + 1);
}
static int iterativeSum(ArrayList<Integer> list) {
int sum = 0;
for (Integer item : list) {
System.out.println("Iterative step : " + steps++);
sum += item;
}
return sum;
}
static int sumArraySplit(ArrayList<Integer> list) {
int start = 0;
int end = list.size();
int mid = (start + end) / 2;
System.out.println("Rescursive step : " + steps++);
//System.out.println("Start : " + start + ", End : " + end + ", Mid : " + mid);
//System.out.println(list);
if (list.size() <= 1)
return list.get(0);
else
return sumArraySplit(new ArrayList<Integer>(list.subList(0, mid)))
+ sumArraySplit(new ArrayList<Integer>(list.subList(mid,
end)));
}
}
Output:
------------------------------------------
Rescursive step : 1
Rescursive step : 2
Rescursive step : 3
Rescursive step : 4
Rescursive step : 5
Rescursive step : 6
Recursive array sum = 15
------------------------------------------
Iterative step : 1
Iterative step : 2
Iterative step : 3
Iterative step : 4
Iterative step : 5
Iterative array sum = 15
------------------------------------------
Rescursive step : 1
Rescursive step : 2
Rescursive step : 3
Rescursive step : 4
Rescursive step : 5
Rescursive step : 6
Rescursive step : 7
Rescursive step : 8
Rescursive step : 9
Array sum using recursive array split : 15
Now from the above output the recursive array split algorithm takes the most steps, however according to my notes, it is as efficient as the iterative algorithm. So which is incorrect my code or my notes?

Do you just want to look at speed of execution? If so, you'll want to look at microbenchmarking:
How do I write a correct micro-benchmark in Java?
Essentially because of how the JVM and modern processors work, you won't get consistent results by running something a million times in a FOR loop and measuring the execution speed with a system timer (EDIT).
That said, "efficiency" can also mean other things like memory consumption. For instance, any recursive method runs a risk of a stack overflow, the issue this site is named after :) Try giving that ArrayList tens of thousands of elements and see what happens.

Using System.currentTimeMillis() is the way to go. Define a start variable before your code and an end variable after it completes. The difference of these will be the time elapsed for your program to execute. The shortest time will be the most efficient.
long start = System.currentTimeMillis();
// Program to test
long end = System.currentTimeMillis();
long diff = end - start;

I suggest that you look at the running time and space complexity (these are more computer sciencey names for efficiency) of these algorithms in the abstract. This is what the so-called Big-Oh notation is for.
To be exact, of course, after making the implementations as tight and side-effect-free as possible, you should consider writing microbenchmarks.
Since you have to be able to read the value of every element of the list in order to sum these elements up, no algorithm is going to perform better than a (linear) O(n) time, O(1) space algorithm (which is what your iterative algorithm does) in the general case (i.e. without any other assumptions). Here n is the size of the input (i.e. the number of elements in the list). Such an algorithm is said to have a linear time and constant space complexity meaning its running time increases as the size of the list increases, but it does not need any additional memory; in fact it needs some constant memory to do its job.
The other two recursive algorithms, can, at best, perform as well as this simple algorithm because the iterative algorithm does not have any complications (additional memory on the stack, for instance) that recursive algorithms suffer with.
This gets reflected into what are called the constant terms of the algorithms that have the same O(f(n)) running time. For instance, if you somehow found an algorithm which examines roughly half the elements of a list to solve a problem, whereas another algorithm must see all the elements, then, the first algorithm has better constant terms than the second and is expected to beat it in practice, although both these algorithms have a time complexity of O(n).
Now, it is quite possible to parallelize the solution to this problem by splitting the giant list into smaller lists (you can achieve the effect via indexes into a single list) and then use a parallel summing operation which may beat other algorithms if the list is sufficiently long. This is because each non-overlapping interval can be summed up in parallel (at the same time) and you'd sum the partial sums up in the end. But this is not a possibility we are considering in the current context.

I would say to use the Guava Google Core Libraries For Java Stopwatch. Example:
Stopwatch stopwatch = Stopwatch.createStarted();
// TODO: Your tests here
long elapsedTime = stopwatch.stop().elapsed(TimeUnit.MILLISECONDS);
You get the elapsed in whatever unit you need and plus you don't need any extra calculations.

If you want to consider efficiency then you really need to look at algorithm structure rather than timing.
Load the sources for the methods you are using, dive into the structure and look for looping - that will give you the correct measure of efficiency.

Empty speed test, unexpected result

Why might this code
long s, e, sum1 = 0, sum2 = 0, TRIALS = 10000000;
for(long i=0; i<TRIALS; i++) {
s = System.nanoTime();
e = System.nanoTime();
sum1 += e - s;
s = System.nanoTime();
e = System.nanoTime();
sum2 += e - s;
}
System.out.println(sum1 / TRIALS);
System.out.println(sum2 / TRIALS);
produce this result
-60
61
"on my machine?"
EDIT:
Sam I am's answer points to the nanoSecond() documentation which helps, but now, more precisely, why does the result consistently favor the first sum?
"my machine":
JavaSE-1.7, Eclipse
Win 7 x64, AMD Athlon II X4 635
switching the order inside the loop produces reverse results
for(int i=0; i<TRIALS; i++) {
s = System.nanoTime();
e = System.nanoTime();
sum2 += e - s;
s = System.nanoTime();
e = System.nanoTime();
sum1 += e - s;
}
61
-61
looking (e-s) before adding it to sum1 makes sum1 positive.
for(long i=0; i<TRIALS; i++) {
s = System.nanoTime();
e = System.nanoTime();
temp = e-s;
if(temp < 0)
count++;
sum1 += temp;
s = System.nanoTime();
e = System.nanoTime();
sum2 += e - s;
}
61
61
And as Andrew Alcock points out, sum1 += -s + e produces the expected outcome.
for(long i=0; i<TRIALS; i++) {
s = System.nanoTime();
e = System.nanoTime();
sum1 += -s + e;
s = System.nanoTime();
e = System.nanoTime();
sum2 += -s + e;
}
61
61
A few other tests: http://pastebin.com/QJ93NZxP

This answer is supposition. If you update your question with some details about your environment, it's likely that someone else can give a more detailed, grounded answer.
The nanoTime() function works by accessing some high-resolution timer with low access latency. On the x86, I believe this is the Time Stamp Counter, which is driven by the basic clock cycle of the machine.
If you're seeing consistent results of +/- 60 ns, then I believe you're simply seeing the basic interval of the timer on your machine.
However, what about the negative numbers? Again, supposition, but if you read the Wikipedia article, you'll see a comment that Intel processors might re-order the instructions.

In conjunction with roundar, we ran a number of tests on this code. In summary, the effect disappeared when:
Running the same code in interpreted mode (-Xint)
Changing the aggregation logic order from sum += e - s to sum += -s + e
Running on some different architectures or different VMs (eg I ran on Java 6 on Mac)
Placing logging statements inspecting s and e
Performing additional arithmetic on s and e
In addition, the effect is not threading:
There are no additional threads spawned
Only local variables are involved
This effect is 100% reproducible in roundar's environment, and always results in precisely the same timings, namely +61 and -61.
The effect is not a timing issue because:
The execution takes place over 10m iterations
This effect is 100% reproducible in roundar's environment
The result is precisely the same timings, namely +61 and -61, on all iterations.
Given the above, I believe we have a bug in the hotspot module of Java VM. The code as written should return positive results, but does not.

straight from oracle's documentation
In short: the frequency of updating the values can cause results to differ.
nanoTime
public static long nanoTime()
Returns the current value of the most precise available system timer, in nanoseconds.
This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time.
The value returned represents nanoseconds since some fixed but arbitrary time
(perhaps in the future, so values may be negative). This method
provides nanosecond precision, but not necessarily nanosecond
accuracy. No guarantees are made about how frequently values change.
Differences in successive calls that span greater than approximately
292 years (263 nanoseconds) will not accurately compute elapsed time
due to numerical overflow.
For example, to measure how long some code takes to execute:
long startTime = System.nanoTime();
// ... the code being measured ...
long estimatedTime = System.nanoTime() - startTime;
Returns:
The current value of the system timer, in nanoseconds.
Since:
1.5

scala ranges versus lists performance on large collections

I ran a set of performance benchmarks for 10,000,000 elements, and I've discovered that the results vary greatly with each implementation.
Can anybody explain why creating a Range.ByOne, results in performance that is better than a simple array of primitives, but converting that same range to a list results in even worse performance than the worse case scenario?
Create 10,000,000 elements, and print out those that are modulos of 1,000,000. JVM size is always set to same min and max: -Xms?m -Xmx?m
import java.util.concurrent.TimeUnit
import java.util.concurrent.TimeUnit._
object LightAndFastRange extends App {
def chrono[A](f: => A, timeUnit: TimeUnit = MILLISECONDS): (A,Long) = {
val start = System.nanoTime()
val result: A = f
val end = System.nanoTime()
(result, timeUnit.convert(end-start, NANOSECONDS))
}
def millions(): List[Int] = (0 to 10000000).filter(_ % 1000000 == 0).toList
val results = chrono(millions())
results._1.foreach(x => println ("x: " + x))
println("Time: " + results._2);
}
It takes 141 milliseconds with a JVM size of 27m
In comparison, converting to List affects performance dramatically:
import java.util.concurrent.TimeUnit
import java.util.concurrent.TimeUnit._
object LargeLinkedList extends App {
def chrono[A](f: => A, timeUnit: TimeUnit = MILLISECONDS): (A,Long) = {
val start = System.nanoTime()
val result: A = f
val end = System.nanoTime()
(result, timeUnit.convert(end-start, NANOSECONDS))
}
val results = chrono((0 to 10000000).toList.filter(_ % 1000000 == 0))
results._1.foreach(x => println ("x: " + x))
println("Time: " + results._2)
}
It takes 8514-10896 ms with 460-455 m
In contrast, this Java implementation uses an array of primitives
import static java.util.concurrent.TimeUnit.*;
public class LargePrimitiveArray {
public static void main(String[] args){
long start = System.nanoTime();
int[] elements = new int[10000000];
for(int i = 0; i < 10000000; i++){
elements[i] = i;
}
for(int i = 0; i < 10000000; i++){
if(elements[i] % 1000000 == 0) {
System.out.println("x: " + elements[i]);
}
}
long end = System.nanoTime();
System.out.println("Time: " + MILLISECONDS.convert(end-start, NANOSECONDS));
}
}
It takes 116ms with JVM size of 59m
Java List of Integers
import java.util.List;
import java.util.ArrayList;
import static java.util.concurrent.TimeUnit.*;
public class LargeArrayList {
public static void main(String[] args){
long start = System.nanoTime();
List<Integer> elements = new ArrayList<Integer>();
for(int i = 0; i < 10000000; i++){
elements.add(i);
}
for(Integer x: elements){
if(x % 1000000 == 0) {
System.out.println("x: " + x);
}
}
long end = System.nanoTime();
System.out.println("Time: " + MILLISECONDS.convert(end-start, NANOSECONDS));
}
}
It takes 3993 ms with JVM size of 283m
My question is, why is the first example so performant, while the second is so badly affected. I tried creating views, but wasn't successful at reproducing the performance benefits of the range.
All tests running on Mac OS X Snow Leopard,
Java 6u26 64-Bit Server
Scala 2.9.1.final
EDIT:
for completion, here's the actual implementation using a LinkedList (which is a more fair comparison in terms of space than ArrayList, since as rightly pointed out, scala's List are linked)
import java.util.List;
import java.util.LinkedList;
import static java.util.concurrent.TimeUnit.*;
public class LargeLinkedList {
public static void main(String[] args){
LargeLinkedList test = new LargeLinkedList();
long start = System.nanoTime();
List<Integer> elements = test.createElements();
test.findElementsToPrint(elements);
long end = System.nanoTime();
System.out.println("Time: " + MILLISECONDS.convert(end-start, NANOSECONDS));
}
private List<Integer> createElements(){
List<Integer> elements = new LinkedList<Integer>();
for(int i = 0; i < 10000000; i++){
elements.add(i);
}
return elements;
}
private void findElementsToPrint(List<Integer> elements){
for(Integer x: elements){
if(x % 1000000 == 0) {
System.out.println("x: " + x);
}
}
}
}
Takes 3621-6749 ms with 480-460 mbs. That's much more in line with the performance of the second scala example.
finally, a LargeArrayBuffer
import collection.mutable.ArrayBuffer
import java.util.concurrent.TimeUnit
import java.util.concurrent.TimeUnit._
object LargeArrayBuffer extends App {
def chrono[A](f: => A, timeUnit: TimeUnit = MILLISECONDS): (A,Long) = {
val start = System.nanoTime()
val result: A = f
val end = System.nanoTime()
(result, timeUnit.convert(end-start, NANOSECONDS))
}
def millions(): List[Int] = {
val size = 10000000
var items = new ArrayBuffer[Int](size)
(0 to size).foreach (items += _)
items.filter(_ % 1000000 == 0).toList
}
val results = chrono(millions())
results._1.foreach(x => println ("x: " + x))
println("Time: " + results._2);
}
Taking about 2145 ms and 375 mb
Thanks a lot for the answers.

Oh So Many Things going on here!!!
Let's start with Java int[]. Arrays in Java are the only collection that is not type erased. The run time representation of an int[] is different from the run time representation of Object[], in that it actually uses int directly. Because of that, there's no boxing involved in using it.
In memory terms, you have 40.000.000 consecutive bytes in memory, that are read and written 4 at a time whenever an element is read or written to.
In contrast, an ArrayList<Integer> -- as well as pretty much any other generic collection -- is composed of 40.000.000 or 80.000.00 consecutive bytes (on 32 and 64 bits JVM respectively), PLUS 80.000.000 bytes spread all around memory in groups of 8 bytes. Every read an write to an element has to go through two memory spaces, and the sheer time spent handling all that memory is significant when the actual task you are doing is so fast.
So, back to Scala, for the second example where you manipulate a List. Now, Scala's List is much more like Java's LinkedList than the grossly misnamed ArrayList. Each element of a List is composed of an object called Cons, which has 16 bytes, with a pointer to the element and a pointer to another list. So, a List of 10.000.000 elements is composed of 160.000.000 elements spread all around memory in groups of 16 bytes, plus 80.000.000 bytes spread all around memory in groups of 8 bytes. So what was true for ArrayList is even more so for List
Finally, Range. A Range is a sequence of integers with a lower and an upper boundary, plus a step. A Range of 10.000.000 elements is 40 bytes: three ints (not generic) for lower and upper bounds and step, plus a few pre-computed values (last, numRangeElements) and two other ints used for lazy val thread safety. Just to make clear, that's NOT 40 times 10.000.000: that's 40 bytes TOTAL. The size of the range is completely irrelevant, because IT DOESN'T STORE THE INDIVIDUAL ELEMENTS. Just the lower bound, upper bound and step.
Now, because a Range is a Seq[Int], it still has to go through boxing for most uses: an int will be converted into an Integer and then back into an int again, which is sadly wasteful.
Cons Size Calculation
So, here's a tentative calculation of Cons. First of all, read this article about some general guidelines on how much memory an object takes. The important points are:
Java uses 8 bytes for normal objects, and 12 for object arrays, for "housekeeping" information (what's the class of this object, etc).
Objects are allocated in 8 bytes chunks. If your object is smaller than that, it will be padded to complement it.
I actually thought it was 16 bytes, not 8. Anyway, Cons is also smaller than I thought. Its fields are:
public static final long serialVersionUID; // static, doesn't count
private java.lang.Object scala$collection$immutable$$colon$colon$$hd;
private scala.collection.immutable.List tl;
References are at least 4 bytes (could be more on 64 bits JVM). So we have:
8 bytes Java header
4 bytes hd
4 bytes tl
Which makes it only 16 bytes long. Pretty good, actually. In the example, hd will point to an Integer object, which I assume is 8 bytes long. As for tl, it points to another cons, which we are already counting.
I'm going to revise the estimates, with actual data where possible.

In the first example you create a linked list with 10 elements by computing the steps of the range.
In the second example you create a linked list with 10 millions of elements and filter it down to a new linked list with 10 elements.
In the third example you create an array-backed buffer with 10 millions of elements which you traverse and print, no new array-backed buffer is created.
Conclusion:
Every piece of code does something different, that's why the performance varies greatly.

This is an educated guess ...
I think it is because in the fast version the Scala compiler is able to translate the key statement into something like this (in Java):
List<Integer> millions = new ArrayList<Integer>();
for (int i = 0; i <= 10000000; i++) {
if (i % 1000000 == 0) {
millions.add(i);
}
}
As you can see, (0 to 10000000) doesn't generate an intermediate list of 10,000,000 Integer objects.
By contrast, in the slow version the Scala compiler is not able to do that optimization, and is generating that list.
(The intermediate data structure could possibly be an int[], but the observed JVM size suggests that it is not.)

It's hard to read the Scala source on my iPad, but it looks like Range's constructor isn't actually producing a list, just remembering the start, increment and end. It uses these to produce its values on request, so that iterating over a range is a lot closer to a simple for loop than examining the elements of an array.
As soon as you say range.toList you are forcing Scala to produce a linked list of the 'values' in the range (allocating memory for both the values and the links), and then you are iterating over that. Being a linked list the performance of this is going to be worse than your Java ArrayList example.

Enhanced for loop performance worse than traditional indexed lookup?

I just came across this seemingly innocuous comment, benchmarking ArrayList vs a raw String array. It's from a couple years ago, but the OP writes
I did notice that using for String s: stringsList was about 50% slower than using an old-style for-loop to access the list. Go figure...
Nobody commented on it in the original post, and the test seemed a little dubious (too short to be accurate), but I nearly fell out of my chair when I read it. I've never benchmarked an enhanced loop against a "traditional" one, but I'm currently working on a project that does hundreds of millions of iterations over ArrayList instances using enhanced loops so this is a concern to me.
I'm going to do some benchmarking and post my findings here, but this is obviously a big concern to me. I could find precious little info online about relative performance, except for a couple offhand mentions that enhanced loops for ArrayLists do run a lot slower under Android.
Has anybody experienced this? Does such a performance gap still exist? I'll post my findings here, but was very surprised to read it. I suspect that if this performance gap did exist, it has been fixed in more modern VM's, but I guess now I'll have to do some testing and confirm.
Update: I made some changes to my code, but was already suspecting what others here have already pointed out: sure the enhanced for loop is slower, but outside of very trivial tight loops, the cost should be a miniscule fraction of the cost of the logic of the loop. In my case, even though I'm iterating over very large lists of strings using enhanced loops, my logic inside the loop is complex enough that I couldn't even measure a difference after switching to index-based loops.
TL;DR: enhanced loops are indeed slower than a traditional index-based loop over an arraylist; but for most applications the difference should be negligible.

The problem you have is that using an Iterator will be slower than using a direct lookup. On my machine the difference is about 0.13 ns per iteration. Using an array instead saves about 0.15 ns per iteration. This should be trivial in 99% of situations.
public static void main(String... args) {
int testLength = 100 * 1000 * 1000;
String[] stringArray = new String[testLength];
Arrays.fill(stringArray, "a");
List<String> stringList = new ArrayList<String>(Arrays.asList(stringArray));
{
long start = System.nanoTime();
long total = 0;
for (String str : stringArray) {
total += str.length();
}
System.out.printf("The for each Array loop time was %.2f ns total=%d%n", (double) (System.nanoTime() - start) / testLength, total);
}
{
long start = System.nanoTime();
long total = 0;
for (int i = 0, stringListSize = stringList.size(); i < stringListSize; i++) {
String str = stringList.get(i);
total += str.length();
}
System.out.printf("The for/get List loop time was %.2f ns total=%d%n", (double) (System.nanoTime() - start) / testLength, total);
}
{
long start = System.nanoTime();
long total = 0;
for (String str : stringList) {
total += str.length();
}
System.out.printf("The for each List loop time was %.2f ns total=%d%n", (double) (System.nanoTime() - start) / testLength, total);
}
}
When run with one billion entries entries prints (using Java 6 update 26.)
The for each Array loop time was 0.76 ns total=1000000000
The for/get List loop time was 0.91 ns total=1000000000
The for each List loop time was 1.04 ns total=1000000000
When run with one billion entries entries prints (using OpenJDK 7.)
The for each Array loop time was 0.76 ns total=1000000000
The for/get List loop time was 0.91 ns total=1000000000
The for each List loop time was 1.04 ns total=1000000000
i.e. exactly the same. ;)

Every claim that X is slower than Y on a JVM which does not address all the issues presented in this article ant it's second part spreads fears and lies about the performance of a typical JVM. This applies to the comment referred to by the original question as well as to GravityBringer's answer. I am sorry to be so rude, but unless you use appropriate micro benchmarking technology your benchmarks produce really badly skewed random numbers.
Tell me if you're interested in more explanations. Although it is all in the articles I referred to.

GravityBringer's number doesn't seem right, because I know ArrayList.get() is as fast as raw array access after VM optimization.
I ran GravityBringer's test twice on my machine, -server mode
50574847
43872295
30494292
30787885
(2nd round)
33865894
32939945
33362063
33165376
The bottleneck in such tests is actually memory read/write. Judging from the numbers, the entire 2 arrays are in my L2 cache. If we decrease the size to fit L1 cache, or if we increase the size beyond L2 cache, we'll see 10X throughput difference.
The iterator of ArrayList uses a single int counter. Even if VM doesn't put it in a register (the loop body is too complex), at least it will be in the L1 cache, therefore r/w of are basically free.
The ultimate answer of course is to test your particular program in your particular environment.
Though it's not helpful to play agnostic whenever a benchmark question is raised.

The situation has gotten worse for ArrayLists. On my computer running Java 6.26, there is a fourfold difference. Interestingly (and perhaps quite logically), there is no difference for raw arrays. I ran the following test:
int testSize = 5000000;
ArrayList<Double> list = new ArrayList<Double>();
Double[] arr = new Double[testSize];
//set up the data - make sure data doesn't have patterns
//or anything compiler could somehow optimize
for (int i=0;i<testSize; i++)
{
double someNumber = Math.random();
list.add(someNumber);
arr[i] = someNumber;
}
//ArrayList foreach
long time = System.nanoTime();
double total1 = 0;
for (Double k: list)
{
total1 += k;
}
System.out.println (System.nanoTime()-time);
//ArrayList get() method
time = System.nanoTime();
double total2 = 0;
for (int i=0;i<testSize;i++)
{
total2 += list.get(i);
}
System.out.println (System.nanoTime()-time);
//array foreach
time = System.nanoTime();
double total3 = 0;
for (Double k: arr)
{
total3 += k;
}
System.out.println (System.nanoTime()-time);
//array indexing
time = System.nanoTime();
double total4 = 0;
for (int i=0;i<testSize;i++)
{
total4 += arr[i];
}
System.out.println (System.nanoTime()-time);
//would be strange if different values were produced,
//but no, all these are the same, of course
System.out.println (total1);
System.out.println (total2);
System.out.println (total3);
System.out.println (total4);
The arithmetic in the loops is to prevent the JIT compiler from possibly optimizing away some of the code. The effect of the arithmetic on performance is small, as the runtime is dominated by the ArrayList accesses.
The runtimes are (in nanoseconds):
ArrayList foreach: 248,351,782
ArrayList get(): 60,657,907
array foreach: 27,381,576
array direct indexing: 27,468,091

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.