Why is the java vector API so slow compared to scalar?

Why is the java vector API so slow compared to scalar? - java

I recently decided to play around with Java's new incubated vector API, to see how fast it can get. I implemented two fairly simple methods, one for parsing an int and one for finding the index of a character in a string. In both cases, my vectorized methods were incredibly slow compared to their scalar equivalents.
Here's my code:
public class SIMDParse {
private static IntVector mul = IntVector.fromArray(
IntVector.SPECIES_512,
new int[] {0, 0, 0, 0, 0, 0, 1000000000, 100000000, 10000000, 1000000, 100000, 10000, 1000, 100, 10, 1},
0
);
private static byte zeroChar = (byte) '0';
private static int width = IntVector.SPECIES_512.length();
private static byte[] filler;
static {
filler = new byte[16];
for (int i = 0; i < 16; i++) {
filler[i] = zeroChar;
}
}
public static int parseInt(String str) {
boolean negative = str.charAt(0) == '-';
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
if (negative) {
bytes[0] = zeroChar;
}
bytes = ensureSize(bytes, width);
ByteVector vec = ByteVector.fromArray(ByteVector.SPECIES_128, bytes, 0);
vec = vec.sub(zeroChar);
IntVector ints = (IntVector) vec.castShape(IntVector.SPECIES_512, 0);
ints = ints.mul(mul);
return ints.reduceLanes(VectorOperators.ADD) * (negative ? -1 : 1);
}
public static byte[] ensureSize(byte[] arr, int per) {
int mod = arr.length % per;
if (mod == 0) {
return arr;
}
int length = arr.length - (mod);
length += per;
byte[] newArr = new byte[length];
System.arraycopy(arr, 0, newArr, per - mod, arr.length);
System.arraycopy(filler, 0, newArr, 0, per - mod);
return newArr;
}
public static byte[] ensureSize2(byte[] arr, int per) {
int mod = arr.length % per;
if (mod == 0) {
return arr;
}
int length = arr.length - (mod);
length += per;
byte[] newArr = new byte[length];
System.arraycopy(arr, 0, newArr, 0, arr.length);
return newArr;
}
public static int indexOf(String s, char c) {
byte[] b = s.getBytes(StandardCharsets.UTF_8);
int width = ByteVector.SPECIES_MAX.length();
byte bChar = (byte) c;
b = ensureSize2(b, width);
for (int i = 0; i < b.length; i += width) {
ByteVector vec = ByteVector.fromArray(ByteVector.SPECIES_MAX, b, i);
int pos = vec.compare(VectorOperators.EQ, bChar).firstTrue();
if (pos != width) {
return pos + i;
}
}
return -1;
}
}
I fully expected my int parsing to be slower, since it won't ever be handling more than the vector size can hold (an int can never be more than 10 digits long).
By my bechmarks, parsing 123 as an int 10k times took 3081 microseconds for Integer.parseInt, and 80601 microseconds for my implementation. Searching for 'a' in a very long string ("____".repeat(4000) + "a" + "----".repeat(193)) took 7709 microseconds to String#indexOf's 7.
Why is it so unbelievably slow? I thought the entire point of SIMD is that it's faster than the scalar equivalents for tasks like these.

You picked something SIMD is not great at (string->int), and something that JVMs are very good at optimizing out of loops. And you made an implementation with a bunch of extra copying work if the inputs aren't exact multiples of the vector width.
I'm assuming your times are totals (for 10k repeats each), not a per-call average.
7 us is impossibly fast for that.
"____".repeat(4000) is 16k bytes before the 'a', which I assume is what you're searching for. Even a well-tuned / unrolled memchr (aka indexOf) running at 2x 32-byte vectors per clock cycle, on a 4GHz CPU, would take 625 us for 10k reps. (16000B / (64B/c) * 10000 reps / 4000 MHz). And yes, I'd expect a JVM to either call the native memchr or use something equally efficient for a commonly-used core library function like String#indexOf. For example, glibc's avx2 memchr is pretty well-tuned with loop unrolling; if you're on Linux, your JVM might be calling it.
Built-in String indexOf is also something the JIT "knows about". It's apparently able to hoist it out of loops when it can see that you're using the same string repeatedly as input. (But then what's it doing for the rest of those 7 us? I guess doing a not-quite-so-great memchr and then doing an empty 10k iteration loop at 1/clock could take about 7 microseconds, especially if your CPU isn't as fast as 4GHz.)
See Idiomatic way of performance evaluation? - if doubling the repeat-count to 20k doesn't double the time, your benchmark is broken and not measuring what you think it does.
Your manual SIMD indexOf is very unlikely to get optimized out of a loop. It makes a copy of the whole array every time, if the size isn't an exact multiple of the vector width!! (In ensureSize2). The normal technique is to fall back to scalar for the last size % width elements, which is obviously much better for large arrays. Or even better, do an unaligned load that ends at the end of the array (if the total size is >= vector width) for something where overlap with previous work is not a problem.
A decent memchr on modern x86 (using an algorithm like your indexOf without unrolling) should go at about 1 vector (16/32/64 bytes) per maybe 1.5 clock cycles, with data hot in L1d cache, without loop unrolling or anything. (Checking both the vector compare and the pointer bound as possible loop exit conditions takes extra asm instructions vs. a simple strlen, but see this answer for some microbenchmarks of a simple hand-written strlen that assumes aligned buffers). Probably your indexOf loops bottlenecks on front-end throughput on a CPU like Skylake, with its pipeline width of 4 uops/clock.
So let's guess that your implementation takes 1.5 cycles per 16 byte vector, if perhaps you're on a CPU without AVX2? You didn't say.
16kB / 16B = 1000 vectors. At 1 vector per 1.5 clocks, that's 1500 cycles. On a 3GHz machine, 1500 cycles takes 500 ns = 0.5 us per call, or 5000 us per 10k reps. But since 16194 bytes isn't a multiple of 16, you're also copying the whole thing every call, so that costs some more time, and could plausibly account for your 7709 us total time.
What SIMD is good for
for tasks like these.
No, "horizontal" stuff like ints.reduceLanes is something SIMD is generally slow at. And even with something like How to implement atoi using SIMD? using x86 pmaddwd to multiply and add pairs horizontally, it's still a lot of work.
Note that to make the elements wide enough to multiply by place-values without overflow, you have to unpack, which costs some shuffling. ints.reduceLanes takes about log2(elements) shuffle/add steps, and if you're starting with 512-bit AVX-512 vectors of int, the first 2 of those shuffles are lane-crossing, 3 cycle latency (https://agner.org/optimize/). (Or if your machine doesn't even have AVX2, then a 512-bit integer vector is actually 4x 128-bit vectors. And you had to do separate work to unpack each part. But at least the reduction will be cheap, just vertical adds until you get down to a single 128-bit vector.)

Hmm. I found this post because I've hit something strange with the Vector perfomance for something that ostensibly it should be ideal for - multiplying two double arrays.
static private void doVector(int iteration, double[] input1, double[] input2, double[] output) {
Instant start = Instant.now();
for (int i = 0; i < SPECIES.loopBound(ARRAY_LENGTH); i += SPECIES.length()) {
DoubleVector va = DoubleVector.fromArray(SPECIES, input1, i);
DoubleVector vb = DoubleVector.fromArray(SPECIES, input2, i);
va.mul(vb);
System.arraycopy(va.mul(vb).toArray(), 0, output, i, SPECIES.length());
}
Instant finish = Instant.now();
System.out.println("vector duration " + iteration + ": " + Duration.between(start, finish).getNano());
}
The species length comes out at 4 on my machine (CPU is Intel i7-7700HQ at 2.8 GHz).
On my first attempt the execution was taking more than 15 milliseconds to execute (compared with 0 for the scalar equivalent), even with a tiny array length (8 elements). On a hunch I added the iteration to see whether something had to warm up - and indeed, the first iteration still ALWAYS takes ages (44 ms for 65536 elements). Whilst most of the other iterations are reporting zero time, a few are taking around 15ms but they are randomly distributed (i.e. not always the same iteration index on each run). I sort of expect that (because I'm measuring real-time measurement and other stuff will be going on).
However, overall for an array size of 65536 elements, and 32 iterations, the total duration for the vector approach is 2-3 times longer than that for the scalar one.

Related

What is the space complexity of bitset in this scenario

I am doing a leetcode problem where I have to find the duplicate of an array of size [1-N] inclusive and came upon this solution:
public int findDuplicate(int[] nums) {
BitSet bit = new BitSet();
for(int num : nums) {
if(!bit.get(num)) {
bit.set(num);
} else {
return num;
}
}
return -1;
}
The use of bitset here im assuming is similar to using boolean[] to keep track if we saw the current number previously. So my question is what the space complexity is for this? The runtime seems to be O(n) where n is the size of the input array. Would the same be true for the space complexity?
Link to problem : https://leetcode.com/problems/find-the-duplicate-number/

Your Bitset creates an underlying long[] to store the values. Reading the code of Bitset#set, I would say it's safe to say that the array will never be larger than max(nums) / 64 * 2 = max(nums) / 32. Since long has a fixed size, this comes down to O(max(nums)). If nums contains large values, you can do better with a hash map.
I'm trying this out with simple code, and it seems to corroborate my reading of the code.
BitSet bitSet = new BitSet();
bitSet.set(100);
System.out.println(bitSet.toLongArray().length); // 2 (max(nums) / 32 = 3.125)
bitSet.set(64000);
System.out.println(bitSet.toLongArray().length); // 1001 (max(nums) / 32 = 2000)
bitSet.set(100_000);
System.out.println(bitSet.toLongArray().length); // 1563 (max(nums) / 32 = 3125)
Note that the 2 factor I added is conservative, in general it will be a smaller factor, that's why my formula consistently over-estimates the actual length of the long array, but never by more than a factor of 2. This is the code in Bitset that made me add it:
private void ensureCapacity(int wordsRequired) {
if (words.length < wordsRequired) {
// Allocate larger of doubled size or required size
int request = Math.max(2 * words.length, wordsRequired);
words = Arrays.copyOf(words, request);
sizeIsSticky = false;
}
}
In summary, I would say the bit set is only a good idea if you have reason to believe you have smaller values than you have values (count). For example, if you have only two values but they are over a billion in value, you will needlessly allocate an array of several million elements.
Additionally, even in cases where values remain small, this solutions performs poorly for sorted arrays because Bitset#set will always reallocate and copy the array, so your complexity is not linear at all, it's quadratic in max(nums), which can be terrible if max(nums) is very large. To be linear, you would need to first find the maximum, allocate the necessary length in the Bitset, and then only go through the array.
At this point, using a map is simpler and fits all situations. If speed really matters, my bet is that the Bitset will beat a map under specific conditions (lots of values, but small, and by pre-sizing the bit set as described).

Creating combinations of a BitSet

Assume I have a Java BitSet. I now need to make combinations of the BitSet such that only Bits which are Set can be flipped. i.e. only need combinations of Bits which are set.
For Eg. BitSet - 1010, Combinations - 1010, 1000, 0010, 0000
BitSet - 1100, Combination - 1100, 1000, 0100, 0000
I can think of a few solutions E.g. I can take combinations of all 4 bits and then XOR the combinations with the original Bitset. But this would be very resource-intensive for large sparse BitSets. So I was looking for a more elegant solution.

It appears that you want to get the power set of the bit set. There is already an answer here about how to get the power set of a Set<T>. Here, I will show a modified version of the algorithm shown in that post, using BitSets:
private static Set<BitSet> powerset(BitSet set) {
Set<BitSet> sets = new HashSet<>();
if (set.isEmpty()) {
sets.add(new BitSet(0));
return sets;
}
Integer head = set.nextSetBit(0);
BitSet rest = set.get(0, set.size());
rest.clear(head);
for (BitSet s : powerset(rest)) {
BitSet newSet = s.get(0, s.size());
newSet.set(head);
sets.add(newSet);
sets.add(s);
}
return sets;
}

You can perform the operation in a single linear pass instead of recursion, if you realize the integer numbers are a computer’s intrinsic variant of “on off” patterns and iterating over the appropriate integer range will ultimately produce all possible permutations. The only challenge in your case, is to transfer the densely packed bits of an integer number to the target bits of a BitSet.
Here is such a solution:
static List<BitSet> powerset(BitSet set) {
int nBits = set.cardinality();
if(nBits > 30) throw new OutOfMemoryError(
"Not enough memory for "+BigInteger.ONE.shiftLeft(nBits)+" BitSets");
int max = 1 << nBits;
int[] targetBits = set.stream().toArray();
List<BitSet> sets = new ArrayList<>(max);
for(int onOff = 0; onOff < max; onOff++) {
BitSet next = new BitSet(set.size());
for(int bitsToSet = onOff, ix = 0; bitsToSet != 0; ix++, bitsToSet>>>=1) {
if((bitsToSet & 1) == 0) {
int skip = Integer.numberOfTrailingZeros(bitsToSet);
ix += skip;
bitsToSet >>>= skip;
}
next.set(targetBits[ix]);
}
sets.add(next);
}
return sets;
}
It uses an int value for the iteration, which is already enough to represent all permutations that can ever be stored in one of Java’s builtin collections. If your source BitSet has 2³¹ one bits, the 2³² possible combinations do not only require a hundred GB heap, but also a collection supporting 2³² elements, i.e. a size not representable as int.
So the code above terminates early if the number exceeds the capabilities, without even trying. You could rewrite it to use a long or even BigInteger instead, to keep it busy in such cases, until it will fail with an OutOfMemoryError anyway.
For the working cases, the int solution is the most efficient variant.
Note that the code returns a List rather than a HashSet to avoid the costs of hashing. The values are already known to be unique and hashing would only pay off if you want to perform lookups, i.e. call contains with another BitSet. But to test whether an existing BitSet is a permutation of your input BitSet, you wouldn’t even need to generate all permutations, a simple bit operation, e.g. andNot would tell you that already. So for storing and iterating the permutations, an ArrayList is more efficient.

Reverse Engineer Sorting Algorithm

I have been given 3 algorithms to reverse engineer and explain how they work, so far I have worked out that I have been given a quick sorting algorithm and a bubble sorting algorithm; however i'm not sure what algorithm this is. I understand how the quick sort and bubble sort work, but I just can't get my head around this algorithm. I'm unsure what the variables are and was hoping someone out there would be able to tell me whats going on here:
public static ArrayList<Integer> SortB(ArrayList<Integer> a)
{
ArrayList<Integer> array = CopyArray(a);
Integer[] zero = new Integer[a.size()];
Integer[] one = new Integer[a.size()];
int i,b;
Integer x,p;
//Change from 8 to 32 for whole integers - will run 4 times slower
for(b=0;b<8;++b)
{
int zc = 0;
int oc = 0;
for(i=0;i<array.size();++i)
{
x = array.get(i);
p = 1 << b;
if ((x & p) == 0)
{
zero[zc++] = array.get(i);
}
else
{
one[oc++] = array.get(i);
}
}
for(i=0;i<oc;++i) array.set(i,one[i]);
for(i=0;i<zc;++i) array.set(i+oc,zero[i]);
}
return(array);
}

This is a Radix Sort, limited to the least significant eight bits. It does not complete the sort unless you change the loop to go 32 times instead of 8.
Each iteration processes a single bit b. It prepares a mask called p by shifting 1 left b times. This produces a power of two - 1, 2, 4, 8, ..., or 1, 10, 100, 1000, 10000, ... in binary.
For each bit, the number of elements in the original array with bit b set to 1 and to 0 are separated into two buckets called one and zero. Once the separation is over, the elements are placed back into the original array, and the algorithm proceeds to the next iteration.
This implementation uses two times more storage than the size of the original array, and goes through the array a total of 16 times (64 times in the full version - once for reading and once for writing of data for each bit). The asymptotic complexity of the algorithm is linear.

Looks like a bit-by-bit radix sort to me, but it seems to be sorting backwards.

Efficient way of altering data in an array with threads

I've been trying to figure out the most efficient way where many threads are altering a very big byte array on bit level. For ease of explaining I'll base the question around a multithreaded Sieve of Eratosthenes to ease explaining the question. The code though should not be expected to fully completed as I'll omit certain parts that aren't directly related. The sieve also wont be fully optimised as thats not the direct question. The sieve will work in such a way that it saves which values are primes in a byte array, where each byte contains 7 numbers (we can't alter the first bit due to all things being signed).
Lets say our goal is to find all the primes below 1 000 000 000 (1 billion). As a result we would need an byte array of length 1 000 000 000 / 7 +1 or 142 857 143 (About 143 million).
class Prime {
int max = 1000000000;
byte[] b = new byte[(max/7)+1];
Prime() {
for(int i = 0; i < b.length; i++) {
b[i] = (byte)127; //Setting all values to 1 at start
}
findPrimes();
}
/*
* Calling remove will set the bit value associated with the number
* to 0 signaling that isn't an prime
*/
void remove(int i) {
int j = i/7; //gets which array index to access
b[j] = (byte) (b[j] & ~(1 << (i%7)));
}
void findPrimes() {
remove(1); //1 is not a prime and we wanna remove it from the start
int prime = 2;
while (prime*prime < max) {
for(int i = prime*2; i < max; i = prime + i) {
remove(i);
}
prime = nextPrime(prime); //This returns the next prime from the list
}
}
... //Omitting code, not relevant to question
}
Now we got a basic outline where something runs through all numbers for a certain mulitplication table and calls remove to remove numbers set bits that fits the number to 9 if we found out they aren't primes.
Now to up the ante we create threads that do the checking for us. We split the work so that each takes a part of the removing from the table. So for example if we got 4 threads and we are running through the multiplication table for the prime 2, we would assign thread 1 all in the 8 times tables with an starting offset of 2, that is 4, 10, 18, ...., the second thread gets an offset of 4, so it goes through 6, 14, 22... and so on. They then call remove on the ones they want.
Now to the real question. As most can see that while the prime is less than 7 we will have multiple threads accessing the same array index. While running through 2 for example we will have thread 1, thread 2 and thread 3 will all try to access b[0] to alter the byte which causes an race condition which we don't want.
The question therefore is, whats the best way of optimising access to the byte array.
So far the thoughts I've had for it are:
Putting synchronized on the remove method. This obviously would be very easy to implement but an horrible ideas as it would remove any type of gain from having threads.
Create an mutex array equal in size to the byte array. To enter an index one would need the mutex on the same index. This Would be fairly fast but would require another very big array in memory which might not be the best way to do it
Limit the numbers stored in the byte to prime number we start running on. So if we start on 2 we would have numbers per array. This would however increase our array length to 500 000 000 (500 million).
Are there other ways of doing this in a fast and optimal way without overusing the memory?
(This is my first question here so I tried to be as detailed and thorough as possible but I would accept any comments on how I can improve the question - to much detail, needs more detail etc.)

You can use an array of atomic integers for this. Unfortunately there isn't a getAndAND, which would be ideal for your remove() function, but you can CAS in a loop:
java.util.concurrent.atomic.AtomicIntegerArray aia;
....
void remove(int i) {
int j = i/32; //gets which array index to access
do {
int oldVal = aia.get(j);
int newVal = oldVal & ~(1 << (i%32));
boolean updated = aia.weakCompareAndSet(j, oldVal, newVal);
} while(!updated);
}
Basically you keep trying to adjust the slot to remove that bit, but you only succeed if nobody else modifies it out from under you. Safe, and likely to be very efficient. weakCompareAndSet is basically an abstracted Load-link/Store conditional instruction.
BTW, there's no reason not to use the sign bit.

I think you could avoid synchronizing threads...
For example, this task:
for(int i = prime*2; i < max; i = prime + i) {
remove(i);
}
it could be partitioned in small tasks.
for (int i =0; i < thread_poll; i++){
int totalPos = max/8; // dividing virtual array in bytes
int partitionSize = totalPos /thread_poll; // dividing bytes by thread poll
removeAll(prime, partitionSize*i*8, (i + 1)* partitionSize*8);
}
....
// no colisions!!!
void removeAll(int prime, int initial; int max){
k = initial / prime;
if (k < 2) k = 2;
for(int i = k * prime; i < max; i = i + prime) {
remove(i);
}
}

scala ranges versus lists performance on large collections

I ran a set of performance benchmarks for 10,000,000 elements, and I've discovered that the results vary greatly with each implementation.
Can anybody explain why creating a Range.ByOne, results in performance that is better than a simple array of primitives, but converting that same range to a list results in even worse performance than the worse case scenario?
Create 10,000,000 elements, and print out those that are modulos of 1,000,000. JVM size is always set to same min and max: -Xms?m -Xmx?m
import java.util.concurrent.TimeUnit
import java.util.concurrent.TimeUnit._
object LightAndFastRange extends App {
def chrono[A](f: => A, timeUnit: TimeUnit = MILLISECONDS): (A,Long) = {
val start = System.nanoTime()
val result: A = f
val end = System.nanoTime()
(result, timeUnit.convert(end-start, NANOSECONDS))
}
def millions(): List[Int] = (0 to 10000000).filter(_ % 1000000 == 0).toList
val results = chrono(millions())
results._1.foreach(x => println ("x: " + x))
println("Time: " + results._2);
}
It takes 141 milliseconds with a JVM size of 27m
In comparison, converting to List affects performance dramatically:
import java.util.concurrent.TimeUnit
import java.util.concurrent.TimeUnit._
object LargeLinkedList extends App {
def chrono[A](f: => A, timeUnit: TimeUnit = MILLISECONDS): (A,Long) = {
val start = System.nanoTime()
val result: A = f
val end = System.nanoTime()
(result, timeUnit.convert(end-start, NANOSECONDS))
}
val results = chrono((0 to 10000000).toList.filter(_ % 1000000 == 0))
results._1.foreach(x => println ("x: " + x))
println("Time: " + results._2)
}
It takes 8514-10896 ms with 460-455 m
In contrast, this Java implementation uses an array of primitives
import static java.util.concurrent.TimeUnit.*;
public class LargePrimitiveArray {
public static void main(String[] args){
long start = System.nanoTime();
int[] elements = new int[10000000];
for(int i = 0; i < 10000000; i++){
elements[i] = i;
}
for(int i = 0; i < 10000000; i++){
if(elements[i] % 1000000 == 0) {
System.out.println("x: " + elements[i]);
}
}
long end = System.nanoTime();
System.out.println("Time: " + MILLISECONDS.convert(end-start, NANOSECONDS));
}
}
It takes 116ms with JVM size of 59m
Java List of Integers
import java.util.List;
import java.util.ArrayList;
import static java.util.concurrent.TimeUnit.*;
public class LargeArrayList {
public static void main(String[] args){
long start = System.nanoTime();
List<Integer> elements = new ArrayList<Integer>();
for(int i = 0; i < 10000000; i++){
elements.add(i);
}
for(Integer x: elements){
if(x % 1000000 == 0) {
System.out.println("x: " + x);
}
}
long end = System.nanoTime();
System.out.println("Time: " + MILLISECONDS.convert(end-start, NANOSECONDS));
}
}
It takes 3993 ms with JVM size of 283m
My question is, why is the first example so performant, while the second is so badly affected. I tried creating views, but wasn't successful at reproducing the performance benefits of the range.
All tests running on Mac OS X Snow Leopard,
Java 6u26 64-Bit Server
Scala 2.9.1.final
EDIT:
for completion, here's the actual implementation using a LinkedList (which is a more fair comparison in terms of space than ArrayList, since as rightly pointed out, scala's List are linked)
import java.util.List;
import java.util.LinkedList;
import static java.util.concurrent.TimeUnit.*;
public class LargeLinkedList {
public static void main(String[] args){
LargeLinkedList test = new LargeLinkedList();
long start = System.nanoTime();
List<Integer> elements = test.createElements();
test.findElementsToPrint(elements);
long end = System.nanoTime();
System.out.println("Time: " + MILLISECONDS.convert(end-start, NANOSECONDS));
}
private List<Integer> createElements(){
List<Integer> elements = new LinkedList<Integer>();
for(int i = 0; i < 10000000; i++){
elements.add(i);
}
return elements;
}
private void findElementsToPrint(List<Integer> elements){
for(Integer x: elements){
if(x % 1000000 == 0) {
System.out.println("x: " + x);
}
}
}
}
Takes 3621-6749 ms with 480-460 mbs. That's much more in line with the performance of the second scala example.
finally, a LargeArrayBuffer
import collection.mutable.ArrayBuffer
import java.util.concurrent.TimeUnit
import java.util.concurrent.TimeUnit._
object LargeArrayBuffer extends App {
def chrono[A](f: => A, timeUnit: TimeUnit = MILLISECONDS): (A,Long) = {
val start = System.nanoTime()
val result: A = f
val end = System.nanoTime()
(result, timeUnit.convert(end-start, NANOSECONDS))
}
def millions(): List[Int] = {
val size = 10000000
var items = new ArrayBuffer[Int](size)
(0 to size).foreach (items += _)
items.filter(_ % 1000000 == 0).toList
}
val results = chrono(millions())
results._1.foreach(x => println ("x: " + x))
println("Time: " + results._2);
}
Taking about 2145 ms and 375 mb
Thanks a lot for the answers.

Oh So Many Things going on here!!!
Let's start with Java int[]. Arrays in Java are the only collection that is not type erased. The run time representation of an int[] is different from the run time representation of Object[], in that it actually uses int directly. Because of that, there's no boxing involved in using it.
In memory terms, you have 40.000.000 consecutive bytes in memory, that are read and written 4 at a time whenever an element is read or written to.
In contrast, an ArrayList<Integer> -- as well as pretty much any other generic collection -- is composed of 40.000.000 or 80.000.00 consecutive bytes (on 32 and 64 bits JVM respectively), PLUS 80.000.000 bytes spread all around memory in groups of 8 bytes. Every read an write to an element has to go through two memory spaces, and the sheer time spent handling all that memory is significant when the actual task you are doing is so fast.
So, back to Scala, for the second example where you manipulate a List. Now, Scala's List is much more like Java's LinkedList than the grossly misnamed ArrayList. Each element of a List is composed of an object called Cons, which has 16 bytes, with a pointer to the element and a pointer to another list. So, a List of 10.000.000 elements is composed of 160.000.000 elements spread all around memory in groups of 16 bytes, plus 80.000.000 bytes spread all around memory in groups of 8 bytes. So what was true for ArrayList is even more so for List
Finally, Range. A Range is a sequence of integers with a lower and an upper boundary, plus a step. A Range of 10.000.000 elements is 40 bytes: three ints (not generic) for lower and upper bounds and step, plus a few pre-computed values (last, numRangeElements) and two other ints used for lazy val thread safety. Just to make clear, that's NOT 40 times 10.000.000: that's 40 bytes TOTAL. The size of the range is completely irrelevant, because IT DOESN'T STORE THE INDIVIDUAL ELEMENTS. Just the lower bound, upper bound and step.
Now, because a Range is a Seq[Int], it still has to go through boxing for most uses: an int will be converted into an Integer and then back into an int again, which is sadly wasteful.
Cons Size Calculation
So, here's a tentative calculation of Cons. First of all, read this article about some general guidelines on how much memory an object takes. The important points are:
Java uses 8 bytes for normal objects, and 12 for object arrays, for "housekeeping" information (what's the class of this object, etc).
Objects are allocated in 8 bytes chunks. If your object is smaller than that, it will be padded to complement it.
I actually thought it was 16 bytes, not 8. Anyway, Cons is also smaller than I thought. Its fields are:
public static final long serialVersionUID; // static, doesn't count
private java.lang.Object scala$collection$immutable$$colon$colon$$hd;
private scala.collection.immutable.List tl;
References are at least 4 bytes (could be more on 64 bits JVM). So we have:
8 bytes Java header
4 bytes hd
4 bytes tl
Which makes it only 16 bytes long. Pretty good, actually. In the example, hd will point to an Integer object, which I assume is 8 bytes long. As for tl, it points to another cons, which we are already counting.
I'm going to revise the estimates, with actual data where possible.

In the first example you create a linked list with 10 elements by computing the steps of the range.
In the second example you create a linked list with 10 millions of elements and filter it down to a new linked list with 10 elements.
In the third example you create an array-backed buffer with 10 millions of elements which you traverse and print, no new array-backed buffer is created.
Conclusion:
Every piece of code does something different, that's why the performance varies greatly.

This is an educated guess ...
I think it is because in the fast version the Scala compiler is able to translate the key statement into something like this (in Java):
List<Integer> millions = new ArrayList<Integer>();
for (int i = 0; i <= 10000000; i++) {
if (i % 1000000 == 0) {
millions.add(i);
}
}
As you can see, (0 to 10000000) doesn't generate an intermediate list of 10,000,000 Integer objects.
By contrast, in the slow version the Scala compiler is not able to do that optimization, and is generating that list.
(The intermediate data structure could possibly be an int[], but the observed JVM size suggests that it is not.)

It's hard to read the Scala source on my iPad, but it looks like Range's constructor isn't actually producing a list, just remembering the start, increment and end. It uses these to produce its values on request, so that iterating over a range is a lot closer to a simple for loop than examining the elements of an array.
As soon as you say range.toList you are forcing Scala to produce a linked list of the 'values' in the range (allocating memory for both the values and the links), and then you are iterating over that. Being a linked list the performance of this is going to be worse than your Java ArrayList example.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.