Related
I recently decided to play around with Java's new incubated vector API, to see how fast it can get. I implemented two fairly simple methods, one for parsing an int and one for finding the index of a character in a string. In both cases, my vectorized methods were incredibly slow compared to their scalar equivalents.
Here's my code:
public class SIMDParse {
private static IntVector mul = IntVector.fromArray(
IntVector.SPECIES_512,
new int[] {0, 0, 0, 0, 0, 0, 1000000000, 100000000, 10000000, 1000000, 100000, 10000, 1000, 100, 10, 1},
0
);
private static byte zeroChar = (byte) '0';
private static int width = IntVector.SPECIES_512.length();
private static byte[] filler;
static {
filler = new byte[16];
for (int i = 0; i < 16; i++) {
filler[i] = zeroChar;
}
}
public static int parseInt(String str) {
boolean negative = str.charAt(0) == '-';
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
if (negative) {
bytes[0] = zeroChar;
}
bytes = ensureSize(bytes, width);
ByteVector vec = ByteVector.fromArray(ByteVector.SPECIES_128, bytes, 0);
vec = vec.sub(zeroChar);
IntVector ints = (IntVector) vec.castShape(IntVector.SPECIES_512, 0);
ints = ints.mul(mul);
return ints.reduceLanes(VectorOperators.ADD) * (negative ? -1 : 1);
}
public static byte[] ensureSize(byte[] arr, int per) {
int mod = arr.length % per;
if (mod == 0) {
return arr;
}
int length = arr.length - (mod);
length += per;
byte[] newArr = new byte[length];
System.arraycopy(arr, 0, newArr, per - mod, arr.length);
System.arraycopy(filler, 0, newArr, 0, per - mod);
return newArr;
}
public static byte[] ensureSize2(byte[] arr, int per) {
int mod = arr.length % per;
if (mod == 0) {
return arr;
}
int length = arr.length - (mod);
length += per;
byte[] newArr = new byte[length];
System.arraycopy(arr, 0, newArr, 0, arr.length);
return newArr;
}
public static int indexOf(String s, char c) {
byte[] b = s.getBytes(StandardCharsets.UTF_8);
int width = ByteVector.SPECIES_MAX.length();
byte bChar = (byte) c;
b = ensureSize2(b, width);
for (int i = 0; i < b.length; i += width) {
ByteVector vec = ByteVector.fromArray(ByteVector.SPECIES_MAX, b, i);
int pos = vec.compare(VectorOperators.EQ, bChar).firstTrue();
if (pos != width) {
return pos + i;
}
}
return -1;
}
}
I fully expected my int parsing to be slower, since it won't ever be handling more than the vector size can hold (an int can never be more than 10 digits long).
By my bechmarks, parsing 123 as an int 10k times took 3081 microseconds for Integer.parseInt, and 80601 microseconds for my implementation. Searching for 'a' in a very long string ("____".repeat(4000) + "a" + "----".repeat(193)) took 7709 microseconds to String#indexOf's 7.
Why is it so unbelievably slow? I thought the entire point of SIMD is that it's faster than the scalar equivalents for tasks like these.
You picked something SIMD is not great at (string->int), and something that JVMs are very good at optimizing out of loops. And you made an implementation with a bunch of extra copying work if the inputs aren't exact multiples of the vector width.
I'm assuming your times are totals (for 10k repeats each), not a per-call average.
7 us is impossibly fast for that.
"____".repeat(4000) is 16k bytes before the 'a', which I assume is what you're searching for. Even a well-tuned / unrolled memchr (aka indexOf) running at 2x 32-byte vectors per clock cycle, on a 4GHz CPU, would take 625 us for 10k reps. (16000B / (64B/c) * 10000 reps / 4000 MHz). And yes, I'd expect a JVM to either call the native memchr or use something equally efficient for a commonly-used core library function like String#indexOf. For example, glibc's avx2 memchr is pretty well-tuned with loop unrolling; if you're on Linux, your JVM might be calling it.
Built-in String indexOf is also something the JIT "knows about". It's apparently able to hoist it out of loops when it can see that you're using the same string repeatedly as input. (But then what's it doing for the rest of those 7 us? I guess doing a not-quite-so-great memchr and then doing an empty 10k iteration loop at 1/clock could take about 7 microseconds, especially if your CPU isn't as fast as 4GHz.)
See Idiomatic way of performance evaluation? - if doubling the repeat-count to 20k doesn't double the time, your benchmark is broken and not measuring what you think it does.
Your manual SIMD indexOf is very unlikely to get optimized out of a loop. It makes a copy of the whole array every time, if the size isn't an exact multiple of the vector width!! (In ensureSize2). The normal technique is to fall back to scalar for the last size % width elements, which is obviously much better for large arrays. Or even better, do an unaligned load that ends at the end of the array (if the total size is >= vector width) for something where overlap with previous work is not a problem.
A decent memchr on modern x86 (using an algorithm like your indexOf without unrolling) should go at about 1 vector (16/32/64 bytes) per maybe 1.5 clock cycles, with data hot in L1d cache, without loop unrolling or anything. (Checking both the vector compare and the pointer bound as possible loop exit conditions takes extra asm instructions vs. a simple strlen, but see this answer for some microbenchmarks of a simple hand-written strlen that assumes aligned buffers). Probably your indexOf loops bottlenecks on front-end throughput on a CPU like Skylake, with its pipeline width of 4 uops/clock.
So let's guess that your implementation takes 1.5 cycles per 16 byte vector, if perhaps you're on a CPU without AVX2? You didn't say.
16kB / 16B = 1000 vectors. At 1 vector per 1.5 clocks, that's 1500 cycles. On a 3GHz machine, 1500 cycles takes 500 ns = 0.5 us per call, or 5000 us per 10k reps. But since 16194 bytes isn't a multiple of 16, you're also copying the whole thing every call, so that costs some more time, and could plausibly account for your 7709 us total time.
What SIMD is good for
for tasks like these.
No, "horizontal" stuff like ints.reduceLanes is something SIMD is generally slow at. And even with something like How to implement atoi using SIMD? using x86 pmaddwd to multiply and add pairs horizontally, it's still a lot of work.
Note that to make the elements wide enough to multiply by place-values without overflow, you have to unpack, which costs some shuffling. ints.reduceLanes takes about log2(elements) shuffle/add steps, and if you're starting with 512-bit AVX-512 vectors of int, the first 2 of those shuffles are lane-crossing, 3 cycle latency (https://agner.org/optimize/). (Or if your machine doesn't even have AVX2, then a 512-bit integer vector is actually 4x 128-bit vectors. And you had to do separate work to unpack each part. But at least the reduction will be cheap, just vertical adds until you get down to a single 128-bit vector.)
Hmm. I found this post because I've hit something strange with the Vector perfomance for something that ostensibly it should be ideal for - multiplying two double arrays.
static private void doVector(int iteration, double[] input1, double[] input2, double[] output) {
Instant start = Instant.now();
for (int i = 0; i < SPECIES.loopBound(ARRAY_LENGTH); i += SPECIES.length()) {
DoubleVector va = DoubleVector.fromArray(SPECIES, input1, i);
DoubleVector vb = DoubleVector.fromArray(SPECIES, input2, i);
va.mul(vb);
System.arraycopy(va.mul(vb).toArray(), 0, output, i, SPECIES.length());
}
Instant finish = Instant.now();
System.out.println("vector duration " + iteration + ": " + Duration.between(start, finish).getNano());
}
The species length comes out at 4 on my machine (CPU is Intel i7-7700HQ at 2.8 GHz).
On my first attempt the execution was taking more than 15 milliseconds to execute (compared with 0 for the scalar equivalent), even with a tiny array length (8 elements). On a hunch I added the iteration to see whether something had to warm up - and indeed, the first iteration still ALWAYS takes ages (44 ms for 65536 elements). Whilst most of the other iterations are reporting zero time, a few are taking around 15ms but they are randomly distributed (i.e. not always the same iteration index on each run). I sort of expect that (because I'm measuring real-time measurement and other stuff will be going on).
However, overall for an array size of 65536 elements, and 32 iterations, the total duration for the vector approach is 2-3 times longer than that for the scalar one.
I don't Java much.
I am writing some optimized math code and I was shocked by my profiler results. My code collects values, interleaves the data and then chooses the values based on that. Java runs slower than my C++ and MATLAB implementations.
I am using javac 1.7.0_05
I am using the Sun/Oracle JDK 1.7.05
There exists a floor function that performs a relevant task in the code.
Does anybody know of the paradigmatic way to fix this?
I noticed that my floor() function is defined with something called StrictMath. Is there something like -ffast-math for Java? I am expecting there must be a way to change the floor function to something more computationally reasonable without writing my own.
public static double floor(double a) {
return StrictMath.floor(a); // default impl. delegates to StrictMath
}
Edit
So a few people suggested I try to do a cast. I tried this and there was absolutely no change in walltime.
private static int flur(float dF)
{
return (int) dF;
}
413742 cast floor function
394675 Math.floor
These test were ran without the profiler. An effort was made to use a profiler but the runtime was drastically altered (15+ minutes so I quit).
You might want to give a try to FastMath.
Here is a post about the performance of Math in Java vs. Javascript. There are a few good hints about why the default math lib is slow. They are discussing other operations than floor, but I guess their findings can be generalized. I found it interesting.
EDIT
According to this bug entry, floor has been implemented a pure java code in 7(b79), 6u21(b01) resulting in better performance. The code of floor in the JDK 6 is still a bit longer than the one in FastMath, but might not be responsible for such a perf. degradation. What JDK are you using? Could you try with a more recent version?
Here's a sanity check for your hypothesis that the code is really spending 99% of its time in floor. Let's assume that you have Java and C++ versions of the algorithm that are both correct in terms of the outputs they produce. For the sake of the argument, let us assume that the two versions call the equivalent floor functions the same number of times. So a time function is
t(input) = nosFloorCalls(input) * floorTime + otherTime(input)
where floorTime is the time taken for a call to floor on the platform.
Now if your hypothesis is correct, and floorTime is vastly more expensive on Java (to the extent that it takes roughly 99% of the execution time) then you would expect the Java version of the application to run a large factor (50 times or more) slower than the C++ version. If you don't see this, then your hypothesis most likely is false.
If the hypothesis is false, here are two alternative explanations for the profiling results.
This is a measurement anomaly; i.e. the profiler has somehow got it wrong. Try using a different profiler.
There is a bug in the Java version of your code that is causing it to call floor many, many more times than in the C++ version of the code.
Math.floor() is insanely fast on my machine at around 7 nanoseconds per call in a tight loop. (Windows 7, Eclipse, Oracle JDK 7). I'd expect it to be very fast in pretty much all circumstances and would be extremely surprised if it turned out to be the bottleneck.
Some ideas:
I'd suggest re-running some benchmarks without a profiler running. It sometimes happens that profilers create spurious overhead when they instrument the binary - particularly for small functions like Math.floor() that are likely to be inlined.
Try a couple of different JVMs, you might have hit an obscure bug
Try the FastMath class in the excellent Apache Commons Math library, which includes a new implementation of floor. I'd be really surprised if it is faster, but you never know.
Check you are not running any virtualisation technonolgy or similar that might be interfering with Java's ability to call native code (which is used in a few of the java.lang.Math functions including Math.floor())
It is worth noting that monitoring a method takes some overhead and in the case of VisualVM, this is fairly high. If you have a method which is called often but does very little it can appear to use lots of CPU. e.g. I have seen Integer.hashCode() as a big hitter once. ;)
On my machine a floor takes less 5.6 ns, but a cast takes 2.3 ns. You might like to try this on your machine.
Unless you need to handle corner cases, a plain cast is faster.
// Rounds to zero, instead of Negative infinity.
public static double floor(double a) {
return (long) a;
}
public static void main(String... args) {
int size = 100000;
double[] a = new double[size];
double[] b = new double[size];
double[] c = new double[size];
for (int i = 0; i < a.length; i++) a[i] = Math.random() * 1e6;
for (int i = 0; i < 5; i++) {
timeCast(a, b);
timeFloor(a, c);
for (int j = 0; j < size; j++)
if (b[i] != c[i])
System.err.println(a[i] + ": " + b[i] + " " + c[i]);
}
}
public static double floor(double a) {
return a < 0 ? -(long) -a : (long) a;
}
private static void timeCast(double[] from, double[] to) {
long start = System.nanoTime();
for (int i = 0; i < from.length; i++)
to[i] = floor(from[i]);
long time = System.nanoTime() - start;
System.out.printf("Cast took an average of %.1f ns%n", (double) time / from.length);
}
private static void timeFloor(double[] from, double[] to) {
long start = System.nanoTime();
for (int i = 0; i < from.length; i++)
to[i] = Math.floor(from[i]);
long time = System.nanoTime() - start;
System.out.printf("Math.floor took an average of %.1f ns%n", (double) time / from.length);
}
prints
Cast took an average of 62.1 ns
Math.floor took an average of 123.6 ns
Cast took an average of 61.9 ns
Math.floor took an average of 6.3 ns
Cast took an average of 47.2 ns
Math.floor took an average of 6.5 ns
Cast took an average of 2.3 ns
Math.floor took an average of 5.6 ns
Cast took an average of 2.3 ns
Math.floor took an average of 5.6 ns
First of all: Your profiler shows that your spending 99% of the cpu time in the floor function. This does not indicate floor is slow. If you do nothing but floor() thats totally sane. Since other languages seem to implement floor more efficient your assumption may be correct, however.
I know from school that a naive implementation of floor (which works only for positive numbers and is one off for negative ones) can be done by casting to an integer/long. That is language agnostic and some sort of general knowledge from CS courses.
Here are some micro benches. Works on my machine and backs what I learned in school ;)
rataman#RWW009 ~/Desktop
$ javac Cast.java && java Cast
10000000 Rounds of Casts took 16 ms
rataman#RWW009 ~/Desktop
$ javac Floor.java && java Floor
10000000 Rounds of Floor took 140 ms
#
public class Cast/Floor {
private static final int ROUNDS = 10000000;
public static void main(String[] args)
{
double[] vals = new double[ROUNDS];
double[] res = new double[ROUNDS];
// awesome testdata
for(int i = 0; i < ROUNDS; i++)
{
vals[i] = Math.random() * 10.0;
}
// warmup
for(int i = 0; i < ROUNDS; i++)
{
res[i] = floor(vals[i]);
}
long start = System.currentTimeMillis();
for(int i = 0; i < ROUNDS; i++)
{
res[i] = floor(vals[i]);
}
System.out.println(ROUNDS + " Rounds of Casts took " + (System.currentTimeMillis() - start) +" ms");
}
private static double floor(double arg)
{
// Floor.java
return Math.floor(arg);
// or Cast.java
return (int)arg;
}
}
Math.floor (and Math.ceil) can be a surprising bottleneck if your algorithm depends on it a lot.
This is because these functions handle edge cases that you might not care about (such as minus-zero and positive-zero etc). Just look at the implementation of these functions to see what they're actually doing; there's a surprising amount of branching in there.
Also consider that Math.floor/ceil take only a double as an argument and return a double, which you might not want. If you just want an int or long, some of the checks in Math.floor are simply unnecessary.
Some have suggested to simply cast to an int, which will work as long as your values are positive (and your algorithm doesn't depend on the edge cases that Math.floor checks for). If that's the case, a simple cast is the fastest solution by quite a margin (in my experience).
If for example your values can be negative and you want an int from a float, you can do something like this:
public static final int floor(final float value) {
return ((int) value) - (Float.floatToRawIntBits(value) >>> 31);
}
(It just subtracts the float's sign bit from the cast to make it correct for negative numbers, while preventing an "if")
In my experience, this is a lot faster than Math.floor. If it isn't, I suggest to check your algorithm, or perhaps you've ran into JVM performance bug (which is much less likely).
I have a Java method that repeatedly evaluates the following expression in a very tight loop with a large number of repetitions:
Math.abs(a - b) - Math.abs(c - d)
a, b, c and d are long values that can span the whole range of their type. They are different in each loop iteration and they do not satisfy any invariant that I know of.
The profiler indicates that a significant portion of the processor time is spent in this method. While I will pursue other avenues of optimization first, I was wondering if there is a smarter way to calculate the aforementioned expression.
Apart from inlining the Math.abs() calls manually for a very slight (if any) performance gain, is there any mathematical trick that I could use to speed-up the evaluation of this expression?
I suspect the profiler isn't giving you a true result as it trying to profile (and thus adding over head to) such a trivial "method". Without the profile Math.abs can be turned into a small number of machine code instructions, and you won't be able to make it faster than that.
I suggest you do a micro-benchmark to confirm this. I would expect loading the data to be an order of magnitude more expensive.
long a = 10, b = 6, c = -2, d = 3;
int runs = 1000 * 1000 * 1000;
long start = System.nanoTime();
for (int i = 0; i < runs; i += 2) {
long r = Math.abs(i - a) - Math.abs(c - i);
long r2 = Math.abs(i - b) - Math.abs(d - i);
if (r + r2 < Integer.MIN_VALUE) throw new AssertionError();
}
long time = System.nanoTime() - start;
System.out.printf("Took an average of %.1f ns per abs-abs. %n", (double) time / runs);
prints
Took an average of 0.9 ns per abs-abs.
I ended up using this little method:
public static long diff(final long a, final long b, final long c, final long d) {
final long a0 = (a < b)?(b - a):(a - b);
final long a1 = (c < d)?(d - c):(c - d);
return a0 - a1;
}
I experienced a measurable performance increase - about 10-15% for the whole application. I believe this is mostly due to:
The elimination of a method call: Rather than calling Math.abs() twice, I call this method once. Sure, static method calls are not inordinately expensive, but they still have an impact.
The elimination of a couple of negation operations: This may be offset by the slightly increased size of the code, but I'll happily fool myself into believing that it actually made a difference.
EDIT:
It seems that it's actually the other way around. Explicitly inlining the code does not seem to impact the performance in my micro-benchmark. Changing the way the absolute values are calculated does...
You can always try to unroll the functions and hand optimize, if you don't get more cache misses it might be faster.
If I got the unrolling right it could be something like this:
if(a<b)
{
if(c<d)
{
r=b-a-d+c;
}
else
{
r=b-a+d-c;
}
}
else
{
if(c<d)
{
r=a-b-d+c;
}
else
{
r=a-b+d-c;
}
}
are you sure its the method itself causes the problem? Maybe its an enormous amount of invocation of this method and you just see the aggregated results (like TIME_OF_METHOD_EXECUTION X NUMBER_OF_INVOCATIONS) in your profiler?
I ran a set of performance benchmarks for 10,000,000 elements, and I've discovered that the results vary greatly with each implementation.
Can anybody explain why creating a Range.ByOne, results in performance that is better than a simple array of primitives, but converting that same range to a list results in even worse performance than the worse case scenario?
Create 10,000,000 elements, and print out those that are modulos of 1,000,000. JVM size is always set to same min and max: -Xms?m -Xmx?m
import java.util.concurrent.TimeUnit
import java.util.concurrent.TimeUnit._
object LightAndFastRange extends App {
def chrono[A](f: => A, timeUnit: TimeUnit = MILLISECONDS): (A,Long) = {
val start = System.nanoTime()
val result: A = f
val end = System.nanoTime()
(result, timeUnit.convert(end-start, NANOSECONDS))
}
def millions(): List[Int] = (0 to 10000000).filter(_ % 1000000 == 0).toList
val results = chrono(millions())
results._1.foreach(x => println ("x: " + x))
println("Time: " + results._2);
}
It takes 141 milliseconds with a JVM size of 27m
In comparison, converting to List affects performance dramatically:
import java.util.concurrent.TimeUnit
import java.util.concurrent.TimeUnit._
object LargeLinkedList extends App {
def chrono[A](f: => A, timeUnit: TimeUnit = MILLISECONDS): (A,Long) = {
val start = System.nanoTime()
val result: A = f
val end = System.nanoTime()
(result, timeUnit.convert(end-start, NANOSECONDS))
}
val results = chrono((0 to 10000000).toList.filter(_ % 1000000 == 0))
results._1.foreach(x => println ("x: " + x))
println("Time: " + results._2)
}
It takes 8514-10896 ms with 460-455 m
In contrast, this Java implementation uses an array of primitives
import static java.util.concurrent.TimeUnit.*;
public class LargePrimitiveArray {
public static void main(String[] args){
long start = System.nanoTime();
int[] elements = new int[10000000];
for(int i = 0; i < 10000000; i++){
elements[i] = i;
}
for(int i = 0; i < 10000000; i++){
if(elements[i] % 1000000 == 0) {
System.out.println("x: " + elements[i]);
}
}
long end = System.nanoTime();
System.out.println("Time: " + MILLISECONDS.convert(end-start, NANOSECONDS));
}
}
It takes 116ms with JVM size of 59m
Java List of Integers
import java.util.List;
import java.util.ArrayList;
import static java.util.concurrent.TimeUnit.*;
public class LargeArrayList {
public static void main(String[] args){
long start = System.nanoTime();
List<Integer> elements = new ArrayList<Integer>();
for(int i = 0; i < 10000000; i++){
elements.add(i);
}
for(Integer x: elements){
if(x % 1000000 == 0) {
System.out.println("x: " + x);
}
}
long end = System.nanoTime();
System.out.println("Time: " + MILLISECONDS.convert(end-start, NANOSECONDS));
}
}
It takes 3993 ms with JVM size of 283m
My question is, why is the first example so performant, while the second is so badly affected. I tried creating views, but wasn't successful at reproducing the performance benefits of the range.
All tests running on Mac OS X Snow Leopard,
Java 6u26 64-Bit Server
Scala 2.9.1.final
EDIT:
for completion, here's the actual implementation using a LinkedList (which is a more fair comparison in terms of space than ArrayList, since as rightly pointed out, scala's List are linked)
import java.util.List;
import java.util.LinkedList;
import static java.util.concurrent.TimeUnit.*;
public class LargeLinkedList {
public static void main(String[] args){
LargeLinkedList test = new LargeLinkedList();
long start = System.nanoTime();
List<Integer> elements = test.createElements();
test.findElementsToPrint(elements);
long end = System.nanoTime();
System.out.println("Time: " + MILLISECONDS.convert(end-start, NANOSECONDS));
}
private List<Integer> createElements(){
List<Integer> elements = new LinkedList<Integer>();
for(int i = 0; i < 10000000; i++){
elements.add(i);
}
return elements;
}
private void findElementsToPrint(List<Integer> elements){
for(Integer x: elements){
if(x % 1000000 == 0) {
System.out.println("x: " + x);
}
}
}
}
Takes 3621-6749 ms with 480-460 mbs. That's much more in line with the performance of the second scala example.
finally, a LargeArrayBuffer
import collection.mutable.ArrayBuffer
import java.util.concurrent.TimeUnit
import java.util.concurrent.TimeUnit._
object LargeArrayBuffer extends App {
def chrono[A](f: => A, timeUnit: TimeUnit = MILLISECONDS): (A,Long) = {
val start = System.nanoTime()
val result: A = f
val end = System.nanoTime()
(result, timeUnit.convert(end-start, NANOSECONDS))
}
def millions(): List[Int] = {
val size = 10000000
var items = new ArrayBuffer[Int](size)
(0 to size).foreach (items += _)
items.filter(_ % 1000000 == 0).toList
}
val results = chrono(millions())
results._1.foreach(x => println ("x: " + x))
println("Time: " + results._2);
}
Taking about 2145 ms and 375 mb
Thanks a lot for the answers.
Oh So Many Things going on here!!!
Let's start with Java int[]. Arrays in Java are the only collection that is not type erased. The run time representation of an int[] is different from the run time representation of Object[], in that it actually uses int directly. Because of that, there's no boxing involved in using it.
In memory terms, you have 40.000.000 consecutive bytes in memory, that are read and written 4 at a time whenever an element is read or written to.
In contrast, an ArrayList<Integer> -- as well as pretty much any other generic collection -- is composed of 40.000.000 or 80.000.00 consecutive bytes (on 32 and 64 bits JVM respectively), PLUS 80.000.000 bytes spread all around memory in groups of 8 bytes. Every read an write to an element has to go through two memory spaces, and the sheer time spent handling all that memory is significant when the actual task you are doing is so fast.
So, back to Scala, for the second example where you manipulate a List. Now, Scala's List is much more like Java's LinkedList than the grossly misnamed ArrayList. Each element of a List is composed of an object called Cons, which has 16 bytes, with a pointer to the element and a pointer to another list. So, a List of 10.000.000 elements is composed of 160.000.000 elements spread all around memory in groups of 16 bytes, plus 80.000.000 bytes spread all around memory in groups of 8 bytes. So what was true for ArrayList is even more so for List
Finally, Range. A Range is a sequence of integers with a lower and an upper boundary, plus a step. A Range of 10.000.000 elements is 40 bytes: three ints (not generic) for lower and upper bounds and step, plus a few pre-computed values (last, numRangeElements) and two other ints used for lazy val thread safety. Just to make clear, that's NOT 40 times 10.000.000: that's 40 bytes TOTAL. The size of the range is completely irrelevant, because IT DOESN'T STORE THE INDIVIDUAL ELEMENTS. Just the lower bound, upper bound and step.
Now, because a Range is a Seq[Int], it still has to go through boxing for most uses: an int will be converted into an Integer and then back into an int again, which is sadly wasteful.
Cons Size Calculation
So, here's a tentative calculation of Cons. First of all, read this article about some general guidelines on how much memory an object takes. The important points are:
Java uses 8 bytes for normal objects, and 12 for object arrays, for "housekeeping" information (what's the class of this object, etc).
Objects are allocated in 8 bytes chunks. If your object is smaller than that, it will be padded to complement it.
I actually thought it was 16 bytes, not 8. Anyway, Cons is also smaller than I thought. Its fields are:
public static final long serialVersionUID; // static, doesn't count
private java.lang.Object scala$collection$immutable$$colon$colon$$hd;
private scala.collection.immutable.List tl;
References are at least 4 bytes (could be more on 64 bits JVM). So we have:
8 bytes Java header
4 bytes hd
4 bytes tl
Which makes it only 16 bytes long. Pretty good, actually. In the example, hd will point to an Integer object, which I assume is 8 bytes long. As for tl, it points to another cons, which we are already counting.
I'm going to revise the estimates, with actual data where possible.
In the first example you create a linked list with 10 elements by computing the steps of the range.
In the second example you create a linked list with 10 millions of elements and filter it down to a new linked list with 10 elements.
In the third example you create an array-backed buffer with 10 millions of elements which you traverse and print, no new array-backed buffer is created.
Conclusion:
Every piece of code does something different, that's why the performance varies greatly.
This is an educated guess ...
I think it is because in the fast version the Scala compiler is able to translate the key statement into something like this (in Java):
List<Integer> millions = new ArrayList<Integer>();
for (int i = 0; i <= 10000000; i++) {
if (i % 1000000 == 0) {
millions.add(i);
}
}
As you can see, (0 to 10000000) doesn't generate an intermediate list of 10,000,000 Integer objects.
By contrast, in the slow version the Scala compiler is not able to do that optimization, and is generating that list.
(The intermediate data structure could possibly be an int[], but the observed JVM size suggests that it is not.)
It's hard to read the Scala source on my iPad, but it looks like Range's constructor isn't actually producing a list, just remembering the start, increment and end. It uses these to produce its values on request, so that iterating over a range is a lot closer to a simple for loop than examining the elements of an array.
As soon as you say range.toList you are forcing Scala to produce a linked list of the 'values' in the range (allocating memory for both the values and the links), and then you are iterating over that. Being a linked list the performance of this is going to be worse than your Java ArrayList example.
I just came across this seemingly innocuous comment, benchmarking ArrayList vs a raw String array. It's from a couple years ago, but the OP writes
I did notice that using for String s: stringsList was about 50% slower than using an old-style for-loop to access the list. Go figure...
Nobody commented on it in the original post, and the test seemed a little dubious (too short to be accurate), but I nearly fell out of my chair when I read it. I've never benchmarked an enhanced loop against a "traditional" one, but I'm currently working on a project that does hundreds of millions of iterations over ArrayList instances using enhanced loops so this is a concern to me.
I'm going to do some benchmarking and post my findings here, but this is obviously a big concern to me. I could find precious little info online about relative performance, except for a couple offhand mentions that enhanced loops for ArrayLists do run a lot slower under Android.
Has anybody experienced this? Does such a performance gap still exist? I'll post my findings here, but was very surprised to read it. I suspect that if this performance gap did exist, it has been fixed in more modern VM's, but I guess now I'll have to do some testing and confirm.
Update: I made some changes to my code, but was already suspecting what others here have already pointed out: sure the enhanced for loop is slower, but outside of very trivial tight loops, the cost should be a miniscule fraction of the cost of the logic of the loop. In my case, even though I'm iterating over very large lists of strings using enhanced loops, my logic inside the loop is complex enough that I couldn't even measure a difference after switching to index-based loops.
TL;DR: enhanced loops are indeed slower than a traditional index-based loop over an arraylist; but for most applications the difference should be negligible.
The problem you have is that using an Iterator will be slower than using a direct lookup. On my machine the difference is about 0.13 ns per iteration. Using an array instead saves about 0.15 ns per iteration. This should be trivial in 99% of situations.
public static void main(String... args) {
int testLength = 100 * 1000 * 1000;
String[] stringArray = new String[testLength];
Arrays.fill(stringArray, "a");
List<String> stringList = new ArrayList<String>(Arrays.asList(stringArray));
{
long start = System.nanoTime();
long total = 0;
for (String str : stringArray) {
total += str.length();
}
System.out.printf("The for each Array loop time was %.2f ns total=%d%n", (double) (System.nanoTime() - start) / testLength, total);
}
{
long start = System.nanoTime();
long total = 0;
for (int i = 0, stringListSize = stringList.size(); i < stringListSize; i++) {
String str = stringList.get(i);
total += str.length();
}
System.out.printf("The for/get List loop time was %.2f ns total=%d%n", (double) (System.nanoTime() - start) / testLength, total);
}
{
long start = System.nanoTime();
long total = 0;
for (String str : stringList) {
total += str.length();
}
System.out.printf("The for each List loop time was %.2f ns total=%d%n", (double) (System.nanoTime() - start) / testLength, total);
}
}
When run with one billion entries entries prints (using Java 6 update 26.)
The for each Array loop time was 0.76 ns total=1000000000
The for/get List loop time was 0.91 ns total=1000000000
The for each List loop time was 1.04 ns total=1000000000
When run with one billion entries entries prints (using OpenJDK 7.)
The for each Array loop time was 0.76 ns total=1000000000
The for/get List loop time was 0.91 ns total=1000000000
The for each List loop time was 1.04 ns total=1000000000
i.e. exactly the same. ;)
Every claim that X is slower than Y on a JVM which does not address all the issues presented in this article ant it's second part spreads fears and lies about the performance of a typical JVM. This applies to the comment referred to by the original question as well as to GravityBringer's answer. I am sorry to be so rude, but unless you use appropriate micro benchmarking technology your benchmarks produce really badly skewed random numbers.
Tell me if you're interested in more explanations. Although it is all in the articles I referred to.
GravityBringer's number doesn't seem right, because I know ArrayList.get() is as fast as raw array access after VM optimization.
I ran GravityBringer's test twice on my machine, -server mode
50574847
43872295
30494292
30787885
(2nd round)
33865894
32939945
33362063
33165376
The bottleneck in such tests is actually memory read/write. Judging from the numbers, the entire 2 arrays are in my L2 cache. If we decrease the size to fit L1 cache, or if we increase the size beyond L2 cache, we'll see 10X throughput difference.
The iterator of ArrayList uses a single int counter. Even if VM doesn't put it in a register (the loop body is too complex), at least it will be in the L1 cache, therefore r/w of are basically free.
The ultimate answer of course is to test your particular program in your particular environment.
Though it's not helpful to play agnostic whenever a benchmark question is raised.
The situation has gotten worse for ArrayLists. On my computer running Java 6.26, there is a fourfold difference. Interestingly (and perhaps quite logically), there is no difference for raw arrays. I ran the following test:
int testSize = 5000000;
ArrayList<Double> list = new ArrayList<Double>();
Double[] arr = new Double[testSize];
//set up the data - make sure data doesn't have patterns
//or anything compiler could somehow optimize
for (int i=0;i<testSize; i++)
{
double someNumber = Math.random();
list.add(someNumber);
arr[i] = someNumber;
}
//ArrayList foreach
long time = System.nanoTime();
double total1 = 0;
for (Double k: list)
{
total1 += k;
}
System.out.println (System.nanoTime()-time);
//ArrayList get() method
time = System.nanoTime();
double total2 = 0;
for (int i=0;i<testSize;i++)
{
total2 += list.get(i);
}
System.out.println (System.nanoTime()-time);
//array foreach
time = System.nanoTime();
double total3 = 0;
for (Double k: arr)
{
total3 += k;
}
System.out.println (System.nanoTime()-time);
//array indexing
time = System.nanoTime();
double total4 = 0;
for (int i=0;i<testSize;i++)
{
total4 += arr[i];
}
System.out.println (System.nanoTime()-time);
//would be strange if different values were produced,
//but no, all these are the same, of course
System.out.println (total1);
System.out.println (total2);
System.out.println (total3);
System.out.println (total4);
The arithmetic in the loops is to prevent the JIT compiler from possibly optimizing away some of the code. The effect of the arithmetic on performance is small, as the runtime is dominated by the ArrayList accesses.
The runtimes are (in nanoseconds):
ArrayList foreach: 248,351,782
ArrayList get(): 60,657,907
array foreach: 27,381,576
array direct indexing: 27,468,091