Does this mean that Java Math.floor is extremely slow?

Does this mean that Java Math.floor is extremely slow? - java

I don't Java much.
I am writing some optimized math code and I was shocked by my profiler results. My code collects values, interleaves the data and then chooses the values based on that. Java runs slower than my C++ and MATLAB implementations.
I am using javac 1.7.0_05
I am using the Sun/Oracle JDK 1.7.05
There exists a floor function that performs a relevant task in the code.
Does anybody know of the paradigmatic way to fix this?
I noticed that my floor() function is defined with something called StrictMath. Is there something like -ffast-math for Java? I am expecting there must be a way to change the floor function to something more computationally reasonable without writing my own.
public static double floor(double a) {
return StrictMath.floor(a); // default impl. delegates to StrictMath
}
Edit
So a few people suggested I try to do a cast. I tried this and there was absolutely no change in walltime.
private static int flur(float dF)
{
return (int) dF;
}
413742 cast floor function
394675 Math.floor
These test were ran without the profiler. An effort was made to use a profiler but the runtime was drastically altered (15+ minutes so I quit).

You might want to give a try to FastMath.
Here is a post about the performance of Math in Java vs. Javascript. There are a few good hints about why the default math lib is slow. They are discussing other operations than floor, but I guess their findings can be generalized. I found it interesting.
EDIT
According to this bug entry, floor has been implemented a pure java code in 7(b79), 6u21(b01) resulting in better performance. The code of floor in the JDK 6 is still a bit longer than the one in FastMath, but might not be responsible for such a perf. degradation. What JDK are you using? Could you try with a more recent version?

Here's a sanity check for your hypothesis that the code is really spending 99% of its time in floor. Let's assume that you have Java and C++ versions of the algorithm that are both correct in terms of the outputs they produce. For the sake of the argument, let us assume that the two versions call the equivalent floor functions the same number of times. So a time function is
t(input) = nosFloorCalls(input) * floorTime + otherTime(input)
where floorTime is the time taken for a call to floor on the platform.
Now if your hypothesis is correct, and floorTime is vastly more expensive on Java (to the extent that it takes roughly 99% of the execution time) then you would expect the Java version of the application to run a large factor (50 times or more) slower than the C++ version. If you don't see this, then your hypothesis most likely is false.
If the hypothesis is false, here are two alternative explanations for the profiling results.
This is a measurement anomaly; i.e. the profiler has somehow got it wrong. Try using a different profiler.
There is a bug in the Java version of your code that is causing it to call floor many, many more times than in the C++ version of the code.

Math.floor() is insanely fast on my machine at around 7 nanoseconds per call in a tight loop. (Windows 7, Eclipse, Oracle JDK 7). I'd expect it to be very fast in pretty much all circumstances and would be extremely surprised if it turned out to be the bottleneck.
Some ideas:
I'd suggest re-running some benchmarks without a profiler running. It sometimes happens that profilers create spurious overhead when they instrument the binary - particularly for small functions like Math.floor() that are likely to be inlined.
Try a couple of different JVMs, you might have hit an obscure bug
Try the FastMath class in the excellent Apache Commons Math library, which includes a new implementation of floor. I'd be really surprised if it is faster, but you never know.
Check you are not running any virtualisation technonolgy or similar that might be interfering with Java's ability to call native code (which is used in a few of the java.lang.Math functions including Math.floor())

It is worth noting that monitoring a method takes some overhead and in the case of VisualVM, this is fairly high. If you have a method which is called often but does very little it can appear to use lots of CPU. e.g. I have seen Integer.hashCode() as a big hitter once. ;)
On my machine a floor takes less 5.6 ns, but a cast takes 2.3 ns. You might like to try this on your machine.
Unless you need to handle corner cases, a plain cast is faster.
// Rounds to zero, instead of Negative infinity.
public static double floor(double a) {
return (long) a;
}
public static void main(String... args) {
int size = 100000;
double[] a = new double[size];
double[] b = new double[size];
double[] c = new double[size];
for (int i = 0; i < a.length; i++) a[i] = Math.random() * 1e6;
for (int i = 0; i < 5; i++) {
timeCast(a, b);
timeFloor(a, c);
for (int j = 0; j < size; j++)
if (b[i] != c[i])
System.err.println(a[i] + ": " + b[i] + " " + c[i]);
}
}
public static double floor(double a) {
return a < 0 ? -(long) -a : (long) a;
}
private static void timeCast(double[] from, double[] to) {
long start = System.nanoTime();
for (int i = 0; i < from.length; i++)
to[i] = floor(from[i]);
long time = System.nanoTime() - start;
System.out.printf("Cast took an average of %.1f ns%n", (double) time / from.length);
}
private static void timeFloor(double[] from, double[] to) {
long start = System.nanoTime();
for (int i = 0; i < from.length; i++)
to[i] = Math.floor(from[i]);
long time = System.nanoTime() - start;
System.out.printf("Math.floor took an average of %.1f ns%n", (double) time / from.length);
}
prints
Cast took an average of 62.1 ns
Math.floor took an average of 123.6 ns
Cast took an average of 61.9 ns
Math.floor took an average of 6.3 ns
Cast took an average of 47.2 ns
Math.floor took an average of 6.5 ns
Cast took an average of 2.3 ns
Math.floor took an average of 5.6 ns
Cast took an average of 2.3 ns
Math.floor took an average of 5.6 ns

First of all: Your profiler shows that your spending 99% of the cpu time in the floor function. This does not indicate floor is slow. If you do nothing but floor() thats totally sane. Since other languages seem to implement floor more efficient your assumption may be correct, however.
I know from school that a naive implementation of floor (which works only for positive numbers and is one off for negative ones) can be done by casting to an integer/long. That is language agnostic and some sort of general knowledge from CS courses.
Here are some micro benches. Works on my machine and backs what I learned in school ;)
rataman#RWW009 ~/Desktop
$ javac Cast.java && java Cast
10000000 Rounds of Casts took 16 ms
rataman#RWW009 ~/Desktop
$ javac Floor.java && java Floor
10000000 Rounds of Floor took 140 ms
#
public class Cast/Floor {
private static final int ROUNDS = 10000000;
public static void main(String[] args)
{
double[] vals = new double[ROUNDS];
double[] res = new double[ROUNDS];
// awesome testdata
for(int i = 0; i < ROUNDS; i++)
{
vals[i] = Math.random() * 10.0;
}
// warmup
for(int i = 0; i < ROUNDS; i++)
{
res[i] = floor(vals[i]);
}
long start = System.currentTimeMillis();
for(int i = 0; i < ROUNDS; i++)
{
res[i] = floor(vals[i]);
}
System.out.println(ROUNDS + " Rounds of Casts took " + (System.currentTimeMillis() - start) +" ms");
}
private static double floor(double arg)
{
// Floor.java
return Math.floor(arg);
// or Cast.java
return (int)arg;
}
}

Math.floor (and Math.ceil) can be a surprising bottleneck if your algorithm depends on it a lot.
This is because these functions handle edge cases that you might not care about (such as minus-zero and positive-zero etc). Just look at the implementation of these functions to see what they're actually doing; there's a surprising amount of branching in there.
Also consider that Math.floor/ceil take only a double as an argument and return a double, which you might not want. If you just want an int or long, some of the checks in Math.floor are simply unnecessary.
Some have suggested to simply cast to an int, which will work as long as your values are positive (and your algorithm doesn't depend on the edge cases that Math.floor checks for). If that's the case, a simple cast is the fastest solution by quite a margin (in my experience).
If for example your values can be negative and you want an int from a float, you can do something like this:
public static final int floor(final float value) {
return ((int) value) - (Float.floatToRawIntBits(value) >>> 31);
}
(It just subtracts the float's sign bit from the cast to make it correct for negative numbers, while preventing an "if")
In my experience, this is a lot faster than Math.floor. If it isn't, I suggest to check your algorithm, or perhaps you've ran into JVM performance bug (which is much less likely).

Related

Fast way to check if long integer is a cube (in Java)

I am writing a program in which I am required to check if certain large numbers (permutations of cubes) are cubic (equal to n^3 for some n).
At the moment I simply use the method
static boolean isCube(long input) {
double cubeRoot = Math.pow(input,1.0/3.0);
return Math.round(cubeRoot) == cubeRoot;
}
but this is very slow when working with large numbers (10+ digits). Is there a faster way to determine if integer numbers are cubes?

There are only 2^21 cubes that don't overflow a long (2^22 - 1 if you allow negative numbers), so you could just use a HashSet lookup.

The Hacker's Delight book has a short and fast function for integer cube roots which could be worth porting to 64bit longs, see below.
It appears that testing if a number is a perfect cube can be done faster than actually computing the cube root. Burningmath has a technique that uses the "digital root" (sum the digits. repeat until it's a single digit). If the digital root is 0, 1 or 8, your number might be a perfect cube.
This method could be extremely valuable for your case of permuting (the digits of?) numbers. If you can rule out a number by its digital root, all permutations are also ruled out.
They also describe a technique based on the prime factors for checking perfect cubes. This looks most appropriate for mental arithmetic, as I think factoring is slower than cube-rooting on a computer.
Anyway, the digital root is quick to computer, and you even have your numbers as a string of digits to start with. You'll still need a divide-by-10 loop, but your starting point is the sum of digits of the input, not the whole number, so it won't be many divisions. (Integer division is about an order of magnitude slower than multiplication on current CPUs, but division by a compile-time-constant can be optimized to multiply+shift with a fixed-point inverse. Hopefully Java JIT compilers use that, too, and maybe even use it for runtime constants.)
This plus A. Webb's test (input % 819 -> search of a table of 45 entries) will rule out a lot of inputs as not possible perfect cubes.
IDK if binary search, linear search, or hash/set would be best.
These tests could be a front-end to David Eisenstat's idea of just storing the set of longs that are perfect cubes in a data structure that allows quick is-present checks. (e.g. HashSet). Yes, cache misses are expensive enough that at least the digital-root test is probably worth it before doing a HashSet lookup, maybe both.
You could use less memory on this idea by using it for a Bloom Filter instead of an exact set (David Ehrman's suggestion). This would give another candidate-rejection frontend to the full calculation. The guavac BloomFilter implementation requires a "funnel" function to translate objects to bytes, which in this case should be f(x)=x).
I suspect that Bloom filtering isn't going to be a big win over an exact HashSet check, since it requires multiple memory accesses. It's appropriate when you really can't afford the space for a full table, and what you're filtering out is something really expensive like a disk access.
The integer cube root function (below) is probably faster than a single cache miss. If the cbrt check is causing cache misses, then probably the rest of your code will suffer more cache misses too, when its data is evicted.
Math.SE had a question about this for perfect squares, but that was about squares, not cubes, so none of this came up. The answers there did discuss and avoid the problems in your method, though. >.<
There are several problems with your method:
The problem with using pow(x, 1./3) is that 1/3 does not have an exact representation in floating point, so you're not "really" getting the cube root. So use cbrt. It's highly unlikely to be slower, unless it has higher accuracy that comes with a time cost.
You're assuming Math.pow or Math.cbrt always return a value that's exactly an integer, and not 41.999999 or something. Java docs say:
The computed result must be within 1 ulp of the exact result.
This means your code might not work on a conforming Java implementation. Comparing floating point numbers for exactly equal is tricky business. What Every Computer Scientist Should Know About Floating-Point Arithmetic has much to say about floating point, but it's really long. (With good reason. It's easy to shoot yourself in the foot with floating point.) See also Comparing Floating Point Numbers, 2012 Edition, Bruce Dawson's series of FP articles.
I think it won't work for all long values. double can only precisely represent integers up to 2^53 (size of the mantissa in a 64bit IEEE double). Math.cbrt of integers that can't be represented exactly is even less likely to be an exact integer.
FP cube root, and then testing the resulting integer, avoids all the problems that the FP comparison introduced:
static boolean isCube(long input) {
double cubeRoot = Math.cbrt(input);
long intRoot = Math.round(cubeRoot);
return (intRoot*intRoot*intRoot) == input;
}
(After searching around, I see other people on other stackoverflow / stackexchange answers suggesting that integer-comparison method, too.)
If you need high performance, and you don't mind having a more complex function with more source code, then there are possibilities. For example, use a cube-root successive-approximation algorithm with integer math. If you eventually get to a point where n^3 < input <(n+1)^3, theninput` isn't a cube.
There's some discussion of methods on this math.SE question.
I'm not going to take the time to dig into integer cube-root algorithms in detail, as the cbrt part is probably not the main bottleneck. Probably input parsing and string->long conversion is a major part of your bottleneck.
Actually, I got curious. Turns out there is already an integer cube-root implementation available in Hacker's Delight (use / copying / distributing even without attribution is allowed. AFAICT, it's essentially public domain code.):
// Hacker's delight integer cube-root (for 32-bit integers, I think)
int icbrt1(unsigned x) {
int s;
unsigned y, b;
y = 0;
for (s = 30; s >= 0; s = s - 3) {
y = 2*y;
b = (3*y*(y + 1) + 1) << s;
if (x >= b) {
x = x - b;
y = y + 1;
}
}
return y;
}
That 30 looks like a magic number based on the number of bits in an int. Porting this to long would require testing. (Also note that this is C, but looks like it should compile in Java, too!)
IDK if this is common knowledge among Java people, but the 32bit Windows JVM doesn't use the server JIT engine, and doesn't optimize your code as well.

You can first eliminate a large number of candidates by testing modulo given numbers. For example, a cube modulo the number 819 can only take on the following 45 values.
0 125 181 818 720 811 532 755 476
1 216 90 307 377 694 350 567 442
8 343 559 629 658 351 190 91 469
27 512 287 252 638 118 603 161 441
64 729 99 701 792 378 260 468 728
So, you could eliminate actually having to compute the cubic root in almost 95% of uniformly distributed cases.

The hackers delight routine seems to work on long numbers if you just change int to long and 30 to 60. If you change 30 to 61 it does not seem to work.
I didn't really understand the program, so I made another version that seems to work in Java.
private static int cubeRoot(long n) {
final int MAX_POWER = 21;
int power = MAX_POWER;
long factor;
long root = 0;
long next, square, cube;
while (power >= 0) {
factor = 1 << power;
next = root + factor;
while (true) {
if (next > n) {
break;
}
if (n / next < next) {
break;
}
square = next * next;
if (n / square < next) {
break;
}
cube = square * next;
if (cube > n) {
break;
}
root = next;
next += factor;
}
--power;
}
return (int) root;
}

Please define very show. Here is a test program:
public static void main(String[] args) {
for (long v = 1; v > 0; v = v * 10) {
long start = System.nanoTime();
for (int i = 0; i < 100; i++)
isCube(v);
long end = System.nanoTime();
System.out.println(v + ": " + (end - start) + "ns");
}
}
static boolean isCube(long input) {
double cubeRoot = Math.pow(input,1.0/3.0);
return Math.round(cubeRoot) == cubeRoot;
}
Output is:
1: 290528ns
10: 46188ns
100: 45332ns
1000: 46188ns
10000: 46188ns
100000: 46473ns
1000000: 46188ns
10000000: 45048ns
100000000: 45048ns
1000000000: 44763ns
10000000000: 45048ns
100000000000: 44477ns
1000000000000: 45047ns
10000000000000: 46473ns
100000000000000: 47044ns
1000000000000000: 46188ns
10000000000000000: 65291ns
100000000000000000: 45047ns
1000000000000000000: 44477ns
I don't see a performance impact of "large" numbers.

The CPU Cache is invalid in C Programing?

this is the C program under Linux/GUN:
#include<stdio.h>
#include<sys/time.h>
#define Max 1024*1024
int main()
{
struct timeval start,end;
long int dis;
int i;
int m=0;
int a[Max];
gettimeofday(&start,NULL);
for(i=0;i<Max;i += 1){
a[Max] *= 3;
}
gettimeofday(&end,NULL);
dis = end.tv_usec - start.tv_usec;
printf("time1: %ld\n",dis);
gettimeofday(&start,NULL);
for(i=0;i<Max;i += 16){
a[Max] *= 3;
}
gettimeofday(&end,NULL);
dis = end.tv_usec - start.tv_usec;
printf("time2: %ld\n",dis);
return 0;
}
the output:
time1: 7074
time2: 234
it's a big distance
this Java program:
public class Cache1 {
public static void main(String[] args){
int a[] = new int[1024*1024*64];
long time1 = System.currentTimeMillis();
for(int i=0;i<a.length;i++){
a[i] *= 3;
}
long time2 = System.currentTimeMillis();
System.out.println(time2 - time1);
time1 = System.currentTimeMillis();
for(int i=0;i<a.length;i += 16){
a[i] *= 3;
}
time2 = System.currentTimeMillis();
System.out.println(time2 - time1);
}
}
the output:
92
82
it's nealy the same
with the CPU Cache. why they hava so much difference? the Cpu Cache is invalid in C programing?

I hope you realize that the difference in units of time in those tests is 10^3. C code is order of magnitude faster than Java code.
In C code there should be a[i] instead of a[Max].
As for cache: since you access only one memory location in your C code (which triggers undefined behavior) your C test is completely invalid.
And even if it were correct, your method is flawed. It is quite possible that the multiplication operations and even the whole loops were skipped completely by C copiler, since nothing depends on their outcome.
The result where first run takes long, and the second takes less time, is expected. Data has to be loaded to cache anyway, and that takes time. Once it is loaded, operations on that data take less time.
Java may either not use cache at all (not likely) or preload the whole array to cache even before the loops are executed. That would explain equal execution times.

You have three cache sizes, these are typically
L1: 32 KB (data), 4 clock cycles
L2: 256KB, 10-11 clock cycles
L3: 3-24 MB. 40 - 75 clock cycles.
Anything larger than this will not fit into the cache as if you just scroll through memory it will be like they are not there.
I suggest you write a test which empirically works out the CPU cache sizes as a good exercise to help you understand this. BTW You don't need to use *= to exercise the cache as this exercises the ALU. Perhaps there is a simpler operation you can use ;)
In the case of your Java code, most likely it is not compiled yet so you are seeing the speed of the interperator, not the memory accesses.
I suggest you run the test repeatedly on smaller memory sizes for at least 2 seconds and take the average.

A faster implementation for Math.abs(a - b) - Math.abs(c - d)?

I have a Java method that repeatedly evaluates the following expression in a very tight loop with a large number of repetitions:
Math.abs(a - b) - Math.abs(c - d)
a, b, c and d are long values that can span the whole range of their type. They are different in each loop iteration and they do not satisfy any invariant that I know of.
The profiler indicates that a significant portion of the processor time is spent in this method. While I will pursue other avenues of optimization first, I was wondering if there is a smarter way to calculate the aforementioned expression.
Apart from inlining the Math.abs() calls manually for a very slight (if any) performance gain, is there any mathematical trick that I could use to speed-up the evaluation of this expression?

I suspect the profiler isn't giving you a true result as it trying to profile (and thus adding over head to) such a trivial "method". Without the profile Math.abs can be turned into a small number of machine code instructions, and you won't be able to make it faster than that.
I suggest you do a micro-benchmark to confirm this. I would expect loading the data to be an order of magnitude more expensive.
long a = 10, b = 6, c = -2, d = 3;
int runs = 1000 * 1000 * 1000;
long start = System.nanoTime();
for (int i = 0; i < runs; i += 2) {
long r = Math.abs(i - a) - Math.abs(c - i);
long r2 = Math.abs(i - b) - Math.abs(d - i);
if (r + r2 < Integer.MIN_VALUE) throw new AssertionError();
}
long time = System.nanoTime() - start;
System.out.printf("Took an average of %.1f ns per abs-abs. %n", (double) time / runs);
prints
Took an average of 0.9 ns per abs-abs.

I ended up using this little method:
public static long diff(final long a, final long b, final long c, final long d) {
final long a0 = (a < b)?(b - a):(a - b);
final long a1 = (c < d)?(d - c):(c - d);
return a0 - a1;
}
I experienced a measurable performance increase - about 10-15% for the whole application. I believe this is mostly due to:
The elimination of a method call: Rather than calling Math.abs() twice, I call this method once. Sure, static method calls are not inordinately expensive, but they still have an impact.
The elimination of a couple of negation operations: This may be offset by the slightly increased size of the code, but I'll happily fool myself into believing that it actually made a difference.
EDIT:
It seems that it's actually the other way around. Explicitly inlining the code does not seem to impact the performance in my micro-benchmark. Changing the way the absolute values are calculated does...

You can always try to unroll the functions and hand optimize, if you don't get more cache misses it might be faster.
If I got the unrolling right it could be something like this:
if(a<b)
{
if(c<d)
{
r=b-a-d+c;
}
else
{
r=b-a+d-c;
}
}
else
{
if(c<d)
{
r=a-b-d+c;
}
else
{
r=a-b+d-c;
}
}

are you sure its the method itself causes the problem? Maybe its an enormous amount of invocation of this method and you just see the aggregated results (like TIME_OF_METHOD_EXECUTION X NUMBER_OF_INVOCATIONS) in your profiler?

Enhanced for loop performance worse than traditional indexed lookup?

I just came across this seemingly innocuous comment, benchmarking ArrayList vs a raw String array. It's from a couple years ago, but the OP writes
I did notice that using for String s: stringsList was about 50% slower than using an old-style for-loop to access the list. Go figure...
Nobody commented on it in the original post, and the test seemed a little dubious (too short to be accurate), but I nearly fell out of my chair when I read it. I've never benchmarked an enhanced loop against a "traditional" one, but I'm currently working on a project that does hundreds of millions of iterations over ArrayList instances using enhanced loops so this is a concern to me.
I'm going to do some benchmarking and post my findings here, but this is obviously a big concern to me. I could find precious little info online about relative performance, except for a couple offhand mentions that enhanced loops for ArrayLists do run a lot slower under Android.
Has anybody experienced this? Does such a performance gap still exist? I'll post my findings here, but was very surprised to read it. I suspect that if this performance gap did exist, it has been fixed in more modern VM's, but I guess now I'll have to do some testing and confirm.
Update: I made some changes to my code, but was already suspecting what others here have already pointed out: sure the enhanced for loop is slower, but outside of very trivial tight loops, the cost should be a miniscule fraction of the cost of the logic of the loop. In my case, even though I'm iterating over very large lists of strings using enhanced loops, my logic inside the loop is complex enough that I couldn't even measure a difference after switching to index-based loops.
TL;DR: enhanced loops are indeed slower than a traditional index-based loop over an arraylist; but for most applications the difference should be negligible.

The problem you have is that using an Iterator will be slower than using a direct lookup. On my machine the difference is about 0.13 ns per iteration. Using an array instead saves about 0.15 ns per iteration. This should be trivial in 99% of situations.
public static void main(String... args) {
int testLength = 100 * 1000 * 1000;
String[] stringArray = new String[testLength];
Arrays.fill(stringArray, "a");
List<String> stringList = new ArrayList<String>(Arrays.asList(stringArray));
{
long start = System.nanoTime();
long total = 0;
for (String str : stringArray) {
total += str.length();
}
System.out.printf("The for each Array loop time was %.2f ns total=%d%n", (double) (System.nanoTime() - start) / testLength, total);
}
{
long start = System.nanoTime();
long total = 0;
for (int i = 0, stringListSize = stringList.size(); i < stringListSize; i++) {
String str = stringList.get(i);
total += str.length();
}
System.out.printf("The for/get List loop time was %.2f ns total=%d%n", (double) (System.nanoTime() - start) / testLength, total);
}
{
long start = System.nanoTime();
long total = 0;
for (String str : stringList) {
total += str.length();
}
System.out.printf("The for each List loop time was %.2f ns total=%d%n", (double) (System.nanoTime() - start) / testLength, total);
}
}
When run with one billion entries entries prints (using Java 6 update 26.)
The for each Array loop time was 0.76 ns total=1000000000
The for/get List loop time was 0.91 ns total=1000000000
The for each List loop time was 1.04 ns total=1000000000
When run with one billion entries entries prints (using OpenJDK 7.)
The for each Array loop time was 0.76 ns total=1000000000
The for/get List loop time was 0.91 ns total=1000000000
The for each List loop time was 1.04 ns total=1000000000
i.e. exactly the same. ;)

Every claim that X is slower than Y on a JVM which does not address all the issues presented in this article ant it's second part spreads fears and lies about the performance of a typical JVM. This applies to the comment referred to by the original question as well as to GravityBringer's answer. I am sorry to be so rude, but unless you use appropriate micro benchmarking technology your benchmarks produce really badly skewed random numbers.
Tell me if you're interested in more explanations. Although it is all in the articles I referred to.

GravityBringer's number doesn't seem right, because I know ArrayList.get() is as fast as raw array access after VM optimization.
I ran GravityBringer's test twice on my machine, -server mode
50574847
43872295
30494292
30787885
(2nd round)
33865894
32939945
33362063
33165376
The bottleneck in such tests is actually memory read/write. Judging from the numbers, the entire 2 arrays are in my L2 cache. If we decrease the size to fit L1 cache, or if we increase the size beyond L2 cache, we'll see 10X throughput difference.
The iterator of ArrayList uses a single int counter. Even if VM doesn't put it in a register (the loop body is too complex), at least it will be in the L1 cache, therefore r/w of are basically free.
The ultimate answer of course is to test your particular program in your particular environment.
Though it's not helpful to play agnostic whenever a benchmark question is raised.

The situation has gotten worse for ArrayLists. On my computer running Java 6.26, there is a fourfold difference. Interestingly (and perhaps quite logically), there is no difference for raw arrays. I ran the following test:
int testSize = 5000000;
ArrayList<Double> list = new ArrayList<Double>();
Double[] arr = new Double[testSize];
//set up the data - make sure data doesn't have patterns
//or anything compiler could somehow optimize
for (int i=0;i<testSize; i++)
{
double someNumber = Math.random();
list.add(someNumber);
arr[i] = someNumber;
}
//ArrayList foreach
long time = System.nanoTime();
double total1 = 0;
for (Double k: list)
{
total1 += k;
}
System.out.println (System.nanoTime()-time);
//ArrayList get() method
time = System.nanoTime();
double total2 = 0;
for (int i=0;i<testSize;i++)
{
total2 += list.get(i);
}
System.out.println (System.nanoTime()-time);
//array foreach
time = System.nanoTime();
double total3 = 0;
for (Double k: arr)
{
total3 += k;
}
System.out.println (System.nanoTime()-time);
//array indexing
time = System.nanoTime();
double total4 = 0;
for (int i=0;i<testSize;i++)
{
total4 += arr[i];
}
System.out.println (System.nanoTime()-time);
//would be strange if different values were produced,
//but no, all these are the same, of course
System.out.println (total1);
System.out.println (total2);
System.out.println (total3);
System.out.println (total4);
The arithmetic in the loops is to prevent the JIT compiler from possibly optimizing away some of the code. The effect of the arithmetic on performance is small, as the runtime is dominated by the ArrayList accesses.
The runtimes are (in nanoseconds):
ArrayList foreach: 248,351,782
ArrayList get(): 60,657,907
array foreach: 27,381,576
array direct indexing: 27,468,091

Switch to BigInteger if necessary

I am reading a text file which contains numbers in the range [1, 10^100]. I am then performing a sequence of arithmetic operations on each number. I would like to use a BigInteger only if the number is out of the int/long range. One approach would be to count how many digits there are in the string and switch to BigInteger if there are too many. Otherwise I'd just use primitive arithmetic as it is faster. Is there a better way?
Is there any reason why Java could not do this automatically i.e. switch to BigInteger if an int was too small? This way we would not have to worry about overflows.

I suspect the decision to use primitive values for integers and reals (done for performance reasons) made that option not possible. Note that Python and Ruby both do what you ask.
In this case it may be more work to handle the smaller special case than it is worth (you need some custom class to handle the two cases), and you should just use BigInteger.

Is there any reason why Java could not do this automatically i.e. switch to BigInteger if an int was too small?
Because that is a higher level programming behavior than what Java currently is. The language is not even aware of the BigInteger class and what it does (i.e. it's not in JLS). It's only aware of Integer (among other things) for boxing and unboxing purposes.
Speaking of boxing/unboxing, an int is a primitive type; BigInteger is a reference type. You can't have a variable that can hold values of both types.

You could read the values into BigIntegers, and then convert them to longs if they're small enough.
private final BigInteger LONG_MAX = BigInteger.valueOf(Long.MAX_VALUE);
private static List<BigInteger> readAndProcess(BufferedReader rd) throws IOException {
List<BigInteger> result = new ArrayList<BigInteger>();
for (String line; (line = rd.readLine()) != null; ) {
BigInteger bignum = new BigInteger(line);
if (bignum.compareTo(LONG_MAX) > 0) // doesn't fit in a long
result.add(bignumCalculation(bignum));
else result.add(BigInteger.valueOf(primitiveCalculation(bignum.longValue())));
}
return result;
}
private BigInteger bignumCalculation(BigInteger value) {
// perform the calculation
}
private long primitiveCalculation(long value) {
// perform the calculation
}
(You could make the return value a List<Number> and have it a mixed collection of BigInteger and Long objects, but that wouldn't look very nice and wouldn't improve performance by a lot.)
The performance may be better if a large amount of the numbers in the file are small enough to fit in a long (depending on the complexity of calculation). There's still risk for overflow depending on what you do in primitiveCalculation, and you've now repeated the code, (at least) doubling the bug potential, so you'll have to decide if the performance gain really is worth it.
If your code is anything like my example, though, you'd probably have more to gain by parallelizing the code so the calculations and the I/O aren't performed on the same thread - you'd have to do some pretty heavy calculations for an architecture like that to be CPU-bound.

The impact of using BigDecimals when something smaller will suffice is surprisingly, err, big: Running the following code
public static class MyLong {
private long l;
public MyLong(long l) { this.l = l; }
public void add(MyLong l2) { l += l2.l; }
}
public static void main(String[] args) throws Exception {
// generate lots of random numbers
long ls[] = new long[100000];
BigDecimal bds[] = new BigDecimal[100000];
MyLong mls[] = new MyLong[100000];
Random r = new Random();
for (int i=0; i<ls.length; i++) {
long n = r.nextLong();
ls[i] = n;
bds[i] = new BigDecimal(n);
mls[i] = new MyLong(n);
}
// time with longs & Bigints
long t0 = System.currentTimeMillis();
for (int j=0; j<1000; j++) for (int i=0; i<ls.length-1; i++) {
ls[i] += ls[i+1];
}
long t1 = Math.max(t0 + 1, System.currentTimeMillis());
for (int j=0; j<1000; j++) for (int i=0; i<ls.length-1; i++) {
bds[i].add(bds[i+1]);
}
long t2 = System.currentTimeMillis();
for (int j=0; j<1000; j++) for (int i=0; i<ls.length-1; i++) {
mls[i].add(mls[i+1]);
}
long t3 = System.currentTimeMillis();
// compare times
t3 -= t2;
t2 -= t1;
t1 -= t0;
DecimalFormat df = new DecimalFormat("0.00");
System.err.println("long: " + t1 + "ms, bigd: " + t2 + "ms, x"
+ df.format(t2*1.0/t1) + " more, mylong: " + t3 + "ms, x"
+ df.format(t3*1.0/t1) + " more");
}
produces, on my system, this output:
long: 375ms, bigd: 6296ms, x16.79 more, mylong: 516ms, x1.38 more
The MyLong class is there only to look at the effects of boxing, to compare against what you would get with a custom BigOrLong class.

Java is Fast--really really Fast. It's only 2-4x slower than c and sometimes as fast or a tad faster where most other languages are 10x (python) to 100x (ruby) slower than C/Java. (Fortran is also hella-fast, by the way)
Part of this is because it doesn't do things like switch number types for you. It could, but currently it can inline an operation like "a*5" in just a few bytes, imagine the hoops it would have to go through if a was an object. It would at least be a dynamic call to a's multiply method which would be a few hundred / thousand times slower than it was when a was simply an integer value.
Java probably could, these days, actually use JIT compiling to optimize the call better and inline it at runtime, but even then very few library calls support BigInteger/BigDecimal so there would be a LOT of native support, it would be a completely new language.
Also imagine how switching from int to BigInteger instead of long would make debugging video games crazy-hard! (Yeah, every time we move to the right side of the screen the game slows down by 50x, the code is all the same! How is this possible?!??)

Would it have been possible? Yes. But there are many problems with it.
Consider, for instance, that Java stores references to BigInteger, which is actually allocated on the heap, but store int literals. The difference can be made clear in C:
int i;
BigInt* bi;
Now, to automatically go from a literal to a reference, one would necessarily have to annotate the literal somehow. For instance, if the highest bit of the int was set, then the other bits could be used as a table lookup of some sort to retrieve the proper reference. That also means you'd get a BigInt** bi whenever it overflowed into that.
Of course, that's the bit usually used for sign, and hardware instructions pretty much depend on it. Worse still, if we do that, then the hardware won't be able to detect overflow and set the flags to indicate it. As a result, each operation would have to be accompanied by some test to see if and overflow has happened or will happen (depending on when it can be detected).
All that would add a lot of overhead to basic integer arithmetic, which would in practice negate any benefits you had to begin with. In other words, it is faster to assume BigInt than it is to try to use int and detect overflow conditions while at the same time juggling with the reference/literal problem.
So, to get any real advantage, one would have to use more space to represent ints. So instead of storing 32 bits in the stack, in the objects, or anywhere else we use them, we store 64 bits, for example, and use the additional 32 bits to control whether we want a reference or a literal. That could work, but there's an obvious problem with it -- space usage. :-) We might see more of it with 64 bits hardware, though.
Now, you might ask why not just 40 bits (32 bits + 1 byte) instead of 64? Basically, on modern hardware it is preferable to store stuff in 32 bits increments for performance reasons, so we'll be padding 40 bits to 64 bits anyway.
EDIT
Let's consider how one could go about doing this in C#. Now, I have no programming experience with C#, so I can't write the code to do it, but I expect I can give an overview.
The idea is to create a struct for it. It should look roughly like this:
public struct MixedInt
{
private int i;
private System.Numeric.BigInteger bi;
public MixedInt(string s)
{
bi = BigInteger.Parse(s);
if (parsed <= int.MaxValue && parsed => int.MinValue)
{
i = (int32) parsed;
bi = 0;
}
}
// Define all required operations
}
So, if the number is in the integer range we use int, otherwise we use BigInteger. The operations have to ensure transition from one to another as required/possible. From the client point of view, this is transparent. It's just one type MixedInt, and the class takes care of using whatever fits better.
Note, however, that this kind of optimization may well be part of C#'s BigInteger already, given it's implementation as a struct.
If Java had something like C#'s struct, we could do something like this in Java as well.

Is there any reason why Java could not
do this automatically i.e. switch to
BigInteger if an int was too small?
This is one of the advantage of dynamic typing, but Java is statically typed and prevents this.
In a dynamically type language when two Integer which are summed together would produce an overflow, the system is free to return, say, a Long. Because dynamically typed language rely on duck typing, it's fine. The same can not happen in a statically typed language; it would break the type system.
EDIT
Given that my answer and comment was not clear, here I try to provide more details why I think that static typing is the main issue:
1) the very fact that we speak of primitive type is a static typing issue; we wouldn't care in a dynamically type language.
2) with primitive types, the result of the overflow can not be converted to another type than an int because it would not be correct w.r.t static typing
int i = Integer.MAX_VALUE + 1; // -2147483648
3) with reference types, it's the same except that we have autoboxing. Still, the addition could not return, say, a BigInteger because it would not match the static type sytem (A BigInteger can not be casted to Integer).
Integer j = new Integer( Integer.MAX_VALUE ) + 1; // -2147483648
4) what could be done is to subclass, say, Number and implement at type UnboundedNumeric that optimizes the representation internally (representation independence).
UnboundedNum k = new UnboundedNum( Integer.MAX_VALUE ).add( 1 ); // 2147483648
Still, it's not really the answer to the original question.
5) with dynamic typing, something like
var d = new Integer( Integer.MAX_VALUE ) + 1; // 2147483648
would return a Long which is ok.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.