Java Math.pow(a,b) time complexity

Java Math.pow(a,b) time complexity - java

I would like to ask time complexity of the following code. Is it O(n)? (Is time complexity of Math.pow() O(1)? ) In general, is Math.pow(a,b) has time complexity O(b) or O(1)? Thanks in advance.
public void foo(int[] ar) {
int n = ar.length;
int sum = 0;
for(int i = 0; i < n; ++i) {
sum += Math.pow(10,ar[i]);
}
}

#Blindy talks about possible approaches that Java could take in implementing pow.
First of all, the general case cannot be repeated multiplication. It won't work for the general case where the exponent is not an integer. (The signature for pow is Math.pow(double, double)!)
In the OpenJDK 8 codebase, the native code implementation for pow can work in two ways:
The first implementation in e_pow.c uses a power series. The approach is described in the C comments as follows:
* Method: Let x = 2 * (1+f)
* 1. Compute and return log2(x) in two pieces:
* log2(x) = w1 + w2,
* where w1 has 53-24 = 29 bit trailing zeros.
* 2. Perform y*log2(x) = n+y' by simulating multi-precision
* arithmetic, where |y'|<=0.5.
* 3. Return x**y = 2**n*exp(y'*log2)
The second implementation in w_pow.c is a wrapper for the pow function provided by the Standard C library. The wrapper deals with edge cases.
Now it is possible that the Standard C library uses CPU specific math instructions. If it did, and the JDK build (or runtime) selected1 the second implementation, then Java would use those instructions too.
But either way, I can see no trace of any special case code that uses repeated multiplication. You can safely assume that it is O(1).
1 - I haven't delved into how when the selection is / can be made.

You can consider Math.pow to be O(1).
There's a few possible implementations, ranging from a CPU assembler instruction (Java doesn't use this) to a stable software implementation based on (for example) the Taylor series expansion over a few terms (though not exactly the Taylor implementation, there's some more specific algorithms).
It most definitely won't repeatedly multiply if that's what you're worried about.

Related

Worst case time complexity of Math.sqrt in java

We have a test exercise where you need to find out whether a given N number is a square of another number or no, with the smallest time complexity.
I wrote:
public static boolean what2(int n) {
double newN = (double)n;
double x = Math.sqrt(newN);
int y = (int)x;
if (y * y == n)
return false;
else
return true;
}
I looked online and specifically on SO to try and find the complexity of sqrt but couldn't find it. This SO post is for C# and says its O(1), and this Java post says its O(1) but could potentially iterate over all doubles.
I'm trying to understand the worst time complexity of this method. All other operations are O(1) so this is the only factor.
Would appreciate any feedback!

Using the floating point conversion is OK because java's int type is 32 bits and java's double type is the IEEE 64 bit format that can represent all values of 32 bit integers exactly.
If you were to implement your function for long, you would need to be more careful because many large long values are not represented exactly as doubles, so taking the square root and converting it to an integer type might not yield the actual square root.
All operations in your implementation execute in constant time, so the complexity of your solution is indeed O(1).

If I understood the question correctly, the Java instruction can be converted by just-in-time-compilation to use the native fsqrt instruction (however I don't know whether this is actually the case), which, according to this table, uses a bounded number of processor cycles, which means that the complexity would be O(1).

java's Math.sqrt actually delegates sqrt to StrictMath.java source code one of its implementations can be found here, by looking at sqrt function, it looks like the complexity is constant time. Look at while(r != 0) loop inside.

Why is Math.pow(int,int) slower than my naive implementation?

Yesterday I saw a question asking why Math.pow(int,int) is so slow, but the question was poorly worded and showed no research effort, so it was quickly closed.
I did a little test of my own and found that the Math.pow method actually did run extremely slow compared to my own naive implementation (which isn't even a particularly efficient implementation) when dealing with integer arguments. Below is the code I ran to test this:
class PowerTest {
public static double myPow(int base, int exponent) {
if(base == 0) return 0;
if(exponent == 0) return 1;
int absExponent = (exponent < 0)? exponent * -1 : exponent;
double result = base;
for(int i = 1; i < absExponent; i++) {
result *= base;
}
if(exponent < 1) result = 1 / result;
return result;
}
public static void main(String args[]) {
long startTime, endTime;
startTime = System.nanoTime();
for(int i = 0; i < 5000000; i++) {
Math.pow(2,2);
}
endTime = System.nanoTime();
System.out.printf("Math.pow took %d milliseconds.\n", (endTime - startTime) / 1000000);
startTime = System.nanoTime();
for(int i = 0; i < 5000000; i++) {
myPow(2,2);
}
endTime = System.nanoTime();
System.out.printf("myPow took %d milliseconds.\n", (endTime - startTime) / 1000000);
}
}
On my computer (linux on an intel x86_64 cpu), the output almost always reported that Math.pow took 10ms while myPow took 2ms. This occasionally fluctuated by a millisecond here or there, but Math.pow ran about 5x slower on average.
I did some research and, according to grepcode, Math.pow only offers a method with type signature of (double, double), and it defers that to the StrictMath.pow method which is a native method call.
The fact that the Math library only offers a pow function that deals with doubles seems to indicate a possible answer to this question. Obviously, a power algorithm that must handle the possibility of a base or exponent of type double is going to take longer to execute than my algorithm which only deals with integers. However, in the end, it boils down to architecture-dependent native code (which almost always runs faster than JVM byte code, probably C or assembly in my case). It seems that at this level, an optimization would be made to check the data type and run a simpler algorithm if possible.
Given this information, why does the native Math.pow method consistently run much slower than my un-optimized and naive myPow method when given integer arguments?

As others have said, you cannot just ignore the use of double, as floating point arithmetic will almost certainly be slower. However, this is not the only reason - if you change your implementation to use them, it is still faster.
This is because of two things: the first is that 2^2 (exponent, not xor) is a very quick calculation to perform, so your algorithm is fine to use for that - try using two values from Random#nextInt (or nextDouble) and you'll see that Math#pow is actually much quicker.
The other reason is that calling native methods has overhead, which is actually meaningful here, because 2^2 is so quick to calculate, and you are calling Math#pow so many times. See What makes JNI calls slow? for more on this.

There is no pow(int,int) function. You are comparing apples to oranges with your simplifying assumption that floating point numbers can be ignored.

Math.pow is slow because it deals with an equation in the generic sense, using fractional powers to raise it to the given power. It's the lookup it has to go through when computing that takes more time.
Simply multiplying numbers together is often faster, since native calls in Java are much more efficient.
Edit: It may also be worthy to note that Math functions use doubles, which can also take a longer amount of time than using ints.

Math.pow(x, y) is probably implemented as exp(y log x). This allows for fractional exponents and is remarkably fast.
But you'll be able to beat this performance by writing your own version if you only require small positive integral arguments.
Arguably Java could make that check for you, but there would be a point for large integers where the built-in version would be faster. It would also have to define an appropriate integral return type and the risk of overflowing that is obvious. Defining the behaviour around a branch region would be tricky.
By the way, your integral type version could be faster. Do some research on exponentiation by squaring.

Why does my processing time drop when running the same function over and over again (with incremented values)?

I was testing a new Method to replace my old one and made did some speed testing.
When I now look at the graph I see, that the time it takes per iteration drops drastically.
Now I'm wondering why that might be.
My quess would be, that my graphics card takes over the heavy work, but the first function iterates n times and the second (the blue one) doesn't have a single iteration but "heavy" calculation work with doubles.
In case system details are needed:
OS: Mac OS X 10.10.4
Core: 2.8 GHz Intel Core i7 (4x)
GPU: AMD Radeon R9 M370X 2048 MB
If you need the two functions:
New One:
private static int sumOfI(int i) {
int factor;
float factor_ = (i + 1) / 2;
factor = (int) factor_;
return (i % 2 == 0) ? i * factor + i / 2 : i * factor;
}
Old One:
private static int sumOfIOrdinary(int j) {
int result = 0;
for (int i = 1; i <= j; i++) {
result += i;
}
return result;
}
To clarify my question:
Why does the processing time drop that drastically?
Edit:
I understand at least a little bit about cost and such. I probably didn't explain my test method good enough. I have a simple for loop which in this test counted from 0 to 1000 and I fed each value to 1 method and recorded the time it took (for the whole loop to execute), then I did the same with the other method.
So after the loop reached about 500 the same method took significantly less time to execute.

Java did not calculate anything on the graphic card (without help from other frameworks or classes). Also what you think is a "heavy" calculation is kinda easy for a cpu this day (even if division is kinda tricky). So speed depends on the bytecode generated and the Java optimisations when running a program and mostly on the Big-O Notation.
Your method sumOfI is just x statements to execute so this is O(1), regardless how large your i is its always only this x statements. But the sumOfIOrdinary uses one loop and its O(n) this will use y statements + i statements depending on the input.
So from the theory and in worst caste sumOfI is always faster as sumOfIOrdinary.
You can also see this problem in the bytecode view. sumOfI is only some load and add and multiply calls to the cpu. But for a loop the bytecode also uses a goto and needs to return to an older address and needs to execute lines again this will cost time.
On my VM with i=500000 the first method needs <1 millisecond and the second method because of the loop takes 2-4 millisecond.
Links to explain Big-O-Notation:
Simple Big O Notation
A beginner's guide to Big O notation

What's wrong with using associativity by compilers?

Sometimes associativity can be used to loose data dependencies and I was curious how much it can help. I was rather surprised to find out that I can nearly get a speed-up factor of 4 by manually unrolling a trivial loop, both in Java (build 1.7.0_51-b13) and in C (gcc 4.4.3).
So either I'm doing something pretty stupid or the compilers ignore a powerful tool. I started with
int a = 0;
for (int i=0; i<N; ++i) a = M1 * a + t[i];
which computes something close to String.hashCode() (set M1=31 and use a char[]). The computation is pretty trivial and for t.length=1000 takes about 1.2 microsecond on my i5-2400 # 3.10GHz (both in Java and C).
Observe that each two steps a gets multiplied by M2 = M1*M1 and added something. This leads to this piece of code
int a = 0;
for (int i=0; i<N; i+=2) {
a = M2 * a + (M1 * t[i] + t[i+1]); // <-- note the parentheses!
}
if (i < len) a = M1 * a + t[i]; // Handle odd length.
This is exactly twice as fast as the first snippet. Strangely, leaving out the parentheses eats 20% of the speed-up. Funnily enough, this can be repeated and a factor of 3.8 can be achieved.
Unlike java, gcc -O3 chooses not to unroll the loop. It's wise choice since it wouldn't help anyway (as -funroll-all-loops shows).
So my question1 is: What prevents such an optimization?
Googling didn't work, I got "associative arrays" and "associative operators" only.
Update
I polished up my benchmark a little bit and can provide some results now. There's no speedup beyond unrolling 4 times, probably because of multiplication and addition together taking 4 cycles.
Update 2
As Java already unrolls the loop, all the hard work is done. What we get is something like
...pre-loop
for (int i=0; i<N; i+=2) {
a2 = M1 * a + t[i];
a = M1 * a2 + t[i+1];
}
...post-loop
where the interesting part can be rewritten like
a = M1 * ((M1 * a) + t[i]) + t[i+1]; // latency 2mul + 2add
This reveals that there are 2 multiplications and 2 additions, all of them to be performed sequentially, thus needing 8 cycles on a modern x86 CPU. All we need now is some primary school math (working for ints even in case of overflow or whatever, but not applicable to floating point).
a = ((M1 * (M1 * a)) + (M1 * t[i])) + t[i+1]; // latency 2mul + 2add
So far we gained nothing, but it allows us to fold the constants
a = ((M2 * a) + (M1 * t[i])) + t[i+1]; // latency 1mul + 2add
and gain even more by regrouping the sum
a = (M2 * a) + ((M1 * t[i]) + t[i+1]); // latency 1mul + 1add

Here is how I understand your two cases: In the first case, you have a loop that takes N steps; in the second case you manually merged two consecutive iterations of the first case into one, so you only need to do N/2 steps in the second case. Your second case runs faster and you are wondering why the dumb compiler couldn't do it automatically.
There is nothing that would prevent the compiler from doing such an optimization. But please note that this re-write of the original loop leads to larger executable size: You have more instructions inside the for loop and the additional if after the loop.
If N=1 or N=3, the original loop is likely to be faster (less branching and better caching/prefetching/branch prediction). It made things faster in your case but it may make things slower in other cases. It is not a clear cut whether it is worth doing this optimization and it can be highly nontrivial to implement such an optimization in a compiler.
By the way, what you have done is very similar to loop vectorization but in your case, you did the parallel step manually and plugged-in the result. Eric Brumer's Compiler Confidential talk will give you insight why rewriting loops in general is tricky and what drawbacks / disadvantages there are (larger executable size, potentially slower in some cases). So compiler writers are very well aware of this optimization possibility and are actively working on it but it is highly nontrivial in general and can also make things slower.
Please try something for me:
int a = 0;
for (int i=0; i<N; ++i)
a = ((a<<5) - a) + t[i];
assuming M1=31. In principle, the compiler should be smart enough to rewrite 31*a into (a<<5)-a but I am curious if it really does that.

Empirically estimating big-oh time efficiency

Background
I'd like to estimate the big-oh performance of some methods in a library through benchmarks. I don't need precision -- it suffices to show that something is O(1), O(logn), O(n), O(nlogn), O(n^2) or worse than that. Since big-oh means upper-bound, estimating O(logn) for something that is O(log logn) is not a problem.
Right now, I'm thinking of finding the constant multiplier k that best fits data for each big-oh (but will top all results), and then choosing the big-oh with the best fit.
Questions
Are there better ways of doing it than what I'm thiking of? If so, what are they?
Otherwise, can anyone point me to the algorithms to estimate k for best fitting, and comparing how well each curve fits the data?
Notes & Constraints
Given the comments so far, I need to make a few things clear:
This needs to be automated. I can't "look" at data and make a judgment call.
I'm going to benchmark the methods with multiple n sizes. For each size n, I'm going to use a proven benchmark framework that provides reliable statistical results.
I actually know beforehand the big-oh of most of the methods that will be tested. My main intention is to provide performance regression testing for them.
The code will be written in Scala, and any free Java library can be used.
Example
Here's one example of the kind of stuff I want to measure. I have a method with this signature:
def apply(n: Int): A
Given an n, it will return the nth element of a sequence. This method can have O(1), O(logn) or O(n) given the existing implementations, and small changes can get it to use a suboptimal implementation by mistake. Or, more easily, could get some other method that depends on it to use a suboptimal version of it.

In order to get started, you have to make a couple of assumptions.
n is large compared to any constant terms.
You can effectively randomize your input data
You can sample with sufficient density to get a good handle on the distribution of runtimes
In particular, (3) is difficult to achieve in concert with (1). So you may get something with an exponential worst case, but never run into that worst case, and thus think your algorithm is much better than it is on average.
With that said, all you need is any standard curve fitting library. Apache Commons Math has a fully adequate one. You then either create a function with all the common terms that you want to test (e.g. constant, log n, n, n log n, nn, nn*n, e^n), or you take the log of your data and fit the exponent, and then if you get an exponent not close to an integer, see if throwing in a log n gives a better fit.
(In more detail, if you fit C*x^a for C and a, or more easily log C + a log x, you can get the exponent a; in the all-common-terms-at-once scheme, you'll get weights for each term, so if you have n*n + C*n*log(n) where C is large, you'll pick up that term also.)
You'll want to vary the size by enough so that you can tell the different cases apart (might be hard with log terms, if you care about those), and safely more different sizes than you have parameters (probably 3x excess would start being okay, as long as you do at least a dozen or so runs total).
Edit: Here is Scala code that does all this for you. Rather than explain each little piece, I'll leave it to you to investigate; it implements the scheme above using the C*x^a fit, and returns ((a,C),(lower bound for a, upper bound for a)). The bounds are quite conservative, as you can see from running the thing a few times. The units of C are seconds (a is unitless), but don't trust that too much as there is some looping overhead (and also some noise).
class TimeLord[A: ClassManifest,B: ClassManifest](setup: Int => A, static: Boolean = true)(run: A => B) {
#annotation.tailrec final def exceed(time: Double, size: Int, step: Int => Int = _*2, first: Int = 1): (Int,Double) = {
var i = 0
val elapsed = 1e-9 * {
if (static) {
val a = setup(size)
var b: B = null.asInstanceOf[B]
val t0 = System.nanoTime
var i = 0
while (i < first) {
b = run(a)
i += 1
}
System.nanoTime - t0
}
else {
val starts = if (static) { val a = setup(size); Array.fill(first)(a) } else Array.fill(first)(setup(size))
val answers = new Array[B](first)
val t0 = System.nanoTime
var i = 0
while (i < first) {
answers(i) = run(starts(i))
i += 1
}
System.nanoTime - t0
}
}
if (time > elapsed) {
val second = step(first)
if (second <= first) throw new IllegalArgumentException("Iteration size increase failed: %d to %d".format(first,second))
else exceed(time, size, step, second)
}
else (first, elapsed)
}
def multibench(smallest: Int, largest: Int, time: Double, n: Int, m: Int = 1) = {
if (m < 1 || n < 1 || largest < smallest || (n>1 && largest==smallest)) throw new IllegalArgumentException("Poor choice of sizes")
val frac = (largest.toDouble)/smallest
(0 until n).map(x => (smallest*math.pow(frac,x/((n-1).toDouble))).toInt).map{ i =>
val (k,dt) = exceed(time,i)
if (m==1) i -> Array(dt/k) else {
i -> ( (dt/k) +: (1 until m).map(_ => exceed(time,i,first=k)).map{ case (j,dt2) => dt2/j }.toArray )
}
}.foldLeft(Vector[(Int,Array[Double])]()){ (acc,x) =>
if (acc.length==0 || acc.last._1 != x._1) acc :+ x
else acc.dropRight(1) :+ (x._1, acc.last._2 ++ x._2)
}
}
def alpha(data: Seq[(Int,Array[Double])]) = {
// Use Theil-Sen estimator for calculation of straight-line fit for exponent
// Assume timing relationship is t(n) = A*n^alpha
val dat = data.map{ case (i,ad) => math.log(i) -> ad.map(x => math.log(i) -> math.log(x)) }
val slopes = (for {
i <- dat.indices
j <- ((i+1) until dat.length)
(pi,px) <- dat(i)._2
(qi,qx) <- dat(j)._2
} yield (qx - px)/(qi - pi)).sorted
val mbest = slopes(slopes.length/2)
val mp05 = slopes(slopes.length/20)
val mp95 = slopes(slopes.length-(1+slopes.length/20))
val intercepts = dat.flatMap{ case (i,a) => a.map{ case (li,lx) => lx - li*mbest } }.sorted
val bbest = intercepts(intercepts.length/2)
((mbest,math.exp(bbest)),(mp05,mp95))
}
}
Note that the multibench method is expected to take about sqrt(2)nm*time to run, assuming that static initialization data is used and is relatively cheap compared to whatever you're running. Here are some examples with parameters chosen to take ~15s to run:
val tl1 = new TimeLord(x => List.range(0,x))(_.sum) // Should be linear
// Try list sizes 100 to 10000, with each run taking at least 0.1s;
// use 10 different sizes and 10 repeats of each size
scala> tl1.alpha( tl1.multibench(100,10000,0.1,10,10) )
res0: ((Double, Double), (Double, Double)) = ((1.0075537890632216,7.061397125245351E-9),(0.8763463348353099,1.102663784225697))
val longList = List.range(0,100000)
val tl2 = new TimeLord(x=>x)(longList.apply) // Again, should be linear
scala> tl2.alpha( tl2.multibench(100,10000,0.1,10,10) )
res1: ((Double, Double), (Double, Double)) = ((1.4534378213477026,1.1325696181862922E-10),(0.969955396265306,1.8294175293676322))
// 1.45?! That's not linear. Maybe the short ones are cached?
scala> tl2.alpha( tl2.multibench(9000,90000,0.1,100,1) )
res2: ((Double, Double), (Double, Double)) = ((0.9973235607566956,1.9214696731124573E-9),(0.9486294398193154,1.0365312207345019))
// Let's try some sorting
val tl3 = new TimeLord(x=>Vector.fill(x)(util.Random.nextInt))(_.sorted)
scala> tl3.alpha( tl3.multibench(100,10000,0.1,10,10) )
res3: ((Double, Double), (Double, Double)) = ((1.1713142886974603,3.882658025586512E-8),(1.0521099621639414,1.3392622111121666))
// Note the log(n) term comes out as a fractional power
// (which will decrease as the sizes increase)
// Maybe sort some arrays?
// This may take longer to run because we have to recreate the (mutable) array each time
val tl4 = new TimeLord(x=>Array.fill(x)(util.Random.nextInt), false)(java.util.Arrays.sort)
scala> tl4.alpha( tl4.multibench(100,10000,0.1,10,10) )
res4: ((Double, Double), (Double, Double)) = ((1.1216172965292541,2.2206198821180513E-8),(1.0929414090177318,1.1543697719880128))
// Let's time something slow
def kube(n: Int) = (for (i <- 1 to n; j <- 1 to n; k <- 1 to n) yield 1).sum
val tl5 = new TimeLord(x=>x)(kube)
scala> tl5.alpha( tl5.multibench(10,100,0.1,10,10) )
res5: ((Double, Double), (Double, Double)) = ((2.8456382116915484,1.0433534274508799E-7),(2.6416659356198617,2.999094292838751))
// Okay, we're a little short of 3; there's constant overhead on the small sizes
Anyway, for the stated use case--where you are checking to make sure the order doesn't change--this is probably adequate, since you can play with the values a bit when setting up the test to make sure they give something sensible. One could also create heuristics that search for stability, but that's probably overkill.
(Incidentally, there is no explicit warmup step here; the robust fitting of the Theil-Sen estimator should make it unnecessary for sensibly large benchmarks. This also is why I don't use any other benching framework; any statistics that it does just loses power from this test.)
Edit again: if you replace the alpha method with the following:
// We'll need this math
#inline private[this] def sq(x: Double) = x*x
final private[this] val inv_log_of_2 = 1/math.log(2)
#inline private[this] def log2(x: Double) = math.log(x)*inv_log_of_2
import math.{log,exp,pow}
// All the info you need to calculate a y value, e.g. y = x*m+b
case class Yp(x: Double, m: Double, b: Double) {}
// Estimators for data order
// fx = transformation to apply to x-data before linear fitting
// fy = transformation to apply to y-data before linear fitting
// model = given x, slope, and intercept, calculate predicted y
case class Estimator(fx: Double => Double, invfx: Double=> Double, fy: (Double,Double) => Double, model: Yp => Double) {}
// C*n^alpha
val alpha = Estimator(log, exp, (x,y) => log(y), p => p.b*pow(p.x,p.m))
// C*log(n)*n^alpha
val logalpha = Estimator(log, exp, (x,y) =>log(y/log2(x)), p => p.b*log2(p.x)*pow(p.x,p.m))
// Use Theil-Sen estimator for calculation of straight-line fit
case class Fit(slope: Double, const: Double, bounds: (Double,Double), fracrms: Double) {}
def theilsen(data: Seq[(Int,Array[Double])], est: Estimator = alpha) = {
// Use Theil-Sen estimator for calculation of straight-line fit for exponent
// Assume timing relationship is t(n) = A*n^alpha
val dat = data.map{ case (i,ad) => ad.map(x => est.fx(i) -> est.fy(i,x)) }
val slopes = (for {
i <- dat.indices
j <- ((i+1) until dat.length)
(pi,px) <- dat(i)
(qi,qx) <- dat(j)
} yield (qx - px)/(qi - pi)).sorted
val mbest = slopes(slopes.length/2)
val mp05 = slopes(slopes.length/20)
val mp95 = slopes(slopes.length-(1+slopes.length/20))
val intercepts = dat.flatMap{ _.map{ case (li,lx) => lx - li*mbest } }.sorted
val bbest = est.invfx(intercepts(intercepts.length/2))
val fracrms = math.sqrt(data.map{ case (x,ys) => ys.map(y => sq(1 - y/est.model(Yp(x,mbest,bbest)))).sum }.sum / data.map(_._2.length).sum)
Fit(mbest, bbest, (mp05,mp95), fracrms)
}
then you can get an estimate of the exponent when there's a log term also--error estimates exist to pick whether the log term or not is the correct way to go, but it's up to you to make the call (i.e. I'm assuming you'll be supervising this initially and reading the numbers that come off):
val tl3 = new TimeLord(x=>Vector.fill(x)(util.Random.nextInt))(_.sorted)
val timings = tl3.multibench(100,10000,0.1,10,10)
// Regular n^alpha fit
scala> tl3.theilsen( timings )
res20: tl3.Fit = Fit(1.1811648421030059,3.353753446942075E-8,(1.1100382697696545,1.3204652930525234),0.05927994882343982)
// log(n)*n^alpha fit--note first value is closer to an integer
// and last value (error) is smaller
scala> tl3.theilsen( timings, tl3.logalpha )
res21: tl3.Fit = Fit(1.0369167329732445,9.211366397621766E-9,(0.9722967182484441,1.129869067913768),0.04026308919615681)
(Edit: fixed the RMS computation so it's actually the mean, plus demonstrated that you only need to do timings once and can then try both fits.)

I don't think your approach will work in general.
The problem is that "big O" complexity is based on a limit as some scaling variable tends to infinity. For smaller values of that variable, the performance behavior can appear to fit a different curve entirely.
The problem is that with an empirical approach you can never know if the scaling variable is large enough for the limit to be apparent in the results.
Another problem is that if you implement this in Java / Scala, you have to go to considerable lengths to eliminate distortions and "noise" in your timings due to things like JVM warmup (e.g. class loading, JIT compilation, heap resizing) and garbage collection.
Finally, nobody is going to place much trust in empirical estimates of complexity. Or at least, they wouldn't if they understood the mathematics of complexity analysis.
FOLLOWUP
In response to this comment:
Your estimate's significance will improve drastically the more and larger samples you use.
This is true, though my point is that you (Daniel) haven't factored this in.
Also, runtime functions typically have special characteristics which can be exploited; for example, algorithms tend to not change their behaviour at some huge n.
For simple cases, yes.
For complicated cases and real world cases, that is a dubious assumption. For example:
Suppose some algorithm uses a hash table with a large but fixed-sized primary hash array, and uses external lists to deal with collisions. For N (== number of entries) less than the size of the primary hash array, the behaviour of most operations will appear to be O(1). The true O(N) behaviour can only be detected by curve fitting when N gets much larger than that.
Suppose that the algorithm uses a lot of memory or network bandwidth. Typically, it will work well until you hit the resource limit, and then performance will tail off badly. How do you account for this? If it is part of the "empirical complexity", how do you make sure that you get to the transition point? If you want to exclude it, how do you do that?

If you are happy to estimate this empirically, you can measure how long it takes to do exponentially increasing numbers of operations. Using the ratio you can get which function you estimate it to be.
e.g. if the ratio of 1000 operations to 10000 operations (10x) is (test the longer one first) You need to do a realistic number of operations to see what the order is for the range you have.
1x => O(1)
1.2x => O(ln ln n)
~ 2-5x => O(ln n)
10x => O(n)
20-50x => O(n ln n)
100x => O(n ^ 2)
Its is just an estimate as time complexity is intended for an ideal machine and something should can be mathematically proven rather than measures.
e.g. Many people tried to prove empirically that PI is a fraction. When they measured the ratio of circumference to diameter for circles they had made it was always a fraction. Eventually, it was generally accepted that PI is not a fraction.

We have lately implemented a tool that does semi-automated average runtime analysis for JVM code. You do not even have to have access to the sources. It is not published yet (still ironing out some usability flaws) but will be soon, I hope.
It is based on maximum-likelihood models of program execution [1]. In short, byte code is augmented with cost counters. The target algorithm is then run (distributed, if you want) on a bunch of inputs whose distribution you control. The aggregated counters are extrapolated to functions using involved heuristics (method of least squares on crack, sort of). From those, more science leads to an estimate for the average runtime asymptotics (3.576n - 1.23log(n) + 1.7, for instance). For example, the method is able to reproduce rigorous classic analyses done by Knuth and Sedgewick with high precision.
The big advantage of this method compared to what others post is that you are independent of time estimates, that is in particular independent of machine, virtual machine and even programming language. You really get information about your algorithm, without all the noise.
And---probably the killer feature---it comes with a complete GUI that guides you through the whole process.
See my answer on cs.SE for a little more detail and further references.
You can find a preliminary website (including a beta version of the tool and the papers published) here.
(Note that average runtime can be estimated that way while worst case runtime can never be, except in case you know the worst case. If you do, you can use the average case for worst case analysis; just feed the tool only worst case instances. In general, runtime bounds can not be decided, though.)
Maximum likelihood analysis of algorithms and data structures by U. Laube and M.E. Nebel (2010). [preprint]

What you are looking to achieve is impossible in general. Even the fact that an algorithm will ever stop cannot be proven in general case (see Halting Problem). And even if it does stop on your data you still cannot deduce the complexity by running it. For instance, bubble sort has complexity O(n^2), while on already sorted data it performs as if it was O(n). There is no way to select "appropriate" data for an unknow algorithm to estimate its worst case.

You should consider changing a critical aspects of your task.
Change the terminology that you are using to: "estimate the runtime of the algorithm" or "setup performance regression testing"
Can you estimate the runtime of the algorithm? Well you propose to try different input sizes and measure either some critical operation or the time it takes. Then for the series of input sizes you plan to programmaticly estimate if the algorithm's runtime has no growth, constant growth, exponential growth etc.
So you have two problems, running the tests, and programmatically estimating the growth rate as you input set grows. This sounds like a reasonable task.

I'm not sure I get 100% what you want. But I understand that you test your own code, so you can modify it, e.g. inject observing statements. Otherwise you could use some form of aspect weaving?
How about adding resetable counters to your data structures and then increase them each time a particular sub-function is invoked? You could make those counting #elidable so they will be gone in the deployed library.
Then for a given method, say delete(x), you would test that with all sorts of automatically generated data sets, trying to give them some skew, etc., and gather the counts. While as Igor points out you cannot verify that the data structure won't ever violate a big-O bound, you will at least be able to assert that in the actual experiment a given limit count is never exceeded (e.g. going down a node in a tree is never done more than 4 * log(n) times) -- so you can detect some mistakes.
Of course, you would need certain assumptions, e.g. that calling a method is O(1) in your computer model.

I actually know beforehand the big-oh of most of the methods that will
be tested. My main intention is to provide performance regression
testing for them.
This requirement is key. You want to detect outliers with minimal data (because testing should be fast, dammit), and in my experience fitting curves to numerical evaluations of complex recurrences, linear regression and the like will overfit. I think your initial idea is a good one.
What I would do to implement it is prepare a list of expected complexity functions g1, g2, ..., and for data f, test how close to constant f/gi + gi/f is for each i. With a least squares cost function, this is just computing the variance of that quantity for each i and reporting the smallest. Eyeball the variances at the end and manually inspect unusually poor fits.

For an empiric analysis of the complexity of the program, what you would do is run (and time) the algorithm given 10, 50, 100, 500, 1000, etc input elements. You can then graph the results and determine the best-fit function order from the most common basic types: constant, logarithmic, linear, nlogn, quadratic, cubic, higher-polynomial, exponential. This is a normal part of load testing, which makes sure that the algorithm is first behaving as theorized, and second that it meets real-world performance expectations despite its theoretical complexity (a logarithmic-time algorithm in which each step takes 5 minutes is going to lose all but the absolute highest-cardinality tests to a quadratic-complexity algorithm in which each step is a few millis).
EDIT: Breaking it down, the algorithm is very simple:
Define a list, N, of various cardinalities for which you want to evaluate performance (10,100,1000,10000 etc)
For each element X in N:
Create a suitable set of test data that has X elements.
Start a stopwatch, or determine and store the current system time.
Run the algorithm over the X-element test set.
Stop the stopwatch, or determine the system time again.
The difference between start and stop times is your algorithm's run time over X elements.
Repeat for each X in N.
Plot the results; given X elements (x-axis), the algorithm takes T time (y-axis). The closest basic function governing the increase in T as X increases is your Big-Oh approximation. As was stated by Raphael, this approximation is exactly that, and will not get you very fine distinctions such as coefficients of N, that could make the difference between a N^2 algorithm and a 2N^2 algorithm (both are technically O(N^2) but given the same number of elements one will perform twice as fast).

Wanted to share my experiments as well. Nothing new from the theoretical standpoint, but it's a fully functional Python module that can easily be extended.
Main points:
It's based on scipy Python library curve_fit function that allows
to fit any function into the given set of points minimizing sum of
square differences;
Since tests are done increasing the problem size exponentially points
closer to the start will kind of have a bigger weight, which does not
help to identify the correct approximation, so it seems to me that
simple linear interpolation to redestribute points evenly does help;
The set of approximations we are trying to fit is fully under our
control; I've added the following ones:
def fn_linear(x, k, c):
return k * x + c
def fn_squared(x, k, c):
return k * x ** 2 + c
def fn_pow3(x, k, c):
return k * x ** 3 + c
def fn_log(x, k, c):
return k * np.log10(x) + c
def fn_nlogn(x, k, c):
return k * x * np.log10(x) + c
Here is a fully functional Python module to play with: https://gist.github.com/gubenkoved/d9876ccf3ceb935e81f45c8208931fa4, and some pictures it produces (please note -- 4 graphs per sample with different axis scales).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.