Binary GCD Algorithm vs. Euclid's Algorithm on modern computers - java

http://en.wikipedia.org/wiki/Binary_GCD_algorithm
This Wikipedia entry has a very dissatisfying implication: the Binary GCD algorithm was at one time as much as 60% more efficient than the standard Euclid Algorithm, but as late as 1998 Knuth concluded that there was only a 15% gain in efficiency on his contemporary computers.
Well another 15 years has passed... how do these two algorithms stack today with advances in hardware?
Does the Binary GCD continue to outperform the Euclidean Algorithm in low-level languages but languish behind due to its complexity in higher level languages like Java? Or is the difference moot in modern computing?
Why do I care you might ask? I just so happen to have to process like 100 billion of these today :) Here's a toast to living in an era of computing (poor Euclid).

The answer is of course "it depends". It depends on hardware, compiler, specific implementation, whatever I forgot. On machines with slow division, binary GCD tends to outperform the Euclidean algorithm. I benchmarked it a couple of years ago on a Pentium4 in C, Java and a few other languages, overall in that benchmark, binary gcd with a 256-element lookup table beat the Euclidean algorithm by a factor of between 1.6 and nearly 3. Euclidean came closer when instead of immediately dividing, first a few rounds of subtraction were performed. I don't remember the figures, but binary still was considerably faster.
If the machine has fast division, things may be different, since the Euclidean algorithm needs fewer operations. If the difference of cost between division and subtraction/shifts is small enough, binary will be slower. Which one is better in your circumstances, you have to find out by benchmarking yourself.

Related

Which algorithm does Java use for multiplication?

The list of possible algorithms for multiplication is quite long:
Schoolbook long multiplication
Karatsuba algorithm
3-way Toom–Cook multiplication
k-way Toom–Cook multiplication
Mixed-level Toom–Cook
Schönhage–Strassen algorithm
Fürer's algorithm
Which one is used by Java by default and why? When does it switch to a "better performance" algorithm?
Well ... the * operator will use whatever the hardware provides. Java has no say in it.
But if you are talking about BigInteger.multiply(BigInteger), the answer depends on the Java version. For Java 11 it uses:
naive "long multiplication" for small numbers,
Karatsuba algorithm for medium sized number, and
3-way Toom–Cook multiplication for large numbers.
The thresholds are Karatsuba for numbers represented by 80 to 239 int values, an 3-way Toom-Cook for >= 240 int values. The smaller of the numbers being multiplied controls the algorithm selection.
Which one is used by Java by default and why?
Which ones? See above.
Why? Comments in the code imply that the thresholds were chosen empirically; i.e. someone did some systematic testing to determine which threshold values gave the best performance1.
You can find more details by reading the source code2.
1 - The current implementation BigInteger implementation hasn't changed significantly since 2013, so it is possible that it doesn't incorporate more recent research results.
2 - Note that this link is to the latest version on Github.

Runtime of BigInteger operations? [duplicate]

What complexity are the methods multiply, divide and pow in BigInteger currently? There is no mention of the computational complexity in the documentation (nor anywhere else).
If you look at the code for BigInteger (provided with JDK), it appears to me that
multiply(..) has O(n^2) (actually the method is multiplyToLen(..)). The code for the other methods is a bit more complex, but you can see yourself.
Note: this is for Java 6. I assume it won't differ in Java 7.
As noted in the comments on #Bozho's answer, Java 8 and onwards use more efficient algorithms to implement multiplication and division than the naive O(N^2) algorithms in Java 7 and earlier.
Java 8 multiplication adaptively uses either the naive O(N^2) long multiplication algorithm, the Karatsuba algorithm or the 3 way Toom-Cook algorithm depending in the sizes of the numbers being multiplied. The latter are (respectively) O(N^1.58) and O(N^1.46).
Java 8 division adaptively uses either Knuth's O(N^2) long division algorithm or the Burnikel-Ziegler algorithm. (According to the research paper, the latter is 2K(N) + O(NlogN) for a division of a 2N digit number by an N digit number, where K(N) is the Karatsuba multiplication time for two N-digit numbers.)
Likewise some other operations have been optimized.
There is no mention of the computational complexity in the documentation (nor anywhere else).
Some details of the complexity are mentioned in the Java 8 source code. The reason that the javadocs do not mention complexity is that it is implementation specific, both in theory and in practice. (As illustrated by the fact that the complexity of some operations is significantly different between Java 7 and 8.)
There is a new "better" BigInteger class that is not being used by the sun jdk for conservateism and lack of useful regression tests (huge data sets). The guy that did the better algorithms might have discussed the old BigInteger in the comments.
Here you go http://futureboy.us/temp/BigInteger.java
Measure it. Do operations with linearly increasing operands and draw the times on a diagram.
Don't forget to warm up the JVM (several runs) to get valid benchmark results.
If operations are linear O(n), quadratic O(n^2), polynomial or exponential should be obvious.
EDIT: While you can give algorithms theoretical bounds, they may not be such useful in practice. First of all, the complexity does not give the factor. Some linear or subquadratic algorithms are simply not useful because they are eating so much time and resources that they are not adequate for the problem on hand (e.g. Coppersmith-Winograd matrix multiplication).
Then your computation may have all kludges you can only detect by experiment. There are preparing algorithms which do nothing to solve the problem but to speed up the real solver (matrix conditioning). There are suboptimal implementations. With longer lengths, your speed may drop dramatically (cache missing, memory moving etc.). So for practical purposes, I advise to do experimentation.
The best thing is to double each time the length of the input and compare the times.
And yes, you do find out if an algorithm has n^1.5 or n^1.8 complexity. Simply quadruple
the input length and you need only the half time for 1.5 instead of 2. You get again nearly half the time for 1.8 if you multiply the length 256 times.

Is a (common) CPU faster at computing low values than great ones? [duplicate]

This question already has answers here:
Why does Java's hashCode() in String use 31 as a multiplier?
(13 answers)
Closed 5 years ago.
Question is that simple:
Would combining two low values with a common basic operation like addition, division, modulo, bit shift and others be combuted faster than same operation with greater values?
This would, as far as I can tell, require the CPU to keep track of the most significant bit (which I assume to be unlikely) but maybe there's something else in the business.
I am specifically asking because I often see rather low primes (e.g. 31) used in some of Java's basic classes' hashCode() method (e.g. String and List), which is surprising since greater values would most likely cause more diffusion (which is generally a good thing for hashfunctions).
Arithmetic
I do not think there are many pipelined processors (i.e. almost all except the very smallest) where a simple arithmetic instruction's cost would change with the value of a register or memory operand. This would make the design of the pipeline more complex and may be counterproductive in practice.
I could imagine that a very complex instruction (at least a division) that may take many cycles compared to the pipeline length could show such behaviour, since it likely introduces wait states anyway. Agner Fog writes that this is true "on AMD processors, but not on Intel processors."
If an operation cannot be computed in one instruction, like a multiplication of numbers that are larger than the native integer width, the implementation may well contain a "fast path" for cases e.g. the upper half of both operands is zero. A common example would be 64 bit multiplications on x86 32 bit architectures used by MSVC. Some smaller processors do not have instructions for division, sometimes not even for multiplication. The assembly for computing these operations may well terminate earlier for smaller operands. This effect would be felt more acutely on smaller architectures.
Immediate Value Encodings
For immediate values (constants) this may be different. For example there are RISC processors that allow encoding up to 16 bit immediates in a load/add-immediate instruction and require either two operations for loading a 32-bit word via load-upper-immediate + add-immediate or have to load the constant from program memory.
In CISC processors a large immediate value will likely take up more memory, which may reduce the number of instructions that can be fetched per cycle, or increase the number of cache misses.
In such a cases a smaller constant may be cheaper than a large one.
I am not sure if the encoding differences matter as much for Java, since most code will at least initially be distributed as Java bytecode, althoug a JIT-enabled JVM will translate the code to machine code and some library classes may have precompiled implementations. I do not know enough about Java bytecode to determine the consequences of constant size on it. From what I read it seems to me most constants are usually loaded via an index from constant pools and not directly encoded in the bytecode stream, so I would not expect a large difference here, if any.
Strenght reduction optimizations
For very expensive operations (relative to the processor) compilers and programmers often employ tricks to replace a hard computation by a simpler one that is valid for a constant, like in the multiplication example mentioned where a multiplication is replaced by a shift and a subtraction/addition.
In the example given (multiply by 31 vs. multiply by 65,537), I would not expect a difference. For other numbers there will be a difference, but it will not correlate perfectly with the magnitude of the number. Divisions by constants are also commonly replaced by an arcane sequence of multiplications and shifts.
See for example how gcc translates a division by 13.
On an x86 processors some multiplications by small constants can be replaced by load-effective-address instructions, but only for certain constants.
All in all I would expect this effect to depend very much on the processor architecture and the operations to be performed. Since Java is supposed to run almost everywhere, I think the library authors wanted their code to be efficient over a large range of processors, including small embedded processors, where operand size will play a larger role.

Estimating actual (not theoretic) runtime complexity of an implementation

Anyone in computer science will know that HeapSort is O(n log n) worst case in theory, while QuickSort is O(n^2) worst case. However, in practice, a well implemented QuickSort (with good heuristics) will outperform HeapSort on every single data set. On one hand, we barely observe the worst case, and on the other hand e.g. CPU cache lines, prefetching etc. make an enormous difference in many simple tasks. And while e.g. QuickSort can handle presorted data (with a good heuristic) in O(n), HeapSort will always reorganize the data in O(n log n), as it doesn't exploit the existing structure.
For my toy project caliper-analyze, I've recently been looking into methods for estimating the actual average complexity of algorithms from benchmarking results. In particular, I've tried Lawson and Hanson's NNLS fitting with different polynomials.
However, it doesn't work too well yet. Sometimes I get usable results, sometimes I don't. I figure that it may help to just do larger benchmarks, in particular try more parameters.
The following results are for sorting Double objects, in a SAW pattern with 10% randomness. n was only up to 500 for this run, so it is not very representative for actual use... The numbers are the estimated runtime dependency on the size. The output is hand-edited and manually sorted, so it does not reflect what the tool currently provides!
BubbleSortTextbook LINEAR: 67.59 NLOG2N: 1.89 QUADRATIC: 2.51
BubbleSort LINEAR: 54.84 QUADRATIC: 1.68
BidirectionalBubbleSort LINEAR: 52.20 QUADRATIC: 1.36
InsertionSort LINEAR: 17.13 NLOG2N: 2.97 QUADRATIC: 0.86
QuickSortTextbook NLOG2N: 18.15
QuickSortBo3 LINEAR: 59.74 QUADRATIC: 0.12
Java LINEAR: 6.81 NLOG2N: 12.33
DualPivotQuickSortBo5 NLOG2N: 11.28
QuickSortBo5 LINEAR: 3.35 NLOG2N: 9.67
You can tell that while in this particular setting (often it does not at all work satisfactory) the results largely agree with the known behavior: bubble sort is really costly, and a good heuristic on QuickSort is much better. However, e.g. QuickSort with median-of-three heuristics ends up with an O(n + n^2) estimation, for example, while the other QuickSorts are estimated as O(n + n log n)
So now to my actual questions:
Do you know algorithms/approaches/tools that perform runtime complexity analysis from benchmark data, in order to predict which implementation (as you can see above, I'm interested in comparing different implementations of the same algorithm!) performs best on real data?
Do you know scientific articles with respect to this (estimating average complexity of implementations)?
Do you know robust fitting methods that will help getting more accurate estimates here? E.g. a regularized version of NNLS.
Do you know rules-of-thumb of how many samples one needs to get a reasonable estimate? (in particular, when should the tool refrain from giving any estimate, as it will likely be inaccurate anyway?)
And let me emphasizes once more that I'm not interested in theoretical complexity or formal analysis. I'm interested in seeing how implementations (of theoretically even identical algorithms) perform on benchmark data on real CPUs... the numerical factors for a common range are of key interest to me, more than the asymptotic behaviour. (and no, on the long run it is not just time complexity and sorting. But I'm interested in index structures and other parameters. And caliper can e.g. also measure memory consumption, if I'm not mistaken) Plus, I'm working in java. An approach that just calls a Matlab builtin is of no use to me, as I'm not living in the matlab world.
If I have time, I will try to re-run some of these benchmarks with a larger variety of sizes, so I get more data points. Maybe it will then just work... but I believe there are more robust regression methods that I could use to get better estimates even from smaller data sets. Plus, I'd like to detect when the sample is just too small to do any prediction at all!
If you want the actual complexity you are better of measuring it. Trying to guess how a program will perform without measuring is very unreliable.
The same program can perform very differently on a different machine. e.g. one algo might be faster on one machine but slower on another.
Your programs can be slower depending on what else the machine is doing so an algo which looks good but makes heavy use of resources like caches, can be slower and make other programs slower when it has to share those resources.
Testing an algo on a machine by itself can be up to 2-5x faster than trying to use it in a real program.
Do you know rules-of-thumb of how many samples one needs to get a reasonable estimate? (in particular, when should the tool refrain from giving any estimate, as it will likely be inaccurate anyway?)
For determining a percentile like 90% or 99% you need 1/(1-p)^2 i.e. for 99%tile you need at least 10,000 samples after warmup. For 99.9%tile you need one million.

Big O for a finite, fixed size set of possible values

This question triggered some confusion and many comments about whether the algorithms proposed in the various answers are O(1) or O(n).
Let's use a simple example to illustrate the two points of view:
we want to find a long x such that a * x + b = 0, where a and b are known, non-null longs.
An obvious O(1) algo is x = - b / a
A much slower algo would consist in testing every possible long value, which would be about 2^63 times slower on average.
Is the second algorithm O(1) or O(n)?
The arguments presented in the linked questions are:
it is O(n) because in the worst case you need to loop over all possible long values
it is O(1) because its complexity is of the form c x O(1), where c = 2^64 is a constant.
Although I understand the argument to say that it is O(1), it feels counter-intuitive.
ps: I added java as the original question is in Java, but this question is language-agnostic.
The complexity is only relevant if there is a variable N. So, the question makes no sense as is. If the question was:
A much slower algo would consist in testing every possible value in a range of N values, which would be about N times slower on average.
Is the second algorithm O(1) or O(N)?
Then the answer would be: this algorithm is O(N).
Big O describes how an algorithm's performance will scale as the input size n scales. In other words as you run the algorithm across more input data.
In this case the input data is a fixed size so both algorithms are O(1) albeit with different constant factors.
If you took "n" to mean the number of bits in the numbers (i.e. you removed the restriction that it's a 64-bit long), then you could analyze for a given bit size n how do the algorithms scale.
In this scenario, the first would still be O(1) (see Qnan's comment), but the second would now be O(2^n).
I highly recommend watching the early lectures from MIT's "Introduction to Algorithms" course. They are a great explanation of Big O (and Big Omega/Theta) although do assume a good grasp of maths.
Checking every possible input is O(2^N) on the number of bits in the solution. When you make the number of bits constant, then both algorithms are O(1), you know how many solutions you need to check.
Fact: Every algorithm that you actually run on your computer is O(1), because the universe has finite computational power (there are finitely many atoms and finitely many seconds have passed since the Big Bang).
This is true, but not a very useful way to think about things. When we use big-O in practice, we generally assume that the constants involved are small relative to the asymptotic terms, because otherwise only giving the asymptotic term doesn't tell you much about how the algorithm performs. This works great in practice because the constants are usually things like "do I use an array or a hash map" which is at most about a 30x different, and the inputs are 10^6 or 10^9, so the difference between a quadratic and linear algorithm is more important than constant factors. Discussions of big-O that don't respect this convention (like algorithm #2) are pointless.
Whatever the value for a or b are, the worst case is still to check 2^64 or 2^32 or 2^somevalue values. This algorithm complexity is in O(2^k) time where k is the number of bits used to represent a long value, or O(1) time if we consider the values of a and b.

Categories