How does Math.pow() handle fractional exponents (specifically nth roots) [duplicate] - java

I'm trying to determine the asymptotic run-time of one of my algorithms, which uses exponents, but I'm not sure of how exponents are calculated programmatically.
I'm specifically looking for the pow() algorithm used for double-precision, floating point numbers.

I've had a chance to look at fdlibm's implementation. The comments describe the algorithm used:
* n
* Method: Let x = 2 * (1+f)
* 1. Compute and return log2(x) in two pieces:
* log2(x) = w1 + w2,
* where w1 has 53-24 = 29 bit trailing zeros.
* 2. Perform y*log2(x) = n+y' by simulating muti-precision
* arithmetic, where |y'|<=0.5.
* 3. Return x**y = 2**n*exp(y'*log2)
followed by a listing of all the special cases handled (0, 1, inf, nan).
The most intense sections of the code, after all the special-case handling, involve the log2 and 2** calculations. And there are no loops in either of those. So, the complexity of floating-point primitives notwithstanding, it looks like a asymptotically constant-time algorithm.
Floating-point experts (of which I'm not one) are welcome to comment. :-)

Unless they've discovered a better way to do it, I believe that approximate values for trig, logarithmic and exponential functions (for exponential growth and decay, for example) are generally calculated using arithmetic rules and Taylor Series expansions to produce an approximate result accurate to within the requested precision. (See any Calculus book for details on power series, Taylor series, and Maclaurin series expansions of functions.) Please note that it's been a while since I did any of this so I couldn't tell you, for example, exactly how to calculate the number of terms in the series you need to include guarantee an error that small enough to be negligible in a double-precision calculation.
For example, the Taylor/Maclaurin series expansion for e^x is this:
+inf [ x^k ] x^2 x^3 x^4 x^5
e^x = SUM [ --- ] = 1 + x + --- + ----- + ------- + --------- + ....
k=0 [ k! ] 2*1 3*2*1 4*3*2*1 5*4*3*2*1
If you take all of the terms (k from 0 to infinity), this expansion is exact and complete (no error).
However, if you don't take all the terms going to infinity, but you stop after say 5 terms or 50 terms or whatever, you produce an approximate result that differs from the actual e^x function value by a remainder which is fairly easy to calculate.
The good news for exponentials is that it converges nicely and the terms of its polynomial expansion are fairly easy to code iteratively, so you might (repeat, MIGHT - remember, it's been a while) not even need to pre-calculate how many terms you need to guarantee your error is less than precision because you can test the size of the contribution at each iteration and stop when it becomes close enough to zero. In practice, I do not know if this strategy is viable or not - I'd have to try it. There are important details I have long since forgotten about. Stuff like: machine precision, machine error and rounding error, etc.
Also, please note that if you are not using e^x, but you are doing growth/decay with another base like 2^x or 10^x, the approximating polynomial function changes.

The usual approach, to raise a to the b, for an integer exponent, goes something like this:
result = 1
while b > 0
if b is odd
result *= a
b -= 1
b /= 2
a = a * a
It is generally logarithmic in the size of the exponent. The algorithm is based on the invariant "a^b*result = a0^b0", where a0 and b0 are the initial values of a and b.
For negative or non-integer exponents, logarithms and approximations and numerical analysis are needed. The running time will depend on the algorithm used and what precision the library is tuned for.
Edit: Since there seems to be some interest, here's a version without the extra multiplication.
result = 1
while b > 0
while b is even
a = a * a
b = b / 2
result = result * a
b = b - 1

You can use exp(n*ln(x)) for calculating xn. Both x and n can be double-precision, floating point numbers. Natural logarithm and exponential function can be calculated using Taylor series. Here you can find formulas: http://en.wikipedia.org/wiki/Taylor_series

If I were writing a pow function targeting Intel, I would return exp2(log2(x) * y). Intel's microcode for log2 is surely faster than anything I'd be able to code, even if I could remember my first year calculus and grad school numerical analysis.

e^x = (1 + fraction) * (2^exponent), 1 <= 1 + fraction < 2
x * log2(e) = log2(1 + fraction) + exponent, 0 <= log2(1 + fraction) < 1
exponent = floor(x * log2(e))
1 + fraction = 2^(x * log2(e) - exponent) = e^((x * log2(e) - exponent) * ln2) = e^(x - exponent * ln2), 0 <= x - exponent * ln2 < ln2

Related

How to implement Karatsuba multiplication using bit manipulation

I'm implementing Karatsuba multiplication in Scala (my choice) for an online course. Considering the algorithm is meant to multiply large numbers, I chose the BigInt type which is backed by Java BigInteger. I'd like to implement the algorithm efficiently, which using base 10 arithmetic is copied below from Wikipedia:
procedure karatsuba(num1, num2)
if (num1 < 10) or (num2 < 10)
return num1*num2
/* calculates the size of the numbers */
m = max(size_base10(num1), size_base10(num2))
m2 = floor(m/2)
/* split the digit sequences in the middle */
high1, low1 = split_at(num1, m2)
high2, low2 = split_at(num2, m2)
/* 3 calls made to numbers approximately half the size */
z0 = karatsuba(low1, low2)
z1 = karatsuba((low1 + high1), (low2 + high2))
z2 = karatsuba(high1, high2)
return (z2 * 10 ^ (m2 * 2)) + ((z1 - z2 - z0) * 10 ^ m2) + z0
Given that BigInteger is internally represented as an int[], if I can calculate m2 in terms of the int[], I can use bit shifting to extract the lower and higher halves of the number. Similarly, the last step can be achieved by bit shifting too.
However, it's easier said than done, as I can't seem to wrap my head around the logic. For example, if the max number is 999, the binary representation is 1111100111, lower half is 99 = 1100011, upper half is 9 = 1001. How do I get the above split?
Note:
There is an existing question that shows how to implement using arithmetic on BigInteger, but not bit shifting. Hence, my question is not a duplicate.
To be able to use bit shifting to do the splits and recombination, the base needs to be a power of two. Using two itself, as in the linked answer, is probably reasonable. Then the "length" of the inputs can be found directly with bitLength, and the split could be implemented as:
// x = a + 2^N b
BigInteger b = x.shiftRight(N);
BigInteger a = x.subtract(b.shiftLeft(N));
Where N is the size that a will have in bits.
Given that BigInteger is implemented with 32bit limbs, it makes sense to use 2³² as the base, ensuring that the big shifts involve only the movement of whole integers, and not also the slower code path where the BigInteger is shifted by a value between 1 and 31. This could be accomplished by rounding N to a multiple of 32.
The specific constant in this line,
if (N <= 2000) return x.multiply(y); // optimize this parameter
Should probably not be trusted too much, given that comment. For performance there should be some bound though, otherwise the recursive splitting goes too deeply. For example, when the size of the numbers is 32 or less, it's clearly better to just multiply, but probably a good cut-off is much higher. In this source of BigInteger itself, the cutoff is expressed in terms of the number of limbs instead of bits, and set to 80 (so 2560 bits) - it also has an other threshold above which it switches to 3-way Toom-Cook multiplication instead of Karatsuba multiplication.

Java Estimating Double's Representation Error

Every now and then I see some rounding errors which are caused by floor'ing some values as shown in the two examples below.
// floor(number, precision)
double balance = floor(0.7/0.1, 3) // = 6.999
double balance = floor(0.7*0.1, 3) // = 0.069
The problem of course is 0.7/0.1 and 0.7*0.1 are not exactly the number it should be due to representation errors [check the NOTE below].
One solution could be to add an epsilon so any representation error is mitigated just before applying the floor.
double balance = floor(0.7/0.1 + 1e-10, 3) // = 7.0
double balance = floor(0.7*0.1 + 1e-10, 3) // = 0.07
What epsilon should I use so it's guaranteed to work in all cases? I feel this solution is rather hacky unless I have a good strategy for choosing the correct epsilon which probably depends on the numbers I'm dealing with.
For instance, if there was a way of getting an estimation of the error (as in representation - number) or at least the sign of it (whether representation > number or not), that would be helpful to determine in what direction I should correct the result before applying the floor.
Any other workaround you can think of is very welcome.
NOTE: I know the real problem is I'm using doubles and it has representation errors. Please refrain from saying anything like I should store the balance in a long ((long) Math.floor(3931809L/0.080241D) is equally erratic). I also tried using BigDecimal but the performance degraded quite a lot (it's a realtime application). Also, note I'm not very concerned about propagating small errors over time, I do a lot of calculations like those above but I start from a fresh balance number every time (I do maybe 3 of those operations before returning and starting over).
EDIT: To make that clear, that's the only operation I do, and I repeat it 3 times on the same balance. For example, I take a balance in USD and I convert it to RUB, then to JPY then to EUR, and I return the balance and start over from the beginning (with a fresh balance number, ie no rounding error is propagated other than on these 3 operations). The values are not constrained apart from them being positive numbers (ie, in the range [0, +inf)) and the precision is always below 8 (8 decimal digits, ie 0.00000001 is the smallest balance I will ever have to deal with).
What epsilon should I use so it's guaranteed to work in all cases?
There is no epsilon that is guaranteed to work in all cases1. Period.
If you analyze (mathematically2) the computations that your application is performing, then it may be possible to figure out a value of epsilon that will work for you.
But note that there are dangers in repeatedly "rounding off" the errors in a multi-step computation. You can end up with the wrong answer in the end. That's what the maths says.
Finally, ask yourself this: If it is legitimate / safe to just make the epsilon based adjustments, why (after 50 something years) do typical hand-held calculators still insist that 1.0 / 3 * 3 is 0.9999999999....
1 - Lets be clear. You have not attempted to specify what your "cases" are. So I am assuming that you mean all possible computations.
2 - The analysis is complicated by the fact that the epsilon between a Real number and the corresponding floating binary representation (e.g. a "double") depends on the binary magnitude of the number.
A binary floating point number (double precision) has 53 bits of precision or approximately 15.95 decimal digits.
Let r be a Real and d be the double that is closest to r
epsilon(r) = |r - d| is in the range 0 to r * 2floor(log2(r)) -53
and, if you have to deal with numbers in the range 0 to N, then the maximum epsilon value across the range will be approximately:
N * 2floor(log2(N)) - 53
When you perform a computation you need to estimate the cumulative error for all of the steps in the computation. Addition, multiplication and division are relatively "safe". For example with multiplication:
Let r1 = d1 + e1 and r2 = d2 + e2
r1 * r2 = (d1 + e1) * (d2 + e2) = d1 * d2 + d2 * e1 + d1 * e2 + e1 * e2
Unless the epsilon values are already large, the e1 * e2 term vanishes relative to the others, and the epsilon is going to be ≤ 2 * max(|d1|, |d2|) * max(|e1|, |e2|).
(I think. It is a long, long time since I last did this stuff in anger.)
However, subtraction has nasty properties; see Theorem 9 of the Goldberg paper!
The floor function you are using is also a bit tricky.
Let r = d + e
epsilon(floor(r)) = |floor(r, 3) - floor(d, 3)|
which is ≤ max(|ceiling(e, 3)|, 10-3)

Check division by 3 with binary operations?

I've read this interesting answer about "Checking if a number is divisible by 3"
Although the answer is in Java , it seems to work with other languages also.
Obviously we can do :
boolean canBeDevidedBy3 = (i % 3) == 0;
But the interesting part was this other calculation :
boolean canBeDevidedBy3 = ((int) (i * 0x55555556L >> 30) & 3) == 0;
For simplicity :
0x55555556L = "1010101010101010101010101010110"
Nb
There's also another method to check it :
One can determine if an integer is divisible by 3 by counting the 1
bits at odd bit positions, multiply this number by 2, add the number
of 1-bits at even bit positions add them to the result and check if
the result is divisible by 3
For example :
9310 ( is divisible by 3)
010111012
It has 2 bits in the odd places and 4 bits at the even places ( place is the zero based of the base 2 digit location)
So 2*1 + 4 = 6 which is divisible by 3.
At first I thought those 2 methods are related but I didn't find how.
Question
How does
boolean canBeDevidedBy3 = ((int) (i * 0x55555556L >> 30) & 3) == 0;
— actually determines if i%3==0 ?
Whenever you add 3 to a number, what you do is to add binary 11. Whatever the original value of the number, this will maintain the invariant that twice the number of 1 bits at odd positions, plus the number of 1 bits at even positions, will also be divisible by 3.
You can see that in this way. Let's call the value of the above expression c. You're adding 1 to an odd position, and 1 to an even position. When you add 1 to an even position, either the bit you've added 1 to was set or unset. If it was unset, you'll increase the value of c by 1, because you've added a new 1 in an odd position. If it was previously set, you'll flip that bit, but add a 1 in an even position (from the carry). This means that you initially decrease c by 1, but now when you add the 1 in the even position, you increase c by 2, so overall you've increased c by 2.
Of course, this carry bit might also get added to a bit that's already set, in which case we need to check that this part still increases c by 2: you'll remove a 1 in an even position (decreasing c by 2), and then add a 1 in an odd position (increasing c by 1), meaning that we've in fact decreased c by 1. That is the same as increasing c by 2, though, if we're working modulo 3.
A more formal version of this would be structured as a proof by induction.
The two methods do not appear to be related. The bit-wise method seems to be related to certain methods for the efficient computation of modulo b-1 when using digit base b, known in decimal arithmetic as "casting out nines".
The multiplication-based method is directly based on the definition of division when accomplished by multiplication with the reciprocal. Letting / denote mathematical division, we have
int_quot = (int)(i / 3)
frac_quot = i / 3 - int_quot = i / 3 - (int)(i / 3)
i % 3 = 3 * frac_quot = 3 * (i / 3 - (int)(i / 3))
The fractional portion of the mathematical quotient translates directly into the remainder of integer division: If the fraction is 0, the remainder is 0, if the fraction is 1/3 the remainder is 1, if the fraction is 2/3 the remainder is 2. This means we only need to examine the fractional portion of the quotient.
Instead of dividing by 3, we can multiply by 1/3. If we perform the computation in a 32.32 fixed-point format, 1/3 corresponds to 232*1/3 which is a number between 0x55555555 and 0x55555556. For reasons that will become apparent shortly, we use the overestimation here, that is the rounded-up result 0x555555556.
When we multiply 0x55555556 by i, the most significant 32 bits of the full 64-bit product will contain the integral portion of the quotient (int)(i * 1/3) = (int)(i / 3). We are not interested in this integral portion, so we neither compute nor store it. The lower 32-bits of the product is one of the fractions 0/3, 1/3, 2/3 however computed with a slight error since our value of 0x555555556 is slightly larger than 1/3:
i = 1: i * 0.55555556 = 0.555555556
i = 2: i * 0.55555556 = 0.AAAAAAAAC
i = 3: i * 0.55555556 = 1.000000002
i = 4: i * 0.55555556 = 1.555555558
i = 5: i * 0.55555556 = 1.AAAAAAAAE
If we examine the most significant bits of the three possible fraction values in binary, we find that 0x5 = 0101, 0xA = 1010, 0x0 = 0000. So the two most significant bits of the fractional portion of the quotient correspond exactly to the desired modulo values. Since we are dealing with 32-bit operands, we can extract these two bits with a right shift by 30 bits followed by a mask of 0x3 to isolate two bits. I think the masking is needed in Java as 32-bit integers are always signed. For uint32_t operands in C/C++ the shift alone would suffice.
We now see why choosing 0x55555555 as representation of 1/3 wouldn't work. The fractional portion of the quotient would turn into 0xFFFFFFF*, and since 0xF = 1111 in binary, the modulo computation would deliver an incorrect result of 3.
Note that as i increases in magnitude, the accumulated error from the imprecise representation of 1/3 affects more and more bits of the fractional portion. In fact, exhaustive testing shows that the method only works for i < 0x60000000: beyond that limit the error overwhelms the most significant fraction bits which represent our result.

When using doubles, why isn't (x / (y * z)) the same as (x / y / z)? [duplicate]

This question already has answers here:
How to avoid floating point precision errors with floats or doubles in Java?
(12 answers)
Double calculation producing odd result [duplicate]
(3 answers)
Closed 7 years ago.
This is partly academic, as for my purposes I only need it rounded to two decimal places; but I am keen to know what is going on to produce two slightly different results.
This is the test that I wrote to narrow it to the simplest implementation:
#Test
public void shouldEqual() {
double expected = 450.00d / (7d * 60); // 1.0714285714285714
double actual = 450.00d / 7d / 60; // 1.0714285714285716
assertThat(actual).isEqualTo(expected);
}
But it fails with this output:
org.junit.ComparisonFailure:
Expected :1.0714285714285714
Actual :1.0714285714285716
Can anyone explain in detail what is going on under the hood to result in the value at 1.000000000000000X being different?
Some of the points I'm looking for in an answer are:
Where is the precision lost?
Which method is preferred, and why?
Which is actually correct? (In pure maths, both can't be right. Perhaps both are wrong?)
Is there a better solution or method for these arithmetic operations?
I see a bunch of questions that tell you how to work around this problem, but not one that really explains what's going on, other than "floating-point roundoff error is bad, m'kay?" So let me take a shot at it. Let me first point out that nothing in this answer is specific to Java. Roundoff error is a problem inherent to any fixed-precision representation of numbers, so you get the same issues in, say, C.
Roundoff error in a decimal data type
As a simplified example, imagine we have some sort of computer that natively uses an unsigned decimal data type, let's call it float6d. The length of the data type is 6 digits: 4 dedicated to the mantissa, and 2 dedicated to the exponent. For example, the number 3.142 can be expressed as
3.142 x 10^0
which would be stored in 6 digits as
503142
The first two digits are the exponent plus 50, and the last four are the mantissa. This data type can represent any number from 0.001 x 10^-50 to 9.999 x 10^+49.
Actually, that's not true. It can't store any number. What if you want to represent 3.141592? Or 3.1412034? Or 3.141488906? Tough luck, the data type can't store more than four digits of precision, so the compiler has to round anything with more digits to fit into the constraints of the data type. If you write
float6d x = 3.141592;
float6d y = 3.1412034;
float6d z = 3.141488906;
then the compiler converts each of these three values to the same internal representation, 3.142 x 10^0 (which, remember, is stored as 503142), so that x == y == z will hold true.
The point is that there is a whole range of real numbers which all map to the same underlying sequence of digits (or bits, in a real computer). Specifically, any x satisfying 3.1415 <= x <= 3.1425 (assuming half-even rounding) gets converted to the representation 503142 for storage in memory.
This rounding happens every time your program stores a floating-point value in memory. The first time it happens is when you write a constant in your source code, as I did above with x, y, and z. It happens again whenever you do an arithmetic operation that increases the number of digits of precision beyond what the data type can represent. Either of these effects is called roundoff error. There are a few different ways this can happen:
Addition and subtraction: if one of the values you're adding has a different exponent from the other, you will wind up with extra digits of precision, and if there are enough of them, the least significant ones will need to be dropped. For example, 2.718 and 121.0 are both values that can be exactly represented in the float6d data type. But if you try to add them together:
1.210 x 10^2
+ 0.02718 x 10^2
-------------------
1.23718 x 10^2
which gets rounded off to 1.237 x 10^2, or 123.7, dropping two digits of precision.
Multiplication: the number of digits in the result is approximately the sum of the number of digits in the two operands. This will produce some amount of roundoff error, if your operands already have many significant digits. For example, 121 x 2.718 gives you
1.210 x 10^2
x 0.02718 x 10^2
-------------------
3.28878 x 10^2
which gets rounded off to 3.289 x 10^2, or 328.9, again dropping two digits of precision.
However, it's useful to keep in mind that, if your operands are "nice" numbers, without many significant digits, the floating-point format can probably represent the result exactly, so you don't have to deal with roundoff error. For example, 2.3 x 140 gives
1.40 x 10^2
x 0.23 x 10^2
-------------------
3.22 x 10^2
which has no roundoff problems.
Division: this is where things get messy. Division will pretty much always result in some amount of roundoff error unless the number you're dividing by happens to be a power of the base (in which case the division is just a digit shift, or bit shift in binary). As an example, take two very simple numbers, 3 and 7, divide them, and you get
3. x 10^0
/ 7. x 10^0
----------------------------
0.428571428571... x 10^0
The closest value to this number which can be represented as a float6d is 4.286 x 10^-1, or 0.4286, which distinctly differs from the exact result.
As we'll see in the next section, the error introduced by rounding grows with each operation you do. So if you're working with "nice" numbers, as in your example, it's generally best to do the division operations as late as possible because those are the operations most likely to introduce roundoff error into your program where none existed before.
Analysis of roundoff error
In general, if you can't assume your numbers are "nice", roundoff error can be either positive or negative, and it's very difficult to predict which direction it will go just based on the operation. It depends on the specific values involved. Look at this plot of the roundoff error for 2.718 z as a function of z (still using the float6d data type):
In practice, when you're working with values that use the full precision of your data type, it's often easier to treat roundoff error as a random error. Looking at the plot, you might be able to guess that the magnitude of the error depends on the order of magnitude of the result of the operation. In this particular case, when z is of the order 10-1, 2.718 z is also on the order of 10-1, so it will be a number of the form 0.XXXX. The maximum roundoff error is then half of the last digit of precision; in this case, by "the last digit of precision" I mean 0.0001, so the roundoff error varies between -0.00005 and +0.00005. At the point where 2.718 z jumps up to the next order of magnitude, which is 1/2.718 = 0.3679, you can see that the roundoff error also jumps up by an order of magnitude.
You can use well-known techniques of error analysis to analyze how a random (or unpredictable) error of a certain magnitude affects your result. Specifically, for multiplication or division, the "average" relative error in your result can be approximated by adding the relative error in each of the operands in quadrature - that is, square them, add them, and take the square root. With our float6d data type, the relative error varies between 0.0005 (for a value like 0.101) and 0.00005 (for a value like 0.995).
Let's take 0.0001 as a rough average for the relative error in values x and y. The relative error in x * y or x / y is then given by
sqrt(0.0001^2 + 0.0001^2) = 0.0001414
which is a factor of sqrt(2) larger than the relative error in each of the individual values.
When it comes to combining operations, you can apply this formula multiple times, once for each floating-point operation. So for instance, for z / (x * y), the relative error in x * y is, on average, 0.0001414 (in this decimal example) and then the relative error in z / (x * y) is
sqrt(0.0001^2 + 0.0001414^2) = 0.0001732
Notice that the average relative error grows with each operation, specifically as the square root of the number of multiplications and divisions you do.
Similarly, for z / x * y, the average relative error in z / x is 0.0001414, and the relative error in z / x * y is
sqrt(0.0001414^2 + 0.0001^2) = 0.0001732
So, the same, in this case. This means that for arbitrary values, on average, the two expressions introduce approximately the same error. (In theory, that is. I've seen these operations behave very differently in practice, but that's another story.)
Gory details
You might be curious about the specific calculation you presented in the question, not just an average. For that analysis, let's switch to the real world of binary arithmetic. Floating-point numbers in most systems and languages are represented using IEEE standard 754. For 64-bit numbers, the format specifies 52 bits dedicated to the mantissa, 11 to the exponent, and one to the sign. In other words, when written in base 2, a floating point number is a value of the form
1.1100000000000000000000000000000000000000000000000000 x 2^00000000010
52 bits 11 bits
The leading 1 is not explicitly stored, and constitutes a 53rd bit. Also, you should note that the 11 bits stored to represent the exponent are actually the real exponent plus 1023. For example, this particular value is 7, which is 1.75 x 22. The mantissa is 1.75 in binary, or 1.11, and the exponent is 1023 + 2 = 1025 in binary, or 10000000001, so the content stored in memory is
01000000000111100000000000000000000000000000000000000000000000000
^ ^
exponent mantissa
but that doesn't really matter.
Your example also involves 450,
1.1100001000000000000000000000000000000000000000000000 x 2^00000001000
and 60,
1.1110000000000000000000000000000000000000000000000000 x 2^00000000101
You can play around with these values using this converter or any of many others on the internet.
When you compute the first expression, 450/(7*60), the processor first does the multiplication, obtaining 420, or
1.1010010000000000000000000000000000000000000000000000 x 2^00000001000
Then it divides 450 by 420. This produces 15/14, which is
1.0001001001001001001001001001001001001001001001001001001001001001001001...
in binary. Now, the Java language specification says that
Inexact results must be rounded to the representable value nearest to the infinitely precise result; if the two nearest representable values are equally near, the one with its least significant bit zero is chosen. This is the IEEE 754 standard's default rounding mode known as round to nearest.
and the nearest representable value to 15/14 in 64-bit IEEE 754 format is
1.0001001001001001001001001001001001001001001001001001 x 2^00000000000
which is approximately 1.0714285714285714 in decimal. (More precisely, this is the least precise decimal value that uniquely specifies this particular binary representation.)
On the other hand, if you compute 450 / 7 first, the result is 64.2857142857..., or in binary,
1000000.01001001001001001001001001001001001001001001001001001001001001001...
for which the nearest representable value is
1.0000000100100100100100100100100100100100100100100101 x 2^00000000110
which is 64.28571428571429180465... Note the change in the last digit of the binary mantissa (compared to the exact value) due to roundoff error. Dividing this by 60 gives you
1.000100100100100100100100100100100100100100100100100110011001100110011...
Look at the end: the pattern is different! It's 0011 that repeats, instead of 001 as in the other case. The closest representable value is
1.0001001001001001001001001001001001001001001001001010 x 2^00000000000
which differs from the other order of operations in the last two bits: they're 10 instead of 01. The decimal equivalent is 1.0714285714285716.
The specific rounding that causes this difference should be clear if you look at the exact binary values:
1.0001001001001001001001001001001001001001001001001001001001001001001001...
1.0001001001001001001001001001001001001001001001001001100110011001100110...
^ last bit of mantissa
It works out in this case that the former result, numerically 15/14, happens to be the most accurate representation of the exact value. This is an example of how leaving division until the end benefits you. But again, this rule only holds as long as the values you're working with don't use the full precision of the data type. Once you start working with inexact (rounded) values, you no longer protect yourself from further roundoff errors by doing the multiplications first.
It has to do with how the double type is implemented and the fact that the floating-point types don't make the same precision guarantees as other simpler numerical types. Although the following answer is more specifically about sums, it also answers your question by explaining how there is no guarantee of infinite precision in floating-point mathematical operations: Why does changing the sum order returns a different result?. Essentially you should never attempt to determine the equality of floating-point values without specifying an acceptable margin of error. Google's Guava library includes DoubleMath.fuzzyEquals(double, double, double) to determine the equality of two double values within a certain precision. If you wish to read up on the specifics of floating-point equality this site is quite useful; the same site also explains floating-point rounding errors. In summation: the expected and actual values of your calculation differ because of the rounding differing between the calculations due to the order of operations.
Let's simplify things a bit. What you want to know is why 450d / 420 and 450d / 7 / 60 (specifically) give different results.
Let's see how division is performed in IEE double-precision floating point format. Without going deep into implementation details, it's basically XOR-ing the sign bit, subtracting the exponent of the divisor from the exponent of the dividend, dividing the mantissas, and normalizing the result.
First, we should represent our numbers in the proper format for double:
450 is 0 10000000111 1100001000000000000000000000000000000000000000000000
420 is 0 10000000111 1010010000000000000000000000000000000000000000000000
7 is 0 10000000001 1100000000000000000000000000000000000000000000000000
60 is 0 10000000100 1110000000000000000000000000000000000000000000000000
Let's first divide 450 by 420
First comes the sign bit, it's 0 (0 xor 0 == 0).
Then comes the exponent. 10000000111b - 10000000111b + 1023 == 10000000111b - 10000000111b + 01111111111b == 01111111111b
Looking good, now the mantissa:
1.1100001000000000000000000000000000000000000000000000 / 1.1010010000000000000000000000000000000000000000000000 == 1.1100001 / 1.101001. There are a couple of different ways to do this, I'll talk a bit about them later. The result is 1.0(001) (you can verify it here).
Now we should normalize the result. Let's see the guard, round and sticky bit values:
0001001001001001001001001001001001001001001001001001 0 0 1
Guard bit's 0, we don't do any rounding. The result is, in binary:
0 01111111111 0001001001001001001001001001001001001001001001001001
Which gets represented as 1.0714285714285714 in decimal.
Now let's divide 450 by 7 by analogy.
Sign bit = 0
Exponent = 10000000111b - 10000000001b + 01111111111b == -01111111001b + 01111111111b + 01111111111b == 10000000101b
Mantissa = 1.1100001 / 1.11 == 1.00000(001)
Rounding:
0000000100100100100100100100100100100100100100100100 1 0 0
Guard bit is set, round and sticky bits are not. We are rounding to-nearest (default mode for IEEE), and we're stuck right between the two possible values which we could round to. As the lsb is 0, we add 1. This gives us the rounded mantissa:
0000000100100100100100100100100100100100100100100101
The result is
0 10000000101 0000000100100100100100100100100100100100100100100101
Which gets represented as 64.28571428571429 in decimal.
Now we will have to divide it by 60... But you already know that we have lost some precision. Dividing 450 by 420 didn't require rounding at all, but here, we already had to round the result at least once. But, for completeness's sake, let's finish the job:
Dividing 64.28571428571429 by 60
Sign bit = 0
Exponent = 10000000101b - 10000000100b + 01111111111b == 01111111110b
Mantissa = 1.0000000100100100100100100100100100100100100100100101 / 1.111 == 0.10001001001001001001001001001001001001001001001001001100110011
Round and shift:
0.1000100100100100100100100100100100100100100100100100 1 1 0 0
1.0001001001001001001001001001001001001001001001001001 1 0 0
Rounding just as in the previous case, we get the mantissa: 0001001001001001001001001001001001001001001001001010.
As we shifted by 1, we add that to the exponent, getting
Exponent = 01111111111b
So, the result is:
0 01111111111 0001001001001001001001001001001001001001001001001010
Which gets represented as 1.0714285714285716 in decimal.
Tl;dr:
The first division gave us:
0 01111111111 0001001001001001001001001001001001001001001001001001
And the last division gave us:
0 01111111111 0001001001001001001001001001001001001001001001001010
The difference is in the last 2 bits only, but we could have lost more - after all, to get the second result, we had to round two times instead of none!
Now, about mantissa division. Floating-point division is implemented in two major ways.
The way mandated by the IEEE long division (here are some good examples; it's basically the regular long division, but with binary instead of decimal), and it's pretty slow. That is what your computer did.
There is also a faster but less accrate option, multiplication by inverse. First, a reciprocal of the divisor is found, and then multiplication is performed.
That's because double division often lead to a loss of precision. Said loss can vary depending on the order of the divisions.
When you divide by 7d, you already lost some precision with the actual result. Then only you divide an erroneous result by 60.
When you divide by 7d * 60, you only have to use division once, thus losing precision only once.
Note that double multiplication can sometimes fail too, but that's much less common.
Certainly the order of the operations mixed with the fact that doubles aren't precise :
450.00d / (7d * 60) --> a = 7d * 60 --> result = 450.00d / a
vs
450.00d / 7d / 60 --> a = 450.00d /7d --> result = a / 60

Why does this random value have a 25/75 distribution instead of 50/50?

Edit: So basically what I'm trying to write is a 1 bit hash for double.
I want to map a double to true or false with a 50/50 chance. For that I wrote code that picks some random numbers (just as an example, I want to use this on data with regularities and still get a 50/50 result), checks their last bit and increments y if it is 1, or n if it is 0.
However, this code constantly results in 25% y and 75% n. Why is it not 50/50? And why such a weird, but straight-forward (1/3) distribution?
public class DoubleToBoolean {
#Test
public void test() {
int y = 0;
int n = 0;
Random r = new Random();
for (int i = 0; i < 1000000; i++) {
double randomValue = r.nextDouble();
long lastBit = Double.doubleToLongBits(randomValue) & 1;
if (lastBit == 1) {
y++;
} else {
n++;
}
}
System.out.println(y + " " + n);
}
}
Example output:
250167 749833
Because nextDouble works like this: (source)
public double nextDouble()
{
return (((long) next(26) << 27) + next(27)) / (double) (1L << 53);
}
next(x) makes x random bits.
Now why does this matter? Because about half the numbers generated by the first part (before the division) are less than 1L << 52, and therefore their significand doesn't entirely fill the 53 bits that it could fill, meaning the least significant bit of the significand is always zero for those.
Because of the amount of attention this is receiving, here's some extra explanation of what a double in Java (and many other languages) really looks like and why it mattered in this question.
Basically, a double looks like this: (source)
A very important detail not visible in this picture is that numbers are "normalized"1 such that the 53 bit fraction starts with a 1 (by choosing the exponent such that it is so), that 1 is then omitted. That is why the picture shows 52 bits for the fraction (significand) but there are effectively 53 bits in it.
The normalization means that if in the code for nextDouble the 53rd bit is set, that bit is the implicit leading 1 and it goes away, and the other 52 bits are copied literally to the significand of the resulting double. If that bit is not set however, the remaining bits must be shifted left until it becomes set.
On average, half the generated numbers fall into the case where the significand was not shifted left at all (and about half those have a 0 as their least significant bit), and the other half is shifted by at least 1 (or is just completely zero) so their least significant bit is always 0.
1: not always, clearly it cannot be done for zero, which has no highest 1. These numbers are called denormal or subnormal numbers, see wikipedia:denormal number.
From the docs:
The method nextDouble is implemented by class Random as if by:
public double nextDouble() {
return (((long)next(26) << 27) + next(27))
/ (double)(1L << 53);
}
But it also states the following (emphasis mine):
[In early versions of Java, the result was incorrectly calculated as:
return (((long)next(27) << 27) + next(27))
/ (double)(1L << 54);
This might seem to be equivalent, if not better, but in fact it introduced a large nonuniformity because of the bias in the rounding of floating-point numbers: it was three times as likely that the low-order bit of the significand would be 0 than that it would be 1! This nonuniformity probably doesn't matter much in practice, but we strive for perfection.]
This note has been there since Java 5 at least (docs for Java <= 1.4 are behind a loginwall, too lazy to check). This is interesting, because the problem apparently still exists even in Java 8. Perhaps the "fixed" version was never tested?
This result doesn't surprise me given how floating-point numbers are represented. Let's suppose we had a very short floating-point type with only 4 bits of precision. If we were to generate a random number between 0 and 1, distributed uniformly, there would be 16 possible values:
0.0000
0.0001
0.0010
0.0011
0.0100
...
0.1110
0.1111
If that's how they looked in the machine, you could test the low-order bit to get a 50/50 distribution. However, IEEE floats are represented as a power of 2 times a mantissa; one field in the float is the power of 2 (plus a fixed offset). The power of 2 is selected so that the "mantissa" part is always a number >= 1.0 and < 2.0. This means that, in effect, the numbers other than 0.0000 would be represented like this:
0.0001 = 2^(-4) x 1.000
0.0010 = 2^(-3) x 1.000
0.0011 = 2^(-3) x 1.100
0.0100 = 2^(-2) x 1.000
...
0.0111 = 2^(-2) x 1.110
0.1000 = 2^(-1) x 1.000
0.1001 = 2^(-1) x 1.001
...
0.1110 = 2^(-1) x 1.110
0.1111 = 2^(-1) x 1.111
(The 1 before the binary point is an implied value; for 32- and 64-bit floats, no bit is actually allocated to hold this 1.)
But looking at the above should demonstrate why, if you convert the representation to bits and look at the low bit, you will get zero 75% of the time. This is due to all values less than 0.5 (binary 0.1000), which is half the possible values, having their mantissas shifted over, causing 0 to appear in the low bit. The situation is essentially the same when the mantissa has 52 bits (not including the implied 1) as a double does.
(Actually, as #sneftel suggested in a comment, we could include more than 16 possible values in the distribution, by generating:
0.0001000 with probability 1/128
0.0001001 with probability 1/128
...
0.0001111 with probability 1/128
0.001000 with probability 1/64
0.001001 with probability 1/64
...
0.01111 with probability 1/32
0.1000 with probability 1/16
0.1001 with probability 1/16
...
0.1110 with probability 1/16
0.1111 with probability 1/16
But I'm not sure it's the kind of distribution most programmers would expect, so it probably isn't worthwhile. Plus it doesn't gain you much when the values are used to generate integers, as random floating-point values often are.)

Categories