Is every double value rational? - java

Is every double a rational number (Excluding the special values [Infinity, -Infinity, NaN])? I am leaning towards saying yes, based on the following logic:
The mantissa has a value that can be represented as a decimal, which can be the numerator.
The exponent can be converted to a denominator, so that the result is scaled up and down as required.
Is this logic correct, and if not, what is wrong with it, and are there counterexamples which prove double values can be irrational?

This logic seems correct.
Computers can use only limited space, meaning they can only represent in memory rational numbers (When using double format), as irrational numbers are composed of an infinite number of digits without repeating.
Coming to think about it, you can, however, store an executable code of a function that defines the number, rational or not, but this wouldn't work for every irrational and more importantly, isn't how double works.
As for the special values, I don't think so. Infinity is not really a number, so I find it hard to define as rational or irrational. Same for NaN (Which is, by definition, not a number).

You seem to be correct, doubles, at least IEEE 754 with base 2 are rational.
With IEEE 754 you have
x = s * m * b^e
s is sign, m is mantissa, b is the base 2, e is the exponent.
Since s, m, b and e are integer, x must be rational.

Related

Appropriate scale for converting via BigDecimal to floating point

I've written an arbitrary precision rational number class that needs to provide a way to convert to floating-point. This can be done straightforwardly via BigDecimal:
return new BigDecimal(num).divide(new BigDecimal(den), 17, RoundingMode.HALF_EVEN).doubleValue();
but this requires a value for the scale parameter when dividing the decimal numbers. I picked 17 as the initial guess because that is approximately the precision of a double precision floating point number, but I don't know whether that's actually correct.
What would be the correct number to use, defined as, the smallest number such that making it any larger would not make the answer any more accurate?
Introduction
No finite precision suffices.
The problem posed in the question is equivalent to:
What precision p guarantees that converting any rational number x to p decimal digits and then to floating-point yields the floating-point number nearest x (or, in case of a tie, either of the two nearest x)?
To see this is equivalent, observe that the BigDecimal divide shown in the question returns num/div to a selected number of decimal places. The question then asks whether increasing that number of decimal places could increase the accuracy of the result. Clearly, if there is a floating-point number nearer x than the result, then the accuracy could be improved. Thus, we are asking how many decimal places are needed to guarantee the closest floating-point number (or one of the tied two) is obtained.
Since BigDecimal offers a choice of rounding methods, I will consider whether any of them suffices. For the conversion to floating-point, I presume round-to-nearest-ties-to-even is used (which BigDecimal appears to use when converting to Double or Float). I give a proof using the IEEE-754 binary64 format, which Java uses for Double, but the proof applies to any binary floating-point format by changing the 252 used below to 2w-1, where w is the number of bits in the significand.
Proof
One of the parameters to a BigDecimal division is the rounding method. Java’s BigDecimal has several rounding methods. We only need to consider three, ROUND_UP, ROUND_HALF_UP, and ROUND_HALF_EVEN. Arguments for the others are analogous to those below, by using various symmetries.
In the following, suppose we convert to decimal using any large precision p. That is, p is the number of decimal digits in the result of the conversion.
Let m be the rational number 252+1+½−10−p. The two binary64 numbers neighboring m are 252+1 and 252+2. m is closer to the first one, so that is the result we require from converting m first to decimal and then to floating-point.
In decimal, m is 4503599627370497.4999…, where there are p−1 trailing 9s. When rounded to p significant digits with ROUND_UP, ROUND_HALF_UP, or ROUND_HALF_EVEN, the result is 4503599627370497.5 = 252+1+½. (Recognize that, at the position where rounding occurs, there are 16 trailing 9s being discarded, effectively a fraction of .9999999999999999 relative to the rounding position. In ROUND_UP, any non-zero discarded amount causes rounding up. In ROUND_HALF_UP and ROUND_HALF_EVEN, a discarded amount greater than ½ at that position causes rounding up.)
252+1+½ is equally close to the neighboring binary64 numbers 252+1 and 252+2, so the round-to-nearest-ties-to-even method produces 252+2.
Thus, the result is 252+2, which is not the binary64 value closest to m.
Therefore, no finite precision p suffices to round all rational numbers correctly.

How to round a double/float to BINARY precision?

I am writing tests for code performing calculations on floating point numbers. Quite expectedly, the results are rarely exact and I would like to set a tolerance between the calculated and expected result. I have verified that in practice, with double precision, the results are always correct after rounding of last two significant decimals, but usually after rounding the last decimal. I am aware of the format in which doubles and floats are stored, as well as the two main methods of rounding (precise via BigDecimal and faster via multiplication, math.round and division). As the mantissa is stored in binary however, is there a way to perform rounding using base 2 rather than 10?
Just clearing the last 3 bits almost always yields equal results, but if I could push it and instead 'add 2' to the mantissa if its second least significast bit is set, I could probably reach the limit of accuracy. This would be easy enough, expect I have no idea how to handle overflow (when all bits 52-1 are set).
A Java solution would be preferred, but I could probably port one for another language if I understood it.
EDIT:
As part of the problem was that my code was generic with regards to arithmetic (relying on scala.Numeric type class), what I did was an incorporation of rounding suggested in the answer into a new numeric type, which carried the calculated number (floating point in this case) and rounding error, essentially representing a range instead of a point. I then overrode equals so that two numbers are equal if their error ranges overlap (and they share arithmetic, i.e. the number type).
Yes, rounding off binary digits makes more sense than going through BigDecimal and can be implemented very efficiently if you are not worried about being within a small factor of Double.MAX_VALUE.
You can round a floating-point double value x with the following sequence in Java (untested):
double t = 9 * x; // beware: this overflows if x is too close to Double.MAX_VALUE
double y = x - t + t;
After this sequence, y should contain the rounded value. Adjust the distance between the two set bits in the constant 9 in order to adjust the number of bits that are rounded off. The value 3 rounds off one bit. The value 5 rounds off two bits. The value 17 rounds off four bits, and so on.
This sequence of instruction is attributed to Veltkamp and is typically used in “Dekker multiplication”. This page has some references.

If the double type can handle the numbers 4.35 and 435, why do 4.35 * 100 evaluates to 434.99999999999994? [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Rounding oddity - what is special about "100"? [duplicate]
(2 answers)
Closed 9 years ago.
As I understand this, some numbers can't be represented with exactitude in binary, and that's why floating-point arithmetic sometimes gives us unexpected results; like 4.35 * 100 = 434.99999999999994. Something similar to what happens with 1/3 in decimal.
That makes sense, but this induces another question. Seems that in binary both 4.35 and 435 can be represented with exactitude. That's when it stops making sense to me. Why does 4.35 * 100 evaluates to 434.99999999999994? 435 and 4.35 have an exact representation in the double type dynamics:
double number1 = 4.35;
double number2 = 435;
double number3 = 100;
System.out.println(number1); // 4.35
System.out.println(number2); // 435.0
System.out.println(number3); // 100.0
// So far so good. Everything ok.
System.out.println(number1 * number3); // 434.99999999999994 !!!
// But 4.35 * 100 evaluates to 434.99999999999994
Why?
Edit: this question was marked as duplicate, and it is not. As you can see in the accepted answer, my confusion was regarding the discrepancy between the actual value and the printed value.
Seems that in binary both 4.35 and 435 can be represented with exactitude.
I see that you understand how the floating point numbers are internally represented. As for your doubt, no 4.35 does not have an exact binary representation. So the issue is, why the 1st print statement prints 4.35.
That is happening because System.out.println() invokes the Double.toString(double) method, which in turns uses FloatingDecimal#toJavaFormatString() method, which performs some rounding internally on the passed double argument. You can go through the source code I linked.
For seeing the actual value of 4.35, try using this:
BigDecimal bd = new BigDecimal(number1);
System.out.println(bd);
This will print:
4.3499999999999996447286321199499070644378662109375
In this case, rather than printing the double value, you create a BigDecimal object passing double value as argument. BigDecimal represents arbitrary precision signed decimal number. So it gives you the exact value of 4.35.
You are right in that sometimes floating-point arithmetic gives unexpected results.
Your assertion that 4.35 can be represented exactly in floating-point is incorrect, because it can't be represented as a terminating binary decimal. 100 can obviously be represented exactly, so for the result to be 434.99999999999994, `4.35 must not be represented exactly.
To be represented exactly in floating-point, a number must be able to be converted to a fraction where the denominator is a power of two only (and it must not be so precise that it exceeds the maximum precision of the floating-point type you're using). In this case, 4.35 is 4 7/20, and the denominator has a factor of 5, so the number can't be represented exactly in binary.
Although from a hardware perspective each floating-point number represents some exact value of the form M * 2^E (where M and E are integers in a certain range), from a software perspective it is more helpful to think of each floating-point number as representing "Something for which M * 2^E has been deemed the best representation, and which is hopefully close to that". Given a floating-point value (M * 2^E), one should figure that the actual number it's intended to represent may very easily be anywhere from (N - 1/2) * 2^E to (N + 1/2) * 2^E and in practice may extend a bit further beyond.
As a simple example, with type float, the value of M is limited to the range 0-16777215. The best representation of 2000000.1f is thus 16000001 * 2^-3 [i.e. 16000001/8]. Although exact decimal value of 16000001/8 is 2000000.125, the last digit isn't necessary to define the value of the number, since 16000001/8 would the best representation of 2000000.120 and 2000000.129 (or, for that matter, all values between 2000000.0625 and 2000000.1875, non-inclusive). Because the number of digits that would required to display the exact decimal value of a number of the form M * 2^E would often far exceed the number of meaningful digits, it is common to limit number of displayed digits to roughly those necessary to uniquely define the value.
Note that if one regards floating-point numbers as representing ranges, one will observe that casts from double to float--even though they must be explicitly specified--are actually safe since converting the double that best represents a particular value to float will yield either the best float representation of that value or something very close to it. Conversely, conversion from float to double, even though it's allowed implicitly, is dangerous because such conversion is very unlikely to select the double which would best represent the number that the float was supposed to represent.
it is a bit hard to explain in English, because I have learned computer number representation in Hungarian. In short, 4.35, 435 nor 100 is not exactly these numbers, but mantissa * 2^k (k-characteristic from -k to +k, and t - is the length of the mantissa in the M = (t,-k,+k) ) although the print call does some rounding. So the number-line is not continuous, but near some famous points, denser ).
So as I think these numbers are not exactly what you expect, and after the operation (I suppose this is one or two simple binary operation) you get the multiple of error distance of the two float point number representation.

Is it possible that a number exactly represented as float can NOT be exactly represented as double?

I have a question which arose from another question about precision of floating numbers.
Now, I know that floating points can not always be represented accurately and hence they are stored as the closest possible floating number that can be represented.
My question is actually about the difference in representation of float and double.
Where does this question arise from?
Suppose I do:
System.out.println(.475d+.075d);
then the output would not be 0.55 but 0.549999 (on my machine)
However, when I do :
System.out.println(.475f+.075f);
I get the correct answer, i.e. 0.55 (a little unexpected for me)
Till now I was under an impression that double has more precision(double will be more accurate upto a longer number of decimal places) that float. So, if a double cannot be represented precisely, then its equivalent float representation will also be stored inaccurately.
However the results I got are a little disturbing for me. I am confused if:
I have an incorrect understanding of what precision means?
float and double are represented differently, apart from the fact that double has more bits?
A number that can be reprsented as a float can be represented as double too.
What you read is just formatted output, you don't read actual binary representation.
System.out.println(Long.toBinaryString(Double.doubleToRawLongBits(.475d + .075d)));
// 11111111100001100110011001100110011001100110011001100110011001
System.out.println(Integer.toBinaryString(Float.floatToRawIntBits(.475f + .075f)));
// 111111000011001100110011001101
double d = .475d + .075d;
System.out.println(d);
// 0.5499999999999999
System.out.println((float)d);
// 0.55 (as expected)
System.out.println((double)(float)d);
// 0.550000011920929
System.out.println( .475f + .075f == 0.550000011920929d);
// true
Precision just means more bits. A number that cannot be represented as a float may have an exact representation as a double, but that the number of those cases is infinitely small relative to the total number of possible cases.
For the simple cases like 0.1, that is not representable as a fixed-length floating-point number, no matter what the number of bits available. This is the same as saying that a fraction such as 1/7 cannot be represented exactly in decimal, regardless of the number of digits you are allowed to use (as long as the number of digits is finite). You can approximate it as 0.142857142857142857... repeating over and over again, but you will never be able to write it EXACTLY no matter how long you go on.
Conversely, if a number is representable exactly as a float, it will also be representable exactly as a double. A double has a larger exponent range and more mantissa bits.
For your example, the cause of the apparent discrepancy is that in float, the difference between 0.475 and its float representation was in the 'right' direction so that when truncation occurred it went how you expected it. When increasing the precision available, the representation was "closer" to 0.475 but now on the opposite side. As a gross example, let's say that the closest possible float was 0.475006 but in a double the closest possible value was 0.474999. This would give you the results you see.
Edit: Here's the results of a quick experiment:
public class Test {
public static void main(String[] args)
{
float f = 0.475f;
double d = 0.475d;
System.out.printf("%20.16f", f);
System.out.printf("%20.16f", d);
}
}
Output:
0.4749999940395355 0.4750000000000000
What this means is that the floating-point representation of the number 0.475, if you had a huge number of bits, would be just a tiny bit less than 0.475. This is see in the double representation. However, the first 'wrong' bit occurs so far to the right that when truncated to fit in a float, it just happens to work out to 0.475. This is purely an accident.
If one regards that floating-point types actually represent ranges of values, rather than discrete values (e.g. 0.1f doesn't represent 13421773/134217728, but rather "something between 13421772.5/134217728 and 13421773.5/134217728"), conversions from double to float will usually be accurate, while conversions from float to double will usually not. Unfortunately, Java allows the usually-inaccurate conversions to be performed implicitly, while requiring a typecast in the usually-accurate direction.
For every value of type float, there exists a value of type double whose range is centered about the center of the float's range. That does not mean the double is an accurate representation of the value in the float. For example, converting 0.1f to double yields a value meaning "something between 13421772.9999999/134217728 and 13421773.0000001/134217728", a value which is off by over a million times the implied tolerance.
For almost every value of type double, there exists a value of type float whose range completely includes the range implied by the double. The only exceptions are values whose range is centered precisely on the boundary between two float values. Converting such values to float would require that the system chose one range or the other; if the system rounds up when the double actually represented a number below the center of its range, or vice versa, the range of the float would not totally encompass that of the double. In practical terms, though, this is a non-issue, since it means that instead of a float cast from a double representing a range like (13421772.5/134217728 to 13421773.5/134217728) it would represent a range like (13421772.4999999/134217728 to 13421773.5000001/134217728). Compared with the horrendous imprecision resulting from a float to double cast, that tiny imprecision is nothing.
BTW, returning to the particular numbers you are using, when you do your calculations as float, the computations are:
0.075f = 20132660±½ / 268435456
0.475f = 31876710±½ / 67108864
Sum = 18454938±½ / 33554432
In other words, the sum represents a number somewhere between roughly 0.54999999701 and 0.55000002682. The most natural representation is 0.55 (since the actual value could be more or less than that, additional digits would be meaningless).

Why is Double.MIN_VALUE in not negative

Can anyone shed some light on why Double.MIN_VALUE is not actually the minimum value that Doubles can take? It is a positive value, and a Double can of course be negative.
I understand why it's a useful number, but it seems a very unintuitive name, especially when compared to Integer.MIN_VALUE. Calling it Double.SMALLEST_POSITIVE or MIN_INCREMENT or similar would have clearer semantics.
Also, what is the minimum value that Doubles can take? Is it -Double.MAX_VALUE? The docs don't seem to say.
The IEEE 754 format has one bit reserved for the sign and the remaining bits representing the magnitude. This means that it is "symmetrical" around origo (as opposed to the Integer values, which have one more negative value). Thus the minimum value is simply the same as the maximum value, with the sign-bit flipped, so yes, -Double.MAX_VALUE is the lowest actual number you can represent with a double.
I suppose the Double.MAX_VALUE should be seen as maximum magnitude, in which case it actually makes sense to simply write -Double.MAX_VALUE. It also explains why Double.MIN_VALUE is the least positive value (since that represents the least possible magnitude).
But sure, I agree that the naming is a bit misleading. Being used to the meaning Integer.MIN_VALUE, I too was a bit surprised when I read that Double.MIN_VALUE was the smallest absolute value that could be represented. Perhaps they thought it was superfluous to have a constant representing the least possible value as it is simply a - away from MAX_VALUE :-)
(Note, there is also Double.NEGATIVE_INFINITY but I'm disregarding from this, as it is to be seen as a "special case" and does not in fact represent any actual number.)
Here is a good text on the subject.
These constants have nothing to do with sign. This makes more sense if you consider a double as a composite of three parts: Sign, Exponent and Mantissa.
Double.MIN_VALUE is actually the smallest value Mantissa can assume when the Exponent is at minimun value before a flush to zero occurs. Likewise MAX_VALUE can be understood as the largest value Mantissa can assume when the Exponent is at maximum value before a flush to infinity occurs.
A more descriptive name for these two could be Largest Absolute (add non-zero for verbositiy) and Smallest Absolute value (add non-infinity for verbositiy).
Check out the IEEE 754 (1985) standard for details. There is a revised (2008) version, but that only introduces more formats which aren't even supported by java (strictly speaking java even lacks support for some mandatory features of IEEE 754 1985, like many other high level languages).
I assume the confusing names can be traced back to C, which defined FLT_MIN as the smallest positive number.
Like in Java, where you have to use -Double.MAX_VALUE, you have to use -FLT_MAX to get the smallest float in C.
The minimum value for a double is Double.NEGATIVE_INFINITY that's why Double.MIN_VALUE isn't really the minimum for a Double.
As the double are floating point numbers, you can only have the biggest number (with a lower precision) or the closest number to 0 (with a great precision).
If you really want a minimal value for a double that isn't infinity then you can use -Double.MAX_VALUE.
Because with floating point numbers, the precision is what is important as there's no exact range.
/**
* A constant holding the smallest positive nonzero value of type
* <code>double</code>, 2<sup>-1074</sup>. It is equal to the
* hexadecimal floating-point literal
* <code>0x0.0000000000001P-1022</code> and also equal to
* <code>Double.longBitsToDouble(0x1L)</code>.
*/
But i agree that it should probably have been named something better :)
As it says in the documents,
Double.MIN_VALUE is a constant holding the smallest POSITIVE nonzero value of type double, 2^(-1074).
The trick here is we are talking about a floating point number representation. The double data type is a double-precision 64-bit IEEE 754 floating point. Floating points represent numbers from 1,000,000,000,000 to 0.0000000000000001 with ease, and while maximizing precision (the number of digits) at both ends of the scale. (For more refer this)
The mantissa, always a positive number, holds the significant digits of the floating-point number. The exponent indicates the positive or negative power of the radix that the mantissa and sign should be multiplied by. The four components are combined as follows to get the floating-point value.
Think that the MIN_VALUE is the minimum value that the mantissa can represent. As the minimum values of a floating point representation is the minimum magnitude that can be represented using that. (Could have used a better name to avoid this confusion though)
123 > 10 > 1 > 0.12 > 0.012 > 0.0000123 > 0.000000001 > 0.0000000000000001
Below is just FYI.
Double-precision floating-point can represent 2,098 powers of two, from 2^-1074 through 2^1023. Denormalized powers of two are those from 2^-1074 through 2^-1023; normalized powers of two are those from 2^-1022 through 2^1023. Refer this and this.

Categories