How to round a double/float to BINARY precision? - java

I am writing tests for code performing calculations on floating point numbers. Quite expectedly, the results are rarely exact and I would like to set a tolerance between the calculated and expected result. I have verified that in practice, with double precision, the results are always correct after rounding of last two significant decimals, but usually after rounding the last decimal. I am aware of the format in which doubles and floats are stored, as well as the two main methods of rounding (precise via BigDecimal and faster via multiplication, math.round and division). As the mantissa is stored in binary however, is there a way to perform rounding using base 2 rather than 10?
Just clearing the last 3 bits almost always yields equal results, but if I could push it and instead 'add 2' to the mantissa if its second least significast bit is set, I could probably reach the limit of accuracy. This would be easy enough, expect I have no idea how to handle overflow (when all bits 52-1 are set).
A Java solution would be preferred, but I could probably port one for another language if I understood it.
EDIT:
As part of the problem was that my code was generic with regards to arithmetic (relying on scala.Numeric type class), what I did was an incorporation of rounding suggested in the answer into a new numeric type, which carried the calculated number (floating point in this case) and rounding error, essentially representing a range instead of a point. I then overrode equals so that two numbers are equal if their error ranges overlap (and they share arithmetic, i.e. the number type).

Yes, rounding off binary digits makes more sense than going through BigDecimal and can be implemented very efficiently if you are not worried about being within a small factor of Double.MAX_VALUE.
You can round a floating-point double value x with the following sequence in Java (untested):
double t = 9 * x; // beware: this overflows if x is too close to Double.MAX_VALUE
double y = x - t + t;
After this sequence, y should contain the rounded value. Adjust the distance between the two set bits in the constant 9 in order to adjust the number of bits that are rounded off. The value 3 rounds off one bit. The value 5 rounds off two bits. The value 17 rounds off four bits, and so on.
This sequence of instruction is attributed to Veltkamp and is typically used in “Dekker multiplication”. This page has some references.

Related

Appropriate scale for converting via BigDecimal to floating point

I've written an arbitrary precision rational number class that needs to provide a way to convert to floating-point. This can be done straightforwardly via BigDecimal:
return new BigDecimal(num).divide(new BigDecimal(den), 17, RoundingMode.HALF_EVEN).doubleValue();
but this requires a value for the scale parameter when dividing the decimal numbers. I picked 17 as the initial guess because that is approximately the precision of a double precision floating point number, but I don't know whether that's actually correct.
What would be the correct number to use, defined as, the smallest number such that making it any larger would not make the answer any more accurate?
Introduction
No finite precision suffices.
The problem posed in the question is equivalent to:
What precision p guarantees that converting any rational number x to p decimal digits and then to floating-point yields the floating-point number nearest x (or, in case of a tie, either of the two nearest x)?
To see this is equivalent, observe that the BigDecimal divide shown in the question returns num/div to a selected number of decimal places. The question then asks whether increasing that number of decimal places could increase the accuracy of the result. Clearly, if there is a floating-point number nearer x than the result, then the accuracy could be improved. Thus, we are asking how many decimal places are needed to guarantee the closest floating-point number (or one of the tied two) is obtained.
Since BigDecimal offers a choice of rounding methods, I will consider whether any of them suffices. For the conversion to floating-point, I presume round-to-nearest-ties-to-even is used (which BigDecimal appears to use when converting to Double or Float). I give a proof using the IEEE-754 binary64 format, which Java uses for Double, but the proof applies to any binary floating-point format by changing the 252 used below to 2w-1, where w is the number of bits in the significand.
Proof
One of the parameters to a BigDecimal division is the rounding method. Java’s BigDecimal has several rounding methods. We only need to consider three, ROUND_UP, ROUND_HALF_UP, and ROUND_HALF_EVEN. Arguments for the others are analogous to those below, by using various symmetries.
In the following, suppose we convert to decimal using any large precision p. That is, p is the number of decimal digits in the result of the conversion.
Let m be the rational number 252+1+½−10−p. The two binary64 numbers neighboring m are 252+1 and 252+2. m is closer to the first one, so that is the result we require from converting m first to decimal and then to floating-point.
In decimal, m is 4503599627370497.4999…, where there are p−1 trailing 9s. When rounded to p significant digits with ROUND_UP, ROUND_HALF_UP, or ROUND_HALF_EVEN, the result is 4503599627370497.5 = 252+1+½. (Recognize that, at the position where rounding occurs, there are 16 trailing 9s being discarded, effectively a fraction of .9999999999999999 relative to the rounding position. In ROUND_UP, any non-zero discarded amount causes rounding up. In ROUND_HALF_UP and ROUND_HALF_EVEN, a discarded amount greater than ½ at that position causes rounding up.)
252+1+½ is equally close to the neighboring binary64 numbers 252+1 and 252+2, so the round-to-nearest-ties-to-even method produces 252+2.
Thus, the result is 252+2, which is not the binary64 value closest to m.
Therefore, no finite precision p suffices to round all rational numbers correctly.

Double data type confusion - redeemed

A question of mine was recently closed as a duplicate, but that didn't help me completely. My new and a more specific question is:
Can all values(whole numbers, without any decimal part) smaller than 1.7e308 and greater than 0 be stored in a double data type, as 1.7e308 is the maximum value of a double data type? Because I don't want a decimal numeral, but a large, non-decimal number so large that can't be represented even by long long.
Can all whole numbers smaller than 1.7e308 and greater than 0 be stored in a double data type
The simple answer is No.
There are various ways to come to this conclusion, but here is one that doesn't even depend on an understanding of floating point number formats.
We know that double is a 64 bit representation.
Therefore, there can be at most 264 distinct double values: that is about 1.8 x 1019
You are asking if a double can represent all integers between zero and 1.7 x 10308.
That is 1.7 x 10308 distinct values.
1.7 x 10308 is greater than 1.8 x 1019.
Therefore what you are asking is impossible.
Your best simple option is to use BigInteger.
But you said this:
... but due to slow operations on BigIntegers, I'm keeping it a last choice. It takes about a second to multiply two 4-digit numbers.
That is simply not true. If you have come to that conclusion by benchmarking, then there is something very wrong with your methodology.
Multiplying 2 x 4 digit numbers using BigInteger should take less than a microsecond.
All floating type numbers (halfs/floats/doubles/long doubles/etc) are composed of a mantissa and an exponent.
Suppose you have 1.7e308, 1.7 is the mantissa while 308 is the exponent. You can't exactly separate the two in a float. This is because every float is represented as a composition of the aforementioned in memory. Hence you can't have a "non-decimal" float.

When using doubles, why isn't (x / (y * z)) the same as (x / y / z)? [duplicate]

This question already has answers here:
How to avoid floating point precision errors with floats or doubles in Java?
(12 answers)
Double calculation producing odd result [duplicate]
(3 answers)
Closed 7 years ago.
This is partly academic, as for my purposes I only need it rounded to two decimal places; but I am keen to know what is going on to produce two slightly different results.
This is the test that I wrote to narrow it to the simplest implementation:
#Test
public void shouldEqual() {
double expected = 450.00d / (7d * 60); // 1.0714285714285714
double actual = 450.00d / 7d / 60; // 1.0714285714285716
assertThat(actual).isEqualTo(expected);
}
But it fails with this output:
org.junit.ComparisonFailure:
Expected :1.0714285714285714
Actual :1.0714285714285716
Can anyone explain in detail what is going on under the hood to result in the value at 1.000000000000000X being different?
Some of the points I'm looking for in an answer are:
Where is the precision lost?
Which method is preferred, and why?
Which is actually correct? (In pure maths, both can't be right. Perhaps both are wrong?)
Is there a better solution or method for these arithmetic operations?
I see a bunch of questions that tell you how to work around this problem, but not one that really explains what's going on, other than "floating-point roundoff error is bad, m'kay?" So let me take a shot at it. Let me first point out that nothing in this answer is specific to Java. Roundoff error is a problem inherent to any fixed-precision representation of numbers, so you get the same issues in, say, C.
Roundoff error in a decimal data type
As a simplified example, imagine we have some sort of computer that natively uses an unsigned decimal data type, let's call it float6d. The length of the data type is 6 digits: 4 dedicated to the mantissa, and 2 dedicated to the exponent. For example, the number 3.142 can be expressed as
3.142 x 10^0
which would be stored in 6 digits as
503142
The first two digits are the exponent plus 50, and the last four are the mantissa. This data type can represent any number from 0.001 x 10^-50 to 9.999 x 10^+49.
Actually, that's not true. It can't store any number. What if you want to represent 3.141592? Or 3.1412034? Or 3.141488906? Tough luck, the data type can't store more than four digits of precision, so the compiler has to round anything with more digits to fit into the constraints of the data type. If you write
float6d x = 3.141592;
float6d y = 3.1412034;
float6d z = 3.141488906;
then the compiler converts each of these three values to the same internal representation, 3.142 x 10^0 (which, remember, is stored as 503142), so that x == y == z will hold true.
The point is that there is a whole range of real numbers which all map to the same underlying sequence of digits (or bits, in a real computer). Specifically, any x satisfying 3.1415 <= x <= 3.1425 (assuming half-even rounding) gets converted to the representation 503142 for storage in memory.
This rounding happens every time your program stores a floating-point value in memory. The first time it happens is when you write a constant in your source code, as I did above with x, y, and z. It happens again whenever you do an arithmetic operation that increases the number of digits of precision beyond what the data type can represent. Either of these effects is called roundoff error. There are a few different ways this can happen:
Addition and subtraction: if one of the values you're adding has a different exponent from the other, you will wind up with extra digits of precision, and if there are enough of them, the least significant ones will need to be dropped. For example, 2.718 and 121.0 are both values that can be exactly represented in the float6d data type. But if you try to add them together:
1.210 x 10^2
+ 0.02718 x 10^2
-------------------
1.23718 x 10^2
which gets rounded off to 1.237 x 10^2, or 123.7, dropping two digits of precision.
Multiplication: the number of digits in the result is approximately the sum of the number of digits in the two operands. This will produce some amount of roundoff error, if your operands already have many significant digits. For example, 121 x 2.718 gives you
1.210 x 10^2
x 0.02718 x 10^2
-------------------
3.28878 x 10^2
which gets rounded off to 3.289 x 10^2, or 328.9, again dropping two digits of precision.
However, it's useful to keep in mind that, if your operands are "nice" numbers, without many significant digits, the floating-point format can probably represent the result exactly, so you don't have to deal with roundoff error. For example, 2.3 x 140 gives
1.40 x 10^2
x 0.23 x 10^2
-------------------
3.22 x 10^2
which has no roundoff problems.
Division: this is where things get messy. Division will pretty much always result in some amount of roundoff error unless the number you're dividing by happens to be a power of the base (in which case the division is just a digit shift, or bit shift in binary). As an example, take two very simple numbers, 3 and 7, divide them, and you get
3. x 10^0
/ 7. x 10^0
----------------------------
0.428571428571... x 10^0
The closest value to this number which can be represented as a float6d is 4.286 x 10^-1, or 0.4286, which distinctly differs from the exact result.
As we'll see in the next section, the error introduced by rounding grows with each operation you do. So if you're working with "nice" numbers, as in your example, it's generally best to do the division operations as late as possible because those are the operations most likely to introduce roundoff error into your program where none existed before.
Analysis of roundoff error
In general, if you can't assume your numbers are "nice", roundoff error can be either positive or negative, and it's very difficult to predict which direction it will go just based on the operation. It depends on the specific values involved. Look at this plot of the roundoff error for 2.718 z as a function of z (still using the float6d data type):
In practice, when you're working with values that use the full precision of your data type, it's often easier to treat roundoff error as a random error. Looking at the plot, you might be able to guess that the magnitude of the error depends on the order of magnitude of the result of the operation. In this particular case, when z is of the order 10-1, 2.718 z is also on the order of 10-1, so it will be a number of the form 0.XXXX. The maximum roundoff error is then half of the last digit of precision; in this case, by "the last digit of precision" I mean 0.0001, so the roundoff error varies between -0.00005 and +0.00005. At the point where 2.718 z jumps up to the next order of magnitude, which is 1/2.718 = 0.3679, you can see that the roundoff error also jumps up by an order of magnitude.
You can use well-known techniques of error analysis to analyze how a random (or unpredictable) error of a certain magnitude affects your result. Specifically, for multiplication or division, the "average" relative error in your result can be approximated by adding the relative error in each of the operands in quadrature - that is, square them, add them, and take the square root. With our float6d data type, the relative error varies between 0.0005 (for a value like 0.101) and 0.00005 (for a value like 0.995).
Let's take 0.0001 as a rough average for the relative error in values x and y. The relative error in x * y or x / y is then given by
sqrt(0.0001^2 + 0.0001^2) = 0.0001414
which is a factor of sqrt(2) larger than the relative error in each of the individual values.
When it comes to combining operations, you can apply this formula multiple times, once for each floating-point operation. So for instance, for z / (x * y), the relative error in x * y is, on average, 0.0001414 (in this decimal example) and then the relative error in z / (x * y) is
sqrt(0.0001^2 + 0.0001414^2) = 0.0001732
Notice that the average relative error grows with each operation, specifically as the square root of the number of multiplications and divisions you do.
Similarly, for z / x * y, the average relative error in z / x is 0.0001414, and the relative error in z / x * y is
sqrt(0.0001414^2 + 0.0001^2) = 0.0001732
So, the same, in this case. This means that for arbitrary values, on average, the two expressions introduce approximately the same error. (In theory, that is. I've seen these operations behave very differently in practice, but that's another story.)
Gory details
You might be curious about the specific calculation you presented in the question, not just an average. For that analysis, let's switch to the real world of binary arithmetic. Floating-point numbers in most systems and languages are represented using IEEE standard 754. For 64-bit numbers, the format specifies 52 bits dedicated to the mantissa, 11 to the exponent, and one to the sign. In other words, when written in base 2, a floating point number is a value of the form
1.1100000000000000000000000000000000000000000000000000 x 2^00000000010
52 bits 11 bits
The leading 1 is not explicitly stored, and constitutes a 53rd bit. Also, you should note that the 11 bits stored to represent the exponent are actually the real exponent plus 1023. For example, this particular value is 7, which is 1.75 x 22. The mantissa is 1.75 in binary, or 1.11, and the exponent is 1023 + 2 = 1025 in binary, or 10000000001, so the content stored in memory is
01000000000111100000000000000000000000000000000000000000000000000
^ ^
exponent mantissa
but that doesn't really matter.
Your example also involves 450,
1.1100001000000000000000000000000000000000000000000000 x 2^00000001000
and 60,
1.1110000000000000000000000000000000000000000000000000 x 2^00000000101
You can play around with these values using this converter or any of many others on the internet.
When you compute the first expression, 450/(7*60), the processor first does the multiplication, obtaining 420, or
1.1010010000000000000000000000000000000000000000000000 x 2^00000001000
Then it divides 450 by 420. This produces 15/14, which is
1.0001001001001001001001001001001001001001001001001001001001001001001001...
in binary. Now, the Java language specification says that
Inexact results must be rounded to the representable value nearest to the infinitely precise result; if the two nearest representable values are equally near, the one with its least significant bit zero is chosen. This is the IEEE 754 standard's default rounding mode known as round to nearest.
and the nearest representable value to 15/14 in 64-bit IEEE 754 format is
1.0001001001001001001001001001001001001001001001001001 x 2^00000000000
which is approximately 1.0714285714285714 in decimal. (More precisely, this is the least precise decimal value that uniquely specifies this particular binary representation.)
On the other hand, if you compute 450 / 7 first, the result is 64.2857142857..., or in binary,
1000000.01001001001001001001001001001001001001001001001001001001001001001...
for which the nearest representable value is
1.0000000100100100100100100100100100100100100100100101 x 2^00000000110
which is 64.28571428571429180465... Note the change in the last digit of the binary mantissa (compared to the exact value) due to roundoff error. Dividing this by 60 gives you
1.000100100100100100100100100100100100100100100100100110011001100110011...
Look at the end: the pattern is different! It's 0011 that repeats, instead of 001 as in the other case. The closest representable value is
1.0001001001001001001001001001001001001001001001001010 x 2^00000000000
which differs from the other order of operations in the last two bits: they're 10 instead of 01. The decimal equivalent is 1.0714285714285716.
The specific rounding that causes this difference should be clear if you look at the exact binary values:
1.0001001001001001001001001001001001001001001001001001001001001001001001...
1.0001001001001001001001001001001001001001001001001001100110011001100110...
^ last bit of mantissa
It works out in this case that the former result, numerically 15/14, happens to be the most accurate representation of the exact value. This is an example of how leaving division until the end benefits you. But again, this rule only holds as long as the values you're working with don't use the full precision of the data type. Once you start working with inexact (rounded) values, you no longer protect yourself from further roundoff errors by doing the multiplications first.
It has to do with how the double type is implemented and the fact that the floating-point types don't make the same precision guarantees as other simpler numerical types. Although the following answer is more specifically about sums, it also answers your question by explaining how there is no guarantee of infinite precision in floating-point mathematical operations: Why does changing the sum order returns a different result?. Essentially you should never attempt to determine the equality of floating-point values without specifying an acceptable margin of error. Google's Guava library includes DoubleMath.fuzzyEquals(double, double, double) to determine the equality of two double values within a certain precision. If you wish to read up on the specifics of floating-point equality this site is quite useful; the same site also explains floating-point rounding errors. In summation: the expected and actual values of your calculation differ because of the rounding differing between the calculations due to the order of operations.
Let's simplify things a bit. What you want to know is why 450d / 420 and 450d / 7 / 60 (specifically) give different results.
Let's see how division is performed in IEE double-precision floating point format. Without going deep into implementation details, it's basically XOR-ing the sign bit, subtracting the exponent of the divisor from the exponent of the dividend, dividing the mantissas, and normalizing the result.
First, we should represent our numbers in the proper format for double:
450 is 0 10000000111 1100001000000000000000000000000000000000000000000000
420 is 0 10000000111 1010010000000000000000000000000000000000000000000000
7 is 0 10000000001 1100000000000000000000000000000000000000000000000000
60 is 0 10000000100 1110000000000000000000000000000000000000000000000000
Let's first divide 450 by 420
First comes the sign bit, it's 0 (0 xor 0 == 0).
Then comes the exponent. 10000000111b - 10000000111b + 1023 == 10000000111b - 10000000111b + 01111111111b == 01111111111b
Looking good, now the mantissa:
1.1100001000000000000000000000000000000000000000000000 / 1.1010010000000000000000000000000000000000000000000000 == 1.1100001 / 1.101001. There are a couple of different ways to do this, I'll talk a bit about them later. The result is 1.0(001) (you can verify it here).
Now we should normalize the result. Let's see the guard, round and sticky bit values:
0001001001001001001001001001001001001001001001001001 0 0 1
Guard bit's 0, we don't do any rounding. The result is, in binary:
0 01111111111 0001001001001001001001001001001001001001001001001001
Which gets represented as 1.0714285714285714 in decimal.
Now let's divide 450 by 7 by analogy.
Sign bit = 0
Exponent = 10000000111b - 10000000001b + 01111111111b == -01111111001b + 01111111111b + 01111111111b == 10000000101b
Mantissa = 1.1100001 / 1.11 == 1.00000(001)
Rounding:
0000000100100100100100100100100100100100100100100100 1 0 0
Guard bit is set, round and sticky bits are not. We are rounding to-nearest (default mode for IEEE), and we're stuck right between the two possible values which we could round to. As the lsb is 0, we add 1. This gives us the rounded mantissa:
0000000100100100100100100100100100100100100100100101
The result is
0 10000000101 0000000100100100100100100100100100100100100100100101
Which gets represented as 64.28571428571429 in decimal.
Now we will have to divide it by 60... But you already know that we have lost some precision. Dividing 450 by 420 didn't require rounding at all, but here, we already had to round the result at least once. But, for completeness's sake, let's finish the job:
Dividing 64.28571428571429 by 60
Sign bit = 0
Exponent = 10000000101b - 10000000100b + 01111111111b == 01111111110b
Mantissa = 1.0000000100100100100100100100100100100100100100100101 / 1.111 == 0.10001001001001001001001001001001001001001001001001001100110011
Round and shift:
0.1000100100100100100100100100100100100100100100100100 1 1 0 0
1.0001001001001001001001001001001001001001001001001001 1 0 0
Rounding just as in the previous case, we get the mantissa: 0001001001001001001001001001001001001001001001001010.
As we shifted by 1, we add that to the exponent, getting
Exponent = 01111111111b
So, the result is:
0 01111111111 0001001001001001001001001001001001001001001001001010
Which gets represented as 1.0714285714285716 in decimal.
Tl;dr:
The first division gave us:
0 01111111111 0001001001001001001001001001001001001001001001001001
And the last division gave us:
0 01111111111 0001001001001001001001001001001001001001001001001010
The difference is in the last 2 bits only, but we could have lost more - after all, to get the second result, we had to round two times instead of none!
Now, about mantissa division. Floating-point division is implemented in two major ways.
The way mandated by the IEEE long division (here are some good examples; it's basically the regular long division, but with binary instead of decimal), and it's pretty slow. That is what your computer did.
There is also a faster but less accrate option, multiplication by inverse. First, a reciprocal of the divisor is found, and then multiplication is performed.
That's because double division often lead to a loss of precision. Said loss can vary depending on the order of the divisions.
When you divide by 7d, you already lost some precision with the actual result. Then only you divide an erroneous result by 60.
When you divide by 7d * 60, you only have to use division once, thus losing precision only once.
Note that double multiplication can sometimes fail too, but that's much less common.
Certainly the order of the operations mixed with the fact that doubles aren't precise :
450.00d / (7d * 60) --> a = 7d * 60 --> result = 450.00d / a
vs
450.00d / 7d / 60 --> a = 450.00d /7d --> result = a / 60

BigDecimal.divide(...) - appropriate scale for quotients with non-terminating expansions

I'm using BigDecimal for some floating-point math. If you divide 5 by 4.2 you'll get an exception (as the result has a non-terminating expansion which cannot be represented by a BigDecimal) i.e.
BigDecimal five = new BigDecimal("5");
BigDecimal fourPointTwo = new BigDecimal("4.2");
five.divide(fourPointTwo) // ArithmeticException: Non-terminating decimal expansion; no exact representable decimal result.
I'm prepared to lose some precision in this case, so I am going to use the divide(...) method which allows a scale for the result to be provided:
five.divide(fourPointTwo, 2, RoundingMode.HALF_UP); //Fine, but obviously not 100% accurate
What scale should I pass to this method so that the result is as accurate as if I had performed the calculation using two doubles?
From the javadocs of the BigDecimal class:
If zero or positive, the scale is the number of digits to the right of the decimal point. If negative, the unscaled value of the number is multiplied by ten to the power of the negation of the scale. The value of the number represented by the BigDecimal is therefore (unscaledValue × 10-scale).
The precision of double varies accordingly to the order of magnitude of the value. According to this, it uses 52 bits to store the unsigned mantissa, so any integer that may be represented with 52 bits will be ok. This will be roughly 18 decimal digits.
Further, double uses 11 bits to store the exponent. So, something like 4 decimal precision will do. This way, any integer up to 52 bits multiplied by a positive or negative power of 2 with at most 10 bits may be represented (one bit is the sign of the expoent). Beyond that, you start to lose precision.
The extra bit of double stores the sign.
This way, scale 18 + 4 = 22 will be at least as precise as double.
Your problem is called "round-off error" or "rounding error". Example:
You have two numbers a and b. You know that each has a certain precision (i.e. number of digits that you're confident of) which means that every other digit is "random" noise.
Imagine b has a precision of two digits. The result of (b*100)-int(b*100) will be random since the operation cuts away all "correct" digits.
These errors propagate depending on the mathematical operation. Some examples:
Errors margins add when the numbers are added. If a and b have a precision of two, adding them might turn the second digit of the fraction into garbage: 0.003 + 0.008 = 0.011
Multiplication grows the error fast and exponential functions grow it even faster.
Division reduces the error margin (0.003 / 3 = 0.001)
So if you want a correct answer, you must calculate the error margins of all operations in your code following the rules outlined above. Link anyone?
Of course, this is usually not an option. So you need to think what amount of error you can live with. For example if you do math on financial data, a precision of 10 or 20 is generally enough because you have enough bits to "waste" for several mathematical operations before the error grows into significant parts of the value.
Example: You start with 10.500 000 000 and 3.100 000 000. If you divide the two, you get 3.387 096 774. From that, you only need 3.87 - the rest is spare precision which you can use up in further operations until you round the last result to two digits and save it back in the database.

double d=1/0.0 vs double d=1/0

double d=1/0.0;
System.out.println(d);
It prints Infinity , but if we will write double d=1/0; and print it we'll get this exception: Exception
in thread "main" java.lang.ArithmeticException: / by zero
at D.main(D.java:3) Why does Java know in one case that diving by zero is infinity but for the int 0 it is not defined?
In both cases d is double and in both cases the result is infinity.
Floating point data types have a special value reserved to represent infinity, integer values do not.
In your code 1/0 is an integer division that, of course, fails. However, 1/0.0 is a floating point division and so results in Infinity.
strictly speaking, 1.0/0.0 isn't infinity at all, it's undefined.
As David says in his answer, Floats have a way of expressing a number that is not in the range of the highest number it can represent and the lowest. These values are collectively known as "Not a number" or just NaNs. NaNs can also occur from calculations that really are infinite (such as limx -> 0 ln2 x), values that are finite but overflow the range floats can represent (like 10100100), as well as undefined values like 1/0.
Floating point numbers don't quite clearly distinguish among undefined values, overflow and infinity; what combination of bits results from that calculation depends. Since just printing "NaN" or "Not a Number" is a bit harder to understand for folks that don't know how floating point values are represented, that formatter just prints "Infinity" or sometimes "-Infinity" Since it provides the same level of information when you do know what FP NaN's are all about, and has some meaning when you don't.
Integers don't have anything comparable to floating point NaN's. Since there's no sensible value for an integer to take when you do 1/0, the only option left is to raise an exception.
The same code written in machine language can either invoke an interrupt, which is comparable to a Java exception, or set a condition register, which would be a global value to indicate that the last calculation was a divide by zero. which of those are available varies a bit by platform.

Categories