The float data type is a single-precision 32-bit IEEE 754 floating point and the double data type is a double-precision 64-bit IEEE 754 floating point.
What does it mean? And when should I use float instead of double or vice-versa?
The Wikipedia page on it is a good place to start.
To sum up:
float is represented in 32 bits, with 1 sign bit, 8 bits of exponent, and 23 bits of the significand (or what follows from a scientific-notation number: 2.33728*1012; 33728 is the significand).
double is represented in 64 bits, with 1 sign bit, 11 bits of exponent, and 52 bits of significand.
By default, Java uses double to represent its floating-point numerals (so a literal 3.14 is typed double). It's also the data type that will give you a much larger number range, so I would strongly encourage its use over float.
There may be certain libraries that actually force your usage of float, but in general - unless you can guarantee that your result will be small enough to fit in float's prescribed range, then it's best to opt with double.
If you require accuracy - for instance, you can't have a decimal value that is inaccurate (like 1/10 + 2/10), or you're doing anything with currency (for example, representing $10.33 in the system), then use a BigDecimal, which can support an arbitrary amount of precision and handle situations like that elegantly.
A float gives you approx. 6-7 decimal digits precision while a double gives you approx. 15-16. Also the range of numbers is larger for double.
A double needs 8 bytes of storage space while a float needs just 4 bytes.
Floating-point numbers, also known as real numbers, are used when evaluating expressions that require fractional precision. For example, calculations such as square root, or transcendentals such as sine and cosine, result in a value whose precision requires a floating-point type. Java implements the standard (IEEE–754) set of floatingpoint types and operators. There are two kinds of floating-point types, float and double, which represent single- and double-precision numbers, respectively. Their width and ranges are shown here:
Name Width in Bits Range
double 64 1 .7e–308 to 1.7e+308
float 32 3 .4e–038 to 3.4e+038
float
The type float specifies a single-precision value that uses 32 bits of storage. Single precision is faster on some processors and takes half as much space as double precision, but will become imprecise when the values are either very large or very small. Variables of type float are useful when you need a fractional component, but don't require a large degree of precision.
Here are some example float variable declarations:
float hightemp, lowtemp;
double
Double precision, as denoted by the double keyword, uses 64 bits to store a value. Double precision is actually faster than single precision on some modern processors that have been optimized for high-speed mathematical calculations. All transcendental math functions, such as sin( ), cos( ), and sqrt( ), return double values. When you need to maintain accuracy over many iterative calculations, or are manipulating large-valued numbers, double is the best choice.
This will give error:
public class MyClass {
public static void main(String args[]) {
float a = 0.5;
}
}
/MyClass.java:3: error: incompatible types: possible lossy conversion from double to float
float a = 0.5;
This will work perfectly fine
public class MyClass {
public static void main(String args[]) {
double a = 0.5;
}
}
This will also work perfectly fine
public class MyClass {
public static void main(String args[]) {
float a = (float)0.5;
}
}
Reason : Java by default stores real numbers as double to ensure higher precision.
Double takes more space but more precise during computation and float takes less space but less precise.
Java seems to have a bias towards using double for computations nonetheless:
Case in point the program I wrote earlier today, the methods didn't work when I used float, but now work great when I substituted float with double (in the NetBeans IDE):
package palettedos;
import java.util.*;
class Palettedos{
private static Scanner Z = new Scanner(System.in);
public static final double pi = 3.142;
public static void main(String[]args){
Palettedos A = new Palettedos();
System.out.println("Enter the base and height of the triangle respectively");
int base = Z.nextInt();
int height = Z.nextInt();
System.out.println("Enter the radius of the circle");
int radius = Z.nextInt();
System.out.println("Enter the length of the square");
long length = Z.nextInt();
double tArea = A.calculateArea(base, height);
double cArea = A.calculateArea(radius);
long sqArea = A.calculateArea(length);
System.out.println("The area of the triangle is\t" + tArea);
System.out.println("The area of the circle is\t" + cArea);
System.out.println("The area of the square is\t" + sqArea);
}
double calculateArea(int base, int height){
double triArea = 0.5*base*height;
return triArea;
}
double calculateArea(int radius){
double circArea = pi*radius*radius;
return circArea;
}
long calculateArea(long length){
long squaArea = length*length;
return squaArea;
}
}
According to the IEEE standards, float is a 32 bit representation of a real number while double is a 64 bit representation.
In Java programs we normally mostly see the use of double data type. It's just to avoid overflows as the range of numbers that can be accommodated using the double data type is more that the range when float is used.
Also when high precision is required, the use of double is encouraged. Few library methods that were implemented a long time ago still requires the use of float data type as a must (that is only because it was implemented using float, nothing else!).
But if you are certain that your program requires small numbers and an overflow won't occur with your use of float, then the use of float will largely improve your space complexity as floats require half the memory as required by double.
This example illustrates how to extract the sign (the leftmost bit), exponent (the 8 following bits) and mantissa (the 23 rightmost bits) from a float in Java.
int bits = Float.floatToIntBits(-0.005f);
int sign = bits >>> 31;
int exp = (bits >>> 23 & ((1 << 8) - 1)) - ((1 << 7) - 1);
int mantissa = bits & ((1 << 23) - 1);
System.out.println(sign + " " + exp + " " + mantissa + " " +
Float.intBitsToFloat((sign << 31) | (exp + ((1 << 7) - 1)) << 23 | mantissa));
The same approach can be used for double’s (11 bit exponent and 52 bit mantissa).
long bits = Double.doubleToLongBits(-0.005);
long sign = bits >>> 63;
long exp = (bits >>> 52 & ((1 << 11) - 1)) - ((1 << 10) - 1);
long mantissa = bits & ((1L << 52) - 1);
System.out.println(sign + " " + exp + " " + mantissa + " " +
Double.longBitsToDouble((sign << 63) | (exp + ((1 << 10) - 1)) << 52 | mantissa));
Credit: http://s-j.github.io/java-float/
You should use double instead of float for precise calculations, and float instead of double when using less accurate calculations. Float contains only decimal numbers, but double contains an IEEE754 double-precision floating point number, making it easier to contain and computate numbers more accurately. Hope this helps.
In regular programming calculations, we don’t use float. If we ensure that the result range is within the range of float data type then we can choose a float data type for saving memory. Generally, we use double because of two reasons:-
If we want to use the floating-point number as float data type then method caller must explicitly suffix F or f, because by default every floating-point number is treated as double. It increases the burden to the programmer. If we use a floating-point number as double data type then we don’t need to add any suffix.
Float is a single-precision data type means it occupies 4 bytes. Hence in large computations, we will not get a complete result. If we choose double data type, it occupies 8 bytes and we will get complete results.
Both float and double data types were designed especially for scientific calculations, where approximation errors are acceptable. If accuracy is the most prior concern then, it is recommended to use BigDecimal class instead of float or double data types. Source:- Float and double datatypes in Java
Related
This question already has answers here:
Why does Java implicitly (without cast) convert a `long` to a `float`?
(4 answers)
Closed 7 years ago.
if you call the following method of Java
void processIt(long a) {
float b = a; /*do I have loss here*/
}
do I have information loss when I assign the long variable to the float variable?
The Java language Specification says that the float type is a supertype of long.
Do I have information loss when I assign the long variable to the float variable?
Potentially, yes. That should be fairly clear from the fact that long has 64 bits of information, whereas float has only 32.
More specifically, as float values get bigger, the gap between successive values becomes more than 1 - whereas with long, the gap between successive values is always 1.
As an example:
long x = 100000000L;
float f1 = (float) x;
float f2 = (float) (x + 1);
System.out.println(f1 == f2); // true
In other words, two different long values have the same nearest representation in float.
This isn't just true of float though - it can happen with double too. In that case the numbers have to be bigger (as double has more precision) but it's still potentially lossy.
Again, it's reasonably easy to see that it has to be lossy - even though both long and double are represented in 64 bits, there are obviously double values which can't be represented as long values (trivially, 0.5 is one such) which means there must be some long values which aren't exactly representable as double values.
Yes, this is possible: if only for the reason that float has too few (typically 6-7) significant digits to deal with all possible numbers that long can represent (19 significant digits). This is in part due to the fact that float has only 32 bits of storage, and long has 64 (the other part is float's storage format † ). As per the JLS:
A widening conversion of an int or a long value to float, or of a long value to double, may result in loss of precision - that is, the result may lose some of the least significant bits of the value. In this case, the resulting floating-point value will be a correctly rounded version of the integer value, using IEEE 754 round-to-nearest mode (§4.2.4).
By example:
long i = 1000000001; // 10 significant digits
float f = i;
System.out.printf(" %d %n %.1f", i, f);
This prints (with the difference highlighted):
1000000001
1000000000.0
~ ← lost the number 1
It is worth noting this is also the case with int to float and long to double (as per that quote). In fact the only integer → floating point conversion that won't lose precision is int to double.
~~~~~~
† I say in part as this is also true for int widening to float which can also lose precision, despite both int and float having 32-bits. The same sample above but with int i has the same result as printed. This is unsurprising once you consider the way that float is structured; it uses some of the 32-bits to store the mantissa, or significand, so cannot represent all integer numbers in the same range as that of int.
Yes you will, for example...
public static void main(String[] args) {
long g = 2;
g <<= 48;
g++;
System.out.println(g);
float f = (float) g;
System.out.println(f);
long a = (long) f;
System.out.println(a);
}
... prints...
562949953421313
5.6294995E14
562949953421312
I've been trying to find out the reason, but I couldn't.
Can anybody help me?
Look at the following example.
float f = 125.32f;
System.out.println("value of f = " + f);
double d = (double) 125.32f;
System.out.println("value of d = " + d);
This is the output:
value of f = 125.32
value of d = 125.31999969482422
The value of a float does not change when converted to a double. There is a difference in the displayed numerals because more digits are required to distinguish a double value from its neighbors, which is required by the Java documentation. That is the documentation for toString, which is referred (through several links) from the documentation for println.
The exact value for 125.32f is 125.31999969482421875. The two neighboring float values are 125.3199920654296875 and 125.32000732421875. Observe that 125.32 is closer to 125.31999969482421875 than to either of the neighbors. Therefore, by displaying “125.32”, Java has displayed enough digits so that conversion back from the decimal numeral to float reproduces the value of the float passed to println.
The two neighboring double values of 125.31999969482421875 are 125.3199996948242045391452847979962825775146484375 and 125.3199996948242329608547152020037174224853515625.
Observe that 125.32 is closer to the latter neighbor than to the original value (125.31999969482421875). Therefore, printing “125.32” does not contain enough digits to distinguish the original value. Java must print more digits in order to ensure that a conversion from the displayed numeral back to double reproduces the value of the double passed to println.
When you convert a float into a double, there is no loss of information. Every float can be represented exactly as a double.
On the other hand, neither decimal representation printed by System.out.println is the exact value for the number. An exact decimal representation could require up to about 760 decimal digits. Instead, System.out.println prints exactly the number of decimal digits that allow to parse the decimal representation back into the original float or double. There are more doubles, so when printing one, System.out.println needs to print more digits before the representation becomes unambiguous.
The conversion from float to double is a widening conversion, as specified by the JLS. A widening conversion is defined as an injective mapping of a smaller set into its superset. Therefore the number being represented does not change after a conversion from float to double.
More information regarding your updated question
In your update you added an example which is supposed to demonstrate that the number has changed. However, it only shows that the string representation of the number has changed, which indeed it has due to the additional precision acquired through the conversion to double. Note that your first output is just a rounding of the second output. As specified by Double.toString,
There must be at least one digit to represent the fractional part, and beyond that as many, but only as many, more digits as are needed to uniquely distinguish the argument value from adjacent values of type double.
Since the adjacent values in the type double are much closer than in float, more digits are needed to comply with that ruling.
The 32bit IEEE-754 floating point number closest to 125.32 is in fact 125.31999969482421875. Pretty close, but not quite there (that's because 0.32 is repeating in binary).
When you cast that to a double, it's the value 125.31999969482421875 that will be made into a double (125.32 is nowhere to be found at this point, the information that it should really end in .32 is completely lost) and of course can be represented exactly by a double. When you print that double, the print routine thinks it has more significant digits than it really has (but of course it can't know that), so it prints to 125.31999969482422, which is the shortest decimal that rounds to that exact double (and of all decimals of that length, it is the closest).
The issue of the precision of floating-point numbers is really language-agnostic, so I'll be using MATLAB in my explanation.
The reason you see a difference is that certain numbers are not exactly representable in fixed number of bits. Take 0.1 for example:
>> format hex
>> double(0.1)
ans =
3fb999999999999a
>> double(single(0.1))
ans =
3fb99999a0000000
So the error in the approximation of 0.1 in single-precision gets bigger when you cast it as double-precision floating-point number. The result is different from its approximation if you started directly in double-precision.
>> double(single(0.1)) - double(0.1)
ans =
1.490116113833651e-09
As already explained, all floats can be exactly represented as a double and the reason for your issue is that System.out.println performs some rounding when displaying the value of a float or double but the rounding methodology is not the same in both cases.
To see the exact value of the float, you can use a BigDecimal:
float f = 125.32f;
System.out.println("value of f = " + new BigDecimal(f));
double d = (double) 125.32f;
System.out.println("value of d = " + new BigDecimal(d));
which outputs:
value of f = 125.31999969482421875
value of d = 125.31999969482421875
it won`t work in java because in java by default it will take real values as double and if we declare a float value without float representation
like
123.45f
by default it will take it as double and it will cause an error as loss of precision
The representation of the values changes due to contracts of the methods that convert numerical values to a String, correspondingly java.lang.Float#toString(float) and java.lang.Double#toString(double), while the actual value remains the same. There is a common part in Javadoc of both aforementioned methods that elaborates requirements to values' String representation:
There must be at least one digit to represent the fractional part, and beyond that as many, but only as many, more digits as are needed to uniquely distinguish the argument value from adjacent values
To illustrate the similarity of significant parts for values of both types, the following snippet can be run:
package com.my.sandbox.numbers;
public class FloatToDoubleConversion {
public static void main(String[] args) {
float f = 125.32f;
floatToBits(f);
double d = (double) f;
doubleToBits(d);
}
private static void floatToBits(float floatValue) {
System.out.println();
System.out.println("Float.");
System.out.println("String representation of float: " + floatValue);
int bits = Float.floatToIntBits(floatValue);
int sign = bits >>> 31;
int exponent = (bits >>> 23 & ((1 << 8) - 1)) - ((1 << 7) - 1);
int mantissa = bits & ((1 << 23) - 1);
System.out.println("Bytes: " + Long.toBinaryString(Float.floatToIntBits(floatValue)));
System.out.println("Sign: " + Long.toBinaryString(sign));
System.out.println("Exponent: " + Long.toBinaryString(exponent));
System.out.println("Mantissa: " + Long.toBinaryString(mantissa));
System.out.println("Back from parts: " + Float.intBitsToFloat((sign << 31) | (exponent + ((1 << 7) - 1)) << 23 | mantissa));
System.out.println(10D);
}
private static void doubleToBits(double doubleValue) {
System.out.println();
System.out.println("Double.");
System.out.println("String representation of double: " + doubleValue);
long bits = Double.doubleToLongBits(doubleValue);
long sign = bits >>> 63;
long exponent = (bits >>> 52 & ((1 << 11) - 1)) - ((1 << 10) - 1);
long mantissa = bits & ((1L << 52) - 1);
System.out.println("Bytes: " + Long.toBinaryString(Double.doubleToLongBits(doubleValue)));
System.out.println("Sign: " + Long.toBinaryString(sign));
System.out.println("Exponent: " + Long.toBinaryString(exponent));
System.out.println("Mantissa: " + Long.toBinaryString(mantissa));
System.out.println("Back from parts: " + Double.longBitsToDouble((sign << 63) | (exponent + ((1 << 10) - 1)) << 52 | mantissa));
}
}
In my environment, the output is:
Float.
String representation of float: 125.32
Bytes: 1000010111110101010001111010111
Sign: 0
Exponent: 110
Mantissa: 11110101010001111010111
Back from parts: 125.32
Double.
String representation of double: 125.31999969482422
Bytes: 100000001011111010101000111101011100000000000000000000000000000
Sign: 0
Exponent: 110
Mantissa: 1111010101000111101011100000000000000000000000000000
Back from parts: 125.31999969482422
This way, you can see that values' sign, exponent are the same, while its mantissa was extended retained its significant part (11110101010001111010111) exactly the same.
The used extraction logic of floating point number parts: 1 and 2.
Both are what Microsoft refers to as "approximate number data types."
There's a reason. A float has a precision of 7 digits, and a double 15. But I have seen it happen many times that 8.0 - 1.0 - 6.999999999. This is because they are not guaranteed to represent a decimal number fraction exactly.
If you need absolute, invariable precision, go with a decimal, or integral type.
If I do something like
final float third = 1f / 3f;
System.out.println((third + third + third) == 1.0f);
I get true. Does that mean float can exactly represent 1/3?
I lost a bet a long time ago when I said "No!" Not only that, but the other party claimed "There is no CS reason that float cannot represent fractional values."
Here is a sample program that explores the issue a bit:
package test;
public class FloatTest
{
public static void main(String[] args)
{
new FloatTest().run();
}
public void run()
{
final float third = 1f / 3f;
final float small = third * .01f;
final float realSmall = third * .0001f;
System.out.println("third: " + third);
System.out.println("small: " + small);
System.out.println("real small: " + realSmall);
System.out.println("Three thirds: " + (third + third + third));
System.out.println("Three small: " + (small + small + small));
System.out.println("Three real small: " + (realSmall + realSmall + realSmall));
}
}
The output is
third: 0.33333334
small: 0.0033333334
real small: 3.3333334E-5
Three thirds: 1.0
Three small: 0.01
Three real small: 1.00000005E-4
The strange results have to do with a couple of things. First, float cannot exactly represent 1/3, any more than 1/3 can be written out with a finite number of decimal digits. In the old days, the result would be .99999999, due to rounding.
In modern times, IEEE 754 has specified how float should be represented and arithmetic handled. In particular, when doing arithmetic, an extra three bits are kept and rounding is performed. That's why the first two results come out exactly right and I lost my bet. However, these extra bits don't guarantee accuracy, and the third result shows.
Here's a decent description of float in general, and the section toward the bottom, "on Rounding" (sic), describes the extra bits: http://pages.cs.wisc.edu/~markhill/cs354/Fall2008/notes/flpt.apprec.html
Bottom line, if you want exact fractional values and to control rounding, use BigDecimal.
A float cannot exactly represent every fraction because the float data type is a single-precision 32-bit IEEE 754 floating point and is subject to IEEE rounding rules. Therefore any thing requiring greater then 32 bits of precision is not representable by a float. There's also the double which is a double-precision 64-bit IEEE 754 floating point number. Finally, Java has the arbitrary precision BigDecimal class. However it cannot represent every fraction perfectly either, consider the Golden ratio; BigDecimal will throw an exception if you try to calculate
float cannot represent exactly 1/3. If A/B is a fraction in reduced notation, with A and B both integers, A > 0, and B > 1, the only fractions float can represent are those where B is a power of 2 up to 2126, and A is less than 224; or other cases where B is a power of 2 greater than 2126, but I won't go into exactly what those are. (I'm not considering B=1; it would take extra work to describe what integers float is capable of representing, and it's not relevant here.)
Thus, float cannot represent 1/3. If you compute f=1/3, the fraction represented by the float is 11184811/33554432 (33554432 = 225). If you add f+f, the result is a float that represents the fraction 11184811/16777216 (16777216 = 224). If you then add this to f, the exact resulting fraction would be 33554433/33554432. But this fraction cannot be represented exactly in a float, since it breaks the rule in the first paragraph (33554433 > 224). Thus, the result must be rounded to something that can be represented in a float, and that something will be 1.
Try this:
final float third = 1f / 3f;
System.out.println(((double)third + (double)third + (double)third) == 1.0);
If third represented exactly 1/3, then surely casting it to a double would also represent exactly 1/3, since a double has more precision than a float, right? But this displays false. In fact, the double on the left side is 1.0000000298023224 if you display it. That is, 33554433/33554432.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Why does (360 / 24) / 60 = 0 … in Java
I am having this problem:
float rate= (115/100);
When I do:
System.out.println(rate);
It gives me 1.0
What... is the problem?
115 and 100 are both integers, so will return an integer.
Try doing this:
float rate = (115f / 100f);
You're performing integer division (which provides an integer result) and then storing it in a float.
You need to use at least one float in the operation for the result to be the proper type:
float rate = 115f / 100;
float rate= (115/100);
Does the following things:
1) Performs integer division of 115 over 100 this yields the value 1.
2) Cast the result from step 1) to a float. This yields the value 1.0
What you want is this:
float rate = 115.0/100;
Or more generally, you want to convert one of the pieces of your division into a float whether that is via casting (float)115/100 or by appending a decimal point to one of the two pieces or by doing this float rate = 115f / 100 is completely up to you and yields the same result.
In order to perform floating-point arithmetic with integers you need to cast at least one of the operands to a float.
Example:
int a = 115;
int b = 100;
float rate = ((float)a)/b;
use float rate= (float)(115.0/100); instead
It is enough to put float rate = 115f / 100;
The problem you have is that your dividend and divisor are declared as integer type.
In mathematic when you divide two integer results only with remainder. And that is what you assign to your rate variable.
So to have the result as you expected, a remainder with fraction (rational numbers). Your dividend or divisor must be declared in a type with precision.
Base two known types with precision are float (Floating point) and double (Double precision).
By default all numbers (integer literals for purists) written in Java code are in type int (Integer). To change that you need to tell the compiler that a number you want to declare should be represent in different type. To do that you need to append a suffix to integer literal.
Literals for decimal types:
float - f or F; 110f;
double - d or D 110D;
Note that when you would like to use the double, type you can also declare it by adding a decimal separator to literal:
double d = 2.;
or
double d = 2.0;
I encourage you to use double type instead of float. Double type is more suitable for most of modern application. Usage of float may cause unexpected results, because of accuracy problem that in single point calculation have bigger impact on result. Good reading about this “What Every Computer Scientist Should Know About Floating-Point Arithmetic”.
In addition on current CPU architecture both float and double have same performance characteristic. So there is not need to sacrifice the accuracy.
A final note about floating point types in is that non of them should be use when we write a financial application. To have valid results in this matter, you should always used [BigDecimal]
I have an exam question I am revising for and the question is for 4 marks.
"In java we can assign a int to a double or a float". Will this ever lose information and why?
I have put that because ints are normally of fixed length or size - the precision for storing data is finite, where storing information in floating point can be infinite, essentially we lose information because of this
Now I am a little sketchy as to whether or not I am hitting the right areas here. I very sure it will lose precision but I can't exactly put my finger on why. Can I get some help, please?
In Java Integer uses 32 bits to represent its value.
In Java a FLOAT uses a 23 bit mantissa, so integers greater than 2^23 will have their least significant bits truncated. For example 33554435 (or 0x200003) will be truncated to around 33554432 +/- 4
In Java a DOUBLE uses a 52 bit mantissa, so will be able to represent a 32bit integer without lost of data.
See also "Floating Point" on wikipedia
It's not necessary to know the internal layout of floating-point numbers. All you need is the pigeonhole principle and the knowledge that int and float are the same size.
int is a 32-bit type, for which every bit pattern represents a distinct integer, so there are 2^32 int values.
float is a 32-bit type, so it has at most 2^32 distinct values.
Some floats represent non-integers, so there are fewer than 2^32 float values that represent integers.
Therefore, different int values will be converted to the same float (=loss of precision).
Similar reasoning can be used with long and double.
Here's what JLS has to say about the matter (in a non-technical discussion).
JLS 5.1.2 Widening primitive conversion
The following 19 specific conversions on primitive types are called the widening primitive conversions:
int to long, float, or double
(rest omitted)
Conversion of an int or a long value to float, or of a long value to double, may result in loss of precision -- that is, the result may lose some of the least significant bits of the value. In this case, the resulting floating-point value will be a correctly rounded version of the integer value, using IEEE 754 round-to-nearest mode.
Despite the fact that loss of precision may occur, widening conversions among primitive types never result in a run-time exception.
Here is an example of a widening conversion that loses precision:
class Test {
public static void main(String[] args) {
int big = 1234567890;
float approx = big;
System.out.println(big - (int)approx);
}
}
which prints:
-46
thus indicating that information was lost during the conversion from type int to type float because values of type float are not precise to nine significant digits.
No, float and double are fixed-length too - they just use their bits differently. Read more about how exactly they work in the Floating-Poing Guide .
Basically, you cannot lose precision when assigning an int to a double, because double has 52 bits of precision, which is enough to hold all int values. But float only has 23 bits of precision, so it cannot exactly represent all int values that are larger than about 2^23.
Your intuition is correct, you MAY loose precision when converting int to float. However it not as simple as presented in most other answers.
In Java a FLOAT uses a 23 bit mantissa, so integers greater than 2^23 will have their least significant bits truncated. (from a post on this page)
Not true.
Example: here is an integer that is greater than 2^23 that converts to a float with no loss:
int i = 33_554_430 * 64; // is greater than 2^23 (and also greater than 2^24); i = 2_147_483_520
float f = i;
System.out.println("result: " + (i - (int) f)); // Prints: result: 0
System.out.println("with i:" + i + ", f:" + f);//Prints: with i:2_147_483_520, f:2.14748352E9
Therefore, it is not true that integers greater than 2^23 will have their least significant bits truncated.
The best explanation I found is here:
A float in Java is 32-bit and is represented by:
sign * mantissa * 2^exponent
sign * (0 to 33_554_431) * 2^(-125 to +127)
Source: http://www.ibm.com/developerworks/java/library/j-math2/index.html
Why is this an issue?
It leaves the impression that you can determine whether there is a loss of precision from int to float just by looking at how large the int is.
I have especially seen Java exam questions where one is asked whether a large int would convert to a float with no loss.
Also, sometimes people tend to think that there will be loss of precision from int to float:
when an int is larger than: 1_234_567_890 not true (see counter-example above)
when an int is larger than: 2 exponent 23 (equals: 8_388_608) not true
when an int is larger than: 2 exponent 24 (equals: 16_777_216) not true
Conclusion
Conversions from sufficiently large ints to floats MAY lose precision.
It is not possible to determine whether there will be loss just by looking at how large the int is (i.e. without trying to go deeper into the actual float representation).
Possibly the clearest explanation I've seen:
http://www.ibm.com/developerworks/java/library/j-math2/index.html
the ULP or unit of least precision defines the precision available between any two float values. As these values increase the available precision decreases.
For example: between 1.0 and 2.0 inclusive there are 8,388,609 floats, between 1,000,000 and 1,000,001 there are 17. At 10,000,000 the ULP is 1.0, so above this value you soon have multiple integeral values mapping to each available float, hence the loss of precision.
There are two reasons that assigning an int to a double or a float might lose precision:
There are certain numbers that just can't be represented as a double/float, so they end up approximated
Large integer numbers may contain too much precision in the lease-significant digits
For these examples, I'm using Java.
Use a function like this to check for loss of precision when casting from int to float
static boolean checkPrecisionLossToFloat(int val)
{
if(val < 0)
{
val = -val;
}
// 8 is the bit-width of the exponent for single-precision
return Integer.numberOfLeadingZeros(val) + Integer.numberOfTrailingZeros(val) < 8;
}
Use a function like this to check for loss of precision when casting from long to double
static boolean checkPrecisionLossToDouble(long val)
{
if(val < 0)
{
val = -val;
}
// 11 is the bit-width for the exponent in double-precision
return Long.numberOfLeadingZeros(val) + Long.numberOfTrailingZeros(val) < 11;
}
Use a function like this to check for loss of precision when casting from long to float
static boolean checkPrecisionLossToFloat(long val)
{
if(val < 0)
{
val = -val;
}
// 8 + 32
return Long.numberOfLeadingZeros(val) + Long.numberOfTrailingZeros(val) < 40;
}
For each of these functions, returning true means that casting that integral value to the floating point value will result in a loss of precision.
Casting to float will lose precision if the integral value has more than 24 significant bits.
Casting to double will lose precision if the integral value has more than 53 significant bits.
You can assign double as int without losing precision.