how compare Double.MAX_VALUE - java

I want to check if a double has Double.MAX_VALUE.
Is this the right way (version 1):
boolean hasMaxVal(double val){
return val == Double.MAX_VALUE;
}
or do i need to do something like this (version 2):
boolean hasMaxVal(double val){
return Math.abs(val - Double.MAX_VALUE) < 0.00001
}

Java's double type is a double-precision IEEE 754 floating-point number. This means there are 53 bits of precision in the mantissa, and hence the precision of the number is limited to about 16 significant figures in a decimal format.
Double.MAX_VALUE is approximately 1.798×10308, so the 16th significant figure has a magnitude on the order of 10308 - 16 = 10292. We can confirm this using the Math.ulp method, which returns a double value's "unit of least precision":
> Double.MAX_VALUE
1.7976931348623157E308
> Math.ulp(Double.MAX_VALUE)
1.9958403095347198E292
This means if you do want to test for a value "close to" Double.MAX_VALUE, it only makes sense to do so within an epsilon of at least 2E292. Your epsilon of 0.00001 is far too small for there to be any values within that range other than Double.MAX_VALUE itself, so your test is equivalent to val == Double.MAX_VALUE.

Related

Is storing currency in Java double (floating point), without any math, always accurate?

Of course no math should be done, because the outcome will not be the accurate. Floating point values are not suitable for this.
But what about just storing values? Personally, I'd go for String or Long, but it looks like I might sometimes be forced to interact with systems that insist on floating point types.
It looks like values from 0.00 to 2.00 are 100% accurate - see code below. But is this so? And why? Shouldn't there be problems already when I simply do double v = 0.01?
public static void main(final String[] args) {
final DecimalFormat df = new DecimalFormat("0.0000000000000000000000000", DecimalFormatSymbols.getInstance(Locale.US));
final BigDecimal aHundred = new BigDecimal("100");
final BigDecimal oneHundredth = BigDecimal.ONE.divide(aHundred);
for (int i = 0; i < 200; i++) {
BigDecimal dec = oneHundredth;
for (int ii = 0; ii < i; ii++) {
dec = dec.add(oneHundredth);
}
final double v = dec.doubleValue();
System.err.println(v);
System.err.println(df.format(v));
}
System.exit(0);
}
Output:
0.01
0.0100000000000000000000000
0.02
0.0200000000000000000000000
0.03
0.0300000000000000000000000
...
1.38
1.3800000000000000000000000
1.39
1.3900000000000000000000000
1.4
1.4000000000000000000000000
1.41
1.4100000000000000000000000
...
1.99
1.9900000000000000000000000
2.0
2.0000000000000000000000000
Converting from decimal to binary-based floating-point or vice-versa is math. It is an operation that rounds the result to the nearest representable value.
When you convert .01 to double, the result is exactly 0.01000000000000000020816681711721685132943093776702880859375. Java’s default formatting for displaying this may show it as “0.01”, but the actual value is 0.01000000000000000020816681711721685132943093776702880859375.
The precision of Java’s double format is such that if any decimal numeral with at most 15 significant decimal digits is rounded to the nearest representable double and then that double is rounded to the nearest decimal numeral with 15 significant digits or fewer, the result will be the original number.
Therefore, you can use a double to store any decimal numeral with at most 15 significant digits (within the exponent range) and can recover the original numeral by converting it back to decimal. Beyond 15 digits, some numbers will be changed by the round trip.
Is storing currency in Java double (floating point), without any math, always accurate?
If you represent the currency values as a multiple of the smallest unit of currency (for instance, cents), then you have effectively 53 bits of precision to work with ... which works out at 9.0 x 1015 cents, or 9.0 x 1013 dollars.
(For scale the US national debt is currently around 2.8 x 13 dollars.)
And if you try to represent currency values in (say) floating point dollars (using double), then most cent values simply cannot be represented precisely. Only multiples of 25 cents have a precise representation in binary floating point.
In short, it is potentially imprecise even if you are not performing arithmetic on the values.

Why `2.0 - 1.1` and `2.0F - 1.1F` produce different results?

I am working on a code where I am comparing Double and float values:
class Demo {
public static void main(String[] args) {
System.out.println(2.0 - 1.1); // 0.8999999999999999
System.out.println(2.0 - 1.1 == 0.9); // false
System.out.println(2.0F - 1.1F); // 0.9
System.out.println(2.0F - 1.1F == 0.9F); // true
System.out.println(2.0F - 1.1F == 0.9); // false
}
}
Output is given below:
0.8999999999999999
false
0.9
true
false
I believe the Double value can save more precision than the float.
Please explain this, looks like the float value is not lose precision but the double one lose?
Edit:
#goodvibration I'm aware of that 0.9 can not be exactly saved in any computer language, i'm just confused how java works with this in detail, why 2.0F - 1.1F == 0.9F, but 2.0 - 1.1 != 0.9, another interesting found may help:
class Demo {
public static void main(String[] args) {
System.out.println(2.0 - 0.9); // 1.1
System.out.println(2.0 - 0.9 == 1.1); // true
System.out.println(2.0F - 0.9F); // 1.1
System.out.println(2.0F - 0.9F == 1.1F); // true
System.out.println(2.0F - 0.9F == 1.1); // false
}
}
I know I can't count on the float or double precision, just.. can't figure it out drive me crazy, whats the real deal behind this? Why 2.0 - 0.9 == 1.1 but 2.0 - 1.1 != 0.9 ??
The difference between float and double:
IEEE 754 single-precision binary floating-point format
IEEE 754 double-precision binary floating-point format
Let's run your numbers in a simple C program, in order to get their binary representations:
#include <stdio.h>
typedef union {
float val;
struct {
unsigned int fraction : 23;
unsigned int exponent : 8;
unsigned int sign : 1;
} bits;
} F;
typedef union {
double val;
struct {
unsigned long long fraction : 52;
unsigned long long exponent : 11;
unsigned long long sign : 1;
} bits;
} D;
int main() {
F f = {(float )(2.0 - 1.1)};
D d = {(double)(2.0 - 1.1)};
printf("%d %d %d\n" , f.bits.sign, f.bits.exponent, f.bits.fraction);
printf("%lld %lld %lld\n", d.bits.sign, d.bits.exponent, d.bits.fraction);
return 0;
}
The printout of this code is:
0 126 6710886
0 1022 3602879701896396
Based on the two format specifications above, let's convert these numbers to rational values.
In order to achieve high accuracy, let's do this in a simple Python program:
from decimal import Decimal
from decimal import getcontext
getcontext().prec = 100
TWO = Decimal(2)
def convert(sign, exponent, fraction, e_len, f_len):
return (-1) ** sign * TWO ** (exponent - 2 ** (e_len - 1) + 1) * (1 + fraction / TWO ** f_len)
def toFloat(sign, exponent, fraction):
return convert(sign, exponent, fraction, 8, 23)
def toDouble(sign, exponent, fraction):
return convert(sign, exponent, fraction, 11, 52)
f = toFloat(0, 126, 6710886)
d = toDouble(0, 1022, 3602879701896396)
print('{:.40f}'.format(f))
print('{:.40f}'.format(d))
The printout of this code is:
0.8999999761581420898437500000000000000000
0.8999999999999999111821580299874767661094
If we print these two values while specifying between 8 and 15 decimal digits, then we shall experience the same thing that you have observed (the double value printed as 0.9, while the float value printed as close to 0.9):
In other words, this code:
for n in range(8, 15 + 1):
string = '{:.' + str(n) + 'f}';
print(string.format(f))
print(string.format(d))
Gives this printout:
0.89999998
0.90000000
0.899999976
0.900000000
0.8999999762
0.9000000000
0.89999997616
0.90000000000
0.899999976158
0.900000000000
0.8999999761581
0.9000000000000
0.89999997615814
0.90000000000000
0.899999976158142
0.900000000000000
Our conclusion is therefore that Java prints decimals with a precision of between 8 and 15 digits by default.
Nice question BTW...
Pop quiz: Represent 1/3rd, in decimal.
Answer: You can't; not precisely.
Computers count in binary. There are many more numbers that 'cannot be completely represented'. Just like, in the decimal question, if you have only a small piece of paper to write it on, you may simply go with 0.3333333 and call it a day, and you'd then have a number that is quite close to, but not entirely the same as, 1 / 3, so do computers represent fractions.
Or, think about it this way: a float occupies 32-bits; a double occupies 64. There are only 2^32 (about 4 billion) different numbers that a 32-bit value can represent. And yet, even between 0 and 1 there are an infinite amount of numbers. So, given that there are at most 2^32 specific, concrete numbers that are 'representable precisely' as a float, any number that isn't in that blessed set of about 4 billion values, is not representable. Instead of just erroring out, you simply get the one in this pool of 4 billion values that IS representable, and is the closest number to the one you wanted.
In addition, because computers count in binary and not decimal, your sense of what is 'representable' and what isn't, is off. You may think that 1/3 is a big problem, but surely 1/10 is easy, right? That's simply 0.1 and that is a precise representation. Ah, but, a tenth works well in decimal. After all, decimal is based around the number 10, no surprise there. But in binary? a half, a fourth, an eighth, a sixteenth: Easy in binary. A tenth? That is as difficult as a third: NOT REPRESENTABLE.
0.9 is, itself, not a representable number. And yet, when you printed your float, that's what you got.
The reason is, printing floats/doubles is an art, more than a science. Given that only a few numbers are representable, and given that these numbers don't feel 'natural' to humans due to the binary v. decimal thing, you really need to add a 'rounding' strategy to the number or it'll look crazy (nobody wants to read 0.899999999999999999765). And that is precisely what System.out.println and co do.
But you really should take control of the rounding function: Never use System.out.println to print doubles and floats. Use System.out.printf("%.6f", yourDouble); instead, and in this case, BOTH would print 0.9. Because whilst neither can actually represent 0.9 precisely, the number that is closest to it in floats (or rather, the number you get when you take the number closest to 2.0 (which is 2.0), and the number closest to 1.1 (which is not 1.1 precisely), subtract them, and then find the number closest to that result) – prints as 0.9 even though it isn't for floats, and does not print as 0.9 in double.

Bijection between Java float and integer keeping order

Both int and float in Java are 32 bits size values. Is it possible to program a pair of functions
int toInt(float f);
float toFloat(int n);
such that if f1 and f2 are arbitrary float non-NaN values and i1 and i2 are arbitraty int values:
f1 < f2 if and only if toInt(f1) < toInt(f2)
f1 > f2 if and only if toInt(f1) > toInt(f2)
f1 == f2 if and only if toInt(f1) == toInt(f2)
toInt(toFloat(i1) == i1
toFloat(toInt(f1)) == f1
Edit: I have edited the question to exclude NaN values for float, thanks to the answers clarifying what happens with those.
Yes. IEEE floats and doubles are arranged in such a way that you can compare them by doing an unsigned comparison of the raw binary representation. The function to convert from float to raw integer and back are java.lang.Float.floatToIntBits and java.lang.Float.intBitsToFloat. These functions are processor intrinsics, so they have an extremely low cost.
The same is true for longs and doubles. Here the conversion functions are java.lang.Double.doubleToLongBits and java.lang.Double.longBitsToDouble.
Note that if you want to use the normal signed comparison for your integers, you have to do some additional transformation in addition to the conversion to integer.
The only exception to this rule is NaN, which does not permit a total ordering anyway.
You can use
int n = Float.floatToRawIntBits(f);
float f2 = Float.intBitToFloat(n);
int n2 = Float.floatToRawIntBits(f2);
assert n == n2; // always
assert f == f2 || Float.isNaN(f);
The raw bits as a int have the same sort order as the original float with the exception of the NaN values which are not comparable as a float value have a value as an int
Note: there is multiple values for NaN which are not equal to each other as float
No you cannot
There are 2^32 possible int values, all of which are distinct.
However, thee are less than 2^32 floats; ie. 7FF0000000000001 to 7FF7FFFFFFFFFFFF represent NaN's,
There fore, you have more ints than floats an cannot distinctly map them to each other as toFloat(i1) would not be cable of producing a distinct float for every int
I see what you're saying. At first I had a different interpretation of your question. As everyone else has mention: yes. Use the articles described here and here to explain why we should use the methods described by #Peter Lawrey in order to compare the underlying bit pattern between ints and floats
The answer from Rüdiger Klaehn gives the normal case, but it lacks some details. The bijection exits only in the domain of nice and clean floats.
Notice : representation of an IEEE float is sign_bit(1 bit) exponent(8 bits) sinificand(23 bits) and the value is : (-1)<sup>sign</sup> * 2<sup>exp</sup> * significand in clean cases. In fact, the 23 bits represent the fractional part of the actual significand, the integer part being 1.
All is fine for 0 < exp < 255 (which correspond to normal not null floats ) as an unsigned byte and in that domain you have a bijection.
For exp == 255 you have the infinite values is significand == 0 and all the NaN for significand != 0 - ok, you explicitely excluded them.
But for exp == 0 there are still weird things : when significand == 0 you have +0 and -0. I am not sure if they are considered equal. If anybody knows, please feel free to edit the post. But as integer values, they will of course be different.
And when exp == 0 and significand != 0 you find denormalized numbers ... which while not being equal will be converted to either 0 of the littlest number not being 0.
So if you want a bijection only use normal numbers having 0 < exp < 255< and avoid NaN, infinite, 0 and denormal numbers where things are weird.
References :
IEEE floating point
Single-precision floating-point format
Denormal number
f1 == f2
is impossible, see this answer for more info. You will need to include a delta if you actually want to APPROXIMATE your equality-check.

Double vs. BigDecimal?

I have to calculate some floating point variables and my colleague suggest me to use BigDecimal instead of double since it will be more precise. But I want to know what it is and how to make most out of BigDecimal?
A BigDecimal is an exact way of representing numbers. A Double has a certain precision. Working with doubles of various magnitudes (say d1=1000.0 and d2=0.001) could result in the 0.001 being dropped altogether when summing as the difference in magnitude is so large. With BigDecimal this would not happen.
The disadvantage of BigDecimal is that it's slower, and it's a bit more difficult to program algorithms that way (due to + - * and / not being overloaded).
If you are dealing with money, or precision is a must, use BigDecimal. Otherwise Doubles tend to be good enough.
I do recommend reading the javadoc of BigDecimal as they do explain things better than I do here :)
My English is not good so I'll just write a simple example here.
double a = 0.02;
double b = 0.03;
double c = b - a;
System.out.println(c);
BigDecimal _a = new BigDecimal("0.02");
BigDecimal _b = new BigDecimal("0.03");
BigDecimal _c = _b.subtract(_a);
System.out.println(_c);
Program output:
0.009999999999999998
0.01
Does anyone still want to use double? ;)
There are two main differences from double:
Arbitrary precision, similarly to BigInteger they can contain number of arbitrary precision and size (whereas a double has a fixed number of bits)
Base 10 instead of Base 2, a BigDecimal is n*10^-scale where n is an arbitrary large signed integer and scale can be thought of as the number of digits to move the decimal point left or right
It is still not true to say that BigDecimal can represent any number. But two reasons you should use BigDecimal for monetary calculations are:
It can represent all numbers that can be represented in decimal notion and that includes virtually all numbers in the monetary world (you never transfer 1/3 $ to someone).
The precision can be controlled to avoid accumulated errors. With a double, as the magnitude of the value increases, its precision decreases and this can introduce significant error into the result.
If you write down a fractional value like 1 / 7 as decimal value you get
1/7 = 0.142857142857142857142857142857142857142857...
with an infinite repetition of the digits 142857. Since you can only write a finite number of digits you will inevitably introduce a rounding (or truncation) error.
Numbers like 1/10 or 1/100 expressed as binary numbers with a fractional part also have an infinite number of digits after the decimal point:
1/10 = binary 0.0001100110011001100110011001100110...
Doubles store values as binary and therefore might introduce an error solely by converting a decimal number to a binary number, without even doing any arithmetic.
Decimal numbers (like BigDecimal), on the other hand, store each decimal digit as is (binary coded, but each decimal on its own). This means that a decimal type is not more precise than a binary floating point or fixed point type in a general sense (i.e. it cannot store 1/7 without loss of precision), but it is more accurate for numbers that have a finite number of decimal digits as is often the case for money calculations.
Java's BigDecimal has the additional advantage that it can have an arbitrary (but finite) number of digits on both sides of the decimal point, limited only by the available memory.
If you are dealing with calculation, there are laws on how you should calculate and what precision you should use. If you fail that you will be doing something illegal.
The only real reason is that the bit representation of decimal cases are not precise. As Basil simply put, an example is the best explanation. Just to complement his example, here's what happens:
static void theDoubleProblem1() {
double d1 = 0.3;
double d2 = 0.2;
System.out.println("Double:\t 0,3 - 0,2 = " + (d1 - d2));
float f1 = 0.3f;
float f2 = 0.2f;
System.out.println("Float:\t 0,3 - 0,2 = " + (f1 - f2));
BigDecimal bd1 = new BigDecimal("0.3");
BigDecimal bd2 = new BigDecimal("0.2");
System.out.println("BigDec:\t 0,3 - 0,2 = " + (bd1.subtract(bd2)));
}
Output:
Double: 0,3 - 0,2 = 0.09999999999999998
Float: 0,3 - 0,2 = 0.10000001
BigDec: 0,3 - 0,2 = 0.1
Also we have that:
static void theDoubleProblem2() {
double d1 = 10;
double d2 = 3;
System.out.println("Double:\t 10 / 3 = " + (d1 / d2));
float f1 = 10f;
float f2 = 3f;
System.out.println("Float:\t 10 / 3 = " + (f1 / f2));
// Exception!
BigDecimal bd3 = new BigDecimal("10");
BigDecimal bd4 = new BigDecimal("3");
System.out.println("BigDec:\t 10 / 3 = " + (bd3.divide(bd4)));
}
Gives us the output:
Double: 10 / 3 = 3.3333333333333335
Float: 10 / 3 = 3.3333333
Exception in thread "main" java.lang.ArithmeticException: Non-terminating decimal expansion
But:
static void theDoubleProblem2() {
BigDecimal bd3 = new BigDecimal("10");
BigDecimal bd4 = new BigDecimal("3");
System.out.println("BigDec:\t 10 / 3 = " + (bd3.divide(bd4, 4, BigDecimal.ROUND_HALF_UP)));
}
Has the output:
BigDec: 10 / 3 = 3.3333
BigDecimal is Oracle's arbitrary-precision numerical library. BigDecimal is part of the Java language and is useful for a variety of applications ranging from the financial to the scientific (that's where sort of am).
There's nothing wrong with using doubles for certain calculations. Suppose, however, you wanted to calculate Math.Pi * Math.Pi / 6, that is, the value of the Riemann Zeta Function for a real argument of two (a project I'm currently working on). Floating-point division presents you with a painful problem of rounding error.
BigDecimal, on the other hand, includes many options for calculating expressions to arbitrary precision. The add, multiply, and divide methods as described in the Oracle documentation below "take the place" of +, *, and / in BigDecimal Java World:
http://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html
The compareTo method is especially useful in while and for loops.
Be careful, however, in your use of constructors for BigDecimal. The string constructor is very useful in many cases. For instance, the code
BigDecimal onethird = new BigDecimal("0.33333333333");
utilizes a string representation of 1/3 to represent that infinitely-repeating number to a specified degree of accuracy. The round-off error is most likely somewhere so deep inside the JVM that the round-off errors won't disturb most of your practical calculations. I have, from personal experience, seen round-off creep up, however. The setScale method is important in these regards, as can be seen from the Oracle documentation.
If you need to use division in your arithmetic, you need to use double instead of BigDecimal. Division (divide(BigDecimal) method) in BigDecimal is pretty useless as BigDecimal can't handle repeating decimal rational numbers (division where divisors are and will throw java.lang.ArithmeticException: Non-terminating decimal expansion; no exact representable decimal result.
Just try BigDecimal.ONE.divide(new BigDecimal("3"));
Double, on the other hand, will handle division fine (with the understood precision which is roughly 15 significant digits)

Loss of precision - int -> float or double

I have an exam question I am revising for and the question is for 4 marks.
"In java we can assign a int to a double or a float". Will this ever lose information and why?
I have put that because ints are normally of fixed length or size - the precision for storing data is finite, where storing information in floating point can be infinite, essentially we lose information because of this
Now I am a little sketchy as to whether or not I am hitting the right areas here. I very sure it will lose precision but I can't exactly put my finger on why. Can I get some help, please?
In Java Integer uses 32 bits to represent its value.
In Java a FLOAT uses a 23 bit mantissa, so integers greater than 2^23 will have their least significant bits truncated. For example 33554435 (or 0x200003) will be truncated to around 33554432 +/- 4
In Java a DOUBLE uses a 52 bit mantissa, so will be able to represent a 32bit integer without lost of data.
See also "Floating Point" on wikipedia
It's not necessary to know the internal layout of floating-point numbers. All you need is the pigeonhole principle and the knowledge that int and float are the same size.
int is a 32-bit type, for which every bit pattern represents a distinct integer, so there are 2^32 int values.
float is a 32-bit type, so it has at most 2^32 distinct values.
Some floats represent non-integers, so there are fewer than 2^32 float values that represent integers.
Therefore, different int values will be converted to the same float (=loss of precision).
Similar reasoning can be used with long and double.
Here's what JLS has to say about the matter (in a non-technical discussion).
JLS 5.1.2 Widening primitive conversion
The following 19 specific conversions on primitive types are called the widening primitive conversions:
int to long, float, or double
(rest omitted)
Conversion of an int or a long value to float, or of a long value to double, may result in loss of precision -- that is, the result may lose some of the least significant bits of the value. In this case, the resulting floating-point value will be a correctly rounded version of the integer value, using IEEE 754 round-to-nearest mode.
Despite the fact that loss of precision may occur, widening conversions among primitive types never result in a run-time exception.
Here is an example of a widening conversion that loses precision:
class Test {
public static void main(String[] args) {
int big = 1234567890;
float approx = big;
System.out.println(big - (int)approx);
}
}
which prints:
-46
thus indicating that information was lost during the conversion from type int to type float because values of type float are not precise to nine significant digits.
No, float and double are fixed-length too - they just use their bits differently. Read more about how exactly they work in the Floating-Poing Guide .
Basically, you cannot lose precision when assigning an int to a double, because double has 52 bits of precision, which is enough to hold all int values. But float only has 23 bits of precision, so it cannot exactly represent all int values that are larger than about 2^23.
Your intuition is correct, you MAY loose precision when converting int to float. However it not as simple as presented in most other answers.
In Java a FLOAT uses a 23 bit mantissa, so integers greater than 2^23 will have their least significant bits truncated. (from a post on this page)
Not true.
Example: here is an integer that is greater than 2^23 that converts to a float with no loss:
int i = 33_554_430 * 64; // is greater than 2^23 (and also greater than 2^24); i = 2_147_483_520
float f = i;
System.out.println("result: " + (i - (int) f)); // Prints: result: 0
System.out.println("with i:" + i + ", f:" + f);//Prints: with i:2_147_483_520, f:2.14748352E9
Therefore, it is not true that integers greater than 2^23 will have their least significant bits truncated.
The best explanation I found is here:
A float in Java is 32-bit and is represented by:
sign * mantissa * 2^exponent
sign * (0 to 33_554_431) * 2^(-125 to +127)
Source: http://www.ibm.com/developerworks/java/library/j-math2/index.html
Why is this an issue?
It leaves the impression that you can determine whether there is a loss of precision from int to float just by looking at how large the int is.
I have especially seen Java exam questions where one is asked whether a large int would convert to a float with no loss.
Also, sometimes people tend to think that there will be loss of precision from int to float:
when an int is larger than: 1_234_567_890 not true (see counter-example above)
when an int is larger than: 2 exponent 23 (equals: 8_388_608) not true
when an int is larger than: 2 exponent 24 (equals: 16_777_216) not true
Conclusion
Conversions from sufficiently large ints to floats MAY lose precision.
It is not possible to determine whether there will be loss just by looking at how large the int is (i.e. without trying to go deeper into the actual float representation).
Possibly the clearest explanation I've seen:
http://www.ibm.com/developerworks/java/library/j-math2/index.html
the ULP or unit of least precision defines the precision available between any two float values. As these values increase the available precision decreases.
For example: between 1.0 and 2.0 inclusive there are 8,388,609 floats, between 1,000,000 and 1,000,001 there are 17. At 10,000,000 the ULP is 1.0, so above this value you soon have multiple integeral values mapping to each available float, hence the loss of precision.
There are two reasons that assigning an int to a double or a float might lose precision:
There are certain numbers that just can't be represented as a double/float, so they end up approximated
Large integer numbers may contain too much precision in the lease-significant digits
For these examples, I'm using Java.
Use a function like this to check for loss of precision when casting from int to float
static boolean checkPrecisionLossToFloat(int val)
{
if(val < 0)
{
val = -val;
}
// 8 is the bit-width of the exponent for single-precision
return Integer.numberOfLeadingZeros(val) + Integer.numberOfTrailingZeros(val) < 8;
}
Use a function like this to check for loss of precision when casting from long to double
static boolean checkPrecisionLossToDouble(long val)
{
if(val < 0)
{
val = -val;
}
// 11 is the bit-width for the exponent in double-precision
return Long.numberOfLeadingZeros(val) + Long.numberOfTrailingZeros(val) < 11;
}
Use a function like this to check for loss of precision when casting from long to float
static boolean checkPrecisionLossToFloat(long val)
{
if(val < 0)
{
val = -val;
}
// 8 + 32
return Long.numberOfLeadingZeros(val) + Long.numberOfTrailingZeros(val) < 40;
}
For each of these functions, returning true means that casting that integral value to the floating point value will result in a loss of precision.
Casting to float will lose precision if the integral value has more than 24 significant bits.
Casting to double will lose precision if the integral value has more than 53 significant bits.
You can assign double as int without losing precision.

Categories