Question:
The total amount of floating points is finite, there's about 2^32 of them. With a float, you can go directly to the next or previous one using java.lang.Math.nextAfter. I call that a single leap. My main quesion, composed of sub questions is, how can I navigate on floats using leaps ?
First, how can I move a float to another with multiple leaps at once ?
public static float moveFloat(float value, int leaps) {
for(int i = 0; i < Math.abs(leaps); i++)
value = Math.nextAfter(value, Float.POSITIVE_INFINITY * signum(leaps));
return value;
}
That way should work on theory but is really unoptimized. How can I do it in a single addition ?
I also need to know how much leaps there's between 2 floats. Here's the example implementation for this one:
public static int getLeaps(float value, float destination) {
int leaps = 0;
float direction = signum(destination - value);
while(value * direction < destination * direction) {
value = Math.nextAfter(value, Float.POSITIVE_INFINITY * direction);
leaps++;
}
return leaps;
}
Again, same problem here. This implementation isn't suitable.
Extra:
The thing I call a leap, does it have an actual name ?
Background:
I'm trying to make a simple 2D physics engine in Java and I have trouble with my floating point operations. I learned about relative error float comparison and it helped a bit but it's not magic. What I want is to be exact with my floating points.
I already know a lot of base ten numbers cannot be exactly represented with floating points but execptionally, I don't care. All I want is exact float arithmetic in base 2.
To simplify, in my collision detection and response process, I check if shapes overlap (let's stay in one dimension for this example) and I replace the 2 shapes overlapping using their weight.
See this example:
If the black lines are the float values(and the space between each other leaps) whatever the precision is, I want to place both shapes (colored lines) to be exactly at the brown position. (The brown position is determined by the weights ratio and by rounding. What I call penetration is the overlaping area/distance. If the penetration would of been 5, red would been pushed by 1 and blue by 4).
The problem is, do to that I have to keep the penetration of the collision (in this case the penetration is exactly the ULP of the float, or 1 leap) in a float and I suspect this leads to inexactitude. If the penetration value is bigger than the coordinates of the shapes, it will be less precise so they won't be exactly replaced at the good coordinate.
What I imagine is to keep the penetration of the collision as the amount of leaps I need to get from one to the another and use it afterwards.
This is a simplified version of the current code I have:
public class ReplaceResolver implements CollisionResolver {
#Override
public void resolve(Collision collision) {
float deltaB = collision.weightRatio * collision.penetration; //bodyA's weight over the sum of the 2 (pre calculated)
float deltaA = 1f - deltaB;
//the normal indicates where the shape should be pushed. For now, my engine is only AA so a component of the normal (x or y) is always 0 while the other is 1
if(deltaB > 0)
replace(collision.bodyA, collision.normalB, deltaA);
if(deltaA > 0)
replace(collision.bodyB, collision.normalA, deltaB);
}
private void replace(Body body, Vector2 normal, float delta) {
body.getPosition().x += normal.x * delta; //body.getPosition() is a Vector2
body.getPosition().y += normal.y * delta;
}
}
Obviously, this doesn't work properly and accumulates floating point precision error. The error is well handled by my collision detection which checks for float equality using ULP. However it breaks when crossing 0 because of the ULP going extremely low.
I could simply fix an epsilon for a physic simulation but it would remove the whole point of using floats. The technique I want to use lets the user choose his precision implicitly and theorically should be working with any precision.
Underlying IEEE 754 floating point model has this property: if you re-interpret the bits as Integer, taking the next float after (or before depending on the direction) is just like taking the next (or previous) integer, that is adding or subtracting 1 to the bit pattern re-interpreted as integer.
Stepping n times is adding (or subtracting) n to the bit pattern. It's as simple as that as long as the sign does not change, and you don't overflow to NaN or Inf.
And the number of different floats between two floats is the difference of two integers if the signs agree.
If signs differ, since the float has a sign-magnitude like representation, which does not fit the integer representation, you'll then have to exert a bit of arithmetic.
I want to do the same calculation. So, if "leaps" means as #aka.nice said, the integer difference/span/distance between two float-point values according to the IEEE 754 floating-point "single format" bit layout (IEEE754 Format), I may have found a simple method:
public static native int floatToRawIntBits(float value) and Java_java_lang_Float_floatToRawIntBits can be used for this purpose, which has similar functionality to my test code in c++ (reinterpret a memory (reinterpret_cast)).
#include <stdio.h>
/* https://stackoverflow.com/questions/44008357/adding-and-subtracting-exact-values-to-float */
int main(void) {
float float0 = 1.5f;
float float1 = 1.5000001f;
int intbits_of_float0 = *(int *)&float0;
int intbits_of_float1 = *(int *)&float1;
printf("float %.17g is reinterpreted as an integer %d\n", float0, intbits_of_float0);
printf("float %.17g is reinterpreted as an integer %d\n", float1, intbits_of_float1);
return 0;
}
And, the Java code (online compiler) below is used to calcuate the "leaps":
public class Toy {
public static void main(String args[]) {
int length = 0x82000000;
int x = length >>> 24;
int y = (length >>> 24) & 0xFF;
System.out.println("length = " + length + ", x = " + x + ", y = " + y);
float float0 = 1.5f;
float float1 = 1.5000001f;
float float2 = 1.5000002f;
float float4 = 1.5000004f;
float float5 = 1.5000005f;
// testLeaps(float0, float4);
// testLeaps(0, float5);
// testLeaps(0, -float1);
// testLeaps(-float1, 0);
System.out.println(Math.nextAfter(-float1, Float.POSITIVE_INFINITY));
System.out.println(INT_POWER_MASK & Float.floatToIntBits(-float0));
System.out.println(INT_POWER_MASK & Float.floatToIntBits(float0));
// testLeaps(-float1, -float0);
testLeaps(-float0, 0);
testLeaps(float0, 0);
}
public static void testLeaps(float value, float destination) {
System.out.println("optLeaps(" + value + ", " + destination + ") = " + optLeaps(value, destination));
System.out.println("getLeaps(" + value + ", " + destination + ") = " + getLeaps(value, destination));
}
public static final int INT_POWER_MASK = 0x7f800000 | 0x007fffff; // ~0x80000000
/**
* Retrieves the integer difference between two float-point values according to
* the IEEE 754 floating-point "single format" bit layout.
*
* <pre>
* mask 0x80000000 | 0x7f800000 | 0x007fffff
* sign | exponent | coefficient/significand/mantissa
* +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
* | | | |
* +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
* 31 30 23 22 0
* 0x7fc00000 => NaN
* 0x7f800000 +Infinity
* 0xff800000 -Infinity
* </pre>
*
* Using base (radix) 10, the numerical value of such a float type number is
* `(-1)^sign x coefficient x 10^exponent`, so the coefficient is a key factor
* to calculation of leaps coefficient.
*
* #param value the first operand
* #param destination the second operand
* #return the integer span from {#code value} to {#code destination}
*/
public static int optLeaps(float value, float destination) {
// TODO process possible cases for some special inputs.
int valueBits = Float.floatToIntBits(value); // IEEE 754 floating-point "single format" bit layout
int destinationBits = Float.floatToIntBits(destination); // IEEE 754 floating-point "single format" bit layout
int leaps; // Float.intBitsToFloat();
if ((destinationBits ^ valueBits) >= 0) {
leaps = Math.abs(destinationBits - valueBits);
} else {
leaps = INT_POWER_MASK & destinationBits + INT_POWER_MASK & valueBits;
}
return leaps;
}
public static int getLeaps(float value, float destination) {
int leaps = 0;
float signum = Math.signum(destination - value);
// float direction = Float.POSITIVE_INFINITY * signum;
// while (value * signum < destination * signum) {
// value = Math.nextAfter(value, direction); // Float.POSITIVE_INFINITY * direction
// leaps++;
// }
if (0 == signum) {
return 0;
}
if (0 < signum) {
while (value < destination) {
value = Math.nextAfter(value, Float.POSITIVE_INFINITY);
leaps++;
}
} else {
while (value > destination) {
value = Math.nextAfter(value, Float.NEGATIVE_INFINITY);
leaps++;
}
}
return leaps;
}
// optimiaze to reduce the elapsed time by roughly half
}
To start, I just want to say I don't like hacking into an Objects implementation, and you should using your own (or another library) implementation first, but sometimes you have to get creative.
Lets start with key detail here, what you call the "Leap" (I would call rounding error), So What/Why is there rounding error? Floats (and Doubles) are stored as Integer X Base_Integer^exponent_Integer. (IEEE Standard) So using base 10, If you have 1.2340 X 10^3 (or 1,234.0) your "Leap" will be 0.1 since that is the size of your least significant digit (In storage, the . is implied).
(And I'm out, too much black magic here for me)
Related
could someone please tell me if I´m on the right way or if I´m doing it totally wrong?
How could I do it easier? I ´m thankful for any helpful tips
I also get that i & mm cannot be resolved to a variable look at (1)
I also get that double cannot covert to int look at (2)
Edit Update:
I´m trying to do it with a try/catch and was coverting to a string instead of a double. But I´ve noticed that I can´t round it then. Isn´t try/catch a good idea?
/**
* Returns the best guess given a range defined by its lowest and highest possible values.
* The best guess that can be taken is the number exactly in the centre of the range.
* If the centre is between two numbers, the number directly below the centre is returned (the centre is rounded down)
* #param lowestPossible The lowest possible value of the range
* #param highestPossible The highest possible value of the range
* #return The best guess for the given range
*/
public static int getBestNewGuess(int lowestPossilbe, int highestPossible) {
//return -1; // TODO: IMPLEMENT ME (AND DELETE THIS COMMENT AFTERWARDS!!!)
My Code:
double x = (lowestPossilbe + highestPossible);
double y = x /2; //Getting the middle
if (int i = (int)y) { //if double y can be converted to a int then return int y (1)
return y; // (2)
}else if (mm = (int) Math.round(y)) { //if that´s not the case, then round double y and convert it to a int (1)
int m = (int)y;
return m; // return rounded and converted y
}
// if (roundf(y) == y) {
// return y;
// }if else {
//
// }
}
If you are worried about overflow, and if you know that the lower value is less than the upper one, you could subtract the lower from the upper, divide the resulting int by 2, and add back to the lower number. Avoids casting a precise int to an approximate double and back again.
That simple case assumes that both numbers are positive. If the lower number is negative and the upper one positive, you need to add them and divide by 2, since subtracting a large negative number from a large positive one can overflow.
Try this:
return Math.floor((Double.valueOf(lowestPossilbe) + Double.valueOf(highestPossible))/2)
In C++ the sqrt function operates only with double values.
If we use integers (unsigned long long) can we be sure that
x == sqrt(x * x)
for any positive x where x * x <= MAXIMUM_VALUE?
Is it depend on the machine architecture and compiler?
In Java, Math.sqrt(x) takes a double value. You stated that x is such that x * x is below Integer.MAX_VALUE. Every integer is perfectly representable in double - double in java is explicitly defined as an iEEE-754 style double with a 52-bit mantissa; therefore in java a double can perfectly represent all integral values between -2^52 and +2^52, which easily covers all int values (as that is defined as signed 32-bit on java), but it does not cover all long values. (Defined as signed 64-bit; 64 is more than 52, so no go).
Thus, x * x loses no precision when it ends up getting converted from int to double. Then, Math.sqrt() on this number will give a result that is also perfectly representable as a double (because it is x, and given that x*x fits in an int, x must also fit), and thus, yes, this will always work out for all x.
But, hey, why not give it a shot, right?
public static void main(String[] args) {
int i = 1;
while (true) {
if (i * i < 0) break;
int j = (int) Math.sqrt(i * i);
if (i != j) System.out.println("Oh dear! " + i + " -> " + j);
i++;
}
System.out.println("Done at " + i);
}
> Done at 46341
Thus proving it by exhaustively trying it all.
Turns out, none exist - any long value such that x * x still fits (thus, is <2^63-1) has the property that x == (long) Math.sqrt(x * x);. This is presumably because at x*x, the number fits perfectly in a long, even if not all integer numbers that are this large do. Proof:
long p = 2000000000L;
for (; true; p++) {
long pp = p * p;
if (pp < 0) break;
long q = (long) Math.sqrt(pp);
if (q != p) System.out.println("PROBLEM: " + p + " -> " + q);
}
System.out.println("Abort: " + p);
> Abort: 3037000500
Surely if any number exists that doesn't hold, there is at least one in this high end range. Starting from 0 takes very long.
But do we know that sqrt will always return an exact value for a perfect square, or might it be slightly inaccurate?
We should - it's java. Unlike C, almost everything is 'well defined', and a JVM cannot legally call itself one if it fails to produce the exact answer as specified. The leeway that the Math.sqrt docs provide is not sufficient for any answer other than precisely x to be a legal implementation, therefore, yes, this is a guarantee.
In theory the JVM has some very minor leeway with floating point numbers, which strictfp disables, but [A] that's more about using 80-bit registers to represent numbers instead of 64, which cannot possibly ruin this hypothesis, and [B] a while back a java tag question showed up to show strictfp having any effect on any hardware and any VM version and the only viable result was a non-reproducible thing from 15 years ago. I feel quite confident to state that this will always hold, regardless of hardware or VM version.
I think we can believe.
Type casting a floating point number to an integer is to take only integer part of it. I believe you may concern, for example, sqrt(4) yields a floating point number like 1.999...9 and it is type casted to 1. (Yielding 2.000...1 is fine because it will be type casted to 2.)
But the floating number 4 is like
(1 * 2-0 + 0 + 2-1 + ... + 0 * 2-23) * 22
according to Floating-point arithmetic.
Which means, it must not be smaller than 4 like 3.999...9. So also, sqrt of the number must not be smaller than
(1 * 2-0) * 2
So sqrt of a square of an integer will at least yield a floating point number greater than but close enough to the integer.
Just try it. Yes it works in Java, for non-negative numbers. Even works for long contrary to common opinion.
class Code {
public static void main(String[] args) throws Throwable {
for (long x=(long)Math.sqrt(Long.MAX_VALUE);; --x) {
if (!(x == (long)Math.sqrt(x * x))) {
System.err.println("Not true for: "+x);
break;
}
}
System.err.println("Done");
}
}
(The first number that doesn't work is 3037000500L which goes negative when squared.)
Even for longs, the range of testable values is around 2^31 or 2*10^9 so for something this trivial it is reasonable to check every single value. You can even brute force reasonable cryptographic functions for 32-bit values - something more people should realise. Won't work so well for the full 64 bits.
BigInteger - sqrt(since 9)
Use cases requiring tighter constraint over possibilities of overflow can use BigInteger
BigInteger should work for any practical use case.
Still for normal use case, this might not be efficient.
Constraints
BigInteger Limits
BigInteger must support values in the range -2^Integer.MAX_VALUE (exclusive) to +2^Integer.MAX_VALUE (exclusive) and may support values outside of that range. An ArithmeticException is thrown when a BigInteger constructor or method would generate a value outside of the supported range. The range of probable prime values is limited and may be less than the full supported positive range of BigInteger. The range must be at least 1 to 2500000000.
Implementation Note:
In the reference implementation, BigInteger constructors and operations throw ArithmeticException when the result is out of the supported range of -2^Integer.MAX_VALUE (exclusive) to +2^Integer.MAX_VALUE (exclusive).
Array size limit when initialized as byte array
String length limit when initialized as String
Definitely may not support 1/0
jshell> new BigInteger("1").divide(new BigInteger("0"))
| Exception java.lang.ArithmeticException: BigInteger divide by zero
| at MutableBigInteger.divideKnuth (MutableBigInteger.java:1178)
| at BigInteger.divideKnuth (BigInteger.java:2300)
| at BigInteger.divide (BigInteger.java:2281)
| at (#1:1)
An example code
import java.math.BigInteger;
import java.util.Arrays;
import java.util.List;
public class SquareAndSqrt {
static void valid() {
List<String> values = Arrays.asList("1", "9223372036854775807",
"92233720368547758079223372036854775807",
new BigInteger("2").pow(Short.MAX_VALUE - 1).toString());
for (String input : values) {
final BigInteger value = new BigInteger(input);
final BigInteger square = value.multiply(value);
final BigInteger sqrt = square.sqrt();
System.out.println("value: " + input + System.lineSeparator()
+ ", square: " + square + System.lineSeparator()
+ ", sqrt: " + sqrt + System.lineSeparator()
+ ", " + value.equals(sqrt));
System.out.println(System.lineSeparator().repeat(2)); // pre java 11 - System.out.println(new String(new char[2]).replace("\0", System.lineSeparator()));
}
}
static void mayBeInValid() {
try {
new BigInteger("2").pow(Integer.MAX_VALUE);
} catch (ArithmeticException e) {
System.out.print("value: 2^Integer.MAX_VALUE, Exception: " + e);
System.out.println(System.lineSeparator().repeat(2));
}
}
public static void main(String[] args) {
valid();
mayBeInValid();
}
}
in cmath library sqrt function always convert argument to double or float so the range of double or float much more than unsigned long long so it always give positive.
for reference you can use
https://learn.microsoft.com/en-us/cpp/standard-library/cmath?view=msvc-170,
https://learn.microsoft.com/en-us/cpp/c-language/type-float?view=msvc-170
Sometimes when you do calculations with very small probabilities using common data types such as doubles, numerical inaccuracies cascade over multiple calculations and lead to incorrect results. Because of this it is recommended to use log probabilities, which improve numerical stability. I have implemented log probabilities in Java and my implementation works, but it has worse numerical stability than using raw doubles. What is wrong with my implementation? What is an accurate and efficient way to perform many consecutive calculations with small probabilities in Java?
I'm unable to provide a neatly contained demonstration of this problem because the inaccuracies cascade over many calculations. However, here is proof that a problem exists: this submission to a CodeForces contest fails due to numerical accuracy. Running test #7 and adding debug prints clearly show that from day 1774, numerical errors begin cascading until the sum of probabilities drops to 0 (when it should be 1). After replacing my Prob class with a simple wrapper over doubles the exact same solution passes tests.
My implementation of multiplying probabilities:
a * b = Math.log(a) + Math.log(b)
My implementation of addition:
a + b = Math.log(a) + Math.log(1 + Math.exp(Math.log(b) - Math.log(a)))
The stability problem is most likely contained within those 2 lines, but here is my entire implementation:
class Prob {
/** Math explained: https://en.wikipedia.org/wiki/Log_probability
* Quick start:
* - Instantiate probabilities, eg. Prob a = new Prob(0.75)
* - add(), multiply() return new objects, can perform on nulls & NaNs.
* - get() returns probability as a readable double */
/** Logarithmized probability. Note: 0% represented by logP NaN. */
private double logP;
/** Construct instance with real probability. */
public Prob(double real) {
if (real > 0) this.logP = Math.log(real);
else this.logP = Double.NaN;
}
/** Construct instance with already logarithmized value. */
static boolean dontLogAgain = true;
public Prob(double logP, boolean anyBooleanHereToChooseThisConstructor) {
this.logP = logP;
}
/** Returns real probability as a double. */
public double get() {
return Math.exp(logP);
}
#Override
public String toString() {
return ""+get();
}
/***************** STATIC METHODS BELOW ********************/
/** Note: returns NaN only when a && b are both NaN/null. */
public static Prob add(Prob a, Prob b) {
if (nullOrNaN(a) && nullOrNaN(b)) return new Prob(Double.NaN, dontLogAgain);
if (nullOrNaN(a)) return copy(b);
if (nullOrNaN(b)) return copy(a);
double x = a.logP;
double y = b.logP;
double sum = x + Math.log(1 + Math.exp(y - x));
return new Prob(sum, dontLogAgain);
}
/** Note: multiplying by null or NaN produces NaN (repping 0% real prob). */
public static Prob multiply(Prob a, Prob b) {
if (nullOrNaN(a) || nullOrNaN(b)) return new Prob(Double.NaN, dontLogAgain);
return new Prob(a.logP + b.logP, dontLogAgain);
}
/** Returns true if p is null or NaN. */
private static boolean nullOrNaN(Prob p) {
return (p == null || Double.isNaN(p.logP));
}
/** Returns a new instance with the same value as original. */
private static Prob copy(Prob original) {
return new Prob(original.logP, dontLogAgain);
}
}
Problem was caused by the way Math.exp(z) was used in this line:
a + b = Math.log(a) + Math.log(1 + Math.exp(Math.log(b) - Math.log(a)))
When z reaches extreme values, numerical accuracy of double is not enough for the output of Math.exp(z). This causes us to lose information, produce an inaccurate result, and then these results cascade over multiple calculations.
When z >= 710 then Math.exp(z) = Infinity
When z <= -746 then Math.exp(z) = 0
In the original code I was calling Math.exp with y - x and arbitrarily choosing which is x and which is why. Let's instead choose y and x based on which is larger, so that z is negative rather than positive. The point where we get overflow is further on the negative side (746 rather than 710) and more importantly, when we overflow, we end up at 0 rather than infinity. Which is what we want with a low probability.
double x = Math.max(a.logP, b.logP);
double y = Math.min(a.logP, b.logP);
double sum = x + Math.log(1 + Math.exp(y - x));
Is there an optimized, performant way to round a double to the exact value nearest multiple of a given power of two fraction?
In other words, round .44 to the nearest 1/16 (in other words, to a value that can be expressed as n/16 where n is an integer) would be .4375. Note: this is relevant because power of two fractions can be stored without rounding errors, e.g.
public class PowerOfTwo {
public static void main(String... args) {
double inexact = .44;
double exact = .4375;
System.out.println(inexact + ": " + Long.toBinaryString(Double.doubleToLongBits(inexact)));
System.out.println(exact + ": " + Long.toBinaryString(Double.doubleToLongBits(exact)));
}
}
Output:
0.44: 11111111011100001010001111010111000010100011110101110000101001
0.4375: 11111111011100000000000000000000000000000000000000000000000000
If you want to chose the power of two, the simplest way is to multiply by e.g. 16, round to nearest integer, then divide by 16. Note that division by a power of two is exact if the result is a normal number. It can cause rounding error for subnormal numbers.
Here is a sample program using this technique:
public class Test {
public static void main(String[] args) {
System.out.println(roundToPowerOfTwo(0.44, 2));
System.out.println(roundToPowerOfTwo(0.44, 3));
System.out.println(roundToPowerOfTwo(0.44, 4));
System.out.println(roundToPowerOfTwo(0.44, 5));
System.out.println(roundToPowerOfTwo(0.44, 6));
System.out.println(roundToPowerOfTwo(0.44, 7));
System.out.println(roundToPowerOfTwo(0.44, 8));
}
public static double roundToPowerOfTwo(double in, int power) {
double multiplier = 1 << power;
return Math.rint(in * multiplier) / multiplier;
}
}
Output:
0.5
0.5
0.4375
0.4375
0.4375
0.4375
0.44140625
If the question is about rounding any number to a pre-determined binary precision, what you need to do is this:
Convert the value to long using 'Double.doubleToLongBits()`
Examine the exponent: if it's too big (exponent+required precision>51, the number of bits in the significand), you won't be able to do any rounding but you won't have to: the number already satisfies your criteria.
If on the other hand exponent+required precision<0, the result of the rounding is always 0.
In any other case, look at the significand and blot out all the bits that are below the exponent+required precisionth significant bit.
Convert the number back to double using Double.longBitsToDouble()
Getting this right in all corner cases is a bit tricky. If I have to solve such a task, I'd usually start with a naive implementation that I can be pretty sure will be correct and only then start implementing an optimized version. While doing so, I can always compare against the naive approach to validate my results.
The naive approach is to start with 1 and multiply / divide it with / by 2 until we have bracketed the absolute value of the input. Then, we'll output the nearer of the boundaries. It's actually a bit more complicated: If the value is a NaN or infinity, it requires special treatment.
Here is the code:
public static double getClosestPowerOf2Loop(final double x) {
final double absx = Math.abs(x);
double prev = 1.0;
double next = 1.0;
if (Double.isInfinite(x) || Double.isNaN(x)) {
return x;
} else if (absx < 1.0) {
do {
prev = next;
next /= 2.0;
} while (next > absx);
} else if (absx > 1.0) {
do {
prev = next;
next *= 2.0;
} while (next < absx);
}
if (x < 0.0) {
prev = -prev;
next = -next;
}
return (Math.abs(next - x) < Math.abs(prev - x)) ? next : prev;
}
I hope the code will be clear without further explanation. Since Java 8, you can use !Double.isFinite(x) as a replacement for Double.isInfinite(x) || Double.isNaN(x).
Let's see for an optimized version. As other answers have already suggested, we should probably look at the bit representation. Java requires floating point values to be represented using IEE 754. In that format, numbers in double (64 bit) precision are represented as
1 bit sign,
11 bits exponent and
52 bits mantissa.
We will special-case NaNs and infinities (which are represented by special bit patterns) again. However, there is yet another exception: The most significant bit of the mantissa is implicitly 1 and not found in the bit pattern – except for very small numbers where a so-called subnormal representation us used where the most significant digit is not the most significant bit of the mantissa. Therefore, for normal numbers we will simply set the mantissa's bits to all 0 but for subnormals, we convert it to a number where none but the most significant 1 bit is preserved. This procedure always rounds towards zero so to get the other bound, we simply multiply by 2.
Let's see how this all works together:
public static double getClosestPowerOf2Bits(final double x) {
if (Double.isInfinite(x) || Double.isNaN(x)) {
return x;
} else {
final long bits = Double.doubleToLongBits(x);
final long signexp = bits & 0xfff0000000000000L;
final long mantissa = bits & 0x000fffffffffffffL;
final long mantissaPrev = Math.abs(x) < Double.MIN_NORMAL
? Long.highestOneBit(mantissa)
: 0x0000000000000000L;
final double prev = Double.longBitsToDouble(signexp | mantissaPrev);
final double next = 2.0 * prev;
return (Math.abs(next - x) < Math.abs(prev - x)) ? next : prev;
}
}
I'm note entirely sure I have covered all corner cases but the following tests do run:
public static void main(final String[] args) {
final double[] values = {
5.0, 4.1, 3.9, 1.0, 0.0, -0.1, -8.0, -8.1, -7.9,
0.9 * Double.MIN_NORMAL, -0.9 * Double.MIN_NORMAL,
Double.NaN, Double.MAX_VALUE, Double.MIN_VALUE,
Double.NEGATIVE_INFINITY, Double.POSITIVE_INFINITY,
};
for (final double value : values) {
final double powerL = getClosestPowerOf2Loop(value);
final double powerB = getClosestPowerOf2Bits(value);
System.out.printf("%17.10g --> %17.10g %17.10g%n",
value, powerL, powerB);
assert Double.doubleToLongBits(powerL) == Double.doubleToLongBits(powerB);
}
}
Output:
5.000000000 --> 4.000000000 4.000000000
4.100000000 --> 4.000000000 4.000000000
3.900000000 --> 4.000000000 4.000000000
1.000000000 --> 1.000000000 1.000000000
0.000000000 --> 0.000000000 0.000000000
-0.1000000000 --> -0.1250000000 -0.1250000000
-8.000000000 --> -8.000000000 -8.000000000
-8.100000000 --> -8.000000000 -8.000000000
-7.900000000 --> -8.000000000 -8.000000000
2.002566473e-308 --> 2.225073859e-308 2.225073859e-308
-2.002566473e-308 --> -2.225073859e-308 -2.225073859e-308
NaN --> NaN NaN
1.797693135e+308 --> 8.988465674e+307 8.988465674e+307
4.900000000e-324 --> 4.900000000e-324 4.900000000e-324
-Infinity --> -Infinity -Infinity
Infinity --> Infinity Infinity
How about performance?
I have run the following benchmark
public static void main(final String[] args) {
final Random rand = new Random();
for (int i = 0; i < 1000000; ++i) {
final double value = Double.longBitsToDouble(rand.nextLong());
final double power = getClosestPowerOf2(value);
}
}
where getClosestPowerOf2 is to be replaced by either getClosestPowerOf2Loop or getClosestPowerOf2Bits. On my laptop, I get the following results:
getClosestPowerOf2Loop: 2.35 s
getClosestPowerOf2Bits: 1.80 s
Was that really worth the effort?
You are going to need some bit magic if you are going to round to arbitrary powers of 2.
You will need to inspect the exponent:
int exponent = Math.getExponent(inexact);
Then knowing that there are 53 bits in the mantissa can find the bit at which you need to round with.
Or just do:
Math.round(inexact* (1l<<exponent))/(1l<<exponent)
I use Math.round because I expect it to be optimal for the task as opposed to trying to implement it yourself.
Here is my first attempt at a solution, that doesn't handle all the cases in #biziclop's answer, and probably does "floor" instead of "round"
public static double round(double d, int precision) {
double longPart = Math.rint(d);
double decimalOnly = d - longPart;
long bits = Double.doubleToLongBits(decimalOnly);
long mask = -1l << (54 - precision);
return Double.longBitsToDouble(bits & mask) + longPart;
}
I came across this post trying to solve a related problem: how to efficiently find the two powers of two that bracket any given regular real value. Since my program deals in many types beside doubles I needed a general solution. Someone wanting to round to the nearest power of two can get the bracketing values and choose the closest. In my case the general solution required BigDecimals. Here is the trick I used.
For numbers > 1:
int exponent = myBigDecimal.toBigInteger.bitLength() - 1;
BigDecimal lowerBound = TWO.pow(exponent);
BigDecimal upperBound = TWO.pow(exponent+1);
For numbers > 0 and < 1:
int exponent = -(BigDecimal.ONE.divide(myBigDecimal, myContext).toBigInteger().bitLength()-1);
BigDecimal lowerBound = TWO.pow(exponent-1);
BigDecimal upperBound = TWO.pow(exponent);
I have only lined out the positive case. You generally take a number, and use this algorithm on it's absolute value. And then if in the original problem the number was negative you multiply the algorithm's result by -1. Finally the orignal num == 0 or num == 1 are trivial to handle outside this algorithm. That covers the whole real number line except infinties and nans which you deal with before calling this algorithm.
In Java the floating point arithmetic is not represented precisely. For example this java code:
float a = 1.2;
float b= 3.0;
float c = a * b;
if(c == 3.6){
System.out.println("c is 3.6");
}
else {
System.out.println("c is not 3.6");
}
Prints "c is not 3.6".
I'm not interested in precision beyond 3 decimals (#.###). How can I deal with this problem to multiply floats and compare them reliably?
It's a general rule that floating point number should never be compared like (a==b), but rather like (Math.abs(a-b) < delta) where delta is a small number.
A floating point value having fixed number of digits in decimal form does not necessary have fixed number of digits in binary form.
Addition for clarity:
Though strict == comparison of floating point numbers has very little practical sense, the strict < and > comparison, on the contrary, is a valid use case (example - logic triggering when certain value exceeds threshold: (val > threshold) && panic();)
If you are interested in fixed precision numbers, you should be using a fixed precision type like BigDecimal, not an inherently approximate (though high precision) type like float. There are numerous similar questions on Stack Overflow that go into this in more detail, across many languages.
I think it has nothing to do with Java, it happens on any IEEE 754 floating point number. It is because of the nature of floating point representation. Any languages that use the IEEE 754 format will encounter the same problem.
As suggested by David above, you should use the method abs of java.lang.Math class to get the absolute value (drop the positive/negative sign).
You can read this: http://en.wikipedia.org/wiki/IEEE_754_revision and also a good numerical methods text book will address the problem sufficiently.
public static void main(String[] args) {
float a = 1.2f;
float b = 3.0f;
float c = a * b;
final float PRECISION_LEVEL = 0.001f;
if(Math.abs(c - 3.6f) < PRECISION_LEVEL) {
System.out.println("c is 3.6");
} else {
System.out.println("c is not 3.6");
}
}
I’m using this bit of code in unit tests to compare if the outcome of 2 different calculations are the same, barring floating point math errors.
It works by looking at the binary representation of the floating point number. Most of the complication is due to the fact that the sign of floating point numbers is not two’s complement. After compensating for that it basically comes down to just a simple subtraction to get the difference in ULPs (explained in the comment below).
/**
* Compare two floating points for equality within a margin of error.
*
* This can be used to compensate for inequality caused by accumulated
* floating point math errors.
*
* The error margin is specified in ULPs (units of least precision).
* A one-ULP difference means there are no representable floats in between.
* E.g. 0f and 1.4e-45f are one ULP apart. So are -6.1340704f and -6.13407f.
* Depending on the number of calculations involved, typically a margin of
* 1-5 ULPs should be enough.
*
* #param expected The expected value.
* #param actual The actual value.
* #param maxUlps The maximum difference in ULPs.
* #return Whether they are equal or not.
*/
public static boolean compareFloatEquals(float expected, float actual, int maxUlps) {
int expectedBits = Float.floatToIntBits(expected) < 0 ? 0x80000000 - Float.floatToIntBits(expected) : Float.floatToIntBits(expected);
int actualBits = Float.floatToIntBits(actual) < 0 ? 0x80000000 - Float.floatToIntBits(actual) : Float.floatToIntBits(actual);
int difference = expectedBits > actualBits ? expectedBits - actualBits : actualBits - expectedBits;
return !Float.isNaN(expected) && !Float.isNaN(actual) && difference <= maxUlps;
}
Here is a version for double precision floats:
/**
* Compare two double precision floats for equality within a margin of error.
*
* #param expected The expected value.
* #param actual The actual value.
* #param maxUlps The maximum difference in ULPs.
* #return Whether they are equal or not.
* #see Utils#compareFloatEquals(float, float, int)
*/
public static boolean compareDoubleEquals(double expected, double actual, long maxUlps) {
long expectedBits = Double.doubleToLongBits(expected) < 0 ? 0x8000000000000000L - Double.doubleToLongBits(expected) : Double.doubleToLongBits(expected);
long actualBits = Double.doubleToLongBits(actual) < 0 ? 0x8000000000000000L - Double.doubleToLongBits(actual) : Double.doubleToLongBits(actual);
long difference = expectedBits > actualBits ? expectedBits - actualBits : actualBits - expectedBits;
return !Double.isNaN(expected) && !Double.isNaN(actual) && difference <= maxUlps;
}
This is a weakness of all floating point representations, and it happens because some numbers that appear to have a fixed number of decimals in the decimal system, actually have an infinite number of decimals in the binary system. And so what you think is 1.2 is actually something like 1.199999999997 because when representing it in binary it has to chop off the decimals after a certain number, and you lose some precision. Then multiplying it by 3 actually gives 3.5999999...
http://docs.python.org/py3k/tutorial/floatingpoint.html <- this might explain it better (even if it's for python, it's a common problem of the floating point representation)
Like the others wrote:
Compare floats with: if (Math.abs(a - b) < delta)
You can write a nice method for doing this:
public static int compareFloats(float f1, float f2, float delta)
{
if (Math.abs(f1 - f2) < delta)
{
return 0;
} else
{
if (f1 < f2)
{
return -1;
} else {
return 1;
}
}
}
/**
* Uses <code>0.001f</code> for delta.
*/
public static int compareFloats(float f1, float f2)
{
return compareFloats(f1, f2, 0.001f);
}
So, you can use it like this:
if (compareFloats(a * b, 3.6f) == 0)
{
System.out.println("They are equal");
}
else
{
System.out.println("They aren't equal");
}
There is an apache class for comparing doubles: org.apache.commons.math3.util.Precision
It contains some interesting constants: SAFE_MIN and EPSILON, which are the maximum possible deviations when performing arithmetic operations.
It also provides the necessary methods to compare, equal or round doubles.
Rounding is a bad idea. Use BigDecimal and set it's precision as needed.
Like:
public static void main(String... args) {
float a = 1.2f;
float b = 3.0f;
float c = a * b;
BigDecimal a2 = BigDecimal.valueOf(a);
BigDecimal b2 = BigDecimal.valueOf(b);
BigDecimal c2 = a2.multiply(b2);
BigDecimal a3 = a2.setScale(2, RoundingMode.HALF_UP);
BigDecimal b3 = b2.setScale(2, RoundingMode.HALF_UP);
BigDecimal c3 = a3.multiply(b3);
BigDecimal c4 = a3.multiply(b3).setScale(2, RoundingMode.HALF_UP);
System.out.println(c); // 3.6000001
System.out.println(c2); // 3.60000014305114740
System.out.println(c3); // 3.6000
System.out.println(c == 3.6f); // false
System.out.println(Float.compare(c, 3.6f) == 0); // false
System.out.println(c2.compareTo(BigDecimal.valueOf(3.6f)) == 0); // false
System.out.println(c3.compareTo(BigDecimal.valueOf(3.6f)) == 0); // false
System.out.println(c3.compareTo(BigDecimal.valueOf(3.6f).setScale(2, RoundingMode.HALF_UP)) == 0); // true
System.out.println(c3.compareTo(BigDecimal.valueOf(3.6f).setScale(9, RoundingMode.HALF_UP)) == 0); // false
System.out.println(c4.compareTo(BigDecimal.valueOf(3.6f).setScale(2, RoundingMode.HALF_UP)) == 0); // true
}
To compare two floats, f1 and f2 within precision of #.### I believe you would need to do like this:
((int) (f1 * 1000 + 0.5)) == ((int) (f2 * 1000 + 0.5))
f1 * 1000 lifts 3.14159265... to 3141.59265, + 0.5 results in 3142.09265 and the (int) chops off the decimals, 3142. That is, it includes 3 decimals and rounds the last digit properly.