Manipulating and comparing floating points in java

Manipulating and comparing floating points in java - java

In Java the floating point arithmetic is not represented precisely. For example this java code:
float a = 1.2;
float b= 3.0;
float c = a * b;
if(c == 3.6){
System.out.println("c is 3.6");
}
else {
System.out.println("c is not 3.6");
}
Prints "c is not 3.6".
I'm not interested in precision beyond 3 decimals (#.###). How can I deal with this problem to multiply floats and compare them reliably?

It's a general rule that floating point number should never be compared like (a==b), but rather like (Math.abs(a-b) < delta) where delta is a small number.
A floating point value having fixed number of digits in decimal form does not necessary have fixed number of digits in binary form.
Addition for clarity:
Though strict == comparison of floating point numbers has very little practical sense, the strict < and > comparison, on the contrary, is a valid use case (example - logic triggering when certain value exceeds threshold: (val > threshold) && panic();)

If you are interested in fixed precision numbers, you should be using a fixed precision type like BigDecimal, not an inherently approximate (though high precision) type like float. There are numerous similar questions on Stack Overflow that go into this in more detail, across many languages.

I think it has nothing to do with Java, it happens on any IEEE 754 floating point number. It is because of the nature of floating point representation. Any languages that use the IEEE 754 format will encounter the same problem.
As suggested by David above, you should use the method abs of java.lang.Math class to get the absolute value (drop the positive/negative sign).
You can read this: http://en.wikipedia.org/wiki/IEEE_754_revision and also a good numerical methods text book will address the problem sufficiently.
public static void main(String[] args) {
float a = 1.2f;
float b = 3.0f;
float c = a * b;
final float PRECISION_LEVEL = 0.001f;
if(Math.abs(c - 3.6f) < PRECISION_LEVEL) {
System.out.println("c is 3.6");
} else {
System.out.println("c is not 3.6");
}
}

I’m using this bit of code in unit tests to compare if the outcome of 2 different calculations are the same, barring floating point math errors.
It works by looking at the binary representation of the floating point number. Most of the complication is due to the fact that the sign of floating point numbers is not two’s complement. After compensating for that it basically comes down to just a simple subtraction to get the difference in ULPs (explained in the comment below).
/**
* Compare two floating points for equality within a margin of error.
*
* This can be used to compensate for inequality caused by accumulated
* floating point math errors.
*
* The error margin is specified in ULPs (units of least precision).
* A one-ULP difference means there are no representable floats in between.
* E.g. 0f and 1.4e-45f are one ULP apart. So are -6.1340704f and -6.13407f.
* Depending on the number of calculations involved, typically a margin of
* 1-5 ULPs should be enough.
*
* #param expected The expected value.
* #param actual The actual value.
* #param maxUlps The maximum difference in ULPs.
* #return Whether they are equal or not.
*/
public static boolean compareFloatEquals(float expected, float actual, int maxUlps) {
int expectedBits = Float.floatToIntBits(expected) < 0 ? 0x80000000 - Float.floatToIntBits(expected) : Float.floatToIntBits(expected);
int actualBits = Float.floatToIntBits(actual) < 0 ? 0x80000000 - Float.floatToIntBits(actual) : Float.floatToIntBits(actual);
int difference = expectedBits > actualBits ? expectedBits - actualBits : actualBits - expectedBits;
return !Float.isNaN(expected) && !Float.isNaN(actual) && difference <= maxUlps;
}
Here is a version for double precision floats:
/**
* Compare two double precision floats for equality within a margin of error.
*
* #param expected The expected value.
* #param actual The actual value.
* #param maxUlps The maximum difference in ULPs.
* #return Whether they are equal or not.
* #see Utils#compareFloatEquals(float, float, int)
*/
public static boolean compareDoubleEquals(double expected, double actual, long maxUlps) {
long expectedBits = Double.doubleToLongBits(expected) < 0 ? 0x8000000000000000L - Double.doubleToLongBits(expected) : Double.doubleToLongBits(expected);
long actualBits = Double.doubleToLongBits(actual) < 0 ? 0x8000000000000000L - Double.doubleToLongBits(actual) : Double.doubleToLongBits(actual);
long difference = expectedBits > actualBits ? expectedBits - actualBits : actualBits - expectedBits;
return !Double.isNaN(expected) && !Double.isNaN(actual) && difference <= maxUlps;
}

This is a weakness of all floating point representations, and it happens because some numbers that appear to have a fixed number of decimals in the decimal system, actually have an infinite number of decimals in the binary system. And so what you think is 1.2 is actually something like 1.199999999997 because when representing it in binary it has to chop off the decimals after a certain number, and you lose some precision. Then multiplying it by 3 actually gives 3.5999999...
http://docs.python.org/py3k/tutorial/floatingpoint.html <- this might explain it better (even if it's for python, it's a common problem of the floating point representation)

Like the others wrote:
Compare floats with: if (Math.abs(a - b) < delta)
You can write a nice method for doing this:
public static int compareFloats(float f1, float f2, float delta)
{
if (Math.abs(f1 - f2) < delta)
{
return 0;
} else
{
if (f1 < f2)
{
return -1;
} else {
return 1;
}
}
}
/**
* Uses <code>0.001f</code> for delta.
*/
public static int compareFloats(float f1, float f2)
{
return compareFloats(f1, f2, 0.001f);
}
So, you can use it like this:
if (compareFloats(a * b, 3.6f) == 0)
{
System.out.println("They are equal");
}
else
{
System.out.println("They aren't equal");
}

There is an apache class for comparing doubles: org.apache.commons.math3.util.Precision
It contains some interesting constants: SAFE_MIN and EPSILON, which are the maximum possible deviations when performing arithmetic operations.
It also provides the necessary methods to compare, equal or round doubles.

Rounding is a bad idea. Use BigDecimal and set it's precision as needed.
Like:
public static void main(String... args) {
float a = 1.2f;
float b = 3.0f;
float c = a * b;
BigDecimal a2 = BigDecimal.valueOf(a);
BigDecimal b2 = BigDecimal.valueOf(b);
BigDecimal c2 = a2.multiply(b2);
BigDecimal a3 = a2.setScale(2, RoundingMode.HALF_UP);
BigDecimal b3 = b2.setScale(2, RoundingMode.HALF_UP);
BigDecimal c3 = a3.multiply(b3);
BigDecimal c4 = a3.multiply(b3).setScale(2, RoundingMode.HALF_UP);
System.out.println(c); // 3.6000001
System.out.println(c2); // 3.60000014305114740
System.out.println(c3); // 3.6000
System.out.println(c == 3.6f); // false
System.out.println(Float.compare(c, 3.6f) == 0); // false
System.out.println(c2.compareTo(BigDecimal.valueOf(3.6f)) == 0); // false
System.out.println(c3.compareTo(BigDecimal.valueOf(3.6f)) == 0); // false
System.out.println(c3.compareTo(BigDecimal.valueOf(3.6f).setScale(2, RoundingMode.HALF_UP)) == 0); // true
System.out.println(c3.compareTo(BigDecimal.valueOf(3.6f).setScale(9, RoundingMode.HALF_UP)) == 0); // false
System.out.println(c4.compareTo(BigDecimal.valueOf(3.6f).setScale(2, RoundingMode.HALF_UP)) == 0); // true
}

To compare two floats, f1 and f2 within precision of #.### I believe you would need to do like this:
((int) (f1 * 1000 + 0.5)) == ((int) (f2 * 1000 + 0.5))
f1 * 1000 lifts 3.14159265... to 3141.59265, + 0.5 results in 3142.09265 and the (int) chops off the decimals, 3142. That is, it includes 3 decimals and rounds the last digit properly.

Related

Adding and subtracting exact values to float

Question:
The total amount of floating points is finite, there's about 2^32 of them. With a float, you can go directly to the next or previous one using java.lang.Math.nextAfter. I call that a single leap. My main quesion, composed of sub questions is, how can I navigate on floats using leaps ?
First, how can I move a float to another with multiple leaps at once ?
public static float moveFloat(float value, int leaps) {
for(int i = 0; i < Math.abs(leaps); i++)
value = Math.nextAfter(value, Float.POSITIVE_INFINITY * signum(leaps));
return value;
}
That way should work on theory but is really unoptimized. How can I do it in a single addition ?
I also need to know how much leaps there's between 2 floats. Here's the example implementation for this one:
public static int getLeaps(float value, float destination) {
int leaps = 0;
float direction = signum(destination - value);
while(value * direction < destination * direction) {
value = Math.nextAfter(value, Float.POSITIVE_INFINITY * direction);
leaps++;
}
return leaps;
}
Again, same problem here. This implementation isn't suitable.
Extra:
The thing I call a leap, does it have an actual name ?
Background:
I'm trying to make a simple 2D physics engine in Java and I have trouble with my floating point operations. I learned about relative error float comparison and it helped a bit but it's not magic. What I want is to be exact with my floating points.
I already know a lot of base ten numbers cannot be exactly represented with floating points but execptionally, I don't care. All I want is exact float arithmetic in base 2.
To simplify, in my collision detection and response process, I check if shapes overlap (let's stay in one dimension for this example) and I replace the 2 shapes overlapping using their weight.
See this example:
If the black lines are the float values(and the space between each other leaps) whatever the precision is, I want to place both shapes (colored lines) to be exactly at the brown position. (The brown position is determined by the weights ratio and by rounding. What I call penetration is the overlaping area/distance. If the penetration would of been 5, red would been pushed by 1 and blue by 4).
The problem is, do to that I have to keep the penetration of the collision (in this case the penetration is exactly the ULP of the float, or 1 leap) in a float and I suspect this leads to inexactitude. If the penetration value is bigger than the coordinates of the shapes, it will be less precise so they won't be exactly replaced at the good coordinate.
What I imagine is to keep the penetration of the collision as the amount of leaps I need to get from one to the another and use it afterwards.
This is a simplified version of the current code I have:
public class ReplaceResolver implements CollisionResolver {
#Override
public void resolve(Collision collision) {
float deltaB = collision.weightRatio * collision.penetration; //bodyA's weight over the sum of the 2 (pre calculated)
float deltaA = 1f - deltaB;
//the normal indicates where the shape should be pushed. For now, my engine is only AA so a component of the normal (x or y) is always 0 while the other is 1
if(deltaB > 0)
replace(collision.bodyA, collision.normalB, deltaA);
if(deltaA > 0)
replace(collision.bodyB, collision.normalA, deltaB);
}
private void replace(Body body, Vector2 normal, float delta) {
body.getPosition().x += normal.x * delta; //body.getPosition() is a Vector2
body.getPosition().y += normal.y * delta;
}
}
Obviously, this doesn't work properly and accumulates floating point precision error. The error is well handled by my collision detection which checks for float equality using ULP. However it breaks when crossing 0 because of the ULP going extremely low.
I could simply fix an epsilon for a physic simulation but it would remove the whole point of using floats. The technique I want to use lets the user choose his precision implicitly and theorically should be working with any precision.

Underlying IEEE 754 floating point model has this property: if you re-interpret the bits as Integer, taking the next float after (or before depending on the direction) is just like taking the next (or previous) integer, that is adding or subtracting 1 to the bit pattern re-interpreted as integer.
Stepping n times is adding (or subtracting) n to the bit pattern. It's as simple as that as long as the sign does not change, and you don't overflow to NaN or Inf.
And the number of different floats between two floats is the difference of two integers if the signs agree.
If signs differ, since the float has a sign-magnitude like representation, which does not fit the integer representation, you'll then have to exert a bit of arithmetic.

I want to do the same calculation. So, if "leaps" means as #aka.nice said, the integer difference/span/distance between two float-point values according to the IEEE 754 floating-point "single format" bit layout (IEEE754 Format), I may have found a simple method:
public static native int floatToRawIntBits(float value) and Java_java_lang_Float_floatToRawIntBits can be used for this purpose, which has similar functionality to my test code in c++ (reinterpret a memory (reinterpret_cast)).
#include <stdio.h>
/* https://stackoverflow.com/questions/44008357/adding-and-subtracting-exact-values-to-float */
int main(void) {
float float0 = 1.5f;
float float1 = 1.5000001f;
int intbits_of_float0 = *(int *)&float0;
int intbits_of_float1 = *(int *)&float1;
printf("float %.17g is reinterpreted as an integer %d\n", float0, intbits_of_float0);
printf("float %.17g is reinterpreted as an integer %d\n", float1, intbits_of_float1);
return 0;
}
And, the Java code (online compiler) below is used to calcuate the "leaps":
public class Toy {
public static void main(String args[]) {
int length = 0x82000000;
int x = length >>> 24;
int y = (length >>> 24) & 0xFF;
System.out.println("length = " + length + ", x = " + x + ", y = " + y);
float float0 = 1.5f;
float float1 = 1.5000001f;
float float2 = 1.5000002f;
float float4 = 1.5000004f;
float float5 = 1.5000005f;
// testLeaps(float0, float4);
// testLeaps(0, float5);
// testLeaps(0, -float1);
// testLeaps(-float1, 0);
System.out.println(Math.nextAfter(-float1, Float.POSITIVE_INFINITY));
System.out.println(INT_POWER_MASK & Float.floatToIntBits(-float0));
System.out.println(INT_POWER_MASK & Float.floatToIntBits(float0));
// testLeaps(-float1, -float0);
testLeaps(-float0, 0);
testLeaps(float0, 0);
}
public static void testLeaps(float value, float destination) {
System.out.println("optLeaps(" + value + ", " + destination + ") = " + optLeaps(value, destination));
System.out.println("getLeaps(" + value + ", " + destination + ") = " + getLeaps(value, destination));
}
public static final int INT_POWER_MASK = 0x7f800000 | 0x007fffff; // ~0x80000000
/**
* Retrieves the integer difference between two float-point values according to
* the IEEE 754 floating-point "single format" bit layout.
*
* <pre>
* mask 0x80000000 | 0x7f800000 | 0x007fffff
* sign | exponent | coefficient/significand/mantissa
* +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
* | | | |
* +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
* 31 30 23 22 0
* 0x7fc00000 => NaN
* 0x7f800000 +Infinity
* 0xff800000 -Infinity
* </pre>
*
* Using base (radix) 10, the numerical value of such a float type number is
* `(-1)^sign x coefficient x 10^exponent`, so the coefficient is a key factor
* to calculation of leaps coefficient.
*
* #param value the first operand
* #param destination the second operand
* #return the integer span from {#code value} to {#code destination}
*/
public static int optLeaps(float value, float destination) {
// TODO process possible cases for some special inputs.
int valueBits = Float.floatToIntBits(value); // IEEE 754 floating-point "single format" bit layout
int destinationBits = Float.floatToIntBits(destination); // IEEE 754 floating-point "single format" bit layout
int leaps; // Float.intBitsToFloat();
if ((destinationBits ^ valueBits) >= 0) {
leaps = Math.abs(destinationBits - valueBits);
} else {
leaps = INT_POWER_MASK & destinationBits + INT_POWER_MASK & valueBits;
}
return leaps;
}
public static int getLeaps(float value, float destination) {
int leaps = 0;
float signum = Math.signum(destination - value);
// float direction = Float.POSITIVE_INFINITY * signum;
// while (value * signum < destination * signum) {
// value = Math.nextAfter(value, direction); // Float.POSITIVE_INFINITY * direction
// leaps++;
// }
if (0 == signum) {
return 0;
}
if (0 < signum) {
while (value < destination) {
value = Math.nextAfter(value, Float.POSITIVE_INFINITY);
leaps++;
}
} else {
while (value > destination) {
value = Math.nextAfter(value, Float.NEGATIVE_INFINITY);
leaps++;
}
}
return leaps;
}
// optimiaze to reduce the elapsed time by roughly half
}

To start, I just want to say I don't like hacking into an Objects implementation, and you should using your own (or another library) implementation first, but sometimes you have to get creative.
Lets start with key detail here, what you call the "Leap" (I would call rounding error), So What/Why is there rounding error? Floats (and Doubles) are stored as Integer X Base_Integer^exponent_Integer. (IEEE Standard) So using base 10, If you have 1.2340 X 10^3 (or 1,234.0) your "Leap" will be 0.1 since that is the size of your least significant digit (In storage, the . is implied).
(And I'm out, too much black magic here for me)

Numerical accuracy with log probability Java implementation

Sometimes when you do calculations with very small probabilities using common data types such as doubles, numerical inaccuracies cascade over multiple calculations and lead to incorrect results. Because of this it is recommended to use log probabilities, which improve numerical stability. I have implemented log probabilities in Java and my implementation works, but it has worse numerical stability than using raw doubles. What is wrong with my implementation? What is an accurate and efficient way to perform many consecutive calculations with small probabilities in Java?
I'm unable to provide a neatly contained demonstration of this problem because the inaccuracies cascade over many calculations. However, here is proof that a problem exists: this submission to a CodeForces contest fails due to numerical accuracy. Running test #7 and adding debug prints clearly show that from day 1774, numerical errors begin cascading until the sum of probabilities drops to 0 (when it should be 1). After replacing my Prob class with a simple wrapper over doubles the exact same solution passes tests.
My implementation of multiplying probabilities:
a * b = Math.log(a) + Math.log(b)
My implementation of addition:
a + b = Math.log(a) + Math.log(1 + Math.exp(Math.log(b) - Math.log(a)))
The stability problem is most likely contained within those 2 lines, but here is my entire implementation:
class Prob {
/** Math explained: https://en.wikipedia.org/wiki/Log_probability
* Quick start:
* - Instantiate probabilities, eg. Prob a = new Prob(0.75)
* - add(), multiply() return new objects, can perform on nulls & NaNs.
* - get() returns probability as a readable double */
/** Logarithmized probability. Note: 0% represented by logP NaN. */
private double logP;
/** Construct instance with real probability. */
public Prob(double real) {
if (real > 0) this.logP = Math.log(real);
else this.logP = Double.NaN;
}
/** Construct instance with already logarithmized value. */
static boolean dontLogAgain = true;
public Prob(double logP, boolean anyBooleanHereToChooseThisConstructor) {
this.logP = logP;
}
/** Returns real probability as a double. */
public double get() {
return Math.exp(logP);
}
#Override
public String toString() {
return ""+get();
}
/***************** STATIC METHODS BELOW ********************/
/** Note: returns NaN only when a && b are both NaN/null. */
public static Prob add(Prob a, Prob b) {
if (nullOrNaN(a) && nullOrNaN(b)) return new Prob(Double.NaN, dontLogAgain);
if (nullOrNaN(a)) return copy(b);
if (nullOrNaN(b)) return copy(a);
double x = a.logP;
double y = b.logP;
double sum = x + Math.log(1 + Math.exp(y - x));
return new Prob(sum, dontLogAgain);
}
/** Note: multiplying by null or NaN produces NaN (repping 0% real prob). */
public static Prob multiply(Prob a, Prob b) {
if (nullOrNaN(a) || nullOrNaN(b)) return new Prob(Double.NaN, dontLogAgain);
return new Prob(a.logP + b.logP, dontLogAgain);
}
/** Returns true if p is null or NaN. */
private static boolean nullOrNaN(Prob p) {
return (p == null || Double.isNaN(p.logP));
}
/** Returns a new instance with the same value as original. */
private static Prob copy(Prob original) {
return new Prob(original.logP, dontLogAgain);
}
}

Problem was caused by the way Math.exp(z) was used in this line:
a + b = Math.log(a) + Math.log(1 + Math.exp(Math.log(b) - Math.log(a)))
When z reaches extreme values, numerical accuracy of double is not enough for the output of Math.exp(z). This causes us to lose information, produce an inaccurate result, and then these results cascade over multiple calculations.
When z >= 710 then Math.exp(z) = Infinity
When z <= -746 then Math.exp(z) = 0
In the original code I was calling Math.exp with y - x and arbitrarily choosing which is x and which is why. Let's instead choose y and x based on which is larger, so that z is negative rather than positive. The point where we get overflow is further on the negative side (746 rather than 710) and more importantly, when we overflow, we end up at 0 rather than infinity. Which is what we want with a low probability.
double x = Math.max(a.logP, b.logP);
double y = Math.min(a.logP, b.logP);
double sum = x + Math.log(1 + Math.exp(y - x));

Handle equal case in float comparison

I have some items which have an id and a value and I am looking for the maximum element.
The values are floats/doubles and as tie breaking I want to use the object with the smaller id.
One approach is the following:
double maxValue = Double.NEGATIVE_INFINITY;
Item maxItem = null;
for (Item item : items) {
if (item.value() > maxValue) {
maxValue = item.value();
maxItem = item;
} else if (item.value() == maxValue && item.id() < maxItem.id()) {
maxItem = item;
}
}
However, this includes an quality-comparison using floating point numbers, which is discouraged and in my case also creates a critical issue in the code analysis step.
Of course, I can write something to avoid the issue, e.g. use >= for the second comparison, however from the point of readability my future me or any other reader might wonder if it is a bug.
My question: Is there an approach which expresses well the intent and also avoids float comparison using == for this task?

In this case there is nothing wrong with testing equality of two floating point values. Two float/double values can be equal, and this can be tested with the == operator.
One reason that using == is discouraged for floating point values is that comparing them with literal constants can lead to unexpected behavior because the literals are usually written in decimal notation, whereas the floating point variables are stored in binary. Not all decimal values can be exactly represented in binary, and therefore the value of the variables are only approximations of the decimal values. For example: 3.0 * 0.1 == 0.3 evaluates to false.
Another reason is that floating point values do not always behave like real numbers. In particular, floating point operations are not necessarily commutative (x * y == y * x) and associative ((x * y) * z == x * (y * z)). For example, (0.3 * 0.2) * 0.1 == 0.3 * (0.2 * 0.1) evaluates to false.
In your case, however, there is no reason not to use ==.

You could write a Comparator<Item> and then use it to find the greatest item:
Comparator<Item> byValueAscThenIdDesc = (i1, i2) -> {
int valueComparison = Double.compare(i1.value(), i2.value());
if(valueComparison == 0) {
int idComprison = Integer.compare(i1.id(), i2.id());
return -idComparison;
}
return valueComparison;
};
List<Item> items = new ArrayList<>();
Item max = items.stream().max(byValueAscThenIdDesc).get();

Nearest multiple of a power of two fraction

Is there an optimized, performant way to round a double to the exact value nearest multiple of a given power of two fraction?
In other words, round .44 to the nearest 1/16 (in other words, to a value that can be expressed as n/16 where n is an integer) would be .4375. Note: this is relevant because power of two fractions can be stored without rounding errors, e.g.
public class PowerOfTwo {
public static void main(String... args) {
double inexact = .44;
double exact = .4375;
System.out.println(inexact + ": " + Long.toBinaryString(Double.doubleToLongBits(inexact)));
System.out.println(exact + ": " + Long.toBinaryString(Double.doubleToLongBits(exact)));
}
}
Output:
0.44: 11111111011100001010001111010111000010100011110101110000101001
0.4375: 11111111011100000000000000000000000000000000000000000000000000

If you want to chose the power of two, the simplest way is to multiply by e.g. 16, round to nearest integer, then divide by 16. Note that division by a power of two is exact if the result is a normal number. It can cause rounding error for subnormal numbers.
Here is a sample program using this technique:
public class Test {
public static void main(String[] args) {
System.out.println(roundToPowerOfTwo(0.44, 2));
System.out.println(roundToPowerOfTwo(0.44, 3));
System.out.println(roundToPowerOfTwo(0.44, 4));
System.out.println(roundToPowerOfTwo(0.44, 5));
System.out.println(roundToPowerOfTwo(0.44, 6));
System.out.println(roundToPowerOfTwo(0.44, 7));
System.out.println(roundToPowerOfTwo(0.44, 8));
}
public static double roundToPowerOfTwo(double in, int power) {
double multiplier = 1 << power;
return Math.rint(in * multiplier) / multiplier;
}
}
Output:
0.5
0.5
0.4375
0.4375
0.4375
0.4375
0.44140625

If the question is about rounding any number to a pre-determined binary precision, what you need to do is this:
Convert the value to long using 'Double.doubleToLongBits()`
Examine the exponent: if it's too big (exponent+required precision>51, the number of bits in the significand), you won't be able to do any rounding but you won't have to: the number already satisfies your criteria.
If on the other hand exponent+required precision<0, the result of the rounding is always 0.
In any other case, look at the significand and blot out all the bits that are below the exponent+required precisionth significant bit.
Convert the number back to double using Double.longBitsToDouble()

Getting this right in all corner cases is a bit tricky. If I have to solve such a task, I'd usually start with a naive implementation that I can be pretty sure will be correct and only then start implementing an optimized version. While doing so, I can always compare against the naive approach to validate my results.
The naive approach is to start with 1 and multiply / divide it with / by 2 until we have bracketed the absolute value of the input. Then, we'll output the nearer of the boundaries. It's actually a bit more complicated: If the value is a NaN or infinity, it requires special treatment.
Here is the code:
public static double getClosestPowerOf2Loop(final double x) {
final double absx = Math.abs(x);
double prev = 1.0;
double next = 1.0;
if (Double.isInfinite(x) || Double.isNaN(x)) {
return x;
} else if (absx < 1.0) {
do {
prev = next;
next /= 2.0;
} while (next > absx);
} else if (absx > 1.0) {
do {
prev = next;
next *= 2.0;
} while (next < absx);
}
if (x < 0.0) {
prev = -prev;
next = -next;
}
return (Math.abs(next - x) < Math.abs(prev - x)) ? next : prev;
}
I hope the code will be clear without further explanation. Since Java 8, you can use !Double.isFinite(x) as a replacement for Double.isInfinite(x) || Double.isNaN(x).
Let's see for an optimized version. As other answers have already suggested, we should probably look at the bit representation. Java requires floating point values to be represented using IEE 754. In that format, numbers in double (64 bit) precision are represented as
1 bit sign,
11 bits exponent and
52 bits mantissa.
We will special-case NaNs and infinities (which are represented by special bit patterns) again. However, there is yet another exception: The most significant bit of the mantissa is implicitly 1 and not found in the bit pattern – except for very small numbers where a so-called subnormal representation us used where the most significant digit is not the most significant bit of the mantissa. Therefore, for normal numbers we will simply set the mantissa's bits to all 0 but for subnormals, we convert it to a number where none but the most significant 1 bit is preserved. This procedure always rounds towards zero so to get the other bound, we simply multiply by 2.
Let's see how this all works together:
public static double getClosestPowerOf2Bits(final double x) {
if (Double.isInfinite(x) || Double.isNaN(x)) {
return x;
} else {
final long bits = Double.doubleToLongBits(x);
final long signexp = bits & 0xfff0000000000000L;
final long mantissa = bits & 0x000fffffffffffffL;
final long mantissaPrev = Math.abs(x) < Double.MIN_NORMAL
? Long.highestOneBit(mantissa)
: 0x0000000000000000L;
final double prev = Double.longBitsToDouble(signexp | mantissaPrev);
final double next = 2.0 * prev;
return (Math.abs(next - x) < Math.abs(prev - x)) ? next : prev;
}
}
I'm note entirely sure I have covered all corner cases but the following tests do run:
public static void main(final String[] args) {
final double[] values = {
5.0, 4.1, 3.9, 1.0, 0.0, -0.1, -8.0, -8.1, -7.9,
0.9 * Double.MIN_NORMAL, -0.9 * Double.MIN_NORMAL,
Double.NaN, Double.MAX_VALUE, Double.MIN_VALUE,
Double.NEGATIVE_INFINITY, Double.POSITIVE_INFINITY,
};
for (final double value : values) {
final double powerL = getClosestPowerOf2Loop(value);
final double powerB = getClosestPowerOf2Bits(value);
System.out.printf("%17.10g --> %17.10g %17.10g%n",
value, powerL, powerB);
assert Double.doubleToLongBits(powerL) == Double.doubleToLongBits(powerB);
}
}
Output:
5.000000000 --> 4.000000000 4.000000000
4.100000000 --> 4.000000000 4.000000000
3.900000000 --> 4.000000000 4.000000000
1.000000000 --> 1.000000000 1.000000000
0.000000000 --> 0.000000000 0.000000000
-0.1000000000 --> -0.1250000000 -0.1250000000
-8.000000000 --> -8.000000000 -8.000000000
-8.100000000 --> -8.000000000 -8.000000000
-7.900000000 --> -8.000000000 -8.000000000
2.002566473e-308 --> 2.225073859e-308 2.225073859e-308
-2.002566473e-308 --> -2.225073859e-308 -2.225073859e-308
NaN --> NaN NaN
1.797693135e+308 --> 8.988465674e+307 8.988465674e+307
4.900000000e-324 --> 4.900000000e-324 4.900000000e-324
-Infinity --> -Infinity -Infinity
Infinity --> Infinity Infinity
How about performance?
I have run the following benchmark
public static void main(final String[] args) {
final Random rand = new Random();
for (int i = 0; i < 1000000; ++i) {
final double value = Double.longBitsToDouble(rand.nextLong());
final double power = getClosestPowerOf2(value);
}
}
where getClosestPowerOf2 is to be replaced by either getClosestPowerOf2Loop or getClosestPowerOf2Bits. On my laptop, I get the following results:
getClosestPowerOf2Loop: 2.35 s
getClosestPowerOf2Bits: 1.80 s
Was that really worth the effort?

You are going to need some bit magic if you are going to round to arbitrary powers of 2.
You will need to inspect the exponent:
int exponent = Math.getExponent(inexact);
Then knowing that there are 53 bits in the mantissa can find the bit at which you need to round with.
Or just do:
Math.round(inexact* (1l<<exponent))/(1l<<exponent)
I use Math.round because I expect it to be optimal for the task as opposed to trying to implement it yourself.

Here is my first attempt at a solution, that doesn't handle all the cases in #biziclop's answer, and probably does "floor" instead of "round"
public static double round(double d, int precision) {
double longPart = Math.rint(d);
double decimalOnly = d - longPart;
long bits = Double.doubleToLongBits(decimalOnly);
long mask = -1l << (54 - precision);
return Double.longBitsToDouble(bits & mask) + longPart;
}

I came across this post trying to solve a related problem: how to efficiently find the two powers of two that bracket any given regular real value. Since my program deals in many types beside doubles I needed a general solution. Someone wanting to round to the nearest power of two can get the bracketing values and choose the closest. In my case the general solution required BigDecimals. Here is the trick I used.
For numbers > 1:
int exponent = myBigDecimal.toBigInteger.bitLength() - 1;
BigDecimal lowerBound = TWO.pow(exponent);
BigDecimal upperBound = TWO.pow(exponent+1);
For numbers > 0 and < 1:
int exponent = -(BigDecimal.ONE.divide(myBigDecimal, myContext).toBigInteger().bitLength()-1);
BigDecimal lowerBound = TWO.pow(exponent-1);
BigDecimal upperBound = TWO.pow(exponent);
I have only lined out the positive case. You generally take a number, and use this algorithm on it's absolute value. And then if in the original problem the number was negative you multiply the algorithm's result by -1. Finally the orignal num == 0 or num == 1 are trivial to handle outside this algorithm. That covers the whole real number line except infinties and nans which you deal with before calling this algorithm.

Is there an efficient way to check for an approximate floating point equality in C++? [duplicate]

What would be the most efficient way to compare two double or two float values?
Simply doing this is not correct:
bool CompareDoubles1 (double A, double B)
{
return A == B;
}
But something like:
bool CompareDoubles2 (double A, double B)
{
diff = A - B;
return (diff < EPSILON) && (-diff < EPSILON);
}
Seems to waste processing.
Does anyone know a smarter float comparer?

Be extremely careful using any of the other suggestions. It all depends on context.
I have spent a long time tracing bugs in a system that presumed a==b if |a-b|<epsilon. The underlying problems were:
The implicit presumption in an algorithm that if a==b and b==c then a==c.
Using the same epsilon for lines measured in inches and lines measured in mils (.001 inch). That is a==b but 1000a!=1000b. (This is why AlmostEqual2sComplement asks for the epsilon or max ULPS).
The use of the same epsilon for both the cosine of angles and the length of lines!
Using such a compare function to sort items in a collection. (In this case using the builtin C++ operator == for doubles produced correct results.)
Like I said: it all depends on context and the expected size of a and b.
By the way, std::numeric_limits<double>::epsilon() is the "machine epsilon". It is the difference between 1.0 and the next value representable by a double. I guess that it could be used in the compare function but only if the expected values are less than 1. (This is in response to #cdv's answer...)
Also, if you basically have int arithmetic in doubles (here we use doubles to hold int values in certain cases) your arithmetic will be correct. For example 4.0/2.0 will be the same as 1.0+1.0. This is as long as you do not do things that result in fractions (4.0/3.0) or do not go outside of the size of an int.

The comparison with an epsilon value is what most people do (even in game programming).
You should change your implementation a little though:
bool AreSame(double a, double b)
{
return fabs(a - b) < EPSILON;
}
Edit: Christer has added a stack of great info on this topic on a recent blog post. Enjoy.

Comparing floating point numbers for depends on the context. Since even changing the order of operations can produce different results, it is important to know how "equal" you want the numbers to be.
Comparing floating point numbers by Bruce Dawson is a good place to start when looking at floating point comparison.
The following definitions are from The art of computer programming by Knuth:
bool approximatelyEqual(float a, float b, float epsilon)
{
return fabs(a - b) <= ( (fabs(a) < fabs(b) ? fabs(b) : fabs(a)) * epsilon);
}
bool essentiallyEqual(float a, float b, float epsilon)
{
return fabs(a - b) <= ( (fabs(a) > fabs(b) ? fabs(b) : fabs(a)) * epsilon);
}
bool definitelyGreaterThan(float a, float b, float epsilon)
{
return (a - b) > ( (fabs(a) < fabs(b) ? fabs(b) : fabs(a)) * epsilon);
}
bool definitelyLessThan(float a, float b, float epsilon)
{
return (b - a) > ( (fabs(a) < fabs(b) ? fabs(b) : fabs(a)) * epsilon);
}
Of course, choosing epsilon depends on the context, and determines how equal you want the numbers to be.
Another method of comparing floating point numbers is to look at the ULP (units in last place) of the numbers. While not dealing specifically with comparisons, the paper What every computer scientist should know about floating point numbers is a good resource for understanding how floating point works and what the pitfalls are, including what ULP is.

I found that the Google C++ Testing Framework contains a nice cross-platform template-based implementation of AlmostEqual2sComplement which works on both doubles and floats. Given that it is released under the BSD license, using it in your own code should be no problem, as long as you retain the license. I extracted the below code from http://code.google.com/p/googletest/source/browse/trunk/include/gtest/internal/gtest-internal.h https://github.com/google/googletest/blob/master/googletest/include/gtest/internal/gtest-internal.h and added the license on top.
Be sure to #define GTEST_OS_WINDOWS to some value (or to change the code where it's used to something that fits your codebase - it's BSD licensed after all).
Usage example:
double left = // something
double right = // something
const FloatingPoint<double> lhs(left), rhs(right);
if (lhs.AlmostEquals(rhs)) {
//they're equal!
}
Here's the code:
// Copyright 2005, Google Inc.
// All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are
// met:
//
// * Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
// * Redistributions in binary form must reproduce the above
// copyright notice, this list of conditions and the following disclaimer
// in the documentation and/or other materials provided with the
// distribution.
// * Neither the name of Google Inc. nor the names of its
// contributors may be used to endorse or promote products derived from
// this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
// "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
// LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
// A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
// OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
// LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
// DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
// THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//
// Authors: wan#google.com (Zhanyong Wan), eefacm#gmail.com (Sean Mcafee)
//
// The Google C++ Testing Framework (Google Test)
// This template class serves as a compile-time function from size to
// type. It maps a size in bytes to a primitive type with that
// size. e.g.
//
// TypeWithSize<4>::UInt
//
// is typedef-ed to be unsigned int (unsigned integer made up of 4
// bytes).
//
// Such functionality should belong to STL, but I cannot find it
// there.
//
// Google Test uses this class in the implementation of floating-point
// comparison.
//
// For now it only handles UInt (unsigned int) as that's all Google Test
// needs. Other types can be easily added in the future if need
// arises.
template <size_t size>
class TypeWithSize {
public:
// This prevents the user from using TypeWithSize<N> with incorrect
// values of N.
typedef void UInt;
};
// The specialization for size 4.
template <>
class TypeWithSize<4> {
public:
// unsigned int has size 4 in both gcc and MSVC.
//
// As base/basictypes.h doesn't compile on Windows, we cannot use
// uint32, uint64, and etc here.
typedef int Int;
typedef unsigned int UInt;
};
// The specialization for size 8.
template <>
class TypeWithSize<8> {
public:
#if GTEST_OS_WINDOWS
typedef __int64 Int;
typedef unsigned __int64 UInt;
#else
typedef long long Int; // NOLINT
typedef unsigned long long UInt; // NOLINT
#endif // GTEST_OS_WINDOWS
};
// This template class represents an IEEE floating-point number
// (either single-precision or double-precision, depending on the
// template parameters).
//
// The purpose of this class is to do more sophisticated number
// comparison. (Due to round-off error, etc, it's very unlikely that
// two floating-points will be equal exactly. Hence a naive
// comparison by the == operation often doesn't work.)
//
// Format of IEEE floating-point:
//
// The most-significant bit being the leftmost, an IEEE
// floating-point looks like
//
// sign_bit exponent_bits fraction_bits
//
// Here, sign_bit is a single bit that designates the sign of the
// number.
//
// For float, there are 8 exponent bits and 23 fraction bits.
//
// For double, there are 11 exponent bits and 52 fraction bits.
//
// More details can be found at
// http://en.wikipedia.org/wiki/IEEE_floating-point_standard.
//
// Template parameter:
//
// RawType: the raw floating-point type (either float or double)
template <typename RawType>
class FloatingPoint {
public:
// Defines the unsigned integer type that has the same size as the
// floating point number.
typedef typename TypeWithSize<sizeof(RawType)>::UInt Bits;
// Constants.
// # of bits in a number.
static const size_t kBitCount = 8*sizeof(RawType);
// # of fraction bits in a number.
static const size_t kFractionBitCount =
std::numeric_limits<RawType>::digits - 1;
// # of exponent bits in a number.
static const size_t kExponentBitCount = kBitCount - 1 - kFractionBitCount;
// The mask for the sign bit.
static const Bits kSignBitMask = static_cast<Bits>(1) << (kBitCount - 1);
// The mask for the fraction bits.
static const Bits kFractionBitMask =
~static_cast<Bits>(0) >> (kExponentBitCount + 1);
// The mask for the exponent bits.
static const Bits kExponentBitMask = ~(kSignBitMask | kFractionBitMask);
// How many ULP's (Units in the Last Place) we want to tolerate when
// comparing two numbers. The larger the value, the more error we
// allow. A 0 value means that two numbers must be exactly the same
// to be considered equal.
//
// The maximum error of a single floating-point operation is 0.5
// units in the last place. On Intel CPU's, all floating-point
// calculations are done with 80-bit precision, while double has 64
// bits. Therefore, 4 should be enough for ordinary use.
//
// See the following article for more details on ULP:
// http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm.
static const size_t kMaxUlps = 4;
// Constructs a FloatingPoint from a raw floating-point number.
//
// On an Intel CPU, passing a non-normalized NAN (Not a Number)
// around may change its bits, although the new value is guaranteed
// to be also a NAN. Therefore, don't expect this constructor to
// preserve the bits in x when x is a NAN.
explicit FloatingPoint(const RawType& x) { u_.value_ = x; }
// Static methods
// Reinterprets a bit pattern as a floating-point number.
//
// This function is needed to test the AlmostEquals() method.
static RawType ReinterpretBits(const Bits bits) {
FloatingPoint fp(0);
fp.u_.bits_ = bits;
return fp.u_.value_;
}
// Returns the floating-point number that represent positive infinity.
static RawType Infinity() {
return ReinterpretBits(kExponentBitMask);
}
// Non-static methods
// Returns the bits that represents this number.
const Bits &bits() const { return u_.bits_; }
// Returns the exponent bits of this number.
Bits exponent_bits() const { return kExponentBitMask & u_.bits_; }
// Returns the fraction bits of this number.
Bits fraction_bits() const { return kFractionBitMask & u_.bits_; }
// Returns the sign bit of this number.
Bits sign_bit() const { return kSignBitMask & u_.bits_; }
// Returns true iff this is NAN (not a number).
bool is_nan() const {
// It's a NAN if the exponent bits are all ones and the fraction
// bits are not entirely zeros.
return (exponent_bits() == kExponentBitMask) && (fraction_bits() != 0);
}
// Returns true iff this number is at most kMaxUlps ULP's away from
// rhs. In particular, this function:
//
// - returns false if either number is (or both are) NAN.
// - treats really large numbers as almost equal to infinity.
// - thinks +0.0 and -0.0 are 0 DLP's apart.
bool AlmostEquals(const FloatingPoint& rhs) const {
// The IEEE standard says that any comparison operation involving
// a NAN must return false.
if (is_nan() || rhs.is_nan()) return false;
return DistanceBetweenSignAndMagnitudeNumbers(u_.bits_, rhs.u_.bits_)
<= kMaxUlps;
}
private:
// The data type used to store the actual floating-point number.
union FloatingPointUnion {
RawType value_; // The raw floating-point number.
Bits bits_; // The bits that represent the number.
};
// Converts an integer from the sign-and-magnitude representation to
// the biased representation. More precisely, let N be 2 to the
// power of (kBitCount - 1), an integer x is represented by the
// unsigned number x + N.
//
// For instance,
//
// -N + 1 (the most negative number representable using
// sign-and-magnitude) is represented by 1;
// 0 is represented by N; and
// N - 1 (the biggest number representable using
// sign-and-magnitude) is represented by 2N - 1.
//
// Read http://en.wikipedia.org/wiki/Signed_number_representations
// for more details on signed number representations.
static Bits SignAndMagnitudeToBiased(const Bits &sam) {
if (kSignBitMask & sam) {
// sam represents a negative number.
return ~sam + 1;
} else {
// sam represents a positive number.
return kSignBitMask | sam;
}
}
// Given two numbers in the sign-and-magnitude representation,
// returns the distance between them as an unsigned number.
static Bits DistanceBetweenSignAndMagnitudeNumbers(const Bits &sam1,
const Bits &sam2) {
const Bits biased1 = SignAndMagnitudeToBiased(sam1);
const Bits biased2 = SignAndMagnitudeToBiased(sam2);
return (biased1 >= biased2) ? (biased1 - biased2) : (biased2 - biased1);
}
FloatingPointUnion u_;
};
EDIT: This post is 4 years old. It's probably still valid, and the code is nice, but some people found improvements. Best go get the latest version of AlmostEquals right from the Google Test source code, and not the one I pasted up here.

For a more in depth approach read Comparing floating point numbers. Here is the code snippet from that link:
// Usable AlmostEqual function
bool AlmostEqual2sComplement(float A, float B, int maxUlps)
{
// Make sure maxUlps is non-negative and small enough that the
// default NAN won't compare as equal to anything.
assert(maxUlps > 0 && maxUlps < 4 * 1024 * 1024);
int aInt = *(int*)&A;
// Make aInt lexicographically ordered as a twos-complement int
if (aInt < 0)
aInt = 0x80000000 - aInt;
// Make bInt lexicographically ordered as a twos-complement int
int bInt = *(int*)&B;
if (bInt < 0)
bInt = 0x80000000 - bInt;
int intDiff = abs(aInt - bInt);
if (intDiff <= maxUlps)
return true;
return false;
}

Realizing this is an old thread but this article is one of the most straight forward ones I have found on comparing floating point numbers and if you want to explore more it has more detailed references as well and it the main site covers a complete range of issues dealing with floating point numbers The Floating-Point Guide :Comparison.
We can find a somewhat more practical article in Floating-point tolerances revisited and notes there is absolute tolerance test, which boils down to this in C++:
bool absoluteToleranceCompare(double x, double y)
{
return std::fabs(x - y) <= std::numeric_limits<double>::epsilon() ;
}
and relative tolerance test:
bool relativeToleranceCompare(double x, double y)
{
double maxXY = std::max( std::fabs(x) , std::fabs(y) ) ;
return std::fabs(x - y) <= std::numeric_limits<double>::epsilon()*maxXY ;
}
The article notes that the absolute test fails when x and y are large and fails in the relative case when they are small. Assuming he absolute and relative tolerance is the same a combined test would look like this:
bool combinedToleranceCompare(double x, double y)
{
double maxXYOne = std::max( { 1.0, std::fabs(x) , std::fabs(y) } ) ;
return std::fabs(x - y) <= std::numeric_limits<double>::epsilon()*maxXYOne ;
}

I ended up spending quite some time going through material in this great thread. I doubt everyone wants to spend so much time so I would highlight the summary of what I learned and the solution I implemented.
Quick Summary
Is 1e-8 approximately same as 1e-16? If you are looking at noisy sensor data then probably yes but if you are doing molecular simulation then may be not! Bottom line: You always need to think of tolerance value in context of specific function call and not just make it generic app-wide hard-coded constant.
For general library functions, it's still nice to have parameter with default tolerance. A typical choice is numeric_limits::epsilon() which is same as FLT_EPSILON in float.h. This is however problematic because epsilon for comparing values like 1.0 is not same as epsilon for values like 1E9. The FLT_EPSILON is defined for 1.0.
The obvious implementation to check if number is within tolerance is fabs(a-b) <= epsilon however this doesn't work because default epsilon is defined for 1.0. We need to scale epsilon up or down in terms of a and b.
There are two solution to this problem: either you set epsilon proportional to max(a,b) or you can get next representable numbers around a and then see if b falls into that range. The former is called "relative" method and later is called ULP method.
Both methods actually fails anyway when comparing with 0. In this case, application must supply correct tolerance.
Utility Functions Implementation (C++11)
//implements relative method - do not use for comparing with zero
//use this most of the time, tolerance needs to be meaningful in your context
template<typename TReal>
static bool isApproximatelyEqual(TReal a, TReal b, TReal tolerance = std::numeric_limits<TReal>::epsilon())
{
TReal diff = std::fabs(a - b);
if (diff <= tolerance)
return true;
if (diff < std::fmax(std::fabs(a), std::fabs(b)) * tolerance)
return true;
return false;
}
//supply tolerance that is meaningful in your context
//for example, default tolerance may not work if you are comparing double with float
template<typename TReal>
static bool isApproximatelyZero(TReal a, TReal tolerance = std::numeric_limits<TReal>::epsilon())
{
if (std::fabs(a) <= tolerance)
return true;
return false;
}
//use this when you want to be on safe side
//for example, don't start rover unless signal is above 1
template<typename TReal>
static bool isDefinitelyLessThan(TReal a, TReal b, TReal tolerance = std::numeric_limits<TReal>::epsilon())
{
TReal diff = a - b;
if (diff < tolerance)
return true;
if (diff < std::fmax(std::fabs(a), std::fabs(b)) * tolerance)
return true;
return false;
}
template<typename TReal>
static bool isDefinitelyGreaterThan(TReal a, TReal b, TReal tolerance = std::numeric_limits<TReal>::epsilon())
{
TReal diff = a - b;
if (diff > tolerance)
return true;
if (diff > std::fmax(std::fabs(a), std::fabs(b)) * tolerance)
return true;
return false;
}
//implements ULP method
//use this when you are only concerned about floating point precision issue
//for example, if you want to see if a is 1.0 by checking if its within
//10 closest representable floating point numbers around 1.0.
template<typename TReal>
static bool isWithinPrecisionInterval(TReal a, TReal b, unsigned int interval_size = 1)
{
TReal min_a = a - (a - std::nextafter(a, std::numeric_limits<TReal>::lowest())) * interval_size;
TReal max_a = a + (std::nextafter(a, std::numeric_limits<TReal>::max()) - a) * interval_size;
return min_a <= b && max_a >= b;
}

The portable way to get epsilon in C++ is
#include <limits>
std::numeric_limits<double>::epsilon()
Then the comparison function becomes
#include <cmath>
#include <limits>
bool AreSame(double a, double b) {
return std::fabs(a - b) < std::numeric_limits<double>::epsilon();
}

The code you wrote is bugged :
return (diff < EPSILON) && (-diff > EPSILON);
The correct code would be :
return (diff < EPSILON) && (diff > -EPSILON);
(...and yes this is different)
I wonder if fabs wouldn't make you lose lazy evaluation in some case. I would say it depends on the compiler. You might want to try both. If they are equivalent in average, take the implementation with fabs.
If you have some info on which of the two float is more likely to be bigger than then other, you can play on the order of the comparison to take better advantage of the lazy evaluation.
Finally you might get better result by inlining this function. Not likely to improve much though...
Edit: OJ, thanks for correcting your code. I erased my comment accordingly

`return fabs(a - b) < EPSILON;
This is fine if:
the order of magnitude of your inputs don't change much
very small numbers of opposite signs can be treated as equal
But otherwise it'll lead you into trouble. Double precision numbers have a resolution of about 16 decimal places. If the two numbers you are comparing are larger in magnitude than EPSILON*1.0E16, then you might as well be saying:
return a==b;
I'll examine a different approach that assumes you need to worry about the first issue and assume the second is fine your application. A solution would be something like:
#define VERYSMALL (1.0E-150)
#define EPSILON (1.0E-8)
bool AreSame(double a, double b)
{
double absDiff = fabs(a - b);
if (absDiff < VERYSMALL)
{
return true;
}
double maxAbs = max(fabs(a) - fabs(b));
return (absDiff/maxAbs) < EPSILON;
}
This is expensive computationally, but it is sometimes what is called for. This is what we have to do at my company because we deal with an engineering library and inputs can vary by a few dozen orders of magnitude.
Anyway, the point is this (and applies to practically every programming problem): Evaluate what your needs are, then come up with a solution to address your needs -- don't assume the easy answer will address your needs. If after your evaluation you find that fabs(a-b) < EPSILON will suffice, perfect -- use it! But be aware of its shortcomings and other possible solutions too.

As others have pointed out, using a fixed-exponent epsilon (such as 0.0000001) will be useless for values away from the epsilon value. For example, if your two values are 10000.000977 and 10000, then there are NO 32-bit floating-point values between these two numbers -- 10000 and 10000.000977 are as close as you can possibly get without being bit-for-bit identical. Here, an epsilon of less than 0.0009 is meaningless; you might as well use the straight equality operator.
Likewise, as the two values approach epsilon in size, the relative error grows to 100%.
Thus, trying to mix a fixed point number such as 0.00001 with floating-point values (where the exponent is arbitrary) is a pointless exercise. This will only ever work if you can be assured that the operand values lie within a narrow domain (that is, close to some specific exponent), and if you properly select an epsilon value for that specific test. If you pull a number out of the air ("Hey! 0.00001 is small, so that must be good!"), you're doomed to numerical errors. I've spent plenty of time debugging bad numerical code where some poor schmuck tosses in random epsilon values to make yet another test case work.
If you do numerical programming of any kind and believe you need to reach for fixed-point epsilons, READ BRUCE'S ARTICLE ON COMPARING FLOATING-POINT NUMBERS.
Comparing Floating Point Numbers

Here's proof that using std::numeric_limits::epsilon() is not the answer — it fails for values greater than one:
Proof of my comment above:
#include <stdio.h>
#include <limits>
double ItoD (__int64 x) {
// Return double from 64-bit hexadecimal representation.
return *(reinterpret_cast<double*>(&x));
}
void test (__int64 ai, __int64 bi) {
double a = ItoD(ai), b = ItoD(bi);
bool close = std::fabs(a-b) < std::numeric_limits<double>::epsilon();
printf ("%.16f and %.16f %s close.\n", a, b, close ? "are " : "are not");
}
int main()
{
test (0x3fe0000000000000L,
0x3fe0000000000001L);
test (0x3ff0000000000000L,
0x3ff0000000000001L);
}
Running yields this output:
0.5000000000000000 and 0.5000000000000001 are close.
1.0000000000000000 and 1.0000000000000002 are not close.
Note that in the second case (one and just larger than one), the two input values are as close as they can possibly be, and still compare as not close. Thus, for values greater than 1.0, you might as well just use an equality test. Fixed epsilons will not save you when comparing floating-point values.

Qt implements two functions, maybe you can learn from them:
static inline bool qFuzzyCompare(double p1, double p2)
{
return (qAbs(p1 - p2) <= 0.000000000001 * qMin(qAbs(p1), qAbs(p2)));
}
static inline bool qFuzzyCompare(float p1, float p2)
{
return (qAbs(p1 - p2) <= 0.00001f * qMin(qAbs(p1), qAbs(p2)));
}
And you may need the following functions, since
Note that comparing values where either p1 or p2 is 0.0 will not work,
nor does comparing values where one of the values is NaN or infinity.
If one of the values is always 0.0, use qFuzzyIsNull instead. If one
of the values is likely to be 0.0, one solution is to add 1.0 to both
values.
static inline bool qFuzzyIsNull(double d)
{
return qAbs(d) <= 0.000000000001;
}
static inline bool qFuzzyIsNull(float f)
{
return qAbs(f) <= 0.00001f;
}

Unfortunately, even your "wasteful" code is incorrect. EPSILON is the smallest value that could be added to 1.0 and change its value. The value 1.0 is very important — larger numbers do not change when added to EPSILON. Now, you can scale this value to the numbers you are comparing to tell whether they are different or not. The correct expression for comparing two doubles is:
if (fabs(a - b) <= DBL_EPSILON * fmax(fabs(a), fabs(b)))
{
// ...
}
This is at a minimum. In general, though, you would want to account for noise in your calculations and ignore a few of the least significant bits, so a more realistic comparison would look like:
if (fabs(a - b) <= 16 * DBL_EPSILON * fmax(fabs(a), fabs(b)))
{
// ...
}
If comparison performance is very important to you and you know the range of your values, then you should use fixed-point numbers instead.

General-purpose comparison of floating-point numbers is generally meaningless. How to compare really depends on a problem at hand. In many problems, numbers are sufficiently discretized to allow comparing them within a given tolerance. Unfortunately, there are just as many problems, where such trick doesn't really work. For one example, consider working with a Heaviside (step) function of a number in question (digital stock options come to mind) when your observations are very close to the barrier. Performing tolerance-based comparison wouldn't do much good, as it would effectively shift the issue from the original barrier to two new ones. Again, there is no general-purpose solution for such problems and the particular solution might require going as far as changing the numerical method in order to achieve stability.

You have to do this processing for floating point comparison, since float's can't be perfectly compared like integer types. Here are functions for the various comparison operators.
Floating Point Equal to (==)
I also prefer the subtraction technique rather than relying on fabs() or abs(), but I'd have to speed profile it on various architectures from 64-bit PC to ATMega328 microcontroller (Arduino) to really see if it makes much of a performance difference.
So, let's forget about all this absolute value stuff and just do some subtraction and comparison!
Modified from Microsoft's example here:
/// #brief See if two floating point numbers are approximately equal.
/// #param[in] a number 1
/// #param[in] b number 2
/// #param[in] epsilon A small value such that if the difference between the two numbers is
/// smaller than this they can safely be considered to be equal.
/// #return true if the two numbers are approximately equal, and false otherwise
bool is_float_eq(float a, float b, float epsilon) {
return ((a - b) < epsilon) && ((b - a) < epsilon);
}
bool is_double_eq(double a, double b, double epsilon) {
return ((a - b) < epsilon) && ((b - a) < epsilon);
}
Example usage:
constexpr float EPSILON = 0.0001; // 1e-4
is_float_eq(1.0001, 0.99998, EPSILON);
I'm not entirely sure, but it seems to me some of the criticisms of the epsilon-based approach, as described in the comments below this highly-upvoted answer, can be resolved by using a variable epsilon, scaled according to the floating point values being compared, like this:
float a = 1.0001;
float b = 0.99998;
float epsilon = std::max(std::fabs(a), std::fabs(b)) * 1e-4;
is_float_eq(a, b, epsilon);
This way, the epsilon value scales with the floating point values and is therefore never so small of a value that it becomes insignificant.
For completeness, let's add the rest:
Greater than (>), and less than (<):
/// #brief See if floating point number `a` is > `b`
/// #param[in] a number 1
/// #param[in] b number 2
/// #param[in] epsilon a small value such that if `a` is > `b` by this amount, `a` is considered
/// to be definitively > `b`
/// #return true if `a` is definitively > `b`, and false otherwise
bool is_float_gt(float a, float b, float epsilon) {
return a > b + epsilon;
}
bool is_double_gt(double a, double b, double epsilon) {
return a > b + epsilon;
}
/// #brief See if floating point number `a` is < `b`
/// #param[in] a number 1
/// #param[in] b number 2
/// #param[in] epsilon a small value such that if `a` is < `b` by this amount, `a` is considered
/// to be definitively < `b`
/// #return true if `a` is definitively < `b`, and false otherwise
bool is_float_lt(float a, float b, float epsilon) {
return a < b - epsilon;
}
bool is_double_lt(double a, double b, double epsilon) {
return a < b - epsilon;
}
Greater than or equal to (>=), and less than or equal to (<=)
/// #brief Returns true if `a` is definitively >= `b`, and false otherwise
bool is_float_ge(float a, float b, float epsilon) {
return a > b - epsilon;
}
bool is_double_ge(double a, double b, double epsilon) {
return a > b - epsilon;
}
/// #brief Returns true if `a` is definitively <= `b`, and false otherwise
bool is_float_le(float a, float b, float epsilon) {
return a < b + epsilon;
}
bool is_double_le(double a, double b, double epsilon) {
return a < b + epsilon;
}
Additional improvements:
A good default value for epsilon in C++ is std::numeric_limits<T>::epsilon(), which evaluates to either 0 or FLT_EPSILON, DBL_EPSILON, or LDBL_EPSILON. See here: https://en.cppreference.com/w/cpp/types/numeric_limits/epsilon. You can also see the float.h header for FLT_EPSILON, DBL_EPSILON, and LDBL_EPSILON.
See https://en.cppreference.com/w/cpp/header/cfloat and
https://www.cplusplus.com/reference/cfloat/
You could template the functions instead, to handle all floating point types: float, double, and long double, with type checks for these types via a static_assert() inside the template.
Scaling the epsilon value is a good idea to ensure it works for really large and really small a and b values. This article recommends and explains it: http://realtimecollisiondetection.net/blog/?p=89. So, you should scale epsilon by a scaling value equal to max(1.0, abs(a), abs(b)), as that article explains. Otherwise, as a and/or b increase in magnitude, the epsilon would eventually become so small relative to those values that it becomes lost in the floating point error. So, we scale it to become larger in magnitude like they are. However, using 1.0 as the smallest allowed scaling factor for epsilon also ensures that for really small-magnitude a and b values, epsilon itself doesn't get scaled so small that it also becomes lost in the floating point error. So, we limit the minimum scaling factor to 1.0.
If you want to "encapsulate" the above functions into a class, don't. Instead, wrap them up in a namespace if you like in order to namespace them. Ex: if you put all of the stand-alone functions into a namespace called float_comparison, then you could access the is_eq() function like this, for instance: float_comparison::is_eq(1.0, 1.5);.
It might also be nice to add comparisons against zero, not just comparisons between two values.
So, here is a better type of solution with the above improvements in place:
namespace float_comparison {
/// Scale the epsilon value to become large for large-magnitude a or b,
/// but no smaller than 1.0, per the explanation above, to ensure that
/// epsilon doesn't ever fall out in floating point error as a and/or b
/// increase in magnitude.
template<typename T>
static constexpr T scale_epsilon(T a, T b, T epsilon =
std::numeric_limits<T>::epsilon()) noexcept
{
static_assert(std::is_floating_point_v<T>, "Floating point comparisons "
"require type float, double, or long double.");
T scaling_factor;
// Special case for when a or b is infinity
if (std::isinf(a) || std::isinf(b))
{
scaling_factor = 0;
}
else
{
scaling_factor = std::max({(T)1.0, std::abs(a), std::abs(b)});
}
T epsilon_scaled = scaling_factor * std::abs(epsilon);
return epsilon_scaled;
}
// Compare two values
/// Equal: returns true if a is approximately == b, and false otherwise
template<typename T>
static constexpr bool is_eq(T a, T b, T epsilon =
std::numeric_limits<T>::epsilon()) noexcept
{
static_assert(std::is_floating_point_v<T>, "Floating point comparisons "
"require type float, double, or long double.");
// test `a == b` first to see if both a and b are either infinity
// or -infinity
return a == b || std::abs(a - b) <= scale_epsilon(a, b, epsilon);
}
/*
etc. etc.:
is_eq()
is_ne()
is_lt()
is_le()
is_gt()
is_ge()
*/
// Compare against zero
/// Equal: returns true if a is approximately == 0, and false otherwise
template<typename T>
static constexpr bool is_eq_zero(T a, T epsilon =
std::numeric_limits<T>::epsilon()) noexcept
{
static_assert(std::is_floating_point_v<T>, "Floating point comparisons "
"require type float, double, or long double.");
return is_eq(a, (T)0.0, epsilon);
}
/*
etc. etc.:
is_eq_zero()
is_ne_zero()
is_lt_zero()
is_le_zero()
is_gt_zero()
is_ge_zero()
*/
} // namespace float_comparison
See also:
The macro forms of some of the functions above in my repo here: utilities.h.
UPDATE 29 NOV 2020: it's a work-in-progress, and I'm going to make it a separate answer when ready, but I've produced a better, scaled-epsilon version of all of the functions in C in this file here: utilities.c. Take a look.
ADDITIONAL READING I need to do now have done: Floating-point tolerances revisited, by Christer Ericson. VERY USEFUL ARTICLE! It talks about scaling epsilon in order to ensure it never falls out in floating point error, even for really large-magnitude a and/or b values!

My class based on previously posted answers. Very similar to Google's code but I use a bias which pushes all NaN values above 0xFF000000. That allows a faster check for NaN.
This code is meant to demonstrate the concept, not be a general solution. Google's code already shows how to compute all the platform specific values and I didn't want to duplicate all that. I've done limited testing on this code.
typedef unsigned int U32;
// Float Memory Bias (unsigned)
// ----- ------ ---------------
// NaN 0xFFFFFFFF 0xFF800001
// NaN 0xFF800001 0xFFFFFFFF
// -Infinity 0xFF800000 0x00000000 ---
// -3.40282e+038 0xFF7FFFFF 0x00000001 |
// -1.40130e-045 0x80000001 0x7F7FFFFF |
// -0.0 0x80000000 0x7F800000 |--- Valid <= 0xFF000000.
// 0.0 0x00000000 0x7F800000 | NaN > 0xFF000000
// 1.40130e-045 0x00000001 0x7F800001 |
// 3.40282e+038 0x7F7FFFFF 0xFEFFFFFF |
// Infinity 0x7F800000 0xFF000000 ---
// NaN 0x7F800001 0xFF000001
// NaN 0x7FFFFFFF 0xFF7FFFFF
//
// Either value of NaN returns false.
// -Infinity and +Infinity are not "close".
// -0 and +0 are equal.
//
class CompareFloat{
public:
union{
float m_f32;
U32 m_u32;
};
static bool CompareFloat::IsClose( float A, float B, U32 unitsDelta = 4 )
{
U32 a = CompareFloat::GetBiased( A );
U32 b = CompareFloat::GetBiased( B );
if ( (a > 0xFF000000) || (b > 0xFF000000) )
{
return( false );
}
return( (static_cast<U32>(abs( a - b ))) < unitsDelta );
}
protected:
static U32 CompareFloat::GetBiased( float f )
{
U32 r = ((CompareFloat*)&f)->m_u32;
if ( r & 0x80000000 )
{
return( ~r - 0x007FFFFF );
}
return( r + 0x7F800000 );
}
};

I'd be very wary of any of these answers that involves floating point subtraction (e.g., fabs(a-b) < epsilon). First, the floating point numbers become more sparse at greater magnitudes and at high enough magnitudes where the spacing is greater than epsilon, you might as well just be doing a == b. Second, subtracting two very close floating point numbers (as these will tend to be, given that you're looking for near equality) is exactly how you get catastrophic cancellation.
While not portable, I think grom's answer does the best job of avoiding these issues.

There are actually cases in numerical software where you want to check whether two floating point numbers are exactly equal. I posted this on a similar question
https://stackoverflow.com/a/10973098/1447411
So you can not say that "CompareDoubles1" is wrong in general.

In terms of the scale of quantities:
If epsilon is the small fraction of the magnitude of quantity (i.e. relative value) in some certain physical sense and A and B types is comparable in the same sense, than I think, that the following is quite correct:
#include <limits>
#include <iomanip>
#include <iostream>
#include <cmath>
#include <cstdlib>
#include <cassert>
template< typename A, typename B >
inline
bool close_enough(A const & a, B const & b,
typename std::common_type< A, B >::type const & epsilon)
{
using std::isless;
assert(isless(0, epsilon)); // epsilon is a part of the whole quantity
assert(isless(epsilon, 1));
using std::abs;
auto const delta = abs(a - b);
auto const x = abs(a);
auto const y = abs(b);
// comparable generally and |a - b| < eps * (|a| + |b|) / 2
return isless(epsilon * y, x) && isless(epsilon * x, y) && isless((delta + delta) / (x + y), epsilon);
}
int main()
{
std::cout << std::boolalpha << close_enough(0.9, 1.0, 0.1) << std::endl;
std::cout << std::boolalpha << close_enough(1.0, 1.1, 0.1) << std::endl;
std::cout << std::boolalpha << close_enough(1.1, 1.2, 0.01) << std::endl;
std::cout << std::boolalpha << close_enough(1.0001, 1.0002, 0.01) << std::endl;
std::cout << std::boolalpha << close_enough(1.0, 0.01, 0.1) << std::endl;
return EXIT_SUCCESS;
}

I use this code:
bool AlmostEqual(double v1, double v2)
{
return (std::fabs(v1 - v2) < std::fabs(std::min(v1, v2)) * std::numeric_limits<double>::epsilon());
}

Found another interesting implementation on: https://en.cppreference.com/w/cpp/types/numeric_limits/epsilon
#include <cmath>
#include <limits>
#include <iomanip>
#include <iostream>
#include <type_traits>
#include <algorithm>
template<class T>
typename std::enable_if<!std::numeric_limits<T>::is_integer, bool>::type
almost_equal(T x, T y, int ulp)
{
// the machine epsilon has to be scaled to the magnitude of the values used
// and multiplied by the desired precision in ULPs (units in the last place)
return std::fabs(x-y) <= std::numeric_limits<T>::epsilon() * std::fabs(x+y) * ulp
// unless the result is subnormal
|| std::fabs(x-y) < std::numeric_limits<T>::min();
}
int main()
{
double d1 = 0.2;
double d2 = 1 / std::sqrt(5) / std::sqrt(5);
std::cout << std::fixed << std::setprecision(20)
<< "d1=" << d1 << "\nd2=" << d2 << '\n';
if(d1 == d2)
std::cout << "d1 == d2\n";
else
std::cout << "d1 != d2\n";
if(almost_equal(d1, d2, 2))
std::cout << "d1 almost equals d2\n";
else
std::cout << "d1 does not almost equal d2\n";
}

In a more generic way:
template <typename T>
bool compareNumber(const T& a, const T& b) {
return std::abs(a - b) < std::numeric_limits<T>::epsilon();
}
Note:
As pointed out by #SirGuy, this approach is flawed.
I am leaving this answer here as an example not to follow.

I use this code. Unlike the above answers this allows one to
give a abs_relative_error that is explained in the comments of the code.
The first version compares complex numbers, so that the error
can be explained in terms of the angle between two "vectors"
of the same length in the complex plane (which gives a little
insight). Then from there the correct formula for two real
numbers follows.
https://github.com/CarloWood/ai-utils/blob/master/almost_equal.h
The latter then is
template<class T>
typename std::enable_if<std::is_floating_point<T>::value, bool>::type
almost_equal(T x, T y, T const abs_relative_error)
{
return 2 * std::abs(x - y) <= abs_relative_error * std::abs(x + y);
}
where abs_relative_error is basically (twice) the absolute value of what comes closest to being defined in the literature: a relative error. But that is just the choice of the name.
What it really is seen most clearly in the complex plane I think. If |x| = 1, and y lays in a circle around x with diameter abs_relative_error, then the two are considered equal.

I use the following function for floating-point numbers comparison:
bool approximatelyEqual(double a, double b)
{
return fabs(a - b) <= ((fabs(a) < fabs(b) ? fabs(b) : fabs(a)) * std::numeric_limits<double>::epsilon());
}

It depends on how precise you want the comparison to be. If you want to compare for exactly the same number, then just go with ==. (You almost never want to do this unless you actually want exactly the same number.) On any decent platform you can also do the following:
diff= a - b; return fabs(diff)<EPSILON;
as fabs tends to be pretty fast. By pretty fast I mean it is basically a bitwise AND, so it better be fast.
And integer tricks for comparing doubles and floats are nice but tend to make it more difficult for the various CPU pipelines to handle effectively. And it's definitely not faster on certain in-order architectures these days due to using the stack as a temporary storage area for values that are being used frequently. (Load-hit-store for those who care.)

/// testing whether two doubles are almost equal. We consider two doubles
/// equal if the difference is within the range [0, epsilon).
///
/// epsilon: a positive number (supposed to be small)
///
/// if either x or y is 0, then we are comparing the absolute difference to
/// epsilon.
/// if both x and y are non-zero, then we are comparing the relative difference
/// to epsilon.
bool almost_equal(double x, double y, double epsilon)
{
double diff = x - y;
if (x != 0 && y != 0){
diff = diff/y;
}
if (diff < epsilon && -1.0*diff < epsilon){
return true;
}
return false;
}
I used this function for my small project and it works, but note the following:
Double precision error can create a surprise for you. Let's say epsilon = 1.0e-6, then 1.0 and 1.000001 should NOT be considered equal according to the above code, but on my machine the function considers them to be equal, this is because 1.000001 can not be precisely translated to a binary format, it is probably 1.0000009xxx. I test it with 1.0 and 1.0000011 and this time I get the expected result.

You cannot compare two double with a fixed EPSILON. Depending on the value of double, EPSILON varies.
A better double comparison would be:
bool same(double a, double b)
{
return std::nextafter(a, std::numeric_limits<double>::lowest()) <= b
&& std::nextafter(a, std::numeric_limits<double>::max()) >= b;
}

My way may not be correct but useful
Convert both float to strings and then do string compare
bool IsFlaotEqual(float a, float b, int decimal)
{
TCHAR form[50] = _T("");
_stprintf(form, _T("%%.%df"), decimal);
TCHAR a1[30] = _T(""), a2[30] = _T("");
_stprintf(a1, form, a);
_stprintf(a2, form, b);
if( _tcscmp(a1, a2) == 0 )
return true;
return false;
}
operator overlaoding can also be done

I write this for java, but maybe you find it useful. It uses longs instead of doubles, but takes care of NaNs, subnormals, etc.
public static boolean equal(double a, double b) {
final long fm = 0xFFFFFFFFFFFFFL; // fraction mask
final long sm = 0x8000000000000000L; // sign mask
final long cm = 0x8000000000000L; // most significant decimal bit mask
long c = Double.doubleToLongBits(a), d = Double.doubleToLongBits(b);
int ea = (int) (c >> 52 & 2047), eb = (int) (d >> 52 & 2047);
if (ea == 2047 && (c & fm) != 0 || eb == 2047 && (d & fm) != 0) return false; // NaN
if (c == d) return true; // identical - fast check
if (ea == 0 && eb == 0) return true; // ±0 or subnormals
if ((c & sm) != (d & sm)) return false; // different signs
if (abs(ea - eb) > 1) return false; // b > 2*a or a > 2*b
d <<= 12; c <<= 12;
if (ea < eb) c = c >> 1 | sm;
else if (ea > eb) d = d >> 1 | sm;
c -= d;
return c < 65536 && c > -65536; // don't use abs(), because:
// There is a posibility c=0x8000000000000000 which cannot be converted to positive
}
public static boolean zero(double a) { return (Double.doubleToLongBits(a) >> 52 & 2047) < 3; }
Keep in mind that after a number of floating-point operations, number can be very different from what we expect. There is no code to fix that.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Manipulating and comparing floating points in java - java

If you are interested in fixed precision numbers, you should be using a fixed precision type like BigDecimal, not an inherently approximate (though high precision) type like float. There are numerous similar questions on Stack Overflow that go into this in more detail, across many languages.

Related

Adding and subtracting exact values to float

Numerical accuracy with log probability Java implementation

Handle equal case in float comparison

Nearest multiple of a power of two fraction

Is there an efficient way to check for an approximate floating point equality in C++? [duplicate]

Categories

Resources