Comparing Floating-Point Numbers Is Tricky

Floating-point math is fraught with subtle gotchas,
and comparing values properly is no exception.
Here we discuss common pitfalls,
examine some possible solutions,
and try to beat Boost.

Things you probably know about floats

If you need to represent a non-integer in a mainstream programming
language, you’ll probably end up using IEEE 754 floating-point values.
Since their standardization in 1985, they’ve become ubiquitous.
Nearly all modern CPUs—and many microprocessors—contain special hardware
(called floating-point units, or FPUs) to handle them.

Each float consists of a sign bit, some bits representing an exponent,
and bits representing a fraction, also called the mantissa.
Under most circumstances, the value of a float is:

where is our sign bit, is some fraction represented by the mantissa bits,
is an unsigned integer represented by the exponent bits,
and is half the maximum value of , i.e.,
127 for a 32-bit float and 1023 for a 64-bit float.

There are also some special cases.
For example, when all exponent bits are zero, the formula changes to:

Note the lack of an implicit 1 preceding the mantissa—this allows us to store
small values close to zero, called denormal or subnormal values.
And when all exponent bits are one, certain mantissa values represent
, , and “Not a Number” (NaN), the result of undefined
or unrepresentable operations such as dividing by zero.

We’ll make two observations that will prove themselves useful shortly:

Floats cannot store arbitrary real numbers,
or even arbitrary rational numbers.
They can only store numbers representable by the equations shown before.
For example, if I declare some variable,

Since the equations are exponential, the distance on the number line
between adjacent values increases (exponentially!) as you move away
from zero.
The distance between 1.0 and the next possible value is about
, but the distance between adjacent floats near
is roughly .
This will prove to be our greatest challenge: when comparing floats,
we want to handle inputs close to zero as well as we handle ones
close to the Avogadro constant.

What is equality?

Since the result of every floating-point operation must be rounded to the
nearest possible value, math doesn’t behave like it does with real numbers.
Depending on your hardware, compiler, and compiler flags,
may produce a different result than .1
Whenever we compare calculated values to each other,
we should provide some leeway to account for this.
Comparing their exact values with == won’t cut it.

Instead,
we should consider two distinct values and equal if
for some sufficiently small .
As luck would have it, the C standard library contains a FLT_EPSILON.
Let’s try it out!

boolalmostEqual(floata,floatb){returnfabs(a-b)<=FLT_EPSILON;}

We would hope that we’re done here, but we would be wrong.
A look at the language standards reveals that
FLT_EPSILON is equal to
the difference between 1.0 and the value that follows it.
But as we noted before, float values aren’t equidistant!
For values smaller than 1, FLT_EPSILON quickly becomes too large to be useful.
For values greater than 2, FLT_EPSILON is smaller than the
distance between adjacent values, so
fabs(a-b)<=FLT_EPSILON
will always be false.

To address these problems,
what if we scaled proportionally to our inputs?

boolrelativelyEqual(floata,floatb,floatmaxRelativeDiff=FLT_EPSILON){constfloatdifference=fabs(a-b);// Scale to the largest value.
a=fabs(a);b=fabs(b);constfloatscaledEpsilon=maxRelativeDiff*max(a,b);returndifference<=scaledEpsilon;}

This works better than our initial solution,
but it’s not immediately obvious what values of
maxRelativeDiff we might want for different cases.
The fact that we scale it by arbitrary inputs also means it can fall prey
to the same rounding we’re worried about in the first place.

What about Boost?

Boost, the popular collection of C++ libraries,
provides functions for similar purposes.2
After removing template boilerplate and edge case handling for and NaNs,
they resemble:

Unfortunately, these functions don’t seem to solve
our problems.3
Since the division in relative_difference often makes its result
quite small,4
how do we know what a good threshold might be?
By dividing that result by FLT_EPSILON,
epsilon_difference attempts to give an easier value to reason about.
But we just saw the dangers of FLT_EPSILON!
This scheme becomes increasingly questionable as inputs move away from one.

What about ULPs?

It would be nice to define comparisons in terms of
something more concrete than arbitrary thresholds.
Ideally, we would like to know the number of possible floating-point
values—sometimes called units of least precision, or ULPs—between
inputs.
If I have some value , and another value is only two or three ULPs away,
we can probably consider them equal, assuming some rounding error.
Most importantly, this is true regardless of the distance between
and on the number line.

Boost offers a function called float_distance to get the distance
between values in ULPs,
but it’s about an order of magnitude slower than the approaches
discussed so far.
With some bit-fiddling, we can do better.

Consider some positive float where every mantissa
bit is one.
must use the next largest exponent,
and all its mantissa bits must be zero.
As an example, consider 1.99999988 and 2:

Value

Bits

Exponent

Mantissa bits

1.99999988

0x3FFFFFFF

127

0x7FFFFF

2.0

0x40000000

128

0x000000

The property holds for denormals, even though they have a different
value equation.
Consider the largest denormal value and the smallest normal one:

Value

Bits

Exponent

Mantissa bits

0x007FFFFF

-126

0x7FFFFF

0x00800000

-126

0x000000

Notice an interesting corollary:
adjacent floats (of the same sign)
have adjacent integer values when reinterpreted as such.
This reinterpretation is sometimes called type punning,
and we can use it to calculate the distance between values in ULPs.

Traditionally in C and C++, one used a union trick:

unionFloatPun{floatf;int32_ti;};FloatPunfp;fp.f=25.624f;// Read the same value as an integer.
printf("%x",fp.i);

This still works in C,
but can run afoul of strict aliasing rules in C++.5
A better approach is to use memcpy.
Given the usual use of the function, one might assume that it would be less
efficient, but

This code is quite portable—it only assumes that the platform supports
32-bit integers and that floats are stored in accordance with
IEEE 754.6
We avoid comparing differently-signed values for a few reasons:

ULPs are the wrong tool to compare values near or across zero,
as we’ll see below.

Almost all modern CPUs use two’s complement
arithmetic, while floats use
signed magnitude.
Converting one format to the other in order to meaningfully add or subtract
differently-signed values requires some extra work.
For the same reason, the sign of our result might not be what we expect,
so we take its absolute value.
We only care about the distance between our two inputs.

If the subtraction overflows or underflows, we get undefined behavior
with signed integers and modular arithmetic with unsigned ones.
Neither is desirable here.

We calculate the absolute value ourselves instead of using std::abs
for two reaons.
First, the integer versions of std::abs only take types—such as
int, long, and longlong—whose
sizes are platform-specific.
We want to avoid assumptions about implicit conversions
between those types and int32_t.7
The second is a strange pitfall related to the placement of
std::abs overloads in the C++ standard library.
If you include <cmath> but not <cstdlib>,
only the floating-point versions of std::abs are provided.
Several toolchains I tested then promote the int32_t
value to a double,
even if your target only has
a 32-bit FPU and must emulate double
using integer registers.
(As one might guess, this is terrible for performance.)
Warning flags such as -Wconversion can help us notice this happening,
or we can just avoid all these gotchas by calculating the absolute value directly.
At any rate, this is a trivial detail.

No silver bullets

Relative epsilons—including ULPs-based ones—don’t make sense around zero.
The exponential nature of floats means that many more values are
gathered there than anywhere else on the number line.
Despite being a fairly small value in the context of many calculations,
0.1 is over one billion ULPs away from zero!
Consequently, fixed epsilons are probably the best
choice when you expect the results to be small.
What particular you want is entirely dependent on the calculations
performed.

Armed with this knowledge,
you may be tempted to write some end-all comparison function along the lines of:

But using it meaningfully is difficult without understanding the theory
we’ve discussed.

Brief aside: Other ULPs-based functions

We can use the same techniques to write other useful functions,
such as one that increments a float by some number of ULPs.
Boost offers a similar family of functions
(float_next, float_advance, etc.),
but like float_distance, they pay a performance cost to avoid
type punning.

One would hope we could simply get our ULPs, perform our addition,
and pun the result back, e.g.,

This naïve solution works for positive values,
but on most hardware, “incrementing” a negative float by a positive number of
ULPs will move us away from zero!
This is probably not what we want.
We mentioned before that floats use a signed magnitude scheme,
whereas most CPUs use two’s complement.
So, to operate on negative values, we need to convert from the former
to the CPU’s native integer format.8

staticconstint32_tint32SignBit=(int32_t)1<<31;int32_tfloatToNativeSignedUlps(floatf){int32_ti;memcpy(&i,&f,sizeof(float));// Positive values are the same in both
// two's complement and signed magnitude.
// For negative values, remove the sign bit
// and negate the result (subtract from 0).
returni>=0?i:-(i&~int32SignBit);}

After operating on the ULPs,
we must convert back to signed magnitude:

Takeaways

FLT_EPSILON… isn’t float epsilon,
except in the ranges [-2, -1] and [1, 2].
The distance between adjacent values depends on the values in question.

When comparing to some known value—especially zero or values near it—use
a fixed that makes sense for your calculations.

When comparing non-zero values, some ULPs-based comparison is probably the
best choice.

When values could be anywhere on the number line,
some hybrid of the two is needed.
Choose epsilons carefully based on expected outputs.

Acknowledgments

Much of this was adapted from Bruce Dawson’s fantastic
exploration of the topic on his blog,
Random ASCII.
Thanks also to
coworkers Evan Thompson and Matt Drees for their input.

Afterward: performance concerns

The relatively poor performance of boost::float_distance
was a large motivation for implementing
our own ulpsDistance.
For the sake of completeness, the following is a benchmark
(using Google’s benchmark library)
comparing the two with a handful of inputs.

#include <cstring> // For memcpy
#include <limits> // for numeric_limits<float>::infinity
#include <random>
#include <benchmark/benchmark.h>
#include <boost/math/special_functions/next.hpp>
#include <boost/math/special_functions/relative_difference.hpp>
usingnamespacestd;usingnamespaceboost::math;std::pair<float,float>pickInput(){staticautore=mt19937(random_device()());staticautocoinFlip=bernoulli_distribution(0.5);staticautoinputPicker=uniform_int_distribution<int>(1,10);constfloatinfinity=numeric_limits<float>::infinity();switch(inputPicker(re)){// Let's say there's a 5% chance our values are denormal.
// (This is probably more pessimal than our actual data.)
case1:if(coinFlip(re))return{1e-38f,float_advance(1e-38f,3)};// Intentional fall-through
// Let's throw in some huge numbers
case2:case3:case4:case5:return{6.022e23f,2.998e8f};break;// And so not-so-huge ones.
case6:case7:case8:case9:return{1.0f,11.0f};// Let's say there's a 5% chance we have NaNs
// and another 5% chance they're infinity
case10:if(coinFlip(re))return{42,numeric_limits<float>::quiet_NaN()};elsereturn{42,infinity};default:assert(0);}}__attribute__((noinline))// For visibility when benchmarking
int32_tulpsDistance(constfloata,constfloatb){// We can skip all the following work if they're equal.
if(a==b)return0;constautomax=numeric_limits<int32_t>::max();// We first check if the values are NaN.
// If this is the case, they're inherently unequal;
// return the maximum distance between the two.
if(isnan(a)||isnan(b))returnmax;// If one's infinite, and they're not equal,
// return the max distance between the two.
if(isinf(a)||isinf(b))returnmax;// At this point we know that the floating-point values aren't equal and
// aren't special values (infinity/NaN).
// Because of how IEEE754 floats are laid out
// (sign bit, then exponent, then mantissa), we can examine the bits
// as if they were integers to get the distance between them in units
// of least precision (ULPs).
static_assert(sizeof(float)==sizeof(int32_t),"What size is float?");// memcpy to get around the strict aliasing rule.
// The compiler knows what we're doing and will just transfer the float
// values into integer registers.
int32_tia,ib;memcpy(&ia,&a,sizeof(float));memcpy(&ib,&b,sizeof(float));// If the signs of the two values aren't the same,
// return the maximum distance between the two.
// This is done to avoid integer overflow, and because the bit layout of
// floats is closer to sign-magnitude than it is to two's complement.
// This *also* means that if you're checking if a value is close to zero,
// you should probably just use a fixed epsilon instead of this function.
if((ia<0)!=(ib<0))returnmax;// If we've satisfied all our caveats above, just subtract the values.
// The result is the distance between the values in ULPs.
int32_tdistance=ia-ib;if(distance<0)distance=-distance;returndistance;}voidbenchFloatDistance(benchmark::State&state){while(state.KeepRunning()){state.PauseTiming();floata,b;std::tie(a,b)=pickInput();state.ResumeTiming();// float_distance can't handle NaN and Infs.
if(!isnan(a)&&!isnan(b)&&!isinf(a)&&!isinf(b)){benchmark::DoNotOptimize(float_distance(a,b));}}}BENCHMARK(benchFloatDistance);voidbenchUlps(benchmark::State&state){while(state.KeepRunning()){state.PauseTiming();floata,b;std::tie(a,b)=pickInput();state.ResumeTiming();benchmark::DoNotOptimize(ulpsDistance(a,b));}}BENCHMARK(benchUlps);BENCHMARK_MAIN();

Actual timing values obviously depend on the type of inputs,
the uniformity of the inputs (which influences branch prediction),
and many other factors,
but our function seems to outperform Boost alternatives in the general case.

For the typographically inclined, this is available as a
PDF
and its
LaTeX
source.

On my setup, the former gives 1.0 and the latter gives 1.000000119209290. ↩

The format of float is implementation-defined according
to the C++ standard, and not necessarily adherent to IEEE 754.
(See http://stackoverflow.com/a/24157568.)
Perhaps this is why Boost’s float_distance is implemented the way it is. ↩

Granted, this is borderline paranoia—int32_t is one of
those types on nearly every relevant platform. ↩

On esoteric hardware, the native format may also be signed magnitude.
In those cases, we trust the compiler to elide the needless work we do here. ↩