This article is now hosted on csharpindepth.com

I'm gradually moving all my C# articles to the web site for
my book, C# in Depth. This
article has already moved. I'm leaving the content here so that
existing links still work, but I won't be updating this page.

Binary floating point and .NET

Lots of people are at first surprised when some of their arithmetic
comes out "wrong" in .NET. This isn't something specific to .NET in
particular - most languages/platforms use something called "floating
point" arithmetic for representing non-integer numbers. This is fine
in itself, but you need to be a bit aware of what's going on under
the covers, otherwise you'll be surprised at some of the results.

It's worth noting that I am not an expert on this matter. Since
writing this article, I've found another one - this time written by
someone who really is an expert, Jeffrey Sax. I strongly
recommend that you read his
article on floating point concepts too.

What is floating point?

Computers always need some way of representing data, and ultimately
those representations will always boil down to binary (0s and 1s).
Integers are easy to represent (with appropriate conventions for
negative numbers, and with well-specified ranges to know how big the
representation is to start with) but non-integers are a bit more
tricky. Whatever you come up with, there'll be a problem with it.
For instance, take out own normal way of writing numbers in decimal:
that can't (in itself) express a third. You end up with a recurring 3.
Whatever base you come up with, you'll have the same problem with
some numbers - and in particular, "irrational" numbers (numbers which
can't be represented as fractions) like the mathematical constants
pi and e are always going to give trouble.

You could store all the rational numbers exactly as two integers,
with the number being the first number divided by the second - but the
integers can grow quite large quite quickly even for "simple" operations,
and things like square roots will tend to produce irrational numbers.
There are various other schemes which also pose problems, but the one most
systems use in one form or other is floating point. The idea of
this is that basically you have one integer (the mantissa) which
gives some scaled representation of the number, and another (the
exponent) which says what the scale is, in terms of "where does the
dot go". For instance, 34.5 could be represented in "decimal floating
point" as mantissa 3.45 with an exponent of 1, whereas 3450 would have the
same mantissa but an exponent of 3 (as 34.5 is 3.45x101, and
3450 is 3.45x103). Now, that example is in decimal just for
simplicity, but the most common formats of floating point are for binary.
For instance, the binary mantissa 1.1 with an exponent of -1 would mean
decimal 0.75 (binary 1.1==decimal 1.5, and the exponent of -1 means
"divide by 2" in the same way that a decimal exponent of -1 means "divide
by 10").

It's very important to understand that in the same way that you can't represent
a third exactly in a (finite) decimal expansion, there are lots of numbers
which look simple in decimal, but which have long or infinite expansions
in a binary expansion. This means that (for instance) a binary floating point
variable can't have the exact value of decimal 0.1. Instead, if you
have some code like

double x = 0.1d;

the variable will actually
store the closest available double to that value. Once you can get your head
round that, it becomes obvious why some calculations seem to be "wrong". If
you were asked to add a third to a third, but could only represent the thirds
using 3 decimal places, you'd get the "wrong" answer: the closest you could get
to a third is 0.333, and adding two of those together gives 0.666, rather than
0.667 (which is closer to the exact value of two thirds). An example in binary
floating point is that 3.65d+0.05d != 3.7d (although it may be displayed
as 3.7 in some situations).

What floating point types are available in .NET?

The C# standard only lists double and float as floating
points available (those being the C# shorthand for System.Double
and System.Single), but I believe that strictly speaking the
decimal type (shorthand for System.Decimal) is also a floating
point type - it's just it's decimal floating point, and the ranges of exponents
are interesting. The decimal type is described in another article,
so this doesn't go into it any further - we're concentrating on the more
commonly used double and float. Both of these are
binary floating point types, conforming to IEEE 754 (a standard defining
various floating point types). float is a 32 bit type (1 bit of sign,
23 bits of mantissa, and 8 bits of exponent), and double is a 64 bit type
(1 bit of sign, 52 bits of mantissa and 11 bits of exponent).

Isn't it bad that results aren't what I'd expect?

Well, that depends on the situation. If you're writing financial applications,
you probably have very rigidly defined ways of treating errors, and the amounts
are also intuitively represented as decimal - in which case the
decimal type is more likely to be appropriate than float
or double. If, however, you're writing a scientific app, the link
with the decimal representation is likely to be weaker, and you're also likely
to be dealing with less precise amounts to start with (a dollar is exactly a dollar,
but if you've measured a length to be a metre, that's likely to have some sort
of inaccuracy in it to start with).

Comparing floating point numbers

One consequence of all of this is that you should very, very rarely
be comparing binary floating point numbers for equality directly. It's usually fine to
compare in terms of greater-than or less-than, but when you're interested in
equality you should always consider whether what you actually want is near
equality: is one number almost the same as another. One simple way of
doing this is to subtract one from the other, use Math.Abs to find the
absolute value of the difference, and then check whether this is lower than a certain
tolerance level.

There are some cases which are particularly pathological though, and these are
due to JIT optimisations. Look at the following code:

It should always print True, right? Wrong, unfortunately.
When running under debug, where the JIT can't make as many optimisations as normal,
it will print True. When running normally, the JIT can store the result
of the sum more accurately than a float can really represent - it can use the default
x86 80-bit representation, for instance, for the sum itself, the return value, and
the local variable. See the ECMA CLI spec, partition 1, section 12.1.3 for more details.
Uncommenting the commented out line in the above, makes the JIT
behave a bit more conservatively - the result is then True either way -
although that may only be true under the current implementation, and shouldn't be relied
on. (Casting g to float in the comparison line has the same effect,
even though it looks like a no-op.) This is another reason to avoid equality comparisons
even if you're really sure that the results should be the same.

How does .NET format floating point numbers?

There's no built-in way to see the exact decimal value of a floating
point number in .NET, although you can do it with a bit of work. (See the
bottom of this article for some code to do this.) By default, .NET formats
a double to 15 decimal places, and a float to 7.
(In some cases it will use scientific notation; see the MSDN page on
standard numeric format strings for more information.) If you use the
round-trip format specifier ("r"), it formats the number to
the shortest form which, when parsed (to the same type), will get back to
the original number. If you are storing floating point numbers as strings
and the exact value is important to you, you should definitely use the
round-trip specifier, as otherwise you are very likely to lose data.

What exactly does a floating point number look like in memory?

As it says above, a floating point number basically has a sign, an
exponent and a mantissa. All of these are integers, and the combination of
the three of them specifies exactly what number is being represented.
There are various classes of floating point number: normalised,
subnormal, infinity and not a number (NaN). Most numbers
are normalised, which means that the first bit of the binary mantissa is
assumed to be 1, which means you don't actually need to store it. For
instance, the binary number 1.01101 could be expressed as just .01101 -
the leading 1 is assumed, as if it were 0 a different exponent would be
used. That technique only works when the number is in the range where you
can choose the exponent suitably. Numbers which don't lie in that range
(very, very small numbers) are called subnormal, and no leading bit is
assumed. "Not a number" (NaN) values are for things like the result of
dividing 0 by 0, etc. There are various different classes of NaN, and
there's some odd behaviour there as well. Subnormal numbers are also
sometimes called denormalised numbers.

The actual representation of the sign, exponent and mantissa at the bit
level is for each of them to be an unsigned integer, with the stored value
being just the concatenation of the sign, then the exponent, then the
mantissa. The "real" exponent is biased - for instance, in the case of a
double, the exponent is biased by 1023, so a stored exponent
value of 1026 really means 3 when you come to work out the actual value.
The table below shows what each combination of sign, exponent and mantissa
means, using double as an example. The same principles apply
for float, just with slightly different values (such as the
bias). Note that the exponent value given here is the stored exponent,
before the bias is applied. (That's why the bias is shown in the "value"
column.)

This is therefore a normal number of value
(-1)0 x 1.0111001101101101001001001000010101110011000100100011 (binary) x 21028-1023
which is more simply represented as
1.0111001101101101001001001000010101110011000100100011 (binary) x 25
or
101110.01101101101001001001000010101110011000100100011

In decimal, this is 46.42829231507700882275457843206822872161865234375, but .NET
will display it by default as 46.428292315077 or with the "round-trip" format specifier
as 46.428292315077009.

Sample code

DoubleConverter.cs: this is a fairly simple class
which allows you to convert a double to its exact decimal representation as a string.
Note that although finite decimals don't always have finite binary expansions, all finite
binaries have a finite decimal expansion (because 2 is a factor of 10, essentially). The
class is extremely simple to use - just call DoubleConverter.ToExactString(value)
and the exact string representation for value is returned. There's also a simple
online version - enter the number as
you'd write it in code, and the exact value which is actually used will be displayed.

NaNs

NaNs are odd beasts. There are two types of NaNs - signalling and
quiet, or SNaN and QNaN for short. In terms of the bit pattern,
a quiet NaN has the top bit of the mantissa set, whereas a signalling NaN has it cleared.
Quiet NaNs are used to signify that the result of a mathematical operation is undefined,
whereas signalling NaNs are used to signify exceptions (where the operation was invalid,
rather than just having an indeterminate outcome).

The strangest thing most people are likely to see about NaNs is that they're not equal
to themselves. For instance, Double.NaN==Double.NaN evaluates to false.
Instead, you need to use Double.IsNaN to check whether a value is not a number.
Fortunately, most people are unlikely to encounter NaNs at all except in articles like
this.

Conclusion

Binary floating point arithmetic is fine so long as you know what's going on and don't
expect values to be exactly the decimal ones you put in your program, and don't expect
calculations involving binary floating point numbers to necessarily yield precise results.
Even if two numbers are both exactly represented in the type you're using, the
result of an operation involving those two numbers won't necessarily be exactly represented.
This is most easily seen with division (eg 1/10 isn't exactly representable despite both
1 and 10 being exactly representable) but it can happen with any operation - even
seemingly innocent ones such as addition and subtraction.

If you particularly want precise decimal numbers, consider using the
decimal type instead - but expect to pay a performance penalty for doing
so. (One very quickly devised test came out with multiplication of doubles
being about 40 times faster than multiplication of decimals; don't pay
particular heed to this exact figure, but take it as an indication that
binary floating point is generally faster on current hardware than decimal
floating point.)