Improve this page Quickly fork, edit online, and submit a pull request for this page.
Requires a signed-in GitHub account. This works well for small changes.
If you'd like to make larger changes you may want to consider using
local clone.
Page wiki View or edit the community-maintained wiki page associated with this page.

Floating Point

Floating Point Intermediate Values

On many computers, greater
precision operations do not take any longer than lesser
precision operations, so it makes numerical sense to use
the greatest precision available for internal temporaries.
The philosophy is not to dumb down the language to the lowest
common hardware denominator, but to enable the exploitation
of the best capabilities of target hardware.

For floating point operations and expression intermediate values,
a greater precision can be used than the type of the
expression.
Only the minimum precision is set by the types of the
operands, not the maximum. Implementation Note: On Intel
x86 machines, for example,
it is expected (but not required) that the intermediate
calculations be done to the full 80 bits of precision
implemented by the hardware.

It's possible that, due to greater use of temporaries and
common subexpressions, optimized code may produce a more
accurate answer than unoptimized code.

Algorithms should be written to work based on the minimum
precision of the calculation. They should not degrade or
fail if the actual precision is greater. Float or double types,
as opposed to the real (extended) type, should only be used for:

reducing memory consumption for large arrays

when speed is more important than accuracy

data and function argument compatibility with C

Floating Point Constant Folding

Regardless of the type of the operands, floating point
constant folding is done in real or greater precision.
It is always done following IEEE 754 rules and round-to-nearest
is used.

Floating point constants are internally represented in
the implementation in at least real precision, regardless
of the constant's type. The extra precision is available for
constant folding. Committing to the precision of the result is
done as late as possible in the compilation process. For example:

constfloat f = 0.2f;
writeln(f - 0.2);

will print 0. A non-const static variable's value cannot be
propagated at compile time, so:

staticfloat f = 0.2f;
writeln(f - 0.2);

will print 2.98023e-09. Hex floating point constants can also
be used when specific floating point bit patterns are needed that
are unaffected by rounding. To find the hex value of 0.2f:

import std.stdio;
void main()
{
writefln("%a", 0.2f);
}

which is 0x1.99999ap-3. Using the hex constant:

constfloat f = 0x1.99999ap-3f;
writeln(f - 0.2);

prints 2.98023e-09.

Different compiler settings, optimization settings,
and inlining settings can affect opportunities for constant
folding, therefore the results of floating point calculations may differ
depending on those settings.

Rounding Control

IEEE 754 floating point arithmetic includes the ability to set 4
different rounding modes.
These are accessible via the functions in std.c.fenv.

If the floating-point rounding mode is changed within a function,
it must be restored before the function exits. If this rule is violated
(for example, by the use of inline asm), the rounding mode used for
subsequent calculations is undefined.

Exception Flags

IEEE 754 floating point arithmetic can set several flags based on what
happened with a
computation:

An implementation may perform transformations on
floating point computations in order to reduce their strength,
i.e. their runtime computation time.
Because floating point math does not precisely follow mathematical
rules, some transformations are not valid, even though some
other programming languages still allow them.

The following transformations of floating point expressions
are not allowed because under IEEE rules they could produce
different results.

Disallowed Floating Point Transformations

transformation

comments

x + 0 → x

not valid if x is -0

x - 0 → x

not valid if x is ±0 and rounding is towards -∞

-x ↔ 0 - x

not valid if x is +0

x - x → 0

not valid if x is NaN or ±∞

x - y ↔ -(y - x)

not valid because (1-1=+0) whereas -(1-1)=-0

x * 0 → 0

not valid if x is NaN or ±∞

x / c ↔ x * (1/c)

valid if (1/c) yields an exact result

x != x → false

not valid if x is a NaN

x == x → true

not valid if x is a NaN

x !opy ↔ !(xopy)

not valid if x or y is a NaN

Of course, transformations that would alter side effects are also
invalid.