Supported arithmetics evaluations (provided by the underlying platform and
compiler) are FLTEVALMETHOD 0, 1 or 2 (when FLTEVALMETHOD!=0 the excess
precision and range should be handled according to the ISO C semantics: rounded
away at assignments and casts).

The four rounding modes (FETONEAREST, FEUPWARD, FEDOWNWARD, FETOWARDZERO)
are supported on all platforms with fenv support.

The five status flags (FEINEXACT, FEINVALID, FEDIVBYZERO, FEUNDERFLOW,
FE_OVERFLOW) are supported on all platforms with fenv support.

No additional classifications are provided beyond the five standard classes
(FPNAN, FPINFINITE, FPZERO, FPSUBNORMAL, FPNORMAL), in case of ld80 the
invalid representations are treated according to the x87 FPU (pseudo infinites,
nans and normals are classified as FPNAN and pseudo subnormals are classified
as FP_NORMAL).

Errors are reported by raising floating-point status flags ie. matherrhandling
is set to MATHERREXCEPT on all platforms (errno is not set by math functions
and on platforms without IEEE 754 exception semantics there is no error
reporting).

There are no other predefined fenv settings than FEDFLENV.

If the floating-point environment is changed in other ways than the provided
interfaces in fenv.h the behaviour of the library functions is undefined (eg.
changing the precision setting of the x87 fpu or unmasking floating-point
exceptions are not supported).

Status flags are raised correctly in default fenv except the inexact flag.
(inexact is only raised correctly when explicitly specified by ISO C, underflow
flag is handled correctly in all cases which is a stronger guarantee than the
requirement in Annex F)

Decimal conversion functions in stdlib.h and stdio.h are correctly rounded as
well.

Other functions should have small ulp error in nearest rounding mode and the
sign of zero should be correct. (the goal is < 1.5 ulp error, ie. the last bit
may be wrong, which is met for most commonly used functions, notable exceptions:
several complex functions, bessel functions, lgamma and tgamma functions and arc
hyperbolic functions)

Signaling nan is not supported (library functions handle it as quiet nan so they
may not raise the invalid flag properly)

Quiet nans are treated equally (there is only one logical nan value, it is
printed as “nan” or “-nan”, the library otherwise does not care about the sign
and payload of nan)

Infinities are printed as “inf” and “-inf”.

Development documentation

Overview

The source code of math functions are in three directories: src/math/,
src/complex/ and src/fenv/ (implementing math.h, complex.h and fenv.h APIs
respectively)

Most other functions in the standard library don’t use floating-point
arithmetics, the notable exceptions are decimal conversion functions: strtof,
strtod, strtold, printf, scanf. (Note that these conversion functions that don’t
use global stdio buffers are async-signal-safe in musl, except scanf with %m
which calls malloc. No other async-signal-safe code in musl uses floating-point
arithmetics, this is important because in C11 the state of the fenv in a signal
handler is unspecified, one has to explicitly save, set to default and restore
the fenv in a signal handler if float code is used.)

A large part of the mathematical library is based on the freebsd libm, which is
in turn based on fdlibm. No effort was made to keep the integrity of the
original code, most of it needed heavy cleanups and rewrites. The numerical
algorithms and approximation methods are not changed in most cases though.
(Designing new algorithms requires additional tools and is more effort than
conformance cleanups).

Note that floating-point semantics is non-trivial and often misinterpreted or
misimplemented, some examples:

Bit manipulation

It is often required to manipulate the bit representation of floating-point
numbers using integer arithmetics (mostly to avoid raising floating-point
exceptions or to do simple operations that is hard or slow with floating-point
arithmetics).

libm.h has historical macros for bit manipulations, the convention for new code
is using explicitly declared unions and the ldshape union in case of long
double. An example of typical bit manipulation code that accesses the sign and
exponent:

The union convention is used because other solutions would violate the C99
aliasing rules or could not be optimized easily. For float and double the unions
are explicit instead of typedefed somewhere because they are sufficiently short,
make the code self-contained and they should be portable to all platforms with
IEEE fp format. (There is one known exception: the arm floating-point
accelerator provided IEEE arithmetics with mixed endian double representation,
but arm oabi is not supported by musl)

The ldshape union is for ld80 and ld128 code and depends on byte order.

The bit operations and int arithmetics need to be done with care and awareness
about integer promotion rules, implementation-defined and undefined behaviours
of signed int arithmetics. The freebsd math code heavily relies on
implementation-defined behaviour (signed int representation, unsigned to signed
conversion, signed right shift) and even undefined behaviour (signed int
overflow, signed left shift). Such integer issues should be fixed, they are
considered to be bugs. The convention is to use uint32t and uint64t types for
bit manipulations to reduce surprises.

Constants

Floating constants can have float, double or long double type, eg. 0.1f, 0.1 or
0.1L respectively.

The usual arithmetic conversion rules convert to the “largest” type used in an
expression, so the ‘f’ suffix should be used consistently for single precision
float expressions (missing the ‘f’ causes the expression to be evaluated at
double precision).

The ‘L’ suffix has to be used for long double constants that are not exactly
representible at double precision to avoid precision loss. For clarity the
convention in musl is to use ‘L’ consistently in long double context except for
trivial numbers (0.0, 1.0, 2.0, .5) where it is optional.

Using integers in a floating-point expression is fine, it will be converted to
the appropriate float type at translation time (eg. 2x is just as good as
2.0fx, however -0x is not the same as -0.0x).

Note that 0.1f and (float)0.1 has different meaning: if FLTEVALMETHOD==0 the
first is 0.1 rounded to float precision, the second is 0.1 rounded to double
precision and then rounded to float causing double-rounding issues (not in this
case, but eg (float)0x1.00000100000001p0) and raises floating-point status flags
(inexact in this case) if FENV_ACCESS is ON.

FLTEVALMETHOD!=0 means the evaluation precision and range of constants and
expressions can be higher than the that of the type ie 0.1f==0.1 may be true
(this is not implemented by gcc correctly as of gcc-4.8 and clang only supports
targets with FLTEVALMETHOD==0).

Note that a larger floating-point type can always represent all values of a
smaller one (so conversion to larger type can never cause precision loss).

Even if FENV_ACCESS is ON static initializers don’t raise flags at runtime (the
constants are evaluated using the default fenv, ie round to nearest mode) so
‘static const’ variables are used for named consts when inexact flag and higher
precision representation should be avoided otherwise macros or explicit literal
is used.

The standard does not require correct rounding for decimal float constants (the
converted floating-point value can be the nearest or any neighbor of the nearest
representible number, but the recommended practice is correct rounding according
to nearest rounding mode), however hexfloats can represent all values of a type
exactly and unambiguously, so the convention is to prefer hexfloat representation
in the math code internally especially when it makes the intention clearer (eg
0x1p52 instead of 4503599627370496.0). (In public header files they must not be
used as they are not C++ or C89 compatible).

There is another i386 specific gcc issue related to constants: gcc can recognize
special constants that are built into the x87 FPU (eg
3.141592653589793238462643383L) and load them from the FPU instead of memory,
however the FPU value depends on the current rounding mode, so in this rare case
static const values may not be correctly rounded to the nearest number in
non-nearest rounding mode. Musl supports non-nearest rounding modes, but library
results may be imprecise for functions that are not listed as correctly-rounded.

Excess precision

In C, fp arithmetics and constants may be evaluated to a format that has greater
precision and range than the type of the operands as specified by
FLTEVALMETHOD. The mathematical library must be compiled with consistent and
well-defined evaluation method (FLTEVALMETHOD >= 0).

If FLTEVALMETHOD==0 then there is no excess precision and this is the most
common case. (This gives the IEEE semantics, clang does not even support other
values and gcc only supports other values on i386, where FLTEVALMETHOD==2 is
the default)

There is an additional issue on i386: the x87 FPU has a precision setting, it
can do arithmetics at extended precision (64bit precision and 16bit
sign+exponent) or at double precision (53bit precision and 16 bit sign+exponent)
or at single precision. On linux the default setting is extended precision and
changing it is not supported by musl. On other operating systems the default is
double precision (most notably on freebsd), so taking floating-point code from
other systems should be done with care. (The double precision setting provides
similar fp semantics to FLTEVALMETHOD==0, but there are still issues with the
wider exponent range and the extra precision of long double is lost)

The convention in musl is that when excess precision is not harmful use doublet
or floatt and use explicit cast or assignment when rounding away the excess
precision is important. (Rounding the result means extra load/store operations
on i386, on other architectures it is a no-op).

Argument passing is an assignment so excess precision is dropped there, but
return statement is not (this has changed in C11), so if the extra range or
precision is harmful then explicit cast or assignment should be used before
return (eg ‘return 0x1p-1000*0x1p-1000;’ is not ok for returning 0 with
underflow raised).

An important issue with FLTEVALMETHOD!=0 is double rounding: rounding an
infinite precision result into an intermediate format and then again into a
smaller precision format can give different result than a single rounding to the
smaller format in nearest rounding mode. (with the x87 double precision setting
the wider exponent range can still cause double rounding in the subnormal range
where the inmemory representation has smaller precision than the fp registers)

double x = 0.5 + 0x1p-50; // exact, no rounding
x = x + 0x1p52;

On FLTEVALMETHOD==0 platforms x is set to 0x1p52+1, on i386 the addition is
evaluated to 0x1p52+0.5 (rounded to extended precision) and then this is rounded
again to double precision for the assignment to 0x1p52 (nearest rounding rule)

In non-nearest rounding mode double rounding cannot cause different results, but
it can cause the omission of the underflow flag:

(double)(0x1p-1070 + 0x1p-1000*0x1p-1000)

The result is inexact and subnormal, so it should raise the underflow flag, but
the multiplication does not underflow at extended precision and the addition is
first rounded to extended precision and then to double and the result of the
second rounding is exact subnormal.

Musl raises the underflow flag correctly so it needs to take special care when
the result is in the subnormal range.

Fenv and error handling

Correctly rounded functions support all rounding modes and all math functions
support the exception flags other than inexact.

In some cases the rounding mode is changed (eg. fma), or the exception flags are
tested and cleared (eg. nearbyint) or exception flags are raised by some float
operation (many functions). This means the compiler most not treat surrounding
floating-point operations as sideeffect-free pure computations: they depend on
the current rounding mode and change the exception flags (depend on and modify
global fenv state).

To allow compilers to do fp optimizations, the standard invented the FENV_ACCESS
pragma:

#pragma STDC FENV_ACCESS ON
#pragma STDC FENV_ACCESS OFF

The OFF state means the code does not access the fenv in any way, so various
reordering, constant folding and common subexpression elimination operations are
valid optimizations. The ON state means in the local scope the compiler must not
do such optimizations.

Neither gcc nor clang supports the FENVACCESS pragma, in which case they should
assume FENVACCESS ON for all code for safety, however both compilers assume
FENV_ACCESS OFF and do aggressive fp optimizations which cannot be turned off.

Whenever the optional FE* macros are used in library code they should be
conditional (#ifdef FE…) because some platform may not support them.

Musl does not provide errno based error handling, matherrhandling is
MATHERREXCEPT, so it’s important that floating-point status flags are raised
correctly. The convention is not to use explicit feraiseexcept(flag) calls,
because math code should work even if the platform does not support fenv,
raising flags through function calls is slow and arithmetics usually raise flags
correctly anyway. The math code makes sure that exceptions are not raised
spuriously and exceptions are not omitted by using appropriate arithmetics.

A simple exception raising example: implement a function f that is odd, can be
approximated by x + xxx/2 + x^5*eps when x is small, otherwise return g(x).

FORCE_EVAL is used if the expression is only evaluated for its fenv sideeffects.
Sometimes more efficient code could be made without any store if the compilers
did not optimize away unused code. (Efficient in terms of code size, these are
usually the rare corner-cases so performance is not critical)

Common formulas

Commonly used formulas throughout the mathematical library (some of them are
described in the Goldberg paper, what every computer scientist should know about
fp)

where EPSILON is DBLEPSILON (0x1p-52) or LDBLEPSILON (0x1p-63 or 0x1p-112)
according to the FLTEVALMETHOD.

Sign of zero

Musl tries to return zero with correct sign (this is important in functions
which may be used in complex code, because of branch cuts).

With simple arithmetics the sign of zero cannot be detected silently: 0.0 ==
-0.0 and in all context they are treated equally except for division (1/-0.0 ==
–inf, 1/0.0 == inf), but then the divbyzero flag is raised. The signbit macro
or– bit manipulation can be used to get the sign silently.

To remove the sign of zero, x*x can be used if x==0, alternatively x+0 works too
except in downward rounding mode (-0.0+0 is -0.0 in downward rounding).

To create 0 with sign according to the sign of x, 0*x can be used if x is finite
(otherwise copysign(0,x) or signbit(x)?-0.0:0.0 can be used, since signbit is a
macro it is probably faster, nan may need special treatment).

Complex

Currently some complex functions are dummy ones: naive implementation that may
give wildly incorrect result. The reason is that there are non-trivial special
cases, which are hard to handle correctly. (There are various papers on the
topic, eg. about branch cuts by William Kahan and about the implementation of
complex inverse trigonometric functions by Hull et al.)

The convention is to create complex numbers using the CMPLX(re,im) construct,
new in C11, because gcc does not support the _Imaginary type and turns

x*I

into

(x+0*I)*(0+I) = 0*x + x*I

So if x=inf then the result is nan+infI instead of infI and thus a+b*I cannot
be used to create complex numbers in general.