Classifying Floating Point Numbers

IEEE Floating point numbers consist of three components. The first part is the "significand" which contains an integer which represents the fractional part of the number. Next, follows the "exponent" which is an offseted integer describing what power of two the significand needs to be multiplied by. Finally, the most significant bit contains the sign of the number.

The C programming language on x86_64 systems exposes four different floating point types. Three of these are standard IEEE types of 32, 64 and 128 bits in size. The remaining type is an Intel specific 80 bit floating point number. In addition to the difference in size, the Intel numbers include the leading one in the significand, whereas the IEEE floating point numbers only include the fractional part leaving the leading one implicit.

The different floating point types contain different precisions for the exponents and significands. They are also passed in different registers.

C Type

Size in bits

Exponent size in bits

Significand size in bits

Register

float

32

8

23

SSE single precision

double

64

11

52

SSE double precision

long double

80

15

63 + leading 1

memory + x87 stack

__float128

128

15

112

SSE 128bit

Most of the above floating point types are well known. However, the __float128 type is a little obscure. It is currently poorly supported in the gnu glibc library, lacking most of the math library bindings. The reason for this is that there is currently no hardware support for this type on x86 machines. Software based floating point is used for arithmetic with this type instead, which is quite slow.

IEEE floating point numbers can represent more than just normal numbers. Special bit patterns are used for ±∞, ±0, subnormal numbers, and signaling and non-signaling NaN's. If the exponent has all bits set, or all bits clear, then the number isn't "normal". The two infinities have all exponent bits set, together with all significand bits clear. If a bit is set in the significand, then the result is a NaN. If the highest (or second highest on 80 bit floating point) significand bit is set, then the NaN is signaling, otherwise it is a "quiet" NaN.

Zero is represented by all bits clear, and negative zero has just the sign bit set. If the exponent has all bits clear, and there are still bits set in the significand, then the result is a subnormal number. These numbers do not have the implicit leading one, and allow gradual underflow.

Many algorithms depend on testing floating point numbers to see which type they are classified as. For example, a data stream may represent missing points as NaN's, which will then need to be tested for later on. So what is the fastest way for testing for a NaN?

isnan()

The most obvious function to use is the isnan() function/macro from the C math library. This will return non-zero if the floating point number is a NaN. Benchmarking repeating this 227 times, takes on this machine:

Type

Time in sec

float

0.68

double

0.74

long double

1.45

__float128

6.98

As can be seen, the larger the type, the slower the operation is. The __float128 case is extremely slow. This is due to the non-existence of a isnan() function optimized for that type in the glibc library. The compiler makes an implicit conversion to long double type, which slows things down enormously.

Can we do better? It turns out that gcc is smarter than the C library. If we use the fact that NaN's always return false for a comparison we can write a test for them. The compiler is smart enough to recognize this, and use an optimized test in the resulting machine code.

The fact that __float128 types do not have hardware support makes this technique a loser for them. Instead, we can use SSE instructions to manipulate the 128 bit values. With careful instruction selection, we can get a very fast function. The trick here is that the paddusw instruction (implied by the _mm_adds_epu16() intrinsic) allows us to test the exponent and significand simultaneously. The result after clearing the sign followed by the bounded addition, will only have the high bit set if the exponent was originally all ones. The other bits extracted by the pmovmskb can then be massaged to test for any bits in the significand.

This is quite a bit faster for float and double types, nearly three times faster for long double's and over an order of magnitude faster for __float128's.

isfinite()

The isfinite() function/macro returns non-zero if the floating point number passed to it is not an infinity or NaN. Otherwise it returns zero. This can be implemented in C, but it is much faster to use assembly. We can use the pextrw SSE instruction to extract part of the float, and then do a compare to the all-bits set pattern to test for this.

Where the above assembly functions correspond to the float, double, long double and __float128 versions respectively. Note how the sbb instruction is used to create the result via the carry flag set, or unset, by the previous sub instruction. Minus one is returned when the test passes, and zero if not.

This code is about twice as fast as the functions within glibc, and again over an order of magnitude faster for the __float128 version.

Type

Original Time in sec

Optimized Time in sec

float

0.68

0.35

double

0.68

0.35

long double

0.68

0.38

__float128

5.5

0.35

signbit()

We can use a similar trick to extract the sign bit of a floating point type. However, it is faster to simply shift all of the non-required bits away rather than use a mask. The resulting optimized asm routines are only three instructions long:

Since the task is so simple, the glibc library code is the same speed for float and double types. However, the above is a fair bit more compact, since it avoids writing to the stack simply to transfer the information from the SSE register.

Type

Original Time in sec

Optimized Time in sec

float

0.40

0.40

double

0.45

0.45

long double

1.18

0.52

__float128

5.6

0.40

Again, the __float128 version is much faster, and in addition the performance increase for long double types is also significant.

isnormal()

The isnormal() function/macro returns nonzero if its parameter is not an infinity, zero, subnormal or NaN. This corresponds to a number with a exponent that is not either all bits set, or all bits clear. By masking with an and instruction, we can simultaneously check to see if the exponent bits are all zero. This allows us to only need one explicit comparison in the optimized functions.

The repz prefix is used on the return instruction to work around a limitation of the AMD64 branch predictor. It avoids having the ret as a target of a branch. Apparently, if this is the case, then the machine cannot correctly predict the target of the return and thus causing a slow-down.

Type

Original Time in sec

Optimized Time in sec

float

1.02

0.45

double

1.2

0.46

long double

1.65

0.58

__float128

6.6

0.46

As can be seen by the timing results, the optimized assembly versions are about twice as fast as the glibc library, and even faster for long double and __float128 types.

isinf()

The isinf() function/macro is a little different than the previous functions. They can return any non-zero value if their condition is true. isinf() is specified the same way in the C99 standard. However, the glibc library defines its version to return 1 for positive infinity, and -1 for negative infinity. This means that we need to make our return value dependent on the sign if we are to remain compatible. Also note how the isinf() function needs to check both the exponent and the significand since a NaN will have the same exponent bit pattern as an infinity. This makes this function much more complex than the previous routines.

The big question is which is better: SSE code, or 64bit code? SSE instructions work on a large amount of data at a time, but tend to be relatively slow and inflexible. They are also rather non-orthogonal, and the exact instruction you might like to use may not exist. This makes programming using SSE instructions difficult. Conversely, the 64bit instructions are very flexible, but only work with a small amount of data at a time. It turns out that the fastest code uses both instruction sets simultaneously, extracting the best of both worlds.

How the above routines work is non-obvious. The basic algorithm for the float and double versions is to first mask of the sign bit. This turns a negative infinity into a positive infinity. We then do a comparison with the largest non-infinite number. If we are larger, and not a NaN, then we need to return the sign. While the SSE instructions are running, we are simultaneously calculating this value. The trick is that the arithmetic shift will yield ±1 since we know the second most significant bit is set. Finally, the results are combined with a conditional move instruction.

The long double algorithm is a little different. Here we extract the sign bit into the carry flag via the first add instruction. The sbb instruction then sets %edx to either all ones or all zeros based on this carry value. If both comparisons pass, then we use the expression eax = 2*edx+1 to convert this into ±1 result we need.

The __float128 version is also different. In this case, it is because the 128bit SSE instructions operate on fragments of the 128 bit registers independently. It is difficult to transfer information from the lower to the upper parts, and vice-versa. Thus to obtain the results, we need to extract the information we need, and store it in normal 64bit registers. The first movmskpd instruction extracts the sign information, thus %edx will be either 2 or 0 depending on the sign of the infinity. We next compare with the bit pattern for possitive infinity, and construct a 16 bit result that contains the information about the status of the exponent and significand via the pcmpeqw and pmovmskb instructions. Finally, the information is combined to return ±1 or 0 depending on the input in a branch-free manner.

Type

Original Time in sec

Optimized Time in sec

float

0.75

0.35

double

0.78

0.35

long double

1.55

0.45

__float128

6.3

0.46

Again, the resulting optimized assembly routines are much faster than the glibc versions, being at least twice as fast.

fpclassify()

The above functions are defined in terms of the fpclassify function/macro. This returns an integer based on the classification of its parameter. To remain compatible with the glibc library, we will use the same values:

The algorithm we will use is to firstly check the expoenent. If it is all-ones, branch to some code that checks the significand. If the significand is all-zeros, return 1, otherwise return zero. If the exponent isn't all-ones, check to see if it is all-zeros. If not, then we have a normal number, and return 4. Finally, we need to return 2 or 3 depending on whether or not the significand is clear. Unfortunately, this algorithm requires a fair amount of jumping, so is quite slow. However, the last steps can be done branchlessly via various setcc instructions, which helps somewhat.

The above functions have execution times that depend on the the type of numbers tested. This makes constructing a benchmark to test them harder. We will thus test these functions in two ways. The first is to simply use a normal number over and over again. This corresponds to use where most numbers are normal. The other case will test all possibilities by enumerating the values within an array repeatedly.

Normals

Type

Original Time in sec

Optimized Time in sec

float

0.74

0.35

double

0.92

0.45

long double

1.17

0.40

__float128

6.1

0.46

Mixed numbers

Type

Original Time in sec

Optimized Time in sec

float

0.75

0.51

double

0.88

0.57

long double

1.56

0.58

__float128

8.5

0.60

As can be seen, the mixed input results aren't quite as much of an improvement. However, the normal-only results are at least twice as the glibc. This is due to the branchy nature of the algorithm.

Summary

Using assembly code we can obtain about a 2× speedup for the bit-manipulation algorithms used to classify floating point types, over the C standard library. By using SSE instructions in non-standard ways, it is possible construct 128bit comparisons. By using SSE and normal 64 bit code simultaneously it is possible to construct branch-free versions of some quite complex algorithms.

The above contains quite a few assembly tricks, and it may be worthwhile investigating the source in more detail. For example, note how the sbb instruction can be used after a comparison to store the result. The use of shifts instead of an and instruction + comparison is also interesting. Another trick is to test a set of contiguous bits with an add causing a carry. Finally, the arithmetic shift of the infinity exponent bit-pattern to yield ±1 is also something you won't see all that often.

Obviously spending quite so much effort optimizing normal code isn't worth it. However, the floating point classification algorithms are simple test cases that allow one to explore what possibilities the extremes of optimization allow. By spending similar effort on your inner loops in performance-critical code, similar speed improvements may be possible. The most important thing is to think laterally, you don't have to use instructions the way they are normally used...

Comments

said...

In the article 'Classifying Floating Point Numbers' the term "magnitude" is used wrongly. You should replace all ocurrences of "magnitude" with "exponent". If one was to use the term "magnitude" at all in regard to floating point numbers it would refer to the combination of the exponent and the significand, in other words, everything except the sign.

sfuerst said...

Thanks, I've fixed the notation problem and replaced "magnitude" with the more precise "exponent" where applicable.

Jeremy said...

Have you ever compared the speed of the "fxam" instruction for classifying floating point numbers on x86/x87? It seems to do everything you provide explicit code for, in one instruction!