Know your FPU: Fixing Floating Fast

This is roughly six years overdue,
but I wanted to share a more robust version of my favorite Float to Int
technique and a broader version of the code that
Mike Herf posted on his most excellent site.
That article is a good primer,
and I tried to avoid repeating too much of the same material here.

Truthfully, there are a lot more floating point tricks than just conversions to
integers, but I think most people would be surprised just how much of a
bottleneck this one simple conversion is, even on modern processors -
especially for multimedia coding (graphics, video, music, sound, speech, etc.).

It was easily a 2x difference in my
(software) 3d renderer when I'm geometry bound. And in my 2d vector
rendering code, its often the difference between between being geometry bound
and not . Just for context: when I mention
"production code" in this article, I don't mean my private HaXor elite
graphics code that will p0wn your lame commerical product: I mean the
variety of products
in the market place
that continue to use my
realtime graphics and/or UI code
on a variety of
platforms, so I think I can speak with some small authority on the
topic. Further, I think its worth some (more) exploration exactly because
its a transparent, common, but expensive operation for realtime multimedia
coders.

That said, the thing I've noticed about most people's conversions (or at least
their explanations of them) is that they tend to be functionally and
numerically, um, incomplete, for lack of a better term. By that I mean, they
don't round out the conversion routines you'd like (pun intended :)), and their
numerical validity is generally a little unclear.

For example, if you use the "fistp in rounding mode" conversion technique that
almost everyone recommends, and you're writing graphics code, you probably have
a subtle rounding bug you didn't know you had. I know I did. In
addition, understanding this particular floating point trick (and its variants)
is a good way to start helping you construct your own.

In this article, I'll compare some fast floating point to integer
conversions, extend one of the techniques to ultrafast conversion
directly to fixed point, and provide substantially faster floor() and ceil()
functions, all with an eye to making the arithmetic implications and tradeoffs
clear.

Can you snatch the mantissa from my hand, grasshopper?

The smartest one of all!!!

To cut to the chase, here's the performance results for basic conversion I see
on my machine (Pentium IV 1.2ghz):

I can verify that these results are basically the same as what I see in
production graphics code, across processor speeds/types (of the x86 variety),
but this particular set of timings came from an included benchmark utility. If
anything, I find that the "magic number" technique that xs_CRoundToInt() uses
tends to pipeline much better that fistp, resulting in a bigger gain in the
real world than the benchmark demonstrates.

Explanations to follow, but the important part is that the technique I like,
xs_CRountToInt(), is fastest, more flexible (more on this coming), and more
portable: across CPU types, and even across code and compilers - for example,
it doesn't require any assembly or processor-level floating point flags to be
set for correctness.

[Aside: xs_ToInt() uses the same tricks ast xs_CRountToInt(), but matches the
direct C cast to integer for correctness. It has a compare that's
required to match the ANSI/C Standard - and branching turns out to be
unsurprisingly pretty bad for performance.]

So why is this so much faster than the simple cast?

First, you should understand the folks who write compilers are NOT dumb. In
fact, they are, more likely than not, smarter than you (and I put myself in the
"you" bucket, just to be clear). So why is their version - the code
generated by the compiler to perform this operation - not just slow, but real
slow?

The answer's pretty pedestrian.

In order to make floating point (single, double, extended, et al) as accurateand as fast as possible, there are a number of
truncation and precision choices that the IEEE proscribed for performing
arithmetic - after all, even extended precision floating point numbers are a
finite, fixed number of bits. In particular, arithmetic truncation because of
limited precision will round towards or away from 0 for any particular math
operation (multiply, divide, and normalization in particular) in a
deterministic but "averaging" way. Simply put (over-simply, really, but good
enough for our purposes), some numbers round up, some down - and numbers at the
"boundary" (i.e. exactly halfway between) don't
always round up as they did when you were learning about rounding in your high
school math class. Actually, the rounding in these cases is always towards the
"even" number.

Now this is a fine rule. It even has a name:
Banker's Rounding, so clearly it must
be fine - bankers use it! Incidentally, that's why I distinguish the
"CRound()" from the just "Round()" versions of my conversion routines - the
former uses Banker's rounding (like C) where the latter is arithmetic rounding.

The net result of Banker's rounding is that numerical errors in
floating point math tend to, on average, just come out in the wash. If you're
doing scientific computing, you probably need to care, but the rest of us don't
really.

Except.

Except that the IEEE rounding/truncation rules that turn out to be good for
"better" consistency with floating point numbers don't actually match what the
ANSI/ISO C specification requires for float to integer conversions. And much
hilarity ensues because of it.

This is not to say that ANSI folks were off - quite the contrary, it makes good
sense, and MOST of the time isn't a big deal. Unless you're writing
graphics/audio/ray-tracing/processing code (i.e. multimedia) on a modern
processor.

Then its a big deal.

So basically, the straight float to int cast is slow because it does a WHOLE lot
of work to ensure that it is always correct (correctness, in this case being,
"match the C Spec"). And this is unfortunately at odds with generally
not propagating rounding errors during floating point arithmetic. The
compiler cast is slow because you can be fast, or you can be right. Guess which
one they chose?

Now, the obvious and easy question is this: "I'm writing
[graphics][audio][ray-tracer][etc] code, I don't need to match the spec - can't
we just do it fast?"

The answer, obviously, is "Yes."

And by "Yes", I mean that we can be both much faster, and correct more than
enough of the time for our purposes. Remember, we're not doing scientific
computing, although exactly understanding BOTH parts of the previous
sentence is still important.

This is where most people default to "fistp" as the answer. Its the built-in x86
instruction for converting floating point numbers, so its seems like a pretty
reasonable choice - pretty fast, if non-portable (even across compilers can be
a pain), but it has a few gotchas. For one, its behaviour depends on the
floating point mode. Now, you could just set the mode when your app starts, and
use it willy-nilly whereever you need to cast quickly. The problem is that (1)
you may have libraries, code, even third party DLLs that you're inadvertently
changing the mode for - its set per thread, and (2) you're changing the math
for any and all floating point math (multiply, divide, and normalization in
particular).

Eeek.

You could try to wrap it locally, where you have hotspots, and that's not a bad
option. Or more simply, you could just change your routines so that they
work correctly with "rounded" conversions as opposed to straight truncation....

...Which is what you should do - I don't suggest using any of these
conversions routines as global conversions; Just treat hotspots and adjust so
that rounding works for you... And I really don't recommend setting the
FPU round mode except by clear and extreme exception.

However.

However, its not the only option - and it doesn't solve what, for me, turns out
to be a pretty core problem in multimedia programming: conversion directly to
fixed point.

I'm not going to to get into the fun details of fixed point (if you're not sure
what it is, then I take back the thing I said about putting myself in the "you"
bucket), but suffice it say that it turns out to be a specific, big hotspot in
accurate multimedia coding - as much, if not more so than, than just your
conversions to integer.

Additionally, if you want to do other floating "conversions" (operations,
really: ceil, floor, etc.) and their respective conversions to fixed (ceil to
fixed, floor to fixed, etc.), then you'll end up doing a whole bunch of FPU
mode settings, state management etc. (high complexity), or you're back pretty
much where the compiler guys left you - 80 cycles to set the FPU round mode,
flush the state, convert, and set it back.... yikes.

That's where the "magic number" conversion techniques come in.

Magic Numbers.

Poof.

Six quadrillion, seven hundred fifty five trillion, three hundred ninety
nine billion, four hundred forty one million, fifty five thousand, seven
hundred forty four (also known as 2 raised to the 52nd
power multiplied by 1.5, or 6755399441055744.0). See why its magical? OK, maybe
not so much...

The nitty gritty mechanisms of IEEE floats and magic numbers have been covered
in detail by others, most clearly by
Chris Hecker (IMHO), but the root idea of "magic number" conversions is
not too complicated.

Basically, in order to add two floating point numbers, your processor "lines up"
the decimal points of the numbers so that it can easily add the bits. It does
this by "normalizing" the numbers such that the most significant bits are
preserved, i.e. the smaller number "normalizes" to match the bigger one.

So the principle of the "magic number" conversion that xs_CRoundToInt() uses is
this: We add a big enough floating point number (a number that is so big that
there are significant digits only UP TO the decimal point, and none after it)
to the one you're converting such that: (a) the number gets normalized by the
processor to its integer equivalent and (b) adding the two does not erase the
integral significat bits in the number you were trying to convert (i.e. XX00 +
00YY = XXYY).

And then, you can just "read" that integral component right out of the lower
(32) bits into your integer. Since a double precision number has 52 bits of
mantissa (plus 11 bits exponent, 1 bit sign), the magic number for doubles
turns out to be 2 to the 52nd power (times 1.5, to deal with some sign
arithmetic) - AKA our "magic number". As I said,
RTFA because we're just getting warmed up.

Straight to Fixed.

Again, to cut to the chase, here are timings on my machine, and I can again
confirm the results in production code - though in this case, the fistp to
fixed conversion performs MUCH worse in the benchmark than in production
because of pipelining issues (i.e. its slower in my renderer, too, but not
quite as badly).

After all, if you can use the FPU's normalization to get to an integer, you can
use it just as easily to get to an integer shifted by any power of 2 - just
EXACTLY what you need to convert to a fixed point number. So rather than using
2 to the 52nd power (times 1.5), if we want 16 bits of fixed point goodness,
we'd use 2 to the 36th power (times 1.5), otherwise known as 2 to the (52-16)th
power.

So its easy to convert any floating point number to any fixed point number.

And its fast and pipelines well.

But wait, there's more.

I promised a discussion of numerical accuracy as well. But at this point, that
might have to be a follow-up.... at least in any great depth.

The routine xs_CRoundToInt() produces exactly the same results as fistp, i.e. it
uses Banker's rounding. Be careful with as it can be subtle and, in fact, just
recently bit me in my grid fitting code - you can actually see the bug in
action if you resize windows in AIM Triton 1.0;
they appear to sometimes lose a pixel at the bottom or right edge.

I should have used the routine xs_RountToInt(), which is (just about nearly) as
fast, but uses "true" rounding, i.e. as in your high school math class, though
I should point out that its not accurate to scientific precision. Still, it is
MORE than good enough for even high precision ray tracing - last I had heard
some of the exluna folks
were still using it (they're actually the ones that prompted me to finally
write this all up - although that was *cough* ... 3 years ago...).

The routine xs_ToInt() matches the C cast, but is more than twice as fast (its
actually quite a bit better than 2x in the real world - but close enough), and
doesn't require setting the FPU rounding mode. Ditto in terms of precisions as
xs_RountToInt().

The xs_Fixed() point conversion routines and xs_FloorToInt() and xs_CeilToInt()
are 5 to 10x faster than alternatives, and again Ditto (is that redundantly
repeating?) for precision.

Final Thoughts.

Remember that xs_CRountToInt() and fistp produce EXACTLY the same results.
The function xs_RoundToInt() is subtly different: it does arithmetic rounding
instead of Banker's rounding. In all likelihood,
that's probably what you actually thought you were doing with
fistp, if you're doing graphics/multimedia (I know that's what I thought)
and so I actually recommend that's the one you use everywhere.

Although most of this discussion was about comparisons to fistp (and just plain
old C) on an x86, I've found the tricks just as valuable on Sparc powered
boxes, 68k processors, etc. (especially the direct-to-fixed conversions on the
latter).

On the PowerPC, however, it didn't really matter. It wasn't really slower, but
it wasn't really faster, other than floor() and ceil()...

Below are some times from my machine (1.2 ghz Penitum IV) and
Herf's machine (3.2ghz Pentium D), just for comparision, followed by
code.