If you answered ‘double’ and ‘float’ then you score one point for youthful idealism, but zero points for correctness. The correct answer, for zero-idealism points and forty two correctness points is “it depends”.

Read on for more details.

It depends. It depends on the compiler, the compiler version, the compiler settings, 32-bit versus 64-bit, and the run-time state of some CPU flags. And, this ‘it depends’ behavior can affect both performance and, in some cases, the calculated results. How exciting!

Previously on this channel…

If you’re just joining us then you may find it helpful to read some of the earlier posts in this series.

Post 4 (Comparing Floating Point Numbers) is particularly relevant because its test code legitimately prints different results with different compilers.

Default x86 code

The obvious place to start is with a default VC++ project. Let’s look at the results for the release build of a Win32 (x86) project with the default compiler settings except for turning off Link Time Code Generation (LTCG). Here’s the generated code:

It’s interesting that the code for MulDouble and MulFloat is identical except for size specifiers on the load and store instructions. These size specifiers just indicate whether to load/store a float or double from memory. Either way the fmul is done at “register precision”.

Register precision

x87 code for floating-point operations, as shown above. The peculiar x87 FPU has eight registers, and people will usually tell you that ‘register precision’ on this FPU means 80-bit precision. These people are wrong. Mostly.

It turns out that the x87 FPU has a precision setting. This can be set to 24-bit (float), 53-bit (double), or 64-bit (long double). VC++ hasn’t supported long double for a while so it initializes the FPU to double precision during thread startup. This means that every floating-point operation on the x87 FPU – add, multiply, square root, etc. – is done to double precision by default, albeit in 80-bit registers.The registers are 80 bits, but every operation is rounded to double precision before being stored in the register.

Except even that is not quite true. The mantissa is rounded to a double-precision compatible 53 bits, but the exponent is not, so the precision is double compatible but the range is not.

In the x87 register diagram below blue is the sign bit, pink is the exponent, and green is the mantissa. The full exponent is always used, but when the rounding is set to 24-bit then only the light green mantissa is used. When the rounding is set to 53-bit then only the light and medium green mantissa bits are used, and when 64-bit precision is requested the entire mantissa is used.

In summary, as long as your results are between DBL_MIN and DBL_MAX (they should be) then the larger exponent doesn’t matter and we can simplify and just say that register precision on x87 means double precision.

Except when it doesn’t.

While the VC++ CRT initializes the x87 FPUs register precision to _PC_53 (53-bits of mantissa) this can be changed. By calling _D3DCREATE_FPU_PRESERVE. This means that on the thread where you initialized D3D9 you should expect many of your calculations to give different results than on other threads. D3D9 does this to improve the performance of the x87’s floating-point divide and square root. Nothing else runs faster or slower.

So, the assembly code above does its intermediate calculations in:

64-bit precision (80-bit long double), if somehow you avoid the CRT’s initialization or if you use _controlfp_s to set the precision to _PC_64

or 53-bit precision (64-bit double), by default

or 24-bit precision (32-bit float), if you’ve initialized D3D9 or used _controlfp_s to set the precision to _PC_24

On more complicated calculations the compiler can confuse things even more by spilling temporary results to memory, which may be done at a different precision than the currently selected register precision.

The glorious thing is that you can’t tell by looking at the code. The precision flags are a per-thread runtime setting, so the same code called from different threads in the same process can give different results. Is it confusingly awesome, or awesomely confusing?

/arch:SSE/SSE2

The acronym soup which is SSE changes all of this. Instructions from the SSE family all specify what precision to use and there is no precision override flag, so you can tell exactly what you’re going to get (as long as nobody has changed the rounding flags, but let’s pretend I didn’t mention that shall we?)

If we enable /arch:sse2 and otherwise leave our compiler settings unchanged (release build, link-time-code-gen off, /fp:strict) we will see these results in our x86 build:

The MulDouble code looks comfortably similar, with just some mnemonic and register changes on three of the instructions. Simple enough.

The MulFloat code looks… longer. Weirder. Slower.

The MulFoat code has four extra instructions. Two of these instructions convert the inputs from float to double, and a third converts the result from double back to float. The fourth is an explicit load of ‘y’ because SSE can’t load-and-multiply in one instruction when combining float and double precision. Unlike the x87 instruction set where conversions happen as part of the load/store process, SSE conversions must be explicit. This gives greater control, but it adds cost. If we optimistically assume that the load and float-to-double conversions of ‘x’ and ‘y’ happen in parallel then the dependency chain for the floating-point math varies significantly between these two functions:

MulDouble:

movsd-> mulsd-> movsd

MulFloat

movss-> cvtps2pd-> mulsd-> cvtpd2ps-> movss

That means the MulFloat dependency chain is 66% longer than the MulDouble dependency chain. The movss and movsd instructions are somewhere between cheap and free and the convert instructions have comparable cost to floating-point multiply, so the actual latency increase could be even higher. Measuring such things is tricky at best, and the measurements extrapolate poorly, but in my crude tests I found that on optimized 32-bit /arch:SSE2 builds with VC++ 2010 MulFloat takes 35% longer to run than MulDouble. On tests where there is less function call overhead I’ve seen the float code take 78% longer than the double code. Ouch.

To widen or not to widen

So why does the compiler do this? Why does it waste three instructions to widen the calculations to double precision?

Is this even legal? Is it perhaps required?

The guidance for whether to do float calculations using double-precision intermediates has gone back and forth over the years, except when when it was neither back nor forth.

The IEEE 754 floating-point math standard chose not to rule on this issue:

Annex B.1: The format of an anonymous destination is defined by language expression evaluation rules.

All floating arithmetic in C is carried out in double-precision; whenever a float appears in an expression it is lengthened to double by zero-padding its fraction.

However the C++ 1998 standard does not appear to contain that language. The most obvious governing language is the usual arithmetic conversions (section 5.9 in the 1998 standard) which basically says that double-precision is only used if the expression contains a double:

5.9: If the above condition is not met and either operand is of type double, the other operand is converted to type double.

But the C++ 1998 standard also appears to permit floating-point math to be done at higher precisions if a compiler feels like it:

5.10 The values of the floating operands and the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.55)

It is generally more desirable to compute the intermediate results in as high as [sic] precision as is practical.

The consensus is clear: sometimes having high-precision intermediates is great, and sometimes it is horrible. The IEEE math standard defers to the C++ standard, which defers to compiler writers, which sometimes then defer to developers.

It’s a design decision and a performance bug

As Goldberg and Fleegal point out, there are no easy answers. Sometimes higher precision intermediates will preserve vital information, and sometimes they will lead to confusing discrepancies.

I think I’ve reverse engineered how the VC++ team made their decision to use double-precision intermediates with /arch:SSE2. For many years 32-bit code on Windows has used the x87 FPU, set to double precision, so for years the intermediate values have been (as long as they stay in the registers) double precision. It seems obvious that when SSE and SSE2 came along the VC++ team wanted to maintain consistency. This means explicitly coding double precision temporaries.

The compiler (spoiler alert: prior to VS 11) also has a strong tendency to use x87 instructions on any function that returns a float or double, presumably because the result has to be returned in an x87 register anyway, and I’m sure they wanted to avoid inconsistencies within the same program.

The frustrating thing about my test functions above is that the extra instructions are provably unneeded. They will make no difference to the result.

The reason I can be so confident that the double-precision temporaries make no difference in the example above is that there is just one operation being done. The IEEE standard has always guaranteed that the basic operations (add, subtract, multiply, divide, square root, some others) give perfectly rounded results. On any single basic operation on two floats if the result is stored into a float then widening the calculation to double is completely pointless because the result is already perfect. The optimizer should recognize this and remove the four extraneous instructions. Intermediate precision matters in complex calculation, but in a single multiply, it doesn’t.

In short, even though the VC++ policy in this case is to use double-precision for float calculations, the “as-if” rule means that the compiler can (and should) use single precision if it is faster and if the results would be the same “as-if” double precision was used.

The difference is almost comical. The shortest function for 32-bit was eight instructions, and these are three instructions. What happened?

The main differences come from the 64-bit ETW/xperf profiling and for many other tools and, as dramatic as it looks in this example, is not actually expensive and should not be disabled. On 64-bit the stack walking is done using metadata rather than executed instructions, so we save three instructions.

The 64-bit ABI also helps because the three parameters are all passed in registers instead of on the stack, and that saves us from two instructions to load parameters into registers. And with that we’ve accounted for all of the differences between the 32-bit and 64-bit versions of MulDouble.

The differences in MulFloat are more significant. x64 implies /arch:SSE2 so there’s no need to worry about the x87 FPU. And, the math is being done at float precision. That saves three conversion instructions. It appears that the VC++ team decided that double-precision temporaries were no longer a good idea for x64. In the sample code I’ve used so far in this post this is pure win – the results will be identical, only faster.

butterfly effect, wherein a distant hurricane may cause a butterfly to flap its wings

If you want to control the precision of intermediates then cast some of your input values to double, and store explicit temporaries in doubles. The cost to do this can be minimal in many cases.

64-bit FTW

64-bit code-gen was a chance for the VC++ team to wipe the slate clean. The new ABI plus the new policy on intermediate precision means that the instruction count for these routines goes from eight-to-twelve instructions to just three. Plus there are twice as many registers. No wonder x64 code can (if your memory footprint doesn’t increase too much) be faster than x86 code.

VS 11 changes everything

Visual Studio 11 beta (which I’m guessing will eventually be called VS 2012) appears to have a lot of code-gen differences, and one of them is in this area. VS 11 no longer uses double precision for intermediate values when it is generating SSE/SSE2 code for float calculations.

It’s the journey, not the destination

So far our test code has simply done one basic math operation on two floats and stored the result in a float, in which case higher precision does not affect the result. As soon as we do multiple operations, or store the result in a double, higher precision intermediates give us a different and more accurate answer.

One could argue that the compiler should use higher precision whenever the result is stored in a double. However this would diverge from the C++ policy for integer math where the size of the destination doesn’t affect the calculation. For example:

int x =1000000;longlong y = x * x;

The multiply generates an ‘int’ result and then assigns that to the long long. The multiply overflows, but that’s not the compiler’s problem. With integer math the operands determine the precision of the calculation, and applying the same rules to floating-point math helps maintain consistency.

Comprehension test

Here’s some simple code. It just adds four numbers together. The numbers have been carefully chosen so that there are three possible results, depending on whether the precision of intermediate values is float, double, or long-double.

Take this code, add it to a project, turn off LTCG, turn on Assembler Output (so you can see the generated code without even calling the code), and play around with build settings to see what you get. You can also run the code to see the three different results, perhaps with some calls to _controlfp_s to vary the x87’s precision.

// Here's how to use _controlfp_s to set the x87 register precistion// to 24-bits (float precision)unsigned oldState;
_controlfp_s(&oldState, _PC_24, _MCW_PC);

To aid in your explorations, use this simple flowchart to figure out what precision is used for intermediates in all-float calculations. Click on the image for a clearer view of the flowchart:

Got it? Simple enough?

I’m sure the VC++ team didn’t plan for the flowchart to be this complicated, however much of the complexity is due to the weirdness of the x87 FPU, and then their desire for SSE/SSE2 code to be x87 compatible. The Intel compiler’s options are complex also but I glossed over the details by just assuming that SSE/SSE2 is available.

And at least Microsoft is moving towards a dramatically simpler story with VS 11 and with x64.

Bigger is not necessarily better

I want to emphasize that more precise intermediates are not ‘better’. As we have seen they can cause performance to be worse, but they can also cause unexpected results. Consider this code:

Sometimes bigger is better

If the compiler uses source-precision (float-precision in this case for intermediates) then it will calculate a float-precision result and then convert it to a double and return it. This is consistent with the rules for integer calculations (the destination of a calculation doesn’t affect how the calculation is done), but it means that maximum precision is not realized. Developers can fix this by casting g_three or g_seven to double, but this is just another example showing that there is no simple solution.

The code that got me interested in this question is shown below in a simplified form:

Depending on the intermediate precision, and therefore depending on your compiler and other factors, this might print 1.0000000000 or 1.0000000149:

What’s a developer to do?

Right now the best recommendation for x86 game developers using VC++ is to compile with /fp:fast and /arch:sse2.

For x64 game developers /arch:sse2 is implied, and /fp:fast may not be needed either. Your mileage may vary, but it seems quite possible that /fp:strict will provide fast enough floating-point, while still giving conformant and predictable results. It really depends on how much you depend on predictable handling of NaNs and other floating-point subtleties that /fp:fast takes away.

For x86 developers using VS 11 /arch:sse will still be needed but, as with x64, /fp:fast may no longer be.

If you need double-precision intermediates you can always request them on a case-by-case basis by using double or by casting from float to double at key points. That’s why I like source-precision intermediates because it gives developers the option to request higher precision when they want it.

Be wary of floating-point constants. Remember that 1.0 is a double, and using it will cause your float calculations to be done at double precision, with the attendant conversion costs. Searching your generated code for “cvtps2pd” and “cvtpd2ps” can help track down significant performance problems. I recently sped up some x64 code by more than 40% just by adding an ‘f’ to some constants. You can also find double-to-float conversions by enabling warning C4244: