Andrew Koenig

Dr. Dobb's Bloggers

The Temptation For Implementations To Cut Corners

February 20, 2014

It would be useful if all conversions were as accurate as possible and, independently, were always done in exactly the same way. But complications can impede these goals.

Last week, I noted that C++ programs convert floating-point numbers from decimal strings to internal binary form in several different contexts, and suggested that it would be useful if all such conversions were as accurate as possible and, independently, were always done in exactly the same way. These desirable qualities may seem so obvious as to raise the question of why an implementation would ever want to do anything else. The reasons, as is so often the case, are complicated.

What is the value of percent? Obviously it should be close to 0.01, but how close?

On an implementation that conforms to the IEEE floating-point standard, there is no question: The value of tenth is the closest single-precision floating-point approximation to 0.1, and the value of percent is obtained by squaring that (approximate) value, rounding the result to single precision (because the product of two float values is a float, not a double), and then converting that value (presumably without changing it) to double.

To make this discussion concrete, let's assume that float numbers have 24 significant bits. Then the value of tenth will be a floating-point number with a binary value of exactly 0.000110011001100110011001100. This number has 27 bits because the three zeroes after the binary point are not significant.

Remember that every binary floating-point number can be represented exactly as a decimal number with the same number of digits after the decimal point as the binary number has after the binary point. If we disregard the trailing zeroes, this particular value has 25 bits after the binary point, which means that we can represent it exactly as a decimal value with 25 digits after the decimal point, namely 0.0999999940395355224609375. This is the best approximation to 0.1 that is possible in 24-bit binary floating-point.

If we now square this number, we get a binary value of exactly 0.00000010100011110101110000100011110101110000101001. With its 50 bits after the binary point, this value translates into a decimal value of exactly 0.00999999880790714001932428800500929355621337890625, which has 50 digits after the decimal point. If our typical IEEE-compliant implementation has 53 significant bits in double-precision floating-point numbers, then this value can be represented exactly as a double but not as a float. If we wish to represent it as a float, we must round it to 24 significant bits, which gives us 0.000000101000111101011100001001 in binary. The last bit changes from 0 to 1 as a consequence of rounding. Because this value has 30 bits after the binary point, we can convert it exactly to a decimal value with 30 digits after the decimal point, namely 0.009999998845160007476806640625.

If you've spaced out while you were trying to follow this intricate discussion, here's the point. If we assume that float variables have 24 significant bits and double variables have 53 significant bits, then:

The closest float to 0.1 is 0.0999999940395355224609375, so that is the value that tenth will have.

The exact value of tenth * tenth is 0.00999999880790714001932428800500929355621337890625, which can be represented exactly as a double.

If we round tenth * tenth to float, the result is exactly 0.009999998845160007476806640625.

The result of rounding tenth * tenth to float differs from the exact value of in the value of tenth * tenth in the 9th significant (decimal) digit, and from the ideal value 0.01 in the 7th significant digit.

In other words, rounding tenth to float limits the precision of tenth * tenth to about seven significant digits.

I now invite you to ponder the following question:

Suppose you are implementing C++ on a computer that always produces a double result as the product of two float values. On such hardware, storing the double value of tenth * tenth directly in percent might be substantially faster than rounding that result to float and then converting it back to double. In this case, the decision about whether to round would affect the 9th and subsequent significant digits of a value that has only seven significant digits in the first place.

As an implementor, surely you will be tempted to generate the faster code, because doing so is at least as accurate than computing the result as a double, rounding it to float, and then converting it back to double. What possible disadvantage could there be to this faster, more accurate approach? We shall see next week.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!