Floating point binary

Fixed point may be simple but it is very limited in the range of numbers it can represent. If a calculation resulted in big scale changes i.e. results very much smaller or bigger than the initial set of numbers then the calculation would fail because or under or overflow.

A better scheme if you want trouble free calculations is to use a floating point representation.

In this case the radix point is allowed to move during the calculation and extra bits are allocated to keep track of where it is i.e. the binary point "floats".

The advantage of this approach is clear if you consider multiplying a value such as 123.4 by 1000. If the hardware (decimal in this case!) can only hold four digits then the result is an overflow error

i.e. 123.4 * 1000 = 123400

which truncates to 3400 which is clearly not the right answer.

If the hardware uses the floating point approach it can simply record the shift in the decimal point four places to the right.

You can think of this as a way of allowing a much larger range of numbers to be represented but with a fixed number of digits’ precision.

A floating point number is represented by two parts – an exponent and a fractional part.

The fractional part is just a fixed-point representation of the number – usually with the radix point to the immediate left of the first bit making its value less than 1.

The exponent is a scale factor which determines the true magnitude of the number.

In decimal we are used to this scheme as scientific notation, standard form or exponential notation. For example, Avogadro’s number is usually written as 6.02252 x 1023 and the 23 is the exponent and the 6.02252 is the fractional part – notice that in standard form the fractional part is always less than 10 and more than 1. In floating point representation it is usual for the fractional part to be normalised to be just less than 1.

Floating point in this form was known to the Babylonian’s in 1800 BC but computers took a long time to get round to using it. It was independently proposed by Leonardo Torres y Quevedo at the end of the nineteenth century, by Konrad Zuse in 1936 and George Stibitz in 1939. The first relay computers, the Harvard Mark II for example, had floating point hardware but it was too costly to include in the ENIAC, though it was considered.

In binary, floating point is just the binary equivalent of standard form in decimal. The exponent is the power of two by which you have to multiply the fraction to get the true magnitude. At this point you might want to write floating point off as trivial but there are some subtleties.

For example when the fractional part is zero what should the exponent be set to?

Clearly there is more than one representation for zero. By convention the exponent is made as negative as it can be, i.e. as small as possible in the representation of zero. If two's complement were used this would result in a zero that didn’t have all its bits set to zero and we don't like this for many obvious reasons.

To achieve this a small change is needed to use a biased exponent by adding the largest negative value to it.

For example, if the exponent is six bits in size, the two's complement notation range is –32 to +31.

If instead of two's complement a simple positive representation is used then 0, i.e. all bits zero, represents –32, 32 represents 0 and 63 represents 31. The same range is covered but now the representation of zero has all bits set to zero even if it does now mean 0x2-32.

The value you subtract from the exponent to obtain its true value is called the “bias” – 32 in our example.

Algorithms

Algorithms for working with floating point numbers are very complex and very easy to get wrong.

Early computers, or perhaps it should be early programmers, suffered a lot from poor floating point implementations. Often values that floating point routines produce are closer to random numbers than true values.

For example, consider the algorithm for adding two floating point numbers.

You can’t simply add the fractional parts because the numbers could be very different in size. The first task is to shift the fraction with the smallest exponent to the right to make the two exponents equal.

When this has been done the fractional parts can be added in the usual way and the result can then be normalised so that the fraction is just less than one.

Sounds innocent enough but consider what happens when you try to add 1 x 2-8 to 1 x 28 using an eight-bit fractional part. Both number are represented in floating point form as 0.1 but with exponents of 2-7 and 29 respectively.

When you try to add these two values the first has to be shifted to the right nine times with the result that the single non-zero bit finally falls off the end and the result is zero in the standard precision. So when you think you are adding as small value to 1 x 28 you are in fact adding zero.

Not much of a problem but try the following program in C# (the same would happen in most languages only the values used would change):

float v = 0.999999F; do { v = v + 0.0000001F; } while (v<1);

This program does complete the loop but if you add one more zero before the 1 in the quantity added to v, the loop never ends.