Floating-point numbers

Mass of an electron: 9x10e-28 grams to Mass of the sun 2x10e33;
range exceeds10e60

No way we can fit a range of 10e60 values in 8 bits, or even 32 bits.
It would take about 200 bits.

So, we handle this in a computer the same way we do ourselves:
scientific notation. Just translate the number into binary and represent
it with a sign, an exponent, and a mantissa (fraction). We'll put all three
parts in a single word (single precision), or in two words (double precision).
The base will be fixed for all numbers, so there is no need to represent
it explicitly.

Basic Format

The assumed base of the exponent is a power of 2, not 10. The
exponent is 1's, 2's, or excess format (later).

The mantissa is typically represented in sign magnitude and normalized
so
that all of the positions can be used for precision.

Example using base 10:

6875 x 10e3 represented as
6.875 x 10e6 or .6875 x 10e7

The term floating-point is used because the position of the radix
point is adjusted during computation to retain the normal form.

Example format:

| sign bit | 7-bit exponent | 24-bit fractional mantissa
|

Converting decimal fractions into binary fractions

An important difference between real numbers and their representation on
the computer is their density. Real numbers form a continuum; between any
two real numbers no matter how close there is a real number between them.
On a computer, we are using a fixed number of bits, say n, for the fraction
field (the mantissa). Only 2eN different bit patterns, so only 2eN different
fractions can be represented. Truncation or rounding is performed.

Here is an intuitive way to convert a decimal value into a binary fraction.

In general, the fractional places will range from 1/2 to 1/(2eN), where
N is the number of bits.

Procedure: express everything as a fraction of 2eN (here, 256),
then use our conversion strategy for integers.

Convert .625: .625 = X/256;
.625 * 256 = 160; .625 = 160/256.

We want to find which of the digits should be 1, such that those numerators
sum to 160.
160 = 128 + 32. So, .625 decimal = .10100000 binary.

Another example: .6875 using 4 bits.

The positions: . 1/2
1/4 1/8
1/16
. 8/16 4/16
2/16 1/16

Express .6875 as a fraction of 16: .6875 = 11/16.

So, .6875 in 4 bits is: .1011

So, suppose we want to convert 35.6875 decimal to the floating point
representation above.

100011.1011 = .1000111011 * 2e+6.

In our representation above:

| 0 | 0000110 | 1000111011 |

Representing Larger Ranges

Hexadecimal Normalization

Obviously, there is a tradeoff between size (the more bits for the exponent,
the larger the range of numbers) and accuracy (the more bits for the mantissa,
the more significant digits can be represented).

In 7 bits, with an implied base of 2, we are limited to exponents between
-64 to +63. This is not enough of a range.

If we can help it, we don't want to reduce the size of the mantissa
field, because then we will lose accuracy.

What can we do?

Change the implied base to a larger power of 2 (4, 8, or 16, say).

Suppose the base is 16. Range roughly 10e-76 to 10e76,
while still using 7 bits for the mantissa.

Shifting the mantissa mantissa to perform normalization must take place
in steps of 4-bit shifts. A representation is now considered to be
normalized if any of the leading 4 bits of its mantissa is 1.
This is called Hexadecimal normalization.

So, what is the tradeoff of using the larger base? Extra range at the
expense of precision. Even though 24 bits are still used for the mantissa,
hex normalization allows the three leading bits of a mantissa to be 0s.
So, in some cases, only 21 signficant bits are retained. equires I bits.

Another option: use base 2 and a phantom bit

If the base is 2, then the most signifiant bit of a normalized mantissa
is always 1. So, why store it? This way, you can get 24 bits
of precision in 23 bits, and use the extra bit for the exponent.

Excess Notation

Instead of representing the exponent in a signed 2's-complement format,
represent it in excess 2e(N-1), where N is the number of bits in
the exponent. So, in our case, with 7 bits, we have excess-64. An
exponent having the signed value E is represented by the value E' = E +
64. So, what is the range? 0 through 127.

This simplifies the circuitry needed for comparing exponents.

Example: -0.00000101... * 16e3

Unnormalized: | 1 | 0000011
| 00000101.... |

Normalized, base-16 representation:

| 1 | 0000010 | 01010000....
|

Represented in excess-64, base-16 representation:

| 1 | 1000010 | 01010000...|

Overflow

In our example format, underflow occurs if the exponent
< -64; overflow occurs if the exponent > 63. Events
like these are often called exceptions. A typical way to handle
them is to raise an interrupt when they occur.

Binary normalization (implied base is 2)
A phantom bit is used, so that 24 (53) bits of precision are represented
using 23 (52) bits.
The 8-bit exponent is in excess-127 representation, where E'=
0 and E'=255 are used to represent special values, such as exact 0 and
infinity.

E: -127 |
-126 127 |
???
+127 | +127+127 | +127E'
0 |
1 254
| 255

Actually, a more precise version is the following. Recall that
$FF_us = 255 and $FF_2c = -1:

E: -128
-127 | -126
127 |
+127+127 |
+127+127 |
E' -1
0 |
1 254 |

Where -1 ($FF) and 0 are the end values used to represent special values.