Survey of Floating-Point Formats

This page gives a very brief summary of floating-point formats that
have been used over the years. Most have been implemented in hardware
and/or software and used for real work; a few (notably the small ones
at the beginning) just for lecture and homework examples. They are
listed in order of increasing range (a function of exponent size)
rather than by precision or chronologically.

IEEE 754-2008 binary16, also called "half", "s10e5", "fp16". 2-byte, excess-15 exponent used in nVidia NV3x and subsequent GPUs24,25,27,33; largest minifloat. Can approximate any 16-bit unsigned integer or its reciprocal to 3 decimal places.

B : Base of exponent. This is the amount by which your
floating-point number increases if you raise its exponent by 1. Modern
formats like IEEE 754 all use base 2, so B is 2, and increasing
the exponent field by 1 amounts to multiplying the number by 2. Older
formats used base 8, 10 or 16.

V : The size of this field, and therefore the precision,
number of "digits", and/or exponent range, is variable and limited only
by available memory and how long you're willing to wait for a calculation.
For example, one version of Maxima takes over 4 hours to
give an answer for "bfloat(10.0)^(bfloat(2.0)^bfloat(100000.0));".

We : Width of exponent. If B is 2, 8 or 16, this is the
number of bits (binary digits) in the exponent field. For the specific
case of B=2, We is equal to K+1 in the equation 1-2K<e<2K
specifying the bounds of the excess-2n exponent in an IEEE 754
representation (see below). When B is 10, there are two cases: "6d"
indicates an exponent stored as base-10 digits and the letter d is
included to make this clear; "8-" indicates an IEEE binary decimal
format, using 2 bits in the combination field and 6 bits in the
following exponent field, which together can hold only 3/4 of the
values such a width would imply (because the high 2 bits cannot both
be 1), thus the legal values are e such that 0≤e<3×26.

Wm : Width of mantissa. For binary formats with "hidden" or
"implied" leading mantissa bits, this is given as "1+N", such as
"1+23", the "1+" refers to the leading 1 bit; this plus 23 actual bits
gives a total of 24 bits of precision. For decimal formats the letter
"d" is shown to make it clear the precision is in decimal digits.

IEEE 754 Single Representation

This is worth describing in a bit more detail because it is so
prevalent in the hardware used today, and it is probably what you'll
be looking at when you try to decipher a floating-point value from its
"raw binary".

First a warning: Although the "normal" values are what you see when
your program is working with real data, proper handling of the rest of
the values (denorms, NANs, etc.) is vitally important; otherwise
you'll get all sorts of horrible results that are difficult to
understand, and usually impossible to fix.

So, for the normal values (which in this case means, not including the
zeros, denorms, NANs, and infinities) the value being represented can
be expressed in the following form:

value = s 2k+1-Nn

where the sign s is -1 or 1, and k and n are integers that fall
within the ranges given by:

2-2K < k < 2K-1 and 2N-1-1 < n < 2N

for two integers K and N. If you look at the range of k and n
you can see that k can have exactly 2K+1-2 values and n can
have exactly 2N-1 values, and therefore exactly K+1 bits can be
used to store the exponent (including two unused values discussed
below) and N-1 bits to store the mantissa. To give a specific
example, for IEEE 754 single precision, as the above table shows there
are We=8 bits for the exponent and Wm=23 bits for the mantissa,
so K is 7 and N is 24.

The exponent is stored in "excess 2K format", which means the
binary value you see is 2K bigger than the actual value of k
being represented. For example, when K is 7, is the value 254 is
seen, k is 126, and the value being represented is
s 2127-Nn. This is only true for the normal values
just described, not for denorms.

The next set of values to understand are the denormalizedvalues
(or "denorms"), very small values for which

k = 2-2K and 0 < n < 2N-1

using the same definitions as above. These values use one of the
"unused" exponent values, namely the one that is all 0 bits. They are
very important because they make overflow work better: instead of
jumping suddenly to 0, you lose precision gradually as you go
towards 0.

In addition to making the underflow case a little less severe by
losing precision gradually instead of suddenly, denormalized values
eliminate a lot of strange bugs that would otherwise occur. For
example, the tests "if x>y" and "if x-y>0" can yield differentresults, unless you use denorms.

All of the various values are arranged in such a way that hardware or
software can perform comparisons treating the data as signed-magnitude
integers, and as long as neither argument is a NAN the proper answer
will result. Such comparisons even properly handle the infinities and
negative zero. (A signed-magnitude integer is a sign bit followed by
an unsigned expression of its magnitude  this is not the normal
signed integer format which is called "2's complement signed integer".
As with floats, there are ways to express 0 as a signed-magnitude
integer.)

Here are some sample values with their binary representation. The
binary digits are broken into groups of 4 to help with interpreting a
value in hexadecimal. They are shown in order from largest to
smallest, with the non-numbers in the places they would fall if they
were sorted by their bit patterns.

s

exponent

mantissa

value(s)

0

111.1111.1

111.1111.1111.1111.1111.1111

Quiet NANs

0

111.1111.1

100.0000.0000.0000.0000.0000

Indeterminate

0

111.1111.1

0xx.xxxx.xxxx.xxxx.xxxx.xxxx

Signaling NANs

0

111.1111.1

000.0000.0000.0000.0000.0000

Infinity

0

111.1111.0

111.1111.1111.1111.1111.1111

3.402×1038

0

100.0000.1

000.0000.0000.0000.0000.0000

4.0

0

100.0000.0

100.0000.0000.0000.0000.0000

3.0

0

100.0000.0

000.0000.0000.0000.0000.0000

2.0

0

011.1111.1

000.0000.0000.0000.0000.0000

1.0

0

011.1111.0

000.0000.0000.0000.0000.0000

0.5

0

000.0000.1

000.0000.0000.0000.0000.0000

1.175×10-38(Smallest normalized value)

0

000.0000.0

111.1111.1111.1111.1111.1111

1.175×10-38(Largest denormalized value)

0

000.0000.0

000.0000.0000.0000.0000.0001

1.401×10-45(Smallest denormalized value)

0

000.0000.0

000.0000.0000.0000.0000.0000

0

1

000.0000.0

000.0000.0000.0000.0000.0000

-0

1

000.0000.0

000.0000.0000.0000.0000.0001

-1.401×10-45(Smallest denormalized value)

1

000.0000.0

111.1111.1111.1111.1111.1111

-1.175×10-38(Largest denormalized value)

1

000.0000.1

000.0000.0000.0000.0000.0000

-1.175×10-38(Smallest normalized value)

1

011.1111.0

000.0000.0000.0000.0000.0000

-0.5

1

011.1111.1

000.0000.0000.0000.0000.0000

-1.0

1

100.0000.0

000.0000.0000.0000.0000.0000

-2.0

1

100.0000.0

100.0000.0000.0000.0000.0000

-3.0

1

100.0000.1

000.0000.0000.0000.0000.0000

-4.0

1

000.0000.1

000.0000.0000.0000.0000.0000

-1.175×10-38

1

111.1111.0

111.1111.1111.1111.1111.1111

-3.402×1038

1

111.1111.1

000.0000.0000.0000.0000.0000

Negative infinity

1

111.1111.1

0xx.xxxx.xxxx.xxxx.xxxx.xxxx

Signaling NANs

1

111.1111.1

100.0000.0000.0000.0000.0000

Indeterminate

1

111.1111.1

111.1111.1111.1111.1111.1111

Quiet NANs

IEEE 754d Decimal Formats

The decimal32, decimal64 and decimal128 formats defined in the
IEEE 754-2008 standard are interesting largely because of their
innovative packing of 3 decimal digits into 10 binary digits. Decimal
formats are still useful because they can store decimal fractions
(like 0.01) precisely. Normal BCD (binary-coded decimal) uses 4 binary
digits for each decimal digit, requiring a waste of about 17% of the
information capacity of the bits. The 1000 combinations of 3 decimal
digits fit nearly perfectly into the 1024 combinations of 10 binary
digits. In addition to the space efficiency, groups of 3 work well for
formatting and printing which typically use a thousands separator
(such as "," or a blank space) between groups of 3 digits. However,
prospects for easy encoding and decoding seem bleak. In 1975 Chen and
Ho published the first such system, but it had some drawbacks. The
Cowlishaw encoding4, used in IEEE 754-2008, is remarkable because it
manages to achieve all of the following desirable goals:

The encoding of 000 is all 0's; if the 3 digits are 000-009, the
high 6 bits of the encoded result are 0; and if the digits are 010-099
the high 3 bits are 0. Thus you can store 1 digit in 4 bits or 2
digits in 7 bits, making it easy to store any
number of decimal digits, not just a multiple of 3; and you can
expand any field into a larger field by adding 0's on the left.

All combinations from 000-079 encode into the same bit pattern as
normal BCD.

You can easily discover if any decimal digit is odd or even by testing
a single bit in the binary encoding: test bit 0 (the lowest bit)
to see if the low digit is odd; test bit 4 to see if the middle digit
is odd and test bit 7 to see if the high digit is odd. These tests
always work regardless of the values of the other digits. (As a consequence
of this, the hardware implementations for encoding and decoding require
no gates for these 3 bits)

The hardware implementation for encoding 3 decimal digits into 10
binary requires only a total of 33 NAND gates,
and decoding back to decimal requires only 54 NAND gates, with a
3-gate-delay in both directions (not including fanout drivers).

The 24 unused bit patterns are easily characterized as [ddx11x111x]
with [dd] equal to 01, 10 or 11.

Minifloats and Microfloats: Excessively Small Floating-Point Formats

Although they do not have much practical value as a universal format
for computation, very small floating-point formats are of interest for
other reasons.

One can refer to a format using 16 bits or less as a
minifloat. (For origin of the term, see footnotes:
42,43,44.) Of these, the most popular by far is 1.5.10 (or
s10e5 or binary16), the 16-bit format invented at nVidia and ILM
and now a part of IEEE 754-2008. This format uses 1 sign bit, a
5-bit excess-15 exponent, 10 mantissa bits (with an implied 1 bit) and
all the standard IEEE rules including denormals,
infinities and NaNs. The minimum and maximum representable (and
positive) values are 5.96×10-8 and 65504 respectively.

s

expon.

mantissa

value(s)

0

111.11

xx.xxxx.xxxx

various NANs

0

111.11

00.0000.0000

Infinity

0

111.10

11.1111.1111

65504 (Largest finite value)

0

100.11

10.1100.0000

27.0

0

100.01

11.0000.0000

7.0

0

100.00

10.0000.0000

3.0

0

011.11

00.0000.0000

1.0

0

011.10

00.0000.0000

0.5

0

000.01

00.0000.0000

6.104×10-5(Smallest normalized value)

0

000.00

11.1111.1111

6.098×10-5(Largest denormalized value)

0

000.00

00.0000.0001

5.96×10-8(Smallest denormalized value)

0

000.00

00.0000.0000

0

1

011.11

00.0000.0000

-1.0 (other negative values are analogous)

This format is supported in hardware by many nVidia graphics cards
including GeForce FX and Quadro FX 3D (they call it fp16 or
s10e5), and is used by Industrial Light and Magic (as part of their
OpenEXR standard) and Pixar as the native format for raw output
rendered frames (prior to conversion to a compressed format like DVD,
HDTV, or imaging on photographic film for exhibition in a theater).
s10e5 is more than sufficient to represent light levels in a rendered
image, and compared to 32-bit floating-point, it presents quite a few
advantages: it requires half as much memory space (and bus bandwidth);
an operation (such as addition or multiplication) takes less than half
the time (as measured in gate delay) and about 1/4 as many
transistors. All of these advantages are very important when you are
expected to perform trillions of operations to render a frame.

To give a concrete example: at the time of the 3-GHz Pentium, which
was capable of 12 billion floating-point operations per second (12
GFLOPs), nVidia graphics cards for consumers could manage around 40
billion operations per second. Soon after that, ATI (which uses 24-bit
1.7.16 or s16e7 format) surpassed that, and the two companies
repeatedly leapfrogged each other. In subsequent years, the graphics
cards continued to widen their lead over CPUs, and even when 32-bit
floating-point became common on graphics cards, 16-bit is still very
commonly used typically because it presents a lesser load on membory
bandwidth.

The computer-graphics industry has long recognized the value of
floating-point to represent pixels, because a pixel expresses
(essentially) a light level. Light levels can vary over a very wide
range  for example, the ratio between broad daylight and a clear
night under a full moon, is 14 "magnitudes" on the scale used by
astronomers. That's 2.51214 ≈ 400,000. The ratio of brightnesses
in nighttime environments with bright lights (such as when driving at
night, or in a candlelit room) are similar. Such scenes have
"high-contrast" lighting. The human eye can handle this range easily.
A standard 8-bit format for pixel values (typically 8 bits for each of
the three components red, green and blue) doesn't even come close.
Doubling the pixel width to 16 bits produces the 48-bit format (common
in the industry) but does little to improve the situation for
high-contrast lighting  for pixel values near the bottom of the
range, roundoff error is terrible. But using 1.5.10 float format
increases the range to over 109 (values as small as 6.1×10-5
and as large as 65504), with the equivalent of 3 decimal digits of
precision over the entire range. It can also represent any integer
from -2048 to 2048, so is even useful in some situations like pixel
addresses within a texture.

A floating-point format using 8 bits or less fits in a byte; I call
this a microfloat. These are the best for learning, particularly
when you have to convert to/from floating-point using pencil and
paper. I am not alone in thinking they are useful as an educational
tool for learning about and practising the implementation of
floating-point algorithms  I have found courses at no fewer than 11
colleges and universities that use them in
lectures.21,22,23

But surprisingly, such small representations even have use in the real
world  sort of. Some encodings used for waveforms and other
time-variable analog data are very close to being a floating-point
encoding with a small number of exponent and mantissa bits. An example
is "mu-law" coding used for audio. Such codes usually store the
logarithm of a value plus a sign, and have a special value for zero.
This is not the same as a true floating-point format, but it has a
similar range and precision.

The smallest format that has all three fields would be 1.1.1 format 
using 3 bits with one bit each for sign, exponent and mantissa. 1.1.1
format encodes the values {-3, -2, -1, -0, 0, 1, 2, 3} or an
equivalent set multiplied by a scaling factor. But this isn't very
"useful" because you can do a little better just by treating the 3
bits as a signed integer (which gives you the integers -4 through 3).

The smallest formats that are "useful" in the sense of covering a
broader range than the same number of bits as a signed integer have at
least a 2-bit exponent field. There is always at least 1 mantissa bit
anyway (the hidden leading 1, or leading 0 for the denormalized values
when the exponent field is 0). The smallest of these is 1.2.0 format
 three bits, encoding the values {-4, -2, -1, -0, 0, 1, 2, 4}.

Adding one mantissa bit to get the 4-bit format 1.2.1 gives us a lot
more  it encodes the set {-12, -8, -6, -4, -3, -2, -1, -0, 0, 1, 2,
3, 4, 6, 8, 12} giving quite a bit more than the range of the 4-bit
signed integer {-8 ... 7}.

5 bits are best used in a 1.2.2 format, using 1 sign bit, 2 exponent
bits and 2 mantissa bits (plus an implied leading 1 bit for a mantissa
precision of 3 bits). If the exponent is treated as excess -2
(that's "excess minus-two"), all representable values are
integers and the range is {-28 .. 28} (or {-24 .. 24} if the highest
exponent value is used for infinities). 5 bits as a normal two's
complement integer has a range of {-16 .. 15}.

Reader George Spelvin34 pointed out that an "all-integer" 0.2.3
format, with no sign and without denormals or infinities, is used in
the command for setting the keyboard repeat rate (the "typematic
rate") of an IBM PC keyboard. There is a 5-bit field whose 32 possible
values are used for the numbers 8 through 120, as follows:

These are used to indicate inter-character delay values of 8/240 through 120/240 of a
second (i.e. the fastest rate is 30 characters per second and the
slowest is 2 per second). This is like a 2-bit "excess -3" exponent
and a 3-bit mantissa.

In separate, earlier correspondence, Spelvin suggested other similar
all-integer formats, with denormals but without infinities or
NANs. The exponent excess is taken to be whatever value causes the
denorms to use the same storage format as the corresponding integer.

Using 0.5.3 format as an example: There is no sign bit, so
all values are positive. When the exponent field is 0, the mantissa is
denormalized. So the values (in binary) 00000.000 through 00000.111
express the integers 0 through 7. The next exponent value is 00001 in
binary, its 1 bit happens to correspond with the implied leading 1 of
the (now normalized) mantissa, so values 00001.000 through 00001.111
express integers 8 through 15. Notice how all of these values for the
integers 0 through 15 are the same as the normal 8-bit integer
representation.

After that, values scale in the normal way: 00010.000 through
00010.111 expresses the even integers 16 through 30 (note that only
the first of these corresponds to the integer representation);
00011.000 through 00011.111 are the multiples of 4 from 32 through 60;
and so on. The highest value is 11111.111 which is 15×230 =
225-2×(23+1-1) = 16106127360. Another similar format is
0.4.4, excess -4, which expresses integers from 0 up to
224-2×(24+1-1) = 507904.

In general, using E exponent bits and M mantissa bits, you can
express all integers from 0 to 2M+1, and various higher values up
to 22E-2×(2M+1-1).

Here is a table presenting most of the smaller entries from the main
table in a somewhat different format, along with the integer-only
formats that bias the exponent so that the smallest denorm is 1.

s.e.m

excess

range

comments

1.1.1

0

1 to 3

Less range than signed-magnitude integer

1.2.0

0

1 to 4

The smallest format whose range exceeds that of the same number of bits interpreted as a signed-magnitude integer

1.2.1

0

1 to 12

Best use of 4 bits

1.2.2

-2

1 to 28

Using no infinity values; range is 1 to 24 if the biggest values are used for infinities

12 :
One source gave 831 as the range for the Burroughs B5500.
(I forgot to save my source for this. I have sources for other
Burroughs systems, giving 876 as the highest value (and 8-50 as
the lowest, for a field width of 7 bits). I might have inferred it
from http://www.cs.science.cmu.ac.th/panutson/433.htm which only gives
a field width of 6 bits, and no bias. The Burroughs 5000 manual says
the mantissa is 39 bits, but does not talk about exponent range. Did
some models have a 6-bit exponent field? Since these are the folks who
simplified things by storing all integers as floating-point numbers
with an exponent of 0 17, I suspect anything is possible.

16 :
http://www.netlib.org/slap/slapqc.tgz
FORTRAN-90
implementation of a linear algebra package, including a file
(mach.f) which curiously begins with a table of machinefloating-point register parameters for lots of old mainframes. See
also the
NETLIB D1MACH fiunction,
which gives similar values for many systems. (formerly at
http://www.csit.fsu.edu/~burkardt/f_src/slap/slap.f90 andhttp://interval.louisiana.edu/pub/interval_math/Fortran_90_software/d1i1mach.forrespectively)

17 :
http://grouper.ieee.org/groups/754/meeting-minutes/02-04-18.html
Includes this brief description of the key design feature of the
Burroughs B5500: "ints and floats with the same value have the same
strings in registers and memory. The octal point at the right, zero
exponent." This shows why the exponent range is quoted as 8-50 (or
8-51) to 876: The exponent ranged from 8-63 to 863, and
the (for floating-point, always normalized) 13-digit mantissa held any
value from 812 up to nearly 813, shifting both ends of the range
up by that amount.

20 :
This format would be easy to implement on an 8-bit
microprocessor. It has the sign and exponent in one byte, and a 16-bit
mantissa and an explicit leading 1 bit (if the leading 1 is
hidden/implied, we get twice the range). With only 4-5 decimal digits
it isn't too useful, but it's what you could expect to see on a really
small early home computer.

21 :
http://turing.cs.plymouth.edu/~wjt/Architecture/CS-APP/L05-FloatingPoint.pdf
This lecture presentation (or a variation of it) appears at
clarkson.edu, plymouth.edu, sc.edu, ucar.edu, umd.edu, umn.edu,
utah.edu, utexas.edu and vancouver.wsu.edu. Good discussion of
floating-point representations, subnormals, rounding modes and various
other issues. Pages 14-16 use the 1.4.3 microfloat format as an
example to illustrate in a very concrete way how the subnormals,
normals and NANs are related; pages 17-18 use the even smaller 1.3.2
format to show the range of representable values on a number line.
Make sure to see page 30  this alone is worth the effort of
downloading and viewing the document!

31 :
http://en.wikipedia.org/wiki/Floating_point
Wikipedia,
Floating point (encyclopedia article). While it's possible the
idea of floating-point might have been devised for use in mechanical
calculators, Konrad Zuse had formulated the ideas behind his model Z3
before building the Z1, and the Z3 is generally regarded as the first
generally-programmable computer (more on that topic here),

32 :
http://www.epemag.com/zuse/part3c.htm
Horst Zuse, The Life
and Work of Konrad Zuse. The Zuse Z1 took numeric input from the
operator in decimal form, and then converted it to binary. For
output, binary was converted back to decimal. The input and output
devices both used 5 decimal digits and an exponent ranging from
10-8 to 108. However, the internal representation had 7 binary
digits of exponent, so the range for intermediate calculations was
somewhat larger  perhaps 263 or 264. Zuse Z3 was similar, but
had 4 or 5 digits and exponent ranges of -9 to 9 (for input) and -13
to +12 (for output).

35 :
http://www.trs-80.com/trs80-zaps-internals.htm
The TRS-80
passes 4-byte single-precision (and with Level II BASIC, 8-byte
double-precision) values into and out of its ROM routines, and it is
clear that one byte is an exponent. The exponent is often described as
being in excess-128 (or "XS128") format. However, as reported by
reader Ulrich Müller, emulators show that the range is 2127, and
that the internal representation actually uses excess-129.

36 :
Joe Zbiciak, email correspondence. The TI 99/4A uses radix 100
in an 8-byte storage format. 7 bytes are base-100 mantissa "digits"
(equivalent to 14 decimal digits), and the exponent (a value from -64
to 63) is stored in the 8th byte along with a sign bit. The exponent
is treated as a power of 100. The largest-magnitude values are
±99.999999999999×10063, and the smallest-magnitude values
(apart from 0) are ±1×100-64. Precision varies from just over
12 decimal digits to just under 14: for example, π/3 is
01.047197551197×1000 and 3/π is 0.95492965855137 (represented
as 95.492965855137×100-1).

42 :
Queen's University, |Minifloat.java| (source code), August 20,
2003. This is a confirmed source of the term "minifloat" that
pre-dates my usage. The link was
http://www.caslab.queensu.ca/~apsc142i/W2003/lecturenotes/Section_BCDJ/Lecture15/Minifloat.java,and this link is dead, but it can probably still be viewed at
archive.org here:
Minifloat.java (20030820).

45 :
Industrial Light & Magic (a division of Lucas Digital Ltd.
LLC), s10e5 C++ class, with rounding improvements and
other changes by Robert Munafo. Use is subject to the copyright and
distribution conditions in its header comment. There are two other
files (which I have not used): s10e5-limits.h and
s10e5-function.h.

46 :
John J. G. Savard ("quadibloc"),
Floating-Point Formats
fills in many of the details for old mainframe computers with nice
color-coded diagrams for each.