In the June 2002 issue of SIAM Review, I reviewed Michael Overton’s 2001 SIAM book, Numerical Computing with IEEE Floating Point Arithmetic: Including One Theorem, One Rule of Thumb, and One Hundred and One Exercises. In this post I reproduce the review and then discuss what has changed in the thirteen years since the book was published.

The Original Review

This very attractively produced hardback book of just over 100 pages describes IEEE standard floating point arithmetic and associated issues such as hardware implementations and language support, together with some basics of numerical analysis. Its main intended audience is computer science and mathematics students, where it can serve as a supplement to more general textbooks, and its motivation (stated in the preface) is that “15 years after its publication, the key ideas of the IEEE standard remain poorly understood by many students and computing professionals.” One could argue that the standard is so well designed that the average user does not need to understand it; many of the weaknesses that plagued earlier floating point arithmetics, such as \(x\)−\(y\) evaluating to zero when \(x\) and \(y\) are different floating point numbers, are not present in IEEE arithmetic, and so the unwary programmer is much less likely to be surprised by the results produced. Nevertheless, an appreciation of the standard is useful for anyone involved in scientific computation. A self-contained, accessible and easy to read treatment dedicated to IEEE arithmetic has been lacking, and this book admirably fills the gap.

The book’s subtitle refers to the fact that the book contains just one theorem—the theorem that bounds the relative error in rounding a real number to floating point form—and a lot of exercises. I suspect that upgrading a few more results to theorems would have helped the student identify the essentials and made the book easier to use for reference. Including solutions to the exercises, where appropriate, would make the book easier to use for self-study.

The book is very clearly written and is far more than just a “user-friendly” version of the relevant IEEE standards document. It draws on ideas and observations from many sources, not least from William Kahan, who won the ACM’s Turing Prize for his work on the standard. Many examples are included, along with pointers to the literature. The webpage for the book includes the bibliography with an appropriate link for most entries, such as an online version of the entry or a publisher’s catalogue.

I learned from the book that what I have always called the “mantissa” of a floating point number should really be referred to as the “significand” (mantissa applies only to logarithms). The reader who searches here for the term mantissa won’t find it, which could cause some confusion, for example when the book is being used in conjunction with a numerical analysis textbook.

The content is hard to fault, and I can find only a few minor quibbles. The “positional” representation of floating point numbers, b0 · b1b2…bt−1 X 2E, is used exclusively throughout. The integer-based representation m×2e−t, with m ≤ 2t–1 an integer, which goes back at least to Forsythe1 and Matula2, is not mentioned, but is often easier to work with and might be preferred by some students. Chapter 11 on cancellation could be enhanced by giving some constructive examples of how to avoid damaging cancellation. Finally, a better index (more entries and greater use of subentries) would make it easier to locate specific topics.

In summary, I think Overton has achieved his aims and produced an attractive, insightful book that can serve as the first point of call for anyone wanting to understand IEEE floating point arithmetic. I won’t be surprised if the book sells as well to academics and computer professionals as to its primary intended market of students.

Revisiting the Review

A paperback version of the book, including minor corrections, appeared in 2004. The book has been translated into Spanish and a Persian translation by Behnam Hashemi is due to be published by Sharif University Press in 2014.

Overton’s book is about the 1985 IEEE standard. In 2008 an updated standard IEEE Std 754-2008 (revision of IEEE Std 754-1985) was published. It added support for decimal arithmetic (the original standard covered only binary, but decimal had been included in an 854-1987 standard), but the most significant aspects from the numerical analysis point of view are as follows.

Support is added for a fused multiply-add operation (FMA) A*B+C, which is evaluated “as if with unbounded range and precision, rounding only once to the destination format”. This corresponds to the FMA instruction available on certain chips.

A 128-bit quadruple precision floating point format was introduced. This reflects the availability of quadruple precision in software, through some compilers. In fact, Overton mentioned that the (then) ongoing revision of the standard was expected to include a 128-bit format and said, further, that

Inevitably, 256-bit floating point will become standard eventually. Equally inevitably, there will be some users for whom this will not be enough, who will use arbitrary precision algorithms.

128-bit floating point arithmetic is not available in hardware on current machines, to my knowledge. If and when it, or an even higher precision, does become available there will be some interesting questions for numerical analysts about how to exploit it.

What new books dealing with floating point arithmetic have been published since 2001? A second edition of my Accuracy and Stability of Numerical Algorithms, which has a chapter on floating point arithmetic, was published in 2002. The main new entrant is the 572-page book

This is an excellent reference for the subject, covering not just the basics of floating-point arithmetic but topics such as computer language support, evaluation of elementary functions, the table maker’s dilemma, and proofs of correctness of floating point algorithms.

Nick Higham is the Richardson Professor of Applied Mathematics at The University of Manchester.