JMike’s Favorite Bug

I found this bug while working as one of the first QA engineers at the MathWorks (makers of MATLAB and Simulink technical computing software) in the early 1990s.

ANSI standard 754 mandates rounding and such behavior for real finite-precision floating-point arithmetic. But there is no standard — at least I can say for sure there was no standard in the early 1990s — for extending those rules to complex arithmetic. So I did a fair amount of exploratory testing of MATLAB’s complex arithmetic details on various operating systems, mathematical libraries, and machine architectures.

One basic mathematical rule is that a number times its complex conjugate is always real (has no imaginary part). So I wrote a little test that multiplied a bunch of numbers against their complex conjugates, and looked at the results. Nearly all platforms performed as expected. But the IBM RS-6000 would give Funny Answers: often there would be a teeny tiny little imaginary part, on the order of ten-to-the-minus-sixteen-or-so times the real part.

I showed the result to MathWorks chief scientist Cleve Moler, who speculated that it was a problem in how C’s mathematical operations were compiled down to the RS-6000’s instruction set. He sent the result off to the appropriate people at IBM, who quickly determined that that was true.

On most platforms, the part of the computation to determine the result’s imaginary part compiled down to the following:

Multiply the first number’s real part by the second number’s imaginary part and store the result in standard 64-bit floating-point format.

Multiply the second number’s real part by the first number’s imaginary part and store the result in another 64-bit register.

Add the results of the two registers and store the result back in one of them.

But on the IBM RS-6000, the operation was fancier and exposed that they were carrying internal results around in longer (I believe 80-bit) registers:

Multiply the first number’s real part by the second number’s imaginary part, to 80-bit precision, and store the result in a 64-bit register.

In one shot, using the “multiply with add” instruction: multiply the second number’s real part by the first number’s imaginary part, to 80-bit precision, and add in the result from the register (which has been rounded to 64 bits), and return the answer.

See the problem there: the second half of the operation carried more bits after the decimal place than the first half did, so when you add them together, those low-order bits would be left, which would then come out as the imaginary part of the answer.

IBM fixed this issue quickly without a fuss.

This is my favorite bug because it demonstrated to me (and to others at the MathWorks) the value of hunting for bugs, it launched me into a long period of curiosity about finite-precision arithmetic (leading eventually to an advanced degree in numerical analysis), and, in retrospect, showed how a company could fix a bug that had significant impact on scientific computation without the widespread hue and cry that resulted from the classic “Pentium Bug” that dogged Intel a few years later.