Heisenbug

The year was 1974. Computers looked like computers, just as
you'd expect from the movies. Below is the maintenance panel of a
Univac 1110 computer, like the one I worked on at the time. Operators
interacted with the computer through a separate console, with a CRT
terminal and page printer, but the maintenance panel provided access
to the low-level circuitry of the computer, allowing hardware
maintenance engineers and systems programmers (like me, who worked on
the operating system at a low level) to examine its operation in
binary, stepping the multi-million dollar machine
instruction-by-instruction through programs late at night while paying
customers were asleep.

The indicators at the right showed the most basic registers of the
computer such as the program counter and processor state register,
while the knobs at the left allowed selecting a variety of internal
registers to be displayed or modified, with a legend above the binary
display mechanically rotated to label the bits in the selection. This
was a 36-bit word machine, so the registers were of that length.

When the computer was running its normal work load, the maintenance
console, while in view from the operator's console, displayed
what was essentially a random pattern of lights: the registers were
changing so rapidly they were averaged out into a blur.

That wouldn't do. I was a systems programmer! This machine was
what we now call a
symmetric multiprocessor:
it had two central
processing units (CPUs) which shared access to common memory. In
single processor systems, it was usual that when the system was idle
it would simply while away the time in an infinite loop until
interrupted when work arrived, but this was a poor choice for a
multiprocessor: retrieving the infinite loop instruction over and over
from memory would impede other processors and input/output devices
from accessing it. (Today's computers would keep such an
instruction in a
CPU cache
local to the processor, but this was the
1970s, where such extravagances were like
flying cars.)

Fortunately, the Univac 1110 had an
instruction
called “Block
Transfer”, intended to move memory in bulk from one location to
another with a single instruction. Further, this instruction could be
(ab)used to move data from CPU registers to other registers, never
accessing memory at all. By replacing the (“here: go to
here;”) instruction in the idle loop with a block transfer, an
idle CPU would make no memory accesses at all, and thus not interfere
with other CPUs or input/output operations.

This was cool, and I considered myself a Knight of Efficiency for
implementing it. But then I observed something else. While the block
transfer instruction was executing, the value in the register it was
transferring was displayed in one of the rows of lights on the
maintenance panel, so long as the rotary switch was set to view it.
Now this suggested something even cooler: since I could display any
pattern I wished in the lights, why not do something interesting like
bits which went zorp-zorp back and forth across the field.
This would only be displayed when the CPU was idle, and the speed the
bits moved when visible indicated the percentage of idle time, and
hence the inverse of the load on the system. I called it “The
Speedometer”. It was an immediate hit, and other Univac sites
adopted it.

But all was not well at its home site. After installation of the
speedometer, the system started to crash more frequently than it had
previously. (Although, at that time, reliability was such that it was
difficult to tell the difference.) One property of these grand 1970s
timesharing systems which has been forgotten by users of personal
computers is that when the music stopped—the system went
down—you could have more than three thousand people furious with
you all at the same time. This was not a good place to be, especially
when your cool hack for the blinky lights seemed to be culpable.

You could, and I did, look at the speedometer code in great detail,
and find there was nothing which could explain the crashes. Further,
none of the other sites which had installed the speedometer were
experiencing these crashes.

What was going on?

It turns out this was a
Heisenbug,
a problem which manifested itself
depending upon whether it was being observed. The name is derived from
Heisenberg's uncertainty principle
in quantum mechanics,
according to which the results of a measurement depend upon which
experiment the observer chooses to make. In this case, in order to
observe the speedometer on the maintenance panel, the rotor on the
left side of the panel had to be set to view an internal register in
the CPU which held the value in the block transfer. In order to
display the bits, the circuitry in the panel imposed a load upon this
internal register which, because it contained a marginal component,
caused it to randomly fail. When a different register was displayed
(as maintenance personnel did when performing diagnostics on the CPU),
the problem did not occur. Thus the problem was not caused by the
speedometer, but rather the selection of the display which allowed it
to be observed. The weak circuit which failed when its state was
monitored by the maintenance panel display was eventually identified
and replaced. In electrical engineering, a circuit's behaving
differently when observed with a measuring instrument is called a
probe effect,
and has been infuriating people for more than a century.
My reputation was rehabilitated until the
next outrage.

What can you learn from this? Observation affects what you measure.
Always blame the systems programmer. Sometimes it really is
the hardware. The 1970s were fully as awful as you've imagined them to
be.