S/370 Compatible Memory: An Unusual Bug

Discussed here is a bug which affected IBM System 370 Models 158AP
and 158MP. These configurations had two processors.
(The bug is not relevant to the single processor 158UP.)

The complex interacting bug described here is fascinating
because it involves all of the following:

A ``precarious'' IBM hardware design choice,

``Precarious'' plug-compatible hardware design,

An IBM operating system upgrade, and

A bug in an IBM operating system diagnostic patch.

The bug did not manifest itself until IBM released the new
operating systems, MVS/SE and VM370/SE, in late 1977, although these
systems ``did nothing wrong.''
When the bug first surfaced, some called it the ``compare and
swap'' bug, but this was a gross misnomer since compare-swap
was the one instruction which could not provoke the bug.

Background

During the late 1970's I was frequently involved in troubleshooting
IBM mainframes, or rather ``plug-compatible'' main memory attachments
for those mainframes.
I encountered a variety of problems:

design errors by AMS/Intersil or National Semiconductor
(the ``plug-compatible'' manufacturers I worked for),

design errors by IBM (there were various reasons these might manifest
themselves during our installation),

manufacturing defects (a machine in Lyons, France had a totally
failed ``pull-down'' transistor which nevertheless caused system
failure only a few times per week. I enhanced my
troubleshooting reputation
when I created a ``scope loop'' which failed millions of times per minute.),
and

wires broken or misplaced during installation.

Often the misplaced wire was more challenging to diagnose than a
design error, but the most interesting bug of all involved design errors.
I'll tell that story in great detail, going on tangents which,
though irrelevant to ``The Unusual Bug (tm)'' are instructive
in their own right.

Something that made engineering IBM ``plug-compatible''
systems particularly challenging was that modifications were made to
essentially ``hostile'' circuits. Thus a problem which IBM might
solve with a simple redesign of a CPU circuit board could not
easily be solved by us the same way -- we would then need to redesign
the entire IBM circuit, an expensive and time-consuming undertaking.
Instead we would try to design a parallel circuit which could
``wire-or'' to and override the IBM logic. When no way could be
found to achieve the desired logic with a ``wired-or'' we might
use an Exacto knife to cut an IBM printed-circuit trace,
though many project managers refused to countenance that.

Refresh Holdoff

IBM mainframes used static memory, but the Brand X companies
saved cost with cheap high-density dynamic memories.
These memories needed to be refreshed periodically;
if the refresh cycles could not be hidden during system
operation, the CPU would have to be held-off
(stopped or prevented from accessing memory) while the
dynamic memory did a forced refresh.
AMS/Intersil used hidden refreshes, so the dynamic memory had
only slight degradation to system performance.
NatSemi, on the other hand, used only forced refresh on some
models and needed to signal the CPU to hold-off once every 16 microseconds.

Eventually, holdoff circuitry was designed for the IBM 370
models 138, 148, 158 and 168 (and later for 3031, 3032, 3033 --
all relatives of the 168).
The 148 design was more challenging
and interesting than that of the 138, and the 158 design was more
challenging still -- the 158 was a complicated beast,
with non-obvious issues. Perhaps paradoxically, the top-of-the-line
168 was the easiest of all to modify -- because of its large number
of components, IBM designed its hardware to have a
very careful modular structure so it was easy to manipulate.

370/158 Buffer and Backing Store

The larger IBM models are complicated by a memory hierarchy:
recently fetched data is saved in a high-speed buffer (cache)
to avoid redundant fetches to the slower backing store;
when the backing store is written the high-speed buffer
must be checked for that address, and the buffer entry also rewritten,
or at least invalidated.
(Even less expensive Models like the 148 may have this issue,
although the ``buffer'' may just be a few bytes representing
the next instruction(s) to be executed.)

When the 158AP or 158MP performs a memory write, three
actions have to happen:

The local buffer is checked for that address; if present the
old data is overwritten with the new data.

The remote buffer is checked for that address; if present the
old data is invalidated.

The backing store is written,

(The remote buffer is invalidated, not overwritten.
Although the CPU has a data bus to the opposite backing storage,
backing and buffer store operate independently
and do not share the same input data bus.)

The three actions are initiated concurrently, but only the first
two need take place quickly. The backing store write may be delayed
by a previous write from this CPU which has not yet completed
or any operation from the other CPU, if it is accepted first.
It can be further delayed when both local write and remote access
are pending. On machines with non-IBM add-on ``dynamic'' memory,
access can be delayed still further by a forced
refresh cycle.

Asynchronous Control Signal

The 158 AP/MP systems we discuss have some bug-like features
which are not related to the Unusual Bug that developed,
but I mention them just so this page can be a compendium of
interesting design issues.

For example, although the two CPU's in these systems operate at identical
8.69565 MHz clock speeds, each CPU has its own independent
oscillator, so control signals from the remote CPU occur at
completely arbitrary times relative to the local oscillater.
A single clocked latch cannot reliably synchronize such a signal
to the local clock; this is because the asynchronous
signal might switch at exactly the same time as thc locking
edge -- in that case the latch output may be poorly behaved.
(In some logic families it might oscillate. With the ECL circuits
used in IBM mainframes in the 1970's, a more likely failure mode
would be for the signal to remain near threshold level for a while,
neither at a high (`true') nor low (`false') level.)
Such a latch is said to enter a ``metastable'' state.

The output of the metastable latch can be clocked into a second
latch, say 45 nanoseconds after the first latch is clocked.
This is likely to produce a clean signal, but is still not guaranteed:
the faulty level output by the first (metastable) latch may put
the second latch also into a metastable mode (``Garbage in, garbage out'').

The usual solution is to provide yet more clocked latching in series:
the more such latches the more unlikely failure becomes.
In the circuit diagram following `-RMT SELECT' is seen to pass through three
such synchronizing latches, but my memory is hazy --
in the IBM circuit there might
have been four latches in this series.

(An article in an engineering magazine about this problem
had an interesting table showing, for different logic families,
the probability a latch output could still be metastable X nanoseconds
after the clock. I remember CMOS scored well, and ECL scored poorly,
but the surprising thing was that popular 74S-series TTL parts
scored all over the chart:
those manufactured by Texas Instruments outperformed
CMOS, while parts from National Semi with ``identical form, fit and function''
to the T.I. parts were worst in chart.)

370/158 - FET CTRL BZ

We begin our look at details by depicting the circuitry which
initiates a backing store cycle on the 370 Model 158.
This is from my 30-year old memory, but I think it is
substantially correct.

The relevant nets show the names as they appeared in the IBM logic
diagrams: ``- Local Select'' and ``+Remote Select'' are the
signals from the two respective CPUs which request
access to a backing-store memory cabinet.

``Fet Control Busy'' is active when a backing store
memory cycle is in progress (the 158 backing store used Field Effect
(MOS) Transistors, hence the ``FET'' mnemonic); ``Remote Cycle''
indicates to which CPU's request the backing store is responding,
and ``End Reset'' is a pulse generated a fixed delay after the
falling edge of ``- FET CTRL BZ.''

Ignore the peculiar part of the circuitry in green;
it isn't part of the IBM circuit and is discussed in the next section.

Special System/370 op-codes (Compare-Swap and Test-Set)
allow one CPU to deny access to the other; this is
indicated in the diagram (``cs/ts'') but without details.

Note that Remote Select has priority over Local Select,
but with contemporaneous selects, Local goes first because Remote
is delayed by the synchronization latches discussed in the previous section.
The ``+RMT CYCLE'' signal must not change states
while ``-FET CTRL BZ'' is active; note that the logic
ensures this.

The ``Fet Control Busy'' signal exists in two polarities;
the minus-active is connected, unbuffered, to the backplane (``mother-board'')
so it can be used by other circuit boards.
The diagram does not depict those details, though they
happen to be quite fortuitous as implied in the next section.

Refresh Holdoff on the 370/158

One engineer (John Bizjak) designed a circuit to holdoff the 158 from
accessing its backing store, and supplied it to several manufacturers.

The relevant circuit is so simple that it may seem amazing
it does the job: delaying the memory of a million-dollar
computer, and letting it start up again right on schedule.
``Want Holdoff'' requests a forced refresh, and ``Holdoff''
acknowledges that it may begin.
``Refresh complete'' terminates the holdoff; it must be synchronized to the
appropriate 158 clock phase.

I've shown the add-on circuits in a different color from the
IBM circuits in the previous diagram, but ``-FET CTRL BZ'' is
the IBM net. It is both an input and an output on an AND-gate:
this is called a one-gate latch.
Since the IBM 158 computer used NPN emitter-coupled logic,
the output of this circuit is OR'ed (``wire-or'ed'') with the
output of IBM's ``-FET CTRL BZ'' circuit. While the one-gate
latch is set, a backing-store cycle cannot be initiated,
but the latch will not set if such a cycle is already in
progress.

With the Bizjak circuit added on, IBM's ``FET CTRL BZ'' latch will
still set as usual but, when the add-on memory is doing forced
refresh, the backing-store cycle will be delayed.

Note a curious feature of the IBM circuit when this add-on is installed:
During Holdoff, the Remote Cycle latch will be allowed to
change from local to remote selection
even though a Local select has previously set
the Fet-Control-Busy latch!
(This leads directly to the Unusual Bug
which is the subject of this web page.)

Wire-Or over Distance

Although it did not cause trouble here,
I will mention an unusual logic hazard that can arise in a wire-or
like that used in the Bizjak circuit.

The Bizjak circuit is installed in an empty card slot
in IBM's card cage. Because there are no empty slots near
the IBM circuit board that generates ``-FET CTRL BZ'',
the add-on ``one-gate latch'' will be at least 60 centimeters
away (``as the electron flies'') from IBM's latch; in other words
the drivers that are wire-or'ed together are two feet apart.

When both inputs to an OR-gate are active (high) and one goes
low, you expect the output to remain high, but that is not what
happens here! The add-on circuitry used different power-supply
voltages and had a slightly lower high level, so the IBM
driver would supply all the current to the wired-or net (See Note).
When that transistor turns off, its emitter voltage will
immediately begin dropping, pulled down by a nearby (ca 390 ohm)
resistor. This falling voltage cannot be reversed until
it is ``seen'' by the other drive transistor, which will then
turn on, and until that turn-on is ``seen'' back at the first source.
In other words there is a delay of eight
nanoseconds minimum before the falling voltage can be reversed!
(Light travels about one nanosecond per foot in vacuum, but electric
signals only about half that speed in copper and the signal
has a round-trip of 4 feet through the 2-foot wire.)

Placing an oscilloscope on IBM's Fet-Control-Busy net,
one did indeed see a clear negative-going glitch whenever the
CPU attempted to initiate a backing-store cycle during a
forced-refresh. However the glitchy excursion only went barely below
the threshold level and was not quite
enough to cause trouble. Installers took care to use a
short wire and keep the Fet-Control-Busy add-on to two feet:
the system would fail when a 3-foot wire was substituted!

Note: The industry standard ECL 10K family
used a 5.2 volt power supply differential and was not fully
compatible with IBM's MST logic family and its
three voltages -- +1.25, 0, and 3.0 volts.
(Like other add-on designs, the Bizjak circuit used
an add-on -4.0 volt power supply to achieve a 5.25 volt differential.)
However, because of minor device variations,
the described logic hazard would exist even if the two
drivers had been compatible.

Everything OK at First

The add-on manufacturers began attaching dynamic memory
to IBM 370 Model 158 in 1975 and
despite the peculiarities mentioned, things were OK at first,
even on the 158MP and 158AP systems.

It may be somewhat uncommon for one CPU to be caching data from
a memory section into which the other CPU is writing,
so the software might have suffered no ill effects even if the
buffer invalidation mechanism were defective!

When both CPUs are executing kernel code, they may each have copies
of shared variables in their caches, but the kernel will often
use Test-Set or Compare-Swap when updating such a shared variable.
On the 158, these instructions are handled with special microcode.
The sequence begins with Read-and-Lock, which bypasses the buffer,
sets a latch to prevent the other CPU from accessing the backing
store, and reads backing store.
If a Compare operation leads to a Write, that Write will invalidate
the remote buffer entry, update backing store, and only then
reset the Lock latch. Caches and backing store thus
have ensured consistency.

CPUs are not required to use the Read-and-Lock/Compare/Write
mechanism on all updates to shared variables, however, and as we will
see the caches could become inconsistent.
The IBM design which allowed this might be called ``precarious''
but could not fairly be described as ``buggy'' -- failures
arose only when non-IBM (``Brand X'') memories were attached.

Buffer Inconsistency

IBM began shipping new operating systems (MVS/SE and VM370/SE) shortly
before I went to work for NatSemi. The 158 was regarded as a ``done deal''
so I was asked to design circuitry for the 3031, 3032 and 3033
attachments. But soon strange reports began drifting in
about 158AP and 158MP systems.

The symptoms varied from site to site. In several cases,
the only problem was that a terminated job survived in a ``zombie''
status: the operators could ``kill'' that job explicitly and
continue normally. At other installations, CPU B (the ``AP side'')
would go into a tight infinite loop.
Those customers had the choice of continuing
with half their CPU power out of action, or doing a re-IPL
(and then hanging again, randomly, after several hours).
At still other installations, the operating system would crash completely.

On one particular machine, a kernel variable (SRBCNT)
would go negative. Since this variable starts at zero,
increments when an SRB begins, and decrements when it exits, it should
never become negative ... but it did. The problem began as an inconsistency
between the two buffers of the two CPUs, but it could be quickly
disguised: one CPU fetches its inconsistent value, updates it,
and writes it to backing store. The buffered and backing-store
data are all then identical -- but wrong.

IBM's diagnostic strategy may have been simplified once they
noticed that all the failing machines had NatSemi add-on
memory installed!

IBM's top software specialists developed special diagnostic patches
to detect the buffer inconsistencies before they ``healed''
themselves.
Because the Compare-Swap instruction fetches from the backing-store
on Model 158,
software can detect buffer inconsistency by simply doing an
ordinary fetch and a Compare-Swap fetch and comparing the results.

The telephone communication that develops among engineers in such situations
is rather like a game of ``gossip.''
IBM had little incentive to put its best engineers in contact with NatSemi,
and NatSemi let managerial types communicate at its end.
For some reason, all these people were calling the syndrome
the ``Compare-and-Swap'' problem.
I had a hunch what the real problem was, but knew it couldn't be
Compare-Swap. When I finally got an IBM engineer on the phone,
I asked him what instruction was used to write SRBCNT.
He thought it was CS (Compare-Swap) but I asked him to please check
the source code.
When he came back on the line with the answer,
he seemed surprised: STH (Store Halfword).

I have no knowledge of IBM's operating systems -- my programs
were standalone diagnostics -- but I suspect that MVS/SE was more aggressive
than its predecessor in allowing both CPUs to execute dispatcher
code in the kernel concurrently.
They took care to keep the data consistent but
this needn't require that every
write instruction be ``locked'' (use CS or TS opcode): a single
locked write can suffice to grant a unique processor software-controlled
write authority.

A Detailed Look at the Faulty SRBCNT

The following table shows what I think was happening to SRBCNT.
Note that it wasn't necessary that CPU B read SRBCNT while CPU held
the kernel lock; the bug is provoked by a read of any datum (XYZ) in
the same quadword (the 158 memory operates in 16-byte chunks).
The table is not ``to scale''! The actual STH R5,SRBCNT
instruction is broken into parts to show cache and backing-store changes
separately.

Kernel

CPU
actions

Value of SRBCNT

Comment

Lock

CPU A

CPU B

Buffer A

Buffer B

Backing store

-

-

-

0

0

0

B

-

dispatch SRB

0

0

0

B issues an SRB,

B

-

(begin STH)

Inv

1

0

updating SRBCNT

B

-

(end STH)

Inv

1

1

normally

-

-

-

(1)

1

1

-

-

-

1

1

1

(time passes, A's cache gets reloaded)

-

-

-

1

1

1

A

dispatch SRB

-

1

1

1

A issues an SRB,

A

(begin STH)

-

2

Inv

1

but before the Store-Halfword

A

( ( holdoff for forced refresh in A's memory ) )

makes it to backing store

A

-

Read XYZ

2

1

1

an unrelated fetch fills

A

(end STH)

2

1

2

B's cache with obsolete data

-

-

-

2

1

2

-

-

-

2

1

2

(time passes)

-

-

-

2

1

2

B

-

SRB exit

2

1

2

B needs to decrement SRBCNT;

B

-

(begin STH)

Inv

0

2

this time the hardware

B

-

(end STH)

Inv

0

0

operation proceeds correctly

-

-

-

Inv

0

0

but the resultant value is wrong

-

-

-

(0)

0

0

-

-

-

0

0

0

(time passes, A's cache gets reloaded)

-

-

-

0

0

0

A

SRB exit

-

0

0

0

Now A decrements SRBCNT;

A

(begin STH)

-

-1

Inv

0

again hardware action is correct

A

(end STH)

-

-1

Inv

-1

but an ``impossible'' negative results

-

-

-

-1

(-1)

-1

( ( with an erroneous SRBCNT the MVS system does not behave properly ) )

Three-Way Coincidence

The problem was caused by a coincidence of three independent events:

The STH R5,SRBCNT issued by CPU A,

a fetch by CPU B to a datum (XYZ) in the same quadword as SRBCNT, and

a forced refresh of the NatSemi memory.

The normal sequence of events would be

CPU A invalidates B's buffer

CPU A writes SRBCNT to backing store

CPU B fetches XYZ from backing store

With the forced refresh, the operations may be reordered:

CPU A invalidates B's buffer

CPU A's backing store is preempted for forced-refresh

CPU B fetches XYZ from backing store (obtaining the obsolete SRBCNT)

CPU A writes SRBCNT to backing store

(It may seem wrong to invalidate B's buffer before
updating the backing store, but unmodified IBM machines did not fail.)

IBM turns the problem over to NatSemi

IBM modified the kernels on each failing machine, not to fix
the problem, but to detect it at the earliest stage.
Since machines had different symptoms, the patches were different,
but usually checked explicitly for a mismatch between cache and
backing store.

In those days, IBM operated under a consent decree ordered by
a federal judge, stating that IBM was a monopoly and needed to cooperate
to some degree with competing manufacturers and maintenance services.
However, although IBM had figured out the details of the problem,
it did not feel obliged to educate NatSemi. (It did submit
a large bill for the many hours IBM had spent troubleshooting the
problem.)

(Eventually, other manufacturers of dynamic add-on memory
for the 158 had to address this problem.
NatSemi had the first opportunity since, lacking any mechanism
to hide refresh, they were holding off the CPU once every 16 microseconds
and thus provoked a high failure rate.)

It fell to NatSemi to rediagnose and solve the problem.
Although a customer machine might only fail once a day or so,
I wrote a simple program which would cause and detect a buffer
mismatch after a few seconds. This enabled us to test design fixes,
and we developed circuitry both to hide some refreshes and to
preserve the Local/Remote ordering during forced refreshes.
The fixes were installed on customer machines, and all began acting correctly.

Or should I say, all but one ...

A Buggy Bug-detector in Columbus, Ohio

The electric utility in Columbus, Ohio had an IBM 158AP with
NatSemi memory attached. Their machine had suffered problems
and IBM patched the kernel to detect buffer inconsistencies.
NatSemi installed its fixes, but the kernel patch continued
to report errors.

The 158AP has backing store only on A's side so, as explained above,
the Bug can lead to a mismatch between cache and backing store
only on B's side. However, the mismatches now reported by
the modified kernel were occurring on the A side.
It seemed to me that the defect was probably in the ad-hoc kernel patch
rather than the hardware but the customer in Columbus refused to
revert to the virgin operating system.

I examined the kernel patch. It was in 370 Assembler language of course
but was equivalent to the following C code:

(As explained earlier, backstore_fetch() is accomplished
with the CS opcode.)

The bug-detection patch does two comparisons, to allow
for the possibility that the other CPU will update shared_variable
exactly at the point marked ``Window 1.''

Is there a bug in this bug-detector?
Could the other CPU update shared_variable both
during ``Window 1'' and during ``Window 2''?
Perhaps we'll never know for sure now,
but I assume that was happening.

NatSemi was ordered to remove its memory attachment,
but the kernel patch continued to report ``cache inconsistencies''
even when there was no dynamic memory installed.

The bug in the bug-detector at Columbus, Ohio should teach us
a healthy respect for Murphy's
Law as it applies to multi-processor races.

I hoped to describe an interesting bug from the past, but I'm
afraid it ended up long-winded and boring.
Please e-mail me and tell me what you thought.