Article - Jan 25, 1999.

World-Stop Debuggers

In the 1970's, The Xerox Office Systems Division produced an
operating system called "Pilot" with a world-stop debugger [Redell80]
called "CoPilot". When you hit a breakpoint or pressed a reset button
on the computer, the entire virtual memory image would be written to
disk, and the CoPilot debugger would be restored from a previously
unmounted superior disk volume. In essence, the "World-Stop" Debugger
controlled everything about the computer system from an
unmounted disk volume, which served as a firewall. In this system the
OS was just an application like any other. Using this idea, you
worked with 3 volumes : your main volume, where you did your work, an
inferior volume where you debugged new code, and a superior volume, in
case your OS crashed.

This is an amazing idea. It provides a strobelight for looking at
software. You may set breakpoints and debug the debugger,
running on a subordinate disk volume, using the debugger on a superior
volume. You may debug or rewrite the file system with no worries
about corrupting your execution environment. You may write a
bootable new operating system (within reason) using the old debugger
for the entire project. You may step every line of code in the
system, including every line of every interrupt routine. We still
can't do these things with 99% of all the debuggers available today,
25 years later.

The only code that was untouchable was the debugger "nub" - a
piece of code that performed the world swap, and talked to the network
(in the case of remote debugging). Because the CPU microcode was
stored and swapped with the volume, you could even debug microcode or
run INTERLISP on a remote volume!

Time-Stop Debuggers

In the implementation of real-time software, the idea of a
time-stop debugger is just as appealing as the world-stop debugger.
However, there are a number of problems to be resolved. When time is
frozen, important system services (i.e. timers, time sliced tasking,
time-triggered applications, watchdogs) do not work. Therefore, to
implement a time-stop debugger, some sort of world-stop debugger is
required. When CoPilot was written in 1979, a line of code could be
stepped in about 5-10 secs. By 1986, main memory was 8 times bigger,
and it took sixty seconds to swap in the debugger, and sixty seconds
to swap out the debugger. The debugger turned into a failure because
its benefits didn't justify the 2-minute cost of stepping a line of
code.
This is amusing, because Xerox generally practiced a form of "future
prediction" research, where you predict the hardware that will be
widespread in 10 years, build it now, and get going on that software
to take advantage of the hardware. Clearly, they were right about how
hardware would get better from 1978-1985, but they misjudged the size
of the software binaries by a mile !!

Eventually, the Xerox OSD programmers modified the system to allow
debugging in your own environment, incresing productivity. However,
self-corruption was possible, and many types of code (i.e. certain
interrupts and critical debugger tasks) could not be stepped. They
could still be debugged from the superior volume, though.

For effective time-stop debugging, you must be able to freeze time
for some portion of the system, while time proceeds for the rest of
the system (i.e. for the debugger itself). Thus, a time-firewall is
necessary to allow part of the system to proceed in time, while the
rest of the system is frozen in time.

These days most complex real-time systems involve many CPU's
(Iridium has 7 onboard ppc 603 CPU's). A Globalstar transciever
subsystem (forward link or reverse link) has at least 20 VME cards.
Without a facility to stop all the CPU's at once, the debugging in
such a distributed system would probably be useless.

To simplify the implementation of a time-stop debugger, the
hardware should be restricted to a single timing device. The clock
fed to this device could be enabled or disabled via software.