Summary: Use a Raspberry Pi Zero as an RGB to HDMI converter that's optimised for the video timings of the Model B/Master/Electron and supports all screen modes (including mode 7), together with automatic calibration. Do this as cheaply as possible, using a small CPLD for level shifting and pixel sampling. The advantage of using HDMI, compared to VGA, is that almost all LCD TVs/Monitors with HDMI inputs should support 50Hz, where as few VGA monitor do.

This is a project that's been in the pipeline for a while, but I've not written much about it yet. Most of the work was done between April and June last year, then I got a bit stalled on the PCB (more later). I've had some help along the way from Ed (BigEd) on the auto calibration and Dominic (dp11) on optimising the ARM code. So a big thank you to them.

Anyway, sorry for the rather long post, I hope some of you find the details interesting.

Here's a photo of the current prototype:

You can seen there are two boards here:
- a small CPLD development board containing a XC9572XL CPLD, used for level shifting and pixel sampling
- a Raspberry Pi Zero, running some bare metal firmware, producing the HDMI output

So, how does it work?

Here's a block diagram of the overall system:

(do click on the image to make it larger)

The CPLD contain a 4-pixel shift register which is used to collect four successive pixel samples (called a Pixel Quad). The shift register is followed by an output register so the Pixel Quad value seen by the Pi stays stable for as long as possible. Each time a new value is loaded into the output register the Pixel Sync output is toggled. This is used by the Pi to determine when a new Pixel Quad value is available, to be written into the frame buffer.

The precise pixel sampling points are controlled by the sampling state machine. At the end of each horizontal sync pulse, a counter is loaded with a large negative value and then starts counting up. When this counter reaches zero, the sampling begins.

To reliably sample in the centre of each pixel, the sampling clock is significantly higher than the pixel clock. In this case we're using 96MHz, as this is an integer multiple of both 12MHz and 16MHz. This clock is generated by one of the PLLs on the Pi. In Modes 0-6 a pixel is sampled every 6 cycles (16MHz). In Mode 7 a pixel is samples every 8 cycles (12MHz).

It's necessary to deal with some varation in the Beeb's pixel clock, because the tolerence of the Beeb's 16MHz crystal is ~100ppm, and in practice after 30+ years it may actually be worse than this. On startup the Pi runs an initial calibration phase that accurately measures the duration of two video fields (i.e. one frame). If the Beeb's clock is exactly 16MHz then the frame time should be exactly 40mS. Any error from this allows us to estimate the actual 16MHz clock frequency, and then vary the Pi's PLL frequency so the 96MHz clock tracks this variation. The net result is there is very little shift in the pixel sampling point across the line.

Even after calibrating the 96MHz clock, picking the best 96MHz clock cycle to sample the pixel is tricky. Dead reckoning doesn't seem to work that well, as the video timings vary slightly on different machines, and also vary between the Elk, the Model B and the Master.

To address this, the CPLD allows some fine tuning of the exact pixel sample point, through the sample point register. This register is a set of 3-bit values that allow the sample point to be offset by 0 to 7 cycles (each cycle is ~10ns).

In Modes 0-6 it turns out that a single offset value will suffice (that X in the diagram) regardless of the mode, so long as it is correctly set, because all these modes derive their pixel clock directly from the Beeb's 16MHz clock.

In Mode 7 it's unfortunately far more complicated. This is because the scheme that the Beeb uses to generate the 6MHz clock for the SAA5050 is a bit of a hack, involving a 4MHz clock, an 8MHz clock, some XOR gates and some capacitors. The resulting 6MHz clock can have very unevenly spaced edges, which actually means the pixels are different sizes. This can be seen in the below scope capture:

It turns out that to be able to do the best job in Mode 7, you actually need to use different sample point offsets for 6 successive pixels. That's the purpose of the values A - F in the sample point register.

To correctly set the sample point register the Pi preforms a second calibration phase. This involves going through all the possible offset values (0-5 in Modes 0-6, and 0-7 in Mode 7). At each stage a number of separate frames are captured, and the number of differences between frames is counted. The best sampling offset is the one that gives fewest difference between successive frames (a bit of a simplification, because mode 7 is more complex). Clearly this needs a static image to be present on the Beeb!

In the below scope capture, you can see the sampling points that have been picked for a Mode 7 screen. Even though the pixels are of varying width, the sample is taken close to the centre of each pixel:

The sampling offset calibration happens automatically the first time a particular mode is used. The user can also trigger it to happen again by pressing the Calibrate button. It takes about a second.

On the Pi side all of the software is running bare metal, so that interrupts don't cause sampling glitches. The software is written in a mixture of C and ARM assembler. All of the non-time-critical code, like the calibration phases, is written in C. The time-critical piece, copying Quad Pixel values from GPIO to the Frame Buffer, is written in ARM assembler.

The Frame Buffer on the Pi is set to:
- 672 x 540 x 4 bits/pixel if Modes 0-6 are being used
- 504 x 540 x 4 bits/pixel if Mode 7 is being used

These values allow a bit of over scan, as it turns out the position of the Beeb's screen shifts very slightly in the different screen modes.

A further optimisation is using the 4 bits/pixel frame buffer mode. This allows two Quad Pixel values to fit nicely in a 32-bit word, meaning the frame buffer can be updated with a single write. From a performance perspective the worst case is in modes 0-6, where a new Quad Pixel is available every 500ns. It seems that a Pi Zero can just about keep up with this while writing to memory.

The system also performs simple weave de-interlacing (Mode 7) or line doubling (Modes 0-6) as data is written into the frame buffer.

On the output side, the GPU is used to scale up the frame buffer to match the output panel resolution. To keep things looking nice, it's desirable
that an integer scaling factor is used. This is typically a factor of two with the resolutions we have chosen.

Last year I started working on a PCB design (also in Github), but got rather distracted over form factor and connector choice.

More recently Myelin has said he might be able to help out with this. The plan (I think) is to try for a form factor that's the same width as the Pi Zero, but about 30% longer. This will allow room for a standard Beeb RGB DIN connector. All of the connectors will be in the same plane, so it should be possible to put this in a nice little box. All the power comes from +5V pin on the Beeb's RGB output.

In the next post, I'll try to show some screen shots of the results, and show one of the debug logs from the calibration.

Last edited by hoglet on Thu Jan 25, 2018 9:35 pm, edited 3 times in total.

The software measures the duration of one frame (two fields), compares with the expected value of 40ms, and then corrects the Pi's clock accordingly. In this case the measured error is 575 PPM. If uncorrected, the effect of this error over 640 pixels would be to shift the sampling point by 0.368 pixels, which is quite significant.

You can see the results of setting up the clocks later in the logs, where CORE_FREQ is now 383.779 MHz:

First, it's detected we are in Mode 7 (which itself was quite hard, but there are subtle differences in the sync timing that can be measured).

In Mode 7 the pixel clock is 12MHz and the RGBtoHDMI sampling clock is 96MHz. That means that there are 8 possible sample point offsets (0 to 7) in a pixel. The software configures each offset in turn, and the measures the number of differences between 10 successive frames. You can see that an offset of 5 yields the best results, with an average of 31 differences per frame.

The problem in Mode 7 is the clock is rather asymmetric, as can be seen here:

The SAA5050 uses both edges of the clock, and you can see in the scope capture the edge spacing is not consistent,

In fact, it turns out the pattern of jitter repeats every 6 edges. To deal with this the CPLD has six sample offset registers (A-F) which are used in turn for successive pixels. At this point, A - F are all set to the same value (5). Next, the software tries moving each of points A-F a small amount, and if that improves the overall error metric the nudged value is persisted:

Here, just two of the points were nudged, and this reduced the number of errors from 31 per frame to 10 per frame.

Depending on the Beeb, and how bad it's teletext clock is, it may not be possible to get error free sampling. But in general the errors are only noticeable if you really look for them, as occasionally twittering pixels.

Here's the results after calibration:

In Mode 0-6 the calibration is much simpler, as only one offset register is needed:

Now, you can see at several positions the sampling is error free. In this case an offset of 0 is picked for the default sampling point (sp_default), because none of the later positions improve on that.

There is actually room for improvement here. If you re-order the measurements, it's clear that actually offset 5 would have been a better choice, because it's in the centre of the minima:

Finally, to get an idea of aspect ratio I've enabled the "grey background" debugging feature. There is intentionally a bit of overscan here, to allow for small changes in the positioning of the active area (e.g. due to *TV or 6845 register tweaking):

Dave

Last edited by hoglet on Fri Jan 26, 2018 12:03 pm, edited 9 times in total.

I suspect this would also do really well on the Archimedes, particularly for retro gamers wanting 320x256 @ 50 Hz. Definitely worth testing on an Archie

Back to the beeb, when building the digitised frame, are you using the pixel colour captured as digitised? Or are you re-colouring the captured pixels to true bbc colours, to eliminate digitising noise? Both options could be useful.

steve3000 wrote:
Back to the beeb, when building the digitised frame, are you using the pixel colour captured as digitised? Or are you re-colouring the captured pixels to true bbc colours, to eliminate digitising noise? Both options could be useful.

There is no video DAC - i.e. the TTL-level R G B inputs are connected directly to the CPLD (so we effectively have 1-bit RGB). These are eventually written to the frame buffer as 4-bit pixels. Only values 0-7 are ever used, and then the Pi's palette is setup to map these to the usual colours on the Beeb:https://github.com/hoglet67/RGBtoHDMI/b ... ace.c#L204

If you want to extend this to 2 or 4 bits each of RGB, I think the limiting factor will be software copying the data from GPIO inputs to the frame buffer. I think you will hit that limit very quickly.

The only alternative to increase the copying performance is to find a way to move the data from the CPLD into the frame buffer using DMA. Dominic (dp11) has some ideas here, but I haven't followed up with him yet.

The frame buffer is currently double buffered, so there is currently one field (20ms) lag (maybe 2 worst case).

The reason for double buffering is that currently the Pi's HDMI video, though nominally 50Hz, is not genlocked/slaved to the Beeb video frame rate. So the double buffering is there to avoid visible tearing. But the difference in frame rates means that that occasionally a field will need to be dropped, or duplicated.

Last year it wasn't obvious to me whether it was possible to genlock the HDMI. But in reading around this evening, it might be as simple as continuously tweaking the Pi's HDMI pixel clock frequency, and there is a firmware mail box interface call to do this.

Just chiming in to confirm that I am indeed planning on helping out with the PCB (although I haven't started yet). Hoglet has actually already designed a pretty nice PCB for it, so my job is to shrink it down a little and move some stuff around to make it easier to design an enclosure for. I've been looking forward to building one of these for a long time!

Impressive though this is, the bit that I find most outright surprising is that you manage to get a 96MHz clock out of the RasPi. In other forums, I've seen people struggling to do SPI at anything above about 25MHz, despondently posting ominous waveforms showing very high slew!

tricky wrote:Great work, I expect we will be seeing this mentioned on hackaday before too long.
Put me down for one too
Does it work on the old and new zero (w)?

It's untested with the new zero, but I don't see why it wouldn't work, unless the memory is for some reason slower. It doesn't work well with the Pi 2 or Pi 3, I think for precisely this reason, i.e. there is lots of jitter/wobbling visible on the screen.

crj wrote:Impressive!
Was this at ABug and I just missed it?

No, but it will be at the next one, together with me.

crj wrote:Impressive!
Impressive though this is, the bit that I find most outright surprising is that you manage to get a 96MHz clock out of the RasPi. In other forums, I've seen people struggling to do SPI at anything above about 25MHz, despondently posting ominous waveforms showing very high slew!

The clock is being generated internally by a PLL, then output on the one of the dedicated GPCLK pin. Even then, on my scope it's basically a sin wave. But 96MHz is approaching the scope bandwidth limit, so that might be expected. For this this application, as long as it crosses the CPLD logic threshold with sufficient amplitude, the phase doesn't really matter because the system as a whole is self-calibrating.

It's probably worth pointing out that RGBtoHDMI is not going to be work well with the extended capabilities of VideoNuLA, for two reasons:
1. RGB is being digitised directly the CPLD (i.e. no DAC is used) so you end up with just 1-bit RGB.
2. In some of the VideoNuLA attribute modes the pixel clock is 12MHz, but RGBtoHDMI is expecting 16MHz so will mis-sample

hoglet wrote:It's necessary to deal with some varation in the Beeb's pixel clock, because the tolerence of the Beeb's 16MHz crystal is ~100ppm, and in practice after 30+ years it may actually be worse than this.

As an aside...

On the Master AIV they fixed this (to permit genlock with the video disc player) by feeding the 16MHz from a higher quality oscillator module located on the SCSI interface board.

When you were working on the Beeb FPGA,I asked about positioning the picture based on the centre of the h-sync pulse (as a CRT would) to allow the pulse width to be changed to give fine hardware scrolling
I know I haven't got any further with rally-x and that nothing else except maybe Orlando's Scrolls support it, but if it isn't too much trouble, could this also support it?
It doesn't need to model springing back over several scan lines, instantly adjusting should be ok.

hoglet wrote:
On startup the Pi runs an initial calibration phase that accurately measures the duration of two video fields (i.e. one frame). If the Beeb's clock is exactly 16MHz then the frame time should be exactly 40mS. Any error from this allows us to estimate the actual 16MHz clock frequency, and then vary the Pi's PLL frequency so the 96MHz clock tracks this variation. The net result is there is very little shift in the pixel sampling point across the line.

Do you cope with the case of *TV0,1 as software sometimes has that embedded in it so it's not as simple as not using that feature.
(When interlace is turned off the frame length is 39.936 ms except for Mode 7 which remains at 40ms)

hoglet wrote:
The system also performs simple weave de-interlacing (Mode 7) or line doubling (Modes 0-6) as data is written into the frame buffer.

Does that mean edge combing on any mode 7 horizontal movement? If so, wouldn't bob de-interlacing be better?

tricky wrote:
When you were working on the Beeb FPGA,I asked about positioning the picture based on the centre of the h-sync pulse (as a CRT would) to allow the pulse width to be changed to give fine hardware scrolling
I know I haven't got any further with rally-x and that nothing else except maybe Orlando's Scrolls support it, but if it isn't too much trouble, could this also support it?
It doesn't need to model springing back over several scan lines, instantly adjusting should be ok.

It might be possible, but it would require changes to the CPLD, which is fairly full.

It comes down to how the counter that determines when to start sampling the pixels on the line is managed.

Currently the counter works as follows:
- it is loaded with a suitable negative value (that excludes the HSYNC width) on the trailing edge of HSYNC
- it counts up during the left border
- on reaching zero, pixel sampling will start

To start it from the centre of HSYNC this would have to change to:
- it is loaded with a suitable negative value (that includes the HSYNC width) on the leading edge of HSYNC
- it counts up during HSYNC at half the normal rate (I think...)
- it counts up during the left border at the normal rate
- on reaching zero, pixel sampling starts

JonC wrote:I'm wondering if this just work on a Zero, or would it work on other Pi's as well? Presumaby the Zero has sufficient oomph to do the job and the older ones not so much..

I'm only planning on supporting the Pi Zero. I did try a Pi 3, and it didn't work well at all. It seems the memory bandwidth is less, or at least there are more stalls when continuously writing to memory.

JonC wrote:
PS - This is a very cool project which could work well for a lot of retro machines potentially! I know one of the Amiga guys at work keeps asking me if it could be adapted for Amiga's.

It's probably limited to digital (i.e. 1-bit) RGB, or the date rates get too high for the memory bandwidth of the Pi.

IanB wrote:
Do you cope with the case of *TV0,1 as software sometimes has that embedded in it so it's not as simple as not using that feature.
(When interlace is turned off the frame length is 39.936 ms except for Mode 7 which remains at 40ms)

IanB wrote:
Does that mean edge combing on any mode 7 horizontal movement? If so, wouldn't bob de-interlacing be better?

Yes, it likely does mean edge combing will be present.

Can you (or anyone else) suggest any Mode 7 games with continuous horizontal scrolling that I could test with?

The way it works is that there is a single 504 x 540 resolution frame buffer that's being output continuously at 50Hz (i.e. no double buffering is used). The odd field is simply written to the odd lines, and the even field is written to the even lines. The advantage is that (at least on a static screen) there is no loss of vertical resolution.

Doesn't bob effectively half the vertical resolution (i.e. each field is line doubled and written to two lines)

Your timing for the h-sync centre aligned positioning seems correct.
It is only a nice to have, but I am still hoping to use it, maybe fore Scramble and/or Rally-X and the more ways to use it without a CRT the better.Edge Grinder and presumably Super Edge Grinder on the CPC use it I believe with shadow RAM like the Master to get single MODE 1 pixel hardware scrolling, but the two pixel scrolling that this would give are fine for most constantly scrolling games.
I've attached one of my demos, just in case you want to try it / see the difference. You can "drive" with the cursor keys, but try to line up with a road!

hoglet wrote:
Yes, it likely does mean edge combing will be present.
Doesn't bob effectively half the vertical resolution (i.e. each field is line doubled and written to two lines)
I'm definitely open to ideas for how to improve this.

Bob does reduce the effective vertical resolution a bit but not by half and it does introduce interlace twitter (like a real CRT).
Proper bob deinterlacing is done by interpolating the lines in between the missing ones for each field when converting to a frame, not just line doubling as follows:

The 'i' lines above are interpolated and the non 'i' lines are the original lines from each field.
Interpolation can be as simple as an average of the line above and below but better quality can be achieved by using weighted values from four lines or even more. (This means that RGB intensities will not just be on or off so you would need to have multiple bits per pixel for intermediate intensities)

This results in most of the resolution of the video being retained but it introduces interlace twitter on horizontal edges as the interpolated values don't 100% match the real values in the next de-interlaced frame.

An even better solution would be motion adaptive de-interlacing which switches between bob and weave in parts of the frame depending on their motion. i.e. moving areas are bobbed and stationary areas are weaved. Normally this is quite complex as determining the extent of a moving area is quite difficult but in the case of mode 7, the characters can only be in a cell and so it might be possible to switch between bob and weave on a per character cell basis depending on whether that cell had changed in the last frame (It might be possible to use simple line doubling bob under those circumstances as well)

Hoglet, I'm sorry, I'd like to add a proviso to my answer.
If counting at half rate means changing the sampling positions, it isn't right, but may be near enough.
In mode 7 it is ok if a character is a multiple of 12 sample points.
In any other mode, I think it is ok .
If it is about which pixels to fill in on the pi, then it is always ok.

Ps, unless someone is doing a side scrolling text game in mode 7, it could only really be used for shaking the screen, and wouldn't need to be perfect.

Last edited by tricky on Sun Jan 28, 2018 7:41 am, edited 1 time in total.