Immature Optimization

How much difference can a single instruction make? On the Raspberry Pi, it can make a difference of 14%, demonstrable without any expensive digital testing equipment.

The Tutorial

I started with Alex Chadwick’s Baking Pi tutorial on the U. of Cambridge website. Baking Pi is a course in very rudimentary embedded programming—no operating system, so everything is “bare metal.” The Beginner lessons demonstrate how to use the GPIO interface to manipulate the OK light on the Raspberry Pi. Part 1 covers how to turn on the OK light, and part 2 explains delay loops and turning off the OK light.

And that’s as far as I made it. I got a bit distracted after that.

Veering off-course

What caught my attention was an inefficient delay loop. Granted, the purpose of a delay loop is to burn CPU time, but can one tweak in a delay loop have a visible difference? Yes, it can, even if the difference is barely perceptible.

Chadwick does explain instruction predication, as used in the bne wait1$ instruction. However, he misses an opportunity to explain the -s mnemonic suffix, which sets flags by comparing the result to zero automatically. In the code above, setting/clearing the Equal flag is delayed until the cmp r2,#0 instruction, when it could be incorporated into the subtraction directly:

This works when the eliminated instruction is a comparison to zero. The fetch-decode-execute cycle has been shortened by one instruction, four bytes. The difference in performance was, as I said earlier, barely perceptible. How could I prove such a small difference?

The measurement

Recording the flashing OK light on a digital camera was my flash of inspiration (bad pun intended). I could record ten flashes of the OK light, then split the frames apart and find the frame counts where the OK light changed from “off” to “on.” (Physics students may complain that the proper calculation of wavelength is “peak-to-peak.” The waveform in this case is a square wave with a pulse width of 50%, so the calculation is based on the rising edge, not the non-existent peak.) The camera has two qualities for video, which trade off framerate and image resolution. Using the lower resolution, I could get 20.032 FPS.

After splitting the frames apart from the first video of Chadwick’s un-optimized loop, I found that the first pulse began on frame 8, and the tenth pulse began on frame 127, giving an average of (127-8)/9 = 13.2 frames-per-pulse, or 1.52 Hz (20.032 FPS / 13.2 frames).

Moving the compare-to-zero operation into the subs instruction shortened the delay loop. The first pulse began on frame 7, and the tenth on frame 111, giving an average of (111-7)/9 = 11.6 frames-per-pulse, or 1.73 Hz, a 14% speed-up. To carry the math further, given that the delay loop counts down twice from 4,128,768 (once with the OK light turned on, and once with it turned off), it would appear that my version has shaved about 10 nanoseconds from each iteration of the delay loop. The clock speed is the default 700 MHz, so that’s roughly 7 machine cycles.

Of course, this all assumes no lost precision in my speed calculations. I could probably get better precision by timing 100 flashes, or even 1,000 flashes. Still, as a proof-of-concept, this demonstrates the ability to measure event durations using a commodity item.

In the next article, I’ll demonstrate implementing the same light flasher in C, and with actual code!