Last year I did some experiments with the Theora video decoder on a Texas Instruments DM642 DSP. A royalty free video decoder is very attractive for embedded devices, but after some major restructuring for performance, some problems remained.

The main problem is that, unlike MPEG video, Theora video is not packed in the bitstream in the raster order that it is displayed on screen, but instead in Hilbert curve order. This is not a problem in itself, but Theora’s DC prediction and post-processing loop filter are both defined in raster order. The need to go over the data once in Hilbert curve order and once in raster order leaves Theora decode requiring higher memory bandwidth than MPEG decode.

The encoder faces a similar problem. Andrey N. Filippov describes an FPGA implementation of the Theora Encoder, and comments on the high memory bandwidth required. The solution in the article is to implement a custom SDRAM controller with knowledge of the Theora data structures, an option not available on a DSP.

There are other minor problems remaining. The DM642 has instructions to assist video encoding and decoding, but these are optimised for MPEG and may not easily apply to Theora. For example, the avg2 instruction averages two pairs of 16-bit values, but it uses the formula (x + y + 1) >> 1, whereas Theora’s half-pixel predictor uses the formula (x + y) >> 1.

Where does this leave Theora decode on DSP? The DM642 is just capable of decoding NTSC quality video (640×480, 30fps) provided that the bitrate is controlled. The good news is that the newer DaVinci architecture provides extra memory bandwidth through a DDR2 memory controller, plus the possibility of splitting the workload to place bitstream decode on the ARM processor and frame reconstruction on the DSP.

You might also like

Makes me wondering how much effort it would take to put an embedded version of Dirac inside. After all, Theora is Open and Free, but stays VP3. So from a codec perspective it is an old codec based on older technology.

While the new Dirac project is working on performance on x86, makes you wonder how well it could work on a DSP.

Interesting question, which I’d like to look into later. Dirac has advanced to a stage where someone could do the experiment, but optimisation for DSP often involves different tactics to optimisation on a desktop platform.

The real problem, as I saw it, was not the raster/Hilbert curve ordering (though this certainly introduces many complexities in the encoder; the decoder is not as bad), but the fact that coefficients are ordered by their frequency first, and the block they belong to second, meaning that if the 63rd coefficient in a block is non-zero, you need to decode virtually all the coefficients in the frame before you can decode that one block.

This structure actually helps avoid the raster order problem for DC prediction, because you can undo the DC prediction after decoding just the DC coefficients, i.e., on 1/64th of the final image data, which often fits entirely in cache (at 640×480 it’s less than 16k of data).

The loop filter problem can similarly be handled by applying it (with a one block row delay) after each super block row is decoded. This requires slightly more cache (just under 64k of image data at 640×480, plus additional overhead for the coefficients that were decoded), but should still fit well within the 256K L2 cache on your DSP.

The theora-exp decoder does both these things, to good effect. The spec describes everything as separate processes for conceptual simplicity, but to get good preformance they really need to be pipelined.

For the loop filter at larger resolutions, you could (if you’re careful about it), process 3/4 of each superblock (with a one block column delay) immediately after decoding it, without waiting for the entire row to finish. Then you only need to cache one row of block data between superblock rows.