Keithp.com

Serving from Portland for a while.

Forward Error Correction with the TI CC1120 Sub-GHz transceiver

We're building a new, fancier flight computer and decided to give the
new, fancier TI CC1120 chip a try.
It has significantly more transmit power (+16dBm) and receive
sensitivity (-109dBm) than the transciever built in to
the TI CC1111 RF SoC that we've
built our other boards with. Given that we'd already decided to use a
fancier STM32L
processor, using the nicest stand-alone radio chip we could find
seemed to make good sense.

We assumed that, as the CC1120 was advertised as a 'better'
replacement for our CC1111 radio, that it would also include all of
these fine features. After we got the chip all soldered down to our
first prototype boards, we read through the reference manual looking
for the right register values to flip for forward error correction. We
found support for CRC computation and data whitening, but no mention
of interleaving or forward error correction at all.

Thanks, TI. Of course, without the FEC pieces, the other packet
handling hardware is worthless—you can't exactly do a CRC on the
encoded data and expect it to be useful on the other end of the link,
and data whitening happens after the CRC.

Basic CC1120 driver

I assumed I'd eventually figure out something to do about the FEC
pieces and started writing a basic
CC1120 driverAltOS, the operating system used in
all of our projects. I started with the transmit side, as that seemed
easier as I could use the existing packet hardware by uploading the
desired bit sequence using whatever fancy encoding desired and the
hardware would happily transmit it.

I first tested that using low data-rate bits (2kbps), transmitting an
alternating sequence of zeros and ones to generate a 1kHz tone on a
regular 70cm receiver. That seemed to work just fine, demonstrating
that I had some grasp of the basic operation of the chip.

Convolutional Encoding in Software

Not to be daunted by a small challenge in software, I set out to
replicate the encoding scheme used in the CC1111 chips for use in our
CC1120-based design. We obviously want our CC1120 boards to
interoperate with the CC1111 boards. Fortunately, TI
fully documents
the CRC, data whitening, convolutional encoding and interleaving done
in the hardware.

I created my own implementation of the encoder (licensed, as usual, under the
GPLv2) which you can see in the file
ao_fec_tx.c
It's not quite as flexible as the hardware, it always whitens the data
and always appends the CRC bytes to the end of the packet, along with
the necessary trellis termination bytes.

ao_fec_encode_table takes care of the convolutional piece,
ao_fec_whiten_table (not shown) contains the data whitening codes
while the little loop at the bottom does the interleaving. I didn't
spend a lot of time optimizing this as it seemed to run plenty fast on
the 32MHz processor.

With the bits all nicely encoded, I passed them to my simple CC1120
driver and let it send them along to our CC1111 receiver. Much to my
delight, with only a few bugs fixed, it worked just fine! I'd managed
to get one direction working, validating the hardware and making it
clear that we'd at least be able to use the board in flight for
telemetry.

Getting Soft-decision data out of the CC1120

In packet mode, the CC1120 receiver converts the incoming GFSK stream
directly into bits, making its best guess as to whether the detected
value was a one or zero. Of course, like any FM modulation, the
receiver will be producing intermediate values, especially when the
signal is weak or masked by interference. A Viterbi decoder can make
good use of this information;these noisy intermediate values are less
influential in the final resulting path cost which allows the strong
0/1 values to pull the solution to the right one. Soft decision
decoding yields a theoretical 2dB gain over hard decision decoders,
and given our desire for maximum packet reception over long distances,
it seemed important to make this work.

The CC1120 has a way to recover the soft decision values, but only one
8-bit value at a time. It does, however, helpfully provide a GPIO
wire which can be programmed to interrupt the CPU when the soft
decision data has arrived.

So, I hooked up a GPIO line on the processor to the GPIO line on the
CC1120 and let it interrupt once per bit time, or 38400 times
per second, to read this valuable soft decision data out of the radio
part. According to TI support, this is the only way to make this work,
and that they generally recommend that developers use the packet mode
stuff and throw away this valueable source of data.

I was happily surprised to learn that the STM32L can in fact survive
this interrupt rate, and by setting interrupt priorities
appropriately, I manage to reliably capture an entire packets worth of
data without dropping a single bit.

I set up a simple packet dumping function to run on the prototype
board and captured raw encoded bytes from our CC1111-based flight
computer so that I'd have some "real" data to work from.

Soft-decision Viterbi Decoding

The sample encoding implementation from TI made that piece fairly
straightforward. Lacking a similar sample for the decoding piece meant
that I'd be starting from scratch. And, implementing something that
seemed like magic to me when I started. I started by reading a few
articles on Viterbi decoding:

I found Chip Fleming's tutorial
very useful as it included a bunch of worked examples using a
convolutional encoders with constraint length (k) of 3 instead of 4;
this makes all of the trellis diagrams half the size.

I also sent my friend, Phil Karn, a query about
this process, and he responded with a bunch of helpful suggestions,
while also recommending that I just steal his
DSP and FEC library
implementation. Of course, I don't like using code that I can't
understand, and his code only implements much longer codes than I
needed, so I'd have to understand it anyways to make it work for
me. Phil's code is a marvel, but it's not exactly comprehensible to
someone who doesn't even know what a Viterbi decoder is supposed to do.

So, with Chip's tutorial on my screen, I wrote a primitive Viterbi
decoder. This did only the Viterbi step, and split that into two
phases. The first computed the cost for each state in the system for
every received pair of symbols received. It also tracked the chain of
states through the entire decode process:

Completely separate from this code were the interleaving and data
whitening stages. Keeping those separate allowed them to be tested
independently, which made debugging them far easier.

None of this code ever ran on the STM32L processor, it was instead run
in a test framework on my laptop. I first got this working with data
generated by the software convolutional encoder, then I managed to get
it to decode the raw packets that I had recorded directly from the
receiver it self.

This marked a significant milestone in my mind—I'd managed to
replace, in software, the CC1111 FEC encoder and decoder. However, the
decoder performance was not fast enough to manage back-to-back packets
yet. Having something in hand that worked meant that further
development would be directly comparable to something that did work,
making optimization much easier. Yet another lesson in making it work,
and then making it work fast.

Optimizing the Viterbi Decoder

The first step was to reduce memory usage within the decoder. While
it's best to defer computing the path until the very last bit has been
received, it's also true that the chances of a bit affecting old data
reduces as the distance between the new and old data increases. With
the rule of thumb being a distance of k × 6 or k × 7, I saved 32 output
bits for each state and wrote the oldest of 8 each time the buffer
filled up, making a distance of 24 bits between the newest saved data
and the current decoding position. That meant a fixed amount of
storage was needed for an arbitrarily long packet.

I tested a variety of different history sizes from 8 to 56 bits
(cooresponding to 16, 32 and 64 bits of saved data for each
state). With 100000 random packets generated that had gaussian noise
added to each bit, here are the error rates for the different history lengths:

History Length

Total Packets

Correct

Incorrect

8 bits

100000

90520 (90.52%)

9480 (9.48%)

24 bits

100000

93614 (93.61%)

6386 (6.39%)

56 bits

100000

93620 (93.62%)

6380 (6.38%)

The performance of the 8-bit history and 16-bit history versions
turned out to be essentially identical (given the 32-bit nature of the
STM32L, that shouldn't be too surprising). The performance of the
56-bit version, which manipulated uint64_t data types was sigificantly
slower, and given the very modest benefit, deemed not worth it. If we
move to a larger constraint length (k > 4), or started using a
punctured convolutional encoding, we would want to re-measure this.

The next step was to combine the de-interleaving, decoding,
de-whitening and CRC computation into a single loop. This would
eliminate a pile of data copying and extra memory usage. Nothing fancy
here, but it made enough of a difference that I could decode most of
the incoming packets now, dropping only about 25% of them.

To get to being able to decode every packet, I needed to start
decoding the packet while it was being received. Because of
interleaving, I had to have 32 bits of data to be able to decode
anything, so it was a simple matter of having my per-soft-value
interrupt handler start the decoder going after each interleave block
of data was received. With this in place, I was able to decode a
76-byte payload (160 transmitted bits) in 14.7 ms, or just about 50%
of the time it took to transmit the packet. Unrolling one of the inner
decoding loops eliminated a bunch of computation and sped this up even
more, decoding packets in 9.5ms.

Success at last, a solid 100% of packets received were being decoded,
as long as the CPU was otherwise idle.

Finally, I went and re-read Phil Karn's code and pondered over how it
works. It was opaque until I wrote a message back to Phil asking how
it worked and pointing out sections that were confusing. Fortunately,
simply phrasing my confusion in words made me understand how the code
works. Here's my original code for computing the per-state cost
metric:

Before this loop is run, I have to initialize cost[n] to a large
value so that the comparisons will be true for the first case to reach
each state. Phil's code, in contrast, doesn't have to initialize
cost[n] as we know which two states can reach any possible new
state. His code looks like this: