A detailed tutorial on speeding up AVR division

[Alan Burlison] is working on an Arduino project with an accelerometer and a few LEDs. Having the LEDs light up as his board is tilted to one side or another is an easy enough project a computer cowboy could whip out in an hour, but [Alan] – ever the perfectionist – decided to optimize his code so his accelerometer-controlled LEDs don’t jitter. The result is a spectacular blog post chronicling the pitfalls of floating point math and division on an AVR.

To remove the jitter from his LEDs, [Alan] used a smoothing algorithm known as an exponential moving average. This algorithm uses multiplication and is usually implemented using floating point arithmetic. Unfortunately, AVRs don’t have floating point arithmetic so [Alan] used fixed point arithmetic – a system similar to balancing your checkbook in cents rather than dollars.

With a clever use of bit shifting to calculate the average with scaling, [Alan] was able to make the fixed point version nearly six times faster than the floating point algorithm implementation. After digging into the assembly of his fixed point algorithm, he was able to speed it up to 10 times faster than floating point arithmetic.

The takeaway from [Alan]’s adventures in arithmetic is that division on an AVR is slow. Not very surprising after you realize the AVR doesn’t have a division instruction. Of course, sometimes you can’t get around having to divide so multiplying by the reciprocal and using fixed point arithmetic is the way to go if speed is an issue.

Sure, squeezing every last cycle out of an 8 bit microcontroller is a bit excessive if you’re just using an Arduino as a switch. If you’re doing something with graphics or need very fast response times, [Alan] gives a lot of really useful tips.

That’s the whole point – by choosing reciprocals where the denominators are a power of two, and by choosing fixed-point scaling factors that are also a power of two the divisions and some of the multiplications can just be replaced with shifts. The remaining multiplications can be done with the hardware MUL instruction. And in some cases, the operations will even partially cancel out, which is why the new sample value is just shifted left by 1 place, because scaling (multiplying) by 32 and then multiplying by 1/16 is just the same as multiplying by 2, i.e. 1 left shift.

There are a lot of fast algorithms to calculate the reciprocal of a number without using a division. A fast and easy to implement one is for example the Newton-Raphson algorithm. I have used it lots of times when working with fixed point DSPs.

Why don’t the existing software floating point library in avr-gcc simply calculate the reciprocal using Newton-Raphson and multiply by the reciprocal? Or are there situations where that’s not faster, or would violate IEEE 754?

For exponentially weighted moving averages, you’re going to be running a very large number of cycles of avg = avg*p + measurement*(1-p), so you want to think about roundoff error accumulation, and about doing calculations upfront whenever possible instead of doing them for each measurement. So using division to calculate a reciprocal once and then using multiplication each time is going to be a win.

If you’re using fixed-point calculations, you want your base to be a power of 2, not 10, so you can use shifting instead of division, and so you get precise answers instead of repeating decimals. For instance, instead of avg*95/100 + measurement*5/100, you may want to do avg*122/128 + measurement*6/128, as (avg*122+measurement*6)>>7, or maybe (avg*122+measurement*6+64)>>7 so you get the round-off errors to balance.

If you’re after speed why not use simple rolling average with a 2^n sized stack? Then there’s no need for MUL (granted a 2 cycle MUL is slick) or DIV at all. Assuming unsigned, just ADD, SUB and LSL. Use a stack pointer so you don’t have to roll the whole stack each time. The only problem with this method is you do get a stack size * sample rate (approx) initial lag to the result while the stack fills whereas the EMA method gives you somewhat noisy data initially. In theory the lag seems like a problem but in practice it rarely (for me) is.

@Cyril, I’ve used rolling average (RA) in the past but in this case it didn’t work very well. The raw sensor values tend to be longish sequences of one value, then of another. If I make the RA stack long enough to kill the jitter it also kill the responsiveness to movement. EMA kills the jitter without killing responsiveness to movement.

As I read it, the point is that division of ANY kind sucks, regardless of whether it’s fixed or floating-point. If you read the article, “avg = val * 0.1 + avg * 0.9;” came out faster than “v2 = ((val <> 5;” because of the two divisions in there. I agree that simply weighting them in advance and the dividing should probably be investigated too, to see what benefits that provides.

I came across this site awhile back. It’s pretty handy for finding “good enough” fractional equivalents to floating point numbers.http://www.mindspring.com/~alanh/fracs.html
It takes any decimal number in and gives you various fractional versions of that number pretty much instantly at varying levels of accuracy. I’ve used it also to avoid overflows in my code (for example, when setting up an audio IC one time I needed to multiply an already large number by 1048576/48000 and this shrank it down to 8192/375 for me, which avoided overflow.