root is effectively the divide-by-two needed
to convert the root * 2 that has actually
been in root to the true root value.

I hand-compiled this function into PIC
assembly language, partially shown in Listing

2. Following the lead of the previous article on
square roots, I used macros extensively to encapsulate the 32-bit operations. For trial >>= 1
example, the ior32 macro actually requires root |= trial
eight instructions to perform a logical-OR on root^=trial
two 32-bit operands. This approach allows you
to see the close relationship between the C and
assembler versions of the program without arg -= root
getting caught up in the details of 32-bit arith- root+=trial
metic. The complete program listing with the
macro definitions can be downloaded from the root++
SERVO website ( www.servomagazine.com).

There is one optimization worth noting that is possible in
assembly language but not in C. In C, the loop exits when the
bit in trial has been shifted out and trial is now zero. In
assembler, we can use the fact that the trial bit was shifted
out into the carry flag. This saves us from testing the four
bytes of trial to see if they’re all zero.

In the PIC version, I have put the rounding code in a
conditional block marked with #if ROUND. I wanted to make
clear the optional nature of this section so it could be
omitted if not appropriate for a given application. It adds
very little execution time because it is out of the main loop,
but it does add almost 20% more code. It also has the effect
of allowing the result to exceed 16 bits, rounding it to
0x10000 if the input is big enough, so this must be taken into
consideration if a purely 16-bit result is expected.

Another option you’ll find in the full source code listing is
to choose to assemble the program for the PIC16 or for the
PIC18. This doesn’t show up in Listing 2 because it only affects
the macros. The PIC18 code is typically around 20% faster.

C Operators

Here is an explanation of the operators used in Listing 1:

1 << (INT_BITS - 1)

With INT_BITS set to 32, this shifts a “1” bit left 31
bits, producing the value 0x80000000.

I also compiled the C version of my routine for the AVR
using the free WinAVR compiler package. Assuming a 16
MHz clock, execution ranged from 34 to 37 microseconds.
Code size was 108 bytes.

In conclusion, this algorithm makes computing square
roots entirely practical in microcontroller-based robotic
projects, even in moderately fast service loops. No other
approach comes close to the performance and code size of
this algorithm. SV

√ Performance Results

I compared the performance of this square root routine
with the original one described in the November ‘06 issue. In
both cases, I used PIC18-optimized code. I switched off the
rounding code in my routine since the original didn’t round. I
used the Microchip MPLAB simulator with its stopwatch function to compute execution time, assuming a 20 MHz clock.

The execution time of my program ranges from 136 to
159 microseconds, for an average of about 150 microseconds. The variation is due to the different code paths inside
the main loop on whether the trial bit fits or not. Code size
is 134 bytes ( 22 bytes more when rounding is included).

The performance of the original square root program is
totally dependent on the size of the argument. With an input
of 500, it takes 155 microseconds; about the same as my
program. For smaller arguments, it would be faster — as little
as 12 microseconds — for an argument of 1. At the other
end, it can take as long as 445 milliseconds (close to a 1/2
second!). Code size is 100 bytes.