ARM code inverse square root routines

Because of the complete lack of a FAQ for the newsgroup
comp.sys.arm, and because some questions
come up again and again, I decided to archive an old discussion on
inverse square root routines for the ARM processor on the web where people may
be able to find it.

I have not explicitly obtained permission from the posters to reproduce
their postings. Anyone who objects should mail me at the address given at
the end of this web page.

I have felt free to edit the the postings to avoid repetition, to
change the formatting and to cut out anything I deemed irrelevant.
The order of the routines in this document is more or less the order
in which they were posted in this thread. Some of the routines are
better than others.

The start of the thread

Torben Mogensen posted an article which
explains why calculation 1/sqrt(x) is interesting and which also gave his
method of doing it. It's probably simplest to give his article verbatim.

[2012-06-07:
ManuTOO
pointed out to me that the original article appears to contain an error. This
is marked, and corrections suggested, in text highlighed in the same manner
as this paragraph]

When doing graphics, e.g. raytracing, it is often necessary to
normalize vectors, i.e. given (x,y) to calculate (x/l,y/l) where l =
sqrt(x^2+y^2), or similar for (x,y,z) when in 3D. This is quite
expensive to calculate on the ARM since there is no built-in divide
or square root. The usual procedure is to first calculate l using a
software square-root and then use software divide two or three times
afterwards or to compute 1/sqrt(...) once and multiply by this,
which is faster when you have hardware multiply (especially on the
StrongARM). However, 1/sqrt(k) is normally (to my knowledge) done
using first a software squareroot and then a software divide. My
idea is that you can do 1/sqrt(k) faster than this.

I assume that k is normalized so. 0.25<=k<1, as is usual for square
root routines. This means that the result is 1<r<=2.

We must find r = 1/sqrt(k). This is the same as solving k*r^2 = 1.
We now try to do this by a binary search:

Now, the squaring and the multiplication are expensive, so this is
not a good way of doing it. We may, however, do these operations
incrementally, using
k*(r+0.5^i)^2 = k*r^2+k*0.5^(2*i)+2*k*r*0.5^i.
So we use variables kr2 to hold k*r^2 and kr to hold 2*k*r.

[2012-06-07: If the code has been transformed using the
substitution q = 0.5*r then, since the previous code was initialised with
r = 1.0 we'd expect this code to be initialised with q = 0.5. The fix is
simply to alter the initialisation to
q = 0.5; kq2 = 0.25 * k; kq = 0.5 * k.]

When making this into ARM code, we assume numbers in 0<=x<1 are
represented as integers holding x*2^32. We unroll the loop to avoid
register-controlled shifts.

[2012-06-07: The article later warns that this ARM code
is untested and so may contain errors. As well as the initialisation error
already discussed, there appear to be three other issues: 1) the second
ADD in the calculation of t is missing a parameter
— it should be ADD R4,R4,R0,LSR#n; 2) compared to the
earlier code, the comparison has been inverted, <= has been
encoded as condition code HS, I think it should be
LS; 3) the original code also uses 2^n for constants
— I suspect most assemblers would prefer 1<<n. I'll
quote the original article first and then my suggested, but untested, fix.]

Note that the component 0.5^(2*i+2) of the sum becomes less than the
precision after the first 14 iterations, so we omit it. Hence, the
time used is 3+14*6+16*5+3 = 170 cycles, not counting call/return.
This is only slightly more than a square root, and certainly less
than a square root followed by a divide. Furthermore, the code takes
up less space (in the cache) than a fully unrolled divide plus a
fully unrolled square root. I don't expect the precision to be any
worse, as you don't accumulate errors.

Note that the code is not tested, so there might be small errors.
I'm sure the general principle is fine, but I may have miscalculated
some of the shift-counts in the assembly code etc.

Look-up tables

Colin Hogben made a suggestion about using
a look-up table. He said that the TMS320C40 DSP chip has a single-cycle
reciprocal square root instruction for just this purpose. It gives 8 bits
of mantissa accuracy by using a 512-entry look-up table. The result can be
made more accurate by applying the iteration

r[n+1] = r[n] * (1.5 - k/2 * r[n] * r[n])

which doubles the number of bits of accuracy each time.

Colin hadn't worked out the details of the time and memory usage of an ARM
version of this method, but suggested that code and table space usage would
be comparable to Torben's unrolled loop, and that this method should
`yield a result more quickly'. Sadly, he left the implementation details
to any interested reader.

Wilco Dijkstra also suggested
the look-up table method might be useful when fewer than 32 bits of accuracy
were required. He commented that he used 128 entry look-up tables with
interpolation to calculate 16 bit fixed point sines, cosines and square roots.
He said that this method gave him about 15.5 bit precision in 22 ARM710
cycles, he reckoned this would be about 13 StrongARM cycles including the
return. He estimated that a 512 or 1024 entry table might give 20 bits of
precision.

Combined methods

In a separate post,
Wilco Dijkstra summarised
the available methods and their times for doing a 2 dimensional normalisation.
I've added the times for a 3 dimensional normalisation in square brackets. To
remind you, x and y [and z] are two [three] components of the vector, l is the
current length, that is, l = sqrt(x^2+y^2[+z^2]), and we want to calculate
x/l and y/l [and z/l]. All these methods assume that we've already calculated
x^2 + y^2 [+ z^2]. This is referred to as k.

This looks much worse than the simultaneous approach for an ARM 710.
However, on an ARM7DM or a StrongARM where full 32 bit multiplies are
fast, the multiplies only take 6 cycles each. This reduces the time to
5*32+2*6=172 cycles [178 cycles] neglecting any fix up code.

This method assumes a modified 5 cycle/bit 1/sqrt calculation which
Wilco went on to describe.

It's clear that approach 4 wins for an ARM710 which is the only case that
Wilco considered in detail. I think that on a chip with a fast long multiply
instruction, the calculations would have to be redone.

Approach 4 relies on a simultaneous calculation described by Wilco. Wilco
first modified Torben's routine to use 5 cycles/bit instead of the 5.5
Torben's original needed. He also altered the entry conditions. For the
original routine, k had to satisfy 0.25<=k<1, Wilco suggested
0.25<k<=1 as that guarantees that 1/l is less than 2 allowing
31 bit fixed point calculations to be used instead 30 bit.

What value you choose for K depends on how much code you want to save. Wilco
suggested that for a 27 bit version you could take K=8 whereas for a 31 bit
version, take K=7 or 14. He added that a fully unrolled loop makes little
sense on a cached system (640 bytes), for example, on an ARM7, one cache miss
would add a 20 cycle penalty, while with K=14 the average penalty would be
half as big.

Finally, Wilco delivered the modified version of the code that calculates
x*1/l and y*1/l simultaneously (I've added code for z*1/l).