SQRT

R&D: Square Root in Assembly |Part II

This is the second part of the square root algorithm. It was
developed in one month during the final stages of finishing
this web site and reviewing this document for publication.

While Part I dealt with an empirical discovery for a sequential
algorithm to find the square root, and its subsequent mathematical
proof, Part II deals with a binary search variation that greatly
improves the root's calculation speed (even though the sequential
search algorithm is the smallest you'll find, if size is what
matters to you). Both algorithms have the same accuracy.

NOTES: The "official" address for this document is
http://www.pedrofreire.com/sqrt.
Please do not link directly to the address that shows up in
your browser — this may be changed without notice. Also note that at
the time I write this (August 2000) I am not aware of these algorithms
being in use (at least the binary search version), but I cannot
confirm this.

Binary Search

A binary search works pretty much like searching for a word in a
dictionary. When looking for "Green", you don't open the dictionary
in the first page and start looking for your word throughout.
Instead you take advantage in the knowledge that the words are
ordered alphabetically.

Here's how you would describe the search to someone else: First open
the dictionary in half, and look for the first word you find. Compare
it to the word you're looking for. If it's a match, you found it.
Otherwise divide one of the two halves of the dictionary in half again
(the right half if the word you're looking for is after the one you
found, the left half if it's before, alphabetically), and repeat the
process.

Remember the game "Simon Says" from childhood? "Simon Says: Guess the
Number". "I'm thinking of a number between 1 and 100 — guess what it
is in under 10 attempts". "I can only tell you if your guess is
above or bellow the number I'm thinking of".

The best strategy to win at this game is exactly the same. You start
by guessing 50, and then go on to guess half of that, then half
and so on, based on the answers you get. If you do this you will
always get the number in under 8 attempts! This is called
a binary search.

For the square root algorithm, we can do a similar job (because both
the square root and raising to the square are continuous growing
operations, and are therefore "ordered"): first start at half the
maximum root that can be calculated. For n-bit numbers, this is

`sqrt((2^n)-1)/2 ~~ sqrt(2^n)/2 = 2^(n/2-1)`

For example, for n=16, half of the maximum root is 128. Depending
on how the original value to root, x, compares to the square of 128,
you cut one of the two halves in half again. One-quarter of the maximum
root (or half of 128) is 64, so if you should select the lower half,
you compare x with the square of 64, if you select the upper half,
you compare x with the square of 192.

Since 64 is precisely in the middle of the lower half, it can be obtained
by either 0+64 or 128-64, and since 192 is precisely in the middle of the
upper half, it can be obtained by either 128+64 or 256-64 (i.e., you get these
values by adding one-fourth of the maximum root to the bottom edge of the
half, or subtracting it from the top edge of the half, as can be seen in the
following graphic).

When you want to split the next quarter in half, you'll need to use one-eighth
of the maximum root (32). On the next step, one-sixteenth (16) and so on.
The easiest way to get the value (half of the section you want to move
to) from these maximum root partial values, is to add them to the current
value (if you want the upper section), or subtract them from it (if you
want the lower section). For instance, if you had moved to 192, and x was
now lower than the square of 192, then the next value you want to compare
x with is the square of 192-32, or 160.

All this is very well, and the square of the first value is relatively easy
to calculate — after all,

`(2^(n/2-1))^2 = 2^(n-2)`

and for each of the next halves (for instance if the section selected
is always the lower section),

`(2^(n/2-2))^2 = 2^(n-4)`

`(2^(n/2-3))^2 = 2^(n-6)`

`(2^(n/2-4))^2 = 2^(n-8)`

…

`(2^(n/2-k))^2 = 2^(n-2k)`

which means that the square of these powers of 2 can be calculated
consecutively by shifting the first square twice to the right, just
as their square root (the halves of the maximum root) can be
calculated by shifting the first half to the right by one. But what
if we need to select an upper section? We'll get to that.

So far we only have integer shifts, additions, subtractions and
comparisons, which is good for our purposes. But the entire method
is not yet done. We have a way to approximate the roots, but not
to approximate the squares. For instance, if you change 128 to
192 by adding 64, what do you do with their respective squares?
Do you just calculate a square by doing a multiplication?

Well, not necessarily (which is good since that would be slow!).
You see, if a is the current root search value (192 for
example), and b the maximum root half you want to add or
subtract from it (32 in this example), what you want to find is

`(a+-b)^2`

and you have

`(a+-b)^2 = a^2 +- 2ab + b^2`

which means that the offset

`+-2ab + b^2`

should be added to the current square, where the sign of 2ab
is identical to the operation performed on the roots (addition /
subtraction).

Here we have 3 multiplications, but let's
take a look at them: the first is a multiplication by 2, which can
be replaced by a shift operation. Then you multiply by b which
is itself a power of 2, so this multiplication can also be replaced
by a shift operation. Finally, b² is easy to calculate
exactly because b is a power of 2, as we have just seen!

In other words, the offset is actually

`a * 2^(n/2-k+1) + 2^(n-2k)`

which, for an even n and integer k>0, always
returns positive integer exponents of 2, which means all
operations are fast shifts and additions, not multiplications!
This also proves mathematically the steps used in the method.

Here, the variable i has a similar function to the
above formulas' k, though they don't match in terms
of values. The last 4 lines of code are a product of the
analysis made in Part I of
this document. Upon entry to the code, x should have
the value to root, and upon exit r will have the
integer result rounded to the nearest integer.

What you can see right away is that instead of the old sequential
algorithms that took

`sqrt(x)-1`

to

`sqrt(x)`

iterations to finish, which means that the number of iterations
ranged from

`0 <= sqrt(x)-1 <= 2^(n/2)-1`

to

`0 <= sqrt(x) <= 2^(n/2)`

the binary search has a fixed number of iterations of

`(n/2)-1`

This is a huge improvement over the prior algorithms (though they
are still faster for a very small range of roots). For instance,
even for a low value of n (say n=16), the other
algorithms took anywhere between 0 and 256 iterations to find the
root, whereas this one takes exactly 7 iterations to find that
same root!

We don't use x==x2 as a loop termination condition
for the same reason since it only improves the root calculation
speed of

`2^(n/2-1) - 1`

values of x and makes it worse for all others!

The next step in optimisation asks for your focus on shifts.
See how the only use of r in the loop is shifted left
by an ever-decreasing number of digits, and it is updated by
a value (h) that is shifted right by 1 on each
iteration? If you were to start the loop with r fully
shifted left and shift it right by 1 on each iteration, h
would have to be shifted right by 2 (the same direction and
amount of sq's shift). See how h and i are
not used anywhere else in the code?

This allows us to build the optimised C code version of this loop,
where h no longer has the meaning of "root half", but rather
of "helper" value that merges the old code's h and sq.

which means that this code can calculate a 32-bit square root
(n=32) in 191~219 clock cycles, and a 16-bit square root
(n=16) in 95~107 clock cycles, which is 14% to 30%
faster than hard-sqrt!

Hardware Version

If implemented in hardware, this algorithm could be parallelised,
which means even more performance. Both halves of the loop would
be calculated in parallel, and then the proper result selected
in a multiplexer from the result of the initial comparison.
Shifts don't take any time because they can be implemented by
connecting the output bits of an operation circuit to the input
bits of the register, already shifted.

As long as there is enough time within one clock cycle to
propagate the electrical signals from one addition / subtraction
circuit, one multiplexer and another addition circuit, the loop
can be implemented in 1 clock cycle. If not, using
clock-synchronised multiplexers and doing some changes, should
allow you to do the loop in 2 clock cycles, at least.

Let's see a diagram of the former circuit:

This diagram only shows the part of the square-root circuit that
runs the loop — it doesn't show prologue or epilogue code.
Inputs that have a note of >>1 or >>2
have the bits of their register word connected in such a way that
they implement the corresponding shift operation before entering
that circuit. For example if you connect output bit u to
input bit u-1 for all bits of that word, and connect the
most-significant input bit to 0, you're implementing a shift right
by 1 between the two circuits.

All clocks pins of all registers (clk)
should be connected to the same common clock. The loop is over
(and EBX holds the almost-ready root), when the output
pin zero is set, indicating ECX=0. The two
multiplexers return value a if sel=0 (false), and
value b if sel=1 (true). Finally, the adder
circuit that implements EBX+ECX and the right-most adder
circuit can both be replaced by n-bit OR circuits.

Do note that even though the algorithm is highly optimised, the
circuit that implements the algorithm is far from it. For instance,
circuits that do both an addition or subtraction based on a sel
signal could eliminate the two ADD/SUB/MUX pairs
and replace each for a single ADD+SUB circuit. In fact, the
ADD/SUB/MUX pair that implements
EBX±ECX can be replaced by an OR with
ECX and a specialised XOR with the sel signal.
An optimised comparison circuit will give the result EAX<EDX?
much quicker than a SUB circuit, and so on. The diagram shown
here is an illustration of one possibility, not an optimisation.

Floating-Point

This easily extends to floating-point (real numbers).
Floating-point IEEE numbers are stored in the format

`+-"mantissa" * 2^"xpon",`

`0 <= "mantissa" < 2,`

`|"xpon"| in NN_0`

where xpon is the exponent and mantissa... the mantissa!
If the exponent has a special value named the "denormal" value,
the mantissa is between 0 (inclusive) and 1 (exclusive). If the
exponent has any other value (a "normal" value), the mantissa is
between 1 (inclusive) and 2 (exclusive). This means the left-most
(most significant) bit can be determined by the exponent, which
is why most floating-point formats do not code this bit.

The mantissa can be made into an integer by multiplying it by a power
of 2, which is the same as subtracting the exponent of that power from
xpon. This reduces the problem to calculating the square
root of an integer (we already know how to do this), and dividing
xpon by 2 (trivial).

`sqrt("mantissa" * 2^"xpon") = sqrt("mantissa") * 2^("xpon"/2)`

In order to return n bits of significant digits, you need
2n digits in the mantissa to start with and its
most-significant-bit set (1), so you will need to multiply it by yet
another power of 2. This and other little things around this (returning
the mantissa into a value between 0 and 2, handling an odd xpon
before its division by 2, handling an odd number of bits in the
mantissa — some registers are set-up to

`2^(n/2)`

and similar values which are not integer if n is odd! —
and so on), are what complicates this process.

Since it is beyond the scope of this document, I have written a test
program in ANSI C to describe and test two slightly different
algorithms that calculate a floating-point square root (these algorithms
are best implemented in hardware, and the test program has been designed
to ease the understanding of how to lay out the algorithms in that medium).
You can download this program (C source only) at the end of this section.

The first of these algorithms is slightly faster and takes less circuitry
to implement (which means it is cheaper), but returns values that are
not precise, but rather have a small measurable, quantifiable error.
This should be more than adequate for graphical applications, for
instance. Such algorithm can be implemented in hardware using the
following diagram:

The same comments that explained the integer algorithm's hardware
implementation also apply to this diagram. Here, circuit names that
end in *2 refer to circuits with (or that operate on) 2n
bits instead of just n bits. Notes of hi mean the top
n bits of such circuits, whereas low mean the bottom
n bits. For instance, one input to the right-most adder circuit
are the top n bits of ECX*2 shifted right by 1. The
loop ends when the top n bits of EBX*2 (signal
hi_zero) are 0. low(EBX*2) will then hold the result's mantissa
and xpon its exponent (which was properly modified prior to
loop start). To understand the need for xpon's counter, read
the code's comments. The output of the comparison SUB circuit
need not be complemented if you swap the a and b entries
of the multiplexers, but the diagram was left this way so that the
only structural change to the integer diagram is in this comparison.

The second algorithm will take some more circuitry, but it is
not more complex. It will return exact values for the square roots.

As you can see, now nearly all circuits are 2n bits in size, which
makes it nearly twice as expensive as the integer algorithm, and about
30% more expensive than the first floating-point algorithm already
described. It is, however, the preferred circuit to implement in FPUs
(floating-point units), due to its precision. This loop ends when
ECX*2 is 0 or 1 (signal zero_one), i.e., when all of its
bits except for the least-significant-bit are 0.

The test program can be compiled to test any of these two algorithms
in just about any ANSI C compliant machine, with just about any type
of floating-point format (please read the section on restrictions at
the head of the code). It will be able to be compiled in many forms,
to test for a variety of floating-point formats and algorithm
variations. It can test single values (displaying the algorithms'
progress) or entire ranges of values for empirical proof of
accuracy. You should take a look at the code of function
gotest_real() to see both algorithms' implementations and
comments.

Please read the comments at the head of the code for more details and
conclusions (for instance, for 32-bit IEEE floating-point numbers, the
hardware algorithm would be nearly 30% faster than a latest-generation
(in August/2000) Intel Itanium).

Version 1.00
ftest.zip (21kb)

Root of any Index

When using the binary search algorithm for a root of index i, the
root values taken at each iteration (in the first example, 128 followed by 64
or 192, followed by 32 or ..., and so on) are not the same. The first
value to test for the root (we'll call it h) is half of the
maximum root we can generate using n-bit numbers, or

`h = root(i) (2^n-1) / 2`

which is very close to

`h = 1/2 * root(i) (2^n)`

`h = 2^(n/i-1)`

If this value is not integer, you need to use a double h
instead of int. r and x2 in the C code
described bellow, follow the type of h. For n=16
and i=2 (the example we've been using), h=128,
as expected.

This value is still divided by 2 in each iteration of the loop, but
unlike what happened with i=2, where h was always
evenly divisible by 2, we cannot guarantee that now, so we need to
use h/=2 and double h instead of h>>=1
in those cases.

Updating the value of x2 within the C loop is no longer
the same, as you may expect. But the problem is not complex, though.
If you call XA(r,h) to

`(r+h)^i - r^i`

and XS(r,h) to

`(r-h)^i - r^i`

then you can add these to x2 when adding or subtracting h
from r, respectively. For instance, for i=2,
XA(r,h)=h*(2*r+h), and for i=3,
XA(r,h)=h*(3*r*(r+h)+h*h).

Finally, we need a loop termination condition. We could use
h<0.001 or some other low value, but we would need to
fine-tune it to each index. We could also use XA(r,h)<0.5
but that will force too many iterations. The solution found was to
terminate the loop when a) the loop "side" (addition/subtraction)
changed, and b) r rounded to the nearest integer did not
change as the side changed. This means the root is between the
two previously tested squares, and equals r rounded to the
nearest integer. This allows us to dismiss the epilogue code (which
is not good in terms of speed, but does simplify the code), but does
have a drawback — the minimum and maximum values for the n-bit
range will cause an infinite loop because the loop will continuously
approach them, but never leap over them. They should be treated as
special cases.

This code will only work on 0<x<ULONG_MAX (i.e.,
positive numbers except the special cases of 0 and ULONG_MAX).
How efficient is the code? It depends on your ability to optimise code.
Please take a look at the sequential search algorithm's
Root of any Index
section in Part I for more details and example. I will not expand
this issue.

Further Work

The square root is often used in formulas (especially 3D formulas)
scaled (multiplied) by a constant, or another value. Because the
binary-search algorithm iterates through each bit of the result and
therefore is similar to the operation of a dumb multiplication
circuit, one could alter the first (or second) floating-point
diagram shown so that a multiplication is done in parallel with
the square root calculation, and an integer result is returned (or
floating-point, if you model the circuit to it) with very little
loss of accuracy (if any).

This is quite feasible if you think about it: as EBX is being
built on any variation of the algorithm, the bit represented by
ECX will always become set on EBX, regardless of whether
you add or subtract ECX from it, but the bit prior to that
(and only that bit) may be changed (if a subtraction was
used) from 1 to 0. The parallel multiplication may then follow that
prior bit in each cycle, instead of the bit in ECX. I will
not get further into that, however — the process is straightforward
and can lead to much simplified and accelerated 3D graphics circuits.

Another direction of work is in the development of a "Guess Search".
This search came about due to the first two paragraphs that describe
the binary search.

When you look through a dictionary, you don't open it at the middle
at first, bit rather take an educated guess to where do you think
the word will be. And then, from the comparison with the word you
find there, you don't blindly cut one of the two split sections in
half, but rather move into it some distance — again an educated
guess.

But how well can you guess square roots? Well, as we have already
seen, "eating away" half of the least significant digits in a number
(the integer part of that half rounded down if the number of significant
digits is odd) yields a pretty good approximation. This comes from

`sqrt(2^n) = 2^(n/2)`

So if you find an n such that

`2^(n-1) <= x < 2^n`

`n = "int"(log_2 x)+1`

where the logarithm is rounded down, then

`x / 2^(n/2)`

would be a good (and fast) approximation. The error is given by

`sqrt(x) - x / 2^(n/2)`

which has its minimum value at nearly

`sqrt(2^n) - 2^n / 2^(n/2) = 0`

as expected, and its maximum value at

`sqrt(2^(n-1)) - 2^(n-1) / 2^(n/2) = 2^((n-1)/2) - 2^((n-2)/2)`

which for 16-bit words that have the maximum n at 16 (this
is a slightly different meaning of n than what we've
been giving it so far), the maximum error is 54 (>21%); for
32-bit words, the maximum n is 32, and the maximum error
is 13573 (<21%).

Unfortunately, you can't just plug it in at the head of a
sequential search algorithm just like this. The approximation
prevented the sequential search algorithm from reducing EAX
as it normally would. Adding it to the binary-search algorithm
is also silly because its maximum error means that only one
iteration of the algorithm would be saved, and too much time is
taken preparing the approximation.
Plus the only way to find this approximation's square (so as to
use it in either of these algorithms) is to do a multiplication
which is not all that fast. Further work on this issue is
therefore left to the reader.

Conclusion

In terms of speed, the binary search is one of the best (or the
best) algorithm to use in software on CPUs that only have integer
operations, or to implement quite easily by hardware. In terms of size,
the first sequential search described in Part I has no parallel.

These are not perhaps the best algorithms around, and am I am not
(possibly) the first to develop them. I am unaware if their use
anywhere, though, or of papers or books describing such algorithms
as I describe in this document, but I have not done extensive
search for this.