2.6.4 Multipliers

Figure 2.27
shows a symmetric 6-bit array multiplier (an
n -bit multiplier multiplies two
n -bit numbers; we shall use n
-bit by m -bit multiplier if the lengths
are different). Adders a0f0 may be eliminated, which then eliminates adders
a1a6, leaving an asymmetric CSA array of 30 (5
¥6)
adders (including one half adder). An n
-bit array multiplier has a delay proportional to
n plus the delay of the CPA (adders b6f6 in Figure 2.27). There
are two items we can attack to improve the performance of a multiplier:
the number of partial products and the addition of the partial products.

FIGURE 2.27 Multiplication.
A 6-bit array multiplier using a final carry-propagate adder (full-adder
cells a6f6, a ripple-carry adder). Apart from the generation of the summands
this multiplier uses the same structure as the carry-save adder of Figure 2.23(d).

Suppose we wish to multiply 15
(the multiplicand ) by 19 (the multiplier ) mentally. It is
easier to calculate 15¥20
and subtract 15. In effect we complete the multiplication as 15¥(201)
and we could write this as 15¥
, with the overbar representing a minus sign. Now suppose we wish to multiply
an 8-bit binary number, A, by B=00010111
(decimal 16+4+2+1=23).
It is easier to multiply A by the canonical signed-digit vector(
CSD vector ) D=
(decimal 328+1=
23) since this requires only three add or subtract operations (and a subtraction
is as easy as an addition). We say B has a weight of 4 and D has
a weight of 3. By using D instead of B we have reduced the number of partial
products by 1 (=43).

We can recode (or encode)
any binary number, B, as a CSD vector, D, as follows (canonical means there
is only one CSD vector for any number):

D
i=B
i+C
i2C
i+1
,(2.58)

where C
i+1
is the carry from the sum of B i+1+B
i+C
i (we start with C 0=0).

so that D=
(decimal 41=3).
CSD vectors are useful to represent fixed coefficients in digital filters,
for example.

We can recode using a radix
other than 2. Suppose B is an ( n
+1)-digit two's
complement number,

B=B
0+B
1 2+B
2 2 2+...+B
i 2 i
+
...+B
n1
2 n1B
n 2 n .(2.60)

We can rewrite the expression
for B using the following sleight-of-hand:

2BB=B=B
0+(B
0B
1 )2+...+(B
i1B
i )2 i
+...+B
n1
2 n1B
n 2 n

=(2B
1+B
0 )2 0+(2B
3+B
2+B
1 )2 2+...

+(2B
i+B
i1+B
i2
)2 i1+(2B
i+2+B
i+1+B
i
)2 i+1+...

+(2B
n +B
i1+B
i2
)2 n 1
.(2.61)

This is very useful. Consider
B=101001
(decimal 932=23,
n=5),

B=101001=(2B
1+B
0 )2 0+(2B
3+B
2+B
1 )2 2+(2B
5 +B
4+B
3 )2 4

= ((2
¥0)+1)2
0+((2
¥1)+0+0)2
2+((2
¥1)+0+1)2
4.(2.62)

Equation 2.61
tells us how to encode B as a radix-4 signed digit, E=
(decimal 168+1=23).
To multiply by B encoded as E we only have to perform a multiplication by
2 (a shift) and three add/subtract operations.

Using Eq. 2.61 we can
encode any number by taking groups of three bits at a time and calculating

E
j=2B
i+B
i1+B
i2
, E j+1=2B
i+2+B
i+1+B
i , ...,(2.63)

where each 3-bit
group overlaps by one bit. We pad B with a zero, B
n...B
1 B 0 0, to match the first term in Eq. 2.61.
If B has an odd number of bits, then we extend the sign: B
n B n...B
1 B 0 0. For example, B=01011
(eleven), encodes to E=
(1641);
and B=101
is E=
. This is called Booth encoding and reduces the number of partial
products by a factor of two and thus considerably reduces the area as well
as increasing the speed of our multiplier [Booth, 1951].

Next we turn our attention
to improving the speed of addition in the CSA array. Figure 2.28(a)
shows a section of the 6-bit array multiplier from Figure 2.27. We
can collapse the chain of adders a0f5 (5 adder delays) to the Wallace tree
consisting of adders 5.15.4 (4 adder delays) shown in Figure 2.28(b).

FIGURE 2.28 Tree-based
multiplication. (a) The portion of Figure 2.27 that calculates
the sum bit, P 5 , using a chain of adders
(cells a0f5). (b) We can collapse this chain to a Wallace tree (cells
5.15.5). (c) The stages of multiplication.

Figure 2.28(c) pictorially
represents multiplication as a sort of golf course. Each link corresponds
to an adder. The holes or dots are the outputs of one stage (and the inputs
of the next). At each stage we have the following three choices: (1) sum
three outputs using a full adder (denoted by a box enclosing three dots);
(2) sum two outputs using a half adder (a box with two dots); (3) pass
the outputs directly to the next stage. The two outputs of an adder are
joined by a diagonal line (full adders use black dots, half adders white
dots). The object of the game is to choose (1), (2), or (3) at each stage
to maximize the performance of the multiplier. In tree-based multipliers
there are two ways to do thisworking forward and working backward.

In a Wallace-tree multiplier
we work forward from the multiplier inputs, compressing the number of signals
to be added at each stage [Wallace, 1960]. We can view an FA as a 3:2
compressor or (3, 2) counter it counts the number of '1's on
the inputs. Thus, for example, an input of '101' (two '1's) results in an
output '10' (2). A half adder is a (2, 2) counter . To form P
5 in Figure 2.29 we must add 6 summands (S
05 , S 14 , S
23 , S 32 , S
41 , and S 50 ) and 4 carries from the
P 4 column. We add these in stages 17, compressing
from 6:3:2:2:3:1:1. Notice that we wait until stage 5 to add the last carry
from column P 4,
and this means we expand (rather than compress) the number of signals (from
2 to 3) between stages 3 and 5. The maximum delay through the CSA array
of Figure 2.29 is 6 adder delays. To this we must add the delay of
the 4-bit (9 inputs) CPA (stage 7). There are 26 adders (6 half adders)
plus the 4 adders in the CPA.

FIGURE 2.29 A
6-bit Wallace-tree multiplier. The carry-save adder (CSA) requires 26 adders
(cells 126, six are half adders). The final carry-propagate adder (CPA)
consists of 4 adder cells (2730). The delay of the CSA is 6 adders. The
delay of the CPA is 4 adders.

In a Dadda multiplier
(Figure 2.30) we work backward from the final product [Dadda, 1965].
Each stage has a maximum of 2, 3, 4, 6, 9, 13, 19, ...outputs
(each successive stage is 3/2 times largerrounded down to an integer). Thus,
for example, in Figure 2.28(d) we require 3 stages (with 3 adder delaysplus
the delay of a 10-bit output CPA) for a 6-bit Dadda multiplier. There are
19 adders (4 half adders) in the CSA plus the 10 adders (2 half adders)
in the CPA. A Dadda multiplier is usually faster and smaller than a Wallace-tree
multiplier.

FIGURE 2.30 The
6-bit Dadda multiplier. The carry-save adder (CSA) requires 20 adders (cells
120, four are half adders). The carry-propagate adder (CPA, cells 2130)
is a ripple-carry adder (RCA). The CSA is smaller (20 versus 26 adders),
faster (3 adder delays versus 6 adder delays), and more regular than the
Wallace-tree CSA of Figure 2.29. The overall speed of this implementation
is approximately the same as the Wallace-tree multiplier of Figure 2.29;
however, the speed may be increased by substituting a faster CPA.

In general, the number of stages
and thus delay (in units of an FA delayexcluding the CPA) for an
n -bit tree-based multiplier using (3, 2) counters is

log
1.5
n=log
10
n /log 101.5=log
10
n /0.176.(2.64)

Figure 2.31(a) shows
how the partial-product array is constructed in a conventional 4-bit multiplier.
The FerrariStefanelli multiplier (Figure 2.31b) "nests"
multipliersthe 2-bit submultipliers reduce the number of partial products
[Ferrari and Stefanelli, 1969].

FIGURE 2.31 FerrariStefanelli
multiplier. (a) A conventional 4-bit array multiplier using AND gates
to calculate the summands with (2, 2) and (3, 2) counters to sum the partial
products. (b) A 4-bit FerrariStefanelli multiplier using 2-bit submultipliers
to construct the partial product array. (c) A circuit implementation
for an inverting 2-bit submultiplier.

There are several issues in deciding
between parallel multiplier architectures:

Since it is easier to fold triangles rather
than trapezoids into squares, a Wallace-tree multiplier is more suited
to full-custom layout, but is slightly larger, than a Dadda multiplierboth
are less regular than an array multiplier. For cell-based ASICs, a Dadda
multiplier is smaller than a Wallace-tree multiplier.

The overall multiplier speed does depend
on the size and architecture of the final CPA, but this may be optimized
independently of the CSA array. This means a Dadda multiplier is always
at least as fast as the Wallace-tree version.

The low-order bits of any parallel multiplier
settle first and can be added in the CPA before the remaining bits settle.
This allows multiplication and the final addition to be overlapped in time.

Any of the parallel multiplier architectures
may be pipelined. We may also use a variably pipelined approach
that tailors the register locations to the size of the multiplier.

Using (4, 2), (5, 3), (7, 3), or (15, 4)
counters increases the stage compression and permits the size of the stages
to be tuned. Some ASIC cell libraries contain a (7, 3) countera 2-bit
full-adder . A (15, 4) counter is a 3-bit full adder. There is
a trade-off in using these counters between the speed and size of the logic
cells and the delay as well as area of the interconnect.

Power dissipation is reduced by the tree-based
structures. The simplified carry-save logic produces fewer signal transitions
and the tree structures produce fewer glitches than a chain.

None of the multiplier structures we have
discussed take into account the possibility of staggered arrival times
for different bits of the multiplicand or the multiplier. Optimization
then requires a logic-synthesis tool.