http://www.reenigne.org/blog/multiplying-faster-with-squares/(a+b)2 = a2 + b2 + 2ab and
(a-b)2 = a2 + b2 - 2ab and subtracting
these gives: 4ab = (a+b)2 - (a-b)2 or ab
= (a+b)2/4 - (a-b)2/4. So if we keep in memory
a table of x2/4 we can do a multiply with an add, two subtractions
and two table lookups. The tables have a+b number of entries rather than
a*b, but must hold values as large as a*b if the full range of possible output
is to be supported.+

Novel Methods of Integer Multiplication and Division

You may be familiar with the method of multiplication, variously alleged
to be of Kenyan, Russian, or even Himalayan origin, in which you repeatedly
halve the multiplicand and double the multiplier until the multiplicand becomes
1. Then the sum of those multipliers that have a multiplicand counterpart
of odd value becomes the product. This sounds complicated, but it's really
not; table 1 shows an example.

Table 1: An example of the Kenyan double-and-halve
algorithm for integer multiplication.

Procedure: Repeatedly halve the multiplicand (discarding re-
mainders) and double the multiplier until the former is 1. For every
odd multiplicand, add the respective multiplier.

Example: 44 x 51

Multiplicand

Multiplier

Partial
Sum

Column (c)
Expressed
in Terms of
Original
Multiplier

Remainder
of Division
of Column
(a) by 2

(a)

(b)

(c)

(d)

(e)

44

51

0

22

102

0

11

204

204

4 x 51

1

5

408

408

8x51

1

2

816

0

1

1632

1632

32 x 51

1

Total

2244 =

44 x 51

101100 is
binary for 44

This algorithm readily lends itself to coding, as exemplified by the sequence
in 8080 code shown in listing 1. Halving is done by shifting to the right,
and the odd/even test is performed by checking the carry. Doubling is done
by adding to itself using the DAD instruction, which is also used for summing
up the output terms.

Repeated halving of a number and then noting the odd/even results is a nice
way of finding the binary form of the number (the last bit found being the
most significant one). It also tells something of the binary nature of the
Kenyan method.

Listing 1:An implementation of the Kenyan algorithm for integer
multiplication for the 8080 microprocessor.

;multiplibation program MULT
;input multiplication factors in HL and DE, one of which must.
;necessarily be an 8-bit number; if not, carry is set
;output product in DE, carry set if overflow.
;********************** Initial test to find 8-bit factor
MULT: xra a ;clear A
ora d ;is D zero?
jz found ;yes, DE number is 8-bit fabtor
xra a ;no, DE number was not 8-bit factor
ora h ;is H zero then?
stc
rnz ;no, return with carry set
xchg ;yes, place 8-bit factor in DE
found: mov a,e ;transfer multiplicand to A
;********************** Multiplication starts in earnest
lxi d,0 ;clear DE to receive output terms
ana a ;8-bit factor now in A; clear carry.
next: rar ;halve the multiplicand;- result odd?
jnc even ;no, don't add multiplier term
xchg ;yes, therefore,
dad d ;add multiplier (now in DE) to output
rb ;overflow, carry set on return
xchg ;put multiplier back. in HL
even: ana a ;already reached 1 by halving?
rz ;Yes, retuPn with result, carry cleared
dad h ;no, double the multip.lier and
jnc next ;continue the process.
ret ;overflow, carry set on return

Some time ago I became intrigued by the possibility of finding a procedure
for division that was similar to the Kenyan method of multiplication. I came
up with the following scheme: The divisor is repeatedly doubled until just
less than the dividend, then successively subtracted from the dividend. Every
time the subtraction operation gives a positive result, a 1 is noted; otherwise
a O is recorded. Remarkably enough, the resultant sequence of 0's and 1's
constitutes the quotient directly in binary form, as shown in table 2.

Table 2: An example of a new method of integer division
suitable for implementation on microprocessors without a
divide instruction

Procedure: Double the divisor until it is just less than the
divi-
dend. Then try to subtract the doubled divisors, starting with the
largest, from the dividend. Note a 1 if the subtraction is possible
otherwise, note a zero and do not perform the subtraction.

The 1s and Os constitute the binary form of the quotient. To ob-
tain the decimal form, multiply the latter digits with the corre-
sponding terms in a power of 2 series, arranged in reverse order.
The quotient is the sum of the resultant terms.

To obtain decimal accuracy, multiply the dividend initially by an
Nth power of 10. Then, after the division is complete, divide the
quotient by the same power of 10 (moving the decimal point N
places).

Example: 2246/51

Counter

Double:

51

0

102

1

204

2

408

3

816

4

1632

5

Subtract:

2246

-1632

614

1

X 32 = 32

5

-816

0

x 16 = 0

4

614

-408

206

1

x 8 = 8

3

-204

2

1

x 4 = 4

2

-102

0

x 2 = 0

1

2

-51

0

x 1 = O

0

Remainder:

2

Quotient:

101100 = 44

Notice that the procedure is quite mechanical, with none of the trial-and-error
search for the next correct quotient digit that is characteristic of the
conventional method Furthermore, it lends itself beautifully to coding (see
listing 2). There need be no 8-bit restrictions on any of the numbers; the
dividend, divisor, quotient, and remainder can all be entered as 16-bit numbers.

Listing 2: An implementation of the author's integer-division algorithm
for the 8080 microprocessor.

To handle 16-bit numbers, the add-to-itself DAD H instruction is used for
doubling the divisor, and the necessary comparison with the dividend is
accomplished by reverse-polarity addition, using the negative value of the
dividend (in the DE register pair) and testing on the carry. Care is taken
to restore the divisor before the next doubling by adding back the positive
value (in the BC register pair). The doubled divisors are put in temporary
storage by pushing them to the stack.

For the necessary subtraction of the doubled divisors from the dividend,
reverse-polarity addition is used again. Luckily, the dividend is already
present in negative form (in the DE register pair), and the divisors can
be used in their existing positive form as they are popped from the stack
for subtraction. The carry is then indicative of a positive or negative result,
and for every subtraction, it is shifted into a register pair to form the
final quotient. A counter sees to it that there are no more subtractions
than there were doubling operations. The contents of the DE register pair
constitute the remainder (in complemented form).

As we have seen, odd ways of multiplying and dividing can lead to useful
code algorithms. But the reverse can also be true. Machine-code algorithms
can lead to odd but perhaps not so useful manual methods.

First, consider a table used for multiplying by a fixed number K, based on
using the 8080 DAD instruction (see table 3). The multiplicand is loaded
into two register pairs (HL and DE), and the product is obtained by executing
a sequence of DAD H and DAD D commands in the order given beneath each value
of K (operand sequences for K=2 to K=32 have been included). DAD H doubles
the accumulated multiplicand in the HL pair, and DAD D adds the original
multiplicand to the HL pair.

Procedure: Input multiplicand in both HL and DE register pairs. Constant
K is the multiplier. Then perform a series of DAD D and DAD H Instructions
in the order given by the sequence of Ds and Hs under the given value of
K. The final product will be in the HL register pair If every DAD instruction
is followed by a test of carry (JC or RC), carry will be set in case of overflow.

It seems natural to look for a general algorithm based on DAD Hs and DAD
Ds. If you look hard at table 3, you'll see a familiar pattern emerge: the
Hs and Ds actually represent K in binary form. The Os are represented by
H, whereas the 1's are represented by H and D as a group. True, the most
significant bit is missing, but that will always be a 1 anyway. As an example,
consider K=19. The sequence is H H (H D) (H D), which translates into (1)
0 0 1 1.

Thus, we can multiply by shifting the multiplier and examining the carry.
When carry is cleared, we perform a DAD H operation, and when it is set,
we do both a DAD H and a DAD D. This gives us the code in listing 3

Listing 3:an implementation in 8080 assembly language of the
integer-multiplication algorithm given in tables 3 and 4.

Now for the manual method that can be derived from this: Repeatedly halve
the multipler until it becomes 1 (in order to find the binary form). Reverse
the sequence of halved multipliers and ignore the 1. Repeatedly double the
multiplicand. Whenever the corresponding halved multiplier is odd, add also
the original multiplicand to the accumulated doubled multiplicands; table
4 gives us an example of this method. Oh well, not everything is progress.
But then, progress isn't everything.

table 4:An example of manual implementation of the algorithm of
table 3.

Procedure: Repeatedly halve the multiplier (discarding remainders)
until you reach 1.
Ignore the 1 and arrange the resultant halved multipliers vertically in reverse
order.
For each halved multiplier, double the multiplicand. Add also the initial
multiplicand
if the halved multiplier is an odd number.

how about 'vedic' multiplication methods? while not so beautifull without
parallelism (i.e. fpga array) , they are still interesting, and allow performing
multiplication of large numbers in just few steps (i.e. 6 clock cycles for
64bit mul. if fpga is used)
simplest way to perform it is to multiply bit like one multiplies on paper,
so i.e.

notice that while CPU has to repeat the 'shift and multiply'
for each bit of the multiplier , fpga can do it in parallel in just one cycle
(shift is just adressing to destination register, moving data there (including
masking by multiplier) - takes just one cycle , actually one 'slope' , as
not even full 'cycle' is needed.)
then we have $multiplier_bit_count_size array of sums to make.
for 64bit multiplier, this would equal to as many as 64 operations, so 64
cycles, but we can once again try to be smart the 'vedic' way; addition
operations can be split into parallel operations :

we can add it in pretty much any order i.e. (0110 + 00000) + (011000
+ 0000000) according to the above rule . This mean we can quickly add
each pair , which makes it 32 parallel operations per first clock cyle, then
16 for 2nd, 8 for 3rd, 4, for 4th , 2 for 5th , and voila.

Also if extra logic would be use to detect 'pairs' which will unlikely cause
any overflow in the addition (i.e. 001 + 010) they could be instantly merged
to peform (a+b)+c additions, as it does not require involving additional
'overflow sum' cycle. We can group our adds to ones which will unlikely influence
each other.

The order can be even more free, if we have logic allowing us to detect zeros,
and choose only 'non zero' substrate, allowing to further reduce amount of
operations (though at cost of non-predictable lenght of operation and more
logic complexity)

Including the masked preloading of the array mentioned earlier, this all
equals to just 6 cycles, and of those, just two involve ALU bus (fetching
number and multiplier is one, and placing final addition result in destination
is another)

so assuming clocking the MUL array 6x faster than the ALU bus (which is quite
doable in cmos, assuming we talk about up to ~1ghz speeds), we can practicaly
deliver MUL instruction in just 2 ALU bus clock cycles, while if registers
can be independent (separate ALU bus to result register) - just one clock
cycle.

In asm it takes bit more looping unfortunatelly , but it still makes practical
method for multiplying of insanely large (64bit and more) numbers quite a
breeze.

After you find an appropriate page, you are invited to
your
to this massmind site! (posts will be visible only to you before review)
Just type in the box and press the Post button.
(HTML welcomed, but not the <A tag:
Instead, use the link box to link to another page.
A tutorial is availableMembers can
login
to post directly, become page editors, and be credited for their posts.

Link? Put it here:
if you want a response,
please enter your email address:
Attn spammers: All posts are reviewed before being made visible to anyone other than the poster.