We discuss a sequence of parallel algorithms for linear algebra
and other problems, based on trees. We begin by showing that
any nontrivial function of n parameters takes at least
log_2 n parallel steps to compute, using only binary operations.
Then we show how the parallel prefix operation is implemented in
O(log n) time, and use it to implement a variety of other fast
algorithms.

The spirit of this lecture is to find the fastest possible parallel
algorithm for a variety of problems, not concerning ourselves
with how many processors are needed, or whether our algorithms are
numerically stable.
As a result, some of our algorithms will be impractical, and mostly of
academic interest (and we are academics!).
Others are of great practical interest, and in fact are used in every
existing microprocessor hardware design.

We assume that we are only allowed to use binary operations, which produce
one result given 2 operands. Suppose each possible operation takes the
same amount of time (use the time for the fastest operation if they differ).
Then in 1 time step, only 1 binary operation can be performed, so our function
can depend on at most 2 parameters, the arguments to the binary operation.
In 2 time steps, both of these parameters can for the same reason depend on
at most two more parameters each, for a total of 4 independent parameters
on which the single output can depend. More generally, adding one more time
step can at most double the number of independent inputs on which the
single output can depend. So with k=log n time steps, there can be at most
2^k = n independent parameters.

Trees are naturally suited for broadcasts, because the root processor
simply sends to its children, the children send to their children,
and so on until the broadcast is complete after log_2 p steps. Reduction
with an associative operation like addition is similar, with the leaf nodes
sending their data to their common parent, who sums the child values,
and send the sum to its parent for analogous treatment. After log_2 p steps,
the root has the full sum:

In what follows, the + operator could be replaced by any other associative
operator (product, max, min, matrix multiplication, ...) but for simplicity
we discuss addition. We show how to compute y0 through yp-1 in 2*log_2 p -1
steps using a tree.

There are two ways to see the pattern. Perhaps the simplest is the
following, where we use the abbreviation i:j, where i <= j, to mean the
sum ai+...+aj. The final values at each node are indicated in blue.

This is actually mapped onto a tree as follows. There is an "up-the-tree"
phase, and a "down-the-tree" phase. On the way up, each internal
tree node runs the following algorithm.

Get the values L and R from the left and right child, respectively.

Save L in a local register M.

Compute the sum T = L+R, and pass T to the parent.

On the way down, this is the algorithm:

Get a value T from the parent (the root gets 0).

Send T to the left child.

Send T+M to the right child.

On the way up, one can show by induction on the tree that M
contains the sum of the leaves of the left subtree at each node. On the way
down, one can show by induction that the value T obtained from the parent
is the sum of all the leaf nodes to the left of the subtree rooted at the
parent. Finally, the value at each leaf is added to the value received
from the parent to get the final value. Here is an example.

Details of the CM-5

The CM-5 has a second network called the control network, which has
a variety of broadcast, reduction and scan operations built into it.
Reduction and scan are only supported for logical OR, AND and XOR,
as well as signed max, signed integer add, and unsigned integer add.
The CM-2 had floating point operations as well, but these were eliminated
in the CM-5.
It is used by CMMD and Split-C to implement their corresponding high
level operations. It performs several other useful operations as well:

Synchronous Global OR. Each processor supplies a bit, blocking until
all participate, at which point the logical OR of all the bits is
returned to the processors. This is useful for barriers.

Asynchronous Global OR. Each processor supplies a bit, with the
global logical OR being constantly computed and updated for each
processor. This can be used to test for the first processor to
complete a search task, for example.

Broadcasts, reduction and scans could also be done
by the data network, and indeed must be done by the data network for
more complicated data types (such as floating point) not handled by the
control network.

The diagonal of the triangular matrix consists of consecutive integers,
and the i-th subdiagonal is equal to s(i). Thus, given the s(k),
we can solve this triangular system to get the polynomial coefficients.

Proof. We sketch the proof. Details may be found in
"Parallel and Distributed Computation" by Bertsekas and Tsitsiklis,
Prentice-Hall, 1989. We expand the derivative p'(x) in powers of x in
two different ways, and equate coefficients of like powers of x in each
expansion. The first expansion is simply

p'(x) = n*x^(n-1) + c(1)*(n-1)*x^(n-2) + ... + c(n-1)

The second expansion begins with p(x) = prod_{i=1}^n (x-lambda(i)) to get

where s(0)=1. Now substitute p(x) = x^n + ... + c(n), and multiply out to get
the second expansion.
This completes the proof.

To be able to use Lemma 3 to get the coefficients of the characteristic
polynomial, we need to compute the s(k), the sums of powers of eigenvalues.
The next two lemmas show how to do this.

Lemma 4. The trace of the matrix A is defined
as tr(A) = sum_{i=1}^n A(i,i), the sum of the diagonal entries.
Then tr(A) = sum_{i=1}^n lambda(i), the sum of the eigenvalues.

Proof. We sketch the proof. As above, we may write the characteristic
polynomial p(x) in two ways:

p(x) = x^n + c(1)*x^(n-1) + ... + c(n) = prod_{i=1}^n (x - lambda(i))

By examining the expansion of determinant(x*I-A), it is easy to see
that the coefficient of x^(n-1), c(1) = -tr(A).
By examining prod_{i=1}^n (x - lambda(i)), it is easy to see that the
coefficient of x^(n-1) may also be written -sum_{i=1}^n lambda(i).
This completes the proof.

Lemma 4. If the eigenvalues of A are lambda(1), ... , lambda(n),
then the eigenvalues of A^k are lambda(1)^k , ... , lambda(n)^k.

This algorithm is numerically quite unstable. In a numerical experiment,
we used this algorithm to invert 3*I_n, where I_n is the n-by-n identity
matrix. We used double precision, i.e. about 16 decimal digits of accuracy.
For n>50, no correct digits were obtained. The source of the
instability is the sizes of the terms c(i)*A^(n-i-1) which must be
summed to get inv(A). These are enormous, both because A^(n-i-1) and
because c(i) is enormous. These terms are added and we expect the sum to
be (1/3)*I_n. In other words, we take enormous numbers of the size 10^16
or larger, which are accurate to at most 16 decimals, and add them so that
the leading digits cancel, leaving no significant figures to approximate
1/3.

Theoretically, we could modify this algorithm to let the precision it
uses grow with n, rather than use fixed precision.
In this model, including the cost of the extra precise arithmetic,
it is possible to show we can invert matrices accurately in O(log^2 n) time.
But the overhead of such an algorithm makes it unattractive.

The solution via Gaussian elimination without pivoting would take 3 steps:

Factorize T = L*U. L is a lower triangular matrix with ones on the
diagonal, and U is an upper triangular matrix. Only the diagonal and
first subdiagonal (superdiagonal) of L (of U) is nonzero.

Solve L*z = y.

Solve U*x = z.

The parallel version of Gaussian elimination discussed in another section
would take O(n) steps to factorize, and O(n) steps for each triangular solve.
We will show how to use parallel prefix to do each of these in O( log n) steps.

where f(1) = 0, p(0) = 1, and q(0) = 0. Thus, we need to compute the
prefix-products N(i) for i=1 to n, and this can be done using parallel
prefix with 2-by-2 matrix multiplication as the associative operation,
in O(log n) steps. Thus, we can LU factorize T=L*U in O(log n) steps.

Now we consider solving the triangular system U*x=z in O(log n) steps;
solving L*z=y is analogous. We again convert to parallel-prefix with
2-by-2 matrix multiplication as the operation. Equating (U*x)(i) = z(i)
yields

Let a = a(n-1)a(n-2)a(n-3)...a(0) be an n-bit binary number, with bits
a(n-1) through a(0).
Let b = b(n-1)b(n-2)b(n-3)...b(0) be another n-bit binary number.
We want to compute the sum a+b = s = s(n)s(n-1)...s(0).
The standard algorithm is to add from right to left, propagating
a carry-bit c(i) from bit to bit. In the following algorithm, 0 or F
means false, 1 or T means true, and xor, or and
and are logical operations:

The challenge is to propagate the c(i) from right to left quickly.
After this all the sum bits s(i) can be computed in a single time step.
For simplicity let p(i) = [a(i) xor b(i)] (the "propagate bit") and
g(i) = [a(i) and b(i)] (the "generate bit"), which reduces the recurrence
for the carry-bit to

c(i) = ( p(i) and c(i-1) ) or g(i)

We evaluate this recurrence using parallel prefix, where the
associative operation is 2-by-2 Boolean matrix multiply, in almost
the same way as we solved U*x=z:

We consider whether we can parallelize the recurrence z(i) = f_i (z(i-1))
where z(i) is a number and f_i is a rational function. We assume
each f(i) has degree d, which means that d is the larger of the
degrees of the numerator polynomial n_i(z) and denominator polynomial d_i(z)
defining f_i(z) = n_i(z) / d_i(z) (we assume n_i and d_i have no common
factors). We present a theorem due to Kung which characterizes exactly
when z(n) can be evaluated in parallel.

Next, consider the case d>1. We want to show that evaluating z(n) takes
time at least proportional to n. To do this we consider z(n) as a rational
function of n, and ask what its degree is, in the same sense as the
degree of f_i: if we write z(n) as a rational function of z(0), what is the
maximum of the degrees of the numerator and denominator? It is easy to
see that since d-th powers of polynomials occur in each f_i, the
degree can increase by a factor of d at each step, so the degree of z(n)
is at most d^n, and will equal d^n barring fortuitous cancellations.

Now suppose we evaluate z(n) in m parallel time steps. We will show that
the degree of z(n) is at most 2^m. Thus, for z(n) to have the right
degree, we will need 2^m >= d^n, or m >= n, since d >= 2. Thus we will
need m >= n parallel time steps to evaluate z(n), so no parallel speedup
is possible. To see that the degree of z(n) is at most 2^m after m time
steps, we proceed by induction. After m=0 steps, z(0) has degree 1 = 2^0.
Suppose we have degree 2^(m-1) after m-1 time steps. At the m-th step,
we can add, subtract, multiply or divide two rational functions of
degree at most 2^(m-1) to get z(n). It is easy to confirm that doing
any one of these four rational operations on 2 rational functions can
at most double the maximum degree. This completes the proof.

We emphasize that this theorem is true because we permit only addition,
subtraction, multiplication and division to be used. If we permit
other, transcendental operations, like square roots, exponentials, and
logarithms, further speedup is possible. For example, consider evaluating
the square root using Newton's method to find a positive root of
g(x) = x^2 - a = 0. This yields the recurrence

x(i+1) = (x(i)^2 + a)/(2*x(i)) = f(x(i))

Since the degree of f(.) is 2, this commonly used recurrence
cannot be accelerated. However, let us change variables using the
decidedly transcendental transformation

We close by stating, without proof, a theorem due to Brent.
Let E be an arbitrary expression consisting of +, -, *, /, parentheses,
and n variables, where each appearance of a variable is counted
separately. Another way to say this is to let E be an arbitrary
binary expression tree with n leaves (the variables), and with each
internal node labeled by +, -, * or /. Then E can be evaluated in
O(log n) time, using the laws of commutativity, associativity and
distributivity. A more modern proof than Brent's uses a greedy algorithm,
collapsing all leaves into their parents at each time step, and
evaluating all chains with parallel prefix.