In order to simplify the design of ScaLAPACK, and because the BLAS
have proven to be very useful tools outside LAPACK, we chose to build
a Parallel BLAS, or PBLAS,
whose interface is as similar
to the BLAS as possible. This decision has permitted the ScaLAPACK
code to be quite similar, and sometimes nearly identical, to the
analogous LAPACK code. Only one substantially new routine was added
to the PBLAS, matrix transposition, since this is a complicated
operation in a distributed memory environment [3].

We hope that the PBLAS will provide a distributed memory standard,
just as the BLAS have provided a shared memory standard. This would
simplify and encourage the development of high performance and portable
parallel numerical software, as well as providing manufacturers with a
small set of routines to be optimized. The acceptance of the PBLAS
requires reasonable compromises among competing goals of functionality
and simplicity. These issues are discussed below.

The PBLAS operate on matrices distributed in a 2D block cyclic layout.
Since such a data layout requires many parameters to fully describe
the distributed matrix, we have chosen a more object-oriented approach,
and encapsulated these parameters in an integer array called an
array descriptor.
An array descriptor includes

(1) the number of rows in the distributed matrix, (2) the number of columns in the distributed matrix, (3) the row block size
( in section 2.5), (4) the column block size
( in section 2.5), (5) the process row over which the first row of the matrix is
distributed, (6) the process column over which the first column of the matrix
is distributed, (7) the BLACS context, and (8) the leading dimension of the local array storing the local blocks.

For example, here is an example of a call to the BLAS double precision
matrix multiplication routine DGEMM, and the corresponding PBLAS routine
PDGEMM; note how similar they are:

DGEMM computes C = BETA * C + ALPHA * op( A ) * op( B ),
where op(A) is either A or its transpose depending on
TRANSA, op(B) is similar, op(A) is M-by-K,
and op(B) is K-by-N. PDGEMM is the same, with
the exception of the way in which submatrices are specified. To pass
the submatrix starting at A(IA,JA) to DGEMM, for example,
the actual argument corresponding to the formal argument A would
simply be A(IA,JA). PDGEMM, on the other hand, needs to
understand the global storage scheme of A to extract the correct
submatrix, so IA and JA must be passed in separately.
DESC_A is the array descriptor for A. The parameters
describing the matrix operands B and C are analogous
to those describing A. In a truly object-oriented environment
matrices and DESC_A would be the synonymous. However, this
would require language support, and detract from portability.

Our implementation of the PBLAS emphasizes the mathematical view of a
matrix over its storage. In fact, it is even possible to reuse our
interface to implement the PBLAS for a different block data
distribution that would not fit in the block-cyclic scheme.

The presence of a context associated with every distributed matrix
provides the ability to have separate ``universes'' of message passing.
The use of separate communication contexts by distinct libraries
(or distinct library invocations) such as the PBLAS insulates
communication internal to the library from external communication.
When more than one descriptor array is present in the argument
list of a routine in the PBLAS, it is required that the individual
BLACS context entries must be equal. In other words, the PBLAS do
not perform ``intra-context'' operations.

We have not included specialized routines to take advantage of
packed storage schemes for symmetric, Hermitian, or triangular
matrices, nor of compact storage schemes for banded matrices.