Name

Synopsis

Description

HPL_pdpanrlT
factorizes a panel of columns that is a sub-array of a
larger one-dimensional panel A using the Right-looking variant of the
usual one-dimensional algorithm. The lower triangular N0-by-N0 upper
block of the panel is stored in transpose form.
Bi-directional exchange is used to perform the swap::broadcast
operations at once for one column in the panel. This results in a
lower number of slightly larger messages than usual. On P processes
and assuming bi-directional links, the running time of this function
can be approximated by (when N is equal to N0):
N0 * log_2( P ) * ( lat + ( 2*N0 + 4 ) / bdwth ) +
N0^2 * ( M - N0/3 ) * gam2-3
where M is the local number of rows of the panel, lat and bdwth are
the latency and bandwidth of the network for double precision real
words, and gam2-3 is an estimate of the Level 2 and Level 3 BLAS
rate of execution. The recursive algorithm allows indeed to almost
achieve Level 3 BLAS performance in the panel factorization. On a
large number of modern machines, this operation is however latency
bound, meaning that its cost can be estimated by only the latency
portion N0 * log_2(P) * lat. Mono-directional links will double this
communication cost.
Note that one iteration of the the main loop is unrolled. The local
computation of the absolute value max of the next column is performed
just after its update by the current column. This allows to bring the
current column only once through cache at each step. The current
implementation does not perform any blocking for this sequence of
BLAS operations, however the design allows for plugging in an optimal
(machine-specific) specialized BLAS-like kernel. This idea has been
suggested to us by Fred Gustavson, IBM T.J. Watson Research Center.