LAPACK 3.2

Release date: Su 11/16/2008.

This material is based upon work supported by the National Science Foundation
and the Department of Energy under Grant No. NSF-CCF-00444486, NSF-CNS
0325873, NSF-EIA 0122599, NSF-ACI-0090127, DOE-DE-FC02-01ER25478,
DOE-DE-FC02-06ER25768.

LAPACK is a software package provided by Univ. of Tennessee, Univ. of
California Berkeley, Univ. of Colorado Denver and NAG Ltd.

XBLAS, or portable “extra precise BLAS”: Our new linear
solvers in (1) depend on these to perform iterative refinement. See
reference [3] below. The XBLAS will be released in a separate
package. See “More Details”.

Non-Negative Diagonals from Householder QR: The QR
factorization routines now guarantee that the diagonal is both real
and non-negative. Factoring a uniformly random matrix now correctly
generates an orthogonal Q from the Haar distribution. See reference
[4] below.

High Performance QR and Householder Reflections on Low-Profile
Matrices: The auxiliary routines to apply Householder reflections
(e.g. DLARFB) automatically reduce the cost of QR from O(n3) to
O(n2+nb2) for matrices stored in a dense format for band
matrices with bandwidth b with no user interface changes. Other
users of these routines can see similar benefits in particular on
“narrow profile” matrices. See reference [4] below.

New fast and accurate Jacobi SVD: High accuracy SVD routine for
dense matrices, which can compute tiny singular values to many more
correct digits than xGESVD when the matrix has columns differing
widely in norm, and usually runs faster than xGESVD too. See
references [5], [6] below.

Routines for Rectangular Full Packed format: The RFP format
(SF, HF, PF, TF) enables efficient routines with optimal storage for
symmetric, Hermitian or triangular matrices. Since these routines
utilize the Level 3 BLAS, they are generally much more efficient
than the existing packed storage routines (SP, HP, PP, TP). See
reference [7] below.

Mixed precision iterative refinement routines for exploiting
fast single precision hardware: On platforms like the Cell processor
that do single precision much faster than double, linear systems can
be solved many times faster. Even on commodity processors there is a
factor of 2 in speed between single and double precision. The
matrix types supported in this release are: GE (general), PO
(positive definite). See reference [1] below.

Some new variants added for the one sided factorization: LU
gets right-looking, left-looking, Crout and Recursive), QR gets
right-looking and left-looking, Cholesky gets left-looking,
right-looking and top-looking. Depending on the computer
architecture (or speed of the underlying BLAS), one of these
variants may be faster than the original LAPACK implementation.

The XBLAS must be built with a Fortran compiler compatible to the one
used to build LAPACK. We recommend using exactly the same Fortran
compiler for both the XBLAS and LAPACK. If these instructions do not
work, see the INSTALL file in the XBLAS distribution for more
information.

New linear solvers that “guarantee” answers accurate to machine precision or
warn that the answer cannot be trusted based on computed error bounds. All but
extremely ill-conditioned problems fall into the first category.

public interfaces

These interfaces are prototypes; they may change after user feedback.
The matrix types supported in this release are

GE (general)

SY (symmetric)

PO (positive definite)

HE (Hermitian)

GB (general band)

in real/complex and single/double precision versions. The driver routines are designated
-svxx, and the refinement routines -rfsx. The testing codes for SY
and HE have not been integrated into the LAPACK distribution yet.

A SRC/sgesvxx.f A SRC/sposvxx.f A SRC/sgerfsx.f A SRC/sporfsx.f
A SRC/cgesvxx.f A SRC/cposvxx.f A SRC/cgerfsx.f A SRC/cporfsx.f
A SRC/dgesvxx.f A SRC/dposvxx.f A SRC/dgerfsx.f A SRC/dporfsx.f
A SRC/zgesvxx.f A SRC/zposvxx.f A SRC/zgerfsx.f A SRC/zporfsx.f
A SRC/ssysvxx.f A SRC/sgbsvxx.f A SRC/ssyrfsx.f A SRC/sgbrfsx.f
A SRC/csysvxx.f A SRC/cgbsvxx.f A SRC/csyrfsx.f A SRC/cgbrfsx.f
A SRC/dsysvxx.f A SRC/dgbsvxx.f A SRC/dsyrfsx.f A SRC/dgbrfsx.f
A SRC/zsysvxx.f A SRC/zgbsvxx.f A SRC/zsyrfsx.f A SRC/zgbrfsx.f
A SRC/chesvxx.f A SRC/cherfsx.f
A SRC/zhesvxx.f A SRC/zherfsx.f

We have also added routines to compute equilibration factors.

A SRC/sgeequb.f A SRC/ssyequb.f A SRC/spoequb.f A SRC/sgbequb.f
A SRC/cgeequb.f A SRC/csyequb.f A SRC/cpoequb.f A SRC/cgbequb.f
A SRC/dgeequb.f A SRC/dsyequb.f A SRC/dpoequb.f A SRC/dgbequb.f
A SRC/zgeequb.f A SRC/zsyequb.f A SRC/zpoequb.f A SRC/zgbequb.f
A SRC/cheequb.f
A SRC/zheequb.f

The following is the interface of the extra-precise iterative refinement driver SGESVXX.
The interfaces for the drivers for other matrix types and precisions are analoguous.

The refinement routines listed above make use of new routines to refine each right-hand side.

A SRC/sla_gerfsx_extended.f A SRC/sla_porfsx_extended.f A SRC/cla_herfsx_extended.f
A SRC/cla_gerfsx_extended.f A SRC/cla_porfsx_extended.f A SRC/zla_herfsx_extended.f
A SRC/dla_gerfsx_extended.f A SRC/dla_porfsx_extended.f
A SRC/zla_gerfsx_extended.f A SRC/zla_porfsx_extended.f
A SRC/sla_syrfsx_extended.f A SRC/sla_gbrfsx_extended.f
A SRC/cla_syrfsx_extended.f A SRC/cla_gbrfsx_extended.f
A SRC/dla_syrfsx_extended.f A SRC/dla_gbrfsx_extended.f
A SRC/zla_syrfsx_extended.f A SRC/zla_gbrfsx_extended.f

We have added new routines to compute the reciprocal condition number.

A SRC/sla_gercond.f A SRC/sla_syrcond.f A SRC/sla_porcond.f A SRC/sla_gbrcond.f
A SRC/cla_gercond_x.f A SRC/cla_syrcond_x.f A SRC/cla_porcond_x.f A SRC/cla_gbrcond_x.f
A SRC/cla_gercond_c.f A SRC/cla_syrcond_c.f A SRC/cla_porcond_c.f A SRC/cla_gbrcond_c.f
A SRC/dla_gercond.f A SRC/dla_syrcond.f A SRC/dla_porcond.f A SRC/dla_gbrcond.f
A SRC/zla_gercond_x.f A SRC/zla_syrcond_x.f A SRC/zla_porcond_x.f A SRC/zla_gbrcond_x.f
A SRC/zla_gercond_c.f A SRC/zla_syrcond_c.f A SRC/zla_porcond_c.f A SRC/zla_gbrcond_c.f
A SRC/zla_hercond_x.f
A SRC/zla_hercond_c.f

We have added routines to compute reciprocal pivot growth.

A SRC/sla_rpvgrw.f A SRC/sla_syrpvgrw.f A SRC/sla_porpvgrw.f A SRC/sla_gbrpvgrw.f
A SRC/cla_rpvgrw.f A SRC/cla_syrpvgrw.f A SRC/cla_porpvgrw.f A SRC/cla_gbrpvgrw.f
A SRC/dla_rpvgrw.f A SRC/dla_syrpvgrw.f A SRC/dla_porpvgrw.f A SRC/dla_gbrpvgrw.f
A SRC/zla_rpvgrw.f A SRC/zla_syrpvgrw.f A SRC/zla_porpvgrw.f A SRC/zla_gbrpvgrw.f
A SRC/cla_herpvgrw.f
A SRC/zla_herpvgrw.f

Finally, we have added a number of auxiliary routines that are necessary to support the above.

A SRC/sla_lin_berr.f A SRC/slascl2.f A SRC/sla_geamv.f A SRC/cla_heamv.f
A SRC/cla_lin_berr.f A SRC/clascl2.f A SRC/cla_geamv.f A SRC/zla_heamv.f
A SRC/dla_lin_berr.f A SRC/dlascl2.f A SRC/dla_geamv.f
A SRC/zla_lin_berr.f A SRC/zlascl2.f A SRC/zla_geamv.f
A SRC/slarscl2.f A SRC/sla_wwaddw.f A SRC/sla_syamv.f
A SRC/clarscl2.f A SRC/cla_wwaddw.f A SRC/cla_syamv.f
A SRC/dlarscl2.f A SRC/dla_wwaddw.f A SRC/dla_syamv.f
A SRC/zlarscl2.f A SRC/zla_wwaddw.f A SRC/zla_syamv.f

9.2. XBLAS, or the portable extra-precision and mixed-precision BLAS

All material relevant to the XBLAS is located at
http://www.netlib.org/xblas/. They are necessary to build the
extra-precise refinement routines above.

The auxiliary routines to apply Householder reflections (e.g. DLARFB)
automatically reduce the cost of QR from O(n3) to O(n2+nb2) for matrices
stored in a dense format for band matrices with bandwidth b with no user
interface changes. Other users of these routines can see similar benefits in
particular on “narrow profile” matrices.

High accuracy SVD routine for dense matrices, which can compute tiny singular
values to many more correct digits than SGESVD when the matrix has columns
differing widely in norm, and usually runs faster than SGESVD too.

A dgejsv.f A sgejsv.f
A dgesvj.f A sgesvj.f
A dgsvj0.f A sgsvj0.f
A dgsvj1.f A sgsvj1.f

SGESVJ implements one-sided Jacobi SVD with fast scaled rotations and de
Rijk’s pivoting. The code works as specified in the theory of Demmel and
Veselic. It uses a special implementation of Jacobi rotations, so that it can
compute the singular values in the full range [underflow, overflow]. SGESVJ
can be used as standalone, and it is also the kernel routine in the expert
driver routine SGEJSV.

SGEJSV implements the preconditioned Jacobi SVD of Drmac and Veselic. This is
the expert driver routine that calls SGESVJ after certain preconditioning. In
keeping the standard type of accuracy it also delivers relatively accurate
singular values, including the smallest ones, if the matrix A can be diagonally
scaled so that the scaled matrix C = A \* D is well conditioned. The relative
accuracy is usually obtained on a broader class of matrices A = D1 \* C \* D2,
where D1, D2 are arbitrary diagonal matrices and C is well conditioned.
The computational range for the singular values can be set as the full range
(underflow, overflow), provided that the BLAS and LAPACK routines called by
xGEJSV are implemented to work in that range.

The driver routine SGEJSV can be set to work safely in a restricted range,
meaning that the singular values of magnitude below ||A||_2 / SLAMCH(\'O\') are
returned as zeros. This means that the software complies with the theory as
long the condition number does not overflow.

Motivation for the RFP format: LAPACK current full formats (PO, TR, HE) waste
half of the memory, LAPACK current packed formats (PP, TP, HP) are
inefficient. In 2005 Gustavson introduced the rectangular full packed format
(PF, TF, HF) which enables the use of half the full storage while maintaining
efficiency by using Level 3 BLAS/LAPACK kernels.

The idea is simple: to store an N by N triangle (and suppose for simplicity
that N is even), partition the triangle into three parts: two N/2 by N/2
triangles and an N/2 by N/2 square, then pack this as an N by N/2
rectangle (or N/2 by N rectangle), by transposing (or transpose-conjugating)
one of the triangles and packing it next to the other triangle. Since the two
triangles are stored in full storage (PO, TR, HE) one can use existing
efficient routines on them. They are some subtleties and variants, but
everything works fine. The storage format Gustavson has proposed has 8
possible cases depending whether N is odd or even, whether the matrix is upper or lower
triangular, and whether the user chooses to store
the conjugate-transpose of the storage or its normal form.

storing a matrix in RFP format

The RFP format is not that hard to understand. We refer to the headers of
the routines, the paper in the reference above or an illustrative webpage
[URL-3].

The routines CHFRK, ZHFRK, SSFRK and DSFRK directly build matrices
in RFP format. Once the matrix is constructed the user can then use it without
understanding the details of the format.

There are some conversion routines: xTPTTF goes from standard packed to
RFP and xTRTTF goes from full to RFP. This approach is less interesting than
(1) or (2) since it requires the storage of two matrices.

RFP nomenclature

SF

xSFyy

symmetric

HF

xHFyy

Hermitian

PF

xPFyy

symmetric/Hermitian positive definite

TF

xTFyy

triangular

contribution

A SRC/chfrk.f A SRC/dlansf.f A SRC/slansf.f A SRC/zhfrk.f
A SRC/clanhf.f A SRC/dpftrf.f A SRC/spftrf.f A SRC/zlanhf.f
A SRC/cpftrf.f A SRC/dpftri.f A SRC/spftri.f A SRC/zpftrf.f
A SRC/cpftri.f A SRC/dpftrs.f A SRC/spftrs.f A SRC/zpftri.f
A SRC/cpftrs.f A SRC/dsfrk.f A SRC/ssfrk.f A SRC/zpftrs.f
A SRC/ctfsm.f A SRC/dtfsm.f A SRC/stfsm.f A SRC/ztfsm.f
A SRC/ctftri.f A SRC/dtftri.f A SRC/stftri.f A SRC/ztftri.f
A SRC/ctfttp.f A SRC/dtfttp.f A SRC/stfttp.f A SRC/ztfttp.f
A SRC/ctfttr.f A SRC/dtfttr.f A SRC/stfttr.f A SRC/ztfttr.f
A SRC/ctpttf.f A SRC/dtpttf.f A SRC/stpttf.f A SRC/ztpttf.f
A SRC/ctpttr.f A SRC/dtpttr.f A SRC/stpttr.f A SRC/ztpttr.f
A SRC/ctrttf.f A SRC/dtrttf.f A SRC/strttf.f A SRC/ztrttf.f
A SRC/ctrttp.f A SRC/dtrttp.f A SRC/strttp.f A SRC/ztrttp.f

functionality and interface

Cholesky factorization

CPFTRF

DPFTRF

SPFTRF

ZPFTRF

(TRANSR,UPLO,N,A,INFO)

Multiple solve after xPFTRF

CPFTRS

DPFTRS

SPFTRS

ZPFTRS

(TRANSR,UPLO,N,NR,A,B,LDB,INFO)

Inversion after xPFTRF

CPFTRI

DPFTRI

SPFTRI

ZPFTRI

(TRANSR,UPLO,N,A,INFO)

Triangular inversion

CTFTRI

DTFTRI

STFTRI

ZTFTRI

(TRANSR,UPLO,DIAG,N,A,INFO)

Sym/Herm matrix norm

CLANHF

DLANSF

SLANSF

ZLANHF

(NORM,TRANSR,UPLO,N,A,WORK)

Triangular solve

CTFSM

DTFSM

STFSM

ZTFSM

(TRANSR,SIDE,UPLO,TRANS,DIAG,M,N,ALPHA,A,B,LDB)

Sym/Herm rank-k update

CHFRK

DSFRK

SSFRK

ZHFRK

(TRANSR,UPLO,TRANS,N,K,ALPHA,A,LDA,BETA,C)

Conv. from TP to TF

CTPTTF

DTPTTF

STPTTF

ZTPTTF

(TRANSR,UPLO,N,AP,ARF,INFO)

Conv. from TR to TF

CTRTTF

DTRTTF

STRTTF

ZTRTTF

(TRANSR,UPLO,N,A,LDA,ARF,INFO)

Conv. from TF to TP

CTFTTP

DTFTTP

STFTTP

ZTFTTP

(TRANSR,UPLO,N,ARF,AP,INFO)

Conv. from TF to TR

CTFTTR

DTFTTR

STFTTR

ZTFTTR

(TRANSR,UPLO,N,ARF,A,LDA,INFO)

Conv. from TR to TP

CTRTTP

DTRTTP

STRTTP

ZTRTTP

(UPLO,N,A,LDA,AP,INFO)

Conv. from TP to TR

CTPTTR

DTPTTR

STPTTR

ZTPTTR

(UPLO,N,AP,A,LDA,INFO)

example of what one can do with this new release

ZHFRK( ) Form the normal equations (maybe update them)

ZPFTRF( ) Cholesky factorization of the normal equations

ZPFTRS( ) Solve the normal equations

which performs the same operations as ZSYRK, ZPOTRF, ZPOTRS (with half the storage
for the normal equations).

Warning

The current RFP routines can have INTEGER overflow since they access entries
like A( N*N ) and N*N (for very large N) may not be representable
as a 32-bit INTEGER. (Note that the same restriction holds for any current LAPACK
packed routines). Use 64-bit INTEGER and you’ll be covered.

On platforms like the Cell processor that do single precision much faster than
double, linear systems can be solved many times faster. Even on commodity
processors there is a factor of 2 in speed between single and double
precision. The matrix types supported in this release are: GE (general), PO
(positive definite).

A SRC/VARIANTS/cholesky/RL/cpotrf.f
A SRC/VARIANTS/cholesky/RL/dpotrf.f
A SRC/VARIANTS/cholesky/RL/spotrf.f
A SRC/VARIANTS/cholesky/RL/zpotrf.f
A SRC/VARIANTS/cholesky/TOP/cpotrf.f
A SRC/VARIANTS/cholesky/TOP/dpotrf.f
A SRC/VARIANTS/cholesky/TOP/spotrf.f
A SRC/VARIANTS/cholesky/TOP/zpotrf.f
A SRC/VARIANTS/lu/CR/cgetrf.f
A SRC/VARIANTS/lu/CR/dgetrf.f
A SRC/VARIANTS/lu/CR/sgetrf.f
A SRC/VARIANTS/lu/CR/zgetrf.f
A SRC/VARIANTS/lu/LL/cgetrf.f
A SRC/VARIANTS/lu/LL/dgetrf.f
A SRC/VARIANTS/lu/LL/sgetrf.f
A SRC/VARIANTS/lu/LL/zgetrf.f
A SRC/VARIANTS/lu/REC/cgetrf.f
A SRC/VARIANTS/lu/REC/dgetrf.f
A SRC/VARIANTS/lu/REC/sgetrf.f
A SRC/VARIANTS/lu/REC/zgetrf.f
A SRC/VARIANTS/qr/LL/cgeqrf.f
A SRC/VARIANTS/qr/LL/dgeqrf.f
A SRC/VARIANTS/qr/LL/sceil.f
A SRC/VARIANTS/qr/LL/sgeqrf.f
A SRC/VARIANTS/qr/LL/zgeqrf.f

The version of dqds available in LAPACK-3.1.1 (released in February 2007) is
essentially the version of dqds available in LAPACK-3.0 (released in 1999).

The difference between those two versions is that the dqds in LAPACK-3.1.1 is
thread safe. Any eventual shortcomings of the dqds algorithm in LAPACK-3.0
are still present in LAPACK-3.1.1. In LAPACK-3.2.0, we introduce the latest
version of dqds that Beresford Parlett and Osni Marques produced in 2006. This version
contained a number of improvements that were introduced after 1999. In
particular, the flipping mechanism has been improved. The original motivation
was a problem reported by one of Inderjit Dhillon’s students (Univ. of Texas at
Austin). As a result, this new version of dqds (let us call it dqds-3.2)
passes the tests with Cleve Moler’s matrix while dqds-3.0 and dqds-3.1 do not pass the tests with this
matrix. The algorithm in dqds-3.0 and dqds-3.1 fails to converge in the
maximum number of iterations allowed. Cleve Moler’s matrix is at
[URL-1].
The single precision version of the code dqds-3.2 has a problem in the LAPACK
TESTING with one of the matrices of type 16 while dqds-3.0 and dqds-3.1 were
fine. The matrix is at
[URL-2]. To
pass this matrix, we needed to set the IEEE LOGICAL variable in SLASQ2 to
.FALSE.. This is a temporary fix; we wish to have a better fix in the
future.

There are interface changes for:
DLASQ3, DLASQ4, SLASQ3, SLASQ4.
We decided not to add another set of routines and simply
changed the interfaces.
We have removed the routines
DLAZQ3, DLAZQ4, SLAZQ3, SLAZQ4.
They were introduced in 3.1 for thread safety but are no longer needed.

Take into account the comments of Robert van de Geijn (Univ. of Texas at
Austin) concerning the interface of xLARFT. The interface labels V as
input/output and this is not at all convenient for multithreaded
implementations.
xLARFT could be written in such a way that V is input only.
This impacts also the thread friendliness of ORMQR and alikes.

laundry list

Change the default Cholesky factorization in SRC from right-looking to left-looking,
move left-looking to the VARIANTS directory.

Add some recursive variants for QR and Cholesky.

Remove IEEE=.FALSE. in DQDS (DLASQ3, SLASQ3 of Osni).

Remove the possibilities of INTEGER overflow in the RFP routines

Since we moved F90, remove xLAMCH by its FORTRAN INTRINSICs analoguous