A brief summary of routines is given here. Full descriptions of individual routines are given in the Modules section.

Driver & computational routines have a magma_ prefix. These are generally hybrid CPU/GPU algorithms. A suffix indicates in what memory the matrix starts and ends, not where the computation is done.

Suffix

Example

Description

none

magma_dgetrf

hybrid CPU/GPU routine where the matrix is initially in CPU host memory.

_m

magma_dgetrf_m

hybrid CPU/multiple-GPU routine where the matrix is initially in CPU host memory.

_gpu

magma_dgetrf_gpu

hybrid CPU/GPU routine where the matrix is initially in GPU device memory.

_mgpu

magma_dgetrf_mgpu

hybrid CPU/multiple-GPU routine where the matrix is distributed across multiple GPUs' device memories.

In general, MAGMA follows LAPACK's naming conventions. The base name of each routine has a one letter precision (occasionally two letters), two letter matrix type, and usually a 2-3 letter routine name. For example, DGETRF is D (double-precision), GE (general matrix), TRF (triangular factorization).

Auxiliary routines

Additional BLAS-like routines, many originally defined in LAPACK. These follow a similar naming scheme: precision, then "la", then the routine name. MAGMA implements these common ones on the GPU, plus adds a few such as symmetrize and transpose.

For auxiliary routines, the magmablas_ prefix indicates our own MAGMA implementation (e.g., magmablas_zlaswp). All MAGMA auxiliary routines are GPU native and take the matrix in GPU memory.

Utility routines

Memory Allocation

MAGMA can use regular CPU memory allocated with malloc or new, but it may achieve better performance using aligned and, especially, pinned memory. There are typed versions of these (e.g., magma_zmalloc) that avoid the need to cast and use sizeof, and un-typed versions (e.g., magma_malloc) that are more flexible but require a (void**) cast and multiplying the number of elements by sizeof.

Communication

Data types & complex numbers

Integers

MAGMA uses magma_int_t for integers. Normally, this is mapped to the C/C++ int type. Most systems today use the LP64 convention, meaning long and pointers are 64-bit, while int is 32-bit.

MAGMA also supports the ILP64 convention as an alternative, where int, long, and pointers are all 64-bit. To use this, we typedef magma_int_t to be long long. To use ILP64, define MAGMA_ILP64 or MKL_ILP64 when compiling, and link with an ILP64 BLAS and LAPACK library; see make.inc.mkl-ilp64 for an example.

Complex numbers

MAGMA supports complex numbers. Unfortunately, there is not a single standard for how to implement complex numbers in C/C++. Fortunately, most implementations are identical on a binary level, so you can freely cast from one to another. The MAGMA types are: magmaFloatComplex, which in CUDA MAGMA is a typedef of cuFloatComplex, and magmaDoubleComplex, which in CUDA MAGMA is a typedef of cuDoubleComplex.

For C, we provide macros to manipulate complex numbers. For C++ support, include the magma_operators.h header, which provides overloaded C++ operators and functions.

C macro

C++ operator

Description

c = MAGMA_*_MAKE(r,i)

create complex number from real & imaginary parts

r = MAGMA_*_REAL(a)

r = real(a)

return real part

i = MAGMA_*_IMAG(a)

i = imag(a)

return imaginary part

c = MAGMA_*_NEGATE(a)

c = -a;

negate

c = MAGMA_*_ADD(a,b)

c = a + b;

add

c = MAGMA_*_SUB(a,b)

c = a - b;

subtract

c = MAGMA_*_MUL(a,b)

c = a * b;

multiply

c = MAGMA_*_DIV(a,b)

c = a / b;

divide

c = MAGMA_*_CNJG(a)

c = conj(a)

conjugate

r = MAGMA_*_ABS(a)

r = fabs(a)

2-norm, sqrt( real(a)^2 + imag(a)^2 )

r = MAGMA_*_ABS1(a)

r = abs1(a)

1-norm, abs(real(a)) + abs(imag(a))

. Constants

Description

c = MAGMA_*_ZERO

zero

c = MAGMA_*_ONE

one

c = MAGMA_*_NAN

not-a-number (e.g., 0/0)

c = MAGMA_*_INF

infinity (e.g., 1/0, overflow)

where * is one of the four precisions, S D C Z.

Conventions for variables

Here are general guidelines for variable names; there are of course exceptions to these.

LAPACK and MAGMA use column-major matrices. For matrix X with dimension (lda,n), element X(i, j) is X[ i + j*lda ]. For symmetric, Hermitian, and triangular matrices, only the lower or upper triangle is accessed, as specified by the uplo argument; the other triangle is ignored.

lda is the leading dimension of matrix A; similarly ldb for B, ldda for dA, etc. It should immediately follow the matrix pointer in the argument list. The leading dimension can be the number of rows, or if A is a sub-matrix of a larger parent matrix, lda is the leading dimension (e.g., rows) of the parent matrix.

On the GPU, it is often beneficial to round the leading dimension up to a multiple of 32, to provide better performance. This aligns memory reads so they are coalesced. This is provided by the magma_roundup function:

ldda = magma_roundup( m, 32 );

The formula ((m + 31)/32)*32 also works, relying on floored integer division, but the roundup function is clearer to use.

On the CPU, it is often beneficial to ensure that the leading dimension is not a multiple of the page size (often 4 KiB), to minimize TLB misses.

For vectors, incx is the increment or stride between elements of vector x. In all cases, incx != 0. In most cases, if incx < 0, then the vector is indexed in reverse order, for instance, using Matlab notation,

incx = 1 means x( 1 : 1 : n )
incx = 2 means x( 1 : 2 : 2*n-1 )

while

incx = -1 means x( n : -1 : 1 )
incx = -2 means x( 2*n-1 : -2 : 1 )

For several routines (amax, amin, asum, nrm2, scal), the order is irrelevant, so negative incx are not allowed; incx > 0.

MAGMA includes functions, lapack_xyz_const(), which take MAGMA's integer constants and return LAPACK's string constants, where xyz is a MAGMA type such as uplo, trans, etc. From the standpoint of LAPACK, only the first letter of each string is significant. Nevertheless, the functions return meaningful strings, such as "No transpose", "Transpose", "Upper", "Lower", etc. Similarly, there are functions to go from MAGMA's integer constants to CBLAS, OpenCL's clBLAS, and CUDA's cuBLAS integer constants.

There are also functions, magma_xyz_const(), to go in the opposite direction, from LAPACK's string constants to MAGMA's integer constants.

The most common constants are those defined for BLAS routines:

enum { MagmaNoTrans, MagmaTrans, MagmaConjTrans } magma_order_t

Whether a matrix is not transposed, transposed, or conjugate-transposed. For a real matrix, Trans and ConjTrans have the same meaning.

enum { MagmaLower, MagmaUpper, MagmaFull } magma_uplo_t

Whether the lower or upper triangle of a matrix is given, or the full matrix.

enum { MagmaLeft, MagmaRight } magma_side_t

Whether the matrix is on the left or right.

enum { MagmaUnit, MagmaNonUnit } magma_diag_t

Whether the diagonal is assumed to be unit (all ones) or not.

Additional constants for specific routines are defined in the documentation for the routines.

Because MAGMA, CBLAS, LAPACK, CUBLAS, and clBlas use potentially different constants, converters between them are provided.

These do the inverse map, converting MAGMA to LAPACK constants. From the standpoint of LAPACK, only the first letter of each string is significant. Nevertheless, the functions return meaningful strings, such as "No transpose", "Transpose". Substitute lapacke for lapack to get version that returns single char instead of string (const char*).

Errors

Driver and computational routines, and a few BLAS/auxiliary routines, currently return errors both as a return value and in the info argument. The return value and info should always be identical. In general, the meaning is as given in this table. Predefined error codes are large negative numbers. Using the symbolic constants below is preferred, but the numeric values can be found in include/magma_types.h.

Info

Description

info = 0 (MAGMA_SUCCESS)

Successful exit

info < 0, but small

For info = -i, the i-th argument had an illegal value

info > 0

Function-specific error such as singular matrix

MAGMA_ERR_DEVICE_ALLOC

Could not allocate GPU device memory

MAGMA_ERR_HOST_ALLOC

Could not allocate CPU host memory

MAGMA_ERR_ILLEGAL_VALUE

An argument had an illegal value (deprecated; instead it should return -i to say the i-th argument was bad)

MAGMA_ERR_INVALID_PTR

Can't free pointer

MAGMA_ERR_NOT_IMPLEMENTED

Function or option not implemented

MAGMA_ERR_NOT_SUPPORTED

Function or option not supported on the current architecture

magma_xerbla() is called to report errors (mostly bad arguments) to the user.

Methodology

One-sided matrix factorizations

The one-sided LU, Cholesky, and QR factorizations form a basis for solving linear systems. A general recommendation is to use LU for general n-by-n matrices, Cholesky for symmetric/Hermitian positive definite (SPD) matrices, and QR for solving least squares problems,

min || A x - b ||

for general m-by-n, m > n matrices.

We use hybrid algorithms where the computation is split between the GPU and and the CPU. In general for the one-sided factorizations, the panels are factored on the CPU and the trailing sub-matrix updates on the GPU. Look-ahead techniques are used to overlap the CPU and GPU work and some communications.

In both the CPU and GPU interfaces the matrix to be factored resides in the GPU memory, and CPU-GPU transfers are associated only with the panels. The resulting matrix is accumulated (on the CPU or GPU according to the interface) along the computation, as a byproduct of the algorithm, vs. sending the the entire matrix when needed. In the CPU interface, the original transfer of the matrix to the GPU is overlapped with the factorization of the first panel. In this sense the CPU and GPU interfaces, although similar, are not derivatives of each other as they have different communication patterns.

Although the solution step has O(n) times less floating point operations than the factorization, it is still very important to optimize it. Solving a triangular system of equations can be very slow because the computation is bandwidth limited and naturally not parallel. Various approaches have been proposed in the past. We use an approach where diagonal blocks of A are explicitly inverted and used in a block algorithm. This results in a high performance, numerically stable algorithm, especially when used with triangular matrices coming from numerically stable factorization algorithms (e.g., as in LAPACK and MAGMA).

For instances when the GPU's single precision performance is much higher than its double precision performance, MAGMA provides a second set of solvers, based on the mixed precision iterative refinement technique. The solvers are based again on correspondingly the LU, QR, and Cholesky factorizations, and are designed to solve linear problems in double precision accuracy but at a speed that is characteristic for the much faster single precision computations. The idea is to use single precision for the bulk of the computation, namely the factorization step, and than use that factorization as a preconditioner in a simple iterative refinement process in double precision arithmetic. This often results in the desired high performance and high accuracy solvers.

Two-sided matrix factorizations

As the one-sided matrix factorizations are the basis for various linear solvers, the two-sided matrix factorizations are the basis for eigen-solvers, and therefore form an important class of dense linear algebra routines. The two-sided factorizations have been traditionally more difficult to achieve high performance. The reason is that the two-sided factorizations involve large matrix-vector products which are memory bound, and as the gap between compute and communication power increases exponentially, these memory bound operations become an increasingly more difficult to handle bottleneck. GPUs though offer an attractive possibility to accelerate them. Indeed, having a high bandwidth (e.g. 10 times larger than current CPU bus bandwidths), GPUs can accelerate matrix-vector products significantly (10 to 30 times). Here, the panel factorization itself is hybrid, while the trailing matrix update is performed on the GPU.