Many problems in science and engineering, particularly those related to partial differential equations (PDEs), can be represented by a linear system where only a few elements in the matrix are non-zero. For these systems, it would be wasteful, in both storage and computation, to represent all of the elements. To address these common problems, storage formats and methods have been developed to solve sparse matrices with minimal memory and computation requirements. These methods can be broken into two main categories: direct methods and iterative methods.

Direct methods, common for dense matrices, attempt to solve system in a two-step process. Typical algorithms include LU and QR factorization where the linear system is transformed into an equivalent system that can be solved using Gaussian elimination. Direct methods can also be applied to sparse systems but algorithms become increasingly complex in an attempt to minimize storage and computation.

The other class of sparse solvers, and those currently implemented in CULA Sparse, are iterative methods. These methods attempt to coverage on solution to the system \(Ax=b\) by continuously iterating over new solutions until a solution’s residual, typically defined as \(||b-Ax||/||b||\), is under a given tolerance. At each step, a solution is calculated using a technique specific to the given algorithm. Because it is possible for iterative methods to fail to find a solution, they are commonly configured with a maximum number of iterations.

A common method to improve the speed at which a solution converges is called preconditioning. These methods attempt, at each iteration, to transform the original linear system into a new equivalent system that can more readily be solved. This adds overhead, in both memory and time per iteration, but will often result in a shorter end-to-end solution time.

CULA Sparse intends to support the full range of operating systems that are supported by CUDA. Installers are currently available for Windows, Linux, and MAC OS X in 32-bit and 64-bit versions. CULA Sparse has been tested on the following systems:

Windows XP / Vista / 7

Ubuntu Linux 10.04 (and newer)

Red Hat Enterprise Linux 5.7 (and newer)

Fedora 16

Mac OSX 10.6 Snow Leopard / 10.7 Lion

Please provide feedback on any other systems on which you attempt to use CULA Sparse. Although we are continually testing CULA Sparse on other systems, at present we officially support the above list. If your system is not listed, please let us know through the provided feedback channels.

This work has been made possible by the NASA Small Business Innovation Research (SBIR) program. We recognize NVIDIA for their support.

CULA Sparse is built on NVIDIA CUDA and NVIDIA CUSPARSE.

CULA Sparse uses COLAMD, covered by the GNU LGPL license. The source code for COLAMD is used in an unmodified fashion; a copy of this code is distributed in the src/suitesparse directory of this package.

While not required, CULA Sparse can optionally utilize CUDA-enabled GPUs to perform accelerated linear algebra operations. To use these functions an NVIDIA GPU with CUDA support is required. A list of supported GPUs can be found on NVIDIA’s CUDA Enabled webpage.

Installation is completed via the downloadable installation packages. To install CULA Sparse, refer to the section below that applies to your system.

Windows

Run the CULA Sparse installer and when prompted select the location to which to install. The default install location is c:\Program Files\CULA\S#, where S# represents the release number of CULA Sparse.

Linux

It is recommended that you run the CULA Sparse installer as an administrator in order to install to a system-level directory. The default install location is /usr/local/culasparse.

Mac OS X Leopard

Open the CULA Sparse .dmg file and run the installer located inside. The default install location is /usr/local/culasparse.

Note

You may wish to set up environment variables to common CULA Sparse locations. More details are available in the Configuring Your Environment chapter.

CULA Sparse presents two main C headers, cula_sparse.h and cula_sparse_legacy.h. You must include one of these headers in your C source file to use CULA Sparse. These main headers rely on three supplemental headers, cula_sparse_common.h, cula_sparse_options.h, and cula_sparse_defs.h. These files are included automatically by the main C headers and shouldn’t typically be included directly by users.

CULA Sparse provides a link-time stub library, but is otherwise built as a shared library. Applications should link against the following libraries:

Windows

Choose to link against cula_sparse.lib as a link-time option.

Linux / Mac OS X Leopard

Add -l-lcula_sparse to your program’s link line.

CULA Sparse is built as a shared library, and as such it must be visible to your runtime system. This requires that the shared library is located in a directory that is a member of your system’s runtime library path. For more detailed information regarding operating-system-specific linking procedures, please refer to the Configuring Your Environment chapter.

The CULA Sparse’s example projects are a good resource for learning how to set up CULA Sparse for your own project.

Note

CULA Sparse is built against NVIDIA CUDA 5.0 and ships with a copy of the CUDA 5.0 redistributable files. If you have a different version of CUDA installed, you must ensure that the CUDA runtime libraries shipped with CULA Sparse are the first visible copies to your CULA Sparse program. This can be accomplished by placing the CULA Sparsebin path earlier in your system PATH than any CUDA bin path. If a non-CUDA 5.0 runtime loads first, you will experience CULA Sparse errors.

CULA Sparse is also built against Intel OpenMP 5 and ships with a copy of the Intel OpenMP 5 runtime redistributable file.

This chapter describes, at a high level, how to use the CULA Sparse API. Basic information about how to initialize and configure an iterative solver is discussed. Furthermore, we introduce how to collect and interpret the results from the iterative solvers as well any error condition that may occur.

To convey the status of a given function call, all CULA Sparse functions return a culaSparseStatus code.

The culaSparseStatus return code is unique to the CULA Sparse library. It is a high level status code used to indicate if the associated call has completed successfully. It conveys non-specific information such as:

Success condition

Non-convergence condition

Parameter errors

GPU runtime errors

Out of memory errors

Additional information regarding the last status code may be obtain by examining the string returned by culaSparseGetLastStatusString. This routine will return a string with in-depth information regarding the actual error. For example, on an argument error it will report which parameter is in error and why.

The CULA Sparse library is initialized by calling the culaSparseCreate() function to create a library handle. This handle provides serialized thread-safe access and is used as input to all CULA Sparse library functions. When using CULA Sparse on multiple host threads, it is recommended to create a handle for each thread.

Prior to CULA Sparse version S5, the legacy interface was the only interface available and was named cula_sparse.h. However, this was succeeded by the plan interface found in the current version of cula_sparse.h. The plan interface offers a vastly reduced number of functions, improved performance, and better control compared to the legacy interfaces. The old-style interface is still available in the cula_sparse.h header.

The general usage pattern of the plan interface requires the user to create a “plan” object which is then associated with a platform, data format, solver method, and preconditioning strategy. One all options have been set, the plan is executed with the specified settings. The plan interface offers the benefit that data is cached within the plan. For example, if a plan with the ilu0 preconditioner is executed with the gmres solver, changing the plan to be associated with the bicgstab solver will not require the ilu0 preconditioner to be regenerated.

In the legacy interface, all state is encapsulated within a single function call. While this is fundamentally simple to use, the increased flexibility of the plan interface is lost.

The first step of using a plan is to associate with an execution platform.

In CULA Sparse, the Host and Cuda platform (e.g. culaSparseSetHostPlatform() and culaSparseSetCudaPlatform()) will interpret pointers to matrices and vectors as data allocated on the host; that is, data allocated with malloc(), new, or std::vector. For the Cuda platform, CULA Sparse will automatically transfer the data to-and-from the accelerated device. This step is performed concurrently to other operations and in non-trivially small problems yields almost no performance impact for this transfer.

Alternatively, the CudaDevice platform (e.g. culaSparseSetPlatform()) only accept pointers to matrix and vector data allocated on a CUDA GPU; that is, data allocated with cudaMalloc(). When using this data interfaces, be sure that the deviceId field in the :c:func`culaSparseCudaDeviceOptions` options structure matches the CUDA device where allocation has occurred. This data interface is typically used when a sparse matrix has been assembled previously on the GPU. It is not recommended that users manager their own data in an attempt to gain performance.

CULA Sparse currently supports three major matrix storage formats: coordinate, compressed row, and compressed column. It is recommended to use the compressed row format when possible as internally, all input formats are converted to this format.

In the coordinate format, a sparse matrix is represented by three vectors sized by the number of non-zero elements of the system. The common abbreviation for this length, \(nnz\) is used throughout the API and this document. In CULA Sparse, these vectors do not need to be sorted in any fashion, though higher performance is possible if they are sorted by row indices.

In the compressed sparse row format, the row index vector is replaced by a row pointer vector of size \(m+1\). This vector now stores the locations of values that start a row; the last entry of this vector points to one past the final data element. The column index is as in COO format.

Consider the same 3x3 matrix, a zero-indexed CSR format can be represented by three vectors:

In the compressed sparse column format, the column index vector is replaced by a column pointer vector of size \(n+1\). This vector now stores the locations of values that start a column; the last entry of this vector points to one past the final data element. The row index is as in COO format. In CULA Sparse, the row indices must be sorted on a per-column basis.

Consider the same 3x3 matrix, a zero-indexed CSC format can be represented by three vectors:

In the matrix free format, the user provides a pointer to function that performs \(x' = A*x\) instead of explicitly calculating the value. This gives support for a number of exotic matrix storage formats. When using this format, only user definied preconditioners are available.

The complex types provided by CULA Sparse are simple structs containing real and imaginary floating point elements. Additional steps are taken to insure the values are 16-byte aligned as required by many vector processing units.

It is possible to utilize C99 complex values for the Host and Cuda platforms or NVIDIA’s cuComplex when using the CudaDevice platform. The pointers will simply need to be cast to the appropriate culaSparseCompelxFloat or culaSparseComplexDouble type when passed in the CULA Sparse API.

All execution routines within CULA Sparse utilize a common configuration parameter to steer the execution of the solver. The configuration parameter is represented by the culaSparseConfig structure and is the first parameter to any solver function within the library. The configuration parameter informs the API of the desired solve by specifying:

The relative tolerance at which a solve is considered as converged (relative to r0)

The absolution tolerance at which solve is considered as converged

The divergent tolerance at which solve is considered as diverged (relative to r0)

The maximum number of iterations to attempt

The maximum time to execute the solve

Per-iterative result vector

More parameters may be added in the future.

Configuration parameters must be set up before a solver can be called. The configuration parameter is initialized by the culaSparseConfigInit() function. This function ensures that all parameters within this structure are set to reasonable defaults. After calling this function, you may set specific parameters to your needs.

Preconditioner selection is more of an art - the method chosen is tightly coupled to the specifics of the linear system. That is, the computational tradeoffs of generation and application of the preconditioner will be different for different systems.

Using the plan interface, a preconditioner is chosen the culaSparseSet{name}Preconditioner() routines.

Once a plan has been assembled, it can be executed using the culaSparseExecutePlan() routine. Note that no computation, other than parameter checking, is performed until the plan is executed. This routine is responsible for the actual data management, preconditioner generation, and the iterative solve.

For example:

#include <cula_sparse.h>// information returned by the solverculaSparseResultsresults;// execute the planculaSparseStatusstatus=culaSparseExecutePlan(handle,plan,&config,&results);

A flag that denotes why the solver converged, diverged, or failed to converge

The number of iterations performed, regardless of whether this led to convergence

The solution residual when the solve ended

Timing information for the overhead, preconditioner generation, and solving

This structure is returned by the culaSparseExecutePlan routine.

CULA Sparse also provides a utility function for culaSparseResult, which is called culaSparseGetResultString(), that constructs a readable string of information that is suitable for printing. In many cases, this function is able to provide information beyond that which is available by inspection of the structure, and so it is recommended that it be used whenever attempting to debug a solver problem.

// allocate result string bufferconstintbufferSize=256;charbuffer[bufferSize];// fill buffer with result stringculaIterativeResultString(handle,&result,buffer,bufferSize);// print result string to screenprintf("%s\n",buffer);

In CULA Sparse, the culaSparseNonConvergence return status is used to describe any condition for which the solver executed, but failed to converge on solution within the given configuration parameters.

Possible reasons for returning a culaSparseNonConvergence include:

Maximum number of iterations exceeded without reaching desired tolerance

An internal scalar quantity became too large or small to continue

The method stagnated

Maximum runtime was

These possible reasons are enumerated by the flag field of the culaSparseFlag structure.

In some cases, a user may wish to get the best possible result in a fixed number of iterations. In such a case, a status can be interpreted as success if and only if the flag is culaSparseMaxIterationsReached. The residual should then be consulted to judge the quality of solution obtained in the budgeted number of iterations.

For convenience, the culaSparseResult result structure contains high precision timing information regarding the runtime of the iterative solver. This timing is broken down into three major components:

Overhead - This includes memory allocations, transfers to-and-from the GPU, and internal operations such as storage type conversion.

Preconditioner - The time taken to generate the requested preconditioner. This does not include the per-iteration time to apply the preconditioner; the per-iteration time of a preconditioner is included in the solver time.

Solver - This represents the time spent in the actual iteration loop

Additionally, the total time is returned which is a sum of these three values. All values are in seconds.

For some users, it may be desirable to capture the relative residual of the solver at each iteration. CULA Sparse provides a mechanism to obtain this information, via a parameter in the configuration structure. This parameter, result.byIteration is normally set to NULL, but may be assigned by the user to specify an array into which the residual by iteration should be stored. It is up to the user to ensure that this location has enough memory to store the residual for each iteration; in practice this is achieved by ensuring that the specified array is long enough to store one double precision value for each requested iteration.

This function describes the various data types defined in the CULA Sparse library. These types are accepted as arguments by or returned by various interface functions in the library, which will be described later.

The culaSparseFlag data type is an output enumeration containing a set of all possible convergence and non-convergence cases that can be returned by the iterative solver routines. Possible elements include:

Flag Value

Meaning

culaSparseConvergedToRelative

The solver converged to the specified relative tolerance

culaSparseConvergedToAbsolute

The solver converged to the specified absolute tolerance

culaSparseDiverged

The solver diverged by exceeding the specified divergence tolerance

culaSparseMaxIterationsReached

Maximum iterations reached without convergence

culaSparseMaxTimeReached

Maximum execution time reached without convergence

culaSparseStagnation

The solver stagnated

culaSparseBreakdown

An intermediate scalar value was out of range

culaSparseInterrupted

The routine was manually interrupted

culaSparseSolveNotExecuted

The solver was not executed

culaSparseUnknownIterationError

An unknown iteration error was encountered

For more information about various failure status conditions, see the Common Errors chapter.

The culaIterativeConfig data type is an input structure that contains information that steers execution of iterative functions with the following fields:

Name

Type

Description

relativeTolerance

double

The relative tolerance for which convergence is achieved.

absoluteTolerance

double

The absolute tolerance for which convergence is achieved.

diverganceTolerance

double

The relative tolerance of the residual norm for which the iterative
solver will determine the server is diverging.

maxIterations

int

The maximum number of iterations that the solver will attempt.

maxRuntime

int

The maximum time, in seconds, at which the solver will not begin a new
iteration. If set to 0, the solver will run until convergence or the
maximum number of iterations has been reached.

residualVector

double*

This parameter provides the means for a user to capture the residual at
each iteration. The specified array must be at least maxIter in length.
This parameter may be NULL if these quantities are not desired.

useInitialResultVector

int

Indicates whether the ‘x’ vector in iterative
solves should be used as given or ignored. When
ignored, the ‘x’ vector is considered a zero.

useBestAnswer

int

Indicates whether the ‘x’ vector in iterative solves should return the
final answer or the best answer and its associated iteration number in
the case of non-convergence.

useStagnationCheck

int

Indicates whether to check whether the iterative solve stagnated. This
option defaults to on; turning this option off will increase performance
if a problem is certain not to stagnate.

The culaSparseReordering data type is an enum that specifies a reordering strategy for the input data. For some matrices, reordering can introduce additional parallelism that can allow the solver or preconditioner to execute more efficiently on a parallel device.

Platform, solver, and preconditioner options structures allow you to steer the execution of a given platform, solver, and preconditioner. For documentation on individual options structures, see the corresponding solver or preconditioner section.

Initializing these structures is done with a method that matches the name of the associated solver or preconditioner with an Init appended.

Several options structures have reserved parameters. These structures are implemented in this way so as to maintain uniformity in the solver parameter list and to provide compatibility for possible future code changes. We recommend that you make sure to call the options initialization function (as shown above) for all options in the case that any parameters are added to it in the future. For routines that take an option structure, a NULL pointer may be passed, in which case reasonable defaults will be assigned.

CULA Sparse supports a number of processing platforms with different performance and data locality charactertics. Using the plan interface, this preference is set using a function call. These routines take a platform option parameter specific to the platform and have the form of culaSparseSet{Platform}(...) where the value enclosed by :c:type`{Platform}`s are one of the values outlined below.

Selects a reordering method to apply to the input matrix prior to any processing. This operation is applied to a copy of the matrix rather than actual input data.

deviceId

int

Select the CUDA device for which all operations will take place. The device identifiers are enumerated by the CUDA API. The device id must be the same as where memory allocation took place using cudaMalloc.

useHybridFormat

int

Optionally use the hybrid matrix optimization which may result in higher performance for many matrix structures at the expense of higher memory usage.

This data format utilizes a user supplied function that returns \(x' = A*x\) or \(x' = A^t*x\) opposed to explicitly performing the multiplication. This is achieved through a C function pointer that is called whenever an iterative solver needs to perform matrix-vector multiplication. As described in the following table, the function pointer must accept a transpose character, a constant input vector, an output vector, and a void pointer that may contain user data (the parameter names provided in the function pointer are just suggestions and don’t have to match the user provided function).

Preconditioning is an additional step to aid in convergence of an iterative solver. This step is simply a means of transforming the original linear system into one which has the same solution, but which is most likely easier to solve with an iterative solver. Generally speaking, the inclusion of a preconditioner will decrease the number of iterations needed to converge upon a solution. Proper preconditioner selection is necessary for minimizing the number of iterations required to reach a solution. However, the preconditioning step does add additional work to every iteration as well as up-front processing time and additional memory. In some cases this additional work may end up being a new bottleneck and improved overall performance may be obtained using a preconditioner that takes more steps to converge, but the overall runtime is actually lower. As such, we recommend analyzing multiple preconditioner methods and looking at the total runtime as well the number of iterations required.

This chapter describes several preconditioners, which have different parameters in terms of effectiveness, memory usage, and setup/apply time.

This preconditioner is a natural fit for systems with multiple physical variables that have been grouped into blocks.

The Block Jacobi preconditioner requires more computation in both generation and application than the simpler Jacobi preconditioner. However, both generation and application are parallel operations that map well to the GPU.

The Block Jacobi preconditioner is associated with a plan by calling the culaSparseSetBlockPreconditioner with the culaSparseBlockJacobiOptions input struct parameter (see Options Structures).

The ILU0 preconditioner is lightweight in generation, and due to the zero-fill component, requires roughly the same memory as the linear system trying to be solved. In application, the ILU0 preconditioner requires two triangular solve routines - a method not well suited for parallel processing platforms such as the GPU or multicore processors. As such, using the ILU0 preconditioner may result in a reduced number of iterations at the cost of a longer runtime.

In comparison to Jacobi, the construction and application time and the memory requirements are higher for ILU0. For some matrices, the ILU0 might result in significantly improved convergence, which can offset the costs.

In order to successfully complete the factorization, the input matrix must have a diagonal entry in every row, and it must be nonzero. Failure to meet this criteria will result in a culaSparsePreconditionerError code.

The ILU0 preconditioner is associated with a plan by calling the culaSparseSetIlu0Preconditioner with the culaSparseIlu0Options input struct parameter (see Options Structures).

The factorized approximate inverse preconditioner is an incomplete approximation of the inverse of a symmetric positive definite input matrix, where \(A*M \approx I\) and \(M\) has a sparsity pattern as specified by culaSparseSparsityPattern. Fundamentally, the preconditioner attempts to minimize the Frobenius norm:

\[\|AM-I\|^2_F\]

The approximate inverse preconditioner is heavyweight in generation. However, the algorithm expresses per-column parallelism and therefor can be generated efficiently. Unlike ilu based preconditioners, this preconditioner is applied using matrix-vector multiplication which can reach very high performance on GPU based systems.

The approximate inverse preconditioner is associated with a plan by calling the culaSparseSetAinvPreconditioner function with the culaSparseAinvOptions input struct parameter (see Options Structures).

pattern of the generated preconditioner; defaults to match the input matrix

dropTolerance

double

absolute threshold for which small values are dropped from the preconditioner; this can help remove insignificantly small values from the preconditioner and therefor speed up application; set to zero for no dropping

The approximate inverse preconditioner is an incomplete approximation of the inverse of the input matrix, where \(M = LL^T \approx I\) and \(M\) has a sparsity pattern as specified by culaSparseSparsityPattern. Fundamentally, the preconditioner attempts to minimize the Frobenius norm:

\[\|LL^T - I\|_F\]

Unlike the standard approximate inverse preconditioner, this method generates a symmetric positive definite preconditioner appropriate for use with the conjugate gradient solver.

The factorized approximate inverse preconditioner is fairly heavyweight in generation. However, the algorithm expresses per-column parallelism and therefor can be generated efficiently. Unlike ilu based preconditioners, this preconditioner is applied using matrix-vector multiplication which can reach very high performance on GPU based systems.

The approximate inverse preconditioner is associated with a plan by calling the culaSparseSetAinvPreconditioner function with the culaSparseAinvOptions input struct parameter (see Options Structures).

The factorized approximate inverse preconditioner is associated with a plan by calling the culaSparseSetFainvPreconditioner function with the culaSparseFainvOptions input struct parameter (see Options Structures).

pattern of the generated preconditioner; defaults to match the input matrix

dropTolerance

double

absolute threshold for which small values are dropped from the preconditioner; this can help remove insignificantly small values from the preconditioner and therefor speed up application; set to zero for no dropping

The user defined preconditioner allows the user to provide a callback function to preconditioner an input vector such that \(z' = M^-1*z\).

This is achieved through a C function pointer that is called whenever an iterative solver needs to apply the preconditioner. As described in the following table, the function pointer must accept a transpose character, a constant input vector, an output vector, and a void pointer that may contain user data (the parameter names provided in the function pointer are just suggestions and don’t have to match the user provided function).

This family of functions attempt to solve \(Ax=b\) using the conjugate gradient (CG) method where \(A\) is a symmetric positive definite matrix stored in a sparse matrix format and \(x\) and \(b\) are dense vectors. The matrix must be a fully populated symmetric; i.e. for each populated entry \(A_{ij}\) there must be an identical entry \(A_{ji}\).

Solver Trait

Value

matrix class

Symmetric Positive Definite

memory overhead

\(6n\)

The conjugate gradient solver is associated with a plan by calling the culaSparseSetCgSolver with the culaSparseCgOptions input struct parameter (see Options Structures).

This family of functions attempt to solve \(Ax=b\) using the conjugate gradient (BiCG) method where \(A\) is a square matrix stored in a sparse matrix format format and \(x\) and \(b\) are dense vectors. While BiCG may converge for general matrices, it is mathematically most suitable for symmetric systems that are not positive definite. For symmetric positive definite systems, this method is identical to but considerably more expensive than CG.

The biconjugate gradient solver is associated with a plan by calling the culaSparseSetBicgSolver with the culaSparseBicgOptions input struct parameter (see Options Structures).

This family of functions attempt to solve \(Ax=b\) using the conjugate gradient stabilized (BiCGSTAB) method where \(A\) is a square matrix stored in a sparse matrix format format and \(x\) and \(b\) are dense vectors. This method was developed to solve non-symmetric linear systems while avoiding the irregular convergence patterns of the Conjugate Gradient Squared (CGS) method.

Solver Trait

Value

matrix class

General

memory overhead

\(10n\)

The biconjugate gradient stabilized solver is associated with a plan by calling the culaSparseSetBicgstabSolver with the culaSparseBicgstabOptions input struct parameter (see Options Structures).

This family of functions attempt to solve \(Ax=b\) using the conjugate gradient stabilized (BiCGSTAB(L)) method where \(A\) is a square matrix stored in a sparse matrix format format and \(x\) and \(b\) are dense vectors. This method extendeds the BiCG algorithm by adding an additional GMRES step with a restart value of L after each BiCGSTAB iteration. In practice, this may help to smooth convergence - especially in cases where \(A\) has large complex eigenpairs.

Solver Trait

Value

matrix class

General

memory overhead

\(n*L + 8n\)

The biconjugate gradient stabilized (L) solver is associated with a plan by calling the culaSparseSetBicgstablSolver with the culaSparseBicgstablOptions input struct parameter (see Options Structures).

This family of functions attempt to solve \(Ax=b\) using the restarted general minimal residual GMRES(m) method where \(A\) is a square matrix stored in a sparse matrix format and \(x\) and \(b\) are dense vectors. This method is implemented using the modified Gram-Schmidt method for orthogonalization. When a preconditioner is specified, GMRES attempts to minimize \(||Mb-MAx||/||b||\) opposed to \(||b-Ax||/||b||\) in the absence of a preconditioner.

The maximum iterations, specified by culaIterativeConfig, are in reference to the outer iteration count. The maximum inner iteration count is specified by the restart value contained in the culaGmresOptions parameter.

The gmres solver is associated with a plan by calling the culaSparseSetGmresSolver with the culaSparseGmresOptions input struct parameter (see Options Structures).

This family of functions attempt to solve \(Ax=b\) using the minimum residual method MINRES method where \(A\) is a square matrix stored in a sparse matrix format and \(x\) and \(b\) are dense vectors. When a preconditioner is specified, MINRES attempts to minimize \(||Mb-MAx||/||b||\) opposed to \(||b-Ax||/||b||\) in the absence of a preconditioner.

Solver Trait

Value

matrix class

General

memory overhead

\(11n\)

The minres solver is associated with a plan by calling the culaSparseSetMinresSolver with the culaSparseMinresOptions input struct parameter (see Options Structures).

This chapter describes the Legacy Interface found in cula_sparse_legacy.h():

The legacy interface provides a single function call interface to achieve all of the capability of the standard interface. Due to the number of platforms, data formats, solvers, and preconditioenrs, the combinatorial number of functions in this interface is very large. Five major concepts are conveyed by the function names of the iterative system solvers:

Here, each {...} segment represents a different major component of the routine. The following table explains each of these components:

Name

Meaning

platform

execution platform

type

data type used

storage

sparse matrix storage format used

solvers

iterative method used

precond

preconditioner method used

For example, the routine culaSparseCudaDcsrCgJacobi() will use the CUDA platform to attempt to solve a double precision sparse matrix stored in the compressed sparse row (CSR) storage format using the conjugate gradient (CG) method with Jacobi preconditioning.

This chapter outlines many of the performance and accuracy considerations pertaining to the CULA Sparse library. There are details regarding how to get the most performance out of your solvers and provide possible reasons why a particular solver may be under performing.

Many of the solvers in CULA Sparse library should use the double precision data format to achieve convergence in less iterations. While users of the NVIDIA GeForce line may still see an appreciable speedup, we recommend using the NVIDIA Tesla line of compute cards with greatly improved double precision performance.

The modern GPU is optimized to handle large, massively parallel problems with a high computation to memory access ratio. As such, small problems with minimal parallelism will perform poorly on the GPU and are much better suited for the CPU where they can reside in cache memory. Typical problem sizes worth GPU-acceleration are systems with at least 10,000 unknowns and at least 30,000 non-zero elements.

For storage and performance reasons, the compressed sparse row (CSR) format is preferred as many internal operations have been optimized for this format. For other storage formats, CULA Sparse will invoke accelerated conversions routines to convert to CSR internally. To measure this conversion overhead, inspect the overhead field of the timing structure in the culaIterativeResult return structure. These conversion routines also require an internal buffer of size \(nnz\) which for large problems may be more memory than is available on the GPU.

Proper preconditioner selection is necessary for minimizing the number of iterations required to solve a solution. However, as mentioned in previous chapters, the preconditioning step does add additional work to every iteration. In some cases this additional work may end up being a new bottleneck and improved overall performance may be obtained using a preconditioner that takes more steps to converge, but the overall runtime is actually lower. As such, we recommend analyzing multiple preconditioner methods and looking at the total runtime as well the number of iterations required.

Iterative methods are typically very sensitive to numerical precision. Therefore, different implementations of the same algorithm may take a different number of iterations to converge to the same solution. This is as expected when dealing with the non-associative nature of floating point computations.

Unless otherwise specified, all of CULA Sparse’s iteration calculations are done with regards to a residual relative to the norm of the right hand side of the linear system. The defining equation for this is \(||b-Ax||/||b||\).

This section shows a very simple example of how to use the CULA Sparse API. Please note this example does no error checking and uses default parameters for each option. For a more complete example, see the iterativeSystemSolve SDK example.

This section describes how to configure Microsoft Visual Studio to use CULA Sparse. Before following the steps within this section, take note of where you installed CULA Sparse (the default is C:\Program Files\CULA\S#). To set up Visual Studio, you will need to set both Global- and Project-level settings. Each of these steps is described in the sections below.

When inside Visual Studio, navigate to the menu bar and select Tools > Options. A window will open that offers several options; in this window, navigate to Projects and Solutions > VC++ Directories. From this dialog you will be able to configure global executable, include, and library paths, which will allow any project that you create to use CULA Sparse.

The table below specifies the recommended settings for the various directories that the VC++ Directories dialog makes available. When setting up your environment, prepend the path of your CULA Sparse installation to each of the entries in the table below. For example, to set the include path for a typical installation, enter C:\Program Files\CULA\include for the Include Files field.

Option

Win32

x64

Executable Files

bin

bin64

Include Files

include

include

Library Files

lib

lib64

With these global settings complete, Visual Studio will be able to include CULA Sparse files in your application. Before you can compile and link an application that uses CULA Sparse, however, you will need to set up your project to link CULA Sparse.

To use CULA Sparse, you must instruct Visual Studio to link CULA Sparse to your application. To do this, right-click on your project and select Properties. From here, navigate to Configuration Properties > Linker > Input. In the Additional Dependencies field, enter “cula_core.lib cula_sparse.lib”.

On the Windows platform, CULA Sparse’s libraries are distributed as a dynamic link library (DLL) (cula_sparse.dll) and an import library (cula_sparse.lib), located in the bin and lib directories of the CULA Sparse installation, respectively. By Linking cula_sparse.lib, you are instructing Visual Studio to make an association between your application and the CULA Sparse DLL, which will allow your application to use the code that is contained within the CULA Sparse DLL.

CULA Sparse is built as a dynamically linked library, and as such it must be visible to your runtime system. This requires that cula_sparse.dll and its supporting dll’s are located in a directory that is a member of your system’s runtime path. On Windows, you may do one of several things:

The first step in this process is to set up environment variables so that your build scripts can infer the location of CULA Sparse.

On a Linux or Mac OS X system, a simple way to set up CULA Sparse to use environment variables. For example, on a system that uses the bourne (sh) or bash shells, add the following lines to an appropriate shell configuration file (e.g. .bashrc).

This section describes how to set up the gcc compiler to include CULA Sparse in your application. When compiling an application, you will typically need to add the following arguments to your compiler’s argument list:

CULA Sparse is built as a shared library, and as such it must be visible to your runtime system. This requires that CULA Sparse’s shared libraries are located in a directory that is a member of your system’s runtime library path. On Linux, you may do one of several things:

This chapter provides solutions to errors commonly encountered when using the CULA Sparse library.

As a general note, whenever an error is encountered, consider enabling debugging mode. The configuration parameter offers a debugging flag that, when set, causes CULA Sparse to perform many more checks than it would normally. These checks can be computationally expensive and so are not enabled on a default run. These checks may highlight the issue for you directly, saving you from having to do more time consuming debugging. Debugging output will occur as either a culaSparseStatus return code (i.e., a malformed matrix may be printed to the console, or returned via the result culaBadStorageFormat or via the result structure.

A function’s culaSparseStatus return code is equal to culaSparseArgumentError.

Description

This error indicates that one of your parameters to your function is in error. The culaSparseGetLastStatusString function will report which particular parameter is in error. Typical errors include invalid sizes or null pointers.

Solution

Check the noted parameter against the routine’s documentation to make sure the input is valid.

A function’s culaSparseStatus return code is equal to culaSparseDataFormatError.

Description

This error indicates that the set of inputs describing a sparse matrix is somehow malformed. If the debugging flag is set, the culaSparseGetLastStatusString function will report which particular parameter is in error. Typical errors include invalid sizes or null pointers.

There are many conditions which can trigger this, of which a few common examples are listed below.

For 0-based indexing, any entry in the row or column values is less than zero or greater than \(n-1\)

For 1-based indexing, any entry in the row or column values is less than one or greater than \(n\)

Duplicated indices

Entries in an Index array are not ascending

The \(n+1\) element of an Index array is not set properly; ie it does not account for all \(nnz\) elements

There are many others, and the above may not be true for all matrix types.

Solution

Check the matrix data against the documentation for matrix storage types to ensure that it meets any necessary criteria.

Upon a culaSparseNonConvergence return code, it is possible to obtain more information by examining the culaSparseFlag within the culaSparseResult structure. This will indicate a problem with of the following errors:

A function’s culaSparseStatus return code is equal to culaSparseNonConvergence and the culaSparseFlag is indicating culaSparseMaxIterationsReached.

Description

This error indicates that the solver has reached a maximum number of iterations before the an answer within the given tolerance was reached.

Solution

Increase the iteration count or lower the desired tolerance. Also, the given solver and/or preconditioner might not be appropriate for your data. If this is the case, try a different solver and/or preconditioner. It is also possible that the input matrix may not be solvable with any of the methods available in CULA Sparse.

This might also be a desirable outcome in the case that the user is seeking the best possible answer within a budgeted number of iterations. In this case, the “error” can be safely ignored.

A function’s culaSparseStatus return code is equal to culaSparseNonConvergence and the culaSparseFlag is indicating culaSparseStagnation.

Description

The selected iterative solver has stagnated by calculating the same residual for multiple iterations in a row. The solver has exited early because a better solution cannot be calculated.

Solution

A different iterative solver and/or preconditioner may be necessary. It is also possible that the input matrix may not be solvable with any of the methods available in CULA Sparse.

It is implicit when this error is issued that the current residual still exceeds the specified tolerance, but the result may still be usable if the user is looking only for a “best effort” solution. In that case, this “error” can be disregarded.

A function’s culaSparseStatus return code is equal to culaSparseNonConvergence and the culaSparseFlag is indicating culaSparseBreakdown.

Description

The selected iterative solver has encountered an invalid floating point value during calculations. It is possible the method has broken down and isn’t able to solve the provided linear system.

Solution

Therefore, a different iterative solver and/or preconditioner may be necessary. It is also possible that the input matrix may not be solvable with any of the methods available in CULA Sparse. Also, it is possible the input matrix has improperly formed data.

A function’s culaSparseStatus return code is equal to culaSparsePreconditionerError.

Description

The preconditioner failed to generate and no iterations were attempted. This error is usually specific to different preconditioner methods but typically indicates that there is either bad data (i.e., malformed matrix) or a singular matrix. See the documentation for the preconditioner used, as it may specify certain conditions which must be met.

Solution

More information can be obtained through the culaSparseGetLastStatusString() function. In many cases the input matrix is singular and factorization methods such as ILU0 are not appropriate. In this case, try a different preconditioner and check that the structure of your matrix is correct.

A function’s culaSparseStatus return code is equal to culaSparseFeatureNotImplemented.

Description

Execution failed because a necessary internal function is not implemented.

Solution

More information can be obtained through the culaSparseGetLastStatusString() function. In some cases, certain preconditioner and solvers are only implemented for a specific platform or explicitly require a specific storage format.

Selecting the best sparse iterative solver and preconditioner is often a difficult decision. Very rarely can one simply know which combination will converge quickest to find a solution within the given constraints. Often the best answer requires knowledge pertaining to the structure of the matrix and the properties it exhibits. To help aid in the selection of a solver and preconditioner, we have constructed two flow charts to help gauge which solver and preconditioner might work best. Again, since there is no correct answer for any given system, we encourage users to experiment with different solvers, preconditioners, and options. These flowcharts are simply designed to give suggestions, and not absolute answers.