This program was contributed by Bruno Turcksin and Daniel Arndt, Oak Ridge National Laboratory.

Introduction

This example shows how to implement a matrix-free method on the GPU using CUDA for the Helmholtz equation with variable coefficients on a hypercube. The linear system will be solved using the conjugate gradient method and is parallelized with MPI.

In the last few years, heterogeneous computing in general and GPUs in particular have gained a lot of popularity. This is because GPUs offer better computing capabilities and memory bandwidth than CPUs for a given power budget. Among the architectures available in early 2019, GPUs are about 2x-3x as power efficient than server CPUs with wide SIMD for PDE-related tasks. GPUs are also the most popular architecture for machine learning. On the other hand, GPUs are not easy to program. This program explores the deal.II capabilities to see how efficiently such a program can be implemented.

While we have tried for the interface of the matrix-free classes for the CPU and the GPU to be as close as possible, there are a few differences. When using the matrix-free framework on a GPU, one must write some CUDA code. However, the amount is fairly small and the use of CUDA is limited to a few keywords.

We choose as domain \(\Omega=[0,1]^3\) and \(a(\mathbf x)=\frac{10}{0.05 + 2\|\mathbf x\|^2}\). Since the coefficient is symmetric around the origin but the domain is not, we will end up with a non-symmetric solution.

If you've made it this far into the tutorial, you will know how the weak formulation of this problem looks like and how, in principle, one assembles linear systems for it. Of course, in this program we will in fact not actually form the matrix, but rather only represent its action when one multiplies with it.

Moving data to and from the device

GPUs (we will use the term "device" from now on to refer to the GPU) have their own memory that is separate from the memory accessible to the CPU (we will use the term "host" from now on). A normal calculation on the device can be divided in three separate steps:

the data is moved from the host to the device,

the computation is done on the device,

the result is moved back from the device to the host

The data movements can either be done explicitly by the user code or done automatically using UVM (Unified Virtual Memory). In deal.II, only the first method is supported. While it means an extra burden for the user, this allows for better control of data movement and more importantly it avoids to mistakenly run important kernels on the host instead of the device.

The data movement in deal.II is done using LinearAlgebra::ReadWriteVector. These vectors can be seen as buffers on the host that are used to either store data received from the device or to send data to the device. There are two types of vectors that can be used on the device:

Both of the vector classes used here only work on a single machine, i.e., one memory space on a host and one on a device.

But there are cases where one wants to run a parallel computation between multiple MPI processes on a number of machines, each of which is equipped with GPUs. In that case, one wants to use LinearAlgebra::distributed::Vector<Number,MemorySpace::CUDA>, which is similar but the import() stage may involve MPI communication:

The relevant_rw_vector is an object that stores a subset of all elements of the vector. Typically, these are the locally relevant DoFs, which implies that they overlap between different MPI processes. Consequently, the elements stored in that vector on one machine may not coincide with the ones stored by the GPU on that machine, requiring MPI communication to import them.

Matrix-vector product implementation

The code necessary to evaluate the matrix-free operator on the device is very similar to the one on the host. However, there are a few differences, the main ones being that the local_apply() function in step-37 and the loop over quadrature points both need to be encapsulated in their own functors.

The commented program

First include the necessary files from the deal.II libary known from the previous tutorials.

#include <deal.II/base/conditional_ostream.h>

#include <deal.II/base/quadrature_lib.h>

#include <deal.II/dofs/dof_tools.h>

#include <deal.II/fe/fe_q.h>

#include <deal.II/grid/grid_generator.h>

#include <deal.II/grid/tria.h>

#include <deal.II/lac/affine_constraints.h>

#include <deal.II/lac/la_parallel_vector.h>

#include <deal.II/lac/precondition.h>

#include <deal.II/lac/solver_cg.h>

#include <deal.II/numerics/data_out.h>

#include <deal.II/numerics/vector_tools.h>

The following ones include the data structures for the implementation of matrix-free methods on GPU:

Class VaryingCoefficientFunctor

Next, we define a class that implements the varying coefficients we want to use in the Helmholtz operator. Later, we want to pass an object of this type to a CUDAWrappers::MatrixFree object that expects the class to have an operator() that fills the values provided in the constructor for a given cell. This operator needs to run on the device, so it needs to be marked as __device__ for the compiler.

Since CUDAWrappers::MatrixFree::Data doesn't know about the size of its arrays, we need to store the number of quadrature points and the numbers of degrees of freedom in this class to do necessary index conversions.

staticconstunsignedint n_dofs_1d = fe_degree + 1;

staticconstunsignedint n_local_dofs =

::Utilities::pow(n_dofs_1d, dim);

staticconstunsignedint n_q_points =

::Utilities::pow(n_dofs_1d, dim);

private:

double *coef;

};

The following function implements this coefficient. Recall from the introduction that we have defined it as \(a(\mathbf x)=\frac{10}{0.05 + 2\|\mathbf x\|^2}\)

Class HelmholtzOperatorQuad

The class HelmholtzOperatorQuad implements the evaluation of the Helmholtz operator at each quadrature point. It uses a similar mechanism as the MatrixFree framework introduced in step-37. As before, the functions of this class need to run on the device, so need to be marked as __device__ for the compiler.

Again, the CUDAWrappers::MatrixFree object doesn't know about the number of degrees of freedom and the number of quadrature points so we need to store these for index calculations in the call operator.

This is the call operator that performs the Helmholtz operator evaluation on a given cell similar to the MatrixFree framework on the CPU. In particular, we need access to both values and gradients of the source vector and we write value and gradient information to the destination vector.

Class HelmholtzOperator

The HelmholtzOperator class acts as wrapper for LocalHelmholtzOperator defining an interface that can be used with linear solvers like SolverCG. In particular, like every class that implements the interface of a linear operator, it needs to have a vmult() function that performs the action of the linear operator on a source vector.

The following is the implementation of the constructor of this class. In the first part, we initialize the mf_data member variable that is going to provide us with the necessary information when evaluating the operator.

In the second half, we need to store the value of the coefficient for each quadrature point in every active, locally owned cell. We can ask the parallel triangulation for the number of active, locally owned cells but only have a DoFHandler object at hand. Since DoFHandler::get_triangulation() returns a Triangulation object, not a parallel::Triangulation object, we have to downcast the return value. This is safe to do here because we know that the triangulation is a parallel:distributed::Triangulation object in fact.

The key step then is to use all of the previous classes to loop over all cells to perform the matrix-vector product. We implement this in the next function.

When applying the Helmholtz operator, we have to be careful to handle boundary conditions correctly. Since the local operator doesn't know about constraints, we have to copy the correct values from the source to the destination vector afterwards.

Since all the operations in the solve() function are executed on the graphics card, it is necessary for the vectors used to store their values on the GPU as well. LinearAlgebra::distributed::Vector can be told which memory space to use. There is also LinearAlgebra::CUDAWrappers::Vector that always uses GPU memory storage but doesn't work with MPI. It might be worth noticing that the communication between different MPI processes can be improved if the MPI implementation is CUDA-aware and the configure flag DEAL_II_MPI_WITH_CUDA_SUPPORT is enabled. (The value of this flag needs to be set at the time you call cmake when installing deal.II.)

In addition, we also keep a solution vector with CPU storage such that we can view and display the solution as usual.

Unlike programs such as step-4 or step-6, we will not have to assemble the whole linear system but only the right hand side vector. This looks in essence like we did in step-4, for example, but we have to pay attention to using the right constraints object when copying local contributions into the global vector. In particular, we need to make sure the entries that correspond to boundary nodes are properly zeroed out. This is necessary for CG to converge. (Another solution would be to modify the vmult() function above in such a way that we pretend the source vector has zero entries by just not taking them into account in matrix-vector products. But the approach used here is simpler.)

At the end of the function, we can't directly copy the values from the host to the device but need to use an intermediate object of type LinearAlgebra::ReadWriteVector to construct the correct communication pattern necessary.

This solve() function finally contains the calls to the new classes previously dicussed. Here we don't use any preconditioner, i.e. precondition by the identity matrix, to focus just on the peculiarities of the CUDAWrappers::MatrixFree framework. Of course, in a real application the choice of a suitable preconditioner is crucial but we have at least the same restrictions as in step-37 since matrix entries are computed on the fly and not stored.

After solving the linear system in the first part of the function, we copy the solution from the device to the host to be able to view its values and display it in output_results(). This transfer works the same as at the end of the previous function.

The output results function is as usual since we have already copied the values back from the GPU to the CPU.

While we're already doing something with the function, we might as well compute the \(L_2\) norm of the solution. We do this by calling VectorTools::integrate_difference(). That function is meant to compute the error by evaluating the difference between the numerical solution (given by a vector of values for the degrees of freedom) and an object representing the exact solution. But we can easily compute the \(L_2\) norm of the solution by passing in a zero function instead. That is, instead of evaluating the error \(\|u_h-u\|_{L_2(\Omega)}\), we are just evaluating \(\|u_h-0\|_{L_2(\Omega)}=\|u_h\|_{L_2(\Omega)}\) instead.

The main() function

Finally for the main() function. By default, all the MPI ranks will try to access the device with number 0, which we assume to be the GPU device associated with the CPU on which a particular MPI rank runs. This works, but if we are running with MPI support it may be that multiple MPI processes are running on the same machine (for example, one per CPU core) and then they would all want to access the same GPU on that machine. If there is only one GPU in the machine, there is nothing we can do about it: All MPI ranks on that machine need to share it. But if there are more than one GPU, then it is better to address different graphic cards for different processes. The choice below is based on the MPI proccess id by assigning GPUs round robin to GPU ranks. (To work correctly, this scheme assumes that the MPI ranks on one machine are consecutive. If that were not the case, then the rank-GPU association may just not be optimal.) To make this work, MPI needs to be initialized before using this function.

Results

Since the main purpose of this tutorial is to demonstrate how to use the CUDAWrappers::MatrixFree interface, not to compute anything useful in itself, we just show the expected output here:

Cycle 0

Number of active cells: 8

Number of degrees of freedom: 343

Solved in 27 iterations.

solution norm: 0.0205439

Cycle 1

Number of active cells: 64

Number of degrees of freedom: 2197

Solved in 60 iterations.

solution norm: 0.0205269

Cycle 2

Number of active cells: 512

Number of degrees of freedom: 15625

Solved in 114 iterations.

solution norm: 0.0205261

Cycle 3

Number of active cells: 4096

Number of degrees of freedom: 117649

Solved in 227 iterations.

solution norm: 0.0205261

One can make two observations here: First, the norm of the numerical solution converges, presumably to the norm of the exact (but unknown) solution. And second, the number of iterations roughly doubles with each refinement of the mesh. (This is in keeping with the expectation that the number of CG iterations grows with the square root of the condition number of the matrix; and that we know that the condition number of the matrix of a second-order differential operation grows like \({\cal O}(h^{-2})\).) This is of course rather inefficient, as an optimal solver would have a number of iterations that is independent of the size of the problem. But having such a solver would require using a better preconditioner than the identity matrix we have used here.

Possible extensions

Currently, this program uses no preconditioner at all. This is mainly since constructing an efficient matrix-free preconditioner is non-trivial. However, simple choices just requiring the diagonal of the corresponding matrix are good candidates and these can be computed in a matrix-free way as well. Alternatively, and maybe even better, one could extend the tutorial to use multigrid with Chebyshev smoothers similar to step-37.