Saturday, November 12, 2011

MPI communications from GPU memory

There are several groups working on MPI implementations capable of transferring data directly from GPU memory, as a result of the introduction of the Unified Virtual Addressing (UVA) in CUDA 4.0. The MVAPICH group is the first one to officially release a version with CUDA support.

Being able to pass directly GPU pointers to MPI functions, greatly simplify the programming on clusters. For example, if the programmer needs to send data from GPU on system A to another GPU on system B, instead of the sequence:

Transfer data from GPU memory to host memory on system A

Transfer data from host memory on system A to host memory on system B, for example using MPI_Send/Recv

Transfer data from host memory to GPU memory on system B

could just issue the MPI_Send/Recv with the buffers located on GPU memory.

A GPU-aware MPI stack is also capable of optimizing the transfers under the hood via pipelining ( this could be explicitly programmed too, but having the library taking care of it is much more convenient).

In this blog, I am going to explain how to use the CUDA-enabled MVAPICH from CUDA Fortran.

After downloading the tar file from the MVAPICH web site, we need to configure the installation. Due to compatibility issues between CUDA and the PGI C compiler, we are going to use gcc for the C compiler and PGI Fortran for the Fortran one.

We need to specify the location of the CUDA include files and libraries ( in this case, they are located in the standard location /usr/local/cuda ) and the path for MVAPICH ( I am installing on a cluster where all the apps are located in /share/apps).

FC=pgfortran F77=pgfortran FCFLAGS=-fast FFLAGS=-fast ./configure

--prefix=/share/apps/mvapich2-gpu

--enable-cuda

--with-cuda-include=/usr/local/cuda/include

--with-cuda-libpath=/usr/local/cuda/lib64

The next steps are to run "make" and then "make install" ( for this last step, depending on the location of the installed software, you may need to have root privileges). You will also need to add the location of the binaries ( in this case /share/apps/mvapich2-gpu/bin ) to your path.

We are now ready to write a CUDA Fortran code that uses MPI to transfer data between two GPUs. Each process initializes two arrays a_d and b_d, fill them with some values depending on the rank. Then, processor 0 sends a_d to processor 1. After 1 receives the data in b_d transfer the results back to the host array a and print the values.

program mpi_test_gpu

use mpi

integer, allocatable:: a(:)

integer, device,allocatable:: a_d(:),b_d(:)

integer:: N, ierr, rank, num_procs, status(MPI_Status_size)

call MPI_Init (ierr)

call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)

call MPI_Comm_size(MPI_COMM_WORLD, num_procs, ierr)

N=4

allocate (a(N),a_d(N),b_d(N))

a_d=(rank+1)*10

b_d=(rank-1)*100

a=-999

if ( rank == 0) then

call MPI_Send(a_d,N,MPI_INT,1,0,MPI_COMM_WORLD, ierr)

else

call MPI_Recv(b_d,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)

end if

if (rank == 1) a=b_d

print *,"Rank=",rank,"A=",a

deallocate (a,a_d,b_d)

call MPI_Finalize ( ierr )

end program mpi_test_gpu

If the code is in a file with name mpi_test_gpu.cuf, we can generate an executable with the following command:

mpif90 -O3 -o mpi_test_gpu mpi_test_gpu.cuf

We are now ready to run with the command mpirun_rsh. We need to pass a special flag, MV2_USE_CUDA=1, to enable the new GPU path ( or you can add

export MV2_USE_CUDA=1 to your .bashrc to avoid to type it every time).

We are going to use two nodes, c0-0 and c0-1, connected by Infiniband.

mpirun_rsh -np 2 c0-0 c0-1 MV2_USE_CUDA=1 ./mpi_test_gpu

Rank= 0 A= -999 -999 -999 -999

Rank= 1 A= 10 10 10 10

As expected, rank 1 contains the values 10, that was the value initially stored in a_d on rank 0.

MVAPICH also allows to send data from GPU to host memory and vice versa.

just an important remark to add for other users who would like to try it. It is required for MVAPICH2 V 1.8a1p1 with PGI V 11.10 and with CUDA V 4.0/4.1RC1/4.1RC2 to be built WITHOUT HWLOCK support in order to work. (--without-hwloc)