Thursday, June 2, 2011

CUDA 4.0 ships with the Thrust library, a standard template library for GPU that offers several useful algorithms ( sorting, prefix sum, reduction). In the previous post I explained how to configure CUDA Fortran to use the 4.0 toolkit. Now I am going to show how to call Thrust from CUDA Fortran, in particular how to sort an array.

On the Thrust web page, there are a lot of examples and documentation. The basic idea of Thrust is to have containers, that manage host and device memory and simplify data transfers, iterators, thatact like pointers and keep track of memory spaces, and algorithms, that are applied to containers.

This is a simple Thrust code to sort an array of random data.

#include <thrust/host_vector.h >

#include <thrust/device_vector.h>

#include <thrust/sort.h>

int main(void) {

// define a vector of 16M int on the host

thrust::host_vector h_vec(1 << 24);

// generate 16M random numbers on the host

thrust::generate(h_vec.begin(), h_vec.end(), rand);

// transfer data to the device

thrust::device_vector d_vec = h_vec;

// sort data on the device

thrust::sort(d_vec.begin(), d_vec.end());

// transfer data back to host

thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());

return 0;

An important feature, that we will use to call Thrust from CUDA Fortran, is the conversion of Thrust objects to raw pointers or vice versa. This Thrust code snippet will convert a device container to a standard C pointer that we could use to call a CUDA C kernel:

// allocate device vector

thrust::device_vector d_vec(4);

// obtain raw pointer to device vector’s memory

int * ptr = thrust::raw_pointer_cast(&d_vec[0]);

The basic idea is to write a wrapper to the Thrust algorithms that will handle standard C pointer and then use the Iso C Binding to call the wrapper. Since we want to sort an array, let's write a wrapper for the sort algorithm in Thrust.

// Filename: csort.cu

// nvcc -c -arch sm_13 csort.cu

#include <thrust/device_vector.h>

#include <thrust/device_vector.h>

#include <thrust/sort.h>

extern "C" {

//Sort for integer arrays

void sort_int_wrapper( int *data, int N)

{

// Wrap raw pointer with a device_ptr

thrust::device_ptr <int> dev_ptr(data);

// Use device_ptr in Thrust sort algorithm

thrust::sort(dev_ptr, dev_ptr+N);

}

//Sort for float arrays

void sort_float_wrapper( float *data, int N)

{

thrust::device_ptr <float> dev_ptr(data);

thrust::sort(dev_ptr, dev_ptr+N);

}

//Sort for double arrays

void sort_double_wrapper( double *data, int N)

{

thrust::device_ptr <double> dev_ptr(data);

thrust::sort(dev_ptr, dev_ptr+N);

}

}

We can compile the code using

nvcc -c -arch sm_13 csort.cu

This will generate an object file, csort.o that we will use later on in the linking stage of the CUDA Fortran code.

The other missing piece is the interface to these C functions.

We will define a generic interface thrustsort, that depending on the kind of data (integer, single precision or double precision) will call the correct sort function:

module thrust

interface thrustsort

subroutine sort_int( input,N) bind(C,name="sort_int_wrapper")

use iso_c_binding

integer(c_int),device:: input(*)

integer(c_int),value:: N

end subroutine

subroutine sort_float( input,N) bind(C,name="sort_float_wrapper")

use iso_c_binding

real(c_float),device:: input(*)

integer(c_int),value:: N

end subroutine

subroutine sort_double( input,N) bind(C,name="sort_double_wrapper")

use iso_c_binding

real(c_double),device:: input(*)

integer(c_int),value:: N

end subroutine

end interface

end module thrust

At this point we have all we need to write the CUDA Fortran code:

program testsort

use thrust

real, allocatable :: cpuData(:)

real, allocatable, device :: gpuData(:)

integer:: N=10

allocate(cpuData(N))

allocate(gpuData(N))

do i=1,N

cpuData(i)=random(i)

end do

cpuData(5)=100.

print *,"Before sorting", cpuData

gpuData=cpuData

call thrustsort(gpuData,size(gpuData))

cpuData=gpuData

print *,"After sorting", cpuData

end program

If we save the module in a file module_thrust.cuf and the program in simplesort.cuf, we are ready to compile and execute: