CudaDMASequential Version 2.0

Constructors

From 3-5 template parameters are supported for the new CudaDMASequential transfer pattern. All constructors support the new option for specifying the number of BYTES_PER_THREAD for outstanding LDG loads. The value of BYTES_PER_THREAD must be a multiple of ALIGNMENT. By selecting 4*ALIGNMENT the implementation will default to the Fermi implementation.

Unlike previous versions of CudaDMA, the non-warp-specialized implementations also allow you to specify that a subset of the available warps should be used. These are optional parameters. Not specifying them will default to using all the threads in a threadblock for the transfer.

CudaDMASequential Version 1.0

Pattern Description

The CudaDMASequential transfer pattern is used for transferring a contiguous block of memory. There are only two parameters required to characterize a sequential transfer pattern.

ALIGNMENT - the alignment of the block of memory (e.g. 4-, 8-, or 16-byte alignment)

XFER_SIZE - the size of the block of memory to be transferred in bytes

Constructors

There are three constructors for the CudaDMASequential transfer pattern. Different constructors all describe the same sequential transfer pattern, but allow for different parameters to be supplied as compile-time constants via template parameters. Below are models for invoking the three constructors for CudaDMASequential.

The first constructor allows the user to supply the most number of compile-time constants as template parameters. The user can specify the ALIGNMENT, XFER_SIZE and the number of DMA_THREADS as compile-time constants. The second constructor keeps XFER_SIZE as a compile time constant, while making the number of DMA threads a dynamic parameter. The last constructor moves the transfer size parameter to being a dynamic parameter as well. All other parameters are base parameters required by the CudaDMA API.

For the non-warp-specialized constructors, the total threads parameter indicates to the CudaDMA object how many threads should be used to perform the transfer. For the cases where total threads is not specified as a compile-time constant, we use blockDim.x as the number of threads to perform the transfer.

Performance Considerations

Supplying as many parameters as possible as compile-time constants will contribute the most to achieving high performance with CudaDMASequential. In addition to supplying compile-time constants, performance can also be achieved by aligning data to the largest byte-alignment possible. 16-byte alignment will perform better than 8-byte alignment, and 8-byte alignment will perform better than 4-byte alignment.