4 Initialize Device CallsTo select the device associated to the host threadcudaSetDevice(device)This function must be called before any __global__ function, otherwise device 0 is automatically selected.To get number of devicescudaGetDeviceCount(&devicecount)To retrieve device’s propertycudaGetDeviceProperties(&deviceProp, device)

8 To Try CUDA ProgrammingSSH toSet environment vals in .bashrc in your home directoryexport PATH=$PATH:/usr/local/cuda/binexport LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATHexport LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATHCopy the SDK from /home/students/NVIDIA_GPU_Computing_SDKCompile the following directoriesNVIDIA_GPU_Computing_SDK/shared/NVIDIA_GPU_Computing_SDK/C/common/The sample codes are in NVIDIA_GPU_Computing_SDK/C/src/

9 Demo Hello World Vector Add Print out block and thread IDs C = A + BShow some real demos.. Above one and additional ones in the dirs

12 CUDA Programming ModelParallel code (kernel) is launched and executed on a device by many threadsThreads are grouped into thread blocksParallel code is written for a thread// Kernel definition__global__ void vecAdd(float* A, float* B, float* C){int i = threadIdx.x;C[i] = A[i] + B[i];}

13 Thread HierarchyThreads launched for a parallel section are partition into thread blocksThread block is a group of threads that can:Synchronize their executionCommunicate via a low latency shared memoryGrid = all thread blocks for a given launch

15 IDs and Dimensions Threads Blocks Dimensions are set at launch time3D IDsUnique within a block – two threads from two different blocks cannot cooperateBlocks2D and 3D IDs (depend on the hardware)Unique within a gridDimensions are set at launch timeCan be unique for each sectionBuilt-in variables:threadIdx, blockIdxblockDim, gridDim

26 Calling a Kernel Function – Thread CreationA kernel function must be called with an execution configuration:KernelFunc<<< DimGrid, DimBlock, SharedMemBytes, Streams >>>(...);DimGrid = dimension and size of the gridDimBlock = dimension and size of each blockSharedMemBytes specifies the number of bytes in shared memory (option)Streams specifies the associated stream (option)

31 Memory OptimizationsReduce the time of memory transfer between host and deviceUse asynchronous memory transfer (CUDA streams)Use zero copyReduce the number of transactions between on-chip and off-chip memoryMemory coalescingAvoid bank conflicts in shared memory

50 Control Flow if, switch, do, for, while Branch divergence in a warpThreads in a warp issue different instruction setsDifferent execution paths will be serializedIncrease number of instructions in that warp