Σχόλια 0

Το κείμενο του εγγράφου

Porting the CPN code to GPU

Author: Jie Lin (s0969522)

August 26, 2010

MSc in High Performance Computing

The University of Edinburgh

Year of Presentation: 2010

Abstract

GPGPU which stands for General-Purpose computation on graphic processing unit isnow a popular topic in high performance computing. It has attracted the attention ofmany programmers for its high computation capability as well as its ease of use. Existingcode with rich parallelism can be ported to GPU to examine the potential performancegain. This dissertation aims to implement and benchmark a GPU implementation of theCPN code using CUDA programming language. It discusses the strategies for bothporting and optimisation process. With large parameters, the first versionof modifiedcode has a speed-up of 6.3 for the whole program, and about twenty times speed-up forsome selected function. Further optimisation for removing unnecessary produces evenbetter speed-up. With large parameters, the speed-up is about 7.4 for thewhole programand 37 for some selected suitable functions.

I am grateful to my supervisor Chris Johnson for his patient guidance during the projecttime, and for his careful correction and suggestions about the style, language for thereport on the revision stage. I’d also like to thank my friends and other classmates forhelping me to learn the CUDA programming techniques and dissertation written skills.

1

Chapter 1

Introduction

Driven by the very large computer game industry, graphics processor unit get rapidperformance increase in recent years. Traditionally, it is very efficient and widely used inthe computer graphics. Due to its rapid performance improvement and highly parallelstructure, it has now moved to general purpose domain as an accelerator of CPU

GPGPU which refers to general purpose computation on Graphics Processing,“has abroader application across a variety of fields such as computer clusters, physics engines,audio signal processing, digital image processing, bioinformatics as well as scientificcomputing”

[13].As the impressive power of GPU, Some institutions have begun theirscheme to construct supercomputer using GPU.“Tokyo Institute of Technologyannounced a plan to build a 2.4 Petaflop supercomputer based on GPU”

[17], ACCREalso plans to construct a

GPU cluster to boost its supercomputer power [18].In summary,application of GPU has become prevalent in many traditional HPC areas,it is predictablethat GPU will be applied in a more extensive and deeper level

in the future.

In the section 2, I briefly describe GPU features and CUDA Programming language.Section 3 lists the porting process of the CPN code along with a discussion of parallelreduction. Section 4 discusses the general optimisation strategies, the approaches toevaluate the performance and optimisation attempts on the first version of modified CPNcode.

2

Chapter 2

BackgroundKnowledge

2.1

Features of GPU and CUDA Programming language

In this section I will discuss some features associate with GPU and the CUDAProgramming language.

2.1.1

Main differences to CPUs

Compared to the increase in the performance of CPU which has slow down its pace inrecent years, GPU has got a fast and stable improvement in performance. The reasondues to the difference in design between GPU and CPU.

D, Kirk and W. Hwu [14] brief the difference.“Design of a CPU is optimised forsequential code performance, while GPU is designed to optimize for the execution ofmassive number of threads”

[14]. CPU has sophisticated design which can deal withvery complex control logic. By contrast,GPU pursues high computational throughput. Itsustains thousands of threads to perform the intensive computation and hide memorylatency. In addition, CPU is designed to minimise instruction and data access latencies,which will require large cache, whileGPU use small cache so that more transistor area isdedicated to calculations [6].

Although performance of GPU is impressive, it cannot take all the work. GPU is justused to complement CPU. A program which can benefit from GPU should

have itsparallel portion run on GPU and sequential portion run on CPU.

2.1.2

Architecture of a modern GPU

Streaming Multiprocessors (SMs) is the very building block of modern NVIDIA GPU. Itis composed of 8 Streaming Processors, each of which has the essential function unitssuch as multiply-add unit [14]. Besides that, each SM contains a piece of shared memoryand an instruction unit which issues one instruction to all Streaming Processors at oneclock cycle. SMs also have the capability to perform fast floating point by its specialfunctional unit. Figure 1 illustrates an example of a stream multiprocessor.

3

Figure1

[6]: A Stream Multiprocessor

The details of the GPU cards in Ness are as follows. There are two nodes in total. Oneach node there are 2 of the 2.0GHz Intel Xeon E5504 quad cores and a dual M1060Tesla Card. So there are 16 2.0GHz Intel Xeon cores in total and 4 GPUs. The IntegratedIntel Dual Port Gigabit Ethernet connects them.

The M1060 Tesla Card chip is massively parallel, with 240 of Streaming ProcessorCores [2]. The peak performance for single precision floating point performance anddouble precision floating point performance are 933 GFlops and 78 GFlops respectively.It has a 4Gbyte dedicated memory with the bandwidth of 102 GB/sec. These hardwareresources willgenerate

the powerful computational capability. GPU will sustainthousands of threads in the runtime of a program. Applications well written for GPUcouldderive

more than ten thousand threads which are running simultaneously. In orderto write effective program, it is very important to understand this architecture in mind.

2.2

CUDA programming language

Several years ago when the concept of GPGPU is initially introduced, programming onGPUwas very hard, because there are no dedicated programming languages for GPU

4

programming. Therefore the programmer needs to reformulate the code in a graphicsdevelopment pattern which is very difficult. In November 2006, NVIDIA introducedCUDA.

CUDA allows developers to use C as a high level programming language [1].An alternative programming language is OpenCL [3]. OpenCL, which refers to OpenCompute Language, was initialised and developed by Apple. It supports large range ofapplications and architectures. OpenCL is a low level API, and will introduce more workto the programmer. For my project, I will choose C for CUDA to port the code on GPU.

2.2.1

CUDA Program Structure

In this section, I will discuss a simple CUDA programming model.

D, Kirk and W. Hwu [14] illustrate the execution process of a typical CUDA program. ACUDA program consists of two kinds of codes, one is the standard ASNI C code whichruns on the CPU,and the

other is the extended C code which executes on the GPU. In theCUDA programming model, CPU also refers to a host, while GPU refers to a device.Figure 2 gives an example of CUDA program execution. In general, a CUDA programwill start from the host side code, which is running in serial. When it reaches a deviceside code, such as a kernel, the program starts to execute in parallel on multiplestreaming processors in a single GPU. When the parallel execution of the device codecompletes, the program returns the control to CPU. This allows the concurrent executionof host code on CPU and device code of GPU in CUDA programming model, especiallywhen there is no data dependence between host code and device code.

Figure2

[14]:

The execution of a typical CUDA program

The compilation phase just reflects the execution process of a CUDA program. NVIDIAprovide a compiler called NVCC which first separates the host and device code from asingle CUDA program. The host code is then further compiled by a standard C compilersuch as GNU compiler. On the other hand, NVCC further compiles the device code aswell to make it

able to execute in parallel on multiple streaming processors.

The device code uses C syntax with a few extensions, i.e. keywords, according to whichthe NVCC compiler could split the device code from the whole program. The main

5

differences between host code and device code are their execution pattern and memorystructure. The device code is executed in parallel rather than in serial, it also uses avariety of memory types within the GPU. The parallel execution code and differentmemory types are specifiedby the extended CUDA keywords. CUDA introduced twoadditional functions, kernel function and device function. The kernel function with the__global__ qualifier generates multiple threads running on GPU, and the device functionwith the __device__ qualifier

only executes on a single thread. The parallel executionpattern in the CUDA programming model is a two level hierarchy structure. Theinvocation of a kernel produces a grid which consists of several blocks, and each blockincludes a large number of threads. The structure of the grid and blocks within the grid isdetermined in the invocation of a kernel. A kernel invocation uses <<<gridDim,blockDim>>> syntax to specify the dimension of grid and block.

Similar to the functions running on GPU, variables inhost code and device code useseparate memory spaces. GPU has its own memory, and there are many types ofmemories in GPU’s memory hierarchy. Variables in the GPU side cannot be used in theCPU side, nor can the variables on CPU side to be used on GPU side. For this reason,there must be explicit transfer of data between host and device. A CUDA program willusually allocate two interrelated buffers which reside on GPU and CPU separately. Thehost buffer stores the initial data. In order to make it available

on the GPU side forparallel operation on it during the invocation of kernel, this buffer on the CPU shouldcopy data to its corresponding buffer on the GPU. When the execution of the kernelcompletes and the output data is generated, the buffer on the GPU should also transferthe outcome to its corresponding buffer on the CPU. This process is realised by theCUDA runtime functioncudaMemcpy. Allocation of a buffer on a GPU is done bycudaMalloc, this buffer is dynamically allocated so it can be freed up after use bycudaFree.

Listing 2.1

[1]: a simple C for CUDA example

__global__ void vecAdd(float* A, float *B, float *C)

{

inti = threadIdx.x; C[i] = A[i] + B[i];

}

int main()

// Kernel invocation

dim3 dimBlock(16, 16);

dim3 dimGrid(4, 4);

vecAdd<<dimGrid, dimBlock>>(A, B, C);

}

In the example shown in Listing 2.1, the block and gridare both defined as twodimensional, and threads in block is organised in 16x16 pattern.

2.2.2

CUDA Memory and variable Types

So far, we have briefly discussed a simple CUDA programming model. The onlyMemory type we have covered is the global memory, which is used to allocate the globalvariable from the host. But we will not gain performance improvement byonly usingglobal memory

because accessing the global memory is fairly slow. Besides global

6

memory space, there are three other memory types, i.e. shared memory, registers andconstant memory, which are much more efficient than global memory. Figure 3illustrates

the hierarchy of the CUDA memory.

Figure3

[7]: Hierarch of CUDA memory

The performance of memory can be measured from different aspects, including theaccess latency and bandwidth. We can further understand its usage by looking at thefeatures of the variable types that reside in the memory, such as the scope and lifetime ofa particular type of variable. Again, different types of variable are separated byadditional CUDA keywords. The scope here refers to the range of the threads that canaccess the variable. This is caused by the design of two level hierarchystructures

ofthreads in CUDA programming model. A variable caneither be accessed from a singlethread, or it can be accessed from a block, or it can be accessed from the whole grid.Lifetime of variables is another important issue. It identifies the range of programs inwhich the variable can be used. There are two different kinds lifetime, one is within theexecution of a kernel,and the

other is for the whole program. We can look through andcharacterise different memory spaces and their associated variable types in terms of theseaspects.

7

2.2.2..1

Global Memory

Global memory has much higher latency and lower bandwidth compared to othermemory spaces on GPU, although compared to contemporary CPU memory thebandwidth of global memory are many times higher. Residing

on Global memory spaces

are global variables. Global variables

are

specified by an optional __device__ qualifier.

It is created and operated in the host code by invoking the CUDA runtime APIs such ascudaMalloc

andcudaMemcpy. Global variables have the scope of the entire grid, andtheir lifetime is across the entire

program. The value of global variables is maintainedthroughout multiple invocations of different kernel functions. Global variables areparticularly used to pass data between different kernel executions. The device side buffermentioned in Section 2.2.1isa

typical example of global variable.

2.2.2..2

Shared Memory

Global memory exists in DRAM on a GPU. By contrast, Shared memory is on chipmemory which is much like a cache manipulated by user. Each SM has a piece of sharedmemory that is evenly partitioned in terms of the number of blocks on the SM andassigned to these blocks during runtime. Therefore, access to shared memory is muchfaster than global memory. Besides the low latency, shared memory also consists ofseveral memory banks which allow it to

be accessed simultaneously by threads. Thisfeature of shared memoryyields the result thatits

bandwidth is several times as high asthat of a single bank. Residing on shared memory are shared variables specified by the__shared__ qualifier. Shared variables have the scope of a block. They are accessible toall threads within the block which the shared memory is assigned to. Shared variableshave the lifetime of the block. They are not kept when the kernel created them completesexecution. Due to the remarkable low latency and high bandwidth, shared variables areused to replace global variables in a kernel in order to get good data locality when theseglobal variables are heavily used.

2.2.2..3

Registers

Registers are another kind of on chip memory with extremely low latency. Each SM hasa set of registers that are assigned to threads. Unlike shared memory that is owned by allthreads in a block, registers are exclusively owned by each thread. Access to sharedmemory can be highly parallel due to its multiplebanks;

likewise, access to registers canalso be highly parallel because each thread has its unique registers. Variables placed intoregisters are part of automatic variables which refer to variables declared within a kernelor device function without any qualifier. Thescope of automatic variables is

withinindividual thread. Each thread generates a private copy of an automatic variable duringthe kernel’s execution. Similar to global variables, automatic variables have the lifetimeof within a kernel’s execution. As the number of registers is limited, only a fewautomatic variables will be stored in registers. The remaining variables will reside inlocal memory space with the same high latency and low bandwidth as global memory.This raises thetrade-off

in the numberof threads for a kernel. We can either use a largenumber of threads to achieve a high level of parallelism, or reduce the number of threadsto have more registers for each thread in order to get fast access to frequently usedvariables.

8

2.2.2..4

Constant memory

Similar to global memory, constant memory space is in DRAM and can be both read andwritten in the host code. However, speeded up by a constant cache in SM, constantmemory has much faster access speed than global memory. In the host code, constantmemory only provides read-only access. Constant variables are specified by__constant__ qualifier, it must be declared outside any function body. Clearly its scope isacross all grids, and it has the lifetime of the entire program. Constant variables are oftenused to hold constant input values which will be passed to a kernel.

2.3

Libraries

Programming libraries are important tools and building blocks for applicationdevelopment. Beside the CUDA library from NVIDIA, which is a component to CUDAC language, there are several third party libraries which provide powerful utilities tofacilitate CUDA programming. There are two libraries from which the porting of theCPN code may benefit, one is the Thrust library which use similar style as C++ STL andimmensely contributeto the productivity of application development [8],and another

isthe CUDPP library which provide data parallel algorithm primitives such as parallelreduction [9]. Application of these libraries can get high and stable performance and easethe development process.

9

Chapter 3

Porting the CPN code to GPU

The project is aiming to port the CPN code to GPU. The CPN code is about dealing witha 2D lattice QCD field theory. It uses three different Monte-Carlo simulation algorithms,i.e. Metropolis, overheat bath and Microcanonical to update fields. Within the code, themain routine is in the filemc.c. The input parameters are found in a text file, "input_file".There are two choices of action, 0 and 1, corresponding to "action_unimproved" and"action_symanzik". However,

from a programmer's view, the maths or the physicalmeaning of the results are not important to the project.

Porting is the process of modifying the code so that it can run on GPU. Since modifyinga large code is difficult and time consuming, it is better

to make sure the code runscorrectly, without considering the performance at this stage. The common strategy ofporting the code can be divided into following steps. The first step is to find out whichpart of the CPN code can be ported to GPU, accordingto the profiling result and whethera function or code has desirable characteristics. Then all the data that will be used andmodified on device must be identified. Next step is the decomposition process whichparallelises the code into many threads.

It is

always difficult to port a large program to GPU. Many functions or code segmentswill be selected to be converted to kernels. Many CUDA runtime API functions will beneeded to deal with the data required by GPU. It is preferable to move all the newlyintroduced code in the process of porting to separate files instead of the original ones, toavoid damaging the original ones.

3.1

Desirable characteristics of selected code

Application of GPU is only beneficial to certain types of codes. This section discussesseveral desirable characteristics of the code that is suitable for running on GPU.

3.1.1

High computational intensity

As describe in background section, host and device have separate memory space, andthey cannot reference each other’s memory. Therefore, CPU and GPU should transferdata via the PCI Express connection between them in order to see the data. The latencyfor transferring the data is big, and this will result in the possibility that performing an

10

operation on the CPU may be faster than performing that on GPU. Although GPU isgood at large amount of computations, the delay of transferring data between each otherwill significantly slow down the performance. Therefore, the code must involve asignificant number of computations over each data element, so as

to achieve a speed-upwhen implemented on a GPU.

3.1.2

Embarrassingly parallel

The selected code should involve rich data parallelism. Program with rich dataparallelism means that it involves many same operations which can be simultaneouslyexecuted on a large

number of data. GPUs have an explicitly parallel programmingmodel [4]. When a kernel is invoked, thousands of lightweight threads will beconstructed to perform similar operations

[14]. Therefore a large amount of dataparallelism is desired for GPU application to benefit from its tremendous computationalcapability.

We should also avoid interaction between each thread. Due to implementation of threadscheduling, different threads are not guaranteed to be active simultaneously. So if we doneed interaction between threads, we will need the expensive barriersynchronisations.And the interaction is limited because a synchronisation only occurs in a block.

3.1.3

Minimal conditional execution

The selected code should have as less as conditional execution as possible. Warp is thescheduling unit in SM. Each thread is not independent, they are grouped into blocks, andblocks are executed as 32 thread warps. Only one instruction will be issued to a warp fora ‘warp time” which is 4 seconds. If the code has different sets of instructions for threadsin a warp from at some point, as the limit of one instruction, some threads will take theinstruction while others that fall into another execution path will be idle. This is calledthread divergence, which is a waste to GPU’s computational capability. Therefore,Programmer should avoid using conditional statement to ensure maximum efficiency.

3.2

Profiling

According to Amdahl's law, we know that it is only worth working on parts of code thattake a lot of time. Large speed up ofunimportant sections will have little impact on theoverall performance. Therefore it is essential to examine the code by using profilingtools. Only functions that consume the most time and are good candidates for parallelismwill have the potential benefit from GPU acceleration. Functions identified should alsoexhibit potential for rich data parallelism so that they can be decomposed into manythreads. In a word, the overall performance gain by using GPU depends entirely on theextent to which the code can beparallelised.

3.3

Selecting code to port

With the profile of the code, it is able to identify which functions would be suitable foracceleration. Note that although the profile indicates the top three functions which

11

consume most time areiz,ithetaandTrpp, it is actually not possible for them to be portedto GPU as they do not include loop, so they can not be decomposed. Though theseprimitive functions likeiz,ithetaandTrppcan not be converted to kernel functions, butfunctions that contains loops

and invoke these primitive functions inside loops could beparallelised. From the profile, it is obvious that functions and code sections that both takelarge portions of time and have rich data parallelism are the three fields updatingfunctions (update_fields_heat,update_fields_motropolis,update_fields_micro) and twomagnetic functions (magnetic_susceptibility, magnetic_GP1). In addition, it is desirablethat the selected functions and code section have enough computation intensity, so thatthe memorylatency can be well hidden.

A profile of the CPN code indicates that if the parameters N, T, L are small, then there isno function that takes absolute advantage over others, which means that there will belittle performance benefit for porting a single function although it may paralleliseefficiently. However when scaling the problem size by increasing these parameters (N,T,L), some functions take large portion of the runtime. When keeping the parameter Nsmall (e.g. 2) but making T and L large (e.g., 16), about 80% of the code's runtime isspend in functionsmagnetic_susceptibilityandmagnetic_GP1. When keeping theparameters T and L small but making N large, the updating functions will consume muchtime as these functions scale with N. Moreover, many functions in the fileobservables.chave the same loop pattern as the functions magnetic_susceptibility andmagnetic_GP1so that they can be ported to GPU as well through the improvement willbe trivial. Table 1illustrates some potential functions and code sections which are suitable for running onGPU.

Functions and code sections

loop size

plaquette

L*T

topological_susceptibility

L*T

Action_unimproved

L*T

Action_symanzik

L*T

code section which invokes there update

functions

L*T

magnetic_susceptibility

L*L*T*T

magnetic_GP1

L*L*T*T

Table1: Suitable functions and code sections

It is also important to verify that there is no data dependency between loops for afunction. Data dependency is an important requirement of Decomposition. If the latercomputation within a loop is dependent on the computation in the previous iteration, thefunction cannot be parallelized. For the loops that invoke update_fields_heat,

12

update_fields_motropolis and update_fields_micro, although each functionfocuses onthe distinct vector element in the lattice field, the computation on each vector elementdepend son its neighbouring vector element. Therefore direct decomposition methodcannot be applied here.

3.4

Identifying data required on GPU

As discussed in the background section, thereis separate memory space

for host anddevice. Programmers have to declare two interrelated buffers in host and device code,and explicitly transfer value between them before and after the invocation of GPUfunctions.

Another point to note is that, CUDA supported GPU has multi level memory hierarchies.CUDA threads within a grid may access data from multiple memory spaces during theirexception, and variables in different memory spaces have different access scope, lifetimeand different memory latency. Using different memory will result in significantdifference in performance. Therefore, after identifying all the data that is required byGPU, it is necessary to figure out which memory space will be suitable for placing thesedata. Different memory spaces and their associated variables are discussed as follows.

The global space is persistent across kernel launches by the same application, and it canbe accessed by all the threads. At least two arrays should reside in global memory

space.They are z_field and theta_field, the former is a 2N-component real vector at each site onthe L*T lattice (so a total of L*T*2N double precision data), the latter is a real phasetheta at each site (so a total of L*T*2 double precision data). Dealing with globalmemory will need memory management functions from CUDA runtime API. Thesefunctions should all be invoked on the host side. We will need these functions to allocatememory on the GPU, copying data from host side to device side or from device side tohost side, and finally free up the allocated memory on GPU.

Threads within a block can access the shared memory which is on the samemultiprocessor as the block is issued on. Shared memory has the same lifetime as theblock. As shared memory is on-chip, it has a very low latency, in circumstance with nobank conflict between each threads, its latency could be about 100 times lower than thatof global memory. For this reason, if one variable is used many times in the kernelfunction, this variableshould be loaded from the device memory and stored in the sharedmemory. In the CPN code, the reduction operation is used frequently in the selectedfunctions and code section which will be ported to GPU. As the parallel reduction willuse the same data element many times, these elements should reside on the sharedmemory. However, shared memory it is fast but small, and it is evenly issued to eachblock. As the number of blocks increase, the amount of shared memory issued to eachblock decreases. For example, the maximum amount of shared memory permultiprocessor in Tesla 1060 is 16 KB, therefore for SM with 8 blocks, shared memoryper block is 2 KB.

Each thread has its ownlocal variable which is calledautomatic

variables. Theautomatic

variable is only accessible from the thread and have the same lifetime as the kernel. Allthe variables declared within a kernel without any qualifier is aautomaticvariable, and

13

they will either reside on the registers or on theDRAM, depending on some conditions.For example, whether the register issued to each thread is exhausted. It is important tonote, there is a finite number of registers on each SM. Similar to the shared memory,registers are shared by all the threads resident on that SM. Therefore the more threadsonthe SM, the less registers each thread can have. The number of 32-bit registers permultiprocessor is 16 KB. Since the maximum number of resident threads permultiprocessor is 1024 in Tesla, in full occupancy, there could be 16384/1024=16registers perthread.

Constant memory space resides in device memory. As it is speed up by constant cache inSM, accessing constant memory is much faster than accessing global memory. So it isbetter to place unchanged parameters to constant memory, to reduce unnecessary globalmemory access. In the CPN code, there is a C struct calledglobal_params_t

that storesparameters read from the input file, and these parameters will not be changed. As theseparameters are frequently loaded throughout the whole program, placing them toconstant memory instead of global memory will have a good performance gain.

As CPN code uses double precision arithmetic and the double precision data element.We will need a compiler flag which is-arch=sm_13 to support that. Otherwise, all thedouble precision operations will be converted to single precision.

3.5

Decomposition and execution configuration

After selecting the suitable functions and code sections, and identifying the necessaryvariables for it, the next stage of porting is to decide how to decompose the functions orcode sections into many GPU threads. Parallel decomposition on CUDA requires one todecide how many threads each block has and how many blocks each grid has. This iscalled execution configuration on CUDA. The selected functions and code sections allinvolve many loops, or rather, nested loops. In order to port that to GPU, we first convertthe computations within the loops to a kernel, which is specified by a __global__qualifier. Then the selected functions and code sections invoke the kernels instead ofinvolving the loops in the serial code. Invocation of kernel requires one to specify theexecute configuration. Apart from the computations that are intended to be converted toa kernel function, the primitive function which was invoked in these computations, likeiz, itheta and Trpp should be converted to device function with the __device__ qualifier.

Execution configuration is a fraction of kernel invocation. The whole invocation ofkernel function is in the form of : function<<<gridDim,blockDim,Ns,S>>>(argus,…).The gridDim and blockDim refers to the grid dimension and block dimensionrespectively. Ns specifie the amount of dynamically allocated shared memory, and Sspecifies the associated stream (will be covered in optimization section). Ns and S areoptional arguments.

There are several important points to note here. Firstly, there is a limit to the number ofthreads per SM, according to different compute capability. For Tesla 1060, themaximum number of threads per SM is 1024. Also, there are finite amount of memoryresources such as registers and shared memory available per multi-processor. Forexample, using a lot of registers in each thread will reduce the number of active blocks,

14

which means registers and shared memory could be the limit to parallelism. In addition,threads within a block can communicate with each other through shared memory.Moreover, only threads in the same block can have barrier synchronisation with eachother.

The magnetic_susceptibility function and

magnetic_GP1 function have some desirablecharacteristics for parallel decomposition. They contain four nested loops, and within theinner loop, the computation intensity is strong enough. For these two functions, the blockdimension could initially be set to the total iteration number of inner nested loops, andthe grid dimension could be set to the total iteration number of outer nested loops. Asthere is rich data parallelism in the magnetic_susceptibility function and magnetic_GP1function, one can besure the performance gain will be big for porting that to GPU. Thenested loops invoke three fields updating function (update_fields_heat,update_fields_motropolis, update_fields_micro) in the main function, so these loopscould be parallelised. One way to

parallelise these loops is to set the block dimension tothe total iteration number of the inner loop, and set the grid dimension to the number ofSMs. This decomposition means that the number of threads is distributed evenlybetween each block. However,there are alternative decompositions. As the thread isscheduled in the form of warps and one warp is 32 threads, in order to fully utilise theSM's capability, it is always better to set the thread per block to multiples of 32. Asmemory transactions areissued per half warp which is 16, it is always better to set thenumber of threads per kernel to multiples of 16 to facilitate the coalesced memoryaccess. Therefore, it is better to use different parameter combinations setting theexecution configuration,

in order to find out the optimal configuration. When setting theblock size to a constant value which is multiples of 16, the number of blocks is equal tothe total iteration number of the nested loops divided by the multiple of 16. It willprobably not be divided evenly, so the number of blocks should be truncated to an integerand plus one. It means that one of the blocks will have idle threads that do not need totake any work.

3.6

Parallel reduction algorithm

Reduction operations are common and important computation primitives. In the CPNcode, they occur a lot of times. Almost all functions or code segments that have beendetermined to be converted to kernels contain a reduction operation over theintermediate results. So it is worth taking a close look at

how to implement the parallelversion of the reduction operation.

Suppose we have to perform sum reduction to an array that has a size of N, and we aregoing to apply GPU to do it.One method

of performing reductions on GPU is to useintrinsic atomic functions in CUDA. In this implementation, there is a variabledesignating the result, and each thread in the grid holds one piece of computationalelement. In the kernel definition, the result variable updates itself by invoking the atomicfunctions on the element and itself. However, atomic functions use a linear accessmanner which wastes the capability of GPU resources and leads to very low efficiency.The recommended approach is to use a tree based algorithm which can be executed inparallel.

15

Figure4

[11]: Tree based approach for parallel reduction

Generally, each thread block constructs a loop that implicitly builds an addition tree overthe input elements. In every round of reduction, neighboring data elements are addedtogether. The addition result is saved for the next round of addition. Each step of thispair-wise addition cuts the number of partial sums in half and ultimately produces thefinal sum after log2(N) steps.

Listing 3.1

[11]: Reduction withdivergence

__global__

void block_sum(double *input, double *result, size_t n)

{

extern __shared__ double s_data[];

s_data[threadIdx.x]=0;

int i=blockIdx.x*blockDim.x+threadIdx.x;

if (i<n) {

s_data[threadIdx.x]=input[i];

}

__syncthreads();

for (int stride=1; stride<blockDim.x; stride*=2)

{

if (threadIdx.x%(2*stride)==0) {

s_data[threadIdx.x]+=s_data[threadIdx.x+stride];

}

__syncthreads();

}

if (threadIdx.x==0) {

result[blockIdx.x]=s_data[threadIdx.x];

}

}

Listing 3.1 illustrates the implementation of a reduction kernel. There is an arraydeclared in the kernel which holds the initial values and stores partial sums afterwards.The reduction kernel is mainly composed of a loop which continuously adds theneighboring elements

of the array and stores the partial result in a proper position. Eachiteration of the loop reflects one round of pair-wise addition as shown in Figure 5. Inorder to access this array in parallel along with the performance consideration, this array

16

should be a shared variable. Another consideration is about the synchronisation ofthreads within a block. An intrinsic function__syncthreadsis needed to avoid RAWhazard and ensure the completion of one round of pair-wise addition. There is a stridevariable

which is used to select the thread which is required for the pair-wise addition. Inthe first iteration, the stride is set to 1, which means even threads are selected to performthe pair-wise addition.All the elements participate in the reduction, and the results arestored in elements with even indices. At the start of second iteration, the number ofremaining elements becomes half

of the

number of initial elements. The second iterationpicks up all the even elements, performing addition on these elements, and then places

the results in the position with indices of multiples of four. This process goes on untilthread 0 finishes the last round of addition and derives the final result [15].

Figure5

[11]: Parallel reduction withdivergence

However, this implementation has a big drawback, because there is thread divergencewithin each warp. When thread divergence occurs in a warp, it harms the performancesignificantly. We can take the first round of reduction as an example to discuss theproblem. In the first iteration, only threads with even indices will perform the pair-wiseaddition. This creates two execution paths. At first, half of threads take the control andothers are idle. When the work is completed, this

half of threads

go to sleep and anotherhalf take the control. In successive iteration, this effect becomes even more serious, asfewer threads will attend the pair-wise addition in a warp, more and more threads will beidle. This is obviously a waste of GPU recourse.

Listing 3.2

[11]: Reduction with interleaved addressing

__global__ void block_sum(double *input, double *result, size_t n)

{

extern __shared__ double s_data[];

s_data[threadIdx.x]=0;

int i=blockIdx.x*blockDim.x+threadIdx.x;

17

if (i<n) {

s_data[threadIdx.x]=input[i];

}

__syncthreads();

for (int stride=blockDim.x/2; stride>0; stride>>=1)

{

if (threadIdx.x<stride) {

s_data[threadIdx.x]+=s_data[threadIdx.x+stride];

}

__syncthreads();

}

if (threadIdx.x==0) {

result[blockIdx.x]=s_data[threadIdx.x];

}

}

Listing 3.2 shows a modified kernel with a slightly different algorithm for sum reduction.Thread divergence is inevitable for the whole grid, but it is able to avoid manyoccurrences of divergence because divergence only takes place in a warp. This gives riseto an improved version of parallel reduction with interleaved addressing [11]. Bycontrast to the old version of the reduction algorithm, a new version algorithm alwayspicks up first half of the remaining threads to perform pair-wise addition. It is achievedby adding a pair of elements that is half size away from each other in the remainingelements. There is an important variablestride

which controls the distance of elements ina pair. Although one divergence will probably exist in the warp which just takes

the lastseveral working elements and first several idle elements in any iteration, this algorithmremarkably reduces the redundant divergence compared to the first version.

For loops with a tremendous number of iterations, the solution is to decomposecomputation into multiple kernel invocations (figure). In the first invocation, as there is alimit on the number of threads for each block, we will need to use multiple thread blocksto deal with the

large array. Each thread block takes a portion of the total reduction work.In order to combine the partial results from all blocks within a grid, we declare an arrayto store these temporary results and invoke a reduction again on them. The input size of

elements for this launch should be much smaller and it is able to place them in a singleblock to get the final result [11]. In the case of reductions, code for all levels is the same.Listing 3.3 illustrate the multiple kernel invocation of parallel reduction.

Listing 3.3

[14]: Multiple kernel invocation of parallel reduction

// reduce per-block partial sums

int smem_sz = block_size*sizeof(float);

block_sum<<<num_blocks,block_size,smem_sz>>>(d_input, d_sums, n);

// reduce partial sums to a total

sum

block_sum<<<1,

block_size,smem_sz>>>(d_sums, d_sums + num_blocks,

num_blocks);

18

3.7

Porting code using thrust

This section will discuss the porting process by using the thrust library which ismentioned in the background section. Application of thrust library has many benefits.First, thrust has many parallel primitives which can simplify the code of CUDAcomputation and make the CUDA code concise. Employing thrust allow the users torapidly develop complex applications. Thrust also encourages generic programming, onereduction can apply to input in all data types. In addition, with the minimal programmereffort, thrust achieves high performance. Finally, thrust integrates with CUDA C/C++code well.

Thrust provides two vector containers, host_vector and device_vector. The host_vectoris stored in host memory and device_vector is stored in device memory. Thrust’s vectorcontainers are analogous to the C++ STL std::vector. They are generic containers thatcan store any data type and can be resized dynamically.

Listing 3.4 illustrates the use ofThrust's vector containers and compare it with the raw CUDA.

As this example shows, the = operator can be used to copy a host_vector to adevice_vector. The = operator can also be used to copy host_vector to host_vector ordevice_vector to device_vector. This feature hides cudaMalloc, cudaMemcpy andcudaFree, and makes the common operations concise and readable. Also note thatindividual elements of a device_vectorcan be accessed using the standard bracketnotation. This feature facilitates the debugging process in CUDA, because the devicevalue can be directly examined in host side. However, because each of these accessesrequires a call to cudaMemcpy, they shouldbe used sparingly.

In order to use the user defined kernels, we could extract a raw pointer from devicevector and pass it to the kernel. Listing 3.5 illustrates an example.

19

Listing 3.5: passing raw pointer of device vector to a kernel

// allocate device vectorthrust::device_vector<double>d_trpp(n);

// obtain raw pointer to device vector’s memory

double *ptrpp=thrust::raw_pointer_cast(&d_trpp[0]);

// use ptr in a CUDA C kernelsum_magnetic_GP1<<<global.L*global.T,global.L*global.T>>>(d_zfield,ptrpp);

Thrust provides a large number of common parallel algorithms such as reductionoperation. Do not like the raw CUDA implementation, performing such an operationwith Thrust do not need to write a kernel, and set the execution configuration. The sum ofan array is implemented with thrust::reduce as follows: int sum =thrust::reduce(D.begin(), D.end(), (int) 0, thrust::plus<double>()); Here the firstarguments to reduction, begin() and end() is called an iterator, which defines the range ofvalue. And the fourth argumentplus()

is an addition operator.

3.8

Debugging

CUDA debugging is time consuming, and there are several ways of debugging. For

former version of the CUDA Toolkit, the programming environment does not include

any native debug support for kernel functions. There isonly compiling

mode calleddevice emulation mode which can make the entire program runs on CPU. Therefore,programmers are able to debug both the host code and device cod on the host side withcommon tools such as GDB.

[1] As the code runs on the host, the device emulationmodel also allows print statement in the kernel function. However, execution on GPUmay produce different results from using device emulation mode, because deviceemulation behaves slightly differently from normal model. As a consequence, deviceemulation mode may not work for debugging. In addition, compiling with deviceemulation mode, the performance is rather slow. Both these reasons make the deviceemulation mode not the suitable approach for debugging.

Directly examining the value of variables using print statement is the common way ofdebugging code runs on CPU. But print statements are not natively supported in GPU informer versions and in compute capability 1.x. Neither can the printf function

be invokedin the kernel, nor can the device data be examined by printf function in the host side.Fortunately, there is a software package called cuPrintf that allows the user to add printfequivalent cuPrintf calls to the CUDA C code. The cuPrintf codeworks with CUDA 2.3or newer, and is supported on all platforms on which the CUDA Toolkit is supported.This is the main way of examining the variables in kernel on the ness system. With thethrust library, it is also possible to directly examine the device data on the host side,avoiding the process of transferring device data to host side before the value can beprinted. In fermi0 system with compute capability 2.0, the formatted output whichbehaves in a similar way to the standard C printf function is natively supported.

The newly version of GUDA has introduced a new debugger called GUDA-GDB. It isan extension to the standard GDB. It allows user to debug not only the native host code

20

but also the CUDA code.“CUDA

GDB supports setting breakpoints at anyhost ordevice function residing in a CUDA application by using the function symbol name orthe source file line number”

[12]. This allows programmers to check the value of avariable at any point no matter the checking occurs in the host side or device side.However, the ness system where I spent most project time on does not supportCUDA-GDB for some reason. Fortunately, the fermi0 system does support this so I canuse that.

21

Chapter 4

Optimisation

Porting the code on GPUs is aiming to exploit GPUs’ capacity to accelerate the code.therefore performance of the resulting code is always the key objective of the project.

GPU has many features which is different from CPU, some features has significantinfluence to the performance. To efficiently utilize GPU's computecapability, it is betterto optimize the code according to these features after the first round of porting. Severalpapers list the key issue for achieving optimal performance on GPUs [5]. The generalperformance optimisation approach will involve severalsteps. First step is to measurethe performance and find out optimisation on which part of code will get the significantperformance gain. Then the optimisation should be performed follow the overallperformance optimisation strategies listed in section 4.1, including the memory usageoptimisation, instruction usage optimisation respectively. This section discusses theoverall performance optimisation strategies, the importance of performancemeasurement and optimisation attempts for the CPN code.

Parallel execution maximisation means to efficiently map the parallelism of theapplication to various components of the GPU system, in order to fully utilise the GPUcompute capability.

Instruction usage optimisation includes two aspects. First is to minimise the use ofarithmetic instructions with lower speed. CUDA support standard mathematicalfunctions as well as its intrinsic functions. Application of intrinsic functions sacrificesthe precision but results in high speed.

Optimise instruction usage also means avoiding the divergence with warps. In CUDAprogramming model, block is executed as 32 threads warps. During execution,threads inthe same warp perform the same operation at the some time if there are no different pathswithin the warp. However, if there are different paths within a single warp, different

22

execution paths are serialised. An example is discussed below for different results ofconditional statement.

If the conditional statement is written in the form of:“if (threadIdx.x>2) {}, then twodifferent execution path is produced for threads ina block. If we write another statementlike:if (threadidx.x/32>2) {}, we also get two different execution path. Although, both ofthem have different execution path, but the impact of them is different. As threaddivergence only occurs in warp, so the second statement which have the same executionpath for threads in a warp will actually not incur a thread divergence.

Memory usage optimisation is the key issue for performance improvement. CUDAmemory handling is complex, there are various types of memory,and different memoryspaces have different bandwidth and latency. For example, it takes 4 cycles to issue onmemory fetch but 400-600 cycles of latency, which is equivalent to 100 MADs. Whilelatency for registers, shared memory and constant memory are much lower, effectivelyusing different memory can lead to huge speed-up. Memory usage optimisation meansminimising data transfer between host and device, minimizing data transfer betweenglobal memory and device, accessing the memory space with optimal access pattern.Here briefly discusses the concept of memory coalescing and shared memory bankconflict which relate with the optimal access pattern.

[14]. Tiling algorithm is another example of efficient use of memory,and it is beneficial when there is many data reuse within a tile.

Shared memory is divided into banks to enable many

threads can access memorysimultaneously. Each bank can service one address per cycle. So a shared memory canservice as many simultaneous accesses as the banks it has. Without bank conflict, CUDAShared memory is as fast as registers. However, multiple simultaneous accesses to abank will result in a bank conflict, and conflicting accesses are serialised. Therefore, wemust try to eliminate bank conflict.

4.2

Measuring Performance

Measuring performance is an essential preparation for following optimisation steps. Itclearly shows the effect or improvement from each optimisation step, indicating if thisoptimisation is worth. It can also point out the limit of the speed-up achievable on theGPU. This section will discuss the common approach to evaluate performance usingeither CPU timer or CUDA events.

4.2.1

Timing

Measuring performance for a CUDA program particularly focuses on CUDA API callsand user defined kernel executions. It is possible to measure the performance by bothCPU and GPU timers. There are a varietyof ways to measure the time in the host side,and the programmers should always choose to adapt the CPU timer with high resolution.Another issue to keep in mind is that many CUDA API functions are asynchronous [16],

23

which means when these functions are invoked, they immediately return control back tothe CPU thread where they are calling from, although they just begin their work on GPU.Likewise, execution of kernel functions is asynchronous as well. As a result,programmers should use the CUDA API functioncudaThreadSynchronize

to force thesynchronisation between CPU thread and device in order to get accurate result for theduration of a CUDA call or kernel execution.

To make the result more accurate, it is better to use events. Events are part of CUDA

APIand provide a system independent way to measure execution times on CUDA devices.CUDA provides calls that create and destroy events, record events (via timestamp), andconvert timestamp differences into a floating-point value. Note that events are recordedon the GPU, so they will only be used for timing GPU execution.

This returned value isexpressed in milliseconds and has a resolution of approximately half a microsecond.With such a resolution, it is ableto get reliable timings even from a single kernel launch.Listing 4.2 illustrates an example of timing using CUDA events.

Here cudaEventRecord is used to place the start and stop events into the defaultcudaEventRecord is a CUDA runtime API function which used to place events intoStream.“Stream is a sequence of operations that are performed in order on the device.The device will record a timestamp for the event when it reaches that event in thestream”

[1]. Two events are necessary to get the elapsed time, the programmer need tohave start and stop event surrounding the target code which is going to be timed. The

24

elapsed time is returned by cudaEventElapsedTime function. Table 2 illustrates the timespent for CPN code on GPU compared with CPU.

T

CPUruntime

GPUruntime

magnetic_susceptibility

runtime

magnetic_susceptibility_gpu

runtime

6

11.59024

21.789

1.5467

7.6587

8

53.193

28.298

5.7527

8.2164

10

58.191

34.558

12.008

8.516

12

174.81

45.843

24.943

10.196

14

195.23

57.657

46.265

11.544

16

454.94

72.616

79.039

13.946

Table2: Time spent for CPN code on GPU Vs CPU

4.2.2

Maximum Performance Benefit

Before start to perform the optimisation, it is essential to figure out how much speedupcan be achieve theoretically. One wayof this is to apply Amdahl's law. Amdahl’s law isused to predict the maximum speed-up for a parallel program, clearly it can be adapted toCUDA programming. It illustrates the expected maximum speed-up by a formula, whichis:

Equation1:

S(N, P)

here stands for the speed-up for a program with the scale of N and P processors,anda

is the sequential portion of the program that is not able to execute in parallel. Thisequation can be transformed to another form in extreme cases

to

make the result moreclear. In CUDA programming where there are hundreds of streaming processors, theparallel fraction of execution time that is(1-a)/P

can be ignored. Therefore we cansimplify the equation and derive the result:, which indicates that speed-upof a program is limited by time of sequential potion, despite the how large the number ofprocessors is. For example, with 10% sequential code, a program is expected to get 10times maximum speed-up.

So in order to get as much speed-up, it is important to find as many portion of code thatcan be parallelised to minimise the number of a.

25

Figure6: Overall speed-up for the first version

Figure 6 illustrate the speed-up for the first version of the modified code. For small sizeof T, there is no speed-up, porting the code to GPU actually gets a performancereduction. However, as the value of T increase, execution on GPU gets a big speed up forthe selected functions magnetic_GP1 and magnetic_susceptibility. This

is because thatGPU is suitable for code with rich data parallelism.

This figure also indicates the maximum Performance Benefit can be achieved for onlymodifying the magnetic_GP1 and magnetic_susceptibility functions. As these functionstake approximately 85% of the total, the speed-up for the whole program can be about 7.

4.3

Optimisation for the CPN code

There are many spots where the first version of the modified CPN code can be optimised.This section illustrates the optimisation attempts for the CPN code. All the optimisationillustrated below is performed on magnetic_susceptibility_gpu and magnetic_GP1_gpu.

4.3.1

Optimise memory usage

Memory optimisation is the key issue for performance improvement as discussed insection 4.1. Therefore, it is better to firstly explore the potential performanceimprovement with memory optimisation. Before the optimisation, it is worth to examinethe time spent on each GPU related operation.

The total time spend on GPU not only include kernel execution, but also the time spendon dealing with data, involving memory allocation (may implicitly include CUDAinitialisation in the first cudaMalloc function) and de-allocation, memory transferring

26

between CPU and GPU. Therefore, to get the accurate result of performance gain byusing GPU, it is essential to record all the time spent on GPU. Figure 7 illustrates thetime spent on for each GPU related operations.

Figure7: Time spent on for GPU related operations

4.3.1..1

Remove redundant device memory allocation

When first porting the code to GPU, correct result is the first concern instead ofperformance. In order to get the accurate result, programmer should be very cautious tomodify the code. So in the first version, all the introduced code including the memoryallocation and de-allocation are placed in the selected function. However, devicememory allocation and de-allocation viacudaMallocandcudaFreeare expensiveoperations, so these operations should be minimised.

From Figure 7, it is obvious that memory allocation and memory de-allocation takesmore than 1/3 of the total time spent on GPU related operation. It is expected thateliminating unnecessary memory allocation and memory de-allocation will get a goodperformance win.

In the CPN code, as the variablez_fieldandtheta_fieldexist throughout the wholeprogram, it is better to move the invocation of cudaMalloc andcudaFreeout of theselected function and put them in the begin and end of program respectively. After thisoptimisation, the allocation and de-allocation are only invoked once, compared withthousands iterations of computation, they are trivial and not a problem of performance.This optimisation gets a speed-up of 1.5 for both two functions.

After this optimisation, the speed-up for the whole program is 7.3 and speed–up formagnetic_GP1_gpu is 37.

4.3.1..2

Minimise data transfer between CPU and GPU

Therefore, next optimization on memory usage is to minimise data transfers between thehost and device. This can be done by only transfer data when the data is updated.

In the previous version, in bothmodified functions magnetic_susceptibility_gpu and

magnetic_GP1_gpu, there is a data transfer for the z_field array. However, these twofunctions do no change the value of element in z_field. Therefore, the data transfer forz_field in magnetic_GP1_gpu function, which immediately follow the invocation of themagnetic_susceptibility_gpu, is not necessary. So I could remove the redundant memorytransfer in the magnetic_GP1_runtime.

4.3.1..3

Using mapped paged-locked memory

Another way to entirely eliminate the data transfer between host and device is to usemapped pinned memory. Section 2.2.1 has discussed the separate memory space inCUDA programming model and the necessity to transferring data between host anddevice. However, latest version of CUDA provides a way for the device to have directaccess to host memory. This is realised by the mapped pinned memory which is on thehost system but can be shared by device. There are two scenarios where it is suitable tothe pinned memory, one is the discrete GPUs, another is

the integrated GPUs. Asintegrated GPUs actually use the same memory of the host system, using mapped pinnedmemory is always beneficial [10]. Whether a given GPU is integrated can be verified bycalling the CUDA runtime’s cudaGetDeviceProperties function

and checking theintegrated member of the cudaDeviceProp structure. For the ness system with integratedGPU, it is better to use the mapped pinned memory. However, since the mapped buffer isshared by the CPU and GPU, developers must synchronize access using existing context,stream or event APIs. The introduced synchronisation overhead is huge. In the modifiedCPN code, application of mapped pinned memory actually results in a big performancehit.

4.3.1..4

Eliminating unnecessary Global Memory accesses

As the huge

difference in latency between global memory and on-chip shared memory,kernel access to global memory also should be minimized by maximizing the use ofshared memory on the device. Using shared memory is a valuable when the global

28

memory in kernel is accessed many times. In the kernel functionmagnetic_susceptibility_kernel, the access to the array z_field is frequently, it isdesirable to load the whole array from the global memory to shared memory. As in fulloccupancy where each multi-processor has themaximum number of resident blocks,each block can at least have 16KB/8=2KB shared memory. This amount is enough tokeep the whole z_field array when T is 16 and N is 2.

the shared memory is itself an overhead, it alsointroduce a__syncthreads

function to avoid read after write hazards which reduce theperformance. Moreover, the global memory latency is well hidden in the firstmodification of code, which means there issmall potential gain of performance for usingshared memory. Global memory in this kernel is well hidden dues to two reasons. Firstly,for the kernel function magnetic_susceptibility_kernel, the block size is big (256 when Tand L are set to 16). With such

a block, multi-process can have 8 warps, which means itis very flexible for the multi-process to switch the warps. As each warp will take 4 clockcycles for one instruction, 8 warps can have 32 cycles per instruction. Another reason isthat this kernel involves rich arithmetic instructions. Sufficient instructions give threadscheduler more ability to schedule threads in a way where the global memory latency canbe well hidden.

As the previous reasons, loading the whole array into shared memory is not verybeneficial. However, it is best to avoid accessing global memory whenever possible.

4.3.2

Optimise instruction usage

Integer division is very costly. In the previous version of kernel function, there are fourinteger divisions which must be removed. Instruction usage can be optimised by usingintrinsic functions as mentioned in section 4.1. The benefit can be tested by replacing allthe arithmetic operations with intrinsic functions. Intrinsic functions for double presicionare __dadd_rn and __dmul_rn, and forinteger is __mul24.

Application of the intrinsic function results in a small performance gain for kernelmagnetic_susceptibility_kernel.

29

Chapter 5

Conclusions

This dissertation evaluates the potential benefit for porting a CPN code to GPU. When

T and L set to 16,

a final speed-up of 7.3 is achieved. With large value of T and L, thefirst version has already get big performance gain for the selected functions, which are

5 times faster for the magnetic_susceptibility_gpu compared with the serial code and 21faster for magnetic_GP1_gpu compared with serial code. However, for small number ofT and L, there is no performance gain. This is due to the reason that GPU are onlysuitable for code with rich data parallelism to fully utilise the GPU resource. Smallnumber of T

and L means small loop size, which result in a big waste for the GPUresource.

Several optimisations have performed to achieve more speed-ups. Optimisation ismainly concentrate on optimise memory usage. The elimination of the redundantallocation and de-allocation is a performance win over the first version of modification.Using mapped pinned memory is not as expected, it actually reduce the performancedramatically. This may due to the introduce synchronisation overhead. The benefit ofusing shared memory instead of global memory is trivial. Using intrinsic function hassome small performance gain.

Apart from the speed-up, this dissertation also discusses approaches for porting the code,and strategies to follow for optimisation. Measurement of the code is always important,as it tells which part of code is worth porting and optimising.