how the Wave2Dprogram can be mapped onto the single or multiple GPU devicesand compare twothe differentmethods to coordinate multiple devices.The Wave2D program was selected as a test program becauseit’s one of the Multi-Agent and Spatial Simulation (MASS) applications used to test the MASS library.Although this project is tailored towardstheWave2D, it can be extended or use as the base for theMASS-GPU library.

Reason for choosing this topic

The graphics processing unit (GPU) has become an integral part of today's mainstream computingsystems. Over the past 6 years, there has been a marked increase in the performance and capabilities ofGPUs. The modern GPU is not only a powerful graphics engine but also a highly parallel programmableprocessor featuring peak arithmetic and memory bandwidth that substantially outpaces its CPUcounterpart. The GPU's rapid increase in both programmability and capability has spawned a researchcommunity that has successfully mapped a broad range of computationally demanding, complexproblems to the GPU. This effort in general-purpose computing on the GPU, also known as GPUcomputing, has positioned the GPU as a compelling alternative to traditional microprocessorsin high-performance computer systems of the future.

Based on the initially GPU computing readings and the current uses of this technology, I have becomeextremely interested inthis topic

and wanted to confirmed first-hand the performance and capabilitiesof GPU devices. In my project I have explored CUDA

platform, which

enables dramatic increases incomputing performance by harnessing the power of the GPU. In order to evaluate performanceincreases I’ve implemented the simulation of Schroedinger’s wave

dissemination over a twodimensional space (Wave2D) insingle threaded,multithreaded, single GPU and multiple GPUs versionsusing C language.Having all of the applications developed using a single programming language allowme to compare pure parallelization techniques. Furthermore, having thesingle threaded andmultithreaded version of the Wave2D program allowed me to generate a baseline for the performanceevaluation and data validation.

Literature review

The multi process library for multi-agent and spatial simulation (MASS) is intended to facilitate entitybased simulation for on-the-fly sensor data analysis. It reduces the difficulty which comes with mappingof designers’ algorithms to technology specific implementation such OpenMP, MPI, MapReduce. Eachdesigner has to learn and apply the technology specific implementation rather than spend their time onthe actual problem they are trying to resolve. MASS reduces the complexity of creating an applicationand mapping its algorithms to specific technology and allows them to focus on their programs’ corefunctionality and correctness by abstracting the multi-core and multi-process implementation details.

The MASS library demonstrates the programming advantages such as clear separation of the simulationscenario from simulation model and automatic parallelization.The MASS library would benefit fromaparallel computing platform and programming model such as CUDA.

All the other papers suggest that utilizing multiple GPUs can speedup programs from tens of times tohundreds of times depending on the program, how the workload can be divided between GPUs. Thememory management needs to be carefully considered to fully maximize GPU power. It also mentionsthat hardware needs to be carefully considered when designing a system with multiple GPU devices.

Wave2D Overview

As mentioned above the Wave2D program is a simulation ofSchroedinger’s

wave dissemination over atwo-dimensional

space.

The two dimensional

space is partitioned into N by N cells. A wave isdisseminated north, south, east and west of each cell, and thus each cell needs to

compute

its newsurface height from the previous height of itself and its four neighboring cells.

one, and computes the surface height of allcells(i, j), (i.e., Zt_i,j) at

each time t, based on the aboveformulas.

Here are a few snapshots of the Wave2D simulation as it changes over time:

GPU andCUDA Overview

The GPU, a graphics processing unit, is a massively multithread multicore chip available for example in



computer video cards



Playstation 3



XBOX

CUDA, a Compute Unified Device Architecture,

is a scalable parallel programming model and a softwareenvironment for parallel computing. Its heterogeneous serial-parallel programming model providesminimalextensions tofamiliarC and C++environments.The combination of both,GPU Computing withCUDA brings data-parallel computing to the masses asmore than46 million CUDA-capable GPUs havebeenalreadysold and a developer kit cost on average $200. As a result, amassively parallel computinghas become a commodity technology.Computing problems which required incredible amount ofcomputing resources cantodaybe solved on a laptop equipped with a GPU–

One kernel is executedat a time on the device, and multiple threads execute each kernel. Each thread executes the same codeon a different data based on the CUDA built-in threadId object.

CUDA Threads

A grid is composed of blocks which are completely independent and a block is composed of threadswhich cancommunicate within their own block

CUDA Programming

A typical approach to process data on CUDA is to allocate memory on the CPU and copy it to the GPUmemory. Once the memory is copied, we can start executing CUDA by running kernel methods. Aftersuccessful execution of the kernel methods you need to copy the memory from GPU device back to thesystemmemory

(CPU).

Single Threaded

In a 100 by 100 simulation space, a single threaded version

needs to computeone of the above formulasfor each cell, which perform the calculation10,000

times

per simulation time increment.As you canimagine this willconsume server resources andwill takea lot more time to complete the simulation.

Multithreaded

In the multithreaded version we can take advantage of multiple

threads if the server consists of multiple

cores.

And iffour threads

areused as I did in my experiment, the workload of 10,000 cells will bedivided equallyinto 2,500 cells per thread.

Consequently,thetotal execution timewill be smallerandthe server’s resourceswill be available much sooner to process other tasksthan in a single threadedversion.

The graphs above illustrate that Wave2D CUDA version outperforms the multithread version in everyaspect.Starting from the smallest simulation space 100 by 100 to largest 5000 by 5000, the differencein total execution time between two versions becomes greater as the simulation space becomes largerand the number of computation increases.

Multiple GPU devices

To coordinate multiple GPU devices, we need to create a separate host thread per device. The workloadis then divided equally between these devices. The data is staged on the host (CPU) and each hostthread calculates its portion of the array basedon the data index offset.

Each host thread must alsoallocated its own memory, which needs to be copied to the GPU. Once the GPU completes its assignedwork, host thread needs to update the main arrays staged on the host and update the other portioncompleted by other thread(s). As you can see this involves a lot ofcommunication

between CPU andGPU. There is also a large overhead of allocating redundant data on the main arrays (staged), hostthread arrays and GPU arrays.

And the additional step of synchronization between the staged data andeach host thread.

There is also another method to achieve the same results by staging the array as in the previousexample in the system memory and pass the host memory pointer to the GPU. In this case the GPUreads and writes the data on demand and as long as the read/write is done only once in the kernelmethod, it should not hidden the communication overhead.

Requires data to be copied back and forth between CPU (host) and multiple GPUdevices

2.

Zero-copy

approach

a.

Retrieves data on demand from the hostandusesindexas global offset

b.

Requires data to beallocatedonly once on the CPU (host) and then each GPU devicegets a pointer to the host memory

c.

Device need to support CudaDeviceMapHost flag

d.

And requires the memory

to be pinned

Tests were ran on both servers available in the lab:



Hydra.uwb.edu

Device#2–

50blocks and100threadsDevice#1–

50blocks and100threads......

And Hercules.uwb.edu

HYDRAConfiguration

•

2 x TESLA C1060: 3.0 CC, 240 CUDA Cores, 4 GB

HYDRA Results

Simulation Space

Single

GPU

Multi GPU

Copy-based

Multi GPU

Zero-copy

100

0.002

4.389

0.204

300

0.011

4.551

0.885

500

0.019

4.769

2.217

HERCULESConfiguration

•

GeForce GTX 680: 3.0 CC, 1536 CUDA Cores, 2 GB

•

Quadro NVS 295: 1.1 CC, 8 CUDA Cores, 256 MB

HERCULES Results

Simulation Space

Single GPU

Multi GPU

Copy-based

Multi GPU

Zero-copy

100

0.004

0.

750

N/A

300

0.007

0.949

N/A

500

0.015

1.161

N/A

Based on my tests, I found that neither method improved the performance of the simulation. In fact,the redundant memory allocation, or additional communication as well synchronization of databetween the host and devices slowed the performance significantly.Also, there is a large datadiscrepancy between two server in terms of Multi GPU Copy-base test. On Hydra each simulation takesa least 4 seconds but then consequent tests withalarger simulation space doesn’t change as much. Infact the change between different simulation spaces is quite similar which makes me think that there isa difference in hardware configuration such as slower PCIe bus.

Another finding was that I wasn’t able to run Zero-copy test on Hercules server as the Quadro NVS 295would failed when passing a host memory reference to the device.

Lesson Learned

Based on this project, I’ve learned about CUDA technology and GPU Computing market. My knowledgeof memory management has expanded dramatically. I’ve learned what is neededto go through theprocess of implementing a single threaded, multithreaded application and then mapping it to the GPUversion with a single GPU or multiple GPUs. The coordination of multiple devices has been challengingat times but at the end I’ve learned a new skill, which hopefully, I’ll be able to apply at my work.

which allows the datato be copieddirectlybetween GPU cards and thus, removing the overhead of redundant copies ofdatabetween host and devices. I believe that these new results would show the performance improvementwhen coordinating multiple GPU devices.

The next phase would be generalize my solution in such a way that it could be used by other programssuch as Molecular Dynamics which could benefit from powerful and efficient GPU parallelization