Introduction

It is likely that the graphics card in your computer supports CUDA or OpenCL. If it does, then you are in for a real treat if you take the time to explore its capabilities. In this article I am showing off the new 4.5 terra-flop GTX Titan card from NVidia ($1000). The one year old GTX 680 costs half that and comes in at a still staggering 3 terra-flops. Even if you have a lower cost GPU card, chances are that its performance will still be pretty impressive compared to your CPU.

We will a "test" using the GPU with CUDA, the GPU with OpenCL, the CPU with OpenCL, and the CPU using straight C# - all within the safe confines of a managed C# application. Then we will explore the concept of streams, which allow us to overlap computations with memory transfers. Later on we will leave C#, using only C, and find that there are no performance gains to be found down that path. Finally we will tune our GPU code to make your head hurt, but also to really extract all the computing power from our GPU.

Source code for all of this is provided (see above) and a checklist of required downloads is provided below.

The Test

Smooth one million floating point values using a set of 63 smoothing coefficients.

Here is the function that computes the smoothed value of a given point:

CUDA vs OpenCL

As you can see above, CUDAfy gives you a choice in GPU technologies to use with your C# application. I believe it is pretty amazing that I can write some code in C# and have that code executed on the GPU using either CUDA or OpenCL and on the CPU using straight C# or OpenCL for Intel CPUs. There have been a few heated debates on CUDA vs. OpenCL for GPUs and I certainly do not want to give the impression that I know which technology is better. Here are some points to consider:

OpenCL is available for many video card technologies. CUDA is available for NVidia-based cards only (from Asus, EVGA, Msi, etc.). OpenCL is also available as a driver that uses the main CPU.

CUDAfy with OpenCL uses the video card driver to compile the code. CUDAfy with CUDA uses the C++ compiler at run time - but you can use a premade CUDAfy module (*.cdfy) or embed the code in the .NET assembly using the cudafycl tool.

Streaming in CUDA can achieve a 2X improvement in performance. I’ve been told OpenCL supports streams too, but I have not figured out how that works yet.

Under the Hood

Behind the scenes, CUDAfy magically creates either a CUDA or an OpenCL rendition of your code. The CUDA code must be compiled using a C++ compiler with the NVida CUDA extensions. The OpenCL code is processed by the device driver so there is much less headache in the distribution of your code.

CUDA Streaming

Simply stated, "streaming" in CUDA allows the GPU to perform concurrent tasks. In this application, the performance gains in CUDA are due to three overlapped operations. At any point in the performance test, the CUDA code performing each of these three tasks concurrently:

Upload raw data from the host memory (CPU) to the device (GPU) memory.

Process (smooth) the data in device memory.

Download smoothed data from the device to the host.

Synchronize to wait for all operations issued on the given stream to complete before proceeding.

The slight difference in performance is due to the way the tasks are scheduled in CUDA. These are the three scheduling methods I implemented:

Now I don’t have the stamina to turn this blog post into a tutorial on CUDA streaming. Feel free to examine the source code and see how the above three methods are implemented.

CUDA Streaming

Simply stated, "streaming" in CUDA allows the GPU to perform concurrent tasks. In this application, the performance gains in CUDA are due to three overlapped operations. At any point in the performance test, the CUDA code performing each of these three tasks concurrently:

Upload raw data from the host memory (CPU) to the device (GPU) memory.

Process (smooth) the data in device memory.

Download smoothed data from the device to the host.

Synchronize to wait for all operations issued on the given stream to complete before proceeding.

The slight difference in performance is due to the way the tasks are scheduled in CUDA. These are the three scheduling methods I implemented:

Now I don’t have the stamina to turn this blog post into a tutorial on CUDA streaming. Feel free to examine the source code and see how the above three methods are implemented.

CUDA C vs. CUDAfy C#

Some have wondered if the overhead of C# could be significant. Therefore I put together a straight C version of the same streaming performance test. The source code at assembla now includes this new test.

The results show that in this test at least, there is no overhead in using C#. Here are the results:

Faster!

It turns out that much of the time in the smoothing kernel is spent retrieving the input data and the smoothing coefficients from RAM. NVidia calls this "device memory". Each smoothing coefficient is accessed 1 million times and each data point in the source is accessed 64 times. Maybe we can do something about that. NVidia tells us that device memory is relatively slow.

I had already broken the smoothing problem down into 1024 "blocks", where each block has 1024 threads. This means I have allocated 1 thread per data point. It turns out that the threads within a block can share this really fast memory called, well, "shared memory". Shared memory is at least two orders of magnitude faster than device memory. So the idea is to allocate and load the shared memory with all the smoothing coefficients and all the data points from device memory that the threads in that block will need. We need 64 coefficients and (because we are smoothing +/- 32 values around each data point) we need 32 + 1024 + 32 data points loaded from device memory into shared memory.

Since we have 1024 threads, I decided to let them move the first 1024 data points from device memory into shared memory in parallel:

It turns out that the copying of the coefficients to shared memory did not buy me much in terms of performance, but copying the data sure helped.

The Future

I tried executing a "null" kernel that did nothing but return. This gave me a time of about 0.74 ms. Therefore, the next place to take time out of this system is to obtain an NVidia card that sports what is called a "dual copy engine" which allows one upload, one download, and several kernels to all run concurrently.

Share

About the Author

John Hauck has been developing software professionally since 1981, and focused on Windows-based development since 1988. For the past 17 years John has been working at LECO, a scientific laboratory instrument company, where he manages software development. John also served as the manager of software development at Zenith Data Systems, as the Vice President of software development at TechSmith, as the lead medical records developer at Instrument Makar, as the MSU student who developed the time and attendance system for Dart container, and as the high school kid who wrote the manufacturing control system at Wohlert. John loves the Lord, his wife, their three kids, and sailing on Lake Michigan.

Comments and Discussions

I have updated the blog at w8isms.Hopefully that will show up here soon.It turns out that the GTX Titan is indeed up to 70X faster than the Xeon E5.I also explain much more about the different streaming methods and kernel strategies.

It seems that, at least until CUDAfy V1.12, which was the most recent version when I tried your code, the option eGPUType.OpenCL does not yet exist. As I got an AMD gfx card (and no NVidia), I'd really love to see it running with OpenCL.

As of today version 1.21 (which supports OpenCL) is in beta.You can find this at cudafy.codeplex.com.Select the downloads page (tab).On the right side of the page, there is a box titled "other downloads".Select CUDAfy V1.21 Beta.I hope this helps.

Can you explain your results further? What does Smooth 0 measure?
How do you conclude that "The CPU is 70x slower for this specific task". The CPU time was 53.84ms, so 70x faster would be 0.77ms - where can I see that implementation?
thanks

Code project pulled my blog post before I was ready. I've updated my blog since then but I guess it takes Code Project a bit to assimilate my updates. You can see the latest at w8isms.blogspot.com.

In any case, I originally obtained a time recorded of 0.75ms when I wrote the article - but later found out I made a mistake in the code. 50X is what I am seeing now. Sorry about the confusion.

Smooth 0 is a non-streamed version of the loop. That code is shown in the article. Smooth A, B, and C are streamed versions. The latest blog post explains this a bit better - complete with a "timing diagram".

The link to the source should have the latest code and all should be pretty clear from that.