Porting to CUDA 6.0

Update: With CUDA 6.0 released on 4/15/2014 I need to update this article and my Neutrino project code. I’m traveling but I expect to get to it by the end of the month.

This is a technical article so skip it if you read my blog for pithy tech commentary and wisdom. So I’m working with a beta version of CUDA 6.0 from nvidia that enables a new kind of GPU memory management called Unified Memory or “Managed Memory” which makes it much easier to migrate complex C++ class objects with deep data structures to the GPU. This of course is a GREAT feature for enabling much more generalized use of the GPU. Since I’m on the bleeding edge of adoption here I thought I would write an article about my first approach and experiences with porting C++ objects to CUDA 6.0 for the benefit of those who will come after me with the same challenges. As a little context, prior to this moment in GPU computing history writing GPU code was largely constrained to one of two approaches. 1) Writing individual GPU functions that could be called from the CPU code to massively parallel accelerate a specific function call. 2) Using technologies like AMP, OpenAcc or OpenMP to enable the compiler to automatically parallelize for-loops on the GPU. Both approaches enabled developers to kick a narrow range of common compute intensive operations up to the GPU for massive acceleration.. but still largely relied on the primary code base to exist and run on the CPU. The introduction of Unified Memory in CUDA, for the first time makes it practical to move huge bodies of general C++ code entirely up to the GPU and to write and run entire complex code systems entirely on the GPU with minimal CPU governance. In theory a big leap, but not without some new challenges. In this article I will present my first experience and approach to accomplishing this with the caveat that I do not wish to present myself as an authority on this subject. It’s new and I’m just learning it myself. None-the-less it’s very exciting and I wanted to pass on my early adopter wisdom such as it is. To keep this article concise I’m not going to attempt to explain the entire CUDA memory architecture here, the documentation does a great job, what I’m going to focus on is the specific approach I’ve taken to porting C++ class objects and deep data structures to the GPU. As a starting point I began my effort with Mark Harris’s great blog article on Unified Memory: http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/#more-2221 The “magic” of porting C++ objects to Unified Memory is captured here with the simple overloading of a class objects new and delete operators to allocate Unified Memory for an object INSTEAD of CPU memory:

Simple, very elegant. It worked great. Obviously to port an existing object, I would need an easy way to test it against the functionality of it’s pure CPU implementation which I accomplished by modifying Mark’s code as follows;

Now I can just toggle between ordinary CPU memory management and UM memory by setting the CUDA_MANAGED variable to true or false. When you instantiate a C++ object that inherits its new and delete operator from the Managed class it gets created in shared CPU/GPU memory automatically! Deep data structure copies also happen automatically as long as those structures ALSO inherit from the UM new and delete operators. However there is obviously little benefit to doing this IF the member functions themselves only work on the CPU! Unfortunately the state-of-the art in GPU programming does not yet automatically turn my C++ code into GPU code… hopefully that will come one day soon, but in the meantime I need a strategy to convert each of my class member functions into GPU kernels. How is this accomplished? Before we begin, I should mention that most nvidia documentation on the use of CUDA places a huge emphasis on code optimization which often obscures how simple it can be to engage in basic porting. Although CUDA optimization is often essential to achieving the incredible performance potential of the GPU, it also often overcomplicates the initial steps required to port CPU code and verify that it is working properly FIRST. Preserving the CPU only functionality of the class not only increases it’s portability and generality, it provides an essential testing template for verifying that the GPU version of the code is working correctly, especially when you begin optimizing it for the GPU. To accomplish this I structure my initial C++ code as follows inheriting from my Managed class;

In this simple example I want to “port” foo() to a GPU kernel but preserve it’s utility as a CPU only function. To accomlish this task we first convert foo() into a CUDA __host__ __device__ function as follows;

The result parameter is added because CUDA functions can’t return values directly, you have to pass a pointer to the memory you want to return a result in. This function is dual purpose, when

#define CUDA_MANAGED false

It will be compiled as CPU code and work normally as a CPU function but when CUDA_MANAGED = true it will compile as a GPU function. Now I need to modify my original foo function to differentiate between CPU and GPU implementations.

Now the old foo() function has been converted into a redirection call that chooses between CPU and GPU implementations depending on whether or not the code was compiled to use Unified Memory. As ordinary CPU code the d_foo() function gets called and behaves just as the original foo() function did. If, however CUDA_MANAGED == true, it calls a kernel I have not created yet called g_foo() in a single CUDA thread. Because device functions CANNOT be called directly from the CPU we need a CUDA kernel function to call the device version of d_foo() for us. This may seem like a lot of indirection but the nvidia compiler will actually inline the d_foo() device function into the g_foo() redirection kernel function at compile time. Here’s what g_foo() needs to look like;

Nothing to it… just a kernel launch to call d_foo() from the GPU. The important “porting” feature of this approach is here in foo():

g_foo<<<1,1>>>( a_parameter, &result );

I’m calling the g_foo() kernel as a single CUDA thread. Stupid, not parallel, silly implementation BUT very handy for testing and verifying that the function is working in CUDA as expected. That’s it! Using this approach I ported several thousand lines of C++ code and half a dozen objects to CUDA 6.0 in a couple days. Having accomplished this I am now in the process of writing unit tests that leverage the CPU code to verify that the GPU code is working correctly which will be essential when I begin the process of actually parallelizing the GPU code and CUDA optimizing it. I’ll write a subsequent article on how that goes, once I’ve figured it out myself.

UPDATE: Mark Harris, who wrote the original NVidia parallelforall blog post I cited in this article submitted a comment suggesting an approach to my unit testing challenges. The code got trashed as a comment so I’m embedding it as an image until such a time as I’m in the mood to manually edit the HTML to format it correctly.. don’t hold your breath… He also points out that I should have shown the definition of the String class itself inheriting from the Managed class, thereby enabling a deep copy of the entire dataElem object to shared memory. After I’ve had a chance to work through his suggestions and get happy with my approach to unit testing I’ll write an update to this article that condenses this collective wisdom. “Regarding your parameterized allocation, you might have some luck with policy based class design / mixins… Here’s an example:

Note my use of a reference parameter here — I did this to ensure the contents of the dataElem struct, which doesn’t contain any pointers, isn’t passed by value (which would not be the right test…).” -Mark Harris

Share this:

Like this:

9 Comments

Exciting time ahead in seeing GPUs more readily used to their full potential.

Do you have any insight into whether CUDA will ever work on NVidia’s favorite friend AMD GPUs? Its pretty disheartening to see that any CUDA work you put in is going to be useless for almost exactly 50% of the gaming market (AMD+Intel GPUs).
As much as I want to get excited and play with CUDA, those kinds of stats make me shy away from it.

I haven’t looked at it yet but AMD made a big GPU announcement this week and the news item linked support to it to AMD’s Mantle API. It’s on my list to check out. We’re on the very bleeding edge of innovation here, there are plenty of open parallelism API’s that work on both platforms… that are very limited… I have confidence that given NVidia’s leadership in CUDA the market will tend to follow them in capability over time, but I want to be out there TODAY. The trend is clearly toward making adopting GPU programming less and less work and even if the API is open the optimization effort will probably remain pretty proprietary for the foreseeable future. In other words, regardless of whether or not the parallelism API you are using is proprietary, the work you do to achieve performance will still be closely linked to the chip architecture you optimized for…

This is a little embarrassing because I suspect there is a more elegant way to do this, but I found that I was having difficulty writing unit tests to sweep through the various permutations of my class libraries running in CPU memory, GPU shared memory and GPU allocated memory using the Managed class I used for this example. I needed a way to dynamically choose between memory allocation types. This is how I solved it in this situation but I’m certainly receptive to more brilliant suggestions.

So basically I created new and delete overloads that let me parameterize the memory allocation type. But to use them you have to use a fairly exotic bit of C++ syntax. Marvel at this line of code;

//test_foo = new dataElem();
test_foo = (dataElem *)dataElem::operator new( sizeof(dataElem(), true );
The problem with it is that it doesn’t actually call dataElem’s constructor… that piece of syntax appears to be beyond me (how can you initialize something that hasn’t been constructed yet?), so I had to stoop to replacing the objects constructor with an initializer member function and calling that instead. (I am ashamed)

test_foo->init_test_foo();

Yeah, it’s been a couple decades since I was a professional engineer so I’m catching up on my C++ 11, say what you will… it seems to work. I’m open to better approaches, but at least now I can write unit tests for all the GPU memory allocation permutations I want to support.

I feel the urge to help you out here (when i get some time), as your investigations aligns closely with the direction I’ve been instigating at work for porting large C++ class hierarchies without rewriting. Ours use STL which is an additional challenge for CUDA.

Hey Terry, feel free to share any wisdom you develop on the subject. I’m a little twisted and enjoy the learning curve so rest assured that I’m having a good time. I haven’t gotten to trying Mark’s mixen approach yet, just because my partially successful unit test solution… revealed bugs… which I must now compulsively fix before I can proceed. Yeah the STL issue is a challenge and an opportunity. The body of code I just ported is actually a library of STL-like data structure libraries I wrote with the expectation that they would optimize very well for the GPU, so my unit test library is actually the STL for common data structures whose interfaces I’ve duplicated in CUDA. I’ve designed new kinds of hash tables, sparse arrays and sorting libraries for my use on CUDA. I know, many of these solutions are already available from Thrust and other libraries but what I’ve done is “different” and a great exercise in learning CUDA optimization. The structure I’m most excited about is a new kind of hashtable I created that I suspect will absolutely rock on the GPU because it should exhibit extremely good memory coalescence compared to an ordinary hash table. One of the interesting things about the GPU is that it appears that hash code generation and decoding can be virtually free so you can get a lot more creative about how you compute these things.

I’m doing similar things. My classes have members that are references to other members, so I need to call copy constructor to GPU. So far I’ve done this https://gist.github.com/mbianco/8728714 (CUDA 5.5)

ooh, thats’ clever, you’re giving me ideas. Much nicer in theory than dragging along a lot of code in my derived classes.. gotta think on that.

// A class should publicly derive from this class passing itself as template argument.
It’s like using two mirrors to see the back of your head. If I get what that’s doing, this lets your base class call members of your derived class.. which means that even if you have arbitrary pointers to memory that need special code to migrate, you might still be able to specify an API in the derived class that the base can call to copy those members as well? I guess the question with trying that is whether or not it’s possible for the base class to glue any arbitrary combination of pointers to member data and objects back together that way. Hmmmm…

*you’ve got a “teamplate” keyword down at the bottom of your code. Angle brackets are such a pita to format online.

Daily Comments

Only Microsoft would think that the solution to creating a disasterous desktop UI experience that cratered PC sales is to launch a little less of it AND incorporate it into the long lost Start Menu to "improve" it. WIndows 9 Marketing Slogan? "Windows 9 sucks less than Windows 8 and we refuse to sell you Windows 7 anymore!"