.NET

Debugging GPU Code in Microsoft C++ AMP

Visual Studio 2013 makes it much easier to debug parallel code running on the GPU

Now, it's time to run code on the GPU to calculate the sum of the three arrays. Concurrency::parallel_for_each invokes a parallel computation of a kernel function (that is, a function designed for execution on the GPU) over a compute domain on an accelerator_view. sum.extent is the extent that represents the set of indices that form the compute domain. A lambda expression is the function object that performs the parallel computation and receives idx (index<1>). Visual Studio 2013 also includes enhancements that reduce the
parallel_for_each launch overhead in the C++ AMP runtime compared with the previous runtime version.

Notice that restrict(amp) indicates that the chunk of code is to be executed in an AMP accelerator, and therefore, it won't be possible to debug this piece of code when you debug the code that runs on the CPU. If you establish a breakpoint at the line sum[idx] = a[idx] + b[idx] + c[idx], you will see the invalid breakpoint icon with the following message: The breakpoint will not currently be hit. No executable code of the debugger's target code type is associated with this line. Possible causes include: conditional compilation, compiler optimizations, or the target architecture of this line is not supported by the current debugger code type.

By default, when you start a new C++ project, the debugger type is set to Auto, and therefore, you will be limited to CPU debugging when you have C++ AMP code. To check your current settings, access the project properties and then select Configuration Properties | Debugging | Debugger Type. As I mentioned earlier, if you are executing Visual Studio 2013 on Windows 8.1, you can use the WARP accelerator to debug both CPU and GPU code. However, right now I want to focus on debugging CPU code, and therefore, I will discuss WARP accelerator later.

Naturally, when you debug CPU code, it is possible to start executing the _tmain function step-by-step. The code that generates the C++ AMP objects within the amp_sum function also runs on the CPU, and therefore, the last line in which you will be able to set a breakpoint at is sum.discard_data();. This optimization hint will be the last line executed in the CPU before parallel_for_each invokes the parallel computation of the kernel function over the compute domain. Thus, if you want to inspect the values of the different array_view instances before the GPU executes the kernel, that line of code is the appropriate place to set a breakpoint at (see Figure 1).

While debugging, you must take into account that each parallel_for_each call launches as an asynchronous operation. However, the execution will end before the amp_sum function returns, and therefore, the valuesAPlusBPlusC array in the _tmain function will display the results of the sum generated by the code executed in the GPU. Thus, you will be able to inspect valuesAPlusBPlusC if you set a breakpoint at the line with the for loop that prints the results within the _tmain function.

Debugging GPU Code

Once you check the input and the output for the kernel, you can relaunch the debugging session with new settings to debug the kernel code. Access the project properties, select Configuration Properties | Debugging | Debugger Type and change the setting from Auto to GPU Only. The Debugging Accelerator Type will change to GPU - Software Emulator, and the GPU Default Breakpoint Behavior will be set to Break once per warp. First, I will use the default settings that specify that a breakpoint event should be raised only once per warp. Then, I will show the differences with the other available setting, which breaks with every thread, similar to the CPU behavior.

Now, let's set a breakpoint at the following line within the parallel_for_each invocation, in the amp_sum function: sum[idx] = a[idx] + b[idx] + c[idx];

When we start debugging the application, the debugger will stop at that breakpoint. Select Debug | Windows | GPU Threads to activate the GPU Threads window (see Figure 2). The GPU software emulator allows you to work with four threads, and therefore, you will see 4 threads in the Thread Count column for the amp_sum::I3<lambda>Location. You can debug the kernel code in a similar way than you're used with multithreaded CPU code. You can use the Parallel Stacks window to check the threads launched in the GPU and the Locals window to inspect the contents for the different array_view instances (see Figure 2). The first time the debugger hits the breakpoint, you will see the ten values for sum set to 0. You can also use the Parallel Watch window to freeze and thaw GPU threads as you are used with CPU threads.

If you press F5 (Continue), you will notice that the first four values for sum are filled:

sum[0]: 111

sum[1]: 222

sum[2]: 333

sum[3]: 444

As I mentioned earlier, the GPU software emulator is working with four threads and the GPU Default Breakpoint Behavior is set to Break once per warp. The warp size is 4, so when the debugger hits the breakpoint again, the GPU software emulator has already executed four threads that completed the first four sum values.

The GPU Threads window provides valuable information for understanding the C++ AMP code. Click on the Expand Thread Switcher button located at the upper left corner and a new panel displays the thread number that is active in the debugger and the range for the index. Figure 3 shows Thread 4 as the active thread and the information about the Range: 0..9. Thus, the debugger is sitting on the line that will calculate the value that will be stored in sum[4] because idx is equal to 4.

Figure 3: The GPU Threads window with the expanded thread switcher and the Parallel Watch 1 window. The active thread is #4 (Thread[4]).

You can activate another thread in the debugger by entering the desired number in the Thread textbox within the thread switcher. For example, if you enter 5 and press Enter in the Thread textbox, the debugger will switch to thread #5, idx will be equal to 5 and the displayed line will be calculating the value that will be stored in sum[5]. However, take into account that you are working in the warp that has four threads, which means that you can debug threads 4, 5, 6 and 7. You cannot go to thread #8 yet. You can also select Debug | Windows | Parallel Watch | Parallel Watch 1 and see the list of the four active GPU threads: [4], [5], [6], and [7]. I find it easier to see and select the active thread by using the Parallel Watch 1 window instead of the GPU Threads window.

If you press F5 (Continue), you will notice that the first eight values for sum are filled:

sum[0]: 111

sum[1]: 222

sum[2]: 333

sum[3]: 444

sum[4]: 555

sum[5]: 666

sum[6]: 777

sum[7]: 888

This time, there are going to be two active GPU threads: [8] and [9]. Finally, the code will use these two threads to calculate the last two values.

Conclusion

If you don't feel comfortable with the debugger executing a maximum of four threads per breakpoint hit, you can change the debugging settings to make the GPU default breakpoint behavior to break for every thread. Access the project properties, select Configuration Properties | Debugging | GPU Default Breakpoint Behavior and set it to Break for every thread (like CPU behavior). This way, you will work with just one thread at a time and it will be easier for you to debug certain algorithms.

As you can see from this simple example, Visual Studio 2013 makes it easy to debug C++ AMP code. In fact, you can debug GPU code in a similar way to how you debug multithreaded CPU code. However, things become a bit more complicated when you use the GPU programming performance optimization technique called tiling.

In the next article, I'll explain how Visual Studio 2013 C++ AMP debugging capabilities help you to understand and debug code that uses tiling optimizations. The GPU code is going to be more complex than the example explained in this article and it will be necessary to use additional helpers to simplify the debugging session.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!