CUDA Pro Tip: nvprof is Your Handy Universal GPU Profiler

CUDA 5 added a powerful new tool to the CUDA Toolkit: nvprof. nvprof is a command-line profiler available for Linux, Windows, and OS X. At first glance, nvprof seems to be just a GUI-less version of the graphical profiling features available in the NVIDIA Visual Profiler and NSight Eclipse edition. But nvprof is much more than that; to me, nvprof is the light-weight profiler that reaches where other tools can’t.

Use nvprof for Quick Checks

I often find myself wondering if my CUDA application is running as I expect it to. Sometimes this is just a sanity check: is the app running kernels on the GPU at all? Is it performing excessive memory copies? By running my application with nvprof ./myApp, I can quickly see a summary of all the kernels and memory copies that it used, as shown in the following sample output.

In its default summary mode, nvprof presents an overview of the GPU kernels and memory copies in your application. The summary groups all calls to the same kernel together, presenting the total time and percentage of the total application time for each kernel. In addition to summary mode, nvprof supports GPU-Trace and API-Trace modes that let you see a complete list of all kernel launches and memory copies, and in the case of API-Trace mode, all CUDA API calls.

Following is an example of profiling the nbody sample application running on two GPUs on my PC, using nvprof --print-gpu-trace. We can see on which GPU each kernel ran, as well as the grid dimensions used for each launch. This is very useful when you want to verify that a multi-GPU application is running as you expect.

Use nvprof to Profile Anything

nvprof knows how to profile CUDA kernels running on NVIDIA GPUs, no matter what language they are written in (as long as they are launched using the CUDA runtime API or driver API). This means that I can use nvprof to profile OpenACC programs (which have no explicit kernels), or even programs that generate PTX assembly kernels internally. Mark Ebersole showed a great example of this in his recent CUDACast (Episode #10) about CUDA Python, in which he used the NumbaPro compiler (from Continuum Analytics) to Just-In-Time compile a Python function and run it in parallel on the GPU.

During initial implementation of OpenACC or CUDA Python programs, it may not be obvious whether or not a function is running on the GPU or the CPU (especially if you aren’t timing it). In Mark’s example, he ran the Python interpreter inside of nvprof, capturing a trace of the application’s CUDA function calls and kernel launches, showing that the kernel was indeed running on the GPU, as well as the cudaMemcpy calls used to transfer data from the CPU to the GPU. This is a great example of the “sanity check” ability of a lightweight command line GPU profiler like nvprof.

Use nvprof for Remote Profiling

Sometimes the system that you are deploying on is not your desktop system. For example, if you use a GPU cluster or a cloud system such as Amazon EC2, and you only have terminal access to the machine. This is another great use for nvprof. Simply connect to the remote machine (using ssh, for example), and run your application under nvprof.

By using the --output-profile command-line option, you can output a data file for later import into either nvprof or the NVIDIA Visual Profiler. This means that you can capture a profile on a remote machine, and then visualize and analyze the results on your desktop in the Visual Profiler (see “Remote Profiling” for more details).

nvprof provides a handy option (--analysis-metrics) to capture all of the GPU metrics that the Visual Profiler needs for its “guided analysis” mode. The screenshot below shows the visual profiler being used to determine the bottleneck of a kernel. The data for this analysis were captured using the command line below.

A Screenshot of the NVIDIA Visual Profiler (nvvp) analyzing data imported from the nvprof command line profiler.

A Very Handy Tool

If you are a fan of command-line tools, I think you will love using nvprof. There is a lot more that nvprof can do that I haven’t even touched on here, such as collecting profiling metrics for analysis in the NVIDIA Visual Profiler. Check out the nvprof documentation for full details.

I hope that after reading this post, you’ll find yourself using it every day, like a handy pocket knife that you carry with you.

About Mark Harris

Mark is Chief Technologist for GPU Computing Software at NVIDIA. Mark has fifteen years of experience developing software for GPUs, ranging from graphics and games, to physically-based simulation, to parallel algorithms and high-performance computing. Mark has been using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student at UNC he recognized this nascent trend and coined a name for it: GPGPU (General-Purpose computing on Graphics Processing Units), and started GPGPU.org to provide a forum for those working in the field to share and discuss their work. Follow @harrism on Twitter

So 6 registers. However, running in nvprof still shows 7 registers. I’m not sure about the cause of this discrepancy but I will file a bug! Thanks!

George

Ok!Same output here.
So, I must always use the sm_21 (for 2.1capability).
So, –ptxas and nvprof must give the same results always?

Thank you!

http://www.markmark.net/ Mark Harris

You don’t have to explicitly specify the arch version (sm_21), but if you want full control over what code is generated you might want to. I recommend you read my post linked above about fat binaries and JIT linking.

As I wrote I think the profiler *should* match the ptxas output, so I have filed an issue internally to figure that out.

George

Ok,thank you!

http://www.markmark.net/ Mark Harris

I got the answer. To support profiling (for example of concurrent kernels), the profiler has to patch kernel code with some additional instructions, sometimes consuming extra registers. So in this case it uses an extra register. You can verify this by running

nvprof –print-gpu-trace –concurrent-kernels-off ./run

This disables profiling of concurrent kernels (not needed for this app), and you will see the register count drop to 6.