CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX

The last time you used the timeline feature in the NVIDIA Visual Profiler or NSight to analyze a complex application, you might have wished to see a bit more than just CUDA API calls and GPU kernels. Most applications do significant work on both the CPU and GPU, so it would be nice to see in more detail what CPU functions are taking time. This can help identify the sources of idle GPU time, for example.

In this post I will show you how you can use the NVIDIA Tools Extension (NVTX) to annotate the time line with useful information. I will demonstrate how to add time ranges by calling the NVTX API from your application or library. This can be a tedious task for complex applications with deeply nested call-graphs, so I will also explain how to use compiler instrumentation to automate this task.

What is the NVIDIA Tools Extension (NVTX)?

The NVIDIA Tools Extension (NVTX) is an application interface to the NVIDIA Profiling tools, including the NVIDIA Visual Profiler, NSight Eclipse Edition, and NSight Visual Studio Edition. NVTX allows you to annotate the profiler time line with events and ranges and to customize their appearance and assign names to resources such as CPU threads and devices.

Let’s use the following source code as the basis for our example. (This code is incomplete, but complete examples are available in the Parallel Forall Github repository.)

When we run this application in the NVIDIA Visual Profiler we get a timeline like the following image.

This timeline shows CUDA memory copies, Kernels and CUDA API calls. To also see (for example) the duration of the host function init_host_data in this time line we can use an NVTX range. In this post I will explain one way to use ranges. A description of all NVTX features is available at docs.nvidia.com.

How to use NVTX

Use NVTX like any other C library: include the header "nvToolsExt.h", call the API functions from your source and link the NVTX library on the compiler command line with -lnvToolsExt.

To see the duration of init_host_data you can use nvtxRangePushA and nvtxRangePop:

All ranges created with nvtxRangePushA will be colored green. In an application which defines many ranges this default color can quickly become confusing, so NVTX offers the nvtxRangePushEx function, which allows you to customize the color and appearance of a range in the time line. For convenience in my own applications, I use the following macros to insert calls to nvtxRangePushEx and nvtxRangePop.

To eliminate profiling overhead during production runs and remove the dependency on NVTX for release builds, the macros PUSH_RANGE and POP_RANGE only have an effect if USE_NVTX is defined. The macros can be used as in the following excerpt.

Using compiler instrumentation with NVTX

Instrumenting all functions in a large-scale application using the manual approach described above is not feasible. Fortunately most compilers offer a way to automatically instrument the functions in your application or library. When compiler instrumentation is enabled, the compiler generates calls to user-defined functions at the beginning and end of each function. I’ll demonstrate How this works with GCC and NVTX. You can find more details about GCC function instrumentation in this post.

To use compiler instrumentation with GCC you need to do two things:

Compile your objects with -finstrument-functions

Provide definitions of the functions void __cyg_profile_func_enter(void *this_fn, void *call_site) and void __cyg_profile_func_exit(void *this_fn, void *call_site) at link time; these functions are called at each function’s entrance and exit, respectively.

GCC also provides the attribute __attribute__((no_instrument_function)), which you can use to disable instrumentation of specific functions. This attribute must be present on the declaration of __cyg_profile_func_enter and __cyg_profile_func_exit to avoid endless recursion. The following code shows a simple implementation of __cyg_profile_func_enter and __cyg_profile_func_exit to generate NVTX timeline ranges.

This implementation does not provide the function name in the timeline, so it is not very useful. One way to obtain the name of the called function is to compile with -fPIC and call dladdr on the this_fn pointer passed to __cyg_profile_func_enter. To obtain function names we need to add some code like the following.

To support function overloading, C++ compilers use “name mangling” to encode function argument types into the names of generated functions, producing somewhat unreadable name strings. We therefore need to call __cxa_demangle on the string this_fn_info.dli_sname returned by dladdr to get back the original human-readable function declaration.

A complete example using compiler instrumentation and coloring is available on the Parallel Forall github repository. Using compiler instrumentation with the NVIDIA Visual Profiler gives us a timeline which also includes some function calls automatically generated by nvcc, as in the following screenshot.

You should be aware that using compiler instrumentation for profiling can have a significant impact on the performance of your application. Besides the overhead of the additional calls to __cyg_profile_func_enter, __cyg_profile_func_exit and NVTX many compilers disable function inlining if compiler instrumentation is used (gcc does this). This is especially important for C++ codes because many STL containers among other things require inlining to achieve good performance. For that reason you should always compare the run time of a non-instrumented build with an instrumented build. If the difference in run times is larger than 10% you need to reduce the profiling overhead to get reliable results. For gcc you can use the options -finstrument-functions-exclude-function-list or -finstrument-functions-exclude-file-list to filter frequently called functions with short run times and decrease the profiling overhead.

Other Tools

The approach described here can be extended to also collect GPU and CPU hardware counters (using CUPTI and PAPI) and to support MPI applications. A tool with built-in support for MPI, OpenMP, hardware counters and CUDA is Score-P. It can be used for profiling and tracing applications in order to determine bottlenecks. You can use Vampir to create a visualization of Score-P traces similar to the NVIDIA Visual Profiler.

About Jiri Kraus

Jiri Kraus is a developer in NVIDIA's European Developer Technology team. As a consultant for GPU HPC applications at the NVIDIA Jülich Applications Lab, Jiri collaborates with local developers and scientists at the Jülich Supercomputing Centre and the Forschungszentrum Jülich. Before joining NVIDIA Jiri worked on the parallelization and optimization of scientific and technical applications for clusters of multicore CPUs and GPUs at Fraunhofer SCAI in St. Augustin. He holds a Diploma in Mathematics from the University of Cologne, Germany.