CUDA Pro Tip: Understand Fat Binaries and JIT Caching

As NVIDIA GPUs evolve to support new features, the instruction set architecture naturally changes. Because applications must run on multiple generations of GPUs, the NVIDIA compiler tool chain supports compiling for multiple architectures in the same application executable or library. CUDA also relies on the PTX virtual GPU ISA to provide forward compatibility, so that already deployed applications can run on future GPU architectures. In this post I will give you a basic understanding of CUDA “fat binaries” and compilation for multiple GPU architectures, as well as just-in-time PTX compilation for forward compatibility.

nvcc, the CUDA compiler driver, uses a two-stage compilation model. The first stage compiles source device code to PTX virtual assembly, and the second stage compiles the PTX to binary code for the target architecture. The CUDA driver can execute the second stage compilation at run time, compiling the PTX virtual assembly “Just In Time” to run it. This JIT compilation can cause delay at application start-up time (or more accurately, CUDA context creation time). CUDA uses two approaches to mitigate start-up overhead on JIT compilation: fat binaries and JIT caching.

Fat Binaries

The first approach is to completely avoid the JIT cost by including binary code for one or more architectures in the application binary along with PTX code. The CUDA run time looks for code for the present GPU architecture in the binary, and runs it if found. If binary code is not found but PTX is available, then the driver compiles the PTX code. In this way deployed CUDA applications can support new GPUs when they come out.

nvcc enables compilation for multiple architectures using the -arch and -code command line options. For example, this command generates exact code for two Tesla architecture variants, plus PTX code for use on next-generation GPUs.

nvcc x.cu -arch=compute_10 -code=compute_10,sm_10,sm_13

nvcc organizes device code into “fat binaries”, which are able to hold multiple translations of the same GPU source code. At run time, the CUDA driver selects the most appropriate translation when it launches the device function. For full details of using nvcc to generate code for multiple architectures and PTX versions, see the document “NVIDIA CUDA Compiler Driver NVCC”.

Update (05/08/2014): Starting in CUDA 5.5, we can also JIT link separately compiled code from PTX stored in the fat binary.

JIT Caching

The second approach to mitigate JIT overhead is to cache the binaries generated by JIT compilation. When the device driver just-in-time compiles PTX code for an application, it automatically caches a copy of the generated binary code to avoid repeating the compilation in later invocations of the application. The cache—referred to as the compute cache—is automatically invalidated when the device driver is upgraded, so that applications can benefit from improvements in the just-in-time compiler built into the device driver.

Environment variables are available to control just-in-time compilation.

Setting CUDA_CACHE_DISABLE to 1 disables caching (no binary code is added to or retrieved from the cache).

CUDA_CACHE_MAXSIZE specifies the size of the compute cache in bytes; the default size is 32 MB and the maximum size is 4 GB; binary codes whose size exceeds the cache size are not cached; older binary codes are evicted from the cache to make room for newer binary codes if needed.

Setting CUDA_FORCE_PTX_JIT to 1 forces the device driver to ignore any binary code embedded in an application (see Application Compatibility) and to just-in-time compile embedded PTX code instead. If a kernel does not have embedded PTX code, it will fail to load. You can use this environment variable to confirm that an application binary contains PTX code and that just-in-time compilation works as expected to guarantee forward compatibility with future architectures.

Potential Problems

It is helpful to know the above options so you can recognize and avoid problems. Let’s look at two example situations: insufficient JIT cache size and cache stored on a slow network share.

Insufficient JIT Cache Size

Recently I was testing an application that uses the CUDA Data Parallel Primitives library (CUDPP), which is a large library with many CUDA kernels. I had compiled CUDPP using the default settings which generated binary code for GPUs with SM versions 1.0, 1.3, and 2.0, as well as PTX. Because I was running on a Tesla K20c with SM version 3.5, all the kernels in the library were JIT compiled, taking about 75 seconds at application start-up. Moreover, the large amount of kernels required well over the default JIT cache size of 32MB, so they were not cached, and the application incurred the full JIT cost at every invocation. Because I had the source to the library, I was able to recompile it with support for sm_35, but I could also increase the value of CUDA_CACHE_MAXSIZE to make sure the code fit in cache.

Cache stored on a Slow Network Share

On Linux, the default location of the CUDA JIT cache is in your home directory. On clusters, it is not uncommon to mount home directories with relatively poor performance to the compute nodes (by using the Lustre file system for scratch space, but only NFS for the home directory, for example). We have seen cases where this relatively slow connection to the home directory (and thus the JIT cache) resulted in very long application start-up times when the application was not built with code for the right SM version. Even more confusing, start-up time can vary from node to node due to intricacies of the NFS set up.

In this situation, it is best to build the application to avoid JIT entirely, and alternatively, to set CUDA_CACHE_PATH to point to a location on a fast file system.

More information

For more information on the CUDA compilation flow, fat binaries, architecture and PTX versions, and JIT caching, see the CUDA programming guide section on “Compilation with NVCC” and the NVCC documentation.

About Mark Harris

Mark is Chief Technologist for GPU Computing Software at NVIDIA. Mark has fifteen years of experience developing software for GPUs, ranging from graphics and games, to physically-based simulation, to parallel algorithms and high-performance computing. Mark has been using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student at UNC he recognized this nascent trend and coined a name for it: GPGPU (General-Purpose computing on Graphics Processing Units), and started GPGPU.org to provide a forum for those working in the field to share and discuss their work. Follow @harrism on Twitter