Warp Synchrony and The First Law of CUDA Development

One of the most overlooked developments of GTC2017 was that NVIDIA’s Architecture Team has finally Had It Up To Here with developers who write warp synchronous code. As you may know, warp synchronous code relies on the way CUDA hardware executes 32-thread warps in lockstep. The CUDA Handbook contains some examples of warp synchronous code. In the reduction chapter, for example, warp synchronous code is used to optimize performance of the last 5 iterations of this loop that accumulates partial sums in shared memory:

Notice that every iteration of the loop is accompanied by a call to __syncthreads(), the intrinsic that serves as block synchronization primitive and memory barrier. The unrolled, warp synchronous implementation of the last 5 iterations looks like this:

The volatile keyword represents NVIDIA’s grudging acceptance of warp synchronous code. Historically, volatile is a keyword that hints to the compiler not to optimize out memory traffic through the associated pointer. The classic application is for device drivers for hardware with memory-mapped hardware registers, where reads and writes to “memory” are used to program the hardware. But volatile doesn’t give the compiler enough information; although it inhibits optimizations such as reusing registers or conserving memory writes, it’s not expressive enough to capture the synchronization semantics required when threads within a warp can diverge.

As a result, with Volta’s improved support for divergent code execution, NVIDIA is giving up on the volatile keyword workaround and deprecating all warp-level primitives. Instead, developers are encouraged to use new intrinsics with “_sync” appended. So instead of calling any(), the function that returns True if the input predicate expression is true for any of the 32 threads in the warp, we are to call any_sync(). The new function may be invoked on older hardware, and I suspect they are synonyms for the older functions; but on Volta, it likely will enforce semantics that converge execution across the warp.

After listening to the presentation at GTC, I sought out an NVIDIAn and told them that CUDA developers have always known that warp synchronous coding wasn’t strictly correct. NVIDIA has been finger-waggling at CUDA developers who write warp synchronous code for years! To gain some insight into why developers do it anyway, we turn our attention to a completely unscientific survey of developers where they were asked why they write CUDA code: Figure 1. Motivations for CUDA Development

I call this the First Law of CUDA Development: Performance is CUDA’s raison d’être. No one writes CUDA code for fun. Every CUDA user is trying to get a return on investment in the form of higher application performance. The reason developers write warp synchronous code even though it’s the “wrong” thing to do is because it is faster. Put another way, sprinkling __syncthreads() calls that turn out to be superfluous is… well… slower. (A subtler implication is that if the behavior does not change, it is harder for developers to tell which __syncthreads() calls are superfluous). Developers always want to do the right thing, I told the NVIDIAn; but ultimately, if you want developers doing the right thing, you have to make the right thing also be the fastest thing.

During the course of the conversation, the NVIDIAn defended the idea that they should break warp synchronous code in the future: “If I warn you to look both ways before you cross the road, don’t blame me if you get hit by a car.” I told him: “If that is your position, it’s your responsibility to make sure that developers who don’t look both ways ALWAYS get hit by a car.”