CUDA Graphs, ROI, and API Adoption

CUDA 10 adds a new API called “CUDA Graphs” that are immediately familiar to graphics API designers: they are a scene graph API for compute. Scene graph APIs enable developers to describe geometry at a “higher”™ level, in ways that express the relationships between, say, rooms and doorways within a castle or the arms and legs of a 3D character. The idea is that with this additional information, the API implementor (in this case, NVIDIA) can write code that will traverse the scene graph (say, rendering the characters with their limbs animated) more efficiently than code written by the developer. Either that, or the scene graph API is sufficiently easier to learn than learning how to write the scene graph code that developers can achieve faster time-to-market by learning and using the scene graph API.

API designers drive adoption by maximizing the return on investment, where the return is efficient, working code and the investment is developer time. APIs that are not easy to learn are disadvantaged because every developer who writes or maintains the code must invest in learning the API. APIs that don’t deliver a compelling performance advantage must be *very* easy to learn, hence conferring an expressive advantage. (i.e. faster development times.)

CUDA adoption has been driven by delivering huge performance gains (the return) despite a steep learning curve (the investment). (It makes for an interesting thought-experiment to wonder why CUDA has succeeded and other manycore platforms have not. Although this blog post does not touch on the issue, customer investments must be considered in addition to developer investments.)

An early API (in fact, it was created in the 1970s, long before the term “API” had been invented) that delivers high ROI is BLAS, the Basic Linear Algebra Subprograms. Originally written in FORTRAN, the motivations for this library were twofold: to “provide names and argument lists that might become widely used and recognized for some of the basic operations of computational linear algebra,” and “to improve efficiency of math software.” BLAS code is reasonably performance- and platform-portable. As the underlying platforms evolved, the same BLAS code benefited transparently from assembly language hand-coding to cache blocking to SIMD instruction sets. There was no need to update the API client code as the implementation changed underneath. BLAS has achieved widespread adoption in numerical code, amplifying developers’ expressive power and enabling them to leverage the development effort invested by others in its implementation. At this point, BLAS gets an inordinate amount of attention from hardware vendors, making it unlikely that developers can match its performance without exploiting a priori knowledge of their application requirements. It takes time to learn, but delivers a considerable return on that investment.

On the other end of the spectrum, an API that has high ROI by minimizing developer investment is malloc()/free(). Learning first-hand the difficulty of writing a fast, robust memory allocator has been an inflection point in many junior developers’ careers – it’s harder than it looks. Other APIs that deliver a high return with minimal investment: the thread synchronization APIs built into operating systems. They are not hard to learn and, for most developers, impossible to implement.

In the early days (DirectX 2.0-3.0), Direct3D had a scene graph API called the “retained mode,” but the last version shipped in 1996. No one was using it, despite heroic evangelism efforts by its developers. Developers could use “immediate mode” APIs to implement their own scene graphs more efficiently – both in terms of developer time and in terms of high-performance implementations of the operations they needed. As an added bonus, by writing the scene graph traversal themselves, developers kept all the IP in-house (e.g., their visibility algorithm) and, if there was a bug, they could fix it in their code on their own schedule.

Since game developers co-design their content development tools with the runtime, a great deal of intellectual property is encapsulated in the scene graph traversal. In a sense, 3D scene graph API designers were aspiring to co-opt developers’ core IP – never a winning proposition for a platform.

I suspect that CUDA developers will come to similar conclusions with the CUDA Graphs. No one will use them unless they deliver a return on investment in the form of higher performance, or greater expressiveness commensurate with the effort to learn the APIs. Higher performance will be difficult to achieve since CUDA gives developers ready access to the underlying tools used by the CUDA Graphs.

One possible opportunity for NVIDIA: perhaps CUDA Graphs will be an efficient way to enable concurrent execution of kernels that weren’t designed to run in streams? CUDA streams are like const correctness – it is difficult to retrofit code to use them because they must be plumbed into interfaces from top to bottom. An alternative to revisiting interfaces top-to-bottom is to add a “current stream” API (as CUBLAS did), but current-anything APIs interoperate poorly and tend to be inefficient at changing the current-thing. More importantly, the current-thing state must be saved and restored across interfaces.

So one path to adoption for CUDA Graphs may be an efficient way to enable concurrent execution of kernels that weren’t designed to use streams. But in general, like immediate-mode graphics APIs, most developers will be able to more quickly write their own code expressing the dependencies in their application than it would take to learn and use the CUDA Graphs APIs. And developer-authored code will run at least as fast, paying tribute to the First Law Of CUDA Development.

Unless CUDA Graphs deliver a high ROI, they will go the same way as other features that Seemed Like A Neat Idea At The Time, like dynamic parallelism and managed memory.