Managed Memory and Segmentation

At the GPU Technology Conference this year, I ran into an old colleague from NVIDIA and the topic of managed memory came up. He related that earlier in the conference, the Q&A session after a Dell-sponsored presentation on managed memory had suffered a serious decline in the level of discourse. Neither of us had been in attendance, but apparently one questioner after another stood up and asked the presenters, in effect, “Where are you going with this?”

He made it sound like the presenters thought they were going to be pelted with rotten fruit!

Why the controversy? Managed memory is supposed to make CUDA programming simpler. It is intended to eliminate the need to copy data back and forth between CPU and GPU memory. If it were performance-neutral like, say, C++ lambdas, then managed memory would be a welcome addition to CUDA. The problem is that any feature that presents a risk of degrading performance will be viewed with skepticism by the CUDA programming community – because no one programs CUDA for fun.

What is it about CUDA that makes it so difficult to automatically manage residency of memory?

Reflecting on the answer to this question brought me back to a conversation I had at a different GTC, so long ago that I’m not sure which year it occurred. I had just met Daniel Moth, the Program Manager at Microsoft for C++ AMP. Once it was firmly established that we were fellow travelers, charting the technical roadmaps for competing data parallel programming environments, he had a question.

“Tell me one thing,” he asked. “Why do you need streams and events?”

I had to think for a minute. Why indeed? I’d added the feature in CUDA 1.1, to cover new hardware that could DMA host memory concurrently with kernel execution; but it was already clear that the new abstractions were future-proof to multiple kernels executing concurrently, and even coordinating execution between multiple GPUs.

“Streams are like CPU threads,” I told him stupidly, quoting from the original design document I’d written in 2007. “Operations that are done in different streams can happen concurrently. And you need events to coordinate execution between streams.”

“But we don’t need that stuff in C++ AMP. The stuff that can be done in parallel, we just do it in parallel.”

After a few minutes’ conversation, the key difference emerged and I finally had it.

“Oh,” I cried. “CUDA has a flat address space!”

C++ AMP does not.

CUDA’s address space causes more trouble than is widely appreciated. Because pointers can be stored in device memory, any CUDA kernel can attempt to access any address. In early versions of CUDA, where paging is not supported (every byte of virtual memory is backed by physical memory) and systems with multiple GPUs were rare (and certainly not for sale in the public cloud), having an address space seemed to make sense. That impression was bolstered by the hardware design community’s ideological commitment to linear address spaces, which had taken root after a divisive debate contrasting linear address spaces with segmentation.

Linear Addressing versus Segmentation

Segmentation is the idea that memory should be modeled as a set of discrete buffers with base pointers and lengths, rather than assigning an address (like a PO Box) to each byte of memory. Segmented memory is accessed via a segment/offset tuple instead of by a single address, a paradigm that is implemented at the hardware level. Intel’s x86 architecture was segmented from the beginning (c. 1976). It provided for 4 segments to be accessible at any given time: the segment registers CS, DS, SS, and ES were for code, data, stack, and “extra” data, respectively. Each segment register had a base address and a length, and most machine instructions implicitly referenced a segment that represented a sensible default. The PUSH and POP instructions that operate on the stack implicitly referenced the stack segment (SS). Loads and stores from memory implicitly used DS, the data segment, unless that default was overridden by a “segment prefix” instruction. For example, the SS: prefix could be used to operate on stack memory.

The problem with segments was that they made code difficult to compose: even simple operations like function calls were complicated by potential differences between the segment register settings needed by the caller and callee. The callee could save and restore its segment registers at the subroutine boundary, but that hurt performance. More typically, developers would select a “memory model” with fixed segmentation usage that was appropriate for their application. So-called “large” memory models would just specify a segment:offset tuple for every address; under MS-DOS, this amounted to a cheesy way to enable 20-bit addressing with 32-bit addresses, or 1M of memory with 4G worth of address width. It also hurt performance since every load and store needed a segment override.

Segmentation introduced difficult, but solvable, problems for developers of individual applications; but even 25 years ago, it was clear that plugin architectures like OLE automation would play a central role in future software development. Being able to load code and data dynamically into an application and have it “just work,” without having to worry about segments, was of paramount importance. The ability for libraries to efficiently access their callers’ data, and process it on their behalf, overrode the concerns that buggy code could corrupt data that happened to be accessible.

Segmentation and flat addressing can be reconciled by enabling large segment offsets and having the operating system map all the segments to cover the same address range. This usage was anticipated when the Intel i386 was released in 1986, and implemented in 32-bit multitasking operating systems like UNIX (or Microsoft’s long-lost Xenix), and later, OS/2 and Windows NT. This paradigm was so popular, and the need for segmentation support in hardware so unclear, that AMD mostly did away with segment registers when they revised x86 to enable 64-bit addressing in the early aughts.

When I wrote the specification for CUDA textures, with a clear separation between memory and views on the memory (CUDA arrays and texture/surface references, respectively), it quickly became clear that CUDA arrays were effectively segmentation. A CUDA kernel can’t access just any CUDA array; the CUDA driver must predeclare the CUDA arrays to be accessed by a kernel. Coupled with other per-launch parameters, such as the amount of shared memory and the number of registers needed, a CUDA kernel launch more closely resembles a container launch than a subroutine call.

A key reason segmentation was an abject failure for general-purpose computer architectures was the high cost of “switching segments” on a per-instruction basis. On x86, instructions such as LDS (load data segment) were costly; instruction prefixes to change the segments being operated on by a given instruction added complexity; and naïve systems that kept segment:offset tuples for all pointers essentially wasted addressing bits. Now that we have 64-bit addressing, it is possible to envision having page tables play the role of segments (by introducing a byte-granular limit to page table size), as argued in this blog post. For now, however, there is a decisive consensus in favor of flat address spaces.

What does all this have to do with managed memory?

By implementing segmentation on a per-kernel basis instead of a per-machine-instruction basis, GPU computing technologies get many of the benefits of segmentation, without the costs that hindered adoption on the CPU side. Kernels may take slightly longer to launch than they would otherwise, but the cost of a kernel launch is high enough that the additional cost of segmentation is negligible. And if each kernel launch predeclares the needed segments, the system can infer residency requirements, ensure coherency, and identify parallelism opportunities, much in the same manner that superscalar CPUs use real-time dependency analysis to identify which instructions can execute in parallel.

What do you mean by “infer residency requirements,” you ask? You guessed it: managed memory!

What do you mean by “identify parallelism opportunities,” you ask? You guessed it: automatic CUDA streams!

What about coherency? Not much would change here. The CUDA driver already uses software mechanisms to enforce coherency, for example, by inserting cache-invalidate instructions into the command stream before launching kernels that read from texture. In a segmented memory architecture, read-only segments can be copied where they are needed, then discarded without having to worry about propagating changes to the data. Writeable segments could be copied back wholesale, or using dirty bit optimizations.

So, it is not hard to imagine a GPU computing technology that uses segmentation to manage memory rather than a linear address space. In fact, we do not have to imagine OpenACC – it’s already here – and for CUDA, programs that used only CUDA arrays would have the properties needed to automate residency and parallelization. As a side note, the WDDM display driver model introduced in Windows Vista embraced a segmented memory architecture for paging.

Let’s review some of the deficiencies in managed memory, as discussed in my previous blog. It attempts to infer residency requirements based on memory accesses – which hurts performance and breaks the First Law of CUDA Development. It breaks the useful ability to infer the “owner” (CPU or which GPU) of a given address in the Unified Virtual Address Space. The semantics of multi-engine and multi-GPU memory management are complicated, and require hinting. Even if we set aside warranted skepticism about whether the hinting will be future-proof (I have my doubts), it introduces enough complexity that managed memory does not compare favorably to the static, affinitized allocations like CUDA 1.0 or segment-based architectures like OpenACC.