Managed Memory: Belated Comments on Implementation

Managed memory is a new-ish CUDA feature that aspires to do away with the need to explicitly copy CPU memory to and from GPU memory. Introduced in CUDA 6.0, its initial implementation was unusably slow. (For example, copying managed memory from GPU to CPU memory ran at 512MB/s, 25x slower than an asynchronous memcpy.)

When they undertook to build the managed memory feature, NVIDIA had many different implementation strategies they could have pursued. As best I can tell, here is a summary of their implementation strategy:

Upon allocation of managed memory, the CUDA driver allocates device memory, plus a pageable range of CPU memory at the same virtual address range.

The CUDA driver use dirty page bits to track which 4K pages were “touched” by the CPU.

Upon kernel launch, the CUDA driver would unmap the managed memory from the CPU and copy the dirty pages from the CPU to the GPU. Unmapping the CPU memory removes the risk of write-after-read hazards from the CPU corrupting managed memory before the GPU was able to copy it.

While CUDA kernels run, the device memory copy of the managed memory is the only valid one.

Upon CPU/GPU synchronization, the CPU buffer is made accessible again, but is not copied wholesale from GPU memory. It is possible the GPU’s hardware does not have the same dirty bit tracking facilities as the CPU, or perhaps NVIDIA just thought it would be preferable to copy device memory back to the CPU “on demand.”

Copying managed device memory back to host memory is prompted by page faults: when the CPU attempts to access a page of managed memory, the CUDA driver handles the page fault by copying the 4K of GPU memory to CPU memory.

The application I used to investigate NVIDIA’s managed memory implementation is only about 60 lines of code. The key component is a function usPerLaunch that allocates a specified amount of managed memory, launches a NULL kernel, synchronizes with the GPU, then optionally “touches” the managed memory to force the CUDA driver to copy it back to host memory. (In an earlier version of this test, I confirmed that CUDA lazily copies only “dirty” pages in the other direction, as NVIDIA claims in its documentation.)

I ran this program on a Haswell-based Windows 7 machine on two NVIDIA GPU boards: the NVIDIA GeForce GTX 970 and Titan X (GM200 and GP100, respectively). Although both are large “win” chips, I would expect similar test results to hold true across all Maxwell and Pascal GPUs, since they seem to have implemented a hardware interface that improved managed memory performance.

μs

Launch time (ms)

Memory (KB)

Bandwidth (MB/s)

47

0

–

105

4

39

104

8

78

115

16

143

134

32

244

213

64

307

381

128

344

649

256

404

1247

512

420

2221

1024

472

4712

2048

445

8458

4096

496

17041

8192

492

33992

16384

494

Table 1. GM200 results.

Launch time (ms)

Memory (KB)

Bandwidth (MB/s)

39

0

0

47.15

4

7

49.86

8

164

57.84

16

283

59.04

32

555

64.73

64

1012

79.08

128

1657

98.41

256

2664

137.15

512

3823

205.56

1024

5101

391.91

2048

5351

745.81

4096

5624

1543.91

8192

5433

3114.83

16384

5386

Table 2. GP100 launch results.

“Better,” however, does not mean “good.” The most important thing to note is that these kernel launch times are VERY SLOW. You can measure synchronous and asynchronous kernel launch times with the nullKernelSync.cu and nullKernelAsync.cu programs in the same directory. On this machine, those times are 46.35 and 3.25 microseconds, respectively. (In fairness, results likely would be better under Linux, especially the synchronous kernel launch. On Windows 7, launching a CUDA kernel always requires the driver to have the operating system do a user-kernel transition or “kernel thunk.” Sadly, no amount of editing can get around the sad fact that CUDA kernels and OS kernels are completely different things and some sentences must refer to both!)

On the Maxwell machine, whatever mechanism NVIDIA is using to copy managed memory back from the GPU has a maximum performance of less than 500MB/s. That’s a nonstarter. It is more than 25x slower than the bus bandwidth. Pascal has improved things, but is still less than half the performance of a PCI Express 3.0 link. A CUDA kernel reporting results via mapped pinned memory would achieve much higher performance.

Superficially, NVIDIA’s implementation makes sense, assuming there is one CPU and one GPU and that the application isn’t doing any fancy tricks with CPU/GPU concurrency. The main mistake in their implementation was failing to speculatively copy extra pages back from the GPU to the CPU in Step 6, an oversight that seems to have been remedied in subsequent releases. The overhead of servicing the page fault is so high that it’s dominated by interrupt handling, not copying of a 4K page, so it makes sense to copy more pages on the page fault until the overhead of the additional copying becomes non-negligible.

Less clear, however, is the optimal behavior of managed memory in a system with multiple GPUs. Does a managed memory buffer get allocated for each GPU? When a kernel is launched on GPU 0, do the other GPUs get copies of the managed memory? Which memory ranges are valid for which GPUs as kernels are executing? And it seems clear that managed memory can’t possibly retain the property that the “owning” device can be inferred from a UVA address, by e.g. calling cudaPointerGetAttributes().

The paradigm also breaks for applications that perform memory copies and kernel processing concurrently.

I submit that the APIs needed to “enlighten” the managed memory subsystem to do the right thing, are at least as complicated as simply writing the CUDA code to explicitly allocate and copy memory.