Nicholas Wilt

Nicholas Wilt has been programming professionally for over 25 years in a variety of areas, including industrial machine vision, graphics, and low-level multimedia software. While at Microsoft, he served as the development lead for Direct3D 5.0 and 6.0, built the prototype for the Windows Desktop Manager, and did early GPU computing work. At NVIDIA, he worked on CUDA from its inception, designing and often implementing most of CUDA's low-level abstractions. Now at Amazon, Mr. Wilt is working on cloud computing technologies relating to GPUs.

Author Updates

Not long after CUDA first shipped, NVIDIA CEO Jensen Huang seized on it as a key differentiating opportunity. As a proprietary API, after all, only NVIDIA GPUs could run CUDA apps; so “CUDA Everywhere!” became a mantra. Get developers to port their code to CUDA, the thinking went, and the improved user experience would sell more NVIDIA GPUs.
NVIDIA worked hard to persuade software developers to port their applications to CUDA: they invested in CUDA libraries, purchased Ageia and porte

This question on StackExchange was put on hold as primarily opinion-based: “…answers to this question will tend to be almost entirely based on opinions, rather than facts, references, or specific expertise.”
The content of StackExchange is usually high quality, but in this case, while the design decision was based on opinion, the answer to the question needn’t be… you just need to ask the people who know! And the inimitable talonmies, who is poised to crack 30k on StackExchange’s poin

Some time ago, I wrote this in response to a StackOverflow question, but wanted to share here on the blog.
The question basically asked how you could make a floating point operation the same between the CPU and the GPU, and here is an updated version of the answer:
There are many reasons why it is not realistic to expect the same results from floating point computations run on the CPU and GPU. It's much stronger than that: you can't assume that FP results will be the same when

I submitted the final manuscript last Friday, and thought I would reflect briefly on how it aligns with my original goals.
I’ve wanted write a book on CUDA for years. Until I left NVIDIA, I was just too busy building CUDA to work on it. So when it came time to write up a proposal, I’d been thinking about the subject matter and the organization for some time.
One of the exercises that authors undertake (and that editors demand as part of a proposal) is a competitive analysis: W

The first multi-GPU implementation of N-body is done, and it's old school: using the same thread delegation code as the multithreaded CPU implementation, it uses a separate CPU thread for each GPU.

For workloads like N-body, that's probably not necessary - N-body is so GPU-bound that the CPU is more of a traffic cop than an active contributor to the computation - but in a world where 4-core CPUs are common and 16-core CPUs available, it seems terribly wasteful to drive multiple GPUs

The multithreaded variant of SSE N-body is complete, and I've had the opportunity gather some timing information.

Three variants of N-body were timed: single-threaded SSE, multithreaded SSE, and the shared memory (fastest so far) GPU formulation.

Three platforms were tested, two of them on Amazon EC2:cg1.4xlarge (2xXeon 5570 "Nehalem" with 2xTesla M2050 GPUs)cc2.8xlarge (2xXeon 2670 "Sandy Bridge")GeForce GTX 680.If I thought operating system mattere

CUDA developers who are recent refugees from the land of CPU programming often have to learn the hard way that sensible CPU optimizations don’t always work well on GPUs. Lookup tables, for example, are to be avoided. (Okay, not the best example since it is also true on modern CPUs.) Here’s a better one: registers are so precious that it’s often better to recompute results than to store them. GPUs have brought brute force back into style!

The N-body code has been posted to github; it has not yet been ported to Linux, but that will not be difficult to do. It does not use graphics, it is just a Win32 console application that responds to keyboard input (ironically, other than porting the threading code, the keyboard support is the hard part about the Linux port of this app).

So far there are 8 formulations:

1) CPU_AOS (array-of-structures implementation - the gold standard to which we compare other implemen

I have updated the Web site's Sample Chapters section to point to The CUDA Handbook's page in Safari Books Online's Rough Cuts. Rough Cuts gives early access to books while they are still in progress - 3-6 months from publication, and includes tools for readers to give feedback to the author while there's still time to make changes.

Several chapters are still missing, but they will be uploaded soon.

I'm still interested in reviewers who would be willing to take a look a

This post will discuss SIMD instruction sets, peak floating point performance for CPUs, programming models, and ease-of-use, as applied to the flagship N-body application for CUDA.
My involvement in SIMD instruction sets dates back to the mid-1990s, when x86 vendors were adding MMX (all vendors), SSE (Intel) and 3DNow! (AMD, Cyrix et al.) to the x86 architecture. These SIMD instruction sets reflect a trend that had been developing in CPUs for some time: that because most of the die ar

CUDA 5.0 adds a new feature called "stream callbacks," a new mechanism for CPU/GPU synchronization. Previously, CPU/GPU synchronization was accomplished by calling functions like cuStreamSynchronize(), which returns when all preceding commands in the stream have been completed, or cuEventSynchronize(), which waits until the specified event has been recorded by the GPU.

I spent some time investigating how stream callbacks are implemented on Windows 7. cudaStreamAddC

A new version of the Streaming Multiprocessors chapter has been uploaded, this one with merged coverage of the math library for float and double, plus improved coverage of shared memory (especially shared memory atomics) and conditional code.

The code emitted by the compiler when performing shared memory atomics turns out to be the perfect illustration of how CUDA hardware handles conditional code. For the SM 2.0 architecture, a shared atomic add compiles to the following microcode

I thought the architecture of the Streaming Multiprocessors, with an especial focus on their instruction sets, deserved a full chapter of the book. This is the chapter that describes in detail exactly what the SMs' capabilities are, and how to access them. From integer to single- and double-precision floating point, including descriptions of the floating point formats and intrinsics that perform functions such as directed rounding... it's all here! At the end of the chapter is a reference guide

I have uploaded the sample chapter on parallel prefix sum ("scan"), an algorithm whose importance has been recognized in parallel algorithm design at least since Blelloch's work on the Connection Machine. Blelloch's excellent survey can be found here.

For CUDA, Mark Harris et al. published the first implementation, though much simpler and faster formulations have been published since. When I wrote Mark to ask him where to find the latest work on Scan, he referred me to Dua

Believe it or not, the addition of texturing support for CUDA was somewhat controversial when it was first proposed. The hardware support was there, but some felt it was a "graphics feature" that would detract from the product's primary goal of winning general-purpose clock cycles away from CPUs. Once we were able to do performance analysis on actual hardware, though, some of the reasons that we suspected texturing wou

My chapter on normalized correlation is now available. This chapter features use of texture, constant memory and shared memory to accelerate a popular template-matching algorithm used in image processing and machine vision. After computing statistics between an image location and the corresponding pixels of a "template image," the computation generates a value in the range [-1.0, 1.0] where 1.0 is a perfect match.

My editor at Pearson, the inimitable Peter Gordon, agreed to allow me to "open source" the code that was to accompany The CUDA Handbook. I think we both figured that if the code was useful, it would be a good way to promote the book.

That left me with an interesting problem: which of the fifty dozen open source licenses should I use? For my purposes, the more permissive, the better; by that metric, the best "license" I have seen is for the code accompanying Warre