Twilight of the GPU: an epic interview with Tim Sweeney

I recently sat down with Epic Games co-founder and graphics guru Tim Sweeney …

Challenges ahead

JS: So it sounds like right now is a really good time to be a graphics programmer.

TS: Yeah, it's a good time, and it's also a very challenging time. Whereas previously, you had the DirectX API, and some fairly well-documented techniques for taking advantage of it, now, you get a C compiler and a blank text-editing window, and you have infinitely many possibilities. So it takes much more effort to choose the optimal approach. I think we'll see a lot of developers heading off in a lot of different directions, which is great for the industry, but it's a lot more of an open problem for everybody than it has been in the past. You're not handed the toolkit of predefined solutions to everything this time around.

JS: What do you think about the next generation of consoles. It sounds like, instead of the standard CPU plus GPU configuration, we may just have many-core CPUs... or, sorry, not many-core general-purpose CPUs, because that would be silly, but something that's like... nevermind, I can't think of how to put this question.

TS: No, I see exactly where you're heading. In the next console generation you could have consoles consist of a single non-commodity chip. It could be a general processor, whether it evolved from a past CPU architecture or GPU architecture, and it could potentially run everything—the graphics, the AI, sound, and all these systems in an entirely homogeneous manner. That's a very interesting prospect, because it could dramatically simplify the toolset and the processes for creating software.

Right now, in the course of shipping Unreal 3, we have to use multiple programming languages. We use one programming language for writing pixel shaders, another for writing gameplay code, and then on PlayStation 3 we use yet another compiler to write code to run on the Cell processor. So the PlayStation 3 ends up being a particular challenge, because there you have three completely different processors from different vendors with different instruction sets and different compilers and different performance techniques. So, a lot of the complexity is unnecessary and makes load-balancing more difficult.

When you have, for example, three different chips with different programming capabilities, you often have two of those chips sitting idle for much of the time, while the other is maxed out. But if the architecture is completely uniform, then you can run any task on any part of the chip at any time, and get the best performance tradeoff that way.

JS: So do you think Sony has learned its lesson there?

TS: I think Sony increasingly recognizes the value of a platform that's easy to develop for—excellent tools, excellent infrastructure. Sony has made a lot of strides there in the past couple of years.

JS: Now I have to ask the obligatory performance question, but I don't want to ask, "which is going to be faster, Intel or NVIDIA," because it sounds like it's going to depend heavily on the software. In other words, [the performance question] is a different question than it is when you have a fixed-function GPU and everybody's doing things the same way, vs. asking the performance question when things are fully programmable. Does this make sense?

TS: Do you mean, what are the factors that determine performance of these new processors?

JS: Yeah.

TS: These processors don't entirely exist yet, so it's a challenge to say even what aspects of performance are relevant. So you have a certain number of cores running at a certain clock rate, and each core has a vector instruction set of some sort with a certain width—Intel has said that their width is 16, and NVIDIA's public presentations indicate their GPUs run with 16- to 32-wide pixel pipelines or vectors. So there are all those parameters, and for code that's perfectly parallel, you can have performance potentially be determined by number of cores times clockrate times vector width. That's powerful scaling; that gets you up into the Teraflop range with currently-manufacturable chips.

And then on top of that, you have a lot of ancillary issues that could be significant but are hard to analyze without having a complete engine written for a next-gen processor like that. One is the tradeoff between the cache and memory systems. How much memory bandwidth do you have? Because DRAM buses are going to be fundamentally constrained, so you might be able to get 100 or 200 GBs of bandwidth, but will that be enough to power several Teraflops of computing power?

Typically you've wanted about 1 byte for every FLOP; the further away you get from that, the more likely you are to be bottlenecked by memory. So clearly the way to achieve that is with some degree of caches on these chips. Caches can provide far more bandwidth and lower latency for memory accesses. So how much cache do you have, how is it organized, what's the latency, what's the cache architecture—those are the key dimensions of performance.

And then there's usability—the programmer's point of view. How hard is it to write code for this architecture? Code that we can write in 10 lines and run on a single-threaded processor today—how many lines do we need to write to have that scale up to many cores and wide vectors in the future? Is it still 10 lines? That's the ideal. If a simple application today can be written simply and scale up to Teraflop performance, that would be great. But it might be a lot worse. You might have to write 20 lines of code, or 50 lines of code to scale up top to multiple threads in the future.

That's certainly the case with multicore CPUs. To take a simple 10-line sorting algorithm and translate that to run efficiently on a lot of cores requires exploding that to maybe 50 lines of code. You have significant productivity loss as you get into these more advanced architectures, so a key aspect of programmer productivity is to be able to write simple code that runs in parallel. NVIDIA has done some really cool work with CUDA to show that GPUs can actually run code written in a C-subset language. But can we take CUDA's restricted feature set—it doesn't support recursion or function pointers—and can we make that into a full C++ compatible language? There are a lot of open questions there.

But those are the key issues that need to be explored, because nobody's going to move to one of these new processors if it requires vastly higher development costs. If it costs $10 million to develop a game for current-gen, and on a next-generation chip it costs $30 million, that likely makes the whole thing uneconomical. So we need easy, simple programming models that scale to multiple threads and cores. Those are just some of the open issues there to consider.

I'd like to thank Tim Sweeney for agreeing to do this interview, and for being such a good sport about it. I accosted him after his talk at NVISION, and he kindly agreed to be interviewed on the spot.