Valve's Source engine goes multi-core

Multithreading promises to make the next Half-Life episode even betterby Geoff Gasior  12:00 AM on November 13, 2006

IF THE LAUNCH OF Intel's new quad-core Core 2 Extreme QX6700 processor has made one thing clear, it's that some applications are multithreaded, and others are not. Those that are can look forward to a healthy performance boost jumping to four cores, including near-linear scaling in some cases. Those that are not enjoy no such performance benefits, and may even run slower than on the fastest dual-core chips due to the slightly slower clock speeds of Intel's first quad-core offering. Unfortunately, most of today's game engines are among those applications that aren't effectively multithreaded. A handful can take advantage of additional processor cores, but not in a manner that improves performance substantially.

With the megahertz era effectively over and processor makers adding cores rather than cranking up clock speeds, game developers looking to exploit the capabilities of current hardware are faced with a daunting challenge"one of the most important issues to be solving as a game developer right now," according to Valve software's Gabe Newell. Valve has invested significant resources into optimizing its Source engine for multi-core systems, and doing so has opened up a whole new world of possibilities for its game designers.

You won't have to wait for Half-Life 3 to enjoy the benefits of Valve's multi-core efforts, though. Multi-core optimizations for Source will be included in the next engine update, which is due to become available via Steam before Half-Life 2: Episode 2 is released. Read on to see how Valve has implemented multithreading in its Source engine and developer tools, and how they perform on the latest dual- and quad-core processors from AMD and Intel.

Multiple approaches to multi-core Unlike some types of applications, games strive for 100% CPU utilization to give players the best experience their hardware can provide. That's easy enough with a single processor core, but more challenging when the number of cores is multiplied by two, and especially by four. Multithreading is needed to take advantage of extra processor cores, and Valve explored several approaches before settling on a strategy for the Source engine.

Perhaps the most obvious way to take advantage of multiple cores is to distribute in-game systems, such as physics, artificial intelligence, sound, and rendering, among available processors. This coarse threading approach plays well with existing game code, which is generally single-threaded, because it essentially just involves using multiple single threads.

Game code tends to be single-threaded because games are inherently serial applicationseach in-game system depends on the output of other systems. Those dependencies create problems for coarse threading, though, because games tend to become bound by the slowest system. It may be possible to spread multiple systems across a number of processor cores, but performance often doesn't scale in a linear fashion.

Valve initially experimented with coarse threading by splitting the Source engine's client and server systems between a pair of processor cores. Client-side systems included the user interface, graphics simulation, and rendering, while server systems handled AI, physics, and game logic. Unfortunately, this approach didn't yield anywhere close to a linear increase in performance. Valve found that its games spend 80% of their time rendering and only 20% simulating, resulting in an imbalance in the CPU utilization of each core. With standard single-player maps, coarse threading was only able to improve performance by about 20%. Doubling performance was possible, but only by using contrived maps designed to inflate physics and AI loads artificially.

In additional to failing to scale well, coarse threading also introduced an element of latency. Valve had to enable the networking component of the engine to keep the client and server systems synchronized, even with the single-player game. Looking forward, Valve also realized that coarse threading runs into problems when the number of cores exceeds the number of in-game systems. There are more than enough in-game systems to go around for today's dual- and quad-core processors, of course, but with Intel's 80-core "terascale" research processor hinting at things to come, coarse threading appears to have little long-term potential.

As an alternative toand indeed the opposite ofcoarse threading, Valve turned its attention to fine-grained threading. This approach breaks down problems into small, identical tasks that can be spread over multiple cores, making it considerably more complex than coarse threading. Operations executed in parallel must be completely orthogonal, and scaling gets tricky if the computational cost of each operation is variable.

Interestingly, Valve has already implemented fine-grained threading in a couple of its in-house development tools. Valve uses proprietary VVIS and VRAD applications to distribute the calculation of visibility and lighting for game levels across all the systems in its Bellevue headquarters. These apps have long taken advantage of distributed computing, much like Folding@Home, but are also well suited to fine-grained threading. Valve has seen close to linear scaling adapting the apps to take advantage of multiple cores, and has even delayed upgrading systems in its offices until it can order quad-core CPUs.

Valve's, er, valve

The Prius method of game programming Fine-grained threading may work well when it comes to visibility and lighting calculations for game levels, but Valve decided that it wasn't the right approach for multithreading in the Source engine, in part because fine-grained threading tends to be bound by available memory bandwidth. Instead, Valve chose to implement something it calls hybrid threading, which takes an "appropriate tool for the job" approach.

With hybrid threading, Valve created a framework that allows multiple threading models depending on what's appropriate for the task at hand. In-game systems can be sent to individual cores with coarse threading, and calculations that lend themselves to parallel processing can be spread over multiple cores using fine-grained threading. Work can even be queued for processing by idle cores if the results aren't needed right away.

Of course, Valve didn't want its game programmers to have to become threading experts just to take advantage of hybrid threading. Game programmers should be solving game problems rather than threading problems, so a work management system was designed to address gaming problems in a way that's intuitive for game programmers. This system supports all the elements of hybrid threading and focuses on keeping multiple cores as busy as possible.

Valve's work management system features a main thread that uses a pool of N-1 worker threads where N is the number of processor cores available. Of course, multiple threads create problems for data sharing if parallel threads want to read and write the same data. Locks are traditionally used to prevent corruption when a thread tries to read data that's currently being written or modified. However, locks force the read thread to wait, leading to idle CPU cycles that clash with Valve's desire to keep all cores occupied at all times.

In an attempt to avoid core idling due to thread locking, Valve made extensive use of "lock-free" algorithms. These algorithms allow threads to progress regardless of the state of other threads, and have been put under the hood of all of Valve's developer tools.

To illustrate the application of its new programming framework, Valve explained how it handles multithreaded access to the spatial partition, a data structure the represents every object in the world. The spatial partition is used any time something dynamic happens in the world, from movement to shooting. Obviously, you want to allow multiple threads to access the partition, but that becomes tricky if multiple write threads try to access it at the same time. Through profiling, Valve discovered that 95% of the threads that wanted to access the spatial partition were just reading, while only 5% were writing. Valve now allows multiple threads to read the partition at the same time, but only one thread can access it to write.

Valve was also able to apply multithreading to the Source engine's renderer. Game engines must perform numerous tasks before even issuing draw calls, including building world and object lists, performing graphical simulations, updating animations, and computing shadows. These tasks are all CPU-bound, and must be calculated for every "view", be it the player camera, surface reflections, or in-game security camera monitors. With hybrid threading, Valve is able to construct world and object lists for multiple views in parallel. Graphics simulations can be overlapped, and things like shadows and bone transformations for all characters in all views can be processed across multiple cores. Multiple draw threads can even be executed in parallel, and Valve has rewritten the graphics library that sits between its engine and the DirectX API to take advantage of multiple cores.