Lionel Lemarié began the afternoon conferences sessions. The guy works in the profiling tools team at Sony and talks about multi-core optimization of games.

Let’s present an overview of the problem: recent PC processors, the PS3 Cell and the XBox360 Xenon processors are multi-core processors, although their underlying architectures are very different. The Cell is made of one PPU (with 2 hardware threads) and a bunch of SPUs. The Xenon is made of 3 cores that support 2 hardware threads each. And recent PC processors are made of one to many cores with 2 hardware threads each. The architecture differences have a great impact on software architecture, especially if you have to support all three platforms.

To enable yourself to use multi-core architectures, you have to use threads. Lionel suggests putting them in a thread pool. First, query the system to get the processor count then create one thread per logical processor. Of course, you may want to limit the number of threads to something workable. The threads are used to run a set of small, independent tasks whose list is prepared by the low priority main worker thread. You now get one massive benefit (this architecture is scalable) and a few drawbacks (resource management and synchronization is going to be a bit annoying now).

Regarding thread setup: let the system do the bad job, do not forget the rest of the engine, and try to put your main worker thread in a lower priority – because it does not do that much, and balance work correctly. Balance is pretty hard, because there are so many things to take into account: do you sleep or do you “spin lock” important resources? And if you do so, how fast do you “spin lock”?

Because as you will have guessed, thread synchronization is pretty important. The PS3 SDK offers barriers as synchronization primitive. Barrier is a fast synchronization primitive that allow threads to wait until a condition is satisfied (typically: all threads finished their task). It is possible to emulate the same behavior under Windows XP but performances will be lower. On Vista, there is no barrier as well. However, Vista implements condition variables that are quite similar.

With this approach, designing a task does not depend much on the underlying architecture. Let’s take a simple example. On the PS3, a classical task perform these operations:

Get the next data block address.

DMA the next data block.

For each data block.

Wait for the DMA to end.

DMA the next data block.

Process the current block.

Send the data back using the DMA.

On the PC:

Get the next data block address.

Memcpy the data (optional).

For each data block.

Memcpy the next data block (optional).

Process the current block.

Memcpy the data back (optional).

In the end, the architecture is quite similar.

The optimization team at Sony created tools to help to visualize what really happens under the hood. Apart from their profiler SN Tuner (for PS3), they developed an in-game profiler that displays the profiling information on-screen. Each thread is represented by one line, and on each line, color codes are used to display running tasks, synchronization and idle time. Special colors are used to represent draw calls initialization and execution.

Figure 5: on-screen visualization of the running threads

Lionel did some experiments to verify the impact of the number of thread vs. the number of logical processors on a typical game (the falling blocks game in fig. 5). No surprise here: your biggest enemy is synchronization. His first attempt to go from one thread to multithreads on a multi-core processor lead to a performance decrease of more than 60%. To remedy to the situation, he changed the design of the tasks scheduler to create two FIFO instead of one. The main thread is still responsible for these FIFO creations but the umber of locks is drastically reduced. To summarize:

Use mini-tasks to distribute the load among a thread pool. This architecture can be used on any multi-core platform. There are still some differences, but they are reasonably easy to work out.

Verify the performance continuously within the game, with an on-screen profiler.

According to Lionel, this is a bit difficult to get right, but it’s worth the effort.