I was thinking getting it to the extreme and doing threads for any subsystem conceivable. But I was worried that may even slow it. e.g. would it be sane to separate input thread from rendering or game logic thread? Would the data synchronization required make it pointless or even slower?

7 Answers
7

The common approach for taking advantage of multiple cores is, frankly, just plain misguided. Separating your subsystems into different threads will indeed split up some of the work across multiple cores, but it has some major problems. First, it's very hard to work with. Who wants to muck around with locks and synchronization and communication and stuff when they could just be writing straight up rendering or physics code instead? Second, the approach doesn't actually scale up. At best, this will allow you to take advantage of maybe three or four cores, and that's if you really know what you're doing. There are only so many subsystems in a game, and of those there are even fewer that take up large chunks of CPU time. There are a couple good alternatives that I know.

One is to have a main thread along with a worker thread for each additional CPU. Regardless of subsystem, the main thread delegates isolated tasks to the worker threads via some sort of queue(s); these tasks may themselves create yet other tasks, as well. The sole purpose of the worker threads is to each grab tasks from the queue one at a time and perform them. The most important thing, though, is that as soon as a thread needs the result of a task, if the task is completed it can get the result, and if not it can safely remove the task from the queue and go ahead and perform that task itself. That is, not all tasks will end up being scheduled in parallel with each other. Having more tasks than can be executed in parallel is a good thing in this case; it means that it is likely to scale as you add more cores. One downside to this is that it requires a lot of work up front to design a decent queue and worker loop unless you have access to a library or language runtime that already provides this for you. The hardest part is making sure your tasks are truly isolated and thread safe, and making sure your tasks are in a happy middle ground between coarse-grained and fine-grained.

Another alternative to subsystem threads is to parallelize each subsystem in isolation. That is, instead of running rendering and physics in their own threads, write the physics subsystem to use all your cores at once, write the rendering subsystem to use all your cores at once, then have the two systems simply run sequentially (or interleaved, depending on other aspects of your game architecture). For example, in the physics subsystem you could take all the point masses in the game, divide them up among your cores, and then have all the cores update them at once. Each core can then work on your data in tight loops with good locality. This lock-step style of parallelism is similar to what a GPU does. The hardest part here is in making sure that you are dividing your work up into fine-grained chunks such that dividing it evenly actually results in an equal amount of work across all processors.

However, sometimes it's just easiest, due to politics, existing code, or other frustrating circumstances, to give each subsystem a thread. In that case, it's best to avoid making more OS threads than cores (if you have a runtime with lightweight threads that just happen to balance across your cores, this isn't as big of a deal). Also, avoid excessive communication. One nice trick is to try pipelining; each major subsystem can be working on a different game state at a time. Pipelining reduces the amount of communication necessary among your subsystems since they don't all need access to the same data at the same time, and it also can nullify some of the damage caused by bottlenecks. For example, if your physics subsystem tends to take a long time to complete and your rendering subsystem ends up always waiting for it, your absolute frame rate could be higher if you run the physics subsystem for the next frame while the rendering subsystem is still working on the previous frame. In fact, if you have such bottlenecks and can't remove them any other way, pipelining may be the most legitimate reason to bother with subsystem threads.

"as soon as a thread needs the result of a task, if the task is completed it can get the result, and if not it can safely remove the task from the queue and go ahead and perform that task itself". Are you talking about a task spawned by the same thread? If so, then wouldn't it make more sense if that task is executed by the thread which spawned the task itself?
–
jmp97Feb 7 '11 at 8:31

i.e. the thread could, without scheduling the task, execute that task right away.
–
jmp97Feb 7 '11 at 9:52

3

The point is that the thread doesn't necessarily know up front whether it would be better to run the task in parallel or not. The idea is to speculatively spark work that you will eventually need done, and if another thread finds itself idling then it can go ahead and do this work for you. If this ends up not happening by the time you need the result, you can just pull the task from the queue yourself. This scheme is for dynamically balancing a workload across multiple cores rather than statically.
–
Jake McArthurFeb 7 '11 at 18:09

There's a couple things to consider. The thread-per-subsystem route is easy to think about since the code separation is pretty apparent from the get go. However, depending on how much intercommunication your subsystems need, inter-thread communication could really kill your performance. In addition, this only scales to N cores, where N is the number of subsystems you abstract into threads.

If you're just looking to multithread an existing game, this is probably the path of least resistance. However, if you're working on some low level engine systems that might be shared between several games or projects, I would consider another approach.

It can take a bit of mind twisting, but if you can break things up as a job queue with a set of worker threads it will scale much better in the long run. As the latest and greatest chips come out with a gazillion cores, your game's performance will scale along with it, just fire up more worker threads.

So basically, if you're looking to bolt on some parallelism to an existing project, I'd parallelize across subsystems. If you're building a new engine from scratch with parallel scalability in mind, I'd look into a job queue.

The system you mention is very akin to a scheduling system mentioned in the answer given by the Other James, still good detail in that area so +1 as it does add to the discussion.
–
JamesJan 13 '11 at 16:50

2

a community wiki on how to set up a job queue and worker threads would be nice.
–
bot_botNov 15 '11 at 8:12

That question has no best answer, as it depends upon what you are trying to accomplish.

The xbox has three cores and can handle a few threads before context switching overhead becomes a problem. The pc can deal with quite a few more.

A lot of games have typically been single threaded for ease of programming. This is fine for most personal games. The only thing you would likely have to have another thread for is Networking and Audio.

Unreal has a game thread, render thread, network thread, and audio thread (if I remember correctly). This is pretty standard for a lot of current-gen engines, though being able to support a seperate rendering thread can be a pain and involves a lot of groundwork.

The idTech5 engine being developed for Rage actually uses any number of threads, and it does so by breaking down game tasks into 'jobs' that are processed with a tasking system. Their explicit goal is to have their game engine scale nicely when the number of cores on the average gaming system jumps.

The technology I use (and have written) has a seperate thread for Networking, Input, Audio, Rendering, and Scheduling. It then has any number of threads which can be used to perform game tasks, and this is managed by the scheduling thread. A lot of work went into getting all the threads to play nicely with each other, but it seems to be working well and getting very good use out multicore systems, so perhaps it is mission accomplished (for now; I might break down audio/networking/input work into just 'tasks' that the worker threads can update).

You are right that the most critical part is to avoid synchronization wherever possible. There are a few ways to achieve this.

Know your data and store it in memory according to your processing needs. This enables you to plan for parallel calculations without the need of synchronization. Unfortuantely this is most of the time quite hard to achieve as the data is often accessed from different systems at unpredictable times.

Define clear access-times for data. You could separate your main-tick into x phases. If you are sure that Thread X reads the data only in a specific phase, you also know that this data can be modified by other threads in a different phase.

Double-Buffer your data. That is the most simple approach, but it increases the latency, as Thread X is working with the data from the last frame, while Thread Y is preparing the data for the next frame.

My personal experience shows that fine grained calculations are the most effective way, as these can scale far better than a subsystem-based solutions. If you thread your subsystems, the frame-time will be bound to the most expensive subsystem. This can lead to all threads but one idling until the expensive subsystem has finally finished it's work.
If you are able to separate large parts of your game into small tasks, these tasks can be scheduled accordingly to avoid idling cores. But this is something that is hard to accomplish if you have already a big code-base.

To take some hardware constraints into consideration, you should try to never oversubscribe your hardware. With oversubscribe, I mean having more software threads than your platform hardware threads. Especially on PPC architectures ( Xbox360, PS3 ) a task-switch is really expensive. It's of course perfectly okay if you have a few threads oversubscribed which are only triggered for an small amount of time ( once a frame, for example )
If you target the PC, you should keep in mind that the number of cores ( or better HW-Threads ) is constantly growing, so you would want to find a scalable solution, which takes advantage of the additional CPU-Power. So, in this area, you should try to design your code as task-based as possible.

A thread per subsystem is the wrong way to go. Suddenly, your app won't scale because some subsystems demand a lot more than others. This was the threading approach taken by Supreme Commander and it didn't scale beyond two cores because they only had two subsystems that took up a substantial amount of CPU- rendering and physics/game logic, even though they had 16 threads, the other threads just barely amounted to any work and as a result, the game only scaled to two cores.

What you should do is use something called a thread pool. This somewhat mirrors the approach taken on GPUs- that is, you post work, and any available thread simply comes along and does it, and then returns to waiting for work- think of it like a ring buffer, of threads. This approach has the advantage of N-core scaling and is very good at scaling for both low and high core counts. The disadvantage is that it's quite hard to work the thread ownership for this approach, as it's impossible to know which thread is doing what work at any given time, so you have to have the ownership issues locked down very tightly. It also makes it very hard to use technologies like Direct3D9 which don't support multiple threads.

Thread pools are very hard to use, but they deliver the best possible results. If you need extremely good scaling, or you have plenty of time to work on it, use a thread pool. If you're trying to introduce parallelism into an existing project with unknown dependency problems and single-threaded technologies, this isn't the solution for you.

Just to be a bit more precise: GPUs do not use thread pools instead the thread scheduler is implemented in hardware, which makes it very cheap to create new threads and switch threads, as opposed to CPUs where thread creation and context switches are expensive. See Nvidias CUDA Programmer Guide for example.
–
NilsJan 19 '11 at 12:08

2

+1: Best answer here. I would even use more abstract constructs than threadpools (for example job queues and workers) if your framework allows it. It is much easier to think/program in this terms than in pure threads/locks/etc. Plus: Splitting your game in rendering, logic, etc. is nonsense, since the rendering has to wait for the logic to finish. Rather create jobs that can actually be executed in parallel (for example: Compute the AI for one npc for the next frame).
–
Dave O.Feb 10 '11 at 15:45

General rule of thumb for threading an application: 1 thread per CPU Core. On a quad core PC that means 4. As was noted, the XBox 360 however has 3 cores but 2 hardware threads each, so 6 threads in this case. On a system like the PS3... well good luck on that one :) People are still trying to figure it out.

I would suggest designing each system as a self contained module that you could thread if you wanted. This usually means having very clearly defined communication pathways between the module and the rest of the engine. I particularly like Read-Only processes like Rendering and audio as well as 'are we there yet' processes like reading player input for things to be threaded off. To touch on the answer given by AttackingHobo, when you are rendering 30-60fps, if your data is 1/30th-1/60th of a second out of date it really is not going to detract from the responsive feel of your game. Always remember that the main difference between application software and video games is doing everything 30-60 times a second. On that same note however, input may be one of the things you want to keep on the main thread so the rest can react to it as soon as it appears :)

If you design your engine's systems well enough any of them can be moved from thread to thread to load balance your engine more appropriately on a per-game basis and the like. In theory you could also use your engine in a distributed system if need be where entirely separate computer systems run each component.

I create one thread per logical core (minus one, to account for Main Thread, who incidentally is responsible for Rendering, but otherwise acts as a Worker Thread too).

I collect input device events in realtime throughout a frame, but don't apply them until the end of the frame: they will have effect in the next frame.
And I use a similar logic for rendering (old state) versus updating (new state).

I use atomic events to defer unsafe operations until later in the same frame, and I use more than one event queue (job queue) in order to implement a memory barrier that gives an iron-clad guarantee regarding order of operations, without locking or waiting (lock free concurrent queues in order of job priority).

It is noteworthy to mention that any job can issue subjobs (which are finer, and approach atomicity) to the same priority queue or one that is higher (served later in the frame).

Given I have three such queues, all threads except one can potentially stall exactly three times per frame (while waiting for other threads to complete all outstanding jobs issued at the current priority level).

My frame starts with MAIN rendering the OLD STATE from the previous frame's update pass, while all other threads immediately start calculating the NEXT frame state, I'm just using Events to double buffer state changes until a point in the frame where nobody is reading anymore.
–
HomerOct 27 '14 at 8:14