Setup

I have an entity-component architecture where Entities can have a set of attributes (which are pure data with no behavior) and there exist systems that run the entity logic which act on that data. Essentially, in somewhat pseudo-code:

A system that just moves along all entities at a constant rate might be

MovementSystem extends System
{
update()
{
for each entity in entities
position = entity.attributes["position"];
position += vec3(1,1,1);
}
}

Essentially, I'm trying to parallelise update() as efficiently as possible. This can be done by running entire systems in parallel, or by giving each update() of one system a couple of components so different threads can execute the update of the same system, but for a different subset of entities registered with that system.

Problem

In the case of the shown MovementSystem, parallelization is trivial. Since entities don't depend on each other, and don't modify shared data, we could just move all entities in parallel.

However, these systems sometimes require that entities interact with (read/write data from/to) each other, sometimes within the same system, but often between different systems that depend on each other.

For example, in a physics system sometimes entities may interact with each other. Two objects collide, their positions, velocities and other attributes are read from them, are updated, and then the updated attributes are written back to both entities.

And before the rendering system in the engine can start rendering entities, it has to wait for other systems to complete execution to ensure that all relevant attributes are what they need to be.

If we try to blindly parallelize this, it will lead to classical race conditions where different systems may read and modify data at the same time.

Ideally, there would exist a solution where all systems may read data from any entities it wishes to, without having to worry about other systems modifying that same data at the same time, and without having the programmer care about properly ordering the execution and parallelization of these systems manually (which may sometimes not even be possible).

In a basic implementation, this could be achieved by just putting all data reads and writes in critical sections (guarding them with mutexes). But this induces a large amount of runtime overhead and is probably not suitable for performance sensitive applications.

Solution?

In my thinking, a possible solution would be a system where reading/updating and writing of data is separated, so that in one expensive phase, systems only read data and compute what they need to compute, somehow cache the results, and then write all the changed data back to the target entities in a separate writing pass. All systems would act on the data in the state that it was in at the beginning of the frame, and then before the end of the frame, when all systems have finished updating, a serialized writing pass happens where the cached results from all the different systems are iterated through and written back to the target entities.

This is based on the (maybe wrong?) idea that the easy parallelization win could be big enough to outdo the cost (both in terms of runtime performance as well a code overhead) of the result caching and the writing pass.

The Question

How might such a system be implemented to achieve optimal performance? What are the implementation details of such a system and what are the prerequisites for an Entity-Component system that wants to use this solution?

2 Answers
2

First point: since you don't mention having profiled your release build runtime and found a specific need I suggest you do that ASAP. What does your profile look like, are you thrashing the caches with bad memory layout, is one core pegged at 100%, how much relative time is spent processing your ECS versus the rest of your engine, etc...

Read from an entity and compute something... and hold onto the results somewhere in an intermediate storage area until later? I don't think that you can separate read+compute+store in the way you think and expect this intermediate store to be anything but pure overhead.

Plus, since you're doing continuous processing the main rule you want to follow is to have one thread per CPU core. I think that you are looking at this at the wrong layer, try looking at entire systems and not individual entities.

Create a dependency graph between your systems, a tree of what system needs results from an earlier system's work. Once you have that dependency tree then you can easily send entire systems full of entities off to process on a thread.

So let's say that your dependency tree is morass of brambles and bear traps, a design issue but we have to work with what we have. The best case here is that inside each system each entity does not depend on any other result inside that system. Here you easily subdivide the processing across threads, 0-99 and 100-199 on two threads for an example with two cores and 200 entities that this system owns.

In either case, at each stage you have to wait for results that the next stage depends on. But this is OK because waiting for the results of ten large blocks of data being processed in bulk is far superior to synchronizing a thousand times for small blocks.

The idea behind building a dependency graph was to trivialize the seemingly impossible task of "Finding and assembling other systems to run in parallel" by automating it. If such a graph shows signs of being blocked by constant waiting for previous results then creating a read+modify and delayed write only moves the blockage and does not remove the serial nature of the processing.

And serial processing can only be turned parallel between each sequence point, but not overall. But you realize this because it is the core of your problem. Even if you cache reads from data that haven't been written yet you still need to wait on that cache to become available.

If creating parallel architectures were easy or even possible with these kinds of constraints then computer science wouldn't have been struggling with the problem since Bletchley Park.

The only real solution would be to minimize all these dependencies to make the sequence points as rarely needed as possible. This may involve subdividing systems into sequential processing steps where, inside each subsystem, going parallel with threads becomes trivial.

Best I got for this problem and it's really nothing more than recommending that if hitting your head on a brick wall hurts then break it into smaller brick walls so you're only hitting your shins.

I'm sorry to tell you, but this answer seems kind of unproductive. You're just telling me that what I'm looking for doesn't exist, which seems logically wrong (at least in principle) and also because I've seen people allude to such a system in several places before (nobody ever gives enough details, though, which is the main motivation for asking this question). Although, it might be possible that I wasn't nearly detailed enough in my original question which is why I've extensively updated it (and I will keep updating it if my mind stumbles on something).
–
TravisGSep 2 '13 at 17:00

@TravisG There is often systems that depend on other systems as Patrick pointed out. In order to avoid frame delays or to avoid multiple update passes as a part of a logic step, the accepted solution is to serialize the update phase, running subsystems in parallel where possible, serializing subsystems with dependencies all the while batching smaller update passes inside each subsystem using a parallel_for() concept. It's ideal for any combination of subsystem update pass needs and the most flexible.
–
crancranSep 13 '13 at 4:40

I've heard of an interesting solution to this problem: The idea is that there would be 2 copies of the entity data (wasteful, I know). One copy would be the present copy, and the other would be the past copy. The present copy is strictly write only, and the past copy is strictly read only. I'm assuming that systems don't want to write to the same data elements, but if that's not the case, those systems should be on the same thread. Each thread would have write-access to the present copies of mutually exclusive sections of the data, and every thread has read-access to the all past copies of the data, and thus can update the present copies using data from the past copies with no locking. Between each frame, the present copy becomes the past copy, however you want to handle the swapping of roles.

This method also removes race conditions because all systems will be working with a stale state that will not change before/after the system has processed it.

That's John Carmack's heap copy trick, isn't it? I've wondered about it, but it still potentially has the same problem that multiple threads might write to the same output location. It's probably a good solution if you keep everything "single-pass", but I'm not sure how feasible that is.
–
TravisGSep 2 '13 at 18:53

Input to screen display latency would go up by 1 frame's time, including GUI reactivity. Which may matter for action/timing games or heavy GUI manipulations like RTS have. I like it as a creative idea, however.
–
Patrick HughesSep 2 '13 at 18:57

I heard about this from a friend, and did not know it was a Carmack trick. Depending how rendering is done, the rendering of components may be one frame behind. You could just use this for the Update phase, then render from the current copy once everything is up-to-date.
–
John McDonaldSep 3 '13 at 15:28