Tuesday, September 21, 2010

For console development, memory is a very precious resource. You want good locality of reference and as little fragmentation of possible. You also want to be able to track the amount of memory used by different subsystems and eliminate memory leaks. To do that, you want to write your own custom memory allocators. But the standard ways of doing that in C++ leave a lot to be desired.

You can override global new and replace it with something else. This way you can get some basic memory tracking, but you still have to use the same allocation strategy for all allocations, which is far from ideal. Some systems work better with memory pools. Some can use simple frame allocation (i.e., pointer bump allocation). You really want each system to be able to have its own custom allocators.

The other option in C++ is to override new on a per class basis. This has always has seemed kind of strange to me. Pretty much the only thing you can use it for are object pools. Global, per-class object pools. If you want one pool per thread, or one pool per streaming chunk -- you run into problems.

Then you have the STL solution, where containers are templated on their allocator, so containers that use different allocators have different types. It also has fun things such as rebind(). But the weirdest thing is that all instances of the allocator class must be equivalent. So you must put all your data in static variables. And if you want to create two separate memory pools you have to have two different allocator classes.

I must admit that every time I run into something in STL that seems completely bonkers I secretly suspect that I have missed something. Because obviously STL has been created by some really clever people who have thought long and hard about these things. But I just don't understand the idea behind the design of the custom allocator interface at all. Can any one explain it to me? Does any one use it? Find it practical? Sane?

If it weren't for the allocator interface I could almost use STL. Almost. There is also the pretty inefficient map implementation. And the fact that deque is not a simple ring buffer, but some horrible beast. And that many containers allocate memory even if they are empty... So my own version of everything it is. Boring, but what's a poor gal gonna do?

Back to allocators. In conclusion, all the standard C++ ways of implementing custom allocators are (to me) strange and strangely useless. So what do I do instead? I use an abstract allocator interface and implement it with a bunch of concrete classes that allocate memory in different ways:

I think this is about as sane as an allocator API can get. One possible point of contention is the allocated_size() method. Some allocators (e.g., the frame allocator) do not automatically know the sizes of their individual allocations, and would have to use extra memory to store them. However, being able to answer questions about allocation sizes is very useful for memory tracking, so I require all allocators to provide that information, even if it means that a frame allocator will have to use a little extra memory to store it.

I use an abstract interface with virtual functions, because I don't want to template my classes on the allocator type. I like my allocators to be actual objects that I can create more than one of, thank you very much. Memory allocation is expensive anyway, so I don't care about the cost of a virtual function call.

In the BitSquid engine, you can only allocate memory through an Allocator object. If you call malloc or new the engine will assert(false).

Also, in the BitSquid engine all allocators keep track of the total number of allocations they have made, and the total size of those allocations. The numbers are decreased on deallocate(). In the allocator destructor we assert(_size == 0 && _allocations == 0) and when we shut down the application we tear down all allocators properly. So we know that we don't have any memory leaks in the engine. At least not along any code path that has ever been run.

Since everything must be allocated through an Allocator, all our collection classes (and a bunch of other low-level classes) take an Allocator & in the constructor and use that for all their allocations. Higher level classes either create their own allocator or use one of the globals, such as memory_globals::default_allocator().

With this interface set, we can implement a number of different allocators. A HeapAllocator that allocates from a heap. A PoolAllocator that uses an object pool. A FrameAllocator that pointer bumps. A PageAllocator that allocates raw virtual memory. And so on.

Most of the allocators are set up to use a backing allocator to allocate large chunks of memory which they then chop up into smaller pieces. The backing allocator is also an Allocator. So a pool allocator could use either the heap or the virtual memory to back up its allocations.

We use proxy allocators for memory tracking. For example, the sound system uses:

ProxyAllocator("sound", memory_globals::default_allocator());

which forwards all allocations to the default allocator, but keeps track of how much memory has been allocated by the sound system, so that we can display it in nice memory overviews.

If we have a hairy memory leak in some system, we can add a TraceAllocator, another proxy allocator which records a stack trace for each allocation. Though, truth be told, we haven't actually had to use that much. Since our assert triggers as soon as a memory leak is introduced, and the ProxyAllocator tells us in which subsystem the leak occurred, we usually find them quickly.

To create and destroy objects using our allocators, we have to use placement new and friends:

One last interesting thing to talk about. Since we use the allocators to assert on memory leaks, we really want to make sure that we set them up and tear them down in a correct, deterministic order. Since we are not allowed to allocate anything without using allocators, this raises an interesting chicken-and-egg problem: who allocates the allocators? How does the first allocator get allocated?

The first allocator could be static, but I want deterministic creation and destruction. I don't want the allocator to be destroyed by some random _exit() callback god knows when.

The solution -- use a chunk of raw memory and new the first allocator into that:

Note how this works. _buffer is initialized statically, but since that doesn't call any constructors or destructors, we are fine with that. Then we placement new a HeapAllocator at the start of that buffer. That heap allocator is a static heap allocator that uses a predefined memory block to create its heap in. And the memory block that it uses is the rest of the _buffer -- whatever remains after _static_heap has been placed in the beginning.

Now we have our bootstrap allocator, and we can go on creating all the other allocators, using the bootstrap allocator to create them.

Friday, September 17, 2010

The BitSquid engine has two separate scripting systems. The first is Lua. The entire engine is exposed to Lua in such a way that you can (and are encouraged to) write an entire game in Lua without touching the C++ code at all.

The second system, which is the focus of this post, is a visual scripting system that allows artists to add behaviors to levels and entities. We call this system Flow.

Lua and Flow have different roles and complement each other. Lua allows gameplay programmers to quickly test designs and iterate over gameplay mechanics. Flow lets artists enrich their objects by adding effects, destruction sequences, interactivity, etc. When the artists can do this themselves, without the help of a programmer, they can iterate much faster and the quality is raised.

Flow uses a pretty standard setup with nodes representing actions and events. Black links control how events cascade through the graph causing actions to be taken. Blue links represent variables that are fed into the flow nodes:

In this graph, the green nodes represent events and the red node represents an action. The yellow node is an action implemented in Lua. The gameplay programmers can set up such actions on a per-project basis and make them available to the artists.

Each unit (= entity) type can have its own flow graph, specifying that unit's behavior, e.g. an explosion sequence for a barrel. There is also a separate flow graph for the entire level that the level designers can use to set up the level logic.

Since Flow will be used a lot I wanted to implement it as efficiently as possible, both in terms of memory and performance. This means that I don't use a standard object-oriented design where each node is a separate heap-allocated object that inherits from an abstract Node class with a virtual do_action() method. That way lies heap allocation and pointer chasing madness.

Sure, we might use such a design in the editor, where we don't care about performance and where the representation needs to be easy to use, modifiable, stable to version changes, etc. In the runtime, where the graph is static and compiled for a particular platform, we can do a lot better.

(This is why you should keep your editor (source) data format completely separate from your in-game data format. They have very different requirements. One needs to be dynamic, multi-platform and able to handle file format version changes. The other is static, compiled for a specific platform and does not have to care about versioning -- since we can just recompile from source if we change the format. If you don't already -- just use JSON for all your source data, and binary blobs for your in-game data, it's the only sane option.)

The runtime data doesn't even have to be a graph at all. There are at least two other options:

We could convert the entire graph into a Lua program, representing the nodes as little snippets or functions of Lua code. This would be a very flexible approach that would be easy to extend. However, I don't like it because it would introduce significant overhead in both CPU and memory usage.

We could "unroll" the graph. For each possible external input event we could trace how it flows through the graph and generate a list of triggered actions as a sort of "bytecode". The runtime data would then just be a bytecode snippet for each external event. This has the potential of being very fast but can be complicated by nodes with state that may cause branching or looping. Also, it could potentially use a lot of extra memory if there are many shared code paths.

So I've decided to use a graph. But not a stupid object-oriented graph. A data-oriented graph, where we put the entire graph in a single blob of cohesive memory:

Oh, these blobs, how I love them. They are really not that complicated. I just concatenate all the node data into a single memory chunk. The data for each node begins with a node type identifier that lets me switch on the node type and do the appropriate action for each node. (Yes, I'll take that over your virtual calls any day.) Pointers to nodes are stored as offsets within the blob. To follow a pointer, just add the offset to the blob's start address, cast the pointer to something useful and presto!

Yes, it is ok to cast pointers. You can do it. You don't have to feel bad about it. You know that there is a struct ParticleEffectNode at that address. Just cast the pointer and get it over with.

The many nice things about blobs with offsets deserve to be repeated:

We can allocate the entire structure with a single memory allocation. Efficient! And we know exactly how much memory it will use.

We can make mem-copies of the blob without having to do any pointer patching, because there are no pointers, only offsets.

We can even DMA these copies over to a SPU. Everything is still valid.

We can write the data to disk and read it back with a single call. No pointer patching or other fixups needed.

In fact, doesn't this whole blob thing look suspicously like a file format? Yes! That is exactly what it is. A file format for memory. And it shouldn't be surprising. The speed difference between RAM and CPU means that memory is the new disk!

If you are unsure about how to do data-oriented design, thinking "file formats for memory" is not a bad place to start.

Note that there is a separate blob for dynamic data. It is used for the nodes' state data, such as which actor collided with us for a collision event (the blue data fields in the flow graph). We build that blob in the same way. When we compile the graph, any node that needs dynamic data reserves a little space in the dynamic blob and stores the offset to that space from the start of the dynamic block.

When we clone a new entity, we allocate a new dynamic data block and memcopy in the template data from the dynamic data block that is stored in the compiled file.

Sharing of dynamic data (the blue links in the graph) is implemented by just letting nodes point to the same offset in the dynamic data blob.

Since we are discussing performance, I should perhaps also say something about multithreading. How do we parallelize the execution of flow graphs?

The short answer: we don't.

The flow graph is a high level system that talks to a lot of other high level systems. A flow graph may trigger an effect, play a sound, start an animation, disable a light, etc, etc. At the same time there isn't really any heavy CPU processing going on in the flow graph itself. In fact it doesn't really do anything other than calling out to other systems. Multithreading this wouldn't gain a lot (since the compute cost is low to begin with) and add significant costs (since all the external calls would have to be synchronized in some way).

Wednesday, September 8, 2010

I'd like to spend as little of my time as possible writing exporters, and I'm guessing most other developers feel the same. So, in that spirit I'm sharing the code of the simple Motion Builder exporter I just wrote: