Friday, December 18, 2015

We’re all familiar with the benefits that a data driven architecture brings to gameplay: code is decoupled from data, enabling live linking and rapid iteration. Placing new objects in the editor or modifying the speed of a character has an immediate effect on a live game instance. Really speeds up the development process as you fine tune scripts, gameplay and other content.

What about graphics programming? It turns out that the same architecture and associated benefits apply to Stingray’s renderer.

Just by modifying configuration files (albeit somewhat complex configuration files) we can implement new shader programs, post-processing effects and even different cascading shadow map implementations. All in real time, on a live game instance. Which is a big win for graphics programmers: try out new ideas, fine tune shaders all with real-time feedback. No more of that long edit/compile/run/debug cycle. And this applies to the entire rendering pipeline: everything from the object space to world space transforms to shadow casting and the final rendering pass is all exposed as config file data, not as C++ code as with traditional architectures.

I gave a presentation on this topic a while back which has now found it’s way to our YouTube channel:

By the way, there’s a lot of other great Stingray content up there so please check it out! The renderer presentation can be found under “Stingray Render Config Tutorial.”

The details as well as a PowerPoint can be found there. The code changes to add a trivial greyscale post-processing effect involve:

settings.ini:

The render_config variable points to the renderer.config file. Settings.ini also provides a section to override default settings found in the next file, renderer.render_config

core/stingray_renderer/renderer.render_config:

Points to our shader libraries, text files containing actual shader programs. A section called global_resources allocates graphics buffers, such as scratch buffers for the cascading shadow maps and G-buffers for deferred rendering along with the main framebuffer. And most of the actual rendering is invoked in the resource_generators section. Again, more details in the YouTube video though a surprising amount can be learned just by grepping through the various config files and playing with the settings. Which is easy to do since it’s all data driven!

core/stingray_renderer/shader_libraries/development.shader_source:

One of several shader libraries. While shader code can be entered as text here, Stingray also provides a graphical node-based shader editor. And we support ShaderFX materials from Max or Maya. It’s often easier (and more portable) to implement shaders graphically.

But whatever method you choose to implement shaders in, the key point is that Stingray's entire rendering pipeline is fully accessible through configuration files. With our data driven architecture making complex rendering changes, while still non-trivial, is a whole lot faster and easier (and portable!) than working with platform-specific C++ code.

Thursday, September 10, 2015

We've recently re-visited our AO solution with the goal of improving its performance on consoles. We currently use the Scalable Ambient Obscurance algorithm presented by Morgan McGuire. Out target was to bring down the cost of the entire effect to something between 1-1.5ms on Xbox One. To achieve this it was clear that we needed to reduce the number of taps we were taking for each AO sample. An important part of this work went towards improving the efficiency of the temporal reprojection of our AO buffer. I thought I'd share a few observations we've made along the way.

Distributing AO Samples

When reprojecting data the key to success is to make sure your samples are well distributed through time. The Halton sequence was made popular for reprojection methods after Brian Karis presented his High Quality Temporal Supersampling. It is a sequence that gives well distributed samples in space as well as in time.

If we add an offset to a single AO sample we can see that after 2π it repeats (as expected)

So we use 8 samples ranging from [0, 2π] distributed by using the first 8 terms of a base 3 Halton sequence:
{1/3, 2/3, 1/9, 4/9, 7/9, 2/9, 5/9, 8/9} x 2π

And this is the result using 6 AO samples.

Notice that there is some banding that can appear. By adding a dithered offset to the current sample's radius we can remove this quite nicely. We use a 4x4 pattern based on the Bayer matrix.

And here are our samples distributed through time. If anyone has a better way of distributing these I would be very interested to know.

Reprojection function

When doing any temporal reprojection it is crucial to have a good reprojection function. I like to refer to this as a similarity function. Its purpose is to identify how likely it is that the reprojected samples correspond to the samples of the current pixel. When writing this function it's quite important to have a convenient way to visualize it. If the function gets complicated, it's a good idea to isolate the different terms of the function so that you can reason and debug them individually. The reprojection function we use is a combination of three main terms.

Disocclusion Term

This is a simple term which identifies depth differences and classifies previous pixels as disoccluded. We use the relative depth difference described by Huw Bowles in Iterative Image Warping.

Velocity Term

The second term is also very straight forward and consists simply of reducing the similarity for fast moving pixels. A moving pixel has less chance of reprojecting successfully than a still one.

velocity_similarity = saturate(velocity * VELOCITY_SCALAR);

Dangerous Samples Term

The idea is to identify if the AO samples we are gathering are touching moving objects. To do this efficiently, we encode a moving bit as part of the depth buffer info passed in to the SAO algorithm. If you use the mip chain described by the SAO paper, make sure that you forward that 'moving bit' to the lower levels of the mip chain. This idea was presented by Anton Michels during the Labs R&D: Rendering Techniques in Rise of the Tomb Raider presentation earlier this year. Since each AO sample will need to read the depth information we get the 'moving bit' read for free.

To try and invalidate samples associated with fast moving object, the Dangerous Samples term is accumulated through time. An idea described quite well by Oliver Mattausch in his TSSAO Gpu-Pro2 article which he called 'smooth invalidation'.

Here's the kind of ghosting we get without identifying 'dangerous' samples.

With this term as part of the reprojection function we can eliminate most of the ghosting that arises from the temporal reprojection:

Putting it all together

The final similarity term is calculated by combining all terms together.

Thursday, August 20, 2015

Making technical illustrations in drawing programs is tedious and boring. We are programmers, we should be programming our illustrations. Luckily, with JavaScript and its canvas, we can. And we can make them move!

To try this out, and to brush up my rusty JavaScript so that I can hang with all the cool web kids, I made an illustration of the how the buddy allocator described in the last article works:

Note, I use ECMAScript 2015, so this code currently only works in recent versions of Chrome and Firefox. Sorry, but worrying about compatibility takes all the fun out of JavaScript.

The allocator needs to be fast at serving an allocation request, i.e. finding a suitable piece of free memory. It also needs to be fast at freeing memory, i.e. making a previously used piece of memory available for new allocations. Finally, it needs to prevent fragmentation -- more about that in a moment.

Suppose we put all free blocks in a linked list and allocate memory by searching that list for a block of a suitable size. That makes allocation an O(n) operation, where n is the total number of free blocks. There could be thousands of free blocks and following the links in the list will cause cache misses, so to make a competitive allocator we need a faster method.

Fragmentation occurs when the free memory cannot be used effectively, because it is chopped up into little pieces:

Here, we might not be able to service a large allocation request, because the free memory is split up in two pieces. In a real world scenario, the memory can be fragmented into thousands of pieces.

The first step in preventing fragmentation is to ensure that we have some way of merging free memory blocks together. Otherwise, allocating blocks and freeing them will leave the memory buffer in a chopped up state where it is unable to handle any large requests:

Use separate allocators for long-lived and short-lived allocations, so that the short-lived allocations don't create "holes" between the long lived ones.

Put "small" allocations in a separate part of the buffer so they don't interfere with the big ones.

Make the memory blocks relocatable (i.e. use "handles" rather than pointers).

Allocate whole pages from the OS and rely on the page mapping to prevent fragmentation.

The last approach can be surprisingly efficient if you have a small page size and follow the advice suggested earlier in this series, to try to have a few large allocations rather than many small ones. On the other hand, a small page size means more TLB misses. But maybe that doesn't matter so much if you have good data locality. Speculation, speculation! I should provide some real numbers instead, but that is too much work!

Three techniques used by many allocators are in-place linked lists, preambles and postambles.

In-place linked lists is a technique for storing linked lists of free memory blocks without using any extra memory. The idea is that since the memory in the blocks is free anyway, we can just store the prev and next pointers directly in the blocks themselves, which means we don't need any extra storage space.

A preamble is a little piece of data that sits just before the allocated memory and contains some information about that memory block. The allocator allocates extra memory for the preamble and fills it with information when the memory is allocated:

In C we pretty much need to have a preamble, because when the user calls free(void *p) on a pointer p, we get no information about how big the memory block allocated at p is. That information needs to come from somewhere and a preamble is a reasonable option, because it is easy to access from the free() code:

Note that there are other options. We could use a hash table to store the size of each pointer. We could reserve particular areas in the memory buffer for allocations of certain sizes and use pointer compare to find the area (and hence the size) for a certain pointer. But hash tables are expensive, and having certain areas for allocations of certain sizes only really work if you have a limited number of different sizes. So preambles are a common option.

They are really annoying though. They increase the size of all memory allocations and they mess with alignment. For example, suppose that the user wants to allocate 4 K of memory and that our OS uses 4 K pages. Without preambles, we could just allocate a page from the OS and return it. But if we need a four byte preamble, then we will have to allocate 8 K from the OS so that we have somewhere to put those extra four bytes. So annoying!

And what makes it even more annoying is that in most cases storing the size is pointless, because the caller already knows it. For example, in C++, when we do:

delete x;

The runtime knows the actual type of x, because otherwise it wouldn't be able to call the destructor properly. But since it knows the type, it knows the size of that type and it could provide that information to the allocator when the memory is freed..

Similarly, if the memory belongs to an std::vector, the vector class has a capacity field that stores how big the buffer is, so again the size is known.

In fact, you could argue that whenever you have a pointer, some part of the runtime has to know how big that memory allocation is, because otherwise, how could the runtime use that memory without causing an access violation?

So we could imagine a parallel world where instead of free(void *) we would have free(void *, size_t) and the caller would be required to explicitly pass the size when freeing a memory block. That world would be a paradise for allocators. But alas, it is not the world we live in.

(You could enforce this parallel world in a subsystem, but I'm not sure if it is a good idea to enforce it across the board in a bigger project. Going against the grain of the programming language can be painful.)

A postamble is a similar piece of data that is put at the end of an allocated memory block.

Postambles are useful for merging. As mentioned above, when you free a memory block, you want to merge it with its free neighbors. But how do you know what the neighbors are and if they are free or not?

For the memory block to the right it is easy. That memory block starts where yours end, so you can easily get to it and check its preamble.

The neighbor to the left is trickier. Since you don't know how big that memory block might be, you don't know where to find its preamble. A postamble solves that problem, since the postamble of the block to the left will always be located just before your block.

Again, the alternative to using preambles and postambles to check for merging is to have some centralized structure with information about the blocks that you can query. And the challenge is to make such queries efficient.

If you require all allocations to be 16-byte aligned, then having both a preamble and a postamble will add 32 bytes of overhead to your allocations. That is not peanuts, especially if you have many small allocations. You can get around that by using slab or block allocators for such allocations, or even better, avoid them completely and try to make fewer and bigger allocations instead, as already mentioned in this series.

The buddy allocator

With that short introduction to some general allocation issues, it is time to take a look at the buddy allocator.

The buddy allocator works by repeatedly splitting memory blocks in half to create two smaller "buddies" until we get a block of the desired size.

If we start with a 512 K block allocated from the OS, we can split it to create two 256 K buddies. We can then take one of those and split it further into two 128 K buddies, and so on.

When allocating, we check to see if we have a free block of the appropriate size. If not, we split a larger block as many times as necessary to get a block of a suitable size. So if we want 32 K, we split the 128 K block into 64 K and then split one of those into 32 K.

At the end of this, the state of the allocator will look something like this:

As you can see, this method of splitting means that the block sizes will always be a powers of two. If you try to allocate something smaller, say 13 K, the allocation will be rounded up to the nearest power of two (16 K) and then get assigned a 16 K block.

So there is a significant amount of fragmentation happening here. This kind of fragmentation is called internal fragmentation since it is wasted memory inside a block, not wasted space between the blocks.

Merging in the buddy allocator is dead simple. Whenever a block is freed, we check if it's buddy is also free. If it is, we merge the two buddies back together into the single block they were once split from. We continue to do this recursively, so if this newly created free block also has a free buddy, they get merged together into an even bigger block, etc.

The buddy allocator is pretty good at preventing external fragmentation, since whenever something is freed there is a pretty good chance that we can merge, and if we can't the "hole" should be filled pretty soon by a similarly sized allocation. You can still imagine pathological worst-case scenarios. For example, if we first allocate every leaf node and then free every other of those allocations we would end up with a pretty fragmented memory. But such situations should be rare in practice.

I'm being pretty vague here, I know. That's because it is quite hard in general to say something meaningful about how "good" an allocator is at preventing fragmentation. You can say how good it does with a certain allocation pattern, but every program has a different allocation pattern.

Implementing the buddy allocator

Articles on algorithms and data structures are often light on implementation details. For example, you can find tons of articles describing the high-level idea behind the buddy allocator as I've outlined it above, but not much information about how to implement the bloody thing!

This is a pity, because the implementation details can really matter. For example, it's not uncommon to see someone carefully implement the A*-algorithm, but using a data structure for the open and closed sets that completely obliterates the performance advantages of the algorithm.

So let's get into a bit more detail.

We start with allocation. How can we find a free block of a requested size? We can use the technique described above: we put the free blocks of each size in an implicit linked list. To find a free block we just take the first item from the list of blocks of that size, remove it from the list and return it.

If there is no block of the right size, we take the block of the next higher size and split that. We use one of the two blocks we get and put the other one on the free list for that size. If the list of blocks of the bigger size is also empty, we can go to the even bigger size, etc.

To make things easier for us, let's introduce the concept of levels. We say that the single block that we start with, representing the entire buffer, is at level 0. When we split that we get two blocks at level 1. Splitting them, we get to level 2, etc.

We can now write the pseudocode for allocating a block at level n:

if the list of free blocks at level n is empty
allocate a block at level n-1 (using this algorithm)
split the block into two blocks at level n
insert the two blocks into the list of free blocks for level n
remove the first block from the list at level n and return it

The only data structure we need for this is a list of pointers to the first free block at each level:

static const int MAX_LEVELS = 32;
void *_free_lists[MAX_LEVELS];

The prev and next pointers for the lists are stored directly in the free blocks themselves.

Note that MAX_LEVELS = 32 is probably enough since that gives a total size of leaf_size * 4 GB and we know leaf_size will be at least 16. (The leaf nodes must have room for the prev and next pointers of the linked list and we assume a 64 bit system.)

Note also that we can create a unique index for each block in the buddy allocator as (1<<level) + index_in_level - 1. The node at level 0 will have index 0. The two nodes at level 1 will have index 1 and 2, etc:

The total number of entries in the index is (1 << num_levels) - 1. So if we want to store some data per block, this is how much memory we will need. For the sake of simplicity, let's ignore the - 1 part and just round it of as (1 << num_levels).

What about deallocation?

The tricky part is the merging. Doing the merging is simple, we just take the two blocks, remove them from the free list at level n and insert the merged block into the free list at level n-1.

The tricky part is to know when we should merge. I.e. when we are freeing a block p, how do we know if it is buddy is also free, so that we can merge them?

First, note that we can easily compute the address of the buddy. Suppose we have free a block p at level n. We can compute the index of that in the level as:

index_in_level_of(p,n) == (p - _buffer_start) / size_of_level(n)

If the index i is even, then the buddy as at index i+1 and otherwise the buddy is at i-1 and we can use the formula above to solve for the pointer, given the index.

So given the address of the buddy, let's call it buddy_ptr, how can we know if it is free or not? We could look through the free list for level n. If we find it there we know it is free and otherwise it's not. But there could be thousands of blocks and walking the list is hard on the cache.

To do better, we need to store some kind of extra information.

We could use preambles and postambles as discussed earlier, but that would be a pity. The buddy allocator has such nice, even block sizes: 1 K, 2 K, 4 K, we really don't want to mess that up with preambles and postambles.

But what we can do is to store a bit for each block, telling us if that block is free or allocated. We can use the block index as described above to access this bitfield. This will require a total of (1 << num_level) bits. Since the total size of the tree is (1 << num_levels) * leaf_size bytes, we can see that the overhead of storing these extra bits is 1 / 8 / leaf_size. With a decent leaf_size of say 128 (small allocations are better handled by a slab alloactor anyway) the overhead of this table is just 0.1 %. Not too shabby.

But in fact we can do even better. We can get by with just half a bit per block. That sounds impossible, but here is how:

For each pair of buddies A and B we store the single bit is_A_free XOR is_B_free. We can easily maintain the state of that bit by flipping it each time one of the buddies is freed or allocated.

When we consider making a merge we know that one of buddies is free, because it is only when a block has just been freed that we consider a merge. This means we can find out the state of the other block from the XORed bit. If it is 0, then both blocks are free. If it is 1 then it is just our block that is free.

So we can get by with just one bit for every pair of blocks, that's half a bit per block, or an overhead of just 1 / 16 / leaf_size.

At this point, careful readers may note that I have been cheating.

All this time I have assumed that we know the level n of the block that we are freeing. Otherwise we cannot compute the address of the buddy or its index in the node tree.

But to know the level n of ptr we must know the size of its allocated block. So this only really works if the user passes the size of the allocation when freeing the block. I.e, the free(void *, size_t) interface that we discussed earlier.

If we want to support the simpler and more common API free(void *p), the alloator needs to somehow store the size of each alloation.

Again, using a preamble is possible, but we don't really want to.

We could use an array, indexed by (p - _buffer_start) / leaf_size to store the sizes. Note that this is not the same as the block index. We can't use the block index, since we don't know the level. Instead this is an index of size 1 << (num_levels - 1) with one entry for each possible pointer that the buddy allocator can return.

We don't have to store the full size (32 bits) in the index, just the level. That's 5 bits assuming that MAX_LEVELS = 32. Since the number of entries in this index is half that of the block index this ammounts to 2.5 bits per block.

But we can do even better.

Instead of storing the size explicitly, we can use the block index and store a single bit to keep track of whether the block at that level has been split or not.

Since the leaf blocks can't be split, we only need 1 << (num_levels - 1) entries in the split index. This means that the cost of the split index is the same as for the merge index, 0.5 bits per block. It's a bit amazing that we can do all this with a total overhead of just 1 bit per block.

The prize of the memory savings is that we now have to loop a bit to find the allocated size. But num_levels is usually small (in any case <= 32) and since we only have 1 bit per entry the cache usage is pretty good. Furthermore, with this approach it is easy to offer both a free(void *) and a free(void *, size_t) interface. The latter can be used by more sophisticated callers to avoid the loop to calculate the block size.

Memory arrangements

Where do we store this 1 bit of metadata per block? We could use a separate buffer, but it is not that elegant. It would mean that our allocator would have to request two buffers from the system, one for the data and one for the metadata.

Instead, let's just put the metadata in the buffer itself, at the beginning where we can easily find it. We mark the blocks used to store the metadata as allocated so that they won't be used by other allocations:

Note that when allocating the metadata we can be a bit sneaky and not round up the allocation to the nearest power of two. Instead we just take as many leaf blocks as we need. That is because when we allocate the metadata we know that the allocator is completely empty, so we are guaranteed to be able to allocate adjacent leaf blocks. In the example above we only have to use 5 * 16 = 80 K for the metadata instead of the 128 K we would have used if we rounded up.

(The size of the metadata has been greatly exaggerated in the illustration above to show this effect. In reality, since the tree in the illustration has only six levels, the metadata is just 1 * (1 << 6) = 64 bits, that's 8 bytes, not 80 K.)

Note that you have to be a bit careful when allocating the metadata in this way, because you are allocating memory for the metadata that your memory allocation functions depend on. That's a chicken-and-egg problem. Either you have to write a special allocation routine for this initial allocation, or be very careful with how you write your allocation code so that this case is handled gracefully.

We can use the same technique to handle another pesky issue.

It's a bit irritating that the size of the buddy allocator has to be a power of two of the leaf size. Say that we happen to have 400 K of memory lying around somewhere. It would be really nice if we could use all of that memory instead of just the first 256 K.

We can do that using the same trick. For our 400 K, we can just create a 512 K buddy allocator and mark the first 144 K of it as "already allocated". We also offset the start of the buffer, so that the start of the usable memory coincides with the start of the buffer in memory. Like this:

Again, this requires some care when writing the code that does the initial allocation so that it doesn't write into the unallocated memory and causes an access violation.

The buddy allocator and growing buffers

As mentioned in the previous post, the buddy allocator is perfect for allocating dynamically growing buffers, because what we want there is allocations that progressively double in size, which is exactly what the different levels of the buddy allocator offer.

When a buffer needs to grow, we just allocate the next level from the buddy allocator and set the capacity of the buffer so that it fills up all that space.

Note that this completely avoids the internal fragmentation issue, which is otherwise one of the biggest problems with the buddy allocator. There will be no internal fragmentation because the dynamic buffers will make use of all the available space.

Monday, June 22, 2015

Last week's post ended with a puzzle: How can we allocate an array of dynamically growing and shrinking things in an efficient and data-oriented way? I.e. using contiguous memory buffers and as few allocations as possible.

The example in that post was kind of complicated, and I don't want to get lost in the details, so let's look at a simpler version of the same fundamental problem.

Suppose we want to create a TagComponent that allows us to store a number of unsignedtags for an entity.

These tags will be hashes of strings such as "player", "enemy", "consumable", "container", etc and the TagComponent will have some sort of efficient lookup structure that allows us to quickly find all entities with a particular tag.

But to keep things simple, let's ignore that for now. For now we will just consider how to store these lists of tags for all our entities. I.e. we want to find an alternative to:

std::vector< std::vector < unsigned> > data;

that doesn't store every list in a separate memory allocation.

Fixed size

If we can get away with it, we can get rid of the "array of arrays" by setting a hard limit on the number of items we can store per entity. In that case, the data structure becomes simply:

Now all the data is contained in a single buffer, the data buffer for Array<Tags>.

Sometimes the hard limit is inherent in the problem itself. For example, in a 2D grid a cell can have at most four neighbors.

Sometimes the limit is a widely accepted compromise between cost and quality. For example, when skinning meshes it is usually consider ok to limit the number of bone influences per vertex to four.

Sometimes there is no sensible limit inherent to the problem itself, but for the particular project that we are working on we can agree to a limit and then design the game with that limit in mind. For example we may know that there will never be more than two players, never more than three lights affecting an object, never more than four tags needed for an entity, etc.

This of course requires that we are writing, or at least configuring, the engine for a particular project. If we are writing a general engine to be used for a lot of games it is hard to set such limits without artificially constraining what those games will be able to do.

Also, since the fixed size must be set to the maximum array size, every entity that uses fewer entries than the maximum will waste some space. If we need a high maximum this can be a significant problem and it might make sense to go with a dynamic solution even though there is an upper limit.

So while the fixed size approach can be good in some circumstances, it doesn't work in every situation.

Linked list

Instead of using arrays, we can put the tags for a particular entity in a linked list:

struct Tag
{
unsigned tag;
Tag *next;
};
Array<Tag *> data;

Using a linked list may seem like a very bad choice at first. A linked list can give us a cache miss for every next pointer we follow. This would give us even worse performance than we would get with vector < vector < unsigned > >.

But the nodes in the linked list do not necessarily have to be allocated individually on the heap. We can do something similar to what we did in the last post: allocate the nodes in a buffer and refer to them using offsets rather than pointers:

struct Node
{
unsigned tag;
unsigned next;
};
Array<Node> nodes;

With this approach we only have a single allocation -- the buffer for the array that contains all the tag nodes -- and we can follow the indexes in the next field to walk the list.

Side note: Previously I have always used UINT_MAX to mark an nil value for an unsigned. So in the struct above, I would have used UINT_MAX for the next value to indicate the end of the list. But recently, I've switched to using 0 instead. I think it is nice to be able to memset() a buffer to 0 to reset all values. I think it is nice that I can just use if (next) to check if the value is valid. It is also nice that the invalid value will continue to be 0 even if I later decide to change the type to int or uint_16t. It does mean that I can't use the nodes[0] entry, since that is reserved for the nil value, but I think the increased simplicity is worth it.

Using a single buffer rather than separate allocations gives us much better cache locality, but the next references can still jump around randomly in that buffer. So we can still get cache misses. If the buffer is large, this can be as bad as using freely allocated nodes.

Another thing to note is that we are wasting a significant amount of memory. Only half of the memory is used for storing tags, the rest of it is wasted on the next pointers.

We can try to address both these problems by making the nodes a little bigger:

This is just as before, except we have more than one tag per node. This gives better cache performance because we can now process eight tags at a time before we have to follow a next pointer and jump to a different memory location. Memory use can also be better. If the nodes are full, we are using 80 % of the memory for actual tags, rather than 50 % as we had before.

However, if the nodes are not full we could be wasting even more memory than before. If entities have three tags on average, then we are only using 30 % of the memory to store tags.

We can balance cache performance and memory use by changing MAX_TAGS_PER_NODE. Increasing it gives better cache coherence, because we can process more tags before we need to jump to a different memory location. However, increasing it also means more wasted memory. It is probably good to set the size so that "most entities" fit into a single node, but a few special ones (players and enemies maybe) need more.

One interesting thing to note about the cache misses is that we can get rid of them by sorting the nodes. If we sort them so that the nodes in the same next chain always appear directly after one another in the array, then walking the list will access the data linearly in memory, just as if we were accessing an array:

Note that a complete ordering is not required, it is enough if the linked nodes end up together. Single nodes, such as the B node above could go anywhere.

Since these are dynamic lists where items will be added and removed all the time, we can't really do a full O(n log n) sort every time something changes. That would be too expensive. But we could sort the list "incrementally". Every time the list is accessed, we do a little bit of sorting work. As long as the rate of mutation is low compared to the rate of access, which you would expect in most circumstances, our sorting should be able to keep up with the mutations and keep the list "mostly sorted".

You would need a sorting algorithm that can be run incrementally and that works well with already sorted data. Two-way bubble sort perhaps? I haven't thought too deeply about this, because I haven't implemented this method in practice.

Custom memory allocator

Another option is to write a custom memory allocator to divide the bigger buffer up into smaller parts for memory allocations.

You might think that this is a much too complex solution, but a custom memory allocator doesn't necessarily need to be a complex thing. In fact, both the fixed size and linked list approaches described above could be said to be using a very simple kind of custom memory allocator: one that just allocates fixed blocks from an array. Such an allocator does not need many lines of code.

Another criticism against this approach is that if we are writing our own custom memory allocator, aren't we just duplicating the work that malloc() or new already does? What's the point of first complaining a lot about how problematic the use of malloc() can be and then go on to write our very own (and probably worse) implementation of malloc()?

The answer is that malloc() is a generic allocator that has to do well in a lot of different situations. If we have more detailed knowledge of how the allocator is used, we can write an allocator that is both simpler and performs better. For example, as seen above, when we know the allocations are fixed size we can make a very fast and simple allocator. System software typically uses such allocators (check out the slab allocator for instance) rather than relying on malloc().

In addition, we also get the benefit that I talked about in the previous post. Having all of a system's allocations in a single place (rather than mixed up with all other malloc() allocations) makes it much easier to reason about them and optimize them.

As I said above, the key to making something better than malloc() is to make use of the specific knowledge we have about the allocation patterns of our system. So what is special about our vector < vector < unsigned > > case?

1. There are no external pointers to the data.

All the pointers are managed by the TagComponent itself and never visible outside that component.

This means that we can "move around" memory blocks as we like, as long as the TagComponent keeps track of and updates its data structures with the new locations. So we don't have to worry (that much) about fragmentation, because when we need to, we can always move things around in order to defrag the memory.

I'm sure you can build something interesting based on that, but I actually want to explore another property:

2. Memory use always grows by a factor of two.

If you look at the implementation of std::vector or a similar class (since STL code tends to be pretty unreadable) you will see that the memory allocated always grows by a factor of two. (Some implementations may use 1.5 or something else, but usually it is 2. The exact figure doesn't matter that much.)

The vector class keeps track of two counters:

size which stores the number of items in the vector and

capacity which stores how many items the vector has room for, i.e. how much memory has been allocated.

If you try to push an item when size == capacity, more memory is needed. So what typically happens is that the vector allocates twice as much memory as was previously used (capacity *= 2) and then you can continue to push items.

This post is already getting pretty long, but if you haven't thought about it before you may wonder why the vector grows like this. Why doesn't it grow by one item at a time, or perhaps 16 items at a time.

The reason is that we want push_back() to be a cheap operation -- O(1) using computational complexity notation. When we reallocate the vector buffer, we have to move all the existing elements from the old place to the new place. This will take O(n) time. Here, n is the number of elements in the vector.

If we allocate one item at a time, then we need to allocate every time we push and since re-allocate takes O(n) that means push will also take O(n). Not good.

If we allocate 16 items at a time, then we need to allocate every 16th time we push, which means that push on average takes O(n)/16, which by the great laws of O(n) notation is still O(n). Oops!

But if we allocate 2*n items when we allocate, then we only need to reallocate after we have pushed n more items, which means that push on average takes O(n)/n. And O(n)/n is O(1), which is exactly what we wanted.

Note that it is just on average that push is O(1). Every n pushes, you will encounter a push that takes O(n) time. For this reason, push is said to run in amortized constant time. If you have really big vectors, that can cause an unacceptable hitch and in that case you may want to use something other than a vector to store the data.

Anyways, back to our regular programming.

The fact that our data (and indeed, any kind of dynamic data that uses the vector storage model) grows by powers of two is actually really interesting. Because it just so happens that there is an allocator that is very good at allocating blocks at sizes that are powers of two. It is called the buddy allocator and we will take a deeper look at it in the next post.

Friday, June 12, 2015

When sketching out a new low level system I always start with the data layout, because that's the most important part. And I do that with two main goals:

Make sure that memory is laid out and accessed linearly.

Minimize the number of individual memory allocations.

The importance of the first goal should be obvious to anybody who has been following the data-oriented design movement. Modern CPUs are often memory bound. Writing cache friendly code is paramount. Etc.

The advantages of the second goal are less obvious.

Too some extent, it goes hand-in-hand with the first. For the cache to work efficiently you need to avoid pointer chasing, which naturally leads to fewer allocations. But there are other advantages as well.

Data with fewer individual allocations puts less pressure on the memory system. It gives you less fragmentation, more control over memory usage, allocation patterns that are easier to profile and optimize, etc.

It also makes the data easier to move and copy, since it requires less pointer patching, which let's you do lots of cool tricks with it. In general, it's just neater.

For static data, such as resources loaded from disk, both these goals can be easily achieved, by using the blob method.

For dynamic data, things get trickier. If you just use STL as you are told to, you end up with scary stuff like std::map< std::string, std::vector<std::string> > where the cache and memory usage is all over the place.

So let's try and see how we can escape that monster's lair by looking at a component and improving it step-by-step until we get to something nicer.

Introducing the DataComponent

The DataComponent is a part of our entity system that is used to store small amounts of arbitrary dynamic data for an entity. For example, it can be used to store a character sheet:

As data format we use a restricted form of JSON where the only allowed values are:

booleans

floats

strings

objects

arrays of numbers

Note that arbitrary arrays are not allowed. Arrays can only be lists of numbers, i.e. [0 0.5 0.5 0.7]. If you want to have sets of other items, you need to use an object with unique values (possibly GUIDs) as the keys.

The reason for these restrictions is that we want to make sure that all operations on the data merge in simple and well-defined ways. This makes it easier to use the data in collaborative workflows.

Strings and float arrays are regarded as monolithic with respect to merging. This means that all data operations can be reduced to setting object keys, which makes them easy to reason about.

We assume that the amount of data stored for each entity will be small, but that lots of entities will have data and that it is important to keep memory use low and performance high.

The naive approach

Representing a tree structure that can include both strings and arrays is pretty complicated and if we just start to sketch the structure using our regular STL tools it quickly becomes pretty messy:

That doesn't look very fun at all. There are tons of allocations everywhere, both explicit (std::string *) and implicit (inside string, vector, and map). So let's start working.

1: Hash the keys

If we consider how we will use this data structure, we will always be setting and getting the values corresponding to certain keys. We don't have a pressing need to extract the key strings themselves. This means we can use hashed strings instead of the strings themselves for lookup.

Since we assume that the number of keys will be small, it is probably enough to use an unsigned rather than an uint64_t for the hash value:

std::map<unsigned, DataValue> *object;

Great start. That's a lot of allocations already gone and as a bonus we also got rid of a lot of string comparisons.

2: Flatten the structure

If we also sacrifice the ability of being able to enumerate all the (hashed) keys in a particular object we can go even further and flatten the entire structure.

Just as before we represent these keys as hashed values, so "stats.health" is represented by an unsigned containing the hashed value of "stats.health".

(Note: we won't actually hash the string "stats.health" directly, because that could lead to problems if some of our keys contained a "." character. Instead we hash each sub key separately and then hash them all together.)

With this approach, the user will still be able to look up a particular piece of data, such as stats.health, but she won't be able to enumerate all the keys and values under stats, since the hierarchy information is lost. But that's ok.

Getting rid of the tree structure allows us to store our data in a flat array:

Note that we no longer have a lookup hierarchy for the entries. To find a particular key we have to linearly search std::vector<Entry> for a match. Luckily, linear search is the best thing caches know and for reasonable sizes, the search will run faster than a std::map lookup.

We could sort the entries and do a binary search, but since the number of entries will be small, it is probably not even worth bothering.

3: Rewrite using a structure-of-arrays approach

One thing that does slow the search for a particular key down is all the additional data we need to load into the cache to perform the search. As we scan the vector for a particular key, the cache will also be filled by the type and value data.

We can solve this with the standard techinque of rewriting our "array of structures" as a "structure of arrays". I.e., we break out each field of the structure into its own array:

Now when searching for a particular key, we only need to load the keys table which means the cache only contains useful data.

4: Co-allocate the arrays

Not too shabby, but we still have three separate allocations for the three vectors. To get rid of that we have to stop using these fancy std::vectors and go back to the basics, and in terms of basics, a std::vector<unsigned> is really just:

Note that as a bonus, we can now share the capacity and the size fields between the vectors, since they all have the same values and change together.

The arrays are also all reallocated at the same time. We can make use of this and instead of doing three separate allocations, we just allocate one big buffer and lay them out sequentially in that buffer:

5: Get rid of STL

We still have these pointers to deal with in the values union:

std::string *s;
std::vector<float> *array;

Let's start by getting rid of those STL types. At this point they are no longer helping. STL is based around individual allocations which we don't want. We also have two levels of indirection there. First, the pointer to the std::vector, and then the pointer to the actual data inside the std::vector type.

There are two things we need to take care of here. First, we may run out of space in this buffer as we add more data. Second, the individual values we store may change size and no longer fit in their designated place.

The first issue is no big problem. If we run out of space, we can just allocate a bigger buffer as std::vector does.

The second issue is no biggie either. If an item needs to grow, we just allocate a new bigger slot for it from the buffer. This will leave a hole where the item was originally located:

These "holes" will cause fragmentation of the buffer and eventually make it run out of space. If this was a general purpose memory allocator that would be a serious issue that we would have to be seriously worried about.

But in our case it doesn't really matter. The number of items is small and we control the pointers to them, so we can easily move them around and defragment the buffer when necessary. But in fact, we don't even have to do that. If we get fragmentation we will eventually run out of space. This will reallocate the buffer which will also defragment it. This is enough for our purposes.

7: Slim down the Value struct

Since all our value data are now allocated in a single buffer we can use the buffer offset rather than the pointer to refer to the data. This saves us memory (from 64 bits to 32 bits) and as an added bonus, it also allows us to relocate the buffer without having to do pointer patching.

For these small snippets of data 64 K is plenty, which means we can fit both the offset and the size in 32 bits:

struct Data {
uint16_t offset;
uint16_t size;
};

(If you need more memory, you could use a full uint32_t for the offset and store the size in the buffer, together with the data.)

So know we have:

struct Value
{
union {
bool b;
float f;
Data data;
}
};

where the data field is used for both strings and arrays. This means our Value structure is now just 32 bits, half the size of the STL version.

8: Merge the final two allocations

Now we are down to just two buffers: one for the header (keys and types), and one for the values (strings and arrays). Wouldn't it be great if we could merge them to use just a single allocation?

Now they both have some room to grow, but we are not utilizing the free space to its maximum potential. The Values section may run out of free space before the Header, forcing us to reallocate the buffer even though we actually have some free space left in it. That's a bummer.

Instead of allocating the values bottom up, we allocate them top down in the buffer. Now the headers and the values can share the same free space and we don't have to reallocate the buffer until they meet in the middle. Nice!

Benefits of single buffer structures

Stripping everything down to a single buffer gives us many of the same benefits that the blob approach gives us for static data.

Since the data is stored in a single buffer and doesn't contain any pointers it is fully relocatable. We can move it around in memory as we please and make copies of it with a simple memcpy().

If we want to save the data to disk, as a static resource or as part of a saved game, we don't need any special serialization code. We can just write and read the data directly.

The pointer arithmetic that is needed to manipulate data in this way may seem complex at first, if you are not used to think about pointer operations on raw memory. But I would argue that once you get used to that, this kind of representation is in many ways simpler than the multi-level std::vector, std::map based approach that we started with. And in that simplicity lies a lot of power.

But wait, I have been cheating!

Yes, to be honest I have been cheating all this time.

I have ignored the old adage of data-oriented programming:

Where there is one, there are many.

All this time I have been talking about how oneDataComponent can store its data in a single buffer. But in our real system we will have manyDataComponents. We might have 1000 entities, each with a DataComponent that stores a single health float. Even if each component uses a contiguous buffer, that's still a lot of individual memory allocations.

Wouldn't it be nice if we could somehow bundle them together too?

But that represents a trickier challenge. We need to find a way to allocate a lot of individual objects in a single buffer in such a way that each individual object can grow and shrink independently.

Thursday, June 11, 2015

When I joined Bitsquid a month ago, someone mentioned they wanted to upgrade the DirectX SDK
to get some improvements, but that there was a dependency in the way.
I was foolish, and volunteered to investigate.
Over the past ten days or so I have untangled the whole mess, leading to a successful upgrade.
I now want to share my findings so the next unfortunate soul can save some time.

Step 1: Explore

First stop: MSDN's article.
I had heard that the DirectX SDK was now included in the Windows SDK,
but I wasn't sure what that covered.
This article
sums it up.
With a teammate, we went through the whole list, figuring out what we were and were not using.
In the end, the only problematic components were XInput, XAudio2, and D3DX9Mesh.
The bulk of the codebase had already been converted away from using D3DX, which was great!
However another thing needed clearing up.
Our minspec is still Windows 7.
How was that going to work?
Luckily, MSDN had the answer again.
This article
reveals that the Windows 8.X SDK is available on Windows 7.
This is covered in more details on this page and that page.

Step 2: Well let's just try then

I changed the paths in our project generation files to the Windows SDK.
I also added the June 2010 SDK, but only for XAudio2 and D3DX9Mesh
(more on XInput further down).
After fixing only a few compile errors, things seemed mostly fine...
until I got a runtime crash about ID3D11ShaderReflection. Huh?

Step 3: GUIDs and the magic #define

I had wrongly assumed that the link errors I had been seeing when changing the paths were caused by DX9, because I read too fast.
Linking with the old dxguid.lib made the errors go away, so I didn't think about it more.
However, a large part of DirectX relies on GUIDs, unique hardcoded identifiers.
When debugging, I noticed that IID_ID3D11ShaderReflection had the wrong value compared to the Windows SDK header,
which was causing the crash.
I went on a goose hunt for what was somehow changing this value,
and wasted a day to looking for a wrongly included file.
But by default, those GUIDs are extern variables, and will get their values from lib files.
And I was linking with an old one.
Mystery solved!
I removed dxguid.lib from the linker, but that of course caused the GUIDs to be undefined.
The solution for that is to #define INITGUID before including windows.h.
Thanks to the Ogre3D forums for pointing me towards
the relevant support page,
since they encountered the same issue before.
At this point everything was fine, except that it was failing on the build machines.

Step 4: d3dcompiler

The first error had been around for a long time.
We had so far, unknowingly, relied on the d3dcompiler DLL being present in System32!
Since System32 is part of the default DLL search path, this is easy to overlook,
especially when the DirectX SDK is a required install anyway.
We were now relying on a more recent version,
supposed to be included in the Windows SDK.
Yet still it was failing...
because we did not have a proper installation step.
I tweaked the project files again,
adding a copy step for that DLL.
CI, however, was still failing.

Step 5: XInput

XInput comes in several versions in the Windows SDK.
1.4 is the most recent one as I'm writing this, and is Windows 8-only.
To use XInput on Windows 7, you need to use version 9.1.0.
For that, ensure that
the magic _WIN32_WINNT #define
is set to the proper value (see further up on the page).
You also need to explicitly link with XInput9_1_0.lib and not XInput.lib,
or Windows 7 will get a runtime crash trying to fetch XInput1_4.dll, which doesn't exist on Windows 7.
In my case this was breaking the automated tests on a Windows 7 machine, but was completely fine on my Windows 8 workstation.

Step 6: Profit?

As far as I can tell this should be the end of it,
but the rendering team has yet to stress-test it.
We'll see what breaks as they poke around :)
Hopefully this can save you some time if you're doing a similar upgrade,
or convince you to give it a try if you've been holding back.

Tuesday, March 10, 2015

I've written before about multithreading gameplay, but since I didn't really come to any conclusion I think it is time to revisit the topic.

As the number of processors and cores in consumer hardware keeps increasing, Amdahl's law tells us that any single-threaded part of the code will have a bigger and bigger effect on the performance. (As long as you are not memory bound, so keep those caches happy, ok?)

So even if single-threaded gameplay isn't a problem today it soon will be. At some point the elephant will outgrow the living room.

There are (at least) three problems with multithreading gameplay code as we know it:

Writing and reading multithreaded code is much harder than single threaded code, especially for "messy" stuff such as gameplay code.

Gameplay code tend to be more sprawling than engine code, touching all kinds of systems, which means that the standard optimisation technique of finding the hotspots and multithreading them is less likely to work.

Lua, which we use as our scripting language, does not have built-in multithreading support. This might not be a problem for you. On the other hand, if you expect your gameplay programmers to write multithreaded C++ code, you probably have other problems.

If we start with the first point, I don't think it is reasonable to expect gameplay programmers to write safe and efficient multithreaded code using the standard techniques of mutexes, locks, queues, semaphores, atomic operations, etc. Especially not when writing messy gameplay code where requirements often change and experimentation and quick iterations are important. If anyone has experience of this, I'd like to know.

So I think the primary goal is to find a multithreading model that is easy to work with.

You can read up on the actor model if you are not familiar with it. The basic idea is that processing nodes only touch their own local memory and communicate with other nodes through asynchronous message passing. Since there is no shared memory, explicit synchronization primitives are not necessary.

Luckily for us, using this model also takes care of issue #3. If the nodes don't need to share memory, we can let them live in separate Lua VMs that communicate through message passing. Typically we would spawn a separate Lua VM for each processing core in our system.

As a completely contrived example, suppose we had a bunch of numbers that we needed to factor. We could then split them up by our number of VMs and send each VM a message ['factor', n]. Each VM would compute its factors in parallel and send the result back to the main thread.

All fine and dandy. This would work well and give us a good performance boost. But of course, this contrived example is absolutely nothing like real gameplay code.

In real gameplay code, we don't have pockets of completely isolated but computationally intensive code that lends itself to easy parallelization. (If we do, that code should probably be moved out of the gameplay code and into the engine.)

Instead, most gameplay code interacts with the world and does things like moving a unit a little bit, casting a physics ray, adjusting the position of another unit, etc. Unless we can paralelize that kind of messy code, we won't have gained very much.

The problem here is that your engine is probably a big ball of mutable state. If it isn't, congratulations to you I guess. You have figured out something we others haven't and I look forward to your next GDC talk. But for the sake of argument, let's assume that it is.

Any interaction with this mutable state (say a script calling PhysicsWorld.raycast()) is a potential for threading issues.

We could try to make the entire script API thread-safe. For example, we could put a critical section in each API call. But that is unlikely to make anyone happy. With so many critical sections, we will probably loose whatever performance we hoped to gain from multithreading.

So we seem to be at an impasse. Gameplay code will need to interact frequently with a lot of engine APIs and making those APIs thread-safe will likely kill performance.

I've been stuck here for a while. To be honest, a couple of years. (Hey, it's not like I haven't had other stuff to do.) But in the general creative atmosphere of GDC and a discussion with some colleagues and the nice people at Pixeldiet, something shook loose.

Instead of synchronizing at each function call, what if we did it at the level of the API:

In this model, the Lua VM for the threads start with a blank slate. There are no public APIs (except for safe, functional APIs that don't touch mutable state). To do anything with the engine, you must obtain a lock for a particular API.

You could argue that this is nothing than another shade of the complicated explicit multithreading model that we wanted to get rid of to begin with, but I do think there is something different here.

First, since the Lua part of the code will use the Actor model, we have eliminated all the problems with synchronizing the Lua state.

Second, since you can't use an API before locking in it there is a safety mechanism that prevents you from accidentally using multithreading the wrong way.

In this model, the main Lua thread (yes there would still be a main Lua thread) would spawn of a number of jobs for performing a computation intensive task, such as updating a number of units. The main Lua thread would be suspended while the task was performed. (We can only avoid suspension if the main thread also uses the locking mechanism to access the APIs, but that seems too cumbersome.)

The worker Lua threads lock the APIs they need to perform their tasks, and when they have completed the control returns back to the main Lua thread that can gather the results.

Since Lua supports coroutines (aka green threads) the lock_api() function does not have to lock the thread if an API is locked by someone else, we can just switch to a different coroutine.

This model is certainly not perfect. It's a bit annoying that we still have to have a main Lua thread that is "special". It's also a pity that the main Lua thread can't continue to run while the jobs are being executed, since that could allow for greater parallelism.

And it is certainly possible for the gameplay programmer to mess up. For example, it is easy to create a deadlock by requiring the same APIs in different orders in different threads.

But still, to me this seems like the best solution, and something that would actually be worthwhile for the gameplay programmers (unlike some of my previous ideas). So I think I will start tinkering with it and see if it will fly.