Monday, June 22, 2015

Last week's post ended with a puzzle: How can we allocate an array of dynamically growing and shrinking things in an efficient and data-oriented way? I.e. using contiguous memory buffers and as few allocations as possible.

The example in that post was kind of complicated, and I don't want to get lost in the details, so let's look at a simpler version of the same fundamental problem.

Suppose we want to create a TagComponent that allows us to store a number of unsignedtags for an entity.

These tags will be hashes of strings such as "player", "enemy", "consumable", "container", etc and the TagComponent will have some sort of efficient lookup structure that allows us to quickly find all entities with a particular tag.

But to keep things simple, let's ignore that for now. For now we will just consider how to store these lists of tags for all our entities. I.e. we want to find an alternative to:

std::vector< std::vector < unsigned> > data;

that doesn't store every list in a separate memory allocation.

Fixed size

If we can get away with it, we can get rid of the "array of arrays" by setting a hard limit on the number of items we can store per entity. In that case, the data structure becomes simply:

Now all the data is contained in a single buffer, the data buffer for Array<Tags>.

Sometimes the hard limit is inherent in the problem itself. For example, in a 2D grid a cell can have at most four neighbors.

Sometimes the limit is a widely accepted compromise between cost and quality. For example, when skinning meshes it is usually consider ok to limit the number of bone influences per vertex to four.

Sometimes there is no sensible limit inherent to the problem itself, but for the particular project that we are working on we can agree to a limit and then design the game with that limit in mind. For example we may know that there will never be more than two players, never more than three lights affecting an object, never more than four tags needed for an entity, etc.

This of course requires that we are writing, or at least configuring, the engine for a particular project. If we are writing a general engine to be used for a lot of games it is hard to set such limits without artificially constraining what those games will be able to do.

Also, since the fixed size must be set to the maximum array size, every entity that uses fewer entries than the maximum will waste some space. If we need a high maximum this can be a significant problem and it might make sense to go with a dynamic solution even though there is an upper limit.

So while the fixed size approach can be good in some circumstances, it doesn't work in every situation.

Linked list

Instead of using arrays, we can put the tags for a particular entity in a linked list:

struct Tag
{
unsigned tag;
Tag *next;
};
Array<Tag *> data;

Using a linked list may seem like a very bad choice at first. A linked list can give us a cache miss for every next pointer we follow. This would give us even worse performance than we would get with vector < vector < unsigned > >.

But the nodes in the linked list do not necessarily have to be allocated individually on the heap. We can do something similar to what we did in the last post: allocate the nodes in a buffer and refer to them using offsets rather than pointers:

struct Node
{
unsigned tag;
unsigned next;
};
Array<Node> nodes;

With this approach we only have a single allocation -- the buffer for the array that contains all the tag nodes -- and we can follow the indexes in the next field to walk the list.

Side note: Previously I have always used UINT_MAX to mark an nil value for an unsigned. So in the struct above, I would have used UINT_MAX for the next value to indicate the end of the list. But recently, I've switched to using 0 instead. I think it is nice to be able to memset() a buffer to 0 to reset all values. I think it is nice that I can just use if (next) to check if the value is valid. It is also nice that the invalid value will continue to be 0 even if I later decide to change the type to int or uint_16t. It does mean that I can't use the nodes[0] entry, since that is reserved for the nil value, but I think the increased simplicity is worth it.

Using a single buffer rather than separate allocations gives us much better cache locality, but the next references can still jump around randomly in that buffer. So we can still get cache misses. If the buffer is large, this can be as bad as using freely allocated nodes.

Another thing to note is that we are wasting a significant amount of memory. Only half of the memory is used for storing tags, the rest of it is wasted on the next pointers.

We can try to address both these problems by making the nodes a little bigger:

This is just as before, except we have more than one tag per node. This gives better cache performance because we can now process eight tags at a time before we have to follow a next pointer and jump to a different memory location. Memory use can also be better. If the nodes are full, we are using 80 % of the memory for actual tags, rather than 50 % as we had before.

However, if the nodes are not full we could be wasting even more memory than before. If entities have three tags on average, then we are only using 30 % of the memory to store tags.

We can balance cache performance and memory use by changing MAX_TAGS_PER_NODE. Increasing it gives better cache coherence, because we can process more tags before we need to jump to a different memory location. However, increasing it also means more wasted memory. It is probably good to set the size so that "most entities" fit into a single node, but a few special ones (players and enemies maybe) need more.

One interesting thing to note about the cache misses is that we can get rid of them by sorting the nodes. If we sort them so that the nodes in the same next chain always appear directly after one another in the array, then walking the list will access the data linearly in memory, just as if we were accessing an array:

Note that a complete ordering is not required, it is enough if the linked nodes end up together. Single nodes, such as the B node above could go anywhere.

Since these are dynamic lists where items will be added and removed all the time, we can't really do a full O(n log n) sort every time something changes. That would be too expensive. But we could sort the list "incrementally". Every time the list is accessed, we do a little bit of sorting work. As long as the rate of mutation is low compared to the rate of access, which you would expect in most circumstances, our sorting should be able to keep up with the mutations and keep the list "mostly sorted".

You would need a sorting algorithm that can be run incrementally and that works well with already sorted data. Two-way bubble sort perhaps? I haven't thought too deeply about this, because I haven't implemented this method in practice.

Custom memory allocator

Another option is to write a custom memory allocator to divide the bigger buffer up into smaller parts for memory allocations.

You might think that this is a much too complex solution, but a custom memory allocator doesn't necessarily need to be a complex thing. In fact, both the fixed size and linked list approaches described above could be said to be using a very simple kind of custom memory allocator: one that just allocates fixed blocks from an array. Such an allocator does not need many lines of code.

Another criticism against this approach is that if we are writing our own custom memory allocator, aren't we just duplicating the work that malloc() or new already does? What's the point of first complaining a lot about how problematic the use of malloc() can be and then go on to write our very own (and probably worse) implementation of malloc()?

The answer is that malloc() is a generic allocator that has to do well in a lot of different situations. If we have more detailed knowledge of how the allocator is used, we can write an allocator that is both simpler and performs better. For example, as seen above, when we know the allocations are fixed size we can make a very fast and simple allocator. System software typically uses such allocators (check out the slab allocator for instance) rather than relying on malloc().

In addition, we also get the benefit that I talked about in the previous post. Having all of a system's allocations in a single place (rather than mixed up with all other malloc() allocations) makes it much easier to reason about them and optimize them.

As I said above, the key to making something better than malloc() is to make use of the specific knowledge we have about the allocation patterns of our system. So what is special about our vector < vector < unsigned > > case?

1. There are no external pointers to the data.

All the pointers are managed by the TagComponent itself and never visible outside that component.

This means that we can "move around" memory blocks as we like, as long as the TagComponent keeps track of and updates its data structures with the new locations. So we don't have to worry (that much) about fragmentation, because when we need to, we can always move things around in order to defrag the memory.

I'm sure you can build something interesting based on that, but I actually want to explore another property:

2. Memory use always grows by a factor of two.

If you look at the implementation of std::vector or a similar class (since STL code tends to be pretty unreadable) you will see that the memory allocated always grows by a factor of two. (Some implementations may use 1.5 or something else, but usually it is 2. The exact figure doesn't matter that much.)

The vector class keeps track of two counters:

size which stores the number of items in the vector and

capacity which stores how many items the vector has room for, i.e. how much memory has been allocated.

If you try to push an item when size == capacity, more memory is needed. So what typically happens is that the vector allocates twice as much memory as was previously used (capacity *= 2) and then you can continue to push items.

This post is already getting pretty long, but if you haven't thought about it before you may wonder why the vector grows like this. Why doesn't it grow by one item at a time, or perhaps 16 items at a time.

The reason is that we want push_back() to be a cheap operation -- O(1) using computational complexity notation. When we reallocate the vector buffer, we have to move all the existing elements from the old place to the new place. This will take O(n) time. Here, n is the number of elements in the vector.

If we allocate one item at a time, then we need to allocate every time we push and since re-allocate takes O(n) that means push will also take O(n). Not good.

If we allocate 16 items at a time, then we need to allocate every 16th time we push, which means that push on average takes O(n)/16, which by the great laws of O(n) notation is still O(n). Oops!

But if we allocate 2*n items when we allocate, then we only need to reallocate after we have pushed n more items, which means that push on average takes O(n)/n. And O(n)/n is O(1), which is exactly what we wanted.

Note that it is just on average that push is O(1). Every n pushes, you will encounter a push that takes O(n) time. For this reason, push is said to run in amortized constant time. If you have really big vectors, that can cause an unacceptable hitch and in that case you may want to use something other than a vector to store the data.

Anyways, back to our regular programming.

The fact that our data (and indeed, any kind of dynamic data that uses the vector storage model) grows by powers of two is actually really interesting. Because it just so happens that there is an allocator that is very good at allocating blocks at sizes that are powers of two. It is called the buddy allocator and we will take a deeper look at it in the next post.

Friday, June 12, 2015

When sketching out a new low level system I always start with the data layout, because that's the most important part. And I do that with two main goals:

Make sure that memory is laid out and accessed linearly.

Minimize the number of individual memory allocations.

The importance of the first goal should be obvious to anybody who has been following the data-oriented design movement. Modern CPUs are often memory bound. Writing cache friendly code is paramount. Etc.

The advantages of the second goal are less obvious.

Too some extent, it goes hand-in-hand with the first. For the cache to work efficiently you need to avoid pointer chasing, which naturally leads to fewer allocations. But there are other advantages as well.

Data with fewer individual allocations puts less pressure on the memory system. It gives you less fragmentation, more control over memory usage, allocation patterns that are easier to profile and optimize, etc.

It also makes the data easier to move and copy, since it requires less pointer patching, which let's you do lots of cool tricks with it. In general, it's just neater.

For static data, such as resources loaded from disk, both these goals can be easily achieved, by using the blob method.

For dynamic data, things get trickier. If you just use STL as you are told to, you end up with scary stuff like std::map< std::string, std::vector<std::string> > where the cache and memory usage is all over the place.

So let's try and see how we can escape that monster's lair by looking at a component and improving it step-by-step until we get to something nicer.

Introducing the DataComponent

The DataComponent is a part of our entity system that is used to store small amounts of arbitrary dynamic data for an entity. For example, it can be used to store a character sheet:

As data format we use a restricted form of JSON where the only allowed values are:

booleans

floats

strings

objects

arrays of numbers

Note that arbitrary arrays are not allowed. Arrays can only be lists of numbers, i.e. [0 0.5 0.5 0.7]. If you want to have sets of other items, you need to use an object with unique values (possibly GUIDs) as the keys.

The reason for these restrictions is that we want to make sure that all operations on the data merge in simple and well-defined ways. This makes it easier to use the data in collaborative workflows.

Strings and float arrays are regarded as monolithic with respect to merging. This means that all data operations can be reduced to setting object keys, which makes them easy to reason about.

We assume that the amount of data stored for each entity will be small, but that lots of entities will have data and that it is important to keep memory use low and performance high.

The naive approach

Representing a tree structure that can include both strings and arrays is pretty complicated and if we just start to sketch the structure using our regular STL tools it quickly becomes pretty messy:

That doesn't look very fun at all. There are tons of allocations everywhere, both explicit (std::string *) and implicit (inside string, vector, and map). So let's start working.

1: Hash the keys

If we consider how we will use this data structure, we will always be setting and getting the values corresponding to certain keys. We don't have a pressing need to extract the key strings themselves. This means we can use hashed strings instead of the strings themselves for lookup.

Since we assume that the number of keys will be small, it is probably enough to use an unsigned rather than an uint64_t for the hash value:

std::map<unsigned, DataValue> *object;

Great start. That's a lot of allocations already gone and as a bonus we also got rid of a lot of string comparisons.

2: Flatten the structure

If we also sacrifice the ability of being able to enumerate all the (hashed) keys in a particular object we can go even further and flatten the entire structure.

Just as before we represent these keys as hashed values, so "stats.health" is represented by an unsigned containing the hashed value of "stats.health".

(Note: we won't actually hash the string "stats.health" directly, because that could lead to problems if some of our keys contained a "." character. Instead we hash each sub key separately and then hash them all together.)

With this approach, the user will still be able to look up a particular piece of data, such as stats.health, but she won't be able to enumerate all the keys and values under stats, since the hierarchy information is lost. But that's ok.

Getting rid of the tree structure allows us to store our data in a flat array:

Note that we no longer have a lookup hierarchy for the entries. To find a particular key we have to linearly search std::vector<Entry> for a match. Luckily, linear search is the best thing caches know and for reasonable sizes, the search will run faster than a std::map lookup.

We could sort the entries and do a binary search, but since the number of entries will be small, it is probably not even worth bothering.

3: Rewrite using a structure-of-arrays approach

One thing that does slow the search for a particular key down is all the additional data we need to load into the cache to perform the search. As we scan the vector for a particular key, the cache will also be filled by the type and value data.

We can solve this with the standard techinque of rewriting our "array of structures" as a "structure of arrays". I.e., we break out each field of the structure into its own array:

Now when searching for a particular key, we only need to load the keys table which means the cache only contains useful data.

4: Co-allocate the arrays

Not too shabby, but we still have three separate allocations for the three vectors. To get rid of that we have to stop using these fancy std::vectors and go back to the basics, and in terms of basics, a std::vector<unsigned> is really just:

Note that as a bonus, we can now share the capacity and the size fields between the vectors, since they all have the same values and change together.

The arrays are also all reallocated at the same time. We can make use of this and instead of doing three separate allocations, we just allocate one big buffer and lay them out sequentially in that buffer:

5: Get rid of STL

We still have these pointers to deal with in the values union:

std::string *s;
std::vector<float> *array;

Let's start by getting rid of those STL types. At this point they are no longer helping. STL is based around individual allocations which we don't want. We also have two levels of indirection there. First, the pointer to the std::vector, and then the pointer to the actual data inside the std::vector type.

There are two things we need to take care of here. First, we may run out of space in this buffer as we add more data. Second, the individual values we store may change size and no longer fit in their designated place.

The first issue is no big problem. If we run out of space, we can just allocate a bigger buffer as std::vector does.

The second issue is no biggie either. If an item needs to grow, we just allocate a new bigger slot for it from the buffer. This will leave a hole where the item was originally located:

These "holes" will cause fragmentation of the buffer and eventually make it run out of space. If this was a general purpose memory allocator that would be a serious issue that we would have to be seriously worried about.

But in our case it doesn't really matter. The number of items is small and we control the pointers to them, so we can easily move them around and defragment the buffer when necessary. But in fact, we don't even have to do that. If we get fragmentation we will eventually run out of space. This will reallocate the buffer which will also defragment it. This is enough for our purposes.

7: Slim down the Value struct

Since all our value data are now allocated in a single buffer we can use the buffer offset rather than the pointer to refer to the data. This saves us memory (from 64 bits to 32 bits) and as an added bonus, it also allows us to relocate the buffer without having to do pointer patching.

For these small snippets of data 64 K is plenty, which means we can fit both the offset and the size in 32 bits:

struct Data {
uint16_t offset;
uint16_t size;
};

(If you need more memory, you could use a full uint32_t for the offset and store the size in the buffer, together with the data.)

So know we have:

struct Value
{
union {
bool b;
float f;
Data data;
}
};

where the data field is used for both strings and arrays. This means our Value structure is now just 32 bits, half the size of the STL version.

8: Merge the final two allocations

Now we are down to just two buffers: one for the header (keys and types), and one for the values (strings and arrays). Wouldn't it be great if we could merge them to use just a single allocation?

Now they both have some room to grow, but we are not utilizing the free space to its maximum potential. The Values section may run out of free space before the Header, forcing us to reallocate the buffer even though we actually have some free space left in it. That's a bummer.

Instead of allocating the values bottom up, we allocate them top down in the buffer. Now the headers and the values can share the same free space and we don't have to reallocate the buffer until they meet in the middle. Nice!

Benefits of single buffer structures

Stripping everything down to a single buffer gives us many of the same benefits that the blob approach gives us for static data.

Since the data is stored in a single buffer and doesn't contain any pointers it is fully relocatable. We can move it around in memory as we please and make copies of it with a simple memcpy().

If we want to save the data to disk, as a static resource or as part of a saved game, we don't need any special serialization code. We can just write and read the data directly.

The pointer arithmetic that is needed to manipulate data in this way may seem complex at first, if you are not used to think about pointer operations on raw memory. But I would argue that once you get used to that, this kind of representation is in many ways simpler than the multi-level std::vector, std::map based approach that we started with. And in that simplicity lies a lot of power.

But wait, I have been cheating!

Yes, to be honest I have been cheating all this time.

I have ignored the old adage of data-oriented programming:

Where there is one, there are many.

All this time I have been talking about how oneDataComponent can store its data in a single buffer. But in our real system we will have manyDataComponents. We might have 1000 entities, each with a DataComponent that stores a single health float. Even if each component uses a contiguous buffer, that's still a lot of individual memory allocations.

Wouldn't it be nice if we could somehow bundle them together too?

But that represents a trickier challenge. We need to find a way to allocate a lot of individual objects in a single buffer in such a way that each individual object can grow and shrink independently.

Thursday, June 11, 2015

When I joined Bitsquid a month ago, someone mentioned they wanted to upgrade the DirectX SDK
to get some improvements, but that there was a dependency in the way.
I was foolish, and volunteered to investigate.
Over the past ten days or so I have untangled the whole mess, leading to a successful upgrade.
I now want to share my findings so the next unfortunate soul can save some time.

Step 1: Explore

First stop: MSDN's article.
I had heard that the DirectX SDK was now included in the Windows SDK,
but I wasn't sure what that covered.
This article
sums it up.
With a teammate, we went through the whole list, figuring out what we were and were not using.
In the end, the only problematic components were XInput, XAudio2, and D3DX9Mesh.
The bulk of the codebase had already been converted away from using D3DX, which was great!
However another thing needed clearing up.
Our minspec is still Windows 7.
How was that going to work?
Luckily, MSDN had the answer again.
This article
reveals that the Windows 8.X SDK is available on Windows 7.
This is covered in more details on this page and that page.

Step 2: Well let's just try then

I changed the paths in our project generation files to the Windows SDK.
I also added the June 2010 SDK, but only for XAudio2 and D3DX9Mesh
(more on XInput further down).
After fixing only a few compile errors, things seemed mostly fine...
until I got a runtime crash about ID3D11ShaderReflection. Huh?

Step 3: GUIDs and the magic #define

I had wrongly assumed that the link errors I had been seeing when changing the paths were caused by DX9, because I read too fast.
Linking with the old dxguid.lib made the errors go away, so I didn't think about it more.
However, a large part of DirectX relies on GUIDs, unique hardcoded identifiers.
When debugging, I noticed that IID_ID3D11ShaderReflection had the wrong value compared to the Windows SDK header,
which was causing the crash.
I went on a goose hunt for what was somehow changing this value,
and wasted a day to looking for a wrongly included file.
But by default, those GUIDs are extern variables, and will get their values from lib files.
And I was linking with an old one.
Mystery solved!
I removed dxguid.lib from the linker, but that of course caused the GUIDs to be undefined.
The solution for that is to #define INITGUID before including windows.h.
Thanks to the Ogre3D forums for pointing me towards
the relevant support page,
since they encountered the same issue before.
At this point everything was fine, except that it was failing on the build machines.

Step 4: d3dcompiler

The first error had been around for a long time.
We had so far, unknowingly, relied on the d3dcompiler DLL being present in System32!
Since System32 is part of the default DLL search path, this is easy to overlook,
especially when the DirectX SDK is a required install anyway.
We were now relying on a more recent version,
supposed to be included in the Windows SDK.
Yet still it was failing...
because we did not have a proper installation step.
I tweaked the project files again,
adding a copy step for that DLL.
CI, however, was still failing.

Step 5: XInput

XInput comes in several versions in the Windows SDK.
1.4 is the most recent one as I'm writing this, and is Windows 8-only.
To use XInput on Windows 7, you need to use version 9.1.0.
For that, ensure that
the magic _WIN32_WINNT #define
is set to the proper value (see further up on the page).
You also need to explicitly link with XInput9_1_0.lib and not XInput.lib,
or Windows 7 will get a runtime crash trying to fetch XInput1_4.dll, which doesn't exist on Windows 7.
In my case this was breaking the automated tests on a Windows 7 machine, but was completely fine on my Windows 8 workstation.

Step 6: Profit?

As far as I can tell this should be the end of it,
but the rendering team has yet to stress-test it.
We'll see what breaks as they poke around :)
Hopefully this can save you some time if you're doing a similar upgrade,
or convince you to give it a try if you've been holding back.