Tuesday, September 21, 2010

Custom Memory Allocation in C++

For console development, memory is a very precious resource. You want good locality of reference and as little fragmentation of possible. You also want to be able to track the amount of memory used by different subsystems and eliminate memory leaks. To do that, you want to write your own custom memory allocators. But the standard ways of doing that in C++ leave a lot to be desired.

You can override global new and replace it with something else. This way you can get some basic memory tracking, but you still have to use the same allocation strategy for all allocations, which is far from ideal. Some systems work better with memory pools. Some can use simple frame allocation (i.e., pointer bump allocation). You really want each system to be able to have its own custom allocators.

The other option in C++ is to override new on a per class basis. This has always has seemed kind of strange to me. Pretty much the only thing you can use it for are object pools. Global, per-class object pools. If you want one pool per thread, or one pool per streaming chunk -- you run into problems.

Then you have the STL solution, where containers are templated on their allocator, so containers that use different allocators have different types. It also has fun things such as rebind(). But the weirdest thing is that all instances of the allocator class must be equivalent. So you must put all your data in static variables. And if you want to create two separate memory pools you have to have two different allocator classes.

I must admit that every time I run into something in STL that seems completely bonkers I secretly suspect that I have missed something. Because obviously STL has been created by some really clever people who have thought long and hard about these things. But I just don't understand the idea behind the design of the custom allocator interface at all. Can any one explain it to me? Does any one use it? Find it practical? Sane?

If it weren't for the allocator interface I could almost use STL. Almost. There is also the pretty inefficient map implementation. And the fact that deque is not a simple ring buffer, but some horrible beast. And that many containers allocate memory even if they are empty... So my own version of everything it is. Boring, but what's a poor gal gonna do?

Back to allocators. In conclusion, all the standard C++ ways of implementing custom allocators are (to me) strange and strangely useless. So what do I do instead? I use an abstract allocator interface and implement it with a bunch of concrete classes that allocate memory in different ways:

I think this is about as sane as an allocator API can get. One possible point of contention is the allocated_size() method. Some allocators (e.g., the frame allocator) do not automatically know the sizes of their individual allocations, and would have to use extra memory to store them. However, being able to answer questions about allocation sizes is very useful for memory tracking, so I require all allocators to provide that information, even if it means that a frame allocator will have to use a little extra memory to store it.

I use an abstract interface with virtual functions, because I don't want to template my classes on the allocator type. I like my allocators to be actual objects that I can create more than one of, thank you very much. Memory allocation is expensive anyway, so I don't care about the cost of a virtual function call.

In the BitSquid engine, you can only allocate memory through an Allocator object. If you call malloc or new the engine will assert(false).

Also, in the BitSquid engine all allocators keep track of the total number of allocations they have made, and the total size of those allocations. The numbers are decreased on deallocate(). In the allocator destructor we assert(_size == 0 && _allocations == 0) and when we shut down the application we tear down all allocators properly. So we know that we don't have any memory leaks in the engine. At least not along any code path that has ever been run.

Since everything must be allocated through an Allocator, all our collection classes (and a bunch of other low-level classes) take an Allocator & in the constructor and use that for all their allocations. Higher level classes either create their own allocator or use one of the globals, such as memory_globals::default_allocator().

With this interface set, we can implement a number of different allocators. A HeapAllocator that allocates from a heap. A PoolAllocator that uses an object pool. A FrameAllocator that pointer bumps. A PageAllocator that allocates raw virtual memory. And so on.

Most of the allocators are set up to use a backing allocator to allocate large chunks of memory which they then chop up into smaller pieces. The backing allocator is also an Allocator. So a pool allocator could use either the heap or the virtual memory to back up its allocations.

We use proxy allocators for memory tracking. For example, the sound system uses:

ProxyAllocator("sound", memory_globals::default_allocator());

which forwards all allocations to the default allocator, but keeps track of how much memory has been allocated by the sound system, so that we can display it in nice memory overviews.

If we have a hairy memory leak in some system, we can add a TraceAllocator, another proxy allocator which records a stack trace for each allocation. Though, truth be told, we haven't actually had to use that much. Since our assert triggers as soon as a memory leak is introduced, and the ProxyAllocator tells us in which subsystem the leak occurred, we usually find them quickly.

To create and destroy objects using our allocators, we have to use placement new and friends:

One last interesting thing to talk about. Since we use the allocators to assert on memory leaks, we really want to make sure that we set them up and tear them down in a correct, deterministic order. Since we are not allowed to allocate anything without using allocators, this raises an interesting chicken-and-egg problem: who allocates the allocators? How does the first allocator get allocated?

The first allocator could be static, but I want deterministic creation and destruction. I don't want the allocator to be destroyed by some random _exit() callback god knows when.

The solution -- use a chunk of raw memory and new the first allocator into that:

Note how this works. _buffer is initialized statically, but since that doesn't call any constructors or destructors, we are fine with that. Then we placement new a HeapAllocator at the start of that buffer. That heap allocator is a static heap allocator that uses a predefined memory block to create its heap in. And the memory block that it uses is the rest of the _buffer -- whatever remains after _static_heap has been placed in the beginning.

Now we have our bootstrap allocator, and we can go on creating all the other allocators, using the bootstrap allocator to create them.

72 comments:

STL allocators were probably never designed for custom allocation techniques (e.g. pools), but rather to abstract the memory model (Stepanov mentions allowing persistent memory models). They seem to mainly have been pushed by external parties, and Stepanov himself says they are pretty flawed semantically (can't do find(a.begin(), a.end(), b[1]); where a and b are vectors with different allocators for example).

lionet: Of course, one of the allocators in the system should be DougLeaAllocator (in fact, that is exactly what our HeapAllocator is). And if you are using some other allocator because you think it is faster / less fragmented / etc than the HeapAllocator you should performance test. Optimizations should always be based on real world data.

Actually STL was not very well thought through, it was a research project that was standardized in a very short amount of time. The ideas and principles of generic programming are very well thought through and very sane, STL is not :) Nor the C++ features that "supports" GP.

Greedings a noob question: how do u handle deallocation of data.You just memset them to NULLand defrag the rest of the data (reallocate everything)?Or data are never deleted and u work with an index scheme?

I have a couple of questions about your allocation tracking though. You mentioned above that you record the stack trace for each allocation. It seems you've decided not to use the __FILE__ and __LINE__ macros. I guess that would mean you'd have to wrap every allocation call with a macro to do it that way.

How are you recording the stack trace? Writing to an external file? It would seem that this sort of tracking would be a big hit on performance and isn't enabled during regular development, no?.

You are right, I don't use __FILE__ and __LINE__. The main reason is that they don't give enough information in many cases. For examples, all allocations in the Vector class would show up as vector.inl:72 or something like that, which doesn't give any information about what system is actually leaking memory.

I record the information in memory (as pointers, so it is just 4 bytes for each stack trace entry) using a special debug allocator. All debug allocations (profiler data, stack traces, etc) go through that allocator, so I always know how much memory is used for debugging and how much is used by the "real game" -- another advantage of the allocator system.

It is a hit on performance, so it is not enabled during regular development. When I use it, I usually enable it just for the particular allocator that I know is leaking, for example the "sound" allocator, if the shutdown test has shown that that allocator is leaking. That way the game runs at nearly full speed, since only a small percent of the total allocations are traced.

What if the concrete allocator must take some additional params? For example, the stack allocator may take additional argument on which side to allocate (when double ended). Then the code will need to know which allocator uses.

I just have a real noob question: You have disallowed the use of new and malloc() so how do you get your chunks of raw memory?

If you use byte arrays as above, how do you check that the allocation was successful and that you have not run out of memory? I mean there is no way for the program to report an error in allocating a static array, right?

Thanks for the great post, very interesting read. Couple of questions:

1. How do you control internal STL allocations? You said you avoided creating custom STL allocators, but how do you ensure that any memory dynamically allocated inside the STL (Grow() etc) go through your allocators?

2. Do you have any recommended books/links on the types of allocators you've mentioned here (HeapAllocator, FrameAllocator, PageAllocator, PoolAllocator)?

1. We don't use STL. We use our own collector classes. They are quite similar to the STL classes, but they take a reference to an allocator interface in their constructors. They use that interface for all their memory allocations.

2. I've picked up information here and there. You should be able to find some pointers by just googling "memory allocations". Some more detailed information:

HeapAllocator - An allocator that allocates varied sized blocks and keeps an in-place linked list of free blocks. Look at dlmalloc, it is pretty much the standard allocator.

FrameAllocator - Also called "arena allocator" or "pointer bump allocator". An allocator that doesn't deallocate individual blocks but releases all its memory in one go. That means it has a super simple internal state (just a pointer to the next free byte and the remaining bytes of free memory).

PageAllocator - The virtual memory allocator provided by the system. Allocates memory as a number of "pages". The page size is OS dependent (typically something like 1K -- 64K). You don't want to use it for small allocations, but it can be used as a "backing allocator" for any of the other allocators. Google "virtual memory" and you should find lots of information.

PoolAllocator - An allocator that has pools for allocations of certain sizes. For example, one pool may only allocate 16 byte blocks. The pool can be packed tight in memory, you don't need headers and footers for the blocks, since they are always the same size. This saves memory and reduces fragmentation and allocation time.

The problem with that is: If T has an destructor, placement new will write 'length' into the first 4 bytes and returns the allocated memory address offsetted by 4 bytes. This means it actually writes beyond the allocated buffer while initializing the objects. Furthermore I can't use the returned pointer to deallocate the memory again, since it is not the original one returned by the allocate method. How did you solve this? Manually calling regular placement new on each element?Thanks

True. We use both const and non-const argument references so we run into the 2^N combinatorial explosion. We have unrolled that to up to three arguments... if you need more than that you have to write placement new by hand, rather than using make_new.

I'm experimenting with this allocator strategy, rolling my own container and such.I'm concerned by the global override of operator new/delete. While I understand that it's needed to ensure that everything is under control, by doing so we also forbid the use of any external library (or 99% of them) in their original state. E.g. I use UnitTest++ and it does allocate using new.Did you really don't use anything external ?

I humbly thank you for this blog and the time spent for knowledge sharing.

No, we don't use anything external that uses new() and delete(). We use some external stuff, but we only use stuff that let's us customize the memory allocation. That's the way it should be, I think. Any third party library intended for high performance use should let you supply your own memory allocators.

Very nice approach at all Niklas!But how have you implemented your HeapAllocator?Are you simply using malloc()/free() internally or have you simply copied the full source from dlmalloc()? I'm asking because I don't see a reason why reinventing the wheel and not simply using malloc()/free() in this allocator?

We are using dlmalloc. Using that instead of standard malloc() free() means we have more insight into and control over how the heap allocator grabs system memory. So we can make sure that it "plays nice" with our other allocators and we can write functions that given a pointer can tell us if it was allocated by the heap or not.

It also allows us to have multiple heaps. For example, we have a special heap just for debug allocations (such as profiler data) so that we can easily ignore them when we check if we are within the memory budget.

I am still confused about how you handle arrays of objects. You say you use Vector<> (I assume that this is your own custom container) How does this allocate the necessary memory as well as call the constructors for each of the objects in the array?

It is up to each class implementing the Allocator interface. Most of our allocators are thread-safe, because that makes them easier and safer to use, but we have some special allocators that faster but not thread-safe.

It is implemented with critical sections protecting the access to the shared data.

Nice article! I noticed your heap_allocator is using page_allocator as backend. my question is how page_allocator was implemented? Does it allocated a big chunk at the begin and give memory to heap_allocator when requested, and only VirtualFree the chunk in the end? Also did you also redirect dlmalloc's MMAP to the page_allocator?

No, the page allocator, allocates pages as requested and returns them when done. It doesn't grab a big chunk at startup. It tries to be friendly to other processes/alloators that might be using the VM system at the same time.

Yes for heaps that can grow, mmap() is redirected to the page allocator. (Actually, any allocator that supports the Allocator interface can be used as backing allocator.)

With assert of global new and proper handling of delete, i still get all of the benefits of allocator usage along with all of the syntactic sugar the new and delete keywords provide. Also alleviates the need for multiple template parameters make_new uses as the syntax is the same as it would have been except for:

HeapAllocator my_allocator;

MyClass* mc = new (my_allocator) MyClass();

Still enforces the allocator usage, and i can hide it internally if desired. Even if it wasn't hidden, someone using it would still incur correct behavior.

Still toying with the idea though. Since i allow placement new, doesn't seem too bad to allow this for my own internal usage, but will have to see if i will keep with that line of thought.

Now i just need to go research more about the actual implementations of the various types of allocators.

I think realloc() is kind of weird. It is a non-deterministic optimization. If you are really lucky and their happens to be some free space to the right of the memory buffer, then realloc() can be faster than alloc + copy + delete. But you can't really do anything to make sure that that happens.

So you could easily add realloc support (with a default implementation of alloc + copy + delete for allocators that don't have a faster path). But to me it isn't that useful. I much prefer deterministic optimizations.

I get the determinism argument, but not sure if it is strong enough to out-weight benefits of successful realloc(). Also, I would not provide default realloc in terms of alloc+copy+del but rather return null from realloc() so that a+c+d would be a next explicit step so that it is always obvious if optimization happened or not.

If I need a buffer that can grow without reallocating memory I use some other technique, such as a linked chain of fixed size buffers (perhaps 8K each) that I merge as a final step. That way I am sure I don't have to copy the data unnecessarily, regardless of what the memory system decides to do.

But isn't that what Lea's allocator is doing already? I can see your point tho: you want to have full control always. Two more questions: 1) Do you use HeapCreate/Alloc on win to have different OS heaps for different systems/resources? 2) How do you handle dtor calling with allocators such as FrameAllocator where memory might be "freed" without client's explicit free() call? And sorry for not saying it with first comment: great read! :)

1) No, I don't use OS heaps. All heap allocation is done by dlmalloc. I use the OS page allocator (VirtualAlloc) as a backing allocator for dlmalloc.

2) If you "delete" an object in a FrameAllocator, the destructor will be called, but memory will not be freed (all memory is freed when the FrameAllocator is destroyed). When you destroy the FrameAllocator, all memory is released, but no destructors are called... So if you want destructors to be called you must call delete (the same as if you use any other allocator). With the frame allocator though, you can choose to *not* call delete, if you know that the destructor doesn't do anything useful. That should not be the normal code path though, it should be reserved for "special optimizations".

Hi Niklas, I see your heap allocator can take a pointer to a memory area and construct the heap in that region, you also said your heap allocator uses dlmalloc, how you managed to tell dlmalloc what memory region it should use?

We use a special allocator for Lua, so that we can track all Lua allocations (and optimize for Lua allocation patterns if needed).

We use lightuserdata for most objects exported from C to Lua to minimize the load on the garbage collector.

We use an incremental garbage collector where we spend a certain ammount of time every frame to collect. We dynamically adapt the time we spend collecting to the ammount of garbage we generate so that the system settles down in a "steady state" with a fixed garbage collection time per frame and a fixed garbage memory overhead.

It is up to the allocator to track the size of the allocated areas. Each allocator does it differently.

For example, dlmalloc adds some extra memory before each allocated memory block where it stores a header with that information.

A slot based allocator typically allocates slots one "chunk" at a time. In the chunk header it can then store information about the size of the slots in the chunk. If chunks are always say 8K big and aligned at 8K you can round down a slot pointer to the nearest 8K aligned address to get to the chunk header from the slot pointer, and then read the slot size from there.

I have been googlin around some about data alignment, and I'm curious of how you generally handle data alignment in the Bitsquid engine since it is multiplatform. Do you use alignas and alignof from the C++11 standar, or do you use some compiler specific methods?

Also I'm interested of how you handle it in your allocators. if it's not to much to ask of you.

As seen above, our default allocate function takes a size and an alignment

void *allocate(size_t size, size_t align);

The alignment must be a power-of-two and <= size. The different allocators handle alignment as necessary. The simplest method is to just allocate extra memory and then offset the pointer to make sure that it has the right alignment. But most of our allocators are smarter than that. When they request a page from the system, and chop it up for allocation, they make sure that the blocks fall on reasonable alignment boundaries.

Theres something not clear to me about those headers/metadata allocated in this extra space, are they aligned themselves? In the game engine architecture book, for example, he just store an int in the previous 4 bytes from the aligned pointer...Thats not guarantee to be aligned. Isnt that a problem? Ppl dont care about allocator metadata alignment?

1) This doesn't work with private destructors, but regular new and delete wouldn't work in that case either. If you have a private destructor you typically have some other way of creating and destroying objects than using new/delete. In that case you would just make sure that that system worked with the allocator model.

2) Actually we have pretty much exactly that macro in our code base. And we are transitioning from using the templates to using the macro to improve compile-times and reduce object size, just as you suggest. Good point!

Hi there, great article! I just had a quick question about the allocators you use and how they are used with your container classes (in this article and the Foundation example you created).

I have been playing around with making my own Heap style allocator which allocates memory from a global pool/blob (just like in this example). I currently use an std::vector to track allocations which happen (a struct containing the the start and end address of an allocation). I use this to find and remove allocations, and detect where there are gaps in the memory blob to make new allocations if they'll fit. I realised I would like to not have to use std::vector and create my own Vector class in a similar style to the one you created in the Foundation example code, but I hit a problem. The Allocator needs a dynamic, resizing array to track allocations, but the dynamic resizing array needs an allocator itself when it is created, and that doesn't quite work as I have a sort of chicken/egg scenario. I could be completely miss understanding but from the example outlined above I assumed that you would not call new/malloc or delete/free at all (not quite sure what dlmalloc is I'm afraid). I guess what I am trying to ask is how do you track/store allocations that happen in your base allocator. I suppose I could use some sort of Scratch or Temp Allocator to hold the vector inside the Heap Allocator, but that seemed sort of messy and I was hoping there was a nicer solution. I thought I'd ask you in case I've got things horribly wrong and am barking up the wrong tree, I hope you understand what I'm prattling on about :)

I tend to not use traditional data structures (vectors, etc) in the implementation of the allocators, just because of this reason. It becomes tricky to keep track of what is happening if the process of allocating memory triggers a resize which triggers another memory allocation.

So instead I use other techniques, such as chaining together blocks of memory in implicit linked lists, having "header blocks" in memory regions allocated from the backing allocator that stores information about the number and type of allocations. Perhaps also a "master header" block with global information, etc.

Dlmalloc can be found here http://g.oswego.edu/dl/html/malloc.html. You can see how it implements some of these techniques. You can find similar stuff in other memory allocators.

There are some other situations when the chicken-and-egg problem can crop-up, such as when creating the initial global allocators. I don't want to use static variables, since I want to have a clear initialization and destruction order. So instead I placement-new them into a statically allocated memory buffer (as you can see in memory.cpp in the foundation project).

As a temporary solution (while I couldn't think of anything better) I used a simple linear allocator internal to the heap allocator which used a member variable char buffer in the heap allocator as it's memory. I set it to what seemed like an appropriate size and stuck an assert in to catch if ever it grew too large. The implicit linked list solution sounds like a nice solution, I will definitely check out Dlmalloc too.

Thank you again, the Foundation project has been incredibly interesting to poke around in, both in terms of the coding style and techniques.

Just to note that I reference this in my blog post here. I talk about applying this kind of allocator setup in the specific context of a custom STL style vector class, and also the addition of realloc() support.

Hi there, I tried to understand a few concepts that you've shared here.

You have a PageAllocator (which in turn uses VirtualAlloc, mmap etc) as the top level allocator. How do you handle allocated_size() in this allocator? If you store it directly in the page you may need to allocate another page just to store this information. This way request of 4KB would need to be satisfied by allocation of 8KB of memory in 2 pages, just to store it's size and to align the space accordingly. Do you have a separate hash table inside the allocator just for the bookkeeping? Does PageAllocator even implement the Allocator interface?

As for the dlmalloc it uses only 2 global callbacks for allocating it's memory. How can you give it an allocator to allocate from? Did you modify it and pass your own callbacks and an Allocator pointer on creation, so it can allocate from it? By default, dlmalloc also coalesce adjacent segments and free them in one go (in one call to munmap()). How do you handle this behavior in your allocators?

On Windows you can use VirtualQuery() to get the size. But not all platforms provide a similar interface. On other platforms we have a hash table in the allocator for storing the size of page allocations. (As an optimization, we only store allocations > 1 page, if the page is not in the table the allocation size is assumed to be 1 page.)

Yes, we have modified dlmalloc so that it uses a PageAllocator instead of mmap to allocate system memory.