Introduction

In most systems the default allocator used by C++ (and C) compilers is
a very general memory allocator which supports allocating and freeing blocks
of memory of any size. The advantage of this is that it's memory-efficient
when the sizes of the allocated blocks are very varied throughout the
execution of the program.

The disadvantage of this is, however, that the bookkeeping data needed
by the default allocator is relatively complicated, so managing all
this memory with free-sized blocks is relatively slow. Another problem
is that the default allocator tends to be rather cache-unfriendly with
repeated allocations and deallocations, as the list of freed blocks gets
quite randomized (a very random list of free memory blocks will cause
allocations to cause large amounts of cache misses, which can heavily
penalize the speed of the allocator).

There are many situations, especially when programming in C++ using
the STL data containers, where it would be enough to have an allocator
for fixed-sized memory blocks. The advantage of this is that the allocator
can be made considerably faster because the bookkeeping is much simpler
and it's much easier to make it cache-friendly.

This library provides two allocators which are more optimized for
speed and memory usage. The first one is especially suitable for the
C++ STL data containers, specifically std::list,
std::set and std::map.

Allocators provided by this library

This library presents two fixed-size block allocators:

FSBAllocator

This is a more generic fixed-size block allocator, which is very
suitable for use with the C++ STL data containers which allocate one
element at a time, ie. std::list, std::set and
std::map. It cannot be used with std::vector,
std::deque nor std::string (but on the other
hand the memory usage of those containers is already very good, so they
don't really need a specialized allocator).

The allocator works by allocating contiguous blocks of memory where
the allocated elements are stored. The size of the element is determined
automatically from the objects to be allocated. Basically an allocator is
created for every element size that is used. (On the other hand, if two
different data containers use elements of the same size, they will share
the same allocator.)

If an entire block of elements is deallocated, then the block will be
freed from the main memory as well. Besides the obvious advantage of
memory being freed for other uses, this has also speed advantages
(because if the block is allocated again, it will be cache-friendly
and consecuently very fast).

Each allocated element has an overhead of 4 bytes in 32-bit systems
(8 bytes in 64-bit systems), which is the same as with most default
memory allocators. Thus this allocator does not consume any more memory
than the default one. In fact, it can sometimes allocate even less memory
than the default allocator because many such allocators align allocated
elements to 8-byte boundaries, while FSBAllocator allocates
to 4-byte boundaries (in 32-bit systems).

The disadvantages of using this allocator are the following:

The memory used by this allocator cannot be shared with anything else
in the program which uses dynamically allocated memory.

This allocator supports allocating only individual elements. It cannot
be used to allocate arrays (which is why it cannot be used with
std::vector nor std::deque). On the other
hand, this is usually not a problem.

If the memory allocated using this allocator becomes fragmented,
it may in some cases cause the allocator to waste considerable
amounts of memory. (This happens mainly if a lot of elements are
allocated and then the majority of them are deallocated, but random
individual elements here and there aren't.)

This allocator is not thread-safe by default. However, the
library offers several possibilities for making it thread-safe with
a precompiler constant. See the thread-safety section below.

FSBAllocator2

This is very similar to the FSBAllocator, with the difference
that it never frees memory. The advantage of this is that no bookkeeping data
is needed, and thus there's no overhead and allocated elements take only
as much memory as the size of the element.

This can be especially efficient if lots of very small elements
(such as individual integers) are allocated. For example the default
memory allocator in Linux allocates a minimum of 16 bytes per element,
so if you allocate an individual (4-byte) integer, 16 bytes will be
allocated nevertheless. Using FSBAllocator2 only 4 bytes will be
allocated (in 32-bit systems).

The disadvantage over FSBAllocator is that since memory is
never freed, this memory cannot be used by anything else and, more
importantly, FSBAllocator2 can be considerably slower than
FSBAllocator when lots of elements are constantly being
allocated and freed. (As mentioned earlier, if the list of freed elements
gets very randomized, this will cause the allocator to become
cache-unfriendly, causing lots of cache misses, which can penalize it
in speed quite a lot.)

Note, however, that according to my tests FSBAllocator2
is basically never slower than the default allocator, and thus safe to use
speedwise.

Only if all the allocated elements are freed, then
FSBAllocator2 will release all of its memory, becoming
(for the time being) cache-friendly and fast again.

FSBAllocator2 also has a special function for freeing
blocks of memory which only contain deallocated elements. This function
performs a linear sweep through all the blocks and deallocates the ones
which do not contain allocated elements, as well as making the list of
free elements linear. While this sweep can be slow, it will make
subsequent allocations faster.

This allocator is especially efficient for allocating very small elements,
such as for example smart pointer reference counters. Details about this
are given later in this document.

Note that in simple circumstances FSBAllocator2 can be
even faster than FSBAllocator. (For example if memory is
only allocated during the execution of the program, and only freed all
at the same time at the end.)

FSBAllocator2 can be used with the STL containers in the
same way as FSBAllocator, but due to its speed issues this
should only be done if the memory savings are important or in simple
cases, like the above.

The same disadvantages apply as with FSBAllocator.

Speed benchmark of FSBAllocator

To test the speed of FSBAllocator compared to the default
allocator used by the compiler, the following test was performed on a Linux
system using gcc 4.1.2 on a Pentium4 3.4GHz:

A std::list<int> was used, and the following operations
where done to it:

The benchmark was run for both the default allocator and the FSBallocator.
For comparison, this test was also run using
boost::fast_pool_allocator. All conditions were otherwise
exactly identical. The results were:

Default allocator: 1 minute 12 seconds.

Boost pool allocator: 57 seconds.

FSBAllocator: 16 seconds.

The same test was repeated using std::set<int>
(obviously using insert() instead of push_back()),
and the results were:

Default allocator: 2 minutes 12 seconds.

Boost pool allocator: 1 minute 40 seconds.

FSBAllocator: 1 minute 10 seconds.

(Note that in this case there are far fewer element creations and
deletions because of all the repeated values, which is the reason why the
times are closer to each other.)

Usage of FSBAllocator

Using the allocator is rather simple. Simply
#include "FSBAllocator.hh" in the source files where
the allocator is used, and then simply give the allocator to the desired
STL container as template parameter. For example, instead of writing this:

std::list<int> aList;

write this:

std::list<int, FSBAllocator<int> > aList;

That's it. No other modifications are needed.

The same for std::set would be:

std::set<int, std::less<int>, FSBAllocator<int> > aSet;

If you want to use the allocator directly, as a replacement of
new and delete, it can be done like this:

Note that allocate() only allocates memory for the object
but doesn't initialize it in any way (similarly to how, for example,
std::malloc() would do). If your type is a class, you'll have
to initialize and destroy it yourself as demonstrated in the commented
lines or by using placement new directly after allocating and by
calling its destructor explicitly before deallocating.

(Also note that, as stated earlier, arrays cannot be allocated with this
allocator, and thus the only valid parameter value is 1.)

FSBAllocator2

FSBAllocator2 can be used in the same way as
FSBAllocator, although usually it's only recommended for
special cases, like for example as a reference counter pool, described below.

Additionally, FSBAllocator2 offers a special function for
freeing unused memory, which can be used like this:

FSBAllocator2<YourType>().cleanSweep();

This will make the allocator to traverse all the allocated blocks
and free the ones which do not contain any allocated elements. It will
also make the list of freed elements linear.

Note, however, that there's a catch, which can make this function to
malfunction, and thus it must be used with care: In order to distinguish
which allocated elements are free and which aren't, it fills these freed
elements with an "unused value", which it takes as parameter, and which
by default is a size_t(-1). This value must indeed by
unused by the application. If the application has written this value
in some of the allocated elements, this function will malfunction (and
free elements it shouldn't).

For example if this allocator is used as a reference counter pool,
and all reference counts get values greater or equal to 0, then it's
safe to call this function as it is. If the value size_t(-1)
is used by the application for something, then it must provide some other
unused value for this function to use, for example:

FSBAllocator2<size_t>().cleanSweep(size_t(-2));

If unsure, it's better to not to call this function at all.

Note, however, that even though this function can be slow (depending
on how much memory has been allocated), it can make subsequent allocations
much faster, especially if the list of freed elements has become very
randomized.

Thread safety

By default this library is not thread-safe, and thus cannot be
used as an allocator for multiple threads. (As long as only one thread uses
this allocator and the memory allocated by it, it should be ok.)

The library offers several possibilities for making itself thread-safe.
This is done by defining one (and only one) of the following precompiler
constants, either in your compiler settings, or before including this
library. The possible precompiler constants are:

FSBALLOCATOR_USE_THREAD_SAFE_LOCKING_BOOST: Use the Boost
Thread library. This will probably require linking the boost thread library
into the project (eg. with -lboost_thread-mt).

FSBALLOCATOR_USE_THREAD_SAFE_LOCKING_GCC: If you are using
the gcc compiler, this uses an extension provided by it (more precisely, the
__sync_bool_compare_and_swap() function).

FSBALLOCATOR_USE_THREAD_SAFE_LOCKING_GCC_WITH_SCHED: Like
the previous, but additionally uses the posix <sched.h>
library to better interact with the system task scheduler. This results in
an even more efficient and CPU-friendly locking.

Note that enabling thread-safe locking will make the library slower.
In fact, in many systems all the locking mechanisms except the last two
(ie. the ones which use the GCC extension) may even make the library slower
than the default allocator when using multiple threads. They are still
offered, especially as a means to make FSBAllocator2 (which offers memory
saving) thread-safe in most systems.

The following is a list of running times of the first benchmark
mentioned before, using the different locking mechanisms (from fastest
to slowest). Also the same benchmark was run using the default allocator.

No locking: 16 seconds.

GCC: 28 seconds. (The sched version makes no difference when only one
thread is used.)

OpenMP: 46 seconds.

pthread: 55 seconds.

Boost: 70 seconds.

Default allocator: 72 seconds.

A different (but somewhat similar) benchmark, using two threads, both of
them using the same FSBAllocator type, resulted in the following times (from
fastest to slowest):

GCC with <sched.h>: 39 seconds.

GCC: 60 seconds.

Default allocator: 96 seconds.

OpenMP: 107 seconds.

pthread: 120 seconds.

Boost: 133 seconds.

Naturally FSBALLOCATOR_USE_THREAD_SAFE_LOCKING_GCC_WITH_SCHED
should be preferred, if the target system supports its requirements.

FSBAllocator2 as a reference counter pool

On smart pointers

There are basically three types of smart pointers:

Non-intrusive reference-counting smart pointer.

Non-intrusive doubly-linked smart pointer.

Intrusive reference-counting smart pointer.

The third type of smart pointer is usually the most efficient: The size
of the smart pointer object is that of one single pointer, and it's very
fast to copy and assign. Its problem is that it requires for the allocated
object to contain a reference counter as member variable, and thus this
smart pointer cannot be used with existing types which do not contain
such member variable (nor with internal types, such as ints or doubles,
obviously).

The first type of smart pointer has the advantage that it can be used
with any type of allocated object. In other words, the smart pointer does
not require anything from the object in question. However, it has two main
problems compared to the intrusive type: Its size is that of two pointers
(rather than one), and most importantly, it needs to dynamically allocate
the reference counter (which is just an integer type) separately. Such an
extra allocation not only slows down the creation and destruction of such
smart pointers, but it also wastes memory (typically each allocation
consumes 16 bytes of memory).

The second type of smart pointer is a clever variation: Rather than
allocating a separate reference counting integer, what it does is that
it doubly-links itself with all the other smart pointer instances which
are pointing to the same object, effectively creating a doubly-linked list
of smart pointers pointing to the same object. This way it has all the
advantages of the first smart pointer type, but it doesn't need to allocate
anything. Its disadvantage is that its size is that of three pointers.

The doubly-linked smart pointer is more efficient than the non-intrusive
reference-counting smart pointer when there are lots of allocated objects
with only one (or a few) smart pointer pointing to them, and these smart
pointers are not copied/assigned around a lot. In this case the doubly-linked
smart pointers consume less memory and are faster to create and destroy.

However, the non-intrusive reference-counting smart pointers start being
more efficient if there are lots of smart pointers pointing to the same
object, and these smart pointers are copied/assigned around a lot. The more
smart pointers point to the same object, the less memory is wasted on the
smart pointers themselves, compared to the doubly-linked smart pointers.
Also copying and assigning is slightly faster for the reference-counting
ones.

The ideal solution (when intrusive smart pointers cannot be used) is to
use the non-intrusive reference-counting smart pointers with a reference
counter pool.

A reference counter pool solution

FSBAllocator2 is ideal for such a reference counter pool.
This is because each allocated counter requires only as much memory as the size
of the counter (4 bytes in 32-systems) and nothing more. Allocating and
deallocating these counters is also faster (sans the problems with
cache-friendliness discussed earlier).

When such a reference counter pool is used, the reference-counting
smart pointer loses all of its disadvantages over the doubly-linked
smart pointers: Each smart pointer instance allocates at most two
pointers and one reference counter (which has the size of a pointer),
which makes it equal in memory consumption to the doubly-linked smart
pointer. However, if more than one reference-counting smart pointer
points to the same object, it immediately is more memory-efficient
than the doubly-linked smart pointer.

Also creating and destroying such smart pointers becomes faster,
as the FSBAllocator2 is used for allocating and deallocating the
reference counters.

The "FSBAllocator.hh" header defines an allocator name
for this specific purpose: FSBRefCountAllocator (which is
simply a shortcut for FSBAllocator2<size_t>).

Also an example smart pointer implementation is supplied, as the
"SmartPtr.hh" file.

Benchmark

To test the memory consumption and speed of the supplied smart pointer
using different allocators, a simple test was run, where 15 million elements
of type double were dynamically allocated to that many smart
pointers in a vector (and then simply destroyed when the program ends),
with the following four different allocation strategies used:

Both the double and the reference counter were allocated
using the default allocator of the compiler.

The double was allocated with the default allocator,
but the reference counter was allocated with
FSBRefCountAllocator.

Both the double and the reference counter were allocated
using FSBAllocator.

The double was allocated with FSBAllocator,
and the reference counter was allocated with
FSBRefCountAllocator.

The peak memory usage and the execution time of the whole program were
the following:

572 MB, 3.6 seconds.

401 MB, 2.2 seconds.

401 MB, 1.0 seconds.

344 MB, 0.8 seconds.

The same test was repeated, but allocating objects of type int
rather than of type double. The results were:

572 MB, 3.6 seconds.

401 MB, 2.2 seconds.

344 MB, 0.9 seconds.

287 MB, 0.8 seconds.

The memory usage numbers in the above tests are caused by how the linux
gcc allocator behaves:

The size of the allocation request plus 4 bytes are
allocated, aligned to an 8-byte boundary, with a minimum allocation of
16 bytes. The size of a double element is 8 bytes and the
size of an int element is 4 bytes, but the default allocator
allocates 16 bytes in both cases.

License

This library is released under the MIT license.

Copyright (c) 2008-2011 Juha Nieminen

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.