Array-like C++ containers: Four steps of trading speed

In C++ programming, the
STL provides
std::vector as a
sequential container with a contiguous storage. It is implemented as a dynamic
array that grows automatically when inserting elements.

In this blog article we first discuss performance aspects of std::vector and
then look on three other array-like contiguous-storage sequential containers
that can improve upon std::vector in terms of performance, which is often
driven by cache locality and memory allocation operations.

std::vector

Memory layout.
A std::vector is a sequential container that can hold an arbitrary number of
elements. It is implemented1 as a dynamic array with a specific
growth factor, which is often 2, but not
always. Whenever
its capacity is increased then the std::vector has to allocate a memory for the
new capacity and copy2 over the old elements. A std::vector
behaves just like an STL-enriched3 dynamic array and has the
following memory layout:

In an object-oriented software design we often implicitly obtain std::vector of
std::vector structures due to twofold
composition with a 0..*
cardinality. For instance, a train has 0..* cars, a car has 0..* seats, and
each of the compositions it is naturally modeled by a std::vector. Note that
memory-wise a std::vector of std::vector is not an STL-enriched version of a
multi-dimensional C++ array. The latter is stored in a contiguous chunk of
memory, but the former is not:4

Performance considerations.
A contiguous storage is beneficial to cache
locality, which makes
std::vector more efficient than the non-contiguous sequential container
std::deque for operations
that do not change the capacity.5

A cascaded std::vector structure has a memory layout similar to std::deque,
which is is sub-optimal for cache locality. Moreover, and probably more severe,
the creation of a cascaded std::vector requires multiple6
memory allocations where a multi-dimensional array requires a single one.

In high-performance computing memory allocation is a performance killer. In
cases where the size of a std::vector can be guessed it is good practice to
call reserve()
in
order to eliminate repetitive reallocations on the way to a vector’s final
size.

But even a single allocation can be a computational burden if it happens, say,
during a frequently called sub-routine. On real-time systems dynamic memory
allocation is prohibited or at least bad practice: A standard memory allocator
does not provide constant time complexity and may break real-time guarantees.
But even if a real-time memory allocator, like
TLSF, is used the runtime costs for a single
memory allocation are high.

Also note that a single invocation of the new operator or malloc() may
cause a system call to increase
heap size7, which then involves an entire switch to kernel space. A
real-time memory allocator, of course, cannot do that.

boost small_vector

Memory layout.
From memory point of view, a std::string is something similar to a
std::vector. Many implementations of std::string (e.g. gcc-5 and later)
implement small string
optimization (SSO)8
(aka short string optimization), which works as follows: A string object
needs to store a data pointer to the string buffer. A pointer of a 64-bit
address could store 8 characters itself without allocating memory. A typical
string implementation actually stores size, capacity and a pointer to the
buffer. The latter two could be used directly for an internal string buffer if
the size fits:

classstring_with_sso{size_t_size;union{struct{size_t_capacity;char*_buffer;}heapdata;charsmallstring[sizeof(heapdata)];};boolis_small()const{return_size<sizeof(heapdata)-1;// If we also store '\0'}};

Boost 1.58 introduced a container type
boost::container::small_vector
which — similar to SSO — contains in-place space for a fixed number N of
elements. If the size exceeds N then dynamic memory allocation takes place. For
vectors this technique is also called small buffer optimization.

Performance considerations.
A car may comprise an arbitrary number of wheels but in almost all cases it
does not exceed four. If we implement this composition using a small_vector we
effectively eliminate memory allocation, which is especially beneficial for
real-time applications.

Moreover, a train — std::vector of cars — points to a contiguous chunk of
memory that contains all cars with the wheels embedded (for cars with at most
four wheels). That is, a std::vector over small_vector is (typically)
memory-wise contiguous just as a multi-dimensional C++ array. This again
improves cache locality:

In case that a std::vector is created temporarily on the stack — e.g., a local
variable of a function holding intermediate results of a computation — then a
small_vector may hold its elements entirely on the stack with no memory
allocation involved at all, just as static C++ arrays do.9

Using small_vector comes with an additional cost for the in-place
buffer.10 Indications that a small_vector may not help are: (i)
There is no reasonable “small” number N, (ii) in many cases the size is much
smaller than N, or (iii) the size of an element is very large. In all theses
cases the original motivation — improving cache locality — may not be achieved
by small_vector. The cost reduction by eliminating memory allocations may be
nullified by a larger memory footprint. Never optimize without
measuring/profiling; check your hypotheses even if they appear obviously
correct.

boost static_vector

Memory layout.
The small buffer optimization requires additional logic to distinguish the two
cases (i) data is held in-place and (ii) data is held externally. If we know
that the size of a small_vector will not exceed N then we can turn this fact
into a design decision and use static_vector instead. Boost 1.56 introduced
static_vector
as a contiguous-storage sequential container that can hold a number of elements
between zero and a fixed maximum capacity N.

A static_vector only needs to store the size and requires a fixed buffer for
capacity-many elements.11

Performance considerations.
In comparison to small_vector, a static_vector does not need the above
mentioned case distinction at runtime, which eliminates costly control flow
branches.
But also the memory footprint is reduced because the capacity does not need to
be stored. Note that in some sense space is time due to the memory
hierarchy.

std::array

A static_vector is still more flexible than a C++ array over N elements. The
latter has exactly N elements, which are initially instantiated, whereas the
former has a variable number of elements, which can be initially zero. So we
can strip a static_vector further down and we receive the
std::array introduced in
C++11. A std::array is only an STL-enriched static C++ array. The size of a
std::array<char, N> is just min(1, N).

Conclusion

In a certain sense we drew a line from std::vector to std::array on which we
trade a gain of speed for reduced flexibility (or a higher memory footprint).
Very often the full flexibility of a std::vector is not required but it is
merely used as a convenient standard choice that gives access to the entire STL
arsenal. It is also a convenient standard choice for one-to-many relationships
in an object oriented design.

By taking deliberate reductions of flexibility into account significant
performance improvements can be achieved. Nicholas Ormrod reports a 1%
performance gain of the entire Facebook C++ code base by switching to a
fbstring by
Andrei Alexandrescu with
small string optimization!8

In my industry job at
B&R I am working on a software that does
motion control for a hundred of moving elements in a transport system. The
software runs on a real-time operating system on a
PLC. By
replacing various occurrences of std::vector by boost small_vector and
static_vector we could achieve significant performance improvements and heavily
reduce runtime jitter.

Strictly speaking, the STL does not define the implementation.
However, it defines time complexities on operations, guarantees on
iterators, and the like that basically leave no other reasonable choice. ↩

Actually, one could leave the old storage and copy a constant
number of old elements whenever an insertion takes place. This way the
amortized constant time push_back() could become a worst-case constant
time operation. However, this strategy violates STL-defined properties of
std::vector, like element access via pointer arithmetic, which is possible
for a single contiguous storage. ↩

By “STL-enriched” we mean that it conforms to the iterator
concept, which again makes is usable for STL algorithms or container
adaptors
like std::priority_queue. ↩

Also a std::vector of std::vector can model more than a
two-dimensional array can, as each sub-vector could have different length.
Different cars of a train may have different number of seats.

The syntax of the C# programming language distinguishes between a jagged
array
(array of arrays) and a multi-dimensional
array,
e.g., int[][] versus int[,]. See also the
indexing
of multi-dimensional numpy arrays in Python. (Thanks for the remark,
Gerhard). ↩

Generalities concerning performance are dangerous. A std::deque is
typically implemented as an array of arrays. This, for instance, allows
push_back to
be in constant time time, whereas for std::vector it is only amortized
constant time. The key difference, which gives std::deque the name, is that
it also provides a push_front(), and it is constant time, too. The
internal two-level structure, however, comes at a cost that is not covered
by the Big-O notation,
but still relevant for high-performance or real-time code. ↩

In our example, we need the number of cars plus one
allocation operations. For a deeper composition cascade we need to consider
a product of cardinalities, which grows exponentially with the composition
depth. ↩

Note that the stack size has a limit defined by the operating
system. Under Linux the command ulimit -s prints the stack size in kB,
which is 8192 on my system. ↩

On my 64-bit Gentoo Linux with boost-1.65 and gcc-7.3.0 a
std::vector<char> has a size of 24 bytes and a small_vector<char, N>
has a size of min(32, 24 + N) bytes, round down to multiples of 8
bytes. ↩

On my 64-bit system a static_vector<char, N>
has a size of min(16, 8 + N) bytes, round up to multiples of 8 bytes. ↩