Flat Containers the C++ Way

Introduction

Flat Containers (P0038R0)
by Sean Middleditch gave a good overview of what flat containers are
and how might they be designed. This paper aims to answer the questions
posed in Sean's paper, focusing-in on a specific design for the standard.

3. Flat containers expose their underlying container with a public data member

Use container to gain access to the underlying container.
This works because flat containers are just light wrappers around their
contained containers.

std::vector_set<int> set;
// vector_set does not have a reserve member, but that's not a problem with 'container'.
set.container.reserve(1000);
for(int i = 0; i != 1000; ++i)
set.insert(i + i/2);
// operator[] on 'container' is convenient to have.
for(int i = 0; i != set.container.size(); ++i)
set.container[i] *= 2;

There are rules regarding container, but we'll get into those later.

Design considerations

Do what programmers expect.

Use the same interface as standard map/set if possible.

Define operations in terms of existing standard functions.

Be worth using over plain vectors.

There should be little reason to use plain sorted vectors as maps
when flat_map exists.

Be easy to migrate to.

Lots of code out there uses Boost/Loki/EASTL flat containers already.

An improved design gives people reason to migrate.

Support embedded and memory-constrained programmers.

Stack-allocated vector implementations are key.

Performance considerations

There are three cases where flat containers beat out everything else,
bar none:

In-order iteration.

Whole-container copying.

Memory usage.

Note that "searching" is not include on the list. In general, the O(1)
average-case performance of hash tables will always beat flat containers
when it comes to searching, insertion, and removal. Flat containers do
have their advantages over hash tables when it comes to worst-case performance,
but rarely are they optimal.

The graph below shows the performance of associate containers
at sizes less than 256. Note that hash tables perform the best and linear search
(displayed as 'unsorted') performs the worst. The two flat containers
are displayed in black and cyan.

(Side note: Linear search is really, really terrible and should stop being
recommended to newbies as a faster alternative to associative containers.
It's not a worthwhile optimization to make; it's just stupid.)

Internal algorithms

Array Layouts for Comparison-Based Searching

"After extensive testing and tuning on a wide variety of modern hardware,
we arrive at the conclusion that, for small values of n, sorted order,
combined with a good implementation of binary search is best. For larger
values of n, we arrive at the surprising conclusion that the Eytzinger
[breadth-first] layout is usually the fastest."

That quotation is from the abstract of an absolutely fantastic paper by
Pat Morin and Paul-Virak Khuong called Array Layouts for Comparison-Based
Searching, which details the different representations of flat sorted arrays
and their influence upon performance. The paper is based on a series
of binary search benchmarks written by Pat and ran by others on various
types of hardware. Graphs made from these benchmarks are availible on
Pat Morin's website.

Here is one such graph which shows the performance of 32-bit data on an
Intel i7-4790K:

As you can see, branch-free binary search performs best for sizes that
fit in the cache, but Eytzinger (breadth-first) is the clear winner past
that.

A brief explanation of these results

It's all due to bandwidth, prefetching, and speculative loads. It is not
due to "having fewer cache misses".

Two key properties to take note of:

The Eytzinger layout causes child and grandchild (and great-grandchild, etc)
indices to share the same cacheline.

On modern CPUs, requesting up to 4 cachelines of memory can have the same
latency as requesting just one. Multiple requests can be outgoing at once.

The Eytzinger search algorithm is fast because it loads memory it
might need several iterations in advance. Property #1 allows this prefetching
to be efficient; all possible grandchildren of a given node exist in just one
or two cachelines. Property #2 provides the speedup; a small number
of prefetches (four) can be outgoing at once without increases in latency.

A detailed explanation of these results

Rejecting Eytzinger

Unfortunately, the Eytzinger layout is not useful as a general-purpose
container. Despite being fast at searches, Eytzinger's other map-like
operations slow and complicated.

Full implementations of the Eytzinger layout do not exist, but performance
numbers can be estimated from a benchmark done by Joaquín M López Muñoz.
This benchmark compares the in-order traversal time of Eytzinger layouts
(referred to as level-order) to sorted vectors. The Eytzinger layout
took five times as long when doing this traversal.

Assumptions can be made about the speed of insertion, removal, and
in-order construction based on these results. Since Eytzinger layouts
are perfectly balanced binary trees, such operations must do additional
work to re-balance the tree. This requires at least O(N) complexity
for traversing the tree, and that implies a constant factor which is
at least five times that of sorted vectors.

Sorted vectors are strongly preferred from a standard library perspective.
They offer good - but not optimal - performance accross all of their
operations, and can be implemented using pre-existing library functions.
Sorted vectors are the best choice for this proposal.

One Vector or Two?

The graphs below show the performance of searching maps with these two
representations.

8 byte keys / 8 byte mapped values:

8 byte keys / 16 byte mapped values:

8 byte keys / 256 byte mapped values:

(The y-axis scale of these graphs is linear and starts at time=0)

When the size of the key is equal to the size of the mapped value,
both representations will always have the same performance. However, as the
size of the mapped value grows, performance shifts to favoring separate
vectors instead of single interleaved ones.

In reality, large sized mapped values are a rare sight. It is often
more desirable for the map to hold pointers that point to large objects
than it is for the map to hold large objects directly. This enables the
container to have fast insertion and removal without problems
of reference invalidation.

Interface

The interleaved reprentation uses std::pair value types and is thus
is strongly preferred from a design perspective. Interleaved vectors of
pairs mesh well with the existing standard library and containers.
In addition, the container adaptor design of flat containers strongly
favors single containers.

Boost's flat_map implementation uses a single sorted vector of
pairs. Unlike std::map, the keys of flat_map's pairs are not const;
they can be modified. Modifying the keys causes the vector to become unsorted,
and an unsorted vector causes flat_map to break.

This problem is not unique to Boost. Loki, EASTL, and every other
popular implementation have this problem too.

The search for a solution

There are possibilities to design flat container classes which are
impossible to make unsorted. All such designs are based around keeping
key values immutable.

Idea: Reference Proxies.

Useful, but full of gotchas. Reference proxies break the property of,
"Do what people expect."

Not everyone considers them "zero overhead".

Nobody wants another vector<bool>.

Idea: Abandon std::pair iterators.

Breaks compatibility with standard algorithms and containers.

Interface becomes awkward and verbose in certain places.

Idea: Const iterators by default.

See below.

Const iterators: Safety by default

All iterators of flat containers can be made const iterators by default.
This means that references to const pairs are returned when dereferencing
flat container iterators.

First, one can use the has function. has works just
like at and find, but returns a pointer value instead of an iterator.

if(int* mapped_value = map.has(4))
*mapped_value = "has is handy";

Whenever possible, use has.

Second, one can use an iterator public data member called underlying.
underlying is the iterator equivalent of container and grants access
to iterators belonging to the adapted container. These 'underlying' container
iterators have their usual const semantics and can be used to modify
both keys and values.

map.first().underlying->second = "underlying is useful";

Like container and const_cast, use of underlying is an explicit
declaration to others that you know what you're doing and you can verify
that it is safe. This explicitness is what makes underlying the safer
alternative to Boost-style iterators.

Are unsortedness states beneficial?

Yes. Going all-out to prevent unsorted states leads to verbose, hard to use
container designs that are closed off from user optimizations.

It more practical to be "safe by default" than "safe all the time".

Guidelines for coders

There are only three things that can lead to unsorted states:

container

underlying

Throwing moves.

Avoid all three and flat containers are guaranteed to remain sorted.

Always remember that std::sort followed by std::unique can be used to
turn any unsorted state into a sorted state.

Sorted Preconditions

Permitting unsorted states in flat containers requires the standard
to address and define this behavior.

Affected member functions gain preconditions based on the preconditions
of lower_bound, upper_bound, and equal_range.

Requires: The elements e of [begin, end) shall be
partitioned with respect to the expression
key_comp(e, value).

Unsorted/unpartitioned flat containers are well-defined for all
operations that do not have such preconditions.

Uniqueness

The rules regarding duplicate keys are less restrictive than the rules
regarding sortedness.

The intention of flat_map and flat_set is to
hold unique keys, but this is not a hard requirement of the standard.
The behavior of member functions on duplicate-key containers is well defined
but implementation specific. For example, using find on a container
with duplicate keys will return an iterator to any one of the matching
values. This matches the preexisting behavior of std::multimap.

Container Adaptors

The design of priority_queue has influenced the design of flat containers
and the similarities between the two are apparent.

std::priority_queue

A light wrapper around a container. Default: deque.

Operations are defined in terms of the standard functions
make_heap, push_heap,
and pop_heap.

std::flat_map

A light wrapper around a container. Default: vector.

Operations are defined in terms of the standard functions
lower_bound, upper_bound, and equal_range.

The ability to use static_vector and small_vector with flat containers
is a major selling point of this proposal. The container adaptor interface
does not add complexity or performance penalties to the library's
implementation.

User-defined reference proxies

Users may want to use their own reference proxy containers with flat
containers. For example, consider a container which attempts to implement the
"two separate vectors" idea discussed in the Internal algorithms
section:

If possible, custom value types should be permitted by
the standard for use in flat containers.

Public container data member

The standard library's current container adaptors are not well-loved,
and this is largely due to them having less functionality than the
containers that they adapt. The fact is, there are very few cases where
stack and queue are the ideal container choice. Far more often it makes
sense to use vector and deque directly and gain access to all the extra
functionality they provide. The current container adaptors are practically
useless.

With a desire to make flat containers useful, the natural solution
is to make their adapted container member publicly visible. This comes
at no loss to safety; unsortedness would exist in the design regardless of
whether or not container was public. There are no synchronization issues
to worry about as container can be the one and only data member of the
flat container classes. And by making container a public data member,
several advantages are to be had.

Case 1: Delayed sorting

Associative containers often follow this access pattern:

Store a bunch of values.

Wait.

Retreive a bunch of values.

An efficient way of storing multiple values into a flat container
is to insert the values at the end and sort the container afterwards.
For cases involving a single range, the flat container's insert function can
be implemented to have this behavior:

Unfortunately, not all real-world situations have such clean-cut insertion
ranges, and is it not always desirable for the container to sort itself after
every insertion. For cases such as these, the public
container data member offers a solution:

Case 2: Access to member functions

Functions such as reserve and max_size are not part of flat
container interfaces. This is not a problem with a public container
data member.

map.container.reserve(1000);

Flat containers are container adaptors and so any type of vector-like
container can be used. Access to container allows users of
custom vector implementations to use whatever member functions they need.

Occasionally it may be desirable to iterate a container using
indexes rather than iterators. While it is possible to do this using the
operator[] of iterators, using container for this is far easier to grasp.

Are there any downsides to container?

Nope!

Unsorted states exist in the design for iterators, not container. Removing
container would not remove unsorted states. The container member is the
one and only member of flat container classes. There are no variable
out-of-sync issues to worry about.