Summary

During the discussion of coroutines in a Tuesday session in Jacksonville 2018, we had the following exchange:

Richard Smith: I have concerns about the generator example. ... The optimization
relies on accumulate being inlinable or optimizable. ...
You need to inline at least begin to know that the value of the handle is not changed.

Gor Nishanov: Let's take this offline and come back with the joint response.

This document is that join response. It evaluates the necessary conditions for heap allocation elision
to work. After careful study, we find that:

Heap allocation elision optimization DOES NOT require inlining a potentially unbounded amount of code, such as the coroutine body OR the algorithms using the coroutine (such as accumulate in the generator example).

It DOES REQUIRE inlining some compiler synthesized code (such as the coroutine ramp) and inlining a bounded amount of wrapper/glue code that gives the coroutine the desired semantics (such as begin in the generator example). Specifically, we need to inline (or analyze interprocedurally) sufficient code to conclude that all execution paths that create the coroutine and then leave the calling function will include a call to destroy on a coroutine_handle denoting the coroutine (typically in the destructor of the coroutine type).

With the implementation of coroutines in clang today, inlining the coroutine ramp function requires the coroutine to be defined in the same translation unit. We believe that alternative implementation strategies may allow this optimization to apply across ABI boundaries when the definition of the coroutine is not available in the current translation unit, but at this time have not analyzed the tradeoffs of such approaches.

The rest of the document goes over details of what exactly needs to be inlined and explores generator and task examples.

Terminology

Coroutine frame: a memory location where state of the coroutine that has to be preserved across suspend points is stored. The coroutine frame can reside on the heap or on the stack, depending on whether coroutine heap elision is in effect or whether a custom allocator is provided by the user to override default allocation of the coroutine state.

User-authored coroutine body: the function-body of the coroutine.

Coroutine state machine: a compiler created transformation of the user-authored coroutine body

Coroutine ramp function: a compiler synthesized ramp/trampoline/thunk that creates an initial coroutine state and starts the coroutine state machine.

Coroutine handle: an object of a coroutine_handle type that points at a coroutine state machine.

Coroutine type: a wrapper around the coroutine handle that has the necessary bindings to give the coroutine the desired semantics. For example: a generator<T> or a task<T>.

Coroutine escaping its caller: an operation within the caller of the coroutine that could make the address of the object of coroutine type visible outside the caller, excluding such cases where the compiler can prove that such escaping does not occur. For example: storing the address of the object of coroutine type into a global or through an externally-supplied pointer, or calling a member function on the object of coroutine type whose body is not available for examination (or is not examined due to optimizer limitations).

For the coroutine heap allocation elision optimization to work we need the following to be available
for inlining:

the coroutine ramp function

get_return_object()

generator<int>::begin()

the constructor and move constructor of generator<int>

the destructor of generator<int>

coroutine_handle<>::destroy (either a compiler builtin or inlineable to a compiler builtin)

We DO NOT need to inline the body of the coroutine or the algorithm accumulate,
unless the iterator type retains a pointer or reference to the generator<int>
object.

Note that authors of coroutine types should be aware of this point: at least one
blogger has described an implementation of generator<T>::iterator that holds a
generator<T>*. However, every implementation of generator<T> we can find in the
wild holds a coroutine_handle<Promise> instead, which avoids the problem, and
allows heap allocation elision without inlining accumulate.

To give a feel of how big the functions that need to be inlined are, here is a typical generator implementation:

Generator case with the Ranges TS

Under the Ranges TS, we expect that generators will be passed by reference to algorithms.
If these algorithms extract the begin / end iterators from their ranges and then call
other functions, only the range wrapper function need be inlined. For example, given:

Heap allocation elision optimization in all_of(range(1, 5), pred) only requires that
the wrapper version of all_of be inlined in addition to those functions identified above.
The algorithm implementation does not need to be inlined.

Task example

Let's consider the big_task example from P0978R0 with a small addition of a call to an executor (as the question was raised whether a coroutine posted to execute on an executor can have its allocation elided).

Conclusions

We find that the functions that need to be inlined for coroutine heap allocation elision optimization
to work are tiny and we would want them inlined anyway irrespective of whether they are needed for
heap allocation elision to work or not. We anticipate that optimizers will reliably perform the
optimization on code patterns similar to those above, but that allocations will be performed for
code patterns where the object of coroutine type escapes or when compiling without optimization.

Some constraints are imposed on authors of coroutine types and on consumers of coroutine types.
Code should avoid unnecessarily retaining pointers and references to the coroutine type,
by extracting the necessary information from it in an inlineable wrapper. These constraints are
straightforward to satisfy in the cases we have examined, but are novel.