N3710: Specifying the absence of "out of thin air" results (LWG2265)

The C++11 memory model specified the semantics of shared ordinary
and atomic objects. There is one small part of this specification that
we always recognized was dubious. The specification in 1.10 of weakly
ordered atomics allows an atomic load to essentially see, or not see,
the result of any racing store. In the case of explicit very
weak memory ordering specifications, this can lead to "causal cycles" and
highly counterintuitive results. Consider the following example, where
x and y are atomic variables initialized to zero, and ri are
local variables:

Effectively Thread 1 copies x to y,
and thread 2 copies y to x. The section
1.10 specification allows each load to see either the initializing store
of zero, or the store in the other thread.

This famously allows
both r1 and r2 to have final values of 42, or
any other "out of thin air" value. This occurs if each load sees
the store in the other thread. It effectively models an execution
in which the compiler speculates that both atomic variables will have a value
of 42, speculatively stores the resulting values, and then performs the
loads to confirm that the speculation was correct and nothing needs to be
undone.

No known implementations actually produce such results. However,
it is extraordinarily hard to write specifications that present them
without also preventing legitimate compiler and hardware optimizations.
As a first indication of the complexity, note that the following variation
of the preceding example should ideally allow
x = y = 42, and some existing implementations
can produce such a result:

In this case, the load in each thread actually can see the store in the
other thread, without problems. The two operations in thread 2 are
independent and unordered, so either the compiler or hardware can reorder them.

Essentially this issue has been an open issue in the Java specification
for about 10 years. The major advantage that we have in C++ is that the
problem is confined to non-synchronizing atomics, i.e.
memory_order_relaxed, and some
memory_order_consume uses (or read-modify-write operations that
effectively weaken the ordering on either the read or write to one of
those). Many of us expect those to be rarely used.

The current standard addresses this in 29.3p9-11, using words that
none of us ever considered rock-solid. (A 2007 or so discussion leading
up to the current wording can be found
here.)
Unfortunately, we underestimated
both the impact of this issue, and the problems with the current wording.
In the meantime, a number of people, including Peter Sewell, Mark Batty,
and more recently Brian Demsky have pointed out issues with the current wording.

This proposal is an attempt to repair the specification, at
non-negligible cost relative to our original intent, though
not so when compared to the literal text of the current document.
This change would effectively imply
stronger semantics for memory_order_relaxed, which
would imply additional overhead for memory_order_relaxed
on a number of architectures, including ARM, current Nvidia architectures,
and possibly POWER. Our proposal effectively conflates the
above two examples, forbidding the x = y = 42
outcome in both, in order to give us cleanly definable semantics.

Such restrictions on memory_order_relaxed are however not
unprecedented; the current cache coherence requirements already imply
that ordinary loads and stores do not suffice for
memory_order_relaxed on Itanium. (See
Jaroslav Sevcik and Peter Sewell's table of C++11 atomic
mappings.) Indeed, there is no
a priori reason to believe that load and store instructions designed
for regular non-racing accesses should be directly suitable for
atomic access.

We try to mitigate the required performance impact on these architectures
by weakening the intended semantic requirements to normative encouragement
from the current requirement. The intent is that this should provide
hardware vendors the time required to better support the revised semantics.
We also propose to provide an alternative to memory_order_relaxed
accesses for non-racing contexts, in which both the existing and proposed
requirements are too strong.

We reiterate that this added overhead is relative to the
intent of the C++11 standard,
not relative to the actual specification. The actual specification, as
currently worded, seems to unintentionally require
memory_order_relaxed to be implemented as something
close to memory_order_seq_cst, and thus technically
imposes a much larger penalty, as pointed out in
LWG2265.

This proposal benefitted from a lengthy email discussion involving
participants listed at the beginning of this paper, and others.
This provided very useful
input resulting in improvements to this proposal. It unfortunately
resulted in neither a clear consensus in that group,
nor an alternate proposal with clearly better
consensus behind it.
Not all of the listed contributors necessarily support this proposal.
Since I don't believe there is support for maintaining the status
quo, I nonetheless bring forward this proposal.
Mark Batty and Peter Sewell are currently considering submitting additional
late papers on this topic, including an alternate proposal.

Why it's important to prohibit out-of-thin-air results

In a Java context, it is necessary to prohibit out-of-thin-air results
to be able to show that untrusted code cannot obtain access to
certain information. If untrusted code could somehow provoke the
implementation into generating the user's secret password "out of thin air",
that would be bad. Since C++ does not attempt to support running
untrusted code as part of a trusted executable, that's much less of an issue
here.

However, as Brian Demsky and others have pointed out, allowing
out-of-thin-air results also makes it very hard to reason formally, or
perhaps even informally, about ordinary C++ code.

To see the problem, consider a program P that only ever assigns 0 or 1
to the atomic variables x and y,
though it does sometimes copy from one to the other using weakly
ordered operations.
Assume we have a library function f(),
whose specification requires
that its first argument be 0 or 1, and its specification
says that if f()
is called correctly it doesn't modify either global variables
or its second by-reference argument.
Let's say this program calls f(x.load(memory_order_relaxed), y).

We would expect this to be sufficient to prove that the first argument
to f is always 0 or 1. But without an out-of-thin-air
guarantee it is not.
If thread 1 copys y to x using
memory_order_relaxed operations, while thread 2 calls
f(x.load(memory_order_relaxed), y), I still need to preclude
the cases like the one in which thread 1 sees y = 42,
copies 42 to x, causing thread 2 to pass 42 to f(),
which, now operating outside its specification, stores 42 in y,
thus justifying thread 1's load of 42.

This kind of argument is clearly impossible in any realistic setting in
which f() is a library function for which we have have
no idea how it behaves if its preconditions are not met.
This necessity for reasoning about such "impossible" paths is
what makes it untenable to allow any kind of out-of-thin-air results.

Why it's hard to prohibit out-of-thin-air results

As the above illustrates, out-of-thin-air values occur when there are
cyclic "dependency chains" between values. Thus if we could define
"dependency chains", we could define out-of-thin-air values,
prohibit them, and we should be done.

This is essentially the approach taken by hardware memory models.
They prohibit stores from becoming visible to other threads before the
loads on which they depend, and this is sufficient to prevent
out-of-thin-air values at the hardware level. At the hardware level,
it is relatively straightforward to define a notion of one value
"depending" on another; the approach is similar in spirit to that
taken by our definition of "carries a dependency" in
1.10 [intro.multithread]. Essentially anytime a is an argument
to an operation that computes b, b is taken to depend on
a, and similarly if b is computed by a chain of such
operations. Unlike our "carries a dependency" formulation, hardware
models usually also consider b to depend on a if a branch
preceding the computation of b depended on a.
(Itanium is an exception, but not in a way that matters here.)

Hardware memory models then implicitly outlaw out-of-thin-air values
by requiring that dependent operations become visible to other threads in
order. Since the hardware notion of dependency covers all cases in which
one value can affect another, this suffices to solve the problem.

Our existing "carries a dependency" definition does not similarly suffice,
since it covers only a subset of the ways in which one value can affect
another. For example, an evaluation inside the condition of an if-statement
does not carry a dependency to evaluations in then- or else-statements
even though it affects its values. Even if we ensured ordering whenever
a carries a dependency to b, we could not preclude the
problem case from the last section, since the argument evaluation of
x.load(...) may not carry a dependency to the store to
y inside f(), for example because the assignment
to y assigns a constant to it, but the code to do so is inside
a conditional whose condition is false if the first argument to
f() is less than 2.

To address this problem as the hardware does,
we would have to generalize "carries a dependency"
to include all methods by which one value can affect another, including
control dependencies, and then insist that all such dependencies enforce
ordering. This appears impractical, since compilers routinely remove
such dependencies as part of optimization. Consider the code segment

if (x) {
a[i++] = 1;
} else {
a[i++] = 2;
}

Any reasonable compiler will move the computation of
&(a[i++]) out of the conditional, thereby removing
the dependency of i on x.
Since relevant dependencies may pass through separately compiled functions
that don't mention atomics, we would have to prohibit such optimization
in all parts of the program, at a cost that I would expect to be much higher
than our proposal below. We would also have to require recompilation of
existing code not mentioning atomics to get the right semantics for atomics.

The fact that compilation can remove such syntactic dependencies, thus
hiding them from the hardware, makes this problem fundamentally
different at the programming language level from the hardware level.

Note that such elimination of dependencies does not actually introduce
out-of-thin-air results. It is simply at odds with this specification
technique for prohibiting them. The elimination of the syntactic dependency
if i on x simply reflects the fact that
the value of i doesn't really change as a result of a change of
x. Thus we can try to define a more "dynamic" notion
of dependency that reflects this fact.

But that seems to be a fundamentally undefinable notion, and also not
one that has been attempted by any hardware specifications. To illustrate
the difficulty, consider the function definition:

int f(int x, int y) { return x - y; }

Does the result "dynamically depend on" the input x?
If f() is only called in contexts in which the
arguments are equal, and the compiled determines that, e.g. by
inlining f(), it may not. Even if that's not always the case,
the compiler may effectively clone a copy of f(),
typically again by inlining, to handle the one call in which the arguments
are equal. Mark Batty and Peter Sewell point out that such a notion is
fundamentally not definable by looking at just a single execution.
Thus, at a minimum, we are again in a position in which we need to understand
other hypothetical executions in order to understand program
behavior on a given input. Even then, it is not clear that any such
definition could correctly deal with clones or specializations of
functions.

Informally it seems clear that defining the correct notion of "dependency"
is equivalent to defining "out-of-thin-air", and it appears that neither is
possible. Thus we are willing to propose a fairly drastic, of necessity
long term, direction in order to sidestep the problem.

The immediate problem

Brian Demsky and others point out
in
LWG2265 that the current wording intended to avoid "out-of-thin-air"
values is practically unimplementable
on a number of important weakly-ordered architectures. The current
wording essentially requires that memory_order_relaxed
be implemented as something much too close to
memory_order_seq_cst, effectively defeating the point of
introducing weakly ordered memory operations.
This is an accident arising from our earlier efforts to at least
partially address the "out-of-thin-air" problem.

The particular example in LWG2265 seems to be difficult to analyze.
But there seems to be agreement that the following (somewhat abbreviated)
variant of Brian Demsky's example, when compiled normally on
POWER or ARM, can cause the T1 and
T2 stores to be seen in different orders in T3 or T4 (as intended)
in spite of that being accidentally disallowed by 29.3p9 [atomics.order], which
requires a "finite sequence of program evaluations" to produce
the value stored at line 32:

To produce this result, the required sequence of evaluations must
have T1s and T2s assignments before T3s and T4s initial loads respectively,
and the order the loads in each of T3 and T4 must be preserved. If T4's
initial load follows T3's, then T4's final load must see a 1, and similarly
if T3's load is last. Thus the sequence, as required by reasonable
interpretations of the current text, can't exist.

On the other hand, preventing this outcome on POWER or ARM requires a
heavy weight fence between the two loads in each of T3 and T4,
forcing the loads to be treated more or less as sequentially consistent,
completely defeating the purpose of memory_order_relaxed.
And we do not believe that a typical compiler can usually see enough
context to compile memory_order_relaxed as intended,
even for more realistic examples.

As we pointed out in the last sections, there is no known way to fix this,
and it seems very unlikely, and increasingly so, that we are overlooking
a painless solution.

The proposed long-term solution

We propose a fairly drastic change of wording, which effectively
prevents a memory_order_relaxed
(or memory_order_consume) load from being reordered
with a subsequent memory_order_relaxed store,
and imposes corresponding restrictions on some very weakly ordered
atomic read-modify-write operations. This is easily expressed
as the restriction that the "write seen by" (inverse "reads-from")
relation on atomics,
together with the "happens-before" relation, cannot form a cycle or,
more formally the transitive closure of the union of the two relations is
irreflexive. This clearly prohibits any sort of causal cycle;
we can no longer follow a cyclic sequence of dependencies, since all
such dependencies are clearly reflected in either a write-seen-by relationship
or a happens-before relationship (or a data race).
But it unfortunately also prohibits
the second example we
gave at the beginning, in which one thread stores a constant.
Instead of enforcing simply dependency ordering between a load
and a later store, as the hardware does, we are treating every relaxed
store as though it were dependent on every prior load, thus
dodging the issue of defining dependencies.

Although details depend on specific machine memory models,
the above restriction can generally be enforced by preventing load
to store reordering
of atomics. To see that, assume we have a path consisting of
a sequence of write-seen-by
and happens-before relationships. Each happens-before edge
itself can be decomposed into synchronizes-with and
sequenced-before edges. (Ignore memory_order_consume for now.)
The cycle thus consists of intra-thread segments, connected by
write-seen-by or synchronizes with edges. Each intra-thread
segment must start with a load or acquire operation, and end with
a release or store operation. In either case, ordering is preserved.
Thus every intra-thread segment must be completed after it started,
and the same applies to the synchronization edges. Thus we can't
get back to an earlier action along the path, and a cycle is impossible.

(If such a path involves memory_order_consume, the
same path will still exist if we ignore the additional properties
of memory_order_consume and treat it as
memory_order_relaxed. The happens-before relationships
induced by memory_order_consume can be replaced by the
write-seen-by relationship beyween the memory_order_release
store and memory_order_consume load, with sequenced-before
taking the place of carries-a-dependency-to. Thus the same
argument continues to apply.)

Performance consequences

Since our proposal does not require any enforcement of dependencies,
it affects only code that uses weakly ordered atomic operations.
The added guarantee already holds for code that consistently uses
at least memory_order_acquire for loads,
memory_order_release for stores and
memory_order_acq_rel for read-modify-write operations.

On many architectures, a memory_order_relaxed load
followed by a memory_order_relaxed store is already ordered,
provided the compiler refrains from rearranging them. The latter is usually
a tiny cost, since there is rarely strong motivation for a compiler
to reorder in that direction. These architectures include TSO architectures
like x86. They include Itanium because we already imposed restrictions
on memory_order_relaxed that prevent Itanium from using
ordinary loads and stores. They appear to include current implementations
of POWER, though not the POWER architecture as officially specified.

The major architectures on which we add appreciable overhead for
memory_order_relaxed appear to be

ARM (both v8 and earlier)

GPU architectures

(GPU architectures are relevant here, since OpenCL has adopted essentially
the C++ memory model, and we would like to preserve that consistency.)

On ARM (and according to the POWER spec), we can enforce the required
ordering by adding a never-taken conditional branch that tests the result
of the memory_order_relaxed load after every such load,
and before the next memory_order_relaxed atomic store
to a different location. The branch should either never actually be taken
or target the next instruction. (Again memory_order_consume
might require similar treatment if it is not already compiled
as memory_order_acquire. Read-modify-write
operations with weaker than memory_order_acquire
ordering also need to be treated correspondingly.)
This is believed to be considerably less expensive than adding
a fence instruction, especially since the compiler may be able to
postpone the branch significantly, thus hopefully avoiding a stall
waiting for the load to complete.

On current ARM processors, this technique may have the negative effect
of effectively wasting branch prediction slot. We guess that should be
avoidable in future revisions of the architecture.

The overhead seems to be potentially much larger on current Nvidia GPUs,
since we have to wait for the load to complete, and that typically involves
waiting for an actual memory access. However, we hope and expect that typically
this wait can be delayed until the result of the load is needed anyway,
thus hopefully avoiding much of the overhead.

Since this does impact performance on some architectures, our goal is to

Weaken the current requirement to a more vaguely worded recommendation, and

Clearly state an intended future direction, which has the properties we desire, but may not be efficiently implementable on current hardware.

Note that even the future direction has less implementation impact than
the current unintentional restrictions.

Proposed wording for 29.3

The phrasing explicitly allows vendors to either find an alternate way to
address the problem or, at least in the short term, ignore the recommendation
completely. But the hope is that, in the absence of a very insightful
alternate vendor specification, the recommended direction will eventually
become a requirement.

Proposed wording

Change 29.3p9-11 [atomics.order] as follows

An atomic store shall only store a value that has been computed from
constants and program input values by a finite sequence of program
evaluations, such that each evaluation observes the values of variables as
computed by the last prior assignment in the sequence. The ordering of
evaluations in this sequence shall be such that:

if an evaluation B observes a value computed by
A in a different thread, then B does not happen
before A, and

if an evaluation A is included in the sequence, then every
evaluation that assigns to the same variable and happens before A is included.

It is implementation defined when, for an
atomic access a sequenced before an
atomic update b,
when there is a dependency from a to b.
[ Note: This definition should include all cases in
which the value read by a affects that stored at
b. Implementations are encouraged to conservatively define there to
be a dependency from a to b whenever a is
sequenced before b, since alternatives are difficult to
define meaningfully and precisely. --end note ]
[Note: This differs from the
"carries a dependency" terminology in 1.10 in that "carries a
dependency" describes a proper subset of such possible information
flows. --end note. ]

We say that an evaluation A causally precedes an
evaluation B if

A synchronizes with B, or

The evaluation B observes the value stored by
modification A, or

There is a dependency from atomic access A to atomic
update B, or

There is an evaluation C such that
A causally precedes C, and C causally precedes B.

Implementations should ensure that no evaluation causally
precedes itself.

[ Note: The second requirement disallows This recommendation can normally be implemented by ensuring that
an atomic load operation is never reordered with a later atomic
store operation performed by the same thread.
It avoids "out-of-thin-air" or "speculative"
stores of atomics when relaxed atomics are used.
Since unordered operations
are involved, evaluations may appear in this sequence out of
thread order.For example, with x and y initially zero,

may not produce r1 = r2 = 42,
since that would require (4)
to provide the value for (1), which is sequenced before (2),
which provides the value for (3), which is sequenced before (4),
resulting in cyclic causally-precedes relationships.
there is no sequence of evaluations
that results in the computation of 42. In the absence of "relaxed"
operations and read-modify-write operations with weaker than
memory_order_acq_rel ordering,
the second requirement has no impact. -end note ]

[ Note: The requirements do allow
The recommendation similarly disallows

Non-atomic loads and stores of atomics

As we pointed out, the above removes unintended requirements on
memory_order_relaxed, while adding a new requirement,
load-to-store ordering, that was not part of the original design.
This slows down memory_order_relaxed on some platforms
relative to what we initially thought was possible, though not relative to
what the standard unintentionally requires.

A large fraction of memory_order_relaxed use cases are those
in which the access to the atomic variable is in fact either known not
to race with any other accesses, or will be discarded unless we can determine
that there was no race. These arguably cover the safest and simplest use
cases, e.g. the second "check" in double-checked locking, and reads inside
a seqlock reader critical section.

We can regain the performance in these cases, and sometimes gain a bit
of additional performance, by supporting these cases with separate kinds
of accesses. Since these cases do not really apply to read-modify-write
operations, it makes sense to add functions
non_atomic_load() and non_atomic_store()
analogous to
load() and store() instead
of new memory_order_ values. Since these
operations cannot meaningfully be used in racy contexts, they may be
performed in multiple steps, e.g. a byte at a time. In a lock-based
implementation, they do not need to acquire the lock. They do not
need to enforce cache coherence, and hence are subject to weaker
optimization rules. On Itanium, they may, unlike
memory_order_relaxed, use ordinary load and store
operations.

We introduce a helper class race_or<T>
to express the result of a speculative
non_atomic_load() operation that might
have raced, but whose result will only be examined in the
absence of a race. Logically race_or<T>
either contains a T value, or a special "race"
value. There is no way to test which case applies, and
conversion to T yields undefined behavior if it
contained a race value.

This allows us to call non_atomic_load() and
then check that no race occurred, e.g. by examining a version number
to confirm that there was no intervening store, as is done
for seqlocks (cf.
my
recent MSPC paper), and in many transactional memory implementations.

There have been requests for supporting non-atomic operations on atomic
values before. And that may be a requirement for using C++11/C11
atomics in the Linux kernel, for example. Previously our counterargument
was "memory_order_relaxed may be close enough; we'll reconsider
if it isn't". This may be a good time to reconsider.

I don't consider this addition essential at this point, provided we
adopt the weak wording suggested above. I do believe a change along these
lines is required before we make load-to-store ordering a hard requirement,
if and when we decide to do that.

Tentative wording for non-atomic loads and stores

Disclaimer: I am not qualified to write wording for new library classes.
This didn't stop me. LWG help is requested.

This is a helper class for the load_non_atomic functions.
It represents a value that was potentially speculatively loaded
and should not be examined before establishing that the load did
not participate in a race.
The type of the template argument T shall be trivially copyable.

A value of type race_or<T> represents either a value
of type T or a race indication, representing the fact
that the value not uniquely determined.
[Note: There is no way to test a race_or<T> value
to determine whether it
contains a T value or a race indication. Implementations will
typically represent race_or<T> as a T.
A race_or value can only be generated by
copying another one or as the result produced by one of the
load_non_atomic functions. In the latter case, it
provides the opportunity for the calling code to check whether a data
race occurred, for example by checking a version number
incremented by writers,
after performing the load but before performing the conversion.
-- end note.]

race_or() = default;

Effects: Initializes the object to a race indication.

operator T() const;

Requires: The object does not contain a race indication.

Returns: The contained type T object.

Add the following member functions to the synopsis at the beginning
of 29.5 [atomics.types.generic]:

Effects: For purposes of (1.10 [intro.multithread]) this is an
atomic operation, in spite of its name. It cannot, by itself, introduce
a data race that results in undefined behavior. [Note:
The indication that such a data race would otherwise have occurred is
instead captured in the return value, and deferred until that
value is converted to C.--end note]
[Note: Unlike an ordinary
value computation, the implementation must respect the fact that the
value being loaded may change while it's being loaded, and this must
not result in any visible effects unless the result is subsequently
converted to a T. A data-race-checking implementation should
explicitly distinguish the two cases in the race_or<T> representation,
so that the race can be reported when the conversion occurs.
-- end note]

Returns:If the visible sequence of side effects with respect
to this value computation of *object or *this
(1.10 [intro.multithread]) contains exactly one such side effect, then
the resulting race_or object will contain the value stored by this side
effect. Otherwise it will contain an indication that the load involved
a race.