Blog Archives

Tuesday, 27 July 2010

Dekker's
algorithm is one of the most basic algorithms for mutual
exclusion,
alongside Peterson's
algorithm
and Lamport's
bakery algorithm. It has the nice property that it only requires
load and store operations rather than exchange or test-and-set, but
it still requires some level of ordering between the operations. On
a weakly-ordered architecture such as PowerPC or SPARC, a correct
implementation of Dekker's algorithm thus requires the use of fences
or memory barriers in order to ensure correct operation.

The code

For those of you who just want the code: here it is —
Dekker's algorithm in C++, with explicit fences.

The analysis

If you're like me then you'll be interested in why stuff
works, rather than just taking the code. Here is my analysis of the
required orderings, and how the fences guarantee those orderings.

Suppose thread 0 and thread 1 enter p0
and p1 respectively at the same time. They both set their
respective flags to true, execute the fence and then read
the other flag at the start of the while loop.

If both threads read false then both will enter the
critical section, and the algorithm doesn't work. It is the job of the
fences to ensure that this doesn't happen.

The fences are marked with memory_order_seq_cst, so either the
fence in p0 is before the fence in p1 in the global ordering of
memory_order_seq_cst operations, or vice-versa. Without
loss of generality, we can assume that the fence in p0
comes before the fence in p1, since the code is
symmetric. The store to flag0 is sequenced before the
fence in p0, and the fence in p1 is
sequenced before the read from flag0. Therefore the
read from flag0 must see the value stored
(true), so p1 will enter
the while loop.

On the other side, there is no such guarantee for the read
from flag1 in p0, so p0 may or
may not enter the while loop. If p0 reads
the value of false for flag1, it will not
enter the while loop, and will instead enter the critical
section, but that is OK since p1 has entered
the while loop.

Though flag0 is not set to false
until p0 has finished the critical section, we need to
ensure that p1 does not see this until the values
modified in the critical section are also visible to p1,
and this is the purpose of the release fence prior to the store
to flag0 and the acquire fence after
the while loop. When p1 reads the
value false from
flag0 in order to exit the while loop, it
must be reading the value store by p0 after the release
fence at the end of the critical section. The acquire fence after the
load guarantees that all the values written before the release fence
prior to the store are visible, which is exactly what we need
here.

If p0 reads true for flag1,
it will enter the while loop rather than the critical
section. Both threads are now looping, so we need a way to ensure
that exactly one of them breaks out. This is the purpose of
the turn variable. Initially, turn is 0,
so p1 will enter the if and
set flag1 to false, whilst p1
will not enter the if. Because p1 set
flag1 to false, eventually p0
will read flag1 as false and exit the
outer while loop to enter the critical section. On the
other hand, p1 is now stuck in the
inner while loop because turn is
0. When p0 exits the critical section it
sets turn to 1. This will eventually be seen
by p1, allowing it to exit the inner while
loop. When the store to
flag0 becomes visible p1 can then exit the
outer while loop too.

If turn had been 1 initially (because p0
was the last thread to enter the critical section) then the inverse
logic would apply, and p0 would enter the inner loop,
allowing p1 to enter the critical section first.

Second time around

If p0 is called a second time whilst p1
is still in the inner loop then we have a similar situation to the
start of the function — p1 may exit the inner
loop and store true in flag1
whilst p0 stores true
in flag0. We therefore need the
second memory_order_seq_cst fence after the store to
the flag in the inner loop. This guarantees that
at least one of the threads will see the flag from the other thread
set when it executes the check in the outer loop. Without this fence
then both threads can read false, and both can enter
the critical section.

Alternatives

You could put the ordering constraints on the loads and stores
themselves rather than using fences. Indeed, the default memory
ordering for atomic operations in C++
is memory_order_seq_cst, so the algorithm would "just
work" with plain loads and stores to atomic variables. However, by
using memory_order_relaxed on the loads and stores we
can add fences to ensure we have exactly the ordering constraints
required.

Then the algorithm looks suspicious. There is a seq_cst fence between store to flag0 and load form flag1 on main path. However there is no such fence on contention path. This can result in the following execution sequence:
T0:
(contention path)
flag0 = true;
load from flag1;

T1:
(main path)
flag1 = true;
seq_cst fence;
load from flag0;

Then both T0 can load false from flag1, and T1 load false from flag0.
What I am missing?

I know this post is a little old but i hope you will answer some of my question :)

The part of your analysis i do not fully understand is this one :

"On the other side, there is no such guarantee for the read from flag1 in p0, so p0 may or may not enter the while loop. If p0 reads the value of false for flag1, it will not enter the while loop, and will instead enter the critical section, but that is OK since p1 has entered the while loop."

From my understanding, the std::memory_order_seq_cst fence will operate as a MFENCE, data after the fence will be up to date. So p1 will store its value and the fence will operate on the store buffer to update the value on thread 0 (p0) cache. Therefore, p0 should see the value ?

In the example being discussed, we "know" that the fence in p0 occurs before the fence in p1. However, we have no information about whether the read from flag1 in p0 comes before or after the write to flag1 from p1 becomes visible to p0. It would be entirely possible for p0 to run to completion before p1's store to flag1 became visible.

A fence with memory_order_seq_cst does not enforce immediate visibility to other threads (and neither does an MFENCE instruction). The C++0x memory ordering constraints are just that --- ordering constraints. memory_order_seq_cst operations form a total order, but there are no restrictions on what that order is, except that it must be agreed on by all threads, and it must not violate other ordering constraints. In particular, threads may continue to see "stale" values for some time, provided they see values in an order consistent with the constraints.

I thought the draft C++0x standard rules around seq_cst imposed no ordering constraints between non-atomic (or relaxed) loads and stores. If that is the case, how does adding an explicit seq_cst fence (say, between the flag0.store and flag1.load in p0) add a store-load memory barrier?

I always thought this was a terrible gap in the standard that only makes composing SC and no-SC code impossible to reason about. If I place an explicit seq_cst fence (not attached to a particular variable) then I would expect it to insert a MEMBAR #StoreLoad | #LoadLoad | #LoadStore | #StoreStore assuming SPARC RMO. That's how it should work. But I thought the standard did not allow it. What changed?

Note here I'm not using acquire-release semantics. Now I would expect that since the fences are globally ordered that it must NEVER be possible to see r1 == r2 == 0 after these threads execute.

For that matter, I could mark the loads and store here with acquire-release respectively and the result would still be the same. At least to me, that's intuitive behavior. But as I understood it, the standard did not guarantee this unless I missed an update somewhere.

Anthony, I reviewed your manuscript but I don't recall if you clear this up. Could you clear this up somewhere in the section on atomic and maybe add this example and walk through it very carefully. I think this is the minimum code possible to show why seq_cst is necessary (intuitively at least) and it would be useful to walk through the above proof.

Now, it's clear that r1 == r2 == 0 is possible. For a simple existence proof, assume Thread 1 observes A and B in the order {B, A}. One possible global ordering is {B, C, D, A}. Obviously, Thread 1 must be self-consistent so {C, D} ordering must be preserved. But in this case Thread 1 observed A and B out of order and this resulted in r1 == r2 == 0.

I really think a simple example like this would demonstrate the differences between one-sided fences across a pair of threads and two-sided fences on a single thread and why both models are necessary depending on whether you need the #StoreLoad.

Wow, if it doesn't work this way I'd hate to be a compiler vendor supporting the C++11 memory model. ;-)