Additions to atomic<T>

Store and load are as they should be (you don't need an exclusive store to clear the monitor). Thanks for drawing my attention to the fact that ARM dropped the ball on clearing the monitor for abrupt control flow changes (how about a recall?); it then falls on the code taking control (in this case the Linux kernel) to correct that, because there's nothing a user can do to absolutely guarantee correctness, is there, except perhaps do silly things to improve the odds at the expense of performance? 64-bit support can be done without a lock (I'll work on it over the weekend); if you want to try this now then reverse the strex result flag test (I had not noticed yet that it flags failure instead of success) and disable the 64-bit test in test_atomic (TBB only needs 64-bit atomics if a pointer is that size). Sign extension issues are because "cmp" has no width argument, it always compares entire registers. Reliance on g++ intrinsics is not a net gain for TBB.

(Added) Please use new gcc_arm.h snapshot for testing (also disables 64-bit test in test_atomic), on top of previous diff.

Attachments:

Besides: I managed to run TBB on ARM using only the cmpxchg implementations. However, release builds tend to freeze after a couple of seconds, debug builds just need some more time. I might be related to the issue I mentioned previously. Am I right that the standard TBB does not provide a way of specifying operations for atomic stores for all sizes (1,2,4,8) ?

I am out to check with Raf's patch now, but it needs some more work.For instance, why do I have to specify type and size of atomic operations for various types separately?: __TBB_M(1,int8_t ,"b") __TBB_M(2,int16_t,"h") __TBB_M(4,int32_t,"" )The first argument should always be sizeof() the second here, shouldn't it?

Regarding cmp, you are totally right. I was mislead by your code acutally, because it allowed size modifiers to append the cmp mnemonic. Now I suppose that was just to break compilation and catch problematic cases.

Raf, here comes an updated version for ARM.- it compiles without having the assembler claiming about not-allowed registers, etc- it allows TBB to run my example without freezing- no 64 bit support included, but I consider it useful (ABA prevention things)

So in case you need it I vote for using your approach in TBB (at least for the ARM side).

--L

Attachments:

Thanks, and sorry for the bad proofreading (good catch on the test for exclusive-store failure!); I'll try to do better for the 64-bit code still to be added. I did restore the double duty of "intended", however, and rolled back the "memory" clobber (that's what "m"(*(T*)ptr) is for, because clobbering "memory" may prevent some valid optimisations, theoretically at least).

As I do not know enough about the internals of gas+gcc I was afraid that an unused/unreferenced output parameter ("m"(*(T*)ptr)) might be ignored. Due to the bad experience with the 'stock' version I used the memory clobber for safety. But I totally agree that it lacks the possibility to specify a distinct address. So if the distinct output is really working, it is clearly the better solution.Now going to play with Realview Compiler and TBB...

Some traction on PA-RISC/HP-UX/g++ (most of the tests and examples should work), added support for ARM/g++ (64-bit support ready for testing) and {ESA/390,z/Architecture}/g++, parallel_for examples now show up on PowerPC-based Mac OS X (x86-specific flags now only selectively added).

Some corrections and ideas for the scalable memory allocator, and tracking some changes by Intel's TBB team.

Corrections include: alignment guarantees, portability.

The ideas are about decreasing memory overhead (adding bins for sizes 2 and 4, eliminating most of the overhead for allocation sizes of about 8000 to 16000 bytes, potentially decreasing overhead because of the alignment needs for large objects, or eliminating those needs altogether). Because it is not (always) obvious how this will affect performance and whether it is even effective, preprocessor switches control which changes are active, and you are invited to evaluate the results (see __TBB_USE_... in src/tbbmalloc/MemoryAllocator.cpp). The __TBB_USE_MAGIC option is incomplete: first the safe probing should be validated for this to be a viable approach.

Some new atomic functionality (compare-and-swap with spurious failure) but with a twist (different signature, pedantic orthogonalism).

Spurious failure isn't as bad as the name would indicate: it just means that the function does not have to keep trying (spinning) until it can come up with a better excuse than that the other threads got in the way, i.e., the observed value does not have to differ from the expected value. On some architectures this does not matter (x86 gets/simulates a lock and either succeeds or fails because the observed value is actually different), on others it does, and where it matters one might as well unravel the compound loop of spinning on a spinning compare_and_swap, which also should allow for a more rational backoff policy (or so I hope).

where the observed value is exchanged for the expected value in "comparand", instead of being the return value (TBB style) or being exchanged for the intended value as is often done (it seems more user-friendly to me to keep the "comparand" continuously updated, with x86 and ESA/390-z/Architecture as a precedent), and the "wait" parameter determines whether spurious failure is disallowed (the default mimics compare_and_swap), and to make this part of an orthogonal {compare,fetch}_and_{store,add,and,or,xor}() family of operations (just for kicks, I also added {add,and,or,xor}_and_fetch() with "prefix" functionality; the one with store would probably not make sense even though it does the same as operator=), with compare_and_swap now somewhat superfluous but not yet deprecated (pending evaluation, but some uses have already been replaced with a specific expectation to improve performance).

Here's a small illustration in pseudo code, ignoring backoff, memory semantics, etc., showing a default implementation for one of the other operations:

Attachments:

I think it's great!IIRC, CAS that updates comparand called something like 'IBM-style'.

What I am missing in current C++0x is functions like atomic_{add,and,or,xor}, w/o fetch at all.
On x86 architecture 'lock' prefix can applied to instructions like add,
and, or, xor. And in some algorithms this is exactly what is needed. I
don't need previous value, I don't need 'CAS-loop', I just need atomic
operation.

Btw, if you want to make interfaces more C++ish and have more guarantees about static dispatch, you can consider following interfaces:

// how this is made in current C++0x (or Java):bool compare_and_store_weak(value_type &comparand, value_type intended);bool compare_and_store_strong(value_type &comparand, value_type intended);

On the spelling issue, I'd be inclined to follow the C++0x lead and use the _strong suffix for the strong case, and leave off the suffix for the weak case.My guess is that most algorithms can get by with the weak case, and that algorithms depending on the strong case are the unusual ones that need to call out their unusualness.

I suspectthat once serious C++0x implementations start appearing, they will optimize "atomic_fetch_or(&x,c)" with an unused resultto "LOCK OR" instruction. It's too good an opportunity to pass up. Too bad the C++0x "axioms" feature isn't powerful enough to express this transform.

I was wondering what default to use: the one that provides the current behaviour (for least opposition) with the least risk of unexpected results (no spurious failure assumed, perhaps because of blind conversion, tested on x86, executed on POWER/PowerPC etc.), or the one that arguably makes the most sense per se. I will happily change the default then (wait = false), caveat convertor.

I don't see what benefit omitting the fetch part would bring on current hardware. Maybe special hardware can use that, perhaps together with coalescing (or merging, or whatever it is called), but if the value has to be handled on the processor anyway? Or is it just about saving a register on (sparsely outfitted) x86, and how real a benefit is that? And what algorithms would it be used for? I was hesitant to push orthogonalism to the limit (ad nauseam), without seeing a real benefit myself, but now, with this validation, there's no stopping me anymore...

As for C++0x, I still have this pipe dream that, if I can show them a working implementation, they will amend their current ugly API proposal. Would this convince anyone, you think (something you couldn't do without template arguments for memory semantics):

template

int fetch_and_add(int addend) {

int comparand = load();

while( !compare_and_store( comparand, comparand+addend ) );

return comparand;

}

It looks cool to me, but maybe it is inferior to specialisations that keep the fence outside the loop?

Raf_Schietekat:I don't see what benefit omitting the fetch part would bring on current hardware.

The benefit is that x86 has instructions that do atomic AND, XOR, and OR as long as no (user-visible) fetch is required. Once the fetch requirement is added, x86 needs a compare-and-swap loop with its attendant overheads.

spin_rw_mutex.cpp usesboth atomic OR and atomic AND to manipulate some flag bits.

I've been regretting all day that I didn't take just a little longer to think about what I road.

No-fetch and/or: no red-face smiley available?

With the sample of code, I was mixing up two things, one of them misstated. The first is a straight question about how costly a repeated fence operation is compared to a single fence, especially in a loop. The second is that static parameters apparently have all the flexibility required (unlike an affix), while they are also guaranteed to be "optimised", whereas with runtime parameters this seems far less certain to say the least: any runtime variation of the value is almost surely a bug that I would like the compiler to rule out for me, and how would the compiler be able to provide equally efficient compiled code, especially if it has to come up with it ASAP (in a debug build)?

I suspectthat once serious C++0x implementations start appearing, they will optimize "atomic_fetch_or(&x,c)" with an unused resultto "LOCK OR" instruction. It's too good an opportunity to pass up.

But will they optimize TBB's atomics? ;)

I think that first releases of C++0x compilers won't optimize anything. So I think that we will see first atomic optimizing compilers only around 2011. And there are more serious things than "LOCK OR". For example, sequentially consistent atomic RMW in C++0x are not so strong as x86 locked RMW. So one often will have to write sequentially consistent atomic RMW followed by sequentially consistent fence. And w/o optimization on x86 it will result in locked RMW followed by mfence, however intended x86 code sequence is just locked RMW (because it effectively includes trailing sequentially consistent fence). This can basically double cost of synchronization primitives. C++0x memory model and atomics library are so general so they just have to be backed up by heavily optimizing compiler.

Raf_Schietekat:I was wondering what default to use: the one that provides the current behaviour (for least opposition) with the least risk of unexpected results (no spurious failure assumed, perhaps because of blind conversion, tested on x86, executed on POWER/PowerPC etc.), or the one that arguably makes the most sense per se. I will happily change the default then (wait = false), caveat convertor.

I think you can include into header files something like:

#ifndef I_DO_AWARE_OF_CHANGED_SEMANTICS_OF_COMPARE_AND_SWAP

#error "Ah-ah-ah. It seems that you don't yet aware of changed semantics of compare_and_swap. RTFM!"

Raf_Schietekat:The first is a straight question about how costly a repeated fence operation is compared to a single fence, especially in a loop.

It depends on architecture. On x86/Sparc it doesn't make any sense, just because you can't variate any fences.On PPC/Itanium in makes sense, and if you don't want people will be watching your implementation and saying "Oh, cr@p! I have to rewrite it manually!" :) you must take fence away from cycle.Some numbers running in my head, acquire is 25 cycles and release is 40 cycles on PPC, but I really not sure about that numbers.It's used practice in high performance synchronization algorithms. For example RCU in Linux kernel tracks additional quiescent epochs just to remove acquire fence from mutex acquisition, and release fence from mutex release. SMR+RCU makes the same thing. My asymmetric rw-mutex makes the same thing (i.e. read_lock/read_unlock doesn't contain any memory barriers, even acquire and release).

"For example, sequentially consistent atomic RMW in C++0x are not so strong as x86 locked RMW." I didn't use an mfence instruction for ordered RMW operations (like fetch_and_add), nor did TBB before me, so what's different in C++0x? Note that "Intel 64 Architecture Memory Ordering White Paper" 318147.pdf says "Locked instructions have a total order", or is this not valid for Intel-32 anyway (I think somebody said that it was despite the name)? What exactly changed from 486/Pentium onwards (see "Intel 64 and IA-32 Architectures Software Developers Manual/Volume 3A: System Programming Guide, Part 1")? What about non-Intel implementations?

It's probably overkill here.

See my correction: the technical argument is about runtime vs. static parameters, and how they're treated by the compiler (compile-time assurance that they're actually static, and optimisation). What do you think?

"On x86/Sparc it doesn't make any sense, just because you can't variate any fences." What do you mean? And what's different on PPC/Itanium? Are fence instructions fixed in duration regardless of the current state of the processor (I thought maybe the second one wouldn't have anything to do and so might be fairly cheap)? Are you saying that after a number of instructions not needing a fence it will be implicit anyway? Where is that documented, and can it be used here?

Raf_Schietekat:"For example, sequentially consistent atomic RMW in C++0x are not so strong as x86 locked RMW." I didn't use an mfence instruction for ordered RMW operations (like fetch_and_add), nor did TBB before me, so what's different in C++0x? Note that "Intel 64 Architecture Memory Ordering White Paper" 318147.pdf says "Locked instructions have a total order", or is this not valid for Intel-32 anyway (I think somebody said that it was despite the name)? What exactly changed from 486/Pentium onwards (see "Intel 64 and IA-32 Architectures Software Developers Manual/Volume 3A: System Programming Guide, Part 1")? What about non-Intel implementations?

Doesn't work in the sense that (r1 == 0) && (r2 == 0) is possible output.

Because C++0x's sequentially consistent atomic RMW is not so strong as x86 locked RMW, basically C++0x seq_cst atomic RMW doesn't include full fence. So sometimes you will have to insert additional seq_cst fence after seq_cst atomic RMW, just to satisfy C++0x formal rules.

Note that C++0x's seq_cst atomic RMWs form total order, and they both release and acquire. But the example is still doesn't work.

Raf_Schietekat:See my correction: the technical argument is about runtime vs. static parameters, and how they're treated by the compiler (compile-time assurance that they're actually static, and optimisation). What do you think?

Personally I'm not concerned with C-interoperability, and I like advanced C++, so personally I am not objecting to memory ordering as template parameter.But C++ committee is concerned with C-interoperability.Also note that current C++0x atomics interface can be implemented in smart way, so that it will provide guarantees of static dispatch. Something like

typedef int memory_order;

struct memory_order_relaxed_t {

operator int () const {

return 1;

}

};

struct memory_order_acquire_t {

operator int () const {

return 2;

}

};

extern memory_order_relaxed_t memory_order_relaxed;

extern memory_order_acquire_t memory_order_acquire;

// general implementation that uses dynamic dispatch

int atomic_exchange(..., memory_order);

// static specialization for memory_order_relaxed_t

int atomic_exchange(..., memory_order_relaxed_t);

// static specialization for memory_order_acquire_t

int atomic_exchange(..., memory_order_acquire_t);

Now, if I specify memory ordering statically (likely), then implementation will be chosen statically. If I will specify memory ordering dynamically, then implementation will be chosen dynamically.It's a kind of killing two birds with one stone.

Raf_Schietekat:"On x86/Sparc it doesn't make any sense, just because you can't variate any fences." What do you mean?

I mean that on x86 atomic RMW is always sequentially consistent, load is always acquire, store is always release. So one doesn't have physical possibility to move fence out of the cycle. Fence will always be a part of operation.

Raf_Schietekat:
And what's different on PPC/Itanium?

On PPC/Itanium one is able to separate fence and operation. So at least it's possible to move fence out of the cycle.

Raf_Schietekat:

Are fence instructions fixed in duration regardless of the current state of the processor (I thought maybe the second one wouldn't have anything to do and so might be fairly cheap)?

I think it's highly architecture dependent. I can only say about Pentium 4 processors, where I was conducting some measurements. I conclude that on Pentium 4 atomic RMW and mfence always take the same time, no matter what state of processors currently is.

Raf_Schietekat:

Are you saying that after a number of instructions not needing a fence it will be implicit anyway? Where is that documented, and can it be used here?

1. What exactly is the difference between C++0x acq_rel and seq_cst if not making the example work?

2. I think that C interoperability is the root of all evil (you can count on evil to have two roots!). It virtually doomed CORBA all by itself even if Microsoft had not been around to shun it (I still feel far from confident, and I wrote a working implementation...). Upward compatibility is great, downward compatibility is a giant mistake. IMHO anyway.

OK, I remember now that you wrote before about how this can still provide static dispatch. But then it's still a quality-of-implementation question, and what would be a plausible scenario to ever allow runtime dispatch for memory semantics (as in a language can give you more by allowing you to do less)?

3. SPARC: "A compare-and-swap operation does not imply any memory barrier semantics." So it's like PPC/Itanium.

BTW, in the patch I never (intentionally) put a fence operation inside a loop.

To elaborate: I was wondering what you meant by "quiescent epochs just to remove acquire fence from mutex acquisition".

Raf_Schietekat:1. What exactly is the difference between C++0x acq_rel and seq_cst if not making the example work?

seq_cst operations participate in total order over all seq_cst operations. Period.This fact by itself doesn't make the example work. Personally for me it's quite counter-intuitive. I was talking about this issue with committee members, they counting that it's correct behaviour. Operations are local. Fences are global. For me it's still counter-intuitive.

"seq_cst operations participate in total order over all seq_cst operations. Period." Yes, that seems to be what the proposed standard text says. So is this like Java volatile, where volatiles are mutually sequentially consistent, but where you still need a causal chain for other operations to be synchronised? Isn't Java being criticised for this (too little to be helpful, too much to be economical), or did I imagine that? What does it "do" for a program? And I thought that this example was specifically why they chose seq_cst as the default semantics? Anyway, this is not what I implemented for "ordered". So what now...

"I forget to mention, that in practice Sparc nearly always run in TSO mode. Other modes (RMO, PMO) don't even have to be supported." http://www.linuxjournal.com/article/8212 says "Solaris on SPARC uses total-store order (TSO); however, Linux runs SPARC in relaxed-memory order (RMO) mode.".

"What exactly to elaborate?" The first "quiescent epochs" related to computing that shows up in Google is this discussion (I gave up after a few pages). With RCU or "Linux kernel" it's the only hit.

Raf_Schietekat:"I forget to mention, that in practice Sparc nearly always run in TSO mode. Other modes (RMO, PMO) don't even have to be supported." http://www.linuxjournal.com/article/8212 says "Solaris on SPARC uses total-store order (TSO); however, Linux runs SPARC in relaxed-memory order (RMO) mode.".

This must be a mistake.See /linux-2.6.25.1/include/asm-sparc/system.h:

Raf_Schietekat:"What exactly to elaborate?" The first "quiescent epochs" related to computing that shows up in Google is this discussion (I gave up after a few pages). With RCU or "Linux kernel" it's the only hit.

Now that I would not have expected... But where is the commitment that 32-bit Linux will forever run in TSO, because 64-bit seems to have a "this is the future" ring to it? Otherwise, what would be a safe way to determine this (a macro doesn't help if a program is executed on a later kernel)? Until this is clear I think I'll just leave it as it is.

About 8.4.4.3, can you explain why it is not as follows:"Thus, TSO ensures that:- Each load instruction behaves as if it were followed by a MEMBAR with a mask value of #LoadLoad|#LoadStore.- Each store instruction behaves as if it were followedpreceded by a MEMBAR with a mask of #LoadStore|#StoreStore.- Each atomic load-store behaves as if it were followedpreceded by a MEMBAR with a mask of #LoadStore|#StoreStore and followed by a MEMBAR with a mask value of #LoadLoad|#LoadStore."

Also, it begs the question where #StoreLoad comes into the picture, i.e., it seems that TSO implements the rel_acq policy of the patch (more or less), but it seems that "ordered" has to be explicitly added (but where exactly? all the relations both before and after the ordered instruction?). That would also mean that on SPARC it would indeed make sense to wonder about keeping a fence instruction out of the loop (see 30263829).

Raf_Schietekat:Now that I would not have expected... But where is the commitment that 32-bit Linux will forever run in TSO, because 64-bit seems to have a "this is the future" ring to it? Otherwise, what would be a safe way to determine this (a macro doesn't help if a program is executed on a later kernel)? Until this is clear I think I'll just leave it as it is.

Rule of Least Surprise is Unix Philosophy :) And it will be quite radical change...

I don't know. There is a PSTATE.MM register that determines memory model, but afaik it only accessible from privileged mode.

Btw, theoretically they can change and endianess of data. But I thing it's not things to change...

Raf_Schietekat:- Each store instruction behaves as if it were followedpreceded by a MEMBAR with a mask of #LoadStore|#StoreStore.

If StoreStore membar associated with *EVERY* store, then it doesn't make any sense whether membar is followed or preceded :)

Why there is no LoadStore? I think that it's non needed. Load is either (1) non needed to be orderer wrt store, or (2) will be implicitly orderer by data/control dependencies. Try to come up with counter-example.

Raf_Schietekat:- Each atomic load-store behaves as if it were followedpreceded by a MEMBAR with a mask of #LoadStore|#StoreStore and followed by a MEMBAR with a mask value of #LoadLoad|#LoadStore."

Each atomic load-store behaves as if it were followed by a MEMBAR with a mask of #LoadStore|#StoreStore|#LoadLoad.I think it's enough. RMW is load and store at the same time, so:Loads can't hoist above RMW because of #LoadLoad.Stores can't hoist above RMW because of #StoreStore.
Loads can't sink below RMW because loads are followed by #LoadLoad.
Stores can't sink below RMW because stores are followed by #StoreStore.

Raf_Schietekat:Also, it begs the question where #StoreLoad comes into the picture, i.e., it seems that TSO implements the rel_acq policy of the patch (more or less), but it seems that "ordered" has to be explicitly added (but where exactly? all the relations both before and after the ordered instruction?). That would also mean that on SPARC it would indeed make sense to wonder about keeping a fence instruction out of the loop (see 30263829).

I believe that Sparc TSO is effectively the same as x86. So ordered store maps to XCHG, ordered load maps to plain load, orderer RMW maps to RMW (w/o any membars). So on TSO there are just nothing that can be moved out of the loop.

randomizer:Each atomic load-store behaves as if it were followed by a MEMBAR with a mask of #LoadStore|#StoreStore|#LoadLoad.I think it's enough. RMW is load and store at the same time, so:Loads can't hoist above RMW because of #LoadLoad.Stores can't hoist above RMW because of #StoreStore.
Loads can't sink below RMW because loads are followed by #LoadLoad.
Stores can't sink below RMW because stores are followed by #StoreStore.

However, I am not 100% sure here. There is no implied #StoreLoad membar, which is the most critical (and expensive). So can loads hoist above RMW?

"If StoreStore membar associated with *EVERY* store, then it doesn't make any sense whether membar is followed or preceded :)"

It's still distracting (as evidenced by the need to prove instead of being able to just see it), and doing the same in RMO will not work: 2 reasons why this is an inappropriate way of presenting things.

"Why there is no LoadStore? I think that it's non needed. Load is either (1) non needed to be orderer wrt store, or (2) will be implicitly orderer by data/control dependencies. Try to come up with counter-example."

I always wondered why there was a #LoadStore at all, but thought that maybe there might be implementations with delayed loads that would need it, so I already replaced a comment about it with an actual use of #LoadStore (see next version), which should at the worst be harmless. Not being able to come up with a counter-example is no proof, of course, so I'll just leave it like that.

If things were presented as I suggest, it would be easier to see that SPARC RMW is only "rel_acq", not "ordered" like x86 RMW, and the question of loads "hoisting" above RMW would not even come up (sure they can "hoist", no Dekker's with just xchg_u32()...). Right or wrong?

BTW, about C++0x seq_cst: could it be that the strange semantics are inspired by the desire to have only one #StoreLoad per instruction?

(Added) Here's an example of what can go horribly wrong with "D.7.2 Dekkers Algorithm": "This code does not guarantee that any processor can enter (that requires a retry mechanism which is omitted here), but it does guarantee mutual exclusion, which means that it is impossible that each processor finds the others lock idle ( = 0) when it enters cthe ritical section." (sic).

Raf_Schietekat:"If StoreStore membar associated with *EVERY* store, then it doesn't make any sense whether membar is followed or preceded :)"

It's still distracting (as evidenced by the need to prove instead of being able to just see it), and doing the same in RMO will not work: 2 reasons why this is an inappropriate way of presenting things.

I agree that it's not very intuitive definition.I just know that on Sparc TSO store is store-release, and load is load-acquire. I don't what they were thinking about when writing those definitions. Maybe it's how TSO is implemented in hardware, maybe it's the most efficient way to achieve required for TSO guarantees, and they don'e need formal proofs of correctness, because they just know that it will work as expected...

Raf_Schietekat:I always wondered why there was a #LoadStore at all, but thought that maybe there might be implementations with delayed loads that would need it, so I already replaced a comment about it with an actual use of #LoadStore (see next version), which should at the worst be harmless. Not being able to come up with a counter-example is no proof, of course, so I'll just leave it like that.

For a mathematician it's not a proof, but for an engineer it's a proof... Joke :)If you will try hard to came up with counter-example, and will see that you are not able to do it, then at least it will give you some degree of confidence... This is how I get confidence in this question :)

Informally. Loads are invisible to other threads. Other threads can't determine whether load was before store or after. In order to other threads will be able to determine whether load was before or after store, there must be another store which depends on load (whether data dependent, or control dependent). Because this store is dependent on load, load can't sink below this store. Then this store can't also sink below atomic store-release, because store-release contains #StoreStore. So transitively load can't sink below atomic store-release... or it will be just indifferently whether load sink below or not (because other threads don't have means to determine when load was executed).

Raf_Schietekat:If things were presented as I suggest, it would be easier to see that SPARC RMW is only "rel_acq", not "ordered" like x86 RMW, and the question of loads "hoisting" above RMW would not even come up (sure they can "hoist", no Dekker's with just xchg_u32()...). Right or wrong?

*IMVHO* Dekker's algorithm can be implemented on Sparc TSO only with atomic RMW's, no need for additional #StoreLoad barriers.Sources of Linux Kernel somehow confirms this point. Atomic RMW's in Linux Kernel are not augmented with fences. But I believe that they want sequential consistency soever.

Raf_Schietekat:
BTW, about C++0x seq_cst: could it be that the strange semantics are inspired by the desire to have only one #StoreLoad per instruction?

As I understand semantics of seq_cst are inspired by PPC. But I'm not sure, they are very secretive on this :)

"Other threads can't determine whether load was before store or after." Not immediately, obviously, but why would that be a requirement? Given 2 threads A and B, globals a and b both initially clear, and lock l, if each thread acquires the lock, writes its ID into its own global, probes the other global, clears its own global and releases the lock, shouldn't both probes be clear when they compare notes later on? Yet without #LoadStore as part of a release fence, one might have seen the other thread's ID, which seems rather weird.

"*IMVHO* Dekker's algorithm can be implemented on Sparc TSO only with atomic RMW's, no need for additional #StoreLoad barriers." I'm officially confused: why then doesn't 8.4.4.3 document any implicit #StoreLoad?

"As I understand semantics of seq_cst are inspired by PPC." There I suppose it would mean only one SYNC per operation as opposed to two, as is currently implemented here for "ordered"; the other side can be just LWSYNC. "But I'm not sure, they are very secretive on this :)" Why? Am I mistaken?