Description

Herb Sutter presents atomic<> Weapons, 2 of 2. This was filmed at C++ and Beyond 2012. As the title suggests, this is a two part series (given the depth of treatment and complexity of the subject matter).

It's a session that includes topics I've publicly said for years is Stuff You Shouldn't Need To Know and I Just Won't Teach, but it's becoming achingly clear that people do need to know about it. Achingly, heartbreakingly clear, because some hardware incents you to pull out the big guns to achieve top performance, and C++ programmers just are so addicted to full performance that they'll reach for the big red levers with the flashing warning lights. Since we can't keep people from pulling the big red levers, we'd better document the A to Z of what the levers actually do, so that people don't SCRAM unless they really, really, really meant to.

Topics Covered:

The facts: The C++11 memory model and what it requires you to do to make sure your code is correct and stays correct. We'll include clear answers to several FAQs: "how do the compiler and hardware cooperate to remember how to respect these rules?", "what is a race condition?", and the ageless one-hand-clapping question "how is a race condition like a debugger?"

The tools: The deep interrelationships and fundamental tradeoffs among mutexes, atomics, and fences/barriers. I'll try to convince you why standalone memory barriers are bad, and why barriers should always be associated with a specific load or store.

The unspeakables: I'll grudgingly and reluctantly talk about the Thing I Said I'd Never Teach That Programmers Should Never Need To Now: relaxed atomics. Don't use them! If you can avoid it. But here's what you need to know, even though it would be nice if you didn't need to know it.

The rapidly-changing hardware reality: How locks and atomics map to hardware instructions on ARM and x86/x64, and throw in POWER and Itanium for good measure – and I'll cover how and why the answers are actually different last year and this year, and how they will likely be different again a few years from now. We'll cover how the latest CPU and GPU hardware memory models are rapidly evolving, and how this directly affects C++ programmers.

The Discussion

The link to part one under the "More episodes in this show" section of this page doesn't work. There's probably some HTML escaping bug involved as the angle brackets are not displayed correctly in the link text.

@MatthiasThe consensus seems to be that the compiler can't completely optimize away the load operation from the loop, because the standard states in 1.10/25 that "An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time."

See also the comments by Bartosz Milewski and Anthony Williams here:http://stackoverflow.com/questions/8819095#comment-11038619

And the "memory clobber" note here:http://david.jobet.free.fr/wiclear-blog/index.php?title=2010-10-17-c%2B%2B-atomic-lib-impl-rules&mode=print&lang=fr

The slide states that using relaxed in the exchange operation is bad because the code "could do some widget creation if CAS fails - and worse". However, I'm having difficulties seeing how the relaxed exchange in the if condition could result in observably bad behaviour. Can anyone help me out?

BTW, somebody should point out that there currently is a severe problem with the code generation for atomic operations by VC++2012: http://connect.microsoft.com/VisualStudio/feedback/details/770885/std-atomic-load-implementation-is-absurdly-slow

The stop=true; can't be relaxed because otherwise it could float up across the launch (annoying but mostly benign, just causing workers to always immediately stop) or down across the join (oops, program will never terminate).

The exchange_explicit can't be relaxed because for example part of "new widget" could float up speculatively out of the if.

Regarding a relaxed stop=true on page 49: launch_workers() and join_workers() should normally both synchronize with the launched and joined workers, so shouldn't that prevent the worker threads from seeing the assignment too early or too late?

Regarding the relaxed exchange_explicit on page 54: If I'm not mistaken, the compiler or CPU can't move anything from the if-body to before the if-statement if that could lead to observable side effects when the if-condition evaluates to false. When the if-condition evaluates to true and any reader load-acquires a non-null instance pointer, the reader is guaranteed to see the fully constructed widget due to the store-release being sequenced after the new widget(). When the if-condition evaluates to false (after a thread saw a null instance), the thread load-acquires the instance pointer in a spin loop until it sees a non-null instance (which then again must be fully constructed). So, I still don't see why a relaxed exchange wouldn't be enough, what am I missing?