I've just read the paper, and I believe that the following variation on
the code would work and would avoid the MP unsafe issues raised because
bool is defined to be a single byte.

Further-more, I'm pretty certain that it also resolves the issues with the
order of construction
and setting of the pointer in the singleton case, and probably resolves all
the other "over smart optimisation" issues as well

Some recent research I've done in this area, prompted by Scott Meyers' and
Andrei Alexandrescu's article "C++ and the Perils of Double-Checked Locking"
at http://www.aristeia.com/Papers/DDJ_J...04_revised.pdf, makes me
wonder whether this code is thread-safe on multi-processor machines. As the
article points out, DCLP is dangerous in general, however it is most likely
safe if the thing being tested and set is accessed atomically. On most
32-bit machines a 32-bit quantity will generally be accessed in a single bus
transaction, making it inherently atomic. However, there may be cases where
it is not atomic. An example could be on a machine that allows unaligned
accesses, such as the x86. It may be possible for half of the value to be
updated in another processors cache, and used (since the value is therefore
not NULL), before the other half is updated. It seems that in fact the race
condition that is trying to be avoided may have been reduced rather than
eliminated. While it may be true that the code generated by the compiler
doesn't typically result in unaligned accesses it is still a possibility
that exists, and there may be other ways for non-atomic access to occur
without unalignment being the cause.

I've tried some elaborate workarounds to maintain the optimisation that DCLP
provides, but they turn out to be not entirely safe on other processors such
as the Itanium. The easiest way to fix this would seem to be always
obtaining the lock before using the variables in question, but this could
have an impact on performance. A more involved alternative is to use locked
instructions, such as the Interlocked... Functions on Windows, and some
hand-rolled assembler on other platforms, to ensure that the values are
updated atomically. I'm not offering patches at this point in case there is
too much resistance to a performance hit, so I'm interested to know thoughts
either way. I agree that the margin for error is very, very small, and I
don't know how much of an impact on performance the necessary changes would
have, so I'm partly sending this so that if nothing is done and a future
race-condition is reported it may assist with locating the problem.