Note: XCHG and XADD (and all the ‘X’ family of instructions) are planned to
be thread-safe, and always asserts LOCK# regardless of the presence of the LOCK
prefix.

These low-level LOCK mechanisms ensure that some memory is modified by only
one thread at a time.

So what is wrong with these LOCKed
instructions?

On a multi-core CPU, all cores just freeze in order to make this LOCKed asm
function threadsafe. If you have a lot of threads with more than one CPU, the
context of every CPU core has to be frozen, cleared, all cores wait for the
LOCKed asm instruction to complete, then the context is to be retrieved, and
execution continue.

So guess what... when the CPU has to execute such instructions, all cores
just freeze and your brand new 8 cores CPU just run as a 1 core CPU...

This is the same LOCKed asm function which is used
internally by Windows with its Critical Sections. That's why Windows
itself is told not to be very multi-core friendly, because it does use a lot of
critical sections in its internal... Linux is much more advanced, and scales
pretty well on massive multi-core architectures.

What about Delphi?

string types and dynamic arrays just use the same LOCKed asm instruction
everywhere, i.e. for every access which may lead into a write to the
string.

See what I wrote in the
Embarcadero forum... this post was not very popular, but indeed I think
I've raised a big issue on the Delphi compiler internals and performance here -
and I don't think Embarcadero has plans to resolve this...

IMHO if you use strings in your application and need speed, using another
memory manager than FastMM4 is not enough. You'll have to avoid most string
use, and implement a safe TStringBuilder-like class.

ShortStrings could be handy here, even if they are limited to 255 character
long.

Using regular PAnsiChar, and fixed buffers in the stack is also a solution,
but it must be safe...

Our enhanced RTL

In our enhanced RTL for Delphi 7, we avoid use of this LOCKed asm
instruction if your application has only one thread: so if you use our enhanced
RTL, and make thread by yourself (not using the TThread object), you'll have
the best multi-thread performance possible.

So if the AVOIDLOCK conditional is defined, and there is only one thread in
your application (the IsMultiThread is false), the lock dec
[ecx] instruction won't be called, but a much faster (and
core-friendly) dec [ecx] instruction is used.

Note that there is a similar check already in FastMM4: if
IsMultiThread is false, no LOCKed instruction will be used.

The only drawback is that if you want to use threads in your application,
you'll have:

TThread is not to be used: the creation of one TThread just
set IsMultiThread to true, so enable LOCKed
instructions;

BeginThread() function must be avoided also (it set also the flag);

So you'll have to call directly CreateThread() Win32 APIs for your
threads;

And none of your units should use either TThread either BeginThread!

That's why it could be useful that Embarcadero take this problem in
account, and try to resolve it at the compiler level....