Re: Why lock-free implementations are faster than traditional using synchronization? Are they?

I've encountered I have very basic misunderstanding of the topic of
lock-free algorithms.

They possess many good qualities but is the performance one of these
qualities? And it it is, to what extent and due to what reasons
exactly?

Well the reason is, because they do not lock and block.
Each time a thread is held at a mutex it looses its current time slice.
Even if it need to stop only for some clock cycles, the scheduler will
not immediately reschedule the thread as long as there is something else
to do. This also causes the cache content to become mostly waste.
A lock free algorithm will not give up a time slice.

Of course. there may be some special situations where we may run
faster with block-free and wait-free implementations using solely
interlock-xadd like operations.

>

But as it comes to using our main tool, cmpxchg instruction, our code
gets a kind of transactional semantics with failing/recurring attempts
to modify data. It's like we spinning(!)
on attempts to modify some data with cmpxchg.

You are right. Cmpxchg is similar to a spin-lock. And it will have
comparable performance to a fast mutex implemented by a spin-lock. But
there are some important differences.

Now I should ask how is
it different as for performance from using just a spin-lock so that
instead of

Compared exchange comes only into play if the old value is needed to
compute the new value in a non trivial way, i.e. where there is no
dedicated machine instruction available. Most platforms offer trivial
operations like x+value, x-value, x&value etc. directly without a loop.

One may argue that waiting on lock(spinlock) saturates the bus much
more than rarely called cmpxchg in the example 2, but this is merely a
question of implementation of efficient spinlock for multicore.

The collision probability scales with the square of the time taken in
one spinning loop cycle. So the basic idea is to keep this time as short
as possible. It is obvious that spinning at some spin-lock variable and
then modifying some other memory location holds the lock longer than
reading the old value and write the new value back, which implicitly
releases the lock. The compared exchange simply joins the memory
location of the mutex with the memory location of the data to protect.
This also shows when it makes sense: only as long as you do not protect
more than one memory place with the lock. But keep in mind that fine
grained locking is a requirement of scalability.

And there is another important difference. The loop condition is moved
from the beginning of the critical section to the end. This has a
significant effect on scheduling issues. See below.

Supposing that spinning on our spinlock does not saturate the bus and
just wastes the internal resources of one core (which is also the case
in example 1), why the code in example 1 should be fater than in
example 2?

The loop is smaller. Example 1 results in a collision if at least two
threads are between line 1 and 3. In example 2 the range is extended to
line 1 to 6. OK, there is a bug. The delete oldPtr should be kept out of
this area as it is in example 1. And note that the collision probability
scales with the square of this time (and even higher powers for multiple
collisions).

Yes, I understand that example 2 might run slowly in some
circumstances like when a thread which locked a spinlock is preempted
and other threads start to spin.

This is an important wort case. For this reason most spin-lock
implementations have a fall back to some ordinary blocking, at least
sleep(0) after some turns.

This is where the second advantage of e compare exchange loop comes into
play. The check condition is at the end of the critical section. This is
like optimistic locking.
Any process is allowed to proceed until it wants to update a value. At
this point the update is rejected if it is not based on the previous
version. This is almost the same as any reasonable code version
repository works. (Note that the compare exchange loop still has the
potential race condition that other threads might have modified the
value and eventually restored the old value before the loop completes.)

Checking the condition at the end of the loop has the important
advantage that if a thread is preempted or suspended for some reason, no
other threads will spin. Only the thread that did not do its job in time
will pay the bill and take one extra turn. In contrast a spin-lock mutex
may spin infinitely and maybe yet in pararllel if only one thread causes
a delay. It simply is not lock free. For the same reason compare
exchange loops cannot cause priority inversion, too.

May a composite lock (spinlock + kernel mutex) be helpful in this case?

This is one work around. It has the disadvantage that it requires kernel
resources even if they are never needed. Also it makes construction and
destruction of the fast mutex expensive, which could be an issue if they
are used in temporaries.
In fact I have not seen such an implementation of a fast mutex so far.
Either you use fast mutexes, and you confirm thereby that you do not
hold the mutex for long, or you don't and use kernel mutexes.

Or if we take pure kernel-level
implementation where a 'thread' can not be 'preempted' what about
saying that example 2 is as good as example 1? Probably no as Linux
uses RCU, but what are the exact reasons?

First of all you generate a reasonable overhead by the required switches
to kernel context. But a spin-lock is not RCU, so I don't understand
your question.