T1 releases the lock L. This signals T2 and at lower level, this involves some sort of kernel transition.

T2 wakes up and acquires the lock L incurring another context switch.

So there are always at least two context switches when primitive synchronization objects are used. A spin lock can get away with expensive context switches and kernel transition.

Most modern hardware supports atomic instructions and one of them is called 'compare and swap' (CAS). On Win32 systems, they are called interlocked operations. Using these interlocked functions, an application can compare and store a value in an atomic uninterruptible operation. With interlocked functions, it is possible to achieve lock freedom to save expensive context switches and kernel transitions which can be a bottleneck in a low latency application. On a multiprocessor machine, a spin lock (a kind of busy waiting) can avoid both of the above issues to save thousands of CPU cycles in context switches. However, the downside of using spin locks is that they become wasteful if held for a longer period of time, in which case they can prevent other threads from acquiring the lock and progressing. The implementation shown in this article is an effort to develop a general purpose spin lock.

Here, thread T1 acquires the lock by calling the function acquire(). In this case, the value of dest would become 100. When thread T2 tries to acquire the lock, it will loop continuously (a.k.a. busy waiting) as the values of dest and compare are different and therefore the function InterlockedCompareExchange will fail. When T1 calls release(), it sets the value of dest to 0 and therefore allows T2 to acquire the lock. Because only those threads that acquire() will call release(), mutual exclusion is guaranteed.

Above is a simple implementation of a spin lock. However, this implementation alone is not production fit because spinning consumes CPU cycles without doing any useful work, meaning that the thread spinning will still be scheduled on the processor until it is pre-empted. Another downside of spinning is that it will continuously access memory to re-evaluate the value of dest in the function Interlockedxxx and this also puts the pressure on bus communication.

On a single processor machine, spin wait would be a total waste of CPU as another thread T2 wouldn't even get scheduled until the spinning thread is switched by the kernel.

So far this implementation isn't good enough. A general purpose spin lock requires a bit more work in terms of falling back to true waiting in a worst case scenario when it spins for a longer period. Here are some of the points which must be considered:

Yield Processor

The Win32 function YieldProcessor() emits a 'no operation' instruction on processors. This makes the processor aware that the code is currently performing spin waits and will make the processor available to other logical processors in a hyper threading enabled processor so that the other logical processors can make progress.

Switch to Another Thread

Sometimes it is useful to force a context switch when a spinning thread has already consumed enough time spinning equivalent to its thread time slice allocated by the kernel. Here, it makes good sense to allow another thread to do useful work instead. The function SwitchToThread() relinquishes the calling thread's time slice and runs another thread in the ready state. It returns true when a switch occurs, otherwise false.

Sleeping

SwitchToThread() may not consider all threads on the system for execution, therefore it may be wise to sometimes call Sleep() or Sleepex(). Calling Sleep() with an argument of 0 is a good approach as it does not result in a context switch if there are no threads of equal priority in the ready state. Sleep(0) will result in a context switch if a higher priority thread is in ready state.

Other Considerations

A pure spin lock is only good enough when the lock is held for a very short period of time. Here the critical region may have not more than 10 instructions and practically even simple memory allocation or virtual calls or file I/O can take more than 10 instructions.

Secondly, as mentioned above, it would wasteful to use spin locks when an application runs on a single processor.

Sample Project and Implementation

The sample project in C++ consists of a spin lock implementation considering the points stated above. It also has an implementation of Stack, Queue, and a thin Producer-Consumer class. I'll only focus on then Spin Lock implementation here as the rest of it is easy to follow.

The file SpinLock.h defines these constants:

YIELD_ITERATION set to 30 - What this means is that the thread spinning will spin for 30 iterations waiting for the lock to acquire before it calls sleep(0) to give an opportunity to other threads to progress.

MAX_SLEEP_ITERATION set to 40 - This means when the total iteration (or spin) count reaches 40, then it would force a context switch using the function SwitchToThread() in case another thread is in ready state.

The struct tSpinLock acts as a lock object which is declared in the class whose objects are being synchronized. This object is then passed in the constructor to the object of tScopedLock which initializes (references) the lock object passed to it. The tScopedLock() constructor locks the object using the member function of the class tSpinWait. The destructor ~tScopedLock() releases the lock.

The Lock() function in the class tSpinWait has got a nested while loop. This is done on purpose. So if a thread is spinning to acquire the lock, it wouldn't call interlockedxxx() with every iteration, rather it would be looping in the inner while loop. This hack avoids the system memory bus being overly busy due to continuous calls to the interlockedxx function.

The inner while loop just compares the value of dest and compare and if they are not equal, then it tries to acquire them using interlockedxxx. Depending on the iteration count, the thread is either put to sleep or switched. When the application is running on a single CPU, then it always forces a context switch.

Test Results

I tested the performance of this Spin Lock implementation by inserting 10000 integers into a queue from multiple threads (each thread inserting 10000 integers into the queue). I then replaced SpinLock with a Critical Section synchronization primitive in the code and ran the same tests. I ran all the tests on an Intel Core DUO CPU T9600 @ 2.80 GHz.

The x-axis is the number of threads and y-axis is the time taken in milliseconds. Both synchronization methods (spinlock and CS) showed close performance when the number of threads were 2 and 4. As the number of threads increased, critical section locking took more than double the time as compared to spin locks. Spin lock seemed to have scaled a lot better when contention increased due to the high number of threads. The time taken is calculated using QueryPerformanceCounter Win32 methods. However, I would suggest performing your own testing on the platform you intend to use.

Here is the table with the results:

No. of Threads

Time taken (Spin lock)

Time taken (Critical Section)

2

6.6

7.14

4

12.81

14.09

6

16.01

46.37

8

23.32

54.34

10

26.21

74.76

15

41.17

89.05

20

47.63

116.82

25

62.25

147.68

30

64.37

169.17

35

88.02

210.07

40

93.99

296.32

Future Work

Profiling the code on different platforms.

Adding a couple more data structures to the project like associated arrays and hashtable.

Conclusion

This was an effort to develop a general purpose spin lock implementation. Pure spin locking isn't a good option in all scenarios and therefore there is a need for an implementation which allows the spinning thread to be suspended by the kernel.

Share

About the Author

Studied MSc Network and Parallel computing from Reading University, UK. I was always interested in IT. Previously worked in Network-Security, Games Development, Satellite Communications and now in Financial Services industry. After work, usually time spent with my son. Love playing cricket, badminton - though not happening much on sports these days. OK, enough about me.

I like to use it in some personal project I'm working on, but have some questions.

In my testing (on a single core), Sleep(0) is making my CPU go 100% when spinning.
Now I am wondering what would be against using Sleep(1), which is making a huge difference for the CPU load ?

Second, the code as I understand it, goes into a while(true) loop then compares the code once with the InterlockedCompareExchange function, then goes into the nested while(LockObj.dest != LockObj.compare)loop from which it can only leave if the two variables become equal.
Then, it'll compare the values again with the InterlockedCompareExchange, before to leave the lock.

Isn't the whole reason we should use the InterlockedXXX functions, so that we can be 100% sure the value of LockObj.dest is the true one on that moment ? So in this case the code could be spinning longer than really necessary ?
Or is the volatile keyword taking care of that, but then why use InterlockedCompareExchange in the first place ?

You shouldn't use spinning based synchronization on a single core (with no hyperthreading enabled) because there is only one thread that can run at any time. So if thread T1 is holding the lock and is suspended by the kernel, then a thread T2 waiting for the same lock (spin lock here) will never be able to acquire the spin lock untill T1 is re-scheduled and finishes the job it was doing and then finally releasing the spin lock. As I mentioned in the article, spinning will waste CPU cycles. Sleep(0) will result in a context switch if a higher priority thread is in ready state. So it may depend on any thread priorities you are using in your application.

while(LockObj.dest != LockObj.compare) is used so if a thread is spinning to acquire the lock, it wouldn't call interlockedxxx() with every iteration, rather it would be looping in the inner while loop. This hack avoids the system memory bus being overly busy due to continuous calls to the interlockedxx function.

There is a definite need to use interlocked when both acquiring and releasing the lock because had we not releasds the lock using interlocked, then we could have two problems:
Read the discussion in this article between bobyx82 and me.

Let me know if you have more questions.

And one more thing, Volatile doesn't help with memory barriers at the processor level. Its the command to the compiler to not to re-order the statements (expressions).

I'm sure it was supposed to be a reference to the tSpinLock object. Change declaration to

tSpinLock& m_LockObj;

and initialize it properly in tScopedLock constructor

Now the tasks

1) Lock is not reentrant. Try to lock it twice from the same thread and it will hang up the application with CPU busy.
It would be good idea to store in tSpinLock::dest the id of the thread owning the lock.

2) Helper::GetNumberOfProcessors calls GetSystemInfo(&sysinfo) on each iteration. Wrap SYSTEM_INFO in a class, invoke this call once in its constructor and declare static instance of that class inside GetNumberOfProcessors.

1. YieldProcessor is not the greatest way to do this. Well, at least according to Intel documentation the proper instruction for a busy wait is the "__asm { pause };" while the MS provided macro shows that it uses "rep nop" or a SSE nop. I tend to trust the Intel documentation more than MS in this case and it is also the way TBB does their spin lock. In my codebase, I used the following:

Those two modifications will greatly help reduce wasted "real" CPU cycles while spinning. You attempt the actual exchange at an increasingly reduced rate which prevents a very large amount of memory and cpu wastage.

When you get to the point of implementing recursion, you may want to consider 2 variations. You can do one with a single 64bit value which is fair but can be rather performance inhibiting if you use the recursion a lot. As a starting point and debugging method this is fine, for high performance systems you want to separate the thread Id portion from the recursion count by at least 64 or 128 bytes depending on cache line size in order to prevent false sharing. Basically if you use the recursion any notable amount, it causes false sharing which will burn memory bandwidth and of course tear down a fair amount of cpu performance. In the long run a generalized solution is probably overkill but it is always interesting to present as an exercise.

Another item of interest, don't bother with volatile variables and only worry about proper barriers in your code. This has been an ongoing argument but I tend to agree with the folks with the best knowledge such as the folks behind TBB who removed all but one single volatile as I understand it and got another pretty notable performance increase. (Well, notable only in terms of scaling over many cores.) Basically the argument here is that volatile removes a lot of optimization potential even though you only need a few locations in code which absolutely guarantee order of operations. I don't want to start a big discussion on this but it seems to be working out pretty well and I do a lot of highly threaded code with this giving a fair improvement in overall performance.

Like any other technique there are prons and cons. Spin Lock is a good technique when not used on single processor as said in the article. It can be helpful when lock is held for short period of time and something like waiting for database write isn't meant for spin locks. Infact, the technique I showed in the article is quite similar to what windows uses in this function InitializeCriticalSectionAndSpinCount(). This function also ignores spin iterations on single processor system. This is what microsoft has to say in support of its usage:-

"The spin count is useful for critical sections of short duration that can experience high levels of contention. Consider a worst-case scenario, in which an application on an SMP system has two or three threads constantly allocating and releasing memory from the heap. The application serializes the heap with a critical section. In the worst-case scenario, contention for the critical section is constant, and each thread makes an expensive call to the WaitForSingleObject function. However, if the spin count is set properly, the calling thread does not immediately call WaitForSingleObject when contention occurs. Instead, the calling thread can acquire ownership of the critical section if it is released during the spin operation.

You can improve performance significantly by choosing a small spin count for a critical section of short duration. The heap manager uses a spin count of roughly 4000 for its per-heap critical sections. This gives great performance and scalability in almost all worst-case scenarios."
"

Assignment to integral type is always atomic therefore there is no need to use interlockedxxx in release method. While is acquire its doing two tasks in one operation - compare and exchange.

According to MSDN, simple reads and writes to 32-bit values are atomic for correctly aligned variables.
I should have mentioned about alignment in my article. Would you point out the location where it says writes on integer are not atomic. Moreover, the variable dest is a member of the struct tSpinLock and only one thread will ever going to call release() at any point in time and rest of the threads will spin so there isn't any concurrent access to dest.

Well, the benefit of this approach over InitializeCriticalSectionAndSpinCount is that you don't have to call EnterCriticalSection(), LeaveCriticalSection() and DeleteCriticalSection(). People do forget .
Secondly, if you have a look at the structure RTL_CRITICAL_SECTION, you'll see it does lot more - implements auto-reset events which is a kernel object so there will be some sort of kernel transition too. Whilst this is purely at user level.

According to MSDN, simple reads and writes to 32-bit values are atomic for correctly aligned variables.

Atomicity does not guarantee visibility in SMP environment. This will probably work on most IA platforms due to strong memory consistency model, but simple (even aligned) assignment is not portable nor does guarantee correctness. Generally in spin lock implementation memory fences are required. Unfortunately volatile attribute does not provide any (only prevents compiler optimizations).
Also this code may happen to be very inefficient in some situations since it's prone to false sharing of cache line. Spin locks should always be aligned to cache line size to prevent that.

Volatile keyword only prevents instructions re-ordering at the compiler level and would not prevent reordering at the cache level. The implementation does not try to provide any memory fencing using volatile. On SMP systems each processor has its own cache which means there can exist two values of a same variable irrespective of it being volatile. Now there are two ways to fix this problem for multithreaded code on SMP: (1) disable CPU cache using PAGE_NOCACHE flag and expect a drastic slowdown of the application -(Note interlocked instructions may throw hardware exception when no-cache memory);(b) TO USE INTERLOCKED operations :-> The cache coherency protocols synchronizes the cache of two processor when a shared variable (held in different CPU cache) is modified. Now you may think whats the point of all that synchronization at the application level. Well there is, this cache synchronization happens asynchronously with the code execution and that means when caches are being synchronized the application threads can still see different values of the shared variable. This problem can be fixed using interlocked operations because interlocked instructions acts as a command to make changes to memory directly under a locked bus. This behaviour of interlocked operations nullifies any effect of cache inconsistencies.

As far as memory access barriers are concerned, interlockxxx has acquire and release semantics so memory re-ordering is out of question. Another thing about intrelocked oerations is that, if a shared variable is changed using a non-interlocked operations by a thread and then another thread comes in and uses interlockedxx function to change this shared variable then it first forces the updated value of this shared cariable into memory before excuting itself.

So, I don't think what you said is right and it will work on most of the platforms because of the gaurantees given by interlocked instructions: memory consistency, atomicity and cache coherence.

So, I don't think what you said is right and it will work on most of the platforms because of the gaurantees given by interlocked instructions: memory consistency, atomicity and cache coherence.

Sorry if I wasn't clear enough. I referred to release case where you did NOT use Interlocked primitives. Anyway cache hierarchy is more complex then that and memory reordering is not only caused by cache but also e.g. by out-of-order speculative mechanism.
Basically you should use interlocked primitives in every spinlock access unless you can guarantee no memory barriers are required.

Imagine this situation:
Thread A acquires spinlock
Thread B waits for spinlock.
Thread A releases the spinlock (write to local cache).
Thread B still sees spinlock locked because old spinlock value resides in it's local cache.
(memory fence may be required here)

This will probably work on most IA platforms due to self snooping but try Itanium or Alpha (had support for WinNT).

There is however a different problem on IA, another scenario:

Thread A acquires spinlock
Thread A fills structure for thread B
Thread A releases the spinlock.
Thread B acquires the spin lock, but the content of structure is bogus due to memory access reordering, since there was no fence instruction before spinlock release.

Thanks for pointing that out. I remember I read it in Joe Duffy's article sometimes back on release and interlocks. I think example 1 wouldn't pose any problems because at some point very soon thread B will see the updated value of the 'dest'. However, I realized there could potenctially be another issue with this assignment in the release case without interlockedxx. And that is, If interlockedxx is not used with release then it could lead to starvation because release may not happen in time (due to cache synchronization being an async operation itself) and the thread that called release can successfully re-acquire the lock if it wants to (its coded that way). This could lead to starvation for the other threads which are spinning to gain access because they couldn't see the release fast enough. I'll update the implementation.

Thanks for pointing that out. I remember I read it in Joe Duffy's article sometimes back on release and interlocks. I think example 1 wouldn't pose any problems because at some point very soon thread B will see the updated value of the 'dest'.

Right, I have mentioned 1st issue would not occur on IA. But what if dest value has already been committed but data writes preceding spinlock release haven't (2nd issue)? Spinlock are usually used to prevent data races not just task scheduling.

spinlocks are typically only used in kernel mode with scheduler's preemptivity disabled, user mode
can lead to priority reverse issue. so transition to kernel is required for any sort of user mode locks,
it's true that SwitchToThread() in fact yield current thread's cpu quantum to another ready thread,
but this implementation can cause high cpu radio, and high memory bus race condition than windows's
critical section, i ever wrote a similar user spinlock, pretty much like yours, tests indicates no huge benefit
from such design, meanwhile , incur huge cpu/memory bus utilization. i think default critical section is tuned to work well under most work load. for high concurrency code path, i guess queued spinlock (user mode) can help to improve the cpu/memory contention issue, even though, it's still not a good idea to use
such lock in user mode since priority reverse is still a problem.

If spin locks were only meant to run in kernel mode then why would language like C# implement an API to use in user mode? How does spin lock usage in user mode leads to priority reverse issue? I would think its useful in locking regions which has just few instructions as there is little bit of more cost to using win32 CS objects as they maintain a debug_info internally as well.
Can you explain a bit more on your comment on high CPU ratio. My tests showed no such side effects and seem to scale when more than 4 threads running. The inner while loop is actually meant to reduce memory bus utilization because as its not calling interlockedX() in every iteration.