Introduction

This article observes critical sections: what are they and what are they for, and how they're implemented in Windows. Next, we'll see an implementation of critical sections with some performance advantages and extra functionality (timeout).

Readers who are not very experienced in multi-threaded applications will find this article educational, since it covers the synchronization problems in all the details. Those who are experienced in multithreaded, high-performance applications may find my optimization useful in some cases, though it's not a revolutionary performance breakthrough (critical sections are already fast enough, in most of the cases). Those can skip right to the final section.

What is synchronization?

In operating systems such as Windows and Linux, the decision to grant a processor time to a particular thread is up to the system. An application is not allowed to interfere the threads scheduling mechanism. Even assigning different priorities to threads does not guarantee the order of execution. In particular, two consequent processor instructions are not guaranteed to execute without preemption by another thread.

In multi-threaded applications, there's usually a need to access some resource from within different threads. Depending on the type of the resource/object we access, we may or may not admit simultaneous access to it. For example, there's no problem when multiple threads read some global parameter simultaneously. There's, however, a problem if you want to append a node in a shared linked list, for example. Inserting a node into a linked list involves several steps, hence we can be preempted in the middle, and until we resume, the linked list is left in the inconsistent state. We say that insertion into a linked list is not an atomic operation (atom, in Greek, means non-dividable).

So, we need a solution. As was said already, we can not "ask" the system not to preempt us until we finish the operation. It may seem that we can solve this simply by keeping a boolean variable, which we set to true when we access the shared resource, and set to false when we finish; then, before we access it, we may check this variable. But, this naive method does not solve the problem: querying a variable and then modifying it is not atomic, you can be preempted after querying the variable and just before modifying it.

The correct solution was designed into the processors. Modern processors have some instructions that perform several steps, and still execute as atomic operations. For example, on i386 processors (modern Intel and AMD processors are their successors), there're xadd and cmpxchg instructions. xadd adds a specified value to the operand, and then returns its previous value. cmpxchg modifies the operand to a specified value if the operand was equal to another specified value. But, what makes those instructions so valuable for us is the ability of the processor to execute them atomically. That is, a thread can not be preempted in the middle so that xadd did only a part of the job.

So far, we have atomic operations that perform several steps; this gives us immunity against the threads scheduling mechanism. But, there's also another factor we must take into account: multi-processor systems. In machines where several processors (or other on-board devices) work with the same memory location, extra care must be taken to restrict simultaneous access to the same variable. Because of this fact, those atomic instructions we've discussed earlier should be executed with the lock prefix, which signals all other devices in the system that a specific memory location is being modified. This affects the performance (tens to hundreds of cycles, depending on the processor), but this is the only way to synchronize between processors. It may seem for some that multi-processor systems are something very special intended for 'monster' servers, not seen every day, but this is not true; today's modern home computers are already dual or quad processor, soon there won't be such a thing as a single-processor computer.

Such operations are called interlocked, those that restrict simultaneous access to a variable. And, they are the basis for all the synchronization mechanisms in all operating systems. (In kernel mode, there's a mechanism that guarantees the order of execution, but still the lock prefix is needed in multi-processor machines.)

Synchronization in Win32 API

In Win32, interlocked functions are available through the standard API (exported by kernel32.dll); you don't have to be an assembler guru to work with them. All of them have the InterlockedXXXX prefix. Examples are InterlockedIncrement, InterlockedDecrement, InterlockedExchangeAdd, and etc.

All this is very good, but let's get down from the skies back to our goats. How can all this help us to add a node to a shared linked list? Up until now, we haven't seen something like InterlockedAppendNodeToMyLinkedList. So, let's try to implement it using the functionality we have.

A simple way to do this is the following: initialize some variable, say nLockCount, by some known value, say 0. Next, at the beginning of the function, put the following loop:

while (InterlockedCompareExchange(&nLockCount, 1, 0));

The first caller of InterlockedCompareExchange will set the nLockCount to 1 and receive 0, all others will receive 1 without affecting nLockCount. Even if two threads, and even on different processors, enter this loop simultaneously (nearly), only one of them will receive 1. After the loop is performed, we can do whatever we want: add nodes to a linked list, reallocate buffers, everything. Does it mean that we can't be preempted by another thread? Of course, no. The system runs its scheduling mechanism as usual. But, if we're interrupted and some other thread calls our function - it will just hang on this loop. Until eventually our thread is resumed, finishes its work (whatever it is), and sets the nLockCount back to 0. After this point, one of the "waiting" threads (if there is one) will finish the loop when it is resumed, and so on.

We say that we've implemented a synchronization mechanism, and nLockCount is our shared synchronization object. A situation in which a thread attempts to lock an object that is already locked by another thread is called synchronization collision. Our mechanism detects such collisions, and prevents simultaneous access to the shared resource. Let's now analyze the advantages and drawbacks of what we've just done.

First, suppose, from within a locked section (between the locking loop and setting nLockCount back to 0), we call another function, which, in turns, wishes to lock the same resource. The locking loop will hang because the nLockCount is already 1, and there's no one that's going to unlock it. So, we have a deadlock. There're two things we can do about it: either always be aware about which synchronization objects we've already locked before calling your functions, or we can enhance our synchronization mechanism to remember which thread owns it, hence skipping lock/unlock if it is already owned by the calling thread.

(This doesn't mean we've solved the deadlock problem completely. Suppose a thread owns A and waits for B, whereas another thread owns B and waits for A. Deadlock problems may arise from incorrect general design, and they can't be solved by "enhancements" of the synchronization objects themselves. There're different models of how to design the whole application correctly, we'll skip the discussion about this because it's endless.)

Now, about the performance. Our synchronization object is very light in terms of the memory it consumes: just a single 32 bit variable, or two variables in case we want to remember which thread owns us. Locking/unlocking it is also fast. Well, not faster than the lock prefix can afford, but that's what we have. But, what is the cost of the synchronization collision for our method? On collision, we just wait in the loop until the object is unlocked, this is called spinning. This is OK if we're working on multi-processor systems and usually locking our object for small time durations, so that after several attempts, the thread is likely to acquire the lock. But, in other cases, we have a situation where the OS gives a processor time to a thread, and all this valuable time is just spent in spinning, without reasonable chance (on single-processor machines - without any chance) to succeed. Isn't it better to "tell" the OS somehow that we don't need the processor time right now?

This can be easily fixed. Modify the locking loop so that if you fail to acquire the lock - put Sleep(0), this will tell the OS that we don't need the rest of our processor time portion. Well, this helps, but the reality is even more cruel. Attaching/detaching a thread for execution, the so-called context-switch, is a complex operation for the OS, and it costs the processor time. Although we saved the rest of the processor time from obsolete spinning, still it would be better to let the system know preliminarily that our thread should not execute right now, to omit the context-switch to it. In order to implement this, there should be some agreed mechanism that informs the system about our synchronization pattern.

Win32 synchronization objects

We can make the OS aware of our synchronization intentions by using its synchronization objects. There're different types of these objects, the most common are events, mutexes, semaphores, and critical sections. They're created by CreateEvent, CreateMutex, CreateSemaphore, and InitializeCriticalSection, respectively. I will not get deeply into the details of every one of them, they are well documented in the MSDN. I'd just like to point to some of their key differences:

Only critical sections and mutexes "remember" which thread owns them. (By the way, that's the only difference between a mutex and the auto-reset event.) They are intended mainly for the situation we've described: per-thread locking of a shared resource. Other objects may be used for this purpose too; however, that's not their intended use.

All of them, except critical sections, are kernel objects (hence, you free them by CloseHandle). Below, we'll discuss in depth what it means for us.

To acquire kernel synchronization objects, there're the WaitXXXX (and MsgWaitXXXX) functions in the Win32 API. The most basic is WaitForSingleObject. If the object(s) is available, then it is acquired in-place, and the function returns immediately. Otherwise, the OS suspends the caller thread and gives the rest of its processor time to another one. Furthermore, the OS will not execute this thread unless the object(s) it's waiting for can be acquired.

It is also worth to note that the same kernel object may be accessed from different processes (applications), thus enabling synchronization between applications. (In fact, even non-kernel mode objects can be shared between processes using a shared memory mechanism.)

Sounds good. But unfortunately, we have the drawbacks too. Any operation on a kernel object is performed by the system in the kernel mode, leading to a kernel-mode transaction, which has significant implications. Even if the object is available and acquired in-place by WaitForSingleObject, calling this function too often can lead to a serious performance hit. Interlocked operations cost us from tens to hundreds of processor cycles, whereas every kernel-mode call, including WaitForSingleObject and ReleaseMutex, for example, cost thousands of cycles, so that on multi-processor systems for short-time locking - spinning may be a preferred way to go.

Critical Sections

Minding all the above, we introduce critical sections. It's a hybrid. When you attempt to lock it (call EnterCriticalSection), the idea is to perform the following steps:

Check if this thread already owns the critical section. If it does, the lock/unlock is omitted (skip the rest).

Attempt to lock a dedicated variable via the interlocked instruction (similar to what we've done). If the lock succeeds, return (skip the rest).

Optionally, retry the second step a number of times. This can be enabled by calling InitializeCriticalSectionAndSpinCount instead of InitializeCriticalSection. This step is always skipped on single-processor machines.

After we've tried all of the above, call a kernel-mode WaitXXXX function.

In such a way, we can achieve the highest possible performance: lock as fast as possible if no collision, and also suspend for as much as it takes if the critical section is busy.

Note that implementing this, however, is not as trivial as it may seem. For example: when unlocking, you must somehow know that someone is already waiting, to signal the kernel object for it. Also, depending on how exactly you implement it, the successful return of the WaitForSingleObject function may not mean yet that the critical section is acquired; you may need to repeat all the steps again.

And also, there's a choice here of performance vs. fairness: if there're threads currently suspended, then the critical section is eventually unlocked; but, before the system schedules one of those threads, comes another one. Should it immediately acquire the lock (performance), or just wait like all the others (fairness)?

In Windows API, the answer on this question is the performance priority (comes out from experiments). And, that's logical in my opinion: the whole idea of critical sections is to achieve maximal performance. Also, the WaitXXXX functions do not guarantee absolute fairness.

Also, in Windows, the kernel waitable object is created only upon the first collision, not straight at the beginning.

Sounds good so far. So, what can be optimized then?

Optimizations

Let's now see in-depth what we have already.

First of all, Windows interlocked functions are very fast. Furthermore, the Microsoft VC++ compiler supports intrinsic interlocked functions (see the underscored versions, _InterlockedXXXX in MSDN). Actually, I could not squeeze any more, even at the assembler level.

So, what's wrong with the standard Windows critical sections?

Let's look at the definition of the CRITICAL_SECTION C/C++ structure (defined in winnt.h). That's how it is defined:

Some explanation about the members: LockCount is the parameter to the interlocked operations, OwningThread is the owning thread ID (although declared as HANDLE, it is a DWORD), RecursionCount specifies how many times the owning thread has acquired the critical section, SpinCount is the number of additional attempts that should be performed before going into the suspended state, and the LockSemaphore is the kernel object handle - however, it seems to be an auto-reset event rather than a semaphore.

In addition, there's a DebugInfo member, which is our first problem. In fact, when you call InitializeCriticalSection besides initializing the variables of the structure, the Windows API also allocates an additional structure and stores a pointer to it in the DebugInfo member. In such a way, Windows tracks all your critical sections. The question is, what is this for? It is stated that in such a way, Windows can automatically detect deadlocks in the program. In my opinion - stupid idea. At least, there has to be an option to initialize a critical section without this DebugInfo.

First of all, I've never seen this deadlock detection mechanism working, even when a deadlock really happened (should it be enabled somehow?). Also, tracking all the critical sections alone does not do the deadlock detection, since there may be other synchronization objects too. If so, the deadlock detection should rather be something implemented in the kernel mode. But anyway, there's no ultimate solution for this: an application can also use infinite spin-locks, it's not forbidden. In my opinion, deadlocks should be prevented from the beginning by correctly designing the application. That's something you may need in debug time only.

The CRITICAL_SECTION structure consists of six 32 bit members, but because of the DebugInfo, the real memory consumption is more than 24 bytes. RTL_CRITICAL_SECTION_DEBUG takes another 32 bytes, plus it seems to be allocated on heap, which makes memory consumption even larger. From my experiments, it's about 100 bytes.

Also, creating/destroying such critical sections is not zero time-consuming. Allocating/freeing memory on the heap takes some time, plus it needs additional synchronization.

This impact is insignificant for applications that work with some constant set of critical sections, but in some cases, you need to create/destroy them on-the-fly. That's where we have a performance hit.

Next, I noticed that calling EnterCriticalSection leads to a single interlocked instruction when there's no collision, and this is OK. What was a bad surprise for me was that calling LeaveCriticalSection also involves an interlocked instruction. This is a pity, because it is possible to implement a critical section so that unlock will not use the lock prefix at all.

I believe that this is so because of historical reasons, and up until now, no one has fixed it. On early 386 processors, there was no cmpxchg instruction. Instead, you could only use inc and dec with the lock prefix. You could not check the variable and modify it only if it equals to something in an atomic operation. Also, you could not retrieve its previous value (like xadd offers). Instead, you could only know if upon inc/dec, this variable became negative, positive, or zero, plus its parity (by testing the CPU flags). Note also that InitializeCriticalSection sets the LockCount member to -1. Why? Because, executing lock inc on it will set the ZR CPU flag to 1 for the first caller, others will get 0.

But, a real bad surprise for me was when I noticed that even recursive lock/unlock of the critical section leads to interlocked instructions. What for? You know what, I could sincerely forgive all of the above, but this barbarity crosses all the barriers. The only explanation for this I can ever imagine is that critical sections were designed in such a way, that you can enter it from one thread, and leave in another. But, what kind of a programmer would do such a thing? Critical sections (and mutexes) are especially designed to be a per-thread locking object.

Plus, there's no timeout option designed. You can either specify to wait forever, or try to lock without waiting at all (TryEnterCriticalSection). Why not implement it too, if it's not too complex?

Alternative implementation

Minding all of the above, I've decided to write my own critical section. I'll not get too deep into the code discussion, we've already discussed this plenty. It's possible (though a bit tricky) to figure out how it works. Just to note: implementing a timeout (without performance degradation) was not so trivial as it may seem for many.

Let's just point to some key features of it.

A critical section is encapsulated in a CritSectEx C++ object. On construction, you may set the maximal spin count for it (like InitializeCriticalSectionAndSpinCount). You can also change it later by the SetSpinMax method; even in the middle of the work, it's absolutely safe.

We've been told also that on collision, we create a kernel synchronization object silently, but you can also pre-create it from the beginning if you want, by AllocateKernelSemaphore. (This functionality is available in Windows, hence I decided to provide it too.) Again, it's safe to call it during the use.

You can use this critical section object in a traditional way: call Lock and Unlock on it; however, I wouldn't recommend that. Instead, it is better to use a Scope locking technique, something like this:

// in some place you want to use a critical section declared as myCs
{
CritSectEx::Scope scope(myCs);
// access the shared resource
// ...
}
//Here the critical section has been unlocked

In my opinion, this should the the correct way to work with critical sections. Cases where lock and unlock should be separated are pretty rare. And even there, the Scope gives you flexibility: you may call Lock and Unlock explicitly on the Scope variable. For example: you want to lock a CS in a function, and then unlock it somewhere after it returns. This can be done in the following way:

The idea is to avoid situations where you 'forget' to unlock the CS. Why not make the C++ compiler do this job? It is responsible for tracking the lifetime of C++ objects and calling their destructors, when necessary. Hence, we only need to ensure the unlocking in the destructor, and make the lifetime of the Scope appropriate.

Anyway, you can also use explicit Lock and Unlock. However, it's a bit more complex than in the standard critical sections: Upon return, Lock passes you an extra parameter that you should later pass to Unlock. This is so because, I've omitted one extra parameter in my implementation, something similar to RecursionCount in the standard critical sections. This is justifiable.

Here, I give some numbers for performance comparison of my critical sections with the standard Windows ones. The results are pretty accurate, though every test in a multi-threaded environment depends on dozens of miscellaneous factors. Tests were performed on Intel® Pentium® 4 and Intel® Core™ Duo processors. Besides, we have two operating systems: Windows Server 2003 Standard Edition, and Windows XP Professional.

These are the results for the standard critical sections:

OS

CPU

Init + Uninit (cycles)

Lock + Unlock (cycles)

Recursive Lock + Unlock (cycles)

Memory consumption (bytes)

Server 2003

P4 D

977

250

138

100

Professional

P4 (earlier)

7660

401

388

100

Professional

Duo

6672

85

90

100

First of all, there's a clear difference between the two operating systems. On Windows XP Professional, init + uninit is extremely slow. Well, this has an explanation: on Windows XP, heap functions are implemented in the kernel mode, yet another barbarity. Besides, on Server 2003, recursive lock + unlock is optimized (a single interlocked operation).

We also see a difference between the processors. The Intel® Core™ Duo processor is really impressive. Although with a bit lower clock speed, it is a real monster (in fact - two monsters). Interlocked operations are greatly optimized, which is important for processors cooperation. (This is not an Intel advertisement; in fact, I haven't tested it on AMD processors, maybe they give even better performance.) Kernel mode transactions, however, are still slow.

Now, let's see our critical sections in action:

OS

CPU

Init + Uninit (cycles)

Lock + Unlock (cycles)

Recursive Lock + Unlock (cycles)

Memory consumption (bytes)

Server 2003

P4 D

5

116

16

16

Professional

P4 (earlier)

8

198

15

16

Professional

Duo

5

45

1

16

There's no significant difference between operating systems, which is expected. The only significant operation is the non-recursive lock + unlock, which is twice faster than what the standard critical sections offer. All other operations are negligible, in comparison.

And, again we see a clear advantages of the Intel® Core™ Duo processor. It may seem that with such a processor our optimization becomes less significant, but this is not accurate. Yes, it executes interlocked operations in less cycles, but there're many other instructions that it probably executes faster too, so the relative impact on the performance may be nearly the same.

Conclusion

Some will probably laugh. "What is this for?", "You will never be able to see the difference", "Better to use something standard", "Yet another bicycle inventor". I disagree, and there's no point to argue about this. Take into account that in a single cycle, modern processors execute several instructions, and you'll see that an interlocked operation may cost about a thousand (!!!) regular instructions. When you write a hi-performance application that should squeeze the maximum from the computer, you have to care about how to enable the CPU to give its maximum. In particular, sometimes, it is worth to use a per-thread heap (without serialization), or even consider non-thread-safe reference counters for objects unless they are really referenced in different threads. Some believe that writing a program in a multithreaded way already makes it faster. Not true, my friend, improper design kills the performance totally, it'll work even slower than if it would be single-threaded.

Believe it or not, but once I managed to optimize a server so that it became 30 times faster by just removing/caching heap allocations, omitting kernel-mode transactions, and etc. No one could believe me until I demonstrated the numbers. That's the sad truth: our computers are super fast; unfortunately, we don't usually use them correctly.

Back to our critical sections. Well, not a breakthrough in most of the cases, I admit. But:

There're situations where the impact may be significant.

Nothing to lose anyway, we have no disadvantages compared to standard critical sections.

Timeout ability.

Well, some will probably value the timeout (unless those who are already decided to work with mutexes because the standard critical sections lack timeout, and see no significant benefits in critical sections).

I will appreciate comments, both positive and negative. But, I say it again: please don't try to convince me that there's no significant performance difference.

Share

About the Author

My name is Vladislav Gelfer, I was born in Kiev (former Soviet Union), since 1993 I live in Israel.
In programming I'm interested mostly in low-level, OOP design, DSP and multimedia.
Besides of the programming I like physics, math, digital photography.

Comments and Discussions

In my case, I would rarely use the timeout feature.
so, I decided to modify the original class to remove the timeout feature.
It was a fairly easy job and I came up with the following code for CritSectEx::Lock. I intentionally expanded all internal function calls so that I can clearly see the flow of Lock in one place. After full expansion, I rearranged some lines and removed some duplicate lines. Now, it is quite straighforward and easy to follow how Lock works.

I see two implemetations.
[1] Someone can explain when to use CritSecRec instead of CritSectEx?
[2] Do we really need CritSecRec after all?

While waiting for someone to answer my questions, I digged into the source code.
Here are my own answers even though I am correct or not.
CritSectEx does not allow recursive Lock/Unlock pair calls, while CritSectRec does.
So, unless I am 100% sure that no recursive Lock/Unlock is made, I should use CritSectEx, not CritSectRec.

Well, after more thinking, both CritSectEx and CritSectRec allows recursive Lock/Unlock. The only difference is the latter keeps the count of recursion.

I understood your scenario. But it's ok. There is absolutely no problem if m_nWaiters becomes negative and extra ReleaseSemaphore is called.

BTW this may also happen without the intervention of Thread 3. For instance imagine Thread 1 gets suspended exactly where you suggested, meanwhile Thread 2 gets "tired" waiting (timeout), and returns from the PerfLockKernel function. At this point m_nWaiters is already 0. Then Thread 1 resumes, decrements the m_nWaiters and calls ReleaseSemaphore.

So you are left with an unlocked critical section, whereas m_nWaiters is negative, and the semaphore is already released appropriate amount of times. Let's see what happens then.

Since the critical section is unlocked - next thread that tries lock it will succeed, without even looking at m_nWaiters or the semaphore. During the unlock it will not release the semaphore, since m_nWaiters is not positive. So, everything will work normally (without any performance hit) until the next synchronization collision occurs.

Now let's see what happens during synchronization collision. A particular thread enters the PerfLockKernel. It eventually calls WaiterPlus, and then calls WaitForSingleObject, which returns immediately. At this time m_nWaiters becomes 0, and the extra semaphore charge is consumed. By such a normal situation is restored. On the next loop iteration the locker will call WaiterPlus + WaitForSingleObject again, the latter won't return now until either the critical section is unlocked or timeout.

In conclusion: this situation is normal. There may happen a situation where an extra semaphore is released, and m_nWaiters goes negative. But the only consequence of this is a little performance heap: every call to ReleaseSemaphore is expensive (it's a kernel-mode function), plus every such an "unneeded" semaphore charge will be eaten later by WaitForSingleObject (which is also a kernel-mode function) on the next synchronization collision.

Since those situations are rare - the performance hit should be minor. Our critical sections are optimized mostly for situations where there are no collisions (otherwise one should just use a standard mutex).

OTOH imagine a reverse situation: Eventually under some sophisticated scenario the unlocker does not release a semaphore as needed. Here you are left with positive m_nWaiters, and there is a thread waiting for the semaphore. Which may never be released. This situation is problematic. And, if everything is designed correctly, this situation is impossible.

The starvation is prevented, at expense of a low probability of overfeeding+overeating.

I could be misunderstanding a little bit what's going on in PerfLockKernel, but I think I see a couple issues and am wondering if they could potentially be causing the deadlocks others are seeing (or atleast lead to an inconsistent state that could cause a deadlock down the road).

Let's take this scenario here:

Pre-condition: Thread 1 has the lock and there are currently no threads waiting
1. Thread 2 comes along and the lock is locked, so thread2 executes PerfLockKernel -> bWaiter is false first time through the loop and so WaiterPlus occurs and Thread 2 starts to wait for the semaphore
2. Thread 1 now releases the lock and signals the semaphore (because m_nWaiters is currently 1).
3. Thread 2 is released because the semaphore was signaled, WaitForSigleObject returns WAIT_OBJECT_0 and thus bWaiter remains falseand we go to the top of the loop where we do WaiterPlus again (leading to m_nWaiters being 2 at this point). The simple case at this point is that we execute PerfLockImmediate and it returns true which causes PerfLockKernel to returntrue. Control returns to PerfLock where we do WaiterMinus and now Thread 2 owns the lock.

You'll notice that at this point we have a single thread which owns the lock, and we have no waiters, but m_nWaiters does not reflect this, it's value is 1.

You can extrapolate this scenario out a little bit to when we have several threads and Thread 2 above does not receive the lock when it loops around to call PerfLockImmediate (one of the other threads wins the race) which causes Thread 2 to go back to wait on the semaphore. This could happen indefinitely and lead to m_nWaiters to be way out of whack.

I believe if this scenario were to occur it would lead to the output similar to the reported deadlocks where m_nLocker is 0 and m_nWaiters is >0, though in this scenario there is not a true deadlock.

I still think I might be misunderstanding this code a little bit as it looks like this was done intentionally, but my thought is that WaiterPlus should only be called once at the beginning of PerfLockKernel.

The other thing I noticed (it's more of a question as to why it was done this way) is why do we bother to loop around when WaitForSingleObject returns WAIT_TIMEOUT; is it not just sufficient to return false at this point (and maybe more "accurate")?

Other than these observations I'm very happy with this code and am trying to adapt it into an application that I work with.

There's one point you've missed: During the stage (2) Thread 1 releases the lock and immediately calls WaiterMinus. This makes sure m_nWaiters isn't incremented each time.

The even meaning of m_nWaiters is not how many threads are waiting (or going to wait) for the semaphore. It's the difference between the number of threads that are going to call the WaitForSingleObject (or already called it), and the charge count of the semaphore. This is achieved by the following policy:

1. Every thread that's going to call WaitForSingleObject first calls WaiterPlus.
2. If WaitForSingleObject has received the semaphore (i.e. returned WAIT_OBJECT_0) - the thread calls WaiterPlus again on the next loop iteration.
3. After the looping the locker thread calls WaiterMinus (regardless to whether the lock was acquired or not).
4. If the unlocking thread charges the semaphore - it calls WaiterMinus

So that m_nWaiters is not just a hackish deadlock preventor. It has a strict definition (which is obeyed).
There may be a temporary state where the m_nWaiters is bigger than the number of thread suspended within the WaitForSingleObject. But this is ok. However we guarantee that m_nWaiters is never ever smaller than the number of thread suspended within the WaitForSingleObject.

This guarantees we always release the semaphore if there is a tiniest chance there is a thread waiting.

Your other note - you are right. We could immediately return false if WAIT_TIMEOUT is returned. This is however a (minor) performance note.

P.S. Actually I have no idea why many users complain about the deadlocks. I haven't seen problems for awhile. I suspect that at least some of them are caused by improper use...

Aha, yes I did indeed miss that and it makes sense now. My next question related to this is then, I don't quite understand why we must do WaiterPlus after we receive WAIT_OBJECT_0 when immediately after that (in PerfLock) we do WaiterMinus. Could both of these not be eliminated.

One more point that you may missed: if the locker thread has received a semaphore (got WAIT_OBJECT_0), this does NOT mean it has the ownership of the critical section.

There is only one way to take the ownership of the critical section: modify the m_nLocker variable (by the interlocked operation). And this is done only by the PerfLockImmediate function.

Now, look carefully at the PerfLockKernel function. After receiving the WAIT_OBJECT_0 the locker does not return from the function. Instead it goes to another loop iteration, and then calls PerfLockImmediate again.

It's also very important that the WaiterPlus is called beforePerfLockImmediate. Otherwise imagine the following scenario:

1. Thread 1 has the lock.
2. Thread 2 has just entered PerfLockKernel function.
3. On the first loop iteration it calls PerfLockImmediate, and it does not succeed.
4. Meanwhile Thread 1 is releasing the lock.
5. Since m_nWaiters is still zero - it does not release the semaphore.
6. Meanwhile Thread 2 calls WaitForSingleObject. And goes to sleep. Forever.

To avoid such a situation after a call to WaiterPlus the locker must check if the critical section is already free. And only then it may call WaitForSingleObject.

You are right, this leads to a redundant WaiterPlus/WaiterMinus pair. But this is not too significant, compared to other things involved during the collision (such as a kernel mode transactions, involved during ReleaseSemaphore and WaitForSingleObject).

I'm discovering your Critical Section implementation, and so far I like how it works. I use a few CritSectEx instances to act as mutexes, among others to prevent concurrent access to a USB driver. A doubt on my own code made me write a very simple test app in which the main and a background thread attempt to get a CritSectEx::Scope on a single, global CritSectEx instance, then do a Sleep for a fixed duration (different in the main and background loops), and release the Scope.

I was a bit amazed to see that on a multicore machine I need to "waste some time" coming out of a Scope before obtaining a new scope if I want to let the other thread get a lock. I would have expected that MSWin (XP, 64bits) would give the lock to the already-waiting thread at least once in a while...

The mystery concerns a small modification I made. I added an m_bIsLocked bool member (volatile) to CritSectEx so that one can see if the instance is locked before obtaining a Scope. This member is set in PerfLockImmediate() or PerfLock() (with the return value from PerfLockKernel()). It is unset in PerfUnlock().

I've been looking at this a bit more, adding "API-compatible" wrapper classes to the standard CriticalSection and Mutex objects in order to compare their behaviour with CritSectEx. Turns out a call to Sleep (or even querying the PerformanceCounter?!) can also provoke an early unlock.

An interesting observation, debug output from the code outlined above (812 is the background thread in this case):

This shows the foreground thread unlocking the background thread (I'm using a standard Mutex in this case). This happens about as soon as the background thread enters the Sleep function. The unlocking messages comes from the PerfUnlock method.

Curiously, that "cross-thread" PerfUnlock does not effectively release the Mutex, as can be seen from the output immediately following the lines above:

First, thanks for the good-performance lock that you'd provided here.Yet however, when I replaced all my Windows critical section by the implementation here, after a period of time, it got deadlocked in WaitForSingleObject( m_hSemaphore, dwWait). The computer we used is Quad Cored CPU, with about 4 to 5 threads running. I don't know why but it never happened when use Windows critical section functions.Any ideas?

I'm appreciated for your answer. However, we just took CritSectRec to replace our default critical sections. The information may be not enough to figure out where the problem is, but I cannot give you more detail - since we haven't got which threads had caused the deadlock, and it does not always happen. There must be something wrong, since the default critical sections never caused deadlocks (except it is really a deadlock). If there is any new information, I'll post here again. Best regards.

Note in particular that it uses a simple "cmp dword ptr [edx+4], 0FFffFFffh" to check if the section became available, at least as a first pass. Compared that to the _InterlockedCompareExchange your spin loop uses, it's much faster... of course speed just doesn't matter in a spin loop, since the point is to waste time.

But I do wonder about the resources used in wasting time. Both implementations issue a pause instruction, but that frequent repeated _InterlockedCompareExchange call looks like something that might take up bus cycles or some other resource on some systems. I don't really know much about this stuff, possibly Microsofts implementation there is only correct on a subset of systems, or only on x86, and thus not generally usable without #ifdefs or runtime checks.

Interesting point, indeed. I think the correct behavior here is determined by what the "spinning" means in the specific context.

The spin cycle that I used tries to lock the cs. When succeeds - it immediately returns without any further actions.
The spin cycle that you suggested just checks if the cs may be locked. When this is satisfied ([edx+4] differs from -1) - it just means that the cs is probably free, and you should try to lock it.

Is this method is better? Well, it seems so, but I can't tell for sure. Its advantage is that we produce less spam on the bus while waiting.
However in some situations this may be problematic: The value that you see in m_nLocker when accessed without lock semantics is usually outdated. Then when you eventually see the new value - the cs may already be locked again.

By such you may not try to lock the cs when it's available and try to lock it when it's busy again. IMHO - such a scenario is typical for extra-short duration cs.

I'll run some tests to check this.
Thanks again for the interesting point.