Assume these threads run in single core cpu. As a cpu only run one instruction in one cycle. That is said, even thought they share the cpu resource. but the computer ensure that one time one instruction. So is the lock un-necessary for multiplethreading?

7 Answers
7

Suppose we have a simple task that we want to perform multiple times in parallel, and we want to keep track globally of the number of times that the task has been performed, for example, counting hits on a web page.

When each thread gets to the point at which it's incrementing the count, it's execution will look like this:

Read the number of hits from memory into a processor register

Increment that number.

Write that number back to memory

Remember that every thread can suspend at any point in this process. So if thread A performs step 1, and then gets suspended, following by thread B performing all three steps, when thread A resumes, its registers will have the wrong number of hits: its registers will be restored, it will happily increment the old number of hits, and store that incremented number.

In addition, any number of other threads could have run during the time that thread A was suspended, so the count thread A writes at the end might be well below the correct count.

For that reason, it's necessary to ensure that if a thread performs step 1, it must perform step 3 before any other thread is allowed to perform step 1, which can be accomplished by all threads waiting to get a single lock before they begin this process, and freeing the lock only after the process is complete, so that this "critical section" of code cannot be incorrectly interleaved, resulting in a wrong count.

But what if the operation were atomic?

Yes, in the land of magical unicorns and rainbows, where the increment operation is atomic, then locking would not be necessary for the above example.

It's important to realize, however, that we spend very little time in the world of magical unicorns and rainbows. In almost every programming language, the increment operation is broken down into the above three steps. That's because, even if the processor supports an atomic increment operation, that operation is significantly more expensive: it has to read from memory, modify the number, and write it back to memory...and usually the atomic increment operation is an operation that can fail, meaning the simple sequence above has to be replaced with a loop (as we'll see below).

Since, even in multithreaded code, many variables are kept local to a single thread, programs are much more efficient if they assume each variable is local to a single thread, and let the programmers take care of protecting shared state between threads. Especially given that atomic operations are not usually enough to solve threading issues, as we'll see later.

Volatile variables

If we wanted to avoid locks for this particular problem, we first have to realize that the steps depicted in our first example aren't actually what happens in modern compiled code. Because compilers assume only one thread is modifying the variable, each thread will keep its own cached copy of the variable, until the processor register is needed for something else. As long as it has the cached copy, it assumes it doesn't need to go back to memory and read it again (which would be expensive). They also won't write the variable back to memory as long as it's kept in a register.

We can get back to the situation we gave in the first example (with all the same threading problems we identified above) by marking the variable as volatile, which tells the compiler that this variable is being modified by others, and so must be read from or written to memory whenever it's accessed or modified.

So a variable marked as volatile will not take us to the land of atomic increment operations, it only gets us as close as we thought we were already.

Making the increment atomic

Once we're using a volatile variable, we can make our increment operation atomic by using a low-level conditional set operation that most modern CPUs support (often called compare and set or compare and swap). This approach is taken, for example, in Java's AtomicInteger class:

The above loop repeatedly performs the following steps, until step 2 succeeds:

Read the value of a volatile variable directly from memory.

Change the value (in memory) if and only if its current value (in memory) is the same as the value we initially read, using a special atomic operation.

If step 2 fails (because the value was changed by a different thread after step 1), it again reads the variable directly from main memory and tries again.

While the compare-and-swap operation is expensive, it's slightly better than using locking in this case, because if a thread is suspended after step 1, other threads that reach step 1 do not have to block and wait for the first thread, which can prevent costly context switching. When the first thread resumes, it will fail in its first attempt to write the variable, but will be able to continue by re-reading the variable, which again is less expensive than the context switch that would have been necessary with locking.

So, we can get to the land of atomic increments (or other operations on a single variable) without using actual locks, via compare and swap.

So when is locking strictly necessary?

If you need to modify more than one variable in an atomic operation, then locking will be necessary, you won't find a special processor instruction for that.

As long as you're working on a single variable, and you're prepared for whatever work you've done to fail and to have to read the variable and start over again, compare-and-swap will be good enough, however.

Let's consider an example where each thread first adds 2 to a the variable X, and then multiplies X by two.

If X is initially one, and two threads run, we expect the result to be (((1 + 2) * 2) + 2) * 2 = 16.

However, if the threads interleave, we could, even with all operations being atomic, instead have both additions occur first, and the multiplications come after, resulting in (1 + 2 + 2) * 2 * 2 = 20.

This happens because multiplication and addition are not commutative operations.

So, the operations themselves being atomic is not enough, we must make the combination of operations atomic.

We can do that either by using locking to serialize the process, or we could use one local variable to store the value of X when we started our calculation, a second local variable for the intermediate steps, and then use compare-and-swap to set a new value only if the current value of X is the same as the original value of X. If we fail, we would have to start over again by reading X and performing the calculations again.

There are several trade-offs involved: as the calculations become longer, it becomes much more likely that the running thread will be suspended, and the value will be modified by another thread before we resume, meaning failures become much more likely, leading to wasted processor time. In the extreme case of large numbers of threads with very long running calculations, we might have 100 threads read the variable and be engaged in calculations, in which case only the first to finish will succeed in writing the new value, the other 99 will still complete their calculations, but discover upon completion that they can't update the value...at which point they'll each read the value and start the calculation over. We'd likely have the remaining 99 threads repeat the same problem, wasting vast quantities of processor time.

Full serialization via locks would be much better in that situation: 99 threads would suspend when they didn't get the lock, and we'd run each thread in order of arrival at the locking point.

Full serialization via locks would also give us advantages in cases where it's important that jobs be processed in order of arrival: if we're running a banking application, it's important that, if a deposit came in before a withdrawal, the deposit be processed first, otherwise the withdrawal, which came later and should succeed, might find insufficient funds and (incorrectly) fail.

If serialization is not critical (as in our incrementing case), and the calculations that would be lost if updating the number fails are minimal, there may be a significant advantage to be gained from using the compare-and-swap operation, because that operation is less expensive than locking.

but what if the counter increament is atomic, was the lock necessary?
–
pythoneeNov 17 '12 at 9:37

@pythonee: if counter increment is atomic, then possibly not. But in any multithreaded program of reasonable size you will have non-atomic tasks to be done on a shared resource.
–
Doc BrownNov 17 '12 at 11:25

1

Unless you're using a compiler intrinsic to make the increment atomic, it probably isn't.
–
Mike LarsenNov 17 '12 at 16:33

Yes, if the read/modify(increment)/write is atomic, the lock is unnecessary, for that operation. The DEC-10 AOSE (add one and skip if result == 0) instruction was made atomic specifically so it could be used as a test-and-set semaphore. The manual mentions that it was good enough because it would take the machine several days of continuous counting to roll a 36-bit register all the way over. NOW, however, not everything you do will be "add one to memory".
–
John R. StrohmNov 18 '12 at 11:05

I've updated my answer to address some of these concerns: yes, you can make the operation atomic, but no, even on architectures that support it, it won't be atomic by default, and there are situations where atomicity isn't enough and full serialization is needed. Locking is the only mechanism I'm aware of for achieving full serialization.
–
Theodore MurdockNov 18 '12 at 15:53

A CPU runs one instruction at a time, but what if you have two or more CPUs?

You are right in that locks are not needed, if you can write the program such that it takes advantage of atomic instructions: instructions whose execution is not interruptible on the given processor, and free from interference by other processors.

Locks are required when several instructions need to be protected from interference, and there is no equivalent atomic instruction.

For instance, inserting a node into a doubly-linked list requires the update of several memory locations. Prior to the insertion, and after the insertion, certain invariants hold about the structure of the list. However, during the insertion, those invariants are temporarily broken: the list is in an "under construction" state.

If another thread marches through the list while the invariants, or also tries to modify it when it is such a state, the data structure will probably become corrupted and the behavior will be unpredictable: maybe the software will crash, or continue with incorrect results. It is therefore necessary for threads to somehow agree to stay out of each other's way when the list is being updated.

Suitably designed lists can be manipulated with atomic instructions, so that locks are not needed. Algorithms for this are called "lock free". However, note that atomic instructions are a actually a form of locking. They are specially implemented in hardware, and work via communication among processors. They are more expensive than similar instructions which are not atomic.

On multiprocessors that lack the luxury of atomic instructions, primitives for mutual exclusion have to be built up of simple memory accesses and polling loops. Such problems have been worked on by the likes of Edsger Dijkstra and Leslie Lamport.

Some people, when confronted with a problem, think, "I know, I'll use
threads," and then two they haver poblesms

you see, even if 1 instruction runs on a CPU at any given time, computer programs comprise a lot more than just atomic assembly instructions. So for example, writing to the console (or a file) means you have to have to lock to ensure it works like you want.

I thought the quote was regular expressions, not threads?
–
user16764Nov 18 '12 at 2:10

2

The quote looks much more applicable for threads to me (with the words/characters being printed out of order due to threading issues). But there's currently an extra "s" in the output, which suggests the code has three problems.
–
Theodore MurdockNov 18 '12 at 16:03

1

its a side-effect. Very occasionally you could add 1 plus 1 and get 4294967295 :)
–
gbjbaanbNov 21 '12 at 22:50

Seems many answer attempted to explain locking, but I think what OP needs is an explanation of what multitasking actually is.

When you have more than one thread running on a system even with one CPU, there are two main methodologies that dictate how these threads will be scheduled (i.e. placed to run into your single-core CPU):

Cooperative Multitasking - Used in Win9x required each application to explicitly give up control. In this case, you wouldn't need to worry about locking since as long as Thread A is executing some algorithm, you would be guaranteed that it will never be interrupted

Preemptive Multitasking - Used in most modern OSs (Win2k and later). This uses timeslices and will interrupt threads even if they are still doing work. This is much more robust because a single thread can never hang your entire machine, which was a real possibility with cooperative multitasking. On the other hand, now you need to worry about locks because at any given time, one of your threads could be interrupted (i.e. preempted) and OS might schedule a different thread to run. When coding multithreaded applications with this behavior, you MUST consider that between every line of code (or even every instruction) a different thread might run. Now, even with a single core, locking becomes very important to ensure consistent state of your data.

how many instructions would setting a 32-bit integer take?
–
DXMNov 17 '12 at 7:58

1

Can you expand on your first statement a bit. You imply that only a bool can be atomically read/written, but that doesn't make sense. A "bool" doesn't actually exist in hardware. It is usually implemented as either a byte or a word, thus how could only bool have this property? And are you talking about loading from memory, altering, and pushing back to memory, or are you talking about at a register level? All reads/writes to registers are uninterrupted, but mem load then mem store are not (as that alone is 2 instructions, then at least 1 more to change the value).
–
CorbinNov 17 '12 at 10:48

1

The concept of a single instruction in a hyperhreaded/multicore/branch-predicted/multi-cached CPU is a bit tricky - but the standard says that only 'bool' needs to be safe against a context switch in the middle of a read/write of a single variable. There is a boost::Atomic which wraps mutex around other types and I think the c++11 adds some more threading guarrantees
–
Martin BeckettNov 17 '12 at 22:55

The explanation the standard says that only 'bool' needs to be safe against a context switch in the middle of a read/write of a single variable should really be added to the answer.
–
WolfJan 9 at 15:16

The problem does not lie with individual operations, but the larger tasks the operations carry out.

Many algorithms are written with the assumption that they are in full control of the state they operate on. With an interleaved ordered execution model like the one you describe, the operations may be arbitrarily interleaved with each other, and if they share state, there is a risk that the state is in an inconsistent shape.

You can compare it with functions that may temporarily break an invariant in order to do what they do. As long as the intermediary state is not observable from the outside, they can do whatever they want to achieve their task.

When you write concurrent code, you need to ensure that contended state is considered unsafe unless you have exclusive access to it. The common way to achieve exclusive access is synchronizing on a synchronization primitive, like holding a lock.

Another thing that synchronization primitives tend to result in on some platforms is that they emit memory barriers, which ensure inter-CPU consistency of memory.