1001 STORE X=1 LOAD X STORE Y=1
1002 <read barrier> <general barrier>
1003 LOAD Y LOAD X
10041005This substitution destroys transitivity: in this example, it is perfectly
1006legal for CPU 2's load from X to return 1, its load from Y to return 0,
1007and CPU 3's load from X to return 0.
10081009The key point is that although CPU 2's read barrier orders its pair
1010of loads, it does not guarantee to order CPU 1's store. Therefore, if
1011this example runs on a system where CPUs 1 and 2 share a store buffer
1012or a level of cache, CPU 2 might have early access to CPU 1's writes.
1013General barriers are therefore required to ensure that all CPUs agree
1014on the combined order of CPU 1's and CPU 2's accesses.
10151016To reiterate, if your code requires transitivity, use general barriers
1017throughout.
101810191020========================
1021EXPLICIT KERNEL BARRIERS
1022========================
10231024The Linux kernel has a variety of different barriers that act at different
1025levels:
10261027 (*) Compiler barrier.
10281029 (*) CPU memory barriers.
10301031 (*) MMIO write barrier.
103210331034COMPILER BARRIER
1035----------------
10361037The Linux kernel has an explicit compiler barrier function that prevents the
1038compiler from moving the memory accesses either side of it to the other side:
10391040 barrier();
10411042This is a general barrier - lesser varieties of compiler barrier do not exist.
10431044The compiler barrier has no direct effect on the CPU, which may then reorder
1045things however it wishes.
104610471048CPU MEMORY BARRIERS
1049-------------------
10501051The Linux kernel has eight basic CPU memory barriers:
10521053 TYPE MANDATORY SMP CONDITIONAL
1054 =============== ======================= ===========================
1055 GENERAL mb() smp_mb()
1056 WRITE wmb() smp_wmb()
1057 READ rmb() smp_rmb()
1058 DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends()
105910601061All memory barriers except the data dependency barriers imply a compiler
1062barrier. Data dependencies do not impose any additional compiler ordering.
10631064Aside: In the case of data dependencies, the compiler would be expected to
1065issue the loads in the correct order (eg. `a[b]` would have to load the value
1066of b before loading a[b]), however there is no guarantee in the C specification
1067that the compiler may not speculate the value of b (eg. is equal to 1) and load
1068a before b (eg. tmp = a[1]; if (b != 1) tmp = a[b]; ). There is also the
1069problem of a compiler reloading b after having loaded a[b], thus having a newer
1070copy of b than a[b]. A consensus has not yet been reached about these problems,
1071however the ACCESS_ONCE macro is a good place to start looking.
10721073SMP memory barriers are reduced to compiler barriers on uniprocessor compiled
1074systems because it is assumed that a CPU will appear to be self-consistent,
1075and will order overlapping accesses correctly with respect to itself.
10761077[!] Note that SMP memory barriers _must_ be used to control the ordering of
1078references to shared memory on SMP systems, though the use of locking instead
1079is sufficient.
10801081Mandatory barriers should not be used to control SMP effects, since mandatory
1082barriers unnecessarily impose overhead on UP systems. They may, however, be
1083used to control MMIO effects on accesses through relaxed memory I/O windows.
1084These are required even on non-SMP systems as they affect the order in which
1085memory operations appear to a device by prohibiting both the compiler and the
1086CPU from reordering them.
108710881089There are some more advanced barrier functions:
10901091 (*) set_mb(var, value)
10921093 This assigns the value to the variable and then inserts a full memory
1094 barrier after it, depending on the function. It isn't guaranteed to
1095 insert anything more than a compiler barrier in a UP compilation.
109610971098 (*) smp_mb__before_atomic_dec();
1099 (*) smp_mb__after_atomic_dec();
1100 (*) smp_mb__before_atomic_inc();
1101 (*) smp_mb__after_atomic_inc();
11021103 These are for use with atomic add, subtract, increment and decrement
1104 functions that don't return a value, especially when used for reference
1105 counting. These functions do not imply memory barriers.
11061107 As an example, consider a piece of code that marks an object as being dead
1108 and then decrements the object's reference count:
11091110 obj->dead = 1;
1111 smp_mb__before_atomic_dec();
1112 atomic_dec(&obj->ref_count);
11131114 This makes sure that the death mark on the object is perceived to be set
1115 *before* the reference counter is decremented.
11161117 See Documentation/atomic_ops.txt for more information. See the "Atomic
1118 operations" subsection for information on where to use these.
111911201121 (*) smp_mb__before_clear_bit(void);
1122 (*) smp_mb__after_clear_bit(void);
11231124 These are for use similar to the atomic inc/dec barriers. These are
1125 typically used for bitwise unlocking operations, so care must be taken as
1126 there are no implicit memory barriers here either.
11271128 Consider implementing an unlock operation of some nature by clearing a
1129 locking bit. The clear_bit() would then need to be barriered like this:
11301131 smp_mb__before_clear_bit();
1132 clear_bit( ... );
11331134 This prevents memory operations before the clear leaking to after it. See
1135 the subsection on "Locking Functions" with reference to UNLOCK operation
1136 implications.
11371138 See Documentation/atomic_ops.txt for more information. See the "Atomic
1139 operations" subsection for information on where to use these.
114011411142MMIO WRITE BARRIER
1143------------------
11441145The Linux kernel also has a special barrier for use with memory-mapped I/O
1146writes:
11471148 mmiowb();
11491150This is a variation on the mandatory write barrier that causes writes to weakly
1151ordered I/O regions to be partially ordered. Its effects may go beyond the
1152CPU->Hardware interface and actually affect the hardware at some level.
11531154See the subsection "Locks vs I/O accesses" for more information.
115511561157===============================
1158IMPLICIT KERNEL MEMORY BARRIERS
1159===============================
11601161Some of the other functions in the linux kernel imply memory barriers, amongst
1162which are locking and scheduling functions.
11631164This specification is a _minimum_ guarantee; any particular architecture may
1165provide more substantial guarantees, but these may not be relied upon outside
1166of arch specific code.
116711681169LOCKING FUNCTIONS
1170-----------------
11711172The Linux kernel has a number of locking constructs:
11731174 (*) spin locks
1175 (*) R/W spin locks
1176 (*) mutexes
1177 (*) semaphores
1178 (*) R/W semaphores
1179 (*) RCU
11801181In all cases there are variants on "LOCK" operations and "UNLOCK" operations
1182for each construct. These operations all imply certain barriers:
11831184 (1) LOCK operation implication:
11851186 Memory operations issued after the LOCK will be completed after the LOCK
1187 operation has completed.
11881189 Memory operations issued before the LOCK may be completed after the LOCK
1190 operation has completed.
11911192 (2) UNLOCK operation implication:
11931194 Memory operations issued before the UNLOCK will be completed before the
1195 UNLOCK operation has completed.
11961197 Memory operations issued after the UNLOCK may be completed before the
1198 UNLOCK operation has completed.
11991200 (3) LOCK vs LOCK implication:
12011202 All LOCK operations issued before another LOCK operation will be completed
1203 before that LOCK operation.
12041205 (4) LOCK vs UNLOCK implication:
12061207 All LOCK operations issued before an UNLOCK operation will be completed
1208 before the UNLOCK operation.
12091210 All UNLOCK operations issued before a LOCK operation will be completed
1211 before the LOCK operation.
12121213 (5) Failed conditional LOCK implication:
12141215 Certain variants of the LOCK operation may fail, either due to being
1216 unable to get the lock immediately, or due to receiving an unblocked
1217 signal whilst asleep waiting for the lock to become available. Failed
1218 locks do not imply any sort of barrier.
12191220Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is
1221equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.
12221223[!] Note: one of the consequences of LOCKs and UNLOCKs being only one-way
1224 barriers is that the effects of instructions outside of a critical section
1225 may seep into the inside of the critical section.
12261227A LOCK followed by an UNLOCK may not be assumed to be full memory barrier
1228because it is possible for an access preceding the LOCK to happen after the
1229LOCK, and an access following the UNLOCK to happen before the UNLOCK, and the
1230two accesses can themselves then cross:
12311232 *A = a;
1233 LOCK
1234 UNLOCK
1235 *B = b;
12361237may occur as:
12381239 LOCK, STORE *B, STORE *A, UNLOCK
12401241Locks and semaphores may not provide any guarantee of ordering on UP compiled
1242systems, and so cannot be counted on in such a situation to actually achieve
1243anything at all - especially with respect to I/O accesses - unless combined
1244with interrupt disabling operations.
12451246See also the section on "Inter-CPU locking barrier effects".
124712481249As an example, consider the following:
12501251 *A = a;
1252 *B = b;
1253 LOCK
1254 *C = c;
1255 *D = d;
1256 UNLOCK
1257 *E = e;
1258 *F = f;
12591260The following sequence of events is acceptable:
12611262 LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK
12631264 [+] Note that {*F,*A} indicates a combined access.
12651266But none of the following are:
12671268 {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E
1269 *A, *B, *C, LOCK, *D, UNLOCK, *E, *F
1270 *A, *B, LOCK, *C, UNLOCK, *D, *E, *F
1271 *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E
1272127312741275INTERRUPT DISABLING FUNCTIONS
1276-----------------------------
12771278Functions that disable interrupts (LOCK equivalent) and enable interrupts
1279(UNLOCK equivalent) will act as compiler barriers only. So if memory or I/O
1280barriers are required in such a situation, they must be provided from some
1281other means.
128212831284SLEEP AND WAKE-UP FUNCTIONS
1285---------------------------
12861287Sleeping and waking on an event flagged in global data can be viewed as an
1288interaction between two pieces of data: the task state of the task waiting for
1289the event and the global data used to indicate the event. To make sure that
1290these appear to happen in the right order, the primitives to begin the process
1291of going to sleep, and the primitives to initiate a wake up imply certain
1292barriers.
12931294Firstly, the sleeper normally follows something like this sequence of events:
12951296 for (;;) {
1297 set_current_state(TASK_UNINTERRUPTIBLE);
1298 if (event_indicated)
1299 break;
1300 schedule();
1301 }
13021303A general memory barrier is interpolated automatically by set_current_state()
1304after it has altered the task state:
13051306 CPU 1
1307 ===============================
1308 set_current_state();
1309 set_mb();
1310 STORE current->state
1311 <general barrier>
1312 LOAD event_indicated
13131314set_current_state() may be wrapped by:
13151316 prepare_to_wait();
1317 prepare_to_wait_exclusive();
13181319which therefore also imply a general memory barrier after setting the state.
1320The whole sequence above is available in various canned forms, all of which
1321interpolate the memory barrier in the right place:
13221323 wait_event();
1324 wait_event_interruptible();
1325 wait_event_interruptible_exclusive();
1326 wait_event_interruptible_timeout();
1327 wait_event_killable();
1328 wait_event_timeout();
1329 wait_on_bit();
1330 wait_on_bit_lock();
133113321333Secondly, code that performs a wake up normally follows something like this:
13341335 event_indicated = 1;
1336 wake_up(&event_wait_queue);
13371338or:
13391340 event_indicated = 1;
1341 wake_up_process(event_daemon);
13421343A write memory barrier is implied by wake_up() and co. if and only if they wake
1344something up. The barrier occurs before the task state is cleared, and so sits
1345between the STORE to indicate the event and the STORE to set TASK_RUNNING:
13461347 CPU 1 CPU 2
1348 =============================== ===============================
1349 set_current_state(); STORE event_indicated
1350 set_mb(); wake_up();
1351 STORE current->state <write barrier>
1352 <general barrier> STORE current->state
1353 LOAD event_indicated
13541355The available waker functions include:
13561357 complete();
1358 wake_up();
1359 wake_up_all();
1360 wake_up_bit();
1361 wake_up_interruptible();
1362 wake_up_interruptible_all();
1363 wake_up_interruptible_nr();
1364 wake_up_interruptible_poll();
1365 wake_up_interruptible_sync();
1366 wake_up_interruptible_sync_poll();
1367 wake_up_locked();
1368 wake_up_locked_poll();
1369 wake_up_nr();
1370 wake_up_poll();
1371 wake_up_process();
137213731374[!] Note that the memory barriers implied by the sleeper and the waker do _not_
1375order multiple stores before the wake-up with respect to loads of those stored
1376values after the sleeper has called set_current_state(). For instance, if the
1377sleeper does:
13781379 set_current_state(TASK_INTERRUPTIBLE);
1380 if (event_indicated)
1381 break;
1382 __set_current_state(TASK_RUNNING);
1383 do_something(my_data);
13841385and the waker does:
13861387 my_data = value;
1388 event_indicated = 1;
1389 wake_up(&event_wait_queue);
13901391there's no guarantee that the change to event_indicated will be perceived by
1392the sleeper as coming after the change to my_data. In such a circumstance, the
1393code on both sides must interpolate its own memory barriers between the
1394separate data accesses. Thus the above sleeper ought to do:
13951396 set_current_state(TASK_INTERRUPTIBLE);
1397 if (event_indicated) {
1398 smp_rmb();
1399 do_something(my_data);
1400 }
14011402and the waker should do:
14031404 my_data = value;
1405 smp_wmb();
1406 event_indicated = 1;
1407 wake_up(&event_wait_queue);
140814091410MISCELLANEOUS FUNCTIONS
1411-----------------------
14121413Other functions that imply barriers:
14141415 (*) schedule() and similar imply full memory barriers.
141614171418=================================
1419INTER-CPU LOCKING BARRIER EFFECTS
1420=================================
14211422On SMP systems locking primitives give a more substantial form of barrier: one
1423that does affect memory access ordering on other CPUs, within the context of
1424conflict on any particular lock.
142514261427LOCKS VS MEMORY ACCESSES
1428------------------------
14291430Consider the following: the system has a pair of spinlocks (M) and (Q), and
1431three CPUs; then should the following sequence of events occur:
14321433 CPU 1 CPU 2
1434 =============================== ===============================
1435 *A = a; *E = e;
1436 LOCK M LOCK Q
1437 *B = b; *F = f;
1438 *C = c; *G = g;
1439 UNLOCK M UNLOCK Q
1440 *D = d; *H = h;
14411442Then there is no guarantee as to what order CPU 3 will see the accesses to *A
1443through *H occur in, other than the constraints imposed by the separate locks
1444on the separate CPUs. It might, for example, see:
14451446 *E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M
14471448But it won't see any of:
14491450 *B, *C or *D preceding LOCK M
1451 *A, *B or *C following UNLOCK M
1452 *F, *G or *H preceding LOCK Q
1453 *E, *F or *G following UNLOCK Q
145414551456However, if the following occurs:
14571458 CPU 1 CPU 2
1459 =============================== ===============================
1460 *A = a;
1461 LOCK M [1]
1462 *B = b;
1463 *C = c;
1464 UNLOCK M [1]
1465 *D = d; *E = e;
1466 LOCK M [2]
1467 *F = f;
1468 *G = g;
1469 UNLOCK M [2]
1470 *H = h;
14711472CPU 3 might see:
14731474 *E, LOCK M [1], *C, *B, *A, UNLOCK M [1],
1475 LOCK M [2], *H, *F, *G, UNLOCK M [2], *D
14761477But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
14781479 *B, *C, *D, *F, *G or *H preceding LOCK M [1]
1480 *A, *B or *C following UNLOCK M [1]
1481 *F, *G or *H preceding LOCK M [2]
1482 *A, *B, *C, *E, *F or *G following UNLOCK M [2]
148314841485LOCKS VS I/O ACCESSES
1486---------------------
14871488Under certain circumstances (especially involving NUMA), I/O accesses within
1489two spinlocked sections on two different CPUs may be seen as interleaved by the
1490PCI bridge, because the PCI bridge does not necessarily participate in the
1491cache-coherence protocol, and is therefore incapable of issuing the required
1492read memory barriers.
14931494For example:
14951496 CPU 1 CPU 2
1497 =============================== ===============================
1498 spin_lock(Q)
1499 writel(0, ADDR)
1500 writel(1, DATA);
1501 spin_unlock(Q);
1502 spin_lock(Q);
1503 writel(4, ADDR);
1504 writel(5, DATA);
1505 spin_unlock(Q);
15061507may be seen by the PCI bridge as follows:
15081509 STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5
15101511which would probably cause the hardware to malfunction.
151215131514What is necessary here is to intervene with an mmiowb() before dropping the
1515spinlock, for example:
15161517 CPU 1 CPU 2
1518 =============================== ===============================
1519 spin_lock(Q)
1520 writel(0, ADDR)
1521 writel(1, DATA);
1522 mmiowb();
1523 spin_unlock(Q);
1524 spin_lock(Q);
1525 writel(4, ADDR);
1526 writel(5, DATA);
1527 mmiowb();
1528 spin_unlock(Q);
15291530this will ensure that the two stores issued on CPU 1 appear at the PCI bridge
1531before either of the stores issued on CPU 2.
153215331534Furthermore, following a store by a load from the same device obviates the need
1535for the mmiowb(), because the load forces the store to complete before the load
1536is performed:
15371538 CPU 1 CPU 2
1539 =============================== ===============================
1540 spin_lock(Q)
1541 writel(0, ADDR)
1542 a = readl(DATA);
1543 spin_unlock(Q);
1544 spin_lock(Q);
1545 writel(4, ADDR);
1546 b = readl(DATA);
1547 spin_unlock(Q);
154815491550See Documentation/DocBook/deviceiobook.tmpl for more information.
155115521553=================================
1554WHERE ARE MEMORY BARRIERS NEEDED?
1555=================================
15561557Under normal operation, memory operation reordering is generally not going to
1558be a problem as a single-threaded linear piece of code will still appear to
1559work correctly, even if it's in an SMP kernel. There are, however, four
1560circumstances in which reordering definitely _could_ be a problem:
15611562 (*) Interprocessor interaction.
15631564 (*) Atomic operations.
15651566 (*) Accessing devices.
15671568 (*) Interrupts.
156915701571INTERPROCESSOR INTERACTION
1572--------------------------
15731574When there's a system with more than one processor, more than one CPU in the
1575system may be working on the same data set at the same time. This can cause
1576synchronisation problems, and the usual way of dealing with them is to use
1577locks. Locks, however, are quite expensive, and so it may be preferable to
1578operate without the use of a lock if at all possible. In such a case
1579operations that affect both CPUs may have to be carefully ordered to prevent
1580a malfunction.
15811582Consider, for example, the R/W semaphore slow path. Here a waiting process is
1583queued on the semaphore, by virtue of it having a piece of its stack linked to
1584the semaphore's list of waiting processes:
15851586 struct rw_semaphore {
1587 ...
1588 spinlock_t lock;
1589 struct list_head waiters;
1590 };
15911592 struct rwsem_waiter {
1593 struct list_head list;
1594 struct task_struct *task;
1595 };
15961597To wake up a particular waiter, the up_read() or up_write() functions have to:
15981599 (1) read the next pointer from this waiter's record to know as to where the
1600 next waiter record is;
16011602 (2) read the pointer to the waiter's task structure;
16031604 (3) clear the task pointer to tell the waiter it has been given the semaphore;
16051606 (4) call wake_up_process() on the task; and
16071608 (5) release the reference held on the waiter's task struct.
16091610In other words, it has to perform this sequence of events:
16111612 LOAD waiter->list.next;
1613 LOAD waiter->task;
1614 STORE waiter->task;
1615 CALL wakeup
1616 RELEASE task
16171618and if any of these steps occur out of order, then the whole thing may
1619malfunction.
16201621Once it has queued itself and dropped the semaphore lock, the waiter does not
1622get the lock again; it instead just waits for its task pointer to be cleared
1623before proceeding. Since the record is on the waiter's stack, this means that
1624if the task pointer is cleared _before_ the next pointer in the list is read,
1625another CPU might start processing the waiter and might clobber the waiter's
1626stack before the up*() function has a chance to read the next pointer.
16271628Consider then what might happen to the above sequence of events:
16291630 CPU 1 CPU 2
1631 =============================== ===============================
1632 down_xxx()
1633 Queue waiter
1634 Sleep
1635 up_yyy()
1636 LOAD waiter->task;
1637 STORE waiter->task;
1638 Woken up by other event
1639 <preempt>
1640 Resume processing
1641 down_xxx() returns
1642 call foo()
1643 foo() clobbers *waiter
1644 </preempt>
1645 LOAD waiter->list.next;
1646 --- OOPS ---
16471648This could be dealt with using the semaphore lock, but then the down_xxx()
1649function has to needlessly get the spinlock again after being woken up.
16501651The way to deal with this is to insert a general SMP memory barrier:
16521653 LOAD waiter->list.next;
1654 LOAD waiter->task;
1655 smp_mb();
1656 STORE waiter->task;
1657 CALL wakeup
1658 RELEASE task
16591660In this case, the barrier makes a guarantee that all memory accesses before the
1661barrier will appear to happen before all the memory accesses after the barrier
1662with respect to the other CPUs on the system. It does _not_ guarantee that all
1663the memory accesses before the barrier will be complete by the time the barrier
1664instruction itself is complete.
16651666On a UP system - where this wouldn't be a problem - the smp_mb() is just a
1667compiler barrier, thus making sure the compiler emits the instructions in the
1668right order without actually intervening in the CPU. Since there's only one
1669CPU, that CPU's dependency ordering logic will take care of everything else.
167016711672ATOMIC OPERATIONS
1673-----------------
16741675Whilst they are technically interprocessor interaction considerations, atomic
1676operations are noted specially as some of them imply full memory barriers and
1677some don't, but they're very heavily relied on as a group throughout the
1678kernel.
16791680Any atomic operation that modifies some state in memory and returns information
1681about the state (old or new) implies an SMP-conditional general memory barrier
1682(smp_mb()) on each side of the actual operation (with the exception of
1683explicit lock operations, described later). These include:
16841685 xchg();
1686 cmpxchg();
1687 atomic_cmpxchg();
1688 atomic_inc_return();
1689 atomic_dec_return();
1690 atomic_add_return();
1691 atomic_sub_return();
1692 atomic_inc_and_test();
1693 atomic_dec_and_test();
1694 atomic_sub_and_test();
1695 atomic_add_negative();
1696 atomic_add_unless(); /* when succeeds (returns 1) */
1697 test_and_set_bit();
1698 test_and_clear_bit();
1699 test_and_change_bit();
17001701These are used for such things as implementing LOCK-class and UNLOCK-class
1702operations and adjusting reference counters towards object destruction, and as
1703such the implicit memory barrier effects are necessary.
170417051706The following operations are potential problems as they do _not_ imply memory
1707barriers, but might be used for implementing such things as UNLOCK-class
1708operations:
17091710 atomic_set();
1711 set_bit();
1712 clear_bit();
1713 change_bit();
17141715With these the appropriate explicit memory barrier should be used if necessary
1716(smp_mb__before_clear_bit() for instance).
171717181719The following also do _not_ imply memory barriers, and so may require explicit
1720memory barriers under some circumstances (smp_mb__before_atomic_dec() for
1721instance):
17221723 atomic_add();
1724 atomic_sub();
1725 atomic_inc();
1726 atomic_dec();
17271728If they're used for statistics generation, then they probably don't need memory
1729barriers, unless there's a coupling between statistical data.
17301731If they're used for reference counting on an object to control its lifetime,
1732they probably don't need memory barriers because either the reference count
1733will be adjusted inside a locked section, or the caller will already hold
1734sufficient references to make the lock, and thus a memory barrier unnecessary.
17351736If they're used for constructing a lock of some description, then they probably
1737do need memory barriers as a lock primitive generally has to do things in a
1738specific order.
17391740Basically, each usage case has to be carefully considered as to whether memory
1741barriers are needed or not.
17421743The following operations are special locking primitives:
17441745 test_and_set_bit_lock();
1746 clear_bit_unlock();
1747 __clear_bit_unlock();
17481749These implement LOCK-class and UNLOCK-class operations. These should be used in
1750preference to other operations when implementing locking primitives, because
1751their implementations can be optimised on many architectures.
17521753[!] Note that special memory barrier primitives are available for these
1754situations because on some CPUs the atomic instructions used imply full memory
1755barriers, and so barrier instructions are superfluous in conjunction with them,
1756and in such cases the special barrier primitives will be no-ops.
17571758See Documentation/atomic_ops.txt for more information.
175917601761ACCESSING DEVICES
1762-----------------
17631764Many devices can be memory mapped, and so appear to the CPU as if they're just
1765a set of memory locations. To control such a device, the driver usually has to
1766make the right memory accesses in exactly the right order.
17671768However, having a clever CPU or a clever compiler creates a potential problem
1769in that the carefully sequenced accesses in the driver code won't reach the
1770device in the requisite order if the CPU or the compiler thinks it is more
1771efficient to reorder, combine or merge accesses - something that would cause
1772the device to malfunction.
17731774Inside of the Linux kernel, I/O should be done through the appropriate accessor
1775routines - such as inb() or writel() - which know how to make such accesses
1776appropriately sequential. Whilst this, for the most part, renders the explicit
1777use of memory barriers unnecessary, there are a couple of situations where they
1778might be needed:
17791780 (1) On some systems, I/O stores are not strongly ordered across all CPUs, and
1781 so for _all_ general drivers locks should be used and mmiowb() must be
1782 issued prior to unlocking the critical section.
17831784 (2) If the accessor functions are used to refer to an I/O memory window with
1785 relaxed memory access properties, then _mandatory_ memory barriers are
1786 required to enforce ordering.
17871788See Documentation/DocBook/deviceiobook.tmpl for more information.
178917901791INTERRUPTS
1792----------
17931794A driver may be interrupted by its own interrupt service routine, and thus the
1795two parts of the driver may interfere with each other's attempts to control or
1796access the device.
17971798This may be alleviated - at least in part - by disabling local interrupts (a
1799form of locking), such that the critical operations are all contained within
1800the interrupt-disabled section in the driver. Whilst the driver's interrupt
1801routine is executing, the driver's core may not run on the same CPU, and its
1802interrupt is not permitted to happen again until the current interrupt has been
1803handled, thus the interrupt handler does not need to lock against that.
18041805However, consider a driver that was talking to an ethernet card that sports an
1806address register and a data register. If that driver's core talks to the card
1807under interrupt-disablement and then the driver's interrupt handler is invoked:
18081809 LOCAL IRQ DISABLE
1810 writew(ADDR, 3);
1811 writew(DATA, y);
1812 LOCAL IRQ ENABLE
1813 <interrupt>
1814 writew(ADDR, 4);
1815 q = readw(DATA);
1816 </interrupt>
18171818The store to the data register might happen after the second store to the
1819address register if ordering rules are sufficiently relaxed:
18201821 STORE *ADDR = 3, STORE *ADDR = 4, STORE *DATA = y, q = LOAD *DATA
182218231824If ordering rules are relaxed, it must be assumed that accesses done inside an
1825interrupt disabled section may leak outside of it and may interleave with
1826accesses performed in an interrupt - and vice versa - unless implicit or
1827explicit barriers are used.
18281829Normally this won't be a problem because the I/O accesses done inside such
1830sections will include synchronous load operations on strictly ordered I/O
1831registers that form implicit I/O barriers. If this isn't sufficient then an
1832mmiowb() may need to be used explicitly.
183318341835A similar situation may occur between an interrupt routine and two routines
1836running on separate CPUs that communicate with each other. If such a case is
1837likely, then interrupt-disabling locks should be used to guarantee ordering.
183818391840==========================
1841KERNEL I/O BARRIER EFFECTS
1842==========================
18431844When accessing I/O memory, drivers should use the appropriate accessor
1845functions:
18461847 (*) inX(), outX():
18481849 These are intended to talk to I/O space rather than memory space, but
1850 that's primarily a CPU-specific concept. The i386 and x86_64 processors do
1851 indeed have special I/O space access cycles and instructions, but many
1852 CPUs don't have such a concept.
18531854 The PCI bus, amongst others, defines an I/O space concept which - on such
1855 CPUs as i386 and x86_64 - readily maps to the CPU's concept of I/O
1856 space. However, it may also be mapped as a virtual I/O space in the CPU's
1857 memory map, particularly on those CPUs that don't support alternate I/O
1858 spaces.
18591860 Accesses to this space may be fully synchronous (as on i386), but
1861 intermediary bridges (such as the PCI host bridge) may not fully honour
1862 that.
18631864 They are guaranteed to be fully ordered with respect to each other.
18651866 They are not guaranteed to be fully ordered with respect to other types of
1867 memory and I/O operation.
18681869 (*) readX(), writeX():
18701871 Whether these are guaranteed to be fully ordered and uncombined with
1872 respect to each other on the issuing CPU depends on the characteristics
1873 defined for the memory window through which they're accessing. On later
1874 i386 architecture machines, for example, this is controlled by way of the
1875 MTRR registers.
18761877 Ordinarily, these will be guaranteed to be fully ordered and uncombined,
1878 provided they're not accessing a prefetchable device.
18791880 However, intermediary hardware (such as a PCI bridge) may indulge in
1881 deferral if it so wishes; to flush a store, a load from the same location
1882 is preferred[*], but a load from the same device or from configuration
1883 space should suffice for PCI.
18841885 [*] NOTE! attempting to load from the same location as was written to may
1886 cause a malfunction - consider the 16550 Rx/Tx serial registers for
1887 example.
18881889 Used with prefetchable I/O memory, an mmiowb() barrier may be required to
1890 force stores to be ordered.
18911892 Please refer to the PCI specification for more information on interactions
1893 between PCI transactions.
18941895 (*) readX_relaxed()
18961897 These are similar to readX(), but are not guaranteed to be ordered in any
1898 way. Be aware that there is no I/O read barrier available.
18991900 (*) ioreadX(), iowriteX()
19011902 These will perform appropriately for the type of access they're actually
1903 doing, be it inX()/outX() or readX()/writeX().
190419051906========================================
1907ASSUMED MINIMUM EXECUTION ORDERING MODEL
1908========================================
19091910It has to be assumed that the conceptual CPU is weakly-ordered but that it will
1911maintain the appearance of program causality with respect to itself. Some CPUs
1912(such as i386 or x86_64) are more constrained than others (such as powerpc or
1913frv), and so the most relaxed case (namely DEC Alpha) must be assumed outside
1914of arch-specific code.
19151916This means that it must be considered that the CPU will execute its instruction
1917stream in any order it feels like - or even in parallel - provided that if an
1918instruction in the stream depends on an earlier instruction, then that
1919earlier instruction must be sufficiently complete[*] before the later
1920instruction may proceed; in other words: provided that the appearance of
1921causality is maintained.
19221923 [*] Some instructions have more than one effect - such as changing the
1924 condition codes, changing registers or changing memory - and different
1925 instructions may depend on different effects.
19261927A CPU may also discard any instruction sequence that winds up having no
1928ultimate effect. For example, if two adjacent instructions both load an
1929immediate value into the same register, the first may be discarded.
193019311932Similarly, it has to be assumed that compiler might reorder the instruction
1933stream in any way it sees fit, again provided the appearance of causality is
1934maintained.
193519361937============================
1938THE EFFECTS OF THE CPU CACHE
1939============================
19401941The way cached memory operations are perceived across the system is affected to
1942a certain extent by the caches that lie between CPUs and memory, and by the
1943memory coherence system that maintains the consistency of state in the system.
19441945As far as the way a CPU interacts with another part of the system through the
1946caches goes, the memory system has to include the CPU's caches, and memory
1947barriers for the most part act at the interface between the CPU and its cache
1948(memory barriers logically act on the dotted line in the following diagram):
19491950 <--- CPU ---> : <----------- Memory ----------->
1951 :
1952 +--------+ +--------+ : +--------+ +-----------+
1953 | | | | : | | | | +--------+
1954 | CPU | | Memory | : | CPU | | | | |
1955 | Core |--->| Access |----->| Cache |<-->| | | |
1956 | | | Queue | : | | | |--->| Memory |
1957 | | | | : | | | | | |
1958 +--------+ +--------+ : +--------+ | | | |
1959 : | Cache | +--------+
1960 : | Coherency |
1961 : | Mechanism | +--------+
1962 +--------+ +--------+ : +--------+ | | | |
1963 | | | | : | | | | | |
1964 | CPU | | Memory | : | CPU | | |--->| Device |
1965 | Core |--->| Access |----->| Cache |<-->| | | |
1966 | | | Queue | : | | | | | |
1967 | | | | : | | | | +--------+
1968 +--------+ +--------+ : +--------+ +-----------+
1969 :
1970 :
19711972Although any particular load or store may not actually appear outside of the
1973CPU that issued it since it may have been satisfied within the CPU's own cache,
1974it will still appear as if the full memory access had taken place as far as the
1975other CPUs are concerned since the cache coherency mechanisms will migrate the
1976cacheline over to the accessing CPU and propagate the effects upon conflict.
19771978The CPU core may execute instructions in any order it deems fit, provided the
1979expected program causality appears to be maintained. Some of the instructions
1980generate load and store operations which then go into the queue of memory
1981accesses to be performed. The core may place these in the queue in any order
1982it wishes, and continue execution until it is forced to wait for an instruction
1983to complete.
19841985What memory barriers are concerned with is controlling the order in which
1986accesses cross from the CPU side of things to the memory side of things, and
1987the order in which the effects are perceived to happen by the other observers
1988in the system.
19891990[!] Memory barriers are _not_ needed within a given CPU, as CPUs always see
1991their own loads and stores as if they had happened in program order.
19921993[!] MMIO or other device accesses may bypass the cache system. This depends on
1994the properties of the memory window through which devices are accessed and/or
1995the use of any special device communication instructions the CPU may have.
199619971998CACHE COHERENCY
1999---------------
2000

2001Life isn't quite as simple as it may appear above, however: for while the
2002caches are expected to be coherent, there's no guarantee that that coherency
2003will be ordered. This means that whilst changes made on one CPU will
2004eventually become visible on all CPUs, there's no guarantee that they will
2005become apparent in the same order on those other CPUs.
200620072008Consider dealing with a system that has a pair of CPUs (1 & 2), each of which
2009has a pair of parallel data caches (CPU 1 has A/B, and CPU 2 has C/D):
20102011 :
2012 : +--------+
2013 : +---------+ | |
2014 +--------+ : +--->| Cache A |<------->| |
2015 | | : | +---------+ | |
2016 | CPU 1 |<---+ | |
2017 | | : | +---------+ | |
2018 +--------+ : +--->| Cache B |<------->| |
2019 : +---------+ | |
2020 : | Memory |
2021 : +---------+ | System |
2022 +--------+ : +--->| Cache C |<------->| |
2023 | | : | +---------+ | |
2024 | CPU 2 |<---+ | |
2025 | | : | +---------+ | |
2026 +--------+ : +--->| Cache D |<------->| |
2027 : +---------+ | |
2028 : +--------+
2029 :
20302031Imagine the system has the following properties:
20322033 (*) an odd-numbered cache line may be in cache A, cache C or it may still be
2034 resident in memory;
20352036 (*) an even-numbered cache line may be in cache B, cache D or it may still be
2037 resident in memory;
20382039 (*) whilst the CPU core is interrogating one cache, the other cache may be
2040 making use of the bus to access the rest of the system - perhaps to
2041 displace a dirty cacheline or to do a speculative load;
20422043 (*) each cache has a queue of operations that need to be applied to that cache
2044 to maintain coherency with the rest of the system;
20452046 (*) the coherency queue is not flushed by normal loads to lines already
2047 present in the cache, even though the contents of the queue may
2048 potentially affect those loads.
20492050Imagine, then, that two writes are made on the first CPU, with a write barrier
2051between them to guarantee that they will appear to reach that CPU's caches in
2052the requisite order:
20532054 CPU 1 CPU 2 COMMENT
2055 =============== =============== =======================================
2056 u == 0, v == 1 and p == &u, q == &u
2057 v = 2;
2058 smp_wmb(); Make sure change to v is visible before
2059 change to p
2060 <A:modify v=2> v is now in cache A exclusively
2061 p = &v;
2062 <B:modify p=&v> p is now in cache B exclusively
20632064The write memory barrier forces the other CPUs in the system to perceive that
2065the local CPU's caches have apparently been updated in the correct order. But
2066now imagine that the second CPU wants to read those values:
20672068 CPU 1 CPU 2 COMMENT
2069 =============== =============== =======================================
2070 ...
2071 q = p;
2072 x = *q;
20732074The above pair of reads may then fail to happen in the expected order, as the
2075cacheline holding p may get updated in one of the second CPU's caches whilst
2076the update to the cacheline holding v is delayed in the other of the second
2077CPU's caches by some other cache event:
20782079 CPU 1 CPU 2 COMMENT
2080 =============== =============== =======================================
2081 u == 0, v == 1 and p == &u, q == &u
2082 v = 2;
2083 smp_wmb();
2084 <A:modify v=2> <C:busy>
2085 <C:queue v=2>
2086 p = &v; q = p;
2087 <D:request p>
2088 <B:modify p=&v> <D:commit p=&v>
2089 <D:read p>
2090 x = *q;
2091 <C:read *q> Reads from v before v updated in cache
2092 <C:unbusy>
2093 <C:commit v=2>
20942095Basically, whilst both cachelines will be updated on CPU 2 eventually, there's
2096no guarantee that, without intervention, the order of update will be the same
2097as that committed on CPU 1.
209820992100To intervene, we need to interpolate a data dependency barrier or a read
2101barrier between the loads. This will force the cache to commit its coherency
2102queue before processing any further requests:
21032104 CPU 1 CPU 2 COMMENT
2105 =============== =============== =======================================
2106 u == 0, v == 1 and p == &u, q == &u
2107 v = 2;
2108 smp_wmb();
2109 <A:modify v=2> <C:busy>
2110 <C:queue v=2>
2111 p = &v; q = p;
2112 <D:request p>
2113 <B:modify p=&v> <D:commit p=&v>
2114 <D:read p>
2115 smp_read_barrier_depends()
2116 <C:unbusy>
2117 <C:commit v=2>
2118 x = *q;
2119 <C:read *q> Reads from v after v updated in cache
212021212122This sort of problem can be encountered on DEC Alpha processors as they have a
2123split cache that improves performance by making better use of the data bus.
2124Whilst most CPUs do imply a data dependency barrier on the read when a memory
2125access depends on a read, not all do, so it may not be relied on.
21262127Other CPUs may also have split caches, but must coordinate between the various
2128cachelets for normal memory accesses. The semantics of the Alpha removes the
2129need for coordination in the absence of memory barriers.
213021312132CACHE COHERENCY VS DMA
2133----------------------
21342135Not all systems maintain cache coherency with respect to devices doing DMA. In
2136such cases, a device attempting DMA may obtain stale data from RAM because
2137dirty cache lines may be resident in the caches of various CPUs, and may not
2138have been written back to RAM yet. To deal with this, the appropriate part of
2139the kernel must flush the overlapping bits of cache on each CPU (and maybe
2140invalidate them as well).
21412142In addition, the data DMA'd to RAM by a device may be overwritten by dirty
2143cache lines being written back to RAM from a CPU's cache after the device has
2144installed its own data, or cache lines present in the CPU's cache may simply
2145obscure the fact that RAM has been updated, until at such time as the cacheline
2146is discarded from the CPU's cache and reloaded. To deal with this, the
2147appropriate part of the kernel must invalidate the overlapping bits of the
2148cache on each CPU.
21492150See Documentation/cachetlb.txt for more information on cache management.
215121522153CACHE COHERENCY VS MMIO
2154-----------------------
21552156Memory mapped I/O usually takes place through memory locations that are part of
2157a window in the CPU's memory space that has different properties assigned than
2158the usual RAM directed window.
21592160Amongst these properties is usually the fact that such accesses bypass the
2161caching entirely and go directly to the device buses. This means MMIO accesses
2162may, in effect, overtake accesses to cached memory that were emitted earlier.
2163A memory barrier isn't sufficient in such a case, but rather the cache must be
2164flushed between the cached memory write and the MMIO access if the two are in
2165any way dependent.
216621672168=========================
2169THE THINGS CPUS GET UP TO
2170=========================
21712172A programmer might take it for granted that the CPU will perform memory
2173operations in exactly the order specified, so that if the CPU is, for example,
2174given the following piece of code to execute:
21752176 a = *A;
2177 *B = b;
2178 c = *C;
2179 d = *D;
2180 *E = e;
21812182they would then expect that the CPU will complete the memory operation for each
2183instruction before moving on to the next one, leading to a definite sequence of
2184operations as seen by external observers in the system:
21852186 LOAD *A, STORE *B, LOAD *C, LOAD *D, STORE *E.
218721882189Reality is, of course, much messier. With many CPUs and compilers, the above
2190assumption doesn't hold because:
21912192 (*) loads are more likely to need to be completed immediately to permit
2193 execution progress, whereas stores can often be deferred without a
2194 problem;
21952196 (*) loads may be done speculatively, and the result discarded should it prove
2197 to have been unnecessary;
21982199 (*) loads may be done speculatively, leading to the result having been fetched
2200 at the wrong time in the expected sequence of events;
22012202 (*) the order of the memory accesses may be rearranged to promote better use
2203 of the CPU buses and caches;
22042205 (*) loads and stores may be combined to improve performance when talking to
2206 memory or I/O hardware that can do batched accesses of adjacent locations,
2207 thus cutting down on transaction setup costs (memory and PCI devices may
2208 both be able to do this); and
22092210 (*) the CPU's data cache may affect the ordering, and whilst cache-coherency
2211 mechanisms may alleviate this - once the store has actually hit the cache
2212 - there's no guarantee that the coherency management will be propagated in
2213 order to other CPUs.
22142215So what another CPU, say, might actually observe from the above piece of code
2216is:
22172218 LOAD *A, ..., LOAD {*C,*D}, STORE *E, STORE *B
22192220 (Where "LOAD {*C,*D}" is a combined load)
222122222223However, it is guaranteed that a CPU will be self-consistent: it will see its
2224_own_ accesses appear to be correctly ordered, without the need for a memory
2225barrier. For instance with the following code:
22262227 U = *A;
2228 *A = V;
2229 *A = W;
2230 X = *A;
2231 *A = Y;
2232 Z = *A;
22332234and assuming no intervention by an external influence, it can be assumed that
2235the final result will appear to be:
22362237 U == the original value of *A
2238 X == W
2239 Z == Y
2240 *A == Y
22412242The code above may cause the CPU to generate the full sequence of memory
2243accesses:
22442245 U=LOAD *A, STORE *A=V, STORE *A=W, X=LOAD *A, STORE *A=Y, Z=LOAD *A
22462247in that order, but, without intervention, the sequence may have almost any
2248combination of elements combined or discarded, provided the program's view of
2249the world remains consistent.
22502251The compiler may also combine, discard or defer elements of the sequence before
2252the CPU even sees them.
22532254For instance:
22552256 *A = V;
2257 *A = W;
22582259may be reduced to:
22602261 *A = W;
22622263since, without a write barrier, it can be assumed that the effect of the
2264storage of V to *A is lost. Similarly:
22652266 *A = Y;
2267 Z = *A;
22682269may, without a memory barrier, be reduced to:
22702271 *A = Y;
2272 Z = Y;
22732274and the LOAD operation never appear outside of the CPU.
227522762277AND THEN THERE'S THE ALPHA
2278--------------------------
22792280The DEC Alpha CPU is one of the most relaxed CPUs there is. Not only that,
2281some versions of the Alpha CPU have a split data cache, permitting them to have
2282two semantically-related cache lines updated at separate times. This is where
2283the data dependency barrier really becomes necessary as this synchronises both
2284caches with the memory coherence system, thus making it seem like pointer
2285changes vs new data occur in the right order.
22862287The Alpha defines the Linux kernel's memory barrier model.
22882289See the subsection on "Cache Coherency" above.
229022912292============
2293EXAMPLE USES
2294============
22952296CIRCULAR BUFFERS
2297----------------
22982299Memory barriers can be used to implement circular buffering without the need
2300of a lock to serialise the producer with the consumer. See:
23012302 Documentation/circular-buffers.txt
23032304for details.
230523062307==========
2308REFERENCES
2309==========
23102311Alpha AXP Architecture Reference Manual, Second Edition (Sites & Witek,
2312Digital Press)
2313 Chapter 5.2: Physical Address Space Characteristics
2314 Chapter 5.4: Caches and Write Buffers
2315 Chapter 5.5: Data Sharing
2316 Chapter 5.6: Read/Write Ordering
23172318AMD64 Architecture Programmer's Manual Volume 2: System Programming
2319 Chapter 7.1: Memory-Access Ordering
2320 Chapter 7.4: Buffering and Combining Memory Writes
23212322IA-32 Intel Architecture Software Developer's Manual, Volume 3:
2323System Programming Guide
2324 Chapter 7.1: Locked Atomic Operations
2325 Chapter 7.2: Memory Ordering
2326 Chapter 7.4: Serializing Instructions
23272328The SPARC Architecture Manual, Version 9
2329 Chapter 8: Memory Models
2330 Appendix D: Formal Specification of the Memory Models
2331 Appendix J: Programming with the Memory Models
23322333UltraSPARC Programmer Reference Manual
2334 Chapter 5: Memory Accesses and Cacheability
2335 Chapter 15: Sparc-V9 Memory Models
23362337UltraSPARC III Cu User's Manual
2338 Chapter 9: Memory Models
23392340UltraSPARC IIIi Processor User's Manual
2341 Chapter 8: Memory Models
23422343UltraSPARC Architecture 2005
2344 Chapter 9: Memory
2345 Appendix D: Formal Specifications of the Memory Models
23462347UltraSPARC T1 Supplement to the UltraSPARC Architecture 2005
2348 Chapter 8: Memory Models
2349 Appendix F: Caches and Cache Coherency
23502351Solaris Internals, Core Kernel Architecture, p63-68:
2352 Chapter 3.3: Hardware Considerations for Locks and
2353 Synchronization
23542355Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching
2356for Kernel Programmers:
2357 Chapter 13: Other Memory Models
23582359Intel Itanium Architecture Software Developer's Manual: Volume 1:
2360 Section 2.6: Speculation
2361 Section 4.4: Memory Access
2362