Avoiding Locks: Read Copy Update

There is a special method of read/write locking called Read Copy
Update. Using RCU, the readers can avoid taking a lock
altogether: as we expect our cache to be read more often than
updated (otherwise the cache is a waste of time), it is a
candidate for this optimization.

How do we get rid of read locks? Getting rid of read locks
means that writers may be changing the list underneath the
readers. That is actually quite simple: we can read a linked
list while an element is being added if the writer adds the
element very carefully. For example, adding
new to a single linked list called
list:

new->next = list->next;
wmb();
list->next = new;

The wmb() is a write memory barrier. It
ensures that the first operation (setting the new element's
next pointer) is complete and will be seen by
all CPUs, before the second operation is (putting the new
element into the list). This is important, since modern
compilers and modern CPUs can both reorder instructions unless
told otherwise: we want a reader to either not see the new
element at all, or see the new element with the
next pointer correctly pointing at the rest of
the list.

Fortunately, there is a function to do this for standard
struct list_head lists:
list_add_rcu()
(include/linux/list.h).

Removing an element from the list is even simpler: we replace
the pointer to the old element with a pointer to its successor,
and readers will either see it, or skip over it.

list->next = old->next;

There is list_del_rcu()
(include/linux/list.h) which does this (the
normal version poisons the old object, which we don't want).

The reader must also be careful: some CPUs can look through the
next pointer to start reading the contents of
the next element early, but don't realize that the pre-fetched
contents is wrong when the next pointer changes
underneath them. Once again, there is a
list_for_each_entry_rcu()
(include/linux/list.h) to help you. Of
course, writers can just use
list_for_each_entry(), since there cannot
be two simultaneous writers.

Our final dilemma is this: when can we actually destroy the
removed element? Remember, a reader might be stepping through
this element in the list right now: if we free this element and
the next pointer changes, the reader will jump
off into garbage and crash. We need to wait until we know that
all the readers who were traversing the list when we deleted the
element are finished. We use call_rcu() to
register a callback which will actually destroy the object once
all pre-existing readers are finished. Alternatively,
synchronize_rcu() may be used to block until
all pre-existing are finished.

But how does Read Copy Update know when the readers are
finished? The method is this: firstly, the readers always
traverse the list inside
rcu_read_lock()/rcu_read_unlock()
pairs: these simply disable preemption so the reader won't go to
sleep while reading the list.

RCU then waits until every other CPU has slept at least once:
since readers cannot sleep, we know that any readers which were
traversing the list during the deletion are finished, and the
callback is triggered. The real Read Copy Update code is a
little more optimized than this, but this is the fundamental
idea.

Note that the reader will alter the
popularity member in
__cache_find(), and now it doesn't hold a lock.
One solution would be to make it an atomic_t, but for
this usage, we don't really care about races: an approximate result is
good enough, so I didn't change it.

The result is that cache_find() requires no
synchronization with any other functions, so is almost as fast on SMP
as it would be on UP.

There is a further optimization possible here: remember our original
cache code, where there were no reference counts and the caller simply
held the lock whenever using the object? This is still possible: if
you hold the lock, no one can delete the object, so you don't need to
get and put the reference count.

Now, because the 'read lock' in RCU is simply disabling preemption, a
caller which always has preemption disabled between calling
cache_find() and
object_put() does not need to actually get and
put the reference count: we could expose
__cache_find() by making it non-static, and
such callers could simply call that.

The benefit here is that the reference count is not written to: the
object is not altered in any way, which is much faster on SMP
machines due to caching.