Scaling the dentry cache

Introduction:

Currently dentry cache is protected by a single global lock,
dcache_lock. This lock is held during lookup of dentries
as well as all manipulations to the dentry cache and the assorted
lists for maintaining hierarchies, aliases and LRU entries. It doesn't
seem to be an issue with lower-end SMP machines, but as the number of
CPUs increase it becomes more of an issue. So we experimented with
various ways to understand scaling issues related to dentry cache.

A look at the issues:

So far we have used two benchmarks for understanding scaling of dentry cache -
dbench (with settings to avoid I/O as much as possible) and httperf.
The latest numbers are based on 2.4.16 kernel. The numbers were
measured in a 8-way P-III xeon (1MB L2 cache) box with 2 GB RAM.

dbench results:

The dbench results shows that lock utilization and contention, while
not significant, grows steadily with increasing number of CPUs.
8-way results show 5.3% utilization with 16.5% contention on
this lock.

One significant observation with the lockmeter output is that for this
workload d_lookup() is the most commonly done operation.

This shows that 74% of the time the global lock is acquired from
d_lookup().

Summary of baseline measurements:

The baseline measurements show that dcache_lock while not very hot
shows increasing level of contention for some benchmarks. It is
probably not an immediate bottleneck only due to other bottlenecks
like BKL. While looking at the distribution of lock aqcuisitions
for these workloads, it becomes obvious that d_lookup() is the
routine to optimize since it is the routine from where the
global lock is acquired most.

Avoiding global lock in d_lookup()

dcache_lock is held in d_lookup while traversing the d_hash list
and to update the LRU list for freeing if the dentry found has zero ref count.
By using RCU we can avoid dcache_lock while reading d_hash list. That is
what the first RCU dcache patch
dcache_rcu-2.4.10-01 did. In this, we were doing d_hash lookup
lock free but had to take dcache_lock while updating LRU list. This patch
does provide some decrease in lock holdtime and contention level to some
extent. Lockmeter statistics for base kernel (2.4.16) while running
dbench are here and for this
minimal patch are here.

The RCU patch (rcu-2.4.10-1.patch) used for this approach can be found here

This proved that just doing the lock-free lookup of the dentry
cache is not enough. d_lookup() also acquires dcache_lock to
update the LRU list if the newly looked up entry previously
had zero refcount. For dbench workload, most often this was
true and hence we ended up acquiring the lock after almost
every lock-free lookup of hash table in d_lookup().
We somehow needed to avoid acquiring
dcache_lock so often. Different alogorithms were tried to get
rid of this lock from d_lookup(). Here is the summary of the
results.

1. Seperate lock for LRU list

Though this logic worked but the entire load got transferred to the
LRU list lock and practically there was no gain at all. The idea behind
this was that as d_lookup() olny updates the LRU list we can relax
dcache_lock by introducing a seperate lock for LRU list. But the
routines like prune_dcache, select_parent, d_prune_aliases etc
read/write other lists also apart from LRU list..so at these places both
locks were needed.
dcache_rcu-lru_lock-2.4.16-02.patch. This patch is not tested
completely except running dbench and collecting lockmeter statistics.
(lockmeter statistics).

The RCU patch (rcu-2.4.16-1.patch) used for this approach can be found here

2. Per bucket lock for hash & LRU list

We tried hard to make this one work..but it resulted in lots of nasty
and hard to debug race conditions..and it never worked properly.
The fundamental problem is that dentry cache has several other
lists that span across buckets and to modify/access any of those
list we need to take multiple bucket locks. This results in
a serious lock ordering problem and it turned to be unworkable. Also in
this case same problem was there that both per bucket lock and
dcache_lock were needed at many places...The entire logic was also
becoming quite complicated and we could not get any statistics out of this
as it never worked except booting. The approach was abandoned. dcache_rcu-bucket-2.4.16-05.patch

The RCU patch (rcu-2.4.16-1.patch) used for this approach can be found here

3. Lazy dentry cache

Given that lock-free traversal of hash chains is not enough for
reducing dcache_lock acquisitions, we looked at the possibility
of removing dcache_lock acquisitions completely from d_lookup().
It seemed that after using RCU based lock-free hash lookup,
only reason dcache_lock was needed in d_lookup() was for
updating the LRU list if the entry previously had zero refcount.
We decided to the relax the LRU list norms by letting entries
with non-zero refcount be in it and delay the updation of the
LRU list. One logic behind this was that instead of acquiring
the global dcache_lock for every single entry, we would process
a number of entries per acquisition of the lock. To accomplish
this, we introduced another flag in dentry to mark it
and a per-entry lock to maintain consistency between the flag
and the refcount. For all other fields, the refcount still
provided the mutual exclusion. We also needed to make some
decisions to ensure that we don't have too many entries
in the LRU list with non-zero refcount so that the deletion
process becomes very lengthy. For this we decided to try
out two approaches - periodic updation of LRU list using a timer
and periodic updation of LRU list during various traversals.
Both these appraches use the same tactics in conjunction with
RCU, they only differ in how the periodic updation is done.

The RCU patch (rcu-2.4.17-1.patch) used for this approach can be found here

Notes on lazy dcache implementation

Contention for dcache_lock is reduced in all routines..at the expense
of prune_dcache. prune_dache & select_parent takes more time as
d_lru list is longer. But that's fine as both these routines are not in
fast path.

Per dentry lock(d_lock) is needed to protect the d_vfs_falgs and d_count in
d_lookup. But that's fine as there is almost nill contention on per
dentry lock. Now DCACHE_REFERENCED flag does some more work ie. it also
indicates the dentries which are not supposed to be on d_lru list.
Right now apart from d_lookup, per dentry lock (dentry->d_lock) is used
where ever we are reading or modifying d_count or
d_vfs_flags. Probably it is possible to tune the code more and relax
the locking in some cases.

Due to lock hierarchy in dcache_lock and per dentry lock, dput code has
become slightly ugly..and this also uses a d_nexthash field in the
dentry structure because of deferred freeing of dentries using rcu.

Due to RCU, before unlinking dentry from d_hash list we have to
update d_nexthash pointer and mark the dentry for deferred freeing by
setting DCACHE_DEFERRED_FREE bit in d_vfs_flag.. so all this is done
by a new function include/linux/dcache.h: d_unhash(). Also because of
this is I changed the name for fs/namei.c: d_unhash() to
fs/namei.c: d_vfs_unhash().

As we do lockless lookup, rmb() is used in d_lookup to avoid out of order
read for d_nexthash and wmb() is used in d_unhash() to make sure that d_vfs_flags and d_nexthash() are updated before unlinking the dentry from d_hash chain.

The d_lru list is made uptodate through select_parent, prune_dcache
and dput...means the dentries with non-zero ref count are removed
from d_lru_list in those routines.

Now every dget() marks the dentry as referenced by setting
DCACHE_REFERENCE bit in d_vfs_flags. So we have to take per dentry lock in
dget always and there is no dget_locked.

Lazy dcache patches

For lazy LRU updation we also tried a timer based approach. In this a timer
was used to remove the referenced dentries from LRU list so that LRU list
can be kept short. But this does not prove any better than no-timer approach.
Also as we have to take dcache_lock from the timer handler we had to use
spin_lock_bh() nad spin_unlock_bh() for dcache_lock. This also created the
prbolem of cyclic dependencies in dcache.h. But the patch is worth looking
as probably proper tuning of timer frequency can give better results.
dcache_rcu-lazy_lru-timer-2.4.16-04.patch

As of now the patche is tested with ext2, ext3, JFS and /proc filesystem. Our
goal was only to experiment with dcache, testing it with other filesystems is in the pipleline.

Lazy dcache results

We ran dbench and httperf for understanding the effect of lazy dcache
and results were very good. By doing a lock-free d_lookup(), we were
able to substantially cut down the number of dcache_lock acquisitions.
This resulted in substantially decreased contention as well as
lock utilizations.

dbench results:

dbench results showed that lock utilization and contention levels remain
flat with lazy dcache as opposed to steadily increasing with the baseline
kernel. So for 8 processors, contention level is 0.95% as opposed
to 16.5% for the baseline (2.4.16) kernel.

One significant observation is that maximum lock hold time for prune_dcache()
and select_parent() are high for this algorithm. However, these are not
frequently done operation for this workload.

A comparison of baseline (2.4.16) kernel and lazy dcache can be seen in
the following graphs -

The throughput results show marginal differences (statistically
insignificant) for upto 4 CPUs, but of about 1% (statistically
significant) on 8 CPUs. So, there is no performance regression in
the lower end and the gains are really small in the higher end.
Along with decreased contention for dcache_lock in the higher end
we see increased contention for BKL. So, it is likely that other
bottlenecks negate the gain made by rcu_lazy_dcache.

Note:

All dcache patches need RCU patche corresponding kernel version. Like for
dcache_rcu-2.5.3-x.patch first apply rcu-2.5.3-x.patch or rcu_poll-2.5.3-x.patch.
RCU and dcache patches can be downloaded from Read-Copy Update package in
LSE