Scaling dcache with RCU

Reorganizing the way Linux caches filename lookups is a big win for helping to scale to large servers.

Performance and Complexity Comparisons

Although this change in dcache was relatively small, it had far-reaching
consequences in the kernel, because a well-defined API for filesystems
to interact with dcache was not in place. This resulted in a large number of bugs in
the Linux 2.5 kernel due to filesystems hackers attempting to
manipulate dcache directly in the traditional style. Given that a
somewhat more formal API now exists, we hope future changes will be
less traumatic.

Figure 9 shows the performance of a multiuser benchmark running on a Linux
2.5.59 kernel patched to use RCU in the directory-entry cache compared
to the performance of an unpatched kernel. These benchmarks were run on a 16-CPU
NUMA-Q system using 700MHz PIII Intel Xeons with 1MB L2 cache and 16GB of memory.

Figure 9. Multiuser Benchmark Performance

Applying the dcache_rcu patch to a Linux 2.4.17 kernel increased SPECweb99
(without SSL) throughput from 2,258 to 2,530 on an 8-CPU PIII Xeon server,
a 12% improvement.
Applying the same patch to a Linux 2.5.40-mm2 kernel reduced the
system time consumed by a Linux kernel build from 47.548 CPU seconds
to 42.498 CPU seconds, more than a 10% reduction. A similar test run
on a uniprocessor 700MHz PIII Xeon system running the Linux 2.5.42 kernel
showed no change. In summary, dcache RCU not only increases scaling for
high-end machines, it also maintains good performance on low-end machines.

Future Directions

Although the 2.6 dcache system is much more scalable than the 2.4 version
was,
a number of issues still need to be investigated:

Updates still are gated by dcache_lock, which means
that update-intensive workloads do not scale well.

The global hash table defeats cache locality and makes update code more
complex than necessary. Of course, any alternative must preserve
its benefits, including high-performance handling of large directories.

The 2.6 dcache code acquires each dentry's d_lock
spinlock, resulting in cache-line bouncing and atomic
operations, particularly on the root directory and on working
directories. Much thought is needed to arrive at a simple
solution, as moving permissions into the dentry turns out to
be quite complex.

The code that resolves races between __d_lookup() and
d_move() is overly complex.

We eagerly anticipate participating in the 2.7 effort to resolve these issues,
hopefully resulting in the situation shown in Figure 10.

Figure 10. Tux's Duty in 2.7

Legal Statement

This work represents the view of the author and does not necessarily
represent the view of IBM.

SPEC and the benchmark name SPECweb are registered trademarks
of the Standard Performance Evaluation Corporation. The benchmarking was
done for research purposes only and may not be compared to published results
on the SPECWeb site, due to the following deviations from the rules:

It was run on hardware that does not meet the SPEC
availability-to-the-public criteria. The machine was
an engineering sample.

access_log was not kept for full accounting. It was being
written but deleted every 200 seconds.

Paul E. McKenney is a distinguished engineer at IBM and has worked on
SMP and NUMA algorithms for longer than he cares to admit. Prior to
that, he worked on packet-radio and Internet protocols (but long before the
Internet became popular). His hobbies include running and the usual
house-wife-and-kids habit.

Dipankar Sarma currently is working on a number of Linux kernel
projects, including CPU hot-plug, RCU and VFS enhancements. Prior
to his Linux days, he worked on a number of areas including
ABI, OS bringup, I/O drivers and multipath I/O.

Maneesh Soni has been working with IBM's Linux Technology Center as a
member of Linux Scalability Effort Project. He has experience
in the system software arena, particularly with operating-system
kernels and filesystems.