This message is something of a hodge-podge of thoughts.
Benchmarks are useful because they give you a consistent measure.
Benchmarks are harmful, however, when the measure they give you is
not a measure of `real' performance on `real' applications.
Unfortunately, `real' applications (a) vary from one person to
the next and (b) rarely work well as benchmarks.
One problem with optimizing system calls in general is that only
benchmarks spend a large fraction of time making repeated getpid()
calls, and speeding up such a benchmark is not useful. On the
other hand, applications that are important to someone *do* spend
a lot of time making, say, read() or write() calls -- and making
getpid() faster also makes those faster. The question (for which
I do not have the answer) is, how *much* faster, and should the
effort be put into the syscall stub, or into the path within the
file system read() call? The time for a read() may turn out to be
dominated by byte copies that could be eliminated entirely via
page-mapping (e.g., replace the user's buffer pages with COW pages
that alias the buffer cache).
Van and I objected to the RPC-ish VFS interface that sits inside
Lite2-derived systems, but we lost that particular battle. If
someone out there could measure the actual time-effect of that
interface vs a normal call interface, on some `realistic' benchmark
(about which one can also argue all day), that might be helpful.
>Alan Cox just devised a way for Linux/SPARC to avoid packet copying on
>our networking stack ...
This is not a micro-optimization. (Neither, for that matter, is
the `system calls via normal subroutine calls' trick, although this
is probably not the place to *start* optimizing.) In particular,
for applications that spend all their time sending bulk network
data, eliminating these copies eliminates the place they spend most
of their time -- a network send is, or should be, dominated by the
time spent copying those bytes.
Chris