New brainfart for threaded VFS and data passing between threads.

The recent PIPE work adapted from Alan Cox's work in FreeBSD-5 has really
lit a fire under my seat. It's amazing how such a simple concept can
change the world as we know it :-)
Originally the writer side of the PIPE code was mapping the supplied
user data into KVM and then signalling the reader side. The reader
side would then copy the data out of KVM.
The concept Alan codified is quite different: Instead of having the
originator map the data into KVM, simply supply an array of vm_page_t's
to the target and let the target map the data into KVM. In the case
of the PIPE code, Alan used the SF_BUF API (which was originally developed
by David Greenman for the sendfile() implementation) on the target side
to handle the KVA mappings.
Seems simple, eh? But Alan got an unexpectedly huge boost in performance
on IA32 when he did this. The performance boost turned out to be due
to two facts:
* Avoiding the KVM mappings and the related kernel_object manipulations
required for those mappings saves a lot of cpu cycles when all you
want is a quick mapping into KVM.
* On SMP, KVM mappings generated IPIs to all cpus in order to
invalidate the TLB. By avoiding KVM mappings all of those IPIs
go away.
* When the target maps the page, it can often get away with doing
a simple localized cpu_invlpg(). Most targets will NEVER HAVE TO
SEND IPIs TO OTHER CPUS. The current SF_BUF implementation still
does send IPIs in the uncached case, but I had an idea to fix that
and Alan agrees that it is sound... and that is to store a cpumask
in the sf_buf so a user of the sf_buf only invalidates the cached
KVM mapping if it had not yet been accessed on that particular cpu.
* For PIPEs, the fact that SF_BUF's cached their KVM mappings
reduced the mapping overhead almost to zero.
Now when I heard about this huge performance increase I of course
immediately decided that DragonFly needed this feature to, and so we
now have it for DFly pipes.
Light Bulb goes off in head
But it also got me to thinking about a number of other sticky issues
that we face, especially in our desire to thread major subsystems (such
as Jeff's threading of the network stack and my desire to thread VFS),
and also issues related to how to efficiently pass data between threads,
and how to efficiently pass data down through the I/O subsystem.
Until now, I think everyone here and in FreeBSD land were stuck on the
concept of the originator mapping the data into KVM instead of the
target for most things. But Alan's work has changed all that.
This idea of using SF_BUF's and making the target responsible for mapping
the data has changed everything. Consider what this can be used for:
* For threaded VFS we can change the UIO API to a new API (I'll call it
XIO) which passes an array of vm_page_t's instead of a user process
pointer and userspace buffer pointer.
So 'XIO' would basically be our implementation of target-side mappings
with SF_BUF capabilities.
* We can do away with KVM mappings in the buffer cache for the most
prevalient buffers we cache... those representing file data blocks.
We still need them for meta-data, and a few other circumstances, but
the KVM load on the system from buffer cache would drop by like 90%.
* We can use the new XIO interface for all block data referencse from
userland and get rid of the whole UIO_USERSPACE / UIO_SYSSPACE mess.
(I'm gunning to get rid of UIO entirely, in fact).
* We can use the new XIO interface for the entire I/O path all the way
down to busdma, yet still retain the option to map the data if/when
we need to. I never liked the BIO code in FreeBSD-5, this new XIO
concept is far superior and will solve the problem neatly in DragonFly.
* We can eventually use XIO and SF_BUF's to codify copy-on-write at
the vm_page_t level and no longer stall memory modifications to I/O
buffers during I/O writes.
* I will be able to use XIO for our message passing IPC (our CAPS code),
making it much, much faster then it currently is. I may do that as
a second step to prove-out the first step (which is for me to create
the XIO API).
* Once we have vm_page_t copy-on-write we can recode zero-copy TCP
to use XIO, and won't be a hack any more.
* XIO fits perfectly into the eventual pie-in-the-sky goal of
implementing SSI/Clustering, because it means we can pass data
references (vm_page_t equivalents) between machines instead of
passing the data itself, and only actually copy the data across
on the final target. e.g. if on an SSI system you were to do
'cp file1 file2', and both file1 and file2 are on the same filesystem,
the actual *data* transfer might only occur on the machine housing
the physical filesystem and not on the machine doing the 'cp'. Not
one byte. Can you imagine how fast that would be?
And many other things. XIO is the nutcracker, and the nut is virtually
all the remaining big-ticket items we need to cover DragonFly.
This is very exciting to me.
-Matt