Re: VFS ROADMAP (and vfs01.patch stage 1 available for testing)

:
:On Thu, Aug 12, 2004 at 06:19:40PM -0700, Matthew Dillon wrote:
:> Then I'll start working on stage 2 which will be to wrap all the
:> VOP forwarding calls (the VCALL and VOCALL macros).
:>
:> That will give us the infrastructure necessary to implement a
:> messaging interface in a later stage (probably around stage 15 :-)).
:
:Do you want to keep the message API with the structure as argument or do
:you want to switch to direct argument passing and marshalling in the
:messaging layer? In the short term, that would make the calling more
:readable, but might increase the overhead on the stack.
I think we will have to keep with the structure, just like we do with
the system call layer. This will allow us to embed a message and do
other things without having to completely rewrite every single VOP call
in the system.
:> The really nasty stuff starts at stage 3. Before I can implement the
:> messaging interface I have to:
:>
:> * Lock namespaces via the namecache rather then via directory
:> vnode locking (ultimately means that directories do not have
:> to be exclusively locked during create/delete/rename). Otherwise
:> even the simplest, fully cached namespace operations will wind up
:> needing a message.
:
:How does this play with remote and/or dynamically created filesystems?
:Does the filesystem have to keep track of the namespace entries and
:invalidate them? Moving away from an exclusive vnode lock for modifying
:operations does fit in with internal range locking, because those could
:be implemented very well e.g. in a tree-based FS.
It shouldn't create an issue if there is sufficient information in
the remote filesystem VFS to use the bottom up cache invalidation
infrastructure (described down below), but even so I expect there
may be collisions, especially with NFS. However, there are *already*
collisions with NFS, even with the current infrastructure, because
NFS is stateless. I think all we can do there is maintain the existing
recovery mechanisms in the form of a retry or late error.
The main thing the namespace locking will do is give the VFS layer
an assurance that no operations currently initiated by the kernel,
regardless of lock state, will collide with each other. What the
VFS layer does with that assurance is going to be up to it but, e.g.
what this means is that a filesystem won't have to exclusively lock
a directory vnode just to prevent a file name from being reused out
from under some operation. For UFS what this means is that eventually
the directory vnode lock will be able to be changed to just a buffer
cache (struct buf) lock.
:> This step alone will require major changes to the arguments passed
:> in just about every single VOP call because we will be switching
:> from passing directory vnodes to passing namecache pointers.
:>
:> * Ranged data locks will replace the vnode lock for I/O atomicy
:> guarentees (ultimately means that if program #1 is
:> blocked writing to a vnode program #2 can still read cached
:> data from that same vnode without blocking on program #1).
:> Otherwise the messaging latency will kill I/O performance.
:
:Do you plan to move the data locking into the filesystem or should it
:still be implemented in the VFS layer? Moving it down makes IMO more sense
:because it would allow us to keep a simple locking for less important
:filesystems and would allow us to better exploit the internal data structures.
:E.g. if we have a special data structure to handle the byte ranges of a
:file anyway, we could attach the locking on that level.
The atomicy guarentee for I/O operations will be a function of the
kernel, meaning that it will cover *ALL* VFS's. We will add VOP's
for record locking but they will only be needed by those remote VFSs
which have integrated cache management... which is, umm... maybe NFSv4
(which we don't have), and perhaps coda (but maybe not). i.e. the
cubbarts are pretty bare there.
I actually believe that the range locks will not cost anything. The
vast majority of cases will have only one or two I/O range locks on a
file at any given moment (only databases really need parallel access
to a file) so it will cost us virtually nothing to implement it in a
kernel layer.
:>
:> * vattr information will be cached in the vnode so it can be
:> accessed directly without having to enter the VFS layer.
:>
:> * VM objects will become mandatory for all filesystems and will
:> also be made directly accessible to the kernel without having
:> to enter the VFS layer (ultimately this will result in greatly
:> improved read() and write() performance).
:
:How does this effect not physically backed filesystems? If I want to
:support compressed files in ext2, when do I have to decompress the actual
:data?
The data is always decompressed in the VM object, no matter what. Same
with the buffer cache (which is VM object backed). But remember
that data just doesn't appear in a VM object... something has to load
the data into the VM object and if you are reading from a file that
something is the VFS. So compressed filesystems would still work as
expected.
:> * Implement a bottom-up cache invalidation abstraction for
:> namespace, attribute, and file data, so layered filesystems
:> work properly.
:
:The invalidation is one problem, the locking another. The separation of
:vnode and namespace locks should solve most issues though.
:
:Let's discuss the rest later :)
:
:Joerg
Yes, I think so to. In many respects the namespace locking is the
single most difficult part of the work... but it is something we
absolutely have to have (along with bottom-up cache invalidation and
management) if we ever want to have an efficient filesystem caching
interface in a cluster.
-Matt
Matthew Dillon
<dillon@xxxxxxxxxxxxx>