Pull up following revision(s) (requested by hannken in ticket #67):
sys/fs/puffs/puffs_sys.h: revision 1.86
sys/fs/puffs/puffs_vfsops.c: revision 1.114
sys/fs/puffs/puffs_msgif.c: revision 1.95
sys/fs/puffs/puffs_node.c: revision 1.32
sys/fs/puffs/puffs_vnops.c: revision 1.184
Change puffs from hashlist to vcache.
- field "pa_nhashbuckets" of struct "puffs_kargs" becomes a no-op.
and should be removed on the next protocol version bump.

sync with head.
for a reference, the tree before this commit was tagged
as yamt-pagecache-tag8.
this commit was splitted into small chunks to avoid
a limitation of cvs. ("Protocol error: too many arguments")

Excise struct componentname from the namecache.
This uglifies the interface, because several operations need to be
passed the namei flags and cache_lookup also needs for the time being
to be passed cnp->cn_nameiop. Nonetheless, it's a net benefit.
The glop should be able to go away eventually but requires structural
cleanup elsewhere first.
This change requires a kernel bump.

Pull up following revision(s) (requested by manu in ticket #438):
lib/libperfuse/perfuse_priv.h: revision 1.31
sys/fs/puffs/puffs_msgif.h: revision 1.80
sys/fs/puffs/puffs_vnops.c: revision 1.171
lib/libpuffs/puffs_ops.3: revision 1.31
sys/fs/puffs/puffs_vnops.c: revision 1.172
sys/fs/puffs/puffs_vnops.c: revision 1.173
sys/fs/puffs/puffs_vnops.c: revision 1.174
usr.sbin/perfused/perfused.c: revision 1.24
sys/fs/puffs/puffs_sys.h: revision 1.80
sys/fs/puffs/puffs_sys.h: revision 1.81
sys/fs/puffs/puffs_sys.h: revision 1.82
lib/libperfuse/subr.c: revision 1.19
lib/libperfuse/perfuse.c: revision 1.30
sys/fs/puffs/puffs_msgif.c: revision 1.90
sys/fs/puffs/puffs_msgif.c: revision 1.91
sys/fs/puffs/puffs_msgif.c: revision 1.92
lib/libperfuse/ops.c: revision 1.59
lib/libpuffs/puffs.3: revision 1.53
lib/libperfuse/debug.c: revision 1.12
lib/libpuffs/puffs.3: revision 1.54
sys/fs/puffs/puffs_vnops.c: revision 1.167
sys/fs/puffs/puffs_msgif.h: revision 1.79
usr.sbin/perfused/msg.c: revision 1.21
sys/fs/puffs/puffs_vfsops.c: revision 1.102
sys/fs/puffs/puffs_vfsops.c: revision 1.103
sys/fs/puffs/puffs_vfsops.c: revision 1.105
lib/libpuffs/puffs.h: revision 1.123
lib/libperfuse/perfuse_if.h: revision 1.20
lib/libperfuse/perfuse.c: revision 1.29
lib/libpuffs/dispatcher.c: revision 1.42
lib/libpuffs/dispatcher.c: revision 1.43
- Fix same vnodes associated with multiple cookies
The scheme used to retreive known nodes on lookup was flawed, as it only
used parent and name. This produced a different cookie for the same file
if it was renamed, when looking up ../ or when dealing with multiple files
associated with the same name through link(2).
We therefore abandon the use of node name and introduce hashed lists of
inodes. This causes a huge rewrite of reclaim code, which do not attempt
to keep parents allocated until all their children are reclaimed
- Fix race conditions in reclaim
There are a few situations where we issue multiple FUSE operations for
a PUFFS operation. On reclaim, we therefore have to wait for all FUSE
operation to complete, not just the current exchanges. We do this by
introducing node reference count with node_ref() and node_rele().
- Detect data loss caused by FAF
VOP_PUTPAGES causes FAF writes where the kernel does not check the
operation result. At least issue a warning on error.
- Enjoy FAF shortcut on setattr
No need to wait for the result if the kernel does not want it. There is
however an exception for setattr that touch the size, we need to wait
for completion because we have other operations queued for after the
resize.
- Fix fchmod() on write-open file
fchmod() on a node open with write privilege will send setattr with both mode
and size set. This confuses some FUSE filesystem. Therefore we send two FUSE
operations, one for mode, and one for size.
- Remove node TTL handling for netbsd-5 for simplicity sake. The code
still builds on netbsd-5 but does not have the node TTL feature anymore.
It works fine with kernel support on netbsd-6.
- Improve PUFFS_KFLAG_CACHE_FS_TTL by reclaiming older inactive nodes.
The normal kernel behavior is to retain inactive nodes in the freelist
until it runs out of vnodes. This has some merit for local filesystems,
where the cost of an allocation is about the same as the cost of a
lookup. But that situation is not true for distributed filesystems.
On the other hand, keeping inactive nodes for a long time hold memory
in the file server process, and when the kernel runs out of vnodes, it
produce reclaim avalanches that increase lattency for other operations.
We do not reclaim inactive vnodes immediatly either, as they may be
looked up again shortly. Instead we introduce a grace time and we
reclaim nodes that have been inactive beyond the grace time.
- Fix lookup/reclaim race condition.
The above improvement undercovered a race condition between lookup and
reclaim. If we reclaimed a vnode associated with a userland cookie while
a lookup returning that same cookiewas inprogress, then the kernel ends
up with a vnode associated with a cookie that has been reclaimed in
userland. Next operation on the cookie will crash (or at least confuse)
the filesystem.
We fix this by introducing a lookup count in kernel and userland. On
reclaim, the kernel sends the count, which enable userland to detect
situation where it initiated a lookup that is not completed in kernel.
In such a situation, the reclaim must be ignored, as the node is about
to be looked up again.
Fix hang unmount bug introduced by last commit.
We introduced a slow queue for delayed reclaims, while the existing
queue for unmount, flush and exist has been renamed fast queue. Both
queues had timestamp for when an operation should be done, but it was
useless for the fast queue, which is always used to run an operation
ASAP. And the timestamp test had an error that turned ASAP into "at next
tick", but nobody what there to wake the thread at next tick, hence
the hang. The fix is to remove the useless and buggy timestamp test for
fast queue.
Rename slow sopreq queue into node sopreq queue, to refet the fact that
is only intended for postponed node reclaims.
When purging the node sopreq queue, do not call puffs_msg_sendresp(), as
it makes no sense.
Fix race condition between (create|mknod|mkdir|symlino) and reclaim, just
like we did it between lookup and reclaim.
Missing bit in previous commit (prevent race between create|mknod|mkdir|symlink
and reclaim)
Bump date for previous.
New sentence, new line; remove trailing whitespace; fix typos;
punctuation nits.
Add PUFFS_KFLAG_CACHE_DOTDOT so that vnodes hold a reference on their
parent, keeping them active, and allowing to lookup .. without sending
a request to the filesystem.
Enable the featuure for perfused, as this is how FUSE works.
Missing bit in previous commit (PUFFS_KFLAG_CACHE_DOTDOT option to avoid
looking up ..)

Rename slow sopreq queue into node sopreq queue, to refet the fact that
is only intended for postponed node reclaims.
When purging the node sopreq queue, do not call puffs_msg_sendresp(), as
it makes no sense.

Fix hang unmount bug introduced by last commit.
We introduced a slow queue for delayed reclaims, while the existing
queue for unmount, flush and exist has been renamed fast queue. Both
queues had timestamp for when an operation should be done, but it was
useless for the fast queue, which is always used to run an operation
ASAP. And the timestamp test had an error that turned ASAP into "at next
tick", but nobody what there to wake the thread at next tick, hence
the hang. The fix is to remove the useless and buggy timestamp test for
fast queue.

- Improve PUFFS_KFLAG_CACHE_FS_TTL by reclaiming older inactive nodes.
The normal kernel behavior is to retain inactive nodes in the freelist
until it runs out of vnodes. This has some merit for local filesystems,
where the cost of an allocation is about the same as the cost of a
lookup. But that situation is not true for distributed filesystems.
On the other hand, keeping inactive nodes for a long time hold memory
in the file server process, and when the kernel runs out of vnodes, it
produce reclaim avalanches that increase lattency for other operations.
We do not reclaim inactive vnodes immediatly either, as they may be
looked up again shortly. Instead we introduce a grace time and we
reclaim nodes that have been inactive beyond the grace time.
- Fix lookup/reclaim race condition.
The above improvement undercovered a race condition between lookup and
reclaim. If we reclaimed a vnode associated with a userland cookie while
a lookup returning that same cookiewas inprogress, then the kernel ends
up with a vnode associated with a cookie that has been reclaimed in
userland. Next operation on the cookie will crash (or at least confuse)
the filesystem.
We fix this by introducing a lookup count in kernel and userland. On
reclaim, the kernel sends the count, which enable userland to detect
situation where it initiated a lookup that is not completed in kernel.
In such a situation, the reclaim must be ignored, as the node is about
to be looked up again.

Pull up following revision(s) (requested by manu in ticket #1679):
sys/fs/puffs/puffs_vnops.c: revision 1.157
sys/fs/puffs/puffs_vnops.c: revision 1.158
sys/fs/puffs/puffs_vnops.c: revision 1.159
sys/fs/puffs/puffs_vfsops.c: revision 1.97
sys/fs/puffs/puffs_vfsops.c: revision 1.99
sys/fs/puffs/puffs_vnops.c: revision 1.160
sys/fs/puffs/puffs_vfsops.c: revision 1.100
sys/miscfs/syncfs/sync_subr.c: revision 1.47
sys/fs/puffs/puffs_node.c: revision 1.21
sys/fs/puffs/puffs_node.c: revision 1.22
sys/fs/puffs/puffs_msgif.c: revision 1.88
sys/fs/puffs/puffs_msgif.c: revision 1.89
sys/fs/puffs/puffs_vnops.c: revision 1.156
Make sure ioflush does not sleep in PUFFS code path, waiting for a mutex,
a memory allocation, or a response from the filesystem.
This avoids deadlocks in the following situations:
1) when memory is low: ioflush waits the fileystem, the fielsystem waits
for memory
2) when the filesystem does not respond (e.g.: network outage ona
distributed filesystem)
Fix the build that was broken by struct lwp *updateproc reference in
RUMP-visible code. Instead of checking that updateproc (aka ioflush,
aka syncer) will not sleep in PUFFS code, I check for any kernel thread:
after all none of them are designed to hang awaiting for a remote filesystem
operation to complete.
Roll back the change that forced kernel threads to not sleep in PUFFS.
The change does not make consensus, since only pagedaemon should need it.
Other threads will tolerate sleeping, and problems here are only symptoms
that something is going wrong in memory management. The cause, not the
symptoms, need to be fixed.
Make sure pagedaemon does not sleep for memory in puffs_vnop_sleep.
Add KASSERT on any sleeping memory allocation to check it cannot happen again.
Remove #ifdef DIAGNOSTIC guards around KASSERT, as the macro contains them

Pull up following revision(s) (requested by manu in ticket #1604):
sys/fs/puffs/puffs_msgif.c: revision 1.84
Apply patch from PR kern/44093 by yamt:
Interrupt server wait only on certain signals (same set at nfs -i)
instead of all signals. According to the PR this helps with
"git clone" run on a puffs file system.

Pull up following revision(s) (requested by manu in ticket #1604):
sys/fs/puffs/puffs_msgif.c: revision 1.84 via patch
Apply patch from PR kern/44093 by yamt:
Interrupt server wait only on certain signals (same set at nfs -i)
instead of all signals. According to the PR this helps with
"git clone" run on a puffs file system.

Apply patch from PR kern/44093 by yamt:
Interrupt server wait only on certain signals (same set at nfs -i)
instead of all signals. According to the PR this helps with
"git clone" run on a puffs file system.

In case the operations thread has exited, do not queue any more
operations. This prevents kernel memory leaks (one of which happened
every time the file system was unmounted via PUFFSOP_UNMOUNT ...
and incidentally would've been trivially caught with the old
malloc(9) interface. I wonder if the message is to use a ton of
pools instead of regression-attractive kmem interface).

Pull up following revision(s) (requested by pooka in ticket #1212):
sys/fs/puffs/puffs_msgif.c: revision 1.76 via patch
sys/fs/puffs/puffs_sys.h: revision 1.73 via patch
sys/fs/puffs/puffs_vfsops.c: revision 1.84 via patch
Process flush requests from the file server in a separate thread
context. This fixes a long-standing but seldomly seen deadlock,
where the kernel was holding pages busy (due to e.g. readahead
request) while waiting for the server to respond, and the server
made a callback into the kernel asking to invalidate those pages.
... or, well, theoretically fixes, since I didn't have any reliable
way of repeating the deadlock and I think I saw it only twice.

Add a PUFFS_UNMOUNT server->kernel request, which causes the kernel
to initiate self destruct, i.e. unmount(MNT_FORCE). This, however,
is a semi-controlled self-destruct, since all caches are flushed
before the (possibly) violent unmount takes place.

Process flush requests from the file server in a separate thread
context. This fixes a long-standing but seldomly seen deadlock,
where the kernel was holding pages busy (due to e.g. readahead
request) while waiting for the server to respond, and the server
made a callback into the kernel asking to invalidate those pages.
... or, well, theoretically fixes, since I didn't have any reliable
way of repeating the deadlock and I think I saw it only twice.

Kill suspend support. It was never implemented correctly:
* it depended on the biglock (in a very cruel way)
* it was attached to userspace transactions rather than logical
fs operations
(If someone wants to revisit it some day, most of the stuff can be
reused from cvs history)

PR kern/38141 lookup/vfs_busy acquire rwlock recursively
Simplify the mount locking. Remove all the crud to deal with recursion on
the mount lock, and crud to deal with unmount as another weirdo lock.
Hopefully this will once and for all fix the deadlocks with this. With this
commit there are two locks on each mount:
- krwlock_t mnt_unmounting. This is used to prevent unmount across critical
sections like getnewvnode(). It's only ever read locked with rw_tryenter(),
and is only ever write locked in dounmount(). A write hold can't be taken
on this lock if the current LWP could hold a vnode lock.
- kmutex_t mnt_updating. This is taken by threads updating the mount, for
example when going r/o -> r/w, and is only present to serialize updates.
In order to take this lock, a read hold must first be taken on
mnt_unmounting, and the two need to be held across the operation.
One effect of this change: previously if an unmount failed, we would make a
half hearted attempt to back out of it gracefully, but that was unlikely to
work in a lot of cases. Now while an unmount that will be aborted is in
progress, new file operations within the mount will fail instead of being
delayed. That is unlikely to be a problem though, because if the admin
requests unmount of a file system then s(he) has made a decision to deny
access to the resource.

PR kern/38135 vfs_busy/vfs_trybusy confusion
The previous fix worked, but it opened a window where mounts could have
disappeared from mountlist while the caller was traversing it using
vfs_trybusy(). Fix that.

kern/38135 vfs_busy/vfs_trybusy confusion
The symptom was that sometimes file systems would occasionally not appear
in output from 'df' or 'mount' if the system was busy. Resolution:
- Make mount locks work somewhat like vm_map locks.
- vfs_trybusy() now only fails if the mount is gone, or if someone is
unmounting the file system. Simple contention on mnt_lock doesn't
cause it to fail.
- vfs_busy() will wait even if the file system is being unmounted.

PR kern/37706 (forced unmount of file systems is unsafe):
- Do reference counting for 'struct mount'. Each vnode associated with a
mount takes a reference, and in turn the mount takes a reference to the
vfsops.
- Now that mounts are reference counted, replace the overcomplicated mount
locking inherited from 4.4BSD with a recursable rwlock.

Restructure the messaging interface a bit more: make all interfacing
with the file server happen through puffs_msg_enqueue() and
puffs_msg_wait() instead of having a billion different routines.
Build the existing system upon these two. Most importantly though,
decouple insertation into the op queue from the actual wait. This
is useful for a number of reasons coming soon to a cvs repo near you.

Part 2/n of extensive changes to request transport to/from userspace:
Rip the transport code completely out of puffs and generalize it
into an independent module which will be used for multiple purposes
in the future. This module is called the Pass-to-Userspace
Transporter (known as "putter" among friends).
This is very much work-in-progress and one dependency with puffs
remains: the request framing format.
The device name is still /dev/puffs, but that will change soon.
Users of puffs need the following in their kernel configs now:
pseudo-device putter

Sync with HEAD.
Follow the merge of pmap.c on i386 and amd64 and move
pmap_init_tmp_pgtbl into arch/x86/x86/pmap.c. Modify the ACPI wakeup
code to restore CR4 before jumping back into kernel space as the large
page option might cover that.

When doing a read operation, don't copy the whole kernel buffer to
userspace, since it doesn't contain any information yet. I should
still rework this more so this is just a quickie to get the read/write
style interface more up to speed with the ioctl version.

Part 1/n of some pretty extensive changes to how the kernel module
interacts with the userspace file server:
* since the kernel-user communication is not purely request-response
anymore (hasn't been since 2006), try to rename some "request" to
"message". more similar mangling will take place in the future.
* completely rework how messages are allocated. previously most of
them were borrowed from the stack (originally *all* of them),
but now always allocate dynamically. this makes the structure
of the code much cleaner. also makes it possible to fix a
locking order violation. it enables plenty of future enhancements.
* start generalizing the transport interface to be independent of puffs
* move transport interface to read/write instead of ioctl. the
old one had legacy design problems, and besides, ioctl's suck.
implement a very generic version for now; this will be
worked on later hopefully some day reaching "highly optimized".
* implement libpuffs support behind existing library request
interfaces. this will change eventually (I hate those interfaces)

If kernel resource allocation fails after the file server has
committed something, issue an abort. The abort is done through
the regular op channel, e.g. failed mkdir leads to regular rmdir,
inactive and reclaim. No internal interface is planned currently
for the one file system out of a million which would implement it
to benefit from the one case in a billion where kernel resource
allocation actually does fail and out of that one case in a trillion
where internal vs. external would make a difference.

Add error notifications, which are used to deliver errors from the
kernel to the file server for silly things the file server did,
e.g. attempting to create a file with size VSIZENOTSET. The file
server can handle these as it chooses, but the default action is
for it to throw its hands in the air and sing "goodbye, cruel world,
it's over, walk on by".

Introduce noref setbacks, which the file server can use to signal
the kernel it has 0 references to the node in question. In other
words, this can be used to avoid inactive(), or, if the file server
does not implement inactive, prompt reclaim for removed nodes.

If the op was interrupted, decrease ops waiting for fetch from the
file server only if the op was still waiting for fetch (as opposed
to waiting for the response). Also, properly flag the possible
following inactive as an op for which we do not want to wait for
the response from the file server.

Introduce puffs "setbacks", which can be used to set certain flags
for nodes upon return from the userspace. Currently it can be used
to indicate that the file server should be notified of "inactive"
in case the file server has opted to not receive inactive every
time the reference count for a vnode drops to zero. (inactive is
a common event, almost never requires any action and must be executed
sychronously, so it is wasteful).
While doing this, cleanup the release-relock nonsense from the
vntouser*() arguments. It was never enabled and the whole LOCKEDVP()
concept was very broken to begin with.

Fix a problem introduced when I converted puffs to use newlock2:
when unmounting the file system in case of a certain timing (and
possibly some other conditions), a thread would wait on a condition
variable, while another thread broadcast the cv and immediately
proceeded to destroy it. The result was a system frozen completely
solid shorly after the process waiting for the cv woke up. So
introduce reference counting to synchronize destruction of the
resources in unmount.
I was able to repeat the problem only on my laptop in some special
cases, so I do not know how common it was. Ironically, killing
the file server process violently instead of unmount() didn't have
this problem because it never entered the unmount path from two
directions.

Fix one more bug from today's commit: don't remove the op for which
getops runs out of file server buffer space from the request queue.
Otherwise that operation silently vanishes and things go, well, quite
wrong.

Make it possible to interrupt waiters for fs operation completion
again. This is useful until locking is further developed and basically
any deadlocks can be solved by killing appropriate processes.
Thanks especially to Tommi Kyntola and Antti Louko for sitting down
with me and discussing resource ownership and locking strategies
in implementing this.

* abstract ASYNCBIOREAD and let callers freely issue a callback called
from putop. even though there's only one user currently, makes code
more readable
* move "delta" to a standard parameter in vntouser and get rid of the
specialcase vntouser_delta

Convert spinlocks & sleep/wakeup to newlock2 locking stuff. Fix a
bunch of bugs.
* park structures are now always allocated from a pool instead of a
mixed stack/malloc allocation
* get rid of the whole adjbuf concept, always just alloc the maximal
amount of memory to satisfy a request
* little regression: don't allow interrupting wait from file system
to userspace; this had problems already before, but now the problems
really started to shine through. I'll try to make this work again
some day.
* fix bmap to return a sensible value in runp

* rework the page cache interaction a bit: cache metadata in the
kernel and flush it out all at once instead of continuous updating
* add support for delivering notifications to the file server about
when a page was written to (but disabled by default for now). the
file server can use this to request flushing or invalidating the
kernel page cache

Make wait for the user file server PCATCHable. This makes it
possible to recover the system by just killing processes in case
a file server manages to recurse into itself either by fault of
file server implementation or by pilot error. The downside is that
the code is extremely hard to follow and practically screams out
for newlock2 (in addition to screaming "bug here"). The whole
PCATCH nonsense and induced megacomplexity can hopefully be avoided
in the future by tweaking other parts of the implementation.

Initial attempt at suspend/snapshot support for userspace file
servers. This is still pretty much on the level "if it breaks ...".
It should work for single-threaded servers which handle one operation
from start to finish in one go. Also, it does not yet totally
correctly synchronize metadata and data in some cases. So needless
to say, it needs improvement, but it is possible that will have to
wait for some lock revampage.

Don't allow calls to be queued while MOUNTING. We don't make any
kernel->server calls at that time and it allows a window where
operations use an incorrect root node cookie.
XXX: there's still a (very much smaller and biglock safe) race, but
that's going to be solved by some more thorough restructuring

PCATCH in tsleep while waiting for operations in getop. Otherwise
we could end up in an unkillable deadlock if GETOP was called when
an operation that had locked the root vnode was already in userspace.

Fix a race condition with unmount where the mountpoint might disappear
from under us while waiting for syncer_lock and before we got to vfs_busy.
This happens easily e.g. when the userspace server loses its will to
live in VOP_RECLAIM, which is called from vflush() in VFS_UNMOUNT. We
get two competing unmounters. When the first one finishes, it releases
syncer_lock. Now the second one tries to vfs_busy(), but is greeted
with garbage in *mp.
XXX: Technically this is a more general issue and should be fixed
elsewhere, but it's hard to trigger it with normal file systems
unless they are unmounted "simultaneously" twice and are dirty
enough for flushing to take a while. So make a note about it in
the little black book next to the poems and postpone the crusade
for now.

If the control descriptor is closed, mark userspace dead and wakeup
all waiters *before* trying to get the syncer lock necessary for
dounmount(). This prevents a deadlock if the userspace server dies
while the syncer is running.

kernel portion of puffs - the Pass-to-Userspace Framework File System.
It contains the VFS attachment and userspace message-passing interface.
This work was initially started and completed for Google SoC 2005
and tweaked to work a bit better in the past few weeks. While
being far from complete, it is functional enough to be able and
stable to host a fairly general-purpose in-memory file system in
userspace. Even so, puffs should be considered experimental and
no binary compatibility for interfaces or crash-freedom or zero
security implications should be relied upon just yet.
The GSoC project was mentored by William Studenmund and the final
review for the code was done by Christos.