This commit dumps the same frequency for each core; there's no
point supporting different frequencies unless we monitor frequency
change events. We use the frequency to print out relative timestamps
in microseconds. This relies on ktrdump having done correct sorting.

* TMPFS_ASSERT_ELOCKED() is called in numerous places where
tn_vnode is not necessarily assigned, for example during
tmpfs_nremove() after the directory entry has been cleaned out.

Remove the assertion that tn_vnode != NULL.

* Add tmpfs_mount->tm_flags and TMPFS_FLAG_UNMOUNTING, used during
unmounting to tell tmpfs_fsync() to throw away the contents of
the file (normally it ignores it). This fixes a panic on umount
when the main kernel checks that all dirty buffers have been
cleaned out.

* Fix two places where the wrong length for a string is
being kmalloc()'d. The softlink and the directory entry
string allocations were wrong and resulted in a string
terminator being stuffed beyond the end of the malloced
buffer.

* Do a safety-NULL-out of a few fields after kfree()ing them.

* Refactor tmpfs_dir_lookup() a little.

* Enhance tmpfs_reg_resize() to also resize the SWAP VM object.
Failing to do so can leave extranious swap assignments for
deleted areas of the file which become visible again (instead
of becoming zero-fill) if the file is then later ftruncate()d
larger.

Also fix the block size parameters to nvtruncbuf() and nvextendbuf().
It must match the block size used for the buffer cache.

* Temporarily turn off all the MPSAFE flags. Run under the BGL.

* The buffer offset (offset) in tmpfs_read() and tmpfs_write()
can be a size_t. It does not have to be off_t.

* tmpfs_write() was using getblk(). It actually has to use bread()
in order to ensure that the buffer contents is valid when potentially
doing a piecemeal write which does not cover the whole buffer.

* Refactor tmpfs_write() to leave the underlying VM pages dirty,
except in cases the system page daemon wants to flush pages to
clear space in ram (IO_SYNC, IO_ASYNC). Use buwrite() to do this.

* Fix an error path in tmpfs_strategy() which was not biodone()ing
the bio.

* tmpfs_remove() was making assumptions with regards to v->a_nch.ncp->nc_vp
which were not correct. The vp is not referenced and can get ripped
out from under the caller unless properly handled.

* Fix sequencing in tmpfs_inactive(). If tn_links is 0 and the node
is not in the middle of being allocated we can destroy it.

* Remove unnecessary vnode locks from tmpfs_reclaim(). There are also other
vnode locks scattered around that aren't needed (for another time).

* Add buf->flags/B_NOTMETA, vm_page->flags/PG_NOTMETA. If set the pages
underlying the buffer will not be considered meta-data from the
point of view of the swapcache.

* HAMMER must sometimes access bulk data via the block device instead of
via a file vnode. For example, the reblocking and mirroring code.
We do not want this data to be misinterpreted as meta-data when
the meta-data-only swapcache is turned on, otherwise it will blow
out the actual meta-data in the swapcache.

HAMMER_RECTYPE_DATA and HAMMER_RECTYPE_DB are considered normal data.
All other record types (e.g. direntry, inode, etc) are meta-data.

* Refactor the histogram code. This code is responsible for breaking
down a large initial mirroring stream into smaller chunks so the
transaction id can be synced more often. This way if the stream
is interrupted it can be restarted at a more recent point instead
of having to restart further back (or at the beginning).

Ok, here's what is going on. If an SMI interrupt occurs while
an AP is going through the INIT/STARTUP IPI sequence the AP will
brick, and nothing you do will resurrect it.

BIOSes typically set up SMI interrupts when emulating (for example)
a PS/2 keyboard with a USB keyboard, or even if just implementing
BIOS support for a USB keyboard. Even worse, the BIOS may set up
the interrupt to poll at 1000hz. And, EVEN WORSE, it can totally
depend on which USB port you've plugged your keyboard in. And, on top
of all of that, the SMI interrupt is not consistent.

The INIT/STARTUP code contains a 10ms delay (as per Intel spec) between
the INIT IPI and the STARTUP IPI. Well, you can do the math.

In order to reliably boot a SMP system where the BIOS has set up
SMI interrupts this patch uses a nifty bit of code to detect when
the SMI interrupt has occurred and tries to shift the INIT/STARTUP
sequence into a gap between SMI interrupts. If it has to it will
reduce the 10ms spec delay all the way down to 150us. In many
cases we really have no choice for reliable operation. Even a 300uS
delay is too much in the tests I performed on a Shuttle Phenom and
Phenom II cube. I don't honestly know if this will break other SMP
configurations, we'll have to see.

On the particular shuttle I tested on, one of the four USB connections
on the backpanel (the upper left when looking at it from the back)
seemed to cause the BIOS to set up SMI interrupts at a high rate and
caused kernel boots to fail. With this commit those boots now succeed.

* Detect the case where B-Tree leafs are being laid down sequentially,
such as when creating a large file. When linear operation is detected
split leafs 75:25 instead of 50:50. This greatly improves fill ratios.

It should be noted that the HAMMER flush sorts by inode so directory
entries will also tend to benefit.

* This only effects (improves) the initial B-Tree layout. The overnight
hammer cleanup will refactor the B-Tree to a more optimal state
regardless.

* Fix endurance statements for SLC. SLC has approximately 10x the
endurance. Documentation on the web is confused on this matter with
10x and 100x both being thrown around. We will just assume 10x
for now.

* The cluster_read() code was tripping over itself due to a findblk()
call which caused it to believe it had found a buffer hole when it
really found a busy buffer.

Redo the code to use the FINDBLK_TEST flag to locate the next buffer
hole. Also add a shortcut to support efficient coding for larger
read-ahead values.

* Change the single-read-ahead in cluster_read() to a multiple-read-ahead
based on the maxra parameter. Before we just did a single read-ahead
and even though this was a cluster read it still created a situation
where the next cluster_read(0 operation would stall on previous read-ahead
before issuing the next one. In otherwords, it wasn't pipelining requests
as well asit could.

This change tries to keep at least two read-aheads in progress so when
the next cluster_read() stalls on the first one the second one is still
in the pipeline after it unstalls, allowing it to issue the third one
to keep the pipeline hot.

* These changes improve SSD swapcache operation as well as normal HD
cluster_read() pipelining. In addition the read-ahead is now sufficient
to keep the pipeline hot across a 2 x Swap (interleaved) setup.

- Import libevtr, a library for abstracting access to an event stream.
libevtr uses its own dump format and can synthesize event attributes
based on known event types.
- Modify ktrdump(8) to be able to dump an event stream to a file
using libevtr.
- Add evtranalyze(1), a proof of concept utility to display events in
a line-oriented text format or to generate an svg file displaying
the events on each processor. This needs quite some work.

* Implement write clustering. Swapcache attempts to cluster writes
for optimal matching between swap and the buffer cache. This
also reduces the IOPS for writes by a factor 16. The SSD should
be able to do write combining and erasing more optimally as well.

* Add vm.swapcache.minburst

This ensures that curburst is allowed to recover sufficiently that
a nice good write burst can be done, once curburst hits 0. Otherwise
swapcache winds up doing tiny bursts which tend to fragment the cache.

* Add vm.swapcache.maxfilesize

If set to non-zero prevents swapcache from caching files larger than
the specified size. That is, swapcache will only cache smaller files.
This is experimental because there are issues caching small files
anyway (the vnodes get recycled too quickly).

* Allow vm.swapcache.curburst to be manually set larger than
vm.swapcache.maxburst, so the initial load-in can be different
from the maximum reburst.

* Adjust the code which deals with write errors on swap to ensure
that the backing store is destroyed (because it isn't a clean copy).

- Stop special ROT13 treatment of fortunes-o. Neither murphy-o,
fortunes2-o nor limerick were doing the same and contain even
more possibly offensive stuff.
- Merge the spelling files for fortunes{,-o}, this improves
maintainability in case fortunes are moved between the files
- make the installation of offensive stuff depend on
INSTALL_OFFENSIVE_FORTUNES, like NetBSD (defaults to yes).
Previously you had to edit the Makefile to disable this.
- Drop CVS Ids, which are no longer maintained :(

kernel - SWAP CACHE part 13/many - More vm_pindex_t work for vm_objects on i386

* vm_object->size also needs to be a vm_pindex_t, e.g. when mmap()ing regular
HAMMER files or block devices or HAMMER's own use of block devices,
in order to support vm_object operations past the 16TB mark.

* Introduce a 64-bit-friendly trunc_page64() and round_page64(), just to
make sure we don't cut off page alignment operations on 64-bit offsets.

* Our kmem_init() was mapping out the ~6G of KVA below KERNBASE. KERNBASE
is at the -2G mark and unlike i386 it does not mark the beginning of KVA.

Add two more globals, virtual2_start and virtual2_end, adn adjust
kmem_init() to use that space. This fixes kernel_map exhaustion issues
on x86_64. Before the change only ~600M of KVA was available after a
fresh boot.

* Populate the PDPs around both KERNBASE and at virtual2_start for
bootstrapping purposes.

* Adjust kernel_vm_end to start iteration for growkernel purposes at
VM_MIN_KERNEL_ADDRESS and no longer use it to figure out the end
of KVM for the minidump.

* Increase the maximum buffer cache from 200M to 400M. Note that
the buffer cache is backed by the VM page cache which is unlimited.

* Use size_t for kmalloc() tracking

* Allow 0 to be specified for kmalloc_raise_limit() which makes a
kmalloc pool unlimited.

* Adjust the kern.maxvnodes autocalculation for both i386 and x86_64.
i386 boxes with maximum memory will get a slightly lower vnode
limit while x86_64 boxes will get a dramatically higher vnode limit.

* Remove kmalloc pool limits for vnodes, for HAMMER inodes, and
for UFS inodes. These pools track maxvnodes and do not require
limits.

This fixes occassional kmalloc assertions and allows the sysop to
raise kern.maxvnodes on a running system.

* vx_lock_nonblock() is used by allocfreevnode() to interlock the
vnode being freed. However, this function will incorrect succeed
on a vnode recursively held by a caller of allocfreevnode() which
is in the middle of being reclaimed if the vnode in question
allows LK_CANRECURSE locks in the lockinit. UFS vnodes use this
mechanic.

* Add a small state machine and hysteresis to flip between swapcache
writing and swapcache cleaning. The swapcache is written to until
(unless) it hits 75% use. If this occurs it switches to cleaning
mode to get rid of swapcache pages until it gets down to 70%. While
in cleaning mode burst accumulation still occurs. Then it flips back.

Currently the cleaning mode tries to choose swap meta-blocks which
are wholely swapped (have no VM pages), running linearly through
the VM object list in order to try to clean contiguous areas of
the swapcache. The idea is to reduce fragmentation that would lead
to excessive disk seeking. At the same time the limited cleaning
run (only 5% of the swap cache) should prevent any large-scale
excessive deletion of the swapcache.

* Add a new VM object type, OBJT_MARKER, which may be used by iterators
running through the vm_object_list.

* Improve write staging by not counting VM pages which already have a
swap assignment when doing the limited scan of the INACTIVE VM page
queue.

As the swapcache starts to perform more and more disk I/O goes to it,
radically increasing the data rate and also radically increasing the
rate at which pages are shuffled between VM page queues. At some
point enough data is coming from the swapcache that vm.swapcache.maxlaunder
is unable to keep up even when sufficient burst bandwidth is available.

This led to an asymptotic caching curve. After the fix the caching
curve is linear (for data sets which fit in the swapcache).

* The swapcache associated with meta-data (VCHR vnodes) was not being
destroyed on umount. Adjust a conditional such that it is properly
destroyed. Otherwise stale data might be retained across e.g. a
media change.

* The code which limits how much swap space the swap cache uses was
broken. It was using the current amount of free swap space instead
of the total space, causing it to only use 40% of available swap
instead of 66%

* Add vn_cache_strategy() and adjust vn_strategy() to call it. This
implements the read intercept. If vn_cache_strategy() determines that
the entire request can be handled by the swap cache it issues an
appropriate swap_pager_strategy() call and returns 1, else it returns 0
and the normal vn_strategy() function is run.

On machine boot curburst defaults to maxburst and will automatically
be trimmed to maxburst if you change maxburst. This allows a high
write-rate after boot.

During normal operation writes reduce curburst and accrate increases
curburst (up to maxburst), so periods of inactivity will allow another
burst of write activity later on.

vm.swapcache.read_enable (default 0 - disabled)

Enable the swap cache read intercept. When turned on vn_strategy()
calls will read from the swap cache if possible. When turned off
vn_strategy() calls read from the underlying vnode whether data
is available in the swap cache or not.

vm.swapcache.meta_enable (default 0 - disabled)

Enable swap caching of meta-data (The VM-backed block devices used
by filesystems). The swapcache code scans the VM page inactive
queue for suitable clean VCHR-backed VM pages and writes them to
the swap cache.

Specifies the maximum number of pages in the inactive queue to
scan every 1/10 second. Set fairly low for the moment but
the default will ultimately be increased to something like 512
or 1024.

vm.swapcache.write_count

The total amount of data written by the swap cache to swap,
in bytes, since boot.

* Call swap_pager_unswapped() in a few more places that need it.

* NFS doesn't use bread/vn_strategy so it has been modified to call
vn_cache_strategy() directly for async IO. Currently we cannot
easily do it for synchronous IO. But async IO will get most of
it.

* The swap cache will use up to 2/3 of available swap space to
cache clean vnode-backed data. Currently once this limit is
reached it will rely on vnode recycling to clean out space
and make room for more.

Vnode recycling is currently excessively limiting the amount of
data which can be cached, since when a vnode is recycled it's
backing VM object is also recycled and the swap cache assignments
are freed. Meta-data has other problems... it can choke the
swap cache.

* Refactor swap_pager_freespace() to use a RB_SCAN() instead of a
vm_pindex_t iteration. This is necessary if we intend to allow
swap backing store for vnodes because the related files & VM objects
can be huge. This is also generally a good idea in 64-bit mode
to help deal with x86_64's massive address space.

* Change vm_pindex_t from unsigned long (32 bits) to __uint64_t (64 bits).
This change is necessary to support block devices with greater than 16TB
of storage as well as to support the mmap()ing of HAMMER files larger
than 16TB.

Primarily this was done to support block devices greater than 16TB
since HAMMER volumes are allowed to be up to 4096TB each. Filesystem
mounts use VM objects to back block devices.

* On x86_64 vm_pindex_t is already 64 bits but change the typedef from
unsigned long to __uint64_t to match i386.

* Most conversions to and from vm_pindex_t are to 64 bits anyway so this
change does not create any performance issues.