This is the initial commit of B-Tree rebalancing support for HAMMER.
The rebalancer may be run using the 'hammer rebalance' utility directive.

The leafs in a HAMMER B-Tree all reside at the same depth. Insertions and
deletions only collapse the B-Tree when a leaf node becomes empty and then
only if any necessary recursion (possibly reaching the root node) succeeds.
No balancing occurs during normal operation and B-Tree nodes can wind up
with wildly different element counts which bloats the tree and makes
searches less efficient.

The rebalancer effectively does a depth-first traversal of the B-Tree,
visiting leaf nodes first and parent nodes as a trailing function on the
way back up the tree. For any given internal node the sum total of
elements contained in its children is divided by the number of children.
The effective number of children is reduced as is practical to obtain a 75%
fill level. The elements are then packed into the children and any
wholely empty children left over are deleted. The rebalancer does not
create new B-Tree nodes.

Element packing is fairly complex, requiring tracked cursors, on-media
parent pointers, mirror TIDs, and boundary elements to be updated. The
rebalancer must hold a large number of B-Tree nodes exclusively locked
while running.

* The normal cleanup operations now reblock all B-Tree, inode, and directory
elements in the normal daily reblock mode instead of only the ones in
fragmented big-blocks. Bulk data is handled by the 30-day recopy mode.

* Add a new directive 'rebalance' (a future VFS ioctl). This directive will
tell the HAMMER VFS to rebalance the B-Tree. HAMMER B-Trees are always
balanced by depth but degenerate cases with minimal elements in a node
can easily build up. The new directive will rebalance the elements in
each B-Tree node.

* The hammer cleanup directive was not reblocking directories. Now it does.

It usually does not take very long to reblock the B-Tree nodes, inodes, or
directory elements. Reblocking these unconditionally, instead of just
reblocking fragmented allocation areas, keeps the B-Tree in a more optimal
layout, though there is still a lack of correlation between inode numbers
and directory scan order.

- Upper layer will always check if_capabilities against ifreq,
so we don't need to check if_capabilities again.
- When IFCAP_RSS changes, the jme_init() should be called only if
the interface is running.
- Don't use compile time condition for the code handling IFCAP_RSS.

The HAMMER VFS supports a short "@@-1:%05d" for master PFSs. Adjust the
HAMMER VFS to return softlinks in that form and adjust the hammer cleanup
code to recognize softlinks in that form.

Note that PFS softlinks are created as "@@PFS%05d", but the HAMMER VFS
presents them in an expanded form which allows the HAMMER VFS to reflect
the latest synchronized transaction id on slave PFSs. This also prevents
slave PFSs from confusing DragonFly's namecache as each snapshot will appear
to be an entirely different path.

Unfortunately this does mean that cpdup/cp/tar will pickup a translated
softlink and not the actual one. It's just something else to remember
about these 'weird' PFS mount points.

According to Intel OpenSDM's RDH description:
"... If software were to write to this register while the receive
function was enabled, the on-chip descriptor buffers can be
invalidated and other indeterminate operations might result ..."

When tracing a process, it can happen that the thread would get stopped
due to the signal and its tracing. In this case the tracing parent
would get notified and it might choose to let the process to serve the
signal.

However if this stop+trace is happening somewhere deep in the kernel due
to a call to CURSIG(), it might happen that the same signal again is the
cause for a stop+trace cycle because of another call to CURSIG() while
the call stack is unwinding.

Introduce CURSIG_TRACE(), which explicitly allows stopping for tracing
signal delivery. This is only called from userret().
All other instances of CURSIG() may still block/sleep because of SA_STOP
signals, but these invocations may not trace + repost signals.

As such, the only place where trace + repost of signals can happen now
is userret(). Nevertheless, CURSIG() still decides not to ignore a
currently ignored signal and rather lets the kernel unwind until this
signal arrives in the CURSIG_TRACE() called from userret().

Under some conditions (mainly related to multi-threaded processes and
tracing (gdb)), wakeups, scheduling and stops can lead to a race which
will leave the process stopped and wait()ed, but the P_WAITED flag
cleared. This happens because a thread in tstop() might have been woken
up, but not yet scheduled. If the process in turn would get stopped
again (another bug), the thread in tstop() would be counted as stopped,
but would only be waiting to be scheduled to transition into LSRUN.

emx(4): Prepare multi-RX queue support -- use different struct for RX/TX buffer

Extended RX descriptor will be needed for multi-RX queue support. However,
hardware will write information into RX buffer address field of extended
RX descriptors, so we will need to save "device visible" address in RX buffer.

Diff works on pairs of tids and not on only the first tid passed in to
the generator. As a result, there is nothing to do for the last pair of
tids: the last tid and max_tid refer to the same version. Avoid running
the generator in this case.

- emx_buf.m_head is only set/cleared in the last emx_buf associated with
the packet, so we don't need to keep clearing it in transmit descriptor
setup loop.
- Used transmit descriptors do not need to be cleared in txeof and tx_collect

Create a mini-API for boot2 filesystems, split out the filesystem probe &
initialization code, and adjust boot2 to probe multiple filesystems. While
the coding is fairly generic, only the larger boot2 area in a disklabel64
is big enough to hold a multi-filesystem boot2. 32 bit disklabels can still
only boot from UFS.

As part of this work the BTX loader offset for boot2 had to be adjusted.
boot1 used to load boot2 at 0xA000+0x4000 = 0xE000, but this left only 8KB
available before the segment would overflow in boot1's relocation code.

The BOOT2_VORIGIN was adjusted downward from 0x4000 to 0x2000, reducing
the absolute physical load address for boot2 to 0xC000 and allowing us
to load up to a 16K boot2 without overflowing the segment.

The reasons to create another driver for 8257{1,2,3}:
- Various old hardware bug workaround code are removed, so the
code is more straightforward, especially on the transmit path.
- Only 8257{1,2,3} support multi RX queues.
- Only 8257{1,2} support multi TX queues (no plan for it yet).
- It could be a sandbox for me to add multi queues support, while
em(4) always works :).

Adjust the kern_utimes() code in the kernel to check for write permissions
prior to diving into the VFS. UFS checks for write perms but HAMMER doesn't.
Generally speaking we want (at least for now) the kernel to do as much of
these checks as possible.

When deleting a file, msdosfs keeps its denode in the denode cache until it is
reclaimed. This causes a collision in the cache when recycling the directory
entry of a deleted but still open file for a new or renamed file. This
collision was incorrecly handled resulting in a kernel panic (rename case) or
syscall error and corrupted in-core state (new file case).

Fix by allowing denodes pointing to the same directory entry to coexist in the
cache as long as a single one of them represents an existing file.

This function returns an error if there is already a denode in the hash table:
EBUSY if the hashed denode represents a live file and EINVAL if it represents a
deleted but still opened file.

There was a typo in the function causing it to check for liveness in the denode
to insert instead of the already inserted one. As a consequence, if N threads
were in a race in deget() to insert a new denode for the same file in the hash
table, the losers would fail with EINVAL instead of retrying.

With this change, the device will have at most 48 TX descriptors pending
to be write back. 48 is chosen according to the table listed on:
Intel 82571EB/82572EI Ethernet Controller Revision 6.0, Page 43,
Item 70. 82571/82572 Overwrites Transmit Descriptors in Internal Buffer.

We don't use TIDV/TADV to implement TX interrupt moderation, i.e.
TX desc's IDE bit should always be off. When we set TX desc's RS
bit, we do want TX interrupt to come immediately after the TX
desc's DD bit is set by hardware.

The RS (report status) bit in the TX desc controls whether DD bit
should be set by device (via write request) and whether TX interrupt
should be generated. By setting RS bit in the last TX desc of
int_tx_nsegs TX descs, we greatly reduce the TX interrupt rate
(from 20000/s to 1200/s for full speed 1472bytes UDP datagrams) and
the number of device's TX desc write requests. This also gives me
additional +10Kpps on 82573E_IAMT. Add sysctl node for int_tx_nsegs,
its default value is 1/16 number of TX descs. The implementation
details are commented near struct adapter's related fields.