Expand the support for PCI-e memory mapped configuration space access.

This defaults to off and must be explicitly
enabled by setting the loader tunable hw.pci.mcfg=1.
- Add support for the Intel 915GM chipsets by reading the BAR.
- Add parsing of the ACPI MCFG table to discover memory mapped configuration
access on modern machines.
- For config requests to busses not listed in ACPI's min/max valid buses,
fall back to using type #1 configuration access instead.
- Add a workaround for some K8 chipsets that do not expose all devices on
bus 0 via MCFG and fall back to type #1 for those devices instead.

1. Has never worked properly (displayed a wrong value).
2. Didn't allow for its value to be tuned.
3. Is non-standard, so no 3rd party should rely on it. And, thus, noone
should be harmed by its removal.
The canonical way to get the maximum open message queue descriptors
per process is via the sysconf(3) interface.
4. Has now been replaced by kern.mqueue.mq_open_max sysctl which
addresses 1. and 2.

A stub has been inserted in place of the old sysctl definitions, in order to
reduce unnecessary diffs due to renumbering. This patch does NOT introduce
p1003_1b.unused1.

1. Has never worked properly (displayed a wrong value).
2. Didn't allow for its value to be tuned.
3. Is non-standard, so no 3rd party should rely on it. And, thus, noone
should be harmed by its removal.
The canonical way to get the maximum open message queue descriptors
per process is via the sysconf(3) interface.
4. Has now been replaced by kern.mqueue.mq_open_max sysctl which
addresses 1. and 2.

A stub has been inserted in place of the old sysctl definitions, in order to
reduce unnecessary diffs due to renumbering. This patch does NOT introduce
p1003_1b.unused1.

I'm reverting it because:
1) the change didn't get properly discussed
2) it was based on false premises:
"The rest of the world seems to call amd64 x86_64."
3) no pkgsrc bulk build was done to test the change
4) the original committer acted irresponsibly by committing
such a big change just before going on vacation.

* Use the ultradma field instead of the legacy field. All SATA devices
must support DMA so don't bother with the other legacy fields. If the
ultradma field is not initialized (on future devices) then presumably
we do not have to SETXFER at all.

* Change the tailq of inodes in a flush group to a red-black tree.
The flusher now processes inodes in sorted order and breaks them up
into larger sets for concurrent flushing. The flusher threads are thus
more likely to concurrently process inodes which are fairly far apart
in the B-Tree.

This greatly reduces lock interference between flusher threads. However,
B-Tree deadlocks are still an issue between inodes undergoing flushes
and front-end access operations. This can be observed by noting periods
of low dev-write activity in 'hammer iostats 1' output during a blogbench
test. The hammer-S* kernel threads will likely be in a 'hmrdlk' state
at the same time.

* Add sysctl vfs.hammer.limit_reclaim to set the maximum
number of inodes with no vnode associations, default 4000.

NOTE: For debugging only, setting this value too high will blow
out the kmalloc pool.

Continue tuning page recyclement in the VM paging queues. Fix an issue
where the memory used to recycle one-time-use cache data becomes too
constricted. For example if blogbench is run on one directory and then
run again on another directory, too many pages cached from the first run
were being left in the active queue and not recycled.

The only way to deal with this is to allow the pageout code to pull pages
from the active queue to the inactive queue. This in turn resurrects the
issue of overnight processes (whos pages are idle, after all), getting
excessively uncached. I think the goal needs to be to reduce excessive
recyclement of such pages over a shorter time-frame, such as an hour.
Adjusting vfs.vm_cycle_point higher may help (but we don't do it in this
commit).

* Change the pageout code a bit to pull a limited number of pages from
the active queue to the inactive queue if the inactive target has not
been met but the cache+free targets were satisfied from the inactive
queue. This allows the inactive_target to be increased without creating
additional mangement overhead on the machine.

* Increase the inactive_target to 1/2 of probed memory.

* Document the issues involved with pulling pages out of the active queue.

These changes only apply to HAMMER version 4+ filesystems. HAMMER
versions less then 4 only implement some of these changes and do not
use the new features during crash recovery.

* Add a sequence number of the UNDO FIFO media record format. The field
already existed for just this purpose so no media structures changed
size.

* Change the alignment boundary for HAMMER UNDO records from 16K to 512
bytes. This coupled with the sequence number virtually guarantees that
the recovery code can detect uninterrupted sequences of UNDO records
without having to relay on the FIFO last_offset field in the volume
header.

This isn't as bad as it sounds. It just means that large UNDO blocks
are broken up into smaller on-media structures in order to ensure a
record header occurs on every 512 byte boundary.

* Add HAMMER_HEAD_TYPE_DUMMY and HAMMER_HEAD_TYPE_REDO (Redo is not yet
used). The DUMMY type is a dummy record used solely to identify a
sequence number. PAD records cannot have sequence numbers so we need
a DUMMY record for it.

Remove unused UNDO FIFO record types.

* Adjust the version upgrade code to completely reinitialize the UNDO FIFO
space when moving from version < 4 to version >= 4. This puts all blocks
in the UNDO FIFO in a deterministic state with deterministic sequence
numbers on 512 byte boundaries.

* Refactor the flush code. In versions less then 4 the flush code had to
flush dirty UNDO buffers, synchronize disk, then flush the volume header
and synchronize disk again, then flush the meta data. For HAMMER
versions >= 4 the flush code removes the second disk synchronization
operation.

* Refactor the crash recovery code. For versions < 4 the crash recovery
code relied on the UNDO FIFO first_offset and next_offset indexes in
the volume header to calculate the UNDO space that needed to be run.
For versions >= 4 the crash recovery code uses first_offset for the
beginning of the UNDO space and proactively scans the UNDO FIFO to
find the end of the space. This takes longer but allows HAMMER to
remove one of the two disk sync operations in the flush code.

* Split the crash recovery code into stage 1 and stage 2. Stage 2 will
be used to run REDO operations (REDO is not yet implemented).

* When recursively removing empty internal nodes from the B-Tree only
call hammer_cursor_deleted_element() if the related internal
element is actually removed. The element might not be removed due
to the deadlock fail path.

* If hammer_cursor_up_locked() fails fully restore the cursor before
returning. The index field was not being restored.

* Acquire the sync lock when recovering a cursor lost due to a deadlock
in the mirroring code.

* Document and fix an issue in the rebalancing code which could cause a
cursor to fall off the end of the B-Tree.

Binutils-2.20 gas polices more strictly and does not accept 32bit
arguments to the fnstsw opcode. Please it by declaring all arguments as
__uint16_t. This possibly even fixes some latent bugs due to missing
initialization of the top 16 bits.

While our current build infrastructure now produces binaries which can
be read properly by the kernel, this does not hold true for any non-base
linker, particularly binutils compiled directly from source, or even the
recent binutils-2.20 import.

Instead of relying on local modifications to the Elf linker scripts,
bite the bullet and make the kernel deal with PT_NOTE sections that lie
outside the first page.