* Do not scrap deleted + modified chains unconditionally, this will mess
up operations on unlinked-but-open files. Also fixes an assertion which
was getting hit and fixes poudriere test run stdout EBADF errors on
unlinked fifos.

* Optimize handling of the DESTROYED flag to restore the feature where a
rm -rf can get away with doing almost no write I/O.

This explanation is supposed to be simpler and better. In particular
"comparing it to the snprintf API provides lots of value, since it raises the
bar on understanding, so that programmers/auditors will a better job calling
all 3 of these functions."

It was printed even when the SIGSEGV was caught, such as by configure
tests, causing a rather noisy console when packages were built. After
this commit we're back to the traditional behavior (no message if the
signal is caught, and the usual message if not):

* Fix heavy cpu use in flush due to a blown recursion which can run down
the same chain many times due to the aliasing of hammer2_chain_core
structures.

The basic problem is that there can be H2 operations running concurrently
with a flush that are not part of the flush. These operations have a
higher transaction id. When situated deep in the tree, they can cause
the flush to repeatedly traverse large portions of the tree that it had
already checked because the recording of the lower flush TID is lower
than the update_tid from the concurrent operations.

* Fix a multitude of flush / concurrent-operations races. The worst of the
lot is related to the situation where a concurrent operation does a
delete-duplicate on a chain containing a block table (which can include
an inode chain) which the flush needs to update. This results in TWO
block tables needing updating relative to different synchronization
points. Essentially, one of the chains is strictly temporary for flush
purposes while the other is the 'real' chain.

For example, if the concurrent operation is adding or deleting elements
from a block table the flush may have to add/delete DIFFERENT elements
for its own view. This requires two different versions of the block table
(one being strictly temporary).

Improper updates of the chain->bref.mirror_tid caused the flush to get
confused and assert on the blocktable not containing the expected dat.

* More concurrent-operations during a flush issues fixed. If a concurrent
operation deletes a chain and the flush needs to fork a 'live' version
of the chain, the flush's version will have a lower transaction id and
must be properly ordered in hammer2_chain_core->ownerq. It was not
being ordered properly.

* Flushes are recursive and to improve concurrency the flush temporarily
unlocks the old parent when diving under a child. This can result in a
race where, due to hammer2_chain_core aliasing the recursion can wrap
around back to the parent.

Detect the case after re-locking the parent on the way back up the tree
and do the right thing.

* Fix handling of the flush block table rollup. Consolidate the call to
modify the parent (so we can adjust the blockrefs after flushing the
children) to a single point.

* Improve flush performance. If a parent is deferred at a higher level
and then encountered again via a shallower path, we now leave it deferred
and do not try to execute it in the shallower path even though the stack
depth is ok, as it will likely become deferred at a lower level anyway.

Check a deleted-chain case early before we recurse. A deleted chain
which is flagged DUPLICATED does not have to recurse as the sub-path
is reachable via some other parent. This significantly improves
performance because there are often a ton of chains in-memory marked
DELETED.

This results in more efficient deferrals.

* Fix adjustments of modify_tid and delete_tid in delete-duplicate
operations, clean up handling of CHAIN_INITIAL, properly transfer
flags in delete-duplicate.

* Keep track of a generation number on the hammer2_chain_core structure
so the flush code can re-scan when it modifies elements within the
flush transaction.

* Cleanup the duplication and delete-duplication code and hardlink handling.
The delete-duplication code now properly tags delete_tid when a flush is
delete-duplicating a chain which is deleted in the live view but is still
valid in the flush view.

* Correct numerous bugs in tracking the modified/deleted state of
a chain.

* Correct numerous flush bugs.

* Separate the mirror TID for the freemap chain from the volume chain.
This will allow freemap updates to be delayed.

* Implement a more stringent algorithm to determine when CHAIN_MOVED
can be cleared in chain->flags.

* Do a better job limiting the flush scan when concurrent modifying
operations are occuring in large volumes.

* Replace HAMMER2_CHAIN_SUBMODIFIED with core->update_tid. SUBMODIFIED
applies to chain->core, not to chain. Use a TID to track updates to
make it easier for a flush to update records without messing up flush
sequencing of chains being concurrently modified outside the flush's
TID (that will be handled in the next flush).

* Make sure the DUPLICATED flag is set when duplicating a chain which
has already been duplicated to another target. This case is only during
flushes and can occur when the flush races against concurrent updates
which are not part of the flush.

* Refactor bioq flushing during a flush. hammer2_vfs_sync now gives the
bioq a window to operate using the flush's TID before the flush actually
starts to flush.

* hammer2_chain_modify() retains the current allocation block if the TID
does not cross a flush boundary.

* chain->bref.mirror_tid is now used to track flush progress and is compared
against core->update_tid to determine when a flush is needed.

Background:
High rate (actually same rate as polling(4)) IPIs on random CPUs are
observed when polling(4) is enabled and there is virtually no network
activity.

After polling(4) activities are traced using ktr(9), it turns out that the
high rate IPIs are actually from the wakeup() on netisr's msgport. Since
the sleep queue cpumask is indexed by the hash of ident, there are chances
that the netisr's msgport ident has the same hash value as other idents
that certain threads on other CPUs are waiting on. If this ever happens
(well, it does happen), the netisr's msgport wakeup will trigger "wakeup"
IPIs to other CPUs. However, these "wakeup" IPIs are actually useless,
since only netisr will wait on its msgport.

putport_oncpu() msgport method is added to call wakeup_mycpu() for spin
msgport, if we know that this port is only accessed by one thread on the
current CPU, e.g. polling(4). This is also the case for other network
code, e.g. syncache timeout, TCP timeout, fastforward flow cache timeout
etc. However, these network code's running rate is too low to unveil the
extra "wakeup" IPIs problem. lwkt_sendmsg_oncpu() is added as wrapper to
putport_oncpu() msgport method.

Currently, only polling(4) is using lwkt_sendmsg_oncpu(). Others will
be converted soon.

* Move the live_zero optimization from hammer2_chain to
hammer2_chain_core. It is only applicable to the core
and delete-duplicate operations can mess up the cache.

* Move the HAMMER2_CHAIN_COUNTEDBREFS flag to HAMMER2_CORE_COUNTEDBREFS.
It is only applicable to the core and delete-duplication operations
can really mess up calculations of live_count otherwise.

* Don't bump live_count if inserting a deleted chain.

* The vp in the hammer2_sync_scan2() is not locked on purpose. Use the
synclist token interlock to safely ref the hammer2_inode before
potentially blocking, otherwise it can get ripped out from under us.

I should have been more clear in its commit message that locate.updatedb
was not failing with an error if this local change wasn't kept, but that
the database was incomplete and of a much smaller size.

When testing find(1) after an upgrade, a good general rule is that
/var/db/locate.database needs to have about the same size after the
upgrade as it had before the upgrade.

One thing need to note is the interrupt moderation when MSI-X is
enabled. On the PCIE-8AL-C, it looks like that the interrupt rate
set to the chip means total interrupt rate, NOT per MSI-X vector
interrupt rate: e.g. Given the interrupt rate is set too 8000 and 8
MSI-X vectors are allocated. If two MSI-X vectors are active, then
the interrupt rate for each MSI-X vector will be ~4000. If all
MSI-X vectors are active, then the interrupt rate for each MSI-X
vector will be ~1000. This is kind of interrupt moderation for
MSI-X is very unfriendly ...

MSI-X is not enabled by default yet. You could set tunable
hw.mxge.num_slices or hw.mxgeX.num_slices to 0 or any value greater
than 1 to enable MSI-X.

Fix a memory leak in makenetvfslist which would occur when a previous
call to strdup fails and the function returns on error.
The simple fix is a call to free(3) to free memory allocated to listptr
before returning.

This removes nearly all the prior proc_token contention and also removes
process-group processing contention and makes it easier to track tty
sessions.

* Normal process, Zombie processes, the original linear list, and the
original has mechanic are now all combined into a single allprocs[]
table. The various API functions will filter out zombie vs non-zombie
based on the type of request.

* Rewrite the PID allocator to take advantage of the hashed array topology.
An atomic_fetchadd_int() is used on the static base value which will cause
each cpu to start at a different array entry, thus removing SMP conflicts.

At the moment we iterate the relatively small number of elements in the
bucket to find a free pid.

Since the same proc_tokens[n] lock applies to all three arrays (proc,
pgrp, and session), we can validate the pid against all three at the
same time with a single lock.

* Rewrite the procs sysctl to iterate the hash table. Since there are
1024 different locks, a 'ps' or similar operation no longer has any
significant effect on system performance, and 'ps' is VERY fast now
regardless of the load.

* Remove one of the two remaining major bottlenecks in the system, the
global vmobj_token which is used to manage access to the vm_object_list.
All VM object creation and deletion would get thrown into this list.

* Replace it with an array of 64 tokens and an array of 64 lists.
vmobj_token[] and vm_object_lists[]. Use a simple right-shift
hash code to index the array.

* This reduces contention by a factor of 64 or so which makes a big
difference on multi-chip cpu systems. It won't be as noticable on
single-chip (e.g. 4-core/8-thread) systems.

* Rip-out some of the linux vmstats compat functions which were iterating
the object list and replace with the pcpu accumulator scan that was
recently implemented for dragonfly vmstats.

* imgact_elf - drop the vm_object a little earlier in load_section(),
and use a shared object lock when iterating ELF segments.

* When starting a vforked process use a shared process token to
interlock the wait loop instead of an exclusive token. Also don't
bother with the token if there's nothing to wait for.

* When forking, pre-assign lp2 thread's td_ucred.

* Remove the vp->v_object load check loop. It should not be possible
for vp->v_object to change after being assigned as long as the vp
is referenced.

* Replace most OBJ_DEAD tests with assertions that the flag is not set.

* Remove the VOLOCK/VOWANT vnode interlock. It shouldn't be possible
for the vnode's object to change while the vnode is ref'd. This was
a leftover from a long-ago time when vnodes were more persistent and
could be recycled and race accessors.

This also removes vm_object_dead_sleep/wait and related code.

* When memory mapping a vnode object there is no need to formally
hold and chain_wait the object. We can simply add a ref to it,
because vnode objects cannot have backing chains.

* When deallocating a vm_object we can shortcut counts greater than 1
for OBJT_VNODE objects instead of counts greater than 3.

* Adjust trigger points such that under normal operation vnlru_proc()
handles cleaning up extra vnodes. If this is not sufficient then
the synchronous cleanup code will kick in at higher levels.

* Adjust vnode->v_act handling and try to take into account vnodes
with large memory objects (which we would rather reclaim later and
not sooner). This takes over functionality from vlru_reclaim().

* Remove the vlrureclaim() mount-scanning infrastructure. vnlru_proc()
now just calls freesomevnodes(). This should now be sufficient. This
removes significant locking overheads during steady-state operation.

Original code assumes that the total size of RX slots is same as one RX
descriptor ring size. This assumption could easily be broken if we ask
chip to deliver RSS hash (RX slot size will be changed from 4 bytes to 8
bytes). RX slot count is recorded now.

Currently even without the __cachealign, RX data struct size is properly
aligned on 2 cache line size. Add __cachealign, so that even if some
debugging fields are added, RX data struct size still will be cache line
size aligned.

* Fix bugs in the cachedvnodes counter tracking. v_refcnt has to
be masked against VREF_MASK to detect non-zero->0 and 0->non-zero
transitions properly.

* Clear VREF_FINALIZE when reactivating a vnode in vget().

* vhold()/vdrop() no longer prevent a vnode from being moved to the
vinactive list. They simply prevent reclamation.

* Adjust the vnlru trigger points a bit.

* When cleaning, leave the vnode on the inactive list until we determine
we can destroy it. Add a ref instead of using the VREF_TERMINATE
placeholding ref (since the vnode is still on the list).

* Implement vnode->v_act and remove the inactive mid-point stuff. The
now is that vnodes are selectively moved from the active list to
the inactive list as needed. Inactive vnodes are then cleaned up in order.

* Adjust hysteresis so that vnlru has a better chance of handling the
vnode garbage collection before we forced it to be done synchronously
in userexit.