Correct the behaviour when using command_interpreter in
rc.d scripts where the proctitle of ps does not include
the full interpreter path, but for example just "perl: ..."
instead of "/usr/pkg/bin/perl -flags ...".

With this patch, the /etc/rc.d/postgrey stop/status script
of pkgsrc package mail/postgrey works. Other packages seem
to not use command_interpreter at all.

* The new kfree() was improperly adjust ks_memuse/ks_inuse for the wrong
cpu, leading to MP races which could cause the memory statistics to go
negative and trigger a panic.

* When calculating loosememuse it is possible to race another cpu and
come up with an incorrect value. The race itself is ok, loosememuse
is not supposed to be 100% deterministic, but even so do not allow
the value to underflow or we will wind up asserting.

In this example gdb'ing cc1 and examining the code revealed an impossible
crash case where off(%ebx) was deterministically accessed a few
instructions before, then accessed again and somehow %ebx had become zero.

Unfortunately I could find no smoking gun, but my conjecture is that it
is a MP race which can occur when the thread migrates between cpus and/or
a mis-handled IPI.

* In the LWKT messaging code move the cpu_mfence() call in the sequence,
from rindex->read_args->MFENCE->call to rindex->MFENCE->read_args->call.

* In the LWKT thread acquisition code (for thread migration between cpus),
add a cpu_mfence() call after the td_flags check indicates success,
instead of inside the loop where we are waiting for the flags check to
indicate success.

* In both cases the issue seems to be out-of-order reads and/or speculative
reads. Even though MP writes are well ordered on Intel/AMD systems reads
are not. In the case of the IPIQ FIFO the data related to the arguments
can be ordered ahead of the read of the FIFO rindex and thus wind up being
stale relative to the other CPU writing the entry. Moving the mfence
ensures that the args stored in the FIFO are not accessed until after
the rindex is read.

For the thread aquisition code access to and manipulation of the thread
td_allq might be based on stale out-of-order reads prior to the
determination that the thread completed its move.

This can be a problem because several mechanisms in DragonFly are able
to operate without even having to use locked bus cycle. The IPIQ, thread
migration, and kern/sys_pipe.c being the best examples, so the natural
barrier provided by the locked bus-cycle instruction is not necessarily
present.

* Instead of IPIing the chunk being freed to the originating cpu we
use atomic ops to directly link the chunk onto the target slab.
We then notify the target cpu via an IPI message only in the case where
we believe the slab has to be entered back onto the target cpu's
ZoneAry.

This reduces the IPI messaging load by a factor of 100x or more.
kfree() sends virtually no IPIs any more.

* Move malloc_type accounting to the cpu issuing the kmalloc or kfree
(kfree used to forward the accounting to the target cpu). The
accounting is done using the per-cpu malloc_type accounting array
so large deltas will likely accumulate, but they should all cancel
out properly in the summation.

* Use the kmemusage array and kup->ku_pagecnt to track whether a
SLAB is active or not, which allows the handler for the asynchronous IPI
to validate that the SLAB still exists before trying to access it.

This is necessary because once the cpu doing the kfree() successfully
links the chunk into z_RChunks, the target slab can get ripped out
from under it by the owning cpu.

* The special cpu-competing linked list is different from the linked list
normally used to find free chunks, so the localized code and the
MP code is segregated.

We pay special attention to list ordering to try to avoid unnecessary
cache mastership changes, though it should be noted that the c_Next
link field in the chunk creates an issue no matter what we do.

A 100% lockless algorithm is used. atomic_cmpset_ptr() is used
to manage the z_RChunks singly-linked list.

* Remove the page localization code for now. For the life of the
typically chunk of memory I don't think this provided much of
an advantage.

* Note I'm talking about exit/enter wrappers, not enter/exit wrappers.
I believe the enter/exit wrappers can be removed too but for now
we have to remove the exit/enter wrappers which assumed a critical
section would be held on entry.

This is no longer the case. Since so much of the network stack is
now threaded callers into PF are not necessarily holding a critical
section to exit out of.

* Newly allocated mbufs now set m_len and (if a packet header)
m_pkthdr.len to 0 instead of leaving them uninitialized,
allowing us to assert that the mbuf does not have an overrun
later when it is freed.

* The last fix wasn't good enough. Really try to fix it this time. Use
a pool token and validate so_head after acquiring it to deal with races,
interlock against 0-ref races (sockets can be on the so_comp/so_incomp
queues with 0 references), and use it for the accept predicate.

* Fix a race where a socket undergoing an accept() was not being
referenced soon enough, resulting in a window of opportunity for
the kernel to attempt to free it if the tcp connection resets
before userland can finish the accept.

* As of this writing AMD has some new chipsets out for AM3 MBs which
supports AHCI on 5 SATA + 1 E-SATA connector. My testing was done
on a MB with the 880G chipset.

The AHCI firmware for this chipset is a bit on the rough side. It
seems a bit slow on the INIT/device-detection sequencing (possibly due
to longer PHY training time? It's supposed to be a 6GBit PHY), and it
generates a stream of PCS interrupts for some devices.

My assumption is that the PCS interrupts are not being masked by the
chipset during the INIT phase. Both IFS and PCS interrupts seem to
occur during INIT/RESET and PM probing stages.

In addition, at least one drive... an Intel SSD, caused a large number
of PCS interrupts during the INIT phase even when connected to an
internal SATA port at power-on. This is clearly a bug in the AMD
AHCI chipset, again related to their firmware not internally masking
communications glitches during INIT, and/or taking an extra long time
to train the PHY.

* Adjust the AHCI driver to deal with this situation. Limit the interrupt
rate for PCS errors and do harsh reinitialization of the port when we get
a PCS error, along with allowing extra time for the device detect to
succeed.

* As a side benefit the AHCI driver should be able to deal with device
connection and disconnection on non-hot-swap-capable ports, at least
up to a point.

* Silence some of the console output during probe.

* Try harder to clear the CI/SACT registers when stopping a port. Some
chipsets appear to not clear the registers when we cycle ST when they
have already stopped the command processor, possibly as part of the IFS
or PCS interrupt paths.

* Fix a bug where an IFS or PCS interrupt marks a probe command (software
reset sequence) as complete when it actually errored-out.

* Sleep longer between retries if a command fails due to an IFS error.
When testing with the WD Green drives a drive inserted into a PM
enclosure cold seems to take longer to start up during the COMRESET
sequence. This only seems to occur with the AMD chipset and does
not occur with the older NVidia chipset. IFS errors occur for several
seconds beyond what I would consider a reasonable sleep interval.

Previously, BUF_CMD_FLUSH ended up as a zero-byte write command, which
always fails, flooding the console with `iobuf error 5'. Filesystems
other than HAMMER almost never issues this command, so we've never
seen the error message in pre-HAMMER days. This commit adds a new path
for BUF_CMD_FLUSH and issue IPS_CACHE_FLUSH_CMD for it.

Also mention the tunable/sysctl knob debug.ips.ignore_flush_cmd in ips(4)
man page in case the new behavior confuses your controller; when set, the
driver just discards BUF_CMD_FLUSH.

* Due to the bloat in m_hdr and m_pkthdr the 256-byte mbuf structure
is no longer large enough and there appears to be quite a bit of
legacy code still using m_get() and making assumptions on the
available space without checking actual space.

We have assertions in place to catch these but stabilizing the
system is more important right now.