* lwp_signotify() was improperly scheduling threads whos td_gd is on the
local cpu without checking the SINTR flags. This can catch a thread in
the middle of being transitioned to another cpu and cause havoc.

* Only schedule the thread if the SINTR flags are set.

* We can't call setrunnable() from an IPI so adjustments have to be made
in the remote cpu to set the lp's lwp_stat state before issuing the IPI
and only do the scheduling of its thread from the IPI function.

* Adjust sysctl_kern_proc()'s kernel thread scanning code to use a marker
instead of depending on td remaining on its proper list. Otherwise
blocking conditions can rip td out from under us or move it to another
cpu, potentially resulting in a crash or livelock. Index the scan
backwards to avoid live-locking continuous adds to the list.

* Fix a potential race is the zombie removal code vs a ps, p->p_token was
being released too early.

* Adjust lwkt_exit() to wait for the thread's hold count to drop to zero
so lwkt_hold() works as advertised.

* This fixes booting issues on i386 with vm.shared_fault=1 (pool
tokens would sometimes coincide with the token used for kernel_object
which causes problems on i386 due to the pmap code's use of
kernel_map/kernel_object).

* Do not call pmap_enter() from vm_fault_page*(). This function can be
called from foreign pmap contexts and thus the current cpu's bit may
not be set in the target pmap cpumask. Any pmap_enter() operation will
thus not properly synchronize with other users of the pmap (particularly
other foreign users).

* In addition, for callers of the umtx*() function calling pmap_enter()
is inefficient as the correct page might already be faulted in. Now
because we are no longer updating the page in the pmap an older page
may still exist in the pmap (mapped read-only as it was originally COW).

This page may no longer be correct because the umtx*() functions
modify the contend of the page returned by vm_fault_page() without
necessarily mapping it. So to keep the user visibility into the memory
correct we unmap the old page when vm_fault_page() has to do a COW.

This is slightly more burdensome for fork() but far less burdomsome
for the umtx system calls and also allows procfs_memrw to work properly.

* procfs uses vm_fault_page*() to access command line arguments for
any process and umtx*() uses it to access the memory page the umtx
is operating in. Relative to procfs the user process pmap is foreign
(i.e. the current cpu's bit is not set in its pm_active) and cannot
be properly updated via a vm_fault_page*() from procfs anyway, so the
above new behavior for vm_fault_page*() is even more correct for
procfs use cases.

The soreserve and pru_attach could set these two flags internally,
so the original code will only retain those two flags but not clear
them if the listen socket does not have them. We now explicitly
check those two flags and then set or clear them accordingly.

It is possible that tcp input path or tcp timers to drop the socket
reference and put the socket into disconnected state, when there
are still asynchronized sending messages pending on the netisr
message port. If the user space program choose to close the tcp
socket under this situation, then the socket will be directly
freed on the syscall path since the socket has already been
disconnected, so the pending asynchronized sending messages on the
netisr message port will reference freed socket, thus cause
panic later on.

Fix the problem by explicit "sync" the netisr, which could have
pending asynchronized sending messages, before freeing the socket
on the soclose path.

This commit rolls up a lot of work to improve postgres database operations
and the system in general. With this changes we can pgbench -j 8 -c 40 on
our 48-core opteron monster at 140000+ tps, and the shm vm_fault rate
hits 3.1M pps.

* Implement shared tokens. They work as advertised, with some cavets.

It is acceptable to acquire a shared token while you already hold the same
token exclusively, but you will deadlock if you acquire an exclusive token
while you hold the same token shared.

Currently exclusive tokens are not given priority over shared tokens so
starvation is possible under certain circumstances.

* Create a critical code path in vm_fault() using the new shared token
feature to quickly fault-in pages which already exist in the VM cache.
pmap_object_init_pt() also uses the new feature.

This increases fault-in concurrency by a ridiculously huge amount,
particularly on SHM segments (say when you have a large number of postgres
clients). Scaling for large numbers of clients on large numbers of
cores is significantly improved.

This also increases fault-in concurrency for MAP_SHARED file maps.

* Expand the breadn() and cluster_read() APIs. Implement breadnx() and
cluster_readx() which allows a getblk()'d bp to be passed. If *bpp is not
NULL a bp is being passed in, otherwise the routines call getblk().

* Modify the HAMMER read path to use the new API. Instead of calling
getcacheblk() HAMMER now calls getblk() and checks the B_CACHE flag.
This gives getblk() a chance to regenerate a fully cached buffer from
VM backing store without having to acquire any hammer-related locks,
resulting in even faster operation.

* If kern.ipc.shm_use_phys is set to 2 the VM pages will be pre-allocated.
This can take quite a while for a large map and also lock the machine
up for a few seconds. Defaults to off.

* Reorder the smp_invltlb()/cpu_invltlb() combos in a few places, running
cpu_invltlb() last.

* An invalidation interlock might be needed in pmap_enter() under certain
circumstances, enable the code for now.

* vm_object_backing_scan_callback() was failing to properly check the
validity of a vm_object after acquiring its token. Add the required
check + some debugging.

* Make vm_object_set_writeable_dirty() a bit more cache friendly.

* The vmstats sysctl was scanning every process's vm_map (requiring a
vm_map read lock to do so), which can stall for long periods of time
when the system is paging heavily. Change the mechanic to a LWP flag
which can be tested with minimal locking.

* Have the phys_pager mark the page as dirty too, to make sure nothing
tries to free it.

* Remove the spinlock in pmap_prefault_ok(), since we do not delete page
table pages it shouldn't be needed.

* Add a required cpu_ccfence() in pmap_inval.c. The code generated prior
to this fix was still correct, and this makes sure it stays that way.

* It's key aim is to make it simple, if not outright dead easy, to
write test cases. A test case is a simple program using
printf/fprintf and exit - no magic needed. A kernel test case is a
very small module that just needs to implement a few functions and
call a logging and result function.

* Sample output of the text frontend (Frontends are explained further
down the text): http://leaf.dragonflybsd.org/~alexh/dfregress.txt

* dfregress is very UNIXy, it uses makefiles to build the testcases,
stdout/stderr redirection from the testcases (no fancy output
functions needed in the testcases) and evaluates the normal return
value of the testcase (no need to call fancy functions).

* For kernel testcases it is a bit different - you do have to call
functions to log output and to log the result, but it is very simple,
hardly any overhead.

* The test driver assumes that testcases are in the testcases/
directory, but it supports several command line options.

* The tests to run, including several options, are specified in the
runlist file. An example runlist including all the testcases is
included as config/runlist.run. Options that can be specified are:
- timeout: (in seconds) after which the test is aborted if it hasn't
finished.
- test type (userland, kernel or buildonly)
- make: which 'make' tool to use to build the test cases. Defaults to
'make', but could also be 'gmake', etc.
- nobuild: doesn't build the testcase and tries to directly execute
it.
- pre, post: external pre-run and post-run commands, e.g. to set up a
vn device and to tear it down. This is to avoid duplication in test
cases, the common setup can be factored out.
- intpre, intpost: similar to the above, but it assumes that the
testcase, when passed the parameter 'pre', will do the pre-setup,
and when passed 'post' will do the post-test cleanup/setup/etc.
- any number of command line arguments that are passed to the test
case. (See the crypto/ test examples in the runlist).

* A range of sample testcases are available in
test/dfregress/testcases/sample, including a kernel testcase sample.

* Note that many of the test cases in the testcases/ directory have
been copied from elsewhere in the main repository and are,
temporarily at least, duplicated.

* The test driver is completely separated from the frontends. The test
driver outputs the test run results in an XML-like format (plist)
that can easily be parsed using proplib. Python and Ruby also have
plist modules that could be used to parse the output.

* The only available frontend is a simple C program that will parse the
intermediate format to an easy to read plain text format. Additional
frontends can be written in any language, as long as it is possible
to load plists. Frontends are in the fe/ directory.

* XXX: the default options (currently just the timeout) are still
hardcoded in config.c.

* The NOTES file gives details on how the test execution occurs and
what result code can be raised at which point. This document, and a look
at the generated plist, is all you need to write a new frontend that,
for example, generates beautiful HTML output.
For completeness sake, a part of NOTES is reproduced under the main commit
message - specifically the part detailing the execution of a single test
case.

======
Execution of a single test case:
======
1) chdir to testcase directory
- if it fails, set RESULT_PREFAIL (sysbuf is of interest), goto (6)

2) build testcase (make) (unless nobuild flag is set).
+ build_buf is used for stdout/stderr
- if there is an internal driver error (that leads to not running the
build command), set RESULT_PREFAIL (sysbuf is of interest), goto (6)
- if the build command has a non-zero exit value, set the result to
BUILDFAIL, unless it's a buildonly test case, in which it is set to
the actual result value (TIMEOUT, SIGNALLED, FAIL)
goto (6)

3) run 'pre' command if intpre or pre is set.
+ precmd_buf is used for stdout/stderr
- if there is an internal driver error (that leads to not running the
command), set RESULT_PREFAIL (sysbuf is of interest), goto (6)
- if the pre command has a non-zero exit value, set RESULT_PREFAIL and
goto (6)

4) run actual testcase, depending on type
+ stdout_buf is used for stdout
+ stderr_buf is used for stderr
- for BUILDONLY: set RESULT_PASS since the build already succeeded
- for userland and kernel: run the testcase, possibly as a different
user (depending on the runas option), set the result to the actual
result value (TIMEOUT, SIGNALLED, FAIL, NOTRUN)
- if there is an internal driver error (that leads to not running the
command), RESULT_NOTRUN is set (sysbuf is of interest)

5) run 'post' command if intpost or post is set.
+ postcmd_buf is used for stdout/stderr
- if there is an internal driver error (that leads to not running the
command), set RESULT_POSTFAIL (sysbuf is of interest), goto (6)
- if the post command has a non-zero exit value, set RESULT_POSTFAIL
and goto (6)

6) clean testcase directory (make clean) (unless nobuild flag is set).
+ cleanup_buf is used for stdout/stderr and system (driver error) buffer
- no further action.

* The vm_zone (zalloc) code needs to be replaced with objcache, but until
we do continue improving certain critical paths in it.

* Burst fill an empty per-cpu cache to reduce overheads when working with
large numbers of processes with large shared address spaces (postgres,
mysql, etc). This reduces contention under heavy use situations.

* Bring in a much faster allocator for x86-64. DMalloc is a slab alloctor
with dynamic slab sizing capabilities, allowing slabs to be used for
all allocation sizes. This simplifies the code paths considerably.

* DMalloc is optimized for heavy-use situations but will still retain a
run size similar to the old nmalloc code. The VSZ is going to be quite
a bit bigger, though. The best test is w/mysqld as mysql[d] allocates
and frees memory at a very high rate.

* DMalloc is almost completely lockless. Slabs become owned by threads
which can then manipulate them trivially. Frees can operate on foreign
slabs in a lockless manner. A depot is used primarily as a catch-all
for thread exits.

* Flag the case where a sysretq can be performed to quickly return
from a system call instead of having to execute the slower doreti
code.

* This about halves syscall times for simple system calls such as
getuid(), and reduces longer syscalls by ~80ns or so on a fast
3.4GHz SandyBridge, but does not seem to really effect performance
a whole lot.

* Change the low-level IPI code to physically disable interrupts when
waiting for the ICR status bit to clear and issuing a new IPI. It
appears that on Intel cpus it is possible for (with circumstantial
evidence) a LAPIC EOI to busy the command sequencer.

Thus if interrupts are enabled inside a critical section, even if all
they do is EOI the LAPIC and IRET, this can prevent an IPI from being
sent if the interrupt occurs at just the right moment during an IPI
operation.

* Because IPIs are already limited to one per target cpu at any given
moment via gd->gd_npoll we can also do away with the ipiq polling
code that was inside the ICR wait.

* The VM page queues were not being fully utilized, causing the pageout
daemon to calculate incorrect average page counts for deactivation/freeing.
This caused the pageout daemon to dig into the active queue even when it
did not need to.

* The pageout daemon was incorrectly calculating the maxscan value for each
queue. It was using the aggregate count (across all 256 queues) instead of
the per-queue count, resulting in long stalls when memory is low.

* Clean up the PQ_L2* knobs, constants, and other cruft, reducing them to
the essentials for our goals.

* Remove the vm.vm_load logic, it was breaking things worse and fixing
things not so much.

* Fix a bug in the pageout algorithm that was causing the PQ_ACTIVE queue
to drain excessively, messing up the LRU/activity algorithm.

* Rip out hammer_limit_running_io and instead just call waitrunningbufspace().

* Change the waitrunningbufspace() logic to add a bit of hyseresis and to
fairly block everyone doing write I/O, otherwise some threads may be
blocked while other threads are allowed to proceed while the buf_daemon
is trying to flush stuff out.

ACPI specification states that if P_LVL2_LAT >100, then a system
doesn't support C2; if P_LVL3_LAT >1000, then C3 is not supported.

But there are no such rules for Cx state data returned by _CST. If a
state is not supported it should not be included into the return
package. In other words, any latency value returned by _CST is valid,
it's up to the OS and/or user to decide whether to use it.