tcp: Don't allow persist timer if TCP connection is not established yet

This probably could move the un-updated snd_nxt panic earlier.

The dump of the panic in http://bugs.dragonflybsd.org/issue1939 shows
- snd_nxt is less than snd_una
- A persist timer was fired (frame 16, tp->tt_msg->tt_prev_tasks).
- The TCP segment triggered the panic has SYN|ACK (frame 17, th->th_flags).
This TCP segment is considered as valid (frame 17, list), so tp->t_state
was SYN_SENT.

This explains why snd_nxt is less than snd_una:
If tcp_output() is called by persist timer, then the persist timer is
active and the "forced" is turned on, this causes the snd_nxt not updated
at all.

MISSING CHIAN IN THE LINK:
How is the persist timer got set in the SYN_SENT in the first place?

* All vm_page_deactivate() calls now operate with the caller holding
the page PG_BUSY or the page known not to be busy. Reorder several
cases where a vm_page is unbusied prior to calling deactivate.

* vm_page_cache() now expected the vm_page to be PG_BUSY and will cache
the page and clear the bit.

* Fix a race in vm_pageout_page_free() which calls vm_object_reference()
with an unbusied vm_page, then proceeds to busy and free the page.
The problem is that vm_object_reference() can block on vmobj_token.

This may fix the x86-64 seg-fault issue. Or it may not (throws up hands).

* Count the number of times the idle thread is entered on a cpu without
switching to a non-idle thread. Use the fast-halt (non-ACPI) until the
count exceeds a reasonable machdep.cpu_idle_repeat.

This improves the default performance to levels closer to cpu_idle_hlt
mode 1 but still gives us the power savings from mode 3. Performanced is
improved significantly because many threads on SMP boxes are event
or pipe oriented and only sleep for short periods of time or ping-pong
back and forth. For example, a cc -pipe, or typical kernel threads
blocking on tokens or locks for short periods of time.

* Adjust machdep.cpu_idle_hlt modes:

0 Never halt, the idle thread just spins.

1 Always use a fast HLT/MONITOR/MWAIT

2 Hybrid approach use (1) up to a certain point, then use (3).
(this is the default)

* Adjust the FIFO contention resequencer (which deals with spinning
on tokens) to use MONITOR/MWAIT when available instead of DELAY(1)
when waiting to become the head of the queue.

* Adjust the x86-64 idle loop to use MONITOR/MWAIT when available when
the idle halt mode (machdep.cpu_idle_hlt) is set to 1. This
significantly improves performance for event-oriented programs, including
compile pipelines.

NOTE: On the 48-core monster setting machdep.cpu_idle_hlt to 1 improves
performance but at the cost of an additional 200W of power at idle vs
the default value of 2 (ACPI idle halt). Look for a hybrid approach in
a future commit.

Provides cpu_mmw_spin() and cpu_mmw_mwait(), both of which wait for a given
memory cell to contain a value different from an expected value. _spin()
merely spins on the cell; _mwait() uses the SSE3 MONITOR/MWAIT isns.

* Change the LWKT scheduler's token spinning algorithm. It used to
DELAY a short period of time and then simply retry, creating a lot
of contention between cpus trying to acquire a token.

Now the LWKT scheduler uses a FIFO index mechanic to resequence the
contending cpus into 1uS retry slots using essentially just
atomic_fetchadd_int(), so it is very cache friendly. The spin-retry
thus has a bounded cache management traffic load regardless of
the number of cpus and contending cpus will not be tripping over
each other.

The new algorithm slightly regresses 4-cpu operation (~5% under heavy
contention) but significantly improves 48-cpu operation. It is also
flexible enough for further work down the road. The old algorithm
simply did not scale very well.

Add three sysctls:

sysctl lwkt.spin_method=1

0 Allow a user thread to be scheduled on a cpu while kernel
threads are contended on a token, using the IPI mechanic
to interrupt the user thread and reschedule on decontention.

This can potentially result in excessive IPI traffic.

1 Allow a user thread to be scheduled on a cpu while kernel
threads are contended on a token, reschedule on the next clock
tick (100 Hz typically). Decontention will NOT generate
any IPI traffic. DEFAULT.

2 Do not allow a user thread to be scheduled on a cpu while
kernel threads are contended. Should not be used normally,
for debugging only.

sysctl lwkt.spin_delay=1

Slot time in microseconds, default 1uS. Recommended values are
1 or 2 but not longer.

sysctl lwkt.spin_loops=10

Number of times the LWKT scheduler loops on contended threads
before giving up and allowing an idle-thread HLT. In order to
wake up from the HLT decontention will cause an IPI so you do
not want to set this value too small and. Values between
10 and 100 are recommended.

* Redo the token decontention algorithm. Use a new gd_reqflags flag,
RQF_WAKEUP, coupled with RQF_AST_LWKT_RESCHED in the per-cpu globaldata
structure to determine what cpus actually need to be IPId on token
decontention (to wakeup their idle threads stuck in HLT).

This requires that all gd_reqflags operations use locked atomic
instructions rather than non-locked instructions.

* Decontention IPIs are a last-gasp effort if the LWKT scheduler has spun
too many times. Under normal conditions, even under heavy contention,
actual IPIing should be minimal.

* This commit introduces the necessary support for utmpx, wtmpx and
lastlogx, as well as updating many base utils to work with these
while mostly maintaining compatibility with the old utmp, wtmp and
lastlog.

* The new last(1) supports wtmpx but defaults to wtmp as not all wtmp
writers have been updated for wtmpx.

* All utmp readers support both utmp and utmpx now.

* lastlogin (the only lastlog reader) supports both lastlog and
lastlogx.

* The utils who(1) and finger have been almost directly replaced by
their NetBSD equivalent. In case of who(1) the only custom
modification is the behaviour of '-b' to be as it has always been.

* Add a kern.proc.cwd sysctl to be able to get the cwd of processes.
Many terminal emulators need something this to set a window title.
Right now, some simply don't work (konsole) and others (Terminal) use
linprocfs.

* Replace the cpu_contention_mask global with a per-token contention mask.

* Fold the lwkt.user_pri_sched feature into the scheduler and remove the
sysctl. The feature is now alwayws on. The feature allows a lower
priority non-contending thread to be scheduled in the face of a
high-priority contending thread that would normally spin in the scheduler.

* A reschedule IPI is now performed when a high-priority contending thread
might possibly resolve, which will kick the user process back into the
kernel and allow rescheduling of the higher priority thread.

* Change the idle-cpu check semantics. When a cpu's scheduler finds only
contending threads it used to loop in the scheduler and the idle thread
would be flagged to not halt. We now allow the idle thread to halt in
this case and expect to receive an IPI when any of the contending threads
might possibly resolve.

As a fringe benefit this should also benefit vkernels.

* lwkt_schedule() has been significantly simplified. Or as I would say,
decomplexified.

* Rearrange the way the scheduler helpers are woken up. This results in
much better coverage on systems with large numbers of cpus.

Tested on the 48-core opteron monster.

* Essentially we no longer do bogus wakeups of scheduler helpers, and the
chaining has been fixed when a scheduler helper is unable to perform
the operation it was scheduled for (it tries to find another idle cpu
to forward to).

Most cpumask operations are now performed while holding the spin lock.

* The userland scheduler was unconditionally calling lwkt_switch()
via userexit() (i.e. on every system call), creating unnecessary
overhead and possibly also triggering a bsd4 scheduler event
requiring a common spinlock.

* Rearrange the code slightly to reduce instances where lwkt_switch()
is called. We want to try to keep instances where a higher priority
LWKT thread is potentially runnable or when the LWKT fairq accumulator
for the current thread has been exhausted.

* This removes system call overhead multiplication on MP systems. For
example, on a 48-core box system call overhead when all 48 cpus are
busy doing getuid() loops went from 10uS back down to 270nS (which
is near the single-cpu test results).