Currently the passed in address is copied into a newly allocated
memory (grr, additional blocking kmalloc), and the PRUS_FREEADDR
will be set so that protocol thread could know when to free the
address.

Before this change netperf UDP_STREAM (unconnected socket) could
only do ~200Kpps (w/ -m 18), now it could do ~990Kpps (w/ -m 18).
This gives ~500% performance improvement for tiny UDP packet TX.
The improvement is not as good as the connected socket, which is
~600%, mainly because of the additional memory allocation for
the address. We _may_ further optimize out the address allocation.

There is no performance impact on the mostly used sockets:
- IPv4/IPv6 TCP implemented pru_savefaddr, so their pru_accept will not
be called at all
- UNIX domain socket uses sync msgport, so no protocol thread dispatching

* Reorder the vnode ref/rele sequence in the exec path so p_textvp is
left in a more valid state while being initialized.

* Removing the vm_exitingcnt test in exec_new_vmspace(). Release
various resources unconditionally on the last exiting thread regardless
of the state of exitingcnt. This just moves some of the resource
releases out of the wait*() system call path and back into the exit*()
path.

* Implement a hold/drop mechanic for vmspaces and use them in procfs_rwmem(),
vmspace_anonymous_count(), and vmspace_swap_count(), and various other
places.

This does a better job protecting the vmspace from deletion while various
unrelated third parties might be trying to access it.

* Implement vmspace_free() for other code to call instead of them trying
to call sysref_put() directly. Interlock with a vmspace_hold() so
final termination processing always keys off the vm_holdcount.

* Implement vm_object_allocate_hold() and use it in a few places in order
to allow OBJT_SWAP objects to be allocated atomically, so other third
parties (like the swapcache cleaning code) can't wiggle their way in
and access a partially initialized object.

* Reorder the vmspace_terminate() code and introduce some flags to ensure
that resources are terminated at the proper time and in the proper order.

The trick to avoid rebuilding 6 object files unnecessarily ended up making
the gold linker of binutils 2.22 stop working. This commit takes them
off of the gold library and back to the gold program, and again to the
incremental-dump program. Until somebody shows me the way the latter
program can use object files from the former one, we're just going to
build them twice.

* During a [v]fork/exec sequence the exec will replace the VM space of the
target process. A concurrent 'ps' operation could access the target
process's vmspace as it was being ripped out, resulting in memory
corruption.

* The P_INEXEC test in procfs was insufficient, the exec code itself must
also wait for procfs's PHOLD() on the process to go away before it can
proceed. This should properly interlock the entire operation.

* Can occur with procfs or non-procfs ps's (via proc sysctls).

* Possibly related to the seg-fault issue we have where the user stack gets
corrupted.

* Also revamp PHOLD()/PRELE() and add PSTALL(), changing all manual while()
loops waiting on p->p_lock to use PSTALL().

These functions now integrate a wakeup request flag into p->p_lock
using atomic ops and no longer tsleep() for 1 tick (or hz ticks, or
whatever). Wakeups are issued proactively.

* Use an atomic cmpxchg to set the cpu bit in the pmap->pm_active bitmap
AND test the pmap interlock bit at the same time, instead of testing
the interlock bit afterwords.

* In addition, if we find the lock bit set and must spin-wait for it to
clear, we skip the %cr3 comparison check and unconditionally load %cr3.

* It is unclear if the race could be realized in any way. It was probably
not responsible for the seg-fault issue as prior tests with an unconditional
load of %cr3 did not fix the problem. Plus in the same-%cr3-as-last-thread
case the cpu bit is already set so there should be no possibility of
losing a TLB interlock IPI (and %cr3 is loaded unconditionally when it
doesn't match, so....).

* fork() and vfork() allocate a new process, p2, initialize, and add it to
the allproc list as well as other lists.

* These functions failed to acquire p2's token, p2 becomes visible to the
rest of the system when it's added to the allproc list. Even though p2's
state is set to SIDL, this is insufficient protection.

Acquire the token prior to adding p2 to allproc and keep holding the token
until after we have finished initializing p2.

* We must also PHOLD()/PRELE() p2 around the start_forked_proc() call
to prevent it from getting ripped out from under us (if it exits
quickly and/or detaches itself from its parent).

* Possibly fixes the random seg-faulting issue we've seen under very heavy
fork/exec (parallel compile) loads on the 48-core monster.

- Add so_faddr into socket, which records the accepted socket's foreign
address. If it is set, kern_accept() will use it directly instead of
calling protocol specific method to extract the foreign address.
- Add protocol specific method, pru_safefaddr, which will save the
foreign address into socket.so_faddr if the necessary information is
supplied. This protocol method will only be called in protocol
thread.
- Pass the foreign address to sonewconn() if possible, so the foreign
address could be saved before the accepted socket is put onto the
complete list.

Currently only IPv4/TCP implemented pru_savefaddr

This intends to address the following problems:
- Calling pru_accept directly from user context is not MPSAFE, we
always races the socket.so_pcb check->use against protocol thread
clear/free socket.so_pcb, though the race window is too tiny to
be hit. To make it mpsafe, we should dispatch pru_accept to
protocol thread.
If socket.so_faddr is set here, we are race against nothing and
nothing expensive like put the current user thread into sleep will
happen. However, if the socket is dropped when it still sits
on the complete list, the error will not be timely delivered, i.e.
accept(2) will not return error, but the later on read(2)/write(2)
on the socket will deliver the error.
- Calling pru_accept directly races against the inpcb.inp_f{addr,port}
setting up in the protocol thread, since inpcb.inp_f{addr,port} is
setup _after_ the accepted socket was put onto the complete list.