Reorganize the way machine architectures are handled. Consolidate the
kernel configurations into a single generic directory. Move machine-specific
Makefile's and loader scripts into the appropriate architecture directory.
Kernel and module builds also generally add sys/arch to the include path so
source files that include architecture-specific headers do not have to
be adjusted.
sys/<ARCH> -> sys/arch/<ARCH>
sys/conf/*.<ARCH> -> sys/arch/<ARCH>/conf/*.<ARCH>
sys/<ARCH>/conf/<KERNEL> -> sys/config/<KERNEL>

Reformulate the way the kernel updates the PMAPs in the system when adding
a new page table page to expand kernel memory. Keep track of the PMAPs in
their own list rather then scanning the process list to locate them. This
allows PMAPs managed on behalf of virtual kernels to be properly updated.
VM spaces can now be allocated from scratch and may not have a parent
template to inherit certain fields from. Make sure these fields are
properly cleared.

MAP_VPAGETABLE support part 2/3.
Implement preliminary virtual page table handling code in vm_fault. This
code is strictly temporary so subsystem and userland interactions can be
tested, but the real code will be very similar.

Consolidate the initialization of td_mpcount into lwkt_init_thread().
Fix a bug in kern.trap_mpsafe, the mplock was not being properly released
when operating in vm86 mode (when kern.trap_mpsafe was set to 1).

Make tsleep/wakeup() MP SAFE for kernel threads and get us closer to
making it MP SAFE for user processes. Currently the code is operating
under the rule that access to a thread structure requires cpu locality of
reference, and access to a proc structure requires the Big Giant Lock. The
two are not mutually exclusive so, for example, tsleep/wakeup on a proc
needs both cpu locality of reference *AND* the BGL. This was true with the
old tsleep/wakeup and has now been documented.
The new tsleep/wakeup algorithm is quite simple in concept. Each cpu has its
own ident based hash table and each hash slot has a cpu mask which tells
wakeup() which cpu's might have the ident. A wakeup iterates through all
candidate cpus simply by chaining the IPI message through them until either
all candidate cpus have been serviced, or (with wakeup_one()) the requested
number of threads have been woken up.
Other changes made in this patch set:
* The sense of P_INMEM has been reversed. It is now P_SWAPPEDOUT. Also,
P_SWAPPING, P_SWAPINREQ are not longer relevant and have been removed.
* The swapping code has been cleaned up and seriously revamped. The new
swapin code staggers swapins to give the VM system a chance to respond
to new conditions. Also some lwp-related fixes were made (more
p_rtprio vs lwp_rtprio confusion).
* As mentioned above, tsleep/wakeup have been rewritten. The process
p_stat no longer does crazy transitions from SSLEEP to SSTOP. There is
now only SSLEEP and SSTOP is synthesized from P_SWAPPEDOUT for userland
consumpion. Additionally, tsleep() with PCATCH will NO LONGER STOP THE
PROCESS IN THE TSLEEP CALL. Instead, the actual stop is deferred until
the process tries to return to userland. This removes all remaining cases
where a stopped process can hold a locked kernel resource.
* A P_BREAKTSLEEP flag has been added. This flag indicates when an event
occurs that is allowed to break a tsleep with PCATCH. All the weird
undocumented setrunnable() rules have been removed and replaced with a
very simple algorithm based on this flag.
* Since the UAREA is no longer swapped, we no longer faultin() on PHOLD().
This also incidently fixes the 'ps' command's tendancy to try to swap
all processes back into memory.
* speedup_syncer() no longer does hackish checks on proc0's tsleep channel
(td_wchan).
* Userland scheduler acquisition and release has now been tightened up and
KKASSERT's have been added (one of the bugs Stefan found was related
to an improper lwkt_schedule() that was found by one of the new assertions).
We also have added other assertions related to expected conditions.
* A serious race in pmap_release_free_page() has been corrected. We
no longer couple the object generation check with a failed
pmap_release_free_page() call. Instead the two conditions are checked
independantly. We no longer loop when pmap_release_free_page() succeeds
(it is unclear how that could ever have worked properly).
Major testing by: Stefan Krueger <skrueger@meinberlikomm.de>

Allow 'options SMP' *WITHOUT* 'options APIC_IO'. That is, an ability to
produce an SMP-capable kernel that uses the PIC/ICU instead of the IO APICs
for interrupt routing.
SMP boxes with broken BIOSes (namely my Shuttle XPC SN95G5) could very well
have serious interrupt routing problems when operating in IO APIC mode.
One solution is to not use the IO APICs. That is, to run only the Local
APICs for the SMP management.
* Don't conditionalize NIDT. Just set it to 256
* Make the ICU interrupt code MP SAFE. This primarily means using the
imen_spinlock to protect accesses to icu_imen.
* When running SMP without APIC_IO, set the LAPIC TPR to prevent unintentional
interrupts. Leave LINT0 enabled (normally with APIC_IO LINT0 is disabled
when the IO APICs are activated). LINT0 is the virtual wire between the
8259 and LAPIC 0.
* Get rid of NRSVIDT. Just use IDT_OFFSET instead.
* Clean up all the APIC_IO tests which should have been SMP tests, and all
the SMP tests which should have been APIC_IO tests. Explicitly #ifdef
out all code related to the IO APICs when APIC_IO is not set.

ICU/APIC cleanup part 1/many.
Move ICU and APIC support files into their own subdirectory, bump the
required config version for the build since this move also requires the
use of the new arch/ symlink.

Userland 1:1 threading changes step 1/4+:
o Move thread-local members from struct proc into new struct lwp.
o Add a LIST_HEAD(lwp) p_lwps to struct proc. This links a proc
with its lwps.
o Add a td_lwp member to struct thread which links a thread to its lwp,
if it exists. This won't replace td_proc completely to save indirections.
o For now embed one struct lwp into struct proc and set up preprocessor
linkage so that semantics don't change for the rest of the kernel.
Once all consumers are converted to take a struct lwp instead of a struct
proc, this will go away.
Reviewed-by: dillon, davidxu

Remove spl*() calls from i386, replacing them with critical sections.
Leave spl support intact for the moment (it will be removed soon). Adjust
the interrupt mux to use a critical section for 'old' interrupt handlers
not using the new serialization API (which is nearly all of them at the
moment).

Try to close an occassional VM page related panic that is believed to occur
due to the VM page queues or free lists being indirectly manipulated by
interrupts that are not protected by splvm(). Do this by replacing splvm()'s
with critical sections in a number of places.
Note: some of this work bled over into the "VFS messaging/interfacing work
stage 8/99" commit.

Add a stack-size argument to the LWKT threading code so threads can be
created with different-sized stacks. Adjust libcaps to match.
This is a pre-requisit to adding NDIS support. NDIS threads need larger
stacks because microsoft drivers expect larger stacks.

Merge from FreeBSD, RELENG_4 branch, revision 1.250.2.26.
--- original commit message ---
Log:
There is a comma missing in the table initializing the
pmap_prefault_pageorder array. This has two effects:
1. The resulting bogus contents of the array thwarts part of
the optimization effect pmap_prefault() is supposed to have.
2. The resulting array is only 7 elements long (auto-sized), while
pmap_prefault() expects it to be the intended 8 elements. So
this function in fact accesses memory beyond the end of the array.
Fortunately though, if the data at this location is out of bounds
it will be ignored.
This bug dates back more than 6 years. It has been introduced
in revision 1.178.
Submitted by: Uwe Doering <gemini@geminix.org>
PR: 67460
--- original commit message ---

Remove an unimplemented advisory function, pmap_pageable(); there is
no pmap implementation in existance that requires it implemented.
Discussed-with: Alan Cox <alc at freebsd.org>,
Matthew Dillon <dillon at backplane.com>

Mask bits properly for pte_prot() in case it is called with additional
VM_PROT_ bits.
Fix a wired memory leak bug in pmap_enter(). If a page wiring change is
made and the page has already been faulted in for read access, and a
write-fault occurs, pmap_enter() was losing track of the wiring count in
the pmap when it tried to optimize the RO->RW case in the page table.
This prevented the page table page from being freed and led to a memory leak.
The case is easily reproducable if you attempt to wire the data/bss crossover
page in a program (typically just declare a global variable in a small program
and mlock() its page, then exit without munlock()ing). 4K is lost each time
the program is run.

Close an interrupt race between vm_page_lookup() and (typically) a
vm_page_sleep_busy() check by using the correct spl protection.
An interrupt can occur inbetween the two operations and unbusy/free
the page in question, causing the busy check to fail and for the code
to fall through and then operate on a page that may have been freed
and possibly even reused. Also note that vm_page_grab() had the same
issue between the lookup, busy check, and vm_page_busy() call.
Close an interrupt race when scanning a VM object's memq. Interrupts
can free pages, removing them from memq, which interferes with memq scans
and can cause a page unassociated with the object to be processed as if it
were associated with the object.
Calls to vm_page_hold() and vm_page_unhold() require spl protection.
Rename the passed socket descriptor argument in sendfile() to make the
code more readable.
Fix several serious bugs in procfs_rwmem(). In particular, force it to
block if a page is busy and then retry.
Get rid of vm_pager_map_pag() and vm_pager_unmap_page(), make the functions
that used to use these routines use SFBUF's instead.
Get rid of the (userland?) 4MB page mapping feature in pmap_object_init_pt()
for now. The code appears to not track the page directory properly and
could result in a non-zero page being freed as PG_ZERO.
This commit also includes updated code comments and some additional
non-operational code cleanups.

Another major mmx/xmm/FP commit. This is a combination of several patches
but since the earlier patches didn't actually fix the crashing and corruption
issues we were seeing everything has been rolled into one well tested commit.
Make the FP more deterministic by requiring that npxthread and the FP state
be properly synchronized, and that the FP be in a 'safe' state (meaning
that mmx/xmm registers be useable) when npxthread is NULL. Allow the FP
save area to be revectored. Kernel entities which use the FP unit,
such as the bcopy code, must save the app state if it hasn't already been
saved, then revector the save area.
Note that combinations of operations must be protected by a critical section
or interrupt disablement. Any clearing or setting npxthread combined with
an fxsave/fnsave/frstor/fxrstor/fninit must be protected as an atomic entity.
Since interrupts are threads and can preempt, such preemption will cause
a thread switch to occur and thus cause npxthread and the FP state to be
manipulated. The kernel can only depend on the FP state being stable for its
use after it has revectored the FP save area.
This commit fixes a number of issues, including potential filesystem
corruption and kernel crashes.

Commit an update to the pipe code that implements various pipe algorithms.
Note that the newer algorithms are either experimental or only exist for
testing purposes. The default remains the same (sfbuf mode), which is
considered to be stable. The code is just too useful not to commit it.
Add pmap_qenter2() for installing cpu-localized KVM mappings.
Add pmap_page_assertzero() which will be used in a later diagnostic commit.

Correct a bug in the last FPU optimized bcopy commit. The user FPU state
was being corrupted by interrupts.
Fix the bug by implementing a feature described as a missif in the original
FreeBSD comments... add a pointer to the FP saved state in the thread
structure so routines which 'borrow' the FP unit can simply revector the
pointer temporarily to avoid corruption of the original user FP state.
The MMX_*_BLOCK macros in bcopy.s have also been simplified somewhat. We
can simplify them even more (in the future) by reserving FPU save space in
the per-cpu structure instead of on the stack.

Enhance the pmap_kenter*() API and friends, separating out entries which
only need invalidation on the local cpu against entries which need invalidation
across the entire system, and provide a synchronization abstraction.
Enhance sf_buf_alloc() and friends to allow the caller to specify whether the
sf_buf's kernel mapping is going to be used on just the current cpu or
whether it needs to be valid across all cpus. This is done by maintaining
a cpumask of known-synchronized cpus in the struct sf_buf
Optimize sf_buf_alloc() and friends by removing both TAILQ operations in the
critical path. TAILQ operations to remove the sf_buf from the free queue
are now done in a lazy fashion. Most sf_buf operations allocate a buf,
work on it, and free it, so why waste time moving the sf_buf off the freelist
if we are only going to move back onto the free list a microsecond later?
Fix a bug in sf_buf_alloc() code as it was being used by the PIPE code.
sf_buf_alloc() was unconditionally using PCATCH in its tsleep() call, which
is only correct when called from the sendfile() interface.
Optimize the PIPE code to require only local cpu_invlpg()'s when mapping
sf_buf's, greatly reducing the number of IPIs required. On a DELL-2550,
a pipe test which explicitly blows out the sf_buf caching by using huge
buffers improves from 350 to 550 MBytes/sec. However, note that buildworld
times were not found to have changed.
Replace the PIPE code's custom 'struct pipemapping' structure with a
struct xio and use the XIO API functions rather then its own.

Newtoken commit. Change the token implementation as follows: (1) Obtaining
a token no longer enters a critical section. (2) tokens can be held through
schedular switches and blocking conditions and are effectively released and
reacquired on resume. Thus tokens serialize access only while the thread
is actually running. Serialization is not broken by preemptive interrupts.
That is, interrupt threads which preempt do no release the preempted thread's
tokens. (3) Unlike spl's, tokens will interlock w/ interrupt threads on
the same or on a different cpu.
The vnode interlock code has been rewritten and the API has changed. The
mountlist vnode scanning code has been consolidated and all known races have
been fixed. The vnode interlock is now a pool token.
The code that frees unreferenced vnodes whos last VM page has been freed has
been moved out of the low level vm_page_free() code and moved to the
periodic filesystem sycer code in vfs_msycn().
The SMP startup code and the IPI code has been cleaned up considerably.
Certain early token interactions on AP cpus have been moved to the BSP.
The LWKT rwlock API has been cleaned up and turned on.
Major testing by: David Rhodus

Synchronize a bunch of things from FreeBSD-5 in preparation for the new
ACPICA driver support.
* Bring in a lot of new bus and pci DEV_METHODs from FreeBSD-5
* split apic.h into apicreg.h and apicio.h
* rename INTR_TYPE_FAST -> INTR_FAST and move the #define
* rename INTR_TYPE_EXCL -> INTR_EXCL and move the #define
* rename some PCIR_ registers and add additional macros from FreeBSD-5
* note: new pcib bus call, host_pcib_get_busno() imported.
* kern/subr_power.c no longer optional.
Other changes:
* machine/smp.h machine smp/smptests.h can now be #included unconditionally,
and some APIC_IO vs SMP separation has been done as well.
* gd_acpi_id and gd_apic_id added to machine/globaldata.h prep for new
ACPI code.
Despite all the changes, the generated code should be virtually the same.
These were mostly additions which the pre-existing code does not (yet) use.

Introduce an MI cpu synchronization API, redo the SMP AP startup code,
and start cleaning up deprecated IPI and clock code. Add a MMU/TLB page
table invalidation API (pmap_inval.c) which properly synchronizes page
table changes with other cpus in SMP environments.
* removed (unused) gd_cpu_lockid
* remove confusing invltlb() and friends, normalize use of cpu_invltlb()
and smp_invltlb().
* redo the SMP AP startup code to make the system work better in
situations where all APs do not startup.
* add memory barrier API, cpu_mb1() and cpu_mb2().
* remove (obsolete, no longer used) old IPI hard and stat clock forwarding
code.
* add a cpu synchronization API which is capable of handling multiple
simultanious requests without deadlocking or livelocking.
* major changes to the PMAP code to use the new invalidation API.
* remove (unused) all_procs_ipi() and self_ipi().
* only use all_but_self_ipi() if it is known that all AP's started up,
otherwise use a mask.
* remove (obsolete, no longer usde) BETTER_CLOCK code
* remove (obsolete, no longer used) Xcpucheckstate IPI code
Testing-by: David Rhodus and others

Retool the M_* flags to malloc() and the VM_ALLOC_* flags to
vm_page_alloc(), and vm_page_grab() and friends.
The M_* flags now have more flexibility, with the intent that we will start
using some of it to deal with NULL pointer return problems in the codebase
(CAM is especially bad at dealing with unexpected return values). In
particular, add M_USE_INTERRUPT_RESERVE and M_FAILSAFE, and redefine
M_NOWAIT as a combination of M_ flags instead of its own flag.
The VM_ALLOC_* macros are now flags (0x01, 0x01, 0x04) rather then states
(1, 2, 3), which allows us to create combinations that the old interface
could not handle.

CAPS IPC library stage 1/3: The core CAPS IPC code, providing system calls
to create and connect to named rendezvous points. The CAPS interface
implements a many-to-1 (client:server) capability and is totally self
contained. The messaging is designed to support single and multi-threading,
synchronous or asynchronous (as of this commit: polling and synchronous only).
Message data is 100% opaque and so while the intention is to integrate it into
a userland LWKT messaging subsystem, the actual system calls do not depend
on any LWKT structures.
Since these system calls are experiemental and may contain root holes,
they must be enabled via the sysctl kern.caps_enabled.

USER_LDT is now required by a number of packages as well as our upcoming
user threads support. Make it non-optional.
USER_LDT breaks SysV emulated sysarch(... SVR4_SYSARCH_DSCR) support.
For now just #if 0 out the support (which is what FreeBSD-5.x does).
Submitted-by: Craig Dooley <craig@xlnx-x.net>

Fix the pt_entry_t and pd_entry_t types. They were previously pointers to
integers which is completely bogus. What they really represent are page
table entries so define them as __uint32_t. Also add a vtophys_pte()
macro to distinguish between physical addresses (vm_paddr_t) and
physical addresses represented in PTE form (pt_entry_t). vm_paddr_t can
be 64 bits even on IA32 boxes without PAE which use 32 bit PTE's.
Taken loosely from: FreeBSD-4.x

64 bit address space cleanups which are a prerequisit for future 64 bit
address space work and PAE. Note: this is not PAE. This patch basically
adds vm_paddr_t, which represents a 'physical address'. Physical addresses
may be larger then virtual addresses and on IA32 we make vm_paddr_t a 64
bit quantity.
Submitted-by: Hiten Pandya <hmp@backplane.com>

Do a bit of Ansification, add some pmap assertions to catch the
improper use of certain pmap functions from an interrupt, similar to
FreeBSD-5, rewrite a number of comments, and surround some of the
pmap functions which manipulate per-cpu CMAPs with critical sections.

MP Implmentation 3/4: MAJOR progress on SMP, full userland MP is now working!
A number of issues relating to MP lock operation have been fixed, primarily
that we have to read %cr2 before get_mplock() since get_mplock() may switch
away. Idlethreads can now safely HLT without any performance detriment.
The userland scheduler has been almost completely rewritten and is now
using an extremely flexible abstraction with a lot of room to grow. pgeflag
has been removed from mapdev (without per-page invalidation it isn't safe
to use PG_G even on UP). Necessary locked bus cycles have been added for
the pmap->pm_active field in swtch.s. CR3 has been unoptimized for the
moment (see comment in swtch.s). Since the switch code runs without the
MP lock we have to adjust pm_active PRIOR to loading %cr3.
Additional sanity checks have been added to the code (see PARANOID_INVLTLB
and ONLY_ONE_USER_CPU in the code), plus many more in kern_switch.c.
A passive release mechanism has been implemented to optimize P_CURPROC/lwkt
priority shifting when going from user->kernel and kernel->user.
Note: preemptive interrupts don't care due to the way preemption works so
no additional complexity there. non-locking atomic functions to protect
only against local interrupts have been added. astpending now uses
non-locking atomic functions to set and clear bits. private_tss has been
moved to a per-cpu variable. The LWKT thread module has been considerably
enhanced and cleaned up, including some fixes to handle MPLOCKED vs td_mpcount
races (so eventually we can do MP locking without a pushfl/cli/popfl combo).
stopevent() needs critical section protection, maybe.

MP Implementation 1/2: Get the APIC code working again, sweetly integrate the
MP lock into the LWKT scheduler, replace the old simplelock code with
tokens or spin locks as appropriate. In particular, the vnode interlock
(and most other interlocks) are now tokens. Also clean up a few curproc/cred
sequences that are no longer needed.
The APs are left in degenerate state with non IPI interrupts disabled as
additional LWKT work must be done before we can really make use of them,
and FAST interrupts are not managed by the MP lock yet. The main thing
for this stage was to get the system working with an APIC again.
buildworld tested on UP and 2xCPU/MP (Dell 2550)

thread stage 10: (note stage 9 was the kern/lwkt_rwlock commit). Cleanup
thread and process creation functions. Check the spl against ipending in
cpu_lwkt_restore (so the idle loop does not lockup the machine). Remove
the old VM object kstack allocation and freeing code. Leave newly created
processes in a stopped state to fix wakeup/fork_handler races. Normalize
the lwkt_init_*() functions.
Add a sysctl debug.untimely_switch which will cause the last crit_exit()
to yield, which causes a task switch to occur in wakeup() and catches a
lot of 4.x-isms that can be found and fixed on UP.

thread stage 8: add crit_enter(), per-thread cpl handling, fix deferred
interrupt handling for critical sections, add some basic passive token code,
and blocking/signaling code. Add structural definitions for additional
LWKT mechanisms.
Remove asleep/await. Add generation number based xsleep/xwakeup.
Note that when exiting the last crit_exit() we run splz() to catch up
on blocked interrupts. There is also some #if 0'd code that will cause
a thread switch to occur 'at odd times'... primarily wakeup()->
lwkt_schedule()->critical_section->switch. This will be usefulf or testing
purposes down the line.
The passive token code is mostly disabled at the moment. It's primary use
will be under SMP and its primary advantage is very low overhead on UP and,
if used properly, should also have good characteristics under SMP.

thread stage 7: Implement basic LWKTs, use a straight round-robin model for
the moment. Also continue consolidating the globaldata structure so both UP
and SMP use it with more commonality. Temporarily match user processes up
with scheduled LWKTs on a 1:1 basis. Eventually user processes will have
LWKTs, but they will not all be scheduled 1:1 with the user process's
runnability.
With this commit work can potentially start to fan out, but I'm not ready
to announce yet.

thread stage 6: Move thread stack management from the proc structure to
the thread structure, cleanup the pmap_new_*() and pmap_dispose_*()
functions, and disable UPAGES swapping (if we eventually separate the kstack
from the UPAGES we can reenable it). Also LIFO/4 cache thread structures
which improves fork() performance by 40% (when used in typical fork/exec/exit
or fork/subshell/exit situations).

thread stage 4: remove curpcb, use td_pcb reference instead. Move the pcb
to the end of the thread stack, and note that a pcb will always exist because
a thread context will always exist. Also note that vm86 replaces td_pcb
temporarily and we really need to rip that out and instead make a copy on
the stack, because assumptions are made in regards to the pcb's location.