Reorganize symbol table embedding. The existing option SYMTAB_SPACE is
replaced by the make option COPY_SYMTAB set to any value. The copy of
the symbol table is no longer put into a buffer in kern_ksyms.o, but a
small helper object. This object is build first with a dummy size, then
the kernel is linked to compute the real dimension of the symbol table
buffer. After that, the helper object is rebuild and the kernel linked
again.

Move the amd64 and i386 pcb to the bottom of the uarea, and move the
kernel stack to the top.
Change the pcb layouts so that fpu save area is at the end and is
64byte aligned ready for xsave (saving the ymm registers).
Welcome to 6.99.32

Minor fpu initialisation cleanups:
Set default CR) so that the FPU is enabled (unset CR0_EM) and initialise
i386_fpu_present to 1.
No need to call the npx trap indirectly, rename to fpunda() to match amd64.
Remove the i386_fpu_exception variable and sysctl (It used to indicate
which irq was used for fpu exceptions, but we only support 'internal'
now). Hopefully no one cares.
fpuinit() now only needs to clear TS before the fninit(). Apart from the
checks for 486SX and the 'fdiv bug' this matches the amd64 version.
Exclude fpuinit() from XEN kernels, they don't call it - which rather begs
the question as to whether it is needed at all!

Remove support for 'external' floating point units and the MS-DOS
compatible method of handling floating point exceptions.
Make kernel support for teh fpu non-optional (486SX should still work).
Only 386 cpus support external fpu, and i386 support was removed years ago.
This means that the npx code no longer uses port 0xf0 or interupt 13.
All the "npx at isa" lines go from the configs, arch/i386/isa/npx.c
is now mandatory for all i386 kernels.
I've renamed npxinit() to fpuinit() and npxinit_cpu() to fpuinit_cpu()
to match the very similar amd64 functions.
The fpu of the boot cpu is now initialised by a direct call from
cpu_configure(), this enables FP emulation for a 486SX.
(for amd64 the cr0 values are set in locore.S and similar).
This fixes a long-standing bug in linux_setregs() - which did not
save the fpu regsiters if they were active.
I've test booted a single cpu i386 kernel (using anita).
amd64 builds - none of teh changes should affect it.
The i386 XEN kernels build, but I'm not sure where they set cr0, and
it might have got lost!

Use the MI "pcu" framework for bookkeeping of npx/fpu states on x86.
This reduces the amount of MD code enormously, and makes it easier
to implement support for newer CPU features which require more fpu
state, or for fpu usage by the kernel.
For access to FPU state across CPUs, an xcall kthread is used now
rather than a dedicated IPI.
No user visible changes intended.

in osyscall, set the PSL_I bit into the correct field of the trapframe.
it was going into tf_eip instead of tf_eflags, which would sometimes
corrupt %eip and always return to user mode with interrupts disabled.
this was found with a netbsd 1.0 binary, and dsl@ points out that
this should also fix PR 41342.

Retire XEN_COMPAT_030001 as detailed on port-xen@:
http://mail-index.netbsd.org/port-xen/2012/06/25/msg007431.html
The xen_p2m API comes next.
ok bouyer@.
Tested on i386 PAE and amd64 (Xen 3.3 on private test bed, and
Xen 3.4 for Amazon EC2).
FWIW, Amazon always reported:
hypervisor0 at mainbus0: Xen version 3.4.3-kaos_t1micro
multiple times for Europe and US West-1, so I guess they are now at
3.4 (32 and 64 bits).

Mirror what is done for amd64 boot and ACPI wakeup code by setting
CR0_WP (write protection bit) early on boot. Although it is set later via
cpu_init(), this can help tracking down invalid writes to pages mapped
as read only from ring 0.
No regression observed when booting under anita (QEMU) or a P4 host.
Depending on your hardware or setup, you may trigger code paths I have
overlooked. So if your machine does not start properly, or you get
page faults early during boot, please report them to me.

Follow locore.S and move FPU handling from x86_64_switch_context() to
x86_64_tls_switch(); raise IPL to IPL_HIGH in x86_64_switch_context()
and test ci_fpcurlwp to decide to disable FPU or not.
Change the Xen i386 context switch code to be like the amd64 one.

Welcome PAE inside i386 current.
This patch is inspired by work previously done by Jeremy Morse, ported by me
to -current, merged with the work previously done for port-xen, together with
additionals fixes and improvements.
PAE option is disabled by default in GENERIC (but will be enabled in ALL in
the next few days).
In quick, PAE switches the CPU to a mode where physical addresses become
36 bits (64 GiB). Virtual address space remains at 32 bits (4 GiB). To cope
with the increased size of the physical address, they are manipulated as
64 bits variables by kernel and MMU.
When supported by the CPU, it also allows the use of the NX/XD bit that
provides no-execution right enforcement on a per physical page basis.
Notes:
- reworked locore.S
- introduce cpu_load_pmap(), used to switch pmap for the curcpu. Due to the
different handling of pmap mappings with PAE vs !PAE, Xen vs native, details
are hidden within this function. This helps calling it from assembly,
as some features, like BIOS calls, switch to pmap_kernel before mapping
trampoline code in low memory.
- some changes in bioscall and kvm86_call, to reflect the above.
- the L3 is "pinned" per-CPU, and is only manipulated by a
reduced set of functions within pmap. To track the L3, I added two
elements to struct cpu_info, namely ci_l3_pdirpa (PA of the L3), and
ci_l3_pdir (the L3 VA). Rest of the code considers that it runs "just
like" a normal i386, except that the L2 is 4 pages long (PTP_LEVELS is
still 2).
- similar to the ci_pae_l3_pdir{,pa} variables, amd64's xen_current_user_pgd
becomes an element of cpu_info (slowly paving the way for MP world).
- bootinfo_source struct declaration is modified, to cope with paddr_t size
change with PAE (it is not correct to assume that bs_addr is a paddr_t when
compiled with PAE - it should remain 32 bits). bs_addrs is now a
void * array (in bootloader's code under i386/stand/, the bs_addrs
is a physaddr_t, which is an unsigned long).
- fixes in multiboot code (same reason as bootinfo): paddr_t size
change. I used Elf32_* types, use RELOC() where necessary, and move the
memcpy() functions out of the if/else if (I do not expect sym and str tables
to overlap with ELF).
- 64 bits atomic functions for pmap
- all pmap_pdirpa access are now done through the pmap_pdirpa macro. It
hides the L3/L2 stuff from PAE, as well as the pm_pdirpa change in
struct pmap (it now becomes a PDP_SIZE array, with or without PAE).
- manipulation of recursive mappings ( PDIR_SLOT_{,A}PTEs ) is done via
loops on PDP_SIZE.
See also http://mail-index.netbsd.org/port-i386/2010/07/17/msg002062.html
No objection raised on port-i386@ and port-xen@R for about a week.
XXX kvm(3) will be fixed in another patch to properly handle both PAE and !PAE
kernel dumps (VA => PA macros are slightly different, and need proper 64 bits
PA support in kvm_i386).
XXX Mixing PAE and !PAE modules may lead to unwanted/unexpected results. This
cannot be solved easily, and needs lots of thinking before being declared
safe (paddr_t/bus_addr_t size handling, PD/PT macros abstractions).

In Xen PAE case, fix argument size passed to init386(), by pushing the
upper bits onto stack (we do not expect first_avail to be above 4GiB, so
assume their value is 0).
Remove macros (PROC0PDIR and PROC0STACK) that were never used.

PR port-i386/40143 Viewing an mpeg transport stream with mplayer causes crash
Fix numerous problems:
1. LDT updates are not atomic.
2. Number of processes running with private LDTs and/or I/O bitmaps
is not capped. System with high maxprocs can be paniced.
3. LDTR can be leaked over context switch.
4. GDT slot allocations can race, giving the same LDT slot to two procs.
5. Incomplete interrupt/trap frames can be stacked.
6. In some rare cases segment faults are not handled correctly.

Make the emulations, exec formats, coredump, NFS, and the NFS server
into modules. By and large this commit:
- shuffles header files and ifdefs
- splits code out where necessary to be modular
- adds module glue for each of the components
- adds/replaces hooks for things that can be installed at runtime

Replace lines beginning with assembler-style comments with c-style comments,
since '#' can be interpretted as a preprocessor directive. Remove single
quote ''' from assembler-style comments, as the preprocessor may ignore
everything between them.
Fixes compilation with pcc.

- Don't bother using sse to copy/zero pages on demand. It turns out not
to be worth it.
- If the machine has sse, re-enable zeroing pages in the idle loop and
use the sse instructions so that we don't blow out the cache.

- Give x86 BIOS boot the ability to load new style modules and pass them
into the kernel. Based on a patch by jmcneill@, with many fixes and
improvements by me.
- Put MEMORY_DISK_DYNAMIC and MODULAR into the GENERIC kernels, so that
you can load miniroot.kmod from the boot blocks and boot into the
installer!

Merge the bouyer-xeni386 branch. This brings in PAE support to NetBSD xeni386
(domU only). PAE support is enabled by 'options PAE', see the new XEN3PAE_DOMU
and INSTALL_XEN3PAE_DOMU kernel config files.
See the comments in arch/i386/include/{pte.h,pmap.h} to see how it works.
In short, we still handle it as a 2-level MMU, with the second level page
directory being 4 pages in size. pmap switching is done by switching the
L2 pages in the L3 entries, instead of loading %cr3. This is almost required
by Xen, which handle the last L2 page (the one mapping 0xc0000000 - 0xffffffff)
in a very special way. But this approach should also work for native PAE
support if ever supported (in fact, the pmap should almost suport native
PAE, what's missing is bootstrap code in locore.S).

Merge the bouyer-xeni386 branch to head, at tag bouyer-xeni386-merge1 (the
branch is still active and will see i386PAE support developement).
Sumary of changes:
- switch xeni386 to the x86/x86/pmap.c, and the xen/x86/x86_xpmap.c
pmap bootstrap.
- merge back most of xen/i386/ to i386/i386
- change the build to reduce diffs between i386 and amd64 in file locations
- remove include files that were identical to the i386/amd64 counterparts,
the build will find them via the xen-ma/machine link.

- When computing the TSC frequency, call i8254_delay() and not DELAY().
- Use atomics to adjust the pmap reference count, instead of taking locks.
- Implement I386_{SET,GET}_{FS,GS}BASE, allowing %fs and %gs to be used
as per-thread registers. This is compatible with FreeBSD.
- Run patches after we have attached CPUs, since we then know if the
system is uniprocessor or not. Eliminates a lot of #ifdef MULTIPROCESSOR
and makes running MP kernels on UP systems cheaper.
- Patch out many of the 'lock' prefixes to nops if uniprocessor.
- Do a wbinvd after patching to ensure that the trace/instruction cache
is up to date.

Merge the ppcoea-renovation branch to HEAD.
This branch was a major cleanup and rototill of many of the various OEA
cpu based PPC ports that focused on sharing as much code as possible
between the various ports to eliminate near-identical copies of files in
every tree. Additionally there is a new PIC system that unifies the
interface to interrupt code for all different OEA ppc arches. The work
for this branch was done by a variety of people, too long to list here.
TODO:
bebox still needs work to complete the transition to -renovation.
ofppc still needs a bunch of work, which I will be looking at.
ev64260 still needs to be renovated
amigappc was not attempted.
NOTES:
pmppc was removed as an arch, and moved to a evbppc target.

x86 changes for pcc and LKMs.
- Replace most inline assembly with proper functions. As a side effect
this reduces the size of amd64 GENERIC by about 120kB, and i386 by a
smaller amount. Nearly all of the inlines did something slow, or something
that does not need to be fast.
- Make curcpu() and curlwp functions proper, unless __GNUC__ && _KERNEL.
In that case make them inlines. Makes curlwp LKM and preemption safe.
- Make bus_space and bus_dma more LKM friendly.
- Share a few more files between the ports.
- Other minor changes.

Merge most x86 changes from the vmlocking branch, except the threaded soft
interrupt stuff. This is mostly comprised of changes to the pmap modules to
work on multiprocessor systems without kernel_lock, and changes to speed up
tlb shootdowns.

Remove the usage of Multiboot's "a.out kludge" to tell the boot loader to
reserve some more space for the BSS section than the binary says. This
trick was used to leave room after the kernel's image to copy the symbol
table following the format required by ksyms_init. (It was also used to
workaround a bug in the physical address fields of the binary, but this has
been long fixed.) Yes, the MULTIBOOT_SYMTAB_SPACE option goes away; yay!
Instead, copy the required data after the kernel in a way that avoids having
to reserve space and use the new ksyms_init_explicit function to avoid the
need to construct a minimal ELF image.
Fixes ksyms when using an "unpatched" GRUB (one that does not contain the
fix to honour the "a.out kludge" for ELF images, even when present) -- i.e.
ddb and lkms. As a side effect, the new code is much clearer to read and
digest.
Closes PR port-i386/32865.

Convert the code that fetches bootinfo information and stores it into the
kernel from assembler to C and add several comments to make things clearer.
I've been running with these changes for a couple of months without problems.

Implement support for 'The Multiboot Specification' so that i386 kernels
can be booted directly from Multiboot-compliant boot loaders (e.g. GRUB).
See the added multiboot(8) manual page for more information.
No objections in tech-kern@; only positive comments.

some assym cleanup.
- move copyin and friends from locore.S to their own file, copy.S.
share it between i386 and xen.
- defparam KERNBASE and kill KERNBASE_LOCORE hack.
- add more symbols to assym.h and use it where appropriate.

Check the passed in address as well as determining the maximum length
using VM_MAXUSER_ADDRESS in copyinstr and copyoutstr.
Problem originally fixed in OpenBSD/i386.
This fix suggested by Charles Hannum (mycroft at netbsd dot org).

As suggested on tech-kern@ days ago:
* Get rid of PTmap, PTD, PTDpde, APTmap, APTD, and APTDpde from locore.S.
* Rename PTDpaddr to PDPpaddr, ptdpaddr in struct cpu_kcore_hdr to pdppaddr for consistency.

- keep cr3 register and its copy in TSS synchronized.
otherwise an interrupt vector using a task gate (ie. ddbipi) messes it up.
- defer LDTR loading as well as cr3.
- tweak comments to make three copies of switching code more synchronized.

Pass pointers to frames from assembly, do not use the 'frame on stack
as argument passed by value' trick, as gcc 3.3.x makes (valid) assumptions
about the stack that will not be true. Costs 2 instructions per trap/syscall
on i386, 4 per interrupt for MP. One instruction per trap/syscall on amd64,
2 per interrupt for MP. I expect gcc 3.3.1 to make up for this by better
optimization (it'd better..)
While here, make amd64 compile again by using subr_mbr_disk.c

While the previous change actually made the code do what it intended,
it was still wrong. cpu_switch() must return 1 when it switched to
a different LWP, 0 if it didn't. It was doing exactly the reverse.