The main idea is that info values are stored in a vector attached to
each symbol when possible. When not possible, the storage reverts
to the volatile [sic] environment, but still using a vector as the
payload instead of the chained hashing/alist approach.

This strives to be very fast at lookup at the expense of some added
complexity during updates. Performance testing suggests that it is
at least 2x to 3x faster at (INFO :class :type name), and FBOUNDP
is almost 4x faster. In a repeatable test, a file that took 1.8 seconds
to compile now takes 1.7 seconds but with more consing (as expected).

sbcl.core itself increases in size by <1% for 64-bit architecture,
and less for 32-bit architecture because there is proportionaly less
wasted space. A compact environment's table is effectively the
concatenation of all info vectors into one, so the added overhead is
in vector headers. However the fallback hash is now smaller,
so there used to be more wasted cells in the compact env.

Eventually the compact and volatile environments will both go away,
but not until the quasi-lockfree hashtable bootstraps properly.
The problem is an inability to use raw slots in early cold init.
It's actually not a problem of using them - the compiled code is ok -
but cold-init drops into 'ldb' due to how defstruct expands.

Among the bugs fixed by this (not straightforwardly testable) is that
the compact environment would hold into symbols that became otherwise
inaccessible. It no longer does, but still holds onto other names.

This patch builds with CCL as host, and for 32-on-64 and vice-versa,
so nothing seems terribly broken in terms of assumptions made.

Remove a level of indirection when unbinding special bindings, instead
of saving a symbol on the binding stack, and then accessing its
tls-index to unbind it, save the tls-index directly, saving one memory
read.

* Platform-agnostic changes:
- Declare type testing/checking routines.
- Define three primitive types: simd-pack-double for packs
of doubles, simd-pack-single for packs of singles, and
simd-pack-int for packs of integer/unknown.
- Define a heap-representation for 128-bit SIMD packs,
along with reserving a widetag and filling the corresponding
entries in gencgc's tables.
- Make the simd-pack class definition fully concrete.
- Teach IR1 how to expand SIMD-PACK type checks.
- IR2-conversion maps SIMD-PACK types to the right primitive type.
- Increase the limit on the number of storage classes: SIMD packs
went way past the previous (arbitrary?) limit of 40.

* Platform-specific changes, in src/compiler/target/simd-pack:
- Create new storage classes (that are backed by the float-reg [i.e. SSE]
storage base): one for each of double, single and integer sse packs.
- Also create the corresponding immediate-constant and stack storage
classes.
- Teach the assembler and the inline constant code about this new kind
of registers/constants, and how to map constant SIMD-PACKs to which SC.
- Define movement/conversion VOPs for SSE packs, along with VOP routines
needed for basic creation/manipulation of SSE packs.
- The type-checking VOP in generic/late-type-vops is extremely
x86-64-specific... IIRC, there are ordering issues I do not
want to tangle with.

* Implementation idiosyncrasy: while type *tests* (i.e. TYPEP calls) consider
the element type, type *checks* (e.g. THE or DECLARE) only check for
SIMD-PACKness, without looking at the element type. This is allowed by the
standard, is similar to what Python does for FUNCTION types, and helps
code remain efficient even when type checks can't be fully elided.

The vast majority of the code is verbatim or heavily inspired by Alexander
Gavrilov's branch.

- Mark Lisp signal handlers with a flag `synchronous' to indicate
whether we can (and must) handle them immediately. Conversely,
we understand this flag to imply a guarantee that the signal
does not occur during allocation.

- Any signal with a Lisp handler that is not synchronous is
implemented in the runtime using a trampoline, which (instead of
invoking Lisp code directly) first spawns a new pthread, which
only then calls back into Lisp to invoke the handler function
(with a fake signal context).

- Used in particular for SIGINT.

- For SIGPROF, introduce a second per-thread allocation region,
which gets swapped with the usual region around the call into
SIGPROF-HANDLER. This handler is a special case, because it is
careful not to trigger GC nor non-local unwinds, and we can
safely return to the original region afterwards.

- Add a new subclass SIGNAL-HANDLER-THREAD for this purpose,
making it easy to identify these threads (e.g. in the test
driver).

- Run sprof tests while building the contrib. Add a test stressing
time profiling of allocation sequences.

Enable using :SB-SAFEPOINT-STRICTLY on features.

Quite usable already on x86 and x86-64; PPC still has more prominent
issues, e.g. in threads.impure.lisp.

Some support for platforms whose libraries do not maintain a frame pointer

For platforms on which system libraries are built with the
equivalent of -fomit-frame-pointer, i.e. do not maintain EBP, save
it in the thread structure upon entry to an exception handler, and
restore the register during call_into_lisp.

Currently for Windows on x86-64 only, where it is required.
Analogous changes had been implemented for x86, but are not included
here.

- Microsoft x86-64 calling convention differences compared to the
the System V ABI: Argument passing registers; shadow space.
- Inform gcc that we are using the System V ABI for a few functions.
- Define long, unsigned-long to be 32 bit. This change just falls
into place now, since incompatible code had been adjusted earlier.
- Use VEH, not SEH.
- No pseudo atomic needed around inline allocation, but tweak alloc().
- Use the gencgc space alignment that also works on win32 x86.
- Factor "function end breakpoint" handling out of the sigtrap handler.

Beware known bugs, manifested as hangs during threads.impure.lisp,
happening rather frequently with 64 bit builds and at least much
less frequently (or not at all) with 32 bit binaries on the same
version of Windows, tested on Server 2012. (All credit for features
goes to Anton, all bugs are my fault.)

... if and only if running on a version of Windows new enough to
support doing so. Two scenarios come to mind where synchronous (i.e.
non-overlapped) I/O might matter:

- There is one kind of HANDLE which is never overlapped: Unnamed
pipes. Unlike named pipes, the feature added by this commit is
our only option of interrupting I/O on the former.

- User code might pass in a HANDLE through MAKE-FD-STREAM without
the right flag set. In principle, non-interruptibily of such a
HANDLE is a bug in said user code, but it doesn't hurt to deal
with these correctly as a side benefit. (The only Windows
releases which support re-opening of a HANDLE with the right
flag also have the functions needed by this commit.)

One downside for users might be an element of surprise, in that the
same SBCL binary will exhibit the presence or lack of features,
respectively, when started on recent Windows or old Windows. However,
the advantages of offering the feature seem to me to outweigh that
disadvantage.

* Implement pthreads, futex API on top of Win32.
* Adds support for the timer facility using sb-wtimer.
* Implement an interruptable `nanosleep' using waitable timers.
* Threading on Windows uses safepoints to stop the world.
On this platform, either all or none of :SB-THREAD, :SB-SAFEPOINT,
:SB-THRUPT, and :SB-WTIMER need to be enabled together.
* On this platform, INTERRUPT-THREAD will not run interruptions
in a target thread that is executing foreign code, even though
the POSIX version of sb-thrupt still allows this (potentially
unsafe) form of signalling by default.

Does not yet include interruptible I/O, which will be made available
separately. Slime users are requested to build SBCL without threads
until then.

Note that these changes alone are not yet sufficient to make SBCL on
Windows an ideal backend. Users looking for a particularly stable
or thread-enabled version of SBCL for Windows are still advised to
use the well-known Windows branch instead.

This is a merge of features developed earlier by Dmitry Kalyanov and
Anton Kovalenko.

* Performance note: Does not currently replace pseudo-atomic entirely,
except on Windows. Only once further work has been done to reduce
use of signals will pseudo-atomic become truly redundant. Therefore
use of safepoints on POSIX currently still implies the combined
performance overhead of both mechanisms.

* Design alternatives exist for some choices made here. In particular,
this commit places the safepoint trap page into the SBCL binary for
simplicity. It is likely that future changes to allow slam-free
runtime changes will have to go back to a hand-crafted address
parameter.

* This feature has been extracted from work related to Windows
support and backported to POSIX.

Credits: Uses the CSP-based stop-the-world protocol by Anton Kovalenko,
based on the safepoint and threading work by Dmitry Kalyanov. Use of
safepoints for SBCL originally researched by Paul Khuong.

Move the ALLOC-REGION, PSEUDO-ATOMIC-BITS, and BINDING-STACK-* slots
closer to the beginning of the thread structure. This change ensures
that the offsets for those slots are < 128 bytes, which in turns enables
shorter encodings for all accesses to this structure from Lisp code.

I *think* we had this working earlier already, but it's been broken at least
for a while now since there were no tests for it.

Add a DEFKNOWN to the array byte bashers, providing the RESULT-ARG -- and
make them return the sequence.

Replace the unused and bitrotted UNSAFE IR1 attribute with its inverse:
DX-SAFE, and use that togather with RESULT-ARG to allow multiple refs to
potentially DX leafs. Still accept UNSAFE in DEFKNOWNs occurring in
user-code, but ignore it and give a style-warning.

For now, add DX-SAFE to LENGTH and VECTOR-LENGTH, which is enough for our
purposes.

* Remove all lutex-specific code from the system.
** Use SB-FUTEX for futex-capable platforms, and plain SB-THREAD
otherwise.
** Make non-futex mutexes unfair spinlocks for now, using WAIT-FOR to
provide timeouts and backoff.
** Build non-futex condition variables on top of a queue and WAIT-FOR.

Performance implications: SB-FUTEX builds should perform pretty much the
same, or improve a bit. Threaded non-futex builds are affected as follows:

1. Threads idling on semaphores or condition variables aren't quite as
cheap. Just how costly depends on the OS. On Darwin 1000 idle threads
can chew up a bit over 50% CPU. I will try to address this later.

2. Contested locking around operations that take considerably longer
than a single timeslice suffers mild degradation.

3. Contested locking around operations that don't take long is an order
of magnitude performant.

4. Highly active semaphores perform much better. (Follows from #3.)

* GRAB-MUTEX gets timeout support on all platforms.

* CONDITION-WAIT gets timeout support.

* Disable a bunch of prone-to-hang thread tests on Darwin. (All of them
were already prone to hang prior to this commit.)

* Enable a bunch tests that now /pass/ on Darwin. \o/ This doesn't mean that
the threaded Darwin is fully expected to pass all tests yet, but let's say
it's more likely to do so.

...but still not robust enough to enable threads on Darwin by default.

* GET-MUTEX/GRAB-MUTEX get refactored into two main parts: %TRY-MUTEX and
%WAIT-ON-MUTEX, which are also used directly from CONDITION-WAIT where
appropriate.

* Also in pseudo-atomic.h, update the non-x86oid gencgc code
to do the right thing with threaded pseudo-atomic-bits.

* Due to the way dynamic binding works on threaded targets, it
is now a requirement that the arch_* pseudo_atomic functions call
the generic versions if foreign_function_call_active_p() is true
on threaded targets (in short, C code needs to be able to enter
pseudo-atomic, not just lisp code).

* On all platforms:
- Slightly more stable complex-complex float (double and single)
division;
- New transform for real-complex division;
- complex-real and real-complex float addition and subtraction
behave as though the real was first upgraded to a complex, thus
losing the sign of any imaginary zero.

* On x86-64
- Complexes floats are represented packed in a single SSE register;
- VOPs for all four arithmetic operations, complex-complex, but also
complex-real and real-complex, except for complex-complex and
real-complex division;
- VOPs for =, negate and conjugate of complexes (complex-real and
complex-complex);
- VOPs for EQL of floats (real and complexes).
- Full register moves for float values in SSE registers should also
speed scalar operations up.