Make has several evaluations that gcc47 believes will always
(without exception) evaluate to true or false. Likely gcc47 is not
detecting some cases correctly, so -Wno-address was passed to it.
However, -Werror overrides it, so NO_WERROR had to be set on this
Makefile.

The remaining GCC warnings will be left for swildner to handle.
The -Werror flag will be suppress for GCC47 until further notice.

With high job numbers, sometimes c++config.h would get included before
it finished getting generated. Solve this race by using "depend all"
which should ensure c++config.h gets generated before anything else
in libstdc++ starts to build.

Replace GCC 4.1 with GCC 4.7. The primary compiler remains GCC 4.4
and the source and makefiles for GCC 4.1 remain intact so it can be
brought back if necessary. All references to GCC 4.1 in documentation
where updated to reflect version 4.7.

The majority of these changes are new files required to build GCC on
DragonFly. They are identical to the lang/gcc-aux modifications. Of
interest:

1) The modification to c-format.c is a carry-over from GCC44. It
maintains support for the DragonFly-specific %b and %D conversions.
2) The modification to tree-inline.c is a carry-over from GCC44. It
maintains the suppression of "unlikely call" inline warnings.
3) The gcc driver was modified to strip out all the bad paths in its
search path. gcc -print-search-dirs is now short and accurate.

The following programs fail to build with gcc47 due to the new
unused-but-set-variable warning. They've been fixed in various ways.
The ones set with WARNS=3 suppress cast-qual warning
There is a single enum-compare error too.

RX empty event rarely happens (I didn't see it even if the card is
sinking full speed tiny packets on one RX ring). Put the RX empty
events into independent MSI-X, so the hot path RX MSI-X need not
read register at all.

This action prepares for the import of GCC 4.7 into base.
GCC 4.4, unlike GCC 4.1, requires gmp and mpfr, and these libraries
were part of the GCC 4.4 world makefile set. GCC 4.7 also needs
these libraries, so rather than build them twice, they are moved out
to a common area where both compilers can use them.

(2) Exit on another thread simultaniously removes all remaining VM
pages from the pmap. However, due to #(1), there is still an
active page table page in pmap->pm_pteobj that the exit code has
no visibility to.

(3) The related pmap is then dtor'd due to heavy fork/exec/exit load
on the system. The VM page is still present, vm_page_protect()
is still stuck on the token (or hasn't gotten cpu back).

(4) Nominal vm_object_terminate() destroys the page table page.

(5) vm_page_protect() unblocks and tries to destroy the page.

(6) BOOM.

* This fix places a barrier between the normal process exit code and the
dtor which will block while a vm_page_protect() is active on the pmap.

* This time for sure, but if not we still know that the problem is related
to this exit race.

* Add reschedule hints when issuing a read() on a pipe or socket, or
issuing a blocking kevent() call.

* usched_dfly will force a reschedule after the round-robin count has
passed the half-way point if it detects a scheduling hint. This is
an attempt to avoid rescheduling in the middle of some critical user
operation (e.g. postgres server holding internal locks).

* Add kern.usched_dfly.fast_resched which allows the scheduler to avoid
interrupting a less desireable process with a more desireable process
as long as the priority difference is not too great.

However, default the value to 0, because setting the value has
consequences for interactive responsiveness.

* When running pgbench we recommend leaving fast_resched disabled and
instead running the pgbench at idprio 15 to work around issues where
the postgres server process(es) get interrupted by the pgbench processes
which causes the postgres server process(es) to hit internal lock conflicts
more quickly and enter a semaphore wait more often (when both pgbench and
the postgres servers are running on the same machine).

This is really an issue with postgres server scaling. Because the pgbench's
use so much less cpu than the postgres server processes they are given a
more desireable priority and thus can interrupt the postgres server
processes. We can't really 'fix' this in the scheduler without really
messing up normal interactive responsiveness for the system.

* NOTE: This introduces a few regressions at high loads. They've been
identified and will be fixed in another iteration.

We've identified an issue with weight2. When weight2 successfully
schedules a process pair on the same cpu it can lead to inefficiencies
elsewhere in the scheduler related to user-mode and kernel-mode
priority switching. In this situation testing pgbench/postgres pairs
(e.g. -j $ncpus -c $ncpus) we sometimes see some serious regressions on
multi-socket machines, and other times see remarkably high performance.

* Fix a reported panic.

* Revamp the weights and algorithms signficantly. Fix algorithmic errors
and improve the accuracy of weight3. Add weight4 which basically tells
the scheduler to try harder to find a free cpu to schedule the lwp on
when the current cpu is busy doing something else.

* Allow various fork() behaviors to be supported via
kern.usched_dfly.features.

* Set the default to place the newly forked process on
a random cpu instead of the current cpu.

The bsd4 scheduler had a global queue and could just signal
a random helper to pick up the thread. The dfly scheduler
has per-cpu queues and must actually enqueue the thread to
another cpu.

The bsd4 scheduler is still slightly superior here because
if the parent running on the current cpu immediately waits
for the child, the child is able to run on the current cpu.
However, randomization works quite well and this removes
nearly all of the make -j N regression.

* Rewrite the balancing rover. The rover will now move one process per
tick from a very heavily loaded cpu queue to a lightly loaded cpu queue.
Each cpu target is iterated by the rover, one target per tick.

* Reformulate dfly_chooseproc_locked() and friends. Add a capability to
choose the 'worst' process (from the end of the queue), which is used
by the rover.

* When pulling a random thread we require the queue it is taken from to
be MUCH more heavily loaded than our own queue, which avoids ping-ponging
processes back and forth when the load is not balanced against the number
of cpu cores (e.g. 6 servers, 4 cores).

* Change the process pulling behavior. Now we pull the 'worst' thread
from some other cpu instead of the best (duh!), we only pull when a
cpu winds up with no designated user threads, or we pull via a
schedulerclock-implemented rover.

The schedulerclock-implemented rover will allow ONE cpu to pull the
'worst' thread across all cpus (with some locality) once every
round-robin ticks (4 scheduler ticks).

The rover is responsible for taking excess processes that are unbalancing
one or more cpu's (for example, you have 6 running batch processes and
only 4 cpus) and slowly moving them between cpus. If we did not do this
the 'good' processes running on the unbalanced cpus are put at an unfair
disadvantage.

* This should fix all known edge cases, including ramp-down edge cases.

* Fix an edge case where user processes were interrupting each other
when they were in the same queue, which could cause a synchronous
process like a postgres server to lose cpu while holding internal
locks during a short operation.

* Fix fork regression with usched_dfly. Most fork/exec sequences involve
the parent waiting. The new scheduler was placing the newly forked
process on another cpu which is non-optimal if the parent is going
to immediately wait.

Instead if there is nothing else waiting to run on the current cpu,
leave the forked process on the current cpu initially. If the parent
waits quickly the forked process will get cpu, otherwise it will get
scheduled away soon enough. If the parent forks additional children
then we find there is something on the queue now (the first child) and
put the additional children on other cpus.

* Add a field to the thread structure, td_wakefromcpu. All wakeup()
family calls will load this field with the cpu the thread was woken
up FROM.

* Use this field in usched_dfly to weight scheduling such that pairs
of synchronously-dependent threads (for example, a pgbench thread
and a postgres server process) are placed closer to each other in
the cpu topology.

* Weighting:

- Load matters the most
- Current cpu thread is scheduled on is next
- Synchronous wait/wakeup weighting is last