* australasia: Fiji has altered the end date for summer time this
summer, moving it from February to January. It is by no means sure
it won't shift again, but this does appear to be the current plan.

* backward, europe, zone.tab: Pridnestrovian Moldavian Republic
(Europe/Tiraspol) has not followed much of Russia, and will not
retain summer time - rather reverting to standard time along with
western Europe, and Ukraine, on Oct 30, as it was earlier scheduled
to do. This removes the Europe/Tiraspol zone (again) as the
variation never actually happened (and returns the entry in the
"backward" file).

* northamerica: Cuba (America/Havana) has extended summer time by two
weeks, now to end on Nov 13, rather than the (already past) Oct 30.

* Mark the hammer flusher threads as system threads and call
vm_wait_nominal() in the inode flush loop prior to acquiring
an inode lock.

* This attempts to work around an issue where the pageout daemon has
to do a BMAP indirectly via vnode_pager_put_pages(), which requires
a dive into hammer deep enough to need the inode lock.

The pageout daemon checks the vnode lock but has no visibility into
the inode lock. Only the hammer backend (theoretically) can acquire
the inode lock without holding the vnode lock. Hopefully this will
improve the issue.

* netisr thread ports are based on IPIs, but when we enable asynch socket
writes a user thread which gets moved between cpus sending async netmsgs
while doing so can result in the netisr receiving those messages out
of order, corrupting the data stream.

* Add TDF_FORCE_SPINPORT to allow the netisr threads to implement their
message ports as spinports instead of threadports, which guarantees
message ordering.

There is no longer a reason to maintain multiple versions of binutils
in the base system. While contrib/binutils-2.20 directory isn't being
removed quite yet, this commit effectively removed binutils 2.20 from
DragonFly.

Sometime in the future, binutils may be removed from the objformat
handler. The value of the BINUTILSVERS variable no longer has any
effect, and the only version of binutils on the system is 2.21.

- net.inet.tcp.sosnd_async is added to allow pure asynchronized pru_send.
It is default to off currently.
- To prevent soclose() and soshutdown() from interfering TCP processing on
the loopback interface, so_pru_sync() is added, which will make sure
that so_pru_disconnect() and so_pru_shutdown() run only after all of the
previous sent packets had been requeued to netisr (the semantics of the
original half asynchronized pru_send).

* LWKT threads can use thread/IPI or spin-based message ports. The
default is thread-based. Spin-based ports had numerous problems which
would result in panics. This commit fixes those panics and makes the
spinlock version viable.

* However, currently there is no performance improvement so the default
is staying as it was.

* Remove the global kq_token and move to a per-kq and per-kqlist
pool token. This greatly improves postgresql and mysql performance
on MP systems.

* Adjust signal processing tokens to be per-LWP instead of per-PROC.
Signal delivery still utilizes a per-proc token but signal distribution
now uses a per-LWP token, which allows the trap code to only lock the
LWP when checking for pending signals.

This also significantly improves database performance.

* The socket code also now uses only its per-socket pool token instead
of kq_token for its kq operations. kq handles its own tokens.

BSDs have libcrypt and the prototypes for its functions are in
<unistd.h>. The reason we had crypt.h installed for a while was
to make KDE link against libcrypt, due to a wrong check in KDE.

Unfortunately, at least one other package (chat/dircproxy)
assumed that if <crypt.h> exists, it would also find prototypes
for crypt() and friends there, which is not the case. So it
would crash on x86_64 due to defaulting to int as crypt()'s
return type (which is a pointer).

The check in KDE has been fixed since and it properly checks
for the presence of libcrypt:

Bug Fixes
===================
1. echo c|grep '[c]' would fail for any c in 0x80..0xff,
and in many locales.
E.g., printf '\xff\n'|grep "$(printf '[\xff]')" || echo FAIL
would print FAIL rather than the required matching line.
[bug introduced in grep-2.6]

2. grep's interpretation of range expression is now more consistent with
that of other tools. [bug present since multi-byte character set
support was introduced in 2.5.2, though the steps needed to reproduce
it changed in grep-2.6]

3. grep erroneously returned with exit status 1 on some memory allocation
failure. [bug present since "the beginning"]

5. grep is faster on regular expressions that match multibyte characters
in brackets (such as '[áéíóú]').

6. echo c|grep '[c]' would fail for any c in 0x80..0xff, with a uni-byte
encoding for which the byte-to-wide-char mapping is nontrivial. For
example, the ISO-88591 locales are not affected, but ru_RU.KOI8-R is.
[bug introduced in grep-2.6]

The majority of the changes were inherited from gnulib. There were only
a few observable differences from version 3.0:

Release 3.2 (2011-09-02) [stable]
Release 3.1 (2011-08-10) [stable]

Bug fixes
===================
diff no longer reports spurious differences merely because two entries
in the same directory have names that compare equal in the current
locale, or compare equal because --ignore-file-name-case was given.

Changes in behavior
===================
--ignore-file-name-case now applies at the top level too.
For example, "diff dir inIt" might compare "dir/Init" to "inIt".

New features
===================
diff and sdiff have a new option --ignore-trailing-space (-Z).

This requires a bit of explanation. The last single-point spinlocks in the
VM system were the spinlocks for the inactive and active queue. Even though
these two spinlocks are only held for a very short period of time they can
create a major point of contention when one has (e.g.) 48 cores all trying
to run a VM fault at the same time. This is an issue with multi-socket/
many-cores systems and not so much an issue with single-socket systems.

On many cores systems the global VM fault rate was limited to around
~200-250K zfod faults per second prior to this commit on our 48-core
opteron test box. Since any single compiler process can run ~35K zfod
faults per second the maximum concurrency topped out at around ~7 concurrent
processes.

With this commit the global VM fault rate was tested to almost 900K zfod
faults per second. That's 900,000 page faults per second (about 3.5 GBytes
per second). Typical operation was consistently above 750K zfod faults per
second. Maximum concurrency at a 35K fault rate per process is thus
increased from 7 processes to over 25 processes, and is probably approaching
the physical memory bus limit considering that one also has to take into
account generic page-fault overhead above and beyond the memory impact on the
page itself.

I can't stress enough how important it is to avoid contention entirely when
possible on a many-cores system. In this case even though the VM page queue
spinlocks are only held for a very short period of time, the convulsing of
the cache coherency management between physical cpu sockets when all the
cores need to use the spinlock still created an enormous bottleneck. Fixing
this one spinlock easily doubled concurrent compiler performance on our
48-core opteron.

* Fan-out the PQ_INACTIVE and PQ_ACTIVE page queues from 1 queue to
256 queues, each with its own spin lock.

* This removes the last major contention point in the VM system.

* -j48 buildkernel test on monster (48-core opteron) now runs in 55 seconds.
It was originally 167 seconds, and 101 seconds just prior to this commit.

Concurrent compiles are now three times faster (a +200% improvement) on
a many-cores box, with virtually no contention at all.

* Add lwkt_yield() calls in a few critical places which can hog the cpu
on large many-cores boxes during periods of very heavy contention. This
allows other kernel threads on the same cpu to run and reduces symptoms
of e.g. high ping times under certain load conditions.

* Run the callout kernel threads at the same priority as other kernel
threads so cpu-hogging operations run from callouts can yield to
other kernel threads (e.g. yield to the netisr threads).

* Change the vm_page_alloc() API to catch situations where the allocation
races an insertion due to potentially blocking when dealing with
PQ_CACHE pages. VM_ALLOC_NULL_OK allows vm_page_alloc() to return NULL
in this case (otherwise it will panic).

* Change vm_page_insert() to return TRUE if the insertion succeeded and
FALSE if it didn't due to a race against another thread.

* Change the meaning of the cpuid argument to lwkt_alloc_thread() and
lwkt_create(). A cpuid of -1 will cause the kernel to choose a cpu
to run the thread on (instead of choosing the current cpu).

Eventually this specification will allow dynamic migration (but not at
the moment).

Adjust lwp_fork() to specify the current cpu, required for initial
LWKT calls when setting the forked thread up.

Numerous kernel threads will now be spread around available cpus for
now. devfs core threads, NFS socket threads, etc.

Interrupt threads are still fixed on cpu 0 awaiting additional work from
Sephe.

Put the emergency interrupt thread on the last cpu.

* Change the vm_page_grab() API. When VM_ALLOC_ZERO is specified the
vm_page_grab() code will automatically set an invalid page valid and
zero it (using the PG_ZERO optimization if possible). Pages which are
already valid are not zero'd.

This simplies several use cases.

* Change vm_fault_page() to enter the page into the pmap while the vm_map
is still locked, instead of after unlocking it. For now anyhow.

* Minor change to ensure that a deterministic value is stored in *freebuf
in vn_fullpath().

* Minor debugging features added to help track down a x86-64 sge-fault
issue.

* For the moment implement a Red-Black tree for pv_entry_t manipulation.
Revamp the pindex to include all page table page levels, from terminal
pages to the PML4 page. The hierarchy is now arranged via the PV system.

* As before, the kernel page tables only use PV entries for terminal pages.

* Refactor the locking to allow blocking operations during deep scans.
Individual PV entries are now locked and critical PMAP operations do not
require the pmap->pm_token. This should greatly improve threaded
program performance.

* Fix kgdb on the live kernel (pmap_extract() was not handling short-cutted
page directory pages).

* Remove the rest of the LWKT fairq code, it may be added back in a different
form later. Go back to the strict priority model with round-robining
of same-priority LWKT threads.

Currently the model scans gd_tdrunq for sort insertion, which is probably
a bit too inefficient.

* Refactor the LWKT scheduler clock. The round-robining is now based on
the head of gd->gd_tdrunq and the lwkt_schedulerclock() function will
move it. When a thread not on the head is selected to run (because
the head is contending on a token), the round-robin tick will force a
resched on the next tick. As before, we never reschedule-ahead the
kernel scheduler helper thread or threads that have already dropped
to a user priority.

* The token code now tries a little harder to acquire the token before
giving up, controllable with lwkt.token_spin and lwkt.token_delay
(token_spin is the number of times to try and token_delay is the delay
between tries, in nanoseconds).

* Fix a serious bug in usched_bsd4.c which improperly reassigned the 'dd'
variable and caused the scheduler helper to monitor the wrong dd
structure.

* Refactor the vm_page coloring code. On SMP systems we now use the
coloring code to implement cpu localization when allocating pages.
The pages are still 'twisted' based on their physical address so both
functions are served, but cpu localization is now the more important
function.

* Implement NON-OBJECT vm_page allocations. NULL may now be passed, which
allocates a VM page unassociated with any VM object. This will be
used by the pmap code.

* Implement cpu localization for zalloc() and friends. This removes a major
contention point when handling concurrent VM faults. The only major
contention point left is the PQ_INACTIVE vm_page_queues[] queue.

* Temporarily remove the VM_ALLOC_ZERO request. This will probably be
reenabled in a later commit.

* Change the spinlock algorithm to do a read-test before atomic_swap_int().
This has no effect on single-chip cpus (tested on phenom II quad-core),
but has a HUGE HUGE HUGE effect on multi-chip/many-core systems. On
monster (48-core opteron / 4 x 12-core chips) concurrent kernel compile
time is reduced from 170 seconds to 75 seconds with this one change.
That's well over 100%.

The reason the change is important is because it unloads the hardware
cache coherency bus and communication by creating a closed-loop with
the pre-read, which essentially passively waits for the cache update
instead of actively issuing a locked bus cycle memory op. This prevents
total armagheddon on the memory busses when a substantial number of
cores are doing real work.

* Increase the number of pool spinlocks from 1024 to 8192. We need them
now that vm_page's use pool spinlocks.