Pull the intel i915 hibernation memory corruption fix from Dave Airlie:
"I tracked down the misc memory corruption after i915 hibernate to the
blinking fbcon cursor, and realised the i915 driver wasn't doing the
fbdev suspend/resume calls at all. nouveau and radeon have done these
calls for a long time.

This has been fairly well tested and is definitely the main culprit in
hibernate not working."

Looking at hibernate overwriting I though it looked like a cursor,
so I tracked down this missing piece to stop the cursor blink
timer. I've no idea if this is sufficient to fix the hibernate
problems people are seeing, but please test it.

hugepage-mmap.c, hugepage-shm.c and map_hugetlb.c in Documentation/vm are
simple pass/fail tests, It's better to promote them to
tools/testing/selftests.

Thanks suggestion of Andrew Morton about this. They all need firstly
setting up proper nr_hugepages and hugepage-mmap need to mount hugetlbfs.
So I add a shell script run_vmtests to do such work which will call the
three test programs and check the return value of them.

Changes to original code including below:
a. add run_vmtests script
b. return error when read_bytes mismatch with writed bytes.
c. coding style fixes: do not use assignment in if condition

Remove the run_tests script and launch the selftests by calling "make
run_tests" from the selftests top directory instead. This delegates to
the Makefile in each selftest directory, where it is decided how to launch
the local test.

This removes the need to add each selftest directory to the now removed
"run_tests" top script.

* i386 on Sandy Bridge
New code faster for common cases: tagged and dense trees.
Some degradations for non-tagged lookup on sparse trees.

Ideally, there might help __ffs() analog for searching first non-zero
long element in array, gcc sometimes cannot optimize this loop corretly.

Numbers:

CPU: Intel Sandy Bridge i7-2620M 4Mb L3

radix-tree with 1024 slots:

tagged lookup

step 1 before 7156 after 3613
step 2 before 5399 after 2696
step 3 before 4779 after 1928
step 4 before 4456 after 1429
step 5 before 4292 after 1213
step 6 before 4183 after 1052
step 7 before 4157 after 951
step 8 before 4016 after 812
step 9 before 3952 after 851
step 10 before 3937 after 732
step 11 before 4023 after 709
step 12 before 3872 after 657
step 13 before 3892 after 633
step 14 before 3720 after 591
step 15 before 3879 after 578
step 16 before 3561 after 513

normal lookup

step 1 before 4266 after 3301
step 2 before 2695 after 2129
step 3 before 2083 after 1712
step 4 before 1801 after 1534
step 5 before 1628 after 1313
step 6 before 1551 after 1263
step 7 before 1475 after 1185
step 8 before 1432 after 1167
step 9 before 1373 after 1092
step 10 before 1339 after 1134
step 11 before 1292 after 1056
step 12 before 1319 after 1030
step 13 before 1276 after 1004
step 14 before 1256 after 987
step 15 before 1228 after 992
step 16 before 1247 after 999

step 1 before 626064869 after 544418266
step 2 before 418809975 after 336321473
step 7 before 242303598 after 207755560
step 15 before 208380563 after 176496355
step 63 before 186854206 after 167283638
step 64 before 176188060 after 170143976
step 65 before 185139608 after 167487116
step 128 before 88181865 after 86913490
step 256 before 45733628 after 45143534
step 512 before 24506038 after 23859036
step 12345 before 2177425 after 2018662

* AMD Athlon 6000+ 2x1Mb L2, without L3

radix-tree with 1024 slots:

tag-lookup

step 1 before 8164 after 5379
step 2 before 5818 after 5581
step 3 before 4959 after 4213
step 4 before 4371 after 3386
step 5 before 4204 after 2997
step 6 before 4950 after 2744
step 7 before 4598 after 2480
step 8 before 4251 after 2288
step 9 before 4262 after 2243
step 10 before 4175 after 2131
step 11 before 3999 after 2024
step 12 before 3979 after 1994
step 13 before 3842 after 1929
step 14 before 3750 after 1810
step 15 before 3735 after 1810
step 16 before 3532 after 1660

normal-lookup

step 1 before 7875 after 5847
step 2 before 4808 after 4071
step 3 before 4073 after 3462
step 4 before 3677 after 3074
step 5 before 4308 after 2978
step 6 before 3911 after 3807
step 7 before 3635 after 3522
step 8 before 3313 after 3202
step 9 before 3280 after 3257
step 10 before 3166 after 3083
step 11 before 3066 after 3026
step 12 before 2985 after 2982
step 13 before 2925 after 2924
step 14 before 2834 after 2808
step 15 before 2805 after 2803
step 16 before 2647 after 2622

step 1 before 1025677120 after 861375100
step 2 before 647220080 after 572258540
step 7 before 505518960 after 484041813
step 15 before 430483053 after 444815320 *
step 63 before 388113453 after 404250546 *
step 64 before 374154666 after 396027440 *
step 65 before 381423973 after 396704853 *
step 128 before 190078700 after 202619384 *
step 256 before 100886756 after 102829108 *
step 512 before 64074505 after 56158720
step 12345 before 4237289 after 4422299 *

* i686 on Sandy bridge

radix-tree with 1024 slots:

tagged lookup

step 1 before 7990 after 4019
step 2 before 5698 after 2897
step 3 before 5013 after 2475
step 4 before 4630 after 1721
step 5 before 4346 after 1759
step 6 before 4299 after 1556
step 7 before 4098 after 1513
step 8 before 4115 after 1222
step 9 before 3983 after 1390
step 10 before 4077 after 1207
step 11 before 3921 after 1231
step 12 before 3894 after 1116
step 13 before 3840 after 1147
step 14 before 3799 after 1090
step 15 before 3797 after 1059
step 16 before 3783 after 745

Iterating divided into two phases:
* lookup next chunk in radix-tree leaf node
* iterating through slots in this chunk

Main iterator function radix_tree_next_chunk() returns pointer to first
slot, and stores in the struct radix_tree_iter index of next-to-last slot.
For tagged-iterating it also constuct bitmask of tags for retunted chunk.
All additional logic implemented as static-inline functions and macroses.

Also adds radix_tree_find_next_bit() static-inline variant of
find_next_bit() optimized for small constant size arrays, because
find_next_bit() too heavy for searching in an array with one/two long
elements.

In the case of a child pid namespace, rebooting the system does not really
makes sense. When the pid namespace is used in conjunction with the other
namespaces in order to create a linux container, the reboot syscall leads
to some problems.

A container can reboot the host. That can be fixed by dropping the
sys_reboot capability but we are unable to correctly to poweroff/
halt/reboot a container and the container stays stuck at the shutdown time
with the container's init process waiting indefinitively.

After several attempts, no solution from userspace was found to reliabily
handle the shutdown from a container.

This patch propose to make the init process of the child pid namespace to
exit with a signal status set to : SIGINT if the child pid namespace
called "halt/poweroff" and SIGHUP if the child pid namespace called
"reboot". When the reboot syscall is called and we are not in the initial
pid namespace, we kill the pid namespace for "HALT", "POWEROFF",
"RESTART", and "RESTART2". Otherwise we return EINVAL.

Returning EINVAL is also an easy way to check if this feature is supported
by the kernel when invoking another 'reboot' option like CAD.

By this way the parent process of the child pid namespace knows if it
rebooted or not and can take the right decision.

The IPMI watchdog timer clears or extends the timer on reboot/shutdown.
It was using the non-locking routine for setting the watchdog timer, but
this was causing race conditions. Instead, use the locking version to
avoid the races. It seems to work fine.

The part of the IPMI driver that delivered panic information to the event
log and extended the watchdog timeout during a panic was not properly
handling the messages. It used static messages to avoid allocation, but
wasn't properly waiting for these, or wasn't properly handling the
refcounts.

The IPMI driver would release a lock, deliver a message, then relock.
This is obviously ugly, and this patch converts the message handler
interface to use a tasklet to schedule work. This lets the receive
handler be called from an interrupt handler with interrupts enabled.

We currently time out and retry KCS transactions after 1 second of waiting
for IBF or OBF. This appears to be too short for some hardware. The IPMI
spec says "All system software wait loops should include error timeouts.
For simplicity, such timeouts are not shown explicitly in the flow
diagrams. A five-second timeout or greater is recommended". Change the
timeout to five seconds to satisfy the slow hardware.

This was marked as obsolete for quite a while now.. Now it is time to
remove it altogether. And while doing this, get rid of first_cpu() as
well. Also, remove the redundant setting of cpu_online_mask in
smp_prepare_cpus() because the generic code would have already set cpu 0
in cpu_online_mask.

Calculate a cpumask of CPUs with per-cpu pages in any zone and only send
an IPI requesting CPUs to drain these pages to the buddy allocator if they
actually have pages when asked to flush.

This patch saves 85%+ of IPIs asking to drain per-cpu pages in case of
severe memory pressure that leads to OOM since in these cases multiple,
possibly concurrent, allocation requests end up in the direct reclaim code
path so when the per-cpu pages end up reclaimed on first allocation
failure for most of the proceeding allocation attempts until the memory
pressure is off (possibly via the OOM killer) there are no per-cpu pages
on most CPUs (and there can easily be hundreds of them).

This also has the side effect of shortening the average latency of direct
reclaim by 1 or more order of magnitude since waiting for all the CPUs to
ACK the IPI takes a long time.

Tested by running "hackbench 400" on a 8 CPU x86 VM and observing the
difference between the number of direct reclaim attempts that end up in
drain_all_pages() and those were more then 1/2 of the online CPU had any
per-cpu page in them, using the vmstat counters introduced in the next
patch in the series and using proc/interrupts.

In the test sceanrio, this was seen to save around 3600 global
IPIs after trigerring an OOM on a concurrent workload:

In several code paths, such as when unmounting a file system (but not
only) we send an IPI to ask each cpu to invalidate its local LRU BHs.

For multi-cores systems that have many cpus that may not have any LRU BH
because they are idle or because they have not performed any file system
accesses since last invalidation (e.g. CPU crunching on high perfomance
computing nodes that write results to shared memory or only using
filesystems that do not use the bh layer.) This can lead to loss of
performance each time someone switches the KVM (the virtual keyboard and
screen type, not the hypervisor) if it has a USB storage stuck in.

flush_all() is called for each kmem_cache_destroy(). So every cache being
destroyed dynamically ends up sending an IPI to each CPU in the system,
regardless if the cache has ever been used there.

For example, if you close the Infinband ipath driver char device file, the
close file ops calls kmem_cache_destroy(). So running some infiniband
config tool on one a single CPU dedicated to system tasks might interrupt
the rest of the 127 CPUs dedicated to some CPU intensive or latency
sensitive task.

I suspect there is a good chance that every line in the output of "git
grep kmem_cache_destroy linux/ | grep '\->'" has a similar scenario.

This patch attempts to rectify this issue by sending an IPI to flush the
per cpu objects back to the free lists only to CPUs that seem to have such
objects.

The check which CPU to IPI is racy but we don't care since asking a CPU
without per cpu objects to flush does no damage and as far as I can tell
the flush_all by itself is racy against allocs on remote CPUs anyway, so
if you required the flush_all to be determinstic, you had to arrange for
locking regardless.

Add the on_each_cpu_cond() function that wraps on_each_cpu_mask() and
calculates the cpumask of cpus to IPI by calling a function supplied as a
parameter in order to determine whether to IPI each specific cpu.

The function works around allocation failure of cpumask variable in
CONFIG_CPUMASK_OFFSTACK=y by itereating over cpus sending an IPI a time
via smp_call_function_single().

The function is useful since it allows to seperate the specific code that
decided in each case whether to IPI a specific cpu for a specific request
from the common boilerplate code of handling creating the mask, handling
failures etc.

We have lots of infrastructure in place to partition multi-core systems
such that we have a group of CPUs that are dedicated to specific task:
cgroups, scheduler and interrupt affinity, and cpuisol= boot parameter.
Still, kernel code will at times interrupt all CPUs in the system via IPIs
for various needs. These IPIs are useful and cannot be avoided
altogether, but in certain cases it is possible to interrupt only specific
CPUs that have useful work to do and not the entire system.

This patch set, inspired by discussions with Peter Zijlstra and Frederic
Weisbecker when testing the nohz task patch set, is a first stab at trying
to explore doing this by locating the places where such global IPI calls
are being made and turning the global IPI into an IPI for a specific group
of CPUs. The purpose of the patch set is to get feedback if this is the
right way to go for dealing with this issue and indeed, if the issue is
even worth dealing with at all. Based on the feedback from this patch set
I plan to offer further patches that address similar issue in other code
paths.

This patch creates an on_each_cpu_mask() and on_each_cpu_cond()
infrastructure API (the former derived from existing arch specific
versions in Tile and Arm) and uses them to turn several global IPI
invocation to per CPU group invocations.

Core kernel:

on_each_cpu_mask() calls a function on processors specified by cpumask,
which may or may not include the local processor.

You must not call this function with disabled interrupts or from a
hardware interrupt handler or from a bottom half handler.

arch/arm:

Note that the generic version is a little different then the Arm one:

1. It has the mask as first parameter
2. It calls the function on the calling CPU with interrupts disabled,
but this should be OK since the function is called on the other CPUs
with interrupts disabled anyway.

arch/tile:

The API is the same as the tile private one, but the generic version
also calls the function on the with interrupts disabled in UP case

This is OK since the function is called on the other CPUs
with interrupts disabled.

Holepunching filesystems ext4 and xfs are using truncate_inode_pages_range
but forgetting to unmap pages first (ocfs2 remembers). This is not really
a bug, since races already require truncate_inode_page() to handle that
case once the page is locked; but it can be very inefficient if the file
being punched happens to be mapped into many vmas.

Provide a drop-in replacement truncate_pagecache_range() which does the
unmapping pass first, handling the awkward mismatch between arguments to
truncate_inode_pages_range() and arguments to unmap_mapping_range().

Note that holepunching does not unmap privately COWed pages in the range:
POSIX requires that we do so when truncating, but it's hard to justify,
difficult to implement without an i_size cutoff, and no filesystem is
attempting to implement it.

Pull "Disintegrate and delete asm/system.h" from David Howells:
"Here are a bunch of patches to disintegrate asm/system.h into a set of
separate bits to relieve the problem of circular inclusion
dependencies.

I've built all the working defconfigs from all the arches that I can
and made sure that they don't break.

The reason for these patches is that I recently encountered a circular
dependency problem that came about when I produced some patches to
optimise get_order() by rewriting it to use ilog2().

This uses bitops - and on the SH arch asm/bitops.h drags in
asm-generic/get_order.h by a circuituous route involving asm/system.h.

The main difficulty seems to be asm/system.h. It holds a number of
low level bits with no/few dependencies that are commonly used (eg.
memory barriers) and a number of bits with more dependencies that
aren't used in many places (eg. switch_to()).

Move xchg() and cmpxchg() here as they're full word atomic ops and
frequently used by atomic_xchg() and atomic_cmpxchg().

(5) asm/bug.h

Move die() and related bits.

(6) asm/auxvec.h

Move AT_VECTOR_SIZE_ARCH here.

Other arch headers are created as needed on a per-arch basis."

Fixed up some conflicts from other header file cleanups and moving code
around that has happened in the meantime, so David's testing is somewhat
weakened by that. We'll find out anything that got broken and fix it..

Pull a few more things for powerpc by Benjamin Herrenschmidt:
- Anton's did some recent improvements to EPOW event reporting on
pSeries (power supply failures and such). The patches are self
contained enough and replace really nasty code so I felt it should
still go in
- I did the vio driver registration change Greg requested, I don't see
the point of leaving that til the next merge window
- The remaining EEH changes I said were still pending to get rid of the
EEH references from the generic struct device_node
- A few more iSeries removal bits
- A perf bug fix on 970

Pull module and param updates from Rusty Russell:
"I'm getting married next week, and then honeymoon until 6th May. I'll
be offline from next week, except to post the compulsory pictures if
Alex shaves her head..."

I'm sure Rusty can take time off from his honeymoon if something comes
up. And here's the explanation about head shaving:

http://baldalex.org/

in case you wondered and wanted to support another insane caper or
Rusty's involving shaving.

Pull EDAC fixes from Mauro Carvalho Chehab:
"A series of EDAC driver fixes. It also has one core fix at the
documentation, and a rename patch, fixing the name of the struct that
contains the rank information."

Pull x86 platform driver updates from Matthew Garrett:
"Some significant updates to samsung-laptop, additional hardware
support for Toshibas, misc updates to various hardware and a new
backlight driver for some Apple machines."

Pull GPIO changes for v3.4 from Grant Likely:
"Primarily gpio device driver changes with some minor side effects
under arch/arm and arch/x86. Also includes a few core changes such as
explicitly supporting (electrical) open source and open drain outputs
and some help for parsing gpio devicetree properties."

Fix up context conflict due to Laxman Dewangan adding sleep control for
the tps65910 driver separately for gpio's and regulators.

Pull MFD changes from Samuel Ortiz:
- 4 new drivers: Freescale i.MX on-chip Anatop, Ricoh's RC5T583 and
TI's TPS65090 and TPS65217.
- New variants support (8420, 8520 ab9540), cleanups and bug fixes for
the abx500 and db8500 ST-E chipsets.
- Some minor fixes and update for the wm8994 from Mark.
- The beginning of a long term TWL cleanup effort coming from the TI
folks.
- Various fixes and cleanups for the s5m, TPS659xx, pm860x, and MAX8997
drivers.

Pull "drivers/clk: common clock framework" from Olof Johansson:
"This branch contains patches from Mike Turquette adding a common clock
framework to be shared across platforms. This is part of the work
towards building a common zImage for several ARM platforms."

Pull "ARM: More device tree support updates" from Olof Johansson:
"This branch contains a number of updates for device tree support on
several ARM platforms, in particular:

* AT91 continues the device tree conversion adding support for a
number of on-chip drivers and other functionality
* ux500 adds probing of some of the core SoC blocks through device
tree
* Initial device tree support for ST SPEAr600 platforms
* kirkwood continues the conversion to device-tree probing"

Manually merge arch/arm/mach-ux500/Kconfig due to MACH_U8500 rename, and
drivers/usb/gadget/at91_udc.c due to header file include cleanups.

Also do an "evil merge" for the MACH_U8500 config option rename that the
affected RMI4 touchscreen driver in staging. It's called MACH_MOP500
now, and it was missed during previous merges.

Pull "ARM: More SoC driver updates" from Olof Johansson:
"This branch contains a handful of driver updates, mostly to the
LPC32xx platform but also for Samsung EXYNOS and Davinci.

It had a few context conflicts against patches already merged through
fixes-non-critical. We should have resolved this early during the
development cycle by pulling them in as a dependency, instead I did it
after the fact this time."

On discard the corresponding mapping(s) are removed from the thin
device. If the associated block(s) are no longer shared the discard
is passed to the underlying device.

All bios other than discards now have an associated deferred_entry
that is saved to the 'all_io_entry' in endio_hook. When non-discard
IO completes and associated mappings are quiesced any discards that
were deferred, via ds_add_work() in process_discard(), will be queued
for processing by the worker thread.

The thin metadata format can only make use of a device that is <=
THIN_METADATA_MAX_SECTORS (currently 15.9375 GB). Therefore, there is no
practical benefit to using a larger device.

However, it may be that other factors impose a certain granularity for
the space that is allocated to a device (E.g. lvm2 can impose a coarse
granularity through the use of large, >= 1 GB, physical extents).

Rather than reject a larger metadata device, during thin-pool device
construction, switch to allowing it but issue a warning if a device
larger than THIN_METADATA_MAX_SECTORS_WARNING (16 GB) is
provided. Any space over 15.9375 GB will not be used.

Save space by removing entries from the space map ref_count tree if
they're no longer needed.

Ref counts are stored in two places: a bitmap if the ref_count is
below 3, or a btree of uint32_t if 3 or above.

When a ref_count that was above 3 drops below we can remove it from
the tree and save some metadata space. This removal was commented out
before because I was unsure why this was causing under-populated btree
nodes. Earlier patches have fixed this issue.

Released blocks don't become available until after the next commit
(for crash resilience). Prior to this patch commits were only
triggered by a message to the target or a REQ_{FLUSH,FUA} bio. This
allowed far too big a position to build up.

The interval is hard-coded to 1 second. This is a sensible setting.
I'm not making this user configurable, since there isn't much to be
gained by tweaking this - and a lot lost by setting it far too high.

Device mapper uses sscanf to convert arguments to numbers. The problem is that
the way we use it ignores additional unmatched characters in the scanned string.

For example, this `if (sscanf(string, "%d", &number) == 1)' will match a number,
but also it will match number with some garbage appended, like "123abc".

As a result, device mapper accepts garbage after some numbers. For example
the command `dmsetup create vg1-new --table "0 16384 linear 254:1bla 34816bla"'
will pass without an error.

This patch fixes all sscanf uses in device mapper. It appends "%c" with
a pointer to a dummy character variable to every sscanf statement.

The construct `if (sscanf(string, "%d%c", &number, &dummy) == 1)' succeeds
only if string is a null-terminated number (optionally preceded by some
whitespace characters). If there is some character appended after the number,
sscanf matches "%c", writes the character to the dummy variable and returns 2.
We check the return value for 1 and consequently reject numbers with some
garbage appended.

The dm-raid code currently fails to create a RAID array if any of the
superblocks cannot be read. This was an oversight as there is already
code to handle this case if the values ('- -') were provided for the
failed array position.

With this patch, if a superblock cannot be read, the array position's
fields are initialized as though '- -' was set in the table. That is,
the device is failed and the position should not be used, but if there
is sufficient redundancy, the array should still be activated.

The root is a chunk of data that gets written to the superblock. This
data is used to recreate the space map when opening a metadata area.
We have two space maps; one tracking space on the metadata device and
one of the data device. Both of these use the same format for their
root, so this typo was harmless.

When we remove an entry from a node we sometimes rebalance with it's
two neighbours. This wasn't being done correctly; in some cases
entries have to move all the way from the right neighbour to the left
neighbour, or vice versa. This patch pretty much re-writes the
balancing code to fix it.

This code is barely used currently; only when you delete a thin
device, and then only if you have hundreds of them in the same pool.
Once we have discard support, which removes mappings, this will be used
much more heavily.

Avoid using the bi_next field for the holder of a cell when deferring
bios because a stacked device below might change it. Store the
holder in a new field in struct cell instead.

When a cell is created, the bio that triggered creation (the holder) was
added to the same bio list as subsequent bios. In some cases we pass
this holder bio directly to devices underneath. If those devices use
the bi_next field there will be trouble...

This also simplifies some code that had to work out which bio was the
holder.

There were cases where an error code would be set only if we finish
processing the last sector. If there were other encryption operations in
flight, the error would be ignored and bio would be returned with
success as if no error happened.

This bug is present in kcryptd_crypt_write_convert, kcryptd_crypt_read_convert
and kcryptd_async_done.

Currently, dm-crypt reserves a mempool of MIN_BIO_PAGES reserved pages.
It allocates first MIN_BIO_PAGES with non-failing allocation (the allocation
cannot fail and waits until the mempool is refilled). Further pages are
allocated with different gfp flags that allow failing.

Because allocations may be done in parallel, this code can deadlock. Example:
There are two processes, each tries to allocate MIN_BIO_PAGES and the processes
run simultaneously.
It may end up in a situation where each process allocates (MIN_BIO_PAGES / 2)
pages. The mempool is exhausted. Each process waits for more pages to be freed
to the mempool, which never happens.

To avoid this deadlock scenario, this patch changes the code so that only
the first page is allocated with non-failing gfp mask. Allocation of further
pages may fail.

asm/system.h is a cause of circular dependency problems because it contains
commonly used primitive stuff like barrier definitions and uncommonly used
stuff like switch_to() that might require MMU definitions.

asm/system.h has been disintegrated by this point on all arches into the
following common segments: