Add a document that describes how to take advantage of ACPI enumeration for
buses like platform, I2C and SPI. In addition to that we document how to
translate ACPI GpioIo and GpioInt resources to be useful in Linux device
drivers.

Drivers usually expect that the devices they are supposed to handle
will be operational when their .probe() routines are called, but that
need not be the case on some ACPI-based systems with ACPI-based
device enumeration where the BIOSes don't put devices into D0 by
default. To work around this problem it is sufficient to change
bus type .probe() routines to ensure that devices will be powered
on before the drivers' .probe() routines run (and their .remove()
and .shutdown() routines accordingly).

Modify platform_drv_probe() to run acpi_dev_pm_attach() for devices
whose ACPI handles are present, so that ACPI power management is used
to change their power states. Analogously, modify
platform_drv_remove() and platform_drv_shutdown() to call
acpi_dev_pm_detach() for those devices, so that they are not subject
to ACPI PM any more.

Make it possible to ask the routines used for adding/removing devices
to/from the general ACPI PM domain, acpi_dev_pm_attach() and
acpi_dev_pm_detach(), respectively, to change the power states of
devices so that they are put into the full-power state automatically
by acpi_dev_pm_attach() and into the lowest-power state available
automatically by acpi_dev_pm_detach().

Currently we have valid flag to represent if this ACPI device power
state is valid. A device power state is valid does not necessarily
mean we, as OSPM, has a mean to put the device into that power state,
e.g. D3 cold is always a valid power state for any ACPI device, but if
there is no _PS3 or _PRx for this device, we can't really put that
device into D3 cold power state. The same is true for D0 power state.

So here comes the os_accessible flag, which is only set if the device
has provided us the required means to put it into that power state,
e.g. if we have _PS3 or _PRx, we can put the device into D3 cold state
and thus, D3 cold power state's os_accessible flag will be set in this
case.

And a new wrapper inline function is added to be used to check if
firmware has provided us a way to power off the device during runtime.

The current platform device creation and registration code in
acpi_create_platform_device() is quite convoluted. This function
takes an ACPI device node as an argument and eventually calls
platform_device_register_resndata() to create and register a
platform device object on the basis of the information contained
in that code. However, it doesn't associate the new platform
device with the ACPI node directly, but instead it relies on
acpi_platform_notify(), called from within device_add(), to find
that ACPI node again with the help of acpi_platform_find_device()
and acpi_platform_match() and then attach the new platform device
to it. This causes an additional ACPI namespace walk to happen and
is clearly suboptimal.

Use the observation that it is now possible to initialize the ACPI
handle of a device before calling device_add() for it to make this
code more straightforward. Namely, add a new field to struct
platform_device_info allowing us to pass the ACPI handle of interest
to platform_device_register_full(), which will then use it to
initialize the new device's ACPI handle before registering it.
This will cause acpi_platform_notify() to use the ACPI handle from
the device structure directly instead of using the .find_device()
routine provided by the device's bus type. In consequence,
acpi_platform_bus, acpi_platform_find_device(), and
acpi_platform_match() are not necessary any more, so remove them.

To avoid adding an ACPI handle pointer to struct device on
architectures that don't use ACPI, or generally when CONFIG_ACPI is
not set, in which cases that pointer is useless, define struct
acpi_dev_node that will contain the handle pointer if CONFIG_ACPI is
set and will be empty otherwise and use it to represent the ACPI
device node field in struct device.

In addition to that define macros for reading and setting the ACPI
handle of a device that don't generate code when CONFIG_ACPI is
unset. Modify the ACPI subsystem to use those macros instead of
referring to the given device's ACPI handle directly.

Currently, the ACPI handles of devices are initialized from within
device_add(), by acpi_bind_one() called from acpi_platform_notify()
which first uses the .find_device() routine provided by the device's
bus type to find the matching device node in the ACPI namespace.
This is a source of some computational overhead and, moreover, the
correctness of the result depends on the implementation of
.find_device() which is known to fail occasionally for some bus types
(e.g. PCI). In some cases, however, the corresponding ACPI device
node is known already before calling device_add() for the given
struct device object and the whole .find_device() dance in
acpi_platform_notify() is then simply unnecessary.

For this reason, make it possible to initialize the ACPI handles of
devices before calling device_add() for them. Modify
acpi_platform_notify() to call acpi_bind_one() in advance to check
the device's existing ACPI handle and skip the .find_device()
search if that is successful. Change acpi_bind_one() accordingly.

Currently acpi_dev_process_resource() returns AE_ABORT_METHOD
to terminate the acpi_walk_resources() it is called from if
the .preproc() routine provided by the caller of
acpi_dev_get_resources() initiating the resources walk returns
an error code. It is better to use AE_CTRL_TERMINATE for this
purpose, however, so do that.

Commit e5cc8ef (ACPI / PM: Provide ACPI PM callback routines for
subsystems) introduced a build problem occuring if CONFIG_ACPI is
unset or CONFIG_PM is unset and errno.h is not included before
acpi.h, because in that case ENODEV used in acpi.h is undefined.

Currently, whoever wants to use ACPI device resources has to call
acpi_walk_resources() to browse the buffer returned by the _CRS
method for the given device and create filters passed to that
routine to apply to the individual resource items. This generally
is cumbersome, time-consuming and inefficient. Moreover, it may
be problematic if resource conflicts need to be resolved, because
the different users of _CRS will need to do that in a consistent
way. However, if there are resource conflicts, the ACPI core
should be able to resolve them centrally instead of relying on
various users of acpi_walk_resources() to handle them correctly
together.

For this reason, introduce a new function, acpi_dev_get_resources(),
that can be used by subsystems to obtain a list of struct resource
objects corresponding to the ACPI device resources returned by
_CRS and, if necessary, to apply additional preprocessing routine
to the ACPI resources before converting them to the struct resource
format.

Make the ACPI code that creates platform device objects use
acpi_dev_get_resources() for resource processing instead of executing
acpi_walk_resources() twice by itself, which causes it to be much
more straightforward and easier to follow.

In the future, acpi_dev_get_resources() can be extended to meet
the needs of the ACPI PNP subsystem and other users of _CRS in
the kernel.

Move some code used for parsing ACPI device resources from the PNP
subsystem to the ACPI core, so that other bus types (platform, SPI,
I2C) can use the same routines for parsing resources in a consistent
way, without duplicating code.

With ACPI 5 it is now possible to enumerate traditional SoC
peripherals, like serial bus controllers and slave devices behind
them. These devices are typically based on IP-blocks used in many
existing SoC platforms and platform drivers for them may already
be present in the kernel tree.

To make driver "porting" more straightforward, add ACPI support to
the platform bus type. Instead of writing ACPI "glue" drivers for
the existing platform drivers, register the platform bus type with
ACPI to create platform device objects for the drivers and bind the
corresponding ACPI handles to those platform devices.

This should allow us to reuse the existing platform drivers for the
devices in question with the minimum amount of modifications.

This changeset is based on Mika Westerberg's and Mathias Nyman's
work.

With ACPI 5 we are starting to see devices that don't natively support
discovery but can be enumerated with the help of the ACPI namespace.
Typically, these devices can be represented in the Linux device driver
model as platform devices or some serial bus devices, like SPI or I2C
devices.

Since we want to re-use existing drivers for those devices, we need a
way for drivers to specify the ACPI IDs of supported devices, so that
they can be matched against device nodes in the ACPI namespace. To
this end, it is sufficient to add a pointer to an array of supported
ACPI device IDs, that can be provided by the driver, to struct device.

Moreover, things like ACPI power management need to have access to
the ACPI handle of each supported device, because that handle is used
to invoke AML methods associated with the corresponding ACPI device
node. The ACPI handles of devices are now stored in the archdata
member structure of struct device whose definition depends on the
architecture and includes the ACPI handle only on x86 and ia64. Since
the pointer to an array of supported ACPI IDs is added to struct
device_driver in an architecture-independent way, it is logical to
move the ACPI handle from archdata to struct device itself at the same
time. This also makes code more straightforward in some places and
follows the example of Device Trees that have a poiter to struct
device_node in there too.

Some bus types don't support power management natively, but generally
there may be device nodes in ACPI tables corresponding to the devices
whose bus types they are (under ACPI 5 those bus types may be SPI,
I2C and platform). If that is the case, standard ACPI power
management may be applied to those devices, although currently the
kernel has no means for that.

For this reason, provide a set of routines that may be used as power
management callbacks for such devices. This may be done in three
different ways.

(1) Device drivers handling the devices in question may run
acpi_dev_pm_attach() in their .probe() routines, which (on
success) will cause the devices to be added to the general ACPI
PM domain and ACPI power management will be used for them going
forward. Then, acpi_dev_pm_detach() may be used to remove the
devices from the general ACPI PM domain if ACPI power management
is not necessary for them any more.

(2) The devices' subsystems may use acpi_subsys_runtime_suspend(),
acpi_subsys_runtime_resume(), acpi_subsys_prepare(),
acpi_subsys_suspend_late(), acpi_subsys_resume_early() as their
power management callbacks in the same way as the general ACPI
PM domain does that.

(3) The devices' drivers may execute acpi_dev_suspend_late(),
acpi_dev_resume_early(), acpi_dev_runtime_suspend(),
acpi_dev_runtime_resume() from their power management callbacks
as appropriate, if that's absolutely necessary, but it is not
recommended to do that, because such drivers may not work
without ACPI support as a result.

If the caller of acpi_bus_set_power() already has a pointer to the
struct acpi_device object corresponding to the device in question, it
doesn't make sense for it to go through acpi_bus_get_device(), which
may be costly, because it involves acquiring the global ACPI
namespace mutex.

For this reason, export the function operating on struct acpi_device
objects used internally by acpi_bus_set_power(), so that it may be
called instead of acpi_bus_set_power() in the above case, and change
its name to acpi_device_set_power().

Two device wakeup management routines in device_pm.c and sleep.c,
acpi_pm_device_run_wake() and acpi_pm_device_sleep_wake(), take a
device pointer argument and use it to obtain the ACPI handle of the
corresponding ACPI namespace node. That handle is then used to get
the address of the struct acpi_device object corresponding to the
struct device passed as the argument.

Unfortunately, that last operation may be costly, because it involves
taking the global ACPI namespace mutex, so it shouldn't be carried
out too often. However, the callers of those routines usually call
them in a row with acpi_pm_device_sleep_state() which also takes that
mutex for the same reason, so it would be more efficient if they ran
acpi_bus_get_device() themselves to obtain a pointer to the struct
acpi_device object in question and then passed that pointer to the
appropriate PM routines.

To make that possible, split each of the PM routines mentioned above
in two parts, one taking a struct acpi_device pointer argument and
the other implementing the current interface for compatibility.

Additionally, change acpi_pm_device_run_wake() to actually return
an error code if there is an error while setting up runtime remote
wakeup for the device.

The ACPI function for choosing device power state is now located
in drivers/acpi/sleep.c, but drivers/acpi/device_pm.c is a more
logical place for it, so move it there.

However, instead of moving the function entirely, move its core only
under a different name and with a different list of arguments, so
that it is more flexible, and leave a wrapper around it in the
original location.

ACPI routines for adding and removing device wakeup notifiers are
currently defined in a PCI-specific file, but they will be necessary
for non-PCI devices too, so move them to a separate file under
drivers/acpi and rename them to indicate their ACPI origins.

The kerneldoc comments for acpi_pm_device_sleep_state(),
acpi_pm_device_run_wake(), and acpi_pm_device_sleep_wake() are
outdated or otherwise inaccurate and/or don't follow the common
kerneldoc patterns, so fix them.

Additionally, notice that acpi_pm_device_run_wake() should be under
CONFIG_PM_RUNTIME rather than under CONFIG_PM_SLEEP, so fix that too.

Since dev_pm_qos_add_request(), dev_pm_qos_update_request() and
dev_pm_qos_remove_request() for PM QoS flags should not be invoked
when device in RPM_SUSPENDED, add pm_runtime_get_sync() and pm_runtime_put()
around these functions in dev_pm_qos_expose_flags() and
dev_pm_qos_hide_flags().

[rjw: Modified the subject and changelog to better reflect the code
changes made.]

7) __inet_diag_dump() needs to repsect and report the error value
returned from inet_diag_lock_handler() rather than ignore it.
Otherwise if an inet diag handler is not available for a particular
protocol, we essentially report success instead of giving an error
indication. Fix from Cyrill Gorcunov.

8) When the QFQ packet scheduler sees TSO/GSO packets it does not
handle things properly, and in fact ends up corrupting it's
datastructures as well as mis-schedule packets. Fix from Paolo
Valente.

11) When we send unsolicited ipv6 neighbour advertisements, we should
send them to the link-local allnodes multicast address, as per
RFC4861. Fix from Hannes Frederic Sowa.

12) There is some kind of bug in the usbnet's kevent deferral
mechanism, but more immediately when it triggers an uncontrolled
stream of kernel messages spam the log. Rate limit the error log
message triggered when this problem occurs, as sending thousands
of error messages into the kernel log doesn't help matters at all,
and in fact makes further diagnosis more difficult.

From Steve Glendinning.

13) Fix gianfar restore from hibernation, from Wang Dongsheng.

14) The netlink message attribute sizes are wrong in the ipv6 GRE
driver, it was using the size of ipv4 addresses instead of ipv6
ones :-) Fix from Nicolas Dichtel."

1) Configuring a mix of static vs. modular sparc64 crypto modules
didn't work, remove an ill-conceived attempt to only have to build
the device match table for these drivers once to fix the problem.

Reported by Meelis Roos.

2) Make the montgomery multiple/square and mpmul instructions actually
usable in 32-bit tasks. Essentially this involves providing 32-bit
userspace with a way to use a 64-bit stack when it needs to.

3) Our sparc64 atomic backoffs don't yield cpu strands properly on
Niagara chips. Use pause instruction when available to achieve
this, otherwise use a benign instruction we know blocks the strand
for some time.

4) Wire up kcmp

5) Fix the build of various drivers by removing the unnecessary
blocking of OF_GPIO when SPARC.

of/address: sparc: Declare of_address_to_resource() as an extern function for sparc again

This bug-fix makes sure that of_address_to_resource is defined extern for sparc
so that the sparc-specific implementation of of_address_to_resource() is once
again used when including include/linux/of_address.h in a sparc context. A
number of drivers in mainline relies on this function working for sparc.

The bug was introduced in a850a7554442f08d3e910c6eeb4ee216868dda1e, "of/address:
add empty static inlines for !CONFIG_OF". Contrary to that commit title, the
static inlines are added for !CONFIG_OF_ADDRESS, and CONFIG_OF_ADDRESS is never
defined for sparc. This is good behavior for the other functions in
include/linux/of_address.h, as the extern functions defined in
drivers/of/address.c only gets linked when OF_ADDRESS is configured. However,
for of_address_to_resource there exists a sparc-specific implementation in
arch/sparc/arch/sparc/kernel/of_device_common.c

This adds sparc support for platform_get_irq that in the normal case use
platform_get_resource() to get an irq. This standard approach fails for sparc as
there are no resources of type IORESOURCE_IRQ for irqs for sparc.

Cross platform drivers can then use this standard platform function and work on
sparc instead of having to have a special case for sparc.

If a gianfar ethernet device is down prior to hibernating a
system, it will no longer be present upon system restore.

For example:

~# ifconfig eth0 down
~# echo disk > /sys/power/state

<trigger a restore from hibernation>

~# ifconfig eth0 up
SIOCSIFFLAGS: No such device

This happens because the restore function bails out early upon
finding devices that were not up at hibernation. In doing so,
it never gets to the netif_device_attach call at the end of
the restore function. Adding the netif_device_attach as done
here also makes the gfar_restore code consistent with what is
done in the gfar_resume code.

when something goes wrong, a flood of these messages can be
generated by usbnet (thousands per second). This doesn't
generally *help* the condition so this patch ratelimits the
rate of their generation.

There's an underlying problem in usbnet's kevent deferral
mechanism which needs fixing, specifically that events *can*
get dropped and not handled. This patch doesn't address this,
but just mitigates fallout caused by the current implemention.

Pull sound fixes from Takashi Iwai:
"Most of commits are for stable and regression fixes. Except for one
fix for a regression in 3.7-rc4, there are all driver local changes,
so nothing too much to worry."

The commit 911dec0db4de6ccc544178a8ddaf9cec0a11d533
"xen/arm: Fix compile errors when drivers are compiled as modules." exports
the neccessary functions. But to guard ourselves against out-of-tree modules
and future drivers hitting this, lets export all of the relevant
hypercalls.

- zero the allocation_args structure on the stack before using it to
determine whether to use a worker for allocation
- move allocation stack switch to xfs_bmapi_allocate in order to
prevent deadlock on AGF buffers

- growfs no longer reads in garbage for new secondary superblocks

- silence a build warning

- ensure that invalid buffers never get written to disk while on free
list

lib/atomic64.c: In function 'lock_addr':
lib/atomic64.c:40:11: error: 'L1_CACHE_SHIFT' undeclared (first use in this function)
lib/atomic64.c:40:11: note: each undeclared identifier is reported only once for each function it appears in

Revert commit 03a7beb55b9f ("epoll: support for disabling items, and a
self-test app") pending resolution of the issues identified by Michael
Kerrisk, copied below.

We'll revisit this for 3.8.

: I've taken a look at this patch as it currently stands in 3.7-rc1, and
: done a bit of testing. (By the way, the test program
: tools/testing/selftests/epoll/test_epoll.c does not compile...)
:
: There are one or two places where the behavior seems a little strange,
: so I have a question or two at the end of this mail. But other than
: that, I want to check my understanding so that the interface can be
: correctly documented.
:
: Just to go though my understanding, the problem is the following
: scenario in a multithreaded application:
:
: 1. Multiple threads are performing epoll_wait() operations,
: and maintaining a user-space cache that contains information
: corresponding to each file descriptor being monitored by
: epoll_wait().
:
: 2. At some point, a thread wants to delete (EPOLL_CTL_DEL)
: a file descriptor from the epoll interest list, and
: delete the corresponding record from the user-space cache.
:
: 3. The problem with (2) is that some other thread may have
: previously done an epoll_wait() that retrieved information
: about the fd in question, and may be in the middle of using
: information in the cache that relates to that fd. Thus,
: there is a potential race.
:
: 4. The race can't solved purely in user space, because doing
: so would require applying a mutex across the epoll_wait()
: call, which would of course blow thread concurrency.
:
: Right?
:
: Your solution is the EPOLL_CTL_DISABLE operation. I want to
: confirm my understanding about how to use this flag, since
: the description that has accompanied the patches so far
: has been a bit sparse
:
: 0. In the scenario you're concerned about, deleting a file
: descriptor means (safely) doing the following:
: (a) Deleting the file descriptor from the epoll interest list
: using EPOLL_CTL_DEL
: (b) Deleting the corresponding record in the user-space cache
:
: 1. It's only meaningful to use this EPOLL_CTL_DISABLE in
: conjunction with EPOLLONESHOT.
:
: 2. Using EPOLL_CTL_DISABLE without using EPOLLONESHOT in
: conjunction is a logical error.
:
: 3. The correct way to code multithreaded applications using
: EPOLL_CTL_DISABLE and EPOLLONESHOT is as follows:
:
: a. All EPOLL_CTL_ADD and EPOLL_CTL_MOD operations should
: should EPOLLONESHOT.
:
: b. When a thread wants to delete a file descriptor, it
: should do the following:
:
: [1] Call epoll_ctl(EPOLL_CTL_DISABLE)
: [2] If the return status from epoll_ctl(EPOLL_CTL_DISABLE)
: was zero, then the file descriptor can be safely
: deleted by the thread that made this call.
: [3] If the epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY,
: then the descriptor is in use. In this case, the calling
: thread should set a flag in the user-space cache to
: indicate that the thread that is using the descriptor
: should perform the deletion operation.
:
: Is all of the above correct?
:
: The implementation depends on checking on whether
: (events & ~EP_PRIVATE_BITS) == 0
: This replies on the fact that EPOLL_CTL_AD and EPOLL_CTL_MOD always
: set EPOLLHUP and EPOLLERR in the 'events' mask, and EPOLLONESHOT
: causes those flags (as well as all others in ~EP_PRIVATE_BITS) to be
: cleared.
:
: A corollary to the previous paragraph is that using EPOLL_CTL_DISABLE
: is only useful in conjunction with EPOLLONESHOT. However, as things
: stand, one can use EPOLL_CTL_DISABLE on a file descriptor that does
: not have EPOLLONESHOT set in 'events' This results in the following
: (slightly surprising) behavior:
:
: (a) The first call to epoll_ctl(EPOLL_CTL_DISABLE) returns 0
: (the indicator that the file descriptor can be safely deleted).
: (b) The next call to epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY.
:
: This doesn't seem particularly useful, and in fact is probably an
: indication that the user made a logic error: they should only be using
: epoll_ctl(EPOLL_CTL_DISABLE) on a file descriptor for which
: EPOLLONESHOT was set in 'events'. If that is correct, then would it
: not make sense to return an error to user space for this case?

Virtio wants to release used indices after the corresponding
virtio device has been unregistered. However, virtio does not
hold an extra reference, giving up its last reference with
device_unregister(), making accessing dev->index afterwards
invalid.

Just some minor fixes for VM reg check and a regression fix for dce3 plls

* 'drm-fixes-3.7' of git://people.freedesktop.org/~agd5f/linux:
drm/radeon/si: add some missing regs to the VM reg checker
drm/radeon/cayman: add some missing regs to the VM reg checker
drm/radeon/dce3: switch back to old pll allocation order for discrete

Commit 4439647 ("xfs: reset buffer pointers before freeing them") in
3.0-rc1 introduced a regression when recovering log buffers that
wrapped around the end of log. The second part of the log buffer at
the start of the physical log was being read into the header buffer
rather than the data buffer, and hence recovery was seeing garbage
in the data buffer when it got to the region of the log buffer that
was incorrectly read.

When we shut down the filesystem, we have to unpin and free all the
buffers currently active in the CIL. To do this we unpin and remove
them in one operation as a result of a failed iclogbuf write. For
buffers, we do this removal via a simultated IO completion of after
marking the buffer stale.

At the time we do this, we have two references to the buffer - the
active LRU reference and the buf log item. The LRU reference is
removed by marking the buffer stale, and the active CIL reference is
by the xfs_buf_iodone() callback that is run by
xfs_buf_do_callbacks() during ioend processing (via the bp->b_iodone
callback).

However, ioend processing requires one more reference - that of the
IO that it is completing. We don't have this reference, so we free
the buffer prematurely and use it after it is freed. For buffers
marked with XBF_ASYNC, this leads to assert failures in
xfs_buf_rele() on debug kernels because the b_hold count is zero.

Fix this by making sure we take the necessary IO reference before
starting IO completion processing on the stale buffer, and set the
XBF_ASYNC flag to ensure that IO completion processing removes all
the active references from the buffer to ensure it is fully torn
down.

Inode buffers do not need to be mapped as inodes are read or written
directly from/to the pages underlying the buffer. This fixes a
regression introduced by commit 611c994 ("xfs: make XBF_MAPPED the
default behaviour").

When we free a block from the alloc btree tree, we move it to the
freelist held in the AGFL and mark it busy in the busy extent tree.
This typically happens when we merge btree blocks.

Once the transaction is committed and checkpointed, the block can
remain on the free list for an indefinite amount of time. Now, this
isn't the end of the world at this point - if the free list is
shortened, the buffer is invalidated in the transaction that moves
it back to free space. If the buffer is allocated as metadata from
the free list, then all the modifications getted logged, and we have
no issues, either. And if it gets allocated as userdata direct from
the freelist, it gets invalidated and so will never get written.

However, during the time it sits on the free list, pressure on the
log can cause the AIL to be pushed and the buffer that covers the
block gets pushed for write. IOWs, we end up writing a freed
metadata block to disk. Again, this isn't the end of the world
because we know from the above we are only writing to free space.

The problem, however, is for validation callbacks. If the block was
on old btree root block, then the level of the block is going to be
higher than the current tree root, and so will fail validation.
There may be other inconsistencies in the block as well, and
currently we don't care because the block is in free space. Shutting
down the filesystem because a freed block doesn't pass write
validation, OTOH, is rather unfriendly.

So, make sure we always invalidate buffers as they move from the
free space trees to the free list so that we guarantee they never
get written to disk while on the free list.

Uninitialised variable build warning introduced by 2903ff0 ("switch
simple cases of fget_light to fdget"), gcc is not smart enough to
work out that the variable is not used uninitialised, and the commit
removed the initialisation at declaration that the old variable had.

When updating new secondary superblocks in a growfs operation, the
superblock buffer is read from the newly grown region of the
underlying device. This is not guaranteed to be zero, so violates
the underlying assumption that the unused parts of superblocks are
zero filled. Get a new buffer for these secondary superblocks to
ensure that the unused regions are zero filled correctly.

Switching stacks are xfs_alloc_vextent can cause deadlocks when we
run out of worker threads on the allocation workqueue. This can
occur because xfs_bmap_btalloc can make multiple calls to
xfs_alloc_vextent() and even if xfs_alloc_vextent() fails it can
return with the AGF locked in the current allocation transaction.

If we then need to make another allocation, and all the allocation
worker contexts are exhausted because the are blocked waiting for
the AGF lock, holder of the AGF cannot get it's xfs-alloc_vextent
work completed to release the AGF. Hence allocation effectively
deadlocks.

To avoid this, move the stack switch one layer up to
xfs_bmapi_allocate() so that all of the allocation attempts in a
single switched stack transaction occur in a single worker context.
This avoids the problem of an allocation being blocked waiting for
a worker thread whilst holding the AGF.

Certain allocation paths through xfs_bmapi_write() are in situations
where we have limited stack available. These are almost always in
the buffered IO writeback path when convertion delayed allocation
extents to real extents.

The current stack switch occurs for userdata allocations, which
means we also do stack switches for preallocation, direct IO and
unwritten extent conversion, even those these call chains have never
been implicated in a stack overrun.

Hence, let's target just the single stack overun offended for stack
switches. To do that, introduce a XFS_BMAPI_STACK_SWITCH flag that
the caller can pass xfs_bmapi_write() to indicate it should switch
stacks if it needs to do allocation.

The log write code stamps each iclog with the current tail LSN in
the iclog header so that recovery knows where to find the tail of
thelog once it has found the head. Normally this is taken from the
first item on the AIL - the log item that corresponds to the oldest
active item in the log.

The problem is that when the AIL is empty, the tail lsn is dervied
from the the l_last_sync_lsn, which is the LSN of the last iclog to
be written to the log. In most cases this doesn't happen, because
the AIL is rarely empty on an active filesystem. However, when it
does, it opens up an interesting case when the transaction being
committed to the iclog spans multiple iclogs.

That is, the first iclog is stamped with the l_last_sync_lsn, and IO
is issued. Then the next iclog is setup, the changes copied into the
iclog (takes some time), and then the l_last_sync_lsn is stamped
into the header and IO is issued. This is still the same
transaction, so the tail lsn of both iclogs must be the same for log
recovery to find the entire transaction to be able to replay it.

The problem arises in that the iclog buffer IO completion updates
the l_last_sync_lsn with it's own LSN. Therefore, If the first iclog
completes it's IO before the second iclog is filled and has the tail
lsn stamped in it, it will stamp the LSN of the first iclog into
it's tail lsn field. If the system fails at this point, log recovery
will not see a complete transaction, so the transaction will no be
replayed.

The fix is simple - the l_last_sync_lsn is updated when a iclog
buffer IO completes, and this is incorrect. The l_last_sync_lsn
shoul dbe updated when a transaction is completed by a iclog buffer
IO. That is, only iclog buffers that have transaction commit
callbacks attached to them should update the l_last_sync_lsn. This
means that the last_sync_lsn will only move forward when a commit
record it written, not in the middle of a large transaction that is
rolling through multiple iclog buffers.

Commit 149c24151e85 ("ARM: SMP: use a timing out completion for cpu
hotplug") modified arm's CPU up path to use completions. It seems that
we only got half of this patch for arm64, so add the missing call to
complete.

struct user_fp does not exist for arm64, so use struct user_fpsimd_state
instead for the ELF core dumping definitions. Furthermore, since we use
regset-based core dumping, we do not need definitions for dump_task_regs
and dump_fpu.