memcpy_mcsafe() is an API currently used by the pmem subsystem to convert
errors while doing a memcpy (machine check exception errors) to a return
value. This patchset consists of three patches
1. The first patch is a bug fix to handle machine check errors correctly
while walking the page tables in kernel mode, due to huge pmd/pud sizes
2. The second patch adds memcpy_mcsafe() support, this is largely derived
from existing code
3. The third patch registers for callbacks on machine check exceptions and
in them uses specialized knowledge of the type of page to decide whether
to handle the MCE as is or to return to a fixup address present in
memcpy_mcsafe(). If a fixup address is used, then we return an error
value of -EFAULT to the caller.
Testing
A large part of the testing was done under a simulator by selectively
inserting machine check exceptions in a test driver doing memcpy_mcsafe
via ioctls.
Changelog v2
- Fix the logic of shifting in addr_to_pfn
- Use shift consistently instead of PAGE_SHIFT
- Fix a typo in patch1
Balbir Singh (3):
powerpc/mce: Bug fixes for MCE handling in kernel space
powerpc/memcpy: Add memcpy_mcsafe for pmem
powerpc/mce: Handle memcpy_mcsafe
arch/powerpc/include/asm/mce.h | 3 +-
arch/powerpc/include/asm/string.h | 2 +
arch/powerpc/kernel/mce.c | 77 ++++++++++++-
arch/powerpc/kernel/mce_power.c | 26 +++--
arch/powerpc/lib/Makefile | 2 +-
arch/powerpc/lib/memcpy_mcsafe_64.S | 212 ++++++++++++++++++++++++++++++++++++
6 files changed, 308 insertions(+), 14 deletions(-)
create mode 100644 arch/powerpc/lib/memcpy_mcsafe_64.S
--
2.13.6

Changes from v1:
* Reworked patches 1 and 2 so that the __bdev_dax_supported() function
stays hidden behind the bdev_dax_supported() wrapper. This is needed
to prevent compilation errors in configs where CONFIG_FS_DAX isn't
defined. (0-day)
* Added Eric's Reviewed-by to patch 1. I did this in spite of the
bdev_dax_supported() changes because they were minor and I think
Eric's review was focused on the XFS parts.
---
This series fixes a few issues that I found with DM's handling of DAX
devices. Here are some of the issues I found:
* We can create a dm-stripe or dm-linear device which is made up of an
fsdax PMEM namespace and a raw PMEM namespace but which can hold a
filesystem mounted with the -o dax mount option. DAX operations to
the raw PMEM namespace part lack struct page and can fail in
interesting/unexpected ways when doing things like fork(), examining
memory with gdb, etc.
* We can create a dm-stripe or dm-linear device which is made up of an
fsdax PMEM namespace and a BRD ramdisk which can hold a filesystem
mounted with the -o dax mount option. All I/O to this filesystem
will fail.
* In DM you can't transition a dm target which could possibly support
DAX (mode DM_TYPE_DAX_BIO_BASED) to one which can't support DAX
(mode DM_TYPE_BIO_BASED), even if you never use DAX.
The first 2 patches in this series are prep work from Darrick and Dave
which improve bdev_dax_supported(). The last 5 problems fix the above
mentioned problems in DM. I feel that this series simplifies the
handling of DAX devices in DM, and the last 5 DM-related patches have a
net code reduction of 50 lines.
Darrick J. Wong (1):
fs: allow per-device dax status checking for filesystems
Dave Jiang (1):
dax: change bdev_dax_supported() to support boolean returns
Ross Zwisler (5):
dm: fix test for DAX device support
dm: prevent DAX mounts if not supported
dm: remove DM_TYPE_DAX_BIO_BASED dm_queue_mode
dm-snap: remove unnecessary direct_access() stub
dm-error: remove unnecessary direct_access() stub
drivers/dax/super.c | 40 ++++++++++++++++++++--------------------
drivers/md/dm-ioctl.c | 16 ++++++----------
drivers/md/dm-snap.c | 8 --------
drivers/md/dm-table.c | 29 +++++++++++------------------
drivers/md/dm-target.c | 7 -------
drivers/md/dm.c | 7 ++-----
fs/ext2/super.c | 3 +--
fs/ext4/super.c | 3 +--
fs/xfs/xfs_ioctl.c | 3 ++-
fs/xfs/xfs_iops.c | 30 +++++++++++++++++++++++++-----
fs/xfs/xfs_super.c | 10 ++++++++--
include/linux/dax.h | 11 ++++++-----
include/linux/device-mapper.h | 8 ++++++--
13 files changed, 88 insertions(+), 87 deletions(-)
--
2.14.3

Changes since v9 [1] and v10 [2]
* Resend the full series with the reworked "mm: introduce
MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS" (Christoph)
* Move generic_dax_pagefree() into the pmem driver (Christoph)
* Cleanup __bdev_dax_supported() (Christoph)
* Cleanup some stale SRCU bits leftover from other iterations (Jan)
* Cleanup xfs_break_layouts() (Jan)
[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-April/015457.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2018-May/015885.html
---
Background:
get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).
Problem:
This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.
Solution:
Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".
The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.
---
Dan Williams (7):
memremap: split devm_memremap_pages() and memremap() infrastructure
mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
mm: fix __gup_device_huge vs unmap
mm, fs, dax: handle layout changes to pinned dax mappings
xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
xfs: prepare xfs_break_layouts() for another layout type
xfs, dax: introduce xfs_break_dax_layouts()
drivers/dax/super.c | 14 ++-
drivers/nvdimm/pfn_devs.c | 2
drivers/nvdimm/pmem.c | 25 +++++
fs/Kconfig | 1
fs/dax.c | 97 +++++++++++++++++++++
fs/xfs/xfs_file.c | 72 ++++++++++++++--
fs/xfs/xfs_inode.h | 16 +++
fs/xfs/xfs_ioctl.c | 8 --
fs/xfs/xfs_iops.c | 16 ++-
fs/xfs/xfs_pnfs.c | 15 ++-
fs/xfs/xfs_pnfs.h | 5 +
include/linux/dax.h | 7 ++
include/linux/memremap.h | 36 ++------
include/linux/mm.h | 71 +++++++++++----
kernel/Makefile | 3 -
kernel/iomem.c | 167 ++++++++++++++++++++++++++++++++++++
kernel/memremap.c | 209 ++++++---------------------------------------
mm/Kconfig | 5 +
mm/gup.c | 36 ++++++--
mm/hmm.c | 13 ---
mm/swap.c | 3 -
21 files changed, 542 insertions(+), 279 deletions(-)
create mode 100644 kernel/iomem.c

Changes since v3:
* Updated the text in docs/nvdimm.txt to make it clear that the value
being passed in on the command line in an integer made up of various
bit fields. (Rob Elliott)
* Updated the "Highest Valid Capability" byte to be dynamic based on
the highest valid bit in the user's input. (Rob Elliott)
---
The first 2 patches in this series clean up some things I noticed while
coding.
Patch 3 adds support for the new Platform Capabilities Structure, which
was added to the NFIT in ACPI 6.2 Errata A. We add a machine command
line option "nvdimm-cap":
-machine pc,accel=kvm,nvdimm,nvdimm-cap=2
which allows the user to pass in a value for this structure. When such
a value is passed in we will generate the new NFIT subtable.
Patch 4 adds code to the "make check" self test infrastructure so that
we generate the new Platform Capabilities Structure, and adds it to the
expected NFIT output so that we test for it.
Ross Zwisler (4):
nvdimm: fix typo in label-size definition
tests/.gitignore: add entry for generated file
nvdimm, acpi: support NFIT platform capabilities
ACPI testing: test NFIT platform capabilities
docs/nvdimm.txt | 27 ++++++++++++++++++++
hw/acpi/nvdimm.c | 45 +++++++++++++++++++++++++++++++---
hw/i386/pc.c | 31 +++++++++++++++++++++++
hw/mem/nvdimm.c | 2 +-
include/hw/i386/pc.h | 1 +
include/hw/mem/nvdimm.h | 7 +++++-
tests/.gitignore | 1 +
tests/acpi-test-data/pc/NFIT.dimmpxm | Bin 224 -> 240 bytes
tests/acpi-test-data/q35/NFIT.dimmpxm | Bin 224 -> 240 bytes
tests/bios-tables-test.c | 2 +-
10 files changed, 109 insertions(+), 7 deletions(-)
--
2.14.3

Hello,
I would like to know about the Experimental message of Filesystem DAX.
--------------------------------------------------------
DAX enabled. Warning: EXPERIMENTAL, use at your own risk
--------------------------------------------------------
AFAIK, the final issue of Filesystem DAX is metadata update problem,
and it is(will be?) solved by great effort of MAP_SYNC and
"fix dma vs truncate/hole-punch" patch set.
So, I suppose that the Experimental message can be removed,
but I'm not sure.
Is it possible?
Otherwise, are there any other issues in Filesystem DAX yet?
If this is silly question, sorry for noise....
Thanks,
---
Yasunori Goto