Before people get too excited this isn't a proposal to kill DAX. The
topic proposal is a discussion to resolve lingering open questions
that currently motivate ext4 and xfs to scream "EXPERIMENTAL" when the
current DAX facilities are enabled. The are 2 primary concerns to
resolve. Enumerate the remaining features/fixes, and identify a path
to implement it all without regressing any existing application use
cases.
An enumeration of remaining projects follows, please expand this list
if I missed something:
* "DAX" has no specific meaning by itself, users have 2 use cases for
"DAX" capabilities: userspace cache management via MAP_SYNC, and page
cache avoidance where the latter aspect of DAX has no current api to
discover / use it. The project is to supplement MAP_SYNC with a
MAP_DIRECT facility and MADV_SYNC / MADV_DIRECT to indicate the same
dynamically via madvise. Similar to O_DIRECT, MAP_DIRECT would be an
application hint to avoid / minimiize page cache usage, but no strict
guarantee like what MAP_SYNC provides.
* Resolve all "if (dax) goto fail;" patterns in the kernel. Outside of
longterm-GUP (a topic in its own right) the projects here are
XFS-reflink and XFS-realtime-device support. DAX+reflink effectively
requires a given physical page to be mapped into two different inodes
at different (page->index) offsets. The challenge is to support
DAX-reflink without violating any existing application visible
semantics, the operating assumption / strawman to debate is that
experimental status is not blanket permission to go change existing
semantics in backwards incompatible ways.
* Deprecate, but not remove, the DAX mount option. Too many flows
depend on the option so it will never go away, but the facility is too
coarse. Provide an option to enable MAP_SYNC and
more-likely-to-do-something-useful-MAP_DIRECT on a per-directory
basis. The current proposal is to allow this property to only be
toggled while the directory is empty to avoid the complications of
racing page invalidation with new DAX mappings.
Secondary projects, i.e. important but I would submit are not in the
critical path to removing the "experimental" designation:
* Filesystem-integrated badblock management. Hook up the media error
notifications from libnvdimm to the filesystem to allow for operations
like "list files with media errors" and "enumerate bad file offsets on
a granulatiy smaller than a page". Another consideration along these
lines is to integrate machine-check-handling and dynamic error
notification into a filesystem interface. I've heard complaints that
the sigaction() based mechanism to receive BUS_MCEERR_* information,
while sufficient for the "System RAM" use case, is not precise enough
for the "Persistent Memory / DAX" use case where errors are repairable
and sub-page error information is useful.
* Userfaultfd for file-backed mappings and DAX
Ideally all the usual DAX, persistent memory, and GUP suspects could
be in the room to discuss this:
* Jan Kara
* Dave Chinner
* Christoph Hellwig
* Jeff Moyer
* Johannes Thumshirn
* Matthew Wilcox
* John Hubbard
* Jérôme Glisse
* MM folks for the reflink vs 'struct page' vs Xarray considerations

This patch set proposes KUnit, a lightweight unit testing and mocking
framework for the Linux kernel.
Unlike Autotest and kselftest, KUnit is a true unit testing framework;
it does not require installing the kernel on a test machine or in a VM
and does not require tests to be written in userspace running on a host
kernel. Additionally, KUnit is fast: From invocation to completion KUnit
can run several dozen tests in under a second. Currently, the entire
KUnit test suite for KUnit runs in under a second from the initial
invocation (build time excluded).
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining
unit test cases, grouping related test cases into test suites, providing
common infrastructure for running tests, mocking, spying, and much more.
## What's so special about unit testing?
A unit test is supposed to test a single unit of code in isolation,
hence the name. There should be no dependencies outside the control of
the test; this means no external dependencies, which makes tests orders
of magnitudes faster. Likewise, since there are no external dependencies,
there are no hoops to jump through to run the tests. Additionally, this
makes unit tests deterministic: a failing unit test always indicates a
problem. Finally, because unit tests necessarily have finer granularity,
they are able to test all code paths easily solving the classic problem
of difficulty in exercising error handling code.
## Is KUnit trying to replace other testing frameworks for the kernel?
No. Most existing tests for the Linux kernel are end-to-end tests, which
have their place. A well tested system has lots of unit tests, a
reasonable number of integration tests, and some end-to-end tests. KUnit
is just trying to address the unit test space which is currently not
being addressed.
## More information on KUnit
There is a bunch of documentation near the end of this patch set that
describes how to use KUnit and best practices for writing unit tests.
For convenience I am hosting the compiled docs here:
https://google.github.io/kunit-docs/third_party/kernel/docs/
Additionally for convenience, I have applied these patches to a branch:
https://kunit.googlesource.com/linux/+/kunit/rfc/4.19/v3
The repo may be cloned with:
git clone https://kunit.googlesource.com/linux
This patchset is on the kunit/rfc/4.19/v3 branch.
## Changes Since Last Version
- Changed namespace prefix from `test_*` to `kunit_*` as requested by
Shuah.
- Started converting/cleaning up the device tree unittest to use KUnit.
- Started adding KUnit expectations with custom messages.
--
2.20.0.rc0.387.gc7a69e6b6c-goog

From: Ira Weiny <ira.weiny(a)intel.com>
Pre-requisites
==============
Based on mmotm tree.
Based on the feedback from LSFmm, the LWN article, the RFC series since
then, and a ton of scenarios I've worked in my mind and/or tested...[1]
Solution summary
================
The real issue is that there is no use case for a user to have RDMA pinn'ed
memory which is then truncated. So really any solution we present which:
A) Prevents file system corruption or data leaks
...and...
B) Informs the user that they did something wrong
Should be an acceptable solution.
Because this is slightly new behavior. And because this is going to be
specific to DAX (because of the lack of a page cache) we have made the user
"opt in" to this behavior.
The following patches implement the following solution.
0) Registrations to Device DAX char devs are not affected
1) The user has to opt in to allowing page pins on a file with an exclusive
layout lease. Both exclusive and layout lease flags are user visible now.
2) page pins will fail if the lease is not active when the file back page is
encountered.
3) Any truncate or hole punch operation on a pinned DAX page will fail.
4) The user has the option of holding the lease or releasing it. If they
release it no other pin calls will work on the file.
5) Closing the file is ok.
6) Unmapping the file is ok
7) Pins against the files are tracked back to an owning file or an owning mm
depending on the internal subsystem needs. With RDMA there is an owning
file which is related to the pined file.
8) Only RDMA is currently supported
9) Truncation of pages which are not actively pinned nor covered by a lease
will succeed.
Reporting of pinned files in procfs
===================================
A number of alternatives were explored for how to report the file pins within
procfs. The following incorporates ideas from Jan Kara, Jason Gunthorpe, Dave
Chinner, Dan Williams and myself.
A new entry is added to procfs
/proc/<pid>/file_pins
For processes which have pinned DAX file memory file_pins reference come in 2
flavors. Those which are attached to another open file descriptor (For example
what is done in the RDMA subsytem) and those which are attached to a process
mm.
For those which are attached to another open file descriptor (such as RDMA)
the file pin references go through the 'struct file' associated with that pin.
In RDMA this is the RDMA context struct file.
The resulting output from proc fs is something like.
$ cat /proc/<pid>/file_pins
3: /dev/infiniband/uverbs0
/mnt/pmem/foo
Where '3' is the file descriptor (and file path) of the rdma context within the
process. The paths of the files pinned using that context are then listed.
RDMA contexts may have multiple MR each of which may have multiple files pinned
within them. So an output like the following is possible.
$ cat /proc/<pid>/file_pins
4: /dev/infiniband/uverbs0
/mnt/pmem/foo
/mnt/pmem/bar
/mnt/pmem/another
/mnt/pmem/one
The actual memory regions associated with the file pins are not reported.
For processes which are pinning memory which is not associated with a specific
file descriptor memory pins are reported directly as paths to the file.
$ cat /proc/<pid>/file_pins
/mnt/pmem/foo
Putting the above together if a process was using RDMA and another subsystem
the output could be something like:
$ cat /proc/<pid>/file_pins
4: /dev/infiniband/uverbs0
/mnt/pmem/foo
/mnt/pmem/bar
/mnt/pmem/another
/mnt/pmem/one
/mnt/pmem/foo
/mnt/pmem/another
/mnt/pmem/mm_mapped_file
[1] https://lkml.org/lkml/2019/6/5/1046
Background
==========
It should be noted that one solution for this problem is to use RDMA's On
Demand Paging (ODP). There are 2 big reasons this may not work.
1) The hardware being used for RDMA may not support ODP
2) ODP may be detrimental to the over all network (cluster or cloud)
performance
Therefore, in order to support RDMA to File system pages without On Demand
Paging (ODP) a number of things need to be done.
1) "longterm" GUP users need to inform other subsystems that they have taken a
pin on a page which may remain pinned for a very "long time". The
definition of long time is debatable but it has been established that RDMAs
use of pages for, minutes, hours, or even days after the pin is the extreme
case which makes this problem most severe.
2) Any page which is "controlled" by a file system needs to have special
handling. The details of the handling depends on if the page is page cache
fronted or not.
2a) A page cache fronted page which has been pinned by GUP long term can use a
bounce buffer to allow the file system to write back snap shots of the page.
This is handled by the FS recognizing the GUP long term pin and making a copy
of the page to be written back.
NOTE: this patch set does not address this path.
2b) A FS "controlled" page which is not page cache fronted is either easier
to deal with or harder depending on the operation the filesystem is trying
to do.
2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the
FS can no longer use the pages in question until the pin has been
removed. This patch set presents a solution to this by introducing
some reasonable restrictions on user space applications.
2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch
then there is nothing which need be done. Data is Read or Written
directly to the page. This is an easy case which would currently work
if not for GUP long term pins being disabled. Therefore this patch set
need not change access to the file data but does allow for GUP pins
after 2ba above is dealt with.
This patch series and presents a solution for problem 2ba)
Ira Weiny (19):
fs/locks: Export F_LAYOUT lease to user space
fs/locks: Add Exclusive flag to user Layout lease
mm/gup: Pass flags down to __gup_device_huge* calls
mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages
fs/ext4: Teach ext4 to break layout leases
fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range
fs/xfs: Teach xfs to use new dax_layout_busy_page()
fs/xfs: Fail truncate if page lease can't be broken
mm/gup: Introduce vaddr_pin structure
mm/gup: Pass a NULL vaddr_pin through GUP fast
mm/gup: Pass follow_page_context further down the call stack
mm/gup: Prep put_user_pages() to take an vaddr_pin struct
{mm,file}: Add file_pins objects
fs/locks: Associate file pins while performing GUP
mm/gup: Introduce vaddr_pin_pages()
RDMA/uverbs: Add back pointer to system file object
RDMA/umem: Convert to vaddr_[pin|unpin]* operations.
{mm,procfs}: Add display file_pins proc
mm/gup: Remove FOLL_LONGTERM DAX exclusion
drivers/infiniband/core/umem.c | 26 +-
drivers/infiniband/core/umem_odp.c | 16 +-
drivers/infiniband/core/uverbs.h | 1 +
drivers/infiniband/core/uverbs_main.c | 1 +
fs/Kconfig | 1 +
fs/dax.c | 38 ++-
fs/ext4/ext4.h | 2 +-
fs/ext4/extents.c | 6 +-
fs/ext4/inode.c | 26 +-
fs/file_table.c | 4 +
fs/locks.c | 291 +++++++++++++++++-
fs/proc/base.c | 214 +++++++++++++
fs/xfs/xfs_file.c | 21 +-
fs/xfs/xfs_inode.h | 5 +-
fs/xfs/xfs_ioctl.c | 15 +-
fs/xfs/xfs_iops.c | 14 +-
include/linux/dax.h | 12 +-
include/linux/file.h | 49 +++
include/linux/fs.h | 5 +-
include/linux/huge_mm.h | 17 --
include/linux/mm.h | 69 +++--
include/linux/mm_types.h | 2 +
include/rdma/ib_umem.h | 2 +-
include/uapi/asm-generic/fcntl.h | 5 +
kernel/fork.c | 3 +
mm/gup.c | 418 ++++++++++++++++----------
mm/huge_memory.c | 18 +-
mm/internal.h | 28 ++
28 files changed, 1048 insertions(+), 261 deletions(-)
--
2.20.1

On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
>
> > > But the fact that RDMA, and potentially others, can "pass the
> > > pins" to other processes is something I spent a lot of time trying to work out.
> >
> > There's nothing in file layout lease architecture that says you
> > can't "pass the pins" to another process. All the file layout lease
> > requirements say is that if you are going to pass a resource for
> > which the layout lease guarantees access for to another process,
> > then the destination process already have a valid, active layout
> > lease that covers the range of the pins being passed to it via the
> > RDMA handle.
>
> How would the kernel detect and enforce this? There are many ways to
> pass a FD.
AFAIC, that's not really a kernel problem. It's more of an
application design constraint than anything else. i.e. if the app
passes the IB context to another process without a lease, then the
original process is still responsible for recalling the lease and
has to tell that other process to release the IB handle and it's
resources.
> IMHO it is wrong to try and create a model where the file lease exists
> independently from the kernel object relying on it. In other words the
> IB MR object itself should hold a reference to the lease it relies
> upon to function properly.
That still doesn't work. Leases are not individually trackable or
reference counted objects objects - they are attached to a struct
file bUt, in reality, they are far more restricted than a struct
file.
That is, a lease specifically tracks the pid and the _open fd_ it
was obtained for, so it is essentially owned by a specific process
context. Hence a lease is not able to be passed to a separate
process context and have it still work correctly for lease break
notifications. i.e. the layout break signal gets delivered to
original process that created the struct file, if it still exists
and has the original fd still open. It does not get sent to the
process that currently holds a reference to the IB context.
So while a struct file passed to another process might still have
an active lease, and you can change the owner of the struct file
via fcntl(F_SETOWN), you can't associate the existing lease with a
the new fd in the new process and so layout break signals can't be
directed at the lease fd....
This really means that a lease can only be owned by a single process
context - it can't be shared across multiple processes (so I was
wrong about dup/pass as being a possible way of passing them)
because there's only one process that can "own" a struct file, and
that where signals are sent when the lease needs to be broken.
So, fundamentally, if you want to pass a resource that pins a file
layout between processes, both processes need to hold a layout lease
on that file range. And that means exclusive leases and passing
layouts between processes are fundamentally incompatible because you
can't hold two exclusive leases on the same file range....
Cheers,
Dave.
--
Dave Chinner
david(a)fromorbit.com