*[PATCH 0/11] fs: fixes for major copy_file_range() issues@ 2018-12-03 8:34 Dave Chinner
2018-12-03 8:34 ` [PATCH 01/11] vfs: copy_file_range source range over EOF should fail Dave Chinner
` (11 more replies)0 siblings, 12 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-03 8:34 UTC (permalink / raw)
To: linux-fsdevel, linux-xfs
Cc: olga.kornievskaia, linux-nfs, linux-unionfs, ceph-devel, linux-cifs
Hi folks,
As most of you already know, we really suck at introducing new
functionality. The recent problems we found with clone/dedupe file
range interfaces also plague the copy_file_range() API and
implementation. Not only doesn't it do exactly what the man page
says, the man page doesn't document everything the syscal does
either.
There's a few problems:
- can overwrite setuid files
- can read from and overwrite active swap files
- can overwrite immutable files
- doesn't update timestamps
- doesn't obey resource limits
- doesn't catch overlapping copy ranges to the same file
- doesn't consistently implement fallback strategies
- does error out when the source range extends past EOF like
the man page says it should
- isn't consistent with clone file range behaviour
- inconsistent behaviour between filesystems
- inconsistent fallback implementations
And so on. There's so much wrong, and I haven't even got to the
problems that the generic fallback code (i.e. do_splice_direct()
has). That's for another day.
So, what this series attempts to do is clean up the code, implement
all the missing checks, provide an infrastructure layout that allows
for consistent behaviour across filesystems and allows filesysetms
to control fallback mechanisms and cross-device copies.
I'll repeat that so it's clear: the series also enabled cross-device
copies once all the problems are sorted out.
To that end, the current fallback code is moved to
generic_copy_file_range(), and that is called only if the filesystem
does not provide a ->copy_file_range implementation. If the
filesystem provides such a method, itmust implement the page cache
copy fallback itself by calling generic_copy_file_range() when
appropriate. I did this because different filesystems have different
copy-offload capabilities and so need to fall back in different
situations. It's easier to have them call generic_copy_file_range()
to do that copy when necessary than it is to have them try to
communicate back up to vfs_copy_file_range() that it should run a
fallback copy.
To make all the implementations perform the same validity checks,
I've created a generic_copy_file_checks() which is similar to the
checks we do for clone/dedupe. It's not quite the same, but the core
is very similar. This strips setuid, updates timestamps, checks and
enforces filesystem and resource limits, bounds checks the copy
ranges, etc.
This needs to be run before we call ->remap_file_range() so that we
end up with consistent behaviour across copy_file_range() calls.
e.g. we want an XFS filesystem with reflink=1 (i.e. supports
->remap_file_range()) to behave the same as an XFS filesystem with
reflink=0. Hence we need to check all the parameters up front so we
don't end up with calls to ->remap_file_range() resulting in
different behaviour.
It also means that ->copy_file_range implementations only need to
bounds checking the input against fileystem internal constraints,
not everything. This makes the filesystem implementations simpler,
and means they can call the falloback generic_copy_file_range()
implementation without having to care about further bounds checking.
I have not changed the fallback behaviour of the CIFS, Ceph or NFS
client implementations. The still reject copy_file_range() to the
same file with EINVAL, even though it is supported by the fallback
and filesystems that implement ->remap_file_range(). I'll leave it
for the maintainers to decide if they want to implement the manual
data copy fallback or not. My personal opinion is that they should
implement the fallback where-ever they can, but userspace has to be
prepared for copy_file_range() to fail and so implementing the
fallback is an optional feature.
In terms of testing, Darrick and I have been beating the hell out of
copy_file_range with fsx on XFS to sort out all the data corruption
problems it has exposed (we're still working on that). Patches have
been posted to enhance fsx and fsstress in fstests to exercise
clone/dedupe/copy_file_range. Thread here:
https://www.spinics.net/lists/fstests/msg10920.html
I've also written a bounds/behaviour exercising test:
https://marc.info/?l=fstests&m=154381938829897&w=2https://marc.info/?l=fstests&m=154381939029898&w=2https://marc.info/?l=fstests&m=154381939229899&w=2https://marc.info/?l=fstests&m=154381939329900&w=2
I don't know whether I've got all the permission tests right in this
patchset. There's absolutely no documentation telling us when we
should use file_permission, inode_permission, etc in the
documentation or the code, so I just added the things that made the
tests do the things i think are the right things to be doing.
To run the tests, you'll also need modifications to xfs_io to allow
it to modify state appropriately. This is something we have
overlooked in the past, and so a lots of xfs_io based behaviour
checking is not actually testing the syscall we thought it was
testing but is instead testing the permission checking of the open()
syscall. Those patches are here:
https://marc.info/?l=linux-xfs&m=154378403323889&w=2https://marc.info/?l=linux-xfs&m=154378403523890&w=2https://marc.info/?l=linux-xfs&m=154378403323888&w=2https://marc.info/?l=linux-xfs&m=154379644526132&w=2
These changes really need to go in before we merge any more
copy_file_range() features - we need to get the basics right and get
test coverage over it before we unleash things like NFS server-side
copies on unsuspecting users with filesystems that have busted
copy_file_range() implementations.
I'll be appending a man page patch to this series that documents all
the errors this syscall can throw, the expected behaviours, etc. The
test and the man page were written together first, and the
implementation changes were done second. So if you don't agree with
the behaviour, discuss what the man page patch should say and define,
then I'll change the test to reflect that and I'll go from there.
-Dave.
^permalinkrawreply [flat|nested] 83+ messages in thread

*Re: [PATCH 02/11] vfs: introduce generic_copy_file_range()
2018-12-03 10:03 ` Amir Goldstein@ 2018-12-03 23:00 ` Dave Chinner0 siblings, 0 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-03 23:00 UTC (permalink / raw)
To: Amir Goldstein
Cc: linux-fsdevel, linux-xfs, Olga Kornievskaia,
Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs,
Miklos Szeredi
On Mon, Dec 03, 2018 at 12:03:41PM +0200, Amir Goldstein wrote:
> On Mon, Dec 3, 2018 at 10:34 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Right now if vfs_copy_file_range() does not use any offload
> > mechanism, it falls back to calling do_splice_direct(). This fails
> > to do basic sanity checks on the files being copied. Before we
> > start adding this necessarily functionality to the fallback path,
> > separate it out into generic_copy_file_range().
> >
> > generic_copy_file_range() has the same prototype as
> > ->copy_file_range() so that filesystems can use it in their custom
> > ->copy_file_range() method if they so choose.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
>
> Looks good.
>
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
>
> Question:
> 2 years ago you suggested that I covert the overlayfs copy up
> code that does a do_direct_splice() with a loop of vfs_copy_file_range():
> https://marc.info/?l=linux-fsdevel&m=147369468521525&w=2
> We ended up with a slightly different solution, but with your recent
> changes, I can get back to your original proposal.
>
> Back then, I wondered whether it makes sense to push the killable
> loop of shorter do_direct_splice() calls into the vfs helper.
> What do you think about adding this to generic_copy_file_range()
> now? (I can do that after your changes are merged).
No. Adding another loop on top of all the loops already in the
do_direct_splice() is just crazy. The code is hard enough to follow
to begin with. If we are going to make do_splice_direct() killable,
then it needs to be done the splice_direct_to_actor loop that
already splits large splice ranges up into smaller chunks.
As it is, addressing the flaws of do_splice_direct() is not
something I'm about to do in this patchset. It has many issues, and
it's yet another piece of work we need to undertake to make
copy_file_range() somewhat user friendly.
> The fact that userspace *can* enter a very long unkillable loop
> with current copy_file_range() syscall doesn't mean that we
> *should* persist this situation. After all, fixing the brokenness
> of the existing interface is what you set out to do.
That's not an API issue - that's an implementation problem.
Quite frankly, making copy offload implementations killable is going
to "fun" for filesystems that offload the copy to remote servers, so
whatever we do fo the fallback isn't going to prevent
copy_file_range() from being unkillable.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^permalinkrawreply [flat|nested] 83+ messages in thread

*Re: [PATCH 08/11] vfs: push EXDEV check down into ->remap_file_range
2018-12-03 23:58 ` Darrick J. Wong@ 2018-12-04 9:17 ` Amir Goldstein0 siblings, 0 replies; 83+ messages in thread
From: Amir Goldstein @ 2018-12-04 9:17 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Dave Chinner, linux-fsdevel, linux-xfs, Olga Kornievskaia,
Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs
> > > I think this is sort of backwards -- the checks should stay in
> > > do_clone_file_range, and vfs_copy_file_range should be calling that
> > > instead of directly calling ->remap_range():
> > >
> > > vfs_copy_file_range()
> > > {
> > > file_start_write(...);
> > > ret = do_clone_file_range(...);
> > > if (ret > 0)
> > > return ret;
> > > ret = do_copy_file_range(...);
> > > file_end_write(...);
> > > return ret;
> > > }
> >
> > I'm already confused by the way we weave in and out of "vfs_/do_*"
> > functions, and this just makes it worse.
> >
> > Just what the hell is supposed to be in a "vfs_" prefixed function,
> > and why the hell is it considered a "vfs" level function if we then
> > export it's internal functions for individual filesystems to use?
>
> I /think/ vfs_ functions are file_start_write()/file_end_write()
> wrappers around a similarly named function that lacks the freeze
> protection??
That is definitely not an official definition of vfs_ vs. do_, but I found
this rule to be a common practice, which is why I swapped
{do,vfs}_clone_file_range(). But around vfs you can find many examples
where do_ helpers wrap vfs_ helpers.
>
> (AFAICT Amir made that split so that overlayfs could use these
> functions, though I do not know if everything vfs_ was made that way
> /specifically/ for overlayfs or if that's the way things have been and
> ovlfs simply takes advantage of it...)
>
> Guhhh, none of this is documented......
>
It looks like in git epoc, things were pretty straight forward.
vfs_XXX was the interface called after sys_XXX converted
userspace arguments (e.g. char *name, int fd) to vfs objects
(e.g. struct path,dentry,inode,file). Sometimes vfs_ helpers called
do_ helpers for several reasons. See for example epoc version
of fs/namei.c fs/read_write.c.
Even then there were exception. For example do_sendfile()
doesn't even have a vfs_ interface, although it is clear what
that prospect interface would look like.
To that end, do_splice_direct() acts as the standard do_ helper
to that non-existing vfs_ interface.
From there on, I guess things kinda grew organically.
fs/namei.c syscalls grew do_XXXat() helpers between syscalls
and vfs_XXX interface.
Overlayfs uses vfs_ interface 99% of the time, so from that perspective
it is regarded as an interface with vfs objects as arguments that does
NOT skip security_ checks and does NOT bypass freeze protection.
Overlayfs calling do_clone_file_range() and do_splice_direct() are
the only exception to this rule.
If we would want to replace those calls in ovl_copy_up_data() with
a single call to do_copy_file_range(), than said helper should NOT
be taking freeze protection and should do the fallback between
filesystem copy_file_range and generic_copy_file_range.
Cheers,
Amir.
^permalinkrawreply [flat|nested] 83+ messages in thread

*Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
2018-12-04 15:13 ` Christoph Hellwig@ 2018-12-04 21:29 ` Dave Chinner
2018-12-04 21:47 ` Olga Kornievskaia
2018-12-05 14:12 ` Christoph Hellwig0 siblings, 2 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-04 21:29 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Amir Goldstein, linux-fsdevel, linux-xfs, Olga Kornievskaia,
Linux NFS Mailing List, overlayfs, ceph-devel, linux-cifs
On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > The man page says:
> > >
> > > EINVAL Requested range extends beyond the end of the source file
> > >
> > > But the current behaviour is that copy_file_range does a short
> > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > the behaviour described in the man page.
>
> I think the behavior implemented is a lot more useful than the one
> documented..
The current behaviour is really nasty. Because copy_file_range() can
return short copies, the caller has to implement a loop to ensure
the range hey want get copied. When the source range you are
trying to copy overlaps source EOF, this loop:
while (len > 0) {
ret = copy_file_range(... len ...)
...
off_in += ret;
off_out += ret;
len -= ret;
}
Currently the fallback code copies up to the end of the source file
on the first copy and then fails the second copy with EINVAL because
the source range is now completely beyond EOF.
So, from an application perspective, did the copy succeed or did it
fail?
Existing tools that exercise copy_file_range (like xfs_io) consider
this a failure, because the second copy_file_range() call returns
EINVAL and not some "there is no more to copy" marker like read()
returning 0 bytes when attempting to read beyond EOF.
IOWs, we cannot tell the difference between a real error and a short
copy because the input range spans EOF and it was silently
shortened. That's the API problem we need to fix here - the existing
behaviour is really crappy for applications. Erroring out
immmediately is one solution, and it's what the man page says should
happen so that is what I implemented.
Realistically, though, I think an attempt to read beyond EOF for the
copy should result in behaviour like read() (i.e. return 0 bytes),
not EINVAL. The existing behaviour needs to change, though.
> > i_size_read()...
> >
> > Otherwise
> > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
>
> Looks like this doesn't even compile?
It's fixed in a later patch that consolidates the checks into a
generic check function, but I'm not sure why my "compile every
patch" script didn't catch this.
Cheers,
-Dave.
--
Dave Chinner
david@fromorbit.com
^permalinkrawreply [flat|nested] 83+ messages in thread

*Re: [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down
2018-12-04 15:43 ` Christoph Hellwig@ 2018-12-04 22:18 ` Dave Chinner
2018-12-04 23:33 ` Olga Kornievskaia
2018-12-05 14:09 ` Christoph Hellwig0 siblings, 2 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-04 22:18 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-fsdevel, linux-xfs, olga.kornievskaia, linux-nfs,
linux-unionfs, ceph-devel, linux-cifs
On Tue, Dec 04, 2018 at 07:43:47AM -0800, Christoph Hellwig wrote:
> Well, this isn't bugfixes anymore, but adding new features..
I made that perfectly clear in the cover description. I called it
twice, one of them explicitly stating that this series made these
infrastructure changes because we have pending functionality that
dependents on cross-device copies being supported in a sane manner.
I'll drop it if you want, but then I'll just have to come back after
all the NFS code is merged and do yet more cleanup work.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^permalinkrawreply [flat|nested] 83+ messages in thread

*Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
2018-12-04 21:47 ` Olga Kornievskaia@ 2018-12-04 22:31 ` Dave Chinner
2018-12-05 16:51 ` bfields
2019-05-20 9:10 ` Amir Goldstein0 siblings, 2 replies; 83+ messages in thread
From: Dave Chinner @ 2018-12-04 22:31 UTC (permalink / raw)
To: Olga Kornievskaia
Cc: Christoph Hellwig, Amir Goldstein, linux-fsdevel, linux-xfs,
linux-nfs, linux-unionfs, ceph-devel, linux-cifs
On Tue, Dec 04, 2018 at 04:47:18PM -0500, Olga Kornievskaia wrote:
> On Tue, Dec 4, 2018 at 4:35 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> > > On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > >
> > > > > The man page says:
> > > > >
> > > > > EINVAL Requested range extends beyond the end of the source file
> > > > >
> > > > > But the current behaviour is that copy_file_range does a short
> > > > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > > > the behaviour described in the man page.
> > >
> > > I think the behavior implemented is a lot more useful than the one
> > > documented..
> >
> > The current behaviour is really nasty. Because copy_file_range() can
> > return short copies, the caller has to implement a loop to ensure
> > the range hey want get copied. When the source range you are
> > trying to copy overlaps source EOF, this loop:
> >
> > while (len > 0) {
> > ret = copy_file_range(... len ...)
> > ...
> > off_in += ret;
> > off_out += ret;
> > len -= ret;
> > }
> >
> > Currently the fallback code copies up to the end of the source file
> > on the first copy and then fails the second copy with EINVAL because
> > the source range is now completely beyond EOF.
> >
> > So, from an application perspective, did the copy succeed or did it
> > fail?
> >
> > Existing tools that exercise copy_file_range (like xfs_io) consider
> > this a failure, because the second copy_file_range() call returns
> > EINVAL and not some "there is no more to copy" marker like read()
> > returning 0 bytes when attempting to read beyond EOF.
> >
> > IOWs, we cannot tell the difference between a real error and a short
> > copy because the input range spans EOF and it was silently
> > shortened. That's the API problem we need to fix here - the existing
> > behaviour is really crappy for applications. Erroring out
> > immmediately is one solution, and it's what the man page says should
> > happen so that is what I implemented.
> >
> > Realistically, though, I think an attempt to read beyond EOF for the
> > copy should result in behaviour like read() (i.e. return 0 bytes),
> > not EINVAL. The existing behaviour needs to change, though.
>
> There are two checks to consider
> 1. pos_in >= EOF should return EINVAL
> 2. however what's perhaps should be relaxed is pos_in+len >= EOF
> should return a short copy.
>
> Having check#1 enforced allows to us to differentiate between a real
> error and a short copy.
That's what the code does right now and *exactly what I'm trying to
fix* because it EINVAL is ambiguous and not an indicator that we've
reached the end of the source file. EINVAL can indicate several
different errors, so it really has to be treated as a "copy failed"
error by applications.
Have a look at read/pread() - they return 0 in this case to indicate
a short read, and the value of zero is explicitly defined as meaning
"read position is beyond EOF". Applications know straight away that
there is no more data to be read and there was no error, so can
terminate on a successful short read.
We need to allow applications to terminate copy loops on a
successful short copy. IOWs, applications need to either:
- get an immediate error saying the range is invalid rather
than doing a short copy (as per the man page); or
- have an explicit marker to say "no more data to be copied"
Applications need the "no more data to copy" case to be explicit and
unambiguous so they can make sane decisions about whether a short
copy was successful because the file was shorter than expected or
whether a short copy was a result of a real error being encountered.
The current behaviour is largely unusable for applications because
they have to guess at the reason for EINVAL part way through a
copy....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^permalinkrawreply [flat|nested] 83+ messages in thread

*Re: [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down
2018-12-04 22:18 ` Dave Chinner@ 2018-12-04 23:33 ` Olga Kornievskaia
2018-12-05 14:09 ` Christoph Hellwig1 sibling, 0 replies; 83+ messages in thread
From: Olga Kornievskaia @ 2018-12-04 23:33 UTC (permalink / raw)
To: david
Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-nfs,
linux-unionfs, ceph-devel, linux-cifs
On Tue, Dec 4, 2018 at 5:18 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Dec 04, 2018 at 07:43:47AM -0800, Christoph Hellwig wrote:
> > Well, this isn't bugfixes anymore, but adding new features..
>
> I made that perfectly clear in the cover description. I called it
> twice, one of them explicitly stating that this series made these
> infrastructure changes because we have pending functionality that
> dependents on cross-device copies being supported in a sane manner.
>
> I'll drop it if you want, but then I'll just have to come back after
> all the NFS code is merged and do yet more cleanup work.
This doesn't needs to be fixed in this patch series. I think Anna was
pointing out to me for something to take a look at.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
^permalinkrawreply [flat|nested] 83+ messages in thread

*Re: [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down
2018-12-04 22:18 ` Dave Chinner
2018-12-04 23:33 ` Olga Kornievskaia@ 2018-12-05 14:09 ` Christoph Hellwig
2018-12-05 17:01 ` Olga Kornievskaia1 sibling, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2018-12-05 14:09 UTC (permalink / raw)
To: Dave Chinner
Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, olga.kornievskaia,
linux-nfs, linux-unionfs, ceph-devel, linux-cifs
On Wed, Dec 05, 2018 at 09:18:47AM +1100, Dave Chinner wrote:
> I'll drop it if you want, but then I'll just have to come back after
> all the NFS code is merged and do yet more cleanup work.
IFF we want these NFS "features" we'll have to get it right before
merging the code. But even with that I'd rather fix the glaring
issues you are fixing in your first patches as a priority before
adding more features. In other words: don't worry about NFS, lets
get the existing code right before worrying about the next round
of potential issues.
^permalinkrawreply [flat|nested] 83+ messages in thread

*Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
2018-12-04 22:31 ` Dave Chinner@ 2018-12-05 16:51 ` bfields
2019-05-20 9:10 ` Amir Goldstein1 sibling, 0 replies; 83+ messages in thread
From: bfields @ 2018-12-05 16:51 UTC (permalink / raw)
To: Dave Chinner
Cc: Olga Kornievskaia, Christoph Hellwig, Amir Goldstein,
linux-fsdevel, linux-xfs, linux-nfs, linux-unionfs, ceph-devel,
linux-cifs
On Wed, Dec 05, 2018 at 09:31:02AM +1100, Dave Chinner wrote:
> That's what the code does right now and *exactly what I'm trying to
> fix* because it EINVAL is ambiguous and not an indicator that we've
> reached the end of the source file. EINVAL can indicate several
> different errors, so it really has to be treated as a "copy failed"
> error by applications.
>
> Have a look at read/pread() - they return 0 in this case to indicate
> a short read, and the value of zero is explicitly defined as meaning
> "read position is beyond EOF". Applications know straight away that
> there is no more data to be read and there was no error, so can
> terminate on a successful short read.
>
> We need to allow applications to terminate copy loops on a
> successful short copy.
I'm a little confused by your definition of "short copy" and "short
read". Are you using that to mean a copy/read that returns zero? I
usually see it used to mean any successful call that returned less than
the requested amount. I'd expect a zero return to terminate a copy
loop, but not any positive return.
--b.
> IOWs, applications need to either:
>
> - get an immediate error saying the range is invalid rather
> than doing a short copy (as per the man page); or
> - have an explicit marker to say "no more data to be copied"
>
> Applications need the "no more data to copy" case to be explicit and
> unambiguous so they can make sane decisions about whether a short
> copy was successful because the file was shorter than expected or
> whether a short copy was a result of a real error being encountered.
> The current behaviour is largely unusable for applications because
> they have to guess at the reason for EINVAL part way through a
> copy....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
^permalinkrawreply [flat|nested] 83+ messages in thread

*Re: [PATCH 09/11] vfs: push copy_file_ranges -EXDEV checks down
2018-12-05 14:09 ` Christoph Hellwig@ 2018-12-05 17:01 ` Olga Kornievskaia0 siblings, 0 replies; 83+ messages in thread
From: Olga Kornievskaia @ 2018-12-05 17:01 UTC (permalink / raw)
To: Christoph Hellwig
Cc: david, linux-fsdevel, linux-xfs, linux-nfs, linux-unionfs,
ceph-devel, linux-cifs
On Wed, Dec 5, 2018 at 9:09 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Dec 05, 2018 at 09:18:47AM +1100, Dave Chinner wrote:
> > I'll drop it if you want, but then I'll just have to come back after
> > all the NFS code is merged and do yet more cleanup work.
>
> IFF we want these NFS "features" we'll have to get it right before
> merging the code. But even with that I'd rather fix the glaring
> issues you are fixing in your first patches as a priority before
> adding more features. In other words: don't worry about NFS, lets
> get the existing code right before worrying about the next round
> of potential issues.
Dave,
Do you mind in v2 removing the 'retry, ret=EAGAIN' piece and leave the
call to the nfs42_copy_file_range() (with the superblock block check)?
If not, I could provide the patch.
This is a piece of code that got in as a part of the async copy
patches and it was meant for the upcoming server-to-server series.
This code will go right back in with the next series. But since dead
piece of code is glaring wrong currently by all means let's fix it.
Thank you.
^permalinkrawreply [flat|nested] 83+ messages in thread

*Re: [PATCH 05/11] vfs: use inode_permission in copy_file_range()
2018-12-03 23:55 ` Dave Chinner@ 2018-12-05 17:28 ` bfields0 siblings, 0 replies; 83+ messages in thread
From: bfields @ 2018-12-05 17:28 UTC (permalink / raw)
To: Dave Chinner
Cc: Darrick J. Wong, linux-fsdevel, linux-xfs, olga.kornievskaia,
linux-nfs, linux-unionfs, ceph-devel, linux-cifs
On Tue, Dec 04, 2018 at 10:55:17AM +1100, Dave Chinner wrote:
> On Mon, Dec 03, 2018 at 10:18:03AM -0800, Darrick J. Wong wrote:
> > On Mon, Dec 03, 2018 at 07:34:10PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > Similar to FI_DEDUPERANGE, make copy_file_range() check that we have
> >
> > TLDR: No, it's not similar to FIDEDUPERANGE -- the use of
> > inode_permission() in allow_file_dedupe() is to enable callers to dedupe
> > into a file for which the caller has write permissions but opened the
> > file O_RDONLY.
>
> What a grotty, nasty hack.
>
> > [Please keep reading...]
> >
> > > write permissions to the destination inode.
> > >
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > > mm/filemap.c | 5 +++++
> > > 1 file changed, 5 insertions(+)
> > >
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 0a170425935b..876df5275514 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -3013,6 +3013,11 @@ int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> > > (file_out->f_flags & O_APPEND))
> > > return -EBADF;
> > >
> > > + /* may sure we really are allowed to write to the destination inode */
> > > + ret = inode_permission(inode_out, MAY_WRITE);
> >
> > What's the difference between security_file_permission and
> > inode_permission, and when do we call them for a regular
> > open-write-close sequence? Hmmm, let me take a look:
> .....
> > We also cannot dedupe into a file that becomes immutable after we open
> > it for write, but we can dedupe into a file that loses its write
> > permissions after we open it.
>
> It's more nuanced than that - dedupe will proceed after write
> permissions have been removed only if you are root or own the file,
> otherwise it will fail.
>
> Updated summary:
>
> > op: after +immutable? after chmod a-w?
> > write yes yes
> > clonerange no yes
> > dedupe no maybe
> > newcopyrange no no
> >
> > My reaction: I don't think that writes should be allowed after an
> > administrator marks a file immutable (but that's a separate issue) but I
> > do think we should be consistent in allowing copying into a file that
> > has lost its write permissions after we opened the file for write, like
> > we do for write() and the remap ioct....
>
> If we want to allow copying to files we don't actually have
> permission to write to anymore, then I'll remove this from the test,
> the man page and the code. But, quite frankly, I don't trust remote
> server side copies to follow the same permission models as the
> client side OS, so I think we have to treat copy_file_range
> differently to a normal write syscall....
The NFS COPY command takes references to the protocol's equivalent to
open files, and I'd expect permission checks should depend on the open
mode, not the current file permissions.
But server behavior may vary. I'm not sure that's a good guide for what
to do locally.
In general I'm more comfortable the closer copy is to read & write.
--b.
^permalinkrawreply [flat|nested] 83+ messages in thread

*Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
2018-12-04 22:31 ` Dave Chinner
2018-12-05 16:51 ` bfields@ 2019-05-20 9:10 ` Amir Goldstein
2019-05-20 13:12 ` Olga Kornievskaia1 sibling, 1 reply; 83+ messages in thread
From: Amir Goldstein @ 2019-05-20 9:10 UTC (permalink / raw)
To: Dave Chinner
Cc: Olga Kornievskaia, Christoph Hellwig, linux-fsdevel, linux-xfs,
linux-nfs, overlayfs, ceph-devel, CIFS
On Wed, Dec 5, 2018 at 12:31 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Dec 04, 2018 at 04:47:18PM -0500, Olga Kornievskaia wrote:
> > On Tue, Dec 4, 2018 at 4:35 PM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> > > > On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > >
> > > > > > The man page says:
> > > > > >
> > > > > > EINVAL Requested range extends beyond the end of the source file
> > > > > >
> > > > > > But the current behaviour is that copy_file_range does a short
> > > > > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > > > > the behaviour described in the man page.
> > > >
> > > > I think the behavior implemented is a lot more useful than the one
> > > > documented..
> > >
> > > The current behaviour is really nasty. Because copy_file_range() can
> > > return short copies, the caller has to implement a loop to ensure
> > > the range hey want get copied. When the source range you are
> > > trying to copy overlaps source EOF, this loop:
> > >
> > > while (len > 0) {
> > > ret = copy_file_range(... len ...)
> > > ...
> > > off_in += ret;
> > > off_out += ret;
> > > len -= ret;
> > > }
> > >
> > > Currently the fallback code copies up to the end of the source file
> > > on the first copy and then fails the second copy with EINVAL because
> > > the source range is now completely beyond EOF.
> > >
> > > So, from an application perspective, did the copy succeed or did it
> > > fail?
> > >
> > > Existing tools that exercise copy_file_range (like xfs_io) consider
> > > this a failure, because the second copy_file_range() call returns
> > > EINVAL and not some "there is no more to copy" marker like read()
> > > returning 0 bytes when attempting to read beyond EOF.
> > >
> > > IOWs, we cannot tell the difference between a real error and a short
> > > copy because the input range spans EOF and it was silently
> > > shortened. That's the API problem we need to fix here - the existing
> > > behaviour is really crappy for applications. Erroring out
> > > immmediately is one solution, and it's what the man page says should
> > > happen so that is what I implemented.
> > >
> > > Realistically, though, I think an attempt to read beyond EOF for the
> > > copy should result in behaviour like read() (i.e. return 0 bytes),
> > > not EINVAL. The existing behaviour needs to change, though.
> >
> > There are two checks to consider
> > 1. pos_in >= EOF should return EINVAL
> > 2. however what's perhaps should be relaxed is pos_in+len >= EOF
> > should return a short copy.
> >
> > Having check#1 enforced allows to us to differentiate between a real
> > error and a short copy.
>
> That's what the code does right now and *exactly what I'm trying to
> fix* because it EINVAL is ambiguous and not an indicator that we've
> reached the end of the source file. EINVAL can indicate several
> different errors, so it really has to be treated as a "copy failed"
> error by applications.
>
> Have a look at read/pread() - they return 0 in this case to indicate
> a short read, and the value of zero is explicitly defined as meaning
> "read position is beyond EOF". Applications know straight away that
> there is no more data to be read and there was no error, so can
> terminate on a successful short read.
>
> We need to allow applications to terminate copy loops on a
> successful short copy. IOWs, applications need to either:
>
> - get an immediate error saying the range is invalid rather
> than doing a short copy (as per the man page); or
> - have an explicit marker to say "no more data to be copied"
>
> Applications need the "no more data to copy" case to be explicit and
> unambiguous so they can make sane decisions about whether a short
> copy was successful because the file was shorter than expected or
> whether a short copy was a result of a real error being encountered.
> The current behaviour is largely unusable for applications because
> they have to guess at the reason for EINVAL part way through a
> copy....
>
Dave,
I went a head and implemented the desired behavior.
However, while testing I observed that the desired behavior is already
the existing behavior. For example, trying to copy 10 bytes from a 2 bytes file,
xfs_io copy loop ends as expected:
copy_file_range(4, [0], 3, [0], 10, 0) = 2
copy_file_range(4, [2], 3, [2], 8, 0) = 0
This was tested on ext4 and xfs with reflink on recent kernel as well as on
v4.20-rc1 (era of original patch set).
Where and how did you observe the EINVAL behavior described above?
(besides man page that is). There are even xfstests (which you modified)
that verify the return 0 for past EOF behavior.
For now, I am just dropping this patch from the patch series.
Let me know if I am missing something.
Thanks,
Amir.
^permalinkrawreply [flat|nested] 83+ messages in thread

*Re: [PATCH 01/11] vfs: copy_file_range source range over EOF should fail
2019-05-20 9:10 ` Amir Goldstein@ 2019-05-20 13:12 ` Olga Kornievskaia
2019-05-20 13:36 ` Amir Goldstein0 siblings, 1 reply; 83+ messages in thread
From: Olga Kornievskaia @ 2019-05-20 13:12 UTC (permalink / raw)
To: Amir Goldstein
Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel, linux-xfs,
linux-nfs, overlayfs, ceph-devel, CIFS
On Mon, May 20, 2019 at 5:10 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Dec 5, 2018 at 12:31 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, Dec 04, 2018 at 04:47:18PM -0500, Olga Kornievskaia wrote:
> > > On Tue, Dec 4, 2018 at 4:35 PM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Tue, Dec 04, 2018 at 07:13:32AM -0800, Christoph Hellwig wrote:
> > > > > On Mon, Dec 03, 2018 at 02:46:20PM +0200, Amir Goldstein wrote:
> > > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > > >
> > > > > > > The man page says:
> > > > > > >
> > > > > > > EINVAL Requested range extends beyond the end of the source file
> > > > > > >
> > > > > > > But the current behaviour is that copy_file_range does a short
> > > > > > > copy up to the source file EOF. Fix the kernel behaviour to match
> > > > > > > the behaviour described in the man page.
> > > > >
> > > > > I think the behavior implemented is a lot more useful than the one
> > > > > documented..
> > > >
> > > > The current behaviour is really nasty. Because copy_file_range() can
> > > > return short copies, the caller has to implement a loop to ensure
> > > > the range hey want get copied. When the source range you are
> > > > trying to copy overlaps source EOF, this loop:
> > > >
> > > > while (len > 0) {
> > > > ret = copy_file_range(... len ...)
> > > > ...
> > > > off_in += ret;
> > > > off_out += ret;
> > > > len -= ret;
> > > > }
> > > >
> > > > Currently the fallback code copies up to the end of the source file
> > > > on the first copy and then fails the second copy with EINVAL because
> > > > the source range is now completely beyond EOF.
> > > >
> > > > So, from an application perspective, did the copy succeed or did it
> > > > fail?
> > > >
> > > > Existing tools that exercise copy_file_range (like xfs_io) consider
> > > > this a failure, because the second copy_file_range() call returns
> > > > EINVAL and not some "there is no more to copy" marker like read()
> > > > returning 0 bytes when attempting to read beyond EOF.
> > > >
> > > > IOWs, we cannot tell the difference between a real error and a short
> > > > copy because the input range spans EOF and it was silently
> > > > shortened. That's the API problem we need to fix here - the existing
> > > > behaviour is really crappy for applications. Erroring out
> > > > immmediately is one solution, and it's what the man page says should
> > > > happen so that is what I implemented.
> > > >
> > > > Realistically, though, I think an attempt to read beyond EOF for the
> > > > copy should result in behaviour like read() (i.e. return 0 bytes),
> > > > not EINVAL. The existing behaviour needs to change, though.
> > >
> > > There are two checks to consider
> > > 1. pos_in >= EOF should return EINVAL
> > > 2. however what's perhaps should be relaxed is pos_in+len >= EOF
> > > should return a short copy.
> > >
> > > Having check#1 enforced allows to us to differentiate between a real
> > > error and a short copy.
> >
> > That's what the code does right now and *exactly what I'm trying to
> > fix* because it EINVAL is ambiguous and not an indicator that we've
> > reached the end of the source file. EINVAL can indicate several
> > different errors, so it really has to be treated as a "copy failed"
> > error by applications.
> >
> > Have a look at read/pread() - they return 0 in this case to indicate
> > a short read, and the value of zero is explicitly defined as meaning
> > "read position is beyond EOF". Applications know straight away that
> > there is no more data to be read and there was no error, so can
> > terminate on a successful short read.
> >
> > We need to allow applications to terminate copy loops on a
> > successful short copy. IOWs, applications need to either:
> >
> > - get an immediate error saying the range is invalid rather
> > than doing a short copy (as per the man page); or
> > - have an explicit marker to say "no more data to be copied"
> >
> > Applications need the "no more data to copy" case to be explicit and
> > unambiguous so they can make sane decisions about whether a short
> > copy was successful because the file was shorter than expected or
> > whether a short copy was a result of a real error being encountered.
> > The current behaviour is largely unusable for applications because
> > they have to guess at the reason for EINVAL part way through a
> > copy....
> >
>
> Dave,
>
> I went a head and implemented the desired behavior.
> However, while testing I observed that the desired behavior is already
> the existing behavior. For example, trying to copy 10 bytes from a 2 bytes file,
> xfs_io copy loop ends as expected:
> copy_file_range(4, [0], 3, [0], 10, 0) = 2
> copy_file_range(4, [2], 3, [2], 8, 0) = 0
>
> This was tested on ext4 and xfs with reflink on recent kernel as well as on
> v4.20-rc1 (era of original patch set).
>
> Where and how did you observe the EINVAL behavior described above?
> (besides man page that is). There are even xfstests (which you modified)
> that verify the return 0 for past EOF behavior.
>
> For now, I am just dropping this patch from the patch series.
> Let me know if I am missing something.
The was fixing inconsistency in what the man page specified (ie., it
must fail with EINVAL if offsets are out of range) which was never
enforced by the code. The patch then could be to fix the existing
semantics (man page) of the system call.
Copy file range range is not only read and write but rather
lseek+read+write and if somebody specifies an incorrect offset to the
lseek the system call should fail. Thus I still think that copy file
range should enforce that specifying a source offset beyond the end of
the file should fail with EINVAL.
If the copy file range returned 0 bytes does it mean it's a stopping
condition, not according to the current semantics.
^permalinkrawreply [flat|nested] 83+ messages in thread