On Sat, Nov 02, 2013 at 07:45:30PM -0400, Theodore Ts'o wrote:
> Add the ability for a user who has write access to a file to issue a> discard request for blocks belonging to a file. This can be done via> a new flag to the FIEMAP ioctl, or via the BLKDISCARD ioctl (which> previously only worked on block devices).> > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
FIEMAP is not the correct interface for data modifying operations.
It is an interface that returns information about file metadata (i.e
the layout of a file) - it is not an interface for modifying the
contents of the file.
fallocate() is the interface used to modify file layout and
manipulate the data within it and hence, IMO, is much better aligned
to this operation. And to tell the truth, I'd much prefer such an
interface is guaranteed to return zeros to users rather than rely on
whether the underlying device supports discard or whatever they
return after a discard. i.e. if the user is asking to destroy the
data in the file, we better be able to ensure the data in the file
is in a known state at the filesystem level regardless of the
underlying storage capabilities....
For example, XFS already has a XFS_IOC_ZERO operation which is used
to return allocated blocks to unwritten state (i.e. always return
zeros) and could easily be extended to issue discards on the blocks
if the underlying device supports discards-to-zero or WRITE_SAME.
I've previously proposed this interface for fallocate() - old patch
here:
http://permalink.gmane.org/gmane.linux.file-systems/62449
Cheers,
Dave.

On Mon, Nov 04, 2013 at 10:14:42AM +1100, Dave Chinner wrote:
> > FIEMAP is not the correct interface for data modifying operations.> It is an interface that returns information about file metadata (i.e> the layout of a file) - it is not an interface for modifying the> contents of the file.
Well, it's been argued that FIEMAP was designed to be extensible, and
we should use it for other operations. Where the bounds of that
stretches to is certainly an arguable point, and I can understand your
observation that something which causes a change to the contents of
the file might not be best choice.
> fallocate() is the interface used to modify file layout and> manipulate the data within it and hence, IMO, is much better aligned> to this operation. And to tell the truth, I'd much prefer such an> interface is guaranteed to return zeros to users rather than rely on> whether the underlying device supports discard or whatever they> return after a discard. i.e. if the user is asking to destroy the> data in the file, we better be able to ensure the data in the file> is in a known state at the filesystem level regardless of the> underlying storage capabilities....
There are two different things that a user might want to do:
* write zeros
* signal to the flash that the contents of the file are not needed
(There's also a "secure discard" which recently got added which
apparently adds a guarantee that for certain storage devices, all of
the previous locations on the flash which had been mapped by the FTL
would also get discarded --- from what I can tell this was some kind
of eMMC thing, but I'll ignore this for now.)
The specific feature request that I was given was to be a file-level
equivalent of the BLKDISCARD ioctl --- and in the patch I added
support for the BLKDISCARD ioctl to be applied to files, since that
was the desired request.
For other use cases, I agree that a file-level equivalent to the
BLKZEROOUT ioctl might be more appropriate --- but that's not what
would be most useful for the particular user application that I'm
trying to support. (In fact, we're using flash where BLKDISCARDZEROS
returns false, since the discard might be a no-op under certain power
fail scenarios, since the flash doesn't force a stable write of the
FTL metadata when it receives a discard command, for performance
reasons.)
The reason why I found FIEMAP to more convenient from an
implementation perspective is that unlike fallocate(), we're not
actually modifying the logical->physical block map of the file. We
just need to be able to ask the file system to iterate over the
extents in a particular logical block range. And FIEMAP does that,
most conveniently, where as if we used fallocate(), each file system
would have to implement the code to issue the discard for the relevant
extent ranges in fs specific code --- or we would have to reimplement
the FIEMAP machinery in a more general fashion so that we could call
sb_issue_discard (or sb_issue_zeroout) from the VFS layer, after
receiving the extent ranges from the fs-specific layer.
Regards,
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

On Sat, Nov 02, 2013 at 07:45:30PM -0400, Theodore Ts'o wrote:
> Add the ability for a user who has write access to a file to issue a> discard request for blocks belonging to a file. This can be done via> a new flag to the FIEMAP ioctl, or via the BLKDISCARD ioctl (which> previously only worked on block devices).
As Dave already pointed out overloading a data manipulating operation
over FIEMAP is a no-go.
Besides that I really miss an explanation what the intended use cases
are. What does this buy us over punching a hole on an actual real
workload? Where is the overhead? Is it our shitty discard
implementation? If so there's tons of low hanging fruit to fix there
anyway that we shouldn't work around by interfaces taking shortcuts.
Is it problems in ext4's extent management on hole punch? Is the
bit of metadata created when doing an actual hole punch too much for
that very specific workload?
Also you really need to restrict this to devices that set the
discard_zeroes_data flag, without that you'll expose random
uninitialized data.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

On Sun, Nov 03, 2013 at 06:42:00PM -0500, Theodore Ts'o wrote:
> On Mon, Nov 04, 2013 at 10:14:42AM +1100, Dave Chinner wrote:> > > > FIEMAP is not the correct interface for data modifying operations.> > It is an interface that returns information about file metadata (i.e> > the layout of a file) - it is not an interface for modifying the> > contents of the file.> > Well, it's been argued that FIEMAP was designed to be extensible, and> we should use it for other operations. Where the bounds of that> stretches to is certainly an arguable point, and I can understand your> observation that something which causes a change to the contents of> the file might not be best choice.
Yes, I agree it is extensible, but it was never intended to be
extended into a data modification ioctl. It's for querying metadata
related to file data, not for modifying file data.
> > fallocate() is the interface used to modify file layout and> > manipulate the data within it and hence, IMO, is much better aligned> > to this operation. And to tell the truth, I'd much prefer such an> > interface is guaranteed to return zeros to users rather than rely on> > whether the underlying device supports discard or whatever they> > return after a discard. i.e. if the user is asking to destroy the> > data in the file, we better be able to ensure the data in the file> > is in a known state at the filesystem level regardless of the> > underlying storage capabilities....> > There are two different things that a user might want to do:> > * write zeros> * signal to the flash that the contents of the file are not needed
What the user is really saying is "we don't need the contents of
this file anymore, but we want to keep it allocated for future use".
What you are doing is optimising hardware behaviour based on the
fact the user says "we don't need the data anymore". That's an
implementation goal, but that doesn't necessarily mean that it's the
right API.
From a user API perspective, we want either the old contents remain
in the file, or if we modify the contents we guarantee that we don't
expose stale data from the underlying storage. If the hardware
cannot guarantee no stale data exposure, then the filesystem needs
to guarantee that. Hence the "convert to zeros" API specification -
we *guarantee at the API level* the behaviour the user can rely on.
> (There's also a "secure discard" which recently got added which> apparently adds a guarantee that for certain storage devices, all of> the previous locations on the flash which had been mapped by the FTL> would also get discarded --- from what I can tell this was some kind> of eMMC thing, but I'll ignore this for now.)> > The specific feature request that I was given was to be a file-level> equivalent of the BLKDISCARD ioctl --- and in the patch I added> support for the BLKDISCARD ioctl to be applied to files, since that> was the desired request.
Sure, but we don't leave underfined content in files that non-root
users are allowed to access. BLKDISCARD can leave undefined content
in the region of the block device that is being discarded, but we
don't expose that directly to users and so that isn't an issue at
all that BLKDISCARD needs to worry about.
> For other use cases, I agree that a file-level equivalent to the> BLKZEROOUT ioctl might be more appropriate --- but that's not what> would be most useful for the particular user application that I'm> trying to support. (In fact, we're using flash where BLKDISCARDZEROS> returns false, since the discard might be a no-op under certain power> fail scenarios, since the flash doesn't force a stable write of the> FTL metadata when it receives a discard command, for performance> reasons.)
Yup, another reason why for normal users that aren't google we have
layers of protection against stale data exposure.
Google already has a flag in the fallocate API for saying "stale
data is acceptible" (i.e. FALLOC_FL_NO_HIDE_STALE) for telling the
filesystem that you don't care about stale data exposure and so
zeroing via any fallocate operation (be it preallocation, hole
punching, etc) doesn't need to be robust. Hence you already have a
method of working around any sort of strict API requirements we
decide on for guaranteeing the contents of the file after a
FALLOC_FL_ZERO_RANGE operation...
> The reason why I found FIEMAP to more convenient from an> implementation perspective is that unlike fallocate(), we're not> actually modifying the logical->physical block map of the file. We> just need to be able to ask the file system to iterate over the> extents in a particular logical block range.
Which I think is polluting the API specification with implementation
details. As it is, fallocate() implementations already have to
iterate the extents in the range that is being modified in some way,
so I don't think that extent iteration is really a problem for
anyone. Also, it's definitely not a reason for chosing FIEMAP as a
user API because we can use fiemap internally in the kernel for
purposes other than implementing FIEMAP ioctls....
> And FIEMAP does that,> most conveniently, where as if we used fallocate(), each file system> would have to implement the code to issue the discard for the relevant> extent ranges in fs specific code --- or we would have to reimplement> the FIEMAP machinery in a more general fashion so that we could call> sb_issue_discard (or sb_issue_zeroout) from the VFS layer, after> receiving the extent ranges from the fs-specific layer.
Actually, sb_issue_zeroout() is unusuable for acceleration purposes
for XFS because if bdev_write_same() fails or discard fails because
of granularity/alignment mismatches, then we want to fall back to
conversion to unwritten extents rather than manual zero-out like
sb_issue_zeroout() does. Hence we can't use any sort of VFS-based
extent iteration and discard/zeroing like you are proposing, and
that brings us back to needing per-filesystem implementations here.
Cheers,
Dave.

On Nov 4, 2013, at 2:57 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Nov 03, 2013 at 06:42:00PM -0500, Theodore Ts'o wrote:>> On Mon, Nov 04, 2013 at 10:14:42AM +1100, Dave Chinner wrote:>>> >>> FIEMAP is not the correct interface for data modifying operations.>>> It is an interface that returns information about file metadata (i.e>>> the layout of a file) - it is not an interface for modifying the>>> contents of the file.>> >> Well, it's been argued that FIEMAP was designed to be extensible, and>> we should use it for other operations. Where the bounds of that>> stretches to is certainly an arguable point, and I can understand your>> observation that something which causes a change to the contents of>> the file might not be best choice.> > Yes, I agree it is extensible, but it was never intended to be> extended into a data modification ioctl. It's for querying metadata> related to file data, not for modifying file data.
I’d agree with Dave here - there is nothing about FIEMAP that indicates
it is an interface for data modification, only for reporting out the
underlying layout of a file. I think falloc() is already used in this
regard for hole punching, and it makes more sense to me to extend that
for the discard interface as well.
Cheers, Andreas
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

On Mon, Nov 04, 2013 at 02:03:43AM -0800, Christoph Hellwig wrote:
> > Besides that I really miss an explanation what the intended use cases> are. What does this buy us over punching a hole on an actual real> workload? Where is the overhead? Is it our shitty discard> implementation? If so there's tons of low hanging fruit to fix there> anyway that we shouldn't work around by interfaces taking shortcuts.> Is it problems in ext4's extent management on hole punch? Is the> bit of metadata created when doing an actual hole punch too much for> that very specific workload?
The an application in question wants to treat a large file as if it
were a block device --- that's hardly unprecedented; enterprise
databases tend to prefer using raw block devices (at least for
benchmarking purposes), but system administrators like to
administrative convenience of using a file system.
The goal here is get the performace as close to a raw block device as
possible. Especially if you are using fast flash, the overhead of
deallocating blocks using punch, only to reallocate the blocks when we
later write into them, is just unnecessary overhead. Also, if you
deallocate the blocks, they could end up getting grabbed by some other
block allocation, which means the file can end up getting very
fragmented --- which doesn't matter that much for flash, I suppose,
but it means the extent tree could end up growing and getting nasty
over time. The bottom line is why bother doing extra work when it's
not necessary?
If people don't like exposing this via FIEMAP, we could just simply
make it available via the BLKDISCARD ioctl, which already exists
(although currently it is only implemented for block devices).
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

On Mon, Nov 04, 2013 at 07:51:46PM -0500, Theodore Ts'o wrote:
> The an application in question wants to treat a large file as if it> were a block device --- that's hardly unprecedented; enterprise> databases tend to prefer using raw block devices (at least for> benchmarking purposes), but system administrators like to> administrative convenience of using a file system.
Totally reasonable use case.
> > The goal here is get the performace as close to a raw block device as> possible. Especially if you are using fast flash, the overhead of> deallocating blocks using punch, only to reallocate the blocks when we> later write into them, is just unnecessary overhead. Also, if you> deallocate the blocks, they could end up getting grabbed by some other> block allocation, which means the file can end up getting very> fragmented --- which doesn't matter that much for flash, I suppose,> but it means the extent tree could end up growing and getting nasty> over time. The bottom line is why bother doing extra work when it's> not necessary?
Now we're getting into trouble. I'm all for optimizing for a use case
someone cares for. But exposing intimate implementation of that use
case is almost always a bad idea.
So having a new fallocate to zero out parts of a file and not requiring
an allocation to back the file is fine. If it is on a filesystem
supporting discards with the discard zeroes blocks flag we can use the
implementation from your patch. If the device doesn't support discards
or doesn't zero them we'd need to implement it like the
XFS_IOC_ZERO_RANGE ioctl.
Note that exposing stale blocks is a problem at the block device level,
too. If you look at the openstack volume service for example they have
to explicitly zero out volumes during volume creation or deletion to
make sure no data is exposed to another tenant. The only way to
avoid that is to have some auto-zeroing extent state either in software
or hardware.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

On Mon, Nov 04, 2013 at 07:51:46PM -0500, Theodore Ts'o wrote:
> On Mon, Nov 04, 2013 at 02:03:43AM -0800, Christoph Hellwig wrote:> > > > Besides that I really miss an explanation what the intended use cases> > are. What does this buy us over punching a hole on an actual real> > workload? Where is the overhead? Is it our shitty discard> > implementation? If so there's tons of low hanging fruit to fix there> > anyway that we shouldn't work around by interfaces taking shortcuts.> > Is it problems in ext4's extent management on hole punch? Is the> > bit of metadata created when doing an actual hole punch too much for> > that very specific workload?> > The an application in question wants to treat a large file as if it> were a block device --- that's hardly unprecedented; enterprise> databases tend to prefer using raw block devices (at least for> benchmarking purposes), but system administrators like to> administrative convenience of using a file system.> > The goal here is get the performace as close to a raw block device as> possible. Especially if you are using fast flash, the overhead of> deallocating blocks using punch, only to reallocate the blocks when we> later write into them, is just unnecessary overhead. Also, if you> deallocate the blocks, they could end up getting grabbed by some other> block allocation, which means the file can end up getting very> fragmented --- which doesn't matter that much for flash, I suppose,> but it means the extent tree could end up growing and getting nasty> over time. The bottom line is why bother doing extra work when it's> not necessary?
commit 447223520520b17d3b6d0631aa4838fbaf8eddb4
Author: Dave Chinner <dchinner@redhat.com>
Date: Tue Aug 24 12:02:11 2010 +1000
xfs: Introduce XFS_IOC_ZERO_RANGE
XFS_IOC_ZERO_RANGE is the equivalent of an atomic XFS_IOC_UNRESVSP/
XFS_IOC_RESVSP call pair. It enabled ranges of written data to be
turned into zeroes without requiring IO or having to free and
reallocate the extents in the range given as would occur if we had
to punch and then preallocate them separately. This enables
applications to zero parts of files very quickly without changing
the layout of the files in any way.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
Sounds pretty much like the same reasons to me. IOWs, we did this
more than 3 years ago on XFS but discard was not a requirement for
the cloudy guy who asked for it so we simply didn't implement it.
I agree that per-file discard is useful, but it needs to have well
defined byte-range semantics and prevent stale data exposure to
unprivileged users. FALLOC_FL_ZERO_RANGE has those semantics, while
FALLOC_FL_ZERO_RANGE|FALLOC_FL_NO_HIDE_STALE gives google the
trigger to issue discards without caring about what you get back
from the discards....
Cheers,
Dave.