Comments

On Oct 24, 2008 19:09 +0900, Akira Fujita wrote:
> The EXT4_IOC_FIEMAP_INO is used to get extents information of> inode which set to ioctl.> The defragger uses this ioctl to check the fragment condition> and to get extents information in the specified block group.
Instead of having a separate IOC number for each such ioctl, instead
we implemented EXT4_IOC_WRAPPER, which is an root-specific ioctl that
passes in an inode number and a second IOC number so that arbitrary file
ioctls can be run on any inode by root.
This was mentioned last time these patches were posted, but there was
no reply from you. Christoph suggested a more generic VFS open-by-inum,
which isn't impossible to do but would cause a lot of controversy I
think, while the EXT4_IOC_WRAPPER is at least contained within ext4,
but is more generically useful than EXT4_IOC_FIEMAP_INO.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

On Sun, Oct 26, 2008 at 02:40:48AM -0600, Andreas Dilger wrote:
> This was mentioned last time these patches were posted, but there was> no reply from you. Christoph suggested a more generic VFS open-by-inum,> which isn't impossible to do but would cause a lot of controversy I> think, while the EXT4_IOC_WRAPPER is at least contained within ext4,> but is more generically useful than EXT4_IOC_FIEMAP_INO.
I'll hack up a generic open_by_handle and then we can gather the
reaction - it shouldn't be more than about one or two hundred lines of
code. Note that you really want an open by handle and not just inum for
a defragmentation tool - without the generation you can easily run into
races.
Btw, any reason the XFS approach of passing in filehandles for both
the inode to be defragmented and the "donor" inode doesn't work for you?
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

On Sun, Oct 26, 2008 at 04:48:48AM -0400, Christoph Hellwig wrote:
> Btw, any reason the XFS approach of passing in filehandles for both> the inode to be defragmented and the "donor" inode doesn't work for you?
Sorry should be file descriptor and not filehandle above.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Hi Andreas,
Andreas Dilger wrote:
> On Oct 24, 2008 19:09 +0900, Akira Fujita wrote:>> The EXT4_IOC_FIEMAP_INO is used to get extents information of>> inode which set to ioctl.>> The defragger uses this ioctl to check the fragment condition>> and to get extents information in the specified block group.> > Instead of having a separate IOC number for each such ioctl, instead> we implemented EXT4_IOC_WRAPPER, which is an root-specific ioctl that> passes in an inode number and a second IOC number so that arbitrary file> ioctls can be run on any inode by root.
The EXT4_IOC_WRAPPER ioctl seems to be usuful for many situations.
But the EXT4_IOC_FIEMAP_INO ioctl is used not only root user but also
non-root user to call fiemap,
so we cannot use the current EXT4_IOC_WRAPPER ioctl for defrag.
> This was mentioned last time these patches were posted, but there was> no reply from you. Christoph suggested a more generic VFS open-by-inum,> which isn't impossible to do but would cause a lot of controversy I> think, while the EXT4_IOC_WRAPPER is at least contained within ext4,> but is more generically useful than EXT4_IOC_FIEMAP_INO.>
Do you plan to add EXT4_IOC_WRAPPER into the ext4 patch queue?
Regards,
Akira Fujita
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

On Oct 27, 2008 19:21 +0900, Akira Fujita wrote:
> Andreas Dilger wrote:>> On Oct 24, 2008 19:09 +0900, Akira Fujita wrote:>>> The EXT4_IOC_FIEMAP_INO is used to get extents information of>>> inode which set to ioctl.>>> The defragger uses this ioctl to check the fragment condition>>> and to get extents information in the specified block group.>>>> Instead of having a separate IOC number for each such ioctl, instead>> we implemented EXT4_IOC_WRAPPER, which is an root-specific ioctl that>> passes in an inode number and a second IOC number so that arbitrary file>> ioctls can be run on any inode by root.>> The EXT4_IOC_WRAPPER ioctl seems to be usuful for many situations.> But the EXT4_IOC_FIEMAP_INO ioctl is used not only root user but also> non-root user to call fiemap,> so we cannot use the current EXT4_IOC_WRAPPER ioctl for defrag.
Why does a regular user need to do the ioctl on a file that it may not
have read permission to access? I can see this is useful for root
doing a defrag of the whole filesystem instead of opening and closing
all of the files, but for regular users we need to validate via the
full path to ensure they can even access the file before defragmenting it.
>> This was mentioned last time these patches were posted, but there was>> no reply from you. Christoph suggested a more generic VFS open-by-inum,>> which isn't impossible to do but would cause a lot of controversy I>> think, while the EXT4_IOC_WRAPPER is at least contained within ext4,>> but is more generically useful than EXT4_IOC_FIEMAP_INO.>> Do you plan to add EXT4_IOC_WRAPPER into the ext4 patch queue?
If there is interest, yes.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Andreas Dilger Wrote:
> On Oct 27, 2008 19:21 +0900, Akira Fujita wrote:>> Andreas Dilger wrote:>>> On Oct 24, 2008 19:09 +0900, Akira Fujita wrote:>>>> The EXT4_IOC_FIEMAP_INO is used to get extents information of>>>> inode which set to ioctl.>>>> The defragger uses this ioctl to check the fragment condition>>>> and to get extents information in the specified block group.>>> Instead of having a separate IOC number for each such ioctl, instead>>> we implemented EXT4_IOC_WRAPPER, which is an root-specific ioctl that>>> passes in an inode number and a second IOC number so that arbitrary file>>> ioctls can be run on any inode by root.>> The EXT4_IOC_WRAPPER ioctl seems to be usuful for many situations.>> But the EXT4_IOC_FIEMAP_INO ioctl is used not only root user but also>> non-root user to call fiemap,>> so we cannot use the current EXT4_IOC_WRAPPER ioctl for defrag.> > Why does a regular user need to do the ioctl on a file that it may not> have read permission to access? I can see this is useful for root> doing a defrag of the whole filesystem instead of opening and closing> all of the files, but for regular users we need to validate via the> full path to ensure they can even access the file before defragmenting it.
The FIEMAP_INO ioctl just passes a inode number belongs to
the target block group from user space to kernel space
and then the owner check is done in the kernel space.
If the regular user (defrag -f excecutant) is owner of a file,
defrag handles this file as the candidate of victim file which would be moved
to the other block group to make free space.
So I think the full path check is unneeded because the owner check
is done in the kernel space (I'm not sure it's good enough).
If it's not good in the security point of view,
I will make defrag -f mode be done only by root user.
>>> This was mentioned last time these patches were posted, but there was>>> no reply from you. Christoph suggested a more generic VFS open-by-inum,>>> which isn't impossible to do but would cause a lot of controversy I>>> think, while the EXT4_IOC_WRAPPER is at least contained within ext4,>>> but is more generically useful than EXT4_IOC_FIEMAP_INO.>> Do you plan to add EXT4_IOC_WRAPPER into the ext4 patch queue?> > If there is interest, yes.
How do the other ext4 developers think about
implementing EXT4_IOC_WRAPPER?
Will it be used only for defrag so far?
Regards,
Akira Fujita
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Akira, can you please comment on these issues before going on?
I think the generation issue is a particularly important one if you
want to allow defrag by normal users.
On Sun, Oct 26, 2008 at 04:48:48AM -0400, Christoph Hellwig wrote:
> On Sun, Oct 26, 2008 at 02:40:48AM -0600, Andreas Dilger wrote:> > This was mentioned last time these patches were posted, but there was> > no reply from you. Christoph suggested a more generic VFS open-by-inum,> > which isn't impossible to do but would cause a lot of controversy I> > think, while the EXT4_IOC_WRAPPER is at least contained within ext4,> > but is more generically useful than EXT4_IOC_FIEMAP_INO.> > I'll hack up a generic open_by_handle and then we can gather the> reaction - it shouldn't be more than about one or two hundred lines of> code. Note that you really want an open by handle and not just inum for> a defragmentation tool - without the generation you can easily run into> races.> > Btw, any reason the XFS approach of passing in *file descriptors* for both> the inode to be defragmented and the "donor" inode doesn't work for you?
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

On Oct 31, 2008 18:46 +0900, Akira Fujita wrote:
>> Why does a regular user need to do the ioctl on a file that it may not>> have read permission to access? I can see this is useful for root>> doing a defrag of the whole filesystem instead of opening and closing>> all of the files, but for regular users we need to validate via the>> full path to ensure they can even access the file before defragmenting it.>> The FIEMAP_INO ioctl just passes a inode number belongs to> the target block group from user space to kernel space> and then the owner check is done in the kernel space.>> If the regular user (defrag -f excecutant) is owner of a file,> defrag handles this file as the candidate of victim file which would> be moved to the other block group to make free space.>> So I think the full path check is unneeded because the owner check> is done in the kernel space (I'm not sure it's good enough).> If it's not good in the security point of view,> I will make defrag -f mode be done only by root user.
If the defrag operation is limited to the owner of the file (or root
via CAP_DAC_OVERRIDE) then this is probably OK also. The data never
gets to userspace so there is relatively little risk to this operation.
>>>> This was mentioned last time these patches were posted, but there was>>>> no reply from you. Christoph suggested a more generic VFS open-by-inum,>>>> which isn't impossible to do but would cause a lot of controversy I>>>> think, while the EXT4_IOC_WRAPPER is at least contained within ext4,>>>> but is more generically useful than EXT4_IOC_FIEMAP_INO.>> How do the other ext4 developers think about> implementing EXT4_IOC_WRAPPER?> Will it be used only for defrag so far?
I expect the initial users of this ioctl will be FIEMAP and DEFRAG, but
it might also be useful for other ioctls in the future.
I haven't really asked other ext4 developers about it yet, and nobody
else has commented the last time I posted the patch.
I don't have an objection to Christoph's open-by-FH API, if there is
acceptance of this from other kernel developers (Al Viro in particular),
but that exposes a lot more security issues than just the ioctl wrapper.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Hi Christoph,
Christoph Hellwig wrote:
> Akira, can you please comment on these issues before going on?> I think the generation issue is a particularly important one if you> want to allow defrag by normal users.
For a regular user defrag (-f), I'll add the check
to solve the generation issue in the following procedures (1-4).
Please check whether my approach is right or not.
1. Acquire the extents information of inode in kernel space
and then return it to user space with EXT4_IOC_FIEMAP_INO.
2. Calculate the victim extents from the combination of
extents (the result of 1) and free space extents.
3. Pass the victim extents (move to the other block group to make free space)
from user space to kernel space with EXT4_IOC_MOVE_VICTIM.
4. In kernel space, make sure the permission and the extents construction
of the passed inode (the passed inode still covers
the victim extent area or not).
If there is no problem, defrag continues its process.
If check (4) failed, it means victim extents was changed
and that area might be already used by other users, so defrag will fail.
If my approach doesn't seem to be a problem, I will work on it.
Regards,
Akira Fujita
> On Sun, Oct 26, 2008 at 04:48:48AM -0400, Christoph Hellwig wrote:>> On Sun, Oct 26, 2008 at 02:40:48AM -0600, Andreas Dilger wrote:>>> This was mentioned last time these patches were posted, but there was>>> no reply from you. Christoph suggested a more generic VFS open-by-inum,>>> which isn't impossible to do but would cause a lot of controversy I>>> think, while the EXT4_IOC_WRAPPER is at least contained within ext4,>>> but is more generically useful than EXT4_IOC_FIEMAP_INO.>> I'll hack up a generic open_by_handle and then we can gather the>> reaction - it shouldn't be more than about one or two hundred lines of>> code. Note that you really want an open by handle and not just inum for>> a defragmentation tool - without the generation you can easily run into>> races.>>>> Btw, any reason the XFS approach of passing in *file descriptors* for both>> the inode to be defragmented and the "donor" inode doesn't work for you?
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

On Fri, Oct 31, 2008 at 06:05:47AM -0400, Christoph Hellwig wrote:
> > I'll hack up a generic open_by_handle and then we can gather the> > reaction - it shouldn't be more than about one or two hundred lines of> > code. Note that you really want an open by handle and not just inum for> > a defragmentation tool - without the generation you can easily run into> > races.
I think having a generic open_by_handle is a Good Thing, but it
actually isn't quite enough for defrag, because that brings up the
question of how defrag can create the handle in the first place.
In the case of Aryan's desire to get the list of files that were read
during boot, it's pretty obvious how we can define an interface which
would make available a set of file handles corresponding to the files
that were opened during the boot process, and then that list of file
handles can be saved to a file and used on the subsequent boot to do
the readahead. Fairly straight forward.
In the case of the defrag situation, though, we need to step back and
figure out what we are trying to do. What the userspace program is
apparently trying to do is to get the block extent maps used by all of
the inodes in the block group. The problem is we aren't opening the
inodes by pathname, so we couldn't create a handle in the first place
(since in order to create a handle, we need the containing directory).
The bigger question is whether the defrag code is asking the right
question in the first place. The issue is that is that it currently
assumes that in order to find the owner of a particular block (or more
generally, extent), you should search the inodes in the block's
blockgroup. The problem is that for a very fragmented filesystem,
most of the blocks' owners might not be in their block group. In
fact, over time, if you use defrag -f frequently, it will move blocks
belonging to inodes in one block group to other block groups, which
will make defrag -f's job even harder, and eventually impossible, for
inodes belonging to other block groups.
> Akira, can you please comment on these issues before going on?> I think the generation issue is a particularly important one if you> want to allow defrag by normal users.
I'm not at all sure that it makes sense to allow "defrag -f" to be
used by normal users. The problem here is we're fighting multiple
constraints. First of all, we want to keep policy in the kernel to an
absolute minimum, since debugging kernel code is a mess, and I don't
think we want the complexity of a full-fledge defragger in the kernel.
Secondly, though, if we are going to do this in userspace, we need to
push a huge amount of information to the userspace defrag program,
that immediately raises some very serious security issues, because we
don't want to leak information unduly to random userspace programs.
> > Btw, any reason the XFS approach of passing in *file descriptors* for both> > the inode to be defragmented and the "donor" inode doesn't work for you?
I agree this is probably the better approach; it would certainly
reduce the amount of new code that needs to be added to the kernel.
Right now the "donor"/temporary inode is created and allocated in the
kernel, and then the kernel decides whether or not the temporary inode
is an improvement. If we make the userspace code responsible for
creating the temporary inode and then using fallocate() to allocate
the new blocks, then userspace can call FIEMAP and decide whether it
is an improvement.
- Ted
P.S. I've been looking at ext4_defrag_partial(), and the page locking
looks wrong. The page is only locked if it it hasn't been read into
memory yet? And at the end of the function, we do this?
if (PageLocked(page))
unlock_page(page);
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html