Hi all!
I'm new on this list. I work on Qubes OS, where Fedora is used as a base
distribution.
While trying to build the installation image in reproducible manner[1],
I found the current installation image have unusual layout. Quoting
dracut.cmdline manual page:
squashfs.img | Squashfs from LiveCD .iso downloaded via network
!(mount)
/LiveOS
|- rootfs.img | Filesystem image to mount read-only
!(mount)
/bin | Live filesystem
/boot |
/dev |
... |
This rootfs.img layer makes the image build very much unreproducible.
Why is it even there? Bare squashfs.img layer should be enough. Then,
mount overlayfs over it (I see there is even some partial support for it
in dmsquash-live). Most other Live systems I've seen use just squashfs +
overlayfs (or aufs if kernel is older), so it's commonly tested
configuration. I *guess* it's there for historical reason, from before
aufs/overlayfs being available. Is there any other reason for that?
If there is no other reason, I propose to drop this and have
installer/live filesystem directly in squashfs.img. This have multiple
benefits:
- it's much easier to make the image build process reproducible (see
below)
- less complexity, both in the build and in the boot (the whole
dmsquash-live dracut module can be replaced with <20 line
function[2]
- smaller initramfs (which is extremely important if needed to be
included in efiboot.img, which can't be larger than 32MB)
- slightly faster boot time (device-mapper is slow)
What do you think?
As for the reproducibility, I've made changes to lorax (including
dropping rootfs.img layer), anaconda, pungi and createrepo and this all
allows to build bit-by-bit identical image, given the same input (rpm
packages, pungi configuration, $SOURCE_DATE_EPOCH variable[3]). Well,
almost - there is an issue with efiboot.img, but I already have a
solution, just not pushed it yet.
You can find all the pull requests collected here:
https://github.com/QubesOS/qubes-installer-qubes-os/pull/26
I'll work further to make the changes merged upstream.
[1] https://reproducible-builds.org/
[2]
https://github.com/QubesOS/qubes-installer-qubes-os/pull/26/commits/332be...
[3] https://reproducible-builds.org/specs/source-date-epoch/
--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Hi all!
I'm new on this list. I work on Qubes OS, where Fedora is used as a base
distribution.
While trying to build the installation image in reproducible manner[1],
I found the current installation image have unusual layout. Quoting
dracut.cmdline manual page:
squashfs.img | Squashfs from LiveCD .iso downloaded via network
!(mount)
/LiveOS
|- rootfs.img | Filesystem image to mount read-only
!(mount)
/bin | Live filesystem
/boot |
/dev |
... |
This rootfs.img layer makes the image build very much unreproducible.
Why is it even there? Bare squashfs.img layer should be enough. Then,
mount overlayfs over it (I see there is even some partial support for it
in dmsquash-live). Most other Live systems I've seen use just squashfs +
overlayfs (or aufs if kernel is older), so it's commonly tested
configuration. I *guess* it's there for historical reason, from before
aufs/overlayfs being available. Is there any other reason for that?

I'm pretty sure the original reason was the default live install use
dd to block copy the root file system into the fedora-root LV, and
then resized the LV and ext4 file system. There have also been a
number of squashfs improvements since that decision so there might
have been limitations with squashfs that ext4 didn't have (I'm
thinking xattr were long supported in ext4 before squashfs, and maybe
capabilities?)

If there is no other reason, I propose to drop this and have
installer/live filesystem directly in squashfs.img. This have multiple
benefits:
- it's much easier to make the image build process reproducible (see
below)
- less complexity, both in the build and in the boot (the whole
dmsquash-live dracut module can be replaced with <20 line
function[2]
- smaller initramfs (which is extremely important if needed to be
included in efiboot.img, which can't be larger than 32MB)
- slightly faster boot time (device-mapper is slow)
What do you think?

Whatever we do should take into account the persistent root and
persistent home use cases, specifically:
https://github.com/livecd-tools/livecd-tools/blob/master/tools/livecd-iso...
--overlay-size-mb
--home-size-mb
A particular criticism of the device-mapper solution currently being
used is in that script: it blows up. Literally it's WORM, and deleting
files simply dereferences them, it doesn't free up pool space, so it
is inevitable that the pool will fill up, and when it does it blows up
the file system, and it can't be repaired. All you can do is reset the
overlay which means deleting all changes and starting over.
At least one of our spins, SOAS, depends on livecd-iso-to-disk for
creating their final installation because it's predicated on running
Fedora SOAS from a stick.
Why does efiboot.img have a 32MiB limit?

Cool! Well you've already done most of the work and if this has
support elsewhere already then I'm in favor of continuing in that
direction.
I did give all of these things some thought a long time ago when I ran
into a lorax hack by Will Woods who used Btrfs as the root.img file
system, I'm not sure why it was used. But it gave me the idea of using
a few features built into Btrfs specifically for this use case:
- seed/sprout feature can be used with zram block device for volatile
overlay; and used with a blank partition on the stick for persistent
overlay. Discovery is part of the btrfs kernel code.
- Since metadata and data is always checksummed on every read, we
wouldn't have to depend on the slow and transient ISO checksum
(rd.live.check which uses checkisomd5) which likewise breaks when
creating a stick with livecd-iso-to-disk.
- Btrfs supports zstd compression. I did some testing and squashfs is
still a bit more efficient because it compresses fs metadata, whereas
Btrfs only compresses data extents.
The gotcha here is the resulting image isn't going to be bit for bit
reproducible: UUIDs and time stamps are strewn throughout the file
system (similar to ext4 and XFS), but any sufficiently complex file
system is going to have this problem. Off hand I'm not sure how
squashfs would get around it since it's going to draw from an ext4
source (not sure if the ephemeral root could be tmpfs and use it as
the source for mksquashfs?)
--
Chris Murphy

On Thu, Oct 11, 2018 at 6:37 PM, Marek Marczykowski-Górecki
<marmarek(a)invisiblethingslab.com&gt; wrote:
> Hi all!
>
> I'm new on this list. I work on Qubes OS, where Fedora is used as a base
> distribution.
>
> While trying to build the installation image in reproducible manner[1],
> I found the current installation image have unusual layout. Quoting
> dracut.cmdline manual page:
>
> squashfs.img | Squashfs from LiveCD .iso downloaded via network
> !(mount)
> /LiveOS
> |- rootfs.img | Filesystem image to mount read-only
> !(mount)
> /bin | Live filesystem
> /boot |
> /dev |
> ... |
>
> This rootfs.img layer makes the image build very much unreproducible.
> Why is it even there? Bare squashfs.img layer should be enough. Then,
> mount overlayfs over it (I see there is even some partial support for it
> in dmsquash-live). Most other Live systems I've seen use just squashfs +
> overlayfs (or aufs if kernel is older), so it's commonly tested
> configuration. I *guess* it's there for historical reason, from before
> aufs/overlayfs being available. Is there any other reason for that?
I'm pretty sure the original reason was the default live install use
dd to block copy the root file system into the fedora-root LV, and
then resized the LV and ext4 file system.

There have also been a
number of squashfs improvements since that decision so there might
have been limitations with squashfs that ext4 didn't have (I'm
thinking xattr were long supported in ext4 before squashfs, and maybe
capabilities?)
>
> If there is no other reason, I propose to drop this and have
> installer/live filesystem directly in squashfs.img. This have multiple
> benefits:
> - it's much easier to make the image build process reproducible (see
> below)
> - less complexity, both in the build and in the boot (the whole
> dmsquash-live dracut module can be replaced with <20 line
> function[2]
> - smaller initramfs (which is extremely important if needed to be
> included in efiboot.img, which can't be larger than 32MB)
> - slightly faster boot time (device-mapper is slow)
>
> What do you think?
Whatever we do should take into account the persistent root and
persistent home use cases, specifically:
https://github.com/livecd-tools/livecd-tools/blob/master/tools/livecd-iso...
--overlay-size-mb
--home-size-mb
A particular criticism of the device-mapper solution currently being
used is in that script: it blows up. Literally it's WORM, and deleting
files simply dereferences them, it doesn't free up pool space, so it
is inevitable that the pool will fill up, and when it does it blows up
the file system, and it can't be repaired. All you can do is reset the
overlay which means deleting all changes and starting over.
At least one of our spins, SOAS, depends on livecd-iso-to-disk for
creating their final installation because it's predicated on running
Fedora SOAS from a stick.
Why does efiboot.img have a 32MiB limit?

> As for the reproducibility, I've made changes to lorax
(including
> dropping rootfs.img layer), anaconda, pungi and createrepo and this all
> allows to build bit-by-bit identical image, given the same input (rpm
> packages, pungi configuration, $SOURCE_DATE_EPOCH variable[3]). Well,
> almost - there is an issue with efiboot.img, but I already have a
> solution, just not pushed it yet.
>
> You can find all the pull requests collected here:
> https://github.com/QubesOS/qubes-installer-qubes-os/pull/26
>
> I'll work further to make the changes merged upstream.
>
> [1] https://reproducible-builds.org/
> [2]
https://github.com/QubesOS/qubes-installer-qubes-os/pull/26/commits/332be...
> [3] https://reproducible-builds.org/specs/source-date-epoch/
Cool! Well you've already done most of the work and if this has
support elsewhere already then I'm in favor of continuing in that
direction.
I did give all of these things some thought a long time ago when I ran
into a lorax hack by Will Woods who used Btrfs as the root.img file
system, I'm not sure why it was used. But it gave me the idea of using
a few features built into Btrfs specifically for this use case:
- seed/sprout feature can be used with zram block device for volatile
overlay; and used with a blank partition on the stick for persistent
overlay. Discovery is part of the btrfs kernel code.
- Since metadata and data is always checksummed on every read, we
wouldn't have to depend on the slow and transient ISO checksum
(rd.live.check which uses checkisomd5) which likewise breaks when
creating a stick with livecd-iso-to-disk.
- Btrfs supports zstd compression. I did some testing and squashfs is
still a bit more efficient because it compresses fs metadata, whereas
Btrfs only compresses data extents.
The gotcha here is the resulting image isn't going to be bit for bit
reproducible: UUIDs and time stamps are strewn throughout the file
system (similar to ext4 and XFS), but any sufficiently complex file
system is going to have this problem.

I wouldn't worry about _files_ timestamps that much - in most cases this is
solvable problem by elaborate enough find+touch[4]. But that's not all
obviously, there are various timestamps in superblock, and other
metadata. The most problematic part in "normal" filesystems, using
kernel driver is inode allocation, block allocation etc. This greatly
depends on timing, ordering, specific kernel version etc.
See [5] for details.

Off hand I'm not sure how
squashfs would get around it since it's going to draw from an ext4
source (not sure if the ephemeral root could be tmpfs and use it as
the source for mksquashfs?)

mksquashfs 5.0-rc1 have support for clamping mtime to $SOURCE_DATE_EPOCH
variable[3]. And the other metadata is reproducible already in mksquashfs
4.3 (I think files are sorted or similar approach is taken).
TBH, there is also a tool to build ext4 filesystem reproducible, not
using kernel driver. It's make_ext4 from OpenWRT projet. But I still
think it would be better to drop that layer anyway.
[4] https://reproducible-builds.org/docs/archives/
[5] https://reproducible-builds.org/docs/system-images/
--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

> I'm pretty sure the original reason was the default live
install use
> dd to block copy the root file system into the fedora-root LV, and
> then resized the LV and ext4 file system.
How is it done now?

On Live media installs, anaconda does:
rsync -pogAXtlHrDx --exclude /dev/ --exclude /proc/ --exclude /sys/
--exclude /run/ --exclude /boot/*rescue* --exclude /etc/machine-id
/mnt/install/source/ /mnt/sysimage
On DVD and netinstalls, I'm guessing based on packaging.log that it's
a dnf+rpm installation even though I never see a dnf or rpm process in
either top or ps. In any case, the rpm packages are directly on the
iso9660 file system, not baked into the

OK. On Fedora 28 media, efiboot.img is ~9.2 MiB and does not contain
either the kernel or initramfs. The kernel and initramfs are found on
the iso9660 file system at images/pxeboot/ and also at isolinux/ where
GRUB UEFI uses the former, and isolinux BIOS uses the latter. Both
initrd's are 65M so they're already too big to go into bootefi.img -
and they kinda need to be because this particular initramfs is built
by dracut with --nohostonly flag so that hopefully we can boot
anything. (Curiously, the initramfs is 65M on DVD/netinstall and 50M
on LiveOS - I don't have an explanation for that. I'm looking at
Fedora 28 release images.)
From my understanding, efiboot.img only would need to contain shim,
grubia32, grubx64 and supporting bootloader only files.
BTW, trivia: Fedora's installer creates EFI System partitions that are
always FAT16. So far as I know, no computer has complained, only
humans. FAT12/16 is OK for removable media but the spec pretty clearly
expects FAT32 for ESPs on permanent installs. The installer team
doesn't want to use mkfs flags, they expect the defaults to work
unless they don't work, and they do work, so FAT16 it is.

Full story:
https://github.com/QubesOS/qubes-issues/issues/794#issuecomment-135988806
I've spent a lot of time debugging this, because mkisofs doesn't
complain about it, just silently overflow higher bits to adjacent field,
which results in weird results, depending on where you boot it. Adding
isohybrid to the picture doesn't make it easier (there, higher bits are
truncated, or actually not copied to the MBR partition table, as wasn't
part of the original field).

I think we're stuck with isohybrid for a while. Having UEFI and BIOS
bootloaders, along with isohybrid supporting both as well as Macs, all
on one media image, that can be burned to optical media and written to
a USB stick - is hugely beneficial.
The compose process takes about 12 hours. That every ISO for all the
editions, and the spins, and the VM images, for all archs. Even having
separate UEFI and BIOS images, or splitting out Macs with their own
image, it'll increase compose times and complexity across the board.
I'm not sure which happens first: the end to optical media booting
support; or dropping support for BIOS and/or old Apple EFI Macs (only
this year did they start using UEFI, rather than their own variant of
Intel EFI pre-UEFI, so it'll take some time to see how that shakes out
which also involves whether and how Secure Boot can ever be supported
on Macs).
This talks a bit about isohybrid and all the very clever hacks
involved to make Fedora boot practically anything with a single ISO
9660 image. (I'm being x86_64 arch specific when I say that.)
https://mjg59.dreamwidth.org/11285.html

>
> I did give all of these things some thought a long time ago when I ran
> into a lorax hack by Will Woods who used Btrfs as the root.img file
> system, I'm not sure why it was used. But it gave me the idea of using
> a few features built into Btrfs specifically for this use case:
>
> - seed/sprout feature can be used with zram block device for volatile
> overlay; and used with a blank partition on the stick for persistent
> overlay. Discovery is part of the btrfs kernel code.
>
> - Since metadata and data is always checksummed on every read, we
> wouldn't have to depend on the slow and transient ISO checksum
> (rd.live.check which uses checkisomd5) which likewise breaks when
> creating a stick with livecd-iso-to-disk.
>
> - Btrfs supports zstd compression. I did some testing and squashfs is
> still a bit more efficient because it compresses fs metadata, whereas
> Btrfs only compresses data extents.
>
> The gotcha here is the resulting image isn't going to be bit for bit
> reproducible: UUIDs and time stamps are strewn throughout the file
> system (similar to ext4 and XFS), but any sufficiently complex file
> system is going to have this problem.
I wouldn't worry about _files_ timestamps that much - in most cases this is
solvable problem by elaborate enough find+touch[4]. But that's not all
obviously, there are various timestamps in superblock, and other
metadata. The most problematic part in "normal" filesystems, using
kernel driver is inode allocation, block allocation etc. This greatly
depends on timing, ordering, specific kernel version etc.
See [5] for details.

mkfs.btrfs has --rootdir and --shrink features to pre-allocate a
volume with files at mkfs time; I have no idea to what degree it
depends on kernel code. The main benefit with this is it's really easy
to implement full checksum matching for metadata and data on every
read, and user space ends up with EIO instead of corrupt data, and
super clear kernel complaints. And such corruption whether on optical
or USB sticks, is common. Even the more rare case of a stick that
passes md5 checksum, can later have transient and silent corruption
that ends up showing up in weird ways.
It's plausible squashfs could implement this, I think by default it
already checksums every file to look for duplicates, but it doesn't
retain the per file hash for integrity checking later on. It's also
possible with dm-verity or dm-integrity but then that adds back the dm
complexity.
--
Chris Murphy

On Fri, Oct 12, 2018 at 4:30 AM, Marek Marczykowski-Górecki
<marmarek(a)invisiblethingslab.com&gt; wrote:
> On Thu, Oct 11, 2018 at 09:24:08PM -0600, Chris Murphy wrote:
> > I'm pretty sure the original reason was the default live install use
> > dd to block copy the root file system into the fedora-root LV, and
> > then resized the LV and ext4 file system.
>
> How is it done now?
On Live media installs, anaconda does:
rsync -pogAXtlHrDx --exclude /dev/ --exclude /proc/ --exclude /sys/
--exclude /run/ --exclude /boot/*rescue* --exclude /etc/machine-id
/mnt/install/source/ /mnt/sysimage
On DVD and netinstalls, I'm guessing based on packaging.log that it's
a dnf+rpm installation even though I never see a dnf or rpm process in
either top or ps. In any case, the rpm packages are directly on the
iso9660 file system, not baked into the

On Fri, 2018-10-12 at 15:44 -0600, Chris Murphy wrote:
> On Fri, Oct 12, 2018 at 4:30 AM, Marek Marczykowski-Górecki
> <marmarek(a)invisiblethingslab.com&gt; wrote:
> > On Thu, Oct 11, 2018 at 09:24:08PM -0600, Chris Murphy wrote:
> > > I'm pretty sure the original reason was the default live install use
> > > dd to block copy the root file system into the fedora-root LV, and
> > > then resized the LV and ext4 file system.
> >
> > How is it done now?
>
> On Live media installs, anaconda does:
>
> rsync -pogAXtlHrDx --exclude /dev/ --exclude /proc/ --exclude /sys/
> --exclude /run/ --exclude /boot/*rescue* --exclude /etc/machine-id
> /mnt/install/source/ /mnt/sysimage
>
> On DVD and netinstalls, I'm guessing based on packaging.log that it's
> a dnf+rpm installation even though I never see a dnf or rpm process in
> either top or ps. In any case, the rpm packages are directly on the
> iso9660 file system, not baked into the
anaconda uses dnf's python interface, it does not *run* 'dnf'.
https://github.com/rhinstaller/anaconda/blob/master/pyanaconda/payload/dn...

Yep, but the DNF Python code still actually runs in a Python subprocess.
This is needed as aparently something during the package installation transaction
- most likely RPM - does a chroot. If the DNF code did run directly in Anaconda process,
Anaconda would get chrooted as well and BAD THINGS (TM) would happen.
Bad things ranging from missing icons to GTK crashing due to files is uses suddenly
vanishing.

On Fri, Oct 12, 2018 at 4:30 AM, Marek Marczykowski-Górecki
<marmarek(a)invisiblethingslab.com&gt; wrote:
> On Thu, Oct 11, 2018 at 09:24:08PM -0600, Chris Murphy wrote:
>> Why does efiboot.img have a 32MiB limit?
>
> Because "32MB should be enough for everybody"...
> Long story short, "El Torito" boot catalog structure have 16-bit field
> for image size (expressed in 512-bytes sectors). For details see here:
> https://wiki.osdev.org/El-Torito
>
https://web.archive.org/web/20180112220141/https://download.intel.com/sup...
> (page 10)
OK. On Fedora 28 media, efiboot.img is ~9.2 MiB and does not contain
either the kernel or initramfs.

I know, this particular problem was specific to Qubes OS, where
kernel+initramfs needed to be on ESP, because of Xen+EFI limitation
(basically kernel needs to be loaded through through UEFI instead of
by grub, so it needs to live on something that UEFI understands). And
actually recent Xen version doesn't have this limitation anymore (at
least in theory...). This is just a bit of context from where it all
got here, much less relevant today.
(...)

> Full story:
> https://github.com/QubesOS/qubes-issues/issues/794#issuecomment-135988806
>
> I've spent a lot of time debugging this, because mkisofs doesn't
> complain about it, just silently overflow higher bits to adjacent field,
> which results in weird results, depending on where you boot it. Adding
> isohybrid to the picture doesn't make it easier (there, higher bits are
> truncated, or actually not copied to the MBR partition table, as wasn't
> part of the original field).
I think we're stuck with isohybrid for a while. Having UEFI and BIOS
bootloaders, along with isohybrid supporting both as well as Macs, all
on one media image, that can be burned to optical media and written to
a USB stick - is hugely beneficial.

I have no problem with isohybrid alone. It's major hack, but definitely
worth it.

The compose process takes about 12 hours. That every ISO for all the
editions, and the spins, and the VM images, for all archs. Even having
separate UEFI and BIOS images, or splitting out Macs with their own
image, it'll increase compose times and complexity across the board.

And also complexity for the users - which image to download. I totally
understand why it is beneficial.
(...)

>> I did give all of these things some thought a long time ago
when I ran
>> into a lorax hack by Will Woods who used Btrfs as the root.img file
>> system, I'm not sure why it was used. But it gave me the idea of using
>> a few features built into Btrfs specifically for this use case:
>>
>> - seed/sprout feature can be used with zram block device for volatile
>> overlay; and used with a blank partition on the stick for persistent
>> overlay. Discovery is part of the btrfs kernel code.
>>
>> - Since metadata and data is always checksummed on every read, we
>> wouldn't have to depend on the slow and transient ISO checksum
>> (rd.live.check which uses checkisomd5) which likewise breaks when
>> creating a stick with livecd-iso-to-disk.
>>
>> - Btrfs supports zstd compression. I did some testing and squashfs is
>> still a bit more efficient because it compresses fs metadata, whereas
>> Btrfs only compresses data extents.
>>
>> The gotcha here is the resulting image isn't going to be bit for bit
>> reproducible: UUIDs and time stamps are strewn throughout the file
>> system (similar to ext4 and XFS), but any sufficiently complex file
>> system is going to have this problem.
>
> I wouldn't worry about _files_ timestamps that much - in most cases this is
> solvable problem by elaborate enough find+touch[4]. But that's not all
> obviously, there are various timestamps in superblock, and other
> metadata. The most problematic part in "normal" filesystems, using
> kernel driver is inode allocation, block allocation etc. This greatly
> depends on timing, ordering, specific kernel version etc.
> See [5] for details.
mkfs.btrfs has --rootdir and --shrink features to pre-allocate a
volume with files at mkfs time; I have no idea to what degree it
depends on kernel code.

Probably not at all, given it works as non-root user too.
I've tried to run it twice on the same directory (and with the same
--uuid) on 32MB of data and got different images (~2000 lines of hexdump
diff). Could be some timestamps, could be something else.

The main benefit with this is it's really easy
to implement full checksum matching for metadata and data on every
read, and user space ends up with EIO instead of corrupt data, and
super clear kernel complaints. And such corruption whether on optical
or USB sticks, is common. Even the more rare case of a stick that
passes md5 checksum, can later have transient and silent corruption
that ends up showing up in weird ways.
It's plausible squashfs could implement this, I think by default it
already checksums every file to look for duplicates, but it doesn't
retain the per file hash for integrity checking later on.

Indeed it looks that way. I'm able to make one-byte modification to the
image file resulting in different files (diff -r), but no read error. I
wonder if integrity checking is something on squashfs roadmap...

It's also
possible with dm-verity or dm-integrity but then that adds back the dm
complexity.

Oh, please, no...
There are two almost separate aspects here:
- image layout (squashfs+ext4, squashfs alone, squashfs+btrfs)
- how copy-on-write is achieved (dm-snapshot, overlay fs)
For reproducibility, squashfs alone is the best option, but does not
improve integrity checking (but also doesn't make it worse).
For integrity checking, squashfs+btrfs may be better, but doesn't help
that much with reproducibility. Maybe even make it worse, because
mkfs.btrfs also make not reproducible result, while make_ext4 (do not
confuse with mkfs.ext4!) is reproducible. Not being packaged for Fedora
is only a small issue here.
As for copy-on-write, dm-snapshot is quite complex to setup and require
underlying FS to support write. Also, doesn't allow to write more data
than original image size (may be an issue for persistent partition
case). Overlay fs on the other hand works with any underlying fs, you
can write as much data as you want. And in case of persistent partition,
you can access that data even if base image (the lower layer) is
unavailable/broken. I think the only downside of overlay fs is when you
modify large file it gets copied in full to the upper layer. But I don't
think that's an issue in this use case.
For me, overlay fs is a clear winner here.
But as for image layout, it isn't that simple. For reproducibility,
squashfs alone is better. But if the goal of this change would be also
improving read errors detection, then it isn't that clear anymore. It
may be that it takes a simple mkfs.btrfs patch to make it reproducible,
but it isn't obvious for me at this stage. Also, keeping two layers
looks like unnecessary complexity.
What do you think about sidestepping this discussion a little and
replacing dm-snapshot with overlay fs regardless of other changes here?
That should be doable without any change to image format and will give
more flexibility there.
Then, it could be even made to support both 1-layer and 2-layer formats
at the same time (depending on rootfs.img presence). Something that
isn't possible with dm-snapshot right now.
--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

> mkfs.btrfs has --rootdir and --shrink features to pre-allocate a
> volume with files at mkfs time; I have no idea to what degree it
> depends on kernel code.
Probably not at all, given it works as non-root user too.
I've tried to run it twice on the same directory (and with the same
--uuid) on 32MB of data and got different images (~2000 lines of hexdump
diff). Could be some timestamps, could be something else.

There is volume UUID which is what --uuid affects. But there are other
uuids, including the chunk uuid which gets repeated in every leaf and
node along with the volume uuid, device uuid, each files tree
(subvolume) get its own uuid, etc. Time stamps include atime, otime,
mtime, and ctime. Some objects have all 0's for uuid, and some items
have only 0.0 for times. I'll float the reproducibility question on
the Btrfs list, if it's desirable, useful, and how difficult it is. I
think subsetting Btrfs features to reduce complexity generally, and
therefore increase reproducibility as a consequence of that, has
merit.

ext4 alone, and btrfs alone are also viable. But since ext4 has no
compression, image size grows by maybe a factor of 2. Btrfs supports
lzo and zlib compression since forever, and zstd since kernel 4.14,
same as squashfs. What's been missing is mksquashfs with zstd support,
which I imagine will be in 5.0. The compression ratio compares well
with xz currently being used by mksquashfs in Fedora composes, but
with much less CPU to compress and decompress. So I'd say go with zstd
in any case.

For reproducibility, squashfs alone is the best option, but does not
improve integrity checking (but also doesn't make it worse).

I'm not able to estimate how much work it is to add a files hash
manifest to squashfs, and to always use it on reads, and then add some
error handling to EIO upon any mismatch. But yeah it'd need user space
code in mksquashfs and also kernel code to support it.

As for copy-on-write, dm-snapshot is quite complex to setup and
require
underlying FS to support write. Also, doesn't allow to write more data
than original image size (may be an issue for persistent partition
case). Overlay fs on the other hand works with any underlying fs, you
can write as much data as you want. And in case of persistent partition,
you can access that data even if base image (the lower layer) is
unavailable/broken. I think the only downside of overlay fs is when you
modify large file it gets copied in full to the upper layer. But I don't
think that's an issue in this use case.
For me, overlay fs is a clear winner here.
But as for image layout, it isn't that simple. For reproducibility,
squashfs alone is better. But if the goal of this change would be also
improving read errors detection, then it isn't that clear anymore. It
may be that it takes a simple mkfs.btrfs patch to make it reproducible,
but it isn't obvious for me at this stage. Also, keeping two layers
looks like unnecessary complexity.

I agree. Overlayfs works fine with any of the discussed filesystems.
I'd give a slight edge to Btrfs seed+sprout as the overlay mechanism
in the case of persistence on a USB stick: a) checksumming b)
compression helps improve performance of USB flash drives and reduces
wear c) kernel discovers both seed and sprout in early boot by sprout
uuid alone, no special mount options needed for setup. But it's a
really minor point because a) and b) are still possible with overlayfs
with a new independent btrfs as the upperdir.

What do you think about sidestepping this discussion a little and
replacing dm-snapshot with overlay fs regardless of other changes here?
That should be doable without any change to image format and will give
more flexibility there.

Agreed. What I can't tell you off hand is if livecd-iso-to-disk would
be affected by this in some way; or whether the change policy applies.
But I think it's better to file the change so there's awareness and
coordination: installer team would have to sign off on the pull
request for lorax, and then releng team probably should know about it
because they define their own compose settings (I guess they often use
upstreams defaults but they don't have to), and then QA might want a
heads up so if things blow up they know who to ask what's up, and then
it's also a good idea to let SOAS folks know about it. And a central
point of filing changes is coordination.
https://fedoraproject.org/wiki/Changes/Policy
--
Chris Murphy

On Fri, Oct 12, 2018 at 5:26 PM, Marek Marczykowski-Górecki
<marmarek(a)invisiblethingslab.com&gt; wrote:
> On Fri, Oct 12, 2018 at 03:44:38PM -0600, Chris Murphy wrote:
>> mkfs.btrfs has --rootdir and --shrink features to pre-allocate a
>> volume with files at mkfs time; I have no idea to what degree it
>> depends on kernel code.
>
> Probably not at all, given it works as non-root user too.
> I've tried to run it twice on the same directory (and with the same
> --uuid) on 32MB of data and got different images (~2000 lines of hexdump
> diff). Could be some timestamps, could be something else.
There is volume UUID which is what --uuid affects. But there are other
uuids, including the chunk uuid which gets repeated in every leaf and
node along with the volume uuid, device uuid, each files tree
(subvolume) get its own uuid, etc. Time stamps include atime, otime,
mtime, and ctime. Some objects have all 0's for uuid, and some items
have only 0.0 for times. I'll float the reproducibility question on
the Btrfs list, if it's desirable, useful, and how difficult it is. I
think subsetting Btrfs features to reduce complexity generally, and
therefore increase reproducibility as a consequence of that, has
merit.

>
> There are two almost separate aspects here:
> - image layout (squashfs+ext4, squashfs alone, squashfs+btrfs)
> - how copy-on-write is achieved (dm-snapshot, overlay fs)
ext4 alone, and btrfs alone are also viable. But since ext4 has no
compression, image size grows by maybe a factor of 2. Btrfs supports
lzo and zlib compression since forever, and zstd since kernel 4.14,
same as squashfs. What's been missing is mksquashfs with zstd support,
which I imagine will be in 5.0. The compression ratio compares well
with xz currently being used by mksquashfs in Fedora composes, but
with much less CPU to compress and decompress. So I'd say go with zstd
in any case.

>
> For reproducibility, squashfs alone is the best option, but does not
> improve integrity checking (but also doesn't make it worse).
I'm not able to estimate how much work it is to add a files hash
manifest to squashfs, and to always use it on reads, and then add some
error handling to EIO upon any mismatch. But yeah it'd need user space
code in mksquashfs and also kernel code to support it.
> As for copy-on-write, dm-snapshot is quite complex to setup and require
> underlying FS to support write. Also, doesn't allow to write more data
> than original image size (may be an issue for persistent partition
> case). Overlay fs on the other hand works with any underlying fs, you
> can write as much data as you want. And in case of persistent partition,
> you can access that data even if base image (the lower layer) is
> unavailable/broken. I think the only downside of overlay fs is when you
> modify large file it gets copied in full to the upper layer. But I don't
> think that's an issue in this use case.
>
> For me, overlay fs is a clear winner here.
> But as for image layout, it isn't that simple. For reproducibility,
> squashfs alone is better. But if the goal of this change would be also
> improving read errors detection, then it isn't that clear anymore. It
> may be that it takes a simple mkfs.btrfs patch to make it reproducible,
> but it isn't obvious for me at this stage. Also, keeping two layers
> looks like unnecessary complexity.
I agree. Overlayfs works fine with any of the discussed filesystems.
I'd give a slight edge to Btrfs seed+sprout as the overlay mechanism
in the case of persistence on a USB stick: a) checksumming b)
compression helps improve performance of USB flash drives and reduces
wear c) kernel discovers both seed and sprout in early boot by sprout
uuid alone, no special mount options needed for setup. But it's a
really minor point because a) and b) are still possible with overlayfs
with a new independent btrfs as the upperdir.
> What do you think about sidestepping this discussion a little and
> replacing dm-snapshot with overlay fs regardless of other changes here?
> That should be doable without any change to image format and will give
> more flexibility there.
Agreed. What I can't tell you off hand is if livecd-iso-to-disk would
be affected by this in some way; or whether the change policy applies.
But I think it's better to file the change so there's awareness and
coordination: installer team would have to sign off on the pull
request for lorax, and then releng team probably should know about it
because they define their own compose settings (I guess they often use
upstreams defaults but they don't have to), and then QA might want a
heads up so if things blow up they know who to ask what's up, and then
it's also a good idea to let SOAS folks know about it. And a central
point of filing changes is coordination.

As the upstream for livecd-tools[1] (and thus livecd-iso-to-disk), I'd
be very interested in changes to support both Btrfs seed+sprout and
Btrfs+OverlayFS combinations.
[1]: https://github.com/livecd-tools/livecd-tools
--
真実はいつも一つ！/ Always, there's only one truth!

On Sat, Oct 13, 2018 at 8:17 PM Chris Murphy
<lists(a)colorremedies.com&gt; wrote:
>
> On Fri, Oct 12, 2018 at 5:26 PM, Marek Marczykowski-Górecki
> <marmarek(a)invisiblethingslab.com&gt; wrote:
> > On Fri, Oct 12, 2018 at 03:44:38PM -0600, Chris Murphy wrote:
>
> >> mkfs.btrfs has --rootdir and --shrink features to pre-allocate a
> >> volume with files at mkfs time; I have no idea to what degree it
> >> depends on kernel code.
> >
> > Probably not at all, given it works as non-root user too.
> > I've tried to run it twice on the same directory (and with the same
> > --uuid) on 32MB of data and got different images (~2000 lines of hexdump
> > diff). Could be some timestamps, could be something else.
>
> There is volume UUID which is what --uuid affects. But there are other
> uuids, including the chunk uuid which gets repeated in every leaf and
> node along with the volume uuid, device uuid, each files tree
> (subvolume) get its own uuid, etc. Time stamps include atime, otime,
> mtime, and ctime. Some objects have all 0's for uuid, and some items
> have only 0.0 for times. I'll float the reproducibility question on
> the Btrfs list, if it's desirable, useful, and how difficult it is. I
> think subsetting Btrfs features to reduce complexity generally, and
> therefore increase reproducibility as a consequence of that, has
> merit.
>
This is a really interesting idea...

Ahh I missed that. And looking at koji, it seems like squashfs-tools
are currently FTBFS on Fedora 29. I have F29 but
squashfs-tools-4.3-16.fc28.x86_64.
OK, so it sounds to me like the current proposals for this thread as
it relates to installer images for Fedora 30:
- Drop devicemapper in favor of overlayfs
- Drop squashfs+ext4 images in favor of squashfs only image
- Maybe move to zstd in the squashfs image
I think part of the feature/change proposal should be building an
example LiveOS image in copr so we can get an idea of how to blow it
up, and ask QA to run it through OpenQA tests and see what sorts of
things break there.
Neal, any ideas who Marek could be a co-owner of the feature and help
navigate the Fedora process? Maybe someone on the Anaconda or releng
teams?
--
Chris Murphy

On Sat, Oct 13, 2018 at 6:24 PM, Neal Gompa <ngompa13(a)gmail.com&gt; wrote:
> On Sat, Oct 13, 2018 at 8:17 PM Chris Murphy <lists(a)colorremedies.com&gt; wrote:
>>
>> On Fri, Oct 12, 2018 at 5:26 PM, Marek Marczykowski-Górecki
>> <marmarek(a)invisiblethingslab.com&gt; wrote:
>> > On Fri, Oct 12, 2018 at 03:44:38PM -0600, Chris Murphy wrote:
>>
>> >> mkfs.btrfs has --rootdir and --shrink features to pre-allocate a
>> >> volume with files at mkfs time; I have no idea to what degree it
>> >> depends on kernel code.
>> >
>> > Probably not at all, given it works as non-root user too.
>> > I've tried to run it twice on the same directory (and with the same
>> > --uuid) on 32MB of data and got different images (~2000 lines of hexdump
>> > diff). Could be some timestamps, could be something else.
>>
>> There is volume UUID which is what --uuid affects. But there are other
>> uuids, including the chunk uuid which gets repeated in every leaf and
>> node along with the volume uuid, device uuid, each files tree
>> (subvolume) get its own uuid, etc. Time stamps include atime, otime,
>> mtime, and ctime. Some objects have all 0's for uuid, and some items
>> have only 0.0 for times. I'll float the reproducibility question on
>> the Btrfs list, if it's desirable, useful, and how difficult it is. I
>> think subsetting Btrfs features to reduce complexity generally, and
>> therefore increase reproducibility as a consequence of that, has
>> merit.
>>
>
> This is a really interesting idea...
https://lore.kernel.org/linux-btrfs/CAJCQCtTPwQnzwkpk=4ZsZXfWTC7HymYETxp-...

I'm interested to see how that thread turns out... It's a tempting
idea, because it gives you so much more flexibility. Installation onto
a disk could be a "btrfs send" and overlay changes could be easily
flattened on top of the target system. It'd also be much cheaper and
lighter for supporting the live environment.

> squashfs has supported zstd along with btrfs since kernel 4.14. zstd
> support was mainlined into squashfs-tools a year ago:
>
https://github.com/plougher/squashfs-tools/commit/6113361316d5ce5bfdc118d...
>
> However, there's been no releases since the migration from CVS on SF
> to Git on GitHub.
Ahh I missed that. And looking at koji, it seems like squashfs-tools
are currently FTBFS on Fedora 29. I have F29 but
squashfs-tools-4.3-16.fc28.x86_64.
OK, so it sounds to me like the current proposals for this thread as
it relates to installer images for Fedora 30:
- Drop devicemapper in favor of overlayfs
- Drop squashfs+ext4 images in favor of squashfs only image
- Maybe move to zstd in the squashfs image
I think part of the feature/change proposal should be building an
example LiveOS image in copr so we can get an idea of how to blow it
up, and ask QA to run it through OpenQA tests and see what sorts of
things break there.
Neal, any ideas who Marek could be a co-owner of the feature and help
navigate the Fedora process? Maybe someone on the Anaconda or releng
teams?

Brian C. Lane from the Weldr team is probably the guy to work with on
this. He is the chief developer of Lorax, which is where
livemedia-creator comes from. I've CC'd him to this email.
--
真実はいつも一つ！/ Always, there's only one truth!

> > This is a really interesting idea...
>
>
>
https://lore.kernel.org/linux-btrfs/CAJCQCtTPwQnzwkpk=4ZsZXfWTC7HymYETxp-...
>
I'm interested to see how that thread turns out... It's a tempting
idea, because it gives you so much more flexibility. Installation onto
a disk could be a "btrfs send" and overlay changes could be easily
flattened on top of the target system. It'd also be much cheaper and
lighter for supporting the live environment.

Ha! I just realized after all this time that the Btrfs wiki does not
make clear how to make a sprout, even though it mentions the more
esoteric recursive seed.[1] Of course you can mkfs.btrfs, mount it,
and send/receive. But send requires read only snapshots. Making a
sprout is easier, you just remove the seed device. This is supported
since 2009.
# losetup -r /dev/loop0 root.img
# mount /dev/loop0 /mnt/
# btrfs device add /dev/sda3 /mnt
# mount -o remount,rw /mnt
# btrfs device remove /dev/loop0 /mnt
And now it replicates extents from seed to sprout. The copy is faster
than pvmove, rsync, dd, or rpm-ostree deploy.
OK so let's say you have a USB stick 'sdb' and internal drive 'sda'.
And the stick already has a Fedora LiveOS imaged on it, only change is
the root.img is a Btrfs seed. The simplistic systemd pre-mount and
mount look like:
# losetup -r /dev/loop0 root.img
# mount -t btrfs /dev/loop0 /
# btrfs device add /dev/zram1 /
# mount -t btrfs -o remount,rw /
- now you have a live overlay in RAM; user can start using this LiveOS
environment including making changes like installing software; setting
up non-volatile persistence on the stick looks like:
# btrfs device add /dev/sdb3 /
# btrfs device remove /dev/zram1 /
# echo 1 > /sys/class/zram-control/hot_remove
- now the extents on zram1 are moved from zram1 to sdb3 (the stick);
setting up an installation to the internal drive 'sda' by "flattening"
as you say, merely means adding the internal drive to the mounted
Btrfs volume and removing all others:
# btrfs device add /dev/sda3 /
# btrfs device remove /dev/sdb3 /
# btrfs device remove /dev/loop0 /
- now extents on sdb3 (stick) and loop0 (seed) are copied to sda3
(internal), including any changes the user is making while all of this
is happening. In fact, the user does not even have to reboot because
once the operation finishes, and the loop is torn down, the stick is
not in use by the kernel. The user can just unplug the stick and keep
working. A spin or downstream could very sanely, and straightforwardly
build a no-UI OS installation.
It's not obvious that 'btrfs device add' incorporates a mkfs and that
you can now just delete the ro seed. Also not obvious is the 'dev add'
on an ro mounted seed causes a new volume UUID to be generated. This
is immediately discovered by libblkid. The kernel knows that this new
volume is a two device (or three device, whatever the case is) btrfs
and which devices they are. And this is such basic btrfs handling code
that GRUB and extlinux Btrfs code understand it.
[1]
https://btrfs.wiki.kernel.org/index.php/Seed-device
--
Chris Murphy

Ha! I just realized after all this time that the Btrfs wiki does not
make clear how to make a sprout, even though it mentions the more
esoteric recursive seed.[1] Of course you can mkfs.btrfs, mount it,
and send/receive. But send requires read only snapshots. Making a
sprout is easier, you just remove the seed device. This is supported
since 2009.
# losetup -r /dev/loop0 root.img
# mount /dev/loop0 /mnt/
# btrfs device add /dev/sda3 /mnt
# mount -o remount,rw /mnt
# btrfs device remove /dev/loop0 /mnt
And now it replicates extents from seed to sprout. The copy is faster
than pvmove, rsync, dd, or rpm-ostree deploy.

Yeah sorry I made the assumption that "the seed" is already flagged
with btrfstune. If it weren't flagged as seed and is rw mounted,
replication does still happen however the first device has its
signature wiped, and the second device inherits the same fs UUID. The
use case here is live migration from one device to another.
In the seed/sprout use case the seed is not wiped (so it can be an
on-going source), and the sprout gets a new fs UUID assigned.
--
Chris Murphy

> Neal, any ideas who Marek could be a co-owner of the feature and
help
> navigate the Fedora process? Maybe someone on the Anaconda or releng
> teams?
>
Brian C. Lane from the Weldr team is probably the guy to work with on
this. He is the chief developer of Lorax, which is where
livemedia-creator comes from. I've CC'd him to this email.

Thanks, Marek and I are already in touch :) As long as overlayfs can do
what we need the bulk of the extra work needs to be done in
anaconda-dracut.
We may also want to make this switch an option for a bit, while we work
out the details.
--
Brian C. Lane (PST8PDT)

On Sun, Oct 14, 2018 at 02:21:47PM -0400, Neal Gompa wrote:
> > Neal, any ideas who Marek could be a co-owner of the feature and help
> > navigate the Fedora process? Maybe someone on the Anaconda or releng
> > teams?
> >
>
> Brian C. Lane from the Weldr team is probably the guy to work with on
> this. He is the chief developer of Lorax, which is where
> livemedia-creator comes from. I've CC'd him to this email.
Thanks, Marek and I are already in touch :) As long as overlayfs can do
what we need the bulk of the extra work needs to be done in
anaconda-dracut.

We may also want to make this switch an option for a bit, while we
work
out the details.

Support for both layouts will be more tricky, because of the split between
anaconda-dracut and dmsquash-live. Integrating (parts of?) the latter in
the former would make it much easier.
But IMO it's worth making it support both layouts, at least for now.
Anyway, can somebody help me with change proposal? For example I'm not
sure if this is "Self Contained" or "System Wide" Change, or what
should
specifically be listed in "Scope". If IRC would be more appropriate for
such discussion, that's fine for me too.
--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Anyway, can somebody help me with change proposal? For example
I'm not
sure if this is "Self Contained" or "System Wide" Change, or what
should
specifically be listed in "Scope". If IRC would be more appropriate for
such discussion, that's fine for me too.

I would suggest system-wide, since every edition and spin relies on the
installer.
"Scope" should cover everything you're changing, and everyone who is
impacted in some way (whether they need to be directly involved, or are
impacted and need to change something, or whether they just might like to be
aware).
--
Matthew Miller
<mattdm(a)fedoraproject.org&gt;
Fedora Project Leader

On Fri, Oct 12, 2018 at 4:30 AM, Marek Marczykowski-Górecki
<marmarek(a)invisiblethingslab.com&gt; wrote:
> On Thu, Oct 11, 2018 at 09:24:08PM -0600, Chris Murphy wrote:
>> I'm pretty sure the original reason was the default live install use
>> dd to block copy the root file system into the fedora-root LV, and
>> then resized the LV and ext4 file system.
>
> How is it done now?
On Live media installs, anaconda does:
rsync -pogAXtlHrDx --exclude /dev/ --exclude /proc/ --exclude /sys/
--exclude /run/ --exclude /boot/*rescue* --exclude /etc/machine-id
/mnt/install/source/ /mnt/sysimage
On DVD and netinstalls, I'm guessing based on packaging.log that it's
a dnf+rpm installation even though I never see a dnf or rpm process in
either top or ps. In any case, the rpm packages are directly on the
iso9660 file system, not baked into the

One other thing that really hogs system resources for some reason, is
one of the loopback mount devices, I think loop1 which is root.img,
hogs nearly 100% CPU for the duration of the installation for LiveOS
media. I don't know why, but it might be worth benchmarking nbd based
mounts for comparison. The installation turns my computers into hair
dryers. The installation process bottleneck should be reading the
compressed root image, not CPU.
--
Chris Murphy

Hi all!
I'm new on this list. I work on Qubes OS, where Fedora is used as a base
distribution.

Tangentially: Qubes is very cool and I'm glad you find Fedora useful
as a base system. I work on Fedora CoreOS and have patches in a
lot of OS components; lorax, systemd, etc. If there's something blocking
you feel free to reach out and I may be able to
spend some time to help. Also, if you decide to investigate using
rpm-ostree for the Qubes dom0 - I'd be very interested in helping.