Comments

Feature Request

What challenge are you facing?

We've been using btrfs as our volume driver by default on Linux for a long time now, and while we get a lot of good things from it (primarily nestability), we've encountered stability and portability problems in the wild.

What is a volume driver/graph driver?

Buckle up, as there's a lot of history and quirks here. There's a reason Docker supports like 10 different drivers. :)

You may recognize the words aufs, overlay/overlayfs, btrfs - these are all filesystems you can use to have a copy-on-write replica of an original volume/directory. This is how 10 containers are able to all use a 1GB image as their rootfs without using 10GB of disk space. Docker has a very similar concept, called a "graph driver", which is how it does all its image layering shenanigans to tie 20 different layers together to form 1 rootfs. Concourse's volume driver interface is a bit simpler, as BaggageClaim just supports creating volumes and copy-on-writes of other volumes. The full interface is in driver.go.

How do they compare?

aufs and overlay (formerly overlayfs) are both union filesystems. Filesystems like these are "pseudo" filesystems in that they don't directly interact with a device to provide filesystem semantics on their own (like ext4 and btrfs and other "real" filesystems you'd use for a physical machine). Instead they tie together upper and lower directories on an existing filesystem to form one mount point.

The terms "upper directory" and "lower directory" refer to the directory that writes go to, and the (possibly read-only) directory that the writes are overlayed on to, respectively. For example, a container's rootfs will start with an empty upper directory, with the original rootfs image as its lower directory. If the container writes to /etc/hosts or /etc/some-file, those writes will go into the upper directory, leaving the lower directory unchanged, allowing it to be shared across many containers at once without polluting each other.

aufs is kind of a black sheep: it was never part of the kernel, and has only ever been available by installing an -extra package. Despite this, it was chosen by Docker early on as its filesystem of choice, probably because it supported creating filesystems from multiple lower directories. overlayfs was the primary competitor at the time, but it was pretty new, only came with some versions of Ubuntu, and only supported one upper directory and one lower directory.

These days, overlayfs has been renamed to overlay and now ships with the kernel as of version 3.18, requiring no additional setup. Kernel version 4.0 introduces support for configuring multiple lower directories, making it a worthy replacement for aufs on paper.

The critical flaw with aufs and overlay for our use case is that they do not nest. Within a container with an aufs filesystem, you cannot then create more aufs filesystems with an aufs directory as lower directory. The same is true for overlay.

btrfs, on the other hand, is a real filesystem that deals directly with a block device. You can use it for your entire machine, just like ext4. Instead of gluing together "upper" and "lower" directories, it supports copy-on-write semantics via snapshotting a volume to create a separate volume. The key feature btrfs brings to the table is nestability. Volumes created from a snapshot can themselves be snapshotted to create another volume. So if you were to run docker within a container with a btrfs filesystem, it would just use its btrfs driver and create subvolumes within the container. An important side-tangent here is that containers are just processes, it's the management of their rootfs that's complicated with regard to nesting. Using btrfs solves this problem elegantly.

Nestability is important because:

The docker-image resource can just use the btrfs driver, so we don't need a loopback for every image, which is great, as loopbacks are a global system resource that can leak if we're not careful.

Tasks can use docker compose, again by just using the btrfs driver.

What's wrong with btrfs?

Stability and portability. While btrfs has been available in the kernel for a long time, it's never been rock-solid.

It's also occasionally stripped out of some systems (i.e. Docker for Mac), which is frustrating.

There's also the initial dance we need to do in order to get a btrfs device available. In most deployments this involves creating a loopback device for btrfs, as it's extremely uncommon for it to be the primary filesystem of the disk (probably due to the aforementioned stability issues).

A Modest Proposal

Let's revisit our choice and see if we can accomplish everything we need with another driver. Because aufs does not come with the kernel, this will probably be overlay. There's also a new kid on the block in lcfs which may be worth investigating if it's something we can carry around and not require a kernel module to be installed.

We may be able to achieve nestability by ensuring there's a non-overlay scratch space available to resources and tasks, so that they can use their own layering driver. Initial testing suggests we'd need this kind of thing, as overlay does not nest:

So one approach for this could be to create an empty volume and mount it somewhere like /tmp in the container. That would then propagate the filesystem from the host (probably ext4 or something) into the container. This would probably fix the Docker-in-Concourse case. The next case to handle would be Concourse-in-Docker. The concourse/concourse is likely to be run by a Docker daemon using overlay or aufs as its driver. For this we can just have a VOLUME pragma for the work-dir, which should mount in the filesystem from the host.

Let's also investigate lcfs as an alternative experimental approach. It would be great if all that can run in userland and we can literally package Concourse with it, and then we have a known-good (or at least known) version. Not sure if that's possible, but worth investigating.

Side-note: after finding a working alternative, it may be worthwhile to observe performance differences between them in various deployment scenarios (e.g. binary directly on VM, BOSH-deployed, Concourse-in-Docker).

This comment has been minimized.

edited

Initial investigation into overlay has yielded no real surprises.

First I merged in the overlay-driver branch, then changed the baggageclaim_ctl in the BOSH release to use the overlay driver instead of btrfs. I then ran a build with a simple get and things worked. Next I ran a hello-world build which uses image_resource. The build failed with "invalid argument" trying to fetch the image, as expected, as /var/lib/docker in the container is mounted overlay, which Docker can't do anything with.

So I patched the ATC to create an empty volume and bind-mount it in to the container at /var/lib/docker. Then it worked! I moved on to TestFlight, which started failing on volume destroys, which I fixed with concourse/baggageclaim@06884dd. TestFlight is now passing.

Continuing on this thread would probably mean setting up a canonical "scratch space" which would be made available to resource containers. And then decide if it should be available to tasks as well, to support the "Docker Compose" use case. This scratch space can't be /tmp, as it turns out that causes the container create to fail, as Guardian places its init there during container start. We'd probably want to invent something like /opt/resource/scratch, but then we'd need to figure out what to do for tasks.

This comment has been minimized.

Looked into LCFS for a bit. I haven't made a POC as it looks like it'd be a bit of an investment, and I already have an initial set of concerns:

It's a bit early on in the project for my liking.

It requires a block device or a sparse file, just like btrfs. It at least doesn't require a loopback device for the sparse-file case. They do mention that using a file has a negative performance impact, but I do not see any benchmarks so I don't know how bad it is.

I'm not sure if it nests, or if aufs can run on top of it. Hard to tell without a POC. It can at least operate directly with a sparse-file, so we wouldn't need a loopback.

I'm a bit confused by its CLI. lcfs daemon requires you to pass two mount points, one which is called "host-mnt" and one which is "plugin-mnt", which seems like some Docker graph driver plugin concerns leaking through the abstraction. The docs just describe them as two mount points. There's no percievable difference in /proc/mounts. Wat?

It requires FUSE 3.0, but that may be fine. I think that's all user-land, and can be packaged alongside lcfs or just built in to it (not sure if the dependency is compile-time or runtime).

This comment has been minimized.

edited

Starting with some performance testing between btrfs and overlay. Configured two workers: one with overlay, one with btrfs, and a pipeline that runs two jobs against both: one job that builds the atc-ci image (i.e. this job), and another that builds the git resource image and then runs its integration tests (i.e. this job).

These are very high-level tests, compared to the usual tests run against a graph driver, but they're at least realistic.

I'll let the pipeline run periodically overnight. The dashboard showing the results can be seen here:

I then opted to test another kind of job, @osis's Strabo, which demonstrates high write use during the initial get, and high COW volume use as it has eight put steps preceded by ~46 get steps, plus a couple task outputs. This results in 384 total COW volumes being created.

Initial findings were kind of interesting. btrfs took the 384 volume creates like a champ, and successfully ran the build.

overlay, however, cannot run the build successfully. Once it gets to the task, it errors trying to namespace the task's image, with Post /volumes: net/http: timeout awaiting response headers . I opened #1171 for this initially, but then I noticed it happens consistently. Looking into the logs reveals that namespacing a container's image takes about 1 second for btrfs, but 1 minute and 10 seconds for overlay:

This demonstrates the strengths and weaknesses of a "real" filesystem like btrfs compared to a union filesystem like overlay. Namespacing a volume entails recursing through it and chowning things owned by root. This takes much longer with overlay, possibly because each chown requires more i/o to hoist the file to the upper layer and then change the permissions. Or some other file attribute tracking/bookkeeping that overlay has to maintain.

I'll look in to how this behaves on Docker, as they'll have run into the same issues once they added user namespacing support.

This is also a problem set that may be addressed by something like shiftfs in the future, but we need to solve today's problems, not next year's. :)

This comment has been minimized.

We could mitigate the slow namespacing by keeping privileged and unprivileged versions of these resource caches. Then it would only affect the first fetch. The downside would be doubled disk use.

Looking in to how Docker doesn't have this issue, they have the advantage of not having data at rest that they then need to namespace. They fetch images from the registry, knowing whether their container has to be namespaced or not, and will just remap the UID/GIDs as they extract. So there shouldn't be any performance difference. We however have to deal with scenarios like a get of a Docker image (or some other resource, even) being subsequently used by one privileged and one unprivileged task within the same build.

This comment has been minimized.

The concourse worker command will not require a recreate for this upgrade; it'll just stick with btrfs, as there will already be an existing btrfs mount point. (Note: we should validate this during acceptance.)

The BOSH release however will require a recreate of the workers. This is because the same autodetect logic is not implemented in the release. Maybe we should push all this down in to BaggageClaim?

this pushes the filesystem setup down in to baggageclaim, and should
make the upgrade from btrfs-default to overlay-default a smooth
transition, as workers will continue to use btrfs until they're
recreated.
Submodule src/github.com/concourse/baggageclaim c12e0c4..178b8c0:
> complete the move of driver detection down
> auto-detect driver and set up btrfs loopback
Submodule src/github.com/concourse/bin dd6ef2d8..7dfb3e86:
> make asset setup non-platform-specific
> complete the move to baggageclaim
> move auto-driver-setup/detect into baggageclaim
#1045
Signed-off-by: Clara Fu <cfu@pivotal.io>

This comment has been minimized.

We ended up pushing the driver detection and setup logic (i.e. btrfs loopback image wiring) down into BaggageClaim, and removing it from the binaries and BOSH release. This has the (very much intended) side effect of no longer requiring a worker recreate to upgrade - the BOSH release now defaults to detect, along with the binary, so they'll both just see an existing btrfs mount (as it was set up previously) and continue to use that driver. Yay!

This comment has been minimized.

We are using the same filesystem that Dockers overlay2 driver uses. The
naming is confusing as it went from "overlayfs" to just "overlay" when it
was merged into the kernel. Docker already had a driver called overlay, for
the old one, so they called the new driver overlay2.

vito
changed the title from
Switch from btrfs some other filesystem to resolve stabiity and portability issues
to
Switch from btrfs to some other filesystem to resolve stabiity and portability issuesJun 13, 2018