Links

If you cook on a pellet smoker/BBQ/grill, you might have noticed that
the forum pelletheads.com has been down for
awhile. I came across pelletfan.com which was
started by one of the former moderators at pelletheads, and a good chunk
of the pellet cooking community has found it.

So, if you have discovered pelletheads is down, like I did, and are looking
for a replacement forum, give pelletfan.com a try.

Did you know you can create your own Linux AWS EC2 AMI which is running 100%
ZFS for all filesystems (/, /boot - everything)? You can, and it’s not too hard
as long as you are experienced with installing Linux without an installer.
Here’s the rough instructions for setting this up with a modern Debian based
system (I’ve tested with Debian and Ubuntu). As far as I know, this is the
first published account of how to set this up. There aren’t any prebuilt AMIs
available that I know of, but I might just do that unless someone else beats me
to it.

Why run ZFS for the root filesystem? Not only is ZFS a high performing
filesystem, but using native ZFS for everything makes storage management a
cinch. For example, want to keep your root EBS volumes small? No problem - keep
your AMI on a 1GB volume (yes, it’s possible to be that small), and extend the
ZFS pool dynamically at runtime by attaching additional EBS volumes as needed.
ZFS handles this extremely well.

Why build your own AMI instead of using a prebuilt one? There’s a couple of good
reasons, but the primary one is that you get a minimal AMI with the least bit
of cruft and bloat possible. Many of the prebuilt cloud AMIs have a bunch of
package installed that you might not need or want. By building from scratch,
your AMI contains just the things you want, not only lowering EBS costs, but
potentially reducing security risks.

Note that we don’t do anything special with ephemeral drives here - that’s best
kept in it’s own ZFS pool anyway, since mixing EBS and ephemeral drives will
have some interesting performance consequences. You can use ephemerals on an
instance, of course (in fact, it works great to stripe across all ephemeral
drives, or you could use SSD ephemerals for ZFS L2ARC) - that’s just not the
purpose of this article.

These instructions will only create an AMI that will boot on an HVM instance
type. Although it’s easy enough to create a snapshot that can be registered
separately as both HVM or PV AMI, all new AWS instance types support HVM.
Because of this, I’ve decided only to support newer instance types, hence
HVM-only.

I’ve tested this with the upcoming Debian Stretch (“testing” as of this
writing), as well as Ubuntu Yakkety. It should work with Ubuntu Xenial as well,
but I wouldn’t try anything earlier, since ZFS support is relatively new and
maturing rapidly (last time I tried Debian Jessie with 100% ZFS I found that
grub was too old to support booting into ZFS, although a separate EXT4 /boot
works fine. This may have changed since then).

Again, these instructions assume you are pretty familiar with installing Debian
via debootstrap, which means manually provisioning volumes, partitioning them,
creating filesystems, bootstrapping, and chrooting in for final setup. If you
don’t know what all these things mean, you might find this a difficult
undertaking. Unlike installing to your own hardware, there’s very little
instrumentation if things go wrong, and only a read-only console (if you are
lucky - if networking does not initialize properly, you might not even get
that). Expect this to take a few iterations and some frustration - this is a
general guide, not step-by-step instructions!

If you’ve ever installed a Debian based system from scratch, you’ll note that
most of these steps are no different than you’d do on physical hardware. There
are only a few things that are AWS specific, but the vast majority is exactly
how you’d install on bare metal.

Step 1 - Prepare Host Instance

Fire up a host instance to build out the AMI. This doesn’t need to be the same
distribution or version as the AMI to be built, but it has to be recent enough
to have ZFS. Debian Jessie (with jessie-backports) or Ubuntu Xenial will work.

We’ll use this instance (the “host”) to build out the target AMI, and if things
don’t go well we can come back to it and try again (so don’t terminate it until
you are ready, or have a working target AMI).

Once the host instance is up, provision a GP2 EBS volume via the AWS console
and attach it to the host. We use a 10GB volume, but you could make this as
small as 1GB if you really want to (be aware GP2 doesn’t perform well with
small volumes).

We’ll assume the newly provisioned volume is attached at /dev/xvdf. The actual
device might vary, use “dmesg” if you aren’t sure.

Next, update /etc/apt/sources.list with the full sources list for your host
distribution. For Debian, use “main contrib non-free” - and you’ll need
jessie-backports if the host is Jessie. For Ubuntu, use “main restricted
universe multiverse”.

Step 2 - Prepare Target Pools And Filesystems

Now it’s time to set up ZFS on the new EBS volume. Assuming the target volume
device is /dev/xvdf, we’ll create a GPT partition table with a small GRUB EFI
partition and leave the rest of the disk for ZFS.

Be careful - many instructions out on the net for ZFS munge the sector
geometry, or fake sgdisk into using an unnatural alignment. The following is
correct (per AWS documentation) not only for EBS, but is the exact same
geometry I use when installing Linux with ZFS root on physical hardware.

This will create a small partition labelled GRUB (type EF02, 4096 sectors), and
use the rest of the disk with a partition labelled ZFS (type BF01). The grub
partition doesn’t technically need to be as big as 4096 sectors, but this
insures everything is aligned properly.

It’s worth noting that I never give ZFS a full disk, and instead I always use
partitions for ZFS pools. If you give ZFS the entire disk, it will create it’s
own partition table, but waste 8MB in a Solaris partition that Linux has no use
for.

OK, great, next up let’s create our ZFS pool and set up some filesystems. This
will set the target up in /mnt. You can choose any mount point you want,
just remember to use it consistently if you choose a different one.

I use the ZFS pool name “rpool”, but you can choose a different one, just be
careful to substitute yours everywhere.

You may want different options - this will globally enable lz4 compression and
disable atime for the pool. You may want to disable compression generally and
only enable it for specific filesystems. The choice is up to you. We also allow
overlay mount on /var. This is an obscure but important bit - when the system
initially boots, it will log to /var/log before the /var ZFS filesystem is
mounted. Because the mount point is dirty, ZFS won’t mount /var without setting
the overlay flag. Note that /dev/xvdf2 is the second GPT partition we created
above.

Step 5 - Finalize Target Configuration

Now we’ll do some final configuration. Some of the steps here are different
between Debian and Ubuntu, but the general theme is the same.

Update /etc/apt/sources.list with the full sources list for your target
distribution. For Debian, use “main contrib non-free”. For Ubuntu, use
“main restricted universe multiverse”. Be sure you are setting up sources.list
for your target distribution, not the host like we did before!

Install packages, but be sure NOT to install grub when it asks - you’ll have
to acknowledge that this will result in a broken system (for now, anyway).

Note creating the symlink to /etc/mtab for Debian - There was a bug in ZFS that
relied on using /etc/mtab. We got that bug fixed in Ubuntu by Canonical, but as
of a couple of months ago, Stretch didn’t yet have the fix - it’s probably
fixed in Debian as well by now.

On Debian, I found I needed to modify GRUB_CMDLINE_LINUX in /etc/default/grub
with the following. Note escaping ‘$’:

GRUB_CMDLINE_LINUX="boot=zfs \$bootfs"

This additional step might go away (or already be resolved) with a newer
version of ZFS and grub in stretch. You could (should) probably add this to the
grub.d configuration we add later, rather than here.

Verify grub and ZFS are happy. This is very important. If this step doesn’t
work, there’s no point in continuing - the target will not boot.

$ grub-probe /
zfs

This verifies that grub is able to probe filesystems and devices and has ZFS
support. If this returns an error, the target system isn’t going to boot.

Everything is good, so let’s install grub:

$ grub-install /dev/xvdf

Note we give grub the entire EBS volume of xvdf, not just xvdf1. This is
important (installing to just the GRUB partition will result in a non-booting
system).

Again, if this fails, you’ll need to diagnose why and potentially start over,
as you won’t have a bootable target system.

Now we need to add a configuration file for grub to set a few things. To do
this, create a file in “/etc/default/grub.d/50-aws-settings.cfg”:

This will configure grub to log as much as possible to the AWS console, get an
IP address as early as possible, and force TSC (time source) to be reliable (an
obscure boot parameter required for some AWS instance classes). net.ifnames is
set so ethernet adapters are enumerated as ethX instead of ensXX.

Now, let’s update grub:

$ update-grub

You might want to check “/boot/grub/grub.cfg” at this stage to see if the zfs
module will be probed and it’s got the right boot line (vague advice, I know).

Finally, set the ZFS cache and reconfigure - these might be unnecessary, but
since this works, I superstitiously don’t skip it :-).

Again note that we’ve altered the boot commandline so network devices will be
enumerated as ethX, instead of ensXX.

Don’t drop this config into “/etc/network/interfaces.d/eth0.cfg” - cloud-init
will blacklist that configuration.

Finally, you may wish to provision and configure a user (cloud-init will set up
a “debian” or “ubuntu” user already by default). You may want to give root user
a secure passwd and update /etc/ssh/sshd_config to allow PermitRootLogin if this
is appropriate for your environment and security policies.

Step 6 - Quiesce Target Volume

Before creating an AMI, we need to exit the chroot, unmount everything, export
the pool - basically quiesce the target so the volume can be snapshot.

Exit the chroot:

$ exit

Now, you should be back in the host instance.

Unmount the bind mounts (we use the lazy option, otherwise unmounts can fail):

$ umount -l /mnt/dev
$ umount -l /mnt/proc
$ umount -l /mnt/sys

And finally, export the ZFS pool.

$ zpool export rpool

Now, “zpool status”, “df”, etc should show that our target filesystems are
unmounted, and /dev/xvdf is free to be safely cloned. If anything here fails
(unmounting, exporting), the target will not be in a good state and won’t boot.

Step 7 - Snapshot EBS And Create AMI

Now we are all set to create an AMI from our target EBS volume.

In the AWS console, take a snapshot of the target EBS volume - this should take
a minute or two.

Next, also in the AWS console, select the snapshot and register a new AMI. Be
sure to register as HVM and set up ephemeral mappings as you wish. Don’t mess
with kernel ID and other parameters.

Step 8 - Launch And Add Storage

Once registered, launch your shiny new AMI on an instance and enjoy ZFS root
filesystem goodness.

If your instance never comes up, take a look at the console logging available
in the AWS console. This is the only real avenue you have to debug a failed
launch, and it’s very limited. If grub fails, the log might be empty. If
networking fails, the log should have some details, but the instance will not
be reachable.

A very useful debugging technique for AMIs is to terminate the instance, but
don’t destroy the EBS volume - instead, attach the volume to another instance
and import the ZFS pool there. This will allow you to look at logs so hopefully
you can figure out why the boot failed.

If the instance doesn’t come up, you can re-import the ZFS pool on the host
used to stage the target and try to fix it (remember above, I suggested leaving
the host and target EBS volume around so you can iterate on it). Do the bind
mounts before your chroot, and don’t forget to unmount everything and export
the pool before taking another snapshot.

Login with the “debian” or “ubuntu” users (with the default passwords), if
provisioned by default cloud-init - or however they are provisioned by
cloud-init if you customize it. Or login as root if you set the root passwd and
modified ssh configuration to allow root login.

Did it work? If so great! If not, give it another try, paying careful attention
to any errors, as well as scouring output of dkms builds, etc. This isn’t
completely straightforward, and it took me a few tries to get things figured
out.

Now, let’s show the power of ZFS by adding 100GB, which will be available
across the entire rpool, without having to fracture filesystems, mount new
storage to it’s own directory, or move files around to the new device.

Assuming we used a 10GB EBS volume for the AMI, our pool probably looks
something like:

This partitions the volume with a new GPT table, using everything for ZFS
(again, I don’t like giving ZFS the raw volume, as it will waste a bit of space
when it partitions the volume for Solaris compatibility). Finally, we extend
rpool onto the new volume.

We’ve added 100GB of storage completely transparently, and unlike creating a
traditional EXT or XFS volume we don’t have to mount it into a new directory -
with ZFS the storage is just there, and available to all our ZFS filesystems.

Thanks For Reading

Hope that helps for anyone else looking to run ZFS exclusively in AWS. While
not as easy as taking an off-the-shelf prebuilt AMI, you end up with an AMI
that has only a minimal Debian or Ubuntu install - you know exactly want went
into it, and the process for doing so.

If you run into any issues trying this, you can indirectly contact me by
commenting on this blog entry, or try in ##aws on Freenode.

At work, we have a large-scale deployment at AWS on Ubuntu. As a member of
the Performance and Operating Systems Engineering team, I am partially
responsible for building out and stabilizing the base image we use to deploy
our instances. We are currently in the process of migrating to Xenial, the
current Ubuntu LTS release. There’s a lot that has to happen to go from the
foundation image to our deployable image. There’s a few manual things, such
as making our AWS AMI bootable on both PV and HVM instance types (we’ve shared
how to do this with Canonical, but they don’t seem to interested, even though
it reduces operational complexity by not having to maintain multiple base
images). The vast majority of building out our image, on the other hand, is
an automated process involving a relatively large and complex chef recipe,
which we keep backwards compatable for all versions of Ubuntu we support for
our internal customers.

All this works pretty well in practice, but iterating on a new base AMI, like
we are doing now for Xenial, takes some time as we try different recipes,
update init scripts (systemd is new in Xenial since the last LTS - Trusty),
and various other customizations we do. Making idempotent chef recipes is
difficult and not worth the effort, but also that means it’s not really possible
to re-run after a failed chef recipe. The end-to-end delay in trying out
changes is a fairly long process - we check package source into git, let
jenkins build packages, and kick off our automated AMI build process - which
involves taking our foundation image, chrooting into it, running the chef
recipes, and snapshotting the EBS volume into an AMI. Now, we can finally
launch an EC2 instance on the AMI and see if things worked.

This all takes a fair bit of time when rapidly iterating on our base image,
and I wanted to find a quicker way to try potentially breaking changes. Even
though we deploy on Ubuntu, all my personal and work laptops, desktops, and
servers run base Debian. Lately, I’ve been building out all my filesystems
(except for /boot) with ZFS using zfsonlinux (even on my LUKS/dm-crypt
encrypted laptops).

I’ve used LXC a fair bit in the past when needing to do cross-distribution
builds - and I’ve used BTRFS snapshots to make cloning containers fast and
space efficient. ZFS also supports copy-on-write, and is natively supported by
LXC on Debian Jessie, so this seemed like a good approach - and it is!

I’ve been using this method to iterate quickly on our recipes. I have a base
xenial image that I can clone and start in a few seconds to start from the
beginning. I can also snapshot a container at any point in the process so
that I can repeat and retry what would otherwise not be idempotent.

Some of the ZFS integration in LXC is not well documented, so here’s some
rough steps on how I’m doing this on my work desktop, to help anyone else
trying to figure this out.

I started with a single ZFS pool called “pool0” with several filesystems:

The “zfsroot” option is important - without it, LXC doesn’t know what pool
or filesystem to use (it defaults to ‘tank/lxc’).

At this point, we have a working Xenial container - before starting it I
manually edited /var/lib/lxc/xenial/etc/shadow removing the passwords for
the “root” and “ubuntu” users. I then launch the container, login through
the console, and change the passwords for both users. Then, I install
openssh-server and stop the container - this is my base that I can now clone.

You can see that each container is in it’s own ZFS copy-on-write volume. I
can easily clone and destroy containers now without going through a full
build, bake, and deploy process.

Here’s a couple more hints - If you have trouble connecting to the LXC console
before openssh and networking is enabled, make sure you are connecting to
the console tty (for Xenial, I was otherwise getting tty1 which has no getty):

$ sudo lxc-console -n try -t 0

Finally, by default, LXC containers will not be set up with networking. It’s
easy to supply an “/etc/lxc/default.conf” to resolve this:

Note: This post has been updated since discovering this is NOT an
Apache issue, and it turns out to entirely be a problem in the request
processing framework of the application Apache is proxying requests to.
Some frameworks follow old CGI specs that prohibit hyphens (“-“) in
request header names. Apache is passing along both it’s header and the
client-generated headers, but the proxied framework converts “-“ to “_”
which results in a map/dictionary key collision.

A fairly common use-case for this is to pass TLS/SSL headers to a proxied
backend service when TLS termination is done in Apache. Imagine a case
where client certificates are optional but the backend uses information
from the certificate, such as the DN, or just validating if a client
certificate was used.

Let’s take that last case as an example to illustrate this security risk,
where we wish to pass along the SSL_CLIENT_VERIFY Apache variable to a
backend, indicating that a client certificate was successfully used and
validated. A common, but insecure configuration (which you’ll find in many
guides and blogs if you search) is to do this:

This directive will add the header “Ssl-Client-Verify” to the request passed
to the backend service, however this header can be overridden and spoofed by
a client!

Instead, use the following configuration, which is not vulnerable to header
forgery:

RequestHeader set SSLCLIENTVERIFY "%{SSL_CLIENT_VERIFY}s" # Do this

Some request processing frameworks follow an old CGI specification that
prohibits “-“ in header names and convert these to “_”, so to prevent
a client from using a map/dictionary key collision to spoof headers, avoid
the use of these characters entirely.

Here’s an example of header forgery, where we can easily override the Apache
generated headers when specified like the “Don’t do this” case above:

$ curl --header "Ssl-Client-Verify: SPOOFED" -i https://my.site.foo

With a valid certificate we can still override the Apache generated header:

The resulting output will show that the client was able to override the
Apache header if underscores are used in the RequestHeader directive:

Ssl-Client-Verify: SPOOFED

Whereas using either the second or third form, where dashes are used
instead of underscores, the client cannot spoof the header:

Ssl-Client-Verify: SUCCESS

Or if client certifications are optional and none was provided:

Ssl-Client-Verify: (null)

This vulnerability happens if the client passes a header that
matches the final header of “Ssl-Client-Verify” (case doesn’t matter,
so a spoofed header of “SSL-CLIENT-VERIFY” will result in header forgery).
Passing a header of “SSL_CLIENT_VERIFY” from the client will not result in
a spoofed header, potentially giving a false sense of security in testing.

The security risk is pretty clear - a malconfigured Apache and backend
request processing framework that munges header names can result in
clients spoofing headers such that a proxied service incorrectly thinks
authentication or authorization has been confirmed when indeed it has not.

I switched this blog over to the Isso
commenting system from Disqus, and added support for Isso to my popular
Jekyll theme jekyll-clean. It
was always a bit of a battle getting Disqus to work right -
I had quite a few comments that would not show up, and just
logging into Disqus doesn’t work right if you use privacy blockers
like I do (Privacy Badger,
Ublock Origin, and HTTPS
Everywhere for those interested

these are all worthwhile browser extensions to use). There were always some
questions about what Disqus does with data, as well.

Isso is self-hosted, which means you can’t directly use it on static
webhosting such as github pages, and while your data is arguably no
more safe on someone’s random self-hosted blog (such as this one!), Isso
allows anonymous comments - so people only have to provide as much detail
as they wish. For people who want to demand it, you can make the email
and name fields mandatory, but there’s no verification so in practice
there’s not much point (when I come across comment forms that require
an email I always give a fake one).

We’ll see if spam is an issue - Isso has a basic moderation system. That’s
one benefit of hosted solutions such as Disqus - they have a shared
knowledge about spammers and can make some reasonable attempts to control
it, along with requiring you create account (with the obvious downside
being the lack of anonymous comments I mention above).

So, in the end, it’s not a clear choice so everyone has to choose what
matters most to them - there are a few other options other than Isso as
well, but I liked the fact that Isso is small and simple, written in
Python, and uses sqlite for storage. There’s not much to go wrong nor
much attack surface for abuse.

Integrating Isso with Jekyll is pretty easy, you can take a look at
jekyll-clean to see how I
approached it.

On the topic of Jekyll for blogs - I switched over to Jekyll for this
blog about 1+1/2 years ago and don’t regret it for a moment. It’s simple,
easy to modify and theme, and super super fast.