Thursday, January 11, 2018

Avoiding data disasters with Sanoid

Sanoid
helps to recover from what I like to call "Humpty Level Events." In
other words, it can help you put Humpty Dumpty back together again, on
ZFS filesystems.

Humpty Dumpty sat on the wall,
Humpty Dumpty had a great fall.
All the King's horses and all the King's men
Couldn't put Humpty together again.

As a child, long before I read Lewis Carroll's books, I knew this
snatch of doggerel from Mother Goose's nursery rhymes by heart. Humpty
Dumpty's fall is probably the best-known nursery rhyme in the English
language. Why is this simple verse so popular?
It outlines a fear, and an experience, common to everyone: some
seminal, horrible event happened, and in the space of a moment there was
no going back. What had been order became chaos, and there was no way
to restore it. It sucks, but it's a basic part of the human
experience; you can't put an egg back into its shell, you can't unsay
the mean thing you said to your friend, and you can't undo the horrible
mistake you made on your computer.
Maybe you clicked the wrong link or the wrong email attachment and a
cryptomalware payload executed. Or maybe a bad system update came in
from your operating system vendor and bricked the boot process for your
machine. (I haven't actually seen this particular thing happen under
Linux, but it happens with depressing regularity to those of us who
manage enough Windows servers.) Perhaps a mission-critical application
needed an upgrade, and the vendor emailed you a 150-page PDF with
instructions, and it all went south on page 75. Heck, maybe you paid the
vendor to do the upgrade and it all went south on page 75.
Like most of the computing experience, none of these things are truly
new. They're all just rewritings of Humpty's parable. Entropy happens.

You can't unscramble the egg. (Or can you?)

Humpty Dumpty sat on the wall, Humpty Dumpty had a great fall.
The sysadmin did a rollback! Humpty Dumpty sat on the wall...

If you're a *nix person, rm -rf / is as apocryphal a tale as Humpty's fall itself. You may never have done it yourself, or even seen it done in person. We've all at least heard the stories, though, and cringed at the thought. GNU rm even added a special argument, --no-preserve-root,
to try to make it a little more difficult for fast, clumsy fingers to
wipe out the system! That still doesn't stop you from accidentally
nuking all sorts of important things that aren't root, though: /bin, /var, /home... you name it. (I accidentally destroyed /etc on an important system once. Once. And let us never speak of it again.)
In the most prosaic sense, Sanoid is a snapshot
management framework. It takes snapshots of ZFS filesystems, it
monitors their presence, and it deletes them when they should go away.
You feed it a policy, such as "for this dataset and all its children,
take a snapshot every hour, every day, and every month. Keep 30
hourlies, 30 dailies, and 3 monthlies", and it makes that happen for
you.
But forget the prose, and let me get a little poetic with you for a moment: Sanoid's real purpose is to rewrite the tale of Humpty's fall.
I used to get a feeling of existential dread when I'd see certain clients' names on my caller ID. There were days where I spent hours trying to pull arcane rabbits out of my hat to rescue broken systems in-place, fielding unanswerable how much longer?
questions from anxious users, wondering when it was time to abandon the
in-place rescue and begin the laborious restore-from-backup.
Of course, the only thing worse than "we accidentally borked the whole server" is, after you've finished
your restore process, hearing that plaintive cry: "Where is
SuperFooBarApp? It's mission critical!" ... and SuperFooBarApp is
something the client installed themselves, six months ago, and
never told you about. And it was outside the scope of your backup
process, and now it's. Just. Gone.
Sanoid was the thing I built out of sheer desperation, to make all of that stop happening. And it works! By doing the Real Worktm
on virtual machines which are being snapshotted hourly, and keeping the
underlying bare metal clean as a whistle, running nothing but Sanoid
itself, there is no such thing as a Humpty Level Event any more.
Cryptomalware incursion? Rollback. Rogue (or even malicious!) user
deleting giant swathes of data? Rollback. Bad system updates hosed the
machine? Rollback. Client stealth-installed SuperFooBarApp six months
ago in some squirrely location using some squirrely back-end db engine
you've never heard of? Doesn't matter; the snapshots are
whole-disk-image, it's on there.
In super technical business-y planning terms, using Sanoid makes my recovery point objective (RPO) 59 minutes or less, and my recovery time objective (RTO)
roughly 60 seconds from the time it takes me to get to a keyboard. In
less technical person-who-has-to-fix-it terms, it means I can always
make the client happy, and it means that my day got a whole lot more predictable!

Configuring Sanoid

All you need is one single-line cron job, and a simple, easy to read TOML configuration file.Cron:

And Sanoid will apply the
policies you've set easily, neatly and predictably everywhere you've
asked it to. That first module definition covers all nine of the VMs
currently on my workstation, and will automatically pick up any new VMs I create (as long as I place them under /images).

Not a whole lot to set up, and better yet, not much to forget
when new things inevitably get created later! There is still a missing
piece to this puzzle, though. What if banshee, the local machine itself,
catches on fire?

Look, Humpty didn't just get sick—he broke!

So far, we've been assuming that the hardware underneath the VM stays
healthy. Unfortunately, that isn't always the case. Snapshots are great
for recovering from soft failures—basically, disasters that
happen via software, or users interacting with software. But if you lose
the storage hardware, the snapshots go with it. And if you lose the
machine running the hardware, you're down for hours, maybe even a day or
two, waiting for replacements.
Since our goal is to get rid of all the Humpty Level Events, we also need to plan for hard
failures, too. Hard drives died. The power supply died, and we're out
of town and a project is due tonight. Somebody stored food in the server
room, and a moth infestation shorted across components on the
motherboard. (Laugh it up - that happened to a client this year!)
It can get worse than that, too—what about whole-site disasters? The
fire sprinklers came on in the server room. The fire sprinklers didn't come on in the server room, and now the whole building's gone... you get the idea.
So we want snapshots, but we want them on more than one machine, and we want them in more than one place, too. This is where syncoid comes in. syncoid uses filesystem-level snapshot replication to move data from one machine to another, fast. For enormous blobs like virtual machine images, we're talking several orders of magnitude faster than rsync.
If that isn't cool enough already, you don't even necessarily need to restore
from backup if you lost the production hardware; you can just boot up
the VM directly on the local hotspare hardware, or the remote disaster
recovery hardware, as appropriate. So even in case of catastrophic hardware failure, you're still looking at that 59m RPO, <1m p="" rto.="">

This makes it not only possible, but easy
to replicate multiple-terabyte VM images hourly over a local network,
and daily over a VPN. We're not talking enterprise 100mbps symmetrical
fiber, either. Most of my clients have 5mbps or less available for
upload, which doesn't keep them from automated, nightly over-the-air
backups, usually to a machine sitting quietly in an owner's house.

Preventing your own Humpty Level Events

Sanoid is open source software, and so are all its dependencies. You
can run Sanoid and Syncoid themselves on pretty much anything with ZFS. I
developed it and use it on Linux myself, but people are using it (and I
support it) on OpenIndiana, FreeBSD, and FreeNAS too.
You can find the GPLv3 licensed code on the website (which actually just redirects to Sanoid's GitHub project page), and there's also a Chef Cookbook and an Arch AUR repo available from third parties.
1m>