Sunday, 25 March 2012

How to do dirt-cheap, cloud-based, encrypted backups (Part 2)

In my previous post I describing the method I used to store data online, I referred to what I was doing as a "backup" but it would have probably been more accurate to call it a near-real-time off-site mirror. In this post, I describe the pitfalls of my previous system and describe my much-improved latest technique.

Firstly, using s3fs, encfs and lsyncd worked well for small data sets of several megabytes but when scaled up to 10GB of source code, uni assignments and random other files, the round trip time to S3 and all the overheads in the system really start to add up. My internet connection should theoretically be able to upload that data in 24 hours. At the rate it was running when I stopped it, it would have taken about 2-3 weeks!

Secondly, there is the issue of download necessity. I will rarely, if ever, want to access the data I am uploading. It is, in all senses of the word, a backup. 99.9% of the time I shouldn't care about reading files and 100% of the time I shouldn't have to care about individual files.

Thirdly, I don't like being bound to S3. At some stage in the future a geographically disperse group of friends and I will all contribute a pool of disks and provide our own hosting for backups.

With all that in mind, my latest strategy is somewhat simpler. It involves using ZFS for a storage filesystem and zfs send + cron for incremental backups.

Version 1.5: tar + cron
Its probably worth mentioning that old-school GNU tar can also replace zfs send here. The --listed-incremental flag for tar makes it trivial to do incremental backups and these can be encrypted via openssl or whatever tool you like and uploaded along with the shar checkpoint file for super-trivial backups. No need for encfs. No need for lsyncd. You can also trivially make use of compression to get the most out of your uplink. Something like:

I started down this approach but the lack of local snapshots didn't make me happy. This system solves the "house burned down while I was out" scenario but not the "sleep-deprived programmer deletes his last 12 hours of work by mistake" scenario.

Version 2.0: ZFS to the rescue!
I've used ZFS before with great success. Its an excellent filesystem that makes snapshots, de-duplication, compression, raid(ish), and multi-device configurations much easier to deal with. The only issue I had with it is that I use linux at home, not Solaris / FreeBSD and the FUSE version of ZFS is not well maintained and has FUSE-related performance bottlenecks. Given I have a very limited space for hardware where I live, I was contemplating a complex FreeBSD-based xen dom0 host system to run an NFS exported ZFS filesystem and a Linux domU that I'd use for day-to-day computing. The fact I was considering such a complex mess I guess shows how desperate I was to find a solution. In any case, it was about this time that I stumbled across the wonderful zfsonlinux project that seems to have resolved the legal issues stopping ZFS integration with the linux kernel! From their website:

The ZFS code can be modified to build as a CDDL licensed kernel module which is not distributed as part of the Linux kernel. This makes a Native ZFS on Linux implementation possible if you are willing to download and build it yourself.

Great! So I did. And its performance is quite impressive! I get the performance I'd expect to get from FreeBSD or Solaris (50+MB/sec for a single drive, and slightly less than double for two drives) and none of the CPU bottleneck issues I had with the FUSE version years ago. So, with a stable ZFS available for linux, I threw all the terabytes of storage I could find that could fit into my desktop and set about copying over my data. Now I can set up all the periodic local snapshots I want!

To perform my snapshotting and upload my daily deltas, I've written a small shell script as follows:

#!/bin/bash## Triggers rolling periodic snapshots of ZFS filesystems## Mode (first argument) can be one of DAY,HOUR,MINUTE.# The mode dictates the actual operations performed.

The check_zpool.sh script just emails me if zpool status -x returns anything other than "all pools are healthy".

That's it! Having it running now, I don't know what its taken me so long to set something like this up!

Its also worth mentioning the other niceties we can potentially get with this command. If we want to keep a synchronized filesystem with a friend, we can use ssh and cron to push filesystem deltas every so often to a remote read-only copy of our filesystem using zfs send/receive! I'll probably give that a go at some stage soon as a means of sharing family photos with my parents and siblings and post about it here.