Backups are one of those things that are important, but that a lot of people don’t do. The thought of setting up backups always raised a mental barrier for me for a number of reasons:

I have to think about where to backup to.

I have to remember to run the backup on a periodic basis.

I worry about the bandwidth and/or storage costs.

I still remember the days when a 2.5 GB harddisk was considered large, and when I had to spent a few hours splitting MP3 files and putting them on 20 floppy disks to transfer them between computers. Backing up my entire harddisk would have costed me hundreds of dollars and hours of time. Because of this, I tend to worry about the efficiency of my backups. I only want to backup things that need backing up.

I tended to tweak my backup software and rules to be as efficient as possible. However, this made setting up backups a total pain, and makes it very easy to procrastinate backups… until it is too late.

I learned to embrace Moore’s Law

Times have changed. Storage is cheap, very cheap. Time Machine — Apple’s backup software — taught me to stop worrying about efficiency. Backing up everything not only makes backing up a mindless and trivial task, it also makes me feel safe. I don’t have to worry about losing my data anymore. I don’t have to worry that my backup rules missed an important file.

Backing up desktops and laptops is easy and cheap enough. A 2 TB harddisk costs only $100.

What about servers?

Most people can’t go to the data center and attach a hard disk. Buying or renting another harddisk from the hosting provider can be expensive. Furthermore, if your backup device resides on the same location where the data center is, then destruction of the data center (e.g. a fire) will destroy your backup as well.

Backup services provided by the hosting provider can be expensive.

Until a few years ago, bandwidth was relatively expensive, making backing up the entire harddisk to a remote storage service an unviable option for those with a tight budget.

And finally, do you trust that the storage provider will not read or tamper with your data?

Enter Duplicity and S3

Duplicity is a tool for creating incremental, encrypted backups. “Incremental” means that each backup only stores data that has changed since the last backup run. This is achieved by using the rsync algorithm.

What is rsync? It is a tool for synchronizing files between machines. The cool thing about rsync is that it only transfers changes. If you have a directory with 10 GB of files, and your remote machine has an older version of that directory, then rsync only transfers new files or changed files. Of the changed files, rsync is smart enough to only transfer the parts of the files that have changed!

At some point, Ben Escoto authored the tool rdiff-backup, an incremental backup tool which uses an rsync-like algorithm to create filesystem backups. Rdiff-backup also saves metadata such as permissions, owner and group IDs, ACLs, etc. Rdiff-backup stores past versions as well and allows easy rollback to a point in time. It even compresses backups. However, rdiff-backup has one drawback: you have to install it on the remote server as well. This makes it impossible to use rdiff-backup to backup to storage services that don’t allow running arbitrary software.

Ben later created Duplicity, which is like rdiff-backup but encrypts everything. Duplicity works without needing special software on the remote machine and supports many storage methods, for example FTP, SSH, and even S3.

On the storage side, Amazon has consistently lowered the prices of S3 over the past few years. The current price for the US-west-2 region is only $0.09 per GB per month.

Bandwidth costs have also lowered tremendously. Many hosting providers these days allow more than 1 TB of traffic per month per server.

This makes Duplicity and S3 the perfect combination for backing up my servers. Using encryption means that I don’t have to trust my service provider. Storing 200 GB only costs $18 per month.

Setting up Duplicity and S3 using Duply

Duplicity in itself is still a relative pain to use. It has many options — too many if you’re just starting out. Luckily there is a tool which simplifies Duplicity even further: Duply. It keeps your settings in a profile, and supports pre- and post-execution scripts.

Let’s install Duplicity and Duply. If you’re on Ubuntu, you should add the Duplicity PPA so that you get the latest version. If not, you can just install an older version of Duplicity from the distribution’s repositories.

This will create a configuration file in $HOME/.duply/test/conf. Open it in your editor. You will be presented with a lot of configuration options, but only a few are really important. One of them is GPG_KEY and GPG_PW. Duplicity supports asymmetric public-key encryption, or symmetric password-only encryption. For the purposes of this tutorial we’re going to use symmetric password-only encryption because it’s the easiest.

Setting up periodic incremental backups with cron

This line runs the duply main backup command every Sunday at 2:00 AM. Note that we set the HOME environment variable here to /home/admin. Duply is run as root because the cronjob belongs to root. However the Duply profiles are stored in /home/admin/.duply, which is why we need to set the HOME environment variable here.

Making cron jobs less noisy

Cron has a nice feature: it emails you with the output of every job it has run. If you find that this gets annoying after a while, then you can make it only email you if something went wrong. For this, we’ll need the silence-unless-failed tool, part of phusion-server-tools. This tool runs the given command and swallows its output, unless the command fails.

Restoring a backup

Simple restores

You can restore the latest backup with the Duply restore command. It is important to use sudo because this allows Duplicity to restore the original filesystem metadata.

The following will restore the latest backup to a specific directory. The target directory does not need to exist, Duplicity will automatically create it. After restoration, you can move its contents to the root filesystem using mv.

sudo duply main restore /restored_files

You can’t just do sudo duply main restore / here because your system files (e.g. bash, libc, etc) are in use.

Moving the files from /restored_files to / using mv might still not work for you. In that case, consider booting your server from a rescue system and restoring from there.

Restoring a specific file or directory

Use the fetch command to restore a specific file. This restores the /etc/password file in the backup and saves it to /home/admin/password. Notice the lack of leading slash in the etc/password argument.

sudo duply main fetch etc/password /home/admin/password

The fetch command also works on directories:

sudo duply main fetch etc /home/admin/etc

Restoring from a specific date

Every restoration command accepts a date, allowing you to restore from that specific date.

Safely store your keys or passwords!

Whether you used asymmetric public-key encryption or symmetric password-only encryption, you must store them safely! If you ever lose them, you will lose your data. There is no way to recover encrypted data for which the key or password is lost.

My preferred way of storing secrets is to store them inside 1Password and to replicate the data to my phone and tablet so that I have redundant encrypted copies. Alternatives to 1Password include LastPass or KeePass although I have no experience with them.

Conclusion

With Duplicity, Duply and S3, you can setup cheap and secure automated backups in a matter of minutes. For many servers this combo is the silver bullet.

One thing that this tutorial hasn’t dealt with, is database backups. While we’re backing up the database’s raw files, doing so isn’t a good idea. If the database files were being written to at the time the backup was made, then the backup will contain potentially irrecoverably corrupted database files. Even the database’s journaling file or write-ahead log won’t help, because these technologies are designed only to protect against power failures, not against concurrent file-level backup processes. Luckily Duply supports the concept of pre-scripts. In the next part of this article, we’ll cover pre-scripts and database backups.

I hope you’ve enjoyed this article. If you have any comments, please don’t hesitate to post them below. We regularly publish news and interesting articles. If you’re interested, please follow us on Twitter, or subscribe to our newsletter.

is Amazon Glacier also S3 compatible, so that I can set the TARGET to something like glacier.eu-west-1.amazonaws.com?

http://www.phusion.nl/ Hongli Lai

It isn’t, it uses a different API. Furthermore, retrieving and restoring something from Glacier takes 4 hours. Since Duplicity has to fetch metadata from the server on every backup run, Glacier is not a feasible option.

Ernestas Lukoševičius

Since the topic says full-disk backups, it is a highly feasible option. Running incremental for full-disk – is not. Sorry for not reading the article and just the topic.

Restoring a backup with duply main restore / gives an error, so I tried to use --force. This brought new errors (mainly Error '[Errno 17] File exists'), stopping the restore process with a segmentation fault.

Yes. Tarsnap seems to be doing something similar to Duplicity but is more expensive, at 30 cents per GB per month. I didn’t see why I should use Tarsnap and I already had an AWS account so I went with this instead. I could be wrong though and maybe Tarsnap has advantages over Duplicity.

http://www.phusion.nl/ Hongli Lai

You can try restoring to a temporary directory, then moving the files to the root directory using mv.

Dan

How do you prevent your backup from being deleted if somebody owns your server and gets your API key?

David Burley

If you create a lifecycle rule to move content to glacier, and then another to delete the content a year later, you get the benefits of glacier pricing. The only caveat is you need to make full backups periodically and any restore after something has been moved to glacier requires manual effort to move the content back to S3 before restoring along with the noted delay. However, this comes at considerable savings.

Lovingdesigns

I like this question.

It is basically hard next to protect yourself from this, you will need to trust the hoster of the physical hardware.

You could encrypt the drive and unlock it via a tiny ssh server though.

Lovingdesigns

Nice, it’s about half the price of S3!

Dan

It’s a problem I’m facing now doing backups to the cloud. If somebody gets my Rackspace API key, they can do all sorts of really awful things. I wish the providers would let me set permissions to what a certain API key can do.

Anon

If you don’t have full control over your backups you might as well not backup.

http://www.phusion.nl/ Hongli Lai

There are two ways:

1. Use Amazon IAM permissions. You can create a user with a new API key, and restrict access for this user to download and upload only. Unfortunately this also prevents Duplicity from deleting old backups.

2. Enable versioning in your bucket. That way you are protected against all deletions.

gosukiwi

This is nice, I didn’t know about Duplicity, I already have my VM backups on DigitalOcean but I’ll bookmark this just in case 😛

lzap

If you have a shell on the remote, I can offer this: Incremental backup solutinon based on *pure* rsync and ssh (no tools involved) that makes use of hardlinks to create completele “snapshot” everytime you run it (but does not consume extra space on the target disc)

Then you should run your backups using another server, to which your primary server cannot connect, but the backup server – can.

Pizzicato Five Fan

What about using the lifecycle setting on buckets to migrate data to Glacier after a given period of time? Would that also break the functionality?

Scott

Do you know if duplicity will allow bucket to bucket direct transfers? I’d like to backup data already in an S3 bucket to another bucket without bringing the data down to a local machine first. Thanks!

http://nandovieira.com.br Nando Vieira

Yeah, that’s what I thought. Maybe you should consider updating the article (the sudo duply main restore /) because, well, it doesn’t work. 🙂

http://www.phusion.nl/ Hongli Lai

I think so. Duplicity has a very specific format for its files so you can’t just move something to Glacier.

http://www.phusion.nl/ Hongli Lai

Yes. Use the copy-paste functionality in the S3 control panel.

http://kennydude.me/ Joe Simpson

Dreamhost is a ball game of it you get thrown on a decent server or not.

andyjeffries

I’m not advocating them as a general host, but their S3 compatible object store is cheap… Even better I signed up early so got a lifetime 4c/GB price 🙂

P.S. Freaky, one my best mates is called Joe Simpson, but he wouldn’t have a clue about Dreamhost, S3 and the like 🙂

One thing I’d be interested to know your opinion on; did you consider any of the more “serous” backup solutions such as Bacula or Amanda?

And would you consider them if you had to backup a larger number of servers? I’m a little worried about getting a central “view” of the backups if there were 10 (or way more!) servers running duply scripts!

Neil

Obviously “serous” should have read “serious”!

http://chromano.in/ Carlos H. Romano

Not exactly secure, you could truncate the files on the server and the backup server would fetch them… versioning and incremental backups would fix it, but still, recent data won’t be available

this makes me wonder how much outgoing data does it actually use, to run duplicity, since s3 charges 12 cents per gig outgoing

JFD

… or try tklbam (TurnKey Linux Backup And Migration) turnkeylinux.org which uses duplicity + S3 with even more simplicity. It’s now available as a stand alone package using their ppa.

Aussie Mike

Just want to say thank you for a really well-written and useful guide. I went from having the thought “I really should do encrypted backups to S3”, to having the whole thing up and running, in about 15 minutes!

Is not Glacier’s minimum charge a 3 months? So, if you roll your backups more frequently than once a month, it is not really a money-saver…

Ernestas

Even if you delete your backups the same day, you will not be charged more than 0.03 cents/GB for that month.

M T

Even if you delete your backups the same day, you will not be charged more than 0.03 cents/GB for that month.

(First of all, that’s 0.03 dollar/GB — or 3 cents per GB.)

My point was, you will not be charged less than that either. And S3 storage costs slightly less than 3 cents/GB/month. So, if you are storing a GB for less than a month, your cost with Glacier will be 3 cents, and slightly less than that with S3… And you will not need to wait 4 hours for the Glacier-hosted file to be made available either.

Cezinha

Thanks, it helps a lot!

iakkam

How does this duplicity method interact with the versioning feature of Amazon S3? Should we have versioning for the S3 bucket turned on or off?

maqsimum

For some reason, it does not work for me, any idea what could be the reason? Below is the log:

$ tail -f backups.log

Last full backup left a partial set, restarting.

Last full backup date: Tue May 19 14:48:41 2015

RESTART: The first volume failed to upload before termination.

Restart is impossible…starting backup from beginning.

Import of duplicity.backends.dpbxbackend Failed: No module named dropbox

“Phusion” and “Phusion Passenger” are registered trademarks of Phusion. “Rails”, “Ruby on Rails” and the Rails logo are registered trademarks of David Heinemeier Hansson. All other trademarks are property of their respective owners.