One of the first things I did, after having installed
Arch, was to think about a backup
solution. Well, in fact I had been thinking about that before I even knew which
distro I would use, and I had decided to set up a mirroring of my "main" drive,
containing both the system (/ and its friends) and my data (/home), which
actually are on the same partition (for now?).

Logically, what I did was play with mdadm, and started setting up a RAID1
array. This went well, mdadm allowing to do things very easily, and since this
was all done in a VM I could without a problem make a disk disappear, bring a
new one and see how rebuilding the array would work (in brief: couldn't be
easier, and it was pretty fast, too).

Side note: It's one thing I'm liking more and more each day with Linux,
everything seems pretty easy to do (it's all relative, of course), and doable
simply from the command line. Now I'll admit that I'm more of a GUI guy
myself, but I can still appreciate how a couple of commands allow one to
create/format a partition, resize it, move it, create an image of it, or set up
an encryption or pretty much any kind of RAID array of one's choice. And all
that (once you start to understand how it works) with an obvious simplicity. And
while this is a Linux thing in general, it feels even more true in Arch.
Anyways...

But then I realized I made a mistake, and a mirror wasn't what I'm looking
for. Because while this was easy enough to set up, and had the other great
advantage of being a "set & forget" solution, it didn't actually provide what I
needed. A mirror means if one drive dies, you can still go on as if nothing
happened, get a new drive, do the replacement and keep going without any trouble
or data loss.

It's not nothing, but I wanted something else. I wanted to be able to stare at
my screen and go, "Alright, I screwed things up nicely. Now let's restore
things to a working situation again." I wanted to be able to realize that,
while I meant mv, what I actually typed was rm, for some odd reason, without
having to panic or cry.

In other words, I wanted to have actual backups of my data, and not just a(n
always up-to-date) mirror. Backups that could be old, not too old, but
definitely no instant mirror. So I said goodbye to mdadm, and simply settled
for a little bash script using rsync, that would run
automatically at night as a cron.

Multiple backups, with just one copy of the data

rsync is a great a powerful too, and one that allows to create a whole lot of
backup solutions, including over ssh and many more things I don't need. One of
the things it does, though, that will be of use for me, is that it can
basically create different backups of your data, with only one full copy of the
data, and then simply what changed (i.e. new/modified files).

I'm not talking about its algorithm that only transfers part of a file that
changed, to minimize the amount of data to transfer and speed things up, but how
the data are stored. The idea is that you provide rsync with one (or more)
additional location(s), besides the usual source & destination. When a file is
missing in the destination, before copying it from the source rsync will check
those additional locations. Then, it can do different things:

using --compare-dest it will simply skip those files; Thus only new/modified
files are backed up.

using --copy-dest it will do a local copy of the files; This doesn't
actually save any space.

using --link-dest it will create hard links of the files; This is the magic
we want.

It's that third option that interests me, because it means you will end up
with e.g. three backups, each folder containing a fill backup of your
drive/data at a given time, but you don't actually need three times the space
for it. Only one, plus what's changed.

The magic of hard links

Quickly, for those not familiar with hard links: basically when you have a file
stored on your drive, there are two things: the actual data, the content of the
file, and its name. Usually there's only one name per content, but you can
actually have more. For example, /home/user/foo.log and /var/log/foo could
be the same file.

There's no links, shortcuts or anything like it: both names represent the same
data. Editing one file is the same as editing the other, since they "share"
their content. When you remove one, you just remove its name, thus leaving the
other unaffected. When you remove the last one, and there are no more names for
the data, it is "dropped" and the space on the drive is made available again.

Using the -i option of ls one can have the inode number shown. This number
is a unique index that represents the data linked to. Two files with the same
inode are actually pointing to the same data, i.e. are two hard-links for the
same data. You can also use the -l option of cp to create a new hard-link of
a file, instead of copying its data.

And that's what rsync will do with the --link-dest option, create
hard-links. If the same file in source is found on that additional location
(we're not talking same inodes here, of course. rsync uses size, dates and
such attributes to determine equality of file be default, though you can have it
use checksums (e.g. MD5) as well) then a new hard-link will be created and no
data needs to be copied. This not only speed things up quite a lot, but
reduces the amount of space needed to keep multiple backups at once.

And since usually what regularly changes are small files, with the big ones tend
to stay the same, this can allow to keep a few backups for far much less space
that complete separated copies would require. Yet, each backup folder contains
the full backup, and not just a partial/incremental copy. That's the beauty of
it.

With that in mind, I decided to make me a little bash script that would run each
night, and update a backup in "day" which would represent the backup at the
beginning of the day. It would also, every Sunday night/Monday morning, update
the backup "week" representing the backup at the beginning of the week, using
the "day" backup of reference (additional location in --link-dest that is),
and then lastly the first of each month the backup "month" will be updated the
same way.

So now I have three backups: "day", "week", and "month". Each one is a full
backup of my system at a different time. There is about 5 GB of data in the
source, and the three backups are using about 7.4 GB altogether, so less than
what would be needed for two "full" backups. Pretty cool, isn't it?

Scripting time

In case you're interested, here are the two scripts I made. The first one is of
course the one to create backups, the second one will be of use when needing to
restore things.

I should mention that I wrote those for me, and poorly hard-coded some
things(*), like the destinations of the backups : they all go in /backups/
and are called, as we've see, day, week and month

(*) I should note that I am actually planning on doing a full rewrite of those
scripts, for a couple of reasons, and I'll then try to make things a bit better.
Meanwhile, those are easy enough to adapt, should you want to.

backups.sh

The first script is used to create a backup of the data. It supports two modes:
auto, and manual

The mode auto is likely to be used in a cron job, to run automatically every
night or something like that. It will always create/update /backups/day, then
if the day is the first of the month it will also create/update /backups/month
(using day as reference, i.e. in --link-dest) and if the day is a Monday it
will do the same in /backups/week (still using day as reference)

The mode manual allows one to create a backup whenever one wants. You can give
it a name (i.e. the name of the folder where the backup will be, though it will
be a subfolder of /backups/) or, if you don't, it defaults to the current date
(e.g: 2011-09-23_15-42). The way it's done, it always uses day as "reference"
and therefore cannot be run if it doesn't exists.

It also uses a file with a list of locations to exclude from the backup,
/backups/backups.excludes It is simply sent to rsync using its
--exclude-from option. Note that this file is also used in the restore script.

restore.sh

This script is just there to easily restore a full backup. Usually, one would
imagine you will only need to go get a file or two, but should there be a major
problem or something, and you want to restore everything, this is for you.

It will simply do two things:

start rsync with all the required options

read the /backups/backups.excludes file and, for each line in that file that
starts with "- " (dash & space) - i.e. each folder that was excluded from the
backup - it will make sure said folder does exists in the restored location,
creating it if not.

The point of this is that you'll probably want to exclude things like /dev/,
/media/, /mnt/, /proc/ or, of course, /backups/ itself. But it might be
required to have those folders if you want your system to boot (correctly),
therefore.