Friday, December 25, 2009

Backup Revisited: Disk Full

The nightly Bacula backup failed a few nights ago. The reason was simple - the external backup disk was full.

A full backup weighs close to 30GB (and it's steadily growing). A differential/incremental backup weighs about 500MB, on average. So, with a file retention period of 4 months, the storage space needed for backup is about 4×(30+30×0.5)=180GB.

I've configured Bacula's maximum volume size to 4GB (do read the fine manual). This means that it'll divide the backup archive into chunks of no more than 4GB in size. This allows Bacula to recycle volumes when their contents is not needed anymore, i.e. if all their contents is older than the retention period.

I've also configured Bacula to use separate pools of volumes for the monthly full backup jobs and for the nightly incremental/differential backup jobs. It seemed like a good idea at the time. It wasn't.

Bacula does not recycle volumes before it actually needs them. This means that I ended up with left over volumes on disk, that are not needed anymore, that would only be recycled on the next backup. And since I separated the volumes into two pools per client, the full backup leftover volumes remained on disk for a month, were then recycled, and replaced by other, more recent, leftover volumes. The overhead is about 1 volume per full backup, and for two clients it amounts to 8GB.

Furthermore, I use the same disk to store the VirtualBox disk image of my virtual WinXP PC. That's about 15GB.

The disk capacity is 230GB, but 1 percent of this disk is used by the OS - that's 2.3GB down the drain.

That leaves me with close to 25GB of slack. Which doesn't seem too bad, but it's actually pretty bad. The problem is that Bacula, by default, will perform a full backup whenever it detects that the fileset, i.e. the list of files/directories that's included/excluded in each backup job, has been modified. And, as you can imagine, I did just that, at least once, during the past few months.

I've had to reconfigure Bacula as follows:

use a single backup pool per client (I could merge both to a single pool, but it seems to me that keeping clients volumes separate is a more robust approach) - this should reduce the recycling overhead, because I expect volumes to be recycled more often now

reduce the volume size to 700MB, in an attempt to lower the leftover overhead even more, by lowering the chance that a volume contains files from different backup jobs (another, more accurate, approach is to set the Maximum Volume Jobs to 1)

reduce the retention period to 2 months (actually, I've never had to restore files older than a week or so, but... better safe than sorry)

I stopped the Bacula director daemon

invoke-rc.d bacula-director stop

erased all the backup volumes, reset the Bacula database (aka the catalog)

/usr/share/bacula-director/make_sqlite3_tables

(yes, I'm using the SQLite3 backend), started the director daemon, and then used bconsole to manually launch backup jobs for both my wife's PC and my own.

Planning ahead is a good idea. It's only that I realized this fact too late.