Block Based Backup

Introduction

This paper entitled: Venti, a
new approach to archival storage (saved here as venti.html) got
me thinking that there might be better ways to manage the computer file
backup problem. As I see it most backup approaches are falling into one
of the following areas:

none, or add-hoc saving of some critical things to archive
media
such as CDR/DVDR

image based backup, usually of a computer's boot drive or
partition to allow rapid recovery from a dead drive

traditional file based systems, with some combination of
periodic
full backups, perhaps interlaced with incremental or differential
backups

For the home user, who may only have a single machine the first two
approaches appear to be moderately acceptable and cost effective using a
CD-R or DVD-R type drive, but the third approach really needs a tape
drive and may prove to be rather expensive. Once you need to backup
more than one or two machines on a regular basis the CD-R or DVD-R type
drives become too cumbersome (especially the CD-R with its smaller
capacity) to use, so larger capacity devices (like tape drives or
removable disks) become attractive.

Much the same can be said for the small business environment, except
now the financial costs of performing the backups on a regular basis
need to be weighed against the cost of lost data (for example due to a
disk failure).

For the home user the financial costs of lost data are hard to
evaluate, some even look on a disk failure as an opportunity to upgrade
a system. However, with the advent of wide spread digital photography
the need to reliably backup photos is rising, and the difficulty of
doing a good job of this is also rising because of the volume of photos
that are taken.

Block Oriented Storage

As an alternative to storing data on a file-by-file basis a backup
system could break up the files into blocks (say 8K bytes each) and
work with these instead. This would allow further reduction of
redundancy in the case where a part of a file was changed or in the
case where there are different versions of the same file scattered
across a network and the versions are largely similar. It might also
make retrival of data from backup media more rapid, especially with
tape devices which can often seek forward to a particular block quite
quickly.

Use of block oriented storage might also allow for easier
implementation of a caching mechanism within the backup system.

The disadvantage of the block based approach is that the file database
gets larger (since additional data to track the blocks is required).
The costs of this are examined later, but with a block size in the 256k
to 1M range the extra overhead is not too prohibitive.

On the same network I looked at how much redundancy there actually was.
In this case I used the 256k byte chunk size and over a total of
1024525 blocks (this is slightly less than the 1035201 quoted above
because there were files that could not be opened for various reasons),
914029 blocks were unique (had unique MD5 digests). This left 110496
blocks which were duplicated (usually between machines, but sometimes
on the same machine due to copies of directories being made). The
savings that could be made if the duplicate blocks were not re-saved was
6,655,101,923 bytes (which is only 3.4% of the network total). If one
drops the two drives #8 and #15 from the network total (as they contain
archived photos, mp3 and video) then this rises to 18.8%. With the
savings only being at this sort of level it does not seem worth doing
(which is maybe why I have been unable to find any software that does
this already).