For the home user, who may only have a single machine the first two
approaches appear to be moderately acceptable and cost effective using
a
CD-R or DVD-R type drive, but the third approach really needs a tape
drive and may prove to be rather expensive. Once you need to backup
more than one or two machines on a regular basis the CD-R or DVD-R type
drives become too cumbersome (especially the CD-R with its smaller
capacity) to use, so larger capacity devices (like tape drives or
removable disks) become attractive.

Much the same can be said for the small business environment, except
now the financial costs of performing the backups on a regular basis
need to be weighed against the cost of lost data (for example due to a
disk failure, accidential deletion, fire, theft or flood, hurricane,
tornado, earthquake, meteor strike, war).

For the home user the financial costs of lost data are hard to
evaluate, some even look on a disk failure as an opportunity to upgrade
a system and get a clean start. However, with the advent of wide spread
digital photography
the need to reliably backup family photos is rising, and the difficulty
of
doing a good job of this is also rising because of the volume
and size of the photos that are taken.

See also my arcvback
backup program, where the download and manual are available.

Requirements

There are various requirements for backup software. Not all forms of
backup will meet these, but understanding what the common requirements
are will help you in the selection of the type of backup software you
need.

Bare Metal Restore

This is often the toughest problem for backup software, when the main
disk drive fails in a computer and you replace it with a new
(unformatted) drive, how do you take your collection of backup media
and use it to rebuild the machine you once had?

The disk-image form of backup software excels at this. The traditional
file-based backup software makes this (typically) difficult to do (some
commercial systems have a "disaster recovery" option to better address
this issue). The problem with the traditional approach is that before
you can
run the restore job you need to install (at least) an operating system,
and that can take some time.

Another approach is to be able to run the restore software from a
standalone (bootable) CDROM allowing one to install a bare drive, boot
from the CDROM and then format the drive as desired and restore the
files to it. Backup software that can restore from a USB or
network attached drive can make a restore proceed much faster.

User Data Restore

Restoring lost user data (in the case of a failed non-system drive or
user or application error) is the problem that traditional backup
software does a good job of. It is typically limited in the depth of
time over which recovery is possible.

Old Version Recovery

In certain cases it may be necessary to recover very early versions of
files or documents. Often times one avoids this by periodically copying
and renaming a significant document, so that some of the earlier
document versions are still available; however, as this is a manual
process it will generally fail at some time. This sort of recovery
capability is of more importance in the corporate environment than the
home environment.

Typically providing the necessary depth of backup coverage is quite
expensive when using traditional backup software as this inflates the
media requirements dramatically.

Backup Size and Time

The volume of data that must be backed up can have a dramatic impact on
the design of a backup plan. It directly affects cost by dictating the
required capacity of the storage media and the type of hardware needed
to support it. It may cause compromises to be made in the frequency and
type of backup that is performed. It may impose restrictions on the
users of equipment being backed up (to ensure that they are not
preventing files from being backed up). It may cause heavy network
traffic and may necessitate network reorganization or upgrades to
provide the needed capacity.

Robustness

Backups need to be robust, ideally they also need to be tolerant of
media failure. For example if a single backup spans several media the
loss of any one piece should not prevent the recovery of the data on
the remaining pieces. However, if the storage costs permit, it is a
very good idea to have a backup system that employs multiple copies of
redundant media. That way, if a piece of media is lost or damaged a
copy of its data still exists on another media piece.

Since all physical media has a finite lifespan it would also be a good
idea to allow a piece of media to be replaced by newer media at a later
date.

Utilities to verify the integrity of data on old media may also be
useful.

Non-proprietary backup files formats may also be useful, especially
when trying to extract some data from a damaged backup file.

Fire, Theft and Flood

These sort of risks are usually addressed by arranging to have a copy
of the backup media placed in storage at one or more remote sites. This
sort of risk is often overlooked in the home environment, but now that
unique data sets (such as the family photographs) are becoming common
place
this issue should now be considered. Consider the case of film director
Francis Ford Coppola who had his computer
and backup device stolen in Sept'07.

Ease of Use

For a backup system to be effective it must be used on a regular basis.
Ask any owner of a Palm Pilot if he has ever got that "sinking feeling"
that he really should have done a hot sync... There are some issues
here:

the software should be able to run on a daily (or other
scheduled) basis without
needing to have any regular care and feeding (perhaps changing a tape
once a day or burning a few DVDs a week is ok)

the software should tolerate being run when some of the
machines
that it is supposed to backup are not available (turned off...) and
when this happens it should not cause loss of some backup media (some
commercial products will overwrite the weekly rotation tape even
though nothing was received for that backup because of a communications
failure).

the software should make effective use of storage media,
and
should allow you to use a combination of storage devices across a
network of machines

activities that involve manual intervention, like writing
to the
archive media (tape, DVD or external hard drive) should be decoupled
from the actual backup process so
that they can occur when it is convenient to the operators

Existing Solutions

This section provides background information on the various types of
existing backup solutions.

Traditional

Traditional backup is what I am calling "file based backup using a
rotation of media". In it a program runs that periodically visits all
the files on the drives (partitions or subdirectories) that have been
identified as needing backup. At each file it checks to see if
something has changed (by looking at the archive bit or the last
modification time stamp) since the last time the file was backed up. If
the file has been modified then it saves a new copy of the file. This
process typically is done in two phases, a full backup (which backs up
all files, and thus, takes a lot of time and storage) which is run once
a week and a daily backup which only backs up the files which have
changed since either the last daily backup (incremental mode) or last
full backup (differential mode) was done. The ability to go back in
time and recover a particular earlier version of a file is determined
by the number of full backups that are done and how often the
incremental or
differential backups are performed.

In a traditional rotation one might have three sets of weekly full
backup tapes, and about 6 days worth of incremental tapes, this would
allow you to recover a different end-of-day version of a file for the
last week, but only once a week for the preceeding two weeks. This also
provides a degree of redundancy in the (somewhat likely) event that
something goes wrong with the weekly backup there are still two other
(somewhat recent) versions of the weekly backup that could be used to
salvage most data. One might add another level to this by adding a
monthly and/or yearly full backup rotation (perhaps for off site
storage).

The traditional backup recognizes the fact that most files actually
remain unchanged for long periods of time, and so the amount of data
that needs to be backed up on a daily basis is much smaller and
exploits this through the incremental or differential modes. Since the
incremental mode only backs up changes made since the last incremental
pass (or full pass) it will stay relatively small over long periods of
time, but because a restore job may have to access many (or all) of the
incremental tapes between now and the last full backup it becomes a
matter of user inconvenience that dictates how many incremental backups
can be tolerated. This inconvenience is why the differential backup
scheme was invented, though it is less efficient, having to backup more
and more data as the time since the full backup gets larger.

The user convenience factor also come into play in how effectively the
media is utilized. Typically the same size of media is used for both
full and incremental backups, this means that there might be enough
space on one tape to store several daily back runs, however, because of
the difficulty of locating a particular set on a tape the software will
often just waste the remainder of the tape and have the user just have
a different tape for each day of the week.

The traditional backup also wastes a lot of time during the full
backups (again to provide user convenience), this is because most of
the files that are placed on the second week's full backup set were
already on the previous week's full backup. Because of cost one cannot
keep a weekly set of tapes around forever so this periodic recopying of
all the data to allow media to be reused seems like a reasonable
compromise. However, if you are manually changing tapes (and using a
slower, more cost effective, tape technology) this means once a weekly
backup exceeds two or three tapes in size it starts to get rather
inconvenient and requires a long time to execute.

Drive Imaging

Drive imaging backup is the block for block copying and restoring of a
hard drive's data. Because this happens at the block level (below the
disk formats imposed by the operating system) this is usually a
snapshot technique that must be done when the machine is running in
some special standalone mode (for example booted into another operating
system loaded from CDROM or floppy). Recently there have been a number
of advances in these tools to allow for this to be done while the
machine is still running its regular operating system (which sounds
rather risky), to only store the blocks on the disk that the file
system is actually using (which saves a lot of backup space if the disk
is only partially full) and to allow for the backup image to be browsed
and individual files within it to be restored. There is also the
possibility of providing an incremental approach to drive imaging,
whereby only the blocks that have changed since the last image was
taken need to be saved.

This form of backup is best used to protect the operating system drive
or partition of a machine to allow it to be placed back into service
quickly and cheaply without having to go through the tedium of
reinstalling the operating system and all of the applications.
Combining this with traditional backup of the user data areas would
seem
like the best all round approach.

Design your disk partitions to support the imaging backup system (by
reducing tha amount of data that is kept on the operating system
partition). If you keep the C: partition for just the operating system
and installed applications, and have another partition for the user
data files, then you can minimize the amount of data that the imaging
backup needs to copy (restricting it to just the C: partition).
Unfortunately Windows gets in the way of this by placing the "Documents
and Settings" directory on the C: drive (not to mention always wanting
to take up the full hard drive when it is installed).

Redundancy

Redundancy in the backup process is typically seen in the traditional
approach, but only in a partial form, through the way that the full
backup sets contain a lot of the same data. It can be added in a true
form by performing multiple full backups in a row, or by duplicating
the
original set of backup media.

A backup approach that provides redundancy without imposing a lot of
additional work or cost would be useful even in the home environment as
this is what is needed to protect against perils such as fire, theft
and flood.

The main problem with redundancy is the added media cost. It also
increases the inconvenience factor due to the additional backups that
are done and also the additional trips that need to be made for offsite
storage.

Traditional backups often implement a pseudo-redundancy by having
several media sets that are used in rotation. For example you might
have a weekly full backup that has a 4 set rotation, meaning that on
the 5th week you over write the backup that was done in the first week.
This is not true redundancy because if there is a file you need from a
particular media set and it turns out that set is damaged, then if the
file only existed on that particular week you won't be able to recover
it. To get
around this one might duplicate the media once it has been written,
doubling the media count.

Cache Drives

Cache drives can be used to decouple the backup operation from the
act of recording the back up data to removable media. When this is done
the backup writes to a large cache drive (which these days is pretty
inexpensive) so it can run at full speed, and then later the data is
saved from the cache to slow media such as tape. This also allows the
tasks that need operator intervention to be done at a convenient time,
which may mean that an expensive upgrade to a robotic tape changer or
to higher capacity tapes can be delayed or avoided.

The presence of the data on a cache drive for some period of time adds
a small element of risk to the system, but the failure of the cache
drive alone will not loose any data, one would also have to fail
(erase) the original file at the same time. Placing the cache directory
on a RAID protected drive would greatly reduce this already small
risk, as would using backup software that writes two copies, to two
different cache devices.

Cache drives also allow for the possibility of doing restores directly
from the cache (especially if the cache is quite large) which can save
a lot of time, and allow for the possiblity of very convenient
user-driven restores that don't need access to the backup media or
devices.

Use of a cache drive may also improve the utilization of backup media,
since one could delay the flushing of the cache until there is enough
data to fill a piece of media. Of course there is some risk with this
approach, a backup version could be lost if the cache drive fails
before it is flushed. This risk may be tolerable as the original data
should still be on its drive (unless this is the same drive as the
cache uses).

Disk Based Backup

In recent years the falling costs of IDE hard drives have brought about
the odd situation of disk storage being the same price or even less
expensive than tape storage on a $/GB basis, especially in the high
capacity ranges. It appears likely that disk prices will continue to
drop, while tape prices will not move much in the future. As a result
the temptation to use disks to replace tapes is going to grow.

Archival

An archival backup system is one which (in its truest form) never
overwrites backups. This means that old versions of changed files and
even files that were deleted and never replaced a long time ago can
still be retrieved from storage. Archival storage seems to be ignored
as being too expensive on media to implement, or too inconvenient to
use on a wide scale.

However, it appears that with low cost media such as DVD-R we may have
reached a point where it is cost effective and convenient to replace a
tape based traditional backup system with a DVD-R (or RW) based
archival system for certain sizes of systems. This will become more
true (and applicable to a larger group of systems) in the future as the
price of DVD media continues to drop and the capacity of this type of
media rises with the advent of multi-layer and blue laser recording.

LD
Backup (latedecember backup) is a simple cross-platform backup tool
written in Python

A real-time
backup system for NetBSD has been demonstrated. This wedges into
the file system at a driver level and for each write on the master file
system it echoes that write across a TCP connection to a remote disk -
sort of a drive mirroring across a LAN approach.