File Formats Used in ArcvBack

Introduction

Caution: the following discusses
version 2 of arcvback and needs updating for version 3.

The File Object

In an archival system that is file-based this is the fundamental
object. For practical reasons each file will, in fact, be composed of
one or more blocks (or chunks) of data. The file object will contain a
table that contains the SHA1 identifiers for each of these blocks (the
BIDs). When the file is modified, such that it now contains some
different BIDs a new File Object is allocated (of the same file name
but a different version number - should there just be a file ID, FID,
that's a 4 or more byte int? Would that do?), this allows the system to
restore
older versions of files if needed. The purpose of the file objects is
to track the
location and properties of the various files on the various machines
being backed up along with the BID (SHA1 value) for the blocks in the
file. If there
are multiple versions of a file, where the versions are just copies on
different machines, in different directories or under different names
they will all have the same BIDs, so all the file objects will
share the same Block Objects. If there are multiple versions of a file
that have the same name but the contents are different, then there will
be multiple file objects, and each will use a different block objects.
They may actually share some of the BIDs if parts of the files are the
same.

The file objects will also contain data about the file, such as the
creation and modification time stamps, the file name and security
attributes. The approximate deletion date should also be maintained
(this is the time that the backup system first noticed the file, which
it had backed up previously, was deleted from the system), this way one
can determine when it is safe to delete the File and Block objects of a
file that has been deleted (by just checking to see if the current time
minus the deletion time is greater than the archive time period).

The Block Data and Chunk Objects

The block data object contains the SHA1
value for the data (the BID) and the media
identifiers for where the actual data Chunks are stored. The user sets
the level of
redundancy he desires at the time the system is configured in terms of
the number of copies that must be stored for each block. Each block
will
have a short array that contains the MIDs where each chunk data copy is
to be found. Since files are broken into a set of chunks for storage a
single file may be larger than the available space on one media
piece and in this case the chunks for that file will be stored under
different MIDs. We will never split a single chunk across two pieces of
media.

Chunks may also be stored on the backup cache. In fact most restore
operations will be satisfied by loading the desired chunk data from the
cache, especially as the cache could be made quite large (in a home
environment a 100-200GB drive could act as the cache allowing the
complete backup information for several machines to be kept in the
cache for immediate access, in addition to being recorded on backup
media for security). The file data object will record the MID of the
cache(s) that the chunk is currently contained in.

Package Identifiers

In an archive system there are many pieces of media in use, and some of
them may be quite old. To make things simple when a piece of media is
formatted (or erased) it is given a unique media identifier (MID). For
simplicity we never reuse old media identifiers, so the MID will need
to be more than a byte, probably either a 2 byte or a 4 byte quantity.
In a system that uses CD-RW for backup a 2 byte MID would cover 42TB of
total storage, or if one disk was written per day 178 years to exhaust
the available MIDs. A more reasonable rate of 5 disks per day would
still take 35 years to exhaust. If we increase the MID size to 4 bytes
then we would have to write about 1
disk per second 24 hours a day for 178 years to exhaust the
MIDs, which is going to be impossible. So clearly, a 4 byte MID is
going to be sufficient for the worst case of the smallest and least
cost backup media, so unless the design shows the MID size as being a
significant problem we'll stick with 4 bytes.

The Directory Object

This contains the information about the location of a file. It might
also contain the machine name and drive or device name.This could be as
simple as a UNC formatted string of the complete path to each file, but
since directories are true trees its more space efficient to have a one
to one mapping between these directory objects and the real ones.

In a full-blown system additional security information will need to be
stored here too.

The Machine Object

This will identify each machine that is being backed up

The Drive Object

This might be the same as a directory object, it identifies which drive
within a machine particular directories as stored on.

Media Layout

The backup media (and this includes the cache drives) will primarily
contain Chunk Objects as these are the actual data for the files. Other
things to think about:

should there be a table of contents on the media so that at
restore time (especially for tape) the first part of the media can be
read, and then the restore program can seek directly to the chunks it
needs? Would have the BID, length and location of each chunk on the
tape.

should there also be file/directory object data on the
media,
perhaps just for the files that reference the chunks on this media, so
that in the event of database loss it would be possible to rebuild a
database by re-reading all the archive media (or at least this first
part of all the media)

should there be a backup of the database, or should this be
done
separately?