This page describes the repository format called 'Green Albatross'. It
is a preliminary format, intended to improve Obnam performance over
format-6. Current development status is PONDERING;
implementatation has started, but everything on this
page may and probably will change. Until declared stable, the on-disk
data structured WILL change without warning. Only use this format to
help with its development.

Introduction

For background and the big picture, please read the ondisk page.
This page only discusses details specific to this format.

With few, specific exceptions, repository files are never
updated. Everything is done copy-on-write. This enables caching. The
exceptions are root nodes of DAGs, so that it's easy to know where
the DAG starts.

Data is stored in objects of various types. Objects may be small,
and to avoid having a file per object, objects are collected into
bags. This includes chunks. Each bag is stored in its own file.
Objects are identified by the bag identifier, plus an index into the
bag.

Objects are stored in a suitable serialisation encoding.

This is not Python pickles, since Obnam can't assume those are
stable over the lifetime of a backup repository.

JSON is not used, since JSON is not suitable for storing binary
data, such as filenames, without adding an encoding layer on top
of JSON.

Previously this was specified as YAML, but YAML is not fast to
parse. So it's not YAML.

Instead, a simple, custom binary encoding will be used. This
will encode ints, booleans, binary strings, or lists or dicts
(dict keys being lists). A quick prototype shows this to be easy
(worked the first time), and fairly fast even without
optimisation.

Bags

A bag is used to store a number of blobs. Bags are identified by a
random 64-bit integer. This is used as the filename of the bag. The
bags are stored in a 3-level directory structure, using the top three
octets of the bag id as the directory names. Thus, a bag whose id in
hexadecimal is 0x1234567890abcdef would be stored as
12/34/56/1234567890abcdef.bag.

A bag is implemented as a Python dict object:

{
'bag-id': 'cafef00d',
'blobs': [...],
}

The items field contains the blobs. Each blob may be an arbitrary
byte string (for chunks), or an encoded Python object.

Object identifiers

Object identifiers are a pair consisting of the bag id and an index
into the bag. Since the identifiers are used frequently, it is
practical to store them as a unit rather than as a pair. Further, they
will be visible to the user (and, especially, the developer), so the
following syntax is used:

BAGID.INDEX

For example, the first and third objects stored in the bag with id
0x1234567890abcdef would be:

1234567890abcdef.0
1234567890abcdef.2

Note the use of hexadecimal for the bag id (so all bag identifiers are
of the same length), and indexing in decimal, starting from zero.

We will keep bags effectively immutable so that an object id does not
need to change. This means that a bag may contain unused objects. If
it turns out that that's wasting too much data, we can "pack" bags by
replacing the unused blobs with empty values (Python's None) to save
space. This mutates a bag, but only in ways that (correct) users won't
notice.

Client list

The client list is stored as client-list/data.bag in the repository,
and each item in the bag has the following structure:

{
'client-name': 'foo',
'client-id': 123,
'encryption-key': None,
}

Chunks

Chunks are stored in bags. The chunk data is just a binary blob.

Chunk indexes

Chunk indexes map a checksum (using the user's chosen algorithm) to a
list of chunk ids, and a chunk id to a list of client ids. The root
object of the indexes is:

Checksum to chunk ids

The mapping from a checksum value to a list of chunk ids is done using
a lookup tree that is vaguely similar to a B-tree. The tree contains
index nodes and leaf nodes. Leaf nodes store the actual mappings:

The index node is a list of mappings, where each mapping corresponds
to an object on the next level in the lookup tree.

first-checksum is the smallest checksum in the sub-tree being
referenced.

last-checksum is the largest checksum.

leaf-id is the object id of the leaf node, assuming the next level
is leaf nodes.

index-id is the object id of the index node, assuming the next
level is index nodes.

The lookup tree is created in a copy-on-write manner. No node is ever
overwritten, but it may be deleted after it is no longer referenced.
The tree is not kept in balance, to keep the code maintaining as
simple as possible.

When a new checksum is inserted into the lookup tree, it is added to
an existing leaf node by creating a new leaf node that is a copy of
the old one, and adding the new checksum to the new leaf. A big leaf
node is split in half. Any index nodes on the path to an updated leaf
node get updated.

Chunk id to client ids

This tree is similar in structure as the checksum tree, but index nodes
look like this:

Each directory in the live data is stored in a DIR object. The object
stores the metadata for the directory, plus the basenames and metadata
for each file in the directory (anything thing that isn't a
subdirectory), plus the basename and DIR object id of each
subdirectory.