Fossil

A Technical OverviewOf The Design And ImplementationOf Fossil

1.0 Introduction

At its lowest level, a Fossil repository consists of an unordered set
of immutable "artifacts". You might think of these artifacts as "files",
since in many cases the artifacts are exactly that.
But other "structural artifacts" are also included in the mix.
These structural artifacts define the relationships
between artifacts - which files go together to form a particular
version of the project, who checked in that version and when, what was
the check-in comment, what wiki pages are included with the project, what
are the edit histories of each wiki page, what bug reports or tickets are
included, who contributed to the evolution of each ticket, and so forth.
This low-level file format is called the "global state" of
the repository, since this is the information that is synced to peer
repositories using push and pull operations. The low-level file format
is also called "enduring" since it is intended to last for many years.
The details of the low-level, enduring, global file format
are described separately.

This article is about how Fossil is currently implemented. Instead of
dealing with vague abstractions of "enduring file formats" as the
other document does, this article provides
some detail on how Fossil actually stores information on disk.

2.0 Three Databases

Fossil stores state information in
SQLite database files.
SQLite keeps an entire relational database, including multiple tables and
indices, in a single disk file. The SQLite library allows the database
files to be efficiently queried and updated using the industry-standard
SQL language. SQLite updates are atomic, so even in the event of
a system crashes or power failure the repository content is protected.

Fossil uses three separate classes of SQLite databases:

The configuration database

Repository databases

Checkout databases

The configuration database is a one-per-user database that holds
global configuration information used by Fossil. There is one
repository database per project. The repository database is the
file that people are normally referring to when they say
"a Fossil repository". The checkout database is found in the working
checkout for a project and contains state information that is unique
to that working checkout.

Fossil does not always use all three database files. The web interface,
for example, typically only uses the repository database. And the
fossil settings command only opens the configuration database
when the --global option is used. But other commands use all three
databases at once. For example, the fossil status
command will first locate the checkout database, then use the checkout
database to find the repository database, then open the configuration
database. Whenever multiple databases are used at the same time,
they are all opened on the same SQLite database connection using
SQLite's ATTACH command.

The chart below provides a quick summary of how each of these
database files are used by Fossil, with detailed discussion following.

2.1 The Configuration Database

The configuration database holds cross-repository preferences and a list of all
repositories for a single user.

The fossil settings command can be used to specify various
operating parameters and preferences for Fossil repositories. Settings can
apply to a single repository, or they can apply globally to all repositories
for a user. If both a global and a repository value exists for a setting,
then the repository-specific value takes precedence. All of the settings
have reasonable defaults, and so many users will never need to change them.
But if changes to settings are desired, the configuration database provides
a way to change settings for all repositories with a single command, rather
than having to change the setting individually on each repository.

The configuration database also maintains a list of repositories. This
list is used by the fossil all command in order to run various
operations such as "sync" or "rebuild" on all repositories managed by a user.

On Unix systems, the configuration database is named ".fossil" and is
located in the user's home directory. On Windows, the configuration
database is named "_fossil" (using an underscore as the first character
instead of a dot) and is located in the directory specified by the
LOCALAPPDATA, APPDATA, or HOMEPATH environment variables, in that order.

You can override this default location by defining the environment
variable FOSSIL_HOME pointing to an appropriate (writable) directory.

2.2 Repository Databases

The repository database is the file that is commonly referred to as
"the repository". This is because the repository database contains,
among other things, the complete revision, ticket, and wiki history for
a project. It is customary to name the repository database after then
name of the project, with a ".fossil" suffix. For example, the repository
database for the self-hosting Fossil repository is called "fossil.fossil"
and the repository database for SQLite is called "sqlite.fossil".

2.2.1 Global Project State

The bulk of the repository database (typically 75 to 85%) consists
of the artifacts that comprise the
enduring, global, shared state of the project.
The artifacts are stored as BLOBs, compressed using
zlib compression and, where applicable,
using delta compression.
The combination of zlib and delta compression results in a considerable
space savings. For the SQLite project, at the time of this writing,
the total size of all artifacts is over 2.0 GB but thanks to the
combined zlib and delta compression, that content only takes up
32 MB of space in the repository database, for a compression ratio
of about 64:1. The average size of a content BLOB in the database
is around 500 bytes.

Note that the zlib and delta compression is not an inherent part of the
Fossil file format; it is just an optimization.
The enduring file format for Fossil is the unordered
set of artifacts. The compression techniques are just a detail of
how the current implementation of Fossil happens to store these artifacts
efficiently on disk.

All of the original uncompressed and undeltaed artifacts can be extracted
from a Fossil repository database using
the fossil deconstruct
command. Individual artifacts can be extracted using the
fossil artifact command.
When accessing the repository database using raw SQL and the
fossil sql command, the extension function
"content()" with a single argument which is the SHA1 or
SHA3-256 hash
of an artifact will return the complete undeleted and uncompressed
content of that artifact.

Going the other way, the fossil reconstruct
command will scan a directory hierarchy and add all files found to
a new repository database. The fossil import command
works by reading the input git-fast-export stream and using it to construct
corresponding artifacts which are then written into the repository database.

2.2.2 Project Metadata

The global project state information in the repository database is
supplemented by computed metadata that makes querying the project state
more efficient. Metadata includes information such as the following:

The names for all files found in any check-in.

All check-ins that modify a given file

Parents and children of each check-in.

Potential timeline rows.

The names of all symbolic tags and the check-ins they apply to.

The names of all wiki pages and the artifacts that comprise each
wiki page.

Attachments and the wiki pages or tickets they apply to.

Current content of each ticket.

Cross-references between tickets, check-ins, and wiki pages.

The metadata is held in various SQL tables in the repository database.
The metadata is designed to facilitate queries for the various timelines and
reports that Fossil generates.
As the functionality of Fossil evolves,
the schema for the metadata can and does change.
But schema changes do not invalidate the repository. Remember that the
metadata contains no new information - only information that has been
extracted from the canonical artifacts and saved in a more useful form.
Hence, when the metadata schema changes, the prior metadata can be discarded
and the entire metadata corpus can be recomputed from the canonical
artifacts. That is what the
fossil rebuild command does.

2.2.3 Display And Processing Preferences

The repository database also holds information used to help format
the display of web pages and configuration settings that override the
global configuration settings for the specific repository. All of
this information (and the user credentials and privileges too) is
local to each repository database; it is not shared between repositories
by fossil sync. That is because it is entirely reasonable
that two different websites for the same project might have completely
different display preferences and user communities. One instance of the
project might be a fork of the other, for example, which pulls from the
other but never pushes and extends the project in ways that the keepers of
the other website disapprove of.

Display and processing information includes the following:

The name and description of the project

The CSS file, header, and footer used by all web pages

The project logo image

Fields of tickets that are considered "significant" and which are
therefore collected from artifacts and made available for display

Templates for screens to view, edit, and create tickets

Ticket report formats and display preferences

Local values for settings that override the
global values defined in the per-user configuration database.

Though the display and processing preferences do not move between
repository instances using fossil sync, this information
can be shared between repositories using the
fossil config push and
fossil config pull commands.
The display and processing information is also copied into new
repositories when they are created using
fossil clone.

2.2.4 User Credentials And Privileges

Just because two development teams are collaborating on a project and allow
push and/or pull between their repositories does not mean that they
trust each other enough to share passwords and access privileges.
Hence the names and emails and passwords and privileges of users are
considered private information that is kept locally in each repository.

Each repository database has a table holding the username, privileges,
and login credentials for users authorized to interact with that particular
database. In addition, there is a table named "concealed" that maps the
SHA1 hash of each users email address back into their true email address.
The concealed table allows just the SHA1 hash of email addresses to
be stored in tickets, and thus prevents actual email addresses from falling
into the hands of spammers who happen to clone the repository.

The content of the user and concealed tables can be pushed and pulled using the
fossil config push and
fossil config pull commands with the "user" and
"email" as the AREA argument, but only if you have administrative
privileges on the remote repository.

2.2.5 Shunned Artifact List

The set of canonical artifacts for a project - the global state for the
project - is intended to be an append-only database. In other words,
new artifacts can be added but artifacts can never be removed. But
it sometimes happens that inappropriate content is mistakenly or
maliciously added to a repository. The only way to get rid of
the undesired content is to "shun" it.
The "shun" table in the repository database records the hash values for
all shunned artifacts.

The shun table can be pushed or pulled using
the fossil config command with the "shun" AREA argument.
The shun table is also copied during a clone.

2.3 Checkout Databases

Fossil allows a single repository
to have multiple working checkouts. Each working checkout has a single
database in its root directory that records the state of that checkout.
The checkout database is named "_FOSSIL_" or ".fslckout".
The checkout database records information such as the following:

For Fossil commands that run from within a working checkout, the
first thing that happens is that Fossil locates the checkout database.
Fossil first looks in the current directory. If not found there, it
looks in the parent directory. If not found there, the parent of the
parent. And so forth until either the checkout database is found
or the search reaches the root of the filesystem. (In the latter case,
Fossil returns an error, of course.) Once the checkout database is
located, it is used to locate the repository database.

Notice that the checkout database contains a pointer to the repository
database but that the repository database has no record of the checkout
databases. That means that a working checkout directory tree can be
freely renamed or copied or deleted without consequence. But the
repository database file, on the other hand, has to stay in the same
place with the same name or else the open checkout databases will not
be able to find it.

A checkout database is created by the fossil open command.
A checkout database is deleted by fossil close. The
fossil close command really isn't needed; one can accomplish the same
thing simply by deleting the checkout database.

Note that the stash, the undo stack, and the state of the bisect command
are all contained within the checkout database. That means that the
fossil close command will delete all stash content, the undo stack, and
the bisect state. The close command is not undoable. Use it with care.