Repository Maintenance

Maintaining a Subversion repository can be daunting, mostly
due to the complexities inherent in systems that have a database
backend. Doing the task well is all about knowing the
tools—what they are, when to use them, and how. This
section will introduce you to the repository administration
tools provided by Subversion and discuss how to wield them to
accomplish tasks such as repository data migration, upgrades,
backups, and cleanups.

An Administrator's Toolkit

Subversion provides a handful of utilities useful for
creating, inspecting, modifying, and repairing your repository.
Let's look more closely at each of those tools. Afterward,
we'll briefly examine some of the utilities included in the
Berkeley DB distribution that provide functionality specific
to your repository's database backend not otherwise provided
by Subversion's own tools.

svnadmin

The svnadmin program is the
repository administrator's best friend. Besides providing
the ability to create Subversion repositories, this program
allows you to perform several maintenance operations on
those repositories. The syntax of
svnadmin is similar to that of other
Subversion command-line programs:

$ svnadmin help
general usage: svnadmin SUBCOMMAND REPOS_PATH [ARGS & OPTIONS ...]
Type 'svnadmin help <subcommand>' for help on a specific subcommand.
Type 'svnadmin --version' to see the program version and FS modules.
Available subcommands:
crashtest
create
deltify
…

svnlook

svnlook is a tool provided by
Subversion for examining the various revisions and
transactions (which are revisions
in the making) in a repository. No part of this program
attempts to change the repository. svnlook
is typically used by the repository hooks for reporting the
changes that are about to be committed (in the case of the
pre-commit hook) or that were just
committed (in the case of the post-commit
hook) to the repository. A repository administrator may use
this tool for diagnostic purposes.

svnlook has a straightforward
syntax:

$ svnlook help
general usage: svnlook SUBCOMMAND REPOS_PATH [ARGS & OPTIONS ...]
Note: any subcommand which takes the '--revision' and '--transaction'
options will, if invoked without one of those options, act on
the repository's youngest revision.
Type 'svnlook help <subcommand>' for help on a specific subcommand.
Type 'svnlook --version' to see the program version and FS modules.
…

Most of svnlook's
subcommands can operate on either a revision or a
transaction tree, printing information about the tree
itself, or how it differs from the previous revision of the
repository. You use the --revision
(-r) and --transaction
(-t) options to specify which revision or
transaction, respectively, to examine. In the absence of
both the --revision (-r)
and --transaction (-t)
options, svnlook will examine the
youngest (or HEAD) revision in the
repository. So the following two commands do exactly the
same thing when 19 is the youngest revision in the
repository located at
/var/svn/repos:

$ svnlook info /var/svn/repos
$ svnlook info /var/svn/repos -r 19

One exception to these rules about subcommands is
the svnlook youngest subcommand, which
takes no options and simply prints out the repository's
youngest revision number:

$ svnlook youngest /var/svn/repos
19
$

Note

Keep in mind that the only transactions you can browse
are uncommitted ones. Most repositories will have no such
transactions because transactions are usually either
committed (in which case, you should access them as
revision with the --revision
(-r) option) or aborted and
removed.

Output from svnlook is designed to be
both human- and machine-parsable. Take, as an example, the
output of the svnlook info subcommand:

The output of svnlook info consists
of the following, in the order given:

The author, followed by a newline

The date, followed by a newline

The number of characters in the log message,
followed by a newline

The log message itself, followed by a newline

This output is human-readable, meaning items such as the
datestamp are displayed using a textual representation
instead of something more obscure (such as the number of
nanoseconds since the Tastee Freez guy drove by). But the
output is also machine-parsable—because the log
message can contain multiple lines and be unbounded in
length, svnlook provides the length of
that message before the message itself. This allows scripts
and other wrappers around this command to make intelligent
decisions about the log message, such as how much memory to
allocate for the message, or at least how many bytes to skip
in the event that this output is not the last bit of data in
the stream.

svnlook can perform a variety of
other queries: displaying subsets of bits of information
we've mentioned previously, recursively listing versioned
directory trees, reporting which paths were modified in a
given revision or transaction, showing textual and property
differences made to files and directories, and so on. See
the section called “svnlook” for a full reference of
svnlook's features.

svndumpfilter

While it won't be the most commonly used tool at the
administrator's disposal, svndumpfilter
provides a very particular brand of useful
functionality—the ability to quickly and easily modify
streams of Subversion repository history data by acting as a
path-based filter.

The syntax of svndumpfilter is as
follows:

$ svndumpfilter help
general usage: svndumpfilter SUBCOMMAND [ARGS & OPTIONS ...]
Type 'svndumpfilter help <subcommand>' for help on a specific subcommand.
Type 'svndumpfilter --version' to see the program version.
Available subcommands:
exclude
include
help (?, h)

There are only two interesting subcommands:
svndumpfilter exclude and
svndumpfilter include. They allow you to
make the choice between implicit or explicit inclusion of
paths in the stream. You can learn more about these
subcommands and svndumpfilter's unique
purpose later in this chapter, in the section called “Filtering Repository History”.

svnsync

The svnsync program, which was new to
the 1.4 release of Subversion, provides all the
functionality required for maintaining a read-only mirror of
a Subversion repository. The program really has one
job—to transfer one repository's versioned history
into another repository. And while there are few ways to do
that, its primary strength is that it can operate
remotely—the “source” and
“sink”[33]
repositories may be on different computers from each other
and from svnsync itself.

As you might expect, svnsync has a
syntax that looks very much like every other program we've
mentioned in this chapter:

$ svnsync help
general usage: svnsync SUBCOMMAND DEST_URL [ARGS & OPTIONS ...]
Type 'svnsync help <subcommand>' for help on a specific subcommand.
Type 'svnsync --version' to see the program version and RA modules.
Available subcommands:
initialize (init)
synchronize (sync)
copy-revprops
info
help (?, h)
$

fsfs-reshard.py

While not an official member of the Subversion
toolchain, the fsfs-reshard.py script
(found in the tools/server-side
directory of the Subversion source distribution) is a useful
performance tuning tool for administrators of FSFS-backed
Subversion repositories. As described in the sidebar
Revision files and shards,
FSFS repositories use individual files to house information
about each revision. Sometimes these files all live in a
single directory; sometimes they are sharded across many
directories. But the neat thing is that the number of
directories used to house these files is
configurable. That's where
fsfs-reshard.py comes in.

fsfs-reshard.py reshuffles the
repository's file structure into a new arrangement that
reflects the requested number of sharding subdirectories and
updates the repository configuration to preserve this
change. This is especially useful for converting an older
Subversion repository into the new Subversion 1.5 sharded
layout (which Subversion will not automatically do for you)
or for fine-tuning an already sharded repository.

Berkeley DB utilities

If you're using a Berkeley DB repository, all of
your versioned filesystem's structure and data live in a set
of database tables within the db/
subdirectory of your repository. This subdirectory is a
regular Berkeley DB environment directory and can therefore
be used in conjunction with any of the Berkeley database
tools, typically provided as part of the Berkeley DB
distribution.

For day-to-day Subversion use, these tools are
unnecessary. Most of the functionality typically needed for
Subversion repositories has been duplicated in the
svnadmin tool. For example,
svnadmin list-unused-dblogs and
svnadmin list-dblogs perform a
subset of what is provided by the Berkeley
db_archive utility, and svnadmin
recover reflects the common use cases of the
db_recover utility.

However, there are still a few Berkeley DB utilities
that you might find useful. The db_dump
and db_load programs write and read,
respectively, a custom file format that describes the keys
and values in a Berkeley DB database. Since Berkeley
databases are not portable across machine architectures,
this format is a useful way to transfer those databases from
machine to machine, irrespective of architecture or
operating system. As we describe later in this chapter, you
can also use svnadmin dump and
svnadmin load for similar purposes, but
db_dump and db_load
can do certain jobs just as well and much faster. They can
also be useful if the experienced Berkeley DB hacker needs
to do in-place tweaking of the data in a BDB-backed
repository for some reason, which is something Subversion's
utilities won't allow. Also, the db_stat
utility can provide useful information about the status of
your Berkeley DB environment, including detailed statistics
about the locking and storage subsystems.

Commit Log Message Correction

Sometimes a user will have an error in her log message (a
misspelling or some misinformation, perhaps). If the
repository is configured (using the
pre-revprop-change hook; see the section called “Implementing Repository Hooks”) to accept changes to
this log message after the commit is finished, the user
can “fix” her log message remotely using
svn propset (see svn propset). However, because of the
potential to lose information forever, Subversion repositories
are not, by default, configured to allow changes to
unversioned properties—except by an
administrator.

If a log message needs to be changed by an administrator,
this can be done using svnadmin setlog.
This command changes the log message (the
svn:log property) on a given revision of a
repository, reading the new value from a provided file.

The svnadmin setlog command, by
default, is
still bound by the same protections against modifying
unversioned properties as a remote client is—the
pre- and
post-revprop-change hooks are still
triggered, and therefore must be set up to accept changes of
this nature. But an administrator can get around these
protections by passing the --bypass-hooks
option to the svnadmin setlog command.

Warning

Remember, though, that by bypassing the hooks, you are
likely avoiding such things as email notifications of
property changes, backup systems that track unversioned
property changes, and so on. In other words, be very
careful about what you are changing, and how you change
it.

Managing Disk Space

While the cost of storage has dropped incredibly in the
past few years, disk usage is still a valid concern for
administrators seeking to version large amounts of data.
Every bit of version history information stored in the live
repository needs to be backed up
elsewhere, perhaps multiple times as part of rotating backup
schedules. It is useful to know what pieces of Subversion's
repository data need to remain on the live site, which need to
be backed up, and which can be safely removed.

How Subversion saves disk space

To keep the repository small,
Subversion uses deltification (or
deltified storage) within the repository
itself. Deltification involves encoding the representation
of a chunk of data as a collection of differences against
some other chunk of data. If the two pieces of data are
very similar, this deltification results in storage savings
for the deltified chunk—rather than taking up space
equal to the size of the original data, it takes up only
enough space to say, “I look just like this other
piece of data over here, except for the following couple of
changes.” The result is that most of the repository
data that tends to be bulky—namely, the contents of
versioned files—is stored at a much smaller size than
the original full-text representation of that
data. And for repositories created with Subversion 1.4 or
later, the space savings are even better—now those
full-text representations of file contents are themselves
compressed.

Note

Because all of the data that is subject to
deltification in a BDB-backed repository is stored in a
single Berkeley DB database file, reducing the size of the
stored values will not immediately reduce the size of the
database file itself. Berkeley DB will, however, keep
internal records of unused areas of the database file and
consume those areas first before growing the size of the
database file. So while deltification doesn't produce
immediate space savings, it can drastically slow future
growth of the database.

Removing dead transactions

Though they are uncommon, there are circumstances in
which a Subversion commit process might fail, leaving behind
in the repository the remnants of the revision-to-be that
wasn't—an uncommitted transaction and all the file and
directory changes associated with it. This could happen for
several reasons: perhaps the client operation was
inelegantly terminated by the user, or a network failure
occurred in the middle of an operation.
Regardless of the reason, dead transactions can happen.
They don't do any real harm, other than consuming disk
space. A fastidious administrator may nonetheless wish to
remove them.

You can use the svnadmin lstxns
command to list the names of the currently outstanding
transactions:

$ svnadmin lstxns myrepos
19
3a1
a45
$

Each item in the resultant output can then be used with
svnlook (and its
--transaction (-t) option)
to determine who created the transaction, when it was
created, what types of changes were made in the
transaction—information that is helpful in determining
whether the transaction is a safe candidate for
removal! If you do indeed want to remove a transaction, its
name can be passed to svnadmin rmtxns,
which will perform the cleanup of the transaction. In fact,
svnadmin rmtxns can take its input
directly from the output of
svnadmin lstxns!

$ svnadmin rmtxns myrepos `svnadmin lstxns myrepos`
$

If you use these two subcommands like this, you should
consider making your repository temporarily inaccessible to
clients. That way, no one can begin a legitimate
transaction before you start your cleanup. Example 5.1, “txn-info.sh (reporting outstanding transactions)”
contains a bit of shell-scripting that can quickly generate
information about each outstanding transaction in your
repository.

A long-abandoned transaction usually represents some
sort of failed or interrupted commit. A transaction's
datestamp can provide interesting information—for
example, how likely is it that an operation begun nine
months ago is still active?

In short, transaction cleanup decisions need not be made
unwisely. Various sources of information—including
Apache's error and access logs, Subversion's operational
logs, Subversion revision history, and so on—can be
employed in the decision-making process. And of course, an
administrator can often simply communicate with a seemingly
dead transaction's owner (via email, e.g.) to verify
that the transaction is, in fact, in a zombie state.

Purging unused Berkeley DB logfiles

Until recently, the largest offender of disk space usage
with respect to BDB-backed Subversion repositories were the
logfiles in which Berkeley DB performs its prewrites before
modifying the actual database files. These files capture
all the actions taken along the route of changing the
database from one state to another—while the database
files, at any given time, reflect a particular state, the
logfiles contain all of the many changes along the way
between states. Thus, they can grow
and accumulate quite rapidly.

Fortunately, beginning with the 4.2 release of Berkeley
DB, the database environment has the ability to remove its
own unused logfiles automatically. Any
repositories created using svnadmin
when compiled against Berkeley DB version 4.2 or later
will be configured for this automatic logfile removal. If
you don't want this feature enabled, simply pass the
--bdb-log-keep option to the
svnadmin create command. If you forget
to do this or change your mind at a later time, simply edit
the DB_CONFIG file found in your
repository's db directory, comment out
the line that contains the set_flags
DB_LOG_AUTOREMOVE directive, and then run
svnadmin recover on your repository to
force the configuration changes to take effect. See the section called “Berkeley DB Configuration” for more information about
database configuration.

Without some sort of automatic logfile removal in
place, logfiles will accumulate as you use your repository.
This is actually somewhat of a feature of the database
system—you should be able to recreate your entire
database using nothing but the logfiles, so these files can
be useful for catastrophic database recovery. But
typically, you'll want to archive the logfiles that are no
longer in use by Berkeley DB, and then remove them from disk
to conserve space. Use the svnadmin
list-unused-dblogs command to list the unused
logfiles:

BDB-backed repositories whose logfiles are used as
part of a backup or disaster recovery plan should
not make use of the logfile
autoremoval feature. Reconstruction of a repository's
data from logfiles can only be accomplished only when
all the logfiles are available. If
some of the logfiles are removed from disk before the
backup system has a chance to copy them elsewhere, the
incomplete set of backed-up logfiles is essentially
useless.

Packing FSFS filesystems

As described in the sidebar
Revision files and shards,
FSFS-backed Subversion repositories create, by default, a
new on-disk file for each revision added to the repository.
Having thousands of these files present on your Subversion
server—even when housed in separate shard
directories—can lead to inefficiencies.

The first problem is that the operating system has to
reference many different files over a short period of time.
This leads to inefficient use of disk caches and, as a
result, more time spent seeking across large disks. Because
of this, Subversion pays a performance penalty when
accessing your versioned data.

The second problem is a bit more subtle. Because of the
ways that most filesystems allocate disk space, each file
claims more space on the disk than it actually uses. The
amount of extra space required to house a single file can
average anywhere from 2 to 16 kilobytes per
file, depending on the underlying
filesystem in use. This translates directly
into a per-revision disk usage penalty for FSFS-backed
repositories. The effect is most pronounced in repositories
which have many small revisions, since the overhead involved
in storing the revision file quickly outgrows the size of
the actual data being stored.

To solve these problems, Subversion 1.6 introduced the
svnadmin pack command. By concatenating
all the files of a completed shard into a single “pack” file
and then removing the original per-revision
files, svnadmin pack reduces the file
count within a given shard down to just a single file. In
doing so, it aids filesystem caches and reduces (to one) the
number of times a file storage overhead penalty is
paid.

Subversion can pack existing sharded repositories which
have been upgraded to the 1.6 filesystem format (see
svnadmin upgrade). To do so,
just run svnadmin pack on the
repository:

Because the packing process obtains the required locks
before doing its work, you can run it on live repositories,
or even as part of a post-commit hook. Repacking packed
shards is legal, but will have no effect on the disk usage
of the repository.

svnadmin pack has no effect on
BDB-backed Subversion repositories.

Berkeley DB Recovery

As mentioned in the section called “Berkeley DB”, a Berkeley DB
repository can sometimes be left in a frozen state if not closed
properly. When this happens, an administrator needs to rewind
the database back into a consistent state. This is unique to
BDB-backed repositories, though—if you are using
FSFS-backed ones instead, this won't apply to you. And for
those of you using Subversion 1.4 with Berkeley DB 4.4 or
later, you should find that Subversion has become much more
resilient in these types of situations. Still, wedged
Berkeley DB repositories do occur, and an administrator needs
to know how to safely deal with this circumstance.

To protect the data in your repository, Berkeley
DB uses a locking mechanism. This mechanism ensures that
portions of the database are not simultaneously modified by
multiple database accessors, and that each process sees the
data in the correct state when that data is being read from
the database. When a process needs to change something in the
database, it first checks for the existence of a lock on the
target data. If the data is not locked, the process locks the
data, makes the change it wants to make, and then unlocks the
data. Other processes are forced to wait until that lock is
removed before they are permitted to continue accessing that
section of the database. (This has nothing to do with the
locks that you, as a user, can apply to versioned files within
the repository; we try to clear up the confusion caused by
this terminology collision in the sidebar The Three Meanings of “Lock”.)

In the course of using your Subversion repository, fatal
errors or interruptions can prevent a process from having the
chance to remove the locks it has placed in the database. The
result is that the backend database system gets
“wedged.” When this happens, any attempts to
access the repository hang indefinitely (since each new
accessor is waiting for a lock to go away—which isn't
going to happen).

If this happens to your repository, don't panic. The
Berkeley DB filesystem takes advantage of database
transactions, checkpoints, and prewrite journaling to
ensure that only the most catastrophic of events
[34]
can permanently destroy a database environment. A
sufficiently paranoid repository administrator will have made
off-site backups of the repository data in some fashion, but
don't head off to the tape backup storage closet just yet.

Instead, use the following recipe to attempt to
“unwedge” your repository:

Make sure no processes are accessing (or
attempting to access) the repository. For networked
repositories, this also means shutting down the Apache HTTP
Server or svnserve daemon.

Become the user who owns and manages the repository.
This is important, as recovering a repository while
running as the wrong user can tweak the permissions of the
repository's files in such a way that your repository will
still be inaccessible even after it is
“unwedged.”

Run the command svnadmin recover
/var/svn/repos. You should see output such as
this:

Repository lock acquired.
Please wait; recovering the repository may take some time...
Recovery completed.
The latest repos revision is 19.

This command may take many minutes to complete.

Restart the server process.

This procedure fixes almost every case of repository
wedging. Make sure that you run this command as the user that
owns and manages the database, not just as
root. Part of the recovery process might
involve re-creating from scratch various database files (shared
memory regions, e.g.). Recovering as
root will create those files such that they
are owned by root, which means that even
after you restore connectivity to your repository, regular
users will be unable to access it.

If the previous procedure, for some reason, does not
successfully unwedge your repository, you should do two
things. First, move your broken repository directory aside
(perhaps by renaming it to something like
repos.BROKEN) and then restore your
latest backup of it. Then, send an email to the Subversion
users mailing list (at <users@subversion.tigris.org>)
describing your problem in detail. Data integrity is an
extremely high priority to the Subversion developers.

Migrating Repository Data Elsewhere

A Subversion filesystem has its data spread throughout
files in the repository, in a fashion generally
understood by (and of interest to) only the Subversion
developers themselves. However, circumstances may arise that
call for all, or some subset, of that data to be copied or
moved into another repository.

Subversion provides such functionality by way of
repository dump streams. A repository
dump stream (often referred to as a “dump file”
when stored as a file on disk) is a portable, flat file format
that describes the various revisions in your
repository—what was changed, by whom, when, and so on.
This dump stream is the primary mechanism used to marshal
versioned history—in whole or in part, with or without
modification—between repositories. And Subversion
provides the tools necessary for creating and loading these
dump streams: the svnadmin dump and
svnadmin load subcommands,
respectively.

Warning

While the Subversion repository dump format contains
human-readable portions and a familiar structure (it
resembles an RFC 822 format, the same type of format used
for most email), it is not a plain-text
file format. It is a binary file format, highly sensitive
to meddling. For example, many text editors will corrupt
the file by automatically converting line endings.

There are many reasons for dumping and loading Subversion
repository data. Early in Subversion's life, the most common
reason was due to the evolution of Subversion itself. As
Subversion matured, there were times when changes made to the
backend database schema caused compatibility issues with
previous versions of the repository, so users had to dump
their repository data using the previous version of
Subversion and load it into a freshly created repository with
the new version of Subversion. Now, these types of schema
changes haven't occurred since Subversion's 1.0 release, and
the Subversion developers promise not to force users to dump
and load their repositories when upgrading between minor
versions (such as from 1.3 to 1.4) of Subversion. But there
are still other reasons for dumping and loading, including
re-deploying a Berkeley DB repository on a new OS or CPU
architecture, switching between the Berkeley DB and FSFS
backends, or (as we'll cover later in this chapter in the section called “Filtering Repository History”) purging versioned
data from repository history.

Whatever your reason for migrating repository history,
using the svnadmin dump and
svnadmin load subcommands is
straightforward. svnadmin dump will output
a range of repository revisions that are formatted using
Subversion's custom filesystem dump format. The dump format
is printed to the standard output stream, while informative
messages are printed to the standard error stream. This
allows you to redirect the output stream to a file while
watching the status output in your terminal window. For
example:

At the end of the process, you will have a single file
(dumpfile in the previous example) that
contains all the data stored in your repository in the
requested range of revisions. Note that svnadmin
dump is reading revision trees from the repository
just like any other “reader” process would
(e.g., svn checkout), so it's safe
to run this command at any time.

The other subcommand in the pair, svnadmin
load, parses the standard input stream as a
Subversion repository dump file and effectively replays those
dumped revisions into the target repository for that
operation. It also gives informative feedback, this time
using the standard output stream:

The result of a load is new revisions added to a
repository—the same thing you get by making commits
against that repository from a regular Subversion client.
Just as in a commit, you can use hook programs to perform
actions before and after each of the commits made during a
load process. By passing the
--use-pre-commit-hook and
--use-post-commit-hook options to
svnadmin load, you can instruct Subversion
to execute the pre-commit and post-commit hook programs,
respectively, for each loaded revision. You might use these,
for example, to ensure that loaded revisions pass through the
same validation steps that regular commits pass through. Of
course, you should use these options with care—if your
post-commit hook sends emails to a mailing list for each new
commit, you might not want to spew hundreds or thousands of
commit emails in rapid succession at that list! You can read more about the use of hook
scripts in the section called “Implementing Repository Hooks”.

Note that because svnadmin uses
standard input and output streams for the repository dump and
load processes, people who are feeling especially saucy can try
things such as this (perhaps even using different versions of
svnadmin on each side of the pipe):

By default, the dump file will be quite large—much
larger than the repository itself. That's because by default
every version of every file is expressed as a full text in the
dump file. This is the fastest and simplest behavior, and
it's nice if you're piping the dump data directly into some other
process (such as a compression program, filtering program, or
loading process). But if you're creating a dump file
for longer-term storage, you'll likely want to save disk space
by using the --deltas option. With this
option, successive revisions of files will be output as
compressed, binary differences—just as file revisions
are stored in a repository. This option is slower, but it
results in a dump file much closer in size to the original
repository.

We mentioned previously that svnadmin
dump outputs a range of revisions. Use the
--revision (-r) option to
specify a single revision, or a range of revisions, to dump.
If you omit this option, all the existing repository revisions
will be dumped.

As Subversion dumps each new revision, it outputs only
enough information to allow a future loader to re-create that
revision based on the previous one. In other words, for any
given revision in the dump file, only the items that were
changed in that revision will appear in the dump. The only
exception to this rule is the first revision that is dumped
with the current svnadmin dump
command.

By default, Subversion will not express the first dumped
revision as merely differences to be applied to the previous
revision. For one thing, there is no previous revision in the
dump file! And second, Subversion cannot know the state of
the repository into which the dump data will be loaded (if it
ever is). To ensure that the output of each
execution of svnadmin dump is
self-sufficient, the first dumped revision is, by default, a
full representation of every directory, file, and property in
that revision of the repository.

However, you can change this default behavior. If you add
the --incremental option when you dump your
repository, svnadmin will compare the first
dumped revision against the previous revision in the
repository—the same way it treats every other revision that
gets dumped. It will then output the first revision exactly
as it does the rest of the revisions in the dump
range—mentioning only the changes that occurred in that
revision. The benefit of this is that you can create several
small dump files that can be loaded in succession, instead of
one large one, like so:

Another neat trick you can perform with this
--incremental option involves appending to an
existing dump file a new range of dumped revisions. For
example, you might have a post-commit hook
that simply appends the repository dump of the single revision
that triggered the hook. Or you might have a script that runs
nightly to append dump file data for all the revisions that
were added to the repository since the last time the script
ran. Used like this, svnadmin dump can be
one way to back up changes to your repository over time in case
of a system crash or some other catastrophic event.

The dump format can also be used to merge the contents of
several different repositories into a single repository. By
using the --parent-dir option of
svnadmin load, you can specify a new
virtual root directory for the load process. That means if
you have dump files for three repositories—say
calc-dumpfile,
cal-dumpfile, and
ss-dumpfile—you can first create a new
repository to hold them all:

$ svnadmin create /var/svn/projects
$

Then, make new directories in the repository that will
encapsulate the contents of each of the three previous
repositories:

We'll mention one final way to use the Subversion
repository dump format—conversion from a different
storage mechanism or version control system altogether.
Because the dump file format is, for the most part,
human-readable, it should be relatively easy to describe
generic sets of changes—each of which should be treated
as a new revision—using this file format. In fact, the
cvs2svn utility (see the section called “Converting a Repository from CVS to Subversion”) uses the dump format to
represent the contents of a CVS repository so that those
contents can be copied into a Subversion repository.

Filtering Repository History

Since Subversion stores your versioned history using, at
the very least, binary differencing algorithms and data
compression (optionally in a completely opaque database
system), attempting manual tweaks is unwise if not quite
difficult, and at any rate strongly discouraged. And once
data has been stored in your repository, Subversion
generally doesn't provide an easy way to remove that data.
[35]
But inevitably, there will be times when you would like to
manipulate the history of your repository. You might need
to strip out all instances of a file that was accidentally
added to the repository (and shouldn't be there for whatever
reason).
[36]
Or, perhaps you have multiple projects sharing a
single repository, and you decide to split them up into
their own repositories. To accomplish tasks such as these,
administrators need a more manageable and malleable
representation of the data in their repositories—the
Subversion repository dump format.

As we described earlier in the section called “Migrating Repository Data Elsewhere”, the Subversion
repository dump format is a human-readable representation of
the changes that you've made to your versioned data over time.
Use the svnadmin dump command to generate
the dump data, and svnadmin load to
populate a new repository with it. The great thing about the
human-readability aspect of the dump format is that, if you
aren't careless about it, you can manually inspect and modify
it. Of course, the downside is that if you have three years'
worth of repository activity encapsulated in what is likely to
be a very large dump file, it could take you a long, long time
to manually inspect and modify it.

That's where svndumpfilter becomes
useful. This program acts as a path-based filter for
repository dump streams. Simply give it either a list of
paths you wish to keep or a list of paths you wish to not
keep, and then pipe your repository dump data through this
filter. The result will be a modified stream of dump data
that contains only the versioned paths you (explicitly or
implicitly) requested.

Let's look at a realistic example of how you might use this
program. Earlier in this chapter (see the section called “Planning Your Repository Organization”), we discussed the
process of deciding how to choose a layout for the data in
your repositories—using one repository per project or
combining them, arranging stuff within your repository, and
so on. But sometimes after new revisions start flying in,
you rethink your layout and would like to make some changes.
A common change is the decision to move multiple projects
that are sharing a single repository into separate
repositories for each project.

Our imaginary repository contains three projects:
calc, calendar, and
spreadsheet. They have been living
side-by-side in a layout like this:

At this point, you have to make a decision. Each of your
dump files will create a valid repository, but will preserve
the paths exactly as they were in the original repository.
This means that even though you would have a repository solely
for your calc project, that repository
would still have a top-level directory named
calc. If you want your
trunk, tags, and
branches directories to live in the root
of your repository, you might wish to edit your dump files,
tweaking the Node-path and
Node-copyfrom-path headers so that they no
longer have that first calc/ path
component. Also, you'll want to remove the section of dump
data that creates the calc directory. It
will look something like the following:

Node-path: calc
Node-action: add
Node-kind: dir
Content-length: 0

Warning

If you do plan on manually editing the dump file to
remove a top-level directory, make sure your editor is
not set to automatically convert end-of-line characters to
the native format (e.g., \r\n to
\n), as the content will then not agree
with the metadata. This will render the dump file
useless.

All that remains now is to create your three new
repositories, and load each dump file into the right
repository, ignoring the UUID found in the dump stream:

Both of svndumpfilter's subcommands
accept options for deciding how to deal with
“empty” revisions. If a given revision
contains only changes to paths that were filtered out, that
now-empty revision could be considered uninteresting or even
unwanted. So to give the user control over what to do with
those revisions, svndumpfilter provides
the following command-line options:

--drop-empty-revs

Do not generate empty revisions at all—just
omit them.

--renumber-revs

If empty revisions are dropped (using the
--drop-empty-revs option), change the
revision numbers of the remaining revisions so that
there are no gaps in the numeric sequence.

--preserve-revprops

If empty revisions are not dropped, preserve the
revision properties (log message, author, date, custom
properties, etc.) for those empty revisions.
Otherwise, empty revisions will contain only the
original datestamp, and a generated log message that
indicates that this revision was emptied by
svndumpfilter.

While svndumpfilter can be very
useful and a huge timesaver, there are unfortunately a
couple of gotchas. First, this utility is overly sensitive
to path semantics. Pay attention to whether paths in your
dump file are specified with or without leading slashes.
You'll want to look at the Node-path and
Node-copyfrom-path headers.

…
Node-path: spreadsheet/Makefile
…

If the paths have leading slashes, you should
include leading slashes in the paths you pass to
svndumpfilter include and
svndumpfilter exclude (and if they don't,
you shouldn't). Further, if your dump file has an inconsistent
usage of leading slashes for some reason,
[37]
you should probably normalize those paths so that they all
have, or all lack, leading slashes.

Also, copied paths can give you some trouble.
Subversion supports copy operations in the repository, where
a new path is created by copying some already existing path.
It is possible that at some point in the lifetime of your
repository, you might have copied a file or directory from
some location that svndumpfilter is
excluding, to a location that it is including. To
make the dump data self-sufficient,
svndumpfilter needs to still show the
addition of the new path—including the contents of any
files created by the copy—and not represent that
addition as a copy from a source that won't exist in your
filtered dump data stream. But because the Subversion
repository dump format shows only what was changed in each
revision, the contents of the copy source might not be
readily available. If you suspect that you have any copies
of this sort in your repository, you might want to rethink
your set of included/excluded paths, perhaps including the
paths that served as sources of your troublesome copy
operations, too.

Finally, svndumpfilter takes path
filtering quite literally. If you are trying to copy the
history of a project rooted at
trunk/my-project and move it into a
repository of its own, you would, of course, use the
svndumpfilter include command to keep all
the changes in and under
trunk/my-project. But the resultant
dump file makes no assumptions about the repository into
which you plan to load this data. Specifically, the dump
data might begin with the revision that added the
trunk/my-project directory, but it will
not contain directives that would
create the trunk directory itself
(because trunk doesn't match the
include filter). You'll need to make sure that any
directories that the new dump stream expects to exist
actually do exist in the target repository before trying to
load the stream into that repository.

Repository Replication

There are several scenarios in which it is quite handy to
have a Subversion repository whose version history is exactly
the same as some other repository's. Perhaps the most obvious
one is the maintenance of a simple backup repository, used
when the primary repository has become inaccessible due to a
hardware failure, network outage, or other such annoyance.
Other scenarios include deploying mirror repositories to
distribute heavy Subversion load across multiple servers, use
as a soft-upgrade mechanism, and so on.

As of version 1.4, Subversion provides a program for
managing scenarios such as
these—svnsync. This works by
essentially asking the Subversion server to
“replay” revisions, one at a time. It then uses
that revision information to mimic a commit of the same to
another repository. Neither repository needs to be locally
accessible to the machine on which svnsync is
running—its parameters are repository URLs, and it does
all its work through Subversion's Repository Access (RA)
interfaces. All it requires is read access to the source
repository and read/write access to the destination
repository.

Note

When using svnsync against a remote
source repository, the Subversion server for that repository
must be running Subversion version 1.4 or later.

Assuming you already have a source repository that you'd
like to mirror, the next thing you need is an empty target
repository that will actually serve as that mirror. This
target repository can use either of the available filesystem
data-store backends (see the section called “Choosing a Data Store”), but it must not
yet have any version history in it. The protocol that
svnsync uses to communicate revision information
is highly sensitive to mismatches between the versioned
histories contained in the source and target repositories.
For this reason, while svnsync cannot
demand that the target repository be
read-only,
[38]
allowing the revision history in the target repository to
change by any mechanism other than the mirroring process is a
recipe for disaster.

Warning

Do not modify a mirror repository
in such a way as to cause its version history to deviate
from that of the repository it mirrors. The only commits
and revision property modifications that ever occur on that
mirror repository should be those performed by the
svnsync tool.

Another requirement of the target repository is that the
svnsync process be allowed to modify
revision properties. Because svnsync works
within the framework of that repository's hook system, the
default state of the repository (which is to disallow revision
property changes; see pre-revprop-change) is
insufficient. You'll need to explicitly implement the
pre-revprop-change hook, and your script must allow
svnsync to set and change revision
properties. With those provisions in place, you are ready to
start mirroring repository revisions.

Tip

It's a good idea to implement authorization measures
that allow your repository replication process to perform
its tasks while preventing other users from modifying the
contents of your mirror repository at all.

Let's walk through the use of svnsync
in a somewhat typical mirroring scenario. We'll pepper this
discourse with practical recommendations, which you are free to
disregard if they aren't required by or suitable for your
environment.

As a service to the fine developers of our favorite
version control system, we will be mirroring the public
Subversion source code repository and exposing that mirror
publicly on the Internet, hosted on a different machine than
the one on which the original Subversion source code
repository lives. This remote host has a global configuration
that permits anonymous users to read the contents of
repositories on the host, but requires users to authenticate
to modify those repositories. (Please forgive us for
glossing over the details of Subversion server configuration
for the moment—those are covered thoroughly in Chapter 6, Server Configuration.) And for no other reason than
that it makes for a more interesting example, we'll be driving
the replication process from a third machine—the one that
we currently find ourselves using.

First, we'll create the repository which will be our
mirror. This and the next couple of steps do require shell
access to the machine on which the mirror repository will
live. Once the repository is all configured, though, we
shouldn't need to touch it directly again.

At this point, we have our repository, and due to our
server's configuration, that repository is now
“live” on the Internet. Now, because we don't
want anything modifying the repository except our replication
process, we need a way to distinguish that process from other
would-be committers. To do so, we use a dedicated username
for our process. Only commits and revision property
modifications performed by the special username
syncuser will be allowed.

We'll use the repository's hook system both to allow the
replication process to do what it needs to do and to enforce
that only it is doing those things. We accomplish this by
implementing two of the repository event
hooks—pre-revprop-change and start-commit. Our
pre-revprop-change hook script is found
in Example 5.2, “Mirror repository's pre-revprop-change hook script”, and basically verifies that the user attempting the
property changes is our syncuser user. If
so, the change is allowed; otherwise, it is denied.

After installing our hook scripts and ensuring that they
are executable by the Subversion server, we're finished with
the setup of the mirror repository. Now, we get to actually
do the mirroring.

The first thing we need to do with
svnsync is to register in our target
repository the fact that it will be a mirror of the source
repository. We do this using the svnsync
initialize subcommand. The URLs we provide point to
the root directories of the target and source repositories,
respectively. In Subversion 1.4, this is required—only
full mirroring of repositories is permitted. In Subversion
1.5, though, you can use svnsync to mirror
only some subtree of the repository, too.

Our target repository will now remember that it is a
mirror of the public Subversion source code repository.
Notice that we provided a username and password as arguments
to svnsync—that was required by the
pre-revprop-change hook on our mirror repository.

Note

In Subversion 1.4, the values given to
svnsync's --username and
--password command-line options were used
for authentication against both the source and destination
repositories. This caused problems when a user's
credentials weren't exactly the same for both repositories,
especially when running in noninteractive mode (with the
--non-interactive option).

This has been fixed in Subversion 1.5 with the
introduction of two new pairs of options. Use
--source-username and
--source-password to provide authentication
credentials for the source repository; use
--sync-username and
--sync-password to provide credentials for
the destination repository. (The old
--username and --password
options still exist for compatibility, but we advise against
using them.)

And now comes the fun part. With a single subcommand, we
can tell svnsync to copy all the
as-yet-unmirrored revisions from the source repository to the
target.
[39]
The svnsync synchronize subcommand will
peek into the special revision properties previously stored on
the target repository, and determine both what repository it
is mirroring as well as that the most recently mirrored
revision was revision 0. Then it will query the source
repository and determine what the latest revision in that
repository is. Finally, it asks the source repository's
server to start replaying all the revisions between 0 and that
latest revision. As svnsync get the
resultant response from the source repository's server, it
begins forwarding those revisions to the target repository's
server as new commits.

Of particular interest here is that for each mirrored
revision, there is first a commit of that revision to the
target repository, and then property changes follow. This is
because the initial commit is performed by (and attributed to)
the user syncuser, and it is datestamped
with the time as of that revision's creation. Also,
Subversion's underlying repository access interfaces don't
provide a mechanism for setting arbitrary revision properties
as part of a commit. So svnsync follows up
with an immediate series of property modifications that copy
into the target repository all the revision properties found
for that revision in the source repository. This also has the
effect of fixing the author and datestamp of the revision to
match that of the source repository.

Also noteworthy is that svnsync
performs careful bookkeeping that allows it to be safely
interrupted and restarted without ruining the integrity of the
mirrored data. If a network glitch occurs while mirroring a
repository, simply repeat the svnsync
synchronize command, and it will happily pick up
right where it left off. In fact, as new revisions appear in
the source repository, this is exactly what you do
to keep your mirror up to date.

svnsync Bookkeeping

svnsync needs to be able to set and
modify revision properties on the mirror repository because
those properties are part of the data it is tasked with
mirroring. As those properties change in the source
repository, those changes need to be reflected in the mirror
repository, too. But svnsync also uses a
set of custom revision properties—stored in revision 0
of the mirror repository—for its own internal
bookkeeping. These properties contain information such as
the URL and UUID of the source repository, plus some
additional state-tracking information.

One of those pieces of state-tracking information is a
flag that essentially just means “there's a
synchronization in progress right now.” This is used
to prevent multiple svnsync processes
from colliding with each other while trying to mirror data
to the same destination repository. Now, generally you
won't need to pay any attention whatsoever to
any of these special properties (all of
which begin with the prefix svn:sync-).
Occasionally, though, if a synchronization fails
unexpectedly, Subversion never has a chance to remove this
particular state flag. This causes all future
synchronization attempts to fail because it appears that a
synchronization is still in progress when, in fact, none is.
Fortunately, recovering from this situation is as simple as
removing the svn:sync-lock property which
serves as this flag from revision 0 of the mirror
repository:

That svnsync stores the source
repository URL in a bookkeeping property on the mirror
repository is the reason why you have to specify that
URL only once, during svnsync init. Future
synchronization operations against that mirror simply
consult the special svn:sync-from-url
property stored on the mirror itself to know where
to synchronize from. This value is used literally by the
synchronization process, though. So while from within
CollabNet's network you can perhaps access our example
source URL as http://svn/repos/svn
(because that first svn magically gets
.collab.net appended to it by DNS
voodoo), if you later need to update that mirror from
another machine outside CollabNet's network, the
synchronization might fail (because the hostname
svn is ambiguous). For this reason, it's
best to use fully qualified source repository URLs when
initializing a mirror repository rather than those that
refer to only hostnames or IP addresses (which can change
over time). But here again, if you need an existing mirror
to start referring to a different URL for the same source
repository, you can change the bookkeeping property which
houses that information:

Another interesting thing about these special
bookkeeping properties is that svnsync
will not attempt to mirror any of those properties when they
are found in the source repository. The reason is probably
obvious, but basically boils down to
svnsync not being able to distinguish the
special properties it has merely copied from the source
repository from those it needs to consult and maintain for
its own bookkeeping needs. This situation could occur if,
for example, you were maintaining a mirror of a mirror of a
third repository. When svnsync sees its
own special properties in revision 0 of the source
repository, it simply ignores them.

In Subversion 1.6, an svnsync info
subcommand has been added to easily display the special
bookkeeping properties in the destination repository.

There is, however, one bit of inelegance in the process.
Because Subversion revision properties can be changed at any
time throughout the lifetime of the repository, and because
they don't leave an audit trail that indicates when they were
changed, replication processes have to pay special attention
to them. If you've already mirrored the first 15 revisions of
a repository and someone then changes a revision property on
revision 12, svnsync won't know to go back
and patch up its copy of revision 12. You'll need to tell it
to do so manually by using (or with some additional tooling
around) the svnsync copy-revprops
subcommand, which simply rereplicates all the revision
properties for a particular revision or range thereof.

$ svnsync help copy-revprops
copy-revprops: usage: svnsync copy-revprops DEST_URL [REV[:REV2]]
Copy the revision properties in a given range of revisions to the
destination from the source with which it was initialized.
…
$ svnsync copy-revprops http://svn.example.com/svn-mirror 12
Copied properties for revision 12.
$

That's repository replication in a nutshell. You'll
likely want some automation around such a process. For
example, while our example was a pull-and-push setup, you
might wish to have your primary repository push changes to one
or more blessed mirrors as part of its post-commit and
post-revprop-change hook implementations. This would enable
the mirror to be up to date in as near to real time as is
likely possible.

Also, while it isn't very commonplace to do so,
svnsync does gracefully mirror repositories
in which the user as whom it authenticates has only partial
read access. It simply copies only the bits of the repository
that it is permitted to see. Obviously, such a mirror is not
useful as a backup solution.

In Subversion 1.5, svnsync grew the
ability to also mirror a subset of a repository rather than
the whole thing. The process of setting up and maintaining
such a mirror is exactly the same as when mirroring a whole
repository, except that instead of specifying the source
repository's root URL when running svnsync
init, you specify the URL of some subdirectory
within that repository. Synchronization to that mirror will
now copy only the bits that changed under that source
repository subdirectory. There are some limitations to this
support, though. First, you can't mirror multiple disjoint
subdirectories of the source repository into a single mirror
repository—you'd need to instead mirror some parent
directory that is common to both. Second, the filtering
logic is entirely path-based, so if the subdirectory you are
mirroring was renamed at some point in the past, your mirror
would contain only the revisions since the directory appeared
at the URL you specified. And likewise, if the source
subdirectory is renamed in the future, your synchronization
processes will stop mirroring data at the point that the
source URL you specified is no longer valid.

As far as user interaction with repositories and mirrors
goes, it is possible to have a single
working copy that interacts with both, but you'll have to jump
through some hoops to make it happen. First, you need to
ensure that both the primary and mirror repositories have the
same repository UUID (which is not the case by default). See
the section called “Managing Repository UUIDs” later in this
chapter for more about this.

Once the two repositories have the same UUID, you can use
svn switch with the --relocate option to point your working
copy to whichever of the repositories you wish to operate
against, a process that is described in svn switch. There is a possible danger
here, though, in that if the primary and mirror repositories
aren't in close synchronization, a working copy up to date
with, and pointing to, the primary repository will, if
relocated to point to an out-of-date mirror, become confused
about the apparent sudden loss of revisions it fully expects
to be present, and it will throw errors to that effect. If
this occurs, you can relocate your working copy back to the
primary repository and then either wait until the mirror
repository is up to date, or backdate your working copy to a
revision you know is present in the sync repository, and then
retry the relocation.

Finally, be aware that the revision-based replication
provided by svnsync is only
that—replication of revisions. Only information carried
by the Subversion repository dump file format is available for
replication. As such, svnsync has the same
sorts of limitations that the repository dump stream has, and
does not include such things as the hook implementations,
repository or server configuration data, uncommitted
transactions, or information about user locks on repository
paths.

Repository Backup

Despite numerous advances in technology since the birth of
the modern computer, one thing unfortunately rings true with
crystalline clarity—sometimes things go very, very
awry. Power outages, network connectivity dropouts, corrupt
RAM, and crashed hard drives are but a taste of the evil that
Fate is poised to unleash on even the most conscientious
administrator. And so we arrive at a very important
topic—how to make backup copies of your repository
data.

There are two types of backup methods available for
Subversion repository administrators—full and
incremental. A full backup of the repository involves
squirreling away in one sweeping action all the information
required to fully reconstruct that repository in the event of
a catastrophe. Usually, it means, quite literally, the
duplication of the entire repository directory (which includes
either a Berkeley DB or FSFS environment). Incremental
backups are lesser things: backups of only the portion of the
repository data that has changed since the previous
backup.

As far as full backups go, the naïve approach might seem
like a sane one, but unless you temporarily disable all other
access to your repository, simply doing a recursive directory
copy runs the risk of generating a faulty backup. In the case
of Berkeley DB, the documentation describes a certain order in
which database files can be copied that will guarantee a valid
backup copy. A similar ordering exists for FSFS data. But
you don't have to implement these algorithms yourself, because
the Subversion development team has already done so. The
svnadmin hotcopy command takes care of the
minutia involved in making a hot backup of your repository.
And its invocation is as trivial as the Unix
cp or Windows copy
operations:

$ svnadmin hotcopy /var/svn/repos /var/svn/repos-backup

The resultant backup is a fully functional Subversion
repository, able to be dropped in as a replacement for your
live repository should something go horribly wrong.

When making copies of a Berkeley DB repository, you can
even instruct svnadmin hotcopy to purge any
unused Berkeley DB logfiles (see the section called “Purging unused Berkeley DB logfiles”) from the
original repository upon completion of the copy. Simply
provide the --clean-logs option on the
command line.

Additional tooling around this command is available, too.
The tools/backup/ directory of the
Subversion source distribution holds the
hot-backup.py script. This script adds a
bit of backup management atop svnadmin
hotcopy, allowing you to keep only the most recent
configured number of backups of each repository. It will
automatically manage the names of the backed-up repository
directories to avoid collisions with previous backups and
will “rotate off” older backups, deleting them so
that only the most recent ones remain. Even if you also have an
incremental backup, you might want to run this program on a
regular basis. For example, you might consider using
hot-backup.py from a program scheduler
(such as cron on Unix systems), which can
cause it to run nightly (or at whatever granularity of time
you deem safe).

Some administrators use a different backup mechanism built
around generating and storing repository dump data. We
described in the section called “Migrating Repository Data Elsewhere”
how to use svnadmin dump with the --incremental option to
perform an incremental backup of a given revision or range of
revisions. And of course, you can achieve a full backup variation of
this by omitting the --incremental
option to that command. There is some value in these methods,
in that the format of your backed-up information is
flexible—it's not tied to a particular platform,
versioned filesystem type, or release of Subversion or
Berkeley DB. But that flexibility comes at a cost, namely
that restoring that data can take a long time—longer
with each new revision committed to your repository. Also, as
is the case with so many of the various backup methods,
revision property changes that are made to already backed-up
revisions won't get picked up by a nonoverlapping,
incremental dump generation. For these reasons, we recommend
against relying solely on dump-based backup approaches.

As you can see, each of the various backup types and
methods has its advantages and disadvantages. The easiest is
by far the full hot backup, which will always result in a
perfect working replica of your repository. Should something
bad happen to your live repository, you can restore from the
backup with a simple recursive directory copy. Unfortunately,
if you are maintaining multiple backups of your repository,
these full copies will each eat up just as much disk space as
your live repository. Incremental backups, by contrast, tend
to be quicker to generate and smaller to store. But the
restoration process can be a pain, often involving applying
multiple incremental backups. And other methods have their
own peculiarities. Administrators need to find the balance
between the cost of making the backup and the cost of
restoring it.

The svnsync program (see the section called “Repository Replication”) actually
provides a rather handy middle-ground approach. If you are
regularly synchronizing a read-only mirror with your main
repository, in a pinch your read-only mirror is probably
a good candidate for replacing that main repository if it
falls over. The primary disadvantage of this method is that
only the versioned repository data gets
synchronized—repository configuration files,
user-specified repository path locks, and other items that
might live in the physical repository directory but not
inside the repository's virtual versioned
filesystem are not handled by svnsync.

In any backup scenario, repository administrators need
to be aware of how modifications to unversioned revision
properties affect their backups. Since these changes do not
themselves generate new revisions, they will not trigger
post-commit hooks, and may not even trigger the
pre-revprop-change and post-revprop-change hooks.
[40]
And since you can change revision properties without respect
to chronological order—you can change any revision's
properties at any time—an incremental backup of the
latest few revisions might not catch a property modification
to a revision that was included as part of a previous
backup.

Generally speaking, only the truly paranoid would need to
back up their entire repository, say, every time a commit
occurred. However, assuming that a given repository has some
other redundancy mechanism in place with relatively fine
granularity (such as per-commit emails or incremental dumps), a
hot backup of the database might be something that a
repository administrator would want to include as part of a
system-wide nightly backup. It's your data—protect it
as much as you'd like.

Often, the best approach to repository backups is a
diversified one that leverages combinations of the methods
described here. The Subversion developers, for example, back
up the Subversion source code repository nightly using
hot-backup.py and an off-site
rsync of those full backups; keep multiple
archives of all the commit and property change notification
emails; and have repository mirrors maintained by various
volunteers using svnsync. Your solution
might be similar, but should be catered to your needs and that
delicate balance of convenience with paranoia. And whatever
you do, validate your backups from time to time—what
good is a spare tire that has a hole in it? While all of this
might not save your hardware from the iron fist of Fate,
[41]
it should certainly help you recover from those trying
times.

Managing Repository UUIDs

Subversion repositories have a universally unique
identifier (UUID) associated with them. This is used by
Subversion clients to verify the identity of a repository when
other forms of verification aren't good enough (such as
checking the repository URL, which can change over time).
Most Subversion repository administrators rarely, if ever,
need to think about repository UUIDs as anything more than a
trivial implementation detail of Subversion. Sometimes,
however, there is cause for attention to this detail.

As a general rule, you want the UUIDs of your live
repositories to be unique. That is, after all, the point of
having UUIDs. But there are times when you want the
repository UUIDs of two repositories to be exactly the same.
For example, if you make a copy of a repository for backup
purposes, you want the backup to be a perfect replica of the
original so that, in the event that you have to restore that
backup and replace the live repository, users don't suddenly
see what looks like a different repository. When dumping and
loading repository history (as described earlier in the section called “Migrating Repository Data Elsewhere”), you get to decide
whether to apply the UUID encapsulated in the data dump
stream to the repository in which you are loading the data. The
particular circumstance will dictate the correct
behavior.

There are a couple of ways to set (or reset) a
repository's UUID, should you need to. As of Subversion 1.5,
this is as simple as using the svnadmin
setuuid command. If you provide this subcommand
with an explicit UUID, it will validate that the UUID is
well-formed and then set the repository UUID to that value.
If you omit the UUID, a brand-new UUID will be generated for
your repository.

For folks using versions of Subversion earlier than 1.5,
these tasks are a little more complicated. You can explicitly
set a repository's UUID by piping a repository dump file stub
that carries the new UUID specification through
svnadmin load --force-uuid REPOS-PATH.

Having older versions of Subversion generate a brand-new
UUID is not quite as simple to do, though. Your best bet here
is to find some other way to generate a UUID, and then
explicitly set the repository's UUID to that value.

[36] Conscious, cautious removal of certain bits of
versioned data is actually supported by real use cases.
That's why an “obliterate” feature has been
one of the most highly requested Subversion features,
and one which the Subversion developers hope to soon
provide.

[37] While svnadmin dump has a
consistent leading slash policy (to not include
them), other programs that generate dump data might
not be so consistent.

[38] In fact, it can't truly be read-only, or
svnsync itself would have a tough time
copying revision history into it.

[39] Be forewarned that while it will take only a few
seconds for the average reader to parse this paragraph and
the sample output that follows it, the actual time
required to complete such a mirroring operation is, shall
we say, quite a bit longer.

[40] svnadmin setlog can be called in a
way that bypasses the hook interface altogether.

You are reading Version Control with Subversion (for Subversion 1.6), by Ben Collins-Sussman, Brian W. Fitzpatrick, and C. Michael Pilato.
This work is licensed under the Creative Commons Attribution License v2.0.
To submit comments, corrections, or other contributions to the text, please visit http://www.svnbook.com/.