Description

Filtering Repository History

Since Subversion stores your versioned history using, at
the very least, binary differencing algorithms and data
compression (optionally in a completely opaque database
system), attempting manual tweaks is unwise, if not quite
difficult, and at any rate strongly discouraged. And once
data has been stored in your repository, Subversion
generally doesn't provide an easy way to remove that data.
[33]
But inevitably, there will be times when you would like to
manipulate the history of your repository. You might need
to strip out all instances of a file that was accidentally
added to the repository (and shouldn't be there for whatever
reason).
[34]
Or, perhaps you have multiple projects sharing a
single repository, and you decide to split them up into
their own repositories. To accomplish tasks like this,
administrators need a more manageable and malleable
representation of the data in their repositories—the
Subversion repository dump format.

As we described in the section called “Migrating Repository Data Elsewhere”, the
Subversion repository dump format is a human-readable
representation of the changes that you've made to your
versioned data over time. You use the svnadmin
dump command to generate the dump data, and
svnadmin load to populate a new
repository with it (see the section called “Migrating Repository Data Elsewhere”). The great thing
about the human-readability aspect of the dump format is
that, if you aren't careless about it, you can manually
inspect and modify it. Of course, the downside is that if
you have three years' worth of repository activity
encapsulated in what is likely to be a very large dump file,
it could take you a long, long time to manually inspect and
modify it.

That's where svndumpfilter becomes
useful. This program acts as path-based filter for
repository dump streams. Simply give it either a list of
paths you wish to keep, or a list of paths you wish to not
keep, then pipe your repository dump data through this
filter. The result will be a modified stream of dump data
that contains only the versioned paths you (explicitly or
implicitly) requested.

Let's look a realistic example of how you might use this
program. We discuss elsewhere (see the section called “Planning Your Repository Organization”) the
process of deciding how to choose a layout for the data in
your repositories—using one repository per project or
combining them, arranging stuff within your repository, and
so on. But sometimes after new revisions start flying in,
you rethink your layout and would like to make some changes.
A common change is the decision to move multiple projects
which are sharing a single repository into separate
repositories for each project.

Our imaginary repository contains three projects:
calc, calendar, and
spreadsheet. They have been living
side-by-side in a layout like this:

At this point, you have to make a decision. Each of
your dump files will create a valid repository,
but will preserve the paths exactly as they were in the
original repository. This means that even though you would
have a repository solely for your calc
project, that repository would still have a top-level
directory named calc. If you want
your trunk, tags,
and branches directories to live in the
root of your repository, you might wish to edit your
dump files, tweaking the Node-path and
Node-copyfrom-path headers to no longer have
that first calc/ path component. Also,
you'll want to remove the section of dump data that creates
the calc directory. It will look
something like:

Node-path: calc
Node-action: add
Node-kind: dir
Content-length: 0

Warning

If you do plan on manually editing the dump file to
remove a top-level directory, make sure that your editor is
not set to automatically convert end-of-line characters to the native
format (e.g. \r\n to \n), as the content will then not agree
with the metadata. This will render the dump file
useless.

All that remains now is to create your three new
repositories, and load each dump file into the right
repository:

Both of svndumpfilter's subcommands
accept options for deciding how to deal with
“empty” revisions. If a given revision
contained only changes to paths that were filtered out, that
now-empty revision could be considered uninteresting or even
unwanted. So to give the user control over what to do with
those revisions, svndumpfilter provides
the following command-line options:

--drop-empty-revs

Do not generate empty revisions at all—just
omit them.

--renumber-revs

If empty revisions are dropped (using the
--drop-empty-revs option), change the
revision numbers of the remaining revisions so that
there are no gaps in the numeric sequence.

--preserve-revprops

If empty revisions are not dropped, preserve the
revision properties (log message, author, date, custom
properties, etc.) for those empty revisions.
Otherwise, empty revisions will only contain the
original datestamp, and a generated log message that
indicates that this revision was emptied by
svndumpfilter.

While svndumpfilter can be very
useful, and a huge timesaver, there are unfortunately a
couple of gotchas. First, this utility is overly sensitive
to path semantics. Pay attention to whether paths in your
dump file are specified with or without leading slashes.
You'll want to look at the Node-path and
Node-copyfrom-path headers.

…
Node-path: spreadsheet/Makefile
…

If the paths have leading slashes, you should
include leading slashes in the paths you pass to
svndumpfilter include and
svndumpfilter exclude (and if they don't,
you shouldn't). Further, if your dump file has an inconsistent
usage of leading slashes for some reason,
[35]
you should probably normalize those paths so they all
have, or lack, leading slashes.

Also, copied paths can give you some trouble.
Subversion supports copy operations in the repository, where
a new path is created by copying some already existing path.
It is possible that at some point in the lifetime of your
repository, you might have copied a file or directory from
some location that svndumpfilter is
excluding, to a location that it is including. In order to
make the dump data self-sufficient,
svndumpfilter needs to still show the
addition of the new path—including the contents of any
files created by the copy—and not represent that
addition as a copy from a source that won't exist in your
filtered dump data stream. But because the Subversion
repository dump format only shows what was changed in each
revision, the contents of the copy source might not be
readily available. If you suspect that you have any copies
of this sort in your repository, you might want to rethink
your set of included/excluded paths, perhaps including the
paths that served as sources of your troublesome copy
operations, too.

Finally, svndumpfilter takes path
filtering quite literally. If you are trying to copy the
history of a project rooted at
trunk/my-project and move it into a
repository of its own, you would, of course, use the
svndumpfilter include command to keep all
the changes in and under
trunk/my-project. But the resulting
dump file makes no assumptions about the repository into
which you plan to load this data. Specifically, the dump
data might begin with the revision which added the
trunk/my-project directory, but it will
not contain directives which would
create the trunk directory itself
(because trunk doesn't match the
include filter). You'll need to make sure that any
directories which the new dump stream expect to exist
actually do exist in the target repository before trying to
load the stream into that repository.

[34] Conscious, cautious removal of certain bits of
versioned data is actually supported by real use-cases.
That's why an “obliterate” feature has been
one of the most highly requested Subversion features,
and one which the Subversion developers hope to soon
provide.

[35] While svnadmin dump has a
consistent leading slash policy—to not include
them—other programs which generate dump data might
not be so consistent.