9.2 Git and Other Systems - Migrating to Git

Migrating to Git

If you have an existing codebase in another VCS but you’ve decided to start using Git, you must migrate your project one way or another.
This section goes over some importers for common systems, and then demonstrates how to develop your own custom importer.
You’ll learn how to import data from several of the bigger professionally used SCM systems, because they make up the majority of users who are switching, and because high-quality tools for them are easy to come by.

If you read the previous section about using git svn, you can easily use those instructions to git svn clone a repository; then, stop using the Subversion server, push to a new Git server, and start using that.
If you want the history, you can accomplish that as quickly as you can pull the data out of the Subversion server (which may take a while).

However, the import isn’t perfect; and because it will take so long, you may as well do it right.
The first problem is the author information.
In Subversion, each person committing has a user on the system who is recorded in the commit information.
The examples in the previous section show schacon in some places, such as the blame output and the git svn log.
If you want to map this to better Git author data, you need a mapping from the Subversion users to the Git authors.
Create a file called users.txt that has this mapping in a format like this:

That generates the log output in XML format, then keeps only the lines with author information, discards duplicates, strips out the XML tags.
(Obviously this only works on a machine with grep, sort, and perl installed.)
Then, redirect that output into your users.txt file so you can add the equivalent Git user data next to each entry.

You can provide this file to git svn to help it map the author data more accurately.
You can also tell git svn not to include the metadata that Subversion normally imports, by passing --no-metadata to the clone or init command.
This makes your import command look like this:

Not only does the Author field look a lot better, but the git-svn-id is no longer there, either.

You should also do a bit of post-import cleanup.
For one thing, you should clean up the weird references that git svn set up.
First you’ll move the tags so they’re actual tags rather than strange remote branches, and then you’ll move the rest of the branches so they’re local.

Now all the old branches are real Git branches and all the old tags are real Git tags.
The last thing to do is add your new Git server as a remote and push to it.
Here is an example of adding your server as a remote:

$ git remote add origin git@my-git-server:myrepository.git

Because you want all your branches and tags to go up, you can now run this:

$ git push origin --all

All your branches and tags should be on your new Git server in a nice, clean import.

Since Mercurial and Git have fairly similar models for representing versions, and since Git is a bit more flexible, converting a repository from Mercurial to Git is fairly straightforward, using a tool called “hg-fast-export”, which you’ll need a copy of:

$ git clone http://repo.or.cz/r/fast-export.git /tmp/fast-export

The first step in the conversion is to get a full clone of the Mercurial repository you want to convert:

$ hg clone <remote repo URL> /tmp/hg-repo

The next step is to create an author mapping file.
Mercurial is a bit more forgiving than Git for what it will put in the author field for changesets, so this is a good time to clean house.
Generating this is a one-line command in a bash shell:

In this example, the same person (Bob) has created changesets under four different names, one of which actually looks correct, and one of which would be completely invalid for a Git commit.
Hg-fast-export lets us fix this by adding ={new name and email address} at the end of every line we want to change, and removing the lines for any usernames that we want to leave alone.
If all the usernames look fine, we won’t need this file at all.
In this example, we want our file to look like this:

The -r flag tells hg-fast-export where to find the Mercurial repository we want to convert, and the -A flag tells it where to find the author-mapping file.
The script parses Mercurial changesets and converts them into a script for Git’s “fast-import” feature (which we’ll discuss in detail a bit later on).
This takes a bit (though it’s much faster than it would be over the network), and the output is fairly verbose:

That’s pretty much all there is to it.
All of the Mercurial tags have been converted to Git tags, and Mercurial branches and bookmarks have been converted to Git branches.
Now you’re ready to push the repository up to its new server-side home:

The next system you’ll look at importing from is Perforce.
As we discussed above, there are two ways to let Git and Perforce talk to each other: git-p4 and Perforce Git Fusion.

Perforce Git Fusion

Git Fusion makes this process fairly painless.
Just configure your project settings, user mappings, and branches using a configuration file (as discussed in “Git Fusion”), and clone the repository.
Git Fusion leaves you with what looks like a native Git repository, which is then ready to push to a native Git host if you desire.
You could even use Perforce as your Git host if you like.

Git-p4

Git-p4 can also act as an import tool.
As an example, we’ll import the Jam project from the Perforce Public Depot.
To set up your client, you must export the P4PORT environment variable to point to the Perforce depot:

$export P4PORT=public.perforce.com:1666

In order to follow along, you’ll need a Perforce depot to connect with.
We’ll be using the public depot at public.perforce.com for our examples, but you can use any depot you have access to.

Run the git p4 clone command to import the Jam project from the Perforce server, supplying the depot and project path and the path into which you want to import the project:

This particular project has only one branch, but if you have branches that are configured with branch views (or just a set of directories), you can use the --detect-branches flag to git p4 clone to import all the project’s branches as well.
See “Branching” for a bit more detail on this.

At this point you’re almost done.
If you go to the p4import directory and run git log, you can see your imported work:

You can see that git-p4 has left an identifier in each commit message.
It’s fine to keep that identifier there, in case you need to reference the Perforce change number later.
However, if you’d like to remove the identifier, now is the time to do so – before you start doing work on the new repository.
You can use git filter-branch to remove the identifier strings en masse:

If your team is converting their source control from TFVC to Git, you’ll want the highest-fidelity conversion you can get.
This means that, while we covered both git-tfs and git-tf for the interop section, we’ll only be covering git-tfs for this part, because git-tfs supports branches, and this is prohibitively difficult using git-tf.

This is a one-way conversion.
The resulting Git repository won’t be able to connect with the original TFVC project.

The first thing to do is map usernames.
TFVC is fairly liberal with what goes into the author field for changesets, but Git wants a human-readable name and email address.
You can get this information from the tf command-line client, like so:

PS>tfhistory$/myproject-recursive>AUTHORS_TMP

This grabs all of the changesets in the history of the project and put it in the AUTHORS_TMP file that we will process to extract the data of the User column (the 2nd one).
Open the file and find at which characters start and end the column and replace, in the following command-line, the parameters 11-20 of the cut command with the ones found:

PS>catAUTHORS_TMP|cut-b11-20|tail-n+3|uniq|sort>AUTHORS

The cut command keeps only the characters between 11 and 20 from each line.
The tail command skips the first two lines, which are field headers and ASCII-art underlines.
The result of all of this is piped to uniq to eliminate duplicates, and saved to a file named AUTHORS.
The next step is manual; in order for git-tfs to make effective use of this file, each line must be in this format:

DOMAIN\username = User Name <email@address.com>

The portion on the left is the “User” field from TFVC, and the portion on the right side of the equals sign is the user name that will be used for Git commits.

Once you have this file, the next thing to do is make a full clone of the TFVC project you’re interested in:

If your system isn’t one of the above, you should look for an importer online – quality importers are available for many other systems, including CVS, Clear Case, Visual Source Safe, even a directory of archives.
If none of these tools works for you, you have a more obscure tool, or you otherwise need a more custom importing process, you should use git fast-import.
This command reads simple instructions from stdin to write specific Git data.
It’s much easier to create Git objects this way than to run the raw Git commands or try to write the raw objects (see Chapter 10 for more information).
This way, you can write an import script that reads the necessary information out of the system you’re importing from and prints straightforward instructions to stdout.
You can then run this program and pipe its output through git fast-import.

To quickly demonstrate, you’ll write a simple importer.
Suppose you work in current, you back up your project by occasionally copying the directory into a time-stamped back_YYYY_MM_DD backup directory, and you want to import this into Git.
Your directory structure looks like this:

In order to import a Git directory, you need to review how Git stores its data.
As you may remember, Git is fundamentally a linked list of commit objects that point to a snapshot of content.
All you have to do is tell fast-import what the content snapshots are, what commit data points to them, and the order they go in.
Your strategy will be to go through the snapshots one at a time and create commits with the contents of each directory, linking each commit back to the previous one.

As we did in “An Example Git-Enforced Policy”, we’ll write this in Ruby, because it’s what we generally work with and it tends to be easy to read.
You can write this example pretty easily in anything you’re familiar with – it just needs to print the appropriate information to stdout.
And, if you are running on Windows, this means you’ll need to take special care to not introduce carriage returns at the end your lines – git fast-import is very particular about just wanting line feeds (LF) not the carriage return line feeds (CRLF) that Windows uses.

To begin, you’ll change into the target directory and identify every subdirectory, each of which is a snapshot that you want to import as a commit.
You’ll change into each subdirectory and print the commands necessary to export it.
Your basic main loop looks like this:

last_mark=nil# loop through the directoriesDir.chdir(ARGV[0])doDir.glob("*").eachdo|dir|nextifFile.file?(dir)# move into the target directoryDir.chdir(dir)dolast_mark=print_export(dir,last_mark)endendend

You run print_export inside each directory, which takes the manifest and mark of the previous snapshot and returns the manifest and mark of this one; that way, you can link them properly.
“Mark” is the fast-import term for an identifier you give to a commit; as you create commits, you give each one a mark that you can use to link to it from other commits.
So, the first thing to do in your print_export method is generate a mark from the directory name:

mark=convert_dir_to_mark(dir)

You’ll do this by creating an array of directories and using the index value as the mark, because a mark must be an integer.
Your method looks like this:

Now that you have an integer representation of your commit, you need a date for the commit metadata.
Because the date is expressed in the name of the directory, you’ll parse it out.
The next line in your print_export file is

That returns an integer value for the date of each directory.
The last piece of meta-information you need for each commit is the committer data, which you hardcode in a global variable:

$author='John Doe <john@example.com>'

Now you’re ready to begin printing out the commit data for your importer.
The initial information states that you’re defining a commit object and what branch it’s on, followed by the mark you’ve generated, the committer information and commit message, and then the previous commit, if any.
The code looks like this:

You hardcode the time zone (-0700) because doing so is easy.
If you’re importing from another system, you must specify the time zone as an offset.
The commit message must be expressed in a special format:

data (size)\n(contents)

The format consists of the word data, the size of the data to be read, a newline, and finally the data.
Because you need to use the same format to specify the file contents later, you create a helper method, export_data:

defexport_data(string)print"data #{string.size}\n#{string}"end

All that’s left is to specify the file contents for each snapshot.
This is easy, because you have each one in a directory – you can print out the deleteall command followed by the contents of each file in the directory.
Git will then record each snapshot appropriately:

Note: Because many systems think of their revisions as changes from one commit to another, fast-import can also take commands with each commit to specify which files have been added, removed, or modified and what the new contents are.
You could calculate the differences between snapshots and provide only this data, but doing so is more complex – you may as well give Git all the data and let it figure it out.
If this is better suited to your data, check the fast-import man page for details about how to provide your data in this manner.

The format for listing the new file contents or specifying a modified file with the new contents is as follows:

M 644 inline path/to/file
data (size)
(file contents)

Here, 644 is the mode (if you have executable files, you need to detect and specify 755 instead), and inline says you’ll list the contents immediately after this line.
Your inline_data method looks like this:

You reuse the export_data method you defined earlier, because it’s the same as the way you specified your commit message data.

The last thing you need to do is to return the current mark so it can be passed to the next iteration:

returnmark

If you are running on Windows you’ll need to make sure that you add one extra step.
As mentioned before, Windows uses CRLF for new line characters while git fast-import expects only LF.
To get around this problem and make git fast-import happy, you need to tell ruby to use LF instead of CRLF:

To run the importer, pipe this output through git fast-import while in the Git directory you want to import into.
You can create a new directory and then run git init in it for a starting point, and then run your script:

As you can see, when it completes successfully, it gives you a bunch of statistics about what it accomplished.
In this case, you imported 13 objects total for 4 commits into 1 branch.
Now, you can run git log to see your new history:

There you go – a nice, clean Git repository.
It’s important to note that nothing is checked out – you don’t have any files in your working directory at first.
To get them, you must reset your branch to where master is now:

You can do a lot more with the fast-import tool – handle different modes, binary data, multiple branches and merging, tags, progress indicators, and more.
A number of examples of more complex scenarios are available in the contrib/fast-import directory of the Git source code.