git

I wrote a bit ago about making commits via the GitHub API. That post outlined making changes in two simplified situations: making changes to a single file and making updates to two existing files at the root of the repository. Here I show a more general solution that allows arbitrary changes anywhere in the repo.

I want to be able to specify a repo and branch and say "here are the contents of files that have changed or been created and here are the names of files that have been deleted, please take all that and this message and make a new commit for me." Because the GitHub API is so rudimentary when it comes to making commits that will end up being a many-stepped process, but it’s mostly the same steps repeated many times so it’s not a nightmare to code up. At a high level the process goes like this:

Get the current repo state from GitHub

This is the names and hashes of all the files and directories, but not the actual file contents.

Construct a local, malleable representation of the repo

Modify the local representation according to the given updates, creations, and deletions

Walk though the modified local "repo" and upload new/changed files and directories to GitHub

This must be done from the bottom up because a change at the low level means every directory above that level will need to be changed.

Make a new commit pointed at the new root tree (I’ll explain trees soon.)

For fun I’ve been learning a bit about the GitHub API. Using the API it’s possible to do just about everything you can do on GitHub itself, from commenting on PRs to adding commits to a repo. Here I’m going to show how to do add commits to a repo on GitHub. A notebook demonstrating things with code is available here, but you may want to read this post first for the high level view.

Choosing a Client Library

The GitHub API is an HTTP interface so you can talk to it via any tool that speaks HTTP, including things like curl. To make programming with the API simpler there are a number of libraries that allow communicate with GitHub via means native to whatever language you’re using. I’m using Python and I went with the github3.py library based on its Python 3 compatibility, active development, and good documentation.

Making Commits

Modifying a Single File

The special case of making a commit affecting a single file is much simpler than affecting multiple files. Creating, updating, and deleting a file can be done via a single API call once you have enough information to specify what you want done.

Modifying Multiple Files

Making a commit affecting multiple files requires making multiple API calls and some understanding of Git’s internal data store. That’s because to change multiple files you have to add all the changes to the repo one at a time before making a commit. The process is outlined in full in the API docs about Git data.

I should note that I think deleting multiple files in a single commit requires a slightly different procedure, one I’ll cover in another post.

Data Provenance

When running scientific software it is considered a best practice to automatically record the versions of all the software you use. This practice is sometimes referred to as recording the provenance of the results and helps make your analysis more reproducible. Almost all software libraries will have a version number that you can somehow access from your own software. For example, NumPy’s version number is recorded in the variable numpy.__version__ and most Python packages will having something similar. Python’s version is in the variable sys.version (and, alternatively, sys.version_info).

However, a lot of personal or lab software doesn’t have a version number. The software might change so fast and be modified by so many people that manually incrememented version numbers aren’t very practical. There’s still hope in this situation, though, if the software is under version control. (Your software is under version control, isn’t it?) In Subversion the keyword properties feature is often used to record provenance. There isn’t a compatible feature in Git, but for Python software in Git repositories we can engineer a provenance solution using the GitPython package.

Returning to Previous States with Git

When you make a commit in Git the state of the repository is recorded and given a label based on a hash of the commit data. We can use the commit hash to return to any recorded state of the repository using the “git checkout” command. This means that if you know the commit hash of your software when you created a certain set of results, you can always set your software back to that state to reproduce the same results. Very handy!

Recording the Commit Hash

When you import a Python module, code at the global level of the module is actually executed. This is often used to set global variables within the module, which is what we’ll do here. GitPython lets us interact with Git repos from Python and one thing we can do is query a repo to get the commit hash of the current “HEAD“. (HEAD is a label in Git pointing to the latest commit of whatever state the repository is currently in.)

What we can do with that is make it so that when our software modules are imported they set a global variable containing the commit hash of their HEAD at the time the software was run. That hash can then be inserted into data products as a record of the software version used to create them. Here’s some code that gets and stores the hash of the HEAD of a repo:

Versioned Data

Some data formats, especially those that are text based, can be easily stored in version control. If you can put your data in a Git repo then the same strategy as above can be used to get and store the HEAD commit of the data repo when you run your analysis, allowing you to reproduce both your software and data states during later runs. If your data does not easily fit into Git it’s still a good idea to record a unique identifier for the dataset, but you may need to develop that yourself (such as a simple list of all the data files that were used as inputs).

These are a few git aliases I’ve made recently to make my life a little easier. The first two are for displaying the log, and the rest for merging. See the Git wiki for more on how to add aliases.

Logs

I pretty much always want the one-line log, so this is the basic version of that:

ls = log --oneline --decorate

--decorate shows any labels on commits, such as branch names. Sometimes I want the above but with the graph view turned on:

lg = log --oneline --decorate --graph

Merging

Merges in git fall into two basic categories: fast-forward merges in which the branch label is simply updated to a new commit (but no new commit is made), and all other merges in which a new commit is made with multiple parents.

By default the merge command will attempt to do a fast-forward merge. If that won’t work it will move to some other “real” merge strategy. I think the difference between fast-forward and the other merges is sufficient that they shouldn’t happen with the same command so I’ve set up aliases to separate them. First the alias for doing a fast-forward. This will fail if a fast-forward is not possible:

ff = merge --ff-only

And then an alias for forcing a non-fast-forward merge even in situations where a fast-forward would be possible:

mrg = merge --no-ff

(This is the sort of merge GitHub does when you merge a pull request.)

Pulling

The pull command is really fetch + merge so it takes most of the same options. When I do a pull I generally only want it to succeed if it’s possible to fast-forward my local branch. If git must do a real merge something is probably wrong, so I have a fast-forward-only pull alias:

Over the past year or so I have on a few occasions taught git to people
accustomed to using Subversion. These are experienced software developers so it
seems like it should be easy, but it is almost always a difficult process
filled with acrimony, resistance, and gasps. (And this from people who seem
to use Emacs, vim, and Linux without complaint.)

And yet when I teach git to people who don’t already have a preferred
version control system it seems to go pretty well. It takes time, and there
may be confusion, but we demo and practice and learn as with any topic.
There is no hate. I think I’m beginning to understand the situation:

Git isn’t just a little different, it’s completely different. But, dear
svn users, git is not different because it doesn’t like you. Git is different
because it’s optimized to facilitate the collaboration of a number of
developers working on a rapidly changing code base. Some examples:

Branching is easy in git so that you can work in peace isolated from
upstream changes until you are ready to send your work back.

When it’s time to merge back git provides rebase to facilitate cleanly
integrating upstream changes.

Git is distributed so that you and everyone else can make all manner of
branches and experiments without dirtying up the master repo.

Git is different from svn, but it does well what it is designed to do.
This brings me back to the Lao Tsu quote from the top and a simple message
to svn users learning git: You will have to drop your svn workflow to make
the most of git, but it’s not because git is bad, it’s just different. Yield,
use git in the way it was meant to be used, and be happy.

Normally when I do a push in git I do something like git push origin master, which really means push from the local branch named master to the remote branch named master. If you want to push to a remote branch with a different name than your local branch, separate the local and remote names with a colon: