Update 2008-05-21: Tim Dysinger and Pat Maddox pointed out that git submodules are inherently not well-suited for frequently updated projects. Read the comments for more details, and please use submodules with caution on projects where you can't guarantee a shared repository has not changed between 'pull' and 'push' operation.

Today I'm releasing git.rake into the wild under an open-source license. It's a rakefile for managing multiple git submodules in a shared-server development environment.

We've been using it internally at my company, Pharos Enterprise Intelligence, for the last 5 months and it's been a huge timesaver for us. Read below for a detailed description of the features and its use.

The code is being released under the MIT license and the git repository is being hosted on github. Take a look:

Configure a rails project for use with git. (Although, you've seen
that elsewhere and are justifiably unimpressed.)

Prerequisites

If you're not sure how to add a submodule to your repo, or you're not
sure what a submodule is, take a quick trip over to the Git Submodule
Tutorial, and then
come back. In fact, even if you ARE familiar with submodules, it's
probably worth reviewing.

The Problem We're Trying to Solve Here

Let's start with stating our basic assumptions:

you're using a shared repository (like github)

you're actively developing in one or more submodules

This model of development can get very tedious very quickly if you
don't have the right tools, because everytime you decide to
"checkpoint" and commit your code (either locally or up to the shared
server), you have to:

iterate through your submodules, doing things like:

making sure you're on the right branch,

making sure you've pulled changes down from the server,

making sure that you've committed your changes,

and pushed all your commits

and then making sure that your superproject's references to the
submodules have also been committed and pushed.

If you do this a few times, you'll see that it's tedious and
error-prone. You could mistakenly push a version of the superproject
that refers to a local commit of a submodule. When people try to
pull that down from the server, all hell will break loose because that
commit won't exist for them.

Ugh! This is monkey work. Let's automate it.

Simple Solution

OK, fixing this issue sounds easy. All we have to do is:

develop some primitives for iterating over the submodules (and
optionally the superproject),

and then throw some actual functionality on top for sanity checking, pulling,
pushing and committing.

Excellent! Not only did git:update automatically generate a useful log
message for me (indicating that we're updating to the latest submodule
version), but it's also embedding original commit logs for all the
changes included in that commit! That makes it much easier to find a
specific submodule commit in the superproject commit log.

A Note on Branching and Merging

Note that there are no tasks for handling branching and merging. This
is intentional! It could be very dangerous to try to read your mind
about actions on branches, and frankly, I'm just not up to it today.

For example, let's say I invented a task to copy the current branch
master to a new branch foo (the equivalent of git checkout -b foo
master) in all submodules, but one of the submodules already has a
branch named foo!

Do we reduce this action to a simple git checkout foo for that
submodule? That could yield unexpected results if we a) forgot we had
a branch named foo and b) that branch is very different from the
master we expected to copy.

Well, then -- we can delete (or rename) the existing foo branch and
follow that up by copying master to foo. But then we're silently
renaming (or deleting) branches that a) could be upstream on the
shared server or b) we intended to keep around, but forgot to
git-stash.

In any case, my point is that it can get complicated, and so I'm
punting. If you want to copy branches or do simple checkouts, you
should use the git:for_each command.

Everyday Use of git:rake

In my day job, I've taken the vendor-everything approach and
refactored lots of common code (across clients) into plugins, which
are each a git submodule. My current project has 14 submodules, of
which I am actively coding in probably 5 to 7 at any one time. (Plenty
of motivation for creating git:rake right there.)

Let's say I've hacked for an hour or two and am ready to commit to
my local repository. Let's first take a look at what's changed:

You'll notice first of all that, despite having 14 submodules, I'm
only seeing output for the ones that need commits, and even that
output is minimal, listing only the specific files and not all the
cruft in the original message. It tells me that all submodules are on
the same branch. It's smart enough to tell me that a file may need to
be git-added. It will even alert me when a repo needs to be pushed to
the origin.

I'll have to manually cd to the submodule and git-add that one
file, but once that's done, I can commit my changes by running:

$ rake git:commit

which will run git commit -a -v for each submodule, fire up the
editor for commit messages along the way, push each submodule to the
shared server, and then automagically create verbose commit logs for
the superproject.

To pull changes from the shared server:

$ rake git:pull

When you run this command, you'll notice that the output is filtered,
so if no changes were pulled, you'll see no output. Silence is golden.

To push?

$ rake git:push

Not only will this be silent if there's nothing to push, but the rake
task is smart enough to not even attempt to push to the server if
master is no different from origin/master. So it's silent and fast.

Let's say I want to copy the current branch, master, to a new
branch, working.

$ rake git:for_each CMD='git checkout -b working master'

If the command fails for any submodules, the rake task will terminate
immediately.

Merging changes from 'working' back into 'master' for every submodule
(and the superproject)?

What git.rake Doesn't Do

A couple of things that come quickly to mind that git.rake should
probably do:

Push to the shared server for ANY branch that we're tracking from a
remote branch.

Be more intelligent about when we push to the server. Right now, the
code pushes submodules to the shared server every time we want to
commit the superproject. We might be able to get away with only
pushing the submodules when we push the superproject.

Parsing the output from various 'git' commands is prone to breakage
if the git crew starts modifying some of the strings.

There should probably be some unit/functional tests. See previous
item.

Anyway, the code is all up on github. Go hack it, and send back patches!

15
comments:

I don't use submodules anymore. They are a hassle when dealing with a bunch of inter-related sub-projects. Instead I use subtree and merge. It's much more like merging a branch (because that's exactly what it is). The branch just get's placed in a sub-directory. Read about it here http://dysinger.net/2008/04/29/replacing-braid-or-piston-for-git-with-40-lines-of-rake/ The rspec project had a disaster which luckily they recovered from where they lost 15 days of commits due to a submodule mess-up. Anyway submodules are good for some things. But that list is pretty small IMO.

@dysinger - After reading up on subtrees in the How-To, it seems like they're not appropriate for the situation git-rake was developed to handle -- namely, code under active development.

Specifically, subtrees are great for merging upstream changes into your local (downstream) repository. But it does not provide functionality for commit changes back upstream to the (shared) repository.

If your big complaint with submodules is that they're a hassle, and git-rake makes handling submodules easy, then that's problem solved, no?

I think the exact opposite. I think submodules are for code that is not edited often. Everytime you edit code in the submodule you have to make an additional commit in the parent module so that everyone can see the latest changes in the submodule = hassle.

You can do two things for working on projects with sub-trees where you need to edit the sub-tree project.

1 - branch the sub-tree's remote just like you would branch any remote. Commit to them and push them back to their origin repo (which has already been added as a remote).

2 - you can always work on the sub-treed project in another directory where you have the project cloned on it's own just like any git project. When you are done push to the remote, pull back (merge) into your "parent" project.

It works and is less cumbersome than submodules IMO. It's just branches and merging - active development as it's supposed to be.

I understand where you're coming from. However, git-rake automatically pulls your commit logs from the submodule into the superproject ... without a hassle. That's the whole point.

I don't spend any more time managing my submodules than I would if I had all my code in a single project. If I make a change, I run "rake git:commit", and whatever changes were made in my submodules gets committed, pushed upstream, and the superproject pulls the commit logs into its own log. Easy, peasy.

I will take another look at subtrees, however your descriptions of pushing changes back upstream sounds like more work than I'm doing now with git-rake.

But really, that's the wonderful thing about git. There's a workflow for everyone's taste.

dsyinger was right when he said that the RSpec project had a lot of trouble with submodules and found them unsuitable for active development.

They seem to work fine, as does git-rake, until you have more than one developer. Then the submodules become a real PITA.

Let's say you've got a super project with submodule a. I make changes to some file in a and commit my changes, and update the submodule reference to be commit abc123. Now you make some changes to a file in a, commit your changes and update the submodule ref to be def456. When either of us updates, there will be a merge conflict on the submodule reference, because we're both saying that the submodule reference has a different commit ID. You can just accept your own SHA, because you know you've already merged the submodules. But it's annoying to have to do that every time. On top of that, it's way too easy to blow away local changes by doing "git submodule update". So submodules, for us, had no upsides and plenty of downsides.

I experimented with git-rake for a bit and it doesn't appear to handle this problem. Are you using it on a team with multiple active developers, or are you using it alone, managing external dependencies?

As far as how we solved it in the RSpec codebase, we just check out the "child" repos into the proper dirs and use .gitignore to ignore their files from the "parent" repo. But there's no direct relationship between them, so we don't have any of the headaches of submodules.

@pat: I understand the issue as you've described, and was able to reproduce it, and it's clarified some things for me.

The reason I (and my company) have not walked into this spinning propeller of submodule hell is because of two reasons:

1) We always make sure our submodules are on a branch, and not on the (default) detached HEAD. This allows us to ...

2) Always pull changes from the shared repo before committing local changes.

Now, this is only feasible on branches with relatively low update frequencies.

That is, if you can't be reasonably sure that the shared repository hasn't changed between the times you do a 'git pull' and a 'git push' on your submodule, then (proverbially speaking) you're F-ed in the A.

So, I acknowledge that there is no reasonable way to make submodules work for frequently-updated projects, even with git.rake.

Many thanks to you and dysinger for pointing this out. It's definitely an issue. I'll be updating the documentation in git.rake's github repo to alert people.

We just setup our dev env in git with submodules so that we can get partial workspaces and versions for each of our dependent modules. Everyone doesn't need everything. Now I came across this post. Is this still the situation with submodules? Is there any better way to have sparse workspaces?Thanks.