Sunday, 31 March 2013

I had some bad experiences with git a few years back, and until a year or so ago, you couldn't get me to touch it with a ten-parsec pole. Let's just say that things got lost… changes, files, jobs, companies, little things like that. So, building out a Rails shop over the last year, with git being The Standard™(thank you, Github), you'd think I'd be super-extra-hyper-careful about it, right?

Well, the training wheels have to come off sometime, and if they take rather important bits of your skull along for the ride, consider it a Learning Experience. You were warned, weren't you? You did read that .2-point type on the side of the tin with the big label LIFE! that said "Death will be an inevitable finale to Life. Don't bother the lawyers", didn't you? Oh, well. Details.

Rule #16,472 in Dickey's Secrets of Software Survival states: "Never write important code while seriously ill with flu or whatnot. The results would be worse than if you had written the code while drunk. In fact, you may want to give serious consideration to getting drunk before attempting to code while sick." Because what violating 16,472 can get you is shenanigans like committing great laundry lists of files, several times, only to find yourself staring at a red bar and wondering how you got there. (Hot the Red Bar in Cable Street in London, but the red bar that is your test tool saying "your code and/or your specs/tests are busted, bub". That red bar.) This has actually happened to me not once, but twice in the last couple of months.)

At which point, you mutter dark threats under your breath at the imbecile who put you in this position (yourself) and start retracing your steps. Whereupon you find that your last genuine, not-because-things-were-wildly-cached proven-working version was five commits earlier. Five commits that have been pushed to the Master Repo on Github earlier. You then note that each of these touched a dozen files or so (because you don't like seeing "19,281 commits" in the project window when you know you're just getting started), and a couple hours of spelunking (as opposed to caving) leaves you none the wiser. What to do?

The first thing to remember is that there is nothing structurally or necessarily semantically significant about the master branch in git. As far as I can tell, it's merely the name assigned to the first branch created in a new repo, which is traditionally used as the gold-standard branch for the repo. (Create new branch, do your thing, merge back into master. Lather; rinse; repeat. The usual workflow.) On most levels, there's nothing that makes deleting the master branch any different than any other.

Of course, the devil is in the details. You've got to have at least one other surviving branch in the repo, obviously, or everything would go poof! when you killed master. And remote repos, like Github, have the additional detail that specifies HEAD as the default branch on that remote. (Thanks to Jefromi for this explanation on StackOverflow, to another guy's question.) There always has to be a default branch; it's set to master when the repo is created and very rarely touched. But it's nice to know that you can touch, or maul, it when the need arises.

Here's what I did:

I went back to that last proven good commit, five commits back, and created a new repair branch from that commit;

I scavenged the more mundane parts of what had been done in the commits on master after the branch point, and made (bloody well) sure specs fully covered the code and passed;

I methodically added in the code pieces that weren't so mundane, found the bug that had led me astray earlier, and fixed it. This left me with a repair branch that was everything master would have been had it been working.

Now for the seatbelts-and-helmets-decidedly-off part.

I verified that I was on the repair branch locally;

I deleted the master branch in my local repo

I ran git branch master to create a new master branch. (Note that we haven't touched the remote repo yet);

I checked out the (new) master branch (which at this point is an exact duplicate of repair, remember);

I (temporarily) pushed the repair branch to origin;

I used the command git remote set-head origin repair to remove the only link that really mattered to the existing master branch;

I deleted the master branch on the remote ("origin") repo as I would any other remote branch;

I force-pushed the new master to the remote, using the command git push --force origin master. I needed the --force option to override git's complaint that the remote branch was ahead of the one I was pushing;

I ran git remote set-head origin master to restore git's default remote branch to what it had recently been; and

I deleted the repair branch from origin

Transmogrification complete, and apparently successful. Even my GUI git client, SourceTree, only paused a second or so before displaying the newly-revised order of things in the repo. It's useful to know that you can trade the safety scissors for a machete when you really do feel the need.

However… so can anyone else with access to the repo. And I can easily see scenarios in a corporate setting where that might be a Very Bad Thing indeed. In a repo used by a sizeable team, with thousands of commits, it wouldn't be all that difficult for a disgruntled or knowingly-soon-to-be-disgruntled team member to write a program that would take the repo back far enough into the mish-mashed past to deter detection, create a clandestine patch to a feature branch that (in the original history) was merged into master some time later, and then walk the commits back onto the repo, including branches, before pushing the (tainted) repo back up to origin. I'd think that your auditing and SCM controls would have to be pretty tight to catch something like that. I'd also think that other DVCSes such as Mercurial or Bazaar would have a much harder time successfully doing what I did, and therefore, would be less likely to such exploitation. That this hasn't been done on a scale wide enough to be well-publicised, I think, speaks quite loudly about the ethics, job satisfaction, and/or laziness of most development-team members.

2 comments:

In git the 'traditional' way of maintaining a central repo has been to have devs push to their own forks and have an integration manager pull into the central repo from those forks. This way devs don't get the chance to mess up the central repo. (Presumable the integration person knows what they're doing.)

Of course, people are more used to the model where everyone pushes to the central repo, because that's how it worked in the centralised VCS world.

Actually I just reread your last para and saw what you're trying to say about a malicious dev injecting bad code into some point in the repo history. That would be laughably easy to detect with git. See Linus Torvalds' Google tech talk from 2007 for how, and also see http://www.linux.com/news/featured-blogs/171-jonathan-corbet/491001-the-cracking-of-kernelorg for a real example of how the Linux kernel sources themselves were safe from malicious attacks thanks to git.

About Me

Hi! I'm an American Web/software lead/manager with a more-than-stereotypically-cool startup, presently based in Singapore. I've been lucky enough to do a lot of travel in my life, living in various Asian countries for nearly the last ten years. I've been doing IT development and consulting for 30+ years now, and I have pretty strong opinions on things - mostly a very low tolerance of what I see as technical BS without overwhelming business necessity.