Here's an odd thing about the git bisect command: It has only 1 option
(--no-checkout). Compare with eg git commit, which has 36 options by my
count.

The difference is largely down to git having a pervasive culture of
carefully edited history. We need lots of git commit options to carefully
produce commits that look Just Right. Staging only some of the files we've
edited, perhaps even staging only some of the changes within a file. Amend
that commit if we notice we made a mistake. Create a whole series of beautiful
commits, and use rebase later to remix them into a more beautiful whole.

Beautiful fake histories. Because coding is actually messy; our actual
edit history contains blind alleys and doublings back on itself; contains
periods of many days the code isn't building properly. We want to sweep
that complexity away, hide it under the rug. This works well except when it
doesn't, when some detail airbrushed out of the only remaining history
turns out to be important.

Once we have these beautiful fake histories of changes, we can easily
bisect them and find the commit that introduced a bug. So bisect doesn't
need a lot of options to control how it works.

I'd like to suggest a new option though. At least as a thought experiment.
--merges-only would make bisect only check the merge commits in the
range of commits being bisected. The bisection would result in not a single
commit, but in the set of commits between two merges.

I suspect this would be useful for faster bisecting some histories of the
beautiful fake kind. But I know it would be useful when the history is
messy and organic and full of false starts and points where the code
doesn't build. Merges, in such histories, are often the points where things
reach a certian level of beauty, where that messy feature branch
got to the point it all built again (please let this happen today)
and was merged into master. Bisecting such points in a messy organic history
should work about as well as bisecting carefully gardened histories.

I think I'll save the full rant about beautiful fake history vs messy real
history for some other day. Or maybe I've already ranted that rant here
before, I can't remember.

Let's just say that I personally come down on the side of liking my git
history to reflect the actual code I was working on, even if it was broken
and even if I threw it away later. I've indeed taken this to extreme
lengths with propellor;
in its git history
you can see every time I've ever run it, and the version of my config
file and code at that point. Apologies to anyone who's been put off by that...
But oddly, propellor gets by far more contributions from others than any of my
other haskell programs.

I've just released git-annex version 3, which stops cluttering
the filesystem with .git-annex directories. Instead it stores its
data in a git-annex branch, which it manages entirely transparently
to the user. It is essentially now using git as a distributed NOSQL database.
Let's call it a databranch.

This is not an unheard of thing to do with git. The git notes built into
recent git does something similar, using a dynamically balanced tree in
a hidden branch to store notes. My own pristine-tar injects data into
a git branch. (Thanks to Alexander Wirt for showing me how to do that
when I was a git newbie.) Some
distributed bug trackers store
their data in git in various ways.

What I think takes git-annex beyond these is that it not only injects data
into git, but it does it in a way that's efficient for large quantities of
changing data, and it automates merging remote changes into its databranch.
This is novel enough to write up how I did it, especially the latter which
tends to be a weak spot in things that use git this way.

Indeed, it's important to approach your design for using git as a database
from the perspective of automated merging. Get the merging right and the
rest will follow. I've chosen to use the simplest possible merge, the union
merge: When merging parent trees A and B, the result will have all files
that are in either A or B, and files present in both will have their lines
merged (and possibly reordered or uniqed).

The main thing git-annex stores in its databranch is a bunch of
presence logs.
Each log file corresponds to one item, and has lines with this form:

timestamp [0|1] id

This records whether the item was present at the specified id at a given time.
It can be easily union merged, since only the newest timestamp for an id
is relevant. Older lines can be compacted away whenever the log is updated.
Generalizing this technique for other kinds of data is probably an
interesting problem. :)

While git can union merge changes into the currently checked out branch,
when using git as a database, you want to merge into your internal-use
databranch instead, and maintaining a checkout of that branch is inefficient.
So git-annex includes a general purpose
git-union-merge command
that can union merge changes into a git branch, efficiently, without
needing the branch to be checked out. Another problem is how to trigger the
merge when git pulls changes from remotes. There is no suitible git hook
(post-merge won't do because the checked out branch may not change at all).
git-annex works around this problem by automatically merging */git-annex
into git-annex each time it is run. I hope that git might eventually get
such capabilities built into it to better support this type of thing.

So that's the data. Now, how to efficiently inject it into your databranch?
And how to efficiently retrieve it?

The second question is easier to answer, although it took me a while to
find the right way ... Which is two orders of magnitude faster than the
wrong way, and fairly close in speed to reading data files directly
from the filesystem.
The right choice is to use git-cat-file --batch; starting it up the
first time data is requested, and leaving it running for further queries.
This would be straightforward, except git-cat-file --batch is a little
difficult when a file is requested that does not exist. To detect that,
you'll have to examine its stderr for error messages too. Perhaps
git-cat-file --batch could be improved to print something machine
parseable to stdout when it cannot find a file. Takes some careful
parsing, but straightforward.

Efficiently injecting changes into the databranch was another place where
my first attempt was an order of magnitude slower than my final code.
The key trick is to maintain a separate index file for the branch.
(Set GIT_INDEX_FILE to make git use it.) Then changes can be fed
into git by using git hash-object, and those hashes recorded into
the branch's index file with git update-index --index-info. Finally,
just commit the separate index file and update the branch's ref.

That works ok, but the sad truth is that git's index files don't scale well
as the number of files in the tree grows. Once you have a hundred thousand
or so files, updating an index file becomes slow, since for every update,
git has to rewrite the entire file. I hope that git will be improved to
scale better, perhaps by some git wizard who understands index files (does
anyone except Junio and Linus?) arranging for them to be modified in-place.

In the meantime, I use a workaround: Each change that will be committed to
the databranch is first recorded into a journal file, and when git-annex
shuts down, it runs git hash-object just once, passing it all the journal
files, and feeds the resulting hashes into a single call to git
update-index. Of course, my database code has to make sure to check the
journal when retrieving data. And of course, it has to deal with possibly
being interrupted in the middle of updating the journal, or before it can
commit it, and so forth. If gory details interest you, the complete code
for using a git branch as a database, with journaling, is
here.

After all that, git-annex turned out to be nearly as fast as before
when it was simply reading files from the filesystem, and actually faster
in some cases. And without the clutter of the .git-annex/ directory,
git use is overall faster, commits are uncluttered, and there's no difficulty
with branching. Using a git branch as a database is not always the right
choice, and git's plumbing could be improved to better support it, but it
is an interesting technique.

I wrote this code today to verify setup branch pushes on Branchable.
When I was writing it I was just down in the trenches coding until it worked,
but it's rather surprising that what it does to git does work.

The following code runs in git's update hook. The fast path of the hook
(written in C) notices that the user is committing a change to the setup
branch, and hands the incoming git ref off to the setup verifier.

# doing a shared clone makes the setupref, which has# not landed on any branch, be available for checkout
shell("git","clone","--quiet","--shared","--no-checkout", repository($hostname),$tmpcheckout);chdir($tmpcheckout) || error "chdir$tmpcheckout: $!";
shell("git","checkout","--quiet",$setupref,"-b","setup");

I got lucky here, since I initially passed --shared only to avoid
the overhead of a clone of the site's entire git repository (which can
be quite large, since Branchable doesn't have any real limits on site size).
Without the --shared, the clone wouldn't see the incoming ref at all.

In the setup branch is an ikiwiki.setup file, and we only want to allow
safe changes to be committed to it. Ikiwiki has metadata about which
configurations are safe. Checking that and various other amusing scenarios
(what if someone makes ikiwiki.setup a symlink etc) takes a hundred lines
of fairly hairy code, but that doesn't matter here. Eventually it decides
the setup file is ok as-is, or it's already died with an error message.

# Check out setup file in toplevel. This is slightly tricky# as the commit has not landed in the bare git repo yet --# but it is available in the tmpcheckout.
shell("git","pull","-q",$tmpcheckout,"setup");# Refresh or rebuild site to reflect setup changes.print STDERR "Updating site to reflect setup changes...\n";
shell("ikiwiki","-setup","ikiwiki.setup","-v",($rebuild_needed ? ("-rebuild") : ("-refresh","-wrappers")));

When this code runs there are three repositories, each with a different
view of the setup branch. The main bare repository is waiting for the
hook to succeed before it updates the ref to point to what was pushed.
The temporary clone has what was pushed already checked out.
And the site's home directory still has the old version of the setup
branch checked out. Possibly even a version that has diverged from what's
in the bare repository.

It's rather odd that the update hook goes and causes that latter
repository to be updated, before the change has finished landing in
the bare repository. But it does work; it ensures that if there is some
kind of bizzare merge problem the user doing the push sees it, and I
probably won't regret it.

The result certianly is nice -- edit ikiwiki.setup file locally,
commit and push it, and ikiwiki automatically reconfigures itself
and even rebuilds your whole site if you've changed something significant.

Couchdb came onto my radar since distributed stuff is interesting to me
these days. But most of what was being written about it put me off, since
it seemed to be very web-oriented, with javascript and html and stuff
stored in the database, served right out of it to web browsers in an
AJAXy mess.

Also, it's a database. I decided a long, long time ago not to mess with
traditional databases. (They're great, they're just not great for me. Said
the guy leaving after 5 years in the coal mines.)

Then I saw Damien Katz's
talk about how
he gave up everything to go off and create couchdb. Was very inspirational.
Seemed it must be worth another look, with that story behind it.

Now I'm reading the draft O'Rielly book,
like some things, as expected don't like others[1], and am not sure what to
think overall (plus still have half the book to get through yet),
but it has spurred some early thoughts:

... vs DVCS

Couchdb is very unlike a distributed VCS, and yet it's moved from
traditional database country much closer to VCS land. It's document
oriented, not normalized; the data stored in it has significant structure,
but is also in a sense freeform. It doesn't necessarily preserve all
history, but it does support multiple branches, merging, and conflict
resolution.

Oddly, the thing I dislike most about it is possibly its biggest strength
compared to a VCS, and that is that code is stored in the database
alongside the data. That means that changes to the data can trigger
processing, so it is mapped, reduced, views are updated, etc, on demand.
This is done using code that is included in the database, and so is always
available, and runs in an environment couchdb provides -- so replicating
the database automatically deploys it.

Compare with a VCS, where anything that is triggered by changes to the data
is tacked onto the side in hooks, has to be manually set up, and so is poorly
integrated overall.

Basically, what I've been doing with ikiwiki is adding some smarts
about handling a particular kind of data, on top of the VCS. But this is
done via a few narrow hooks; cloning the VCS repository does not get you a
wiki set up and ready to go.

There are good reasons why cloning a VCS repository does not clone the
hooks associated with it. The idea of doing so seems insane; how could you
trust those hooks? How could they work when cloned to another environment?
And so that's Never Been Done[2]. But with couchdb's example, this is
looking to me like a blind spot, that has probably stunted the range of
things VCSs are used for.

If you feel, like I do, that it's great we have these amazing distributed
VCSs, with so many advanced capabilities, but a shame that they're only
used by software developers, then that is an exciting thought.

[1] Javascript? Mixed all in a database with data it runs on? Imperative
code that's supposed to be side-effect free? (I assume the Haskell guys
have already been all over that.) Code stored without real version
control? Still having a hard time with this. :)

[2] I hope someone will give a counterexample of a VCS that does so in the
comments?

Posted in the wee hours of Tuesday night, October 28th, 2009
Tags:
git

I've used unison for a long while for keeping things like my music in sync
between machines. But it's never felt entirely safe, or right. (Or fast!)
Using a VCS would be better, but would consume a lot more space.

Well, space still matters on laptops, with their smallish SSDs, but I have
terabytes of disk on my file servers, so VCS space overhead there is no
longer of much concern for files smaller than videos. So, here's a way
I've been experimenting with to get rid of unison in this situation.

Set up some sort of networked filesystem connection
to the file server. I hate to admit I'm still using NFS.

Log into the file server, init a git repo,
and check all your music (or whatever) into
it.

When checking out on each client, use git clone --shared.
This avoids including any objects in the client's local .git
directory.

git clone --shared /mnt/fileserver/stuff.git stuff

Now you can just use git as usual, to add/remove stuff,
commit, update, etc.

Caveats:

git add is not very fast. Reading, checksumming, and writing
out gig after gig of data can be slow. Think hours. Maybe days.
(OTOH, I ran that on an Thecus.)

Overall, I'm happy with the speed, after the initial setup.
Git pushes data around faster than unison, despite not
really being intended to be used this way.

Note that use of git clone --shared, and read the caveats about
this mode in git-clone(1).

git repack is not recommended on clients because it would read
and write the whole git repo over NFS.

Make sure your NFS server has large file support. (The userspace
one doesn't; kernel one does.) You don't just need it for enormous pack
files. The failure mode I saw was git failing in amusing ways that
involved creating empty files.

Git doesn't deal very well with a bit flipping somewhere
in the middle of a 32 gigabyte pack file. And since this
method avoids duplicating the data in .git, the clones
are not available as backups if something goes wrong.
So if regenerating your entire repo doesn't appeal, keep
a backup of it.

(Thanks to Ted T'so for the hint about using --shared,
which makes this work significantly better, and simpler.)

If it looks good, next steps will be making things like
gitweb, viewvc, ikiwiki, etc, support it. I've already written a
preliminary webcheckout tool that will download an url, parse
the microformat, and run the appropriate VCS program(s).

(Followed by, with any luck, github, ohloh, etc using the
microformat in both the pages they publish, and perhaps,
in their data importers.)

Why? Well,

A similar approach worked great for Debian source packages
with the XS-VCS-* fields.

Pasting git urls from download pages of software
projects gets old.

I'm tired of having to do serious digging to find
where to clone the source to websites like Keith
Packard's blog, or cariographics.org, or St
Hugh of Lincoln Primary School. Sites that I
know live
in a git repo, somewhere.

With the downturn, hosting sites are going down left and
right, and users who trusted their data to these sites
are losing it. Examples include
AOL Hometown and Ficlets,
Google lively,
Journalspace,
podango, etc etc. Even livejournal's future is
looking shakey.
Various people are
trying to archive some of this data before it vanishes for good.
I'm more interested in establishing best practices that
make it easy and attractive to let all the data on
your website be cloned/forked/preserved. Things that people
bitten by these closures just might demand in the future.
This will be one small step in that direction.a

I'm writing a piece of autobiography/alternate world fiction,
using git. Whether it will get finished or be any good, or be too
personal to share I don't know. The idea though is sorta interesting -- a
series of descriptions of inflection points in a life, each committed into
git at the time it describes. As the life paths diverge, branches form, but
never quite merge.

Reading this would not be quite like reading one of those choose your own
adventure books. Rather you'd start at the end of a path and read back
through the choices and events that led there. Or browse around for
interesting nuggets in gitk. Or perhaps the point isn't that it be read
at all, but is instead in the writing, and the committing.

The secret sauce, that makes this not a recipe for disaster but just a
nice feature, is that ikiwiki checks each change as it's pushed in, and
rejects any changes that couldn't be made to the wiki with a web browser.

I'm envisioning a graphical app that displays a file. Like a pager, the up
and down arrows move through the file. But the left and right arrows move
through time. As each successive change to the file is displayed, the
committer's name appears in a column to the left of the lines changed in
that commit. Hover the mouse over it to see the commit message. Names of
old committers will fade out as time advances, but still be visible
for a while. (A menu option will disable the fade out entirely.)

A nice bonus feature would be to allow opening multiple windows, with
multiple files from the same repo. Moving back and forward in time would
affect them all at once.

A nice, but getting harder feature would be to have a horizontal timeline
at the bottom, including branches, so you could click on a specific branch
to visit it. (Without this, when passing a fork or merge point, it would
have to choose a branch heuristically?)

A tricky subtle feature would be to attempt to keep the current code
block centered in the display as lines are added/removed from the file,
adjusting scroll bar position to compensate.

There seems to be a gannotate for bzr, that may do something like this.
Offline so I can't try it.

Google-and-caffine-fed update: bzr gannotate is closest to what I envisoned,
though without a few of the bonuses (fade-out, smart scrolling, multiple
files). qgit's "tree view" includes the same functionality, but the interface
isn't as nice.

I use the standard ciabot.pl script in a git post-receive hook. This
works ok, except in the case where changes are made in a published branch,
and then that branch is merged into a second published branch.

In that case, ciabot.pl reports all the changes twice, once when they're
committed to the published branch, and again when the branch is merged.
This is worst when I sync up two branches; if there were a lot of changes
made on either branch, they all flood into the irc channel again.

Am I using the ciabot.pl script wrong, or is there a better script I should
use? Or maybe there's a CIA alternative that is smarter about git commits,
so it will filter out duplicates?