Sun, 31 Aug 2008

I wanted to get a list of who'd been contributing the most in a
particular open source project. Most projects of any size have a
ChangeLog file, in which check-ins have entries like this:

2008-08-26 Jane Hacker <hacker@domain.org>
* src/app/print.c: make sure the Portrait and Landscape
* buttons update according to the current setting.

I wanted to take each entry, save the name of the developer checking
in, then eventually count the number of times each name occurs (the
number of times that developer checked in) and print them in order
from most check-ins to least.

Getting the names is easy: for check-ins in the last 9 years, I just
want the lines that start with "200". (Of course, if I wanted earlier
check-ins I could make the match more general.)

grep "^200" ChangeLog

But now I want to trim the line so it includes only the
contributor's name. A bit of sed geekery can do that: the date is a
fixed format (four characters, a dash, two, dash, two, then two
spaces, so "^....-..-.. " matches that pattern.

But I want to remove the email address part too
(sometimes people use different email addresses
when they check in). So I want a sed pattern that will match
something at the front (to discard), something in the middle (keep that part)
and something at the end (discard).

Here's how to do that in sed:

grep "^200" ChangeLog | sed 's/^....-..-.. \(.*\)<.*$/\1/'

In English, that says: "For each line in the ChangeLog that starts
with 200, find a pattern at the beginning consisting of any four
characters, a dash, two characters, dash, two characters, dash, and
two spaces; then immediately after that, save all characters up to
a < symbol; then throw away the < and any characters that follow
until the end of the line."

That works pretty well! But it's not quite right: it includes the
two spaces after the name as part of the name. In sed, \s matches
any space character (like space or tab).
So you'd think this should work:

grep "^200" ChangeLog | sed 's/^....-..-.. \(.*\)\s+<.*$/\1/'

\s+ means it will require that at least one and maybe more space
characters immediately before the < are also discarded.
But it doesn't work. It turns out the reason is that the \(.*\)
expression is "greedier" than the \s+: so the saved name expression
grabs the first space, leaving only the second to the \s+.

The way around that is to make the name expression specify that it
can't end with a space. \S is the term for "anything that's not a
space character"; so the expression becomes

grep "^200" ChangeLog | sed 's/^....-..-.. \(.*\S\)\s\+<.*$/\1/'

(the + turned out to need a backslash before it).

We have the list of names! Add a | sort on the end to
sort them alphabetically -- that will make sure you get all the
"Jane Hacker" lines listed together. But how to count them?
The Unix program most frequently invoked after sort
is uniq, which gets rid of all the repeated lines.
On a hunch, I checked out the man page, man uniq,
and found the -c option: "prefix lines by the number of occurrences".
Perfect! Then just sort them by the number, from largest to
smallest:

Now, this isn't perfect since it doesn't catch "Checking in patch
contributed by susan@otherhost.com" attributions -- but those aren't in
a standard format in most projects, so they have to be handled by hand.

Disclaimer: Of course, number of check-ins is not a good measure of
how important or productive someone is. You can check in a lot of
one-line fixes, or you can write an important new module and submit
it for someone else to merge in. The point here wasn't to rank
developers, but just to get an idea who was checking into the tree
and how often.

Well, that ... and an excuse to play with nifty Linux shell pipelines.