Recent entries

A Song of Graph and Report, Part I: A Gathering of Stats

created: February 2, 2014

Statistics and graphs. They don't always mean much, but they can be so darn mesmerizing.

It's in that optic that I was looking at the Map of CPAN the other day and thought "wouldn't it be cool to have the same view within a distribution?". It's not that I need to do that for anything specific, mind you. But to have a distribution split into territories based on authors and files... Maybe animated throughout different releases, to see the ebbs of contributors and code. I'd be so nifty!

Now, considering that there are many things that can be done with those statistics -- and because I should probably do some real work between bouts of fooling around -- I decided to cut the exercise into different parts. Therefore, today we'll begin with the essential, if a tad dry, step required for everything that will follow. Namely: the gathering of the statistics.

Lay The Blames Around

The good news is that Git already has its git blame sub-command that provide us with a per-line author attribution of the codebase. All we have to do is to interface with it, munge its data to our liking, and we'll be good to go.

Same Thing, But Compressed

With the code above, we now have an array for each file, holding the name of the author of each line. If we just want to tally the overall contribution of each author, that's slightly overkill, but it'll become handy when we'll be ready to draw the author map of each file.

Still... one entry per line, that's quite verbose. Instead, let's try to scrunch that into a tally of the successive lines associated with the same author:

Aaaah, yes. Not as dirt-simple as before, but the data structure is now much less wasteful.

Massively Parallel Blaming

Next challenge: git blame is a little bit on the slow side. Not sloth-on-a-diet-of-snails kind of slow, but it does have to spit out all lines of every file. For big projects, that takes a few seconds.

But... isn't our little file_authors() function work in isolation for each file? Wouldn't that be a perfect moment to whip out some parallelization ninja trick? Like... oh, I don't know, try out that neat new MCE module that makes the rounds these days?

Time To Generate a Report

The numbers are now ours to do as we see fit. Pretty graphical stuff will have to wait for the next installments, but why not create a quick credit roll for the project version we just munged?

use 5.10.0;
use List::AllUtils qw/ part /;
my( $minor, $major ) = part { $authors{$_} >= 1 }
reverse
sort { $authors{$a} <=> $authors{$b} }
keys %authors;
my $lines = $nbr_lines;
# 1000000 => 1,000,000
1 while $lines =~ s/^(\d+)(\d{3})/$1,$2/;
say <<"END";
# CREDIT ROLL
This is the list of all persons who had their hands in
crafting the $lines lines of code that make
this version of $project, according to `git blame`.
This being said, don't take those statistics too seriously,
as they are at best a very naive way to judge the contribution
of individuals. Furthermore, it doesn't take into account the army
of equaly important peoples who report bugs, worked on previous
versions and assisted in a thousand different ways. For a glimpse of
this large team, see the CONTRIBUTORS file.
## The Major Hackarna
All contributors claiming at least 1% of the lines of code the project.
END
printf " * %-50s %2d %%\n", $_, $authors{$_} for @$major;
if ( $minor ) {
say "\n\n## The Minor Hackarna\n\n",
"With contributions of: \n";
say " * ", $_ for @$minor;
}
}