An obsession with data (Git version)

In the process of thinking about doing a post about my (somewhat depressing) writing career, I got curious about how much I write. At work, this came up and so I did a quick word count (wc -w) on myrepositories.

$ find -name .txt | xargs wc -w
4288655 total
$

There is no way I've written 4.2 million words. So, I got curious of actually how many words are in there, but also when I wrote it. Now, as much as I'm fond of data, I actually haven't tracked when I finished astory.

I realized since I'm obsessed with source control systems, I could get a rough estimate of when I posted something. In general, I don't take more than 1-2 weeks to finish anything I've started, so I can say that I finished a story about the time I started it (e.g., checked it into sourcecontrol).

Currently, I use git to track my files. I also previously used Subversion, so I'll get to that in the next fewdays.

Two things. One, I am terrible at check-in comments when I write. I use this to make checkpoints on my story and I don't really go back. So, usually I just have a one-line comment like "Worked on chapter 3." Two, I'm using "-1" to limit this to one entry for illustrationpurposes.

The above output is really verbose. Ideally, we don't care about the message, who did it (it is only me), and the commit message. Fortunately, Git has the ability to control the output with the "–pretty" option. In this case, we are using "%ai" which gives us a pretty little ISOtimestamp.

$ git log -1 –pretty=%ai
2012-04-22 20:53:42 -0500
$

Pretty good, except that it doesn't really show the files that changed. Well, we can fix that with the "–name-only"opton.

$ git log -1 –pretty=%ai –name-only
2012-04-22 20:53:42-0500

high/friend-guard.txt
nr-guard.txt
$

There we go. We have the date we made a change and which files we changed (or added or deleted). Of course, not really in a useful form. I'm fond of Perl programming for one-off programs, so I banged thisup:

#!/usr/bin/perl
# git log --name-only --pretty=%ai
#
# Setup
#
# Directives
use strict;
use warnings;
#
# Directory Parsing
#
# Go through all the files in the command-line arguments.
while (@ARGV)
{
# If it isn't a directory, we don't care.
my $dir = shift @ARGV;
$dir =~ s@/$@@sg;
if (! -d $dir)
{
print STDERR "Ignoring $dir (not a directory)n";
next;
}
my $git_dir = "$dir/.git";
if (! -d $dir)
{
print STDERR "Ignoring $dir (no .git inside)n";
next;
}
# We're processing this directory.
print STDERR "Processing $dirn";
# Set the GIT_DIR and GIT_WORK_TREE so we don't have to move into
# that directory.
$ENV{GIT_WORK_TREE} = $dir;
$ENV{GIT_DIR} = $git_dir;
# We want to build up a log of the entire repository. We want to
# know each of the files and date that they were checked in.
my $last_timestamp;
my %files = ();
open GIT, "git log --name-only --pretty=%ai |"
or die "Cannot open Git for $dir ($!)";
while ()
{
# Clean up the line and ignore blanks.
chomp;
next if /^s$/;
# In general, we will have two types of lines. One is in a
# timestamp and the other is the name of the file.
if (/^(d+)-(d+)-(d+) (d+):(d+):(d+)/)
{
# We only care about the date of the check-in.
$last_timestamp = "$1-$2-$3";
}
else
{
# For everything else, we get a filename. There are a few
# things that we frequently ignore, such as hidden files
# (start with a period).
next if /^./;
next if m@/.@;
# We add the file and the current timestamp to the
# hash. Since we replace with each one, and git log goes
# backwards in time, the last time we see the file is the
# point it was first added to the repository.
# Print out the line.
$files{$_} = $last_timestamp;
}
}
close GIT;
# Now that we are done parsing, we output the merged results.
open REPORT, ">$dir.files" or die "Cannot write $dir.files ($!)";
foreach my $file (sort(keys(%files)))
{
# Pull out the date.
my $date = $files{$file};
# Keep track if this file exists.
my $exists = 0;
$exists = 1 if -f "$dir/$file";
#print "$filet$daten";
#print "$datet$existst$filen";
print REPORT "$filet$existst$daten";
}
close REPORT;
}

And the output of this program is put into a ".files" for thedirectory.

Now, this catches all the various versions of the file as I (constantly) renamed files, changed formats, and basically mucked around. I'm not afraid of shifting things around so it reflects that. I put in the exists column (0 or 1) so I know which file is actually there verses the (constant)renames.

In the above example, I would combine "best-enemies.txt" and "best-of-enemies.odt" together and take the earliest date. A bit of manual work isn't too bad for this project, but there you go. The first time a file shows up in a gitrepository.

2012-04-24

Most of my work is available for free in various locations. I'm
supported by donations
and
subscriptions.
If you like what I write and want to see more of my
novels or programs, please consider
becoming a patron and
supporting me. You can also use
Liberapay.