Sample solutions and discussion
Perl Quiz of The Week #20 (20040721)
I run mailing lists. People subscribe, people unsubscribe, and
people get unsubscribed automatically when their addresses
generate too many bounces.
I run these mailing lists using SmartList.
I'd like to find out how my lists are being used - do people
unsubscribe in a bunch when a flame war happens, or do they
just drift in and out over time? What does the
total-membership graph look like?
You are to write a function, parse_smartlist_log. It takes
three parameters:
(1) the name of a SmartList log file.
(2) the total current membership of the list.
(3) the base name of the output file.
It should parse a SmartList log file and generate a graph of
total list membership against time.
Note that not all subscriptions and unsubscriptions will be in
the log; it's possible that the listmaster has added or
removed addresses without using the administrative interface,
especially when the list was first set up. This is the reason
for the second parameter. Take whatever action seems
appropriate.
(The graph can be a bitmap, ASCII, or whatever else - just
give it a sensible filename based on the third parameter.)
A log file includes lines such as:
subscribe: foo@bar.com by: foo@bar.com Thu Mar 21 15:30:35 GMT 2002
unsubscribe: 9 foo@bar.com 32760 foo@bar.com by: foo@bar.com Sat Mar 23 16:27:35 GMT 2002
procbounce: Removed: foo@bar.com 32718
SmartList has fuzzy matching on unsubscription requests - if
the addresses in the line differ, use the first one.
There are many other lines that may appear in the log file.
Sometimes, as seen above for procbounce, there may be no date
on the log line.
Some sample log files may be obtained from
http://firedrake.org/roger/sample_logs.zip
or from
http://perl.plover.com/qotw/misc/r020/sample_logs.zip
http://perl.plover.com/qotw/misc/r020/sample_logs.tgz
----------------------------------------------------------------
Only two solutions were submitted on the discuss list.
The only external solution which solved the problem came from Jesper
Dalberg. This uses Text::Graph, a CPAN module of which I was not
previously aware (thanks!), and Date::Manip. This is a relatively
inefficient method of date parsing; as it happens, all dates I have
observed in SmartList log files are in a format which Date::Parse can
handle.
This solution takes the sensible approach of latching a date value when
it is spotted and using it for subsequent undated lines. However, it
does not use dates found on lines which do not also contain a mailing
list transaction.
The totalling logic seems broken; the default value for list membership
on any date is the final membership value, rather than being in any way
affected by previous values. (Was there perhaps a missing reassignment
to $cnt?)
MJD submitted a solution which, while appealing (I am a great fan of
PostScript and would love to see a Perl-PostScript Quiz of the Week),
does not actually solve the problem. He is correct in that the
SmartList log format is not particularly well-designed, and indeed
that was part of the reason why I chose it for this quiz; it is the
output of a variety of separate programs, including procmail, rather
than coming from an integrated system. In any case, working from the
provided PostScript output it appears that axes are unlabelled and
unscaled.
My own solution is designed for clarity. It parses every line in
search of a date (fed to Date::Parse), and looks for specific patterns
for subscription/unsubscription information. (It also looks for
something vaguely resembling an email address in the line; as David
Jones pointed out, not every line matching /^unsubscribe:/ will be an
unsubscription.) After parsing, the data are rebased to give the
correct final value. The code then uses George A. Fitch's
GD::Graph::xylines module to provide a graph with labelled, scaled
axes.
Possible sophistications would be:
* choose a strftime format based on the sample's date span (e.g.
"%H:%M" if the whole logfile only covers a day, "%b %Y" if it spans
several years).
* if a subscriber is unsubscribed twice without an intervening
resubscription, discount the earlier unsubscription (as he was clearly
re-added without showing up in the log).
#! /usr/bin/perl -w
use strict;
sub parse_smartlist_log {
use Date::Parse;
use GD::Graph::xylines;
use POSIX qw(strftime);
my ($logfile,$final,$outputfile)=@_;
my $total=0;
my (@x,@y);
my $date=0;
open IN,") {
chomp;
my $n=0;
if (/([A-Z][a-z][a-z]\s+[A-Z][a-z][a-z]\s+\d+\s+\d+:\d+:\d+\s+\d+)/) {
$date=str2time($1) || $date;
}
if (/^subscribe: (\S+\@\S+)/) {
$n=1;
} elsif (/^(unsubscribe:\s+\d+|procbounce: Removed:)\s+(\S+\@\S+)/) {
$n=-1;
}
if ($n && $date) {
$total+=$n;
push @x,$date;
push @y,$total;
}
}
close IN;
my $offset=$final-$total;
if ($offset) {
foreach my $n (0..$#y) {
$y[$n]+=$offset;
if ($y[$n]<0) {
$y[$n]=0;
}
}
}
my $graph=GD::Graph::xylines->new;
$graph->set(
x_label => 'date',
y_label => 'subscribers',
title => $logfile,
x_number_format => sub{strftime('%d %b %Y',localtime(shift))},
y_min_value => 0,
transparent => 0
);
my $img=$graph->plot([\@x,\@y]);
open OUT,">$outputfile.png";
binmode OUT;
print OUT $img->png;
close OUT;
}
__END__
[ Thanks to Roger Burton West for running the QOTW this week. The
solution was delayed because I was away at OSCON. I will send the
new quiz tomorrow. -MJD ]