Hack and / - Chopping Logs

Why wait for your awstats job to finish when you need custom log results now? Check out a quick-and-dirty Perl one-liner that creates speedy tallies from log files and is easy to tweak to suit your particular statistics needs.

If you are a sysadmin, logs can be both a bane and a boon to your
existence. On a bad day, a misbehaved program could dump gigabytes of
errors into its log file, fill up the disk and light up your pager like
a Christmas tree. On a good day, logs show you every clue you need to
track down any of a hundred strange system problems. Now, if you manage
any Web servers, logs provide even more valuable information in terms
of statistics. How many visitors did you get to your main index page
today? What spider is hammering your site right now?

Many
excellent log-analysis tools exist. Some provide really
nifty real-time visualizations of Web traffic, and others run every
night and generate manager-friendly reports for you to browse. All of
these programs are great, and I suggest you use them, but sometimes you
need specific statistics and you need them now. For these on-the-fly
statistics, I've developed a common template for a shell one-liner that
chops through logs like Paul Bunyan.

What I've found is that although the specific type of information I need
might change a little, for the most part, the algorithm remains mostly
the same. For any log file, each line contains some bit of unique
information I need. Then, I need to run through the log file, identify
that information and keep a running tally that increments each time I
see the particular pattern. Finally, I need to output that information
along with its final tally and sort based on the tally.

There are many ways you can do this type of log parsing.
Old-school command-line junkies might prefer a nice sed and awk
approach. The whipper-snappers out there might pick a nicely formatted
Python script. There's nothing at all wrong with those approaches, but I
suppose I fall into the middle-child scripting category—I prefer Perl
for this kind of text hacking. Maybe it's the power of Perl regular
expressions, or maybe it's how easy it is to use Perl hashes, or maybe
it's just what I'm most comfortable with, but I just seem to be able to
hack out this kind of script much faster in Perl.

Before I give a sample script though, here's a more specific algorithm. The script
parses through each line of input and uses a regular expression to match
a particular column or other pattern of data on the line. It then uses
that pattern as a key in a hash table and increments the value of that
key. When it's done accepting input, the script iterates through each
key in the hash and outputs the tally for that key and the key itself.

For the test case, I use a general-purpose problem you can try
yourself, as long as you have an Apache Web server. I want to find out
how many unique IP addresses visited one of my sites on November 1,
2008, and the top ten IPs in terms of hits.

Here's a sample entry from the log (the IP has been changed to protect
the innocent):

When you run this command, you should see output something like the
following only with more lines and IPs that aren't fake:

33 99.99.99.99
94 111.111.111.111
138 15.15.15.15

For those of you who know and love both Perl and regular expressions,
that one-liner probably isn't too difficult to parse, but for the rest
of you, let's go step by step. Sometimes it's easier to go through a
one-liner if you see it in a formatted way, so here's the Perl part of
the one-liner translated as though it were in a regular file:

First, let's discuss the while loop. Basically,
while(<>) iterates over
every line of input it receives either through a pipe or as a file
argument on the command line. Inside this loop, I set up a regular
expression to match and pull out an IP address. The regular expression
is probably worth looking at in more detail:

(^\d+\.\d+\.\d+\.\d+)

This section of the regular expression matches the beginning of the line
(^), then any amount of numbers (\d+), and then a dot, another series of
numbers, another dot, another series of numbers, another dot and finally
a fourth series of numbers. This pattern will match, for instance,
123.123.12.34 at the beginning of a line. I surrounded this part of
the regular expression in parentheses. Because this is the first set of
parentheses, when Perl matches it, it puts the resultant match into
the $1 variable so I can pull it out later.

Now, those of you who know regular expressions know that I cheated
here. This regular expression is not very explicit at all. For one, it
would match completely invalid IP addresses, such as 999.999.999.999. For
another, it even would match any series of four numbers with dots in
between, such as 12345.6.7.8910. I chose an overly generic regular
expression on purpose to make a point. There are explicit regular
expressions that match only valid IP addresses, but those expressions are
very long, very complex and, in this case, completely
unnecessary.

Because
I'm dealing with Apache logs, I am pretty confident that the first
set of numbers at the beginning of the file is an IP address and not
something else, and second, the IP address that Apache logged should
be reasonably valid. In taking the shortcut, I not only saved on typing,
but the resulting regular expression also is easier to read and understand
even if you aren't a regex wizard.

After I match the IP, I want to match only log entries from November
01, 2008:

.*?01/Nov/2008

This section performs matches on any number of characters (.*), and with the
question mark at the end, it matches only as much as it needs to and no
more. Then, it matches the datestamp for November 01, 2008. If I wanted
a tally of every day in the log file, I could omit this entire section of
the regular expression. Alternatively, if I wanted to match on some other
keyword (for instance, when the user performed a GET on a particular file),
I could replace or augment the above section with that keyword.

Once I have matched the IP address in a line and have assigned it
to $1, I then use it as a key in a hash I call %v here and increment
it ($h{$1}++). The power of a hash is that it forces each key to be
unique. That means each time I come across a new IP, it will
get its own key in the hash and have its value incremented. So, if it's
the first time I see the IP, its value will be one. The second time I
see the IP, it will increment it to two and so on.

Once I'm done iterating through each line in the file, I then drop to
a foreach loop:

foreach( keys %v ){
print "$v{$_}\t$_\n";
}

Basically, all this does is increment through every key in the hash and
output its value (the number of times I matched that IP in the file) and
the IP itself. Note that I didn't sort the values here. I very well
could have—Perl has powerful methods to sort output—but to make the code
simpler and more flexible, I opted to pipe the output to the command-line
sort command. That way, even if you don't know Perl too well but
know the command line, you could tweak arguments in sort to reverse the
output or even pipe it further to tail, so you could see only the top ten
IPs.

If I want to know only the overall number of unique visitors, as
each line represents a unique visitor, I just need to count the overall
number of lines. To do this, I simply need to pipe the output to
wc -l.

And, there you have it, a quick-and-dirty one-liner to chop through
your logs and tally results. The beauty of using Perl hashes for this
is that you can tweak the regular expression to match all sorts of
values in the file—not just IP addresses—and tally all sorts of useful
information. I've used modified versions of the script to count how many
times a particular file was downloaded by unique IPs, and I've even used it
to perform statistics on mailq output.

Kyle Rankin is a Senior Systems Administrator in the San Francisco Bay
Area
and the author of a number of books, including Knoppix
Hacks and Ubuntu
Hacks for O'Reilly Media. He is currently the president of
the
North Bay
Linux Users' Group.

Kyle Rankin is a director of engineering operations in the San Francisco Bay Area, the author of a number of books including DevOps Troubleshooting and The Official Ubuntu Server Book, and is a columnist for Linux Journal.