Breaking Up Apache Log Files for Analysis

I know, in my last article I promised I'd jump back into the mail merge
program I started building a while back. Since I'm having some hiccups
with my AskDaveTaylor.com web server, however, I'm going to claim
editorial privilege and bump that yet again.

What I need to do is be able to process Apache log files and isolate
specific problems and glitches that are being encountered—a perfect use
for a shell script. In fact, I have a script of this nature that offers
basic analytics in my book Wicked Cool Shell Scripts from
O'Reilly, but this is a bit more specific.

Oh Those Ugly Log Files

To start, let's take a glance at a few lines out of the latest
log file for the site:

Fortunately, the Apache website
has a somewhat clearer
explanation of what's known as the custom log file format that's in
use on my server. Of course, it's described in a way that only a
programmer could love:

This becomes complicated to parse because there are two different types of
field separator: a space for each of the major fields, but since some of
the values can contain spaces, quotes to delimit the start/end of fields
Request, Referrer and User Agent.

As a general rule, shell utilities aren't so good at these sort of
mixed field separators, so it's time for a bit of out-of-the-box thinking!

Breaking Down Fields with Dissimilar Delimiters

It's true that the fields are divided up with dissimilar delimiters
(say that ten times fast), but you can process the information in stages.
You
can examine the line by just processing the quote delimiter with this
clumsy code block:

Using a space as a delimiter makes for a weird-looking command line, as you
can see, but the \ forces the very next character to be
interpreted as the specified value, first a double quote, then a space
character.

Extracting Just the Errors

Now, can you spin through the entire log file and just pull out error codes?
Sure you can, with just a simplification and tweak of the while loop:

Error 405 is (according to the W3 Web standards
site)
"Method Not Allowed", while 301 is "Moved Permanently", and
404 is a standard "Not Found" error when someone requests a
resource that the server cannot find.

Useful, but let's take the next step. For every query where the return
code is not a 200 "OK" status, let's show the original log file
line in question. This time, let's modify the script to do the
200
filtering too:

It's useful to be able to see the log file entry line, the return error code and
the full log file entry line. Is there a pattern? Do they all have the same
user agent (for example, a bot)? Are they from the same IP address? A pattern
based on time of day?

With a judicious use of wc, I also can ascertain that this particular log
file encompasses 99,309 total hits, of which 4,393 (4.4%) are non-200
results.

Another useful feature for this script would be to create
multiple output files automatically, one per error code. I shall leave that, however, as
an exercise for you, dear reader!

And, for my next article, I'll jump back into that mail merge script!

Dave Taylor has been hacking shell scripts on UNIX and Linux systems for a
really long time. He's the author of Learning Unix for Mac OS
X and Wicked Cool Shell Scripts. You can find him on Twitter
as @DaveTaylor, and you can reach him through his tech Q&A site: Ask Dave Taylor.