Apache Guide: Logging, Part 4 -- Log-File Analysis Page 2

HTTP is a stateless, anonymous
protocol. This is by design, and is not, at least
in my opinion, a shortcoming of the protocol. If
you want to know more about your visitors, you
have to be polite, and actually ask them. And be
prepared to not get reliable answers. This is
amazingly frustrating for marketing types. They
want to know the average income, number of kids,
and hair color, of their target demographic. Or
something like that. And they don't like to be
told that that information is not available in
the log files. However, it is quite beyond your
control to get this information out of the log
files. Explain to them that HTTP is
anonymous.

And even what the log files do
tell you is occasionally suspect. For example, I
have numerous entries in my log files indicating
that a machine called
cache-mtc-am05.proxy.aol.com visited
my web site today. I can tell that this is a
machine that is on the AOL network. But because
of the way that AOL works, this might be one
person visiting my site many times, or it might
be many people visiting my site one time each.
AOL does something called proxying, and
you can see from the machine address that it is a
proxy server. A proxy server is one that one or
more people sit behind. They type an address into
their browser. It makes that request to the proxy
server. The proxy server gets the page
(generating the log file entry on my web site).
It then passes that page back to the requesting
machine. This means that I never see the request
from the originating machine, but only the
request from the proxy.

Another
implication of this is that if, 10 minutes later,
someone else sitting behind that same proxy
requests the same page, they don't generate a log
file entry at all. They type in the address, and
that request goes to the proxy server. The proxy
sees the request and thinks "I already have that
document in memory. There's no point asking the
web site for it again." And so instead of asking
my web site for the page, it gives the copy that
it already has to the client. So, not only is the
address field suspect, but the number of request
is also suspect.

It might sound like the
data that you receive is so suspect as to be
useless. This is in fact not the case. It should
just be taken with a grain of salt. The number of
hits that your site receives is almost certainly
not really the number of visitors that came to
your site. But it's a good indication. And it
still gives you some useful information. Just
don't rely on it for exact numbers.

So, to
the real meat of all of this. How do you actually
generate statistics from your Web-server
logs?

There are two main approaches that
you can take here. You can either do it yourself,
or you can get one of the existing applications
that is available to do it for you.

Unless
you have custom log files that don't look
anything like the Common log format,
you should probably get one of the available apps
out there. There are some excellent commercial
products, and some really good free ones, so you
just need to decide what features you are looking
for.

So, without further ado, here's some
of the great apps out there that can help you
with this task.