Fri, 14 Apr 2006

I'm not very consistent about looking at the statistics on my web site.
Every now and then I think of it, and take a look at who's been
visiting, why, and with what, and it's always entertaining.

The first thing I do is take the apache log and run webalizer
on it, to give me a breakdown of some of the "top" lists.

Of course, I'm extremely interested in the user agent list:
which browsers are being used most often? As of last month, the Shallowsky
list still has MSIE 6.0 in the lead ... but it's not as big a lead
as it used to be, at 56.04%. Mozilla 5.0 (which includes all Gecko-
based browsers, as far as I know, including Mozilla, Firefox, Netscape
6 and 7, Camino, etc.) is second with 20.31%. Next are four search
engine 'bots, and then we're into the single digit percentages with
a couple of old IE versions and Opera.

AvantGo (they're still around?) is number 11 with 0.37% -- interesting.
It looks like they're grabbing the Hitchhiker's Guide to the Moon;
then there are a bunch of lines like:

and I'm not sure how to read that (nineplanets.org is The Nine Planets, Bill Arnett's
excellent and justifiably popular planetary site, and he and I have
cross-links, but I'm not sure what that has to do with avantgo and my
site). Not that it's a problem: of course, anyone is welcome to read
my site on a PDA, via AvantGo or otherwise. I'm just curious.

Amusingly, the last user agent in the top fifteen is GIMP Layers, syndicating this blog.

Another interesting list is the search queries: what search terms did
people use which led them to my site? Sometimes that's more
interesting than other times: around Christmas, people were searching
for "griffith park light show" and ending up at my lame collection of
photos from a previous year's light show. I felt so sorry for them:
Griffith Park never puts any information on the web so it's impossible
to find out what hours and dates the light show will be open, so I
know perfectly well why they were googling, and they certainly weren't
getting any help from me. I would have put the information there if
I'd known -- but I tried to find out and couldn't find it either.

But this month, no one is searching on anything unusual. The top
searches leading to my site for the past two months are terms like
birds, gimp plugins, linux powerpoint, mini laptops, debian chkconfig,
san andreas fault, pandora, hummingbird pictures, fiat x1/9,
jupiter's features, linux photo,
and a rather large assortment of dirt bike queries. (I have very
little dirt bike content on my site, but people must be desperate to
find web pages on dirt bikes because those always show up very
prominently in the search string list.)

Most popular pages are this blog (maybe just because of RSS readers),
the Hitchhiker's Guide to the Moon, and bird photos, with an
assortment of other pages covering software, linux tips, assorted
photo collections, and, of course, dirt bikes.

That's most of what I can get from webalizer. Now it's time to look at
the apache error logs. I have quite a few 404s (missing files).
I can clean up some of the obvious ones, and others are coming from
external sites I can't do anything about that for some reason link
to filenames I deleted seven years ago; but how can I get a list of
all the broken internal links on my site, so at least I can fix
the errors that are my own fault?

Kathryn on Linuxchix pointed me to dead-links.com, a rather cool
site. But it turns out it only looks for broken external
links, not internal ones. That's useful, too, just not what I
was after this time.
Warning: if you try to save the page from firefox, it will
start running all over again. You have to copy the content and paste
it into a file if you want to save it.

But Kathryn and Val opined that wget was probably the way to go
for finding internal links. Turns out wget has an option to delete
each file after downloading it, so you can wget a whole site but not
actually need to use the local space to duplicate the site.
Use this command:

Now open the resulting file in an editor and search repeatedly for
ERROR to find all the broken links. Unfortunately the errors are on
a separate line from the filenames they reference,
so you can't just use a grep. wget also
gets some things wrong: for instance, it tries to download the .class
file of a Java applet inside a .jar, then reports an error when the
class doesn't exist. (--reject .class might help that.)
Still, it's not hard to skip past these errors,
and wget does seem to be a fairly good way of finding broken internal links.

There's one more check left to do in the access log.
But that's a longer story, and a posting for another day.