Searchspy Home Page

Searchspy is a simple Perl5 script that
processes a standard web server log file (referrer information
required), that tells you what search queries people are using to find
your web pages from search engines.

telling me in a very simple way that people found my movie page when
they were looking for "sexual anxiety" or information about
stereotypes in "Rumble in the Bronx". (Not a very good movie, by the
way. See "Snake in Eagle's Shadow" instead.)

Download and Docs

to process the data somewhat. You may want to change the
$searchURL variable in the script to customize it for your
particular web pages. By default it processes all pages.

I wrote this because I was curious how well search engines worked
in practice. The results are quite amusing as well as informative. I'm
suprised at how on-target most search terms are. People find my movie
pages with queries that are about movies, not about random other
topics. I get few hits from random sex searches (maybe that says
something about my pages!). Full text indexing works better than you
might think.

I'm interested in any comments you may have. But this software is
really a one-off hack, I do not promise to maintain this
software in any meaningful way. If someone is really excited by this
program, I may be interested in handing development over.

If you are looking for fancier logfile analysis,
analog now
also does a similar search engine query thing.

Jan 5 1999, revision 1.6. Added some entries to the engine map:
altavista.com is the main one.

How it works

Search engines usually encode queries in the standard CGI format.
That means that when a user clicks through to your page, the referrer
information with the search term is sent to your web server. If you've
got referrer logging turned on and the client isn't using a filter
like Intermute, you can find
out the search terms that led users to your pages.

This Perl script is a quick hack I wrote on my decrepit old Linux
laptop while on an airplane. The basic idea is that any referrer with
a ? in the URL is quite likely a search engine. So referrers
with a ? in them are parsed to pull out the particular query
term. Each engine has a different field they set to mark the search
term: hotbot uses MT, AltaVista uses
q, etc. This field is extracted, processed into human
readable form, and then reported.

Want to Hack the Script?

Feel free to play with the script - if you do something neat, email
me and let me know. Most of the code is straightforward. Lines are
read in and parsed (in a ugly, probably broken way). If the referring
URL has a ? in it, then it is unpacked into a hash named
%a and engine-specific parsing is done on it. Finally, the
query term is passed through a filter to turn +s into spaces
and the %xx hex encoding back into characters.

The engine-specific parsing is all encapsulated in the table
%engineMap. The basic idea there is if the search engine name
matches the left side of an entry then the result of the subroutine on
the right side is the actual query term. The function
applytable actually executes the table - this could be useful
code for other applications.

Some suggested hacks:

Improve the logfile parsing.

Update and/or maintain the %engineMap table for new
search engines. Turn $debug on to see the errors.