Copyright Notice

This text is copyright by CMP Media, LLC, and is used with
their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in
WebTechniques magazine.
However, the version you are reading here is as the author
originally submitted the article for publication, not after their
editors applied their creativity.

Back in this column in April 1997, I provided a simple script that
searched the text of the programs I've written for this column over
the years. Recently, I've been hacking my overall web site design,
and thought it would be cool to be able to search my entire site. The
program of the April column could do the trick, but only if I never
planned on getting anything else done with my web server box again,
because it would be expensive to search everything.

But I thought to myself, hey, the big search engines have already come
to my site, fetched all the pages I want to have searched, and indexed
them for me. Furthermore, they have more spare CPU cycles than me,
and it'd be nice to just take advantage of that.

And then I remembered that many of the search engines provide a way to
insist that the returned values have a specific URL or site value. I
could use this to my advantage to create a wrapper that uses the big
search engine to return hits only on my site!

The upside of this approach is that I leverage off of existing work,
and someone else's disk and CPU. The downside is that the spiders
don't visit very often, so new material is likely to be missed in such
an index. But for mostly static or old pages, the tradeoff is often
interesting.

Of course, Perl can pass the proper values into the search engine's
form-response CGI programs, but the answer comes back as HTML. It looks
like a mess to figure out what part of that HTML is a link to some hit,
and what part is simply a link to an ad or something.

Luckily, we don't have to figure that out, because the continually
maintained WWW::Search package in the CPAN lets us access the
output from these engines in a sane way, and all I have to do is
interface to that code. My first attempt resulted in the program in
[listing 1 below].

Line 1 enables warnings and taint checking. I like taint checking on
CGI scripts, because a CGI program is essentially acting on someone
else's behalf using the (hopefully limited) privileges of the web
server. Perl normally enables taint checking automatically on setuid
programs, but we need to let Perl know that we want taint checking
explicitly.

Line 2 turns on the compiler restrictions, requiring me to declare my
variables, disabling the use of soft references, and preventing me
from accidentally using a string where I meant a subroutine
invocation. I use this on any program that is more than 10 lines long
that I use for more than 10 minutes (what I call my ``10 - 10'' rule).

Line 3 disables output buffering. I was toying for a while about
making this program an NPH program that first shoved a ``working...''
page to the browser (using server push), and then returned the real
page later. For that, unbuffering is essential. Here, it's just a
line I type frequently without thinking.

Line 5 pulls in Lincoln Stein's wonderful CGI.pm module, including
all the shortcuts for generating HTML and handling forms.

Lines 7 through 12 define the configuration section, hopefully with
all the things one would want to change to move this to a different
site. Line 9 gives the domain name for which we will ask the search
engines about, and line 10 defines the number of hits of interest. If
you leave the settings as they are in the listing, you'll be searching
live information about my site. The bigger the number of hits, the
more time the connection will be tied up, possibly resulting in a
timeout, so keep it appropriate.

Lines 14 through 22 define the search engines that conform with my
needs (ones that can have some site-narrowing in the query string).
For each of the elements of %ENGINES, the key gives the
WWW::Search search engine name, and the value is a coderef to
transform the search data into a query string. Note that AltaVista,
HotBot, and Infoseek are the easiest: an additional restriction to the
user's requested query is enough. NorthernLight was a little more
odd, requiring some extra syntax to make it a full boolean query. (I
also noticed that WWW::Search hasn't stayed in sync with
NorthernLight's output, and it sends out an erroneous link. Hmm.)

Lines 24 through 27 create the top of the CGI response, including
a nice CGI header (roughly the same as an HTTP header) and a title
of ``Search this site'' and a similar H1.

Lines 29 through 42 create the search form, regardless of whether
we're searching this time or not. Thanks to CGI.pm's sticky fields,
the default values in this form will be the same as the query being
acted upon, if any, allowing slightly modified queries or perhaps even
the same query from different engines (something I was doing
frequently while testing this program).

Lines 30 and 42 put horizontal rules around the form, one of the
things I do conventionally to visually delimit a set of related input
features. Lines 31 and 41 generate the HTML for the start and end of
the form. I force the method to be GET rather than POST (the
default) so that I can bookmark the resulting query. CGI.pm
doesn't care if it's a GET or POST, but bookmarking does.

Lines 32 to 40 generate a layout table to get everything to line up
nicely. The table has one row with four parts:

A submit button with a label of ``Search stonehenge.com for''

a text input field, with a name of search_for.

the word ``using'' (just to fill out the sentence properly), and

a radio-button group with selections for each of the search engines.

The radio-button group is laid out vertically (using tables once again),
thanks to the -columns parameter to radio_group.

When the submit button is pressed, or when return is typed in the text
field (for most browsers), our script will be reinvoked with the
search_for and engine parameters. Line 44 detects this, and
invokes the actual search. By putting this code into a subroutine, I
can clearly see what gets done every time, and what gets done only
when parameters are present.

Lines 46 and 47 finish up the CGI output, ending the HTML and exiting
the program with a good status.

Lines 51 to 82 handle the hard work of calling the search engine with
valid parameters, and displaying the search results. Hard only in the
sense that we have to get stuff validated and then interpret the
results from the nifty WWW::Search family of modules, but I'm
getting ahead of myself.

Lines 53 to 64 validate the form values. If anything goes awry, we
exit the subroutine immediately. Line 53 gets the search string,
simply fetching the value.

Lines 55 to 61 extract the search engine. If the engine is present,
we ensure that's it's a Perl symbol, and extract that symbol. This is
needed because WWW::Search uses this engine string in a way that
trips up taint checking if left tainted. If the engine is absent,
we'll pretend they use AltaVista all the time. That lets me sprinkle
the rest of key pages with something like:

Lines 63 and 64 validate the engine name one more time, ensuring that
it is a key in the %ENGINES hash. The value of that element is a
coderef, which we now invoke to turn the user's query into a query for
the selected engine as $engine_search_for.

If we make it past all those treacherous return operations, it's
time for the actual engine interaction. The require in line 66
brings in the WWW::Search module. Note that this module is not
compiled if we never make to here, so we'll be saving compile time on
those invocations that are merely putting up the search form and not
getting the results.

Line 68 creates the search object, passing the engine name to the
new method of WWW::Search. This also compiles the appropriate
code for that search engine.

Line 69 sets the number of items in which we're interested. The
default is a fairly large number -- not something I want to wait for
while it's being fetched.

Line 70 establishes the query. We pass the search string through an
escape_query method for reasons that are not quite clear to me from
reading the documentation. But once that is done, it's handed to the
search engine interfacer, and we're off and running.

Lines 72 through 81 dump out a table of the results (again, using a
table for some layout control). For grins, I've centered the table
using an attribute in line 73.

Lines 74 through 77 label the table using a TH cell, using an
internal function of CGI.pm to escape the HTML in the search
string. This isn't exactly proper, but I doubt that the function will
change much in future releases of CGI.pm, and if it does, it's just
a five-line routine anyway.

Similarly, the map operation in lines 78 to 80 create a table row for
every result. The results start from the return value in line 80.
Each of these ends up in $_ in line 78. The url method is called
on the result to get the URL string, held for a moment in local
variable $url. Line 79 generates an anchor link, with the text
being the same as the place to which it sends the user. And that's
it!

So, you can drop this program into your CGI area, change the $SITE
parameter, and there you have it, instant searchability with very
little CPU power required.

I hope you find what you're looking for. And if Y2K doesn't turn us
all into characters from the Mad Max movies, I'll see you next month
right here. Enjoy.