What about those search engines? Authors Lawrence and Giles stirred up a pot of competition
since updating and expanding on their earlier search engine research.1
Already both FAST and Excite claim improved
technologies for data collection while Lycos asserts SeeMore will find information related to words,
phrases or images "on any page on the Internet" (emphasis added).2 Are these useful improvements? What newer technologies exist
with potential to enhance our searching of the Web?

The Web stores approximately 800 million pages available for indexing. This equates to
six terabytes of text minus the HTML coding.

Of the phenomenal number of pages available for indexing, engines accomplish 16%, at
best.

Indexing new or modified pages takes months.

The more popular a page, the more likely engines will index it.

Only 34% of Web sites use meta tag coding.

Little overlap of data exists between the engines.

What do these discoveries mean for researchers? To begin with, they illustrate an
engine's inability to index all the Web. Note the study emphasizes "publicly
indexable" pages; that is, those unique URLs available to search engines.

According to Hobbes'
Internet Timeline, the Web hosts 4,389,131 servers as of March 1999. A server houses
one or more Web pages. Think about it. The 800 million unique URLs to which the study
refers represents a small piece of the Web.

What's missing? To start with, password-restricted free data like the Science
Magazine abstract linked in footnote number one or articles appearing in The New York Times or the Financial
Times. Absent too are password-restricted commercial data like articles appearing in
the Wall Street Journal or Consumer Reports, or public records stored on KnowX or cdb4web.
Many search engines also fail to index dynamically delivered data like congressional
documents housed at GPO Access, New
Jersey statutes, or the codes and rules of professional conduct available from
Cornell's American Legal Ethics
Library. Moreover, some engines cannot collect data contained within proprietary files
like portable document format (.pdf), Word or WordPerfect, image maps or frames.

The authors suggest using meta search services like MetaCrawler to improve coverage of the Web. I
disagree. While using sites like MetaCrawler
may offer more comprehensive coverage than that of a single search engine, researchers who
must search far and wide will do better to search each engine individually. Meta search
services often cut off queries before they complete the search of an index in order to
increase their speed. Many times, they provide inadequate query translations -- or none at
all. Consequently, results may misrepresent the actual data available.

The finding that only 34% of Web sites use meta tag coding presents another dilemma for
researchers. The authors write: "The low usage of the simple HTML metadata standard
suggests that acceptance and widespread use of more complex standards, such as XML
may be slow." This observation means researchers may not see the great
potential of XML to deliver the Web as a database in the near future.

I mentioned that newcomer FAST recently
announced it now indexes "more than 200 million unique URLs." This coverage
places it ahead of Northern Light, the engine
the study cited as reaching 16% of the Web. FASTs press release states that it
expects to index the entire Web in one year's time and then keep it up to date.

Not to be outpaced, Excite
announced the next day a new "architecture capable of visiting and analyzing every
Web page."3 Others likely will follow with their own
assertions.

Despite these pronouncements, search engines will continue to face
hurdles like password-restrictions, databases, proprietary files and other technologies in
their data collection efforts. Any attempt, however, to increase coverage is a step in the
right direction. Albeit, a tiny one. As Danny Sullivan points out, sheer scope of coverage
without attention to the relevancy of information to a query does little to improve search
technology.4

To illustrate, take a close look at FAST. FAST
hopes to empower existing engines, not to compete with them for portal status.5 In fact, its the technology behind Lycos' MP3 Search.

For sheer scope of coverage, not to mention processing speed as its name implies, FAST takes the lead. But anyone who searches FAST will soon discover its weakness -- relevancy.
For example, a search for the Long Beach Unified School District does not find a
link to the home page within the first page of results. A search for Science Magazine
returns links for Science Fiction, Fantasy, Weird Fiction Magazine Index, Science
Humor Magazine, and other unrelated items. A search for the laws of Northern Ireland
finds news items, a white paper, and constitutional proposals within the first page of
hits, but not Her Majesty's Stationery Office, which
publishes these laws.

On the other hand, separate queries for school uniforms student behavior and risk
factors youth violence found several useful documents, many of which appeared within
the first page of hits.

Clearly FAST, although speedy and far-reaching,
requires work to be of consistent use to professional researchers. But its ability to
index great quantities of Web pages, and to search them rapidly, adds to the current state
of affairs on the Web.

Where then might researchers look for technological developments
emphasizing relevancy? Currently, Google, Open Directory, Direct Hit
and CLEVER warrant mention.
The latter, however, is not available for public viewing.6

Believing in popularity as a relevancy indicator, Direct
Hit monitors Web sites researchers select from a hit list. It factors in statistics
like time spent at a selected site and then applies this information to refine the
engines index. If searchers frequently select site X in response to query Y, Direct Hit boosts X sites relevancy ranking.

Unfortunately, success with Direct Hit depends
on the choices of earlier searchers, who may or may not have the same information needs
 including subject familiarity and intended use -- of a current researcher. To test
it, I ran the queries above at Hotbot and MSN Search. Although results varied somewhat, I
immediately found Long Beach Unified School District and Science Magazine. The
search on risk factors youth violence yielded several helpful publications while
the query for school uniforms student behavior produced mediocre results. Finally,
the search for the laws of Northern Ireland bombed at MSN
Search failing to find anything of relevance. It also performed poorly at Hotbot finding only one indirect link.

Formerly NewHoo, Open Directory asks
"experts" to index and annotate resources in their area of expertise.7 It offers a category on School Safety,
for example, that links to several potentially relevant resources.

Researchers should note that although many popular engines use Open Directory, search results may vary for several reasons.
Licensees, for example, may elect to use only some of the Open
Directory data. Further, they may rank it according to their own definitions of
relevancy.8

Vastly different from other search technologies, excepting
possibly CLEVER, Google applies complex algorithms to obtain relevancy.
The algorithms in part resemble the concept of citation analysis first employed by Science
Citation Index.9 That is, Google follows the basic principle that works cited by a
document offer potentially relevant information. Googles
creators further apply a number of weights to deal with certain peculiarities of the Web
like universally popular sites, competitive sites, or peer-reviewed publications.10

The outcome provides researchers with the first engine to emphasize search precision
over recall. Try the searches above at Google. The
home page of the Long Beach Unified School District appears as the first result when
entering long beach unified school district. Science Magazine pops up as the
second hit. A search for the laws of Northern Ireland performs less well: Her Majesty's Stationery Office appears eighth on the
hit list!

Hours before the deadline for this article, Surfwax,
a new meta search service, arrives on the scene. Usually, and for reasons I mention above,
I avoid meta searchers. In this case, I make an exception.

Developed with technologies that define words and word relationships, Surfwax improves upon existing meta searchers. Enter a
query in the search box. Then watch as two frames appear below it. Loading first, the
left-hand frame contains results from various search services including FAST and Google.

What distinguishes Surfwax is a
small easily overlooked light green icon that sometimes appears to the left of
linked hits. Clicking on this icon, loads abstracts, key points and buzzwords from the
matching resource in the right-hand frame. Researchers may then select one or more of the
buzzwords to refine their searching.11

To focus the query, review the buzzwords and then click on the red icon next to one of
them (If you use Netscape, click on the buzzword hyperlink instead.). I selected school
discipline. It may appear like nothing happens, but review your search statement,
which still appears at the top of the screen. Notice it now contains the phrase,
"school discipline" in addition to the original query. Run the new search to
find resources discussing the effect of school discipline policies on youth violence.

As experienced online researchers know, the success of a query depends in part on the
search statement. Amateur chefs or frantic spouses looking for recipe suggestions for
leftover turkey, want to enter more than just the word, turkey. Imagine the
reaction of a librarian if a patron approached with just this one word. Yeah, you too,
buddy!

Search engines, too, require context, or at least well-defined queries. Yet as
illustrated, even straightforward search statements may not produce desired results. What
is a poor researcher to do?

First, to locate a starting point or an answer as opposed to
all possible answers, begin with an engine like Google
that emphasizes relevancy over data collection. Google
works so well for this particular research strategy, that in a recent workshop on research
strategies for finding government information on the Internet, I had difficulty steering
students away from Google U.S. Government Search
to examine other techniques.12 Second, for comprehensive
research like scouring the Web for use of common law trademarks or discovering competitive
intelligence, begin with an enhanced meta search service like Surfwax. Then if necessary, and to accomplish
exhaustive searching of the Web to the extent available, query every existing engine
separately.

Sound like a lot of work? You bet! The current state of technology demands attention to
the limitations of "searching the Web." Yet the development of new services like
FAST, Google
and Surfwax, as well as new technologies like XML,
lend hope to the not-too-distant future.