Web searching

Web search engine

Teoma is a search engine created by Rutgers professor Apostolos Gerasoulis
and his associates. It uses a clustering algorithm to weigh
relevancy ratings. Each indexed page is assigned to one or more "communities"
- sets of pages about the same subject. Inbound links from pages in the
community of related pages are ranked higher than similar links from
outside the community. In addition to relevance weighting, this page classification
is the basis for two additional features of the search engine user interface.
The "Refine - suggestions to narrow your search list " list lets the user
select the appropriate classification for a given request. Similarly, the
"Resources - link collections from experts and enthusiasts" list presents
pages that Teoma has essentially determined to be bibliographies (lists
of resources) on the topic at hand.

Simple search

Quoted phrases:Yes. The search screen also has a check
box labelled "Find this phrase" that does the quoting for you
behind the scenes. Nice UI touch.

Truncation symbol: None available. Their instructions
say "Different
word stems or endings can lead to different results. Try all endings."

Stemming: Not supported, see above.

Relevancealgorithm: Each result
list is grouped into "communities": groups of sites with common
subject material. The communities are listed on the result page and a
user may click on one to limit responses to only members of that community.
The initial result set is ranked by a PageRank-like algorithm with the
enhancement of weighing links from within a community higher than links
from outside the community. The description is purposefully vague about
how relative weights between communities are assigned.

Site location: You can specify a domain or any of nine geographical
regions (Africa, Central America, Europe, India or Asia, Middle East,
North America, Oceana, South America, Southeast Asia)

Date page was modified: choice of range, specific date, before or after
a given date.

Search examples

popcorn energy machineThis is what
I started with... I was trying to think of three unrelated terms, but
of course, there are lots of relationships among these words! Did some
variations to explore the syntax rules; various quote marks; put them
into the advanced search to explore how the pulldown and the options
like "must have", "should not have", and "must
not have" worked. Tried
eliminating a keyword and the characteristics of the result list changed
dramatically.

Tried a bunch of searches inspired by Dr. C. Jorgensen: running,
happy, fun, sad... "conceptual" searches, hard
searches in the image world. In the text world Teoma always finds something,
but the results were rather scattered and the secondary search tools
(Refine and Resources) were similarly unimpressive.

digital library collection development When entered
as unquoted terms, this search resulted in a high quality list of
digital library resources: NYPL digital library, SunSITE, Yale, IFLA,
a CLIR report, DLib, Glasgow. The ninth entry on the list was the first
I did not know. When the check box "Find this
phrase" was checked, the list changed to a tighter focus on specific
policy statements of organizations. Other searches: "digital
library collection policy" only yielded two hits. "digital
library collection management" had twelve hits and no refinements
or resources. "digital library" "collection
management" was overwhelming, 4,000+ hits, but the Refine and
Resources that it generated were worthless.

The "Refine" and "Resources" sometimes provide powerful
enhancements to web searching, but the idea seems more useful than this
implementation delivers. Google has a feature similar to "Refine" but
it is not as sophisticated. Teoma lists classification titles so the user knows what
community is being selected. Google's "Similar pages" feature,
in contrast, is a
"classification by example". You pick an individual site and
see pages that are similar, but you do not know the criteria or classification
system determining the similarity. Teoma lets you make fine distinctions
explicitly, where in Google you need to guess. The "Resources" feature
has no equivalent in Google. I used Teoma as my default search machinery
for four days by making it my browser home page. I found that I did not
trust the results; if it was a search I really cared about, I ran the same
search in Google.

Web directory

The "open directory project" is an open source directory constructed and
maintained by volunteers.

Truncation symbol: The * is a limited wildcard character
usable at the end of a search term only, i.e. "GRADUAT*" Other positional
forms like "*GRAD" and "GRAD*UA" are not supported.

Stemming: Not described; I'd assume it is only
supported with the truncation symbol

Relevancealgorithm: The dmoz search
algorith is derived from Isearch and
the descriptions say that relevance ranking is supported.
I could not find a description of the algorithm on the web or in ACM
Digital Library, ArticleFirst, IEEE Xplore, Internet & Personal Computing
Abstracts, or ScienceDirect.
I downloaded
the code and couldn't find the module where relevance ranking was
calculated.

Advanced search

The open directory project offers an advanced search, albeit with very
few choices.
Choices include:

Search "Categories Only", "Sites Only", "Sites
and Categories": Allows user to differentiate between
searching for a category and searching for a site

Kids and teens sites: Three checkboxes ("Kids", "Teens",
"Mature Teens"). In a rather awkward user interface, this filter is always
present but only active when the category "Kids and Teens" is selected.

Random: Searching for a null term (blank input box)
returns four categories chosen at random.

Relevance algorithm There is no description of the
relevance algorithm on the website. The search engine started with Isearch .

Search examples

isearch relevance ranking: This is recursive but ineffectual,
as you would guess: "No results found.". I tried permutations
of this phrase as well. The phrase "isearch" pops up a few
of the same stale results from Teoma.

digital library collection development: This give
two directories, two sites. One site was irrelevant; one was an excellent
find (The Digital Library Center
at University of Tennessee) but not specifically on the topic of
collection development in digital libraries. One directory was irrelevant
(but interesting) and Reference:Libraries:Digital was
close. Refining the search reveals why the original
was not successful: "collection development" is not a category in dmoz.
"Collection policy" is the phrase they use. However "digital library
collection policy" returns zero hits. Truncate to "digital library" and
the world opens up: 7 categories, 624 sites. Exploring leads to nothing:
Top: Reference: Libraries: Library and Information Science: Digital Library
Development is closest, but still does not contain what I'm looking for.
Explore more. Conclusion: Collection policy in digital libraries is too
specialized to have a category; Reference:
Libraries: Library and Information Science: Technical Services: Collection
Development is the best I'll get in dmoz.

directory crawl: The value of a directory is in its
classification system rather than its search. I used dmoz for several
reference needs. My elder son couldn't find his English-Spanish
dictionary. When I searched for English Spanish dictionary, I rapidly
located http://dmoz.org/Reference/Dictionaries/World_Languages/S/Spanish/English/.Trying
to find it via directory crawl took much longer. Refereces/Dictionaries
is easy, but then you have to decide between English and World Languages.
English? Wrong answer. Once in World Languages, a new interface convention
shows up, an alphabet, from which you must choose the first letter of
the language. No prompt, no explanation; very hard to figure out. If
you guess "S" you can easily find Spanish and you are all set. However,
the search engine was easier and faster.

I haven't used it dmoz in years and thought it would be worth another
look. It forms the foundation for Google Directories
and other directories behind search engines. Since it is open, you and
I can create and edit categories. I'm looking forward to seeing http://dmoz.org/Reference/Libraries/Library_and_Information_Science/Librarians/Kazmer,_Michelle.
The directory structure is logical; the user interface is clean and easy
to traverse. Downsides? The classification system is not too deep, so my
topic (LCSH is Digital libraries Collection development) is not included.
Over the three days I was using it for this paper, the server seemed very
sluggish and at times failed to respond to http requests.

Metasearch service

A new metasearch tool provided by Vivisimo,
a company who previously sold search software components and now offers
consumer level search, Clusty is rich in features and newideas, works quickly,
and has a nice interface. This offering was developed by computer scientists
from Carnegie-Mellon. The CEO is Raol
Valdes-Perez, who has published
widely on a variety
of topics.

Clusty's tabbed top lets a user select from nine sources: Web+, News,
Images, Shopping, Encyclopedia, Gossip, eBay, Blogs, Slashdot. Choices
like "Gossip", "Blogs",
and "Slashdot" differentiate this search machine from the competition.
There's even a customize function that lets you create your own tab and
title, with a customized set of search sources.

Relevancealgorithm: Nothing on
the Clusty websites describes this. Vivisimo has descriptions of two
products used in Clusty: Clustering Engine and Content Integrator.
The former takes a result list and categorizes it, using some variant
of the vector space model that is not described. The Clustering Engine
has a cool feature: a licensee can tailor the clustering weights for
keywords, phrases, and terms specific to their trade or industy. The
Content Integrator is a tool that quickly translates Clusty queries into
queries for other search engines and assembles the results. Clusty is
clearly built on these two modules.

Advanced search

Advanced search allows the user more control over sources, clustering,
and type of content.

Search examples

vivisimo relevance ranking: This time recursion yields
a panoply of sources, but still, the algorithmic description remains
unfound. I suspect they consider a trade secret. I've looked in trade
publications and computer science literature.

digital library collection development: This is the
best search result from the variety of search engines I have used for
this search. The immediately found sites are the highest quality ones:
California DIgital Library policy pages, SunSITE, D-Lib, American Memories.
The classification choices are excellent, with choices: University, Development
Policy, Science, California Digital Library, Library of Congress, Library
Research, Library Resources, Conference, Framework, Issues, and (more...).

defaults: the morning after the first Presidential
debate, the gossip column defaults included a "Bush, Debate" category.
At first I thought it was strange; shouldn't that be in the "News" tab?
Then I started reading, and son of a gun! It is gossip about the debate!
"News", on the other hand, had a "Kerry, Debate" category with substantive
articles. Political bias in the Clustering Engine? The "Encyclopedia"
tab had a links to articles in Wikipedia about the 2004 debate program,
along with links to the candidates' pages. Other tabs display only a
blank search box.

images: I'm wandering through gossip land (Paris is
everywhere!) and I see this picture: ....What
is Maureen Dowd doing in the gossip
pages? The image search quickly shows me the source of my confusion: Melissa
Etheridge or Maureen Dowd?

The nice thing about this tool is that once you find an initial category
that is close to the type of information you seek, you navigate through
the classification system to narrow or broaden your search. The results
returned were excellent. I've been using it for several days and it seems
to produce high quality results consistently. I have yet to find something
that would make me want to return to Google as my default search engine.

credits

By Rich Ackerman
For Dr. M. Kazmer
Florida State University
Fall, 2004

Source Links

Historical note

"Archie is a search engine designed to
index FTP archives,
allowing people to find specific files. The original implementation
was written in 1990 by Alan
Emtage, Bill
Heelan, and L. Peter Deutsch, then
students at McGill University in Montreal.

"The earliest versions of archie simply contacted a list of
FTP archives on a regular basis (contacting each roughly once a month,
so as not to waste too much resources on the remote servers) and
requested a listing. These listings were stored in local files to
be searched using the UNIX grep
command. Later, more efficient front- and back-ends were developed, and the system
spread from a local tool, to a network-wide resource, to a popular service available
from multiple sites around the Internet.
Such archie servers could be accessed in multiple ways: using a local client
(such as archie or xarchie);
telneting to a server directly; sending queries by electronic mail; and later
via World Wide Web interfaces."