powerful facilities for browsing and searching both magnet content
and contributed content

means of delegation of moderation

means of identifying members who are imposing an undue burden on the
community and ways of changing their behavior and/or excluding them
from the community without them realizing it

means of software extension by community members themselves

A sustainable online community is one that can accommodate new users.
If Joe Novice, via browsing and searching, cannot find existing
content relevant to his needs, he will ask questions that will annoy
other community members: "Didn't you search the archives?" "Haven't
you read the FAQ?" Long-term community members, instead of being
stimulated by discussion of new and interesting topics, find their
membership a tiresome burden of directing new users to pages that they
"should" have been able to find on their own.

A community's first line of defense is high quality information
architecture and navigation, as discussed at the end of the "Content Management" chapter. Users
are better at browsing than formulating search queries. A community's
second line of defense, however, is a superb full-text search
facility. The search database must include both publisher-authored
and user-contributed content. Here are some example query
categories:

question answering: e.g., planning a trip to Sanibel Island
(Florida) to take pictures of birds and wanting to know which long
telephoto lens to rent, the user types "best lens Sanibel"

navigation: the user knows that a document exists on the server,
but can't remember where it is, e.g., remembering that a tutorial
exists on how to take pictures in gardens, the user types "garden
photography"

task accomplishment: the user wants to find the photo upload page,
not find discussions of photo sharing when he or she types "photo
sharing"

housekeeping: the user wants to find the site's privacy policy,
not a discussion about privacy policies, after typing "privacy policy"

On a large site a user might wish to restrict the search in some way.
If the search form is at the top of a document that is a chapter of an
online book, it might make sense to offer "whole site" and "within the
chapters of this book" options. If the publisher or the other users
have gone to the trouble of rating content, the default search might
limit results to those documents that have been rated of high quality.
If there are multiple discussion forums on the site, each of which is
essentially a self-contained subcommunity, the search boxes on those
pages might offer a "restrict searching to postings in this forum"
option. If a user hasn't visited the site for a month and wants to
see if there is anything new and relevant, the site should perhaps
offer a "restrict searching to content added within the last 30 days"
option.

What's Wrong with SQL (Search Quality)

The relational database management system (RDBMS) sounds like the
perfect tool for this job. We have a lot of data and we want to
provide a lot of flexibility in querying. Suppose a person comes to a
site for athletes and types "running" into the search form. The site
sends the following SQL query to the database:

select *
from content
where body like '%' || :user_query || '%'

which, by the time the bind variable :user_query is
substituted, turns into

select *
from content
where body like '%running%'

In Oracle this won't pick up a row whose message contains the same
word but with a different capitalization. Instead we do

select *
from content
where upper(body) like upper('%running%')

What if the user typed multiple words? The query

select *
from content
where upper(body) like upper('%running shoes%')

would not pick up a message that contained the phrase "shoes for
running". Instead we'll need multiple where clauses:

select *
from content
where upper(body) like upper('%running%')
and upper(body) like upper('%shoes%')

This AND clause isn't quite right. If there are lots of documents
that contain both "running" and "shoes", these are the ones that we'd
like to see. However, if there aren't any rows with all query terms,
we should probably offer the user rows that contain some of the query
terms. We might need to use OR, a scoring function, and an ORDER BY
so that the rows containing both query terms are returned first. If
we insist on the AND clause, we've created a situation in which the
more the user tells us about her interests the fewer documents we'll
return in response to a search, eventually returning "0 results found"
if she keeps adding words. (Note that public search engines circa
2005, such as Google, Yahoo, A9, and MSN, do implicitly use
AND and do return 0 results if a user keeps adding words to a
query and there aren't any documents in the database that contain each
and every one of those words.)

There are some deeper problems with the Caveperson SQL Programmer
approach to full-text search. Suppose that a message contains the
phrase "My brother-in-law Billy Bob ran 20 miles yesterday"
but not the word "running". Or a message contains the phrase "My
cousin Gertrude runs 15 miles every day". These should be
returned as relevant to the query "running", but the LIKE clause won't
do the job. What is needed is a system for stemming both the
query terms and the indexed terms: "running", "runs", and "ran" would
all be bashed down to the stem word "run" for indexing and retrieval.

What about a message saying "I attended the 100th
anniversary Boston Marathon"? The LIKE query won't pick that up.
What is needed is a system for expanding queries through a thesaurus
powerful enough to make the connection between "running" and
"marathon".

What's Wrong with SQL (Performance)

Let's return to the simplest possible LIKE query:

select *
from content
where body like '%running%'

The RDBMS must examine every row in the content table to
answer this query, i.e., must perform a sequential table scan(O[N]
time, where N is the number of rows in the table). Suppose that a
standard RDBMS index is defined on the body column. The
values of body will be used as keys for a B-tree and
we could perform

select *
from content
where body = 'running'

and maybe, depending on the implementation,

select *
from content
where body like 'running%'

in O[logN] time. But the user's interest isn't restricted to
documents whose only word is "running" or documents that begin with
the word "running". The user wants documents in which the word
"running" may be buried. A single B-tree index is not going to help.

Abandoning the RDBMS

We can solve both the performance and search quality problems by
dumping all of our data into a full-text search system. As
the name implies, these systems index every word in a document, not
just the first words as with the standard RDBMS B-tree. A full-text
index can answer the question "Find me the documents containing the
word 'running'" in time that approaches O[1], i.e., an amount of time
that does not vary with the size of the corpus indexed. If
there are 10 million documents in the corpus, a search through those
10 million documents will not take much longer than a search through a
corpus of 1000 documents. (Getting close to constant time in this
situation would require that the 10-million-document collection did
not use a larger vocabulary than the 1000-document collection and that
it was not the case that, say, 90 percent of the documents contained
the word "running".)

How does it work? Like every other indexing strategy: extra work at
insertion time is traded for less work at query time. Consider
constructing a big table of every word in the English language next to
the database keys of those documents that contain the word:

If we build this as a hash table, we have O[1] access to a row in the
table. If we merely keep the rows in sorted order, we have O[log W]
access to any row in the table, where W is the number of words in our
vocabulary. Performance does not vary with the number of documents in
the collection... or does it? Just about every English document will
contain the word "the" and therefore simply returning the value of the
document_ids column for the word "the" will take O[N]
time, where N is the number of documents in the corpus. This row
isn't useful anyway because it isn't selective, i.e., we could get the
same information almost as fast with a sequential scan of the
documents table, collecting all the document IDs. While indexing a
document, a full-text search system will refer to a list of
stopwords, words that are too common to be worth indexing. For
standard English, the stopword list includes such words as "a", "and",
"as", "at", "for", "or", "the", etc.

Inserting a new document into the collection will be slow. We'll have
to go through the document, word by word, and update as many rows in
the index as there are distinct words in the document. But that extra
work at insertion time pays off in a reduction in query time from O[N]
to O[1].

Given a data structure of the preceding form, we can quickly find all
documents containing the word "running". We can also quickly find all
documents containing the word "shoes". We can intersect these result
sets quickly, giving us the documents that contain both "running" and
"shoes". With some fancier indexing data structures we can restrict
our search to documents that contain the contiguous phrase "running
shoes" as opposed to documents where those words appear separately.
But suppose that there are 1000 documents in the collection containing
these two words. Which are the most relevant to the user's query of
"running shoes"?

We need a new data structure: the word-frequency histogram.
This will tell us which words occur in a document and how frequently
they occur in a way that is easily adjusted for the total length of a
document.

Here's a word-frequency histogram for the first sentence of
Tolstoy's
Anna Karenina:

Word

Count

Frequency

all

1

1/16

another

1

1/16

but

1

1/16

each

1

1/16

families

1

1/16

family

1

1/16

happy

1

1/16

in

1

1/16

is

1

1/16

its

1

1/16

one

1

1/16

own

1

1/16

resemble

1

1/16

unhappy

2

2/16

way

1

1/16

One might argue that this sentence makes better literature as
"All happy families resemble one another, but each unhappy family
is unhappy in its own way," but the full-text search software
finds it more useful in this form.

After the crude histogram is made, it is typically adjusted for the
prevalence of words in standard English. So, for example, the
appearance of "resemble" is more interesting than "happy" because
"resemble" occurs less frequently in standard English. Stopwords such
as "is" are thrown away altogether. Stemming is another useful
refinement. In the index and in queries we convert all words to their
stems. The stem word for "families", for example, is "family". With
stemming, a query for "families" would match a document containing
"family" and vice versa.

Given a body of histograms it is possible to answer queries such as
"Show me documents that are similar to this one" or "Show me documents
whose histogram is closest to a user-entered string." The
inter-document similarity query can be handled by comparing histograms
already stored in the text database.
The search string "platinum mines in New Zealand" might be processed
first by throwing away the stopwords "in" and "new". By using
histogram comparison, the software would deliver articles that that
have the most occurrences of "platinum", "mines", and
"Zealand". Suppose that "Zealand" is a rarer word than
"platinum". Then a document with one occurrence of "Zealand" is
favored over one with one occurrence of "platinum". A document with one
occurrence of each word is preferred to an article where only one of
those words shows up. A document that contains only the
words "platinum mines Zealand" is a better match than a document that
contains 100,000 words, three of which happen to match the query
terms.

The power of this kind of system is enticing and raises the question
"Can we run our entire Web application from a specialized full-text search
database system?" Indeed, why not chuck the RDBMS altogether?

We don't chuck the RDBMS because we put it in to handle the problem of
concurrency: two users trying to update the same item simultaneously.
A better query tool is nice, but we can't adopt it as our primary
database management system unless it handles the concurrency problem
as well as the RDBMS.

A pragmatic approach would seem to start by keeping all the documents
in the RDBMS: articles, user comments, discussion forum postings, etc.
Either once per night or every time a new document was added, update a
full-text search system's collection. Pages that are part of the
standard user experience and workflow operate from the RDBMS. The
search box at the upper right corner of every page, however, queries
against the full-text search system. Let's call this a split-system
design.

**** insert figure *****

Figure 12.1:
A split-system approach to providing full-text search. The
application's content is stored in a relational database management
system. Scripts periodically maintain a second copy in a
specialized text database. The Web server program performs queries,
inserts, and updates to the RDBMS. When a user requests a full-text
search, however, the query is sent to the text database.

One argument against the split-system approach is that two copies of
the document collection are being kept. In an age of $200 disk drives
of absurdly high capacity, this isn't a powerful argument. It is
nearly impossible to fill a modern disk drive with words typed by
humans. One can fill up a disk drive with video or audio streams, but
not text. And in any case some full-text search systems can build an
index to a document collection without themselves keeping the original
document around, i.e., you would in fact have only one copy of the
document in the RDBMS.

A second argument against using RDBMS and full-text search systems
simultaneously is that the collections will get out of sync. If the
Web server crashes in the middle of an RDBMS transaction, all work is
rolled back. If the Web server was simultaneously inserting a
document into a full-text search system, it is possible that the
full-text database will contain a document that is not in fact
available on the main pages of the site—the site being generated from the
RDBMS. Alternatively, the RDBMS insert might succeed while the
full-text insert fails, leading to a document that is available on the
site, but not searchable. This argument, too, ultimately lacks power.
It is true that the RDBMS is a convenient and nearly foolproof means
of managing transactions and concurrency. However, it is not the only
way. If one were to hire sufficiently careful programmers and
sufficiently dedicated system and database administrators, it would be
possible to keep two databases in sync.

A third argument against the split system is the disparity of
interfaces. Suppose that our RDBMS is Oracle. The Web developers
know how to talk to Oracle through Active Server Pages. The desktop
programmers know how to talk to Oracle through the C API. The
marketing people know how to talk to Oracle through various reporting
tools. Some individual users have figured out to talk to Oracle from
standard desktop programs such as Microsoft Excel and Microsoft
Access. The cost of bringing in a new programmer grows if you have to
teach that person not only about an RDBMS, but also about specialized
tools, each with its own library of interfaces.

However, the best argument against using both an RDBMS and a
"bag-on-the-side" full-text search system is that the split system
does not naturally support the kinds of queries that are necessary:

show me documents matching "best restaurants" written by users
whose recorded street address is within 10 miles of zip code 02138

show me documents matching "studio photography" written by users
whose contributions have been rated above average by other users (said
content item ratings being stored in RDBMS tables)

show me documents matching "best advertising tricks" written by
users whose recent classified ads have attracted more than 5 bids each

Augmenting the RDBMS

Consider a full-text indexing system. It needs a way of writing stuff
down (the index data structures) and typically chooses the operating
system file system. It needs a way of performing computation in a
procedural computer language, typically C circa 2004.

Consider a modern relational database management system. It offers a
way of writing stuff down: CREATE TABLE and INSERT. It offers a way
of executing software written in a procedural language: C, Java, or
PL/SQL in the case of Oracle; any .NET-supported computer language in
the case of Microsoft SQL Server.

Why couldn't one build a full-text search indexer inside the RDBMS?
That's exactly what some of the commercial RDBMS vendors have done.
Oracle was a pioneer in this area and the relevant Oracle product is
called "Oracle Text".

In the preceding example, Oracle Text builds its own index on the
body column of the content table. When a
Text index is defined on a table it becomes possible to use the
contains operator in a WHERE clause. The Oracle RDBMS
SQL query processor is smart enough to know how to use the Text index
to answer this query without doing a sequential table scan. It is
possible to have more than one call to contains in the
same query. Thus the last argument of contains is an
integer identifying the query, in this case "1". It is possible to
get a relevance score out in the select list or in an ORDER BY clause
with the function score and an argument identifying from which
contains call the score should be pulled.

Oracle Text is one of the more difficult and complex Oracle RDBMS
products to use. For example, if you want to be able to search for a
phrase that occurs in either the one_line_summary or
body and combine the relevance score, you need to build a
multi-column index:

Notice that the index itself is built on the column
modified_date, which is not itself indexed. The call to
ctx_ddl.set_attribute in which the COLUMNS attribute is
set is what determines which columns get indexed.

Oracle Text also has the property that its default search mode is
exact phrase matching. A user who types "zippy pinhead" into a search
engine will expect to find documents that contain the phrase "Zippy
the Pinhead". This won't happen if your script passes the raw user
query right through to the Contains operator. More
problematic is what happens when a user types a query string that
contains characters that Oracle Text treats specially. This can
result in an error being raised by the SQL query and a "Server Error
500" returned to the user if you don't catch the error in your
procedural script. It would be nice if Oracle Text had a built-in
procedure called "ProcessRawQueryFromWebForm" or something. But it
doesn't, at least we couldn't find one in the documentation for
Oracle version 10g. The next best thing is a procedure called
pavtranslate, available from http://technet.oracle.com/sample_code/products/text/htdocs/query_syntax_translators/query_syntax_translators.html.

Oracle Text, via the "INSO filters" option, has the capability to
index a remarkable variety of documents in a BLOB column. For
example, the software can recognize a Microsoft Excel spreadsheet, pull
the text out and add it to the index. At the same time it is smart
enough to know when to ignore a document entirely, e.g., if the BLOB
column were filled with a JPEG photograph.

Exercise 1: Expected Queries

Ask your client what kinds of queries he or she expects to be most
common in your community. For example, in a site for academics, it
might be very important to type in a person's name and get all of the
publications authored by that person. In a site for shoppers, it might
be essential to query for a brand name and get back product reviews.
Only your client can say authoritatively.

Exercise 2: Document Your Design

Place a document at /doc/search in which you describe your team's plan
for providing full-text search over the content on your site. If your
content management system has left you with a mixed bag of stuff in
the file system and stuff in the RDBMS, explain how you're going to
synchronize and unify these documents in one full-text index. If
nightly maintenance scripts are required, document them here.

Include your client's answers to Exercise 1 in this document.

Exercise 3: Build the Basic Search Module

Build a basic search module that provides the following functions:

user query from the URI /search/, targeting
/search/results

administrator ability to view statistics on the size and structure
of the corpus (how many documents of each type, total size of collection)

administrator ability to drop and rebuild the full-text index.
Sadly this is necessary periodically with most tools and you don't
want the publisher to be forced into obscure shell commands. An ideal
solution will be completely maintainable from a Web browser.

Exercise 4: Big Brother

Generally users prefer to browse rather than search. If users are
resorting to searches in order to get standard answers or perform
common tasks, there may be something wrong with a site's navigation or
information architecture. If users are performing searches and
getting zero results back from your full-text search facility, either
your index or the site's content needs augmentation.

Record user search strings in an RDBMS table and let admins see what
the popular search terms are (by the day, week, or month). Make sure
to highlight any searches that resulted in the user seeing a page "No
documents matched your query". Ask yourself whether it would be
ethical to implement a facility whereby the site administrators could
view a report of search strings and the users who typed them in.

Update your /doc/search file to reflect the addition of
this facility.

Exercise 5: Linkage

Find logical places among your community's pages to link to the search
facility. For example, on many sites it will make sense to have a
quick search box in the upper-right corner of every page served. On
most sites, it makes sense to link back to search from the search
results page with a "search again" box filled in by default with the
original query.

Make sure that your main documentation page links to the docs for this
new module.

Working with the Public Search Engines

If your online community is on the public Internet you probably would
like to see your content indexed by public search engines such as Google (www.google.com). First, Google has to
know about your server. This happens either when someone already in
the Google index links to your site or when you manually add your URL
from a form off the google.com home page. Second, Google has to be
able to read the text on your server. At least as of 2005 none of the
public search engines implemented optical character recognition (OCR).
This means that text embedded in a GIF, Flash animation, or a Java
applet won't be indexed. It might be readable by a human user with
perfect eyesight, but it won't be readable by the computer programs
that crawl the Web to build databases for public search engines.
Third, Google has to be able to get into all the pages on your server.
If you've been requiring registration to view discussions, for
example, those discussions won't be indexed by Google unless your
software is smart enough to recognize that it is Google behind the
request and make an exception. How to recognize Google? Here's a
one-line snippet from the philip.greenspun.com access log (newlines inserted for
readability):

Notice the user-agent header at the end: Googlebot/2.1, with its
included suggestion that Web publishers check http://www.google.com/bot.html
for more information.
Because some search engines archive what they index, you would not want
to provide registration-free access to content that is truly private
to members. In theory a <META NAME="ROBOTS"
CONTENT="NOARCHIVE"> placed in the HEAD of your HTML
documents would prevent search engines from archiving the page, but
robots are not guaranteed to follow such directives.

Some search engines allow you to provide indexing hints and hints for
presentation once a user is looking at a search results page. For
example, in the table of contents page for this book, we have the
following META tags in the HEAD:

The "keywords" tag adds some words that are relevant to the document,
but not present in the visible text. This would help someone who
decided to search for "MIT 6.171 textbook", for example. The
"description" tag can be used by a search engine when summarizing a
page. If it isn't present, a search engine may show the first 20 words
on the page or follow some heuristics to build a reasonable summary.
These tags have been routinely abused. A publisher might add popular
search terms such as "sex" to a site that is unrelated to those terms,
in hopes of capturing more readers. A company might add the names of
its competitors as keywords. Users wouldn't see these dirty tricks
unless they went to the trouble of using the View Source command in
their browser. Because of this history of abuse, many public search
engines ignore these tags.

A particularly destructive practice is "cloaking", in which a Web
server is programmed to send entirely different pages to the search
engines and human users (identified by having "Mozilla" or "MSIE" in
their user-agent headers). An unscrupulous publisher would find out
what are the current most popular search terms on public search engines
(http://searchenginewatch.com/facts/searches.html offers a list of windows into various search services),
string those terms together, and serve a mishmash of those to
search engines. Meanwhile, when a regular user came to the site the
page presented would be a banal product pitch. Google threatens to
ban from their index any site that engages in this practice.

The /robots.txt File

Suppose that you don't want the public search engines indexing
anything underneath the /staging/ directory on your
server. This content isn't exactly secret, but neither do you want it
released before its time. Nor do you want two copies of the same
content in the Google index, one copy in the staging area and one copy
in its final position on the site.

You need to read the Standard for Web Exclusion, a protocol for
communication between Web publishers and Web crawlers, available from
http://www.robotstxt.org/wc/norobots.html. You
the publisher put a file on your site, accessible at /robots.txt,
with instructions for robots. Here's an example that excludes the
staging directory:

The User-agent line specifies for which robots the
injunctions are intended. Each Disallow asks a robot not
to look in a particular directory. Nothing requires a robot to observe
these injunctions, but the standard seems to have been adopted by all
the major indices nonetheless.

Exercise 6: robots.txt

Place a file on your server at /robots.txt that excludes
robots from appropriate portions of your server. Put some comments at
the top of the file explaining who created this, when it was created,
and the rationale behind the exclusions.

If you're doing a 100 percent database-backed content management
system, you are free to put the content of the robots.txt file in the
RDBMS, just so long as it is served when the URI
/robots.txt is requested.

Exercise 7: Client Signoff

Review the search facility, both user and admin pages, with your
client. Write down your client's reaction to this new module, paying
particular attention to any new ideas that the client might have for
what will be typical queries on the site.

The Future

As an online community grows older and larger it becomes ever more
likely that a user will be overwhelmed with "100,000 documents matched
your query". When a community is new and small, it is possible to
search for an answer merely by reading the titles of everything on the
site, i.e., by browsing. As a community grows, therefore, the greater
the importance of information retrieval tools. The exercises in this
chapter focus on answering a user's query by presenting links to
relevant documents. Suppose that we build a search facility that
always returns the very most relevant document in the corpus. Is that
an optimal solution? Only if you believe that users like to read.

Suppose that Joe User visits photo.net and types "At what shutter
speeds is a tripod required?" into the search box. Is it reasonable
to assume that Joe wants to read a 10,000-word document that contains
the answer to this question? Or would Joe rather get ... the answer
to his question. The answer "at shutter speeds slower than
1/lens-focal-length" is a lot smaller and quicker to read than a
document containing this information.

To get a feel for how a question answering system can be built on top
of a full-text indexer, read "Scaling Question Answering to the Web"
(Cody Kwok, Oren Etzioni, Dan Weld; WWW10 conference, May 2001; http://www.www10.org/cdrom/papers/pdf/p120.pdf),
which describes a system built at the University of Washington. This
system includes all of the expected linguistic gymnastics plus code to
sort out the Internet-specific problem of noise. Traditional
information retrieval systems are designed to work with authoritative
documents, e.g., the Encyclopedia Britannica, a binder of
corporate policies, or the design notes for a jetliner. The documents
in the corpus are presumed to be authoritative. There won't be four
different answers, three of them flat wrong, to questions such as "In
what year was Gioacchino Rossini born?", "How many signatures are
required for a purchase of $57,300?", or "How wide is the wingspan of
the airplane?" With user-authored content in an online community,
however, it seems safe to assume that while the average answer is
likely to be correct, for every 100 correct answers there will be at
least three or four incorrect ones. Even when the data require no
interpretation, there will be typos. For example, a Google
search for "rossini 1792-1868" returned 50,900 documents in February 2005; a search
for "rossini 1792-1869" returned 43 documents.
A question-answering system built on top of lightly moderated user-authored
content will have to exercise the same sort of judgment as do humans:
How many documents contain Answer A versus Answer B? What is the
relative authority of conflicting documents? Which of two conflicting
documents is more recent?

Mobile Internet devices put an even greater stress on information
retrieval. Connection speeds are slower. Screens are smaller. It
isn't practical for a user to drill down into 20 documents returned by
a search engine as possibly relevant to a query, especially if the
user is driving a car and using a voice browser.

If you want to emerge as a hero from the dust of the next Internet
collapse, work on information retrieval.

More

Time and Motion

The two client interviews, at the beginning of the exercises and again
at the end, should each take under an hour.

The search design and documentation should be a team effort, and take
one to two hours.

The luckiest teams will be able to get their search systems up and
running in an hour. Unlucky teams using difficult-to-install search
systems may require the better part of a day. Teams with a single
content table and no static html pages should be able to build the
basic page scripts in one to two hours. Additional time will be
required for designs that manage content across multiple tables and
the filesystem.