Seek and Ye Shall Find (Maybe)

The cover of WIRED 4.05 (May 1996). WIRED

The most popular sites on the web are trying to bring order out of chaos, in a frantic quest for the ultimate index of all human knowledge.

In 1668, the English philosopher John Wilkins presented a universal classification scheme to London's Royal Society. The scheme neatly divided all of reality into 40 root categories, including "things; called transcendental," "discourse," and "beasts." These categories were further divided into subgenuses (whole-footed beasts and cloven-footed beasts, for example), and each was carefully documented with examples. Wilkins's eagerly awaited proposal was immediately published and distributed throughout Europe.

This article has been reproduced in a new format and may be missing content or contain faulty links. Contact wiredlabs@wired.com to report an issue.

Today, Wilkins's system is remembered only as an example of the arbitrariness of attempts to classify the knowable universe. Indeed, the dream of organizing all knowledge has been thoroughly discredited. It peaked in popularity during the 18th century, when the scope of human knowledge was still imaginable and the universe was thought to be rational. By the century's close, projects such as Wilkins's universal classification scheme, or Ephraim Chambers's comprehensive Cyclopaedia, or Universal Dictionary of Arts and Sciences, had come to seem utopian. Although a few have continued to dream of a universal library – Vannevar Bush, who described his memex system in 1945; Ted Nelson, who has been working on Xanadu since the early 1970s – they are widely seen as laughable in our relativist, postmodern era.

But recently there have been hints of entirely new ways to classify knowledge, new systems for sorting and storing information that avoid the pitfalls of the past and can work on unimaginably large corpuses. The long-moribund fields of knowledge organization and information retrieval are, once again, showing signs of life. The reason, of course, is the Web.

The most popular sites on the Web today are those – like the Yahoo! catalog, like the Alta Vista search engine – that attempt to exert some kind of order on an otherwise anarchic collection of documents. The hard problems of knowledge classification and indexing are suddenly of commercial importance. The result has been a spate of high-tech start-ups, formed mostly by computer scientists and linguists, that are intent on making the Web act more like a well-organized library. Their efforts, rooted in equal parts hubris and brilliance and marked by a conviction that the problem is solvable, can seem startlingly reminiscent of John Wilkins and his contemporaries.

Admittedly, equating the Web with all human knowledge is an exaggeration. But not as much of one as you might think. A year and a half ago, the content of the Web was heavily tilted toward a few niches: there was a lot about Unix and UFOs, not much about real estate or poetry. But today the breadth of the Web comes close to covering all major subjects. Indeed, at its current growth rate, the Web will contain more words than the giant Lexis-Nexis database by this summer, and more than today's Library of Congress by the end of 1998. And the Web defines "knowledge" far more loosely than any library. Even the Total Library of Jorge Luis Borges, which contained all knowledge and its contradiction, didn't include live video feeds of coffeepots. So if the entire Web can be organized, that goes a long way toward organizing all of knowledge as well.

But the difficulty of the task quickly becomes apparent when we look at attempts to solve similar problems. The most obvious place to turn – library science – turns out to be of almost no help. For one thing, even librarians admit that the schemes used today are antiquated and inadequate: the phrase "classification in crisis" has become a cliché in the library community. The most common systems in the US – the Dewey Decimal System and Library of Congress Classification – were developed during the close of the 19th century. Unsurprisingly, they are poor at classifying knowledge in "newly" established fields like genetics or electrical engineering. More important, library classification is bound by restrictions that the digital world is not. While a physical book can be shelved in only one place, a digital document can be placed in several categories at the cost of only a few bytes.

The field of information retrieval, which focuses on automated techniques like keyword indexing for searching large databases, isn't much more encouraging for those trying to organize the Web. The simple reason: even humans are poor at deciding what information is relevant to a particular question. Trying to get a computer to figure it out is nearly impossible.

Given all this, how do researchers possibly believe they can organize the rapidly growing Web? Have they really solved the problems that have stumped scientists for the last 200 years, or are they just ignoring them? And if organizing the Web really is possible, what are the implications?

Yahoo!

To figure out some answers, I drove down to a grubby little office park, where transmission repair shops nestle next to high-tech start-ups, in Mountain View, California, to meet with the people behind Yahoo! (www.yahoo.com/). Their cramped office, jammed full with dilapidated desks covered in stacks of manuals, seemed at odds with the lighthearted image Yahoo! projects online. But the disarray clearly reflected the company's rapid growth.

Yahoo!'s statistics are impressive. Created in 1994 by Jerry Yang and David Filo, two disaffected electrical engineering and computer science
grad students from Stanford University, Yahoo! lists more than 200,000 Web sites under 20,000 different categories. Sites that track pollution, for example, are listed under Society and Culture:Environment and Nature:Pollution. These categories form what the people at Yahoo! a bit pretentiously refer to as their ontology – a taxonomy of everything. Their ordering of the Web is precise enough – and intuitive enough – that almost 800,000 people a day use Yahoo! to search for everything from Web-controlled Christmas trees to research on paleontology. In almost every way you can measure, Yahoo! has successfully exerted order on the chaotic Web.

But how much longer can its hold last? Already, Yahoo! falls short of cataloguing the half-million or so sites on the Web. The enormity of its task is almost comical – I picture Jerry Yang as Charlie Chaplin in Modern Times, confronted with an endless stream of new work that is only increasing in speed. Sites that don't make a point of notifying Yahoo! of their existence often don't end up being listed. And as the Web continues its exponential growth, Yahoo! will have to grow exponentially as well. If it fails to keep up, Yahoo!'s catalog will become like the cyclopedia of Ephraim Chambers, whose claims of comprehensiveness were quickly destroyed by the rapid growth of knowledge.

It's a concern that Jerry Yang, the less publicity shy of the two founders, had been thinking a lot about lately. Not that he seemed terribly worried – at least not at first. A studiously casual 27-year-old from Taiwan, Yang had the Web-to-riches rap down. His speech was peppered with buzzwords. I imagined him coolly promoting Yahoo! – "We're a content-driven, interactive, information provider" – to the executives at companies like Softbank Corp. and Sequoia Capital and walking away with a couple million dollars in financing. (This venture capital will soon be richly supplemented: On March 7, Yahoo! filed with the SEC for an initial public offering.) It was only when he began talking about the intricacies of Yahoo!'s design that Yang reverted back to the CS student he was a little more than a year ago and actually admitted to worrying about the roadblocks ahead. Even then, the fear was covered up by plenty of intellectual braggadocio. As he told me, leaning back and raising his arms in an exaggerated shrug, "I like tough problems. The harder to solve, the better. And organizing the Web is probably the hardest information science problem out there."

That may be, but Yahoo!'s technology, at least, is relatively straightforward. Yahoo! works like this: First, the URLs of new Web sites are collected. Most of these come by email from people who want their sites listed, and some come from Yahoo!'s spider – a simple program that scans the Web, crawling from link to link in search of new sites. Then, one of twenty human classifiers at Yahoo! looks the Web site over and determines how to categorize it.

Really, the only hard part – the only part that your average high-school geek couldn't do – is developing the classification scheme. The ontology. Dividing human knowledge into a clean set of categories is a lot like trying to figure out where to find that suspenseful black comedy at your corner video store. Questions inevitably come up, like Are movies part of Art or Entertainment? (Yahoo! lists them under the latter.)

To solve this problem, Yang and Filo hired Srinija Srinivasan as their "Ontological Yahoo!" Another former Stanford student, Srinivasan is unfailingly helpful, quick to answer any question in her relaxed California accent. Perhaps that's why Newsweek claimed she was trained in library science when including her among the 50 people who matter most on the Internet. Actually, her background is in artificial intelligence. But Srinivasan was well prepared for tackling the organization of the Web: previously she had been working at a lunatic-fringe project in Texas, attempting to teach a computer the fundamentals of human knowledge. (See "CYC-O," Wired 2.04, page 94.)

Starting with the ad hoc categories she inherited from Yang and Filo, Srinivasan began slowly and deliberately steering Yahoo!'s ontology toward completeness. Mainly, it's been a matter of adding new categories and reorganizing hierarchies as the Web evolves from containing only specialized, technical information to containing content from every field of knowledge. But she's also set up certain guidelines to ensure consistency. For example, every regional Web site is now put in the regional hierarchy, and a cross-link to the site is placed under the appropriate topic. So a Florida real estate company is listed under Florida, with a cross-link from real estate.

A few months ago, Srinivasan told me, she was adding categories and making changes to the ontology almost every day. Now major adjustments are becoming much more infrequent. She pointed to this as support for Yang's assertion that "at some point, our scheme will become relatively stable. We will have captured the breadth of human knowledge."

I'd like to think it was that easy, that the goal of categorizing human knowledge would finally be solved by a few computer scientists in a cramped office park. Yang's obviously honest excitement about the promise of the Web made me want to see him succeed. But a story he and Srinivasan told me about recent events at Yahoo! left me convinced I would have to look elsewhere for the answer.

The story began when the Messianic Jewish Alliance of America submitted its Web page to Yahoo! A classifier quickly reviewed the site – which contains everything from Stars of David to articles about Israel, not to mention the word "Jewish" in its name – and placed it under Society and Culture:Religion:Judaism.

But here's where things got tricky. True, MJAA members are born of Jewish mothers and are hence, by definition, Jews. But they also believe that Jesus Christ is the messiah. In the eyes of most Jews, that makes the MJAA a bunch of heretics. Or at least Christians.

So when a few vocal and Net-savvy Jews saw the MJAA listed under Judaism, they let loose a salvo of email demanding that Yahoo! remove MJAA's listing. A bit taken aback by the protesters' virulence ("threats of boycotts," Yang said with amazement), Yahoo! yielded and reclassified MJAA under Christianity with a cross-reference from Judaism. Of course, this caused the MJAA to protest that they were now being incorrectly labeled. After a modern-day Solomonic compromise, the MJAA and a few similar groups can now be found listed under Society and Culture:Religion: Christianity:Messianic Judaism – which is linked by a cross-reference from Judaism.

Yang looked at me sheepishly when telling this story. After all, he believes in truth, justice, and the Internet way. Hell, he even gave me a mini-sermon that morning about how the Net is egalitarian – the little guy can publish just as easily as the big guy. Yet, he knows the MJAA was pushed around because it didn't have mainstream Judaism's clout.

But the MJAA story is interesting not just for exposing the realpolitik of classification. It's proof that no ontology is objective – all have their own biases and proclivities. Yang was quick to admit this: in fact, he referred to Yahoo!'s ontology as the company's editorial. "Organizing the Web is sometimes like being a newspaper editor and inciting riots," he said with a touch of exasperation. "If we put hate crimes in a higher level of the topic hierarchy, well, it's our editorial right to do so, but it's also a very heavy responsibility."

Yahoo!'s success, Yang argued, is evidence that point of view and knowledge classification are not incompatible. Just as we learn to automatically compensate for right-wing bias while reading The Wall Street Journal's editorial page, we can also learn to adjust for the perspective that Yahoo! embodies. We can learn to think like a Yahoo! classifier. The real problem, Yang and Srinivasan agreed, is making sure that Yahoo!'s point of view remains consistent even as the company expands to keep up with the growth of the Web.

After all, Yahoo!'s point of view comes from having the same 20 people classifying every site, and by having those people crammed together in the same building where they are constantly engaged in a discussion of what belongs where. Lose that closeness and the biases will start to become more diffuse. Yang admitted as much, saying, "It's hard to expand Yahoo!, because you end up with too many points of view." Instead of the Journal's editorial page, you end up with something like CNN, where prejudices are masked by a pretense of objectivity. For Yahoo!, that translates to a category scheme where users have a hard time guessing where they'll find what they're looking for.

So Yahoo! is faced with an unforgiving trade-off between the size and the quality of its directory. If Yahoo! hires another 50 or 60 classifiers to examine every last site on the Web, the catalog will become less consistent and more difficult to use. On the other hand, if Yahoo! stays with a small number of classifiers, the percentage of sites Yahoo! knows about will continue to shrink.

Yahoo! will probably take this latter path and simply admit that it is an opinionated guide, a sort of "best of the Web," and not a complete catalog. That will make for a successful business – look at how popular the "cool site of the day" Web pages are – but it doesn't bring us any closer to a universal library.

In my mind, Yang identified the problem with Yahoo! when he noted that "it is much more of a social-engineering problem than a library or computer science problem." By relying on human intelligence to organize the Web, Yahoo! falls victim to subjectivity. The problem must be attacked at some lower level that is amenable to automation.

Inktomi

What's needed, I decided, is an index of the Web. A concordance that keeps track of every word on every Web site. Like a catalog, a keyword index organizes Web sites based on their content, but it does so at the word level instead of by subject. Sites about Messianic Judaism are found by looking for pages that contain the words "Jesus" and "Jewish." This eliminates the subjectivity that plagues classification schemes like Yahoo! – a document either contains the word "Judaism" or it doesn't. However, indexing increases the size of the task from keeping track of millions of documents to keeping track of billions of words.

When the first concordance, or keyword index, of the Bible was compiled by Hugues de Saint-Cher in 1240, the task required the labor of 500 monks. But the labor involved is almost completely mindless; today, a computer can construct a keyword index for a small library in minutes, using a straightforward technique known as an inverted index.

An inverted index is simply a huge table, where rows represent documents and columns represent words. If document x contains word y, there will be a binary 1 in row x, column y of the table. To find all documents that contain a specific word, the computer simply scans for 1s in the appropriate column. With a little added work, it's possible to do more complex searches: Find all documents that contain the word "wired" and not the word "amphetamine." The table helps speed up the process because only the appropriate columns, instead of the documents themselves, need to be examined.

Even with the aid of computers, however, the problem of scale becomes daunting as the size of the corpus increases. Depending on whom you ask, the Web currently contains somewhere between 30 million and 50 million pages. (Louis Monier, the technical leader at Digital Equipment Corp.'s Alta Vista search engine, says at least 45 million, while Michael Mauldin of Lycos says 30 to 50.) Given that the average Web page contains about 500 words, or 7 kilobytes of text, we can guess that the Web contains somewhere between 200 and 330 gigabytes of text. And these numbers are growing by 20 percent every month, says Mauldin. In two years, as the Web surpasses the roughly 29 terabytes in the current Library of Congress, will the inverted index become too large to feasibly store? Will it simply take too long to compute? Or will attempts at indexing the Web break down in some other, unexpected way?

To find out, I headed to the computer science department at the University of California at Berkeley, where Eric Brewer, an assistant professor, is studying these questions. I first met Brewer when I was a CS grad student at Cal. He had just been hired after getting his doctorate from MIT, and he was teaching a course on high-speed networking. I remembered Brewer as a tall, curly haired guy who was always bobbing his head and smiling nervously while bragging about his research.

He didn't seem to have changed much when I caught up with him in Cal's new computer science building, the startlingly ugly, green-tiled Soda Hall. As we sat down in an empty conference room, Brewer was quick to mention that, along with grad student Paul Gauthier, he had created Inktomi (inktomi.cs.berkeley .edu/, named after a mythological spider of the Plains Indians) – one of the largest indexes of the Web. And how, unlike other large Web indexes such as Lycos (www.lycos.com/) and Alta Vista (altavista.digital.com/), Inktomi doesn't require a half-a-million-dollar investment in computer hardware. "We didn't just throw money at the problem like those guys," he said. "We've come up with a truly scalable solution." The result, Brewer assured me, is a system that will be able to index the entire Web even five years down the road.

Inktomi is one of the first real-world applications of hive computing (see Wired 3.11, page 80). The idea is to create a supercomputer by lashing together lots of existing workstations with a network, then having each workstation work on one piece of a problem. The result is cheap (because you're using off-the-shelf components) and fast (because you can keep adding more workstations to increase performance). Inktomi works by splitting the inverted index of the entire Web over four Sun SPARCstations. This is enough computational power and memory to handle about a million users per day and index several million documents. But despite Brewer's assurance, I wasn't convinced that Inktomi's technique will work when the number of documents and users has increased by two orders of magnitude. At some point, it seemed to me, the Web would be so large, and changing so fast, that it would be physically impossible to keep up. Something would break.

My first guess was that the bottleneck would be getting the data. Right now, indexers use software spiders that crawl through the Web and download every page for indexing. (See "Bots Are Hot!" Wired 4.04, page 114.) The spiders start with a list of a dozen or so known sites. They index these pages, then follow every link to every new page the sites contain and index those. The process repeats until the spider can't find any links in the Web that it hasn't visited. Back when the Web was young, when it contained only a few thousand pages, this procedure took less than a day. Now it takes even the quickest spiders three or four days to roam the entire Net. Alta Vista's spider, for example, downloads 2.5 million pages a day out of the more than 21 million it knows about. Won't the Web soon be so large, I challenged Brewer, that the index cannot be completed before more pages are added, making it perpetually out of date?

Brewer leaned back and flatly disagreed. According to him, we will just use smarter and faster spiders. After all, he pointed out, it's possible to get a 155-Mbps connection to the Internet. That means the entire contents of the Web can theoretically be sucked down in about five hours. Sure, the Web is growing, said Brewer, but so is available bandwidth. The real problem is that Web spiders spend most of their time waiting to connect to the Web site. It's a problem familiar to anyone who has tried to go to a popular site in the middle of the afternoon.

So, to speed up the process, Inktomi uses multiple computers to crawl the Web. A few dozen workstations in the Berkeley computer science department are set up to start crawling the Web when nobody else is using them. By breaking the problem up this way, Inktomi can take almost full advantage of its Net connection. Inktomi also plans a few other tricks – for example, to keep track of which Web sites change most frequently and make sure it checks those sites every day. The result, taking into account the bandwidth increases that will be necessary to sustain the Net, is that crawling the Web will still be feasible in 2000.

OK. But what about storage? After all, you're trying to keep track of the entire Web! You're trying to store in one place the contents of hundreds of thousands of hard drives. In a few more years, that will have to be prohibitively expensive.

Not so, claimed Brewer, excited by the opportunity to showcase another advantage of his system. Remember that you need to store only the inverted index, instead of the actual documents. That makes for a great compression scheme: each occurrence of a word is represented by a single bit in the appropriate column. The result? In Inktomi, which uses some clever techniques to reduce the table's size even further, a document takes up only about 4 percent of its original space. Which means that even when the Web is a terabyte of text, a complete index will take up only about 41 gigabytes. You can buy that kind of disk space for less than $10,000 today.

Admittedly, Inktomi currently keeps track only of which words appear in which document – it doesn't know the order in which the words occur. That means Inktomi can't search for occurrences of "Clinton" within five words of "President," for example. However, it's an easy thing to add, said Brewer, and one he will add soon. Even with this word-proximity information, the index will still be only about 15 percent of the total size of the Web. (A consequence of storing word order is that indexes such as Alta Vista contain what is essentially a compressed version of the Web. This raises some tricky copyright questions.)

OK, OK, I was willing to acknowledge, storage might not be a problem. But it's not just the size of the Web that's growing, it's the number of users. What happens when everyone in the whole world is connected to the Web, and half of them are trying to use Inktomi at the same time? Perhaps computational power will be the bottleneck.

Definitely not a problem, Brewer insisted unflaggingly. Inktomi has been stress tested at more than 2.5 million queries a day with no difficulty – and that's with just four outdated workstations. Hook together 40 state of-the-art computers and Inktomi should be able to handle 100 million queries a day – easy. Sure, the Web is growing exponentially, but microprocessors are on that same curve. Computational power is the least of our worries.

I left Berkeley convinced that indexing the Web, while likely to remain a challenge, won't be insurmountable. But after using Inktomi more, I started to wonder if an index really satisfied my desire for organizing knowledge. I could usually find what I was looking for, but I felt as if I was poking around in the dark. I remembered something Jerry Yang had told me at Yahoo!: "The difference between a catalog and an index is that a catalog provides context." That made sense now.

A catalog not only helps you find a Web site, it also tells you how it fits into the grand scheme of things. Yahoo!, for example, shows that the site for the United Patriotic Alliance belongs to Society and Culture: Alternative:Militia Movement. It also lists sites that offer an opposing viewpoint in that same section. And it includes a handy cross reference to Society and Culture:Firearms. Doing a keyword search for the United Patriotic Alliance, on the other hand, doesn't provide any of that. It's like operating with blinders on: you can only see what's directly in front of you.

And, I found, there's another, more subtle drawback. Indexes not only don't provide context for the document, they don't provide context for the keywords. That's because the user can immediately jump to the page that contains a particular word. Using an online index, it's all too easy to find out what someone has said about, say, racism, and then quickly take that quote out of context. By allowing you to jump right to the good stuff, instead of forcing you to read all the way through the document, indexes promote scanning instead of reading.

Organizing knowledge with a keyword index is less like a universal library than like a giant, Burroughs-style cut-up poem. Pages become organized together for no reason other than random confluence of words. While indexes solve the problems of subjectivity and scale that plague classification schemes, they don't impose enough order. The more I tried to use Inktomi, the more I realized that operating just on words is too low-level. There needs to be something in between.

Architext

Finding that in-between has long been a goal of information retrieval research. Even in the 1960s, when online databases were puny by comparison, it was clear that simple keyword searching was inadequate. What was needed was some way to make sense of a document, to figure out what it was really about. But despite concerted efforts, nothing that really works much better has been found. That's why Architext Software's announcement last October of the Excite system (www.excite.com/), which indexes the Web by concept rather than by keyword, was greeted with as much skepticism as enthusiasm.

For one thing, Architext had come out of nowhere. Founded in 1993 by six Stanford students, none with any real background in information retrieval, the company picked up $3 million from Kleiner Perkins Caufield & Byers and began to promote Excite's "concept-based searching." But Architext didn't release any details about how the system actually worked, nor did it enter the annual TREC competitions, where search engines compete head-to-head. In short, it looked like just one more case study on how the word "Web" has the ability to cloud investors' minds.

That's why I was so surprised when I met Graham Spencer, Architext's 24-year-old vice president of technology. Instead of the glad-handing salesman I expected, he was a self-described punk. Punk like I remembered from high school, when it meant not only the music you listened to but a certain earnestness and idealism, evidenced by impassioned fanzines, distrust of anyone who made money, and spray-painted anarchy symbols. Tall, ectomorphic, with tightly cropped hair, Spencer looked out of place in the cubicle-filled office. But, he quietly insisted, he has stuck to the punk do-it-yourself ethic by founding a start-up and making sure it offers a useful service "without fucking anyone over."

The actual service, it turns out, was decided somewhat arbitrarily. The company's founders knew they wanted to start a business but weren't sure what kind. It was Spencer who suggested they build a search engine, because "information retrieval seemed like the easiest place to make progress."

Of course, it says something about Spencer that he assumed he could make progress in a field that has been stalled for some 20 years. But it's a prevalent attitude among computer scientists: Information retrieval is really only a problem for people in library science – if some computer scientists were to put their heads together, they'd probably have it solved before lunchtime.

The "problem" of information retrieval can actually be nailed down to two issues: synonymy and homonymy. The first is a problem because a search for documents containing the word "film" won't find documents containing synonyms such as "movie." Homonyms, words that are spelled the same but have different meanings, are a problem because the search will find documents containing "a film of oil."

All efforts at improving information retrieval involve trying to remove these problems. For example, some of today's best systems – such as Cornell's SMART engine – use a thesaurus to automatically expand a user's search and capture more documents. Some also eliminate homonyms by trying to figure out how a word is being used in a document. This is done by collecting statistics on which words commonly occur together. This way, if the search engine sees the word "film" near the word "director," it can guess that the word is being used to refer to a motion picture.

When I quizzed Spencer on the actual technique Excite uses, he became noticeably more circumspect. On the one hand, he wants to brag about his system's algorithm so people don't think he's just full of hype. On the other hand, he doesn't want to give too much away. From what he did finally tell me, the system appears to use a fairly sophisticated approach. The idea is to take the inverted index of the Web, with its rows of documents and columns of keywords, and compress it so that documents with roughly similar profiles are clustered together. This way, two documents about movies will be clustered together – even if one uses the word "movie" and one uses "film" – because they will have many other words in common. The result is a matrix where the rows now represent concepts instead of actual documents. This cleanly attacks the problems of both synonymy and homonymy.

It turns out that the basic idea behind this approach was first developed in 1988 by a group of scientists at Bellcore, under the name Latent Semantic Indexing. The technique, although shown to be very effective, has been plagued by its heavy computational requirements. It's simply too slow for most practical applications. But, after all, that's what computer scientists like Spencer are good at. And what Architext has apparently done is find a way to perform LSI more efficiently. If so, it's a promising step toward improved information retrieval.

What makes Excite so exciting is that it comes up with a classification scheme through statistical analysis of the actual documents. It learns about subject categories from the bottom up, instead of imposing an order from the top down. It is a self-organizing system. This eliminates two of the biggest criticisms of library classification: that every scheme has a point of view, and that every scheme will be constantly struggling against obsolescence.

To come up with subject categories, Architext makes only one assumption: words that frequently occur together are somehow related. As the corpus changes – as new connections emerge between, say, O. J. Simpson and murder – the classification scheme automatically adjusts. The subject categories reflect the text itself – not the worldview of a few computer scientists in Mountain View, or of a 19th-century Puritan named Melvil Dewey.

But the proof, of course, is in how well it actually works. Although evaluating a search engine empirically is nearly impossible (since checking if it found every relevant Web page would require that someone also search the Web by hand), anecdotal evidence in support of Excite is fairly strong. I've tried doing identical searches on Inktomi, Lycos, and Excite, and found that Excite returned the most relevant documents. Which isn't to say that Excite is perfect: it still returned a fair number of superfluous documents that left me scratching my head, trying to figure out the possible connection. This isn't too surprising, since some words may frequently occur together even if they aren't really related – thereby throwing off Excite's statistical algorithms. Nonetheless, what bothered me most about Excite is not how it searches, but what it searches.

Excite doesn't just index the Web – it also indexes every message posted to about 10,000 Usenet newsgroups. That sounds harmless – after all, Usenet is a completely public message board that anyone can read. Yet searching Usenet with Excite, or similar services such as DejaNews (www.dejanews.com/) and Alta Vista, can feel surprisingly invasive. It's possible, for example, to search on a person's name and find every message they have posted – whether it's on comp.client-server or rec.arts.erotica. Using these tools, anyone can build a profile of a person's interests, based on where they post.

Spencer became animated when I asked him about the privacy issue, finally looking at me instead of the floor as he launched into a topic he had obviously thought about. "I think that indexing Usenet is OK because it is a completely public forum, but other things do make me uncomfortable." For example, Web indexes often end up indexing the archives of Internet mailing lists. "There is a process of joining a mailing list, so it does seem kind of private." Nonetheless, Architext has plans to take indexing even further. "One thing we want to do is index IRC (Internet Relay Chat)," said Spencer. "It will let you find people who are talking about things you are interested in right then." It will also let anyone play at being Big Brother.

Systems like Excite make it clear that as indexing becomes more prevalent, we're going to have to develop new notions of what it means for a document to be public and what a reasonable expectation of privacy entails. As Reva Basch (see "Super Searcher," Wired 3.05, page 152), a professional online searcher, says, "We can no longer depend on privacy through obscurity." With a full-text index, every word can be found, tracked, and correlated.

Privacy aside, Architext's Excite makes a significant step toward building a universal library. By using concepts instead of keywords, information is forced into an organized structure instead of being left as a jumble of words. But Excite still has a couple of technical shortcomings: its fairly simple statistical technique of automatic classification is prone to error, and it still doesn't provide the context a system like Yahoo! does.

Oracle

I found the pieces I was looking for at Oracle Corp.'s sprawling campus in Redwood Shores. I had been hearing rumors for a while about the product they were developing, but it seemed like no one in the close-knit Web indexing community really knew how it worked. What little I had heard was contradictory: Oracle's software was a hopeless pipe dream, a byzantine attempt at artificial intelligence that would never work – or it was the mother of all search engines.

To clear things up, I met with the man in charge of the project at Oracle, Kelly Wical. Walking up to the giant black-glass building, then entering the echoing lobby to page Wical from the glass and marble security desk, I felt as if I had suddenly entered the big leagues. Unsurprisingly, Wical turned out to be far older than the developers I had met with so far.

Genial and rotund, Wical isn't some computer scientist fresh out of college: he's been working for the last 20 years on a system to help computers understand English. His goal is a program that can not only analyze a sentence and figure out information such as what the important nouns are and how they are being modified, but actually understand the written word from the reader's point of view. His quest began while at a computer company in Houston, where he worked on a program to aid users searching for information on specific topics in gigantic online manuals. Then, in 1988 he founded Artificial Linguistics Inc. and continued to attack the problem of understanding written English, producing a sophisticated grammar checker as a spin-off of core technology. In 1991, ALI was purchased by Oracle, and Wical was brought on board to continue development of his system, under the name ConText.

None of which may sound terribly relevant to building a universal library. Except that ConText's ability to understand English comes both from its knowledge of grammar and from its incredibly detailed hierarchy of concepts. ConText knows, for example, that Paris is a city in France, which is a country in Europe. This combination of knowledge is exactly what Excite lacks (and what causes its automatic classification algorithm to sometimes make glaring errors).

The problem is that creating such a comprehensive knowledge base seems impossible. OK, I said, even assuming that this linguistic engine can parse English sentences (something scientists have been struggling with for years), the process of creating a taxonomy of concepts, not just major subjects, would require an unprecedented amount of effort.

Wical smugly agreed. Already, more than 100 person-years have been spent building ConText's database of knowledge. To do so, Oracle has employed dozens of "lexicographers," a lofty title for what are often college interns who do the necessary legwork. "We've sent people to grocery stores, to scientific conferences, even sex shops," said Wical, with a touch of amazement at the resources Oracle can marshal. There, the lexicographers identify the subfields of metallurgy, for example, or the types of pornography, and then incorporate the results into ConText's ontology. This data is supplemented with automatic statistical techniques, similar to those used by Excite, that analyze huge collections of documents for unique concepts and relations between them.

The result of all this effort is a nine-level hierarchy – with each level offering increased specificity – that currently identifies a quarter-million different concepts in English. The scheme also includes approximately 10 million cross-references between related concepts, such as Paris and France, roadways and death. ConText uses this data when it automatically analyzes a document and then decides which of the concepts best describe the document's topic.

That's the theory, anyway. I wanted to see how ConText works in practice, so Wical watched over my shoulder while I tried out the text-analysis engine on a few articles I had brought along. It wasn't exactly a rigorous evaluation, and we turned out to be using an old version of the software, but it gave me a feel for what the program can do. Some of ConText's features – like its summarize tool, which takes a document and tries to compress it down to just the important parts – turned out to be pretty unimpressive. But when it came to document classification, ConText was unerring.

An article I wrote about hive computing, for example, was correctly classified under Science and Technology:Hard Sciences: Computer Industry:Supercomputing, Science and Technology:Hard Sciences:Computer Industry:Workstations, and Business and Economics:Economics. An excerpt from Takedown, a book by Tsutomu Shimomura and John Markoff about notorious hacker Kevin Mitnick, was classified under Science and Technology:Hard Sciences: Computer Industry: Cyberculture:Hackers.

The only time ConText really failed at classification was when we tried it on a piece of fiction. Wical happened to have a chapter of Tolkien's The Hobbit on his hard drive, and it came back classified under Geography and Mythology. Neither of which seemed to me the real topic of the book. I half jokingly took Wical to task for this, and he shrugged. "What's in The Hobbit isn't what it's about." An obvious enough point, but it underscores an important truth about information retrieval. No matter how good the technology, it can only work when the meaning of a document is directly correlated to the words it contains. Fiction – or anything that relies on metaphor or allegory, that evokes instead of tells – can't be usefully classified or indexed. Its meaning comes from the reader. That's a significant limitation for any attempt to automatically organize the Web.

In his system's defense, Wical was quick to point out that automated tools for organizing fiction are no worse than the current, simplistic manual techniques. True. But that's missing the point. As automated indexing becomes available, we will begin to depend on it. It will encourage people to write plainly, without metaphors or double entendres that might confuse a search engine. After all, everyone wants people to be able to find what they have written.

Despite this concern, I drove away from Oracle's gleaming headquarters convinced that a useful and complete organization of the Web was possible. The Web no longer seemed too large, and computers no longer seemed too dumb. I imagined a system that combined the scalable hive computing of Inktomi, the self-organized classification of Excite, and the raw knowledge of ConText. But I was starting to question what the real point of indexing the Web was. I always had some vague notion of a universal library advancing science, informing voters, saving the world – who knows? The feeling of omniscience that came from searching gigantic databases like Lexis-Nexis seemed reason enough. But something Kelly Wical had said made me start wondering.

What's the purpose

The issue came up when I asked Wical what could have possibly kept him interested and motivated to work on the same project – the quest to understand English – for the last 20 years. At first, he just made some vague noises about how it was an "interesting problem with a lot of practical application." But, wanting to hear something that jibed with my own reasons, I kept probing. Finally, he leaned back and said, "My personal reason? Well, I want to talk to hobbits."

Wical slowly began to talk about his fascination with The Lord of the Rings and his dream to bring Tolkien's books to life by writing a computer program that understands everything in the fantastical trilogy – that knows Gandalf is a wizard, that knows mithril is the most precious metal in Middle-earth, that knows the Elven family tree. Once the books have been made digital, Wical said, they could be interactive. The plot could be altered, magic powers could be adjusted, new characters could be added. Wical could enter the story.

I found this reply so odd and unexpected that it made me wonder if my motives for wanting to organize knowledge might appear equally strange. I decided to watch how people use existing search engines to understand their popularity. But after a boring half hour at UC Berkeley watching the queries as they came into Inktomi, I still didn't have an answer.

True, just looking at the most common search terms pointed to an obvious driving force: sex. The top 10 search terms sent to Inktomi were "sex," "nude," "pictures," "adult," "women," "software," "erotic," "erotica," "gay," and "naked." But percentagewise, these terms made up less than a quarter of all queries. Other search terms ranged all over the map, from people's names to "wood-burning stove" to "nine-inch nails."

So I went to see Brewster Kahle. The founder of Wide Area Information Servers Inc., Kahle is one of the handful of people who have managed to actually get rich from information retrieval. With his huge, bushy hair and exaggerated hyperkineticism, he looked like a clown after too much coffee. But no one knows more about the intersection of the Internet and knowledge organization.

"Information retrieval is not about finding how much tannin there is in an apple," he declared in his San Francisco office. "It's about letting everyone publish." With that, he was off on a long rant about how organizing the Web matters, because, as Architext's Spencer had told me, "it's about people finding people, not people finding information." Indexing the Web allows the 40 people interested in Bulgarian folksinging to find each other, it allows fans of long-forgotten TV shows to get together and reminisce. It creates communities.

Even with Kahle's dramatic gesticulations, the argument didn't seem very convincing. The desire to form clubs seems to stop with the Kiwanis set. Then Kahle started speaking at a fever pitch, with one foot on the table and arms oscillating wildly:

"I grew up watching just a whole lot of TV, signals coming right at me. Then, at school, teachers would just tell me stuff, and I'd just try to remember it. But, when I finally hit graduate school, the teachers would say, 'Here's what's known, here's what isn't. If you make any progress Š here's my home number!' Finally, I had a chance to contribute!"

Now this was something I could relate to! Now I understood why indexes mattered. It was like Jerry Yang of Yahoo! had said: "If the Web was a broadcast medium, then we could just do something like TV Guide." But once anyone can publish, anyone can contribute, some new kind of organization is needed.

Knowledge organization is important not because of how much knowledge there is now, but because of how many people are becoming involved in its production. Web indexes now play the same role that atlases did in the 16th century. Both hold an appeal that goes far beyond any possible usefulness. Both lead to dreams of exploring new territories, of discovering new opportunities. Both are evocative because of what they leave blank.

Here’s The Thing With Ad Blockers

We get it: Ads aren’t what you’re here for. But ads help us keep the lights on. So, add us to your ad blocker’s whitelist or pay $1 per week for an ad-free version of WIRED. Either way, you are supporting our journalism. We’d really appreciate it.