Search pioneers join Yahoo! - but is the web beyond search?

Few visitors to IBM's Almaden research lab in 1999 and 2000 can fail to have been impressed by its lead in web search. IBM's Clever project both predated and informed what became Google: Brin and Page cited the Almaden work in their 1998 paper The Anatomy of a Large-Scale Hypertextual Web Search Engine [pdf, 124kb]. Google drew on the same concept, which they were to trademark and market as PageRank™ of using the link structure to infer quality and authority.

But the Clever team was already thinking way beyond PageRank™. Your reporter was one such visitor more than five years ago and was struck by the scope and depth of the work. For example, in 1998 the Clever team was publishing its research into hierarchical topic taxonomies, and inferring web communities. Today, such subjects are presented to conferences of former HTML coders (today's wiki-fiddlers) who appear to be hearing the topics for the first time, such is their wide-eyed wonderment.

Working within IBM also allowed the team to draw on its rich history of database research and linguistic analysis, and at IBM you try not to lose your customers' data.

Google's fate is well known. After last year's IPO it became one of the wealthiest technology companies on the planet, and its founders are billionaires.

And Clever?

Well, IBM appeared to have some inkling that the project was valuable to it. A spin-off was discussed, but never followed through, and IBM officially welcomed licensees at one stage. But Clever was never allowed the opportunity to compete directly with the commercial search rivals, so we never really saw its potential.

Clever's trajectory in some ways mirrors that of IBM's relational database work. With its System*R project, IBM had built the first implementation of the Relational Database in the early 70s, but bureaucratic infighting hampered the researchers' desire to turn it into real product for IBM's customers. Ingres was first to get an RDBMS out of the door and Oracle's single-minded marketing won it big inroads into the new market in the 1980s.

"We were convinced IBM would never ship" Jim Gray later recalled (in one of the best oral histories of a computer project on the net).

Now, however, Yahoo! has hired several of the Clever team and plans to recruit more.

Last week the New York Times reported that Prabhakar Raghavan, one-time project leader had been recruited from Verity, where he was chief scientist and CTO. Another staffer, Andrew Tomkins, is also on his way to Yahoo!, the Times reported.

These guys have their work cut out.

Web chaff beyond sorting?

"The World Wide Web of today is dramatically different from that of just five years ago," the team noted in 1999. "Predicting what it will be like in another five years (2004) seems futile. Will even the basic act of indexing the Web soon become infeasible?"

Google's link-based algorithms were soon imitated by rivals, and as a consequence all today's search engines today must now mine a web stuffed with synthetic documents of little relevance to anyone, many of which are generated by machines on behalf of the customers of the more unscrupulous SEOs (Search Engine Optimizers)

It's an algorithm arms race, and the SEOs themselves know the scale of the problem they nurtured. Some estimate as much as a third of the web is fake, machine-generated pages and Google can't really tell which third it is. Meanwhile, neither Yahoo! Google nor MSN can still offer the most basic improvements on what AltaVista offered in 1996. queries sorted by date. Want a listing of Tony Blair's comments about Iraq published between June and August 2003? Forget it. AltaVista could do this then, and still can, but none of the big three can match this most basic of requests

Because rigging the search engines is so profitable, the junk web or "Web 2.0" as it's called, proliferates and mutates like a superbug. Each new solution to the problem is quickly co-opted by spammers and gamers. For example, last year's "tagging" craze is becoming this year's mortgage and Viagra scam.

Some maintain the web's problems can't be solved technically - but only politically or economically, for example by the application of compensation models which allow the really good data hoarded by database holders to be opened to the public at last. That may prove to be true: the are many flavors of private and public networks, we use a mixture every day, and that mixture will change over time.

The reassembled Clever team at Yahoo! may not even be offered a chance to answer the question.

The Times reports that the team itself is being directed to searching digital media, and hints that some areas of their earlier work remain IBM's intellectual property.

By some irony, we note that one of Sergey Brin's student projects was also searching digital media, only as a kind of RIAA enforcer. The system he developed was for the "automated detection of copyright violations", and was unfortunately called COPS (the COpyright Protection System). Fortunately, Sergey was more interested in developing a general purpose data mining application.

Would he make the same choice today?

Surely something must be done to renew the original raison d'etre behind both Google and Yahoo! - finding good stuff. The world in which an "I'm Feeling Lucky" button was even conceivable seems to belong to a distant past.

Google would rather sell you a shirt on Froogle, and Yahoo! would rather show you the way to the Coliseum, offering you a package tour that includes the ticket admission. And the former search leader's priorities seem to be elsewhere. In recent months Google has patented a widely used business method and beefed up its DC lobbying muscle, and last week's legal dispute over the hiring of a "search expert" by Google from Microsoft sounded thoroughly phoney and synthetic on both sides.

The Clever team that Yahoo! is reassembling are the genuine article. Perhaps if the management permits them, they'll be able to answer the question -