An Open Source Search Engine

Nutch could rewrite the rules of search development — especially with an impressive roster of Internet luminaries now lining up behind it.

Ask anyone in Silicon Valley what the hottest application on the Internet is today and you can bet their answer will be search. The dealmaking has been nothing short of torrid. Only a year ago there were at least half a dozen major players. Now there are just three: Yahoo, which last month bought search giant Overture in a $1.6 billion deal; Google, the undisputed king of search; and Microsoft, which is busy building a search platform of its own. They’re all fighting to dominate the huge and ballooning market, already worth $2 billion and expected to generate between $6 billion and $8 billion in revenues by 2007.

Search is a game of intellectual property, innovation, and market position. The three combatants all keep jealous watch over their patents (Yahoo, for one, has more than 60), engineering talent (hundreds of Ph.D. holders work at Google), and market advantages (Microsoft — need we say more?). Indeed, search is such a complicated and expensive undertaking that analysts have pegged the cost of market entry at well over $100 million.

All that could change this fall, when a new player strides onto the field.

Meet Nutch, the open-source search engine. Open-source applications are unusual in that the code upon which the software runs is not owned by a private, commercial company but rather bound by a simple license that allows anyone to use, modify, and even profit from it free of charge, as long as they pledge to contribute their own innovations back into the code base. Because of this, anyone will be able to access Nutch’s code and use it to their own ends, without paying licensing fees or hewing to a particular company’s set of rules.

Perhaps more important, Google takes a “trust us” approach to search; they say they don’t skew their PageRank formula to favor certain sites, but we have no way of knowing for sure. With Nutch, the indexing and page-ranking technologies are all open and visible; you can check them yourself if you have a problem with your page’s ranking. Just as Linux has taken on Windows, revolutionizing the rules of search-engine development and distribution, Nutch could pose a threat to Google and other search giants. Interestingly, early Nutch development was supported in part by Overture’s R&D division, and an Overture official sits on the Nutch board.

“Search is interesting again,” says Doug Cutting, a founder and core project manager at Nutch. Cutting, whose development chops were honed at Xerox (XRX) PARC, Excite and Apple (AAPL), is building Nutch (that’s his toddler’s all-purpose word for “meal”) with a small team of engineers based around the country. But Cutting says they hope that once Nutch is loosed on the world, tinkerers from Romania to China to Palo Alto will help build it into a robust platform, in the spirit of Linux or Apache (which has garnered more than 60 percent of the Web-server software market in just the last couple of years).

“Search is the first thing people use on the Web now, and there are fewer and fewer alternatives,” Cutting says. With Nutch, “researchers, university folks, and anyone else can have a test bed to make search better. There are a lot of smart people out there that Google can’t hire.”

Mitch Kapor, who helped found Lotus Development and the Electronic Frontier Foundation and is founder and president of the Open Source Applications Foundation, certainly agrees. He’s thrown his weight behind the project by joining Nutch’s nonprofit board, as has Tim O’Reilly, the CEO of O’Reilly & Associates. Brewster Kahle, the visionary behind the Internet Archive, has also lended his support. Nutch is moving its servers to Kahle’s high-bandwidth location this weekend, a crucial step toward readying the engine for its public debut.

“I love Google,” Kapor says, “but this will push search to places that are not immediately obvious. In terms of research and innovation, there is a clear need for an open platform for search.” Kapor and others imagine new kinds of applications springing from Nutch, ideas that commercially driven companies like Yahoo or Microsoft would never fund. “Search is close to a duopoly,” Kapor points out. “Historically we know there are risks when that happens. It’s too important an application to not be transparent.”

Cutting won’t commit to a specific launch date for the engine, but he said he expects it to go live at Nutch.org sometime early this fall. Due to the move to Kahle’s facility and insufficient hardware (Cutting is looking for additional sponsors), Nutch’s demo — based on an initial crawl of more than 100 million webpages — is not yet open to the public. But Cutting, who together with his development partners has built an impressive resume in the search field, is confident his latest creation will be a contender once it launches. “It’s fun to go toe-to-toe with market leaders,” he says. “It’s always a challenge to build a better mousetrap.”

John Battelle is a visiting professor at the UC Berkeley Graduate School of Journalism, where he directs the business reporting program. He was the founder of the Industry Standard and a co-founding editor of Wired. A version of this article originally appeared in Business 2.0; reprinted with permission.

Search Headlines

NOTE: Article links often change. In case of a bad link, use the publication’s search facility, which most have, and search for the headline.

This week, both LinkedIn and Facebook are beefing up their paid social offerings in different ways, while Google seeks to cut off Adwords revenues for fake news sites. And might Google be favouring desktop over its own AMP in its upcoming mobile-first index?

Here we’ll take a look at the basic things you need to know in regards to search engine optimisation, a discipline that everyone in your organisation should at least be aware of, if not have a decent technical understanding.