A Web Spider for Everyone

A Web Spider for Everyone

As the quantity of information on the Internet continues to grow, so does the question of how to process it all and make it useful. A startup called 80legs, based in Houston, TX, is hoping that an inexpensive, distributed Web crawling service could help startups mine the Web for information without having to build the giant server farms used by major search engines. The company launched this week at DEMO, a conference in San Diego that showcases new companies.

Web crawlers, or spiders, are software that automatically visit pages on the Internet and can be used to index them and gather bits of information from different pages. Crawlers are used by search engines, for example, to monitor the location of information on the Web. But the scale of the Web means that comprehensive crawling consumes a lot of processing power, which typically means building huge data centers to power the software.

80legs hopes to make this technology more accessible to small companies and individuals by allowing leasing access and letting customers pay only for what they crawl.

Web crawling technology is also crucial for semantic sites and services designed to process natural-language queries. While 80legs expects to see users interested in search and semantic applications, CEO Shion Deysarkar says that those testing the service also included customers with less technical interests. Some market researchers, for example, use 80legs to uncover mentions of specific companies or topics across the Web.

A user can start a Web crawl through 80legs’s Web-based interface. The form on the company’s site lets them set parameters for the project and upload custom code needed to control how the crawler does its job. For example, a user might want the crawler to find images and check them against a database of copyrighted ones. Deysarkar says his company’s crawlers are capable of processing up to two billion pages a day. The company charges $2 for every million pages crawled, plus a fee of three cents per hour of processing used.

Many startups struggle to find the funding needed to build large data centers, but that’s not the approach 80legs took to construct its Web crawling infrastructure. The company instead runs its software on a distributed network of personal computers, much like the ones used for projects such as SETI@home. The distributed computing network is put together by Plura Processing, which rents it to 80legs. Plura gets computer users to supply unused processing power in exchange for access to games, donations to charities, and other rewards.