Should Web Giants Let Startups Use the Information They Have About You?

Just after 10 am on June 7, 2007, Ryan Sit glanced at his Gmail inbox and saw the message he had been waiting nine months to receive. Sit, a 29-year-old software developer from San Diego, is the founder of Listpic, a site that used bots — automatic software-based agents — to pull images from craigslist for-sale listings and reorganize them into an easier-to-navigate, more attractive format. Instead of tediously clicking individual links to view photos, Listpic users could see them all collected onto a single page. The service was an instant success, and by early June it was pulling in more than 43,000 visitors a day and thousands of dollars a month in Google AdSense revenue.

This article has been reproduced in a new format and may be missing content or contain faulty links. Contact wiredlabs@wired.com to report an issue.

Sit had long dared to hope that Listpic's success might prompt craigslist to commend him, initiate a partnership, or even buy Listpic and bring him aboard. So when he saw the message from craigslist CEO Jim Buckmaster in his inbox, he thought that his dreams were about to be realized.

Scrape at Your Peril
Many Web sites build their businesses by taking data from other online firms. It's a powerful — but risky — strategy. The pros and cons of scraping:

Pro

Gain access to data from big companies like Amazon and Google.

Discover how easy it is to turn a big idea into an instant Web business.

Help build a more robust and useful Web by promoting openness.

Con

Lose access if big companies decide to change their policies.

Discover how hard it is to get investors to gamble on a fragile biz model.

Help build a Web so open that privacy is compromised.

Then he read the subject line: "Cease and desist."

Instead of praising Sit, Buckmaster's email charged him with violating craigslist's terms of use, claiming that Listpic crossed the line between homage and copyright infringement. The missive demanded he stop displaying craigslist content. It closed with a terse "Please let us know of your plans for complying.

Sit didn't have much of a chance to respond. Two hours after receiving the message, Sit went to Listpic and found that none of the images on his homepage were loading. When he clicked on one of the links that was supposed to lead to a specific listing, he was redirected to craigslist's main page. Sit's bots had been crippled. "They didn't even talk to me about trying to work something out," he says. "They just banned me.

Distraught and perhaps a tad vengeful, Sit posted a message on his homepage asking Listpic fans to send protest emails to Buckmaster and craigslist founder Craig Newmark. But craigslist refused to budge. Buckmaster is unapologetic. He points to a couple of factors in craigslist's decision: Listpic's constant stream of data requests had slowed craigslist's page-loading times to a crawl, and, more egregious, Listpic had run Google text ads alongside the content, an affront to craigslist's pristine anti-advertising stance. "It sounds old-fashioned," Buckmaster says, "but we don't view postings by craigslist users as data to be exploited by third parties." Within weeks, Listpic had fallen from its perch as one of the top 15,000 sites on the Web — the height of its popularity — to somewhere below 100,000th place, where it languishes still. Today, Listpic pulls data from a different listings site, called Oodle, which was itself banned from accessing craigslist data.

"My goal was to help craigslist by making the user experience better," a despondent Sit says. "This just sucks."

The Internet these days is supposed to be all about sharing. Thanks to a common commitment to open access and cooperation, the data mashups that have defined the Web2.0 phenomenon have exploded. Zillow pulls map information from several partners, including Navteq, GlobeXplorer, and Proxix, and combines it with real estate data from public records to estimate what a house is worth. Photosynth, a service that Microsoft is developing, merges pictures from Flickr and other sources into eye-popping 3-D models. A popular startup called Mint lets customers pull financial information from their bank accounts and reorganize it into an interface that puts Quicken to shame. And the tools to tap and manipulate all this data can be found at sites like Dapper and Kapow.

Giants like Yahoo and Google have thus far taken a mostly nonproprietary stance toward their data, typically letting outside developers access it in an attempt to curry favor with them and foster increased inbound Web traffic. Most of the largest Web companies position themselves as benign, bountiful data gardens, supplying the environment and raw materials to build inspired new products. After all, Google itself, that harbinger of the Web2.0 era, thrives on info that could be said to "belong" to others — the links, keywords, and metadata that reside on other Web sites and that Google harvests and repositions into search results.

But beneath all the kumbayas, there's an awkward dance going on, an unregulated give-and-take of information for which the rules are still being worked out. And in many cases, some of the big guys that have been the source of that data are finding they can't — or simply don't want to — allow everyone to access their information, Web2.0 dogma be damned. The result: a generation of businesses that depend upon the continued good graces of a relatively small group of Internet powerhouses that philosophically agree information should be free — until suddenly it isn't.

Scraping is such an unkind word. It refers to the act of automatically harvesting information from another site and using the results for sometimes nefarious activities. (Some scrapers, for instance, collect email addresses from public Web sites and sell them to spammers.) And so most Web 2.0 companies eschew the term, preferring words like importing to describe their own data-harvesting expeditions. But whatever you call it, it's a pretty simple process. Scrapers write software robots using scripting languages like Perl, PHP, or Java. They direct the bots to go out (either from a Web server or a computer of their own) to the target site and, if necessary, log in. Then the bots copy and bring back the requested payload, be it images, lists of contact information, or a price catalog.

Technically, such activity violates most Web companies' terms of use. Gmail forbids its members from using "any robot, spider, other automated device, or manual process to monitor or copy any content from the Service." Microsoft echoes that in the terms of use for Windows Live, prohibiting "any automated process or service to access and/or use the service (such as a BOT, a spider, periodic caching of information stored by Microsoft, or meta-searching')." The Facebook agreement directs developers not to "use automated scripts to collect information from or otherwise interact with the Service or the Site.

"But despite the fine print, many companies welcome scrapers. Bank of America, Fidelity Investments, and scores of other financial institutions let their customers use bots from Yodlee to gather their account histories and reassemble them on Web servers outside of their corporate firewalls. And eBay permits Google's shopping service, Google Product Search, to scrape sales listings and display them on its own site. Sure, by allowing scraping, these companies are inviting a deluge of potentially cumbersome data requests. But they're also getting more visibility and happier customers who find the scrapee's information ever-more useful. That, it seems, is a worthwhile trade.

The mostly benign attitude toward scrapers also stems from an inconvenient truth: They can be tricky to stop. One way is to require all users to retype a series of distorted characters, those graphic forms called captchas, which bots are unable to read. But too many of these annoy — even alienate — customers. Another method, devised by Facebook to prevent wholesale copying of users' emails, is to display addresses as image files rather than text. With a little more effort, a site can task a counterbot to identify browser sessions that have suspiciously high rates of data requests — most bots work at a pace that's far too quick to be human — and shut off their access. But overuse of these measures can cost the data source, degrading the site's usability or plunging it into bot warfare. If an outside scraper improves user experience and maybe even brings in a few new visitors, companies usually let the bots come and go unopposed.

Sometimes, though, a Web 2.0 upstart can improve the user experience too much for its own good. In February 2006, Ron Hornbaker created Alexaholic, a site that scraped data from Alexa, Amazon.com's Web-traffic service, and presented it in what Hornbaker thought was a friendlier interface. Users agreed with him: Alexaholic's traffic quickly shot up to 500,000 unique visitors a month. Then, in March 2007, Amazon began blocking browser and server requests from Alexaholic. (According to Amazon's public statements, it blocked Alexaholic only after it had "explored an acquisition" and was rebuffed.) Hornbaker rerouted his traffic through other servers, circumventing the blockade. Then Amazon sent him a cease-and-desist letter, demanding he stop scraping Alexa's data and profiting from its brand. Hornbaker changed his site's name to Statsaholic but continued to scrape and remix Alexa stats. Finally, Amazon — seemingly tired of the cat-and-mouse game — served Hornbaker with a lawsuit charging that he was violating its trademarks. Hornbaker had little choice but to give up. Today, Statsaholic draws upon traffic statistics from a variety of other sources, like Quantcast and Compete. (Hornbaker and Amazon would not discuss the fracas, citing terms of their settlement. Ironically, Statsaholic is three times more popular than Hornbaker's Alexaholic ever was.)

Such vulnerability to sudden data blackouts illustrates why some potential investors get nervous about funding scraping-dependent businesses. "Anybody who is a supplier to you has power over you," says Allen Morgan, a venture capitalist at the Mayfield Fund who has invested in a raft of Web 2.0 companies, including Tagged, a teen social network and Slide, one of the most successful makers of Facebook applications. Morgan says that as those data providers help power more applications, they take on the role of operating systems — with a vested interest in consolidating their power. "Inevitably, they will feel compelled to compete with application developers in order to grow their business — and it's an unfair fight."

Investors aren't the only ones wary of the unspoken agreements and one-sided relationships that characterize the scraping industry. Some large Web companies don't relish the unregulated dispersal of their data and would love to find a way to monitor and control the information they dole out. That's why many of them have begun encouraging developers to access their data through sets of application protocol interfaces, or APIs. If scraping is similar to raiding someone's kitchen, using an API is like ordering food at a restaurant. Rather than create their own bots, developers use a piece of code provided by the data source. Then, all information requests are funneled through the API, which can tell who is tapping the data and can set parameters on how much of it can be accessed. The advantage for an outside developer is that with a formal relationship, a data source is less likely to suddenly turn off the taps.

The downside, from the remixers' point of view, is that it gives data sources greater control over what information the remixers can access and how much of it they can harvest. With most APIs, a developer gets a unique key that lets the data supplier know when the developer is using the API. But it also lets the source block the key's owner for any reason.

In February, Jeremy Stoppelman, the 30-year-old cofounder of the community-directory site Yelp, received a late-night phone call from one of his engineers informing him that the maps on Stoppelman's site, compiled through a Google Maps API, were no longer working. It turns out that Yelp was generating more than the maximum number of data requests the API agreement allowed.

"It was scary," Stoppelman says of the subsequent negotiation with Google. A few months earlier, Yelp had raised a $10 million round of funding. Paying for map data hadn't been part of the business plan, and going into the meeting with Google, he says, "I didn't know if we'd get priced out." Eventually, Stoppelman cut a deal with Google to allow continued access to Google Maps for an undisclosed sum.

The promise — and the threat — of scraping is nowhere more evident than in the booming proto-industry of social networking. Social networks have thrived on scraping: Facebook, MySpace, and LinkedIn all encourage users to tap into their webmail address books as a way of inviting and connecting with their friends and coworkers. After prompting users to submit their login information, the sites unleash bots that scrape the webmail companies' servers, pulling out friends' addresses, checking them against the network's roster, and letting users invite contacts who aren't already signed up. The tactic has fueled an explosion in each site's membership; Facebook's stands at 54million and is growing by more than a million new users every week.

But recently, as the competition between social networks heats up, scraping has emerged as a high-stakes strategy. Microsoft announced a $240million investment in Facebook last fall, and within weeks, LinkedIn users found themselves suddenly unable to import their webmail contacts from Microsoft's webmail services. Angus Logan, a Microsoft executive, says the restrictions are a matter of security and that the company is developing user-data APIs. "We do not advocate the practice of contacts scraping," he says, "as we believe it poses unnecessary risks to consumers, whether it be for nefarious practices like phishing scams or more straightforward social networking activities." But that philosophy is applied inconsistently. As of late November, Facebook members were still able to import their Microsoft webmail accounts through scraping.

In the end, says Reid Hoffman, the founding CEO of LinkedIn, it's the users who lose out when Web companies decide to crack down on popular scrapers. After all, LinkedIn becomes much less useful if its members can't quickly invite all of their friends; Yelp loses much of its appeal if it can't display Google's maps. "The question you hear," Hoffman says, "is You're doing all this scraping, and you're increasing the load on our servers. What are we getting out of it?'" Hoffman's answer: happy, connected users.

And in the process, the world is getting a better Internet, one where bright ideas become great services almost instantly and where information is easy to discover and use. Fundamentally, Hoffman adds, it's not the place of companies like Yahoo, Microsoft, Facebook, or LinkedIn to decide who gets access to their users' data. It should be up to the users themselves. "It's simple," he says. "The individual owns the data." Even if it sits in some company's server farm.