Screen-Scraping and Bots

The focus of this book is on creating mashups using public APIs and web services. If you want to mash up a web site, one of the first things to look for is a public API. A public API is specifically designed as an official channel for giving you programmatic access to data and services of the web site. In some cases, however, you may want to create mashups of services and data for which there is no public API. Even if there is a public API, it is extremely useful to look beyond just the API. An API is often incomplete. That is, there is functionality in the user interface that is not included in the API. Without a public API for a web site, you need to resort to other techniques to reuse the data and functionality of the application.

One such technique is screen-­scraping, which involves extracting data from the user interface designed for display to human users. Let me define bots and spiders, which often use screen-­scraping techniques. Bots (also known as an Internet bots, web robots, and webbots) are computer programs that “run automated tasks over the Internet,” typically tasks that are “both simple and structurally repetitive.”[44]

“Chatterbots” that automatically reply to human users through instant messaging or IRC[45]

Wikipedia bots that automate the monitoring, maintaining, and editing of the Wikipedia[46]

Ticket-purchasing bots that buy tickets on behalf of ticket scalpers

Bots that generate spam or launch distributed denial of service attacks

Web spiders (also known as web crawlers and web harvesters) are a special type of Internet bot. They typically focus on getting collections of web pages—up to billions of pages—rather than focused extraction of data on a given page. It’s the spiders from search engines such as Google and Yahoo! that visit your web pages to collect your web pages with which to build their large indexes of the Web.

There are some important technical challenges to screen-­scraping. The vast majority of data embedded in HTML is not marked up to be unambiguously and consistently parsed by bots. Hence, screen-­scraping depends on making rather brittle assumptions about what the placement and presentation style of embedded data implies about the semantics of the data. The author of web pages often changes its visual style without intending to change any underlying semantics—but still ends up breaking, often inadvertently, screen-­scraping code. In contrast, by packaging data in commonly understood formats such as XML geared to computer consumption, you are an implicit—if not explicit—commitment to the reliable transfer of data to others. Public API functions are controlled, defined programmatic interfaces between the creator of the site and you as the user. Hence, accessing data through the public API should theoretically be less fragile than screen-­scraping/web-scraping a web site.

Caution

Since I’m not a lawyer, do not construe anything in this book, including the following discussion, as legal advice!

If you engage in screen-­scraping, you need to be thoughtful about how you go about it and, in some cases, even whether you should do it in the first place. Start with reading the terms of service (ToS) of the web site. Some ToSs explicitly forbid the use of bots (such as automated crawling) of their sites. How should you respond to such terms of services? On the one hand, you could decide to take a conservative stance and not screen-­scrape the site at all. Or you could go to the other extreme and screen-­scrape the site at will, waging that you won’t get sued and noting that if the web site owner is not happy, the owner could just use technical means to shut down your bot.

I think a middle ground is often in order, one that is well-­stated by Bausch, Calishan, and Dornfest: “So use the API whenever you can, scrape only when you absolutely must, and mind your Ps and Qs when fiddling about with other people’s data.”[47]

Even though bots have negative connotations, many do recognize the positive benefits of some bots, especially search engines. If everyone were to take an extremely conservative reading of the terms of services for web sites, wouldn’t many of the things we take for granted on the Internet (such as search engines) simply disappear?

Since screen-­scraping web sites without public APIs is largely beyond the scope of this book, I will refer you to the following books for more information:

There’s some recent research around end-­user innovation that should encourage web site owners to make their sites extensible and even hackable. See Eric Von Hippel’s books. Von Hippel argues that many products and innovations are originally created by users of products, not the manufacturers that then bake in those innovations after the fact (http://en.wikipedia.org/wiki/Eric_Von_Hippel).