Digital journalist and educator

On the ethics of web scraping

Scraping data from websites is a valuable skill for a journalist, and often (as was my case), the first incentive to learn to code. But once you acquire the power to harvest mass amount of data in a short time, ethical questions invariably pop up:

Can I take this data?

Can I republish this data?

Am I overloading the website’s servers?

What can I use this data for?

In Canada, the legality of web scraping has not been fully defined. In 2011, a B.C. court sided with a company that accused another website of scraping its content without authorization. The ruling essentially said that once you access a website, you’re bound by its fine print, and that site’s terms of use specifically forbid the copying and reuse of its data.

But that website had a specific clause against scraping. With other websites, especially government data, the rules are not so clear.

Thankfully, some Very Smart People have offered helpful thoughts on the issue.

Paul Bradshaw of the Online Journalism Blog argues that it’s basic journalistic ethics to “deceive no one”, and since some scrapers pose as web browsers, this could constitute deception. However, a scraper is accessing a computer, not another human, so no one is really getting harmed

The potential for harm occurs when a scraper makes excessive demands on a server, preventing it from responding to other requests.

Justin Abrahms says scraping can be an ethical act if it helps others by making information easier to access. This quickly becomes unethical if the contents of a website are passed off as one’s own, which veers into copyright infringement.

Otherposts on this topic rehash the same points: respect the terms of use and don’t overload the servers.

With that in mind, here are my personal rules for scraping data from websites:

1. Read the terms and conditions

You may only use or reproduce the Content for your own personal and non-commercial use. The framing, scraping, data-mining, extraction or collection of the Content of the Sites in any form and by any means whatsoever is strictly prohibited. Furthermore, you may not mirror any material contained on this Sites.

That said, the data owner might give you permission to scrape for certain uses. In which case…

2. When in doubt, ask

If it’s not clear from looking at the website, contact the webmaster and ask if and what you’re allowed to harvest.

3. Be gentle on smaller websites

Large government websites can probably handle the load of a single scraper making 3-4 requests per second. Environment Canada had no problem when I scraped 60 years of hourly weather data.

But that’s a very popular government website. It was built to handle high traffic. Smaller departments may not be so robust, and these may need special considerations. I offer two:

1. Run your scraper in off-peak hours, like evenings and weekends.

2. Space out each request so the server isn’t overwhelmed. Python’s time.sleep method makes this a breeze. For example:

Scraping timeout

Python

1

2

3

4

5

6

7

8

9

10

11

importtime

importurllib2

foritem initems_to_scrape:

html=urllib2.urlopen(URL)

# ...

# your scraping code

# ...

time.sleep(3)

This tells the scraper to wait (“sleep”) for three seconds between each loop.

4. Identify yourself in the header

If an admin of the website notices the unusual traffic caused by your scraper, he might want to investigate. You can make his job easier by saying who you are in your HTTP request.

If journalists are to demand transparency from government agencies, it’s only right that we be transparent ourselves (obvious sensitive investigations notwithstanding).

I like the requests Python library for this reason. It makes including a header quite easy. I usually add my name and contact info as a ‘user-agent’ string:

Hi. I came along your article while studying the ethics of web scrapping. What happens when the company’s staff analyse the traffic data on their website to see the number of visitors? Aren’t they deceived by undetected web scrappers who posed as browsers? It seems like a little harm (they don’t have the actual number of real human visitors).

That’s a good question. When admins look at their logs, a scraper might show up as multiple visits by the same visitor. It will probably be obvious that it’s a scraper so I don’t think there’s any deception there. That’s why it’s a good idea to identify yourself in the headers to be transparent about this.