Digital journalist and educator

Using Python’s calendar module for scraping date-based data

I’ve recently fallen in love with Python’s standard calendar module. It has lots of functions to make handling dates a breeze. And for scraping data based on dates, it couldn’t be more convenient.

Take Environment Canada’s historical hourly data for Montreal. Each page has 24 hours of data in a single day. If I want to get the data for every day since the start, I have to loop through each day of each month of each year.

This becomes a pain when you have to account for months that have 30 or 31 days. Leap years add to the hassle. Python’s calendar module handles all this for you.

WORD OF WARNING:whenever scraping a website, be a good internet citizen. See my post on ethical web scraping for some guidelines.

First, look at the URL structure to see where you have to cycle through the dates. This URL takes you to the data for Feb. 8, 2015:

But wait, what are all those zeros? Well, calendar works a lot like your wall calendar: it includes the entire week, starting on Sunday, even if some days belong to the previous and next months.

Displayed as a wall calendar, it looks like this (zeroes added in by me):

Python

1

2

3

4

5

6

7

8

9

February 2015

Mo Tu We Th Fr Sa Su

0 0 0 0 0 0 1

2 3 4 5 6 7 8

9 10 11 12 13 14 15

16 17 18 19 20 21 22

23 24 25 26 27 28 0

This, by the way, is another neat feature of the calendar module, called TextCalendar. It can be accessed via the calendar.TextCalendar() class, using the prmonth (print month) method, passing in the year and month.

We can’t feed the URL zero dates, but we can filter out the zeroes in the list comprehension.