Scraping Airbnb

Airbnb is not exactly keen to share data that might help analyse its impact on local housing markets. In 2016, the Amsterdam Municipality decided to collect Airbnb data using a scraper - a computer programme that automates the job of retrieving information from web pages.

Amsterdam is not the only government to use web scraping. Increasingly, this technique is used to obtain data about topics ranging from consumer prices to jobs vacancy statistics and business data. Collecting data from the internet has advantages, but it also poses some challenges. It may be difficult to aggregate data coming from different websites, and data found online may not cover all aspects of a phenomenon you’re trying to understand (for example, not all job vacancies are published online). On a more practical level, your web scraper code may break when websites change.

In March 2017, Amsterdam reported that its weekly scrapes of major platforms like Airbnb required little maintenance. But last week, it sent a report to the city council describing how Airbnb has been making changes to its website - perhaps in an attempt to frustrate efforts to collect information about its business practices. Initially, Amsterdam’s digital surveillance department succesfully updated its scraper, but following new changes to the Airbnb website since May 2018, Amsterdam now appears to have given up on scraping Airbnb.

This made me curious about the technical characteristics of the Airbnb website. Here are some observations, based on an (admittedly superficial) examination:

The initial download of a web page isn’t the final version: after downloading, the contents of the page are dynamically altered using Javascript. For some purposes like navigating search results, you may prefer the final version of the page, which you can get using Selenium. Selenium would especially come in handy for interacting with the calendar to get availability and price information, which seems to be rather tricky.

Some details on listings only appear to be available in the Javascript code. You can find them using patterns like '\"lat\"\:(.*?),\"lng\"\:(.*?),'

Airbnb uses NGINX to control access to its website. If you request too many pages too fast, you’ll hit a rate limit and get an error page. I guess it should be possible to avoid the rate limit by adding pauses to your programme, but it may take quite some time to figure out how often and how long they should be.

While it appears that barriers to scraping the Airbnb website may be surmountable, it’s quite possible that I underestimate what this would take. If you’d actually build a scraper and would use it to frequently collect information about all local listings, all kinds of new problems might arise.

Meanwhile, other sources of Airbnb data are available. In a previous post, I used data made available by Tom Slee and by Murray Cox’ Inside Airbnb. Slee has since stopped updating his data, but Inside Airbnb is still active. As the Amsterdam Municipality notes in its report, Inside Airbnb has succesfully adapted its scraping technique each time Airbnb changed its website.

UPDATE 13 May - See comments on Twitter: Jens von Bergmann from Vancouver also has a scraper that is working. Following some requests, Tom Slee recently updated his scraper; his code is available on Github.