Scrapy is a Python-based web crawler that can be used to extract information from websites. It is fast and simple, and can navigate pages just like a browser can.

However, note that it is not suitable for websites and apps that use JavaScript to manipulate the user interface. Scrapy loads just the HTML. It has no facilities to execute JavaScript that might be used by the website to tailor the user’s experience.

Installation

We use Virtualenv to install scrapy. This allows us to install scrapy without affecting other system-installed modules.

Create a working directory and initialize a virtual environment in that directory.

mkdir working
cd working
virtualenv venv
. venv/bin/activate

Install scrapy now.

pip install scrapy

Check that it is working. The following display shows the version of scrapy as 1.4.0.

Writing a Spider

To begin with, create a file called redditspider.py and add the following to it. This is a complete spider class, though one which does not do anything useful for us. A spider class requires, at a minimum, the following:

A name identifying the spider.

A start_urls list variable containing the URLs from which to begin crawling.

Turn Off Logging

As you can see, this spider runs and prints a bunch of messages, which can be useful for debugging. However, since it obscures the output of out program, let's turn it off for now.

Add these lines to the beginning of the file:

import logging
logging.getLogger('scrapy').setLevel(logging.WARNING)

Now, when we run the spider, we should not see the obfuscating messages.

Parsing the Response

Let's now parse the response from the scraper. This is done in the method parse(). In this method, we use the method response.css() to perform CSS-style selections on the HTML and extract the required elements.

To identify the CSS selections to extract, we use Chrome’s DOM Inspector tool to pick the elements. From reddit’s front page, we see that each post is wrapped in a <div class="thing">...</div>.

So we select all div.thing from the page and use it to work with further.

Extracting Required Elements

Once these helper methods are in place, let's extract the title from each Reddit post. Within div.thing, the title is available at div.entry>p.title>a.title::text. As mentioned before, this CSS selection for the required elements can be determined from any browser’s DOM Inspector.

The results are returned to the caller using python’s yield statement. The way yield works is as follows — executing a function which contains a yield statement returns a generator to the caller. The caller repeatedly executes this generator and receives results of the execution till the generator terminates.

In our case, the parse() method returns a dictionary object containing a key (title) to the caller on each invocation till the div.thing list ends.

Running the Spider and Collecting Output

Let us now run the spider again. A part of the copious output is shown (after re-instating the log statements).

Conclusion

This article provided a basic view of how to extract information from websites using Scrapy. To use scrapy, we need to write a spider module which instructs scrapy to crawl a website and extract structured information from it. This information can then be returned in JSON format for consumption by downstream software.

Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Our Chief Data Scientist discusses the source of most headlines about AI failures here.