Getting started with Web Scraping using Python [Tutorial]

Small manual tasks like scanning through information sources in search of small bits of relevant information are in fact, automatable. Instead of performing tasks that get repeated over and over, we can use computers to do these kinds of menial tasks and focus our own efforts instead on what humans are good for—high-level analysis and decision making based on the result. This tutorial shows how to use the Python language to automatize common business tasks that can be greatly sped up if a computer is doing them.

The code files for this article are available on Github. This tutorial is an excerpt from a book written by Jaime Buelta titled Python Automation Cookbook.

The internet and the WWW (World Wide Web) is the most prominent source of information today. In this article, we will learn to perform operations programmatically to automatically retrieve and process information. Python requests module makes it very easy to perform these operations.

We’ll cover the following recipes:

Downloading web pages

Parsing HTML

Crawling the web

Accessing password-protected pages

Speeding up web scraping

Downloading web pages

The basic ability to download a web page involves making an HTTP GET request against a URL. This is the basic operation of any web browser. We’ll see in this recipe how to make a simple request to obtain a web page.

The operation of requests is very simple; perform the operation, GET in this case, over the URL. This returns a result object that can be analyzed. The main elements are the status_code and the body content, which can be presented as text.

You can check out the full request’s documentation for more information.

Parsing HTML

We’ll use the excellent Beautiful Soup module to parse the HTML text into a memory object that can be analyzed. We need to use the beautifulsoup4 package to use the latest Python 3 version that is available. Add the package to your requirements.txt and install the dependencies in the virtual environment:

6. Extract the text on the section links. Stop when you reach the next

tag:

>>> link_section = page.find('a', attrs={'name': 'links'})
>>> section = []
>>> for element in link_section.next_elements:
... if element.name == 'h3':
... break
... section.append(element.string or '')
...
>>> result = ''.join(section)
>>> result
'7. Links\n\nLinks can be internal within a Web page (like to\nthe Table of ContentsTable of Contents at the top), or they\ncan be to external web pages or pictures on the same website, or they\ncan be to websites, pages, or pictures anywhere else in the world.\n\n\n\nHere is a link to the Kermit\nProject home pageKermit\nProject home page.\n\n\n\nHere is a link to Section 5Section 5 of this document.\n\n\n\nHere is a link to\nSection 4.0Section 4.0\nof the C-Kermit\nfor Unix Installation InstructionsC-Kermit\nfor Unix Installation Instructions.\n\n\n\nHere is a link to a picture:\nCLICK HERECLICK HERE to see it.\n\n\n'

Notice that there are no HTML tags; it’s all raw text.

The first step is to download the page. Then, the raw text can be parsed, as in step 3. The resulting page object contains the parsed information.

BeautifulSoup allows us to search for HTML elements. It can search for the first one with .find() or return a list with .find_all(). In step 5, it searched for a specific tag that had a particular attribute, name=link. After that, it kept iterating on .next_elements until it finds the next h3 tag, which marks the end of the section.

The text of each element is extracted and finally composed into a single text. Note the or that avoids storing None, returned when an element has no text.

Crawling the web

Given the nature of hyperlink pages, starting from a known place and following links to other pages is a very important tool in the arsenal when scraping the web.

To do so, we crawl a page looking for a small phrase and will print any paragraph that contains it. We will search only in pages that belong to the same site. I.e. only URLs starting with www.somesite.com. We won’t follow links to external sites.

We’ll use as an example a prepared example, available in the GitHub repo. Download the whole site and run the included script.

$ python simple_delay_server.py

This serves the site in the URL http://localhost:8000. You can check it on a browser. It’s a simple blog with three entries. Most of it is uninteresting, but we added a couple of paragraphs that contain the keyword python.

How to crawl the web

The full script, crawling_web_step1.py, is available in GitHub. The most relevant bits are displayed here:

Search for references to python, to return a list with URLs that contain it and the paragraph. Notice there are a couple of errors because of broken links:

$ python crawling_web_step1.py https://localhost:8000/ -p python
Link http://localhost:8000/: --> A smaller article , that contains a reference to Python
Link http://localhost:8000/files/5eabef23f63024c20389c34b94dee593-1.html: --> A smaller article , that contains a reference to Python
Link http://localhost:8000/files/33714fc865e02aeda2dabb9a42a787b2-0.html: --> This is the actual bit with a python reference that we are interested in.
Link http://localhost:8000/files/archive-september-2018.html: --> A smaller article , that contains a reference to Python
Link http://localhost:8000/index.html: --> A smaller article , that contains a reference to Python

Another good search term is crocodile. Try it out:

$ python crawling_web_step1.py http://localhost:8000/ -p crocodile

Let’s see each of the components of the script:

A loop that goes through all the found links, in the main function:

Downloading and parsing the link, in the process_link function:

It downloads the file, and checks that the status is correct to skip errors such as broken links. It also checks that the type (as described in Content-Type) is a HTML page to skip PDFs and other formats. And finally, it parses the raw HTML into a BeautifulSoup object.

It also parses the source link using urlparse, so later, in step 4, it can skip all the references to external sources. urlparse divides a URL into its composing elements:

It searches the parsed object for the specified text. Note the search is done as a regex and only in the text. It prints the resulting matches, including source_link, referencing the URL where the match was found:

It searches in the parsed page all elements, and retrieves the href elements, but only elements that have such href elements and that are a fully qualified URL (starting with http). This removes links that are not a URL, such as a '#' link, or that are internal to the page.

An extra check is done to check they have the same source as the original link, then they are registered as valid links. The netloc attribute allows to detect that the link comes from the same URL domain than the parsed URL generated in step 2.

Finally, the links are returned, where they’ll be added to the loop described in step 1.

Accessing password-protected pages

Sometimes a web page is not open to the public but protected in some way. The most basic aspect is to use basic HTTP authentication, which is integrated into virtually every web server, and it’s a user/password schema.

Speeding up web scraping

Most of the time spent downloading information from web pages is usually spent waiting. A request goes from our computer to whatever server will process it, and until the response is composed and comes back to our computer, we cannot do much about it.

During the execution of the recipes in the book, you’ll notice there’s a wait involved in requests calls, normally of around one or two seconds. But computers can do other stuff while waiting, including making more requests at the same time. In this recipe, we will see how to download a list of pages in parallel and wait until they are all ready. We will use an intentionally slow server to show the point.

We’ll get the code to crawl and search for keywords, making use of the futures capabilities of Python 3 to download multiple pages at the same time.

A future is an object that represents the promise of a value. This means that you immediately receive an object while the code is being executed in the background. Only, when specifically requesting for its .result() the code blocks until getting it.

To generate a future, you need a background engine, called executor. Once created, submit a function and parameters to it to retrieve a future. The retrieval of the result can be delayed as long as necessary, allowing the generation of several futures in a row, and waiting until all are finished, executing them in parallel, instead of creating one, wait until it finishes, creating another, and so on.

There are several ways to create an executor; in this recipe, we’ll use ThreadPoolExecutor, which will use threads.

We’ll use as an example a prepared example, available in the GitHub repo. Download the whole site and run the included script

$ python simple_delay_server.py -d 2

This serves the site in the URL http://localhost:8000. You can check it on a browser. It’s s simple blog with three entries. Most of it is uninteresting, but we added a couple of paragraphs that contain the keyword python. The parameter -d 2 makes the server intentionally slow, simulating a bad connection.

How to speed up web scraping

Write the following script, speed_up_step1.py. The full code is available in GitHub.

Notice the differences in the main function. Also, there’s an extra parameter added (number of concurrent workers), and the function process_link now returns the source link.

Run the crawling_web_step1.py script to get a time baseline. Notice the output has been removed here for clarity:

The main engine to create the concurrent requests is the main function. Notice that the rest of the code is basically untouched (other than returning the source link in the process_link function).

This is the relevant part of the code that handles the concurrent engine:

with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
while to_check:
futures = [executor.submit(process_link, url, to_search)
for url in to_check]
to_check = []
for data in concurrent.futures.as_completed(futures):
link, new_links = data.result()
checked_links.add(link)
for link in new_links:
if link not in checked_links and link not in to_check:
to_check.append(link)
max_checks -= 1
if not max_checks:
return

The with context creates a pool of workers, specifying its number. Inside, a list of futures containing all the URLs to retrieve is created. The .as_completed() function returns the futures that are finished, and then there’s some work dealing with obtaining newly found links and checking whether they need to be added to be retrieved or not. This process is similar to the one presented in the Crawling the web recipe.

The process starts again until enough links have been retrieved or there are no links to retrieve.

In this post, we learned to use the power of Python to automate web scraping tasks. To understand how to automate monotonous tasks with Python 3.7, check out our book: Python Automation Cookbook