How To Work with Web Data Using Requests and Beautiful Soup with Python 3

Introduction

The web provides us with more data than any of us can read and understand, so we often want to work with that information programmatically in order to make sense of it. Sometimes, that data is provided to us by website creators via .csv or comma-separated values files, or through an API (Application Programming Interface). Other times, we need to collect text from the web ourselves.

This tutorial will go over how to work with the Requests and Beautiful Soup Python packages in order to make use of data from web pages. The Requests module lets you integrate your Python programs with web services, while the Beautiful Soup module is designed to make screen-scraping get done quickly. Using the Python interactive console and these two libraries, we’ll go through how to collect a web page and work with the textual information available there.

Next, we can assign the result of a request of that page to the variable page with the request.get() method. We pass the page’s URL (that was assigned to the url variable) to that method.

page = requests.get(url)

The variable page is assigned a Response object:

>>> page
<Response [200]>
>>>

The Response object above tells us the status_code property in square brackets (in this case 200). This attribute can be called explicitly:

>>> page.status_code
200
>>>

The returned code of 200 tells us that the page downloaded successfully. Codes that begin with the number 2 generally indicate success, while codes that begin with a 4 or 5 indicate that an error occurred. You can read more about HTTP status codes from the W3C’s Status Code Definitions.

In order to work with web data, we’re going to want to access the text-based content of web files. We can read the content of the server’s response with page.text (or page.content if we would like to access the response in bytes).

Here we see that the full text of the page was printed out, with all of its HTML tags. However, it is difficult to read because there is not much spacing.

In the next section, we can leverage the Beautiful Soup module to work with this textual data in a more human-friendly manner.

Stepping Through a Page with Beautiful Soup

The Beautiful Soup library creates a parse tree from parsed HTML and XML documents (including documents with non-closed tags or tag soup and other malformed markup). This functionality will make the web page text more readable than what we saw coming from the Requests module.

To start, we’ll import Beautiful Soup into the Python console:

from bs4 import BeautifulSoup

Next, we’ll run the page.text document through the module to give us a BeautifulSoup object — that is, a parse tree from this parsed page that we’ll get from running Python’s built-in html.parser over the HTML. The constructed object represents the mockturtle.html document as a nested data structure. This is assigned to the variable soup.

soup = BeautifulSoup(page.text, 'html.parser')

To show the contents of the page on the terminal, we can print it with the prettify() method in order to turn the Beautiful Soup parse tree into a nicely formatted Unicode string.

In the output above, we can see that there is one tag per line and also that the tags are nested because of the tree schema used by Beautiful Soup.

Finding Instances of a Tag

We can extract a single tag from a page by using Beautiful Soup’s find_all method. This will return all instances of a given tag within a document.

soup.find_all('p')

Running that method on our object returns the full text of the song along with the relevant <p> tags and any tags contained within that requested tag, which here includes the line break tags <br/>:

Output

[<p class="verse" id="first">Beautiful Soup, so rich and green,<br/>
Waiting in a hot tureen!<br/>
Who for such dainties would not stoop?<br/>
Soup of the evening, beautiful Soup!<br/>
Soup of the evening, beautiful Soup!<br/></p>, <p class="chorus" id="second">Beau--ootiful Soo--oop!<br/>
...
Beau--ootiful Soo--oop!<br/>
Soo--oop of the e--e--evening,<br/>
Beautiful, beauti--FUL SOUP!<br/></p>]

You will notice in the output above that the data is contained in square brackets [ ]. This means it is a Python list data type.

Because it is a list, we can call a particular item within it (for example, the third <p> element), and use the get_text() method to extract all the text from inside that tag:

soup.find_all('p')[2].get_text()

The output that we receive will be what is in the third <p> element in this case:

Output

'Beautiful Soup! Who cares for fish,\n Game or any other dish?\n Who would not give all else for two\n Pennyworth only of Beautiful Soup?\n Pennyworth only of beautiful Soup?'

Note that \n line breaks are also shown in the returned string above.

Finding Tags by Class and ID

HTML elements that refer to CSS selectors like class and ID can be helpful to look at when working with web data using Beautiful Soup. We can target specific classes and IDs by using the find_all() method and passing the class and ID strings as arguments.

First, let’s find all of the instances of the class chorus. In Beautiful Soup we will assign the string for the class to the keyword argument class_:

soup.find_all(class_='chorus')

When we run the above line, we’ll receive the following list as output:

The two <p>-tagged sections with the class of chorus were printed out to the terminal.

We can also specify that we want to search for the class chorus only within <p> tags, in case it is used for more than one tag:

soup.find_all('p', class_='chorus')

Running the line above will produce the same output as before.

We can also use Beautiful Soup to target IDs associated with HTML tags. In this case we will assign the string 'third' to the keyword argument id:

soup.find_all(id='third')

Once we run the line above, we’ll receive the following output:

Output

[<p class="verse" id="third">Beautiful Soup! Who cares for fish,<br/>
Game or any other dish?<br/>
Who would not give all else for two<br/>
Pennyworth only of Beautiful Soup?<br/>
Pennyworth only of beautiful Soup?<br/></p>]

The text associated with the <p> tag with the id of third is printed out to the terminal along with the relevant tags.

Conclusion

This tutorial took you through retrieving a web page with the Requests module in Python and doing some preliminary scraping of that web page’s textual data in order to gain an understanding of Beautiful Soup.