Python BeautifulSoup Tutorial: Web scraping in 20 lines of code

Using Python and BeautifulSoup, we can quickly, and efficiently, scrap data from a web page. In the example below, I am going to show you how to scrap a web page in 20 lines of code, using BeautifulSoup and Python.

What is Web Scraping:

Web scraping is the process of automatically extracting information from a website. Web scraping, or data scraping, is useful for researchers, marketers and analysts interested in compiling, filtering and repackaging data.

A word of caution: Always respect the website’s privacy policy and check robots.txt before scraping. If a website offers API to interact with its data, it is better to use that instead of scraping.

Web Scraping with Python and BeautifulSoup:

Web scraping in Python is a breeze. There are number of ways to access a web page and scrap its data. I am using Python and BeautifulSoup for the purpose.

In this example, we are scraping college footballer data from ESPN website.

As we are scraping the web page using BeautifulSoup and Requests libraries, we need to install them first. This can be done using pip:

pip install requests

pip install beautifulsoup4

Ok. Time to brew some Python magic.

Let’s import required libraries in our code. These include BeautifulSoup, requests, os and csv – as we are going to save the extracted data in a CSV file.

from bs4 import BeautifulSoup
import requests
import os, os.path, csv

Next step is to fetch the web page and store it in a BeautifulSoup object. We also need a parser to parse through the fetched web page. BeautifulSoup can work with a variety of parsers, we are using the default html.parser in this example.

So, this is how Python and BeautifulSoup are used to scrap a web page in just 20 lines of code.

While the code achieved the requirements, it is not very elegant or self-explanatory. The detailed version of code which comments, and extra bits to tie up the lose ends, is available at GitHub [here].