A Complete Guide to Web Scraping With Python

In today’s day and age extracting data from the web is becoming more and more important. From getting valuable insights to creating useful metrics, a lot depends on our ability to extract useful data from the web. This article talks about how we can extract data via web scraping with python.

Although there are several ways of extracting data from the web, there are two which have become most frequently used over time. The first and most desirable route is by using an API(Application Programming Interface). That’s because APIs allow us to access a lot of structured data directly from the provider.

API – Image Source – GGSLK

However, an API is not always a feasible option for data extraction for a couple of reasons. The first is that some providers may not want to provide access to structured data from their websites. The other is that the provider simply may not have the technical knowledge.

In such a case, web scraping becomes the only feasible option to extract data from the web. Although not the first choice, web scraping is nonetheless a very useful and effective technique of data extraction and is practically indispensable today.

An interesting fact about web scraping with python is that contrary to popular belief it is perfectly legal. The only exception to this is when a website has blocked crawlers via robots.txt or have a Terms Of Service page that indicates that they don’t allow web scraping shouldn’t be crawled.

What is Web Scraping?

Web scraping is a computer software technique of extracting data from the web. Web scraping allows you to convert unstructured data on the web (present in HTML format) into structured data (such as a database or spreadsheet).

Moreover, effective web scraping services can extract data from a number of unstructured formats from HTML and other websites to social media sites, pdf, local listing, e-commerce portals, blogs and many other online resources.

What are the situations in which Web Scraping is required?

As discussed, web scraping is the best and sometimes only solution to extract large amounts of data from the web when APIs are not available. Although web scraping has been around for a while, it has never been as reliable as it is today.

In today’s day and age, analysis of key trends is the best way to make sound business decisions. In fact, with a deluge of information out there on the internet, it simply doesn’t make business sense to make important decisions without proper analysis. Web scraping gives you al the raw data you need to run the right analytics that will give you in-depth business intelligence.

Another important use of web scraping is in monitoring your brand. In today’s age of instant feedback, negative mentions on social media and other websites, be it Facebook or Twitter or Google Business can ruin your online reputation.

This can have a serious impact on revenues in the long-term. Effective web scraping techniques can help you keep an eye on all your mentions on the web, helping you manage your brand more effectively.

Other applications of web scraping include finding listings for a real website, powering AI experiments which require many keyword-linked images, monitoring prices of competitors on different e-commerce websites and tracking the stock market real-time.

Why is Python a suitable language to use for Web Scraping?

First, python is an easy language to learn and work with because the syntax reads like simple English and the core concepts are easy to understand. So if someone wants to scrape the web in an efficient manner but has no previous programming language, Python is the best choice.

The other advantage of Python is that it has the most elaborate and supportive ecosystem when it comes to web scraping. While many languages have libraries to help with web scraping, Python’s libraries have the most advanced tools and features.

Python is also an interpreted scripting language, which makes it compile free. What this means is that changes are implemented immediately when you run the application you’ve written the next time. Thus you’re able to fix mistakes and iterate quicker when web scraping with Python.

Free Data Analytics Webinar

Date: 28th Feb, 2019 (Thursday)Time: 3 PM to 4 PM (IST/GMT +5:30)

Which libraries can be used for Web Scraping with Python?

When you are web scraping with Python, you have access to some of the most advanced and supportive web scraping libraries.

(i) Scrapy

Scrapy is a comprehensive framework written for web scraping in Python. It allows you to do a number of things, from downloading the HTML of websites to storing them in the form you want to. While it is definitely very efficient and has a lower CPU usage, it is hard to install and may be too much for simple scraping tasks.

(ii) Urllib

Urllib can be used for fetching URLs. It defines the functions and classes that help with URL actions (basic and digest authentication, cookies, redirections, and so on).

(iii) BeautifulSoup (BS4)

BeautifulSoup is a parser that allows you to pull out information from the webpage. You can use it to extract lists, tables, paragraphs. You can even use filters to extract information from different web pages. It is one of the easiest and most popular tools to use.

(iv) LXML

LXML is a high performing parsing library that parses HTML and XML pages. While it is one of the best quality parsers out there, its complex documentation makes it a difficult tool to use for beginners.

Other important libraries for web scraping in Python include Mechanize, Scrapemark, Selenium and Requests. In this guide, we will be using a combination of Urllib and BeautifulSoup to scrape the web. Since BeautifulSoup can only parse the data and not fetch the web pages, Urllib needs to be used in addition to BeautifulSoup.

HTML tags you need to know for Web Scraping with Python

While web scraping with python doesn’t require you to be an HTML expert, it does use HTML tags which you need to become familiar with before getting started. Here’s what basic HTML syntax looks like.

HTML Syntax – Image Source: Analytics Vidhya

Here’s a guide to these basic HTML tags.

<!DOCTYPE html> : All HTML documents start with this declaration to identify it as an HTML document

The HTML document is contained between <html> and </html>

The part of the HTML document which is visible is between <body> and </body>

HTML headings are defined with the <h1> to <h6> tags, in decreasing order of size

HTML paragraphs are indicated through the <p> tag

Some other HTML tags that would be useful to know for web scraping with Python are given below.

The <a> tag is used for links. For instance, “<a href=“http://www.google.com”>This is a link for google.com</a>”

HTML tables are defined using<Table>, row using <tr> and rows are further divided into data using <td>

Steps to scrape the web using Beautifulsoup

The steps below will take you through the journey of scraping this Wikipedia page using BeautifulSoup. The goal is to extract the list of state and union territory capitals in India as well as details like date of the establishment, and the former capital and others from the Wikipedia page.

Free Data Analytics Webinar

Date: 28th Feb, 2019 (Thursday)Time: 3 PM to 4 PM (IST/GMT +5:30)

Import necessary libraries:

1. Import the library Urllib2 which will be used to query the websiteimport urllib2

2. Specify the URL that is to be queriedwiki = “https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India”

3.Query the website and return the HTML to the variable ‘page’page = urllib2.urlopen(wiki)

4. Import the Beautiful soup functions so that you can parse the data which has been returned by the websitefrom bs4 import BeautifulSoup

5. Parse the HTML in the ‘page’ variable, and store it in the Beautiful Soup formatsoup = BeautifulSoup(page)

Use the “prettify” function to get the structure of the HTML page

This will help you see the structure of different HTML tags and you can play around with these to extract data.

Image Source: Analytics Vidhya

Work with HTML tags

soup.<tag>: Return content between the opening and the closing tag including the tag.

In[30]:soup.titleOut[30]:<title>List of state and union territory capitals in India – Wikipedia, the free encyclopedia</title>

soup.<tag>.string: Return string within a given tagIn [38]:soup.title.stringOut[38]:u’List of state and union territory capitals in India – Wikipedia, the free encyclopedia’

Find all the links within the page’s <a> tags: As discussed, we can tag a link using the tag “<a>”. So, you can go with soup.a and it will return all the links available on the web page.

In [40]:soup.a Out[40]:<a id=”top”></a>

This gives us only one output. In order to extract all the links that are present within <a>, the “find_all()” needs to be used. This can be seen in the image below.

Image Source: Analytics Vidhya

This gives us not just the links but also the titles and other information. To return only the link we need to use the “href” attribute as shown below.

Image Source – Analytics Vidhya

Identify the right table

We need to find the right table since we are specifically looking to extract data about state capitals. We start by using the command below to extract data from all the table tags.

all_tables=soup.find_all(‘table’)

After this, we need to identify the table we need. We use the attribute “class” to find the right table. You can use the command shown below for this.

Image Source – Analytics Vidhya

Extract the information to the DataFrame

For this, you will need to iterate through each row (HTML tag (tr)) and then assign each element of tr (HTML tag (td)) to a variable and append it to a list. To do this we need to look at the HTML structure of the table.

Image Source – Analytics Vidhya

As you can see in the image above, the second element of <tr> is within tag <th> instead of the more commonly used <td>. We need to account for this.

In order to access the value of each element, you need to use the “find(text=True)” option with each element. This is what the code looks like.You will now get the output in the format below.

Image Source – Analytics Vidhya

You can also see another great example of web scraping with Python using BeautifulSoup and Requests in this video.

Just like the examples here, many different kinds of websites can be scraped with Python using BeautifulSoup and other libraries. Given its wide-ranging applicationsand relative simplicity, web scraping with Python is definitely a great tool to have in one’s skill set.