Introduction To Python Web Scraping Tutorial Using BeautifulSoup

Hello Python enthusiasts, today I came up with a magnificent and long article which can increase your passion towards Python programming language, i.e.,.., Python Web Scraper Tutorial

If you are a total newbie, don’t worry still you can understand and write your Python web scraper code.Please, go through this article carefully

Introduction

The need of web scraping is gradually increasing day by day in the present world.As we know there are tons and tons of data is available on the internet, sometimes we need to extract that information to accomplish our very own work.

We can say that web scraping is the subset of the data mining. Whatever the data is available on the Internet, you can easily scrape those results and analyze them for your needs.

If you want to save the data from a single random web page, you can do it manually without any web scraper.But What If it has a hell lot of pages containing relevant data ? would you go to every page to scrape that data manually ??

No.That is not a good option at all.It takes your life time 😛

So Web Scraper is going to play a crucial role here.It scrapes all the content of each and every page.Also, It can save those results to your Excel spreadsheets automatically

what a Web Scraper is?

Web scraper is the traditional software technique which is used to extract and to save the information from the structured HTML pages of the particular website according to our needs.

We can do data scraping in any programming languages, but I would prefer to do this in python as it provides some great modules like BeautifulSoup to make our task easier.

If you are not interested in coding, we have some websites on the internet to scrape the data (check google).But, they will not be meet our requirements.So I would suggest doing this manually to achieve our goals.

Some of the companies like twitter, facebook and semrush are providing their API’s to use their data with or without limitations.But what If you have to scrape the content of the website which hasn’t provided with any API?

Then we should go for manual web scraping using our favorite programming language.If you are a newbie, then please stick to this article, I will provide you the best Python web scraping tutorial with a real world example.

Prerequisites

No to Minimum Python programming knowledge

HTML basics

Python 2.7

How To Find A Website To Scrape?

We are going to scrape the yellow pages website which contains lots of pages instead of scraping single web page.

Okay, Now go and search for the website which has numbered parameters in its URL.If you can’t find them.Don’t worry Google Dorks will help you.

inurl:”.php?id=”

I have a list of 100’s of google dorks.But I don’t recommend you scraping those websites without their owner’s permission.

Python modules you need to be installed:

LXML

BeautifulSoup

urllib

Python Web Scraping:

importlxmlfrombs4import BeautifulSoup as b
importurllib

First, We are going to import the above three modules.

1.LXML – This module is used to processing the HTML and XML

2.Next, we should import the BeautifulSoup as b (For simplicity, I have imported this as b.you can replace b with anything that you wish) from the bs4 module.This module is used to extract the structured data from HTML or XML.

3.urllib module deals with the URL requests and responses.

url ="http://yellowpages.fullhyderabad.com/-in-hyderabad-/"

Now go and find your website to scrape the data using google dorks and save that link into a variable called url

page = urllib.urlopen(url).read()
html = b(page,"lxml")

Then we have to send a request to that URL using urllib.urlopen() method and read the HTML page source.

Now we are going to format the HTML page source to XML using beautifulsoup to extract the desired data and assign it to any variable.

Scraping Links

Now Fun time starts.I am going to extract all the links from that page.

links = html.findAll("a")

Here, Links variable contains an array of links from HTML page.So I’m going to iterate through this links variable.that means it prints every Link which is in the links variable

Scrape Links And Anchor texts in an anchor tag

importlxmlimporturllibfrombs4import BeautifulSoup as b
url ="http://yellowpages.fullhyderabad.com/-in-hyderabad-/"
page = urllib.urlopen(url).read()
html = b(page,"lxml")
links = html.findAll("a")
for i in links: #i is one of the link in an array of linksprint i["href"] # to get the href attribute of "a" tagprint i.text #to get the anchor text of that link

Scrape different attributes in any HTML tag

If we are going to scrape the titles of any blog then this method would be very helpful to us.To do this, first, open the source code of that post page and copy its class attribute.

Not only above methods but also we can use so many methods to extract the data.To know all the available methods of BeautifulSoup,just open your python idle and type this command.you will get the list of methods then try one by one to know how they work.

dir(bs4.BeautifulSoup)

My intention is not to make you get bored.So I’m writing another article on how to scrape all the links in the many pages of a website.