How to scrape websites using Python

How to scrape websites using PythonDevanshu JainBlockedUnblockFollowFollowingApr 8It is that time of the year when the air is filled with the claps and cheers of 4 and 6 runs during the Indian Premier League Cricket T20 tournament followed by the ICC Cricket World Cup in England.

And how can we forget the election results of the world’s largest democratic country, India, that will be out in the next few weeks?To stay updated on who will be getting this year’s IPL title or which country is going to get the ICC World Cup in 2019 or how the country’s future will look in the next 5 years, we constantly need to be glued to the Internet.

But if you’re like me and cannot spare much time on the Internet, but have a strong desire to stay updated with all these titles, then this article is for you.

So without wasting any time, let’s get started!Photo by Balázs Kétyi on UnsplashThere are two ways with which we can access the updated information.

One way is through APIs provided by these media websites, and the other way round is through Web/Content Scraping.

The API way is too simple, and probably the best way to get updated information is by calling the associated programming interface.

But sadly, not all websites provide publicly accessible APIs.

So the other way left for us is to Web Scrape.

Web ScrapingWeb scraping is a technique to extract information from websites.

This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet).

Web scraping may involve accessing the web directly using HTTP, or through a web browser.

In this article, we’ll be using Python to create a bot for scraping content from the websites.

Process WorkflowGet the URL of the page from which we want to extract/scrape dataCopy/download the HTML content of the pageParse the HTML content and get the required dataThe above flow helps us to navigate to the URL of the required page, get its HTML content, and parse the required data.

But sometimes there are cases, when we first have to log in to the website and then navigate to a specific location to get the required data.

In that case, that adds one more step of logging into the website.

PackagesFor parsing the HTML content and getting the required data, we use the Beautiful Soup library.

It’s an amazing Python package for parsing HTML and XML documents.

Do check it out here.

For logging into the website, navigating to the required URL within the same session, and downloading the HTML content, we’ll be using the Selenium library.

Selenium Python helps with clicking on buttons, entering content in structures, and much more.

Dive right into the codeFirst we are importing all the libraries that we are going to use.

After installing the libraries, typing #python <program name> would print the values to the console.

By this way, we can scrape and find data from any website.

Now, if we are scraping a website which changes its content very frequently, like cricket scores or live election results, we can run this program in a cron job and set an interval for the cron job.

Apart from that, we can also have the results displayed right onto our screen instead of the console by printing the results in the notification tab that pops up onto the desktop after a particular time interval.

We can even share these values onto a messaging client.

Python has rich libraries that can help us with all that.

If you want me to explain how to set up a cron job and get notifications to appear on the desktop, feel free to ask me in comment section.