Thursday, 12 May 2016

The need and importance of extracting data from the web is becoming increasingly loud and clear. Every few weeks, I find myself in a situation where we need to extract data from the web. For example, last week we were thinking of creating an index of hotness and sentiment about various data science courses available on the internet. This would not only require finding out new courses, but also scrape the web for their reviews and then summarizing them in a few metrics! This is one of the problems / products, whose efficacy depends more on web scrapping and information extraction (data collection) than the techniques used to summarize the data.

Ways to extract information from web

There are several ways to extract information from the web. Use of APIs being probably the best way to extract data from a website. Almost all large websites like Twitter, Facebook, Google, Twitter, StackOverflow provide APIs to access their data in a more structured manner. If you can get what you need through an API, it is almost always preferred approach over web scrapping. This is because if you are getting access to structured data from the provider, why would you want to create an engine to extract the same information.

Sadly, not all websites provide an API. Some do it because they do not want the readers to extract huge information in structured way, while others don’t provide APIs due to lack of technical knowledge. What do you do in these cases? Well, we need to scrape the website to fetch the information.

There might be a few other ways like RSS feeds, but they are limited in their use and hence I am not including them in the discussion here.

What is Web Scraping?

Web scraping is a computer software technique of extracting information from websites. This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet).

You can perform web scrapping in various ways, including use of Google Docs to almost every programming language. I would resort to Python because of its ease and rich eocsystem. It has a library known as ‘Beautiful Soup’ which assists this task. In this article, I’ll show you the easiest way to learn web scraping using python programming.

For those of you, who need a non-programming way to extract information out of web pages, you can also look at import.io . It provides a GUI driven interface to perform all basic web scraping operations. The hackers can continue to read this article!

Libraries required for web scraping

As we know, python is a open source programming language. You may find many libraries to perform one function. Hence, it is necessary to find the best to use library. I prefer Beautiful Soup (python library), since it is easy and intuitive to work on. Precisely, I’ll use two Python modules for scraping data:

Urllib2: It is a Python module which can be used for fetching URLs. It defines functions and classes to help with URL actions (basic and digest authentication, redirections, cookies, etc). For more detail refer to the documentation page.

Beautiful Soup: It is an incredible tool for pulling out information from a webpage. You can use it to extract tables, lists, paragraph and you can also put filters to extract information from web pages. In this article, we will use latest version Beautiful Soup 4. You can look at the installation instruction in its documentation page.

Beautiful Soup does not fetch the web page for us. That’s why, I use urllib2 in combination with the BeautifulSoup library.

Python has several other options for HTML scraping in addition to Beatiful Soup. Here are some others:

-mechanize
-scrapemark
-scrapy

Basics – Get familiar with HTML (Tags)

While performing web scarping, we deal with html tags. Thus, we must have good understanding of them.
you already know basics of HTML, you can skip this section. Below is the basic syntax of HTML:
This syntax has various tags as elaborated below:

<!DOCTYPE html> : HTML documents must start with a type declaration
HTML document is contained between <html> and </html>
The visible part of the HTML document is between <body> and </body>
HTML headings are defined with the <h1> to <h6> tags
HTML paragraphs are defined with the <

Scrapping a web Page using Beautiful Soup

Here, I am scraping data from a Wikipedia page. Our final goal is to extract list of state, union territory capitals in India. And some basic detail like establishment, former capital and others form this wikipedia page. Let’s learn with doing this project step wise step:

Import necessary libraries:

#import the library used to query a website
import urllib2
#specify the url
wiki = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"
#Query the website and return the html to the variable 'page'
page = urllib2.urlopen(wiki)
#import the Beautiful soup functions to parse the data returned from the website
from bs4 import Beautiful Soup
#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = Beautiful Soup(page)

Use function “prettify” to look at nested structure of HTML page

Above, you can see that structure of the HTML tags. This will help you to know about different available tags and how can you play with these to extract information.

Work with HTML tags

soup.<tag>: Return content between opening and closing tag including tag.
In[30]:soup.title
Out[30]:<title>List of state and union territory capitals in India - Wikipedia, the free encyclopedia</title>
soup.<tag>.string: Return string within given tag
In [38]:soup.title.string
Out[38]:u'List of state and union territory capitals in India - Wikipedia, the free encyclopedia'

Find all the links within page’s <a> tags:: We know that, we can tag a link using tag “<a>”. So, we should go with option soup.a and it should return the links available in the web page. Let’s do it.

In [40]:soup.a
Out[40]:<a id="top"></a>

Above, you can see that, we have only one output. Now to extract all the links within <a>, we will use

Above, it is showing all links including titles, links and other information. Now to show only links, we need to iterate over each a tag and then return the link using attribute “href” with get.

Find the right table: As we are seeking a table to extract information about state capitals, we should identify the right table first. Let’s write the command to extract information within all table tags.

all_tables=soup.find_all('table')

Now to identify the right table, we will use attribute “class” of table and use it to filter the right table. In chrome, you can check the class name by right click on the required table of web page –> Inspect element –> Copy the class name OR go through the output of above command find the class name of right table.

Extract the information to DataFrame: Here, we need to iterate through each row (tr) and then assign each element of tr (td) to a variable and append it to a list. Let’s first look at the HTML structure of the table (I am not going to extract information for table heading <th>)
Above, you can notice that second element of <tr> is within tag <th> not <td> so we need to take care for this. Now to access value of each element, we will use “find(text=True)” option with each element. Let’s look at the code

Similarly, you can perform various other types of web scraping using “Beautiful Soup“. This will reduce your manual efforts to collect data from web pages. You can also look at the other attributes like .parent, .contents, .descendants and .next_sibling, .prev_sibling and various attributes to navigate using tag name. These will help you to scrap the web pages effectively.-

But, why can’t I just use Regular Expressions?

Now, if you know regular expressions, you might be thinking that you can write code using regular expression which can do the same thing for you. I definitely had this question. In my experience with Beautiful Soup and Regular expressions to do same thing I found out:

Code written in Beautiful Soup is usually more robust than the one written using regular expressions. Codes written with regular expressions need to be altered with any changes in pages. Even Beautiful Soup needs that in some cases, it is just that Beautiful Soup is relatively better.

Regular expressions are much faster than Beautiful Soup, usually by a factor of 100 in giving the same outcome.

So, it boils down to speed vs. robustness of the code and there is no universal winner here. If the information you are looking for can be extracted with simple regex statements, you should go ahead and use them. For almost any complex work, I usually recommend BeautifulSoup more than regex.

End Note

In this article, we looked at web scraping methods using “Beautiful Soup” and “urllib2” in Python. We also looked at the basics of HTML and perform the web scraping step by step while solving a challenge. I’d recommend you to practice this and use it for collecting data from web pages.

This post was written by Rubén Moya, School of Data fellow in Mexico, and originally posted on Escuela de Datos.Import.io is a very powerful and easy-to-use tool for data extraction that has the aim of getting data from any website in a structured way. It is meant for non-programmers that need data (and for programmers who don’t want to overcomplicate their lives).I almost forgot!! Apart from everything, it is also a free tool (o_O)

The purpose of this post is to teach you how to scrape a website and make a dataset and/or API in under 60 seconds. Are you ready?

It’s very simple. You just have to go to http://magic.import.io; post the URL of the site you want to scrape, and push the “GET DATA” button. Yes! It is that simple! No plugins, downloads, previous knowledge or registration are necessary. You can do this from any browser; it even works on tablets and smartphones.

For example: if we want to have a table with the information on all items related to Chewbacca on MercadoLibre (a Latin American version of eBay), we just need to go to that site and make a search – then copy and paste the link (http://listado.mercadolibre.com.mx/chewbacca) on Import.io, and push the “GET DATA” button.

You’ll notice that now you have all the information on a table, and all you need to do is remove the columns you don’t need. To do this, just place the mouse pointer on top of the column you want to delete, and an “X” will appear.

Finally, it’s enough for you to click on “download” to get it in a csv file.In our example, we have 373 pages with 48 articles each. So this option will be very useful for us.

Good news for those of us who are a bit more technically-oriented! There is a button that says “GET API” and this one is good to, well, generate an API that will update the data on each request. For this you need to create an account (which is also free of cost).

As you saw, we can scrape any website in under 60 seconds, even if it includes tons of results pages. This truly is magic, no? For more complex things that require logins, entering subwebs, automatized searches, et cetera, there is downloadable import.io software… But I’ll explain that in a different post.

Contact Information

We are leading WEB SCRAPING company and enough capable to extract website information, review scraping, contact information scraping, business directory scraping, email list scraping etc. Contact us on