In this article, we take a quick look at how web scraping can be useful in the context of data science projects. Web “scraping” (also called “web harvesting”, “web data extraction” or even “web data mining”), can be defined as “the construction of an agent to download, parse, and organize data from the web in an automated manner”. Or, in other words: instead of a human end-user clicking away in their web browser and copy-pasting interesting parts into, say, a spreadsheet, web scraping offloads this task to a computer program which can execute it much faster, and more correctly, than a human can.

The automated gathering of data from the Internet is probably as old as the Internet itself, and the term “scraping” in itself has been around for much longer than the web, even. Before “web scraping” became popularized as a term, a practice known as “screen scraping” was already well-established as a way to extract data from a visual representation—which in the early days of computing (think 1960s-80s) often boiled down to simple, text-based “terminals”. Just as today, people around this time were already interested in “scraping” off large amounts of text from such terminals and store this data for later use.

When surfing around the web using a normal web browser, you’ve probably encountered multiple sites where you considered the possibility of gathering, storing, and analyzing the data present on the site’s pages. Especially for data scientists, whose “raw material” is data, the web exposes a lot of interesting opportunities. In such cases, the usage of web scraping might come in handy. If you can view some data in your web browser, you will be able to access and retrieve it through a program. If you can access it through a program, the data can be stored, cleaned, and used in any way.

No matter your field of interest, there’s almost always a use case to improve or enrich your practice based on data. “Data is the new oil”, so the common saying goes, and the web has a lot of it.

We’re working on a new book entitled “Web Scraping for Data Science with Python”, which’ll be geared towards data scientists who want to adopt web scraping techniques in their workflow. Stay tuned for more information in the coming issues of Data Science Briefings and over at dataminingapps.com. As a sneak preview, we’re sharing a summarized version of one of the use cases from the upcoming book.

Our goal is to construct a social graph of S&P 500 companies and their interconnectedness through their board members. We’ll start from the S&P 500 page at Reuters: https://www.reuters.com/finance/markets/index/.SPX to obtain a list of symbols:

Once we have obtained a list of symbols, another script will visit the board member pages for each of them (e.g. https://www.reuters.com/finance/stocks/company-officers/MMM.N), fetch out the table of board members, and store it as a pandas data frame:

This sort of information can lead to a lot of interesting use cases, especially in the realm of graph and social network analytics, take a look at our research page around this topic. We can use our collected information to export a graph and visualize it using Gephi, a popular graph viz tool: