When my company started writing blog posts a year ago, we discovered two big problems. One was that writing took a long time (majority of us are developers!), the other was distribution. We believe in contributing to the developer community, but what was the point if no one was reading them?

When one of our colleagues suggested we try Medium for more exposure, we were excited. We read stuff on Medium, but how does it work? What topics are Medium readers interested in? We wanted to do some research to answer all these questions. I have a technical background so I suggested crawling some data to see what types of posts and publications performed well. Here, I’ll share with you how we built a web crawler in a day to help our content team figure out what topics to focus on.

Our results for the keyword “BaaS” (Backend-as-a-Serverless), which is related to our open-source product, Skygear

It only takes a day to write a useful web crawler for your content team.

Do you need a technical background to build a web crawler?

Building a web crawler does require basic coding skills. For this project I, I used the following:

The goal is to extract data from Medium and represent it in a nice spreadsheet for my team to analyze like the one below.

We can also crawl based on publication and extract data such as tags and read length.

Knowing what type of content we’re crawling.

First, we began by choosing the information we wanted and could probably extract, such as title, keywords, tags, and post length. We also manually researched the size of popular publications and popular writer followings. FreeCodeCamp has great insights from the top 252 Medium stories of 2016.

How do we extract from all the different parts of a blog post on a page to insert the right data?

However, not all sites store data the same way. There are structured data and unstructured data. Structured content includes RSS, JSON and XML that you can extract directly from to represent in an ordered way (such as a newsfeed or put on a spreadsheet). Unstructured content like Medium requires a two-step process of extracting data from HTML, then turning it into structured data.

No matter how different the layout of blogs or publications sites look, the data falls into two categories, structural and unstructured. Now, we need to choose a tool to help us build our crawler that will extract this data.

Choosing your library: don’t build from scratch.

If you want to build something quickly like I did, open source tools are great. You can choose from a range of free crawler libraries for different programming languages. Here is a list of libraries you can consider.

This time, I choose Scrapy as it is an open source Python library and well. They also have a great community support so beginners can ask for help easily.

Using Scrapy to crawl data.

Our company loves open source and collaborative frameworks like Scrapy

Install Scrapy

First, we will need to install Scrapy to your computer. You can follow the guideline here to install Scrapy on different platforms such as Windows, Mac OS X or Ubuntu.

Install the start project

One of the best ways to get started is using their start project since it helps you set up most of the configurations. Make a new directory and then run the following command after successful installation. (Or you can also read the documentation and set up everything yourself.)

scrapy startproject tutorial

You will see a spider folder in tutorial directory. Then go to tutorial/spider, open a new file called stories_spider.py. Then paste the below script in this file:

import scrapy
class StoriesSpider(scrapy.Spider):
name = "stories"
start_urls = [
# urls that you want to crawl
'http://example.com/post/',
'http://example.com/post2/'
]
# For All Stories
def parse(self, response):
# Replace 'path' with the right css path that the data located
for story in response.css('path'):
yield {
# Things you need to crawl
}

name : identifies the Spider.

start_urls : List of URLs you want to crawl. The list will be then used by default implementation of start_requests() to create the initial requests for your spider.

parse(): handles the response downloaded for each of the requests made.

In order to crawl the data from Medium, we have to figure out the URLs & the paths of the data and put them to stories_spider.py.

Study the website: URLs

Now, you will need to let the crawler know which site you want to crawl data from. You will have to pass the right URL to your Scrapy program. Don’t worry, you don’t have to input every post manually. Instead, look in the Archives that tell Scrapy to look at all the posts published within (usually) the time period.

Medium publications have a page call ‘Archive’, where you can find the blog posts published in the past few years. For example, the URL for 2016 is https://m.oursky.com/archive/2016

You can search any publication with an archive and target date.

For Medium, you can find articles separated by year, then month, so you will have to input the URLs for the individual months.

Let’s crawl the Oursky publication from Jun 2016 — Feb 2017. I put the URLs in the stories_spider.py

import scrapy
class StoriesSpider(scrapy.Spider):
name = "stories"
start_urls = [
# urls that you want to crawl
'https://m.oursky.com/archive/2016/06',
'https://m.oursky.com/archive/2016/07',
'https://m.oursky.com/archive/2016/08',
'https://m.oursky.com/archive/2016/09',
'https://m.oursky.com/archive/2016/10',
'https://m.oursky.com/archive/2016/11',
'https://m.oursky.com/archive/2016/12',
'https://m.oursky.com/archive/2017/01',
'https://m.oursky.com/archive/2017/02',
]
# For All Stories
def parse(self, response):
# Replace 'path' with the right css path that the data located
for story in response.css('path'):
yield {
# Things you need to crawl with Path de
}

So far so good? Hang in there, because it gets a bit trickier!

Study the website: Identify the path of the components you want to crawl

Tell your crawler which components you want to extract from the unstructured HTML data

Now, we need to find the right piece of information (i.e. date, author name, link) in the CSS expression or XPath expression. When you open the site HTML, you will find the tags & class name (or ID name) for every line of code. Copy those IDs into Scrapy so it can can extract.

CSS expression uses CSS selectors to get the DOM element while XPath expression queries the XML structure to get the element. Both CSS expression and XPath expression can be crawled. You can reference Scrapy Documentation here for the format of the CSS expression or XPath expression. I personally like CSS expression more since I think the syntax is easier to read.

Below, I will use CSS expression to do the demo.

Let’s crawl the author name. I open the console to view the HTML and find the CSS tags and classes of the author name. The process is the same for author link, article title, recommendation and postingTime. For Medium, I found this information under div.u-borderBottomLightest. Then I focused on finding their path after div.u-borderBottomLightest

What can you do with this data?

One of my technical colleagues always says, ‘When you need to spend 90 seconds daily to work on something, you should write a program for it.’ This saved my content colleagues hours of work. Instead, they could focus on brainstorming topics to write on that overlapped with Medium readers’ tastes.

Other things you can do, for example, is calculate correlations between key words / posting time / read duration and recommends (as a proxy for reads / popularity).

We looked at Medium’s top tech and startup publications in turn and learned a few things:

The biggest publications such as FreeCodeCamp and StartupGrind publish often

The largest tech publications had many posts that were 1000+ recommends

Many hit authors didn’t have to be famous

Not all tech topics were the same (for example, “Serverless” and “BaaS” didn’t have that many recommends relative to more generic tags such as “programming” and “tech”)