I was also looking for an excuse to learn how to scrape websites for information and this seemed like the perfect opportunity to do it.

Enter Scrapy

I had a few requirements for my web scraper, although as usual the most important one was that the software is free and open source. I was looking for something in Python so that I would not have to learn a new programming language on top of a new tool. I also needed a scraper that could crawl the website for me so that I would not have to manually input every webpage.

The two big python packages in the web scraper world are BeautifulSoup and Scrapy (apologies to anyone working on something else). BeautifulSoup has a lot of built in tools to parse a webpage, but you have to know the webpages you want to scrape in advance. In the short term, this probably would have been a lot easier to use. Scrapy allows you to advance through pages automatically and is overall a far more robust tool than BeautifulSoup.

First Steps

There is a wonderful tutorial on using Scrapy on their website. I use Continuum's Anaconda Python, so the installation was as simple asconda install scrapy. You might have to reboot to have scrapy added to yourPATH if you are using windows, but creating a Scrapy project is as simple
as

All the spider is returning right now is the match date, which is a great first
start but there is a lot more to do. Scrapy can collect information using two
different methods: Xpath and CSS. While CSS can work fairly well, Scrapy's own
website believes that XPath is a far more powerful way of identifying content on
a webpage. The quick overview is that all you need to do to identify the xpath
is to open the website in chrome, right click on the data you want to extract,
select inspect element, right click on it in the source code, and click on copy
XPath. For a better overview of how to extract data using XPath, check Google.
The extract_first() pulls out the first element of the array it returns, but
can handle empty data.

You should have a Field for every piece of data you want. I found it useful
to sketch out which fields I wanted first. If you want to see all of the
fields I actually ended up collecting, check out my github project. We need
to adapt the spider now to use the items.

If you are unfamiliar with yield in Python, it is similar to return except
that it returns a "generator". A generator is an interable that can only be iterated
over once. If you want more information, check out this stack overflow page

Advancing Pages

The standard way of advancing pages in Scrapy is to identify some button like
'Next Page' through which you can get the url of the next page. Once there is
no more button, you end the scraper. Unfortunately, Master League
only links to the next game in a given series, but not does not link between
series. However, Master League increases the number of each match by 1,
which makes it easy to advance to the next page.

I was a little lazy though, for two reasons. The first is that I just replaced the
last five characters with the next match ID. Eventually they will hit match 10,000
I will have to fix the code. Second, the code ends whenever there is a 404 error,
which I never bothered to handle. It runs fine despite that though.

Adding Statistics

While the above information allowed me to create my Glicko ratings, I
eventually wanted to be able to do something like Moneyball for the
HotS competitive scene. Luckily, while I was working on this project,
Master League added statistics to their page. The statistics are loaded
using javascript when you click on the statistics tab. Interestingly,
they are just loaded from a webpage with the same URL as the match with/stats/ added on. To get the data, I check whether the statistics tabs
exist and then use a second function to parse the data on the statistics
page.

The callback is now parseStats, at the end of which I go to the next match.
In order to pass the item with the basic information already in it, the item
is passed through request.meta['item'].

Pipelines

Most of what I learned about pipelines is from ablog post I found using
Google. In pipelines.py, I created two class to store Items in the database
called HotsPipeline and SQLiteStorePipeline. I also converted the match date to
a more appropriate format and find the number of weeks since the first
match. Most important is to make sure the __init__ and __del__ are set
up so that the connection to the database terminates. Since only one process
can access a SQLite database at a time, this is crucial to being a good person.

The only thing missing is that the first row of match_basic has to
exist already in order to calculate the date. While an if statement
could handle this, entering one row into a database I used to take care
of by hand is perfectly fine.

Next Steps

The code so far handles the vast majority of what I wanted it to, but
there are still some small additions left for quality of life
enhancements. The most important is Deltafetch as described on
Scrapy's blog.
The middleware keeps a list of the sites you have crawled so you do not
crawl them twice. This is great for both you and the site you are crawling.

Finally, make you adhere to the outlines in this post
on scraping politely. You don't want to have your IP address banned for
looking like a hacker, nor do you want to accidentally DDOS a website.
If you have a front-facing application, also be careful about any SQL
injection vulnerabilities. Since this will just be hosted on Github,
I was not particularly worried, but it is something you should always
consider.