Screen scraping with Perl

Screen scraping is a relatively well-known idea, but for those who are not
familiar with it, the term refers to the process of extracting data from a
website. This may involve sending form information, navigating through the
site, etc., but the part I'm most interested in is processing the HTML to
extract the information I'm looking for.

As I mentioned in my article about outliners, I've
been organising myself recently, and as part of that process of organisation
I've been writing several screen scrapers to reduce the amount of browsing
I do: repeatedly visiting news sites to see if they have been updated is a
waste of anyone's time, and in these times of feed readers, it's even less tolerable.

Liferea, my feed reader of
choice, has a facility to read a feed generated by a command, and I have been
taking advantage of this facility. As well as reducing the amount of time
I spend reading the news from various sources, this also allows me to keep
track of websites I wouldn't normally remember to read.

Perl

In my article about feed readers I mentioned RSSscraper, a Ruby-based
framework for writing screen scrapers. As much as I like RSSscraper, I've
been writing my screen scrapers in Perl. Ruby looks like a nice language,
but I find Perl's regexes easier to use, and CPAN is filled with convenient modules to do
just about everything you can think of (and many more things you'd probably
never think of).

Most of my screen scrapers use regexes, mainly because Perl's regexes were
haunting me: there was a something I just wasn't grasping, and I wanted to
push past it (and I have, and now I can't remember what the block was :).
There are much better ways to write screen scrapers: Perl has modules like
WWW::Mechanize, HTML::TokeParser, etc., that make screen scraping easier.

sun-bizarre.pl.txt,
sun-viral.pl.txt:
I read The Sun. There, I admitted it. I
work in a factory, and it's good to keep up with the news that everyone else
reads, but mostly it's because I like looking at pictures of scantily clad
women. [shrug]. I also have sun-pic.pl.txt, which allows me to
bypass The Sun's annoying popups.

Most of the scrapers work in exactly the same way: fetch the page using
LWP::Simple, split the page into sections, and extract the blog entry from
each section. sun-pic.pl is a dirty hackish attempt to bypass
popups, and The Sun's horrible site's tendency to crash Mozilla. It's
called with the address of the page, grabs the images from the popups,
and puts them in a specific directory. It's not meant to be useful to anyone
else, other than as an example of a quick and dirty script that's different
from the other examples here. If you're interested, read the comments in
the script.

I'll use Telsa's diary as an example, because the page layout is clear,
and the script I wrote is one of the better examples (I'd learned to use
the /x modifier for clarity in regexes by then).

Each entry starts with <dt>, so I use that as the
point at which to split. From each entry, I want to grab the anchor name,
the title (between the <strong> tags), and everything
that follows, until the </dd> tag.

Most of the scrapers follow this general recipe, but the Michael Moore
and Terry Pratchett scrapers have two important differences.

Michael Moore's blog, unlike most blogs, has the links for each item on a
separate part of the page from the content that's being scraped, so I have
a function to scrape the content again for the link:

It's important to have a unique URL for each item in a feed, because most
feed readers use the link as a key, and will only display one entry for each
link.

The Terry Pratchett scraper is also different, in that instead of using
LWP::Simple, it uses LWP::Agent. Google wouldn't accept a request from my
script, so I used LWP::Agent to masquerade as a browser:

One thing I'm still looking at is getting the news from The Sun. The problem
with this page is that it has some of the worst abuses of HTML I've ever seen.
This snippet uses HTML::TableExtract to extract most of the headlines.

HTML::TableExtract is a nice module that lets you extract the text
content of any table. The "depth" option allows you to select a depth of
tables within other tables (the page grabbed by this script has most of its
headlines at a depth of 6 tables within tables, but there are others at a
depth of 7 -- I think I'll come back to that one). You can also specify a
"count" option to tell it which table to extract from, or a "headers" option,
which makes the module look for columns with those headers.

Lastly, I'd like to take a look at HTML::TokeParser::Simple. If I had
known about this module when I started writing screen scrapers, they would
be a lot easier to understand, and more resiliant to change. The scraper for
Telsa's diary, for example, will break if the <a> tag
has a href attribute as well as a name attribute.

HTML::TokeParser::Simple is, as the name implies, a simplified version
of HTML::TokeParser, which allows you to look for certain tags within a
file. HTML::TokeParser::Simple gives a number of methods with a prefix of
either "is_" or "return_" that tell you if a tag is a certain type or returns
it, respectively. HTML::TokeParser::Simple also inherits from
HTML::TokeParser, so it has full access to HTML::TokeParser's methods.

The Telsa scraper using HTML::TokeParser::Simple looks like this (text version):

Jimmy has been using computers from the tender age of seven, when his father
inherited an Amstrad PCW8256. After a few brief flirtations with an Atari ST
and numerous versions of DOS and Windows, Jimmy was introduced to Linux in 1998
and hasn't looked back.

In his spare time, Jimmy likes to play guitar and read: not at the same time,
but the picks make handy bookmarks.