Spidering Hacks

Editor's note: This week we offer two hacks from Spidering Hacks that will save you time and extra trips to your favorite web sites. And check back to this space next week for two more hacks from the book; the first will be on scraping all the URLs in a specified subcategory of the Yahoo directory; the second will be on using a bit of Perl to quickly find the word you're looking for in either an online dictionary or thesaurus.

Hack #24: Painless RSS with Template::Extract

Wouldn't it be nice if you could simply visualize what
data on a page looks like, explain it in template form to Perl, and not bother
with the need for parsers, regular expressions, and other programmatic logic?
That's exactly what Template::Extract helps
you do.

One thing that I'd always wanted to do, but never got
around to doing, was produce RSS files for all those news
sites I read regularly that don't have their own RSS feeds. Maybe I'd read them
more regularly if they notified me when something was new, instead of requiring
me to remember to check.

One day, I was fiddling about with the Template Toolkit (http://www.template-toolkit.com/)
and it dawned on me that all these sites were, at some level, generated with
some templating engine. The Template Toolkit takes a template and some data and
produces HTML output. For instance, if I have the following Perl data
structure:

Okay, you might think, very interesting, but how does this relate to scraping
web pages for RSS? Well, we know what the HTML looks like, and we can make a
reasonable guess at what the template ought to look like, but we want only the
data. If only I could apply the Template Toolkit backward somehow. Taking HTML output and a template that
could conceivably generate the output, I could retrieve the original data
structure and, from then on, generating RSS from the data structure would be a
piece of cake.

Like most brilliant ideas, this is hardly original, and an equally brilliant
man named Autrijus Tang not only had the idea a long time
before me, but—and this is the hard part—actually worked out how to implement
it. His Template::Extract Perl module (http://search.cpan.org/author/AUTRIJUS/Template-Extract/)
does precisely this: extract a data structure from its template and output.

You see, it's a shame to have solved such a generic problem—scraping a web
page into an RSS feed—in such a specific way. Instead, what I really use is the
following CGI driver, which allows me to specify all the details of the site and
the RSS in a separate file:

Template::Extract is a brilliant new way of doing
data-directed screen scraping for structured documents, and it's especially
brilliant for anyone who already uses Template to turn
templates and data into HTML. Also look out for Autrijus's latest crazy idea,
Template::Generate (http://search.cpan.org/author/AUTRIJUS/Template-Generate/),
which provides the third side of the Template triangle,
turning data and output into a template.

Simon Cozens

Hack #37: Downloading Comics with dailystrips

It's hard to believe that, across all the
cultures of the Internet, there's one common denominator of humor. Can you guess
what it is? No, no; it's not the "All Your Base Are Belong to Us" videos. It's
the comic strip. Whether you're into geek humor, political humor, or unfortunate
youngsters forever failing to kick a football, there's a comic strip for
you.

In fact, there may be several comic strips for you. There may be so many that
it's a pain to visit all the sites containing said comic strips to view them.
But there's a great piece of software available to ease your woes: dailystrips grabs all the strips for you,
presenting them in one HTML file. Combine it with cron
[Hack #90] and you've got a great daily comic strip supplement right in your
mailbox or web site. The author, Andrew Medico, makes it clear
that if you set this up to run on a web site, you must ensure that you've
configured your site to restrict access to you alone or risk some legal
consequences.

Getting the Code

dailystrips is available at http://dailystrips.sourceforge.net/,
and this hack covers Version 1.0.27. There are two components to the program:
the program itself and the definitions file, which defines the details of the
available comic strips. As of this writing, dailystrips
supports over 500 different comic strips. Once you've downloaded the program, go
back to the download page and grab the latest definitions file, which is updated
often. Save it over the strips.def file that comes
packaged in the ZIP archive with the application.

While the program is running, you'll get a count of any errors in retrieving
the images of the strips. From my experiments, it looked like the nonsyndicated
comics were easier to get and more consistent than the syndicated ones.

Once the program is finished, it'll either spit some HTML to STDOUT or, if you've enacted the --local option, save the strips to an HTML file named using
the current date. The file will save into the dailystrips directory.

Hacking the Hack

In this hack, we're not hacking the hack so much as hacking the defs file. The defs file defines
from where the strips are retrieved and the code snippets that are used to
retrieve them. The defs file also includes groups, which are shortcuts to retrieving several comics at
once. More extensive information on how to define strips is available from the
README.DEFS file.

Defining strips by URL

The first way to define new strips is by generating a URL based on the
current date. Here's an example for James Sharman's "Badtech" comic:

The first line specifies a unique strip name that you'll use to add the strip
to a group or get it from the command line. The second line, name, specifies the name of the strip to display in the HTML
output. Next, artist includes the name of the
illustrator, which will also display in the HTML output. The fourth line
determines the home page of the strip, and the fifth line specifies how the
strip is found. In this case, we're generating a URL. imageurl specifies the URL of the comic, and %-y, %-m, and %-d specify the year, month, and day, respectively.

The final line, provides, indicates which types of
strips the definition can provide: either any for a
definition that can provide the strip for any given date, or latest for a definition that can provide only the current
strip.

Notice that the options are similar to the options in the previous example.
The strip, name, and homepage options function as they do in the first example,
but the type option is now search.
With this type, you need to include a searchpattern,
which specifies a Perl regular expression that will match the strip's URL. The
matchpart line tells the script which paranthetical
section to match. In this example, there's only one parenthetical section.

baseurl is necessary only when the searchpattern line does not match a full URL (as in this
instance). When specified, it's prepended to whatever the regular expression of
searchpattern matches.

Gathering strips into a group

If you want to get a set of the same comic strips every day, it's kind of a
pain to type them all in. dailystrips lets you specify a
group name that gathers several comic strips at the same time. Groups go at the
top of the definitions file and look like this:

group is the name of the group, and desc is its descriptive blurb. On each line after that, use
the word include and whatever strips you want gathered
into the group. As you can see, there are 11 strips in this group. When you're
finished, put end on its own line. You call groups of
strips with a @, as in this example: