More Spidering Hacks

Editor's note: In last week's sample hacks, excerpted from Spidering Hacks, we showed you two workarounds that will save you time and extra trips to your favorite web sites. This week we offer two more hacks on grabbing--or scraping--the information you need, whether it's the link count for a particular Yahoo! category, or the quick answer for the word that's just on the tip of your tongue. Enjoy.

Hack #49: Yahoo! Directory Mindshare in Google

Yahoo! and Google are two
very different animals. Yahoo! indexes only a
site's main URL, title, and description, while Google builds full-text indexes
of entire sites. Surely there's some interesting cross-pollination when you
combine results from the two.

This hack scrapes all the URLs in a specified subcategory of the Yahoo!
directory. It then takes each URL and gets its link count from Google. Each link
count provides a nice snapshot of how a particular Yahoo! category and its
listed sites stack up on the popularity scale.

TIP: What's a link count? It's
simply the total number of pages in Google's index that link to a specific
URL.

There are a couple of ways you can use your knowledge of a subcategory's link
count. If you find a subcategory whose URLs have only a few links each in
Google, you may have found a subcategory that isn't getting a lot of attention
from Yahoo!'s editors. Consider going elsewhere for your research. If you're a
webmaster and you're considering paying to have Yahoo! add you to their
directory, run this hack on the category in which you want to be listed. Are
most of the links really popular? If they are, are you sure your site will stand
out and get clicks? Maybe you should choose a different category.

Running The Hack

The hack has its only configuration — the Yahoo! directory you're interested
in — passed as a single argument (in quotes) on the command line. If you don't
pass one of your own, a default directory will be used instead.

% perl mindshare.pl "/Entertainment/Humor/Procrastination/"

Your results show the URLs in those directories, sorted by total Google
links:

Hacking the Hack

Yahoo! isn't the only searchable subject index out there, of course; there's
also the Open Directory Project (DMOZ, http://www.dmoz.org/), which is the product of
thousands of volunteers busily cataloging and categorizing sites on the Web — the
web community's Yahoo!, if you will. This hack works just as well on DMOZ as it
does on Yahoo!; they're very similar in structure.

Next, replace the lines that check whether a URL should be measured for
mindshare. When we were scraping Yahoo! in our original script, all directory
entries were always prepended with http://srd.yahoo.com/ and then the URL itself.
Thus, to ensure we received a proper URL, we skipped over the link unless it
matched that criteria:

Since DMOZ is an entirely different site, our checks for validity have to
change. DMOZ doesn't modify the outgoing URL, so our previous Yahoo! checks have
no relevance here. Instead, we'll make sure it's a full-blooded location (i.e.,
it starts with http://) and it doesn't match any of
DMOZ's internal page links. Likewise, we'll ignore searches on other
engines: