11.4. An HTTP Authentication Example:The Unicode Mailing Archive

Most password-protected
sites (whether protected
via HTTP Basic Authentication or otherwise) are that way because the
sites' owners don't want just
anyone to look at the content. And it would be a bit odd if I gave
away such a username and password by mentioning it in this book!
However, there is one well-known site whose content is password
protected without being secret: the mailing list archive of the
Unicode mailing lists.

In an effort to keep email-harvesting bots from finding the Unicode
mailing list archive while spidering the Web for fresh email
addresses, the Unicode.org sysadmins have put a password on that part
of their site. But to allow people (actual not-bot humans) to access
the site, the site administrators publicly state the password, on an
unprotected page, at http://www.unicode.org/mail-arch/, which
links to the protected part, but also states the username and
password you should use.

The main Unicode mailing list (called unicode)
once in a while has a thread that is really very interesting and you
really must read, but it's buried in a thousand
other messages that are not even worth downloading, even in digest
form. Luckily, this problem meets a tidy solution with LWP:
I've written a short program that, on the first of
every month, downloads the index of all the previous
month's messages and reports the number of messages
that has each topic as its subject.

The trick is that the web pages that list this information are
password protected. Moreover, the URL for the index of last
month's posts is different every month, but in a
fairly obvious way. The URL for March 2002, for example, is:

http://www.unicode.org/mail-arch/unicode-ml/y2002-m03/

Deducing the URL for the month that has just ended is simple enough:

# To be run on the first of every month...
use POSIX ('strftime');
my $last_month = strftime("y%Y-m%m", localtime(time - 24 * 60 * 60));
# Since today is the first, one day ago (24*60*60 seconds) is in
# last month.
my $url = "http://www.unicode.org/mail-arch/unicode-ml/$last_month/";

But getting the contents of that URL involves first providing the
username and password and realm name. The
Unicode web site doesn't publicly declare the realm
name, because it's an irrelevant detail for users
with interactive browsers, but we need to know it for our call to the
credential method. To find out the realm name, try
accessing the URL in an interactive browser. The realm will be shown
in the authentication dialog box, as shown in Figure 11-1.

In this case, it's
"Unicode-MailList-Archives," which
is all we needed to make our request:

But assuming the authorization data is correct, the page is retrieved
as if it were a normal, unprotected page. From there, counting the
topics and noting the absolute URL of the first message of each
thread is a matter of extracting data from the HTML source and
reporting it concisely.