Headlines in the News

I find myself spending a lot of time participating in online discussion areas. Originally, all we had was Usenet. However, the concept of a "Web-based community" has finally taken hold. These communities usually provide some sort of message-based system (often with threading and separate discussion areas for topics) and frequently an HTML or Java-based "interactive chat" area.

I find myself spending a lot of time participating in online discussion areas. Originally, all we had was Usenet. However, the concept of a “Web-based community” has finally taken hold. These communities usually provide some sort of message-based system (often with threading and separate discussion areas for topics) and frequently an HTML or Java-based “interactive chat” area.

I frequent one such Web community called “the Perl Monastery” (http://www.perlmonks.org). The community is active, posting dozens of messages every day, and is frequented by some sharp people, quick to answer questions. A recent posting piqued my interest. “Jcwren,” a user, suggested a series of “contests” to motivate people to show off while thinking of new solutions. He decided to kick off the first contest himself, awarding a “Perlmonks T-shirt” to the winner (funded out of his pocket).

The contest was to last only a week, and we’re midway through the week as I write this, so I can’t tell you the winner. It won’t be me, because Jcwren deliberately disallowed entries from the senior participants of the Monastery (called “saints”), of which I seem to be one.

I gave it a whack anyway. It was a nice challenge regarding a problem that’s becoming more and more common in Web-based solutions: the repackaging of information. I think we’re going to see more and more “middleware” on the Net (sites that act as brokers or meta-searchers), so I’m constantly researching to see what can be done to help.

The basic problem was to create a headline list for state-based headlines. CNN’s interactive news ticker delivers this information as a far-too-flashy pop-up window. However, the data file that it refreshes was easily reverse-engineered, and the URL and file format of that internal data file have apparently been stable for months.

Jcwren asked for a command-line program (not CGI) that fetches CNN’s internal data file each time that it is invoked. He expected this to be done from cron every 10 minutes or so. Any new headlines that were found there were to be remembered in an unspecified “database.” This was to be done as simply as possible, as he wanted this to run easily on both Unix and Windows. New headlines were to be timestamped on their first observation (there are no timestamps in the source data, so this was as close as we got to a freshness factor).

To keep the database from becoming a history book, each headline was to be aged out when it had not been seen in a specified amount of time (default one day). As long as CNN was still showing the headline, it would stay alive for at least this much longer.

To make it even more fun, the headlines had to be organized by state with a clickable set of links at the top of the output. All 50 states (plus DC) needed to always be present, but only states with current news were to be active links (which would scroll down within the document to that state). Everything had to be alphabetized, of course.

Further, the output was to be an HTML file (selectable, default index.html in the current directory) with a meta-refresh tag so that a browser window could be kept open on it.

I was curious about how long it would take me to write the program. I guessed around 90 minutes, and the first draft of the program was completed in just under that. I’ve since done about a half hour of tweaking. The program, which I will now describe, is present in Listing One (pg. 88).

Lines one through three start nearly every program I write, enabling warnings, turning on the normal compiler restrictions for large programs, and disabling the normal buffering of standard output.

Lines seven through 11 define the configurable constants that are used by this program that don’t make sense to override from the command line. The $CNN_URL is the source of our information. This program depends on the URL providing consistent data, so if it moves or changes format, you’re just out of luck. The $CNN_CACHE file is a local mirror of that remote URL. $DB_MEMORY holds our “database” in whatever format dbmopen selects (which will most often be Berkeley DB).

Line 13 pulls in the CGI module. No, this isn’t a CGI program. However, I am generating HTML so I’m using the HTML generation shortcuts, and it just so happens that the CNN input format is nearly identical to the format of an uploaded form data, which I quickly recognized to leverage off existing code. The CGI module, as of this writing, doesn’t include the HTML 4.01 standard col, thead, and tbody generation methods, so I added them.

Line 14 pulls in the mirror routine from LWP::Simple (part of the CPAN-installable LWP suite).

Similarly, line 15 brings in GetOptions from the standard Getopt::Long module.

We then parse the command-line arguments in lines 17 through 23. Four variables are declared with initial values, and GetOptions alters those values if the right command-line arguments are present. You can see the GetOptions documentation for additional details (however, this should be readable as-is).

Beginning in line 25, we need a list of states. Note the split that breaks the items on either the embedded comma or the ending newline of each line.

Lines 35 to 40 get the “current” information. Because we are maintaining a cache, we can use mirror, which minimizes the transfer cost. The request made to the server includes an “if modified since” header; if the information has not changed since this time, the server returns a quick “304 error” to tell you that you have it already. When new information arrives, the timestamp on the file is set to the “last modified” header (if present), so that the next request has the right “if modified since” header to repeat the process. Slick. Normal expected returns are status 200 (we’ve got a new file) and 304 (we already have the data). Anything else is broken, so we abort quickly.

Line 42 opens the database. This is a simple “hash on disk” database, so we use dbmopen to let it pick the type and naming for us. (This database is cleared in line 48 if the right command-line parameter is present.)

Next, we set up the input and output streams (lines 43 and 44). The input is the file fetched from CNN. The output is the HTML file, except that we don’t want to overwrite the real file just yet, so we’ll append a tilde to the filename (my editor’s backup file convention, so I have scripts to clean those up). After finishing the file, I’ll rename this temporary file over the top of the real data in one fell swoop, so the browser will never see a partial content. This is an important strategy for uncooperative processes.

Line 46 processes STDIN using the CGI module’s ability to parse a form. Assigning the output to $CGI::Q means that we get to use param and friends without having to use the ugly, and nearly always unnecessary, object-oriented form of invocation.

We then pass over the data three times. The first pass (beginning in line 50) looks for all “parameters” from the input data with the form headline>I<n, where n begins at 1 and increases (to about 100, judging from the data I saw while testing). The headline is stuffed into $headline, then the corresponding state jumps into $state.

We’ve now hopefully constructed a unique key of the state, a newline, and the headline. The corresponding value in the database is two integers separated by a space, both timestamps in Unix internal time format. The first number is when the headline first appeared (for display). The second number is the most recent time we’ve seen it (for aging purposes). So, if the key already exists, we update the second number to now, but if it doesn’t, we create a new entry with both numbers set to now.

On the second pass (beginning in line 62), we age out old data by looking at all the entries’ second numbers and deleting those that no longer qualify as fresh enough.

Starting with line 68, it’s time to finally dump the data. For each of the keys (line 75), we pull out the state, headline, and first-seen timestamps as a three-element arrayref, which is then sorted by state, timestamp, and headline order.

Line 77 introduces %states_seen, which will be used to track the first appearance of each state in the sorted list and to figure out for which states to generate links at the top of the table.

Now comes the fun part — transforming the data into a table. First, we break each element of the @data array (line 89) into the three fields (line 81). Next, we create a table row (line 82) consisting of three cells (lines 83, 86, and 87). The first cell is either the state name (fixed so that it can’t wrap) or on first appearance, the statename with an internal anchor. The second cell is an abbreviated portion of the localtime of the timestamp from when the headline first appeared. The final cell is the headline itself. Be particularly careful to properly encode this data as HTML entities if need be.

The next step is to generate the top of the HTML file (on STDOUT) handled in lines 91 to 93 with the right header, title, and meta-refresh information.

Now it’s time to generate the table. The cellspacing and cellpadding are personal choices (line 95). The next three lines give hints to standards-compliant browsers (unlike Netscape or IE) about the width and alignment of the three columns. Next comes the “table header,” one row, one cell (spanning three columns), of all the states. If a state was seen, a link to the proper internal anchor is generated; otherwise, a simple name is used. Again, the state names are guaranteed not to wrap. Finally, the table guts are dumped inside the “table body” tag.

Lines 108 to 110 finish the HTML page. Once this is complete, we rename the temporary output name to its final destination with line 112.

The two subroutines starting in line 115 handle some of the needed transformations. escapeHTMLbreak calls the CGI-module-provided escapeHTML routine to fix all the HTML entities but also changes all remaining spaces to non-breaking spaces. fixname crunches a string so that it’s a legal, unique, anchor name (for the expected dataset).

That’s it. Stick it into a filepath (not in a location for your Web server’s CGI and not necessarily in your PATH) and then run it frequently. You too will have the latest headlines from CNN. Hopefully, you’ll see a few new gizmos and gadgets to steal for your own code. Enjoy!