Monday, July 28, 2008

News Categories at Radio NZ

We've just completed a major update to the news section of the site that allows us to categorise news content.

We use the MySource Matrix CMS, and categorising content is a trivial exercise using the built in asset listing pages. The tricky part in our case is that none of our news staff actually write their content directly in the CMS, or even have logins. Why?

When we first started using Matrix, the system's functionality was much more limited than it is today (in version 3.18.3). For example, paint layouts - which are used to style news items - had not yet been added to the system.

The available features were perfectly suitable for the service that we were able to offer at that time though.

The tool used by our journalists is iNews - a system optimised for processing news content in a large news environment - and it was significantly faster than Matrix for this type of work (as it should be). Because staff were already adept at using it, we decided to use iNews to edit and compile stories and export the result to the CMS.

This would also mean that staff didn't have to know any HTML, and we could add simple codes to the text to allow headings and other basic web formatting. It would also dramatically simplify initial and on-going training.

The proposed process required two scripts. The first script captured the individual iNews stories, ordered them, converted the content to HTML, and packaged them with some metadata into an XML file. (iNews can output a group to stories via ftp with a single command).

The XML was copied to the CMS server where script 2 imported the specified content.Each block of stories was placed in a folder in the CMS. The most recent folder represented the current publish, and was stamped with the date and time. The import script then changed the settings on the site's home page headline area, the RSS news feed and the News home to list the stories in this new folder.

Stories appeared on the site as they were ordered in the iNews system, and the first five were automatically displayed on the home page.

On the site, story URLs looked like this:

www.radionz.co.nz/news/200807221513/134e5321

and each new publish replaced the any previous versions:

www.radionz.co.nz/news/200807221545/2ed5432

On the technical side, the iNews processing script ran once a minute via a cron job, but over time we found two problems with this approach.

The first was that the URL for a story was not unique over time - each update got a new URL. RSS readers had trouble working our what was new content and not just a new publish of the same story. People linking to a story would get a stale version of the content, depending on when it was actually updated.

The second related to the 1 minute cycle time of the processing script. Most of the time this worked fine, but occasionally we'd get a partial publish when the script started before iNews had finished publishing. On rare occasions we'd end up with two scripts trying to process the same content.

The Update

The first thing we had to do was revise the script for importing content. This work was done by Mark Brydon, one of the developers at Squiz. The resulting script allowed us to:

add a new story at a specific location in Matrix.

update the content in a existing story (keeping the URL).

remove a story

put stories into a folder structure based on the the date.

I provided some pseudo-code and XML and Mark did the rest, with a fair bit of testing and discussion to get the script perfect along the way. Revise actually isn't a strong enough word - Mark merged our fours import scripts into one, refactored common code it functions, and brought it all up to Squiz coding standards.

One of the early design decisions was to use SHA1 hashes to compare content when updating. As you'll see later it made the script more flexible as we fine-tuned the publishing process. Initially the iNews exporter generated SHA1s based on the headline and bodycopy and these were stored in the spare fields in the Matrix news asset. These values could be checked to determine if content had changed.

The second task was to update the iNews exporter to generate the new XML. This proved to be a small challenge as I wanted to run the old and the new import jobs on the same content at the same time. Live content generated by real users is the best test data you can get, so new attributes were added to the XML where required to support this.

The first 3 weeks of testing were used to streamline the export script and write unit tests for the import script. I also added code to the exporter to process updates and removals of stories.

Add. This mode is simple enough - if the story was not in the system, add it.

Update. The update function used the headline of story to determine a match with an existing story on the site. We limited the match to content in the last 24 hours.

This created a problem though - if the headline was changed the system would not be able to find the original. To get around this I created the 'replace' mode. To replace a headline staff would go to the site and locate the story they wanted, capture the last segment of the URL, and paste this into the story with a special code.

In practice this proved to be unwieldy and was unworkable. It completely interrupted the flow of news processing, and we dropped it after only 24 hours of testing.

As an aside, the purpose of a long test period is to solve not only technical issues, but also operational ones. The technology should facilitate a simple work-flow that allows staff to get on with their work. The technical side of things should be as transparent as possible; it is the servant, not the master.

What was needed was a unique ID that stayed with a story for its life in the system. iNews does assign a unique ID to every story, but these are lost when the content is duplicated in the system or published. After looking at the system again, I discovered (and I am kicking myself for not noticing earlier) that the creator id and timestamp are unique for every story, and are retained even when copies are made.

It was simple matter to derive a SHA1 from this data, instead of the headline, and use that for matching stories in the import script. Had I not used a spare field in the CMS to hold the SHA1, we'd have had to rework the code.

After a couple of days testing using the new SHA1, it worked perfectly - staff could update the headline or bodycopy of any story in iNews and when published it would update on the test page without any extra work.

This updated process allowed staff to have complete control over the listing order and content of stories simply by publishing them as a group. If only the story order was altered, the updated time on the story was not changed.

It has worked out to be very simple, but effective.

Kill. To kill a story a special code is entered into the body of the story. The import script sets the mode to kill and the CMS importer purges it from the system.

Because of the all the work done on the iNews export script, I decided to fix the issues mentioned above - partial publishes, 1 minute cycle time, and two scripts working at once.

The new script checks for content every 3 seconds, waits for iNews to finish publishing, and uses locking to avoid multiple jobs clashing. I'll cover the gory details of the script in a later post.

Summary

The new scripts and work processes are now being used for live content. Each story gets a category code plus an optional code to put it in the top 5 on the home page. The story order in iNews is reflected on the site, and it is possible to correct old stories. It's all very simple to use and operate, and doesn't get in the way of publishing news.

And work continues to make the publishing process even simpler - I am looking at ways to remotely move content between categories and to simplify the process to kill items.