Building a digital library of television news

Computers in Libraries
[June 2003]

Breeding, Marshall
.

Copyright (c) 2003 Information Today

Abstract: Breeding comments on his latest project, the creation of a large-scale digital collection of video content from the Vanderbilt Television News Archive. He shares the different phases of his project and offers the various costs and challenges during the development.

This issue of CIL provides the perfect opportunity for me to talk about one of the projects that I've been working on recently, that of creating a large-scale digital collection of video content from the Vanderbilt Television News Archive. Though it poses a number of technical challenges, this project has been one of the most exciting endeavors I've been involved with. In this column, I'll give a brief sketch of the technical aspects of the project. Space doesn't allow me to cover the business model for financing the operation, nor the dicey legal and copyright issues involved-these concerns pose much greater challenges than the technical issues.

The content of the Vanderbilt Television News Archive represents an important element of the American cultural experience-that of the evening news watched in millions of homes.

Today's news quickly becomes tomorrow's history. If you wanted to read about events of years past, you could go to your local library and browse through newspaper archives-whether it is through yellowed newsprint, microfiche, or digitized text. Libraries collect newspapers and magazines as valuable sources of historical information. Another great source of current events is the evening news programs. Wouldn't it be great if you could go back in time and view the news programs of years long past? You can. The Vanderbilt Television News Archive has been recording the national news programs of the three major U.S. television networks since 1968. Our collection is the world's most extensive and complete archive of television news.

715,000 Snippets of History

The snippets of history found in our collection never fail to impress. All the major events of the last half of the 20th century are there. I've seen the eruption of Mount St. Helens in April 1980, the first Apollo missions to the moon in 1969, the signing of the peace agreement between Israel and Egypt, and the fall of the Berlin Wall. The collection includes extensive coverage of the Vietnam War, the Gulf War of 1991, and now the war in Iraq. Our video coverage spans the administrations of Richard Nixon, Gerald Ford, Jimmy Carter, Ronald Reagan, George H. W. Bush, Bill Clinton, and George W. Bush. In quantitative terms, the collection includes over 40,000 news programs divided into 715,000 video segments.

The Vanderbilt Television News Archive serves researchers all over the world. While a few visitors come to view materials on-site each month, the majority of its use takes place through loans of videotapes. Researchers can request a videotape of a complete news broadcast, or they can select segments from multiple programs to be compiled on a tape. The Archive charges fees to recover the costs of providing this service. Since copyright issues apply to the collection, the Archive does not sell or license content-it simply creates videotapes for viewing, which must be returned.

For the last year or so, I've been involved with this unit of the university, helping to implement technologies that will allow this vital collection to survive into the future. One of my main responsibilities as library technology officer for the Vanderbilt Libraries has been to oversee the transformation of this video archive into an all-digital operation.

The Struggle Against Media Format Obsolescence

Preserving a collection of video material requires constant attention. While paper may last for hundreds of years, none of the media that store video content seem to last more than a few decades. When the Archive began, it recorded programs using 1-inch Ampex tape. That medium became obsolete in the late 1970s, forcing the Archive to implement a new taping system based on 3/4-inch Sony U-Matic videotape cassettes. All the material recorded up to that time had to be transferred to the new tape format.

Once again, the obsolescence of tape format is driving the Archive to undergo major changes. Within a few years, maintaining equipment capable of reading our current library of videotapes will be increasingly problematic and eventually the format will become extinct. The survival of the collection depends on transferring it onto a new medium. Moving the collection onto a next-generation tape format would be one approach, but such a migration would inevitably provide only a short-term solution. Our strategy, which we believe will provide more long-term options for preserving the collection, lies in transforming the Archive to a digital environment. We plan to not only convert the existing collection to digital format, but to also record all new programs directly to digital files. While no digital format will last forever, we believe that future migrations from one digital format to another will be automated, requiring a fraction of the effort of a tape-to-tape transfer.

Converting this archive to a digital operation will take some time. Planning has been underway for over a year. We are now testing the methodologies we've developed and are in the early implementation stage.

Phase 1: The Metadata

The first phase of the project involved creating a structured database of metadata. Without good metadata, a digital collection can be practically useless. Fortunately, from the earliest days of the Archive, extensive effort went toward describing and indexing the collection. Archive staff wrote abstracts of every news item and created indexes to help researchers find material. The abstracts and indexes originally took print form. In the mid-1990s, text files of the abstracts were placed on a Web server with a search engine. Producing electronic text files from the printed indexes involved scanning and OCR (optical character recognition), followed by extensive manual cleanup. While the text files and Web-based search engine provided better access than the print indexes, they had many limitations.

In order to get the most out of the descriptive information that the Archive staff created, we converted all the text files into a comprehensive database, and we created a set of applications to provide a user interface for searching the collection and for making online requests for materials. The result of this effort is the current Web site for the Archive, http:// tvnews.vanderbilt.edu. The primary database currently consists of 715,000 records that include descriptive metadata elements such as the begin time, end time, duration, date, and network of the broadcast and the text of the abstract, plus a list of the anchors and reporters. DB/TextWorks serves as the underlying database, providing rapid search capabilities and a full suite of search features. The Web site uses databases for registered patrons and for tape loan requests. Following a layered and modular approach, we use the ODBC (Open Database Connectivity) for access to the database, using SQL syntax. All the programming for the user interface and e-commerce system for online ordering is written in Perl. The new database-driven Web site has been live since July 2002 and has worked well as a finding aid for the Archive's videotape collection and as a general-purpose news resource. We call this news database "TV-NewsSearch."

Phase 2: Researching and Establishing Our Process

The phase of the project in which we're currently immersed involves investigating alternatives for storing the video content of the collection in digital form. The National Science Foundation awarded the Archive a grant of $93,000 to conduct this investigation. In this project we will select the technical components and define the processes for converting the existing collection of videotapes into digital form and will design the process for recording new material digitally.

One of our early conclusions was to use MPEG-2 as the primary format for storing video files digitally. MPEG-2 stands as the international standard for compressing video in a way that maintains the highest level of viewing quality while achieving significant reduction in file sizes. We ruled out MPEG-4, a newer video compression standard, because of its emphasis on low-bandwidth streaming rather than on preservation of video quality. MPEG-4 also suffers from a complexity of licensing issues, despite being an international standard.

To create MPEG-2 digital files, you need a hardware device to perform the compression and encoding. A number of PC cards are available that perform MPEG-2 encoding, each with slightly different features and options. An important part of our process was to evaluate each of the cards available to verify the quality of the MPEG-2 files produced and to determine which card's features would best meet the needs of our project.

Once we selected an MPEG-2 encoding card, we began putting together a digitizing workstation capable of both converting the videotapes from our existing collection and recording new programs digitally. The part of the workstation's configuration that digitizes videotapes is fairly straightforward. The output of a Sony U-Matic tape drive connects to the inputs of the MPEG-2 encoding card. To improve the quality of the video, we send the signal through a device that stabilizes the output of the tape deck, performing a process called timebase correction (TBC).

Designing a process for digitally recording the news turned out to be a larger challenge. One of the features of our video collection is a line of text inserted at the top of the display that shows the network from which the material was recorded, a running time clock, and the date of the broadcast. The Archive's current, tape-based equipment uses character generators that were manufactured in the 1970s and are not well-suited to the new digital recording system. We were able to find an inexpensive on-screen display (OSD) board that could be programmed to produce the required network-time-date overlay. The output from a PC-based TV tuner feeds into the OSD device, which then connects to the MPEG-2 encoding card. The final configuration of our video digitization workstation consisted of a relatively high-end PC with a large hard drive, equipped with an MPEG-2 encoding card and a programmable TV tuner card. The OSD and TBC devices are external to the PC.

The Costs and Challenges

We have recently put a prototype of the digitizing workstation into service at the Archive. Now we will be setting up nine additional digitizing workstations, enabling the Archive to make the transition to recording news digitally. These workstations will also be used to convert the existing videotape collection. While a limited amount of this work can be accomplished with existing staff, we are hopeful that another grant proposal will be funded to hire a team of workers to process the full 30,000-hour collection.

Another major challenge of creating the digital collection of our television news archive is storing the material. Digitized video, even when compressed, consumes enormous amounts of storage. A standard television signal (CCIR 601 format) would require about 74 gigabytes of storage for an hour of content without compression. With MPEG-2 compression the same hour of video takes less than 3 gigabytes. Once converted, the entire collection will require at least 90 terabytes of storage.

Our storage strategy involves keeping working copies of the programs locally on DVD-R optical discs and working with the Library of Congress for long-term digital preservation.

The MPEG-2 files that we produce will be copied onto DVD-R discs that will serve as the working copies for the local collection. From these DVD-R's we can make copies in other forms as needed. We will be able to produce regular VHS tapes (and DVDs) to continue our service of loaning material to researchers.

The Vanderbilt Television News Archive has a long-standing relationship with the Library of Congress where LC receives copies of most of the recorded items for long-term preservation. We are in the process of working out the details of how it will receive MPEG-2 files rather than videotape. The costs of building a long-term digital repository of this scale would be extremely difficult for us to assume locally, so this arrangement is a tremendous advantage.

That's the whirlwind tour of our project. I've only skimmed the surface of the technologies involved, and have omitted altogether the business plan and copyright issues. But I hope that I've given you a taste of what it's been like to work with a collection of digital video. The technologies are complex and the storage requirements are immense, but the results can be very impressive.