IFLA Newspaper Conference: Kenning Arlitsch and John Herbert – University of Utah

This panel is on Statewide Digitization Initiatives, with Kenning Arlitsch and John Herbert.

First up is Kenning: The Mountain West Digital Library provides political and technical support for statewide cooperative digitization in Utah and Nevada. They have 400,000 objects excluding newspapers.

They have four different CONTENTdm servers among four different partners, and then an aggregating CONTENTdm server at the University of Utah harvests from the partners and then provides searching and access. This allows local control and identity for partners and a common metadata standard.

The University of Utah has been microfilming since 1951. In 2001, they were awarded a grant to attempt digitization of 30,000 newspaper pages. They scanned in as 1-bit B&W TIFFs, converted them into MrSID, and loaded them into CONTENTdm. They also added index information. You could search by directory information (no full-text), where you could view the images of the pages. As they realized they could not do all that work themselves, they began to work with iArchives to do a large portion of the work. Some of the lessons learned: Bad film, a company that was sold to El Paso where the master microfilm went with it, and more funding. Also, it was really popular. Their second grant was to improve the process and do 100,000 more pages.

John is now speaking. The main site is http://digitalnewspapers.org. It consists of Utah newspapers from 1850 – 1961. It is full-text searchable, with 48 titles and 450,000 pages and 5 million articles. Content is distributed across the state.

Their archival format is 4-bit TIFFs (~28 MiB), so that works out to about 14 TiB for 500,000 pages. They attempt to scan from paper whenever possible. Scanning from paper gives better quality images, but the size of the papers necessitates local scanners, which can be hard to find. Film is cheaper and easier to ship. Articles are all segmented and are viewed stand-alone from the page with no zoom and pan. They find that the OCR is more accurate due to the smaller corpus size. Also, iArchives keys in the headlines manually, leading to near perfect data there.

For OCR options, their testing shows that single word searching is optimized for 2 alternatives. However, adding additional words often causes bizzare search result scenarios. Things like “butch/dutch cassidy” are difficult to overcome, but he’s hopeful that new proximity searches will help with this. Also, they filter through dictionaries and lists of surnames, and then throw out all numbers greater than 100 except for four-digit numbers beginning with 18, 19, or 20.

Imaging trade-offs are the classing size vs. quality problem.

Cost per unit page is low, usually less than $2 per page, but the number of pages is enormous, making it very expensive to digitize. They’ve been digitizing since 2002, and have “only” 500,000 pages.