You are here

Workflow process overview

Outline of Workflow Processes

1. Microfilm is sourced

Newspaper microfilm is sourced from State and Territory Libraries and sent for scanning by one of the scanning contractors on the National Library's service provider panel. Permission will be sought from publishers for post-1954 content prior to scanning.

ISSNs for the digitised titles are requested from the Australian ISSN Agency prior to scanning.

All State and Territory Libraries are contributors to the program. W & F Pascoe Ltd also owns microfilm masters in some cases.

2. Creation of digital images from microfilm (Scanning Contractor)

Digital images are created by scanning microfilm. They are then copied to LTO4 data tapes and are sent to the National Library for quality assurance. Microfilm is returned to libraries.

The Library issues specifications for scanning microfilms to its contractors. A copy of these specifications is available on request by contacting the Program manager through the Trove Contact Form

3. Quality assurance work - part 1 and 2 (Library)

The QA (quality assurance) process is a very important aspect of the Program. It is carried out at the National Library of Australia, and includes automated and manual processes as outlined below:

Identifying missing newspaper pages and issues and creating targets for these

Removing duplicate pages

Grouping digital images into batches of 2000 for Optical Character Recognition (OCR) processing

Once the images have been quality assured, they are then ready to be sent to one of the OCR contractors on the National Library's service provider panel. If there are any issues with digital images they will be returned for reprocessing by the contractor.

OCR contractors complete zoning, categorisation, OCR and rekeying work according to the Library's specifications. A copy of these specifications is available on request by contacting the Program manager through the Trove Contact Form

Zone articles:

Each newspaper page is zoned into separate articles

The coordinates of the zones on the page are recorded in the ALTO/METS file

Categorise articles:

Each article is assigned a category from the list below (which will assist with excluding/including items in searching):

News

Advertising

Family Notices

Detailed lists, results, guides

In addition illustrated articles are categorised as:

Illustrations

Photographs

Cartoons

Maps

Graphs

OCR on articles

Each article is converted into a full text file by having OCR software automatically convert characters in the image into full text searchable words.

Re-keying of text in articles

The titles, subtitles, authors and first four lines of text in the articles are rekeyed manually to achieve 99% accuracy.

Metadata

METS and ALTO are used to transfer metadata from OCR contractors to the National Library. Metadata are extracted from the ALTO and METS files and stored in the repository database's internal format. An XML file in ALTO format is created for each page and a METS file for each issue is created. The ALTO file contains the results of OCR including position of zones and words on the page. The METS file (one per issue) contains the re-keyed data (title, abstract etc) and structural information about the pages and articles. Further details about the use of METS and ALTO files are available on request by contacting the Program manager through the Trove Contact Form

5. Quality acceptance work - part 3 (Library)

The processed pages and articles are quality assured by National Library staff to check that they meet the quality acceptance criteria percentage as defined in the Contract. They are either accepted by the Library or sent back to the contractor for re-processing. The XML files are ftp’d to the Library. The Library creates derivative images from the master greyscale TIFF images for use in the public search and delivery system (JPEGs, and PDFs and thumbnail images).

6. Public Search and correction of data

The data is loaded into Trove and is available to the public. Interested members of the public may correct the OCR text and if they do so the changes are saved in the database and available to others. This improves the accuracy and therefore the searching of articles for everyone. There are already millions of lines of texts corrected.