Automated Quality Control in large scale digital preservation

This brief article is meant to give an overview of the activities related to automated quality control within the SCAPE (SCAlable Preservation Environments) EC-project proposal.

The SCAPE project will enhance the state of the art of digital preservation in three ways: by developing infrastructure and tools for scalable preservation actions; by providing a framework for automated, quality-assured preservation workflows and by integrating these components with a policy-based preservation planning and watch system. These concrete project results will be validated within three large-scale Testbeds from diverse application areas: Digital Repositories from the library community, Web Content from the web archiving community, and Research Data Sets from the scientific community.

Change in a digital object’s form, structure or content introduces risk that important or critical information has been lost. The impact of this loss may only affect certain properties of the object, or it may be catastrophic. Detecting loss when change has occurred is therefore a key digital preservation process. Change may occur involuntarily while the object is stored or preserved, or it may occur when a preservation specialist intervenes to address technology obsolescence (perhaps resulting from a file format migration).

In the past, large-scale quality assurance (QA) (in, for example, digitization processes) has relied on a combination of human intervention and statistical methods. This manual approach to QA is time consuming, but workable when applied to simple content and a straightforward set of properties that must be preserved (for example, QA of a digitization process or a migration from image file format to another image file format).

This approach does not, however, scale well when applied to complex content with correspondingly complex properties that must be preserved (for example, office documents typically require a long list of properties to be assessed and verified in the destination format). Approaches such as the eXtensible Characterisation Languages (XCL ) hierarchically decompose a document and represent documents from different sources in an abstract representation of digital content in XML (eXtensible Markup Language), by mapping file format structures to XCL concepts and therefore enable automatic validation of document conversions and evaluation of migration quality.

Another approach depends on an analysis of the logical structure and content of digital objects. In order to compare different types of media such as images of web pages as well as digital documents, the first step is page segmentation. Competitions on page segmentation take place at the annual International Conference on Document Analysis and Recognition (ICDAR). The results of the 2009 competition were obtained on a realistic dataset for performance evaluation of document layout analysis, the PRImA dataset, which was made publically available recently. In the 2009 competition the Fraunhofer newspaper segmenter, a collection of modules including black and white separator detection and page segmentation and text line and region extraction, exceeded the state-of-the-art in page segmentation.

This structural comparison, such as in Ben-Saad et al. , may be used to define which type of sub-images should be compared betweens documents. Near duplicate image detection techniques34 are powerful tools to handle this image matching step. Combining both structural and image analyses is a certainly a key to get richer quality measures.

We hypothesize that the application of image-extraction tools, combined with established image analysis techniques, can provide highly useful quality assurance measures for a wide variety of media types. In this approach, in order to perform QA on a document migration, the original and final documents would be converted into images, after which established image analysis techniques such as pattern recognition algorithms could be used to quantitatively evaluate the migration process. We believe this technique also lends itself very well to deployment and invocation on distributed virtualized clusters or clouds. SCAPE will develop quality assurance workflows that combine extraction, analysis and comparison tools, and will test and validate the resulting QA components within Testbed workflows.

The partners in the Quality Assurance Components work package include other national libraries in both UK and Holland as well as technology/research institutions in the UK (Microsoft Research Cambridge) and Austria (Austrian Institute of Technology). So there will be good opportunities to work within one of these institutions for a period of time during the PhD. The project as a whole also includes several other European universities so there is great opportunity for being a part of an international network within the field of digital preservation and computer science.

For further information about the project contact head of Digital Resources at the State and University Library, Bjarne Andersen, tel: 8946 2165, email: bja@statsbiblioteket.dk