InfoSci®-Journals Annual Subscription Price for New Customers: As Low As US$ 4,950

This collection of over 175 e-journals offers unlimited access to highly-cited, forward-thinking content in full-text PDF and XML with no DRM. There are no platform or maintenance fees and a guarantee of no more than 5% increase annually.

Receive the complimentary e-books for the first, second, and third editions with the purchase of the Encyclopedia of Information Science and Technology, Fourth Edition e-book. Plus, take 20% off when purchasing directly through IGI Global's Online Bookstore.

Abstract

Page layout analysis and the creation of an XML document from a document image are useful for many applications including the preservation of archived documents, robust electronic access to printed documents, and access to print materials by the visually impaired. In this paper, the authors describe a document image process pipeline comprised of techniques for the identification of article headings and the related body text, the aggregation of the body text with the headings, and the creation of an XML document. The pipeline was developed to support multiple document images captured by the head-mounted cameras of a reading device for the visually impaired. Both automatic and manual adaptations of the pipeline processed a sample of 25 newspaper document images. By comparing the automatic and manual processes, we show that overall our approach generates high-quality XML encoded documents for use in further processing, such as a text-to-speech for the visually impaired.

Article Preview

1. Introduction

Page layout analysis and the creation of an XML document from a document image are useful for many applications including the preservation of archived documents (Wang, et al., 2009) and accessibility by those with visual impairments. TYFLOS (Keefer, et al., 2009a,b) is a prototype wearable mobile reading device for the visually impaired. TYFLOS is equipped with two web cameras mounted into a pair of glasses and the software for performing document image rectification and segmentation. Traditional document image analysis techniques play an important role in the operation of the TYFLOS prototype, including document image capture, binarization, page perspective correction in 3-dimensions, page curl correction, and page segmentation. In this paper we describe techniques for headline identification, page segment aggregation, and the creation of an XML document from the document image. The XML document supports various forms of interaction with the text of the document, including a voice user interface (Keefer, et al., 2013).

Much work has been performed to identify headlines within web sites and document images. This work has been in the context of both improving access to documents for the visually impaired, as well as the digital access of archived documents. For example, Brudvick, et al. (2008) have developed a method to predict whether web page content semantically functions as a headline by considering the visual features of text when rendered in a browser. Similarly, Kohlschütter, et al. (2010) describe a method for identifying text elements within a web page.

Document segmentation has been of interest to the document image processing community for many years. O’Gorman’s (1993) Docstrum method offered an original and well organized analysis of document layout analysis based on K-nearest neighbors to identify connected components and from these to identify regions of text. Akram et al. (2010) offer a review on the way to process a document and generally segment the layout area. In another approach, Winder et al. (2011) describe a method for page segmentation based on an analysis of the Voronoi zones of a histogram of the connected component heights of image segments. Similarly, Breuel et al. (2011) also patented a method for document image layout deconstruction. Finally, Ferilli, et al. (2011) apply supervised machine learning techniques to document image layout analysis.

For the purposes of supporting robust interaction with document images converted to XML, Ishitani (2003) proposed a method for transforming a document image into XML. This method extracts document elements such as title, headings, and body text from a document image. The hierarchical structure of the document is also extracted and described by a document object model (DOM). The XML document is created through a set of transforms applied to the extracted document elements and the DOM.

WISDOM++ (Altamura, et al., 2001) is a document processing system that performs document analysis, classification, and text transformation to generate an XML document from a document image. Agrawal and Doermann (2010) also discuss a method for page segmentation that produces GEDI XML files.

To create an XML document from the document image, a document image segmentation method must separate images from text, identify headings within the document image, and identify article content within the document image. The methods described in (Ishitani, 2003), (Altamura, et al., 2001), (Agrawal and Doermann, 2010), (Antonacopoulos and Karatzas, 2004), and (Pletschacher and Antonacopoulos, 2010) all rely on robust document analysis methods to identify the structure and format of the document image, followed by an OCR step to convert the text within a segment to XML.