Main menu

You are here

PRImA's Aletheia - Ground Truth & Softalk Magazine

Submitted by Jim Salmons on Tue, 07/07/2015 - 17:51

Although most folks are now happy with full access to the complete run of the Apple edition of Softalk magazine now that Timlynn​ and I have funded its "ingestion" into the Internet Archive, the Softalk Apple Project is far from over. In fact, getting the collection into the Archive was just step one -- the Citizen History aspect of our projects.

We're now moving forward on strategic collaborations with world-class researchers and research centers to turn the Softalk Apple Project digital collection into a unique and valuable reference resource for broad applications in the Digital Humanities and Cognitive Computing domains.

One particularly exciting collaboration is shaping up with the VERY "deep weeds" researchers at the PRImA Research Center at the University of Salford in Manchester England. PRImA is "Pattern Recognition and Image Analysis" and this is the premiere research center working on the deepest technical challenges of doing document structure and layout recognition as well as OCR (text recognition) of print and hand-written documents.

PRImA's Aletheia - Our "Get-going" Fact-mining Tool

To date, the Center's research has focused (and made AMAZING progress) addressing within-page layout recognition via fine-grained page segmenting techniques. In order to test recognition algorithms, PRImA has created a "ground truth" tool called Aletheia. For the hard-core OCR crowd, "ground truth" is page segmentation meticulously done by, and validated to be "correct" as a perfect solution by, a human. The "ground-truth edition" of a page is used as an "answer key" to measure the accuracy of, and the differences between, the results of layout and text recognition algorithms.

The great thing is that FactMiners can use Aletheia as a "proof of concept" tool to begin creating the FactMiners' Fact Cloud "edition" of the Softalk Apple collection! A FactMiners' Fact Cloud is just our name for a "_machine-readable_ copy" of each issue of Softalk... that is, all the "facts mined" and stored in a graph database such that ALL the information that a human can gain by reading the magazine will also be accessible as #SmartData (that is, a graph database with a metamodel subgraph explaining the data's structure and processes for access and use) via the Fact Cloud.

Toward Automatic Recognition of Magazine Whole-Issue Structure

The picture I have included with this post is an example of detailed page-segmentation of the key "structure revealing" page of Softalk's first issue (September 1980). Page 1 is "hint rich" with information revealing the "meta-structure" (magazine's have a common but not required structure) of this issue of the magazine. In this case, page one has the Table of Contents, the Advertiser's Index (a 'secondary' type of TOC), the masthead, and Previews of next month's content.

Aletheia provides flexible region (page segment) creation, including the all-important page-segment-respecting OCR feature. Bulk OCR simply produces an unstructured "text soup" in an unseen layer of, for example, an image-based PDF file. While full-text searches can be done on such bulk OCR data, the actual and all-important structure of the magazine is nowhere to be found. This is a central challenge of FactMiners' fact-mining -- typed-structure-respecting text recognition... our Holy Grail of tool requirement.

The only way to create a FactMiners' Fact Cloud will be through a page-segment modeling and respecting tool... and to date, nothing I have seen comes close to filling this requirement the way Aletheia does.

FactMiners' "Visual Language of Magazine Design" Dataset

One of the #CognitiveComputing agenda items we're working on at FactMiners is a whole-issue commercial magazine layout recognition. This will be a specialized #SmartProgram that, given a set of images of the pages of a magazine, finds the page or pages that contain the "telltale specification" of the magazine -- that is, that finds and processes the table of contents, list of advertisers, and any other page-segments that reveal the whole-issue structure of the magazine. Finding and interpreting whole-issue magazine structure is a recognition process that spans individual pages and will be done in an iterative fashion over the collection of all page-images in an issue.

The whole-issue structure-recognizing challenge will require a "Sudoku-like" iterative solution. That is, by finding the "key page(s)" -- primarily the table of contents and list of advertisers -- our recognition algorithm or neural net will use a process of elimination to find and identify as many page segments as possible. While doing this at the page level, the #SmartProgram will be building up a whole-issue document structure as its iterative recognition reveals the front-of-book, feature-well, and back-of-book structure of most commercial magazines.

To this end, we'll be creating a computer-vision reference dataset of the "visual language" of magazine layout design. As we create the FactMiners' Fact Cloud, an intermediary step will be to create PAGE-format files (PRImA's XML spec similar to ALTO) that include a "whole page layout" description at the page level of the XML-based PAGE document structure hierarchy. This means that the Softalk Apple collection will be the first 9,300+ images of a dataset that can be used to teach deep learning algorithms and neural nets, etc. to recognize commercial magazine whole-issue structure. Our "semi-ground-truth" dataset -- as we'll only go down to typed page segments, and not down to baseline, word, and glyph boundaries, etc. -- will be a natural complement to the 880+ pages in the PRImA Magazine Layout Dataset seen here.

I am VERY excited to learn about the PRImA Research Center, its wonderful people, and the amazing tools they have created. I look forward to the progress that can be made through this evolving collaboration.