A number of federal agencies, universities, laboratories, and companies are placing their collections online and making them searchable via metadata fields such as author, title, and publishing organization. Manually creating metadata for a large collection is an extremely time-consuming task, but is difficult to automate, particularly for collections consisting of documents with diverse layout and structure. Unfortunately a number of federal organizations such as DTIC, GPO, and NASA manage heterogeneous collections consisting of documents with diverse layout and structure, where existing approaches for automated metadata extraction do not work well. In this project, we are developing an automated process for metadata extraction for large, diverse, and evolving document collections.

This website records the Extract project through versions v4.1.x, June 2010.
A new website provides information on version 4.2 and later.