Text, tables, and illustrations in scientific publications are the primary means of distributing data and information in many domains of Earth and life science, particularly those based on physical samples. Although efforts are underway to integrate data archiving protocols into existing scientific workflows, a vast amount of valuable “dark” legacy data remains sequestered in publications. Manually assembling the synthetic databases necessary to test a wide rage of hypotheses is labor intensive and results in static lists of predetermined facts that are separated from their original contexts.

Technical Approach

Here, we address problems in geoscience and computer science by developing a machine reading and learning system that meets or exceeds human quality in complex multi-lingual text- and image-based data extraction tasks. Our system builds upon the DeepDive machine reading and learning system and leverages the computational infrastructure of Condor and the US Open Science Grid. The input to the system is a set of documents (e.g., PDFs or HTML) and a database structure that defines the entities and relationships of interest. The first step in is to execute OCR, NLP, and other document processing tasks, which yield data that can be used to define features which in turn relate entities (e.g., by part of speech, or occurrence of entities in the same row of a table).

After extracting features, the next step is to generate a factor graph. Evidence variables derive from existing dictionaries and existing knowledge bases can be used in distant supervision. One key challenge is that factor graphs can be large. DeepDive uses recent research in theory and systems to overcome this computational challenge.

Given a factor graph, our system learns the weight for each factor and runs inference tasks to estimate the probability of each random variable. The output is a probabilistic database, in which each fact is associated with an estimated probability of being correct, and a set of statistics that describe the performance of the system. Improving quality is an iterative process.

Science Drivers

To validate our system and test the hypothesis that a machine can extract structured data from publications, we are attempting three test cases. The first, now completed and published (DOI: 10.1371/journal.pone.0113523), is to recreate and extend the human-constructed Paleobiology Database (http://paleobiodb.org). The second is to extract geochemical measurements, such as stable carbon isotopes, and integrate them with the Macrostrat database (http://macrostrat.org). The third is to extract data, such structural orientation data, from published geological maps.

Benefits to Scientists

In addition to developing general capabilities in machine reading that can benefit scientists in all domains, we are creating infrastructure to lower the barrier to text and data mining (TDM) activities. Our infrastructure involves the creation of a next-generation, TDM-ready digital library that is well curated, constructed in collaboration with publishers and library staff, and built directly on top of a high throughput computing capability.

EarthCube is a collaboration between the Division of Advanced Cyberinfrastructure (ACI) and the Geosciences Directorate (GEO) of the US National Science Foundation (NSF). For official NSF EarthCube content, please see: http://www.nsf.gov/geo/earthcube/.