The Coriell Cell Line Ontology: Rapidly Developing Large Ontologies

*Abstract:

Motivation: Many online catalogues of biomedical products and artifacts exist that are loosely structured but of great value to the community. These include cell lines, enzymes, antibodies, reagents, and laboratory equipment. Improving the representation of these products has several benefits: reporting of products used in experimental protocols and integration of experimental data BioSample databases. Formalization of these resources is often time-consuming, labor-intensive and expensive. We describe an approach to structuring these catalogues using semi-automated techniques to rapidly develop OWL ontologies. We demonstrate the approach using the Coriell Cell Line catalogue, and the resulting ontology of 28,000 classes which imports classes from other community ontologies such as Disease Ontology, Cell Type ontology and FMA.

Availability: http://bioportal.bioontology.org/ontologies/1589

Authors

Introduction

The biomedical community has embraced the use of ontologies as a means of describing scientific data, such as experimental protocols (OBI) (The OBI Consortium, 2010) and experimental variables (EFO) (Malone et al., 2010) Manual development of these ontologies is a costly and time consuming activity. There is clearly value in producing robust expertly curated ontologies such as The Gene Ontology (GO). However, development in this form is clearly not repeatable across every area of biomedicine.

Programmatic approaches can be powerful when transforming and enhancing resources with pre-existing structure into an ontological form (Antezana et al., 2009). Loosely structured data sources contain implicit knowledge – within the data or within the presentation layer, e.g. within categories in a drop-down list on a website. Similarly, implicit knowledge may be contained within column headers of spreadsheets or database table and field names. It is possible to exploit this implicit knowledge and enable a rapid transform into explicit ontology classes.

Here we present our approach to the rapid development of the Coriell cell line ontology based on a collection of semi-structured cell line descriptions from the Coriell cell line catalogue which contains ~27,000 mammalian cell lines and metadata about these. We demonstrate that by using a standardized modeling pattern and text mining approaches, a large ontology (~28,000 classes) can be rapidly produced which logically describes each cell line and their biological properties. The scope of this work was representation of the catalogue in OWL, and development of a robust design pattern for cell lines, however, we expect the approach to scale and be adaptable for other similar resources.

METHODS

The principle methodology underlying this work is ontology normalization (Rector, 2003). Specifically, that we manage multiple inheritance using class descriptions in OWL and infer structure using description logic reasoners such as HermiT. By providing axioms on classes, the need to assert potentially conflicting or fragile subsumption hierarchies is removed. This approach ensures biological knowledge used to create the hierarchy is explicit and renders implicit knowledge explicit in the ontology.

The first step was to develop a standardized model for cell lines. In collaboration with the Cell Line Ontology (Sarntivijai et al., 2011) and the Cell Type Ontology (Meehan et al., 2011) we created a model (Figure 1) which aligns these ontologies and which was used during development.

Fig. 1.The cell line model used to represent Coriell cell lines.

Our primary queries of interest are contained in this model and determined which data we extracted from the catalogue, specifically: cell line name, cell type, disease, organism parts, organism and gender. The model was evaluated against primary competency questions derived from use cases related to the development of a BioSample Database (www.ebi.ac.uk/biosamples/) at the EBI. These include queries by common cell types, by disease and tissues. We a use the relation, is_model_for, to reflect use of cell lines as models for particular diseases. Given the large size of the Coriell catalogue we developed a scalable semi-automatic approach to creating the ontology. Information on each cell line was contained within 104 separate and redundant text files describing different aspects of the Coriell products and derived from an SQL dump of a relational database. Five key files were selected which contained semi-structured descriptions covering the entities described in Figure 1 and which corresponded to our use cases. These files were merged, redundant information was removed and a single ‘cell line’ spreadsheet was produced using bespoke Perl scripts.

Lexical concept recognition

The cell line spreadsheet was used as an input for lexical concept recognition with the aim of generating list of classes from reference ontologies that matched the textual descriptions in the catalogue. The Perl Onto-Mapper (www.ebi.ac.uk/efo/tools) was employed as it has previously been used successfully in building similar application ontologies (Malone et al., 2009). The approach allows for fuzzy matching to identify classes from class labels and their synonyms. Given the nomenclature of areas such as disease and anatomy where synonymy is common, a fuzzy matching approach provided flexibility in mapping. A metric was assigned to each match and those with less than 100% confidence were manually inspected.

The reference ontologies (Table 1) were selected based on the catalogue content and the model. Anatomy was challenging as although the Coriell cell lines are primarily mammalian no single mammalian anatomy ontology exists which would provide the coverage necessary. Although some efforts are ongoing to develop an homology based anatomy ontology (Travillian et al., 2010) we used a pre-existing resource the Minimal Anatomy Terminology (Bard et al., 2008). This species neutral ontology provides mappings to multiple anatomical ontologies and is subsumed by the Experimental Factor Ontology, with which we plan to merge the Coriell Cell Line Ontology in future. Some human specific classes were also imported from FMA. Note, however, that the majority of the terms were generated de novo representing cell lines rather than simply imported. There was no ontological resource for the Coriell catalogue prior to this work.

The disease information within the Coriell descriptions consisted of references to OMIM (McKusick, 2007). Since OMIM is not a disease ontology we exploited the links provided within the Human Disease Ontology (DO) to OMIM and imported DO classes. Where no cross references were found a manual inspection using BioPortal (Noy et al., 2009) was required to extract the disease.

Table 1.Reference ontologies used in the Coriell cell line ontology

Domain

Reference Ontology

Term Number

Organism

NCBI Taxonomy, OBI

93

Anatomy

Experimental Factor Ontology, FMA

61

Cell Type

Cell Type Ontology

11

Disease

Human Disease, NCI Thesaurus

337

Gender

PATO

3

Ontology engineering using the OWL-API

The lexical mapping resulted in a set of files containing mappings between a label and the corresponding URI from the reference ontology, one file per domain. These mappings were used to construct the ontology programmatically (Figure 2)

Fig. 2.Methodology for programmatic ontology creation

The process was implemented as follows:

Input of cell line descriptions contained in the single merged spreadsheet.

Class IRIs are used to import corresponding ontology classes from reference ontologies, along with axiomatic and annotation information within the class signature if present and parent classes.

The EFO upper level is re-used here (a slim version of BFO) and determines where imported classes should be placed, e.g. disease classes are imported under the disease parent, itself a child of disposition.

The Coriell cell line ontology in OWL is output.

The ontology was manually reviewed for correctness, checked for consistency using HermiT 1.3.1 and test defined classes were added.

RESULTS

The Coriell cell line ontology contains 27,002 cell line classes, covering 11 cell types, 61 anatomical terms and 93 organisms. 657 OMIM numbers were attached to cell lines and 393 OMIM numbers were mapped to 337 Disease Ontology classes. 7,688 cell lines were confirmed to model disease and a small number modeled multiple diseases, for example ND00139 which models Parkinson ‘s disease and Lewy Body Disease. Following the creation of the ontology and validation of all lexical matches and the ontology by a domain expert some refinements to the imported structure were required as follows:

Organism taxonomy

Organism classes imported from the NCBI taxonomy have long chains of parent classes, e.g. Homo sapiens has 28 classes in a subclass hierarchy. We retrospectively removed some of these nodes, applying the following design principle; 1. Remove intermediate classes when the child class does not have more than 2 siblings, 2. When the deletion leads to >3 child classes, the parent class is retained. This strategy removed a large number of classes which were not required by our query use cases.

Anatomy

There were 81 unique terms describing anatomy, 45 mapped exactly to pre-existing terms in the MAT. Unmapped terms describe classes other than anatomy such as fibroma, leiomyoma (diseases) and were removed. Buttock-thigh and Thorax/abdomen could be separated into two single terms but it is not clear which part the terms were describing and these were also removed. 9 terms were unmapped which did not appear to fit into anatomy, such as Keloid breast organoid, so were removed. Among the remaining terms unmapped from the concept recognition step, 12 terms are mapped to FMA, 9 terms to EFO, 2 to SNOMED CT, 2 to NCI Thesaurus and 1 term is unmapped. Mixing of terms from disease and anatomy domains was found to be common in many parts of the Coriell Catalogue; manual effort was spent assessing outputs from lexical matching to correct these.

Cell type

22 unique cell type terms were mapped to the Cell Type Ontology. 11 terms are with 100% similarity. Partial mappings were refined manually e.g. smooth muscle is not a cell type and was modified to smooth muscle cell. Myeloma is not a cell type, but a cancer of plasma cells and was changed to plasma cell. Another 11 unmapped terms were not cell type terms and were removed.

Disease

We imported 337 Disease Ontology terms into the Coriell cell line ontology. DO is not well axiomatised except for the use of subclass relationships. EFO, however, provides more information for the class relationships (e.g. disease to anatomical parts). For disease we therefore added axioms from EFO to allow construction of defined classes based on e.g. disease e.g. ‘liver disease cell lines‘. Imported classes were axiomatised using additional logical restrictions e.g. an axiom linking disease to anatomical part. This does not affect the DO child and parent classes and the canonical structure from DO and IRIs are preserved.

Adding defined classes to infer structure

Use of normalisation methodology results in an asserted flat cell line hierarchy, i.e. the only asserted parent class of each cell line is the cell line class. For browsing purposes, however, it is often useful to produce an organizational hierarchy and as such we created some under cell line using defined classes in OWL, i.e. classes with necessary and sufficient restrictions describing members.

Fig. 3.Inference of human cancer cell line hierarchy in Protégé

For example, human cancer cell line (Figure 3) shows inferred subclasses and has the following necessary and sufficient restriction using Manchester OWL syntax:

‘cell line’

and (is_model_for some cancer)

and (derives_from some

(‘cell type’

and (part_of some

(‘organism part’

and (part_of some ‘Homo sapiens’)))))

The nesting reflects an important distinction between separate statements; in effect, we are saying for a specific organism, for which a specific organism part is part, and from which a specific cell type was taken. For the example in Figure 3, the defined class restricts membership to those classes where cancer is the modeled disease and which are derived from humans (more specifically cell types that are part of an organism part which are part of humans). We have also used disjoints in some areas of the ontology, for example by making Homo sapiens disjoint from other siblings under organism, we are able to ask the query for things which are not Homo sapiens because they have been explicitly defined as such.

Rapid generation and regeneration

The ontology was developed over 3 months by one person working full time. The majority of this time was spent developing the code to produce the ontology; a repeat exercise would take a great deal less. We made several changes to the ontology as we progressed and refined the model slightly; the programmatic method used meant regenerating the new OWL ontology took minutes. Rapidly addition of content programmatically is also possible. By comparison with manual development of a similar ontology e.g. the cell line ontology we estimate that ~12 months development time was saved.

DISCUSSION

One of the central claims of this work is that the ontology was rapidly developed using the methods described. Over the 3 months that this work was conducted, we estimate 2 months comprised investigation of the catalogue content and Perl scripting to merge and format the initial input files. A further month’s programming resulted in an ontology of ~28,000 classes. Generalizable components of the methodology include: design of reusable design patterns, re-use of ontology development code and exploitation of the MIREOT process for term imports.

There is a trade-off between hand crafted curation by individual experts and the rapid development of a very large resource. Our approach is of most benefit when a semi-structured data exists and existing Foundry type ontologies are available e.g. for cell types. As a one-off SQL dump was used for development updates need to be managed in future and a dynamic method for accessing new data is desirable.

One of the criteria for inclusion in the OBO Foundry effort (Smith et al., 2009) is that every class is given a textual definition. The effort required to manually produce good textual definitions for an ontology the size of the Coriell cell line ontology is significant. Given the axiomatisation of the ontology, however, efforts such as producing natural language from OWL statements may offer an effective and rapid method to producing textual definitions (Stevens et al., 2011). If such an approach can be applied we will seek to include the artifact into the OBO Foundry in the future. We are also currently working with the Cell Line Ontology to ensure our respective models are synchronized and to merge the Coriell cell line ontology with the CLO which is currently derived from the American Tissue Culture Collection (ATCC). Other work includes mapping to all resources which contain cell line references and addition of these to the ontology, re-running of imports to detect changes in source ontologies, term requests from e.g. the cell type ontology to classify cells by anatomical part and addition of information manually where possible. E.g. much text containing phenotypic descriptions was unstructured and could be mined added. A complete evaluation of additional meta data vs. that of the CLO is also desirable in order to prioritise where to add curation effort and which additional data could added to the core we have built. This work has allowed us to refine the cell line model within EFO to be consistent with the CLO and this will be revised in future releases of EFO. Future work also includes the release of the Coriell ontology to Bio2RDF for linked open data access. Finally our programmatic approach is fully compatible with manual curation and ontology development, and a combined approach is likely to produce rich, well structured ontologies for community use.

Acknowledgements

We thank the Functional Genomics Production Team, the Coriell Institute for Medical Research, Alan Ruttenberg and Science Commons for providing the Coriell SQL dump. Lynn Schriml and colleagues from the Disease Ontology for OMIM mappings and Sirarat Sarntivijai, Oliver He, Alexander Diehl and Terry Meehan for discussions on the cell line model. Funding: The European Molecular Biology Laboratory, and EC (HEALTH theme no. 200754 Gen2Phen).