Review of Semantic Integration in the Life Sciences

This article discusses semantic data integration to address the overwhelming challenge of integrating and querying across thousands of biological databases, in a way that should be superior with “syntactic” data integration. It discusses at the highest level the kinds of data integration strategies: i) local-as-view where queries are mapped to their local sources (no transformation required) and ii) global-as-view where source data is transformed into a common schema. While ontology presents a salient opportunity to unify various data sources through a shared conceptualization, at least one example (protein, as it means for biopax/uniprot) demonstrates that this will be a significant, and non-trivial challenge. This is a great motivating example – it exemplifies syntactic (URI differences) and semantic (natural language/axiomatic definitions) heterogeneity -> how do we resolve this is a question worth of careful analysis in this article.

From schemas to ontology

The article needs to be reformulated in such a way that it clearly presents the problem of data integration and clearly defines and contrasts syntactic and semantic approaches, but more importantly, identifies the role of ontology in this process, as opposed to schemas. Ontology, in this sense, refers not only to an enhanced logic-based formalism in which class descriptions can be logically evaluated for equivalence through subsumption/consistency checking, but also to philosophical ontology such that different kinds of entities, including relations, can be integrated across domains.

From RDF to OWL

While the vast majority of RDF-based data integration efforts are fairly trivial (and require much time and effort), these do not exploit the explicit semantics found in rich ontologies. For instance, Bio2RDF ((Belleau F et al. J biomed inform 2008) now contains over 5 billion linked data statements using RDF, but there with no overaching ontology, queries must be formulated by tracing a path against existing resources. In contrast, data integration projects such as the pharmacogenomics of depression project (Michel Dumontier and Natalia Villanueva-Rosales. Briefings in Bioinformatics. 2009. 10(2):153-163.) use expressive logic-based ontologies that build on foundational philosophical ontology.

Vision

The vision, then, is that data integration occurs across domains through logic + philosophical ontology.