WikiProteins is an on-line terminology system with a wiki interface. Several important databases already have been imported, among which the UMLS MetaThesaurus. Each concept in the MetaThesaurus is tagged as belonging to at least one of 135 Semantic Types (ST; see Table 1), a simplified, hierarchical categorization system covering the breadth of UMLS.

Using the UMLS Semantic Types as an upper ontology for WikiProteins confers some obvious limitations, because of the inconsistencies and ambiguities present. To ensure extensibility and interoperability of the databases contained within WikiProteins, and to reduce ambiguity, we seek to implement a new (Description Logic-based) upper ontology to WikiProteins, simultaneously mapping the STs to it.

Upper ontologies are not domain-specific by definition. Considering that we are looking to simultaneously implement an upper ontology and extend it downwards into the biomedical domain, it is mandatory that a mid-level, “glue” ontology can be seamlessly attached to the upper ontology chosen.

Originally a research project started in 1984, Cyc is currently maintained by Cycorp, Inc., for commercial AI applications such as callcenter transcript analysis, text mining and game design. A publicly available, non-commercial version, OpenCyc, was released in 2001, with subsequent updates, but the number of assertions is much lower that in the commercial version. A research version, ResearchCyc, also exists, but it requires non-disclosure, which precludes use on a public project like WikiProteins. Cycorp has stated its intention to port all non-proprietary information from ResearchCyc to OpenCyc, but more than a year after this announcement, this has not yet happened.
Assertions in Cyc are in First-order logic, with extensions for modal operators and higher order quantification. A complicating fact is the widespread presence of reification (a construct to allow the presence of contradictory statements). Given these conditions, it is unlikely that a DL representation of Cyc could be generated.
In spite of its extensive subject area, very little specific attention to life sciences and medicine has been offered to Cyc. Implementing the Semantic Network would likely be a difficult task.

Like Cyc, SUMO is part of the IEEE Standard Upper Ontology Working Group. The language is developed by Teknowledge Corp. in SUO-KIF, a variant of the Knowledge Interchange Format (KIF). While it may be possible to convert KIF in general to DL, the available OWL interpretation of SUMO is in OWL Full, not OWL DL. MILO is a copyright-encumbered mid-level ontology connecting SUMO to commercial Teknowledge domain ontologies, but “any ontology you create based on MILO or our domain ontologies is your property”. No scientific domain ontologies are available, only two very specific ones (“biological virii” and “atomic elements”). Even if it is possible to express the Semantic Network in (SUO-)KIF, the current level of detail in MILO with regards to the biomedical domain is down to the level of BiologicalProcess, which has two children: CausingHappiness and CausingUnhappiness. If this arbitrary division is considered representative of the orientation of SUMO/MILO, then it would appear to unsuited to our needs.

BioTop is an upper-to-middle ontology, which, as the name would suggest, is geared towards the biomedical domain. It is developed at the Universities of Freiburg and Jena, and is entirely in DL. The ontology (Fig. 1) is divided into two parts: at the upper level is the Basic Formal Ontology (BFO), authored by Barry Smith, among others, conferring a philosophical nature. Added below that is a relatively deep network of biological classes, intended to provide an interface to the domain ontologies contained in the OBO foundry. Some classes appear to be a straight match with semantic types, some even being eponymous like “Animal”, “Fungus” and “Virus”. In other cases, a ST may be not exactly identical, but creating a new BioTop class would not likely result in a better ontology; for instance, the “Biologic Function” ST is described as “A state, activity or process of the body or one of its systems or parts”, while the BioTop class “BiologicalProcess” makes no mention of bodies. UMLS probably does not consider biological processes in non-human organisms, but creating a “BiologicalProcessInHuman” class does not make much sense.

Considering the sound and compatible structure of BioTop, and the willingness of the BioTop developers to cooperate in this project, the choice to use it as a base for our upper ontology is clear. Whereas other biological upper ontologies (Simple Bio Upper Ontology, GFO-Bio, UBO) do exist, and BioTop is neither complete nor without issues itself, none match its quality and extent.

To begin, we will try to map each Semantic Type to BioTop, or if that is impossible, determine where the latter must be extended. A preliminary effort is being undertaken; a scheme of the results so far is available. This is facilitated by the availability of textual definitions of all ST classes (web view) and most BioTop classes. An outline of the mapping will be circulated among domain experts, soliciting comments where necessary. BioTop class membership can then be assigned to individual MetaThesaurus entities inside WikiProteins, using the same mechanism with which STs are assigned now.

After the semantic types have been mapped, creating an implementation in OWL should be possible. The most challenging part will likely be the mapping of the semantic relations from the frame-based UMLS Semantic Network (SN) to a DL, which has no ambiguity but less expressivity. The SN defines a list of permissible (and a few forbidden) relations between Semantic Types, as well as the (non-)inheritance of these relations from a generic parent type to its more specific children. Given such a list, and a classification with the STs, it can be inferred which types of relations are allowed between which concepts.
A second issue might occur from the fact that concepts can belong to multiple semantic types and therefore inherit from multiple classes. This might lead to inconsistencies.

Because of the lack of ontological rigor in the STs, concepts that differ in fundamental ways (i.e., classes versus instances) may still both belong to the same ST. Then, after the mapping to a DL version, it may occur that a certain relation is allowed to exist between two concepts that originally belonged to certain semantic types, that would still hold, but not for another pair of the same types, because what was not distinguished under the ST classification scheme turned out to be distinct after mapping them to a DL-based representation. Seeking out these kinds of errors will bring out inconsistencies in UMLS, which may then possibly be corrected.

When this topic is sufficiently explored, the WikiProteins software will be improved to allow the use of such resources as the SN to provide feedback to wiki editors: given the knowledge of which relation types are allowed between entries marked with what ST, it should be possible to limit the number of options available to the user when adding relations between concepts. Once this functionality is in place, similar resources like the SwissProt annotations could be merged, further assisting the user. From this point onwards, cases where certain valid relations are not suggested and/or can not be added by users should be investigated and resolved by increasing the resolution and completeness of the ruleset in tandem with the upper ontology used.

Summarizing the tasks to be completed, the milestones in which they result, and in what order:

Map Semantic Types to BioTop classes, expand the latter to accommodate the former. This will result in a spreadsheet describing the actions to take for each Semantic Type.

Verification by domain experts (Christine Chichester, Barend Mons, Olivier Bodenreider, Stephan Schulz, Elena Beisswanger, Ronald Cornet) of the mapping; incorporate feedback. This will be repeated until written consensus is reached on the action to take for each Semantic Type.

When satisfied, create an OWL implementation, and import the adapted BioTop classes into WikiProteins using the existing class membership mechanism.

Investigate inconsistencies created through the DL mapping (“Barry Smith” examples). A few examples might include:

Extend the WikiProteins software to enable suggestions/restrictions on possible relation types when adding relations to the wiki. This should prevent the addition of statements like "Ferns suffer from depression" (plants do not possess mental processes). The deliverable code, extending the WikiData extension, will be released under a GPL license.

Explore and expand the ruleset currently defined by the UMLS Semantic Network, and merge other sources of rules where possible.