Posted:May 11, 2016

New, Major Upgrade of UMBEL Released

The year since the last major release of UMBEL (Upper Mapping and Binding Exchange Layer) has been spent in a significant re-think of how the system is organized. Four years ago, in version 1.05, we began to split UMBEL into a core and a series of swappable modules. The first module adopted was in geographical information; the second was in attributes. This design served us well, but it was becoming apparent that we were on a path of multiple modules. Each of UMBEL’s major so-called ‘SuperTypes‘ — that is, major cleavages of the overall UMBEL structure that are largely disjoint from one another, such as between Animals and Facilities — were amenable to the module design. This across-the-board potential cleavage of the UMBEL system caused us to stand back and question whether a module design alone was the best approach. Ultimately, after much thought and testing, we adopted instead a typology design that brought additional benefits beyond simple modularity.

Today, we are pleased to announce the release of these efforts in UMBEL version 1.50. Besides standard release notes, this article discusses this new typology design, and explains its uses and benefits.

Basic UMBEL Background

The Web and enterprises in general are characterized by growing, diverse and distributed information sources and data. Some of this information resides in structured databases; some resides in schema, standards, metadata, specifications and semi-structured sources; and some resides in general text or media where the content meaning is buried in unstructured form. Given these huge amounts of information, how can one bring together what subsets are relevant? And, then for candidate material that does appear relevant, how can it be usefully combined or related given its diversity? In short, how does one go about actually combining diverse information to make it interoperable and coherent?

UMBEL thus has two broad purposes. UMBEL’s first purpose is to provide a general vocabulary of classes and predicates for describing and mapping domain ontologies, with the specific aim of promoting interoperability with external datasets and domains. UMBEL’s second purpose is to provide a coherent framework of reference subjects and topics for grounding relevant Web-accessible content. UMBEL presently has about 34,000 of these reference concepts drawn from the Cyc knowledge base, organized into 31 mostly disjoint SuperTypes.

The grounding of information mapped by UMBEL occurs by common reference to the permanent URIs (identifiers) for UMBEL’s concepts. The connections within the UMBEL upper ontology enable concepts from sources at different levels of abstraction or specificity to be logically related. Since UMBEL is an open source extract of the OpenCyc knowledge base, it can also take advantage of the reasoning capabilities within Cyc.

Diagram showing linked data datasets. UMBEL is near the hub, below and to the right of the central DBpedia.

UMBEL’s vocabulary is designed to recognize that different sources of information have different contexts and different structures, and meaningful connections between sources are not always exact. UMBEL’s 34,000 reference concepts form a knowledge graph of subject nodes that may be related to external classes and individuals (instances and entities). Via this coherent structure, we gain some important benefits:

Mapping to other ontologies — disparate and heterogeneous datasets and ontologies may be related to one another by mapping to the UMBEL structure

A scaffolding for domain ontologies — more specific domain ontologies can be made interoperable by using the UMBEL vocabulary and tieing their more general concepts into the UMBEL structure

Inferencing — the UMBEL reference concept structure is coherent and designed for inferencing, which supports better semantic search and look-ups

Semantic tagging — UMBEL, and ontologies mapped to it, can be used as input bases to ontology-based information extraction (OBIE) for tagging text or documents; UMBEL’s “semsets” broaden these matches and can be used across languages

Linked data mining — via the reference ontology, direct and related concepts may be retrieved and mined and then related to one another

Creating computable knowledge bases — with complete mappings to key portions of a knowledge base, say, for Wikipedia articles, it is possible to use the UMBEL graph structure to create a computable knowledge source, with follow-on benefits in artificial intelligence and KB testing and improvements, and

Categorizing instances and named entities — UMBEL can bring a consistent framework for typing entities and relating their descriptive attributes to one another.

UMBEL is written in the semantic Web languages of SKOS and OWL 2. It is a class structure used in linked data, along with other reference ontologies. Besides data integration, UMBEL has been used to aid concept search, concept definitions, query ranking, ontology integration, and ontology consistency checking. It has also been used to build large ontologies and for online question answering systems [1].

Including OpenCyc, UMBEL has about 65,000 formal mappings to DBpedia, PROTON, GeoNames, and schema.org, and provides linkages to more than 2 million Wikipedia pages (English version). All of its reference concepts and mappings are organized under a hierarchy of 31 different SuperTypes, which are mostly disjoint from one another. Development of UMBEL began in 2007. UMBEL was first released in July 2008. Version 1.00 was released in February 2011.

Summary of Version 1.50 Changes

These are the principal changes between the last public release, version 1.20, and this version 1.50. In summary, these changes include:

Re-aligned the SuperTypes to better support computability of the UMBEL graph and its resulting disjointedness

These SuperTypes were eliminated with concepts re-assigned: Earthscape, Extraterrestrial, Notations and Numbers

These new SuperTypes were introduced: AreaRegion, AtomsElements, BiologicalProcesses, Forms, LocationPlaces, and OrganicChemistry, with logically reasoned assignments of RefConcepts

The ShapesSuperType is a new ST that is inherently non-disjoint because it is shared with about half of the RefConcepts

The Situations is an important ST, overlooked in prior efforts, that helps better establish context for Activities and Events

Made re-alignments in UMBEL’s upper structure and introduced additional upper-level categories to better accommodate these refinements in SuperTypes

A typology was created for each of the resulting 31 disjoint STs, which enabled missing concepts to be identified and added and to better organize the concepts within each given ST

The broad adoption of the typology design for all of the (disjoint) SuperTypes also meant that prior module efforts, specifically Geo and Attributes, could now be made general to all of UMBEL. This re-integration also enabled us to retire these older modules without affecting functionality

Expanded and updated the UMBEL.org Web site, with access and demos of UMBEL.

UMBEL’s SuperTypes

The re-organizations noted above have resulted in some minor changes to the SuperTypes and how they are organized. These changes have made UMBEL more computable with a higher degree of disjointedness between SuperTypes. (Note, there are also organizational SuperTypes that work largely to aid the top levels of the knowledge graph, but are explicitly designed to NOT be disjoint. Important SuperTypes in this category include Abstractions, Attributes, Topics, Concepts, etc. These SuperTypes are not listed below.)

UMBEL thus now has 31 largely disjoint SuperTypes, organized into 10 or so clusters or “dimensions”:

Constituents

Natural Phenomena

Area or Region

Location or Place

Shapes

Forms

Situations

Time-related

Activities

Events

Times

Natural Matter

Atoms and Elements

Natural Substances

Chemistry

Organic Matter

Organic Chemistry

Biochemical Processes

Living Things

Prokaryotes

Protists & Fungus

Plants

Animals

Diseases

Agents

Persons

Organizations

Geopolitical

Artifacts

Products

Food or Drink

Drugs

Facilities

Information

Audio Info

Visual Info

Written Info

Structured Info

Social

Finance & Economy

Society

These disjoint SuperTypes provide the basis for the typology design described next.

The Typology Design

After a few years of working with SuperTypes it became apparent each SuperType could become its own “module”, with its own boundaries and hierarchical structure. Since across the UMBEL structure nearly 90% of the reference concepts are themselves entity classes, if these are properly organized, we can achieve a maximum of disjointness, modularity, and reasoning efficiency. Our early experience with modules pointed the way to a design for each SuperType that was as distinct and disjoint from other STs as possible. And, through a logical design of natural classes [3] for the entities in that ST, we could achieve a flexible, ‘accordion-like’ design that provides entity tie-in points from the general to the specific for each given SuperType. The design is effective for being able to interoperate across both fine-grained and coarse-grained datasets. For specific domains, the same design approach allows even finer-grained domain concepts to be effectively integrated.

All entity classes within a given SuperType are thus organized under the SuperType itself as the root. The classes within that ST are then organized hierarchically, with children classes having a subClassOf relation to their parent. Each class within the typology can become a tie-in point for external information, providing a collapsible or expandable scaffolding (the ‘accordion’ design). Via inferencing, multiple external sources may be related to the same typology, even though at different levels of specificity. Further, very detailed class structures can also be accommodated in this design for domain-specific purposes. Moreover, because of the single tie-in point for each typology at its root, it is also possible to swap out entire typology structures at once, should design needs require this flexibility.

We have thus generalized the earlier module design to where every (mostly) disjoint SuperType now has its own separate typology structure. The typologies provide the flexible lattice for tieing external content together at various levels of specificity. Further, the STs and their typologies may be removed or swapped out at will to deal with specific domain needs. The design also dovetails nicely with UMBEL’s build and testing scripts. Indeed, the evolution of these scripts via literate programming has also been a reinforcing driver for being able to test and refine the complete ST and typologies structure.

Still a Work in Progress

Though UMBEL retains its same mission as when the system was first formulated nearly a decade ago, we also see its role expanding. The two key areas of expansion are in UMBEL’s use to model and map instance data attributes and in acting as a computable overlay for Wikipedia (and other knowledge bases). These two areas of expansion are still a work in progress.

The mapping to Wikipedia is now about 85% complete. While we are testing automated mapping mechanisms, because of its central role we also need to vet all UMBEL-Wikipedia mapping assignments. This effort is pointing out areas of UMBEL that are over-specified, under-specified, and sometimes duplicative or in error. Our goal is to get to a 100% coverage point with Wikipedia, and then to exercise the structure for machine learning and other tests against the KB. These efforts will enable us to enhance the semsets in UMBEL as well as to move toward multilingual versions. This effort, too, is still a work in progress.

Despite these desired enhancements, we are using all aspects of UMBEL and its mappings to both aid these expansions and to test the existing mappings and structure. These efforts are proving the virtuous circle of improvements that is at the heart of UMBEL’s purposes.

Where to Get UMBEL and Learn More

The UMBEL Web site provides various online tools and Web services for exploring and using UMBEL. The UMBEL GitHub site is where you can download the UMBEL Vocabulary or the UMBEL Reference Concept ontology, both under a Creative Commons Attribution 3.0 license. Other documents and backup are also available from that location.