InfoSci®-Journals Annual Subscription Price for New Customers: As Low As US$ 4,950

This collection of over 175 e-journals offers unlimited access to highly-cited, forward-thinking content in full-text PDF and XML with no DRM. There are no platform or maintenance fees and a guarantee of no more than 5% increase annually.

Receive the complimentary e-books for the first, second, and third editions with the purchase of the Encyclopedia of Information Science and Technology, Fourth Edition e-book. Plus, take 20% off when purchasing directly through IGI Global's Online Bookstore.

Abstract

In this chapter, the authors discuss the challenges of performing reasoning on large scale RDF datasets from the Web. Using ter-Horst’s pD* fragment of OWL as a base, the authors compose a rule-based framework for application to Web data: they argue their decisions using observations of undesirable examples taken directly from the Web. The authors further temper their OWL fragment through consideration of “authoritative sources” which counter-acts an observed behaviour which they term “ontology hijacking”: new ontologies published on the Web re-defining the semantics of existing entities resident in other ontologies. They then present their system for performing rule-based forward-chaining reasoning which they call SAOR: Scalable Authoritative OWL Reasoner. Based upon observed characteristics of Web data and reasoning in general, they design their system to scale: the system is based upon a separation of terminological data from assertional data and comprises of a lightweight in-memory index, on-disk sorts and file-scans. The authors evaluate their methods on a dataset in the order of a hundred million statements collected from real-world Web sources and present scale-up experiments on a dataset in the order of a billion statements collected from the Web. In this republished version, the authors also present extended discussion reflecting upon recent developments in the area of scalable RDFS/OWL reasoning, some of which has drawn inspiration from the original publication (Hogan, et al., 2009).

Introduction

Information attainable through the Web is unique in terms of scale and diversity. The Semantic Web movement aims to bring order to this information by providing a stack of technologies, the core of which is the Resource Description Framework (RDF) for publishing data in a machine-readable format: there now exists millions of RDF data-sources on the Web contributing billions of statements. The Semantic Web technology stack includes means to supplement instance data being published in RDF with ontologies described in RDF Schema (RDFS) (Brickley & Guha, 2004) and the Web Ontology Language (OWL) (Bechhofer, et al., 2004; Smith, Welty, & McGuinness, 2004), allowing people to formally specify a domain of discourse, and providing machines a more sapient understanding of the data. In particular, the enhancement of assertional data (i.e., instance data) with terminological data (i.e., structural data) published in ontologies allows for deductive reasoning: i.e., inferring implicit knowledge.

In particular, our work on reasoning is motivated by the requirements of the Semantic Web Search Engine (SWSE) project: http://swse.deri.org/, within which we strive to offer search, querying and browsing over data taken from the Semantic Web. Reasoning over aggregated Web data is useful, for example: to infer new assertions using terminological knowledge from ontologies and therefore provide a more complete dataset; to unite fractured knowledge (as is common on the Web in the absence of restrictive formal agreement on identifiers) about individuals collected from disparate sources; and to execute mappings between domain descriptions and thereby provide translations from one conceptual model to another. The ultimate goal here is to provide a “global knowledge-base”, indexed by machines, providing querying over both the explicit knowledge published on the Web and the implicit knowledge inferable by machine. However, as we will show, complete inferencing on the Web is an infeasible goal, due firstly to the complexity of such a task and secondly to noisy Web data; we aim instead to strike a comprise between the above goals for reasoning and what is indeed feasible for the Web.

Current systems have had limited success in exploiting ontology descriptions for reasoning over RDF Web data. While there exists a large body of work in the area of reasoning algorithms and systems that work and scale well in confined environments, the distributed and loosely coordinated creation of a world-wide knowledge-base creates new challenges for reasoning:

•

the system has to perform on Web-scale, with implications on the completeness of the reasoning procedure, algorithms and optimisations;

•

the method has to perform on collaboratively created knowledge-bases, which has implications on trust and the privileges of data publishers.

With respect to the first requirement, many systems claim to inherit their scalability from the underlying storage -- usually some relational database system -- with many papers having been dedicated to optimisations on database schemata and access; cf. (Hondjack, Pierra, & Bellatreche, 2007; Pan & Heflin, 2003; Theoharis, Christophides, & Karvounarakis, 2005; Zhou, et al., 2006). With regards the second requirement, there have been numerous papers dedicated to the inter-operability of a small number of usually trustworthy ontologies; cf. (Ghilardi, Lutz, & Wolter, 2006; Jiménez-Ruiz, Grau, Sattler, Schneider, & Llavori, 2008; Lutz, Walther, & Wolter, 2007). We leave further discussion of related work for the end of the chapter, except to state that the combination of Web-scale and Web-tolerant reasoning has received little attention in the literature and that our approach is novel.