This document is an analysis provided by the authors and carries no
endorsement by the Consortium.

As we begin the XML Schema design [XMLSchema]
and examine the RDF Schema design [RDFSchema],
this document acknowledges the input we have received on how they fit together
and how they should fit together, and invites further exploration.

This section represents the status of this document at the
time this version was published. It will become outdated if and
when a new version is published. The
latest status
is maintained at the W3C.

Abstract

The World Wide Web is a universal information space. As a medium for human
exchange, it is becoming mature, but we are just beginning to build a space
where automated agents can contribute--just beginning to build the Semantic
Web. The RDF Schema design [RDFSchema] and XML
Schema design [XMLSchema] began independently,
but we explore a common model where they fit together as interlocking pieces
of the semantic web technology.

The architecture of the World Wide Web provides users with a simple
hypertext interface to a variety of remote resources, from static documents
purely for human consumption to interactive data services. HTML, the data format that facilitated the widespread
deployment of the Web, started by adding URI
based linking to word processor style rich text to provide basic global
hypertext functionality. The addition of forms to HTML provided a minimal but
functional user interface to interactive data services.

While this HTML infrastructure has facilitated a revolution in global
information technology, it suffers from the inevitable limitations of a "one
size fits all" solution: rich document structures are lost as the content is
squeezed into the primitive structures of HTML. Similarly, the cost of
squeezing rich data structures into and out of HTML is paid in efficiency and
integrity.

Now that the Web has reached critical mass as a medium for human
communication, the next phase is to build the "Semantic Web". The Semantic Web
is a Web that includes documents, or portions of documents, describing
explicit relationships between things and containing semantic information
intended for automated processing by our machines.

XML began as a project to address HTML's
limitations on structured documents, by selecting a simple-to-implement yet
extensible subset of SGML for use on the Web. It has emerged as the
infrastructure for structured data interchange as well.

Meanwhile, in our effort to address the impact of the Web on society, the
W3C membership came together to develop the Platform for Internet Content
Selection (PICS), which provides users with the
ability to select content based on labels provided by information providers or
other sources. A critical component of PICS is the rating system description,
a sort of schema; every PICS label points to a description, in the Web, of the
fields in the label.

PICS was designed as a first step toward generalized labels that would
allow any party in the Web to make claims about the qualities of resources:
endorsements, terms and conditions for use, and so on. The Metadata Activity addresses the necessary
work to complete the picture: structured labels, rules, integration with
digital signatures. The PICS label design was generalized to a model of
information as directed labeled graphs (DLGs). This was known as the RDF model, and a serialization was defined in XML
syntax. PICS rating systems were incorporated as special cases in the design
of RDF Schemas.

XML documents have a mechanism for self-description as well: the DTD. As
the use of XML became more diverse and intense, the limitations of the aged
DTD design became acute, especially in the area of data typing, modularity,
and reuse; soon W3C began work in the XML
Activity on a new generation of schemas for XML.

The initial expectation was that RDF would be simply layered on top of XML,
with minimal interaction. But then the RDF design started to include a
"namespace" facility for connecting XML element names to web addresses, which
was closely related to a long-standing design discussion in XML (and SGML
before that). Similarly, the RDF requirements for datatypes like integer and
date were shared with many other XML based formats.

Over time, the interactions grew. The Document Object Model (DOM), which started as an effort to harmonize HTML
scripting facilities in browsers, expanded in scope to include XML and become
a foundational Application Programming Interface (API) in many Web software
platforms and structured data repositories. Software built on these platforms
sees RDF not as XML streams but as DOM objects. The emergence of the
transformation component of Extensible Style Language (XSL) as a useful component in its own right sheds
new light on many of the syntactic design issues in RDF. The benefit of syntax
that is easy to manipulate with DOM and XSL were not evident in the early RDF
design stage.

At the Query Language Workshop [QL98] a number of
applications were being designed using XML to encode DLG data and it was clear
that the syntax used by the RDF community to do this was not as direct as that
assumed by some others. The mapping of XML elements directly to graph edges
(rather than nodes) was a closer, more natural mapping to some. The direct
mapping meant that statements about RDF's arcs and XML's elements had
implications on each other by this stronger analogy. This suggested a need to
define precisely what that mapping was, so determining the architectural
connection between future work on the Semantic Web and other applications of
XML schemas.

First we review some of the requirements for the Semantic Web. Secondly we
review the data models of many systems whose data is under strong pressure to
be accessible directly in semantic form. For each we try to delineate the
mapping where it is evident, but outline the areas where specification work is
required.

Traditionally, both documents and databases have been strongly typed; that
is, the producer and consumer have prior agreement on the structure of the
information units. But this by itself is not sufficient for the long-term
health of the Semantic Web. The Semantic Web must permit distributed
communities to work independently to increase the Web of understanding, adding
new information without insisting that the old be modified. This approach
allows the communities to resolve ambiguities and clarify inconsistencies over
time while taking maximum advantage of the wealth of backgrounds and abilities
reachable through the Web. Therefore the Semantic Web must be based on a
facility that can expand as human understanding expands. This facility must be
able to capture information that links independent representations of
overlapping areas of knowledge.

The XML 1.0 specification [XML98] takes a large
step toward enabling the interchange of information even with a party that is
able to recognize only a portion of a document. XML specifies the syntactic
constraint called well-formedness. Well-formedness is a fundamental
tool for allowing documents to include extended information while remaining
processable by older "down-level" applications.

Old engineering habits[1] suggest that for
every document there must exist somewhere in a single place a complete
enumeration of every markup feature present in that document. While this
notion of XML validity is appropriate for many application contexts,
it is too strong a constraint to place on the Semantic Web.

Mixing of vocabularies is a critical feature for the Web [BC98]. Of the evolutionary requirements on protocols [HTTPNG98], the first two of three also apply to data
formats:

The requirement to extensibility is that extended applications do not require
agreement across the whole Internet; rather, it suffices:

that conforming peers supporting a particular protocol extension or feature
can employ it with no prior agreement;

that it is possible for one party having a capability for a new protocol to
require that the other party either understand and abide by the new protocol
or abort the operation; and

that negotiation of capabilities is possible.

Incremental decentralized development of Semantic Web applications requires
documents to be able to contain an ad hoc mixture of features from multiple
application domains. The combinatoric issues make it impractical to predefine
document types that encompass all the possible vocabulary sets. Instead, the
XML Namespace facility [XMLNS99] allows this
vocabulary mix-in. The Resource Description Framework Model and Syntax
Recommendation [RDF99] leverages the XML Namespace
facility throughout.

When design work on RDF began, the only XML schema facility available was
the DTD, which lacks support for decentralized evolution. Since then, XML
Schema work [XMLSchema] proposes ways to compose
strongly typed documents using XML Namespaces.

Any given XML document is finite, as is any table in a relational database.
But the Web is unbounded. The design of the Web fundamentally differed from
traditional hypertext systems in sacrificing link integrity for scalability.
While any party can (and should!) maintain link consistency within some part
of the Web, no tool that looks at the Web as a whole can assume
consistency.

This emerged as a critical design distinction at the workshop [QL98]. Assuming a finite repository, a query processor can
assume it has the total information; it can decide that, for example, there
are no elements or records that satisfy a query. But while there are services
that process queries of the form "find all links to X in the Web," they cannot
decide that there are no links to X, but only that their necessarily
incomplete knowledge of the Web includes no links to X.

The workshop showed that the problem of querying a bounded XML collection
has been solved in the research and industrial settings, and is perhaps a
commodity, amenable to standardization by now. But work presented there and
elsewhere [Craw90] showed that while the unbounded
query problem is perhaps at the research stage, the research is maturing, and
we should take care not to prevent the transition into products and commodity
technology. We should do what we can to see that data is recorded in a global
context, because [Ber98a]:

[...] we expect this data, while limited and simple within an application, to
be combined, later, with data from other applications into a Web. Applications
which run over the whole web must be able to use a common framework for
combining information from all these applications.

For example, HTML linking requires that all links from a resource be
expressed in the content of that resource. But that is a limitation of HTML,
not a limitation of Web architecture [Ber90]. The
design and deployment of HTML did not prevent the design and deployment of
out-of-line XML Links [XLink98], which allow links
from a resource to be expressed anywhere in the web.

In the same way that HTML and XML Linking allow authors to lead readers
from any place in the Web to any other place in the Web, data in the Semantic
Web must be able to relate anything to anything.

A requirement, therefore, of a data model for the Semantic Web is that
there should be no fundamental constraint relating what is said, what it is
said about, and where it is said.

To encompass the universe of network-accessible information [Ber92], the Semantic Web must provide a way of exposing
information from different systems. These systems may use a variety of
internal data models so this implies a requirement for some generic concept of
data at a low level that is in common between each system. For example, at the
W3C Query Language Workshop [QL98] the directed,
labeled graph (DLG) model was a common underlying model among many
systems.

Another challenge of the Semantic Web, then, is to support the mapping of
the existing and future systems onto the Web, preserving the universality of
the Web and also the properties of the local systems. Optimizations - such as
being able to enumerate and index all objects of a given type - that are
important to the local operation of a system do not scale to the Web.

An example of this is the scoping of identifiers. In the object-oriented
model, the variables in an object are declared when the object type is
declared. Entity-relationship models similarly are an optimization of a model
that presumes an enumerable set of properties. The Semantic Web should be able
to represent these constrained models but, as with link consistency, we must
relax absolute constraints to achieve scalability; when an object is exported
to the Web, the "anything can say anything about anything" rule allows
assertions to be made about the object expressing things which were not
foreseen in the original definition of that object.

The mechanism adopted in RDF [RDF99] to manage
the expression of constraints is to make all objects, all relationships, all
types, and even all assertions be "first class objects" on the Web. That is;
they have their own URIs and are not constrained in the fundamental level to
be combined in any particular way. By giving first class identifiers to types,
relationships, and assertions we allow the Semantic Web to make assertions
about itself.

All statements found on the Web occur in some context. Applications need
this context in order to determine the trustworthiness of the statements; that
is, the machinery of the Semantic Web, does not assert that all statements
found on the Web are "true". Truth - or more pragmatically, trustworthiness -
is evaluated by, and in the context of, each application that processes the
information found on the Web. These are not new issues; the commerce and
financial communities have evolved techniques to manage exchange of
information (goods) without requiring perfect trust [Reagle96, Geer98].

Just as the design of the Web sacrificed link integrity for scalability,
the "all knowledge about my thing is contained here" notion cannot hold when
databases and objects are exported to the Web. A great benefit to relaxing
this assumption will be that, just as hypertext links connect different
information systems, the Semantic Web will connect data from vastly different
systems, allowing complex and far-reaching processing of a wide store of
available data.

We will need to consider one optimization that RDF does not currently
address and that is found in database systems. This occurs with operations on
composite objects and is frequently represented as containment (in the sense
of storage). Operations such as deletion and comparison on structured objects
frequently make use of such containment relationships. While this local
containment constraint also does not scale fully to the Web, it is an example
of a relationship that should be expressible in the Semantic Web.

The relationship between the tree structure of an XML document and a graph
structure was discussed at the workshop [QL98]. The
participants, coming from diverse backgrounds, agreed that a shared data model
was the cornerstone of the design of any query language, and, in fact, a
precursor to meaningful design discussion. The relational calculus [Codd70] underlying Structured Query Language(SQL) is a good
example.

Section
2.1 of the XML specification defines the way the elements in a document
form a tree, and Section 5
of the RDF specification defines the RDF data model as a directed, labeled
graph (DLG). The XML syntax of RDF reflects the difference between these
models. And it seemed at first glance that the design of a query language for
XML must start with a tree model, whereas a query language for RDF must start
with a DLG model.

A DLG model has been shown [GMW99] to be useful for
serializing a semistructured database in XML. Other work [LayA98, DSB98, BLR99] also discusses modeling graphs in XML. This work does
not propose that all XML documents should be modelled as DLGs -- the order of
elements generally gets lost, for example -- but it does show that RDF is not
the only XML application with a requirement to represent DLGs in XML.

These models use XML ID/IDREF to supplement parent/child relationships
expressed by element containment. While ID/IDREF works within a single
document, these designs depend on XML Linking [XLink98], i.e. well-known constructs for making
connections across documents, much the way metadata applications require that
the basic structure of RDF assertions be visible to all systems, even systems
that don't understand the semantics of the assertions.

The content selection application added an operational requirement that
assertions be visible by inspection, i.e. without reference to a schema.
Expectations from HTML suggest that authors have the option to express simple
XML Linking constructs [XLink98] directly in the
document, and do not necessarily need to work with a DTD or schema.

From the workshop discussion, it was clear that a direct mapping between
XML elements and graph arcs was a strong design. Future work is needed to
address how to identify semantic arcs and cross-links within an arbitrary XML
document.

The Semantic Web model is very closely connected with the relational
database model [Ber98b]. A collection of RDF statements
about a node corresponds to a row in a table. A database join is a splicing of
graphs. Relational databases are optimized to handle large numbers of
instances of statements using the same property, and there might be
corresponding optimizations in an XML serialization for large volumes of
similar data. But we should expect that the basic structures that support
serializing relational databases can be shared with the RDF DLG data
model.

In a table there are many records with the same set of properties. An
individual cell (which corresponds by analogy to an RDF property) is strongly
typed. Combination rules tend in RDBs to be loosely enforced; a query can join
tables by any columns which match on datatype without any check on the
semantics. Joins across arbitrary fields are another case where constraint
enforcement in the Semantic Web cannot be absolute; the Semantic Web is not
designed just as a new data model - it is designed to support the linking of
data from many different models.

Much of the object-oriented world has to do with the modeling of functions
on objects. For the Semantic Web, the data model for the serializations of
objects when they are stored or transmitted is also of interest [Chang98].

The serialization of an object can be considered to be a series of data
fields expressing different properties of the object. In most O-O systems the
type of an object denotes constraints on the methods (functions) supported by
the object. The serialization of the object data is often considered to be an
internal matter, and is hidden from the caller of methods. Under this design
principle, XML technology is only of interest for the serialization of the
remote method calls (which is outside the scope of this paper) and cannot be
used to provide interoperability between implementations.

However, when interoperability is a goal for object serializations, then it
becomes reasonable to put objects on the Web. In this case, designing
self-describing serialization formats using XML makes the objects more robust
across time [KR97]. XML vocabularies for representing
inheritance are needed when object systems are serialized on the Semantic Web.
These mechanisms are desired by RDF and can be shared across other XML
applications.

A large number of applications store and communicate information that takes
the form of logical expressions. For example, configuration files defining
access control, specifications of capability profiles ([CONNEG98], [CCPP98]) show a need for
not only structure but logical combination. Future work must be chartered to
provide a common vocabulary for this logic language.

Knowledge representation systems (e.g. Knowledge Interchange Format (KIF) and Cyc [Cyc95]) include not only logical level information but also
expressive power which includes quantification and inference. The basic DLG
model of RDF provides a very natural base for the expression and interchange
of such data, but future work is needed to define common terms for these
extensions to the power of the language. There may be a use for some XML
shorthand in order to make such expressions sufficiently concise. While these
systems also make assumptions about having full access to information that
fail to scale to the Web, a comparison of the models suggest two areas of
impact when a typical KR system is represented on the Semantic Web. Many KR
systems are built to work with primarily one node representing any given
concept, while in general the Semantic Web may have many independently created
nodes which in fact represent the same thing. Also, KR systems in practice
store some kinds of hints in order to be able to help algorithms perform
certain types of queries on the data. The Web will not in general guarantee
one node per concept nor will it guarantee the presence of query hints and the
Semantic Web must therefore be able to function without these constraints.

We have shown the importance of a common architecture for tree-structured
documents and directed labeled graphs. We have also shed new light on some of
the design decisions in the XML syntax used by RDF.

We have discussed the way contemporary data models (relational, object,
knowledge representation) relate to a unified Semantic Web Architecture. We
look forward to elaborating these connections in future work.

1 The presumption that a complete specification of the
objects within a document must exist in one place appears in many places:
SGML, Ada, SQL, and in many object oriented programming systems. At the same
time, the import of many different interfaces into a program module (in Ada,
C, etc) also demonstrates the concept of creating a new module using a mixture
of independently created vocabularies.