Abstract

This document presents GRDDL, a mechanism for
Gleaning Resource Descriptions from
Dialects of Languages; that is, for getting RDF
data out of XML and XHTML documents using explicitly associated
transformation algorithms, typically represented in XSLT.

Please send review comments, implementation experience reports,
etc. to public-rdf-in-xhtml-tf@w3.org,
the mailing list of the RDF in XHTML task-force of the the Semantic Web
Best Practices and Deployment Working
Group and the HTML Working Group; the mailing list has a
public archive.

By publishing this document, Dan Connolly and Dominique
Hazaël-Massieux have made a formal submission to W3C for
discussion. Publication of this document by W3C indicates no
endorsement of its content by W3C, nor that W3C has, is, or will be
allocating any resources to the issues addressed by it. This document
is not the product of a chartered W3C group, but is published as
potential input to the W3C Process. Please consult the complete list of acknowledged
W3C Team Submissions.

1. Introduction: Data and Documents

Data formats like XML and XHTML are used in the Web for a large
spectrum of purposes, from poetry and drama to spreadsheets and
databases. The information in a poem may be rich and subtle; we might
use a computer pick out the author's name, but themes and opposing
forces are not readily computable. When extracting data from
documents, preserving meaning is important: if a document says "It is
highly unlikely that the king was over twenty years old" and a
computation returns "the king was over twenty years old," that
computation does not preserve meaning.

The Resource Description Framework[RDFC04]
codifies certain forms of data—simple logical statements like
age(king, 20)—and specifies basic rules for preserving
meaning. The framework includes a constrained XML concrete syntax, but
it also includes an abstract syntax. GRDDL is a mechanism
for Gleaning Resource Descriptions from
Dialects of Languages; that is, for getting RDF
data out of XML and XHTML documents.

For example, Dublin Core meta-data can be written in an HTML
dialect[RFC2731] that has a clear
correspondence to an encoding in RDF/XML[DCRDF]. The correspondence can be expressed in an
XSLT transformation, dc-extract.xsl:

The transformation preserves the author's meaning, provided the
author understood the conventions of this dialect. But an author may
have accidentally conformed to the syntactic conventions
without any knowledge of Dublin Core at all. In that case, the mapping
most likely does not preserve the author's meaning. In GRDDL,
documents contain explicit references to the conventions that the author
used to encode data.

For example, this document not only follows the conventions of
[RFC2731], but it explicitly uses the GRDDL
profile and links to a transformation that extracts the meta-data
in RDF/XML in a way that preserves the meaning of the document:

In the figure below, the arrow labelled info relates a
document to an abstract notion of the information contained in the
document. It shows that the RDF data extracted via the
dc-extract.xsl transformation is part of the information
contained in the document:

3. The GRDDL transformation attribute in XML

The GRDDL profile mechanism is a special case of GRDDL designed to
fit within the DTD-based syntax of XHTML. The general form of GRDDL is
an attribute suitable for use with a wide variety of XML dialects.

The transformation attribute in the
http://www.w3.org/2003/g/data-view# namespace on the root
element of an XML document refers to a list of transformations
that preserve the document's meaning.

4.GRDDL for XML Namespace
and HTML Profile Documents

Transformations can be associated not only with individual
documents but also with whole dialects that share an XML namespace or
XHTML profile. Consider this privacy policy written in P3Q, a
contrived analog to P3P[P3P]:

Note that statements gleaned from namespace documents and profile
documents are a part of their meaning; these documents need not be
written in RDF/XML directly

Consider a purchase order whose namespace document is an XML
Schema, where the XML Schema bears a data-view:transformation
attribute licensing extraction of statements that include
namespaceTransformation statements:

5. GRDDL Transformations

The transformation link type refers to a transformation algorithm
that should have a available representations in widely-supported formats.
We expect most consumers to support XSLT version 1[XSLT1] for the foreseeable
future, though XSLT2[XSLT2] deployment is increasing.
While
javascript, C, or any other programming language technically expresses
the relevant information, XSLT is specifically designed to express XML
to XML transformations and has some good safety characteristics.

Transformation algorithms should be well-defined functions whose
only input is the source document. The use of the XSLT
document() function to incorporate other data at transformation
time is an error.

6. Security considerations

Implementors should pay special attention to the security implications
of any media types that can cause the remote execution of any actions in
the recipient's environment. In such cases, the discussion of the
"application/postscript" type may serve as a model for considering other
media types with remote execution capabilities.

Given the expressive power of XSLT, and the possibility to access external
resources from a XSLT style sheet (e.g. through the document
function or the xsl:import mechanism), implementors should take
the appropriate measures to prevent malicious usage of this mechanism.