Abstract

GRDDL is a mechanism for Gleaning Resource
Descriptions from Dialects of Languages. The
GRDDL specification introduces markup for declaring that an XML
document includes gleanable data and for linking to an algorithm, typically
represented in XSLT, for gleaning the RDF data from the document.

The markup includes a namespace-qualified attribute for use
in general-purpose XML documents and a profile-qualified
link relationship for use in valid XHTML documents. The GRDDL
mechanism also allows an XML namespace document
(or XHTML profile document) to declare that every document associated
with that namespace (or profile) includes gleanable data and for
linking to an algorithm for gleaning the data.

A corresponding GRDDL specification
provides complete technical details. A GRDDL Primer demonstrates the
mechanism on XHTML documents which include widely-deployed dialects,
more recently known as microformats.

Status of this Document

This section describes the status of this document at the time
of its publication. Other documents may supersede this document. A
list of current W3C publications and the latest revision of this
technical report can be found in the W3C technical reports index at
http://www.w3.org/TR/.

This is a First Public Working Draft of GRDDL Use Cases:
Scenarios of extracting RDF data from XML documents.
A log of changes is maintained for the convenience
of editors and reviewers.

The GRDDL design was first released as a W3C technical report in
April 2004. This document was developed by the GRDDL Working Group,
which was chartered in July 2006 to review the specification and
develop use cases and tutorial material. The Working Group expects to
advance GRDDL to Recommendation Status, though these use cases may end
up as a separate Working Group Note.

Publication as a Working Draft does not imply endorsement by the
W3C Membership. This is a draft document and may be updated, replaced
or obsoleted by other documents at any time. It is inappropriate to
cite this document as other than work in progress.

1. Introduction: Data and
Documents

There are many dialects in practice among the many XML documents on the
web.
There are dialects of XHTML, XML and RDF that are used to represent
everything from
poetry to prose, purchase orders to invoices, spreadsheets to databases,
schemas to scripts,
and linked lists to ontologies. Some are more formally defined and others
exhibit more
loosely-couple semantics. Recently, two progressive encoding techniques
have emerged
to overlay additional semantics onto valid XHTML documents: RDF-a and
microformats
offer simple, open data formats built upon existing and widely adopted
standards.

While this breadth of expression is quite liberating, inspiring new
dialects to codify both common and customized meanings, it can prove to be
a barrier to understanding across different domains or fields. How, for
example, does software discover the author of a poem, a
spreadsheet and an ontology? And how can software determine whether
authors of each are in fact the same person?

Any number of those XML documents on the web may contain data
whose value would increase dramatically if they were accessible to systems
which might not directly support such a wide variety of dialects but which
do support RDF.

The Resource Description Framework[RDFC04]
provides a standard for making statements about resources in the form
of a subject-predicate-object expression. One way to represent the
fact "The Stand's author is Stephen King" in RDF would be as a triple
whose subject is "The Stand," whose predicate is "has the author," and
whose object is "Stephen King," The predicate, "has the author"
expresses a relationship between the subject (The Stand) and the object
(Stephen King). Using URIs to uniquely identify the book, the author and
even the relationship would facilitate software design because not
everyone knows Stephen King or even spells his name consistently.

RDF includes an XML concrete syntax and an abstract
syntax. Software tools that use the Resource Description Framework
naturally work with documents whose data is encoded using
RDF/XML.

GRDDL is a mechanism for Gleaning Resource
Descriptions from Dialects of Languages; that is,
for extracting RDF data from XML documents by way of transformation
algorithms, typically represented in XSLT.

For example, Dublin Core meta-data can be written in an HTML
dialect[RFC2731] that has a clear
correspondence to an encoding in RDF/XML[DCRDF]. The following HTML and RDF excerpts
illustrate the correspondence:

The transformation algorithm used to is expressed in an
XSLT transformation, dc-extract.xsl.

This document collects a number of motivating use cases together with their goals and
requirements for extracting RDF data from XML documents.
These use cases also illustrate how XML and XHTML documents can be decorated
with microformats, Embedded RDF
or RDFa statements to support
GRDDL transformations in charge of extracting
valuable data that can then be used to automate a variety of tasks.

The companion GRDDL Working Draft is a concise technical specification of
the GRDDL
mechanism and its XML syntax. It specifies the GRDDL syntax to use in
valid XHTML and well-formed XML documents, as well as how to encode
GRDDL into namespaces and HTML profiles.

The companion document, the GRDDL Primer Working Draft@@pubfix, is a progressive
tutorial on the GRDDL mechanism with illustrated examples taken from the
GRDDL Use Cases Working Draft.

The seven use cases detailed below could be summarized as:

Use case #1: Jane is trying to coordinate a meeting with friends.
She uses GRDDL to extract data from each of their calendar pages and combine it in a single model.
She then writes a query to filter the events down to those dates when all of them are in the same city.

Use case #2: Kayode uses a single-purpose XML vocabulary as the
main representation format for computer-based patient record. He uses GRDDL to be able to able to
query these records both in their XML vocabulary and as RDF, without managing a dual representation.

Use case #3: Stephan wishes to buy a guitar and visits a site offering
a review service. He uses GRDDL to aggregate reviews and profiles of the reviewers in order to select
the reviews he can trust.

Use case #4: Adeline designs a system to allow here company
to streamline the publication of Technical Reports. The system relies on shared templates for publishing
documents and a GRDDL transformation to build an up-to-date RDF index used to create an authoritative repository.

Use case #5: The Technical University of Marcilly decides to use a wiki
with metadata embedded in its pages to tag, structure, navigate and query the resources of the wiki.
GRDDL is used to extract these metadata as RDF to feed the different tools of the system.

Use case #6: Voltaire has setup a weblog engine that utilizes XForms for editing
entries. He also provides a GRDDL transformation that extracts an RDF description of the XForms that other
client applications can use to update existing entries using the identified service URIs, and perform other
such services.

Use case #7: The Open Archives Initiative (OAI) publishes an XML schema
that universities can use to publish their archived documents. This schema also identifies a GRDDL transform to
apply to all its instance documents in order to extract their Creative Commons license.

Use case #1 - Scheduling : Jane is
trying to coordinate a meeting.

Jane is trying to coordinate a meeting with her friends Robin, David and Kate.
They each live in separate cities but often bump into each other at different
conferences throughout the year. Jane wants to find a time when all of her friends are in the same city.

Jane uses an online calendaring service that publishes an RSS 1.0
feed of her schedule.

Despite their different formats, the calendars of all four friends can be used as
source documents and converted to RDF. Once
expressed as RDF the data can be merged and queried using tools such
as the SPARQL query language.

Jane uses a GRDDL-aware agent to automatically extract data from each page, load this data in an
RDF store and combine it in a single model. She then writes a query to filter the events down to
those dates when all four friends are in the same city.

Jane is delighted to find that all four of them will be at conferences in LA at the beginning
of September and she immediately starts looking for restaurants to book for their night out.

Browsing the calendar of her friends, Jane noticed various conferences, talks, and
other gatherings of social groups in her area. These groups publish their calendars in
various HTML-based formats: microformats, eRDF, RDFa, or some home-grown way to express
calendar information.

These calendars are source documents and thus Jane could easily add all of these
events to her own calendar. However, Jane does not want to add all these events to her
calendar. She wants to pick and choose which events to attend. She wants to browse this
list of events and each time she finds an event she is interested in, she wants to be able
to select it and copy-paste it to her calendar.

To enable this copy-paste, Jane's browser includes a GRDDL-aware agent and supports a
default RDF-in-HTML embedding scheme called RDFa. The GRDDL transformation specified in
the page indicates how to transform this XHTML into XHTML+RDFa, while preserving the
style and layout of the page.

Thus, Jane's RDFa-aware browser can perform the transform even before rendering the XHTML.
The rendered XHTML+RDFa provides a copy- paste functionality via, right-clicking on an
event right in the rendered XHTML+RDFa.

Kayode, a developer for a clinical research data management system, uses
XML as the main representation format for their computer-based patient
record. He edits the XML remotely via forms, submits the XML document
to a unique URI for each such record over HTTP.

He wants to use a content management system which
includes a mechanism to automatically replicate an XML document into equivalent,
named RDF graphs for persistence in synchrony with any changes to the document.

The expense of dual representation as single-purpose XML vocabulary and RDF includes space and synchrony problems,
but the primary value is being able to query both as XML and as RDF.
The corresponding XML documents can be transformed into other non-RDF formats,
evaluated by XPath and XPointer expressions, cross-linked by XLink or XInclude,
and structurally validated by RELAX NG (or XML Schema).
Kayode has found RDF queries more amenable for investigative querying, since they allow him to ask
speculative questions using standard healthcare ontologies for patient
records, such as the
HL7 OWL ontology.

Kayode realizes a GRDDL approach alleviates this expense by allowing
a computer-based patient record or any XML-based collection of clinical
research data to be queried semantically by associating a GRDDL profile to
the specific XML vocabulary.

Using RDF helps manage research projects assigned to residents. Kayode finds RDF
especially helpful while trying to determine an initial search criteria for a patient population
relevant to a particular study. Each study has a set of
classifications specific to the study that they express in an ontology
or using rules.

Kayode designs a web-based user interface that works with a GRDDL-aware agent
which picks computer-based patient records from a remote server.
Each is a source document associated with transforms that extract
clinical data as RDF expressed in a universally supported vocabulary for a
computer-based patient record.

The resident physicians then ask speculative questions of the resulting RDF
graph or apply the study-specific rules on the resulting RDF to classify the
data according to his domain of interest, such as specific diagnoses and
pathological observations.

For Kayode, having an RDF representation of the clinical data provides him
advantages over just using a single-purpose XML vocabulary, in particular an additional level of
interpretation and ability to integrate data from diverse sources. The inherent
difficulties of using multiple XML vocabularies over domains such as clinical
data make the mapping to a unified ontology even more valuable.

Stephan wishes to buy a guitar, so he decides to check reviews.
There are various special interest publications
online which feature musical instrument reviews. There are also blogs which
contain reviews by individuals. Among the reviewers there may be friends of
Stephan, people whose opinion Stephan values (e.g. well-known musicians and
people whose reviews Stephan has found useful in the past). There may also be
reviews purposively planted by instrument manufacturers which offer very biased views.

Stephan visits a site offering a review service and enters his preference
for guitar reviews which gave a high rating for the instrument. This initial
request is answered with a list of all the relevant review titles/summaries
together with information about the reviewers.

From this list Stephan chooses only the reviewers he trusts, and on
submitting these preferences is finally presented with a set of full reviews
which match his criteria.

Reviews published using hReview microformats can be discovered using
existing search services. These source documents
can be consumed by a GRDDL-aware agent to extract
the RDF which is then aggregated together in a store. Information about the reviewers can also be
aggregated from various sources including hCard and XFN microformats and autodiscovered FOAF profiles possibly
harvested through links in Stephan's own profile. The filtering may be achieved by running
SPARQL queries against the aggregated data, presented to
the user through regular HTML form interfaces.

See also: put projects and prototypes here.

Use case #4 -
Querying sites and digital libraries: DC4Plus Corp. wants to automate the publication of its electronic
documents.

The Company DC4Plus uses its web site to publish its catalogue of products and
services as well as a number of digital documents both on their public web site
(white papers, user guides and technical manuals of products and brochures)
and on their intranet (internal reports and administrative forms).
Product after product, DC4Plus is growing a digital library as part of its web site.

Adeline is an IT manager at DC4Plus. She is concerned by the tension between on one
hand the natural heterogeneity and distribution of all these electronic documents and
on the other hand the need to have a integrated and unified view of all these productions.
She believes there is a need to automate the detection, indexing and search capabilities for these
documents. Moreover several corporate documents follow a standard process before
being published and there is a growing demand from users and managers to be able
to automate this process and follow the status of each document.

Adeline first focuses on the Technical Reports published by the different divisions
of DC4Plus. These reports are published following a well-defined process. She
proposes a system that relies on Semantic Web technologies to allow here company
to streamline the publication paper trail of Technical Reports, to maintain an
RDF-formalized index of these specifications and to create a number of tools using
this newly available data.

Adeline's implementation of this vision at DC4Plus can be given in five steps:

XHTML templates including RDFa annotations are proposed for every type of document;
users edit these templates to create new documents without even noticing that some
parts are annotated in RDFa and thus they produce source documents.

one or more GRDDL transformations are generated for these templates;
the embedded annotations are used to identify the elements to extract (title, author, editor,
status, related product, department) and make the extraction resistant to
changes of structure in the document.

several new pages are added to the site to generate automatic indexes from the RDF
store showing different views of the documents (a catalogue in alphabetic order,
a list of documents by status, a list of publications of a given department)

more complex tools are developed to assist both internal processes (document
workflow monitoring tools, activity reporting tools, document review management
system) and external processes (a SPARQL web service for partners to query the
catalogue, an RSS feed to notify new publications)

This system relies on shared templates for publishing documents and
including RDFa annotations to mark important data. A GRDDL-aware agent
extracts this metadata as RDF. By crawling the published
reports and applying the associated GRDDL transformations
to them, a complete and up-to-date RDF index is built from resources distributed
over the organization's website. This RDF index is then used to create a central
yet flexible authoritative repository.

Adeline believes that this scenario can be generalized to any organization
interested in maintaining a portal to a digital library with customized indexes,
dedicated search forms, navigation widgets. In particular she appreciates that
in such an architecture the simple fact that the XHTML documents put online
following official templates allow GRDDL-aware agents to extract
corresponding RDF annotations that can then be used to generate portals, feed
workflow engines and run queries directly against the site.

Use case #5 - Wikis and e-learning:
The Technical University of Marcilly decided to use wikis to foster knowledge
exchanges between lecturers and students.

The Technical University of Marcilly (TMU) decided to use
wikis to foster
knowledge exchanges between lecturers and students. They tested several wikis
over the years and they want to experiment with novel ways of structuring the
wiki to improve navigation and retrieval and they also want to make it easier
to reuse learning objects in different contexts. Ideally TMU wants the
information structuring the wiki to be:

easy to add, edit and enrich. All this should be done at the same time a
user edits a page to avoid multiplying interfaces and manipulations.

explicit and understandable to machines so that the wiki engine can
rely on it to propose related pages, to perform precise search, to
generate browsing interfaces, to build dynamic indexes based on
customized queries and to provide customized sorting and filtering for
them.

accessible to other applications to allow integration with other
information systems, links or migration to other wiki engines, extension
of its functionalities.

In this context TMU uses metadata embedded in the wikipages to:

store the results of social tagging on the pages: tags suggested by
users are inserted in the page itself and may reuse data from the page
(e.g. the authors name) or annotate specific portions of the page (e.g.
type a paragraph as a definition, categorize an image);

generate navigation widgets: lists of forward and back links to
navigate the wiki, lists of similar pages, list of all pages tagged with
a specific topic, view of the clusters of pages.

enrich them with schemata to restructure the wiki (declare equivalent
tags, broader/narrower tags, add synonymous labels to existing tags) and
enrich the navigation with these links;

include queries on the metadata in the wikipages to dynamically
generate tailored indexes for the different departments, the different
years, the different topics.

import learning objects edited in classical word processing application
by using the styles of the different sections to extract annotations for
each section and recompose new documents (e.g. transform a handout into a
web site for practical sessions).

Let us consider the case of Michel, a lecturer in engines and thermodynamics.
He used the wiki to publish the handouts of his course. He initially tagged
each handout with the main concepts it introduces (e.g. "RenewableEnergies",
"Ethanol", "Diesel"). In addition, Michel automatically typed each section of
the document using predefined styles (e.g. definitions, formula, example.).
The next practical session will involve knowledge on classical Diesel engines
and Ethanol-based engines. In order to generate a mnemonic card for this session
Michel runs a query to extract definitions and formulas of the courses tagged
with "Diesel" or "Ethanol". He also uses these tags to generate dynamic "see also"
sections at the end of his sections suggesting other sections to read.

Students edit the online handouts, to add pointers, to insert comments on parts
they found difficult to understand,and to recall pieces of previous courses useful
for understanding a new course. Students also tagged the pages with their
own tags to organize their reading and bookmark important parts for them; they
use tags to create transversal thematic tracks (e.g. "LiquidFlow"), to give
feedback on the content (e.g. "Difficult"), to prioritise reading
(e.g. "NiceToKnow", "Vital"). These tags allow them to have transversal navigation
and reorganize the content depending on the task they are doing (e.g. preparing an
exam, writing a report, running an experiment). These tags are also used by Michel
to evaluate the understanding and the shortcomings of his course.

Finally the mass of the course material and tags is such that it needs to be reorganised.
Using the tag editor Michel groups "Ethanol" and "Methanol" as sub tags of a new tag
he calls "Alcohol". Doing so the pages tagged with "Ethanol" or "Methanol" are
grouped and accessible through "Alcohol". He repeats this with other tags (e.g.
"Alcohol" and "Hydrogen" becomes sub- tags of "NewEngineEnergy"). This reorganizes the
wiki seamlessly e.g. suggestion of navigation in the pages automatically propose narrower,
broader and brother tags thus when viewing a page tagged with "Ethanol", the system
suggest other pages tagged with "Methanol". Later when a student posts his report on an
engine using "CopraOil", his new tag can be placed under the existing one "NewEngineEnergy";
he or anyone else can do it and the result will immediately benefit the whole community
of the users. Using these tags and their organization, thematic indexes are dynamically
generated for the materials of the course and automatically updated.

From the technical stand point, TMU designed a wiki that stores
its pages directly in XHTML and RDF annotations are used to represent the
wiki structure and annotate the wikipages and the objects it contains
(images, uploaded files.). The RDF structure allows refactoring the wiki
structure by editing the RDF annotations and the RDFS schemas they are based
on. RDF annotations are embedded in the wiki pages themselves using the RDFa
and microformats. Some of the learning objects can be saved in XML formats
and an XSLT stylesheet exploits the styles used for the session to tag the
different parts (e.g. definition, exercise, example) and these annotation can
then be used to generate new views on this resource (e.g. list of definition,
hypertext support for practical sessions.).

The embedded RDF is extracted by a GRDDL-aware agent using
GRDDL transformations available online as
XSLT stylesheets to
provide semantic annotations directly to the application that needs to extract the embedded metadata:

if someone sends a wiki page to someone else the annotations follow it
and can be processed by applications of the recipient;

if another application crawls (e.g. the crawler of a search engine) the
wiki site it can extract the metadata and reuse them just by applying the
same GRDDL transformation;

if a new community of practice of TMU (e.g. the accountants) wants a
dedicated index of its working document, it can be embedding the
corresponding SPARQL query in a wikipage: the search engine fed with the
result documents solves this query and the result is rendered by an XSLT
stylesheet and embedded in the page;

if the wiki engine is to be changed, the migration transformations can
exploit the embedded metadata;

if a division wants to setup access rules to some documents, they can
be based on these metadata merged with others (e.g. only lecturer can
access document tagged as "tests").

if some users are interested in being informed on any new information
on a topic (e.g. chemists want to be informed on any new norm for the
environment) they can use notification systems monitoring the wiki by
querying its metadata (e.g. recurrent SPARQL queries on pages tagged with
"environment")

Use case #6 - XForms-based Webapps:
Voltaire wants to facilitate the extraction of transport semantics from an online form used to edit
blog entries.

Voltaire's blog is pretty popular and encompasses many major areas of interest, one of which is bird watching.
Voltaire has so many areas of interests and spends so much time watching birds that he doesn't want to surf
the net and find each and every site he might want to syndicate. Rather than 'manually' subscribing to
third-party blogs that are appropriate to the themes he covers, he wants to reverse the subscription model
to be push-based i.e. people who want their blogs to be included can push the appropriate entries to his blog;
his blog becomes somewhat of a magnet for similar entries of interest.

Voltaire has setup a weblog engine that utilizes XForms
for editing entries remotely using the
Atom Publishing Protocol.
Voltaire has found the use of XForms
for authoring fragments of Atom quite useful for a variety of reasons.
In particular, the Atom Publishing Protocol's use of HTTP and single-purpose XML vocabulary as
the primary remote messaging mechanism which allows Voltaire to easily author various
XForm documents that use XForm
submission elements to
dispatch operations on web resources.

As a result, the XForms for dispatching these operations each contain a
rather rich set of information about transport-level services in the form of
service URIs, media-types and HTTP methods. These are completely encapsulated
in an XForms submission element. It so happens that there is an RDF
vocabulary for expressing transport metadata called
RDF Forms.

Somewhere else on the planet, the professional ornithologist Johan Bos, who
recently spotted a red kite (Milvus milvus) far from their breeding ground in
central Wales, is planning to post blog entries about his observations. To make
his results visible he wants his entries to be included in Voltaire's blog.

Voltaire's site provides a general GRDDL transformation
that extracts an RDF Form graph from the XForms submission elements employed in the various web forms
for editing, deleting, and updating Atom entries on his weblog. Such a
transformation can uniformly extract an RDF description of the transport mechanisms
for a software agent to interpret. Johan's client can automatically
retrieve an Introspection Document (via the Atom Publishing Protocol), update
existing entries using the identified service URIs, and perform other such
services.

Thus Johan's client relies on a GRDDL-aware agent to periodically extract the service URIs,
transform the content at these URIs to Atom/OWL and query the resulting RDF to determine
if the topics match. Doing so, he will replicate his entries at the matching URIs by
POSTing them there.

Voltaire does not need to manage the subscriptions, all he might want to do
is perhaps grant accounts for Johan for HTTP-level authentication (as a deterrent
for spam - as you can imagine, reversing the subscription model in this way
opens up Voltaire's system for lots of spam).

Use case #7 - XML schema specifying a transformation:
the OAI would like to be able to specify document licenses in their XML schema.

The Open Archives Initiative (OAI) publishes an XML schema that universities
can use to publish their archived documents. They include
guidelines for expressing
the rights of these documents, including the possibility of referencing a license,
like a creative commons license.

More than 800 universities implement this schema. Creative Commons would like to
deploy tools, like the
MozCC browser extension
which provides a convenient way to
examine licenses embedded in web pages and interpret them.

It is unreasonable to expect to interpret everyone's favorite XML schema,
yet communities like the OAI would like to be able to include licensing information
in their XML shema.

On the other hand, Creative Commons would like to be able to make a generic
recommendation to anyone with XML instance documents, allowing them to do what
they want with their XML schemata, as long as they include a transformation of
the instance documents to RDF.

Since the XML instance documents are often distributed, as in the OAI case, the XML schema itself could
embed RDF descriptions identifying a transform to apply
to all its instance documents. So doing, for each source document, the transformation is
indirectly referenced by the XML Schema it follows.

The XML schema is served from the namespace location and is a source document
which includes descriptions associating a GRDDL transform with its instances.
Thus it serves a dual purpose for its instances: validation and identifying transforms to glean meaning